python -m spacy download en_core_web_lg import spacy nlp = spacy.load('en_core_web_lg') doc = nlp(text) from spacy import displacy displacy.serve(doc, style='ent') an DET AI PROPN scientist NOUN in ADP Silo PROPN . PUNCT AI PROPN ’s NOUN NLP PROPN team NOUN . PUNCT print([(token.text, token.pos_) for token in doc]) ➢ Pre-trained models for 10 languages
ORTH nlp.tokenizer.add_special_case("Silo.AI", [{ORTH: "Silo.AI"}]) an DET AI PROPN scientist NOUN in ADP Silo.AI PROPN ’s PART NLP PROPN team NOUN . PUNCT ➢ Built-in language-specific rules for 50 languages ➢ Pull requests improving your favourite language are always welcome ! ➢ Extend the default rules with your own ➢ Define specific tokenization exceptions, e.g. "don't": [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}] ➢ Implement an entirely new tokenizer
("Filip Ginter works for Silo.AI.", {"entities": [(0, 12, "PERSON"), (24, 30, "ORG")]}), (…) ] for itn in range(n_iter): random.shuffle(TRAIN_DATA) batches = minibatch(TRAIN_DATA) losses = {} for batch in batches: texts, annotations = zip(*batch) nlp.update(texts, annotations, sgd=optimizer, drop=0, losses=losses) print(f"Loss at {itn} is {losses['ner']}") Retrain / refine existing ML models ➢ Add new labels (e.g. NER) ➢ Feed in new data ➢ Ensure the model doesn’t “forget” what it learned before! ➢ Feed in “old” examples too optimizer = nlp.begin_training() optimizer = nlp.resume_training()
convert ud-treebanks-v2.4\UD_Finnish-TDT\fi_tdt-ud-train.conllu fi_json python -m spacy convert ud-treebanks-v2.4\UD_Finnish-TDT\fi_tdt-ud-dev.conllu fi_json python -m spacy train fi output fi_json\fi_tdt-ud-train.json fi_json\fi_tdt-ud-dev.json Train ML models from scratch ➢ Built-in support for UD annotations Itn Tag loss Tag % Dep loss UAS LAS 1 39475.358 89.109 201983.313 65.778 51.614 2 23837.115 90.463 169409.391 71.149 59.22 3 18800.934 91.146 153834.198 73.244 62.157 4 15685.533 91.818 142268.533 74.149 63.751 5 13529.039 92.118 134673.218 75.209 65.086 … … … … … ... Note that this won’t automatically give state-of-the-art results... There is no tuning, hyperparameter selection or language-specific customization (yet) !
learning library ➢ released in January 2020 ➢ Has been powering spaCy for years ➢ Entirely revamped for Python 3 ➢ Type annotations ➢ Functional-programming concept: no computational graph, just higher order functions ➢ Wrappers for PyTorch, MXNet & TensorFlow ➢ Extensive documentation: https://thinc.ai def relu(inputs: Floats2d) -> Tuple[Floats2d, Callable[[Floats2d], Floats2d]]: mask = inputs >= 0 def backprop_relu(d_outputs: Floats2d) -> Floats2d: return d_outputs * mask return inputs * mask, backprop_relu ➢ Layer performs the forward function ➢ Returns the fwd results + a call for the backprop ➢ The backprop calculates the gradient of the inputs, given the gradient of the outputs
control over all settings & parameters ➢ Configurations can easily be saved & shared ➢ Running experiments quickly, e.g. swap in a different tok2vec component ➢ Full support for pre-trained BERT, XLNet and GPT-2 (cf. spacy-transformers package) a.k.a. “starter models” ➢ Keep an eye out on spaCy v.3 ! python -m spacy train-from-config fi train.json dev.json config.cfg Coming soon
talk show host, or American football player ? Russ Cochran: American golfer, or publisher ? Rose: English footballer, or character from the TV series "Doctor Who" ?
NER has already happened on the raw text, so we have entities + labels STEP 1: Candidate generation: create a list of plausible WikiData IDs for a mention STEP 2: Entity Linking: disambiguate these candidates to the most likely ID Text NER NER mentions List of candidates for each mention candidate generation EL One entity ID for each mention Ms Byron would become known as the first programmer. Byron: PERSON Q7259 Q5679: Lord George Gordon Byron Q272161: Anne Isabella Byron Q7259: Ada Lovelace
a convolutional NN model on 165K Wikipedia articles ➔ Manually annotated news data for evaluation: 360 entity links ➔ Adding in coreference resolution • All entities in the same coref chain should link to the same entity • Assing the KB ID with the highest confidence across the chain • Performance (EL+prior) drops to 70.9% → further work required Context Corpus stats Gold label Accuracy Random baseline - - - 33.4 % Entity linking only x - - 60.1 % Prior probability baseline - x - 67.2 % EL + prior probability x x - 71.4 % Oracle KB performance x x x 85.2 %
a production-ready NLP library in Python that lets you quickly implement text-mining solutions, and is extensible & retrainable. 3.0 will bring lots of cool new features ! Thinc is a new Deep Learning library making extensive use of type annotations and supporting configuration files. Prodigy is Explosion’s annotation tool powered by Active Learning. @explosion_ai @oxykodit @spacy.io