usage ➢ Speed & efficiency ➢ Python + Cython ➢ Comparison to other NLP libraries: https://spacy.io/usage/facts-figures ➢ Open source (MIT license): https://github.com/explosion/spaCy/ ➢ Created by Explosion AI (Ines Montani & Matthew Honnibal) ➢ Tokenization (50 languages), lemmatization, POS tagging, dependency parsing ➢ NER, text classification, rule-based matching (API + one implementation) ➢ Word vectors, BERT-style pre-training ➢ Statistical models in 10 languages (v. 2.2): DE, EN, EL, ES, FR, IT, LT, NL, NB, PT ➢ One multi-lingual NER model containing DE, EN, ES, FR, IT, PT, RU
is a consecutive span of one or several tokens ➔ It has a label or type, such as “PERSON”, “LOC” or “ORG” ➔ An NER algorithm is trained on annotated data (e.g. OntoNotes) Entity recognition
→ ➔ Tokenization, word embeddings, dependency trees … all define words with other words ➔ Entity linking: Resolve named entities to concepts from a Knowledge Base (KB) ➔ Ground the lexical information into the “real world” ➔ Allow to fully integrate database facts with textual information Entity linking
NER has already happened on the raw text, so we have entities + labels STEP 1: Candidate generation: create a list of plausible WikiData IDs for a mention STEP 2: Entity Linking: disambiguate these candidates to the most likely ID Text NER NER mentions List of candidates for each mention candidate generation EL One entity ID for each mention Ms Byron would become known as the first programmer. Byron: PERSON Q7259 Q5679: Lord George Gordon Byron Q272161: Anne Isabella Byron Q7259: Ada Lovelace
a textual mention, produce a list of candidate IDs How: Build a Knowledge Base (KB) to query candidates from. This is done by parsing links on Wikipedia: ➔ “William King” is a synonym for “William King-Noel, 1st Earl of Lovelace” • Other synonyms found on Wikipedia: “Earl of Lovelace”, “8th Baron King”, ... ➔ For each synonym, deduce how likely it is they point to a certain ID by normalizing the pair frequencies to prior probabilities • e.g. “Byron” refers to “Lord Byron” in 35% of the cases, and to “Ada Lovelace” in 55% of the cases “Byron” Q5679 Q272161 Q7259 She married [[William King-Noel, 1st Earl of Lovelace|William King]] in 1835
a list of candidate IDs + the textual context, produce the most likely identifier How: Compare lexical clues between the candidates and the context Q5679 Q272161 Q7259 Q7259 Ms Byron would become known as the first programmer. WikiData ID WikiData name WikiData description Context similarity Q5679 Lord Byron English poet and a leading figure in the Romantic movement 0.1 Q272161 Anne Isabella Byron Wife of Lord Byron 0.3 Q7259 Ada Lovelace English mathematician, considered the first computer programmer 0.9 The description of Q7259 is most similar to the original sentence (context)
contains many infrequently linked topics To keep the KB manageable in memory, it requires some pruning: • Keep only entities with min. 20 incoming interwiki links (from 8M to 1M entities) • Each alias-entity pair should occur at least 5 times in WP • Keep 10 candidate entities per alias/mention • Result: ca. 1.1M entities and 1.5M aliases • 350MB file size to store 1M entities and 1.5M aliases + pretrained 64D entity vectors The KB only stores 14% of all WikiData concepts ! The EL still achieves max. 84.2% accuracy (with an oracle EL disambiguation step) • Long tail of infrequent entities
200.000 mentions in Wikipedia articles (2h) Tested on 5000 mentions in (different) Wikipedia articles The random baseline picks a random entity from the set of candidates The prior probability picks the most likely entity for a given synonym, regardless of context The EL algorithm (by itself) significantly outperforms the random baseline: 73.9% > 54.0% and marginally improves upon the prior probability baseline: 79.0% > 78.2 % Random baseline EL only Prior prob baseline EL + prior prob Oracle KB (max) Accuracy % 54.0 73.9 78.2 79.0 84.2
Kampong Cham, ... and Svay Rieng. → predicted: City in Cambodia → gold WP link: Province of Cambodia Societies in the ancient civilizations of Greece and Rome preferred small families. → predicted: Greece → gold WP link: Ancient Greece Agnes Maria of Andechs-Merania (died 1201) was a Queen of France. → predicted: kingdom in Western Europe from 987 to 1791 → gold WP link: current France (gold was incorrect !)
curated the WP training data • Took the original “gold” ID from the interwiki link • Mixed in all other candidate IDs • Presented them in a random order Annotation of 500 cases • 7.4% did not constitute a proper sentence • 8.2% did not refer to a proper entity Of the remaining 422 cases: • 87.7% were found to be the same • 5.2% were found to be different • 7.1% were found to be ambiguous or needed context outside the sentence https://prodi.gy/
Entities without sentence context, e.g. in enumerations, tables, “See also” sections → Remove from the dataset Some links are not really Named Entities but refer to other concepts such as “privacy” → Prune the WikiData KB WP annotations are not always aligned to the entity types “Fiji has experienced many coups recently, in 1987, 2000, and 2006.” → Link to “2000 Fijian coup d'état” or to the year “2000” ? “Full metro systems are in operation in Paris, Lyon and Marseille” → WP links to “Marseille Metro” instead of to “Marseille”
EL only Prior prob baseline EL + prior prob Oracle KB (max) Easy cases (230 entities) 40.9 58.7 84.8 87.8 100 Hard cases (122 entities) 14.9 17.4 25.6 27.3 33.9 All (352 entities) 29.6 44.7 64.4 67.0 77.2 ➔ The original annotation effort started with 500 randomly selected entities ➔ 16% were not proper entities/sentences or Date entities such as “nearly two months” ➔ 9% referred to concepts not in WikiData ➔ 2% was too difficult to resolve ➔ 3% of Prodigy matches could not be matched with the spaCy nlp model ➔ On the news dataset, EL improves more upon the prior probability baseline
There will always be entities too vague or outside the KB (e.g. “a tourist called Julia ...”) Candidate generation can fail due to small lexical variants • middle name, F.B.I. instead of FBI, “‘s” or “the” as part of the entity, ... Often background information beyond what is in the article, is required for disambiguation Metonomy is hard to resolve correctly, even for a human • e.g. “the World Economic Forum at Davos … Davos had come to embody that agenda” Dates and numbers are often impossible to resolve correctly (or need meta data) ➔ “... what happened in the middle of December” The WikiData knowledge graph is incredibly helpful when analysing the entities manually
N. J. Mr. Feeney served as a radio operator in the Air Force and attended Cornell University on the G. I. Bill. Q2967590 Chuck Feeney Q138311 Elizabeth Q2335128 Province of New Jersey Q49115 Cornell University place of birth capital of educated at
helps to link concepts across sentences. The whole chain should then be linked to the same WikiData ID. The EL algorithm is (currently) trained to predict entity links for sentences. ➔ How to obtain consistency across the chain? ➔ First idea: take the prediction with the highest confidence across all entities in the chain Assessment on Wikipedia dev set ➔ 0.3% decrease :( ➔ WP links are biased: only the first occurrence (with most context) is usually linked Assessment on News evaluation set ➔ 0.5-0.8% increase (might not be significant – need to look at more data) ➔ “easy” from 87.8 to 88.3, “hard” from 27.3 to 28.1, “all” from 67.0 to 67.5
for training & evaluation to better benchmark model choices The hierarchy of WikiData concepts to be taken into account ➔ Predicting the province instead of its capital city is not as bad as predicting an unrelated city ➔ Taking this into account in the loss function, could make the training more robust Coreference resolution can give entity linking a performance boost ➔ Use coreference resolution to obtain a more consistent set of predictions ➔ Use coreference resolution to enrich the training data Try it out yourself ? ➔ CLI scripts for creating a KB from a WP dump (any language) ➔ CLI scripts for extracting training data from WP/WD ➔ Example code on how to train an entity linking pipe ➔ Possibility to use a custom implementation (through the provided API’s)
Parse aliases and prior probabilities from intrawiki links to build the KB Takes about 2 hours to parse 1100M lines of Wikipedia ENG XML dump Processing Wikidata: Link English Wikipedia to interlingual Wikidata identifiers Retrieve concise Wikidata descriptions for each entity Takes about 7 hours to parse 55M lines of Wikidata JSON dump Knowledge Base: 55MB to store 1M entities and 1.5M aliases 350MB file size to store 1M entities and 1.5M aliases + pretrained 64D entity vectors Because of an efficient Cython data structures, the KB can be kept in memory Written to file, and read back in, in a matter of seconds
description Loss function Sentence (context) 128D NER type 16D Prior prob 1D P(E|M) [0, 1] NER mention Entity ID (candidate) from text entity encoder float one hot encoding sentence encoder EL Gold label {0, 1} from KB