tokenizer tagger parser ner ... The current spaCy nlp pipeline works purely on the textual information itself: • Tokenizing input text into words & sentences • Parsing syntax & grammar • Recognising meaningful entities and their types • … But how can we ground that information into the “real world” (or its approximation – a knowledge base) … ?
• Augusta Byron = Ada Byron = Countess of Lovelace = Ada Lovelace = Ada King Polysemy • 4 different barons were called “George Byron” • “George Byron” is an American singer • “George Byron Lyon-Fellowes” was the mayor of Ottawa in 1876 • … Vagueness • e.g. “The president” Context is everything !
talk show host, or American football player ? Russ Cochran: American golfer, or publisher ? Rose: English footballer, or character from the TV series "Doctor Who" ?
prototype, focus on WikiData instead of Wikipedia • Stable IDs • Higher coverage (WP:EN has 5.8M pages, WikiData has 55M entities) • Better support for cross-lingual entity linking Canonical knowledge base with potentially language-specific feature vectors Do the KB reconciliation once, as an offline data-dependent step In-memory (fast!) implementation of the KB, using a Cython backend
1st Earl of Lovelace • Earl of Lovelace • William King • William King-Noel, 8th Baron King • ... She married [[William King-Noel, 1st Earl of Lovelace|William King]] in 1835 Aliases and prior probabilities from intrawiki links Takes about 2 hours to parse 1100M lines of Wikipedia XML dump
hours to parse 55M lines of Wikidata JSON dump → Link English Wikipedia to interlingual Wikidata identifiers → Retrieve concise Wikidata descriptions for each entity
pruning to keep the KB manageable in memory: • Keep only entities with min. 20 incoming interwiki links (from 8M to 1M entities) • Each alias-entity pair should occur at least 5 times in WP • Keep 10 candidate entities per alias/mention KB size: • ca. 1M entities and 1.5M aliases • ca. 55MB file size without entity vectors • ca. 350MB file size with 64D entity vectors • Written to file, and read back in, in a matter of seconds
for candidate generation • Input: An alias or textual mention (e.g. “Byron”) • Output: list of candidates, i.e. (entity ID, prior probability) tuples ➔ Currently implemented as the top X of entities, sorted by their prior probabilities Within the list of candidates, the entity linker (EL) needs to find the best match (if any) Text NER NER mentions List of candidates for each mention candidate generation EL One entity ID for each mention
Loss function Sentence (context) 128D NER type 16D Prior prob 1D P(E|M) [0, 1] NER mention Entity ID (candidate) from text entity encoder float one hot encoding CNN EL Gold label {0, 1} from KB
WP intrawiki links with en_core_web_lg NER mentions • Custom filtering: articles < 30K characters and sentences 5-100 tokens • Trained on 200.000 mentions KB has 1.1M entities (14% of all entities) Random baseline Context only Prior prob baseline Context + prior prob Oracle KB (max) Accuracy % 54.0 73.9 78.2 79.0 84.2 The context encoder by itself is viable and significantly outperforms the random baseline. It only marginally improves the prior prob. baseline though, and is limitated by the oracle performance.
Kampong Cham, ... and Svay Rieng. → predicted “City in Cambodia” but should have been “Province of Cambodia” Societies in the ancient civilizations of Greece and Rome preferred small families. → predicted “Greece” instead of “Ancient Greece” Roman, Byzantine, Greek origin are amongst the more popular ancient coins collected → predicted “Ancient Rome” instead of “Roman currency” (but the latter has no description) Agnes Maria of Andechs-Merania (died 1201) was a Queen of France. → predicted “kingdom in Western Europe from 987 to 1791” but should have been “republic with mainland in Europe and numerous oversea territories” (gold was incorrect)
“a hill worth climbing” • We need to obtain a better dataset that is not automatically created / biased • Only then can we continue improving the ML models & architecture Add in coreference resolution • Entity linking for coreference chains (often not available in WP data) • Improve document consistency of the predictions Exploit the Wikidata knowledge graph • Improve semantic similarity between the entities • cf. OpenTapioca, Delpeuch 2019 Beyond Wikipedia & Wikidata: • Reliable estimates of prior probabilities are more difficult to come by • Candidate generation by featurizing entity names (e.g. scispaCy)