Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LUKE@NLPコロキウム

Ikuya Yamada
January 18, 2022

 LUKE@NLPコロキウム

NLPコロキウムでの発表資料です。

Ikuya Yamada

January 18, 2022
Tweet

More Decks by Ikuya Yamada

Other Decks in Research

Transcript

  1. Ikuya Yamada1,2, Akari Asai3, Hiroyuki Shindo4,2, Hideaki Takeda5, and Yuji

    Matsumoto2 : Deep Contextualized Entity Representations with Entity-aware Self-attention 1Studio Ousia 2RIKEN AIP 3University of Washington 4Nara Institute of Science and Technology 5National Institute of Informatics
  2. 自己紹介 山田 育矢 (@ikuyamada) Studio Ousia 共同創業者チーフサイエンティスト ソフトウェアエンジニア、連続起業家、研究者 理化学研究所AIP 客員研究員(知識獲得チーム、言語情報アクセス技術チーム)

    • 大学入学時に、学生ベンチャー企業を起業し売却(2000年〜2006年) ◦ インターネットの基盤技術( Peer to Peer通信におけるNAT越え問題)の研究開発を推進 ◦ 売却先企業は株式上場 • Studio Ousiaを共同創業し、自然言語処理に取り組む(2007年〜) ◦ 質問応答を中心とした自然言語処理の研究開発を推進 • プログラミングが好き ◦ 最近よく使うライブラリ: PyTorch、PyTorch-lightning、transformers、Wikipedia2Vec • コンペティション・シェアードタスクにいろいろ出場 ◦ 優勝したタスク:#Microposts @ WWW2015, W-NUT Task #1 @ ACL 2015, HCQA @ NAACL 2016, HCQA @ NIPS 2017, Semantic Web Challenge @ ISWC 2020 2
  3. Overview • LUKE is new contextualized representations of words and

    entities with an improved transformer architecture and a novel entity-aware self-attention mechanism 3
  4. Overview • LUKE is new contextualized representations of words and

    entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity 4
  5. Overview • LUKE is new contextualized representations of words and

    entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity • LUKE is officially supported by Huggingface Transformers 5
  6. Overview • LUKE is new contextualized representations of words and

    entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity • LUKE is officially supported by Huggingface Transformers • LUKE has been cited more than 100 times within a year 6
  7. Contextualized word representations (CWR) don’t represent entities in text well

    ◦ CWR do not provide the span-level representations of entities ◦ Difficult to capture relationship between entities splitted into multiple tokens ◦ The pretraining task of CWRs is not suitable for entities Background 7 Bert....? Elmo…? The Force is not strong with them. Mark Hamill by Gage Skidmore 2
  8. Contextualized word representations (CWR) don’t represent entities in text well

    ◦ CWR do not provide the span-level representations of entities ◦ Difficult to capture relationship between entities splitted into multiple tokens ◦ The pretraining task of CWRs is not suitable for entities Background 8 predicting “Rings” given “The Lord of the [MASK]” is clearly easier than predicting the entire entity
  9. LUKE is pretrained contextualized representations based on a transformer •

    New architecture that treats both words and entities as tokens • New pretraining strategy: randomly masking and predicting words and entities • Entity-aware self-attention mechanism LUKE: Language Understanding with Knowledge-based Embeddings 9 Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles
  10. The Architecture of LUKE • LUKE treats words and entities

    as independent tokens • Because entities are treated as tokens: ◦ LUKE provides span-level entity representations ◦ The relationships between entities can be directly captured in the transformer 10 Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles Computing Input Representations
  11. • Token embedding: representing the corresponding token in the vocabulary

    ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity Input Representations: Three Types of Embeddings 11
  12. Input Representations: Three Types of Embeddings 12 • Token embedding:

    representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
  13. Input Representations: Three Types of Embeddings 13 • Token embedding:

    representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
  14. Input Representations: Three Types of Embeddings 14 • Token embedding:

    representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
  15. Input Representations: Three Types of Embeddings 15 • Token embedding:

    representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices, B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
  16. Input Representations: Word Input Representation 16 • Word input representation:

    token embedding + position embedding • Entity input representation: token embedding + position embedding + entity type embedding
  17. Input Representations: Entity Input Representation 17 • Word input representation:

    token embedding + position embedding • Entity input representation: token embedding + position embedding + entity type embedding
  18. Pretraining: Masking Words and Entities 18 Wikipedia hyperlinks are treated

    as entity annotations LUKE is trained to predict randomly masked words and entities in an entity-annotated corpus obtained from Wikipedia 15% of random words and entities are replaced with the [MASK] words and the [MASK] entities Born and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as the lead singer of Destiny's Child Born and [MASK] in Houston, Texas, [MASK] performed in various [MASK] and dancing competitions as a [MASK]. She rose to fame in the [MASK] 1990s as the lead singer of Destiny's Child
  19. Pretraining: Task 19 LUKE is trained to • predict the

    original words of masked words from the whole words in the vocabulary • predict the original entities of masked entities from the whole entities in the vocabulary LUKE is trained to predict randomly masked words and entities in an entity-annotated corpus obtained from Wikipedia
  20. The attention weight (αij ) is computed based on the

    dot product of two vectors: Given the input vector sequence x 1 ,x 2 …,x k , the output vector y i corresponding to the i-th token is computed based on the weighted sum of the projected input vectors of all tokens Background: Transformer’s Self-attention Mechanism 20 The transformer’s self-attention mechanism relates tokens each other based on the attention weight between each pair of tokens ◦ Qx i : The input vector corresponding to the attending token projected by query matrix Q ◦ Kx j : The input vector corresponding to the token attended to projected by key matrix K
  21. The attention weight (αij ) is computed based on the

    dot product of two vectors: Given the input vector sequence x 1 ,x 2 …,x k , the output vector y i corresponding to the i-th token is computed based on the weighted sum of the projected input vectors of all tokens Background: Transformer’s Self-attention Mechanism 21 The transformer’s self-attention mechanism relates tokens each other based on the attention weight between each pair of tokens ◦ Qx i : The input vector corresponding to the attending token projected by query matrix Q ◦ Kx j : The input vector corresponding to the token attended to projected by key matrix K
  22. • We extend the self-attention mechanism by using a different

    query matrix for each possible pair of token types of x i and x j Proposed Method: Entity-aware Self-attention Mechanism 22 A simple extension of the self-attention mechanism allowing the model to use the information of target token types when computing attention weights Original self-attention mechanism Entity-aware self-attention mechanism
  23. Experiments: Overview We advance state of the art on five

    diverse tasks using similar architectures for all tasks based on a linear classifier on top of the representations of words, entities, or both 23 Dataset Task Open Entity Entity typing TACRED Relation classification CoNLL-2003 Named entity recognition ReCoRD Cloze-style QA SQuAD Extractive QA
  24. How to Compute Entity Representations in Downstream Tasks 24 Entity

    representations can be computed by • using the [MASK] entity as input token(s) ◦ The model gathers the information regarding the entities from the input text ◦ Used in the all tasks except for the extractive QA (SQuAD) • using the Wikipedia entity as input token(s) ◦ The entity representations are computed based on the information stored in the entity token embeddings ◦ The word representations are enriched by the entity representations inside transformer ◦ Used in the extractive QA (SQuAD) task
  25. Entity representations can be computed by • using the [MASK]

    entity as input token(s) ◦ The model gathers the information regarding the entities from the input text ◦ Used in the all tasks except for the extractive QA (SQuAD) • using the Wikipedia entity as input token(s) ◦ The entity representations are computed based on the information stored in the entity token embeddings ◦ The word representations are enriched by the entity representations inside transformer ◦ Used in the extractive QA (SQuAD) task How to Compute Entity Representations in Downstream Tasks 25
  26. Experiments: Entity Typing, Relation Classification, Cloze-style QA 26 Model: A

    linear classifier with the output entity representation(s) as input feature Model inputs: • Words in the target sentence • [MASK] entity representing the target entity span(s) SOTA on three important entity-related tasks Results on Open Entity Results on TACRED Results on ReCoRD Datasets: • Open Entity (entity typing) • TACRED (relation classification) • ReCoRD (cloze-style QA)
  27. Experiments: Named Entity Recognition (CoNLL-2003) 27 Model: 1. Enumerate all

    possible spans in the input text as entity name candidates 2. Classify them into entity types or non-entity type using a linear classifier based on the entity representation and the word representations of the first and last words in the span 3. Greedily select a span based on the logits Model inputs: • Words in the input text • [MASK] entities corresponding to all possible entity name candidates SOTA on CoNLL-2003 named entity recognition dataset Results on CoNLL-2003
  28. Experiments: Extractive Question Answering (SQuAD v1.1) 30 Model: Two linear

    classifiers on top of the output word representations to predict the start and end positions of the answer Model inputs: • Words in the question and the passage • Wikipedia entities in the passage ◦ Automatically generated based on a heuristic entity linking method SOTA on SQuAD v1.1 extractive question answering dataset Results on SQuAD v1.1 LUKE got #1 on leaderboard
  29. Ablation Study (1): Entity Representations 31 When addressing the task

    without inputting entities, the performance degrades significantly on CoNLL-2003 and SQuAD v1.1
  30. Ablation Study (1): Entity Representations 32 When addressing the task

    without inputting entities, the performance degrades significantly on CoNLL-2003 and SQuAD v1.1 Using [MASK] entities as inputs Using Wikipedia entities as inputs
  31. Ablation Study (2): Entity-aware Self-attention 33 Our entity-aware self-attention mechanism

    consistently outperforms the original mechanism across all tasks
  32. Adding LUKE to Huggingface Transformers • LUKE is officially supported

    by Huggingface Transformers • The state-of-the-art results reported in the paper can now be easily reproduced using Transformers on Colab notebooks! ◦ NER on CoNLL-2003 ◦ Relation extraction on TACRED ◦ Entity Typing on Open Entity 34 https://github.com/studio-ousia/luke/issues/38
  33. Summary • LUKE is new contextualized representations of words and

    entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks 35 [email protected] Paper: Code: @ikuyamada https://arxiv.org/abs/2010.01057 https://github.com/studio-ousia/luke Paper: Code: