Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LUKE@NLPコロキウム

Avatar for Ikuya Yamada Ikuya Yamada
January 18, 2022

 LUKE@NLPコロキウム

NLPコロキウムでの発表資料です。

Avatar for Ikuya Yamada

Ikuya Yamada

January 18, 2022
Tweet

More Decks by Ikuya Yamada

Other Decks in Research

Transcript

  1. Ikuya Yamada1,2, Akari Asai3, Hiroyuki Shindo4,2, Hideaki Takeda5, and Yuji

    Matsumoto2 : Deep Contextualized Entity Representations with Entity-aware Self-attention 1Studio Ousia 2RIKEN AIP 3University of Washington 4Nara Institute of Science and Technology 5National Institute of Informatics
  2. 自己紹介 山田 育矢 (@ikuyamada) Studio Ousia 共同創業者チーフサイエンティスト ソフトウェアエンジニア、連続起業家、研究者 理化学研究所AIP 客員研究員(知識獲得チーム、言語情報アクセス技術チーム)

    • 大学入学時に、学生ベンチャー企業を起業し売却(2000年〜2006年) ◦ インターネットの基盤技術( Peer to Peer通信におけるNAT越え問題)の研究開発を推進 ◦ 売却先企業は株式上場 • Studio Ousiaを共同創業し、自然言語処理に取り組む(2007年〜) ◦ 質問応答を中心とした自然言語処理の研究開発を推進 • プログラミングが好き ◦ 最近よく使うライブラリ: PyTorch、PyTorch-lightning、transformers、Wikipedia2Vec • コンペティション・シェアードタスクにいろいろ出場 ◦ 優勝したタスク:#Microposts @ WWW2015, W-NUT Task #1 @ ACL 2015, HCQA @ NAACL 2016, HCQA @ NIPS 2017, Semantic Web Challenge @ ISWC 2020 2
  3. Overview • LUKE is new contextualized representations of words and

    entities with an improved transformer architecture and a novel entity-aware self-attention mechanism 3
  4. Overview • LUKE is new contextualized representations of words and

    entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity 4
  5. Overview • LUKE is new contextualized representations of words and

    entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity • LUKE is officially supported by Huggingface Transformers 5
  6. Overview • LUKE is new contextualized representations of words and

    entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity • LUKE is officially supported by Huggingface Transformers • LUKE has been cited more than 100 times within a year 6
  7. Contextualized word representations (CWR) don’t represent entities in text well

    ◦ CWR do not provide the span-level representations of entities ◦ Difficult to capture relationship between entities splitted into multiple tokens ◦ The pretraining task of CWRs is not suitable for entities Background 7 Bert....? Elmo…? The Force is not strong with them. Mark Hamill by Gage Skidmore 2
  8. Contextualized word representations (CWR) don’t represent entities in text well

    ◦ CWR do not provide the span-level representations of entities ◦ Difficult to capture relationship between entities splitted into multiple tokens ◦ The pretraining task of CWRs is not suitable for entities Background 8 predicting “Rings” given “The Lord of the [MASK]” is clearly easier than predicting the entire entity
  9. LUKE is pretrained contextualized representations based on a transformer •

    New architecture that treats both words and entities as tokens • New pretraining strategy: randomly masking and predicting words and entities • Entity-aware self-attention mechanism LUKE: Language Understanding with Knowledge-based Embeddings 9 Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles
  10. The Architecture of LUKE • LUKE treats words and entities

    as independent tokens • Because entities are treated as tokens: ◦ LUKE provides span-level entity representations ◦ The relationships between entities can be directly captured in the transformer 10 Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles Computing Input Representations
  11. • Token embedding: representing the corresponding token in the vocabulary

    ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity Input Representations: Three Types of Embeddings 11
  12. Input Representations: Three Types of Embeddings 12 • Token embedding:

    representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
  13. Input Representations: Three Types of Embeddings 13 • Token embedding:

    representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
  14. Input Representations: Three Types of Embeddings 14 • Token embedding:

    representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
  15. Input Representations: Three Types of Embeddings 15 • Token embedding:

    representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices, B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
  16. Input Representations: Word Input Representation 16 • Word input representation:

    token embedding + position embedding • Entity input representation: token embedding + position embedding + entity type embedding
  17. Input Representations: Entity Input Representation 17 • Word input representation:

    token embedding + position embedding • Entity input representation: token embedding + position embedding + entity type embedding
  18. Pretraining: Masking Words and Entities 18 Wikipedia hyperlinks are treated

    as entity annotations LUKE is trained to predict randomly masked words and entities in an entity-annotated corpus obtained from Wikipedia 15% of random words and entities are replaced with the [MASK] words and the [MASK] entities Born and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as the lead singer of Destiny's Child Born and [MASK] in Houston, Texas, [MASK] performed in various [MASK] and dancing competitions as a [MASK]. She rose to fame in the [MASK] 1990s as the lead singer of Destiny's Child
  19. Pretraining: Task 19 LUKE is trained to • predict the

    original words of masked words from the whole words in the vocabulary • predict the original entities of masked entities from the whole entities in the vocabulary LUKE is trained to predict randomly masked words and entities in an entity-annotated corpus obtained from Wikipedia
  20. The attention weight (αij ) is computed based on the

    dot product of two vectors: Given the input vector sequence x 1 ,x 2 …,x k , the output vector y i corresponding to the i-th token is computed based on the weighted sum of the projected input vectors of all tokens Background: Transformer’s Self-attention Mechanism 20 The transformer’s self-attention mechanism relates tokens each other based on the attention weight between each pair of tokens ◦ Qx i : The input vector corresponding to the attending token projected by query matrix Q ◦ Kx j : The input vector corresponding to the token attended to projected by key matrix K
  21. The attention weight (αij ) is computed based on the

    dot product of two vectors: Given the input vector sequence x 1 ,x 2 …,x k , the output vector y i corresponding to the i-th token is computed based on the weighted sum of the projected input vectors of all tokens Background: Transformer’s Self-attention Mechanism 21 The transformer’s self-attention mechanism relates tokens each other based on the attention weight between each pair of tokens ◦ Qx i : The input vector corresponding to the attending token projected by query matrix Q ◦ Kx j : The input vector corresponding to the token attended to projected by key matrix K
  22. • We extend the self-attention mechanism by using a different

    query matrix for each possible pair of token types of x i and x j Proposed Method: Entity-aware Self-attention Mechanism 22 A simple extension of the self-attention mechanism allowing the model to use the information of target token types when computing attention weights Original self-attention mechanism Entity-aware self-attention mechanism
  23. Experiments: Overview We advance state of the art on five

    diverse tasks using similar architectures for all tasks based on a linear classifier on top of the representations of words, entities, or both 23 Dataset Task Open Entity Entity typing TACRED Relation classification CoNLL-2003 Named entity recognition ReCoRD Cloze-style QA SQuAD Extractive QA
  24. How to Compute Entity Representations in Downstream Tasks 24 Entity

    representations can be computed by • using the [MASK] entity as input token(s) ◦ The model gathers the information regarding the entities from the input text ◦ Used in the all tasks except for the extractive QA (SQuAD) • using the Wikipedia entity as input token(s) ◦ The entity representations are computed based on the information stored in the entity token embeddings ◦ The word representations are enriched by the entity representations inside transformer ◦ Used in the extractive QA (SQuAD) task
  25. Entity representations can be computed by • using the [MASK]

    entity as input token(s) ◦ The model gathers the information regarding the entities from the input text ◦ Used in the all tasks except for the extractive QA (SQuAD) • using the Wikipedia entity as input token(s) ◦ The entity representations are computed based on the information stored in the entity token embeddings ◦ The word representations are enriched by the entity representations inside transformer ◦ Used in the extractive QA (SQuAD) task How to Compute Entity Representations in Downstream Tasks 25
  26. Experiments: Entity Typing, Relation Classification, Cloze-style QA 26 Model: A

    linear classifier with the output entity representation(s) as input feature Model inputs: • Words in the target sentence • [MASK] entity representing the target entity span(s) SOTA on three important entity-related tasks Results on Open Entity Results on TACRED Results on ReCoRD Datasets: • Open Entity (entity typing) • TACRED (relation classification) • ReCoRD (cloze-style QA)
  27. Experiments: Named Entity Recognition (CoNLL-2003) 27 Model: 1. Enumerate all

    possible spans in the input text as entity name candidates 2. Classify them into entity types or non-entity type using a linear classifier based on the entity representation and the word representations of the first and last words in the span 3. Greedily select a span based on the logits Model inputs: • Words in the input text • [MASK] entities corresponding to all possible entity name candidates SOTA on CoNLL-2003 named entity recognition dataset Results on CoNLL-2003
  28. Experiments: Extractive Question Answering (SQuAD v1.1) 30 Model: Two linear

    classifiers on top of the output word representations to predict the start and end positions of the answer Model inputs: • Words in the question and the passage • Wikipedia entities in the passage ◦ Automatically generated based on a heuristic entity linking method SOTA on SQuAD v1.1 extractive question answering dataset Results on SQuAD v1.1 LUKE got #1 on leaderboard
  29. Ablation Study (1): Entity Representations 31 When addressing the task

    without inputting entities, the performance degrades significantly on CoNLL-2003 and SQuAD v1.1
  30. Ablation Study (1): Entity Representations 32 When addressing the task

    without inputting entities, the performance degrades significantly on CoNLL-2003 and SQuAD v1.1 Using [MASK] entities as inputs Using Wikipedia entities as inputs
  31. Ablation Study (2): Entity-aware Self-attention 33 Our entity-aware self-attention mechanism

    consistently outperforms the original mechanism across all tasks
  32. Adding LUKE to Huggingface Transformers • LUKE is officially supported

    by Huggingface Transformers • The state-of-the-art results reported in the paper can now be easily reproduced using Transformers on Colab notebooks! ◦ NER on CoNLL-2003 ◦ Relation extraction on TACRED ◦ Entity Typing on Open Entity 34 https://github.com/studio-ousia/luke/issues/38
  33. Summary • LUKE is new contextualized representations of words and

    entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks 35 [email protected] Paper: Code: @ikuyamada https://arxiv.org/abs/2010.01057 https://github.com/studio-ousia/luke Paper: Code: