Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Encoder Decoder Models

Encoder Decoder Models

language modeling, Recurrent Neural Network Language Model (RNNLM), encoder-decoder models, sequence-to-sequence models, attention mechanism, reading comprehension, question answering, headline generation, multi-task learning, character-based RNN, byte-pair encoding, SentencePiece, Convolutional Sequence to Sequence (ConvS2S), Transformer, coverage, round-trip translation

Naoaki Okazaki

August 07, 2020
Tweet

More Decks by Naoaki Okazaki

Other Decks in Research

Transcript

  1. Encoder Decoder Models Naoaki Okazaki School of Computing, Tokyo Institute

    of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/
  2. Main task: Machine Translation (MT) 1  Translate a text

    in a language into another  Basic idea  How do Computers Learn a New Language? An Introduction to Statistical Machine Translation  https://www.youtube.com/watch?v=_ghMKb6iDMM (6:29) こんにちは Hello 您好 Hola
  3. Statistical Machine Translation (SMT) 2 私は動物園に行った.彼らは東京に行った. I went to the

    zoo. They went to Tokyo. I went to Tokyo. 私は東京に行った. 𝑃𝑃(𝑦𝑦|𝑥𝑥) 𝑃𝑃 I 私は = 0.8, 𝑃𝑃 they 彼らは = 0.8 𝑃𝑃 went 行った = 0.9, 𝑃𝑃 to に = 0.9 𝑃𝑃 the zoo 動物園 = 0.8, 𝑃𝑃 Tokyo 東京 = 0.8 Supervision data (parallel corpus) Translation model: Japanese to English I went to Tokyo to meet my friend last Sunday. It was the first time since …… We went to the zoo near Ueno for …… 𝑃𝑃 the of = .012243, 𝑃𝑃 the in = .007208, 𝑃𝑃 the to = .005042, … … , 𝑃𝑃 was it = .000522, 𝑃𝑃 to went = .000080, Supervision data (monolingual corpus) Language model: Naturalness in English Input: 𝑥𝑥 Output: 𝑦𝑦 Building probabilistic models
  4. DNNs applied to MT 3 Replace the probabilistic models with

    DNNs I went to Tokyo. 私は東京に行った. Input: 𝑥𝑥 Output: 𝑦𝑦 𝑃𝑃(𝑦𝑦|𝑥𝑥): Deep Neural Networks (DNNs) It’s not that simple as introducing DNN architectures that were successful in other research fields (e.g., computer vision)
  5. Connection to the previous lecture 4  Embeddings for phrases

    and sentences seem to be useful for solving tasks  Is it possible to generate a sentence (sequence of words) from embeddings?  Yes, encoder-decoder models can do that! very good movie とても 良い 映画 E n c D e c
  6. Demo: Text-generation with GPT-2 6 https://github.com/graykode/gpt-2-Pytorch Text generated by giving

    the first paragraph of the Wikipedia article of “Harry Potter” https://en.wikipedia.org/wiki/Harry_Potter
  7. Language model (LM) 7  For a given word sequence

    𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 , LMs compute the joint probability 𝑃𝑃(𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 )  Example: which word fills the blank in “I have a .”? argmax 𝑤𝑤∈𝑉𝑉 𝑃𝑃(I, have, a, 𝑤𝑤)  Used to assess the naturalness of a sentence (sequence of words) generated by machine translation, speech recognition, etc pen dog PC …… what Set of all words in the vocabulary
  8. Probabilistic language models 8 𝑃𝑃 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚

    = � 𝑖𝑖=1 𝑚𝑚 𝑃𝑃(𝑤𝑤𝑖𝑖 |𝑤𝑤1 , … , 𝑤𝑤𝑖𝑖−1 ) ☹ Data sparseness problem: Insufficient statistics to estimate the probability with a longer sequence of words Predict the next word 𝑤𝑤𝑖𝑖 after the word sequence 𝑤𝑤1 , … , 𝑤𝑤𝑖𝑖−1 #(𝑤𝑤1 , … , 𝑤𝑤𝑖𝑖−1 , 𝑤𝑤𝑖𝑖 ) #(𝑤𝑤1 , … , 𝑤𝑤𝑖𝑖−1 ) || 𝑃𝑃 This, is, a, pen = 𝑃𝑃 This BOS 𝑃𝑃 is this 𝑃𝑃 a This is 𝑃𝑃 pen This is a 𝑃𝑃(EOS|This is a pen)
  9. 𝑛𝑛-gram probabilistic language modeling 9 𝑃𝑃 𝑤𝑤1 , … ,

    𝑤𝑤𝑚𝑚 ≈ � 𝑖𝑖=1 𝑚𝑚 𝑃𝑃(𝑤𝑤𝑖𝑖 |𝑤𝑤𝑖𝑖−𝑛𝑛+1 , … , 𝑤𝑤𝑖𝑖−1 ) Remedy the data sparseness problem by compromising with a shorter context Predict the next word 𝑤𝑤𝑖𝑖 after a word sequence 𝑤𝑤𝑖𝑖−𝑛𝑛+1 , … , 𝑤𝑤𝑖𝑖−1 of length 𝑛𝑛 − 1 #(𝑤𝑤𝑖𝑖−𝑛𝑛+1 , … , 𝑤𝑤𝑖𝑖−1 , 𝑤𝑤𝑖𝑖 ) #(𝑤𝑤𝑖𝑖−𝑛𝑛+1 , … , 𝑤𝑤𝑖𝑖−1 ) || We have more counts! 𝑃𝑃 This, is, a, pen = 𝑃𝑃 This BOS 𝑃𝑃 is this 𝑃𝑃 a is 𝑃𝑃 pen a 𝑃𝑃(EOS|pen) (example with 2-gram)
  10. Sentence generation with LM 10  Find the word sequence

    𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 argmax 𝑤𝑤1,…,𝑤𝑤𝑚𝑚∈𝑉𝑉 𝑃𝑃(𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 )  However, we cannot specify a desired output  Generate a sentence 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 conditioned on an input 𝑥𝑥1 , … , 𝑥𝑥𝑛𝑛 argmax 𝑤𝑤1,…,𝑤𝑤𝑚𝑚∈𝑉𝑉 𝑃𝑃 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 𝑄𝑄(𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 |𝑥𝑥1 , … , 𝑥𝑥𝑛𝑛 )  𝑄𝑄 is translation model in machine translation  𝑄𝑄: Whether the output is a correct translation of the input  𝑃𝑃: Is the generated sentence natural as the language?
  11. Sentence generation as a search problem 11  Sentence generation

    has 𝑂𝑂( 𝑉𝑉 𝑚𝑚) time complexity argmax 𝑤𝑤1,…,𝑤𝑤𝑚𝑚∈𝑉𝑉 𝑃𝑃 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 𝑄𝑄(𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 |𝑥𝑥1 , … , 𝑥𝑥𝑛𝑛 )  Unrealistic to enumerate all possible candidates  Usually, 𝑉𝑉 > 10,000 and 𝑚𝑚 is 20~100: 10,00020 = 1080  Search a word after words (i.e., greedy / beam search) BOS a b … … I … have a … … … … … a … … … … … a … pen … … 𝑃𝑃 𝑤𝑤1 𝑄𝑄(𝑤𝑤1 |𝑋𝑋) 𝑃𝑃 𝑤𝑤1 , 𝑤𝑤2 𝑄𝑄(𝑤𝑤1 , 𝑤𝑤2 |𝑋𝑋) 𝑃𝑃 𝑤𝑤1 , 𝑤𝑤2 , 𝑤𝑤3 𝑄𝑄(𝑤𝑤1 , 𝑤𝑤2, , 𝑤𝑤3 |𝑋𝑋) 𝑃𝑃 𝑤𝑤1 , 𝑤𝑤2 , 𝑤𝑤3 , 𝑤𝑤4 𝑄𝑄(𝑤𝑤1 , 𝑤𝑤2, , 𝑤𝑤3 , 𝑤𝑤4 |𝑋𝑋)
  12. Issues in LM (before the DNN era) 12  Data

    sparseness  Rare words suffer from the insufficiency of statistics  The insufficiency gets worse when using 𝑛𝑛-grams (word combinations)  Addressed by smoothing methods (e.g., Good-Turing, Kneser-Ney)  Surface variations  Surface variations with the same meaning have different probabilities  For example, 𝑃𝑃(girl|clever) and 𝑃𝑃(girl|smart) are independent even if ‘clever’ and ‘smart’ have the similar meaning  Addressed by ‘class’ models that merge similar words into a group  Long-distance dependency  𝑛𝑛-gram models cannot consider dependencies longer than 𝑛𝑛 words  Neural LMs address these issues using distributed representations (word embeddings and their compositions)
  13. Recurrent Neural Network Language Model (Mikolov+ 2010) 13 BOS I

    𝑤𝑤0 have a pen softmax softmax softmax softmax softmax 𝑤𝑤1 𝑤𝑤2 𝑤𝑤3 𝑤𝑤4 𝑝𝑝1 (I) 𝑝𝑝2 (have) 𝑝𝑝3 (a) 𝑝𝑝4 (pen) 𝑝𝑝5 (EOS) 𝑃𝑃(𝑤𝑤1 , … 𝑤𝑤𝑚𝑚 ) = × × × × The number of dimensions of the output layer is |𝑉𝑉|, where 𝑉𝑉 is the set of possible words. Each element presents the probability of generating the corresponding word Probability of a sequence of words is a product of token prediction probabilities T Mikolov, M Karafiát, L Burget, J Černocký, S Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH, pp. 1045-1048.
  14. Recurrent Neural Networks (RNNs) (Sutskever+ 2011) 15 I Sutskever, J

    Martens, G Hinton. 2011. Generating text with recurrent neural networks. In ICML, pp. 1017–1024. John loves 𝑊𝑊ℎ𝑥𝑥 𝒉𝒉4 𝑊𝑊𝑦𝑦𝑦 𝑊𝑊ℎℎ Mary 𝑊𝑊ℎℎ much 𝑊𝑊ℎℎ softmax Word embeddings Represent a word with a vector 𝒙𝒙𝑡𝑡 ∈ ℝ𝑑𝑑 𝑊𝑊ℎ𝑥𝑥 𝑊𝑊ℎ𝑥𝑥 𝑊𝑊ℎ𝑥𝑥 𝒉𝒉1 𝒉𝒉2 𝒉𝒉3 𝒙𝒙1 𝒙𝒙2 𝒙𝒙3 𝒙𝒙4 Recurrent computation Compose a hidden vector 𝒉𝒉𝑡𝑡 from an input word 𝒙𝒙𝑡𝑡 and the hidden vector 𝒉𝒉𝑡𝑡−1 at the previous timestep 𝒉𝒉𝑡𝑡 = 𝑓𝑓(𝑊𝑊ℎ𝑥𝑥 𝒙𝒙𝑡𝑡 + 𝑊𝑊ℎℎ𝒉𝒉𝑡𝑡−1) Fully-connected layer for a task Make a prediction from the hidden vector 𝒉𝒉4 , which are composed from all words in the sentence, by using a fully-connected layer and softmax 𝒚𝒚 𝒉𝒉0 = 0 𝑊𝑊ℎℎ ☺ The parameters 𝑊𝑊ℎ𝑥𝑥 , 𝑊𝑊ℎℎ , 𝑊𝑊𝑦𝑦𝑦 are shared over the entire sequence They are trained by the supervision signal 𝒙𝒙1 , … , 𝒙𝒙4 , 𝒚𝒚 using backpropagation
  15. Convolutional Neural Network (CNN) (Kim 2014) 16 Y Kim. 2014.

    Convolutional neural networks for sentence classification. In EMNLP, pp. 1746-1751. It is a very good movie indeed 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡:𝑡𝑡+𝛿𝛿 ・ ・ ・ ・ ・ ・ 𝑊𝑊 𝑝𝑝𝑡𝑡 𝑐𝑐𝑖𝑖 = max 1<𝑡𝑡<𝑇𝑇−𝛿𝛿+1 𝑝𝑝𝑡𝑡,𝑖𝑖 Max pooling: each dimension 𝑐𝑐𝑖𝑖 is the maximum number of the values 𝑝𝑝𝑡𝑡,𝑖𝑖 over timesteps softmax 𝑦𝑦 𝑊𝑊(𝑦𝑦𝑦𝑦) ☺
  16. Encoding 17  These models can be decomposed into 

    Encoding (variable-length input to feature vector)  𝑧𝑧 = 𝜙𝜙(𝑥𝑥1 , … , 𝑥𝑥𝑚𝑚 ) (𝜙𝜙 is a part of the NN)  Solving the task (e.g., classify the text using the feature vector)  𝑦𝑦 = 𝜓𝜓(𝑧𝑧) (𝜓𝜓 is also a part of the NN) I have 𝑊𝑊ℎ𝑥𝑥 𝒉𝒉4 𝑊𝑊𝑦𝑦ℎ 𝑊𝑊ℎℎ a 𝑊𝑊ℎℎ pen 𝑊𝑊ℎℎ softmax 𝑊𝑊ℎ𝑥𝑥 𝑊𝑊ℎ𝑥𝑥 𝑊𝑊ℎ𝑥𝑥 𝒉𝒉1 𝒉𝒉2 𝒉𝒉3 𝒙𝒙1 𝒙𝒙2 𝒙𝒙3 𝒙𝒙4 𝒚𝒚 𝑊𝑊ℎℎ ☺ It is a very good movie indeed 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡:𝑡𝑡+𝛿𝛿 ・ ・ ・ ・ ・ ・ 𝑊𝑊 𝑝𝑝𝑡𝑡 𝑐𝑐𝑖𝑖 = max 1<𝑡𝑡<𝑇𝑇−𝛿𝛿+1 𝑝𝑝𝑡𝑡,𝑖𝑖 Max pooling: each dimension 𝑐𝑐𝑖𝑖 is the maximum number of the values 𝑝𝑝𝑡𝑡,𝑖𝑖 over timesteps softmax 𝑦𝑦 𝑊𝑊(𝑦𝑦𝑦𝑦) ☺
  17. Using RNNLM for generating sentences 19 Predict a sequence of

    words for an given input , in addition to score the naturalness of the generated sentence BOS I have a pen I have a pen EOS Input
  18. Encoder decoder model (EncDec) (Sutskever+ 2014; Cho+ 2014) 20 I

    Sutskever, O Vinyals, Q V Le. 2014. Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoder– decoder for statistical machine translation. In EMNLP, pp. 1724–1734. I have a ペン を 持つ pen BOS ペン を 持つ EOS Encoder Decoder ※ This illustration omits the matrices of RNNs Representation of the input  Encode an input sentence into a feature vector, and generate a sentence by decoding (predicting) a word sequence from the feature vector  Also known as sequence-to-sequence model  Machine translation is realized by a single NN!  Machine translation had been a mix of various theories and methods before the neural machine translation (NMT)
  19. Caption generation (Vinyals+ 2015) 21 O Vinyals, A Toshev, S

    Bengio, D Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.
  20. Chatbot (Vinyals+ 2015) 22  Supervision data: OpenSubtitles  Scripts

    extracted from movie subtitles (6G sentences)  A chat example from the EncDec model O Vinyals, Q V Le. 2015. A neural conversational model, In ICML Deep Learning Workshop.
  21. Summary  Encoder-decoder architecture  An encoder converts an input

    sentence into a feature vector  A decoder generates a sentence based on the vector  We can train an encoder-decoder model in end-to-end fashion  (An autoregressive) decoder predicts a token sequence by feeding predicted tokens into the input layer  We can connect different modalities (e.g., language and vision) in a single NN as long as they are represented as vectors 23
  22. Weakness of EncDec 25  EncDec represents an input of

    a variable-length with a fixed-size vector  EncDec has no flexibility about the amount of the information of an input  EncDec suffers from handling longer sentences I have a ペン を 持つ pen BOS ペン を 持つ EOS
  23. The idea of attention mechanism 26 This is a pen

    BOS + これ は ペン BOS これ は ペン EOS At each timestep in the decoder, predict a word using the weighted sum of all hidden vectors in the input Attention mechanism determines the weights automatically from the decoder state The decoder now has an access to all hidden vectors in the input 𝑎𝑎(1) 𝑎𝑎(2) 𝑎𝑎(3) 𝑎𝑎(4) 𝑎𝑎(5)
  24. Attention mechanism (Bahdanau+ 2015, Luong+ 2015) 27 is (𝑠𝑠 =

    2) a (𝑠𝑠 = 3) pen (𝑠𝑠 = 4) これ は BOS (𝑡𝑡 = 1) これ (𝑡𝑡 = 2) 𝒉𝒉𝑡𝑡 𝑎𝑎𝑡𝑡 (𝑠𝑠) � 𝒉𝒉𝑡𝑡 = tanh(𝑊𝑊 𝑐𝑐 [𝒄𝒄𝑡𝑡 ; 𝒉𝒉𝑡𝑡 ]) 𝑎𝑎𝑡𝑡 (𝑠𝑠) = exp score(𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠 ) ∑𝑠𝑠′ exp score(𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠𝑠 ) 𝒄𝒄𝑡𝑡 = � 𝑠𝑠 𝑎𝑎𝑡𝑡 (𝑠𝑠)𝒉𝒉𝑠𝑠 𝒚𝒚𝑡𝑡 = softmax(𝑊𝑊 𝑦𝑦 � 𝒉𝒉𝑡𝑡 ) This (𝑠𝑠 = 1) 𝒙𝒙𝑠𝑠 𝒉𝒉𝑠𝑠 score 𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠 = 𝒉𝒉𝑡𝑡 ⋅ 𝒉𝒉𝑠𝑠  Different variables of time steps used for the encoder (𝑠𝑠) and decoder (𝑡𝑡)  Computation flow (Luong+ 2015): 𝒚𝒚𝑡𝑡−1 → 𝒉𝒉𝑡𝑡 → 𝑎𝑎𝑡𝑡 𝑠𝑠 → 𝒄𝒄𝑡𝑡 → � 𝒉𝒉𝑡𝑡 → 𝒚𝒚𝑡𝑡 → 𝒉𝒉𝑡𝑡+1 D Bahdanau, K Cho, Y Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR. M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.  score ℎ𝑡𝑡 , ℎ𝑠𝑠 : how much the decoder at time step 𝑡𝑡 needs information from the time step 𝑠𝑠 in the encoder
  25. Computing attention scores 28  Attention 𝑎𝑎𝑡𝑡 (𝑠𝑠) = exp

    score(𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠 ) ∑𝑠𝑠′ exp score(𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠𝑠 )  Scores are normalized into a probability distribution  Various approaches for computing attention score score 𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠 = 𝒉𝒉𝑡𝑡 ⋅ 𝒉𝒉𝑠𝑠 (dot) 𝒉𝒉𝑡𝑡 𝑊𝑊 𝑎𝑎 𝒉𝒉𝑠𝑠 (product) 𝒗𝒗𝑎𝑎 ⋅ tanh 𝑊𝑊 𝑎𝑎 𝒉𝒉𝑡𝑡 ; 𝒉𝒉𝑠𝑠 (concat)  𝒗𝒗𝑎𝑎 and 𝑊𝑊 𝑎𝑎 are parameters (trained by backpropagation)
  26. Attention has an advantage on longer sentences 29 (Luong+ 2015)

    local-p: Attention mechanism that predicts the focal range of the input sequence based on the hidden state of the decoder M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.
  27. Attention roughly represents alignments 30 Global attention Local monotonic focus

    Gold alignment Local predictive focus (Luong+ 2015) M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.
  28. Show, attend and tell (Xu+ 2015) 31 (Xu+ 2015) K

    Xu, J Ba, R Kiros, K Cho, A Courville, R Salakhutdinov, R Zemel, Y Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, pp. 2048-2057.
  29. RNN/LSTM and CNN 33  It is hard to parallelize

    RNN/LSTM for time steps  It is easy to parallelize CNN for time steps ☺ RNN may capture distant dependencies of tokens ☹ Need to traverse the full distance of the path of words ☹ Hard to parallelize ☺ We can compute convolutions in parallel ☹ CNN may not capture dependencies beyond the window
  30. ByteNet (Kalchbrenner+ 16) 34 ☺ Requires log 𝑛𝑛 traverses for

    handling 𝑛𝑛 distant dependencies (Kalchbrenner+ 2016) N Kalchbrenner, L Espeholt, K Simonyan, A van den Oord, A Graves, K Kavukcuoglu. 2016. Neural Machine Translation in Linear Time. arXiv:1610.10099.
  31. Convolutional Sequence to Sequence (ConvS2S) (Gehring+ 17) 35 これ は

    ペン です _ EOS _ Encoder _ BOS This is _ a pen Decoder EOS A rotation animation represents a composition of a hidden state of the decoder by attending the ones in the encoder Predict a word → Compose the decoder vector → Predict a next word In order to realize this, we put dummy tokens _ Encoder decoder model only with CNN J Gehring, M Auli, D Grangier, D Yarats, Y N Dauphin. 2017. Convolutional sequence to sequence learning. In ICML. pp. 1243-1252.
  32. Vector composition in ConvS2S 36 これ は これ <1> は

    <2> + + Position embedding: 𝑒𝑒𝑡𝑡 = 𝑤𝑤𝑡𝑡 + 𝑝𝑝𝑡𝑡 𝑒𝑒𝑡𝑡 𝑤𝑤𝑡𝑡 𝑝𝑝𝑡𝑡 𝑤𝑤𝑡𝑡+1 𝑝𝑝𝑡𝑡+1 Gated Linear Unit (GLU): ℎ𝑡𝑡 ′ = (𝐸𝐸𝐸𝐸 + 𝑏𝑏𝑐𝑐 ) ⊗ 𝜎𝜎 𝐸𝐸𝐸𝐸 + 𝑏𝑏𝑔𝑔 × + Residual connection: ℎ𝑡𝑡 = ℎ𝑡𝑡 ′ + 𝑤𝑤𝑡𝑡 Encoder and decoder use the same architecture Their experiments use 20-layer CNNs with the window length of 3 𝑒𝑒𝑡𝑡+1 ℎ𝑡𝑡 ′ ℎ𝑡𝑡
  33. Transformer: “Attention is all you need” (Vaswani+ 2017) 38 A

    Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008. https://research.googleblog.com/2017/08/transformer-novel-neural-network.html
  34. The architecture of Transformer 39 (Vaswani+ 2017) A Vaswani, N

    Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.
  35. The architecture of Transformer 40 loves Mary John ジョン は

    BOS ジョン は メアリー 2. Positional encoding 1. Multi-head attention 3. Residual + Layer-norm 4. Feedforward 1. Multi-head attention (cross attention) 2. Positional encoding 1. Multi-head attention 3. Residual + Layer-norm 3. Residual + Layer-norm 3. Residual + Layer-norm 4. Feedforward 3. Residual + Layer-norm
  36. QKV attention mechanism 41  Attention mechanism with query (Q),

    key (K), and value (V)  Query an associative array of key-value store  Queries, keys, and values are represented by vectors  Yields a weighted sum of values instead of returning a single value  The weights are computed by the relatedness between a query and keys  A query 𝒒𝒒 attends keys and obtains � 𝒒𝒒 as a weighted sum of values 𝑲𝑲, 𝑽𝑽 = 𝒌𝒌1 , … , 𝒌𝒌𝐼𝐼 , 𝒗𝒗1 , … , 𝒗𝒗𝐼𝐼 (𝒌𝒌𝑖𝑖 , 𝒗𝒗𝑖𝑖 ∈ ℝ𝑑𝑑) 𝒒𝒒 ∈ ℝ𝑑𝑑 � 𝒒𝒒 ∈ ℝ𝑑𝑑 � 𝒒𝒒 = 𝑽𝑽softmax 𝑐𝑐𝑲𝑲⊤𝒒𝒒 , 𝑐𝑐 = 1/ 𝑑𝑑 Relatedness by 𝒌𝒌𝑖𝑖 ⊤𝒒𝒒 Weighted sum of 𝒗𝒗1 , … , 𝒗𝒗𝐼𝐼 𝒌𝒌1 𝒗𝒗1 𝒌𝒌𝐼𝐼 𝒗𝒗𝐼𝐼 1. Multi-head attention
  37. QKV generalizes the conventional attention mechanism 42 � 𝒛𝒛𝑗𝑗 =

    tanh 𝑾𝑾 ̂ 𝑧𝑧ℎ[𝒛𝒛𝑗𝑗 ; � 𝒉𝒉𝑗𝑗 ] , � 𝒉𝒉𝑗𝑗 = 𝑯𝑯𝒂𝒂𝑗𝑗 , 𝒂𝒂𝑗𝑗 = softmax 𝒂𝒂𝑗𝑗 ′ , 𝒂𝒂𝑗𝑗 ′ = 𝑯𝑯⊤𝒛𝒛𝑗𝑗 𝑯𝑯 = 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 ∈ ℝ𝑑𝑑ℎ×𝐼𝐼 � 𝒒𝒒 = 𝑽𝑽𝑽𝑽, 𝒂𝒂 = softmax 𝒂𝒂′ , 𝒂𝒂′ = 𝑐𝑐𝑲𝑲⊤𝒒𝒒 𝒒𝒒, � 𝒒𝒒 ∈ ℝ𝑑𝑑 𝑲𝑲 ∈ ℝ𝑑𝑑×𝐼𝐼 𝑽𝑽 ∈ ℝ𝑑𝑑×𝐼𝐼 (Compose � 𝒛𝒛𝑗𝑗 with 𝒛𝒛𝑗𝑗 and � 𝒉𝒉𝑗𝑗 ) (A sum of 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 weighted by 𝒂𝒂) (Normalization 𝒂𝒂𝒂 → 𝒂𝒂) (Weights 𝒂𝒂𝑗𝑗 ′ ∈ ℝ𝐼𝐼 are dot products of 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 and 𝒛𝒛𝑗𝑗 ) (Attend 𝐼𝐼 vectors 𝑯𝑯 = (𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 )) Conventional: Construct a decoder vector � 𝒛𝒛𝑗𝑗 from 𝒛𝒛𝑗𝑗 by attending the encoder 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 QKV: Compute weights 𝒂𝒂 = 𝑲𝑲⊤𝒒𝒒 and construct � 𝒒𝒒 as a weighted sum of 𝑽𝑽 (A sum of 𝑽𝑽 = (𝒗𝒗1 , … , 𝒗𝒗𝐼𝐼 ) weighted by 𝒂𝒂) (Normalization 𝒂𝒂′ → 𝒂𝒂) (Weights 𝒂𝒂𝒂 ∈ ℝ𝐼𝐼 are dot products of 𝑲𝑲 = (𝒌𝒌1 , … , 𝒌𝒌𝐼𝐼 ) and 𝒒𝒒) (𝑐𝑐 = 1/ 𝑑𝑑 compensates a larger dot product when 𝑑𝑑 is larger) (Query vector) (Keys) (Values) 1. Multi-head attention
  38. Computing QKV attention 43 𝒒𝒒 = 1 −2 −1 2

    𝑽𝑽𝑽𝑽 = 0.67 × 1 0 0 1 + 0.24 × 0 1 0 1 + 0.09 × 0 0 1 1 = 0.67 0.24 0.09 1.00 ⟶ � 𝒒𝒒 𝑲𝑲 = 1 3 −1 1 1 0 −1 3 1 1 1 0 𝑽𝑽 = 1 0 0 0 1 0 0 0 1 1 1 1 1 4 1 1 −1 1 1 −2 −1 2 = 1 1 4 3 1 3 1 1 −2 −1 2 = 0 1 4 −1 0 1 0 1 −2 −1 2 = −1 softmax 1 0 −1 = 0.67 0.24 0.09 𝒂𝒂′ 𝒂𝒂 𝒂𝒂′ = 𝑐𝑐𝑲𝑲⊤𝒒𝒒 1. Multi-head attention
  39. Cross (source-target) attention with QKV 44 ジョン 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄

    × 1/ 𝑑𝑑𝑘𝑘 softmax + × � 𝑸𝑸 = 𝑽𝑽𝑽𝑽 = 𝑽𝑽softmax(𝑐𝑐𝑲𝑲⊤𝑸𝑸) 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝒁𝒁, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝑯𝑯, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝑯𝑯 𝑯𝑯 = 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 ∈ ℝ𝑑𝑑×𝐼𝐼, 𝒁𝒁 = 𝒛𝒛1 , … , 𝒛𝒛𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, 𝑸𝑸 = 𝒒𝒒1 , … , 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, � 𝑸𝑸 = � 𝒒𝒒1 , … , � 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 John は 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 loves 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 Mary メアリー を 𝑾𝑾𝑄𝑄 × 1/ 𝑑𝑑𝑘𝑘 softmax 𝑾𝑾𝑄𝑄 × 1/ 𝑑𝑑𝑘𝑘 softmax 𝑾𝑾𝑄𝑄 × 1/ 𝑑𝑑𝑘𝑘 softmax + × + × + × 𝑯𝑯 𝒁𝒁 � 𝑸𝑸 Compute a weighted sum of encoder vectors 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 based on the decoder vector 𝒒𝒒𝑗𝑗 1. Multi-head attention
  40. Self attention with QKV (encoder) 45 loves Mary John 𝑾𝑾𝑉𝑉

    𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 1/ 𝑑𝑑𝑘𝑘 1/ 𝑑𝑑𝑘𝑘 1/ 𝑑𝑑𝑘𝑘 softmax softmax softmax + + + × × × Compute a weighted sum of encoder vectors 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 based on word pairs in the encoder � 𝑸𝑸 = 𝑽𝑽𝑽𝑽 = 𝑽𝑽softmax 𝑐𝑐𝑲𝑲⊤𝑸𝑸 , 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝑯𝑯, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝑯𝑯, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝑯𝑯 𝑯𝑯 = 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 ∈ ℝ𝑑𝑑×𝐼𝐼, 𝑸𝑸 = 𝒒𝒒1 , … , 𝒒𝒒𝐼𝐼 ∈ ℝ𝑑𝑑×𝐼𝐼, � 𝑸𝑸 = � 𝒒𝒒1 , … , � 𝒒𝒒𝐼𝐼 ∈ ℝ𝑑𝑑×𝐼𝐼, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 𝑯𝑯 � 𝑸𝑸 1. Multi-head attention
  41. Self attention with QKV (decoder) 46 は メアリー ジョン 𝑾𝑾𝑉𝑉

    𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 1/ 𝑑𝑑𝑘𝑘 1/ 𝑑𝑑𝑘𝑘 1/ 𝑑𝑑𝑘𝑘 softmax softmax softmax + + + × × × Compute a weighted sum of decoder vectors 𝒛𝒛1 , … , 𝒛𝒛𝐽𝐽 based on word pairs in the decoder � 𝑸𝑸 = 𝑽𝑽𝑽𝑽 = 𝑽𝑽softmax 𝑐𝑐𝑲𝑲⊤𝑸𝑸 , 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝒁𝒁, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝒁𝒁, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝒁𝒁 𝒁𝒁 = 𝒛𝒛1 , … , 𝒛𝒛𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, 𝑸𝑸 = 𝒒𝒒1 , … , 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, � 𝑸𝑸 = � 𝒒𝒒1 , … , � 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 𝒁𝒁 � 𝑸𝑸 1. Multi-head attention
  42. Formalization of QKV attention 47 Construct � 𝑸𝑸 = �

    𝒒𝒒1 , … , � 𝒒𝒒𝑇𝑇 by sums of 𝑽𝑽 = (𝒗𝒗1 , … , 𝒗𝒗𝑆𝑆 ) weighted by dot products between 𝑸𝑸 = (𝒒𝒒1 , … , 𝒒𝒒𝑇𝑇 ) and 𝑲𝑲 = (𝒌𝒌1 , … , 𝒌𝒌𝑆𝑆 ) � 𝑸𝑸 = Attention 𝑸𝑸, 𝑲𝑲, 𝑽𝑽 = 𝑽𝑽softmax 𝑐𝑐𝑲𝑲⊤𝑸𝑸 𝑸𝑸 = 𝒒𝒒1 , … , 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝑇𝑇, � 𝑸𝑸 = � 𝒒𝒒1 , … , � 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝑇𝑇, 𝑲𝑲 = 𝒌𝒌1 , … , 𝒌𝒌𝑆𝑆 ∈ ℝ𝑑𝑑×𝑆𝑆, 𝑽𝑽 = (𝒗𝒗1 , … , 𝒗𝒗𝑆𝑆 ) ∈ ℝ𝑑𝑑×𝑆𝑆  Self attention in encoders (Reconstruct 𝑯𝑯 by attending 𝑯𝑯) 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝑯𝑯, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝑯𝑯, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝑯𝑯, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 (𝑆𝑆 = 𝑇𝑇 = 𝐼𝐼)  Self attention in decoders (Reconstruct 𝒁𝒁 by attending 𝒁𝒁) 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝒁𝒁, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝒁𝒁, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝒁𝒁, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 (𝑆𝑆 = 𝑇𝑇 = 𝐽𝐽)  Cross attention (Reconstruct 𝑯𝑯 by attending 𝑯𝑯 from 𝒁𝒁) 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝒁𝒁, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝑯𝑯, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝑯𝑯, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 (𝑆𝑆 = 𝐼𝐼, 𝑇𝑇 = 𝐽𝐽) 1. Multi-head attention
  43. Multi-head attention mechanism 48  QKV attention can compute a

    single pattern of weights (𝑨𝑨 ∈ ℝ𝑆𝑆×𝑆𝑆)  Introduce multiple attention mechanisms with different perspectives � 𝑸𝑸 = MultiHead 𝑸𝑸, 𝑲𝑲, 𝑽𝑽 = 𝑾𝑾𝑂𝑂 Concat � 𝑸𝑸(1), … , � 𝑸𝑸(𝐻𝐻) � 𝑸𝑸(ℎ) = Attention 𝑾𝑾𝑄𝑄 (ℎ)𝑸𝑸, 𝑾𝑾𝐾𝐾 (ℎ)𝑲𝑲, 𝑾𝑾𝑉𝑉 (ℎ)𝑽𝑽 (ℎ = 1, … , 𝐻𝐻) � 𝑸𝑸(ℎ) ∈ ℝ𝑑𝑑𝑘𝑘×𝑇𝑇, 𝑾𝑾𝑂𝑂 ∈ ℝ𝑑𝑑×𝐻𝐻𝑑𝑑𝑣𝑣, 𝑾𝑾𝑄𝑄 (ℎ) ∈ ℝ𝑑𝑑𝑘𝑘×𝑑𝑑, 𝑾𝑾𝐾𝐾 (ℎ) ∈ ℝ𝑑𝑑𝑘𝑘×𝑑𝑑, 𝑾𝑾𝑉𝑉 (ℎ) ∈ ℝ𝑑𝑑𝑣𝑣×𝑑𝑑  Usually, we set 𝑑𝑑𝑘𝑘 = 𝑑𝑑𝑣𝑣 = 𝑑𝑑/𝐻𝐻 and create a subspace for each attention head  An example of self-attention of an encoder � 𝑸𝑸(ℎ) = Attention 𝑾𝑾𝑄𝑄 (ℎ)𝑾𝑾𝑄𝑄 𝑯𝑯, 𝑾𝑾𝐾𝐾 (ℎ)𝑾𝑾𝐾𝐾 𝑯𝑯, 𝑾𝑾𝑉𝑉 (ℎ)𝑾𝑾𝑉𝑉 𝑯𝑯  Equivalent to splitting the transformed matrices of 𝑯𝑯 by 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 into 𝐻𝐻 Regard them as matrices to transform 𝑯𝑯 into query, key, and value subspaces of ℝ𝑑𝑑𝑘𝑘 1. Multi-head attention
  44. 𝑾𝑾𝑄𝑄 (ℎ) 𝑾𝑾𝐾𝐾 (ℎ) 𝑾𝑾𝑉𝑉 (ℎ) Scaled Dot-Product Attention 𝑾𝑾𝑄𝑄

    (ℎ) 𝑾𝑾𝐾𝐾 (ℎ) 𝑾𝑾𝑉𝑉 (ℎ) Scaled Dot-Product Attention Computing multi-head attention mechanisms (𝑑𝑑 = 8, 𝐻𝐻 = 4) 49 𝑾𝑾𝑄𝑄 (ℎ) 𝑾𝑾𝐾𝐾 (ℎ) 𝑾𝑾𝑉𝑉 (ℎ) QKV attention Concatenation 𝑾𝑾𝑂𝑂 𝐻𝐻 QKV attention QKV attention QKV attention QKV attention 𝑽𝑽 𝑲𝑲 𝒒𝒒 𝑽𝑽(1) 𝑲𝑲(1) 𝒒𝒒(1) 𝑽𝑽(2) 𝑲𝑲(2) 𝒒𝒒(2) 𝑽𝑽(3) 𝑲𝑲(3) 𝒒𝒒(3) 𝑽𝑽(4) 𝑲𝑲(4) 𝒒𝒒(4) = � 𝒒𝒒(1) � 𝒒𝒒(2) � 𝒒𝒒(3) � 𝒒𝒒(4) 𝑾𝑾𝑂𝑂 � 𝒒𝒒 1. Multi-head attention
  45. Why self-attention? 50  Self-attention is usually faster than RNNs

    (𝑛𝑛 < 𝑑𝑑)  “NLP researchers scared 𝑛𝑛2 much, but the Google engineer didn’t”  Self-attention is parallelizable over a sequence  Self-attention connects all positions with 𝑂𝑂(1) step  RNN requires 𝑂𝑂(𝑛𝑛) computations  CNN requires 𝑂𝑂 log𝑘𝑘 𝑛𝑛 convolution operations (Vaswani+ 2017) A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008. 1. Multi-head attention
  46. Positional encoding 51  Transformer has no recurrence nor convolution

     We need to inject position information to hidden states in some way  Add a positional encoding 𝒑𝒑𝑡𝑡 ∈ ℝ𝑑𝑑 to token embeddings 𝒘𝒘𝑡𝑡 ∈ ℝ𝑑𝑑 at position 𝑡𝑡 to represent positions of hidden states in the encoder and decoder 𝒙𝒙𝑡𝑡 = 𝒘𝒘𝑡𝑡 + 𝒑𝒑𝑡𝑡  𝑑𝑑: a constant presenting the number of dimension of vectors 𝒑𝒑𝑡𝑡 𝑖𝑖 = � sin 𝜔𝜔𝑘𝑘 𝑡𝑡 𝑖𝑖 = 2𝑘𝑘 cos 𝜔𝜔𝑘𝑘 𝑡𝑡 𝑖𝑖 = 2𝑘𝑘 + 1 𝜔𝜔𝑘𝑘 = 1 100002𝑘𝑘/𝑑𝑑 𝒑𝒑𝑡𝑡 𝒑𝒑𝑡𝑡 𝒘𝒘𝑡𝑡 𝒙𝒙𝑡𝑡 Value of the 𝑖𝑖-th dimension of the vector 𝒑𝒑𝑡𝑡 Modified from the figure in (Vaswani+ 2017) A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008. 2. Positional encoding
  47. Properties of positional encoding 52 PE𝑑𝑑 𝑡𝑡, 𝑖𝑖 = sin

    𝜔𝜔𝑘𝑘 𝑡𝑡 𝑖𝑖 = 2𝑘𝑘 cos 𝜔𝜔𝑘𝑘 𝑡𝑡 𝑖𝑖 = 2𝑘𝑘 + 1 , 𝜔𝜔𝑘𝑘 = 1 100002𝑘𝑘/𝑑𝑑  Values of lower dimensions change a lot, but those of higher ones do not  It looks like a continuous version of binary code  Positional encodings from close positions yield similar values 2. Positional encoding
  48. Residual connection (He+ 16) 53  Suppose that we want

    to learn a function ℎ(𝒙𝒙)  We consider another mapping: 𝑓𝑓 𝒙𝒙 = ℎ 𝒙𝒙 − 𝒙𝒙  Then, the original mapping is ℎ 𝒙𝒙 = 𝑓𝑓 𝒙𝒙 + 𝒙𝒙  We hypothesize that training 𝑓𝑓 𝒙𝒙 is easier than ℎ(𝒙𝒙)  If an identical mapping is default, pushing 𝑓𝑓 𝒙𝒙 = 0 may be easier  We can view 𝑓𝑓 𝒙𝒙 + 𝒙𝒙 as a feedforward neural network with shortcut connections  Useful to build a deeper network  Gradients flow on shortcut connections  Proposed in ResNet (He+ 2016) 𝒙𝒙 𝑓𝑓(𝒙𝒙) 𝑓𝑓 𝒙𝒙 + 𝒙𝒙 K He, X Zhang, S Ren, J Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 3. Residual + Layer-norm
  49. Layer normalization (Ba+ 16) 54  Ensure zero mean and

    unit variance of a vector 𝒙𝒙(new) from 𝒙𝒙 ∈ ℝ𝑑𝑑 𝑥𝑥 𝑖𝑖 (new) ← 𝑥𝑥𝑖𝑖 − 𝜇𝜇 𝜎𝜎2 + 𝜖𝜖 , 𝜇𝜇 = 1 𝑑𝑑 � 𝑖𝑖=1 𝑑𝑑 𝑥𝑥𝑖𝑖 , 𝜎𝜎2 = 1 𝑑𝑑 � 𝑖𝑖=1 𝑑𝑑 𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑖𝑖 2  This is used in various places in Transformer  A mean 𝜇𝜇 and variance 𝜎𝜎2 are computed at each time step  How it works(?) (adapted from batch normalization (Bjorck+ 2018))  Large activations in a lower layer cannot be propagated uncontrollably to upper layers because of the normalization operation  This prevents gradients from exploding (e.g., becoming too large)  This enables higher learning rates (recall that the amount of a parameter update is a product of a learning rate and a gradient)  A large learning rate 𝜂𝜂 leads to a larger noise in SGD (proportional to 𝜂𝜂2)  A larger SGD noise prevents the network from getting “trapped” in sharp minima and biases it towards wider minima with better generalization J. L. Ba, J. R. Kiros, G. E. Hinton. 2016. Layer Normalization. arXiv:1607.06450. J. Bjorck, C. Gomes, B. Selman, K. Q. Weinberger. 2018. Understanding Batch Normalization. In NIPS, pp. 7694-7705. 3. Residual + Layer-norm
  50. Feedforward layer 55 4. Feedforward  Two linear transformations with

    ReLU in between: FFN 𝒙𝒙 = 𝑾𝑾2 max 0, 𝑾𝑾1 𝒙𝒙 + 𝒃𝒃1 + 𝒃𝒃2 𝑾𝑾1 ∈ ℝ𝑑𝑑𝑓𝑓×𝑑𝑑, 𝒃𝒃1 ∈ ℝ𝑑𝑑𝑓𝑓, 𝑾𝑾2 ∈ ℝ𝑑𝑑×𝑑𝑑𝑓𝑓, 𝒃𝒃2 ∈ ℝ𝑑𝑑  The original paper sets 𝑑𝑑𝑓𝑓 = 4𝑑𝑑 in the experiments  Linear -> ReLU -> Linear transformations for each timestep 𝒙𝒙 ∈ ℝ𝑑𝑑 FFN 𝒙𝒙 ∈ ℝ𝑑𝑑 𝑾𝑾1 , 𝒃𝒃1 𝑾𝑾2 , 𝒃𝒃2
  51. Masked self attention (in decoders) 56  When training an

    encoder-decoder model, we give all source and target tokens in the input layer at a time  We want to complete all computation as matrix operations for better parallelization  However, we should not look at future tokens  Before computing the softmax, we force to set −∞ (e.g., −109) to all elements in the score matrix that point to future tokens (masking) ジョン は BOS が 大好き メリー ジョン は EOS が 大好き メリー BOS ジョン は メリー が 大好き BOS ジ ョ ン は メ リ ー が 大 好 き Mask
  52. Hyper-parameters 57 Parameter Base Big # layers (𝑁𝑁) 6 6

    # dimensions (𝑑𝑑) 512 1024 # dimensions for FFN (𝑑𝑑𝑓𝑓 ) 2048 4096 # attention heads (ℎ) 8 16 # dimensions of keys/queries (𝑑𝑑𝑘𝑘 ) 64 64 # dimensions of values (𝑑𝑑𝑣𝑣 ) 64 64 Dropout rate 𝑃𝑃drop 0.1 0.3 # training steps 100K 300K # total parameters 65M 213M  Some training tips exist for Transformer  E.g., The learning rate is increased linearly for the first warm-up steps, and then decreased proportionally to the inverse square root of the step number
  53. Task performance 58 Transformer established the new state-of-the-art performance on

    En-De translation even with the base model (fewer parameters than big) (Vaswani+ 2017) A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.
  54. Coreference handling in self attention 59 The animal didn’t cross

    the street because it was too tired. The animal didn’t cross the street because it was too wide. A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008. (Vaswani+ 2017)
  55. Performance improvements on machine translation 60 35 29.3 33.3 28.4

    25.16 24.61 23 21.6 20.7 0 5 10 15 20 25 30 35 40 Transformer Big + Back translation (Edunov+ 18) Transformer Big (Ott+ 18) DeepL (press release, 17) Transformer (Vaswani+ 17) ConvS2S (Gehring+ 17) Google's NMT (Wu+ 16) Attention mechanism (Luong+ 15) RNNsearch (Jean+ 15) Statistical Machine Translation (Durrani+ 14) BLEU score for English to German translation on WMT 2014 dataset (higher is better) 20 years of research on SMT
  56. What is GPT (Radford+ 2018)? 63  A generic language

    model that is transferable to various NLP tasks  A single model for different tasks  Question answering, document classification, semantic similarity, …  GPT-3 has been a hot topic recently (in 2020)  Pretraining and finetuning (a kind of transfer learning)  Pretraining learns parameters that are generic to the language  Finetuning learns task-specific parameters on supervision data, leveraging the parameters acquired in pretraining  Based on Transformer decoder  Generative Pre-Training (GPT) A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  57. GPT-3 (Brown+ 2020) 64 https://twitter.com/sharifshameem/statu s/1282676454690451457 T. B. Brown, B.

    Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165
  58. Architecture of GPT 67 (Radford+ 2018)  GPT uses Transformer

    decoder across different tasks  Pretraining is based on language modeling  Adding an output layer to a pretrained model for a target, finetuning trains output layers using supervision data A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  59. Pretraining: Language modeling 68 Once upon a time there was

    a girl who really loved + + + + + + + + + + + Token embeddings Position embeddings Input embeddings Output embeddings GPT (Transformer decoder with 𝐿𝐿 layers) Input sequence ℎ1 0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑃𝑃𝑃𝑃11 The model (Transformer decoder) is trained to predict the next token for each time step by using a large corpus as the supervision data 𝑊𝑊𝑊𝑊1 𝑊𝑊𝑊𝑊2 𝑊𝑊𝑊𝑊3 𝑊𝑊𝑊𝑊4 𝑊𝑊𝑊𝑊5 𝑊𝑊𝑊𝑊6 𝑊𝑊𝑊𝑊7 𝑊𝑊𝑊𝑊8 𝑊𝑊𝑊𝑊9 𝑊𝑊𝑊𝑊10 𝑊𝑊𝑊𝑊11 ℎ2 0 ℎ3 0 ℎ4 0 ℎ5 0 ℎ6 0 ℎ7 0 ℎ8 0 ℎ9 0 ℎ10 0 ℎ11 0 ℎ1 𝐿𝐿 ℎ2 𝐿𝐿 ℎ3 𝐿𝐿 ℎ4 𝐿𝐿 ℎ5 𝐿𝐿 ℎ6 𝐿𝐿 ℎ7 𝐿𝐿 ℎ8 𝐿𝐿 ℎ9 𝐿𝐿 ℎ10 𝐿𝐿 ℎ11 𝐿𝐿 upon a time there was a girl who really loved books Output sequence
  60. Example of finetuning: Textual entailment 69 Tokyo Tech is located

    in Ookayama $ Japan has a university + + + + + + + + + + + Token embeddings Position embeddings Input embeddings Output embeddings GPT (Transformer decoder with 𝐿𝐿 layers) Input sequence ℎ1 0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑃𝑃𝑃𝑃11 After training the language model, add a linear and softmax layers to predict labels of a target task, and adapt the parameters using the supervision data 𝑊𝑊𝑊𝑊1 𝑊𝑊𝑊𝑊2 𝑊𝑊𝑊𝑊3 𝑊𝑊𝑊𝑊4 𝑊𝑊𝑊𝑊5 𝑊𝑊𝑊𝑊6 𝑊𝑊𝑊𝑊7 𝑊𝑊𝑊𝑊8 𝑊𝑊𝑊𝑊9 𝑊𝑊𝑊𝑊10 𝑊𝑊𝑊𝑊11 ℎ2 0 ℎ3 0 ℎ4 0 ℎ5 0 ℎ6 0 ℎ7 0 ℎ8 0 ℎ9 0 ℎ10 0 ℎ11 0 ℎ1 𝐿𝐿 ℎ2 𝐿𝐿 ℎ3 𝐿𝐿 ℎ4 𝐿𝐿 ℎ5 𝐿𝐿 ℎ6 𝐿𝐿 ℎ7 𝐿𝐿 ℎ8 𝐿𝐿 ℎ9 𝐿𝐿 ℎ10 𝐿𝐿 ℎ11 𝐿𝐿 Entail 𝑊𝑊 𝑦𝑦 softmax In addition to the objective of the target task, we also train the model with the objective of language modeling on the supervision data
  61. Training the GPT model 70  Pretraining  BooksCorpus dataset

    (7,000 unique books) and 1B Words Benchmark  Finetuning  Detail of the Transformer architecture  12-layer decoder-only transformer with masked self attention  Number of dimension 𝑑𝑑 = 768 (12 attention heads)  Vocabulary of 40,000 subword tokens built by Byte-Pair-Encoding (BPE)  117M parameters in total A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (Radford+ 2018)
  62. Evaluation results 71  Natural Language Inference: SoTA on all

    datasets  Improvements: 1.5% on MNLI, 5% on SciTail, 5.8% on QNLI, and 0.6% on SNLI  Question answering and commonsense reasoning: SoTA on all datasets  Improvements: 8.9% on Story Cloze, and 5.7% overall on RACE  Semantic similarity: SoTA on two ouf of three datasets  Classification: SoTA on GLUE benchmark (72.8 ← 68.9)  Performance drastically drops without pre-training (see the table below) (Radford+ 2018) A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  63.  The paper explores whether the language model trained on

    a text can solve NLP tasks such as question answering without finetuning  The architecture is the same as GPT  But the Transformer architecture is changed from Post-LN to Pre-LN  Training of the language model  8M high-quality documents (40GB) crawled from the Web GPT-2 (Radford+ 2019) 72 𝑥𝑥𝑡𝑡 𝑙𝑙 𝑥𝑥𝑡𝑡 𝑙𝑙+1 Attention FFN Layer Norm Layer Norm 𝑥𝑥𝑡𝑡 𝑙𝑙 𝑥𝑥𝑡𝑡 𝑙𝑙+1 Attention FFN Layer Norm Layer Norm Post-LN Pre-LN A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  64. Performance of GPT-2 73 117M (12 layers, 768 dims); 345M

    (24 layers, 1024 dims); 762M (36 layers, 1280 dims); 1542M (48 layers, 1600 dims) A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (Radford+ 2019)
  65. Answers generated by GPT-2 on the dev set of Natural

    Questions 74 A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  66. The paper explores whether the language model trained on a

    text can solve NLP tasks with zero-shot, one-shot, or few-shot on the tasks and without updating parameters for the task (no fine-tuning) GPT-3 (Brown+ 2020) 75 T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.
  67.  The architecture is the same as GPT-2  But

    GPT-3 use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to Sparse Transformer  The GPT-3 models are extremely large  Training GPT-3 (175B) requires 3.14 × 1023 flops  “Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run.” [1] The architecture of GPT-3 76 T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165. [1] OpenAI's GPT-3 Language Model: A Technical Overview. https://lambdalabs.com/blog/demystifying-gpt-3/
  68. Performance of GPT-3 77 T. B. Brown, B. Mann, N.

    Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.
  69. Limitations even with GPT-3 (Brown+ 2020) 78  Inferior performance

    on some tasks to finetuning approach  Notable weaknesses in text synthesis  Repetitions/contradictions at the document level, lost coherence over sufficiently long passages, and non-sequitur sentences  Difficulty with “common sense physics”  Difficult to answer a question like “If I put cheese into the fridge, will it melt?”  Structural and algorithmic limitations  No bi-directional architecture (unlike BERT), which is disadvantageous to some tasks (e.g., fill-in-the-blank tasks) that require re-reading or carefully considering a long passage and then generating a very short answer  Poor sample efficiency during pre-training  Pre-training requires much more text than a human does in the their lifetime  Test-time sample efficiency is closer to that of humans (one/zero-shot) though  Other limitations that are shared by most deep learning systems  Interpretability of decisions, biases of the data, gender, etc. T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.
  70. What is BERT (Devlin+ 2019)? 80  A generic model

    for various NLP tasks  Question answering, document classification, semantic inference, …  Became a popular methodology, achieving state-of-the-art performance  Pretraining and finetuning (a kind of transfer learning)  Pretraining learns parameters that are generic to the language  Finetuning learns task-specific parameters on supervision data, leveraging the parameters acquired in pretraining  Based on Transformer encoder  BERT is not an encoder-decoder model (without a decoder)  A kind of contextualized word embeddings  Word embeddings that can represent context  Bidirectional Encoder Representations from Transformer (BERT)  → Embeddings from Language Models (ELMo) J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp. 4171-4186.
  71. Pretraining and finetuning 81 (Devlin+ 2019)  BERT uses a

    unified architecture across different tasks  Pretraining is based on bidirectional language modeling  Starting with a pretrained model, finetuning updates output layers (sometimes tailored for target tasks) as well as all internal parameters J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp. 4171-4186.
  72.  Idea: Train the model so that it can solve

    Cloze task  Obtain supervision data by masking tokens in large corpora  BooksCorpus (800M words) and English Wikipedia (2,500M words)  Procedure:  Choose 15% of token positions at random for prediction  Choose one of the following operations Pretraining task 1: Masked language model 82 My dog is [ ]. My dog is cute  [80%]: Replace the target token with [MASK]  [10%]: Replace the target token with a random token  [10%]: Keep the target token unchanged [ ] = cute BERT My dog is [MASK] My dog is apple My dog is cute These treatments are because [MASK] token does not appear in downstream tasks
  73. Masked language modeling (15% × 80%): [MASK] input 83 [CLS]

    my dog [MASK] cute [SEP] he likes [MASK] ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ 𝑃𝑃𝑃𝑃0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B Sentence 1 Sentence 2 is play The model is trained to predict the masked tokens (we don’t predict other tokens)
  74. Masked language modeling (15% × 10%): random input 84 [CLS]

    my dog look cute [SEP] he likes cat ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ 𝑃𝑃𝑃𝑃0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B Sentence 1 Sentence 2 is play The model is trained to predict the target tokens (we don’t predict other tokens)
  75. Masked language modeling (15% × 10%): original input 85 [CLS]

    my dog is cute [SEP] he likes play ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ 𝑃𝑃𝑃𝑃0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B Sentence 1 Sentence 2 is play The model is trained to predict the target tokens (we don’t predict other tokens)
  76.  Idea: Train the model so that it can classify

    whether given two sentences are consecutive or not.  Obtain supervision data by extracting sentences in large corpora  BooksCorpus (800M words) and English Wikipedia (2,500M words)  Procedure:  Choose two sentences that are consecutive 50% of the time  Choose two sentences that are not consecutive 50% of the time Pretraining task 2: Next sentence prediction 86 My dog is cute. He likes playing. Yes BERT My dog is cute. I went to the station. No BERT
  77. Next sentence prediction 87 [CLS] my dog is cute [SEP]

    he likes play ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ 𝑃𝑃𝑃𝑃0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B Sentence 1 Sentence 2 IsNext (or NotNext otherwise)
  78. Finetuning 88 Input embeddings Output embeddings BERT (Transformer encoder) 𝐸𝐸[CLS]

    𝐸𝐸1 𝐸𝐸2 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇… 𝑇𝑇… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸𝑁𝑁 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇𝑁𝑁 𝑇𝑇…  BERT models are flexible to tasks of single text or text pairs  Self-attention allows bidirectional cross attention between two sentences  We can view output embeddings as feature representations of input text  𝑇𝑇𝑖𝑖: Contextual word embeddings of the token at position 𝑖𝑖  𝐶𝐶: Embeddings for single or two sentences  We reuse the model architecture and parameters for downstream tasks  Finetune BERT models on target tasks  We modify a label definition and output layers for a downstream task
  79. Finetuning task type 1: Sentence pair classification 89 [CLS] Tok1

    Tok2 … TokN [SEP] Tok1 Tok2 … TokM [SEP] Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ Sentence 1 Sentence 2 Label Task example: Multi-Genre Natural Language Inference (MultiNLI)  Sentence 1: “At the other end of Pennsylvania Avenue, people began to line up for a White House tour.”  Sentence 2: “People formed a line at the end of Pennsylvania Avenue.”  Label: entailment
  80. Finetuning task type 2: Single sentence classification 90 [CLS] Tok1

    Tok2 … … … … … … … TokN Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇… 𝑇𝑇… Label 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸𝑁𝑁 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇𝑁𝑁 𝑇𝑇… Task example: Stanford Sentiment Treebank (SST)  Input sentence: “You’ll probably love it.”  Label: positive
  81. Finetuning task type 3: Question answering 91 [CLS] Tok1 Tok2

    … TokN [SEP] Tok1 Tok2 … TokM [SEP] Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ Question Paragraph START END Stanford Question Answering Dataset (SQuAD) https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/Doctor_Who.html
  82. Finetuning task type 4: Single sentence tagging 92 [CLS] Tok1

    Tok2 … … … … … … … TokN Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇… 𝑇𝑇… O 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸𝑁𝑁 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇𝑁𝑁 𝑇𝑇… B-PER I-PER O B-ORG I-ORG I-ORG O O O Task example: Named Entity Recognition (NER) (as sequential labeling)  Input: “In March 2005, the New York Times acquired About, Inc .”  Output: O B-TEMP I-TEMP O B-ORG I-ORG I-ORG I-ORG O B-ORG
  83. Performance on downstream tasks 93 GLUE benchmark [1] SQuAD 1.0

    (Q&A) CoNLL 2003 (NER) J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp. 4171-4186. [1] https://gluebenchmark.com/leaderboard
  84. Summary  Attention mechanism addresses the drawback of fixed-size vector

    representation in encoder-decoder models  A decoder can extract features directly from encoder states  Parameters in attention mechanism are trained by a target task (without explicit supervision data for attention mechanism)  RNN/LSTM is difficult to parallelize across timesteps  Encoder-decoder models using CNN and positional encoding  Transformer removes recurrent computation by using self-attention and positional encoding  GPT and BERT are Transformer models applicable to various NLP tasks  GPT is a uni-directional language model based on Transformer decoder  BERT is a bi-directional model based on Transformer encoder 94