Masked Language ModelとNext Sentence Predictionを使って教 師なし学習を⾏った学習済み モデルで、出⼒層を付け替え るだけで多様な⾃然⾔語の課 題に対応可能 複数のベンチマークで発表時 のSoTAを達成 ディープラーニングによる⾃ 然⾔語処理に⼤きなインパク トを与えた %(57 %(57 ( >&/6@ ( ( >6(3@ ( 1 ( ¶ ( 0 ¶ & 7 7 >6(3@ 7 1 7 ¶ 7 0 ¶ >&/6@ 7RN >6(3@ 7RN1 7RN 7RN0 4XHVWLRQ 3DUDJUDSK 6WDUW(QG6SDQ %(57 ( >&/6@ ( ( >6(3@ ( 1 ( ¶ ( 0 ¶ & 7 7 >6(3@ 7 1 7 ¶ 7 0 ¶ >&/6@ 7RN >6(3@ 7RN1 7RN 7RN0 0DVNHG6HQWHQFH$ 0DVNHG6HQWHQFH% 3UHWUDLQLQJ )LQH7XQLQJ 163 0DVN/0 0DVN/0 8QODEHOHG6HQWHQFH$DQG%3DLU 64X$' 4XHVWLRQ$QVZHU3DLU 1(5 01/, Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec- tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques- tions/answers). ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015). 2.3 Transfer Learning from Supervised Data There has also been work showing effective trans- fer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017). Computer vision research has also demon- strated the importance of transfer learning from mal difference between the pre-trained architec- ture and the final downstream architecture. Model Architecture BERT’s model architec- ture is a multi-layer bidirectional Transformer en- coder based on the original implementation de- scribed in Vaswani et al. (2017) and released in the tensor2tensor library.1 Because the use of Transformers has become common and our im- plementation is almost identical to the original, we will omit an exhaustive background descrip- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) https://arxiv.org/abs/1810.04805v2