• ͭ·Γ • ݴޠϞσϧͰੑೳ্ɺ༁ͰޮՌͳ͠ ໊ݹ۠NLPηϛφʔ 2022/06/07 54 which is comparable to the performance of the base- ine with the same number of parameters. We next generalize this model and the original nterleaved transformer, creating the family of sand- wich transformers. A sandwichn k transformer con- ists of 2n sublayers in total (n of each type), con- orming to the regular expression sk(sf)n k fk. The first k sublayers are purely self-attention (s), while the last k are feedforward sublayers (f). In etween, we use the original interleaving pattern sf) to fill the remaining 2(n k) sublayers. When k = 0, we get the original transformer model, and when k = n 1 (its maximal value) we get the Model Test Baseline (Baevski and Auli, 2019) 18.70 Transformer XL (Dai et al., 2019) 18.30 kNN-LM (Khandelwal et al., 2019) 15.79 Baseline (5 Runs) 18.63 ± 0.26 Sandwich16 6 17.96 Table 3: Performance on the WikiText-103 test set. We compare the best sandwich transformer to the unmod- ified, interleaved transformer baseline (Baevski and Auli, 2019) trained over 5 random seeds and to other previously reported results. than the average baseline transformer. Of those, 6 models outperform the best baseline transformer (k = 5, 6, 8, 9, 10, 11). The best performance of 17.84 perplexity is obtained when k = 6. We com- pare this model to the baseline on WikiText-103’s test set. Table 3 shows that, despite its simple design, the sandwich transformer outperforms the original transformer baseline by roughly double the gap be- tween the baseline (Baevski and Auli, 2019) and Transformer XL (Dai et al., 2019). This improve- ment comes at no extra cost in parameters, data, memory, or computation; we did not even change any of the original hyperparameters, including the number of training epochs. To check whether this advantage is consistent, we train 4 more sandwich16 6 models with different Figure 5: The transformer’s sandwich coefficient (k) and validation perplexity, for k 2 {1, . . . , 15}. The dotted line is the average baseline model’s perplex- ity (trained with different random seeds), whereas the dashed line represents the best baseline model. Figure 6: Performance on the WikiText-103 develop- ment set of the Sandwich16 transformer and the base- Figures from [Press+2019] ଠࣈ͕Baseline ͍΄Ͳྑ͍ੑೳ