Upgrade to Pro — share decks privately, control downloads, hide ads and more …

機械翻訳コンペティション参加報告

Avatar for Shun Kiyono Shun Kiyono
February 26, 2021

 機械翻訳コンペティション参加報告

第6回特許情報シンポジウムでの講演資料です

Avatar for Shun Kiyono

Shun Kiyono

February 26, 2021
Tweet

More Decks by Shun Kiyono

Other Decks in Research

Transcript

  1. ࣗݾ঺հ • ܦྺ • 2013 – 2019౦๺େֶʢֶ࢜  म࢜ʣ •

    2019 – ཧԽֶݚڀॴ ֵ৽஌ೳ౷߹ݚڀηϯλʔ • 2020 – : ౦๺େֶʢത࢜ޙظ՝ఔ ʣ • ͜Ε·Ͱͷݚڀ • ੜ੒ܕཁ໿ [BlackboxNLP 2018], [PACLIC 2018] • ߴ଎ & େن໛ͳ൒ڭࢣ͋Γֶश [AAAI 2019] • จ๏ޡΓగਖ਼ [EMNLP 2019], [TASLP 2020] • ػց຋༁ʁ 3 ػց຋༁ͷج൫ٕज़͸ λεΫΛ·͍ͨͰ࢖ΘΕ͍ͯΔ ⇛ଞλεΫͷܦݧ͕࢖͑Δ ͭ·Γɺ৘೤͑͋͞Ε͹େৎ෉ʂ
  2. ࠓճࢀՃͨ͠ίϯϖςΟγϣϯɿ8.5 • WMT: ΋ͱ΋ͱ͸ػց຋༁ͷϫʔΫγϣοϓ • ਺೥લʹࠃࡍձٞʹͳΓ·ͨ͠ • ͞·͟·ͳίϯϖςΟγϣϯ͕ซઃ • ৽ฉهࣄ຋༁

    • ະ஌ͷυϝΠϯͷจʹର͢Δ຋༁ • ڭࢣͳ͠ػց຋༁ • ੜ෺ֶɾҩྍܥจॻͷػց຋༁ • νϟοτจͷػց຋༁ 6 զʑ͕ࢀՃͨ͠λεΫ ࠷΋ྺ࢙͕ݹ͍ ͔ͭ ڝ૪ͷܹ͍͠λεΫ
  3. ίϯϖςΟγϣϯͷྲྀΕʢ೔ӳͷ৔߹ʣ 7 ᶃ γεςϜߏங ᶄ γεςϜධՁ ೔ӳͷର༁ίʔύε ୯ݴޠίʔύεʢ೔ʣ ୯ݴޠίʔύεʢӳʣ લ೥౓·Ͱͷςετσʔλ

    ௒େྔ(100ຕ~)ͷGPUͰ ࢼߦࡨޡ γεςϜ׬੒ σʔληοτͷ४උɾલॲཧ ೔ຊޠจ ຋༁จ ςετσʔλͷ຋༁ ࣗಈධՁ (BLEU)ɾਓखධՁ νʔϜϝϯόʔ
  4. • զʑɿ౦๺େ-ཧݚAIP-NTTνʔϜ • ͦͷଞʹ ژେɾNICTɾDeepMindɾFacebookɾ ΤδϯόϥେɾNAVERɾOPPOɾTencentɾ WeChatɾDiDi ͳͲ͕ࢀՃ ࢀՃνʔϜͷ঺հ 8

    ࠓ໺ ᰜਓ  ਗ਼໺ ॢ  ҏ౻ ୓ւ  ৿Լ ກ  ླ໦ ५  ౦๺େֶ ཧݚ"*1 /55ίϛϡχέʔγϣϯՊֶجૅݚڀॴ ίϯϖςΟγϣϯ্Ґͷৗ࿈ 8"5ɾ8.5ɾ 8"5ͰҐ
  5. ݁ՌɿࣗಈධՁई౓ʢ#-&6ʣͰ্Ґ ಠˠӳҐ Team BLEU Tohoku-AIP-NTT 43.8 Huoshan_Translate 43.5 OPPO 43.2

    UEDIN 42.3 Online-B 41.9 ӳˠಠҐ Team BLEU Tohoku-AIP-NTT 38.8 Tencent_Translation 38.6 OPPO 38.6 Huoshan_Translate 38.2 eTranslation 37.9 ೔ˠӳҐ Team BLEU NiuTrans 26.7 Tohoku-AIP-NTT 25.5 OPPO 24.8 NICT_Kyoto 22.8 eTranslation 22.2 ӳˠ೔Ґ Team BLEU NiuTrans 28.4 OPPO 27.3 ENMT 25.9 Tohoku-AIP-NTT 25.8 NICT_Kyoto 23.9 9
  6. ਓखධՁɿશݴޠରͰҐΛୡ੒ tion aseline sformer mer Inuktitut!English Ave. Ave. z System

    73.1 0.168 NiuTrans 72.9 0.167 Facebook-AI 71.2 0.100 CUNI-Transfer 70.7 0.096 Groningen 70.3 0.072 SRPOL 71.1 0.066 Helsinki 70.2 0.055 NRC 70.2 0.054 UEDIN 70.1 0.047 UQAM-TanLe 68.8 0.006 NICT-Kyoto 68.4 0.035 OPPO Japanese!English Ave. Ave. z System 75.1 0.184 Tohoku-AIP-NTT 76.4 0.147 NiuTrans 74.1 0.088 OPPO 75.2 0.084 NICT-Kyoto 73.3 0.068 Online-B 70.9 0.026 Online-A 71.1 0.019 eTranslation 64.1 0.208 zlabs-nlp 66.0 0.220 Online-G 61.7 0.240 Online-Z Polish!English Ave. Ave. z System 77.2 0.131 SRPOL 76.7 0.097 Online-G 77.7 0.096 NICT-Rui 77.9 0.094 Online-B 78.1 0.085 SJTU-NICT 76.6 0.083 Online-A 75.2 0.050 OPPO 77.3 0.006 Online-Z 78.1 0.003 CUNI-Transformer 76.1 0.038 NICT-Kyoto 73.3 0.041 VolcTrans 73.2 0.048 PROMT-NMT 74.3 0.072 Tilde 74.0 0.130 zlabs-nlp Russian!English Ave. Ave. z System 79.3 0.124 Online-G 80.9 0.114 Online-A 79.7 0.113 OPPO 80.6 0.104 eTranslation 79.5 0.096 PROMT-NMT 80.2 0.072 Online-B 79.9 0.062 HUMAN 77.7 0.042 ariel xv 79.2 0.026 AFRL 10 74.1 0.049 UEDIN-CUNI 74.1 0.065 CUNI-T2T-2018 72.5 0.069 Online-G 71.8 0.080 Online-Z 71.9 0.094 PROMT-NMT 72.0 0.141 zlabs-nlp German!English Ave. Ave. z System 82.6 0.228 VolcTrans 84.6 0.220 OPPO 82.2 0.186 HUMAN 81.5 0.179 Tohoku-AIP-NTT 81.3 0.179 Online-A 81.5 0.172 Online-G 79.8 0.171 PROMT-NMT 82.1 0.167 Online-B 78.5 0.131 UEDIN 78.8 0.085 Online-Z 74.2 0.079 WMTBiomedBaseline 71.1 0.106 zlabs-nlp 20.5 1.618 yolo Khmer!English Ave. Ave. z System 69.0 0.168 Online-B 69.4 0.146 GTCOM 68.5 0.136 Huawei-TSC 62.6 0.047 VolcTrans 58.1 0.210 OPPO 56.9 0.222 Online-Z 55.5 0.282 Online-G Pashto!English Ave. Ave. z System 67.3 0.032 Online-B 66.7 0.024 GTCOM 65.5 0.016 Huawei-TSC 62.7 0.106 VolcTrans 62.1 0.164 OPPO 61.0 0.195 Online-Z 76.0 0.016 DiD 75.2 0.022 On 71.7 0.153 zla Tamil!Eng Ave. Ave. z System 68.7 0.203 GTCO 70.3 0.202 OPPO 68.9 0.176 Online 73.9 0.173 Faceb 70.9 0.150 NiuTr 71.9 0.116 VolcT 64.5 0.007 Online 66.4 0.001 zlabs- 67.5 0.016 Micro 60.8 0.020 UEDI 64.5 0.068 Online 63.4 0.078 DCU 53.7 0.398 Online 53.9 0.451 TALP Table 12: Official results of WMT20 News Translation Task for translation into-English. Systems ordered b z-score; systems within a cluster are considered tied; lines indicate clusters according to Wilcoxon rank-sum tes grayed entry indicates resources that fall outside the constraints provided. 77.1 0.322 UEDIN-CUNI 70.5 0.048 Online-B 69.1 0.017 Online-Z 68.7 0.008 Online-A 62.7 0.216 Online-G 48.1 0.760 zlabs-nlp English!German Ave. Ave. z System 90.5 0.569 HUMAN-B 87.4 0.495 OPPO 88.6 0.468 Tohoku-AIP-NTT 85.7 0.446 HUMAN-A 84.5 0.416 Online-B 84.3 0.385 Tencent-Translation 84.6 0.326 VolcTrans 85.3 0.322 Online-A 82.5 0.312 eTranslation 84.2 0.299 HUMAN-paraphrase 82.2 0.260 AFRL 81.0 0.251 UEDIN 79.3 0.247 PROMT-NMT 77.7 0.126 Online-Z 73.9 0.120 Online-G 68.1 0.278 zlabs-nlp 65.5 0.338 WMTBiomedBaseline 59.8 53.9 52.8 Ave. 88.6 76.4 75.6 76.3 74.0 70.6 72.0 72.4 69.7 71.8 70.1 69.0 64.5 63.9 47.7 Table 13: Official results of WMT20 News Translatio z-score; systems within a cluster are considered tied; lin grayed entry indicates resources that fall outside the cons English!Chinese Ave. Ave. z System 80.6 0.568 HUMAN-B 82.5 0.529 HUMAN-A 80.0 0.447 OPPO 79.0 0.420 Tencent-Translation 77.3 0.415 Huawei-TSC 77.4 0.404 NiuTrans 77.7 0.387 SJTU-NICT 76.6 0.373 VolcTrans 73.7 0.282 Online-B 73.0 0.241 Online-A 69.5 0.136 dong-nmt 68.5 0.135 Online-Z 70.1 0.122 Online-G 68.7 0.082 zlabs-nlp English!Czech Ave. Ave. z System 85.6 0.654 HUMAN 82.2 0.546 CUNI-DocTransformer 81.8 0.538 OPPO 80.8 0.505 SRPOL 80.5 0.458 CUNI-T2T-2018 80.4 0.441 eTranslation 79.3 0.434 CUNI-Transformer 77.1 0.322 UEDIN-CUNI English!Inuktitut (News only) Ave. Ave. z System 90.5 0.574 HUMAN 75.3 0.425 MultiLingual-Ubiqus 77.4 0.409 CUNI-Transfer 71.9 0.369 NRC 74.6 0.368 Facebook-AI 79.2 0.364 NICT-Kyoto 71.6 0.339 Groningen 75.2 0.296 Helsinki 72.8 0.282 SRPOL 68.9 0.084 UQAM-TanLe 66.4 0.081 UEDIN 48.2 0.384 OPPO English!Japanese Ave. Ave. z System 79.7 0.576 HUMAN 77.7 0.502 NiuTrans 76.1 0.496 Tohoku-AIP-NTT 75.8 0.496 OPPO 75.9 0.492 ENMT 71.8 0.375 NICT-Kyoto 71.3 0.349 Online-A 70.2 0.335 Online-B 63.9 0.159 zlabs-nlp 59.8 0.032 Online-Z Ave 83. 79. 75. 77. 77. 78. 76. 72. 72. 72. 74. 71. 68. ʢਓखධՁʹ౷ܭత༗ҙ͕ࠩͳ͍৔߹ɺಉ཰ҰҐͱ͍͏ѻ͍ʣ
  7. ͦͷଞɿݴޠݱ৅ςετͰ΋޷੒੷ 11 ෳ୯ޠදݱɾݻ༗໊ࢺ ػೳޠɾಈࢺͷ੍࣌ͱ͍ͬͨ ݴޠݱ৅ͷऔΓѻ͍ʹؔ͢Δ ςετ category items Tohoku Huoshan

    UEdin Onl-B Onl-G Onl-A PROMT Ambiguity 81 82.7 77.8 72.8 79.0 84.0 76.5 64.2 Composition 49 98.0 98.0 93.9 93.9 95.9 93.9 89.8 Coordination & ellipsis 78 89.7 91.0 89.7 91.0 85.9 87.2 87.2 False friends 36 72.2 80.6 72.2 80.6 77.8 69.4 72.2 Function word 72 86.1 80.6 86.1 90.3 90.3 83.3 88.9 LDD & interrogatives 174 89.1 86.2 85.1 83.3 86.8 77.6 81.0 MWE 80 80.0 75.0 71.3 77.5 77.5 71.3 70.0 Named entitiy & terminology 89 92.1 84.3 87.6 82.0 82.0 88.8 87.6 Negation 20 100.0 100.0 100.0 100.0 100.0 95.0 100.0 Non-verbal agreement 61 91.8 88.5 88.5 86.9 90.2 83.6 82.0 Punctuation 60 96.7 98.3 98.3 71.7 61.7 100.0 98.3 Subordination 180 90.6 88.3 91.1 91.1 92.2 88.9 90.0 Verb tense/aspect/mood 4447 84.6 85.3 80.3 75.9 79.6 77.5 75.1 Verb valency 87 79.3 81.6 77.0 81.6 77.0 77.0 71.3 micro-average 5514 85.3 85.4 81.2 77.7 80.6 78.7 76.5 macro-average 5514 88.1 86.8 85.3 84.6 84.3 83.6 82.7 BLEU 43.8 43.5 42.3 41.9 41.4 40.4 39.6 Table 5: Accuracies (%) of successful translations for 11 systems and 14 categories. Boldface indicates the si Onl-A Onl-B Onl-G PROMT category 2019 2020 2019 2020 2019 2020 2020 Ambiguity +2.6 +7.7 +1.3 +2.6 +2.6 +11.5 +16.7 Composition +10.4 +2.1 -4.1 +12.5 +12.5 +10.4 [Avramidis+2020] զʑͷγεςϜ͕ ϚΫϩฏۉ஋Ͱ࠷΋ྑ͍੒੷
  8. Α͋͘Δ 4.5WT/.5 ؍ 13 4.5ʢ౷ܭతػց຋༁ʣ Ȭdz̺̀ͦ ɵƹ̺̀ͦ ȴʗ̺̀ͦ ǺȰǂȿ /

    ͦ͠͠ ̶͍͌ͦ̿ͦ͡͠ ǨĂ̆ɭ̑ ȩȷ ̀͘͠ ŇǪ ǺȰǭŖ ̼̬͛ͦ̓ͤ ̪̬ͤͤ͟͞/ ̢̡̬̿ͤ͟͞ ̯̻̀ͦͦ / ǺȰ̴̥ͤͤ Ȏƽ̄ǮȪǺȰ(SMT)̆Řǫ̑ sȷ/€ ̢̠͗ͤ́͞ N-best ǺȰǭŖ N-best ǺȰǭŖ ŮȨǺȰ GIZA++ MGIZA FastAlign Nile SRILM KenLM RNNLM Moses, Joshua Travatar, KyotoEBMT MERT MIRA PRO 13 😩 ෳ਺ϞδϡʔϧʹΑΔ൥ࡶͳγεςϜ 😩 Τϥʔ఻ൖ͕຋༁ਫ਼౓ʹѱӨڹ /.5ʢχϡʔϥϧػց຋༁ʣ 😀 ୯ҰͷϞσϧͰҰؾ௨؏ֶश͕Մೳ 😀 Τϥʔ఻ൖΛղফ→ߴ͍຋༁ਫ਼౓ ຋༁Ϟσϧ ର༁ίʔύε ܇࿅ ݪݴޠจ ର৅ݴޠจ ೖྗ σίʔυ ※ ※θϩ͔Β࢝ΊΔχϡʔϥϧωοτϫʔΫػց຋༁ https://www.slideshare.net/ToshiakiNakazawa/nlp2017-nmt-tutorial ΑΓҾ༻
  9. Α͋͘Δ 4.5WT/.5 ؍ 14 4.5ʢ౷ܭతػց຋༁ʣ Ȭdz̺̀ͦ ɵƹ̺̀ͦ ȴʗ̺̀ͦ ǺȰǂȿ /

    ͦ͠͠ ̶͍͌ͦ̿ͦ͡͠ ǨĂ̆ɭ̑ ȩȷ ̀͘͠ ŇǪ ǺȰǭŖ ̼̬͛ͦ̓ͤ ̪̬ͤͤ͟͞/ ̢̡̬̿ͤ͟͞ ̯̻̀ͦͦ / ǺȰ̴̥ͤͤ Ȏƽ̄ǮȪǺȰ(SMT)̆Řǫ̑ sȷ/€ ̢̠͗ͤ́͞ N-best ǺȰǭŖ N-best ǺȰǭŖ ŮȨǺȰ GIZA++ MGIZA FastAlign Nile SRILM KenLM RNNLM Moses, Joshua Travatar, KyotoEBMT MERT MIRA PRO 13 😩 ෳ਺ϞδϡʔϧʹΑΔ൥ࡶͳγεςϜ 😩 Τϥʔ఻ൖ͕຋༁ਫ਼౓ʹѱӨڹ /.5ʢχϡʔϥϧػց຋༁ʣ 😀 ୯ҰͷϞσϧͰҰؾ௨؏ֶश͕Մೳ 😀 Τϥʔ఻ൖΛղফ→ߴ͍຋༁ਫ਼౓ ຋༁Ϟσϧ ର༁ίʔύε ܇࿅ ݪݴޠจ ର৅ݴޠจ ೖྗ σίʔυ ※ ※θϩ͔Β࢝ΊΔχϡʔϥϧωοτϫʔΫػց຋༁ https://www.slideshare.net/ToshiakiNakazawa/nlp2017-nmt-tutorial ΑΓҾ༻ ଟ͘ͷ৔໘Ͱ͜Ε͸ਖ਼͍͕͠ʜ ʮ࠷ઌ୺ͷ/.5ʯͰ͸ঢ়گ͕ҧ͏
  10. ͍ͭ΋ͷ/.5͔Β࠷ઌ୺/.5΁ 17 ୯Ұݴޠ ίʔύε ٯ຋༁Ϟσϧ ର༁ίʔύε ٙࣅ ର༁ίʔύε ର༁ίʔύε ର৅υϝΠϯ

    ίʔύε ςετσʔλ ຋༁Ϟσϧ ग़ྗީิ/จ ຋༁จ -3 ॱํ޲ ຋༁Ϟσϧ 3- ॱํ޲ ຋༁Ϟσϧ -3 ٯํ޲ ຋༁Ϟσϧ 3- ٯํ޲ ຋༁Ϟσϧ ϚεΫ ݴޠϞσϧ ୯ํ޲ ݴޠϞσϧ ࠷ऴ຋༁݁Ռ
  11. ςετσʔλ͸৽ฉهࣄ υϝΠϯͳͷͰɺ ৽ฉهࣄͷσʔλʹదԠ ͍ͤͨ͞ ⇛ϑΝΠϯνϡʔχϯά ࠷ઌ୺/.5͸lటष͍zٕज़ͷू߹ମ 18 ຋༁ϞσϧʢTransformerʣ ͸ϋΠύʔύϥϝʔλʹ ରͯ͠ඇৗʹηϯγςΟϒ

    ϋΠύϥͷ஌ݟ΋೔ਐ݄า ⇛ϋΠύʔύϥϝʔλͷௐ੔ ର༁ίʔύε͚ͩͰ͸ σʔλ͕଍Γͳ͍ ୯ҰݴޠίʔύεΛ࢖ͬͯ σʔλΛ૿΍͍ͨ͠ ⇛ٯ຋༁ʹΑΔσʔλ֦ு ࡾਓدΕ͹จघͷ஌ܙʂ ଞͷϞσϧͷҙݟ΋औΓ ೖΕͯग़ྗΛܾΊ͍ͨ ⇛ϦϥϯΩϯά Ϟσϧֶश݁Ռʹ͸Ϝϥ ͕͋ΔͷͰ ෳ਺ͷϞσϧΛಠཱʹ܇ ࿅ͯ͠ଟ਺ܾ͠Α͏ ⇛Ξϯαϯϒϧ ୯Ұݴޠ ίʔύε ٯ຋༁Ϟσϧ ର༁ίʔύε ٙࣅ ର༁ίʔύε ର༁ίʔύε ର৅υϝΠϯ ίʔύε ςετσʔλ ຋༁Ϟσϧ ग़ྗީิ/จ ຋༁จ -3 ॱํ޲ ຋༁Ϟσϧ 3- ॱํ޲ ຋༁Ϟσϧ -3 ٯํ޲ ຋༁Ϟσϧ 3- ٯํ޲ ຋༁Ϟσϧ ϚεΫ ݴޠϞσϧ ୯ํ޲ ݴޠϞσϧ ࠷ऴ຋༁݁Ռ
  12. ࠷ઌ୺/.5͸lటष͍zٕज़ͷू߹ମ 19 ຋༁ϞσϧʢTransformerʣ ͸ϋΠύʔύϥϝʔλʹ ରͯ͠ඇৗʹηϯγςΟϒ ϋΠύϥͷ஌ݟ΋೔ਐ݄า ⇛ϋΠύʔύϥϝʔλͷௐ੔ ୯Ұݴޠ ίʔύε ٯ຋༁Ϟσϧ

    ର༁ίʔύε ٙࣅ ର༁ίʔύε ର༁ίʔύε ର৅υϝΠϯ ίʔύε ςετσʔλ ຋༁Ϟσϧ ग़ྗީิ/จ ຋༁จ -3 ॱํ޲ ຋༁Ϟσϧ 3- ॱํ޲ ຋༁Ϟσϧ -3 ٯํ޲ ຋༁Ϟσϧ 3- ٯํ޲ ຋༁Ϟσϧ ϚεΫ ݴޠϞσϧ ୯ํ޲ ݴޠϞσϧ ࠷ऴ຋༁݁Ռ
  13. ϋΠύʔύϥϝʔλͷௐ੔ • ϞσϧɿTransformer [Vaswani+2017] • ۙ೥ͷσϑΝΫτతͳϞσϧͰ͋Γɺ࢖Θͳ͍ͱ͍͏બ୒ࢶ͸ͳ͍ • ϑΟʔυϑΥϫʔυ૚ͷ࣍ݩ਺Λഒɾ૚ͷ਺Λ 6 à

    9ʹઃఆ͠ɺ ΑΓଟ͘ͷσʔλΛֶश͢Δ͜ͱΛͶΒ͏ • ௒ڊେόοναΠζ [Ott+2018] • ௨ৗ 4,000 τʔΫϯ à 512,000 τʔΫϯ΁ • ऩଋ଎౓UPɾ൚Խੑೳ޲্ • ܦݧతʹֶश΋҆ఆ͢Δ • Update delay ʢผ໊ ΰʔετόονʣΛ׆༻͢Δ͜ͱͰ࣮ݱ • ڊେֶश཰ [Ott+2018] • AdamͷεςοϓαΠζ 0.0005 à0.001 • ऩଋ଎౓UP • ڊେόοναΠζͱͷ૊Έ߹Θ͕ͤඇৗʹॏཁ • νΣοΫϙΠϯτฏۉ๏ • ద౰ͳ୯Ґʢྫ: ຖEpoch, 2k UpdatesʣͰϞσϧΛอଘ͓ͯ͘͠ • ֶशޙɺอଘͨ͠Ϟσϧͷฏۉ஋Λܭࢉ͠ɺਪ࿦ʹ༻͍Δ • BLEUείΞ͕0.1~0.2΄Ͳվળ͢Δ [Popel+2018] • Pre-layer-normalization • ϑΟʔυϑΥϫʔυ૚ͱΞςϯγϣϯ૚ͷલͰLayerNormΛܭࢉ • ଟ૚Transformerͷֶश͕҆ఆ͢Δͱͷใࠂ [Xiong+2020] 20 Under review as a conference paper at ICLR 2020 the warm-up stage happens in the first several iterations, we investigate the optimization behavior at initialization of the Post-LN Transformer. According to our theoretical analysis, when putting the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, without the warm-up stage, directly using a large learning rate to those parameters may not lead to an improved model and can even make the optimization process unstable. Using a warm-up stage and training the model from small learning rates practically avoid this problem. Figure 1: (a) Post-LN Transformer layer; (b) Pre- LN Transformer layer. As the location of the layer normalization plays a crucial role in controlling the gradient scales, we investigate whether there are some other ways of positioning the layer normalization that lead to better-normalized gradients. In par- ticular, we study another variant, the Trans- former with Pre-Layer Normalization (Pre-LN) (Klein et al., 2018). The Pre-LN Transformer puts the layer normalization inside the residual connection and equips with an additional final- layer normalization before prediction (Please see Figure 1 for the differences between the two variants of the Transformer architectures). In this paper, we show that the gradients are well- behaved without any exploding or vanishing at initialization for the Pre-LN Transformer both theoretically and empirically. Given the gradients are well-behaved in the Pre- LN Transformer, it is natural to consider re- moving the learning rate warm-up stage during training. We conduct extensive experiments, including IWSLT14 German-English transla- tion, WMT14 English-German translation, and BERT pre-training tasks. We show that, in all Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, posit wise fully connected feed-forward network. We employ a residual connection [11] around each the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-laye LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-la itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedd layers, produce outputs of dimension dmodel = 512. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-h attention over the output of the encoder stack. Similar to the encoder, we employ residual connecti around each of the sub-layers, followed by layer normalization. We also modify the self-attent sub-layer in the decoder stack to prevent positions from attending to subsequent positions. T masking, combined with fact that the output embeddings are offset by one position, ensures that predictions for position i can depend only on the known outputs at positions less than i. 3.2 Attention An attention function can be described as mapping a query and a set of key-value pairs to an out where the query, keys, values, and output are all vectors. The output is computed as a weighted s of the values, where the weight assigned to each value is computed by a compatibility function of query with the corresponding key. [Vaswani+2017]ΑΓҾ༻ [Xiong+2020]ΑΓҾ༻
  14. ࠷ઌ୺/.5͸lటष͍zٕज़ͷू߹ମ 21 ର༁ίʔύε͚ͩͰ͸ σʔλ͕଍Γͳ͍ ୯ҰݴޠίʔύεΛ࢖ͬͯ σʔλΛ૿΍͍ͨ͠ ⇛ٯ຋༁ʹΑΔσʔλ֦ு ୯Ұݴޠ ίʔύε ٯ຋༁Ϟσϧ

    ର༁ίʔύε ٙࣅ ର༁ίʔύε ର༁ίʔύε ର৅υϝΠϯ ίʔύε ςετσʔλ ຋༁Ϟσϧ ग़ྗީิ/จ ຋༁จ -3 ॱํ޲ ຋༁Ϟσϧ 3- ॱํ޲ ຋༁Ϟσϧ -3 ٯํ޲ ຋༁Ϟσϧ 3- ٯํ޲ ຋༁Ϟσϧ ϚεΫ ݴޠϞσϧ ୯ํ޲ ݴޠϞσϧ ࠷ऴ຋༁݁Ռ
  15. ٯ຋༁ͱ͸Կ͔ʁ • ٯ຋༁ #BDLUSBOTMBUJPO#5 <4FOOSJDI > • ୯Ұݴޠίʔύε͔Βٙࣅର༁ίʔύεΛ ੜ੒͢ΔͨΊͷํ๏࿦ •

    /.5༻σʔλ֦ுͷσϑΝΫτతͳଘࡏ • ٯ຋༁ϞσϧΛ༻͍ͯ໨తݴޠͷจΛݪݴޠʹ ʮٯʯ຋༁ 23 ೔→ӳ ຋༁Ϟσϧ ୯ݴޠίʔύεʢ೔ʣ ຋༁ࡁΈίʔύεʢӳʣ ӳ೔ͷٙࣅର༁ίʔύε ӳ೔ϞσϧΛ࡞Δ৔߹ʜ
  16. ٯ຋༁ͷϓϩηε 24 ೔ӳͷର༁ίʔύε ೔→ӳ ຋༁Ϟσϧ ܇࿅ ᶃ ٯ຋༁Ϟσϧͷ܇࿅ ᶄ ೔ຊޠ୯ݴޠίʔύεΛ຋༁͠ɺٙࣅσʔλΛੜ੒

    ೔→ӳ ຋༁Ϟσϧ ୯ݴޠίʔύεʢ೔ʣ ຋༁ࡁΈίʔύεʢӳʣ ӳ೔ͷٙࣅର༁ίʔύε ᶅ ٙࣅσʔλΛ༻ֶ͍ͯश ӳ೔ͷର༁ίʔύε ӳ→೔ ຋༁Ϟσϧ ܇࿅ ӳ೔ͷٙࣅର༁ίʔύε
  17. ࠷ઌ୺/.5͸lటष͍zٕज़ͷू߹ମ 26 ࡾਓدΕ͹จघͷ஌ܙʂ ଞͷϞσϧͷҙݟ΋औΓ ೖΕͯग़ྗΛܾΊ͍ͨ ⇛ϦϥϯΩϯά ୯Ұݴޠ ίʔύε ٯ຋༁Ϟσϧ ର༁ίʔύε

    ٙࣅ ର༁ίʔύε ର༁ίʔύε ର৅υϝΠϯ ίʔύε ςετσʔλ ຋༁Ϟσϧ ग़ྗީิ/จ ຋༁จ -3 ॱํ޲ ຋༁Ϟσϧ 3- ॱํ޲ ຋༁Ϟσϧ -3 ٯํ޲ ຋༁Ϟσϧ 3- ٯํ޲ ຋༁Ϟσϧ ϚεΫ ݴޠϞσϧ ୯ํ޲ ݴޠϞσϧ ࠷ऴ຋༁݁Ռ
  18. ϦϥϯΩϯάͰީิจ͔Βྑ͍຋༁ΛબͿ • ϦϥϯΩϯάແ͠ͷ৔߹ 1. ϏʔϜαʔνʹΑͬͯީิจΛ/จੜ੒ 2. είΞͷ࠷΋ߴ͍จΛग़ྗ • ߴείΞ =

    ࠷΋ྑ͍຋༁ Ͱ͸ͳ͍ • ଞͷީิจͷ΄͏͕ྑ͍຋༁ʹͳ͍ͬͯΔՄೳੑ • ϦϥϯΩϯάɿྑ͍຋༁Λݟ͚ͭग़ͨ͢Ίͷޙॲཧ 27 I have been extremely lucky. 9.5 1.1 ީิNจ είΞ ຋༁Ϟσϧ ͱͯ΋޾ӡͰͨ͠ ඇৗʹӡ͕ྑ͔ͬͨɻ ۃΊͯ޾ӡͰ͋ͬͨ ࢲ͸ຊ౰ʹ޾ӡͰͨ͠ ࢲ͸ɺඇৗʹ޾ӡͩͬͨ 8.2 4.2 2.9 ϦϥϯΩϯάͳ͠ͷ৔߹ ͜ͷจΛग़ྗ ຊ౰͸͜ͷจΛग़ྗ͍ͨ͠
  19. Ϟσϧͷू߹஌Ͱྑ͍຋༁Λ໨ࢦ͢ 28 ᶃ ީิ/จͷੜ੒ XϏʔϜαʔν ᶄ /จΛ֤ϞδϡʔϧͰείΞ෇͚ είΞͷ߹ܭͰιʔτ I have

    been extremely lucky. ຋༁Ϟσϧ ͱͯ΋޾ӡͰͨ͠ ඇৗʹӡ͕ྑ͔ͬͨɻ ۃΊͯ޾ӡͰ͋ͬͨ ࢲ͸ຊ౰ʹ޾ӡͰͨ͠ ࢲ͸ɺඇৗʹ޾ӡͩͬͨ ͱͯ΋޾ӡͰͨ͠ ඇৗʹӡ͕ྑ͔ͬͨɻ ۃΊͯ޾ӡͰ͋ͬͨ ࢲ͸ຊ౰ʹ޾ӡͰͨ͠ ࢲ͸ɺඇৗʹ޾ӡͩͬͨ ͱͯ΋޾ӡͰͨ͠ ඇৗʹӡ͕ྑ͔ͬͨɻ ۃΊͯ޾ӡͰ͋ͬͨ ࢲ͸ຊ౰ʹ޾ӡͰͨ͠ ࢲ͸ɺඇৗʹ޾ӡͩͬͨ είΞ Ϟδϡʔϧ ୯ํ޲ݴޠϞσϧ ૒ํ޲ݴޠϞσϧ ٯ຋༁Ϟσϧ ٯํ޲຋༁Ϟσϧ ͳͲͳͲ…
  20. ID Setting EnàDe DeàEn EnàJa JaàEn (a) ϕʔεϥΠϯ 42.4 42.0

    19.7 21.6 (b) ϕʔεϥΠϯ+ٯ຋༁ 42.7 42.5 22.0 23.9 (c) (b)+ϑΝΠϯνϡʔχϯά 44.9 42.3 23.1 24.4 (d) (c) x 4 (Ξϯαϯϒϧ) 45.5 42.8 23.9 25.4 (e) (d)+ϦϥϯΩϯά 45.7 43.8 24.9 26.2 - લ೥౓ͷ༏উγεςϜ 44.9 42.8 - - ࣮ݧ݁Ռ 29 • ֤ٕज़ͷ૊Έ߹ΘͤͰੑೳ޲্Λୡ੒ • ࠷ઌ୺ͷੑೳΛग़ͨ͢Ίʹ͸ෳࡶͳγεςϜ͕ඞཁ
  21. ߏஙͨ͠γεςϜɿ৭ʑͳ෺͕૿͍͑ͯΔ • ܇࿅σʔλͷྔ͕૿͍͑ͯΔ • ௨ৗɿ࠷େͰ . ఔ౓ • ࠓճɿӳಠͰ͸ .

    • Ϟσϧͷύϥϝʔλ͕૿͍͑ͯΔ • ௨ৗɿΤϯίʔμͱσίʔμͰͦΕͧΕ૚ͣͭ • ࠓճɿͦΕͧΕ૚ • Ϟσϧͷ਺͕૿͍͑ͯΔ • ΞϯαϯϒϧɾϦϥϯΩϯά༻ʹෳ਺ͷϞσϧ͕ඞཁ • ֤ݴޠͰ8Ϟσϧඞཁ → ߹ܭϞσϧ 31 γεςϜߏஙʹඞཁͳϦιʔεͷ૿Ճ
  22. ඞཁͳ΋ͷ͸ɺ݁ہ͓ۚ • DGX-2 ૬౰ͷϚγϯ͸ AWS Ͱ 60υϧ/hour • ͭ·Γɺ1Ϟσϧ࡞Δͷʹ1440υϧඞཁ •

    ͭ·Γɺ32Ϟσϧ࡞Δͷʹ46080υϧඞཁ • ͜Ε͸೔ຊԁʹͯ͠500ສԁऑ • ్தͷࢼߦࡨޡʹֹ͔͔ͬͨۚΛ߹ΘͤΔͱ ?ઍສԁ΄Ͳ͔͔ͬͨܭࢉ • ͋͘·Ͱ΋AWSͰGPUΛआΓͨ৔߹ͷ֓ࢉ • ॴଐ૊৫ͷϚγϯΛ࢖ͬͨͨΊɺ࣮ࡍͷֹۚͱ͸ ҟͳΓ·͢ 35 ݁ہ͍͘Β࢖ͬͨͷʁ ൿಗࣄ߲Ͱ͢
  23. ʢ༨ஊʣ্͍ͭͩͬͯʹ͸্͕͍Δ • GPT-3Λ࡞ΔͨΊʹඞཁͳ΋ͷ • σʔλ͸ https://lambdalabs.com/blog/demystifying-gpt-3/ ΑΓҾ༻ 36 V100 32GB

    (Ұຕ100ສԁ) GPU ࣌ؒ 355೥ ͝ઌ૆༷ͷ࡞Γ࢝Ίͨ GPT-3͕དྷि׬੒Ͱ͢ ͜ͷલग़ͨGPT-371ͷ ؒҧ͍ͱ͔Ͱ͸ͳ͘ʁ
  24. ;ͨͨͼɺ࣮ݧ݁Ռ 38 ID Setting EnàDe DeàEn EnàJa JaàEn (a) ϕʔεϥΠϯ

    42.4 42.0 19.7 21.6 (b) ϕʔεϥΠϯ+ٯ຋༁ 42.7 42.5 22.0 23.9 (c) (b)+ϑΝΠϯνϡʔχϯά 44.9 42.3 23.1 24.4 (d) (c) x 4 (Ξϯαϯϒϧ) 45.5 42.8 23.9 25.4 (e) (d)+ϦϥϯΩϯά 45.7 43.8 24.9 26.2 - લ೥౓ͷ༏উγεςϜ 44.9 42.8 - -
  25. ؍࡯ɿٯ຋༁͸ޮՌͳ͠ʁ • ٯ຋༁Ͱ܇࿅σʔλ͸໿10ഒʹ • ӳಠͰ͸BLEUείΞ͸΄ͱΜͲ޲্ͤͣ • ྫ͑͹ 42.4 à 42.7

    • ੑೳήΠϯ͕࿑ྗʹݟ߹͍ͬͯͳ͍… • More data, better same performance ͳͷ͔ʁ 39 ID Setting EnàDe DeàEn EnàJa JaàEn (a) ϕʔεϥΠϯ 42.4 42.0 19.7 21.6 (b) ϕʔεϥΠϯ+ٯ຋༁ 42.7 42.5 22.0 23.9 (c) (b)+ϑΝΠϯνϡʔχϯά 44.9 42.3 23.1 24.4 (d) (c) x 4 (Ξϯαϯϒϧ) 45.5 42.8 23.9 25.4 (e) (d)+ϦϥϯΩϯά 45.7 43.8 24.9 26.2 - લ೥౓ͷ༏উγεςϜ 44.9 42.8 - -
  26. ٯ຋༁ͷޮՌ͸#-&6ͰଌΕͳ͍ • ࠷ઌ୺NMTʹ͓͚Δٯ຋༁ͷޮՌʹ͍ͭͯ • [Edunov+2020] [Bogoychev+2019] 40 😀 ٯ຋༁͸ແବͰ͸ͳ͔ͬͨ 😩

    ਓखධՁͱ#-&6͕૬ؔ͠ͳ͍ੈքʹͳ͍ͬͯΔ… BLEU ٯ຋༁͋Γ ٯ຋༁ͳ͠ ≒ ٯ຋༁͋Γ ٯ຋༁ͳ͠ > ٙࣅର༁ίʔύεͰͷֶशʹΑΓ ग़ྗ͕ྲྀெʹͳ͍ͬͯΔ [Edunov+2020] ਓखධՁ
  27. ؍࡯ɿϦϥϯΩϯάͷޮՌ͕ബ͍ • ϦϥϯΩϯάʹ͸ɺഒҎ্ͷϞσϧ͕ඞཁ • ͭ·ΓɺഒҎ্ͷ͓͕͔͔͍ۚͬͯΔ • ͔͠͠ɺBLEUείΞ͕ࢥͬͨΑ͏ʹ޲্͠ͳ͍ • ྫ͑͹ 45.5

    à 45.7 • ·ͨͯ͠΋ੑೳήΠϯ͕࿑ྗʹݟ߹͍ͬͯͳ͍… 42 ID Setting EnàDe DeàEn EnàJa JaàEn (a) ϕʔεϥΠϯ 42.4 42.0 19.7 21.6 (b) ϕʔεϥΠϯ+ٯ຋༁ 42.7 42.5 22.0 23.9 (c) (b)+ϑΝΠϯνϡʔχϯά 44.9 42.3 23.1 24.4 (d) (c) x 4 (Ξϯαϯϒϧ) 45.5 42.8 23.9 25.4 (e) (d)+ϦϥϯΩϯά 45.7 43.8 24.9 26.2 - લ೥౓ͷ༏উγεςϜ 44.9 42.8 - -
  28. ղ͚ͳ͍໰୊Λղ͜͏ͱ͍ͯ͠Δʁ • ݪจͱީิจ͔Β͚ͩͰ͸ྑ͍຋༁จΛ൑அͰ͖ ͳ͍ͷͰ͸ʁ • ਓؒʹ΋ʮྑ͍຋༁จʯ͸Θ͔Βͳ͍ • ൑அ͢Δ͚ͩͷ৘ใ͕଍Γ͍ͯͳ͍ • ͲΜͳ৘ใ͕͋Ε͹ྑ͍͔

    ⇛ จ຺ʁ 43 ࿨ฏϓϩηεʹӨڹΛٴ΅ͨ͘͠͸ͳ͍ ࿨ฏϓϩηεʹӨڹΛ༩͑ͨ͋͘Γ·ͤΜɻ ࿨ฏϓϩηεʹӨڹΛٴ΅ͯ͠ཉ͘͠ͳ͍ ࿨ฏϓϩηεʹӨڹΛ༩͑ͨ͘ͳ͍ͷͰ͢ɻ ࿨ฏϓϩηεʹӨڹ͕ग़ͳ͍Α͏ʹ͍ͨ͠ɻ ࿨ฏϓϩηεʹӨڹΛٴ΅ͨ͋͘͠Γ·ͤΜ ࿨ฏϓϩηεʹӨڹΛ༩͑Δ͜ͱ͸๬·ͳ͍ɻ ࿨ฏϓϩηεʹӨڹΛ༩͑Δ͜ͱΛ๬·ͳ͍ɻ ຋༁γεςϜ
  29. ·ͱΊ • ࠷ઌ୺ͷNMTγεςϜߏங͔ΒΘ͔ͬͨ͜ͱ3ͭΛ ঺հ • ᶃ ࠷ઌ୺ͷNMTγεςϜ͸టष͍ • ᶄ ๲େͳϦιʔε͕ඞཁ

    • ᶅ lటष͍zNMTͷઌʹ৽͍͠ੈք͕ݟ͖͍͑ͯͯΔ • ೔ຊͷ૊৫Ͱ΋ɺϦιʔε͕͋Ε͹ੈք͸औΕΔʂ 45
  30. ࢀߟจݙ • [Avramidis+2020]: Avramidis, E., Macketanz, V., Lommel, A., &

    Uszkoreit, H. (2018). Fine-grained evaluation of Quality Estimation for Machine translation based on a linguistically motivated Test Suite. In Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing (pp. 243–248). Association for Machine Translation in the Americas. • [Vaswani+2017]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, ., & Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems 31 (NIPS 2017) (pp. 5998– 6008). • [Ott+2018]: Ott, M., Edunov, S., Grangier, D., & Auli, M. (2018). Scaling Neural Machine Translation. In Proceedings of the Third Conference on Machine Translation: Research Papers (pp. 1–9). Association for Computational Linguistics. • [Xiong+2020]: Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, & Tie-Yan Liu (2020). On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (pp. 10524– 10533). PMLR. • [Sennrich+2016]: Sennrich, R., Haddow, B., & Birch, A. (2016). Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 86–96). Association for Computational Linguistics. • [Edunov+2020]: Edunov, S., Ott, M., Ranzato, M., & Auli, M. (2020). On The Evaluation of Machine Translation Systems Trained With Back-Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 2836–2846). Association for Computational Linguistics. • [Bogoychev+2019]: Nikolay Bogoychev, & Rico Sennrich (2019). Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation CoRR, abs/1911.03362. • [Freitag+2020]: Freitag, M., Grangier, D., & Caswell, I. (2020). BLEU might be Guilty but References are not Innocent. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 61–71). Association for Computational Linguistics. 46