Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[SNLP2019] Generalized Data Augmentation for Lo...

Shun Kiyono
September 20, 2019

[SNLP2019] Generalized Data Augmentation for Low-Resource Translation

Shun Kiyono

September 20, 2019
Tweet

More Decks by Shun Kiyono

Other Decks in Research

Transcript

  1. Generalized Data Augmentation for Low-Resource Translation ཧݚAIP / ౦๺େֶ סɾླ໦ݚڀࣨ

    ਗ਼໺ॢ Generalized Data Augmentation for Low-Resource Translation Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, Graham Neubig Language Technologies Institute, Carnegie Mellon University {mengzhox, xiangk, aanastas, gneubig}@andrew.cmu.edu Abstract Translation to or from low-resource languages (LRLs) poses challenges for machine transla- tion in terms of both adequacy and fluency. : Available Resource : Generated Resource LRL ENG [1] ENG!LRL ಡΉਓ ※஫ऍͷͳ͍ਤද͸࿦จ͔ΒҾ༻͞Εͨ΋ͷͰ͢
  2. ͲΜͳ࿦จ͔ʁ • എܠɾ໰୊ • Low Resourceݴޠରͷ৔߹ٯ຋༁Ͱͷੑೳ޲্͕ࠔ೉ • ΞΠσΞ • ݴޠతʹ͍ۙHigh

    ResourceݴޠରͷσʔλΛ ͏·͘׆༻͢Δʢྫ: ΞθϧόΠδϟϯޠͱτϧίޠʣ • ߩݙ • High Resourceݴޠͷ࢖͍ํ͸ඇࣗ໌ͳͷͰɼ৭Μͳ ख๏ͷ૊Έ߹ΘͤΛ໢ཏతʹ࣮ݧ • High ResourceݴޠΛ୯ޠ୯ҐͰLow Resourceݴޠʹ ஔ׵͢Δͷ͕ྑ͍ • High ResourceݴޠΛܦ༝͢Δ͜ͱͰɼ୯Ұݴޠσʔλ Λ׆༻Մೳͱࣔͨ͠ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 2 τϧίͱΞθϧόΠδϟϯ͸͍ۙ
  3. എܠ: Backtranslation͕͍͢͝ʂ • Backtranslation (ҎԼɼٯ຋༁) • ٯ຋༁Ϟσϧͷग़ྗͨ͠຋༁จΛ৽͍ٙ͠ࣅର༁ σʔλͱͯ͠༻͍Δख๏ • େྔɾߴ඼࣭ͳ୯Ұݴޠίʔύε͕࢖͑Δʂ

    • ػց຋༁ͷData Augmentationख๏ͱͯ͠Ұൠత • ٙࣅର༁σʔλͷྔʹରͯ͠ੑೳ͕εέʔϧ͢Δ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 3 sed on the Big Transformer architecture with ks in the encoder and decoder. We use the hyper-parameters for all experiments, i.e., representations of size 1024, feed-forward with inner dimension 4096. Dropout is set for En-De and 0.1 for En-Fr, we use 16 at- n heads, and we average the checkpoints of st ten epochs. Models are optimized with (Kingma and Ba, 2015) using 1 = 0.9, 0.98, and ✏ = 1e 8 and we use the same ng rate schedule as Vaswani et al. (2017). All s use label smoothing with a uniform prior ution over the vocabulary ✏ = 0.1 (Szegedy 2015; Pereyra et al., 2017). We run exper- s on DGX-1 machines with 8 Nvidia V100 5M 8M 11M 17M 29M 23.5 24 24.5 25 25.5 Total training data BLEU (newstest2012) greedy beam top10 sampling beam+noise Figure 1: Accuracy of models trained on dif- Diagram from https://arxiv.org/abs/1808.09381
  4. എܠ: Low Resourceͷ৔߹ɼ ٯ຋༁͸͘͢͝ͳ͍… • Low Resource (LRL)ͷ৔߹ • ͜͜Ͱ͸਺ઍ~਺ສจରΛLow

    Resourceͱ͢Δ • ٯ຋༁ʹΑΔੑೳ޲্͸ݶఆత • Ή͠Ζੑೳ͕ѱԽ͢Δ৔߹΋… September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 4 Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw ˆ Sm , THE THE } 15.24 24.25 32.30 30.00 Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69
  5. ΞΠσΞ: High Resourceͳ ݴޠରΛ׆༻͢Δ • High Resource Language (HRL) Λ͏·͘׆༻͍ͨ͠

    • ͨͩ͠ɼͲ͏΍ͬͯHRLΛ׆༻͢Ε͹͍͍͔͸ඇࣗ໌ • ৭Μͳख๏ͷ૊Έ߹ΘͤΛ໢ཏతʹࢼͯ͠ɼ ݁ՌΛใࠂ • ʢ৽͍͠ํ๏࿦ͷఏҊͰ͸ͳ͍ʣ • ʢ஌ݟͷڞ༗͕ओͳߩݙʣ • ʢGeneralized Data Augmentationͱ͸ʁ) September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 5 LRL ENG HRL ENG ౷ޠߏ଄΍ޠኮ͕ Highly-relatedͰ͋Δͱ Ծఆ͢Δ
  6. Generalized Data Augmentation ͷશମ૾ uages ransla- uency. nts of ective

    his pa- r data ransla- ingual high- we ex- hod to mak- : Available Resource : Generated Resource LRL ENG [c] HRL ENG [b] ENG [a] HRL LRL ENG LRL ENG LRL ENG [1] ENG!LRL [2] ENG!HRL [4] HRL!LRL [3] HRL!LRL Figure 1: With a low-resource language (LRL) and a related high-resource language (HRL), typical data aug- mentation scenarios use any available parallel data [b] September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 6 ٯ຋༁Ͱ͸ੑೳ޲্͕ࠔ೉ʜ ͜ͷͭͷ૊Έ߹ΘͤΛࢼ͢
  7. ׆༻๏1: HRLàLRL • HRLͱENGͷର༁ίʔύε͸͋Δఔ౓ଘࡏ • HRLΛLRLʹ຋༁͢Ε͹ɼLRLͱENGͷٙࣅର༁ ίʔύε͕࡞ΕΔ • ਅͷର༁ίʔύεͱֶࠞͥͯश September

    28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 7 a Augmentation for Low-Resource Translation iang Kong, Antonios Anastasopoulos, Graham Neubig echnologies Institute, Carnegie Mellon University angk, aanastas, gneubig}@andrew.cmu.edu ource languages machine transla- acy and fluency. arge amounts of as an effective ms. In this pa- mework for data machine transla- ide monolingual a related high- ecifically, we ex- : Available Resource : Generated Resource LRL ENG [c] HRL ENG [b] ENG [a] HRL LRL ENG LRL ENG LRL ENG [1] ENG!LRL [2] ENG!HRL [4] HRL!LRL [3] HRL!LRL Figure 1: With a low-resource language (LRL) and a
  8. ׆༻๏2: ENGàHRLàLRL • ENG͔ΒHRLΛܦ༝ͯ͠LRLʹ຋༁ • ENGàHRLͷٯ຋༁ϞσϧΛ׆༻ • ENGͷ୯Ұݴޠίʔύε͔ΒɼLRLàENGͳ ٙࣅର༁σʔλΛ֫ಘՄೳ •

    خ͠͞: ENGͷ୯Ұݴޠίʔύε͸΄΅ແݶʹଘࡏ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 8 ta Augmentation for Low-Resource Translation Xiang Kong, Antonios Anastasopoulos, Graham Neubig Technologies Institute, Carnegie Mellon University iangk, aanastas, gneubig}@andrew.cmu.edu t esource languages r machine transla- uacy and fluency. large amounts of ed as an effective lems. In this pa- amework for data e machine transla- -side monolingual gh a related high- : Available Resource : Generated Resource LRL ENG [c] HRL ENG [b] ENG [a] HRL LRL ENG LRL ENG LRL ENG [1] ENG!LRL [2] ENG!HRL [4] HRL!LRL [3] HRL!LRL
  9. Ͳ͏΍ͬͯHRL à LRL͢Δ͔ʁ • ͦ΋ͦ΋HRL͔ΒLRL΁ͷ຋༁͕Low Resource • Ծఆ: HRLͱLRL͸ݴޠతʹࣅ͍ͯΔ •

    ڭࢣͳ͠ͷख๏Ͱ΋ͦΕͳΓͷ຋༁ਫ਼౓͕ݟࠐΊΔ •  ୯ޠ୯Ґͷஔ͖׵͑ &  ڭࢣͳ͠.5Λར༻ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 9 Data Example Sentence Pivot BLEU SLE (GLG) Pero con todo, veste obrigado a agardar nas mans dunha serie de estraños moi profesionais. SHE (POR) Em vez disso, somos obrigados a esperar nas mãos de uma série de estranhos muito profissionais. 0.09 ˆ Sw H )L En vez disso, somos obrigados a esperar nas mans de unha serie de estraños moito profesionais. 0.18 ˆ Sm H )L En vez diso, somos obrigados a esperar nas mans dunha serie de estraños moi profesionais. 0.54 TLE But instead, you are forced there to wait in the hands of a series of very professional strangers. Table 3: A POR-GLG pivoting example with corresponding pivot BLEU scores. Edits by word substitution or M-UMT are highlighted. UMT’s scores are 2 to 10 BLEU points worse than  ୯ޠ୯Ґ ͷஔ͖׵͑  ڭࢣͳ͠ .5
  10. (1) ୯ޠ୯Ґͷஔ͖׵͑ 1. ݸʑͷݴޠͰ୯ޠϕΫτϧΛֶश͓͖ͯ͠ɼ ࣸ૾WΛֶश [Xing+2015] 2. ୯ޠϕΫτϧۭؒͰۙ๣ͷ୯ޠϖΞΛ ࣙॻʹ௥Ճ 3.

    HRLதͷ֤୯ޠΛରԠ͢ΔLRLͷ୯ޠͰஔ׵ • ରԠ͢Δ୯ޠ͕ແ͚Ε͹ແࢹ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 10 n HRL and LRL slating from En- be achieved by on via Pivoting" et by translating a, into the LRL. we can construct to an HRL-ENG LRL system and o LRL, creating fore, except that result of back- at we have first ual data ME to hose to the LRL, ME }. method to obtain a bilingual dictionary between the two highly-related languages. Following Xing et al. (2015), we formulate the task of finding the optimal mapping between the source and target word embedding spaces as the Procrustes problem (Schönemann, 1966), which can be solved by sin- gular value decomposition (SVD): min W kWX Y k2 F s.t. WT W = I, where X and Y are the source and target word embedding spaces respectively. As a seed dictionary to provide supervision, we simply exploit identical words from the two lan- guages. With the learned mapping W, we compute the distance between mapped source and target words with the CSLS similarity measure (Lample et al., 2018b). Moreover, to ensure the quality of the dictionary, a word pair is only added to the dictionary if both words are each other’s closest neighbors. Adding an LRL word to the dictionary for every HRL word results in relatively poor per- Published as a conference paper at ICLR 2018 Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, Diagram from https://arxiv.org/abs/1710.04087
  11. (2) ڭࢣͳ͠MT • طଘͷڭࢣͳ͠MTͷख๏ͱ΄ͱΜͲಉ͡ • ʢ࿦จதͰ͸“Modified UMT”ͱදه͞Ε͍ͯΔ͕ɼҧ͍ ͕෼͔Βͳ͔ͬͨ…) • ʢڪΒ͘ɼ༧ΊHRLàLRLʹ୯ޠஔ׵͍ͯ͠Δͷ͕ࠩ෼ʣ

    • Denoising Auto-encoderͱIterative Back-translation ͷ2͔ͭΒlossΛܭࢉɾॏΈ෇͖࿨Λ ໨తؔ਺ͱֶͯ͠श September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 11 to be similar d word order, ation process o completely next step is to into a version achieve this ropose to use MT). ne Translation 018a,c) makes uages without pling denois- nslation, and oders and de- odel to extend ng into learn- d on data-rich, e English and work to low- stitution strategy (§3.1). Our initialization is com- prised of a sequence of three steps: 1. First, we use an induced dictionary to substi- tute HRL words in MH to LRL ones, producing a pseudo-LRL monolingual dataset ˆ M L . 2. Second, we learn a joint word segmentation model on both ML and ˆ M L and apply it to both datasets. 3. Third, we train a NMT model in an unsu- pervised fashion between ML and ˆ M L . The training objective L is a weighted sum of two loss terms for denoising auto-encoding and iterative back-translation: L = 1 Ex⇠ML log Ps)s(x|C(x)) + E y⇠ ˆ ML log Pt)t(y|C(y)) + 2 Ex⇠ML log Pt)s(x|u⇤(y|x)) + E y⇠ ˆ ML log Ps)t(y|u⇤(x|y)) where u⇤ denotes translations obtained with
  12. ࣮ݧઃఆ • σʔλ: Multilingual TED corpus [Qi+2018] • ݴޠର: •

    ୯Ұݴޠίʔύεʹ͸WikipediaΛར༻ • Ϟσϧ: Transformer (4 layer) • ϕʔεϥΠϯ: HRLͱLRLͷଟݴޠNMT September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 12 Datasets LRL (HRL) AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) SLE , TLE 5.9K 4.5K 10K 61K SHE , THE 182K 208K 185K 103K SLH , TLH 5.7K 4.2K 3.8K 44K ML 2.02M 1.95M 1.98M 2M MH 2M 2M 2M 2M ME 2M/ 200K Table 1: Statistics (number of sentences) of all datasets. directly translating ENG to LRL under the follow- ing three conditions: 1) HRL and LRL are related enough to allow for the induction of a high-quality bilingual dictionary; 2) There exists a relatively The statistics of the in Table 1. For AZE, B able Wikipedia data, guages we sample a si ple 2M/200K English data, which are used fo augmentation from En 4.2 Pre-processing We train a joint sen each LRL-HRL pair b lingual corpora of th mentation model for E monolingual data only for each model to 20K by their respective seg We use FastText Low Resource: େମ਺ઍ~਺ສͷن໛
  13. ࣮ݧ݁Ռɿؤுͬͨ Training Data BLEU for X)ENG AZE BEL GLG SLK

    (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 12 + { ˆ Sw H )L ˆ Sw E )H )L , THE ME } (word subst.) 15.74 24.51 33.16 32.07 13 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.91 23.69 32.55 31.58 + { ˆ Sw E )H )L ˆ Sm E )H )L , ME ME } Table 2: Evaluation of translation performance over four language pairs. Rows 1 and 2 show pre-training BLEU September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 13 طଘݚڀͷ࿦จ஋ ϕʔεϥΠϯ Data Augmentation ͷ࠷ߴ஋ (͜ͷล·ͰΠέϧ)
  14. ෼ੳ1: ௨ৗͷٯ຋༁Λ࢖ͬͨ৔߹ • ENGàLRLͰٯ຋༁Λͯ͠΋ੑೳ͸্͕Βͳ͍ • Ή͠ΖԼ͕Δ • ENGàHRLΛ௥Ճ͢Δͱੑೳඍ૿ • Ұ෦ͷίʔύε(BEL)Ͱ͸ޮՌ͸ݶఆత

    • ʢHRLͱLRLͷྨࣅ౓͕Өڹ͍ͯͦ͠͏ʣ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 14 Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 12 + { ˆ Sw H )L ˆ Sw E )H )L , THE ME } (word subst.) 15.74 24.51 33.16 32.07 13 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.91 23.69 32.55 31.58 Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations ٯ຋༁͸΍ͬͺΓΠέͯͳ͍
  15. ෼ੳ2: HRLàLRLͨ͠৔߹ͷ݁Ռ • HRLଆΛ୯ޠ୯ҐͰஔ׵͢Δ͚ͩͰܶతʹੑೳ޲্ • ڭࢣͳ͠MTͰ΋ੑೳ޲্͢Δ͕ɼ୯ޠஔ׵ͱಉ౳͔ ͦΕҎԼ • ʢڭࢣͳ͠MTͷํֶ͕श͕େมͳͷͰɼ͜ͷ݁Ռ͸ऐ͍͠ʣ •

    ୯ޠஔ׵ & ڭࢣͳ͠MTͷ૊Έ߹ΘͤͰߋʹੑೳ޲্ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 15 ୯ޠ୯Ґͷஔ͖׵͑͸͍͢͝ Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 12 + { ˆ Sw H )L ˆ Sw E )H )L , THE ME } (word subst.) 15.74 24.51 33.16 32.07 13 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.91 23.69 32.55 31.58 Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 12 + { ˆ Sw H )L ˆ Sw E )H )L , THE ME } (word subst.) 15.74 24.51 33.16 32.07 13 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.91 23.69 32.55 31.58 + { ˆ Sw E )H )L ˆ Sm E )H )L , ME ME } Table 2: Evaluation of translation performance over four language pairs. Rows 1 and 2 show pre-training BLEU
  16. ෼ੳ3: ENGàHRLàLRLͷޮՌ • HRLΛܦ༝ͯ͠ٯ຋༁͢Δ͜ͱͰɼ୯Ұݴޠίʔ ύεΛ༗ޮ׆༻͢Δ͜ͱ͕Մೳ • HRLàLRLͷ৔߹ͱ܏޲͸ࣅ͍ͯΔ • ୯ޠ୯Ґͷஔ׵ >

    ڭࢣͳ͠MT September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 16 ୯Ұݴޠίʔύε΋ੑೳ޲্ʹد༩͢Δ Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 12 + { ˆ Sw H )L ˆ Sw E )H )L , THE ME } (word subst.) 15.74 24.51 33.16 32.07 13 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.91 23.69 32.55 31.58 Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 12 + { ˆ Sw H )L ˆ Sw E )H )L , THE ME } (word subst.) 15.74 24.51 33.16 32.07 13 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.91 23.69 32.55 31.58 + { ˆ Sw E )H )L ˆ Sm E )H )L , ME ME } able 2: Evaluation of translation performance over four language pairs. Rows 1 and 2 show pre-training BLEU cores. Rows 3–13 show scores after fine tuning. Statistically significantly best scores are highlighted (p < 0.05) mixed fine-tuning strategy of Chu et al. (2017), ne-tuning the base model on the concatenation of he base and augmented datasets. For each setting, we perform a sufficient number of updates to reach (2019), indicating the difficulties of directly trans lating between LRLand ENG in an unsupervised fashion. Rows 3 and 4 show that standard super vised back-translation from English at best yield
  17. ෼ੳ4: ݁ہɼڭࢣͳ͠MT͸ͩΊ • ୯ޠ୯Ґͷஔ׵Λͨ͠HRLàLRLͳσʔλͱɼ ENGàHRLàLRLͳσʔλΛ૊Έ߹Θͤͨ৔߹͕ ࠷΋ੑೳ͕ྑ͍ • ↑ʹ௥Ճͯ͠ڭࢣͳ͠.5Λ࢖͏ͱੑೳѱԽʜ September 28,

    2019 RIKEN AIP / Inui-Suzuki Laboratory 17 Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 12 + { ˆ Sw H )L ˆ Sw E )H )L , THE ME } (word subst.) 15.74 24.51 33.16 32.07 13 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.91 23.69 32.55 31.58 Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 12 + { ˆ Sw H )L ˆ Sw E )H )L , THE ME } (word subst.) 15.74 24.51 33.16 32.07 13 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.91 23.69 32.55 31.58 + { ˆ Sw E )H )L ˆ Sm E )H )L , ME ME } Table 2: Evaluation of translation performance over four language pairs. Rows 1 and 2 show pre-training BLEU cores. Rows 3–13 show scores after fine tuning. Statistically significantly best scores are highlighted (p < 0.05). mixed fine-tuning strategy of Chu et al. (2017), fine-tuning the base model on the concatenation of he base and augmented datasets. For each setting, we perform a sufficient number of updates to reach convergence in terms of development perplexity. We use the performance on the development sets as provided by the TED corpus) as our criterion or selecting the best model, both for augmentation (2019), indicating the difficulties of directly trans- lating between LRLand ENG in an unsupervised fashion. Rows 3 and 4 show that standard super- vised back-translation from English at best yields very modest improvements. Notable is the excep- tion of SLK-ENG, which has more parallel data for training than other settings. In the case of BEL and GLG, it even leads to worse performance. Across
  18. ڭࢣͳ͠MT্͕ख͍͔͘ͳ͍ʁ • ࣮ݧͰ͸Ұ؏ͯ͠ʮ୯ޠ୯Ґͷஔ׵ > ڭࢣͳ͠MTʯ • ڭࢣͳ͠MT͸ͪΌΜͱ຋༁Ͱ͖ͯΔͷ͔ʁ • →຋༁͸ग़དྷ͍ͯΔ(pivot BLEU͸্ঢ)͕ɼੑೳʹߩ

    ݙ͠ͳ͍(translation BLEU͸Լ߱) • ஶऀ͍Θ͘ɼ୯ޠ୯Ґͷஔ׵ޙʹڭࢣͳ͠MT ͍ͯ͠Δͷ͕ݪҼͱͷ͜ͱ(ᡰʹམͪͳ͍…) September 28, 2019 Inui-Suzuki Laboratory 18 SHL ˆ Sw HL -3-à )3-ͷੑೳ ࠷ऴతͳ #-&6είΞ
  19. ʢ࠶ܝʣͲΜͳ࿦จ͔ʁ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 19

    • എܠɾ໰୊ • Low Resourceݴޠରͷ৔߹ٯ຋༁Ͱͷੑೳ޲্͕ࠔ೉ • ΞΠσΞ • ݴޠతʹ͍ۙHigh ResourceݴޠରͷσʔλΛ ͏·͘׆༻͢Δʢྫ: ΞθϧόΠδϟϯޠͱτϧίޠʣ • ߩݙ • High Resourceݴޠͷ࢖͍ํ͸ඇࣗ໌ͳͷͰɼ৭Μͳ ख๏ͷ૊Έ߹ΘͤΛ໢ཏతʹ࣮ݧ • High ResourceݴޠΛ୯ޠ୯ҐͰLow Resourceݴޠʹ ஔ׵͢Δͷ͕ྑ͍ • High ResourceݴޠΛܦ༝͢Δ͜ͱͰɼ୯Ұݴޠσʔλ Λ׆༻Մೳͱࣔͨ͠
  20. ײ૝ • Generalized Data Augmentationײʹ͚ܽΔؾ͕͢Δ • Կ͕ ”generalized”ʁ • ΋ͬͱΨΠυϥΠϯతͳ৘ใɾ࣮ݧ͕ཉ͍͠

    • ڭࢣͳ͠MT͕ޮՌബͳͷ͸ͳ͔ͥʁ • HRLàLRLͷ຋༁ੑೳ͸ߴ͍ͷ͚ͩͲ… • ௚ײతʹ͸ɼٙࣅσʔλͷ࣭͕ߴ͍΄Ͳੑೳ޲্ʹ د༩͢Δ͸ͣ • ࣭͕த్൒୺ʹߴ͍͜ͱ͕ݪҼͩΖ͏͔ʁ • ٙࣅσʔλ͸ٙࣅσʔλͱͯ۠͠ผग़དྷͨ΄͏͕Α͍ ͱ͍͏ใࠂ΋͋Δ • [Edunov+2018] Understanding Back-Translation at Scale • [Caswell+2019] Tagged Back-Translation • Figure1͕ૉ੖Β͍͠ʢٙࣅσʔλͷ࡞Γํͷେ࿮͕ ֓؍Ͱ͖Δʣ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 20