Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Understanding Back-Translation at Scale
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
ysasano
February 12, 2019
Technology
3k
5
Share
Understanding Back-Translation at Scale
機械翻訳のデータ拡大手法の一つである逆翻訳について、大量データで評価するとどうなるか検証した論文を紹介します。
ysasano
February 12, 2019
Other Decks in Technology
See All in Technology
サプライチェーンセキュリティの空白地帯 - 信頼できる”依存性”の未来を考える
rung
PRO
2
670
大学生が本気でDatabricksを活用してDiscordサークルをデータ駆動させてみた
phantomjuju
1
340
Oracle Cloud Infrastructure IaaS 新機能アップデート 2026/3 - 2026/5
oracle4engineer
PRO
1
170
速さだけじゃない! VoidZero ツールが移行先に選ばれる理由
mizdra
PRO
6
740
Gradle×GitHub_ActionsでCI時間を約50%短縮 ジョブ分割の設計と落とし穴 / Cutting CI Time by ~50% with Gradle and GitHub Actions: Job-Splitting Design and Pitfalls
takatty
0
620
ChatworkとBPaaS 異なる特性で学んだAI機能開発の ベストプラクティス
kubell_hr
2
2.5k
AIを「創る」と「使う」の循環 — HRテックが実践するリアルなAI組織実装
taketo957
0
1.3k
AI駆動開発が変える、大規模開発の前提 ーHuman in the Loop から Human on the Loop へ / AIE2026
visional_engineering_and_design
3
2.9k
Agentic ERPをどう設計するか ー 受発注エージェントを動かす、現場の知見と設計思想ー
recerqainc
1
1.3k
AI と創る新たな世界 / A New World Created with AI
ks91
PRO
0
110
新規事業を牽引する技術選定 〜フルスタックTypeScript開発の実践事例〜
nullnull
2
300
探して_入れて_作って_使う_Agent_Skills___LT.pdf
peintangos
2
160
Featured
See All Featured
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
1.1k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
128
55k
Winning Ecommerce Organic Search in an AI Era - #searchnstuff2025
aleyda
1
2k
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.5k
Paper Plane (Part 1)
katiecoart
PRO
0
8.4k
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
2k
Making Projects Easy
brettharned
120
6.7k
GraphQLとの向き合い方2022年版
quramy
50
15k
The SEO identity crisis: Don't let AI make you average
varn
0
480
Design in an AI World
tapps
1
220
Digital Projects Gone Horribly Wrong (And the UX Pros Who Still Save the Day) - Dean Schuster
uxyall
0
1.6k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
840
Transcript
Understanding Back-Translation at Scale Yasumasa Sasano (@SquirrelYellow) ٯ༁จͷσʔλΛಡΉ Edunov et
al. 2018ˏEMNLP 2018
Back-Translation = BT ͱԿ͔ 5BSHFU จষσʔλ 4PVSDF จষσʔλ ֶश ٯ༁Ϟσϧ
BT https://qiita.com/tkmaroon/items/4b8f469db1534d5e265b ͪ͜ΒͷهࣄͷදݱΛआΓ·ͨ͠ (1) ຊ໋ͱٯํͷ༁ϞσϧΛֶश(ӳͳΒӳ)
5BSHFU จষσʔλ 4PVSDF จষσʔλ 5BSHFU ୯ݴޠσʔλ 4PVSDF ߹ 4ZOUIFUJD
୯ݴޠσʔλ ਪ ٯ༁Ϟσϧ BT Back-Translation = BT ͱԿ͔ (2) BTΛͬͯσʔλΛ૿͢
5BSHFU จষσʔλ 4PVSDF จষσʔλ ຊ໋Ϟσϧ 5BSHFU ୯ݴޠσʔλ 4PVSDF ߹ 4ZOUIFUJD
୯ݴޠσʔλ ֶश Back-Translation = BT ͱԿ͔ (3) ૿ͨ͠σʔλͰֶश จʹॻ͍ͯͳ͍͕ɺΘ͟Θ͟ʮٯʯ༁͢Δͷ ਖ਼͍͠จষΛڭࢣʹ࠷దԽ͍ͨ͠ͱ͍͏͜ͱͩͱߟ͑Δ
BTͰେ෯ਫ਼UPͱʹ http://deeplearning.hatenablog.com/entry/back_translation
͜ͷจΛબΜͩಈػ ࣮৽ख๏ͷఏҊจͰͳ͍ طଘͷॾख๏ΛେྔσʔλͰධՁ͢ΔͱͲ͏ͳΔ͔ݕূ at Scale σʔλ֦େʹର͢ΔݕূσʔλΛಡΜͰ͍ٞͨ͠ BTҰछͷσʔλ֦େ - ࣄͷ্ؔɺࠓ͋ΔσʔλΛϑϧʹ׆͔͢ಈػ͕͋Δ -
ͲΜͳσʔλ֦େ͕༗ޮ͔ղ໌͞Ε͍ͯͳ͍෦͕ଟ͍ͷͰڵຯ͕͋Δ ͷ͕ಈػ
ฆΕ͕ͳ͍Α͏ʹ ΤϏσϯε จͷओு ݸਓͷॴײ ؾʹͳΔϙΠϯτ
Synthetic data generation method #5Ͱ࡞Δ߹σʔλʹ͍ͭͯ
߹σʔλͷ࡞ΓํʹΑΔҧ͍ΛධՁ Greedy Search ෩अ ෩अ פ͍ פ͍ ࠓ ͷ ෩अ
פ͍ ࡢ Beam Search ArgmaxΛ͏ͱ༁จͷଟ༷ੑ͕ͳ͘ͳͬͯ·͍ͣ ࠓ ͷ ෩अ פ͍ ࡢ εςοϓຖʹҐΛ ֬ఆͯ࣍͠ͷ୯ޠ ௨͠Ͱߴ֬ͷΛબ શ୳ࡧແཧͳͷͰ Beam ༗ݶ෯ Ͱ୳ࡧ 1Ґ લޙ݅1Ґ Greedy Search Beam Search Top 10 Sampling Beam + Noise Argmax Noised Middle ୯ޠ ֬ (ιʔτࡁ)
߹σʔλͷ࡞ΓํʹΑΔҧ͍ΛධՁ Top 10 ηʔλʔ פ͍ פ͍ ࠓ ͷ ෩अ פ͍
ࡢ Beam + Noise Sampling ྫྷଂݿ ϥϯμϜαϯϓϦϯά 1Ґ͔Β10ҐݶఆͰϥϯμϜαϯϓϦϯά ࠓ פ͍ ࠓ פ͍ ࠓ פ͍ ࠓ פ͍ BLANK ม͕͑ͯࠩͳ͍ p=0.1 p=0.1 uniform+maxҠಈ3 k=5, 10, 20, 50Ͱࢼ͕ͨ͠ɺ Otto et al. 2018a ʹΑΔͱෆ֬ఆੑ͕ ͔ͳΓେ͖͘มͳ ୯ޠΛग़͢Մೳੑ͕େ͖͍ ॳग़Imamura et al. 2018 (NICT) ڭࢣͳֶ͠शख๏ͰఏҊ Lample et al. 2018a ෩अ ෩अ ୯ޠ ֬ (ιʔτࡁ) ੜจʹଟ༷ੑΛ࣋ͨͤΔ͜ͱ͕Ͱ͖Δ จষੜٕ๏ͱͯ͠ݹ͘ɺ Graves et al. 2003ͳͲͰΘΕ͍ͯΔ
߹σʔλͷ࡞ΓํʹΑΔҧ͍ΛධՁ samplingbeam+noiseɺbeamgreedyΑΓ1.7-2.0 BLEUੑೳ͕ྑ͍ top10beamgreedyΑΓྑ͍͕samplingbeam+noiseΑΓѱ͍ samplingbeam+noise.ͷ࣌ʹbeamͷഒۙ͘ੑೳվળ͍ͯ͠Δ
ੜ͞Εͨจষͷੳ Greedy searchBeam searchଟ༷ͰϦονͳσʔλΛΊΔ Ott et al.2018aͷ จʹΑΔͱසޠ͕ग़ͳ͘ͳΔʹ͋Δ ͷͰSamplingख๏͕Α͍ denoising
autoencodersͱͷྨࣅੑ samplingbeam+noiseͰग़དྷ্͕ͬͨจݱ࣮Ε͍ͯ͠Δ͕ɺzஔzzॱংมߋzͱ ͍͏ݱී௨ʹى͖ΔͷͰͦ͏͍ͬͨॲཧΛೖΕΔͱϩόετʹͳΔ ࣍ͷ୯ޠ͕༧ଌͰ͖ͳ͍ͨΊɺқ͕Ҿ্͖͕ͬͯਫ਼্͕͕Δ
ੜ͞Εͨจষͷੳ ໌Β͔ʹ͓͔͍͠୯ޠ͕ೖΔͷzہॴతzͩͱΘ͔Δ ԾઆͲΜͳϊΠζ୯ޠ͕དྷͯͳ͍Α͏ɺͬͨਖ਼ৗ෦ͷ൚Խੑೳ্͕ͨ͠ʁ 0, /( ڐ༰Ͱ͖Δ୯ޠΛ੨ɺ໌Β͔ʹ͓͔͍͠୯ޠΛͰృͬͯΈΔͱɺ ʮہॴతͳϊΠζʯʹΑΔ൚Խੑೳ্ ࣭ʹؔΘΒͣଟ༷ੑ͕૿͔͑ͨΒ0,ͱ͍͏ղऍͰ͖ͳ͘ͳ͍͕ɺ ͦΕʹͯ͠ਫ਼্͕Γ͗͢Ͱʁͱ͍͏͜ͱͰ͏গ͠۷ΓԼ͍͛ͨ (ݸਓతߟ)
(ݸਓతߟͷଓ͖) ݘ͕͖Ͱ͢ ΫτΡϧϑਆ͕͖Ͱ͢ I like dog I am scared of
Cthulhu ہॴతϊΠζΛ༩ ଟ͘ͷࣗવݴޠॲཧͷϞσϧ গ͠ม͑Δ͚ͩͰ؆୯ʹὃͤΔಛੑ͕͋Δ Deep Text Classification Can be Fooled Liang et al. 2016 ༁ ະֶशͷσʔλ ޡࠩٯ ͜ͷʹରԠ͢Δଧͪख ʹͳ͍ͬͯΔՄೳੑ ԾʹΫτΡϧϑ͕ປࢺͰ ʮ͖ʯʮlikeʯ (ϊΠζ෦ʹޡࠩΛ͢ΔͷᘳʹແବͳͷͰվળͰ͖Δ͔)
Low Resource & High Resource #5ͷݩखͱͳΔର༁Ϧιʔεྔͷҧ͍ʹ͍ͭͯ
5BSHFU 4PVSDF ຊ໋Ϟσϧ 5BSHFU ୯ݴޠσʔλ 4PVSDF ߹ 4ZOUIFUJD ୯ݴޠσʔλ
ֶश ݩख͕গͳ͍ͱԿ͕ى͜Δ͔ ͜͜ͷྔ͕গͳ͍(80Kจఔ) จݿຊ͘Β͍ (112ສࣈ, 80ࣈ/จ)
ݩख͕গͳ͍ͱԿ͕ى͜Δ͔ 80KจͰsamplingbeam searchͷٯసݱ͕ى͖͍ͯΔ σʔλ͕ଟ͚Εଟ͍΄Ͳsampling͕ڧ͘ͳΔ ݩख͕গͳ͍߹ɺBTͷਫ਼͕ߴ͘ͳ͍ͷͰɺsamplingͰϊΠζΛՃ͑ͨͱ͖ͷѱӨ ڹʹ੬͘ͳΔ BTͷਫ਼ͷҾ্͖͕͛ඞཁ
ݩख͕গͳ͍ͷܰݮ 5BSHFU 4PVSDF &ODPEFS %FDPEFS 4PVSDF 4PVSDF 5BSHFU 5BSHFU 4PVSDFݴޠϞσϧ
5BSHFUݴޠϞσϧ సҠֶशorॏΈڞ༗ సҠֶशorॏΈڞ༗ (1) ୯ݴޠͰݴޠϞσϧΛ࡞ͬͯసҠֶश ʮݴޠϞσϧͷసҠ͕ࠔʯͱ͍͏͕Devlin et al. 2018 (BERT)Ͱղফ͞ΕͨͷͰਐల͋Δ͔
͍ͭͷؒʹ͔ͷ͍͢͝จ͕ൃද͞Ε͍ͯͨ ࢀߟจ: Lample et al. 2019 (XLM) #&35ΛసҠֶशɺ༁Λ&ODPEFS%FDPEFSͷܗͰͳ͘ҰͭͷݴޠϞσϧ ͱֶͯ͠श͠ɺ8.5`ಠӳ༁ͷڭࢣͳֶ͠शͷ405"Λ#-&6ߋ৽ BSYJWTVCNJU
ݩख͕গͳ͍ͷܰݮ (2) ରֶश (Dual Learning) ຊ໋Ϟσϧ 5BSHFU ୯ݴޠσʔλ 4PVSDF ୯ݴޠσʔλ
lରzϞσϧ ର༁Ͱͳͯ͘OK
Domain of synthetic data ߹σʔλͷυϝΠϯʹؔ͢Δݕূ
υϝΠϯదԠ 5BSHFU จষσʔλ 4PVSDF จষσʔλ ຊ໋Ϟσϧ χϡʔε 5BSHFU ୯ݴޠσʔλ χϡʔε
4PVSDF ߹ 4ZOUIFUJD ୯ݴޠσʔλ ֶश χϡʔεͷର༁σʔλ͕ͳͯ͘χϡʔεʹڧ͘ͳΔ͔ʁ
υϝΠϯదԠ ධՁ༻σʔλͷυϝΠϯʹBTͷυϝΠϯ news ͷ߹ຊͷσʔλ ఆͰ83%ͷվળ ධՁ༻σʔλͷυϝΠϯͱ#5ͷυϝΠϯ news ͕·ΔͰ߹͍ͬͯͳ͍ ߹ʹຊͷσʔλఆͰ32.5%ͷվળ ͲͪΒվળ͍ͯ͠Δ͕ɺυϝΠϯ߹க͍ͯ͠Δ߹൚༻ͷσʔλҎ
্ͷਫ਼ʹͳΔ ʓʓδϟϯϧͷର༁σʔλ͕ͳͯ͘ ୯ݴޠσʔλ͕͋Εʓʓδϟϯϧͷ༁ΛڧԽՄೳ
·ͱΊ ·ͱΊ Ͳͷख๏Ͱٯ༁ΛೖΕΕਫ਼্͕Δ͕ɺٯ ༁͢Δͱ͖ͷѻ͍Ͱਫ਼্෯͕ഒʹͳΔ͜ͱ ͋Δ σʔλ͕গͳ͍࣌ʹ૬ରతʹੑೳ͕Լ͕ΔͷͰ҆ қʹαϯϓϦϯά͕͑ͳ͍ υϝΠϯదԠʹ͑Δ