Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Neural Machine Translation with Byte-Level Subw...

Neural Machine Translation with Byte-Level Subwords

Avatar for Scatter Lab Inc.

Scatter Lab Inc.

May 15, 2020
Tweet

More Decks by Scatter Lab Inc.

Other Decks in Research

Transcript

  1. Neural Machine Translation with Byte-level Subwords Overview • “Neural Machine

    Translation with Byte-level Subwords” • Changhan Wang, Kyunghyun Cho, and Jiatao Gu (Facebook AI Research) • AAAI 2020 (arXiv 2019)
  2. Byte-Pair Encoding (BPE) 1. Introduction • ࠼بо ֫਷ Character हਸ

    ߽೤೧աх vocab = all_unique_characters while len(vocab) <= max_vocab_size: pair = get_max_pair(corpus) corpus = merge_vocab(corpus, pair) vocab.append(pair)
  3. Character? Byte? 1. Introduction • Character (a, b, c, о,

    ա, ׮, …) • ઱۽ BPEೞݶ character-levelਸ ݈ೣ • Textۄח ѱ sequence of character۽ॄ ಴അೞח ѱ ੗োझ۞ਕࢲ • Byte (E3, 81, AE, …) • Compactness: 256ѐ੄ ష௾݅ ੓ਵݶ ޤٚ ٜ݅ ࣻ ੓਺ • ঱যী ࢚ҙ হ੉ ࢎਊೡ ࣻ ੓਺
  4. Character? Byte? 1. Introduction • Character (a, b, c, о,

    ա, ׮, …) • ઱۽ BPEೞݶ character-levelਸ ݈ೣ • Textۄח ѱ sequence of character۽ॄ ಴അೞח ѱ ੗োझ۞ਕࢲ • Byte (E3, 81, AE, …) • Compactness: 256ѐ੄ ష௾݅ ੓ਵݶ ޤٚ ٜ݅ ࣻ ੓਺ • ঱যী ࢚ҙ হ੉ ࢎਊೡ ࣻ ੓਺
  5. Character-level BPE੄ ೠ҅ 1. Introduction • Vocabularyীࢲ characterо ցޖ ݆਷

    ठ܃ਸ ର૑ೡ ࣻ ੓਺ • Rare character from noisy text • Character-rich languages (such as CJK languages) • ৈ۞ ঱যܳ ׮ܖӝী ࠗ੸೤ೣ • bilingual and multilingual • 150ѐ੄ ঱যܳ ழߡೞ۰ݶ 138K੄ ਬפ௏٘ characterо ೙ਃೣ • ߈ݶ, UTF-8 byteח 256ѐ ઺ী 248ѐ݅ ੓ਵݶ ׮ ழߡೡ ࣻ ੓਺
  6. Byte-level BPE (BBPE) 2. Byte-level BPE • ӝࠄ੸ਵ۽ ਬפ௏٘ characterܳ

    UTF-8۽ ੋ௏٬ೣ • 1 ਬפ௏٘ = 1~4 byte • ੋ௏٬ ػ sequence of bytesী ؀೧ࢲ BPE ೟णਸ दఇ • ୭ઙ vocab: UTF-8 byte set + BPEܳ ా೧ ୶о غח variable-length n-gram bytes Byte Sequence: EA B0 80 EB 82 98 EB 8B A4 EB 9D BC EB A7 88 EB B0 94 EC 82 AC Byte set: EA, B0, 80, EB, 82, 98, 8B, A4, 9D, BC, A7, 88, B0, 94, EC, 82, AC Variable-length n-gram bytes: EA B0, EB 82 98, A4 EB, …
  7. Byte-level BPE (BBPE) 2. Byte-level BPE • ӝࠄ੸ਵ۽ ਬפ௏٘ characterܳ

    UTF-8۽ ੋ௏٬ೣ • 1 ਬפ௏٘ = 1~4 byte • ੋ௏٬ ػ sequence of bytesী ؀೧ࢲ BPE ೟णਸ दఇ • ୭ઙ vocab: UTF-8 byte set + BPEܳ ా೧ ୶о غח variable-length n-gram bytes ೞա੄ characterо ଂѐ૗
  8. Contextualization 2. Byte-level BPE • ݫੋ ݽ؛ী ٜযоӝ ੹ী Contextualization੉

    ೙ਃೞ׮Ҋ ೣ • рױೠ CNN੉ա GRUܳ కਕࢲ ݫੋ ݽ؛ੋ Transformerী ٜযоח ߑध
  9. Decoding 2. Byte-level BPE • ݽٚ ޙ੢਷ byte sequence۽ ಴അೡ

    ࣻ ੓૑݅, 
 যڃ byte sequenceח ޙ੢ਵ۽ ࠂਗ(decoding)ೞӝ গݒೣ • Ex) Generation, Translation
  10. Decoding 2. Byte-level BPE • Empirically, ೟णػ ݽ؛ীࢲ ੜޅػ byte

    sequenceܳ outputਵ۽ ղࠁղח ҃਋ח ٘ޛ׮Ҋ ೣ • प೷೧ࠄ Ѣীࢲח Ѣ੄ হ঻Ҋ, प೷ ࣁ౴ ઺ 165K example੄ large testsetীࢲ ઑରب ٘ޛ঻਺ • ডр ೟ण੉ ؏ ػ ݽ؛ীࢲח ઺ࠂػ byteܳ ߈ࠂೞח ޙઁо ੓਺ • ੉۠ ী۞ ಁఢٜ੉ ୭؀ೠ ݆਷ ਬפ௏٘ character۽ ࢶഋदрী ࠂਗೞҊ੗ ೣ • Dynamic Programming ӝ߈੄ ঌҊ્ܻਸ ઁউ
  11. Decoding: algorithm 2. Byte-level BPE • Byte sequence о ઱য૗

    • ܳ ীࢲ ࠂਗ оמೠ ୭؀ character ѐࣻۄҊ ೞ੗ • ח dynamic programmingਸ ా೧ࢲ ইې৬ э੉ ҅࢑ೡ ࣻ ੓਺ {B}N k=1 f(k) {B}N k=1 f(k) • о ৢ߄ܲ character੉ݶ , ইפݶ 0 • ਤ੄ ܳ ੤ӈ੸ਵ۽ backtrackingೞݶࢲ ҅࢑ೞݶ ೧ܳ ҳೡ ࣻ ੓਺ {B}j k=i g(i, j) = 1 f(k)
  12. Experimental Setting • Dataset • Bilingual: En-De, Ja-En, Si-En •

    Multilingual: Many-to-English (X-En) → TED Talk Corpus, 59ѐ ঱যী ؀ೠ parallel data • BPE & BBPE: Source + Target ޙ੢ী ؀೧ࢲ SentencePiece۽ ೟ण 3. Experiments
  13. Experimental Setting • Model and Learning • Transformer ࢎਊ •

    Vaswini et al., 2017 ࣁ౴ਸ ݆੉ ٮܴ • Inference and Evaluation • Beam size: En-Deח 4, աݠ૑ח 5 • We calculate casesensitive tokenized BLEU (Papineni et al. 2002) as the metrics using sacreBLEU (Post 2018). 3. Experiments
  14. Results: Qualitative Comparison: BPE vs. BBPE • Symbol Frequency Distribution

    3. Experiments BBPEо ഻ঁ ؊ ࠙࢑غয ੓਺. Long tail੉ Ѣ੄ হҊ Ӓ۠ ൞ӈೠ ױযח subword۽ ಴അ
  15. Results: Qualitative Comparison: BPE vs. BBPE • Ratio of BBPE

    tokens with partial characters 3. Experiments ੌࠄয৬ Multilingual਷ partial character੄ ࠺ਯ੉ ࢚׼ೣ. Character set: ੌࠄয(8K), Multilingual(11K)
  16. Results: Qualitative Comparison: BPE vs. BBPE • Cross-lingual Sharing •

    X-En੄ symbolsҗ ঴݃ա Ҁசө? • Ar, He, Ru, Ko, It ঱যী ؀೧ࢲ प೷ • ੹߈੸ਵ۽ BBPEо symbols੉ ݆੉ Ҁஜ • ݽ؛ ஏݶীࢲ parameter sharing੄ ੉ٙ • vocab ஏݶীࢲ universal modeling੄ ੉ٙ 3. Experiments
  17. Results: Qualitative Comparison: BPE vs. BBPE • Impact on Sequence

    Length 3. Experiments BBPEо ؊ fineೠ ױਤܳ ׮ܖ׮ࠁפө sequenceо ӡয૓׮Ҋ ࢤпೡ ࣻ ੓૑݅, ষ୒ ӟ Ѫب ইש
  18. Results: Importance of Contextualization • X-Enী ؀೧ࢲ 3о૑ ࣁ౴ਸ ࠺Ү

    • none • 1-layer CNN • 1-layer Bi-GRU • Fine-grained vocabੌࣻ۾ ബҗо ఀ 3. Experiments
  19. Results: BBPE on Noisy Character Sets • En-De ؘ੉ఠࣇীח non-latin

    alphabet੉ ખ ੓਺ • ੉۠ ੉ਬ۽ character set੉ 3.4Kա ؽ • BPEח character setਸ ׮ ನೣ೧ঠ ೞӝ ٸޙী ੉۠ ࠗ࠙਷ ݆਷ vocab ठ܃ਸ խ࠺दఇ • BBPE 2K, 4K৬ BPE 32Kо ࠺तೠ Ѿҗܳ ঳਺ • ೞ૑݅ ౵ۄ޷ఠ ࣻীࢲ ݆਷ ੉ٙਸ ࠆ 3. Experiments
  20. Results: BBPE on Character-Rich Languages • ઺Ҵয, ੌࠄযח 50Kо ֈח

    character setਸ о૗ • Ja-En ؘ੉ఠࣇ਷ ୨ 8K੄ character set੉Ҋ,
 top 2.4K੄ characterо ੹୓੄ 99%ܳ ழߡೣ • ੉۠ ੼ਸ Ҋ۰ೞৈ BBPE੄ ௼ӝܳ 4K۽ ࣁ౴ • BPEী ؀೧ࢲ comparableೠ ࢿמਸ ࠁ੐ 3. Experiments
  21. • Impact on Sequence Length 3. Experiments Source৬ Target੄ ӡ੉

    ର੉о ݆੉ աࢲ attention੉ য۰ਕ૓٠. Ӓېࢲ (B)BPE ࢿמ੉ ڄয૓ ѱ ইקө? Results: BBPE on Many-to-En Translation
  22. Results: BBPE on Many-to-En Translation 3. Experiments Ӓۢীب ࠛҳೞҊ, ੹߈੸ਵ۽

    BBPEо ࢿמ੉ա ࣘب ݶীࢲ ߖ۠झо જ਷ Ѫ э਺
  23. Results: Transfer Learning on Unseen Characters • BBPEח ݽٚ UTF-8

    byteܳ ನೣೞӝ ٸޙী OOV ޙઁо ੓ਸ ࣻ হ਺ • ٮۄࢲ character set ੹ഃ উҀ஖ח ف ঱যী ؀೧ transferring੉ оמೣ • X-Enਵ۽ pre-trainingೠ ݽ؛ਸ Si-Enী ؀೧ࢲ Fine-tuningೞݶ transferо ੜ غח Ѫਸ ࠅ ࣻ ੓਺ 3. Experiments
  24. Contributions 4. Conclusion • Byte-level subword vocabularyܳ ݅٘ח BBPEܳ ઁউ

    • Character-based ӝߨী ࠺೧ࢲ ࢿמਸ ਬ૑ೞݶࢲ vocabularyܳ ݒ਋ ੘ѱ ٜ݅ ࣻ ੓਺ • Multilingual settingীࢲח ઙઙ ؊ ࢿמ੉ જӝب ೣ • OOV ޙઁب ੹ഃ হ਺ • ׮নೠ ঱যী transferringب оמೞҊ, ੉ח ݒ਋ genericೞҊ ࢿמ, training acceleration ݶীࢲ ੉ٙ੉ ੓਺ • Character-based ӝߨࠁ׮ sequence lengthب ؊ ૣইࢲ ࡅܲ ೟णҗ ୶ۿ੉ оמೣ
  25. Future Work 4. Conclusion • Source-Target੄ ӡ੉ ର੉о ௿ ٸ

    ࢿמ੉ ڄয૑ח ޙઁܳ ೧Ѿ೧ࠅ Ѫ • One-to-Many, Many-to-Many settingীࢲب ಣоܳ ೧ࠁҊ੗ ೣ
  26. хࢎ೤פ׮✌ ୶о ૕ޙ ژח ҾӘೠ ੼੉ ੓׮ݶ ঱ઁٚ ইې োۅ୊۽

    োۅ ઱ࣁਃ! ੉઱ഘ (ML Research Scientist, Pingpong) Email. [email protected] Facebook. @roomylee Linked in. @roomylee