Huang et al. 2020 Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting

Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting ACL
2020 Presentator: Tosho Hirasawa Tokyo Metropolitan University [email protected]

Summary • Unsupervised multimodal machine translation • Train: L1-Img, L2-Img
(usually, no image overlap) • Test: L1-Img-L2 • Introduce three new loss from: • Multilingual visual-semantic embedding • Pivoted Captioning for Back-Translation • Povited Captioning for Paired-Translation • SOTA for unsupervised multimodal MT on Multi30K 1

Unsupervised Machine Translation • Common principels for unsupervised MT •
Pre-training step is essential • Masked language model [Conneau and Lample, 2019] • Span-based seq-to-seq masking [Song et al., 2019] • Back-translation loss Approach Train/Val Test Loss Supervised MT L1-L2 L1-L2 Unsupervised MT L1, L2 L1-L2 ∗, ℎ∗: sentence predictors 2

Unsupervised Multimodal Machine Translation Approach Train/Val Test Loss Supervised MT
L1-L2 L1-L2 Unsupervised MT L1, L2 L1-L2 Supervised multimodal MT L1-Img-L2 L1-Img-L2 • MT loss • (Auxiliary loss from L1-Img) Unsupervised multimodal MT L1-Img, L2-Img L1-Img-L2 • BT loss • Multilingual Visual-Semantic Embedding • Pivoted Captioning for Back-Translation • Povited Captioning for Paired-Translation ∗, ℎ∗: sentence predictors 3

UMMT: Overall Architecture • Train Enc./Dec. for both direction (x->y,
y->x) synchronousely 4

UMMT: Back-Translation • Back-translation objective: " : image 5

UMMT: Multilingual Visual-Semantic Emb. • Intuition: Align the latent spaces
of the source and target languages by using visual as pivot • VSE objective: Max-margin loss w/ negative sampling • Compute similarity of two sequences 6

UMMT: Pivoted Captioning for BT • Intuition: Reconstruct pesudo image
captions • Pre-train captioning models w/ large-scale dataset • Train two models (c_x, c_y) on disjoint subsets • Objective: • Gold sentence/translation DOES NOT involve in CBT ∗, ℎ∗: sentence predictors ∗: caption predictors 7

UMMT: Pivoted Captioning for Paired-Trans. • Intuition: Translate pesudo image
captioning • Same captioning models as CBT • Objective: • Gold sentence/translation DOES NOT involve in CPT 8

UMMT: Loss overview • Minimizing joint loss: • No interpolation
weight for each loss • BUT, loss for CPT will decrease according to a scheduler • Decrease weight from 1.0 to 0.1 at 10-th epoch • Avoid training on noisy captions in laterstage of training 9

Experiments: Dataset • English -> {German, French} • Multi30K •
Train: 29K, Val: 1K, Test:1K • Multi30K-half • Train: 14,500 (En-Img), 14,500 ({De, Fr}-Img), no overlap • Validation: 507 (En-Img), 507 ({De, Fr}-Img), no overlap • Test: 1,000 (En-Img-{De, Fr}) • Preprocess • Sentence: tokenization, Byte Pair Encoding • Image: Faster-RCNN, 36 objects per image 10

Experiments: Model • Transformer • 6 layers • 8 heads
• 1024 hidden units • 4096 feed-forward filtersize • Multimodal Transformer • Hierachical multi-head multimodal attention [Libovicky and Helcl, 2017] 1. Compute two individual context vectors from encoder states and visual features 2. Map to space with same units 3. Compute attention to encoder and visual context vector 4. Weighted sum 11

Experiments: Pre-training • Pre-train Transformer model • Dtaset: WMT News
Crawl from 2007 to 2017 • 10M data for English/German/French • Objective: Masked seq-to-seq objective • Pre-train captioning model • Dtaset: MS-COCO • 56,643 images and 283,215 captions for English • Use Google Translate to generate German/French translations • Objective: 12

Experiments: Evaluatioin • BLEU (multi-bleu from Moses) • METEOR •
Model selection: BLEU scors of “round-trip” translation [Lample+, 2018] • source -> target -> source -> evaluate • target -> source -> target -> evaluate • Emperically shown to correlate well with the testing metrics 13

Results: Overall performance 14

Results: Ablation w.r.t. objectives • All objectives works well •
VSE > CPT > CBT 15

Results: Generalizability • Train with images Test without images •
VSE is the key component to use visual information • Full model is more sensitive with images (-0.65 BLEU) than model w/o VSE (-0.25 BLEU) 16 Results with images:

Results: Real-pivoting & Low-resource • The performance is improved when
training with the overlapped images • Proposed method works with low-resource setting 17

Results: Supervised MT • Supervised training • Multi30K (100% overlapped)
• Supervised MT objective • Visual information contributes less to improving performance in supervised MT 18

Conclusion • Proposed pseudo visual pivoting for unsupervised multimodal MT
• Improve the crosslingual alighnments in the shared latent space (VSE) • Train on image-pivoted pseudo sentences (CBT, CPT) 19

Huang et al. 2020 Unsupervised Multimodal Neura...

Huang et al. 2020 Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting

tosho

More Decks by tosho

Other Decks in Science

Featured

Transcript

Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting ACL

Summary • Unsupervised multimodal machine translation • Train: L1-Img, L2-Img

Unsupervised Machine Translation • Common principels for unsupervised MT •

Unsupervised Multimodal Machine Translation Approach Train/Val Test Loss Supervised MT

UMMT: Overall Architecture • Train Enc./Dec. for both direction (x->y,

UMMT: Back-Translation • Back-translation objective: " : image 5

UMMT: Multilingual Visual-Semantic Emb. • Intuition: Align the latent spaces

UMMT: Pivoted Captioning for BT • Intuition: Reconstruct pesudo image

UMMT: Pivoted Captioning for Paired-Trans. • Intuition: Translate pesudo image

UMMT: Loss overview • Minimizing joint loss: • No interpolation

Experiments: Dataset • English -> {German, French} • Multi30K •

Experiments: Model • Transformer • 6 layers • 8 heads

Experiments: Pre-training • Pre-train Transformer model • Dtaset: WMT News

Experiments: Evaluatioin • BLEU (multi-bleu from Moses) • METEOR •

Results: Overall performance 14

Results: Ablation w.r.t. objectives • All objectives works well •

Results: Generalizability • Train with images Test without images •

Results: Real-pivoting & Low-resource • The performance is improved when

Results: Supervised MT • Supervised training • Multi30K (100% overlapped)

Conclusion • Proposed pseudo visual pivoting for unsupervised multimodal MT