Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ニューラル言語モデルの 研究動向(NL研招待講演資料)

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

ニューラル言語モデルの 研究動向(NL研招待講演資料)

Avatar for Sho Takase

Sho Takase

June 13, 2019
Tweet

More Decks by Sho Takase

Other Decks in Research

Transcript

  1.      !" 2019/6/13  # !"

    $ %  https://takase.github.io/ 1
  2. +,' • 2008 - 2017: !$ - )$ • 2017

    - 2018: NTT CS&RA • 2018 - : #&( 3     "%  [IJCNLP 17, EMNLP 18, AAAI 19]  [EMNLP 16, NAACL 19]  -* [ACL 16]
  3. 2%NL- 42/3 • $ (, -.#' –  2018 10

    • $ -. *" – *" !) + •  '&  4
  4. 2%NL- 42/3 • $ (, -.#' –  2018 10

    • $ -. *" – *" !) + •  '&  5
  5. .=#!$ • %@.=4 05 – P(I have a dream) >

    P(a have I dream) – &(49716 49>05 • #!$C+perplexity – 8A=':',D+-B – 32 "!'< → ;#!$ – /,),? < *   6 NTTW 2i Encoder-Decoder 2 2RNN Encoder-Decoder P(I have a dream) > P(a have I dream) > P(fuga spam hoge  : •  2RNN e •  2 P P(I have a dream) = P(I)P(have | I)P(a | I have)P(dream | I have a) I have a dream
  6. ",*#' • Noisy channel model – P(T) ", – *352.+&-14

     •  – )%(",   • /,0$ ! – Skip-gramELMoBERT 7
  7. >)KM/;82N 2DO – Penn Treebank (PTB)WikiText-2 – "'# <3 –

    0D4R,1! &HJ8@;* P? • 6-8@L9 7T 8 (DO – WikiText-1031 billion word corpus – S=5:4R<3ELMoBERT – (DO 4RF@,1$%Q3 • S=5: E3G8@7T .+$%DOI Q3 CAB
  8. A-JL1=:2M 4DN# "$ – Penn Treebank (PTB)WikiText-2 – &+!'# 

    ?5 – 2D6Q/3%"$*HI:C=. OB • 80:CK; 9S 9 ,DN# "$ – WikiText-1031 billion word corpus – R@7<6Q?5ELMoBERT – ,DN# "$6QFC/3(#)P5 • R@7<E5 G:C9S 9 > 
  9. ,-$  Penn TreebankPTB • &: #5*  – Penn

    Treebank Wall Street Journal %" – 09 6E  7C [Mikolov+ 11] • 9H10,000 – +1=)1=( – /= N 2? • 10 million → N million – 8G'9 3F<unk> 2? • DB!< A9/887,521 – 4>1 billion word corpus 1/1000 – ;$  .@ 10
  10. PTB Ob3*5?h 11 2012 141.2Kneser-Ney &1%8#5-gram 124.7RNNOb3*5 [Mikolov+ 12] 2014

    78.4LSTMOb3*5 [Zaremba+ 14] 2016 75.0LSTMOb3*5 [Gal+ 16] (]o,6(0!+A< ) 2017 68.5Recurrent Highway Network [Zilly+ 17] 66.1kbrQ;E=p\Y [Inan+ 17] 64.4;EWC$+fM [Takase+ 17] 60.3Simple Recurrent Unit [Lei+ 17] 57.3LSTMOb3*5Iltd [Merity+ 18] 54.4Mixture of Softmaxes [Yang+ 18] 2018 52.4Mixture of Softmaxes oZ^P[[Takase+ 18] 47.2:TEnsemble [Takase+ 18] 62.4JGIlj aeqs [Zoph+ 17] 58.3LSTMOb3*5. //42'qs [Melis+ 18] IlBSc -(+7"aeVR 20189 -(+7"aeqU ↓ nuLSTMOb3*5 IlBSc@KXgL_X )&+*'`PNH)&+*'/42'mD >Fi
  11. PTB !Tg8/:Dm 12 2012 141.2Kneser-Ney +6*=(5-gram 124.7RNNTg8/: [Mikolov+ 12] 2014

    78.4LSTMTg8/: [Zaremba+ 14] 2016 75.0LSTMTg8/: [Gal+ 16] (bt1;-5$&0#FA!) 2017 68.5Recurrent Highway Network [Zilly+ 17] 66.1pgwV@JBu#a^ [Inan+ 17] 64.4@J\H)0#kR [Takase+ 17] 60.3Simple Recurrent Unit [Lei+ 17] 57.3LSTMTg8/:Nq#yi! [Merity+ 18] 54.4Mixture of Softmaxes [Yang+ 18] 2018 52.4Mixture of Softmaxest _c#U`[Takase+ 18] 47.2?YEnsemble [Takase+ 18] 62.4OLNqofj#vx [Zoph+ 17] 58.3LSTMTg8/:3%4497,vx [Melis+ 18] .+0/,eUSM.+0/,497,#rI!CKn  8/:  NqGXh 2-0<'fj[W 2018> " 2-0<'fjvZ ↓ szLSTMTg8/: NqGXhEP]lQd]
  12. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0  • %;*$ – -9 PTB .5  13 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1  > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25
  13. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • B=8; 5!/ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 4  • )@.( – 1> PTB 2:  14 *-$ LSTM2: [Zaremba+ 14] AWD-LSTM +?7 1 D 3C 30#%" 6, 0 40<' 500 ~ 1000 Gradient clipping 5 0.25 A 9& 
  14. 0!LSTM/8 [Zaremba+ 14] • LSTM 3-"<8EAB  – N-gram/8 1&

    N 7 – Perplexity141.2Kneser-Ney 5-gram→ 78.4 • C?C? 25+ * 15 LSTM <BOS> I have P(I) P(have | I) P(a | I have) LSTM LSTM LSTM LSTM LSTM f (x, ht-1 )LSTM Dropout#),> 25 p 61/(1 - p)@  4Dropout 9(;: ↑ = %'D$.Dropout 
  15. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0  • %;*$ – -9 PTB .5  16 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1  > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25
  16. Variational dropout [Gal+ 16] • A&"# .*;>. Dropout %F6 –

    01J/9Dropout F6C= – LSTM(Dropout -3[Moon+ 15] + • @,$?@E5B2874 17 LSTM <BOS> I have P(I) P(have | I) P(a | I have) LSTM LSTM LSTM LSTM LSTM LSTM <BOS> I have P(I) P(have | I) P(a | I have) LSTM LSTM LSTM LSTM LSTM <)LSTM:D'!( Variational dropoutF6 LSTM:D'!( .GIH. d F6
  17. R8."&QRW1/3 • aS? – *123+(),20']TRA • 17VZ^  9,20'!OC –

    ,20' 8_ $4-3   – Dropout !K`,20'Dropout ; $4-3   – []TDropout RA • 96HP9Dropout /%#!\C • MN@ S? – :5>XY G – @=,20' θ 7;8_ p(θ | X, Y) !I – EFLX p(θ | X, Y) !Y<U → q(θ) JbKL( q(θ) || p(θ | X, Y) ) !DBA 18
  18. S7/"%RSW2/3 • @< KL( q(θ) || p(θ | X, Y)

    ) !FBA 19 +234,'*! fθ   y = fθ(x)   =7 O 16(#45Y7O !D 1 $6.4N^ •  :> xi ;9-30&!P • :> xi LVTCGZ]LV1)4[V_\ • LV1)4 8J i Z]MQ  +234,'*-30&SA • HKDropout !X-30&! EI  S7?U
  19. I2)#GIO3/3 • Dropout BV  q(θ) B  20 ',+%

    θ 8J1T1) &- θk  J θk EM2U q(θk )  N() >K2U) &- mk I2',+%  • !/(- ',+%Dropout Q:X<P – 3 ',+%@F3 Dropout *" Q: • 67W5?Dropout NC9H=  !/(- 1TEM p 1) &- $. EM p Dropout Q:1T W v a 0 0  1T1 $.) &- AD) &- 4L;R $. Dropout AD0S
  20. Variational dropout! =7 • Dropout <$ ,1@(  *>8%5-B –

    '/)&6 • [Gal+ 16] 1000 3*> • Perplexity19#" – 79.7 → 78.675.2 → 73.4 • 20RNN: +; –  .A?<$4/ 21
  21. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0  • %;*$ – -9 PTB .5  22 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1  > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25
  22. DropConnect [Wan+ 13] • Dropout ?' & • 9<Dropout @.3

    p  • DropConnect@*=.3 p   –  !86  • AWD-LSTMLSTMA47 ;/(@:( – #$B"+% 0)<5 23 W v a 0 0  @  ,- !2 *= Dropout ,- > LSTM@;/  M 1*= Bernoulli(1 - p)
  23. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0  • %;*$ – -9 PTB .5  24 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1  > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25
  24. Weight tying [Inan+ 17, Press+ 17] • GAL2J -8I5= K16)

    J94 • [Inan+ 17] 94*,(<$@& – D?%.MB/F ! – GA">C K3 N#  ; H  7E 25 n I'LSTM fn  GA+ 1-hot  xt  ←:NILSTM0A   E = WT 
  25. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0  • %;*$ – -9 PTB .5  26 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1  > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25
  26. Averaged SGD (ASGD) [Polyak+ 92] • (. % $/,! 27

    *SGD  ") θt -! ASGD "'  -! • AWD-LSTM SGD  #& .   ASGD+ • *SGD %  ) %   Perplexity SGDASGD+
  27. PTB Ob3*5?h 28 2012 141.2Kneser-Ney &1%8#5-gram 124.7RNNOb3*5 [Mikolov+ 12] 2014

    78.4LSTMOb3*5 [Zaremba+ 14] 2016 75.0LSTMOb3*5 [Gal+ 16] (]o,6(0!+A< ) 2017 68.5Recurrent Highway Network [Zilly+ 17] 66.1kbrQ;E=p\Y [Inan+ 17] 64.4;EWC$+fM [Takase+ 17] 60.3Simple Recurrent Unit [Lei+ 17] 57.3LSTMOb3*5Iltd [Merity+ 18] 54.4Mixture of Softmaxes [Yang+ 18] 2018 52.4Mixture of Softmaxes oZ^P[[Takase+ 18] 47.2:TEnsemble [Takase+ 18] 62.4JGIlj aeqs [Zoph+ 17] 58.3LSTMOb3*5. //42'qs [Melis+ 18] )&+*'`PNH)&+*'/42'mD >Fi IlBSc -(+7"aeVR 20189 -(+7"aeqU ↓ nuLSTMOb3*5 IlBSc@KXgL_X
  28. Mixture of Softmaxes 29 P LSTM LSTM P2 LSTM LSTM

    P1 P3 2: / P 2: / P2 LSTM P1 P3 P LSTM • 7=(HSoftmaxI038 2: /&+ – 7=(H$&. <D@C 'G !#%L 9B ELSTM1@ " [Yang+ 18] -;FI07=(H38 A3,415? [Takase+ 18] $)F7=(H38 D@K56J *> A3,420?
  29. Mixture of Softmax  FAQ • =I 9Softmax /a2 –

    Yes5K0Softmax /UD  – [Yang+ 18] 3S0Softmax / O  A • &)(# ?   – '!%* Qc:XR-\ JE  ?  • 4LW$# 5 ~ 10%?< • >H [ – YesSoftmax >H ^:P[ – PTB  5ZV1[ • M6#"!T 7`b_C  – YesKF+M6#"!8B;, G@ 30 W Y .]>HJE 1PGN.]>H
  30. C9$# A   • Variational Dropout – [Gal+ 16]

    I?4N&#('$/ E >@ – 1!%'!%A  • OpenNMTlua+".1 HB <J  • DropConnect – 1!%'!%A  • Weight Tying – Transformer [Vaswani+ 17] A9 • ASGD – Tranformer [Vaswani+ 17] OQFA9 • ;4 -'036:5  -'0)/,$=L8 • BLEU 0.2 G72KEn-Cz[Popel+ 18] • MoS – MLSTM1!%'!%OpenNMT '*0(DP BLEU 1.7 G72KEn-FrIWSLT 2016[Takase+ 18] 31
  31.  • CX,'.S8\I – ;hjg^UG]`XfF:?… • 7R['%$&(<JWVki • LSTM +

    9bHYa5<JB0 – cLSTM_K)!**-+% >4 3<J [Melis+ 18] – Q/=AWD-LSTM [Merity+ 18] • Variational dropout, DropConnect, Weight tying, ASGD • eANT2d EODP @ 16   <JB0 [Yang+ 18, Takase+ 18] • <JB0HYS8%#"ZM – ZMNG  L  32
  32. 2%NL- 42/3 • $ (, -.#' –  2018 10

    • $ -. *" – *" !) + •  '&  33
  33. 34 57.3AWD-LSTM [Merity+ 18] 54.4AWD-LSTM-MoS [Yang+ 18] 47.2AWD-LSTM-DOC (Ensemble) [Takase+

    18] 6: &"'-0 PTB Perplexity 56.8AWD-LSTM + Fraternal Dropout [Zołna+ 18] 56.5AWD-LSTM%52 [Merity+ 18] 52.4AWD-LSTM-DOC [Takase+ 18] 47.7AWD-LSTM-MoS + Dynamic evaluation [Yang+ 18] 48.4AWD-LSTM-MoS + Dynamic evaluation %52 [Yang+ 18] 53.3AWD-LSTM-MoS + FRAGE [Gong+ 18] 47.7AWD-LSTM-MoS + FRAGE + Dynamic evaluation 8 ,)  2 [Gong+ 18] 46.5AWD-LSTM-MoS + FRAGE + Dynamic evaluation [Gong+ 18] 56.1AWD-LSTM + FRAGE [Gong+ 18] 55.7DARTS/179[Liu+ 19] 47.7AWD-LSTM-MoS (SigSoftmax) + Dynamic evaluation [Kanai+ 18]  4# !$ +3. (*
  34.  #?* Dynamic evaluation [Krause+ 17] • -8 $514 26:03

    • 51)&=@A#?* • !" $ [Kuhn+ 90, Grave+ 17]  '< – (+>. $51)&=@ 70;/%  35 )&,9 ( +>. $#?*
  35. U >D  • UYH 4  – c_(&,(0g9?JYH •

    IA?`f – )!**/+&dM  – b23RIA  • AWD-LSTM-MoS47.7NK→ 48.48^ • AWD-LSTM-MoS + FRAGE46.5NK→ 47.78^ • Perplexity 1 V?JC1<T • O:&%#PW ] – BFZa 57   – "-'$.,(0Le@6 • 3;[QGXES,(0\O=X,(0 36
  36. 37 57.3AWD-LSTM [Merity+ 18] 54.4AWD-LSTM-MoS [Yang+ 18] 47.2AWD-LSTM-DOC (Ensemble) [Takase+

    18] 6: &"'-0 PTB Perplexity 56.8AWD-LSTM + Fraternal Dropout [Zołna+ 18] 56.5AWD-LSTM%52 [Merity+ 18] 52.4AWD-LSTM-DOC [Takase+ 18] 47.7AWD-LSTM-MoS + Dynamic evaluation [Yang+ 18] 48.4AWD-LSTM-MoS + Dynamic evaluation %52 [Yang+ 18] 53.3AWD-LSTM-MoS + FRAGE [Gong+ 18] 47.7AWD-LSTM-MoS + FRAGE + Dynamic evaluation 8 ,)  2 [Gong+ 18] 46.5AWD-LSTM-MoS + FRAGE + Dynamic evaluation [Gong+ 18] 56.1AWD-LSTM + FRAGE [Gong+ 18] 55.7DARTS/179[Liu+ 19] 47.7AWD-LSTM-MoS (SigSoftmax) + Dynamic evaluation [Kanai+ 18]  4# !$ +3. (*
  37. Fraternal Dropout [Zołna+ 18] • Dropout mask H#"BUE?",0.) H= •

    X:/*1BUH="  – Dropout mask H=9"BU5V$>  • 6;R" Dropout mask $Q?/*1BU 73 " " • CFPerplexity57.3 → 56.8 @M – -(0&2 AWD-LSTM <T56.5 – IG AWD-LSTM '+$E" – !-(0&27JZP ('% – 45/*1AWD-LSTM-DOCOF 38 p1, p2R" Dropout mask BU ← BUDLYN$AS=WK8
  38. FRAGE [Gong+ 18] • &#619, ;% + • $!;%0;% .453

    :"#(7 • -/Perplexity57.3 → 56.1 )2 –  AWD-LSTM '856.5 39 I have P(have | I) P(a | I have) LSTM LSTM LSTM LSTM .45 .45 LT LD LT + LD θT*1  θD.453 
  39. AWD-LSTM-MoS + FRAGE • )#&    – Perplexity53.3→

    53.8( • AWD-LSTM-MoS *$  ' – (! + Dropout "53.6 • AWD-LSTM-DOC %  40
  40.    • PTB *%+47=?' –  ! )6

    – !"+4.#9( • 3-+ 15/20 –  ! <&3-8 – >$ 3-   • :,  ;:,   41
  41.  …  • v${^ "e\ – +(-,*174*C ‰GPakˆ –

    g[}"kˆf… • ;u‡,*WJGPT-2[Radford+ 19] – qŠp |†kˆjI0) • `RPlH"Ft$> – KD&. + YB%3'6:A~Zc="  – idXMy$h ƒTF>V€?‚$]z id$b<„ • L@Pam S$\!#"wr – PaxUOoE925 • /578FtNn_ "s Q"…… 42
  42. 2%NL- 42/3 • $ (, -.#' –  2018 10

    • $ -. *" – *" !) + •  '&  43
  43. 2/: > )6( • 10*'=69 > – 1 billion word

    corpus, English Wikipedia, … • 1 @% (07 > 44 I have a P(have | I) P(a | I have) P(dream | I have a) )6LSTMELMoTransformerGPT I [MASK] a have MASK " BERT 3648+A ?-!&B 9#.,; 5$1< …… 1
  44. word2vec >$LS0< 45 RNN6C !R7-2 /J?C(@ 0B5 9 [Mikolov+ 13]

    "FAE,N8D &H* RNNKIskip-gramCBOW [Mikolov+ 13] =+4+ 1 billion word corpus,N  LSTM6C !O;-2T# 4+.&0:G1 [Peter+ 17] ELMo [Peter+ 18] 1 billion word corpus Q'3LSTM6C !,N =+P0 4+ BERT [Devlin+ 19] &H%)M+LSTM → Transformer =+4+ man woman king queen I have a dream that I have dream that
  45. (% &  • p"Q!ELMoBERT se9L – _bRM*@<(' & –

    $&8vta5079' • ELMoBERT   O Gig\8[ • word2vec  RNN *t • Nb507C]f8]fY0/?':BoS – PTB I;KU507l • AWD-LSTM-DOC [Takase+ 18]  AWD-LSTM [Merity+ 18]  5kl • 8]f0/V)( :B#P dVariational dropout – ^D/.,#8]f0/*q ZJ ' • HuwrJF • 0/]f=.-7':B V % – A> Transformer 3.507WX • Transformer-XL [Dai+ 19]PTB  54.44 AWD-LSTM57.3 • hnGiTj 1+2264/E`c ma 46
  46.  • 5=V;J&"(XM – LSTM + 4S>KAWD-LSTM [Merity+ 18] –

    CG*W<D 8[Yang+ 18, Takase+ 18] • U[5-6BI – )YH ?    • !#" $'% T0 • A7 9N:@ • 1/QPF3,ZLO – " ER.(+2 47
  47. 1/5 • Mikolov et al., Empirical Evaluation and Combination of

    Advanced Language Modeling Techniques. INTERSPEECH 2011. • Mikolov et al., Context Dependent Recurrent Neural Network Language Model. SLT 2012. • Zaremba et al., Recurrent Neural Network Regularization. 2014. • Gal et al., A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. NIPS 2016. • Zilly et al., Recurrent Highway Networks. ICML 2017. • Inan et al., Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. ICLR 2017. • Takase et al., Input-to-output gate to improve rnn language models. IJCNLP 2017. 48
  48. 2/5 • Zoph et al., Neural Architecture Search with Reinforcement

    Learning. ICLR 2017. • Lei et al., Simple Recurrent Units for Highly Parallelizable Recurrence. EMNLP 2018. • Melis et al., On the state of the art of evaluation in neural language models. ICLR 2018. • Merity et al., Regularizing and Optimizing LSTM Language Models. ICLR 2018. • Yang et al., Breaking the softmax bottleneck: A high-rank RNN language model. ICLR 2018. • Takase et al., Direct Output Connection for a High- Rank Language Model. EMNLP 2018. 49
  49. 3/5 • Wan et al., Regularization of Neural Networks using

    DropConnect. ICML 2013. • Press et al., Using the Output Embedding to Improve Language Models. EACL 2017. • Polyak et al., Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization 1992. • Vaswani et al., Attention Is All You Need. NIPS 2017. • Popel et al., Training Tips for the Transformer Model. PBML 2018. 50
  50. 4/5 • Zołna et al., Fraternal dropout. ICLR 2018. •

    Gong et al., FRAGE: Frequency-Agnostic Word Representation. NIPS 2018. • Liu et al., Deep Residual Output Layers for Neural Language Generation. ICML 2019. • Kanai et al., Sigsoftmax: Reanalysis of the Softmax Bottleneck. NIPS 2018. • Krause et al., Dynamic Evaluation of Neural Sequence Models. 2017. • Kuhn et al., A cache-based natural language model for speech recognition. PAMI 1990. • Grave et al., Improving Neural Language Models with a Continuous Cache. ICLR 2017. 51
  51. 5/5 • Radford et al., Language Models are Unsupervised Multitask

    Learners. 2019. • Mikolov et al., Linguistic Regularities in Continuous Space Word Representations. NAACL 2013. • Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013. • Peter et al., Semi-supervised sequence tagging with bidirectional language models. ACL 2017. • Peter et al., Deep Contextualized Word Representations. NAACL 2018. • Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. 52