of our deep SRL model: (1) applying recent advances in training deep recurrent neural networks such as highway connections (Srivastava et al., 2015) and RNN-dropouts (Gal and Ghahramani, 2016),2 and (2) using an A⇤ decoding algorithm (Lewis and Steedman, 2014; Lee et al., 2016) to enforce struc- tural consistency at prediction time without adding more complexity to the training process. Formally, our task is to predict a sequence y given a sentence-predicate pair (w, v) as input. Each yi 2 y belongs to a discrete set of BIO tags T . Words outside argument spans have the tag O, and words at the beginning and inside of argument spans with role r have the tags Br and Ir respec- tively. Let n = | w | = | y | be the length of the sequence. Predicting an SRL structure under our model involves finding the highest-scoring tag sequence over the space of all possibilities Y: ˆ y = argmax y2Y f(w, y) (1) We use a deep bidirectional LSTM (BiLSTM) to learn a locally decomposed scoring function con- ditioned on the input: Pn t=1 log p(yt | w) . To incorporate additional information (e.g., structural consistency, syntactic input), we aug- ment the scoring function with penalization terms: f(w, y) = n X t=1 log p(yt | w) X c2C c(w, y1:t) (2) Each constraint function c applies a non-negative penalty given the input w and a length- t prefix y1:t. These constraints can be hard or soft depend- ing on whether the penalties are finite. 2.1 Deep BiLSTM Model Our model computes the distribution over tags us- ing stacked BiLSTMs, which we define as follows: il,t = ( Wl i[hl,t+ l , xl,t] + b l i) (3) ol,t = ( Wl o[hl,t+ l , xl,t] + b l o) (4) fl,t = ( Wl f[hl,t+ l , xl,t] + b l f + 1) (5) ˜ cl,t = tanh( Wl c[hl,t+ l , xl,t] + b l c) (6) cl,t = il,t ˜ cl,t + fl,t ct+ l (7) hl,t = ol,t tanh(cl,t) (8) 2We thank Mingxuan Wang for suggesting highway con- nections with simplified inputs and outputs. Part of our model is extended from his unpublished implementation. + + + The 0 P(BARG0) + + + cats 0 P(IARG0) + + + love 1 P(BV) + + + hats 0 P(BARG1) Softmax Transform Gates LSTM Word & Predicate Figure 1: Highway LSTM with four layers. The curved connections represent highway connec- tions, and the plus symbols represent transform gates that control inter-layer information flow. where xl,t is the input to the LSTM at layer l and timestep t . l is either 1 or 1 , indicating the di- rectionality of the LSTM at layer l . To stack the LSTMs in an interleaving pattern, as proposed by Zhou and Xu (2015), the layer- specific inputs xl,t and directionality l are ar- ranged in the following manner: xl,t = ( [ Wemb(wt), Wmask(t = v)] l = 1 hl 1,t l > 1 (9) l = ( 1 if l is even 1 otherwise (10) The input vector x1,t is the concatenation of token wt’s word embedding and an embedding of the bi- nary feature (t = v) indicating whether wt word is the given predicate. Finally, the locally normalized distribution over output tags is computed via a softmax layer: p(yt | x) / exp( Wy taghL,t + btag) (11) Highway Connections To alleviate the vanish- ing gradient problem when training deep BiL- STMs, we use gated highway connections (Zhang et al., 2016; Srivastava et al., 2015). We include transform gates rt to control the weight of lin- ear and non-linear transformations between layers (See Figure 1). The output hl,t is changed to: rl,t = ( Wl r[hl,t 1, xt] + b l r) (12) h 0 l,t = ol,t tanh(cl,t) (13) hl,t = rl,t h 0 l,t + (1 rl,t) Wl hxl,t (14) Trm Trm Trm Trm Trm Trm ... ... T 1 T 2 T N ... E 1 E 2 E N ... ure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT s a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right- eft LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly nditioned on both left and right context in all layers. dels pre-trained on ImageNet (Deng et al., 09; Yosinski et al., 2014). • BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M BERT (Ours) Trm Trm Trm Trm Trm Trm ... ... T 1 T 2 T N ... E 1 E 2 E N ... Feature-based Fine-tuning Lstm ELMo Lstm Lstm Lstm Lstm Lstm Lstm Lstm Lstm Lstm Lstm Lstm ... ... ... ... T 1 T 2 T N ... E 1 E 2 E N ... aining model architectures. BERT uses a bidirectional Transformer. OpenAI GPT r. ELMo uses the concatenation of independently trained left-to-right and right- ures for downstream tasks. Among three, only BERT representations are jointly ht context in all layers. geNet (Deng et al., detailed implementa- t cover the model ar- esentation for BERT. aining tasks, the core n Section 3.3. The d fine-tuning proce- n 3.4 and 3.5, respec- s between BERT and n Section 3.6. • BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M BERTBASE was chosen to have an identical model size as OpenAI GPT for comparison pur- poses. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left. We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is re- ferred to as a “Transformer decoder” since it can be used for text generation. The comparisons be- tween BERT, OpenAI GPT and ELMo are shown visually in Figure 1. BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... Tok M Question Paragraph BERT E [CLS] E 1 E 2 E N C T 1 T 2 T N Single Sentence ... ... BERT Tok 1 Tok 2 Tok N ... [CLS] E [CLS] E 1 E 2 E N C T 1 T 2 T N Single Sentence B-PER O O ... ... E [CLS] E 1 E [SEP] Class Label ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ Start/End Span Class Label BERT Tok 1 Tok 2 Tok N ... [CLS] Tok 1 [CLS] [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... Tok M Sentence 1 ... Sentence 2 ELMo GPT BERT Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input transformations for fine-tuning on different tasks. We convert all structured inputs into token sequences to be processed by our pre-trained model, followed by a linear+softmax layer. 3.3 Task-specific input transformations For some tasks, like text classification, we can directly fine-tune our model as described above. Certain other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model was trained on contiguous sequences of text, we require some modifications to apply it to these tasks. Previous work proposed learning task specific architectures on top of transferred representations [44]. Such an approach re-introduces a significant amount of task-specific customization and does not use transfer learning for these additional architectural components. Instead, we use a traversal-style approach [52], where we convert structured inputs into an ordered sequence that our pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks. We provide a brief description of these input transformations below and Figure 1 provides a visual illustration. All transformations include adding randomly initialized start and end tokens (h s i, h e i). Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token sequences, with a delimiter token ( $ ) in between. shallow concatenation of left-to-right and right-to-left integrated architecture left-to-right language model task-specific architecture bidirectional conditioning