𝑏E E2, C をfully connected layer𝐹𝐶,,0 に⼊⼒する • 𝑓 = 𝐹P Concat(𝐹𝐶, 𝑣, , 𝐹𝐶0 𝑏0 , … , 𝐹𝐶, 𝑣C , 𝐹𝐶0(𝑏C)) ∈ ℝ(0C)×Q 118 Encoder-Decoder module Context-aware Question Encoder • ⼊⼒questionを𝑋 = “context: {caption} + {tags}. question: {question}”に置換し, Transformer Encoder 𝐹: に⼊⼒ • 𝑞 = 𝐹:(𝑋) Generative Decoder • 𝛼) , 𝛽M , 𝑓, 𝑞 を結合したものをTransformer decoder 𝐹/ に⼊⼒ • 𝑦 = 𝐹/ Concat 𝛼,, … , 𝛼O, 𝛽,, … , 𝛽N, 𝑓, 𝑞 • ℒ = − ∑42, > log 𝑝R • 𝑦4|𝑦S4