Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RoPE(Rotary Position Embeddings)

Avatar for LiberalArts LiberalArts
August 29, 2024

RoPE(Rotary Position Embeddings)

下記で取り扱ったRoPE(Rotary Position Embeddings)についての記述を公開します。

・関連研究から読み解くGemma 2の仕組み
https://lib-arts.booth.pm/items/6036560

RoPEは多くのLLMで用いられるPosition Embeddingsなので、大まかに抑えておくと良いのではないかと思います。

Avatar for LiberalArts

LiberalArts

August 29, 2024
Tweet

More Decks by LiberalArts

Other Decks in Science

Transcript

  1. ୈ 2 ষ Gemma 2 Ͱ࠾༻͞Ε͍ͯΔ ओͳख๏ ୈ 2 ষͰ͸

    Gemma 2 Ͱ࠾༻͞Ε͍ͯΔओͳख๏ʹ͍ͭͯऔΓ ·ͱΊ·ͨ͠ɻ2.1 અͰ͸Ґஔ৘ใΛऔΓѻ͏ RoPEɺ2.2 અͰ͸ FFN ૚Ͱ༻͍Δ GEGLUɺ2.3 અͰ͸ Sparse Attentionɺ2.4 અͰ͸ Logit soft-cappingɺ2.5 અͰ͸ਪ࿦࣌ͷߴ଎Խʹ͋ͨͬͯ࠾༻͞ΕΔ Grouped-Query Attention(GQA)ɺ2.6 અ͸ֶशͷޮ཰Խʹ͋ͨͬͯ ༻͍ΒΕΔৠཹ (Distillation) ʹ͍ͭͯͦΕͧΕऔΓѻ͍·ͨ͠ɻ 2.1 RoPE RoPE(Rotary Position Embeddings) ͸ճసߦྻͷݪཧΛ༻͍ͯҐஔ৘ ใ (Position Embeddings) Λ Transformer ʹ෇༩͢Δख๏Ͱ͢ɻ 21
  2. ୈ 2 ষ Gemma 2 Ͱ࠾༻͞Ε͍ͯΔओͳख๏ 2.1.1 Transformer ࿦จʹ͓͚Δ Position

    Embeddings ͱ ͦͷղऍ Transformer ࿦จͰ͸ࣜ 2.1 ͷΑ͏ʹࡾ֯ؔ਺ʹج͍ͮͨ Position Em- beddings Λ༻͍·͢ɻ ࣜ 2.1: Position Embeddings(Transformer ࿦จ) p(pos,2i) = sin pos 100002i/d p(pos,2i+1) = cos pos 100002i/d ࣜ 2.1 ͷղऍʹ͋ͨͬͯ͸ɺpos ͸τʔΫϯͷΠϯσοΫεɺ2i ΍ 2i + 1 ͸ͦΕͧΕͷτʔΫϯͷ෼ࢄදݱɾӅΕ૚ͷΠϯσοΫεʹͦΕͧΕରԠ͠ ·͢ɻ ˛ϓϩάϥϜ 2.1 Transformer ࿦จʹ͓͚Δ Position Embeddings ͷՄࢹԽ 1 x_ , y_ = np.arange (0 ,501 ,1) , np.arange (0 ,501 ,1) 2 x, y = np.meshgrid(x_ , y_) 3 z = np.sin(y/10000**(x/500)) 4 5 plt.pcolormesh(x,y,z,cmap="viridis") 6 plt.colorbar () 7 plt.show () 22
  3. 2.1 RoPE ˛ਤ 2.1 ࣮ߦ݁Ռͱͦͷղऍ (Position Embeddings ͷՄࢹԽ) ࣜ 2.1

    ͸ਤ 2.1*1 ͷΑ͏ʹՄࢹԽͰ͖·͢ɻ·ͨɺࣜ 2.1 Λ pos ʹରͯ͠ ϕΫτϧ ppos Ͱఆٛ͢Δͱ͖ɺppos ͸ࣜ 2.2 ͷΑ͏ʹද͢͜ͱ͕Ͱ͖·͢ɻ *1 ਤ 2.1 Ͱ͸೾௕͕ 2πʙ10000 · 2π ͷΑ͏ʹهࡌ͠·͕ͨ͠ɺi ͸ 0 ͱ d 2 ͷͲͪΒ͔͠ ͔औΒͳ͍ͷͰ͜ͷهࡌ͸ݫີʹ͸ਖ਼͘͠ͳ͍͜ͱʹ͝஫ҙ͍ͩ͘͞ɻຊॻͰ͸ҎԼɺ ೾௕͕ 2π Ҏ্͔ͭ 10000 · 2π ΑΓখ͘͞ͳΔΑ͏ʹ i ʹؔ͢ΔܭࢉͷఆٛΛߦ͍·͠ ͨɻ 23
  4. ୈ 2 ষ Gemma 2 Ͱ࠾༻͞Ε͍ͯΔओͳख๏ ࣜ 2.2: Position Embeddings

    ͷϕΫτϧදݱ ppos = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ p(pos,0) p(pos,1) . . . p(pos,d−2) p(pos,d−1) ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ p(pos,0) p(pos,1) . . . p(pos,2(k−1)) p(pos,2(k−1)+1) ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ k = d 2 ࣜ 2.2 ͷΑ͏ʹ ppos Λఆٛ͢Δͱ͖ɺppos 1 ͱ ppos 2 ͷ಺ੵ͸ࣜ 2.3 ͷΑ ͏ʹܭࢉͰ͖·͢ɻ 24
  5. 2.1 RoPE ࣜ 2.3: 2 ͭͷ Position Embeddings ͷ಺ੵͷܭࢉ ppos

    1 · ppos 2 = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ p(pos 1,0) p(pos 1,1) . . . p(pos 1,2(k−1)) p(pos 1,2(k−1)+1) ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ · ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ p(pos 2,0) p(pos 2,1) . . . p(pos 2,2(k−1)) p(pos 2,2(k−1)+1) ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = 2(k−1)+1 i=0 p(pos 1,i)p(pos 2,i) = k−1 i=0 p(pos 1,2i)p(pos 2,2i) + p(pos 1,2i+1)p(pos 2,2i+1) = k−1 i=0 sin pos1 100002i/d sin pos2 100002i/d + cos pos1 100002i/d cos pos2 100002i/d = k−1 i=0 cos pos1 100002i/d − pos2 100002i/d = k−1 i=0 cos pos1 − pos2 100002i/d ࣜ 2.3 ͷಋग़݁ՌΑΓɺTransformer ࿦จʹ͓͚Δ Position Embeddings ͷ಺ੵ͸ɺk = d 2 ݸͷ Cos ؔ਺ͷ࿨Ͱ͋Γɺ࠷େ஋͸ pos1 = pos2 ͷͱ͖ k Ͱ͋Δ͜ͱ͕֬ೝͰ͖·͢ɻ 2.1.2 RoPE ͷݪཧ RoPE ͷࣜ͸ࣜ 2.4 ͷΑ͏ʹఆٛ͞Ε·͢ɻ 25
  6. ୈ 2 ষ Gemma 2 Ͱ࠾༻͞Ε͍ͯΔओͳख๏ ࣜ 2.4: RoPE ᶃ

    f{q,k} (xm, m) = Rd Θ,m W{q,k} xm Rd Θ,m = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ cos mθ1 − sin mθ1 0 0 · · · 0 0 sin mθ1 cos mθ1 0 0 · · · 0 0 0 0 cos mθ2 − sin mθ2 · · · 0 0 0 0 sin mθ2 cos mθ2 · · · 0 0 . . . . . . . . . . . . ... . . . . . . 0 0 0 0 · · · cos mθd/2 − sin mθd/2 0 0 0 0 · · · sin mθd/2 cos mθd/2 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ Θ = θi = 10000− 2(i−1) d , i ∈ [1, 2, · · · , d/2] ࣜ 2.4 ͸ k = d 2 ݸͷճసߦྻΛର֯ʹฒ΂ͨϒϩοΫߦྻͰ͋Δͱղऍ ͢Δ͜ͱ΋Ͱ͖·͢ɻճసߦྻʹ͍ͭͯ͸ Appendix Ͱ΋औΓѻ͍·ͨ͠ ͷͰ߹Θͤͯ֬͝ೝ͍ͩ͘͞ɻ ࣜ 2.4 ͷΑ͏ʹ RoPE ͷࣜΛఆٛ͢Δͱ͖ɺҐஔ m ͷ Query ͱҐஔ n ͷ Key ͷ಺ੵ͸ࣜ 2.5 ͷΑ͏ͳࣜͰද͞Ε·͢ɻ ࣜ 2.5: RoPE ᶄ qT m k = Rd Θ,m Wq xm T Rd Θ,n Wk xn = xT m WT q (Rd Θ,m )T Rd Θ,n xn = xT m WT q Rd Θ,n−m Wk xn ࣜ 2.5 ͷ Rd Θ ͸௚ަߦྻͰ͋ΓɺֶशશମΛ௨ͯ͠ͷҐஔ৘ใͷΤϯ ίʔυΛ҆ఆͯ͠ߦ͏͜ͱ͕Ͱ͖·͢ɻ·ͨɺTransformer ࿦จͳͲͷΑ ͏ʹଟ͘ͷ Position Embeddings ͸ additive Ͱ͋Δͷʹର͠ɺRoPE ͸ multiplicative Ͱ͋Δ͜ͱ΋཈͓͑ͯ͘ͱྑ͍Ͱ͢ɻ ͞ΒʹɺRoPE ͸ (Rd Θ,m )T ͱ Rd Θ,n ͷੵʹΑͬͯ Rd Θ,n−m ͕ಘΒΕΔ ͜ͱ͔ΒɺτʔΫϯͷ૬ରతͳҐஔؔ܎ʹΑΔҐஔ৘ใͷؔ਺Խ͕ՄೳͰ ͢ (Relative position embedding)ɻ·ͨɺࣜ 2.6 ͷΑ͏ͳࣜมܗʹΑΓɺ (Rd Θ,m )T Rd Θ,n = Rd Θ,n−m ͕ 2D ͷճసߦྻͰ੒ཱ͢Δ͜ͱʹ͍ͭͯ֬ೝͰ ͖·͢ɻ 26
  7. 2.1 RoPE ࣜ 2.6: RoPE ᶅ (Rd Θ,m )T Rd

    Θ,n = cos mθ1 − sin mθ1 sin mθ1 cos mθ1 T cos nθ1 − sin nθ1 sin nθ1 cos nθ1 = cos mθ1 sin mθ1 − sin mθ1 cos mθ1 cos nθ1 − sin nθ1 sin nθ1 cos nθ1 = cos mθ1 cos nθ1 + sin mθ1 sin nθ1 − cos mθ1 sin nθ1 + sin mθ1 cos nθ1 − sin mθ1 cos nθ1 + cos mθ1 sin nθ1 sin mθ1 sin nθ1 + cos mθ1 cos nθ1 = cos nθ1 cos mθ1 + sin nθ1 sin mθ1 −(sin nθ1 cos mθ1 − cos nθ1 sin mθ1) sin nθ1 cos mθ1 − cos nθ1 sin mθ1 cos nθ1 cos mθ1 + sin nθ1 sin mθ1 = cos (n − m)θ1 − sin (n − m)θ1 sin (n − m)θ1 cos (n − m)θ1 = Rd Θ,n−m 2.1.3 RoPE ͷਤࣜԽ RoPE ͷࣜ͸ਤ 2.2 ͷΑ͏ʹਤࣜԽ͞Ε·͢ɻ ˛ਤ 2.2 RoPE ͷਤࣜԽ (RoPE ࿦จ Figure 1) ਤ 2.2 ͷཧղʹ͋ͨͬͯ͸ɺ ʮ֤τʔΫϯͷ෼ࢄදݱͷཁૉΛ 2 ͭͣͭऔ 27
  8. ୈ 2 ষ Gemma 2 Ͱ࠾༻͞Ε͍ͯΔओͳख๏ Γग़ͯ͠ճసߦྻͷճసͷཁྖͰճసͤ͞Δʯͱղऍ͢Ε͹ྑ͍Ͱ͢ɻ 2.2 GEGLU Gemma

    2 Ͱ ͸ Transformer ͷ FFN ʹ ͓ ͚ Δ ׆ ੑ Խ ؔ ਺ ʹ GEGLU(Gaussian Error Gated Linear Units) ͕༻͍ΒΕ·͢ɻGEGLU ͸ GLU ͱ GELU ͷ૊Έ߹ΘͤͰ͋ΔͱཧղͰ͖·͢ɻҎԼɺGELUɺ GLUɺGEGLU ͷॱʹ֬ೝ͠·͢ɻ 2.2.1 GELU GELU(Gaussian Error Linear Unit) ͷ਺ࣜʹ͸ඪ४ਖ਼ن෼෍ N(0, 1) ͷ ྦྷੵ෼෍ؔ਺͕༻͍ΒΕ·͢ɻඪ४ਖ਼ن෼෍ͷ֬཰ີ౓ؔ਺Λ φ(x)ɺྦྷੵ ෼෍ؔ਺Λ Φ(x) ͱ͓͘ͱ͖ɺφ(x), Φ(x) ͸ͦΕͧΕԼهͷΑ͏ʹද͞Ε ·͢ɻ ࣜ 2.7: φ(x) = 1 √ 2π exp − x2 2 Φ(x) = x −∞ φ(t) dt ࣜ 2.7 ͷඪ४ਖ਼ن෼෍ͷྦྷੵ෼෍ؔ਺ Φ(x) Λݩʹ GELU ͷ਺ࣜ GELU(x) ͸ࣜ 2.8 ͷΑ͏ʹఆٛ͞Ε·͢ɻ ࣜ 2.8: GELU(x) = x Φ(x) ReLU ͱ GELU ͷάϥϑ͸ϓϩάϥϜ 2.2 Λ࣮ߦ͢Δ͜ͱͰ࡞੒͢Δ͜ ͱ͕Ͱ͖ΔͷͰ߹Θͤͯ཈͓͑ͯ͘ͱྑ͍Ͱ͢ɻ ˛ϓϩάϥϜ 2.2 ReLU ͱ GELU ͷάϥϑͷඳը 1 x = np.arange (-2.5, 2.51 , 0.01) 28