Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RoPE(Rotary Position Embeddings)

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for LiberalArts LiberalArts
August 29, 2024

RoPE(Rotary Position Embeddings)

下記で取り扱ったRoPE(Rotary Position Embeddings)についての記述を公開します。

・関連研究から読み解くGemma 2の仕組み
https://lib-arts.booth.pm/items/6036560

RoPEは多くのLLMで用いられるPosition Embeddingsなので、大まかに抑えておくと良いのではないかと思います。

Avatar for LiberalArts

LiberalArts

August 29, 2024
Tweet

More Decks by LiberalArts

Other Decks in Science

Transcript

  1. ୈ 2 ষ Gemma 2 Ͱ࠾༻͞Ε͍ͯΔ ओͳख๏ ୈ 2 ষͰ͸

    Gemma 2 Ͱ࠾༻͞Ε͍ͯΔओͳख๏ʹ͍ͭͯऔΓ ·ͱΊ·ͨ͠ɻ2.1 અͰ͸Ґஔ৘ใΛऔΓѻ͏ RoPEɺ2.2 અͰ͸ FFN ૚Ͱ༻͍Δ GEGLUɺ2.3 અͰ͸ Sparse Attentionɺ2.4 અͰ͸ Logit soft-cappingɺ2.5 અͰ͸ਪ࿦࣌ͷߴ଎Խʹ͋ͨͬͯ࠾༻͞ΕΔ Grouped-Query Attention(GQA)ɺ2.6 અ͸ֶशͷޮ཰Խʹ͋ͨͬͯ ༻͍ΒΕΔৠཹ (Distillation) ʹ͍ͭͯͦΕͧΕऔΓѻ͍·ͨ͠ɻ 2.1 RoPE RoPE(Rotary Position Embeddings) ͸ճసߦྻͷݪཧΛ༻͍ͯҐஔ৘ ใ (Position Embeddings) Λ Transformer ʹ෇༩͢Δख๏Ͱ͢ɻ 21
  2. ୈ 2 ষ Gemma 2 Ͱ࠾༻͞Ε͍ͯΔओͳख๏ 2.1.1 Transformer ࿦จʹ͓͚Δ Position

    Embeddings ͱ ͦͷղऍ Transformer ࿦จͰ͸ࣜ 2.1 ͷΑ͏ʹࡾ֯ؔ਺ʹج͍ͮͨ Position Em- beddings Λ༻͍·͢ɻ ࣜ 2.1: Position Embeddings(Transformer ࿦จ) p(pos,2i) = sin pos 100002i/d p(pos,2i+1) = cos pos 100002i/d ࣜ 2.1 ͷղऍʹ͋ͨͬͯ͸ɺpos ͸τʔΫϯͷΠϯσοΫεɺ2i ΍ 2i + 1 ͸ͦΕͧΕͷτʔΫϯͷ෼ࢄදݱɾӅΕ૚ͷΠϯσοΫεʹͦΕͧΕରԠ͠ ·͢ɻ ˛ϓϩάϥϜ 2.1 Transformer ࿦จʹ͓͚Δ Position Embeddings ͷՄࢹԽ 1 x_ , y_ = np.arange (0 ,501 ,1) , np.arange (0 ,501 ,1) 2 x, y = np.meshgrid(x_ , y_) 3 z = np.sin(y/10000**(x/500)) 4 5 plt.pcolormesh(x,y,z,cmap="viridis") 6 plt.colorbar () 7 plt.show () 22
  3. 2.1 RoPE ˛ਤ 2.1 ࣮ߦ݁Ռͱͦͷղऍ (Position Embeddings ͷՄࢹԽ) ࣜ 2.1

    ͸ਤ 2.1*1 ͷΑ͏ʹՄࢹԽͰ͖·͢ɻ·ͨɺࣜ 2.1 Λ pos ʹରͯ͠ ϕΫτϧ ppos Ͱఆٛ͢Δͱ͖ɺppos ͸ࣜ 2.2 ͷΑ͏ʹද͢͜ͱ͕Ͱ͖·͢ɻ *1 ਤ 2.1 Ͱ͸೾௕͕ 2πʙ10000 · 2π ͷΑ͏ʹهࡌ͠·͕ͨ͠ɺi ͸ 0 ͱ d 2 ͷͲͪΒ͔͠ ͔औΒͳ͍ͷͰ͜ͷهࡌ͸ݫີʹ͸ਖ਼͘͠ͳ͍͜ͱʹ͝஫ҙ͍ͩ͘͞ɻຊॻͰ͸ҎԼɺ ೾௕͕ 2π Ҏ্͔ͭ 10000 · 2π ΑΓখ͘͞ͳΔΑ͏ʹ i ʹؔ͢ΔܭࢉͷఆٛΛߦ͍·͠ ͨɻ 23
  4. ୈ 2 ষ Gemma 2 Ͱ࠾༻͞Ε͍ͯΔओͳख๏ ࣜ 2.2: Position Embeddings

    ͷϕΫτϧදݱ ppos = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ p(pos,0) p(pos,1) . . . p(pos,d−2) p(pos,d−1) ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ p(pos,0) p(pos,1) . . . p(pos,2(k−1)) p(pos,2(k−1)+1) ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ k = d 2 ࣜ 2.2 ͷΑ͏ʹ ppos Λఆٛ͢Δͱ͖ɺppos 1 ͱ ppos 2 ͷ಺ੵ͸ࣜ 2.3 ͷΑ ͏ʹܭࢉͰ͖·͢ɻ 24
  5. 2.1 RoPE ࣜ 2.3: 2 ͭͷ Position Embeddings ͷ಺ੵͷܭࢉ ppos

    1 · ppos 2 = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ p(pos 1,0) p(pos 1,1) . . . p(pos 1,2(k−1)) p(pos 1,2(k−1)+1) ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ · ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ p(pos 2,0) p(pos 2,1) . . . p(pos 2,2(k−1)) p(pos 2,2(k−1)+1) ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = 2(k−1)+1 i=0 p(pos 1,i)p(pos 2,i) = k−1 i=0 p(pos 1,2i)p(pos 2,2i) + p(pos 1,2i+1)p(pos 2,2i+1) = k−1 i=0 sin pos1 100002i/d sin pos2 100002i/d + cos pos1 100002i/d cos pos2 100002i/d = k−1 i=0 cos pos1 100002i/d − pos2 100002i/d = k−1 i=0 cos pos1 − pos2 100002i/d ࣜ 2.3 ͷಋग़݁ՌΑΓɺTransformer ࿦จʹ͓͚Δ Position Embeddings ͷ಺ੵ͸ɺk = d 2 ݸͷ Cos ؔ਺ͷ࿨Ͱ͋Γɺ࠷େ஋͸ pos1 = pos2 ͷͱ͖ k Ͱ͋Δ͜ͱ͕֬ೝͰ͖·͢ɻ 2.1.2 RoPE ͷݪཧ RoPE ͷࣜ͸ࣜ 2.4 ͷΑ͏ʹఆٛ͞Ε·͢ɻ 25
  6. ୈ 2 ষ Gemma 2 Ͱ࠾༻͞Ε͍ͯΔओͳख๏ ࣜ 2.4: RoPE ᶃ

    f{q,k} (xm, m) = Rd Θ,m W{q,k} xm Rd Θ,m = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ cos mθ1 − sin mθ1 0 0 · · · 0 0 sin mθ1 cos mθ1 0 0 · · · 0 0 0 0 cos mθ2 − sin mθ2 · · · 0 0 0 0 sin mθ2 cos mθ2 · · · 0 0 . . . . . . . . . . . . ... . . . . . . 0 0 0 0 · · · cos mθd/2 − sin mθd/2 0 0 0 0 · · · sin mθd/2 cos mθd/2 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ Θ = θi = 10000− 2(i−1) d , i ∈ [1, 2, · · · , d/2] ࣜ 2.4 ͸ k = d 2 ݸͷճసߦྻΛର֯ʹฒ΂ͨϒϩοΫߦྻͰ͋Δͱղऍ ͢Δ͜ͱ΋Ͱ͖·͢ɻճసߦྻʹ͍ͭͯ͸ Appendix Ͱ΋औΓѻ͍·ͨ͠ ͷͰ߹Θͤͯ֬͝ೝ͍ͩ͘͞ɻ ࣜ 2.4 ͷΑ͏ʹ RoPE ͷࣜΛఆٛ͢Δͱ͖ɺҐஔ m ͷ Query ͱҐஔ n ͷ Key ͷ಺ੵ͸ࣜ 2.5 ͷΑ͏ͳࣜͰද͞Ε·͢ɻ ࣜ 2.5: RoPE ᶄ qT m k = Rd Θ,m Wq xm T Rd Θ,n Wk xn = xT m WT q (Rd Θ,m )T Rd Θ,n xn = xT m WT q Rd Θ,n−m Wk xn ࣜ 2.5 ͷ Rd Θ ͸௚ަߦྻͰ͋ΓɺֶशશମΛ௨ͯ͠ͷҐஔ৘ใͷΤϯ ίʔυΛ҆ఆͯ͠ߦ͏͜ͱ͕Ͱ͖·͢ɻ·ͨɺTransformer ࿦จͳͲͷΑ ͏ʹଟ͘ͷ Position Embeddings ͸ additive Ͱ͋Δͷʹର͠ɺRoPE ͸ multiplicative Ͱ͋Δ͜ͱ΋཈͓͑ͯ͘ͱྑ͍Ͱ͢ɻ ͞ΒʹɺRoPE ͸ (Rd Θ,m )T ͱ Rd Θ,n ͷੵʹΑͬͯ Rd Θ,n−m ͕ಘΒΕΔ ͜ͱ͔ΒɺτʔΫϯͷ૬ରతͳҐஔؔ܎ʹΑΔҐஔ৘ใͷؔ਺Խ͕ՄೳͰ ͢ (Relative position embedding)ɻ·ͨɺࣜ 2.6 ͷΑ͏ͳࣜมܗʹΑΓɺ (Rd Θ,m )T Rd Θ,n = Rd Θ,n−m ͕ 2D ͷճసߦྻͰ੒ཱ͢Δ͜ͱʹ͍ͭͯ֬ೝͰ ͖·͢ɻ 26
  7. 2.1 RoPE ࣜ 2.6: RoPE ᶅ (Rd Θ,m )T Rd

    Θ,n = cos mθ1 − sin mθ1 sin mθ1 cos mθ1 T cos nθ1 − sin nθ1 sin nθ1 cos nθ1 = cos mθ1 sin mθ1 − sin mθ1 cos mθ1 cos nθ1 − sin nθ1 sin nθ1 cos nθ1 = cos mθ1 cos nθ1 + sin mθ1 sin nθ1 − cos mθ1 sin nθ1 + sin mθ1 cos nθ1 − sin mθ1 cos nθ1 + cos mθ1 sin nθ1 sin mθ1 sin nθ1 + cos mθ1 cos nθ1 = cos nθ1 cos mθ1 + sin nθ1 sin mθ1 −(sin nθ1 cos mθ1 − cos nθ1 sin mθ1) sin nθ1 cos mθ1 − cos nθ1 sin mθ1 cos nθ1 cos mθ1 + sin nθ1 sin mθ1 = cos (n − m)θ1 − sin (n − m)θ1 sin (n − m)θ1 cos (n − m)θ1 = Rd Θ,n−m 2.1.3 RoPE ͷਤࣜԽ RoPE ͷࣜ͸ਤ 2.2 ͷΑ͏ʹਤࣜԽ͞Ε·͢ɻ ˛ਤ 2.2 RoPE ͷਤࣜԽ (RoPE ࿦จ Figure 1) ਤ 2.2 ͷཧղʹ͋ͨͬͯ͸ɺ ʮ֤τʔΫϯͷ෼ࢄදݱͷཁૉΛ 2 ͭͣͭऔ 27
  8. ୈ 2 ষ Gemma 2 Ͱ࠾༻͞Ε͍ͯΔओͳख๏ Γग़ͯ͠ճసߦྻͷճసͷཁྖͰճసͤ͞Δʯͱղऍ͢Ε͹ྑ͍Ͱ͢ɻ 2.2 GEGLU Gemma

    2 Ͱ ͸ Transformer ͷ FFN ʹ ͓ ͚ Δ ׆ ੑ Խ ؔ ਺ ʹ GEGLU(Gaussian Error Gated Linear Units) ͕༻͍ΒΕ·͢ɻGEGLU ͸ GLU ͱ GELU ͷ૊Έ߹ΘͤͰ͋ΔͱཧղͰ͖·͢ɻҎԼɺGELUɺ GLUɺGEGLU ͷॱʹ֬ೝ͠·͢ɻ 2.2.1 GELU GELU(Gaussian Error Linear Unit) ͷ਺ࣜʹ͸ඪ४ਖ਼ن෼෍ N(0, 1) ͷ ྦྷੵ෼෍ؔ਺͕༻͍ΒΕ·͢ɻඪ४ਖ਼ن෼෍ͷ֬཰ີ౓ؔ਺Λ φ(x)ɺྦྷੵ ෼෍ؔ਺Λ Φ(x) ͱ͓͘ͱ͖ɺφ(x), Φ(x) ͸ͦΕͧΕԼهͷΑ͏ʹද͞Ε ·͢ɻ ࣜ 2.7: φ(x) = 1 √ 2π exp − x2 2 Φ(x) = x −∞ φ(t) dt ࣜ 2.7 ͷඪ४ਖ਼ن෼෍ͷྦྷੵ෼෍ؔ਺ Φ(x) Λݩʹ GELU ͷ਺ࣜ GELU(x) ͸ࣜ 2.8 ͷΑ͏ʹఆٛ͞Ε·͢ɻ ࣜ 2.8: GELU(x) = x Φ(x) ReLU ͱ GELU ͷάϥϑ͸ϓϩάϥϜ 2.2 Λ࣮ߦ͢Δ͜ͱͰ࡞੒͢Δ͜ ͱ͕Ͱ͖ΔͷͰ߹Θͤͯ཈͓͑ͯ͘ͱྑ͍Ͱ͢ɻ ˛ϓϩάϥϜ 2.2 ReLU ͱ GELU ͷάϥϑͷඳը 1 x = np.arange (-2.5, 2.51 , 0.01) 28