Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[論文紹介] Intuitive Fine-Tuning

Avatar for Ryokan RI Ryokan RI
August 30, 2025

[論文紹介] Intuitive Fine-Tuning

2025 第17回最先端NLP勉強会

Avatar for Ryokan RI

Ryokan RI

August 30, 2025
Tweet

More Decks by Ryokan RI

Other Decks in Research

Transcript

  1. Ermo Hua, Biqing Qi, Kaiyan Zhang, Kai Tian, Xingtai Lv,

    Ning Ding, Bowen Zhou ACL 2025 ࠷ઌ୺ NLP ษڧձ 2025 ಡΉਓɿཥ ྇פ Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process
  2. - SFT / POʢPreference Optimizationʣܥͷ LLM Λ νϡʔχϯά͍ͯ͠Δख๏Λମܥతʹ੔ཧ - ͔ͦ͜Βݟ͑ͯ͘Δ

    SFT ͷऑ఺Λิ͏৽ख๏Ͱ͋Δ Intuitive Fine-Tuning (IFT) ΛఏҊ ֓ཁ 🧭 ୹࣌ؒͰཁ఺Λ఻͑ΔͨΊʹɺख๏ͷத਎͔Β঺հ͠·͢ɻ
  3. Intuitive Preference Estimation Ϟνϕʔγϣϯ - SFT ͸͓खຊͱͳΔग़ྗΛ࢖ͬͯɺLLM ͷ෼෍Λม͑Α͏ ͱ͍ͯ͠Δͷ͕໰୊ -

    teacher forcing ʹΑΔ exposure bias ͱ΋ݴ͑Δ - ղܾ͢ΔͨΊʹɺLLM ࣗ਎ͷग़ྗ෼෍΋औΓೖΕ͍ͨ
  4. SFT w/ Intuitive Preference Estimation si−1 si Embeddings Transformer si+1

    si ̂ si loss 🔧 લεςοϓͷ LLM ͷ ༧ଌτʔΫϯͷຒΊࠐΈͱ ೖྗຒΊࠐΈΛઢܗิؒ 🤔 ͜ͷܗɺͲ͔͜Ͱ…
  5. Scheduled Sampling (Bengio et al., 2015) si−1 Embeddings Transformer si+1

    si ̂ si loss 💡 લεςοϓͷ LLM ͷ ༧ଌτʔΫϯΛֶशʹ ࢖͏ͷ͸ Scheduled Sampling ͱಉ͡ൃ૝ si ͱͲͪΒΛ࢖͏͔͸ϥϯμϜʹܾఆ
  6. Dynamic Relation Propagation Ϟνϕʔγϣϯ - Intuitive Preference Estimation ʹΑΓɺ͋ΔλΠϜεςοϓͰͷ ༧ଌ͕কདྷͷ༧ଌʹӨڹΛٴ΅͢Α͏ʹͳͬͨ

    - ͦ͜Ͱɺ֤λΠϜεςοϓͷଛࣦΛܭࢉ͢ΔࡍʹʮকདྷͷଛࣦʯΛߟྀ ʹೖΕ͍ͨ - ྫ͑͹ɺ͋Δ࣌఺Ͱͷ༧ଌͷ݁Ռͱͯ͠ɺޙଓͷλΠϜεςοϓͰଛࣦ ͕େ͖͘ͳΔ৔߹ɺͦͷ࣌఺ͷ༧ଌʹରԠ͢ΔଛࣦΛΑΓڧௐ͍ͨ͠
  7. Dynamic Relation Propagation Li+1 Li (Li + . . .

    + LN ) (Li+1 + . . . + LN ) × × ✏ ্ه͸ݪ࿦จͷ Algorithm 1 ͱஶऀ࣮૷Λݩʹͨ͠આ໌ɻ ʢ࿦จͷࣜʢ21ʣʹ΋͜ͷଛࣦͷఆ͕ٛ͋Δ͕ɺ͔ͳΓҟͳ͍ͬͯΔΑ͏ʹݟ͑ΔɻTypo?ʣ + loss
  8. - SFT / POʢPreference Optimizationʣܥͷ LLM Λ νϡʔχϯά͍ͯ͠Δख๏Λମܥతʹ੔ཧ - ͔ͦ͜Βݟ͑ͯ͘Δ

    SFT ͷऑ఺Λิ͏৽ख๏Ͱ͋Δ Intuitive Fine-Tuning (IFT) ΛఏҊ ֓ཁ
  9. SFT ͱ PO ͷݟํ - τʔΫϯྻΛ stateɺޙଓτʔΫϯͷ༧ଌΛ action ͱΈͳ͢ɻ -

    SFT ΋ PO ΋ɺϞσϧͱਓؒͷ state ͷભҠ֬཰Λ߹ΘͤΔͱ ͍͏໨త͸Ұக͍ͯ͠Δɻ - ҟͳΔͷ͸ɺֶशʹ༻͍Δ state ͷग़ॴ - SFT ͸ਓؒͷॻ͍ͨ ground-truth ͷ state - PO ͸Ϟσϧͷग़ྗ͢Δ state Ϟσϧͷग़ྗ͢Δ state ෼෍ͱҟͳΔͨΊɺ࠷దԽ ͷޮ཰͕ѱ͍ͱ͍͏ͷ͕ɺ IFT ͷϞνϕʔγϣϯɻ
  10. ମܥతͳ੔ཧʹ͍ͭͯ - ఆࣜԽͷҰͭҰͭ͸ೲಘͷ͍͘΋ͷͰ͕͋ͬͨɺ౷߹͞ΕͯɹԿ ͔৽͍͠ࢹ໺͕։͚Δͱ͍͏ྨͰ͸ͳ͔ͬͨ - IFT ʹܨ͕Δͱ͜ΖͰɺ؊৺ͷ֓೦͕ܗࣜతʹදݱ͞Εͣʹɺᡰ ʹ͓ͪͳ͍ͱ͜Ζ͕͋ͬͨʢ”biased/unbiased estimation for

    model/human preference” ͋ͨΓʣ - SFT ͱ PO/RL ͷؔ࿈ʹ͍ͭͯ͸ҎԼͷจݙͷํ͕Θ͔Γ΍͍͢ - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models - On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
  11. ٞ࿦ SFT Λ RL ʹ͚ۙͮΔͱ͍͏ൃ૝͸ͨ·ʹΈΔ - Intuitive Fine-tuning, Dynamic Fine-tuning

    - Ϟνϕʔγϣϯ͸෼͔Δ͕ɺ͜ͷํ޲ੑͰ͸࣮ੈքԠ༻ʹ͓͍ͯ RL ܥ ख๏ͷޮ཰Λ௒͑Δ͜ͱ͸ͳ͍ͷͰ͸ʁͱ΋ࢥ͏ʢRL લͷ warm up ༻్ͱͯ͠͸ɺSFT ͷ্Ґޓ׵ʹͳΓ͏Δ͔΋͠Εͳ͍ʣɻ RL ܥͷख๏ͷྑ͞ - ֶशσʔλ͕ on-policy Ͱ͋Δ͜ͱͷֶशޮ཰ - ground-truth ͷԠ౴Λ࡞੒͠ͳͯ͘Α͍ͱ͍͏σʔλऩू؍఺ͷޮ཰