Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JGLUEの構築そして 日本語LLM評価のこれから

Keisuke Kamata
November 15, 2023
4.4k

JGLUEの構築そして 日本語LLM評価のこれから

Keisuke Kamata

November 15, 2023
Tweet

Transcript

  1. LLMͷ෼ྨ 3 Input: sentence in source language Output: next word

    in target language I am a ju suis étudiant student Words previously generated https://jalammar.github.io/illustrated-transformer/ 6 layers 6 layers Τϯίʔμɾσίʔμ (T5ͳͲ) Attention in source language Attention in target language Attention between source and target languages … … Τϯίʔμ (BERTܥ) σίʔμ (GPTܥ)
  2. ݴޠཧղϕϯνϚʔΫ: GLUE 4 General Language Understanding Evaluation [Wang+ 2018] λεΫ

    આ໌ SST-2 өըͷϨϏϡʔʹର͢Δײ৘෼ੳ (positive/negative) CoLA จ͕acceptable͔Ͳ͏͔ MRPC 2จ͕ಉ͡ҙຯ͔Ͳ͏͔ STS-B 2จͷྨࣅ౓ (1-5) QQP 2ͭͷ࣭໰จ͕ಉ͡ҙຯ͔Ͳ͏͔ MNLI 2จͷؚҙؔ܎ೝࣝ (entailment/contradiction/neutral) QNLI จ͕࣭໰ʹର͢Δ౴͑ΛؚΉ͔Ͳ͏͔ (SQuAD) RTE 2จؚ͕ҙؔ܎Λ΋͔ͭͲ͏͔ WNLI 2จؚ͕ҙؔ܎Λ΋͔ͭͲ͏͔ (Winograd Schema Challenge) จ จ
  3. എܠ (1/2) • LLMͷ໢ཏతධՁɾ෼ੳʹ͸GLUE [Wang+ 2018]ͷΑ͏ͳ ϕϯνϚʔΫ͕ෆՄܽ • ӳޠҎ֎ͷݴޠͰ΋ϕϯνϚʔΫ͕ߏங͞Ε͍ͯΔ •

    ϑϥϯεޠFLUE [Le+ 2020], தࠃޠCLUE [Xu+ 2020], ؖࠃޠKLUE [Park+ 2021], ... → ೔ຊޠݴޠཧղϕϯνϚʔΫJGLUE Λߏங 7 ͞Βʹ೉͍͠ϕϯνϚʔΫͷߏங LLMͷੑೳ޲্
  4. എܠ (2/2) • طଘͷ೔ຊޠσʔληοτͷ՝୊ 1. ຋༁ (JSNLI [٢ӽ+ 2020], JSICK

    [Yanaka+ 2021]ͳͲ) • ػցɾਓख຋༁ʹ͓͚Δ೔ຊޠͷෆࣗવ͞ • ೔ຊͱͷ஍ҬɾจԽࠩ (ྫ: ΞϝϦΧͷ஍໊ɾ੓࣏ՈͳͲʹؔ͢Δจষ͕ଟ͍) 2. ಛఆυϝΠϯ • ྫ: JRTE [Hayashibe+ 2020]: ϗςϧͷϨϏϡʔ • ೔ຊޠݴޠཧղϕϯνϚʔΫJGLUEΛߏங͠ɺݴޠཧղݚڀΛ ଅਐ 8 ˠ೔ຊޠͰҰ͔Βߏங ˠҰൠυϝΠϯͰߏங
  5. JGLUEͷߏ੒ • GLUE΍SuperGLUEͷλεΫΛ޿͘Χόʔ͢ΔΑ͏ʹߏ੒ • ߏஙʹ͸Yahoo!Ϋϥ΢υιʔγϯάΛར༻ 9 λεΫ σʔληοτ train dev

    test จষ෼ྨ MARC-ja 187,528 5,654 5,639 JCoLA [છ୩+ 2022] - - - จϖΞ෼ྨ JSTS 12,451 1,457 1,589 JNLI 20,073 2,434 2,508 QA JSQuAD 62,859 4,442 4,420 JCommonsenseQA 8,939 1,119 1,118
  6. Ϋϥ΢υιʔγϯάͰͷճ౴ positive: 0, negative: 10 10 ৭΋ཤ͖৺஍΋࠷ߴͰ͢ɻࢲͷ৔߹͸Ն৔ͷધ௼Γʹ ࢖͍ͬͯ·͢ɻ MARC-ja JSTS/JNLI

    จ֗தͷಓ࿏Λେ͖ͳόε͕૸͍ͬͯ·͢ɻ จಓ࿏Λେ͖ͳόε͕૸͍ͬͯ·͢ɻ positive Կ͜ͷ4%ΧʔυσʔλʔҠͤͳ͍͠ίϚϯυͰௐ΂ͨ ΒΤϥʔग़Δ͜͠Ε͸ͳ͍ negative ୯७ʹҰिؒͷఱؾΛ஌Γ͍ͨͷͰ͋Ε͹͜ΕͰे෼ɻ ͕ͩ͜ͷఔ౓Ͱ༗ྉ͸͍͔͕ͳ΋ͷ͔ positive → negative ྨࣅ౓: 4.4, ਪ࿦ؔ܎: entailment จςʔϒϧʹྉཧ͕ͳΒ΂ΒΕ͍ͯ·͢ɻ จςʔϒϧʹ৯΂͔͚ͷྉཧ͕͋Γ·͢ɻ ྨࣅ౓: 3.0, ਪ࿦ؔ܎: neutral จ໺ٿબख͕όοτΛεΠϯά͍ͯ͠·͢ɻ จ໺ٿબख͕ΩϟονϘʔϧΛ͍ͯ͠·͢ɻ ྨࣅ౓: 2.0, ਪ࿦ؔ܎: contradiction JGLUEσʔλྫ 
  7. JGLUEσʔλྫ  11 [λΠτϧ] ౦ւಓ৽װઢ 1987೥ʢত࿨62೥ʣ4݄1೔ͷࠃమ෼ׂຽӦԽʹΑΓɺ JR౦ւ͕ӡӦΛܧঝͨ͠ɻ੢೔ຊཱྀ٬మಓʢJR੢೔ ຊʣ͕ܧঝͨ͠ࢁཅ৽װઢͱ͸૬ޓ৐ΓೖΕ͕ߦΘΕ ͓ͯΓɺ౦ւಓ৽װઢ۠ؒͷΈͰӡస͞ΕΔྻंʹ΋ JR੢೔ຊॴ༗ͷं͕྆࢖༻͞ΕΔ͜ͱ͕͋Δɻ2020೥

    ʢྩ࿨2೥ʣ3݄ݱࡏɺ౦ژӺ - ৽େࡕӺؒͷॴཁ࣌ؒ ͸࠷଎2࣌ؒ21෼࠷ߴ଎౓285km/hͰӡߦ͞Ε͍ͯΔɻ ࣭໰: 2020೥ɺ౦ژʙ৽େࡕؒͷ࠷଎ͷॴཁ࣌ؒ͸ ౴͑: 2࣌ؒ21෼ ࣭໰: ౦ւಓ৽װઢͱ૬ޓ৐ΓೖΕ͕͞Ε͍ͯΔ࿏ઢ͸ Ͳ͔͜ʁ ౴͑: ࢁཅ৽װઢ JSQuAD JCommonsenseQA ໰୊: ձࣾͷ࠷ߴ੹೚ऀΛԿͱ͍͏͔ʁ બ୒ࢶ: ڭࢣ, ෦௕, ࣾ௕, ෦Լ, όΠτ ໰୊: εʔϓΛҿΉ࣌ʹ࢖͏ಓ۩͸ʁ બ୒ࢶ: εϓʔϯ, ϝχϡʔ, ࡼ, ϑΥʔΫ, ͸͠
  8. i-1: 夕焼けに... i-1-h1 : 月光に... score: 1.2 1-1: 青い車が... 1-3:

    …... label: entailment ( JNLI-A ) ・ ・ 1-1: 青い車が走っている 1-2: 海沿いを青い車が走っている。 .. 1-5: 歩道の反対側を車が走っている。 2-1: 草原が広がっている。 2-2: 遠くに山がそびえたっている。 .. 2-5: 山の麓に木々が生えている。 i-1: 夕焼けに照らされている男性。 i-2: 短髪の男性が立っている。 .. i-5: 黒い服を着た男性が笑っている。 ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ 1-1: 青い車が... 1-2: 海沿いを... 1-1: 青い車が... 1-3: …… 2-1: 草原が... 2-2: 遠くに... ・ ・ ・ ・ i-1: 夕焼けに... j-1: 白い犬が... ・ ・ i-1: 夕焼けに... i-2: 短髪の... ・ ・ 類似度付与 類似度付与 i-1: 夕焼けに... j-1: 白い犬が... score: 1.2 ・ ・ ・ ・ ・ ・ ・ ・ 1-1: 青い車が... 1-2: 海沿いを... label: neutral 1-2: 海沿いを... 1-1: 青い車が... label: entailment i-1: 夕焼けに... i-1-h1 : 月光に... label: contradiction 推論関係付与 類似度付与 矛盾文作成 JSTS JNLI ( JSTS-A ) 1-1: 青い車が... 1-2: 海沿いを... score: 3.8 1-1: 青い車が... 1-3: …... score: 4.5 ( JNLI-C ) ・ ・ ( JSTS-B ) ( JSTS-C ) 画像の出典: いらすとや(https://www.irasutoya.com/), ONWAイラスト(https://onwa-illust.com/) JSTSɾJNLIͷߏஙϑϩʔ
  9. ֤λεΫͷղ౴ํ๏ 13 1จ෼ྨ໰୊ (MARC-ja) positive [CLS] … この … PC

    … は … 丈夫 … ##で … 軽い … 。 … [SEP] … จϖΞ෼ྨ/ճؼ໰୊ (JSTS, JNLI) entailment [CLS] … 彼 … … … ⾷べた … [SEP] … 彼 … … … ⾷べた … [SEP] … εύϯநग़ (JSQuAD) Start/End Span [CLS] … … … どこ … ? … [SEP] … … … 東京 … … … [SEP] … ଟࢶબ୒ࣜ໰୊ (JCommonsenseQA) score1 [CLS] … … … [SEP] … … … [SEP] … 問題 選択肢1 score5 [CLS] … … … [SEP] … … … [SEP] … 問題 選択肢5 … softmax …
  10. JCommonsenseQA 2.0: ܭࢉػͱਓͷڠಇʹΑΔৗࣝਪ࿦σʔληοτͷվྑ 15 ࣭໰ϓϨθϯτʹਓؾͷ͋Δจ๪۩͸ʁ બ୒ࢶສ೥ච ԰֎ Ϝʔϯ ίοϓ ஡࿸

    ࣭໰ϓϨθϯτʹਓؾͷ͋Δจ๪۩͸ʁ બ୒ࢶສ೥ච ϖϯγϧ ંΓࢴ ίοϓ ஡࿸ ࣭໰ΠϯΫΛิॆͯ͠ॻ͘΋ͷ͸ʁ બ୒ࢶສ೥ච ϖϯγϧ ંΓࢴ ίοϓ ஡࿸ V1 V2 ޡΓબ୒ࢶੜ੒ ࣭໰ϦϥΠτ ਓؒ 0.988 0.997 0.996 ౦๺େBERTBASE 0.782 0.571 0.678 ౦๺େBERTLARGE 0.822 0.617 0.736 ૣҴాେRoBERTaBASE 0.849 0.551 0.672 ૣҴాେRoBERTaLARGE 0.901 0.807 0.865 V1 V2 ޡΓબ୒ࢶੜ੒ V2 ࣭໰ϦϥΠτ ͞Βʹ೉͍͠ ϕϯνϚʔΫͷߏங LLMͷ ੑೳ޲্ V1 → V2 [܀ݪ+ 2023]
  11. MMLU • ਺ֶɺ෺ཧɺ๏ֶɺྺ࢙ͳͲ57Պ໨ͷ4୒໰୊ • େֶӃਐֶదੑࢼݧ(GRE)ɺถࠃҩࢣ໔ڐࢼݧͳͲΛؚΉ • Ұൠతʹ͸few-shotͰղ౴ɺධՁ 18 Measuring Massive

    Multitask Language Understanding [Hendrycks+ 2021] Published as a conference paper at ICLR 2021 One of the reasons that the government discourages and regulates monopolies is that (A) producer surplus is lost and consumer surplus is gained. (B) monopoly prices ensure productive efficiency but cost society allocative efficiency. (C) monopoly firms do not engage in significant research and development. (D) consumer surplus is lost with higher prices and lower levels of output. Microeconomics Figure 3: Examples from the Microeconomics task. When you drop a ball from rest it accelerates downward at 9.8 m/s². If you instead throw it downward assuming no air resistance its acceleration immediately after leaving your hand is (A) 9.8 m/s² (B) more than 9.8 m/s² (C) less than 9.8 m/s² (D) Cannot say unless the speed of throw is given. Conceptual Physics College Mathematics In the complex z-plane, the set of points satisfying the equation z² = |z|² is a (A) pair of points (B) circle (C) half-line (D) line
  12. lm-evaluation-harness (EleutherAI) • 200݅Ҏ্ͷσʔληοτʹ͓͍ͯ ੜ੒ܥLLMΛ౷ҰతʹධՁՄೳ • ARC, BIG-Bench, BLiMP, CrowS-Pairs,

    Drop, LAMBADA, MGSM, MMLU, PAWS-X, QNLI, SQuAD v2, SWAG, TruthfulQA, XCOPA, XWinograd, ... 19 https://github.com/EleutherAI/lm-evaluation-harness
  13. 21 • 805݅ͷଟ༷ͳ໰୊ (ੜ੒λεΫ) • ࣗಈධՁ (GPT-4, Claude) • ϖΞൺֱʹجͮ͘উ཰ʹΑΔ

    ϥϯΩϯά • ର text-davinci-003 https://tatsu-lab.github.io/alpaca_eval/
  14. 23 https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard MT-Bench (LMSYS) • ࣗಈධՁ(GPT-4) or ਓखධՁ • ઈରείΞ(1-10)

    or ϖΞൺֱ • Ϛϧνλʔϯର࿩ೳྗɺࢦࣔʹ ै͏ೳྗΛ໰͏80໰ • 8ΧςΰϦ (10໰ͣͭ, 2λʔϯ) Writing Roleplay Reasoning Math Coding Extraction STEM Humanities 0 2 4 6 8 10 model GPT-4 Claude-v1 GPT-3.5-turbo Vicuna-13B Alpaca-13B LLaMA-13B Figure 20: Category-wise scores of 6 models on MT-bench. 27 writing, roleplay, reasoning, math, coding, extraction, knowledge I (STEM), knowledge II (humanities/social science) [Zheng+ 2023]
  15. MT-Bench: Ϛϧνλʔϯͷ໰୊ͷྫ 24 LLM benchmarks: by combining the existing capability-based

    benchmarks and the new preferen based benchmarks with LLM-as-a-judge, one can swiftly and automatically evaluate both the c capabilities and human alignment of models. We publicly release 80 MT-bench questions, 3K exp votes, and 30K conversations with human preferences for future study. Table 1: Sample multi-turn questions in MT-bench. Category Sample Questions Writing 1st Turn Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. 2nd Turn Rewrite your previous response. Start every sentence with the letter A. Math 1st Turn Given that f(x) = 4x3 9x 14, find the value of f(2). 2nd Turn Find x such that f(x) = 0. Knowledge 1st Turn Provide insights into the correlation between economic indicators such as GDP, inflation, and unemployment rates. Explain how fiscal and monetary policies ... 2nd Turn Now, explain them again like I’m five. 2 MT-bench and Chatbot Arena [Zheng+ 2023]
  16. ઈରείΞධՁ ͷϓϩϯϓτ 25 [System] Please act as an impartial judge

    and evaluate the quality of the response provided by an AI assistant to the user question. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. You evaluation should focus on the assistant's answer to the second question. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]". <|The Start of Reference Answer|> ### User: {question_1} ### Reference answer: {ref_answer_1} ### User: {question_2} ### Reference answer: {ref_answer_2} <|The End of Reference Answer|> <|The Start of Assistant A's Conversation with User|> ### User: {question_1} ### Assistant A: {answer_1} ### User: {question_2} ### Assistant A: {answer_2} <|The End of Assistant A's Conversation with User|> 25
  17. LLMʹΑΔࣗಈධՁͷ՝୊ • Position bias • ࠷ॳʹఏࣔ͞Εͨճ౴͕ΑΓΑ͍ͱ൑அͯ͠͠·͏ → A-B, B-Aͷ2छྨͷఏࣔॱͰධՁ •

    Name bias • “Assistant A”Λ“Assistant B”ΑΓڧ͍ͱ൑அͯ͠͠·͏ • Verbosity bias (length bias) • ΑΓ௕͍ճ౴ΛΑ͍ͱ൑அͯ͠͠·͏ • Self-enhancement bias • ධՁLLMʹΑΔࣗ෼ࣗ਎ͷੜ੒͕Α͍ͱ൑அͯ͠͠·͏ • (਺ֶɾਪ࿦ೳྗͷݶք) 26 [Zheng+ 2023]
  18. GPT-4ʹΑΔධՁͷ੍໿ • OpenAIͷར༻ن໿ ʹΑΔͱ... 2. Usage Requirements (c) Restrictions You

    may not (i) use the Services in a way that infringes, misappropriates or violates any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or otherwise attempt to discover the source code or underlying components of models, algorithms, and systems of the Services (except to the extent such restrictions are contrary to applicable law); (iii) use output from the Services to develop models that compete with OpenAI; (iv) ... • (OpenAIʹڝ߹͢Δ) LLMͷ։ൃऀ͸ɺGPT-4ͷग़ྗ(=ධՁ݁Ռ) Λ࢖ͬͯ͸͍͚ͳ͍ 27
  19. ࠷ۙͷ೔ຊޠϦʔμʔϘʔυɾϕϯνϚʔΫ ໰୊࡞੒ ໰୊਺ λεΫछผ ධՁํ๏ lm-evaluation-harness º - ෼ྨɾੜ੒ ࣗಈ

    Nejumi º - ෼ྨ ࣗಈ Rakuda ˓ 40 ੜ੒ ࣗಈ Japanese VicunaQA ˓ 80 ੜ੒ ࣗಈ Japanese MT-Bench ˓ 80 ੜ੒ ࣗಈ ELYZA-tasks-100 ˓ 100 ੜ੒ ਓखɾ(ࣗಈ) 29
  20. lm-evaluation-harness (Stability AI) • EleutherAI/lm-evaluation-harness ͷ ೔ຊޠ൛ • αϙʔτ͍ͯ͠Δσʔληοτ: JGLUE

    (JCommonsenseQA, JNLI, MARC-ja, JSQuAD), JACKET v2, XLSum (ja), XWinograd (ja), MGSM • few-shot (2, 3-shot) ͰධՁ • ਫ਼౓ͷฏۉʹΑΔϥϯΩϯά (ϦʔμʔϘʔυ) https://github.com/Stability-AI/lm-evaluation-harness/
  21. Nejumi (Weights & Biases) • JGLUEΛϦʔμʔϘʔυԽ • MARC-ja, JNLI, JSQuAD,

    JCommonsenseQA • zero-shot ͰධՁ • ਫ਼౓ͷฏۉʹΑΔϥϯΩϯά • lm-evaluation-harnessͱͷҧ͍ 31 https://note.com/wandb_jp/n/n2464e3d85c1a • MARC-ja, JNLI, JCommonsenseQAͷςετʹ͓͍ͯɺStability AIͷධՁํ๏Ͱ͸બ୒ࢶʹؚ·ΕΔ ީิͷத͔Βର਺໬౓࠷େͷ΋ͷΛճ౴ͱ͢Δ͍Θ͹෼ྨثతͳΞϓϩʔνΛ࠾༻͍ͯ͠ΔͨΊʹ ແؔ܎ͳճ౴΍ϑΥʔϚοτΤϥʔɺεϖϧϛεͳͲ͕ى͜Γಘͳ͍ͷʹରͯ͠ɺࢲͨͪ͸શͯͷ ϘΩϟϒϥϦ͔Βࣗ༝ʹग़ྗ͍ͤͯ͞ΔͨΊʹ͜ΕΒΛࠀ෰͠ͳ͍ͱಘ఺Ͱ͖ͳ͍ɻ • JSQuADͷςετʹ͓͍ͯɺStability AIͷධՁํ๏Ͱ͸ਖ਼ղͷτʔΫϯ਺Λ༩͑ͯͦͷ෼͚ͩग़ྗ͞ ͍ͤͯΔͷʹରͯ͠ɺࢲͨͪͷධՁͰ͸ࣗྗͰଧͪ੾Βͳ͚Ε͹ͳΒͳ͍ɻ https://wandb.me/nejumi
  22. Rakuda (YuzuAI) • ೔ຊͷ஍ཧɺ੓࣏ɺྺ࢙ɺࣾձʹ ؔ͢Δ40໰ (ਓख࡞੒) • ࣗಈධՁ (GPT-4) •

    ϖΞൺֱ (2छྨͷఏࣔॱ) • Bradley-Terry strengths (Elo ratings ͷվྑ൛) ͰείΞԽ͠Ϧʔμʔ Ϙʔυʹ 32 https://yuzuai.jp/benchmark
  23. Japanese VicunaQA (ژେ) • Ұൠɺ஌ࣝɺϩʔϧϓϨΠɺৗࣝɺϑΣϧϛਪఆɺ൓࣮Ծ૝ɺίʔσΟϯάɺ ਺ֶɺϥΠςΟϯάʹؔ͢Δ80໰ • MT-Benchͷલ਎Ͱ͋ΔVicuna Eval 80໰ͷ຋༁

    • ࣗಈධՁ (GPT-4) • ϖΞൺֱ (2छྨͷఏࣔॱ)ʹجͮ͘উ཰Λܭࢉ • ྫ • ετϨεͱ্खʹ෇͖߹͏ʹ͸ɺͲͷΑ͏ͳํ๏͕͋Γ·͔͢ʁ • ΠϯΫϧʔγϒͰΞΫηγϒϧͳެڞަ௨γεςϜΛઃܭ͢ΔࡍɺͲͷΑ͏ͳཁૉΛߟ ྀ͠·͔͢ʁ • ͋ͳ͕ͨ΋͠ւ଑ͷધ௕ͩͬͨΒɺๅ୳͠ͷϞνϕʔγϣϯΛߴΊΔͨΊ৐૊һʹͲΜ ͳݴ༿Λ͔͚·͔͢ʁ 34
  24. Japanese MT-Bench (Stability AI) • Ϛϧνλʔϯձ࿩ೳྗɺࢦࣔʹै͏ ೳྗΛ໰͏80໰ • 8ΧςΰϦ (10໰ͣͭ,

    2λʔϯ) • writing, roleplay, reasoning, math, coding, extraction, knowledge I (STEM), knowledge II (humanities/social science) • MT-BenchΛ຋༁ɺ೔ຊͷจԽʹ߹͏ Α͏ʹमਖ਼ • ࣗಈධՁ (GPT-4) • ઈରείΞ (1-10) 35 shi3z͞ΜʹΑΔධՁ࣮ߦ݁Ռ https://note.com/shi3zblog/n/n6b2ac5874021
  25. Japanese MT-Benchͷ໰୊ͷྫ • ৽ೖࣾһ΁ͷϏδωεϝʔϧͷΤνέοτʹ͍ͭͯͷࢦಋॻΛ࡞੒͠ ͍ͯͩ͘͞ɻܟޠͷਖ਼͍͠࢖͍ํ΍ɺ೔ຊͷϏδωεจԽͰͷ஫ҙ఺ ΛऔΓೖΕ͍ͯͩ͘͞ɻ • ࣗ෼ͷ࡞੒ͨ͠ࢦಋॻΛ٬؍తʹධՁ͠ɺվળ఺͕͋Ε͹ࢦఠ͍ͯͩ͘͠͞ɻ • υϥ͑΋ΜͷʮͷͼଠʯʹͳΓ͖ͬͯձ࿩Λ࢝Ί·͠ΐ͏ɻͰ͸ҎԼ

    ͷ࣭໰͔Β࢝Ί͍ͯͩ͘͞ɿzखΛચͬͨޙɺΤΞυϥΠϠʔ͸ඞཁ ͩͱࢥ͍·͔͢ʁz • ொͰҰॹʹ৯ࣄΛ͠·͠ΐ͏ɻόεͰҰॹʹߦ͖·ͤΜ͔ʁ • ͋ͳͨͷࠨʹඒ͍͠੺͍Ո͕ɺӈʹ͸ݬ૝తͳԹ͕ࣨɺਖ਼໘ʹ͸ັྗ తͳϐϯΫͷ৔ॴ͕ݟ͑·͢ɻͰ͸ɺന͍Ո͸Ͳ͜ʹ͋Γ·͔͢ʁ • ݩͷ࣭໰ʹ͸ɺന͍ՈͷҐஔΛ֬ఆతʹܾఆ͢ΔͨΊͷख͕͔Γؚ͕·Ε͍ͯ ·͔͢ʁ 36
  26. ELYZA-tasks-100 (ELYZA) • ෳࡶͳࢦࣔɾλεΫΛؚΉ100໰ • ਖ਼ղྫɺධՁ؍఺෇͖ • ओʹਓखධՁ (5ஈ֊ɺ3ਓͷධՁऀ) •

    ධՁ݁Ռγʔτ • ໰୊ͷྫ • ࢓ࣄͷ೤ҙΛऔΓ໭ͨ͢ΊͷΞΠσΞΛ5ͭڍ͍͛ͯͩ͘͞ɻ • ࣍ͷจষΛಡΜͰɺͦͷਓ͕Ͳͷఔ౓ౖ͍ͬͯΔ͔ɺ1ʙ10ͷई౓ͰධՁ ͍ͯͩ͘͠͞ɻ(1ʹౖ͍ͬͯͳ͍ɺ10ʹඇৗʹౖ͍ͬͯΔ)ɻ 1. ·ͨςε τͰ੺఺͔ɻ܅͸શ͘... 2. ςετͰ੺఺ʁࠓճ͸೉͔ͬͨ͠Ͷɻ • ҎԼͷϝʔϧʹฦ৴͍ͯͩ͘͠͞ɻ ͓ർΕ༷Ͱ͢ɻ ຊ೔ମௐෆྑʹΑΓɺ ༧ఆΑΓ౸ண͕গ͠஗Εͯ͠·͍ͦ͏Ͱ͢ɻ ஗͘ͱ΋13࣌ա͗ʹ͸ண͘ ͱࢥ͍·͢ɻ ͝໎࿭Λ͓͔͚ͯ͠ڪॖͰ͸͍͟͝·͕͢ɺ Կଔ͝༰͍ࣻ ͚ͨͩ·͢Α͏͓ئ͍ਃ্͛͠·͢ɻ 37
  27. LLMධՁʹ͓͚Δ؍఺ • Seen/Unseen: ڭࢣ͋Γֶश͕͞Ε͍ͯΔ͔Ͳ͏͔ • GLUEͳͲैདྷͷϕϯνϚʔΫ͸seenઃఆ • ࠷ۙͷϦʔμʔϘʔυ͸҉໧తʹunseenઃఆ(zero/few-host)Ͱ͋Δ͜ͱ͕΄ͱΜͲ • Contamination

    • ධՁσʔλֶ͕शʹ࢖ΘΕ͍ͯΔՄೳੑ • cf. “Catch me if you can! How to beat GPT-4 with a 13B model” [Blog] • λεΫछผ: ෼ྨ(ཧղ)ɾੜ੒ • ධՁํ๏: ࣗಈɾਓख • ෼ྨλεΫ͸ࣗಈධՁ • ੜ੒λεΫ͸྆ํ (ੜ੒λεΫͷࣗಈධՁ͸GPT-4ར༻͕ओྲྀ͕ͩɻɻɻ) • Ϟσϧछผ • ֶशํ๏: Pretrained, Fine-tuned (SFT), RLHF • ύϥϝʔλ਺ • ֶशݴޠ 39
  28. ݱࡏͷ೔ຊޠLLMධՁͷ՝୊ • ෼ྨ(ཧղ)ܥλεΫ(JGLUEͳͲ)ͷΈͷධՁͰ͸Ұ໘త • ݱࡏͷੜ੒ܥσʔληοτͷ՝୊ • େ͖ͳن໛ͷσʔληοτ͸গͳ͍ • χϡʔεهࣄͷཁ໿σʔληοτ: XLSum

    (ja) [Hasan+ 2021] • ೔ৗର࿩ίʔύε: Japanese Daily Dialogue [੺ؒ+ 2023] • LLMධՁ༻ͷੜ੒໰୊ͷ՝୊ • σʔληοτ͋ͨΓ਺ेʙ100໰ఔ౓Ͱগͳ͍ • ධՁํ๏͕ਓखɺ΋͘͠͸ɺGPT-4ʹΑΔࣗಈධՁ 40
  29. LLMධՁʹదͨ͠ϕϯνϚʔΫʹ޲͚ͯ • ෼ྨ(ཧղ)ܥσʔληοτͷ֦ॆ • MMLUͷ೔ຊޠ΁ͷ຋༁ (ૣେՏݪݚɾཧԽֶݚڀॴAIP) • llm-jp-evalɾjasterͷ׆ಈ (LLMษڧձ) •

    ੜ੒ܥσʔληοτͷ֦ॆ • ࣗಈධՁ͕ෆՄܽ • GPT-4ʹΑΔධՁ͸ආ͚͍ͨ • Α͍ɾѱ͍ੜ੒ΛΞϊςʔγϣϯͨ͠σʔλΛ࡞੒͠ɺfine-tuningʹΑͬͯධՁثΛߏங • cf. BLEURT [Sellam+ 2020], COMET [Rei+ 2020] • ࡶஊର࿩ͷΑ͏ͳopen-endedੑ͕ߴ͍λεΫ͸(ࣗಈ)ධՁ͕೉͍͠ • ཁ໿΍QAͳͲ͕ީิ (JGLUE v2) • Ξϊςʔγϣϯର৅ͷςΩετͱͯ͠ɺΦʔϓϯͳ΋ͷʹՃ͑ͯاۀ಺ςΩετ΋࢖͑Δ Α͏ʹݕ౼த • ධՁઃఆͷ͋Γํͷݕ౼ • unseenઃఆ?ɺfew-shotઃఆ? • ϓϩϯϓτහײੑ΁ͷରॲ 41
  30. LLMධՁʹదͨ͠ϕϯνϚʔΫʹ޲͚ͯ ͞Βʹ͸ 42 Question Answering Tool Learning Reasoning K nowledge

    Com pletion Ethics and Morality Bias Toxicity Truthfulness Robustness Evaluation Risk Evaluation Biology and M edicine Education Legislation Computer Science Finance Benchmarks for Holistic Evaluation Benchmarks for Knowledge and Reasoning Benchmarks for NLU and NLG Knowled ge and Capability Large Langauge Model Evaluation Alignment Eva luation Safety Specialized LLMs Evaluation Organization … Figure 1: Our proposed taxonomy of major categories and sub-categories of LLM evaluation. [Guo+ 2023] [Awesome-LLMs-Evaluation-Papers]