Upgrade to Pro — share decks privately, control downloads, hide ads and more …

前処理が単語埋め込みに与える影響 A Comprehensive Analysis of Pr...

Avatar for uchi_k uchi_k
August 17, 2020

前処理が単語埋め込みに与える影響 A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks

ACL2020 に採択された A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks という論文を読んでいます。特に感情認識系のタスクにおいて前処理が単語埋め込みに与える影響を調べ、よく行われる実験設定が本当に正しいのかを検証しています。

Avatar for uchi_k

uchi_k

August 17, 2020
Tweet

More Decks by uchi_k

Other Decks in Programming

Transcript

  1. ಺ڮ ݎࢤ uchi_k @__uchi_k__ About me yuni, inc. ୅ද nlpaper.challenge

    ӡӦ Freelance Machine Learning ɹɹɹɹɹEngineer / Researcher former ژେ৘ใӃ, ະ౿16 FreakOut Machine Learning Engineer
  2. #distributional hypothesis #word embedding ෼෍Ծઆʹجͮ͘୯ޠຒΊࠐΈͷݶք ʮ޾ͤʯͱʮ൵͠ΈʯͷϖΞ͕ʮ޾ͤʯͱʮتͼʯͷϖΞΑΓྨࣅ౓ ͕ߴ͘ͳΔɺͳͲ௚ײʹ൓͢Δྨࣅ౓͕ಘΒΕΔ͜ͱ΋͋ΓɺλεΫ ͝ͱʹ୯ޠຒΊࠐΈΛௐ੔͢Δඞཁ͕͋Δ The Distributional

    Hypothesis is that words that occur in the same contexts tends to have similar meanings [Harris, 1954]. ࣅͨจ຺Ͱසൟʹग़ݱ͢Δ୯ޠಉ࢜͸ҙຯతʹྨࣅ͍ͯ͠Δͱߟ͑ͯɺ ຒΊࠐΈۭؒͰ΋ۙ͘ͳΔͱ͍͏Ծઆ ୯ޠͷҙຯΛܾΊΔͨΊͷҰͭͷํ๏ͱͯ͠ɺ෼෍Ծઆ͕͋Δɻ ౷ܭతʹ୯ޠͷҙຯΛಘΔͨΊͷํ๏ͰɺXPSEWFDͷΑ͏ͳਪ࿦ ϕʔεͷϞσϧ΍୯ʹ౷ܭ৘ใΛ࣍ݩ࡟ݮ͢ΔΧ΢ϯτϕʔεͷख๏΋ ͋Δ
  3. "$PNQSFIFOTJWF"OBMZTJTPG1SFQSPDFTTJOH GPS8PSE3FQSFTFOUBUJPO-FBSOJOHJO"⒎FDUJWF5BTLT #abstract /BTUBSBO#BCBOFKBE %FQBSUNFOUPG&MFDUSJDBM&OHJOFFSJOHBOE$PNQVUFS4DJFODF FUBM "$- ಛʹײ৘ೝࣝܥͷλεΫʹ͓͍ͯલॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹΛௐ΂ɺ Α͘ߦΘΕΔ࣮ݧઃఆ͕ຊ౰ʹਖ਼͍͠ͷ͔ݕূ͢Δ ֶशࡁΈͷ୯ޠຒΊࠐΈΛ࢖͍͕͚ͪͩͲɺྫ͑͹ʮ޾ͤʯͱʮ൵͠Έʯͷ

    ϖΞ͕ʮ޾ͤʯͱʮتͼʯͷϖΞΑΓྨࣅ౓͕ߴ͘ͳΔΑ͏ͳຒΊࠐΈ͕ଘ ࡏ͢Δͷʹײ৘ೝ͕ࣝຊ౰ʹղ͚Δʁ 4UPQXPSET OFHBUJPO 104 MFNNBUJ[BUJPOͳͲͷલॲཧΛͲ͏࢖͏͔ ͕ຊ࣭తʹॏཁͳͷͰ͸ʁ લॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹͷେ͖͞Λݕূ͠ɺैདྷͷ࣮ݧઃఆͷݟ ௚͠Λߦ͍͍ͨ
  4. ؔ࿈ݚڀʢXPSEWFD 6($ʣ • *NQSPWJOH%JTUSJCVUJPOBM4JNJMBSJUZXJUI-FTTPOTGSPN8PSE &NCFEEJOHT ◦ 0NFS-FWZ #BS*MBO6OJWFSTJUZ FUBM "$-

    ◦ 8PSEFNCFEEJOHʹ͓͍ͯɺΧ΢ϯτϕʔεͷख๏Ͱ΋ϋΠύʔύϥϝʔ λௐ੔࣍ୈͰXPSEWFDͳͲͷਪ࿦ϕʔεͷख๏ʹউͯΔ͜ͱΛࣔͨ͠ • /-'**5BU*&45&NPUJPO3FDPHOJUJPOVUJMJ[JOH/FVSBM /FUXPSLTBOE.VMUJMFWFM1SFQSPDFTTJOH ◦ 4BNVFM1FDBS 4MPWBL6OJWFSTJUZPG5FDIOPMPHZ FUBM &./-1 ◦ 6TFSHFOFSBUFEDPOUFOUTΛ࢖༻͢Δ৔߹ͷલॲཧͷॏཁੑʹ͍ͭͯௐ΂ ͍ͯΔɻಛʹإจࣈ΍ֆจࣈͷೝࣝΛৄ͘͠ߦ͍είΞΛ্͛Δ͜ͱʹ੒ޭ #recent study #ugc #word2vec
  5. ؔ࿈ݚڀʢલॲཧʣ • 0OTUPQXPSET pMUFSJOHBOEEBUBTQBSTJUZGPSTFOUJNFOU BOBMZTJTPGUXJUUFS ◦ )BTTBO4BJG ,OPXMFEHF.FEJB*OTUJUVUF 5IF0QFO6OJWFSTJUZ FUBM

     -3&$ ◦ ετοϓϫʔυͷআڈ͕༗ޮ͔ͦ͏Ͱͳ͍͔͸ϫʔυϦετͷ࡞Γํ΍λε ΫͰେ͖͘ҟͳΔ͕ɺUXJUUFSTFOUJNFOUͰ͸Ұൠతͳํ๏ͩͱ֐ͷํ͕େ ͖͍͜ͱΛࣔͨ͠ • "DPNQBSBUJWFFWBMVBUJPOPGQSFQSPDFTTJOHUFDIOJRVFTBOE UIFJSJOUFSBDUJPOTGPSUXJUUFSTFOUJNFOUBOBMZTJT ◦ 4ZNFPO4ZNFPOJEJT &YQFSU4ZTUFNTXJUI"QQMJDBUJPOT ◦ લॲཧͷςΫχοΫΛ৭ʑࢼͯ͠ΈͨΒɺײ৘෼ੳͰ͸MFNNBUJ[BUJPOͱ ਺ࣈͷআڈɺ୹ॖܗͷஔ׵͕࠷΋είΞʹد༩ #recent study #preprocessing #emotion
  6. #preprocessing #pipeline /-1ʹ͓͚ΔલॲཧͷྲྀΕ ΫϦʔχϯά ෼ׂ ਖ਼نԽ ѹॖ ϕΫτϧԽ λά ه߸ͳͲͷআڈ

    QVODUVBUJPO ܗଶૉղੳ ࣙॻͷ௥Ճ ܎Γड͚ղੳ ਺ࣈͷஔ͖׵͑ إจࣈͳͲͷೝࣝ TQFMMDIFDL  දهΏΕ MPXFSDBTJOH ୅දޠ΁ͷஔ͖׵͑ লུޠ  MFNNBUJ[BUJPO TUFNNJOH OFHBUJPO Φϯτϩδʔ 4UPQXPSEͷআڈ 104 $#08 TLJQHSBN #&35 DPWFSBHFͷௐࠪ ෼ྨσʔλͱޠኮΛ͚ۙͮΔ FUD
  7. #preprocessing #negation /FHBUJPO • ൓ҙޠࣙॻͷ࡞੒ ◦8PSE/FUίʔύεͰ൓ҙޠࣙॻΛ࡞੒ ◦൓ҙޠ͕ݟ͔ͭΒͳ͍PSͭͰ͋Ε͹ͦͷ··ɺෳ਺͋Δ৔߹͸ VL8BDίʔύεͷதͰ࠷େͷස౓Λ࣋ͭ൓ҙޠͱͨ͠Γ୯ʹϥϯμϜ ʹબ୒ͨ͠Γ •

    ൱ఆޠͷ൓ҙޠ΁ͷஔ׵ ◦൱ఆޠ͕ݟ͔ͭͬͨ৔߹ɺଓ͘୯ޠΛநग़͠ɺ൓ҙޠࣙॻͰ൓ҙޠΛ ݕࡧɻ൓ҙޠ͕ݟ͔ͭͬͨ৔߹ɺ൱ఆޠͱ൱ఆ͞ΕͨޠΛͦΕʹஔ͖ ׵͑Δ ◦ྫ͑͹ɺ<b* BN OPU IBQQZ bUPEBZ`>ͱ͍͏จͰ͸ɺ൱ఆޠʢ`OPUʣ ͱͦΕʹରԠ͢Δ୯ޠʢIBQQZʣΛಛఆɻ൓ҙޠࣙॻͰbIBQQZ`ͷ൓ ҙޠʢ`TBE`ʣΛ୳͠ɺOPUIBQQZ`ΛbTBE`ʹஔ͖׵͑Δ
  8. #corpus #training #dataset /FXT શମͱͯ͠ɺ4UPQXPSEͷআڈ΍104Ͱ͸WPDBCTJ[F͸͋·Γม ΘΒͳ͍͕DPSQVTTJ[F͕େ͖͘ݮগ ʙ೥ͷΞϝϦΧͷͷग़ ൛෺͔Βͷ ݅ͷهࣄ 8JLJQFEJB

    8JLJQFEJBͷهࣄ  ݅Ͱ ߏ੒͞ΕΔɺ/FXTΑΓ໿ഒେ͖ ͍ίʔύε 5SBJOJOH$PSQVT ͭͷαΠζɾੑ࣭ͷҟͳΔίʔύεʹͭͷલॲཧΛߦ͏
  9. #corpus #evaluation #dataset &WBMVBUJOH$PSQVT 4FOUJNFOUBOBMZTJT FNPUJPODMBTTJpDBUJPO  TBSDBTNEFUFDUJPOͷͭͷλεΫͰධՁɻ • *.%#

    ◦ ݅ͷөըϨϏϡʔɻϙδωΨൺ • 4FN&WBM ◦ ໿πΠʔτɻϙδωΨൺ • "JSMJOF ◦ ߤۭձࣾࣾʹؔ͢Δ໿݅πΠʔτɻ 4FOUJNFOUBOBMZTJTײ৘ϙδωΨ • *4&"3 ◦ ໿݅ͷɺײ৘Λשى͢Δݸਓతͳ࿩ • "MN ◦ ໿݅ͷ͓ͱ͗࿩ • 44&$ ◦ 4FN&WBMΛ࠶Ξϊςʔγϣϯͨ͠໿݅ͷπ Πʔτ &NPUJPO%FUFDUJPOײ৘Ϋϥε෼ྨ 4BSDBTN%FUFDUJPOൽ೑ͷݕग़ • 0OJPO ◦ ൽ೑Λѻ͏ϝσΟΞͱͦ͏Ͱͳ͍ϝσΟΞ͔Βऩू ͨ͠໿݅ͷχϡʔεϔουϥΠϯ • *"$ ◦ ໿݅ͷൃ࿩Ԡ౴ • 3FEEJU ◦ ஶऀ͕ϥϕϧ෇͚ͨ͠໿ສ݅ͷ3FEEJU౤ߘ