Upgrade to Pro — share decks privately, control downloads, hide ads and more …

クックパッドにおける研究開発/HCG2020

j.harashima
December 16, 2020
3.6k

 クックパッドにおける研究開発/HCG2020

j.harashima

December 16, 2020
Tweet

More Decks by j.harashima

Transcript

  1. Ԋֵ ೥݄ ༗ݶձࣾίΠϯʢݱΫοΫύουגࣜձࣾʣઃཱ ೥݄ ೥݄ ೥݄ ೥݄ ೥݄ Ϩγϐͷ౤ߘɾݕࡧαʔϏεLJUDIFO!DPJO։࢝ ΫοΫύουʹαʔϏε໊มߋ

    ೥݄ ϓϨϛΞϜαʔϏε։࢝ Ϛβʔζ্৔ ౦ূҰ෦ʹࢢ৔มߋ ϓϨϛΞϜαʔϏεձһ਺ສਓಥഁ ೥݄ ւ֎ల։Λຊ֨Խ ೥݄ ݚڀ։ൃ෦໳ൃ଍
  2. /-1

  3. $BMPSJF&TUJNBUJPO )BSBTIJNBFUBM "TBSJ$MBN3JDF HP ɹɹɹɹɹΞαϦɹ ɹɹɹɹɹถɹɹɹ ɹɹɹɹɹԘɹɹɹ ɹɹɹɹɹञɹɹɹ ɹɹɹɹɹ͠ΐ͏Ώ ɹɹɹɹɹΈΓΜɹ

    SJDF BTBSJDMBN TBMU TBLF TPZTBVDF TXFFUTBLF 㾎 㾎 㾎 㾎 㾎 㾎 8FIBWFFTUJNBUFEUIFOVNCFSPGDBMPSJFTJOPWFS  SFDJQFTBOEBDUVBMMZVTFUIFNJOPVSSFDJQFTFSWJDF 6TFUIFTJOHMFTPVSDFNPEFMGPSTFSWJOHFTUJNBUJPO  c(r) = ∑ i∈Ir c(i) ⋅ q(i)/100 s(r) = 306.6 㾎 *OHSFEJFOU
  4. $7

  5. WFS ɾ$B ff F/FU ɾྉཧPSඇྉཧͷೋ஋෼ྨ WFS ɾ*ODFQUJPOW ɾྉཧ ২෺ ʜ

    PSඇྉཧͷଟ஋෼ྨ WFS ɾ*ODFQUJPOW QBUDIFEDMBTTJ fi DBUJPO ɾྉཧPSඇྉཧͷೋ஋෼ྨ ֶशσʔλ ɾਖ਼ྫɿΫοΫύουͷϨγϐͷը૾ ɾෛྫɿϥΠηϯεϑϦʔͷछʑͷը૾ WFSͷೝࣝ݁Ռ ྉཧࣸਅͷݕग़
  6. ར༻ঢ়گ ؔ ࿈ τ ϐ ο Ϋ ར༻ঢ়گ ެ։લʢʙ ೥

    ݄ʣ େֶ ݚڀࣨ ެ։ޙʢ ೥ ݄ʣ  େֶ  ݚڀࣨ
  7. MA - MeCab, the most popular morphological analyzer of Japanese,

    was tested - All metrics indicated 89–91% although the tool has already achieved over 98% on newspaper articles in Kudo et al. (2004) Cookpad Parsed Corpus: Linguistic Annotations of Japanese Recipes Jun Harashima and Makoto Hiramatsu (Cookpad Inc.) The 14th Linguistic Annotation Workshop  Background     Cookpad Parsed Corpus Name Year Main Content CURD 2008 Machine-readable language representations Flow Graph Corpus 2014 Graph representations and named entities SIMMR Recipe Dataset 2015 Graph representations Cookpad Recipe Dataset 2016 Reviews and meals Cookpad Image Dataset 2017 Food images and cooking images Recipe1M 2017 Food images RecipeQA 2018 Question-answer pairs Stroyboarding Data 2019 Cooking images r-FG BB dataset 2019 Bounding boxes for cooking images English Recipe Flow Graph Corpus 2020 Graph representations and named entities Mulitimodal Aligned Recipe Corpus 2020 URLs to YouTube videos Mulit-modal Recipe Structure dataset 2020 Graph representations and cooking images Cookpad Parsed Corpus 2020 Linguistic annotations Name Year Target documents KU Text Corpus 2002 Newspaper articles GDA Corpus 2005 Newspaper articles and dictionary entries NAIST Text Corpus 2007 Newspaper articles KU and NTT Blog Corpus 2011 Blogs KU Web Document Leads Corpus 2012 Web documents BCCWJ 2014 Newspaper articles, books, magazines, etc Cookpad Parsed Corpus 2020 Cooking recipes # Step-ID:1 # Sentence-ID:1-1 * 0 4D 1/2 .7 1 3:,,?,35,*,*,*,*,1,,,B-Fi + ?,,<,*,*,*,*,+, , ,I-Fi  0,,$0,,*,*,*,*,,,,O * 1 2D 1/2 =4' ( ?,,<,*,*,*,*,(, , ,B-Sf 6 ?,,<,*,*,*,*,6, , ,I-Sf  0,, 0,,<,*,*,*,,,,O * 2 4P 0/0 /' 2 ;,,-A,*,*,&8),B@%,2, , ,B-Ap * 3 4D 0/1 =4'  ?,,<,*,*,*,*,, , ,B-Fi  0,, 0,,<,*,*,*,,,,O * 4 -1O 0/0 /'  ;,,-A,*,*,&8),!>%,,,,B-Ap  "*,#9,*,*,*,*,,,,O EOS raw salmon (topic marker) a bite size (dative) cut salt (accusative) sprinkle . - The number of cooking recipes on the Internet has grown - Recipe-related studies and datasets are also increasing - However, there are still few datasets that provide linguistic annotations for recipe-related studies even though such annotations should form the basis of the studies Table1. Existing recipe-related datasets and our corpus Figure 1. Linguistic annotations for an example sentence,     (Cut the raw salmon into bite-size chunks and sprinkle them with salt.), in our corpus. Precision Recall F1 MeCab 88.91 88.95 88.93 MeCab w/ domain adaptation 91.12 91.04 91.08 Accuracy Precision Recall F1 Sasada et al. (2015) 88.30 74.65 82.77 78.50 Lample et al. (2016) 91.41 88.17 87.18 87.67 Accuracy CaboCha 92.21 CaboCha w/ domain adaptation 94.68 Table 3. Benchmark results for MA Table 4. Benchmark results for NER Table 5. Benchmark results for DP - We divided our corpus into training (400 recipes), validation (100 recipes), and test sets (100 recipes) and tested popular tools or methods for Japanese MA, NER, and DP - We also tested the tools with performing domain adaptation NER - We trained/tested two recognizers using our training/test sets - Many errors were caused by domain-specific unknown words DP - We tested CaboCha, the most popular dependency parser for Japanese - Accuracy was 92–95% (over 20% of the sentences in our test set had at least one parsing error) - We randomly selected 500 recipes from the Cookpad Recipe Dataset - 4,738 sentences in the 500 recipes were annotated with morphemes, named entities, and dependency relations - Construction of a novel corpus, which contains linguistic annotations of 500 Japanese recipes - Benchmark results on the corpus for Japanese morphological analysis (MA), named entity recognition (NER), and dependency parsing (DP)  Contributions of this study Morphemes - We decided boundaries and part-of-speech for each morpheme based on the IPA dictionary, commonly used for MA Named entities - Morphemes were annotated with 17 tags such as Fi (food ingredient) and Sf (state of food) based on IOB2 format Dependency relations - Bunsetsus were annotated with the relations such as D (normal dependency) and P (coordination dependency) - A bunsetsu is a unit of Japanese that consists of one or more content words and zero or more functions words - Bunsetsus were also annotated with 7 types such as  (Topic) Other - Content in the Cookpad Recipe and Image Datasets, which include the same 500 recipes, can also be used - There is still room for improvement in Japanese MA, NER, and DP of cooking recipes - By improving the analyses using our corpus, a variety of recipe- related studies based on them can also be improved Table 2. Existing Japanese parsed corpora and our corpus