Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NLP introduction in R 1

kur0cky
November 29, 2019
72

NLP introduction in R 1

講義用資料

kur0cky

November 29, 2019
Tweet

Transcript

  1. جຊతͳղੳͷྲྀΕ ࠷ۙͷൃల w ༷ʑͳλεΫʹ͓͍ͯɼਂ૚ֶ श͕TUBUFPGUIFBSU w #&35ʹΑΔϒϨΠΫεϧʔ
   w

    લޙ૒ํ޲ͷจ຺Λֶश w ͭͷϞσϧΛ༷ʑͳλεΫʹར ༻Ͱ͖Δ 6 จষ ܗଶૉྻ ਺஋ϕΫτϧ ݁Ռ લॲཧɾܗଶૉղੳ લॲཧ ػցֶशɾ౷ܭղੳ
  2. ܗଶૉղੳ ܗଶૉɿ w ҙຯΛ΋ͭจࣈɾه߸ྻͷ࠷খ୯Ґ w ղੳϓϩάϥϜͱࣙॻʹΑΓߏ੒͞ΕΔ
 ‎ಛ௃ྔͷ࡞੒͕ࣙॻͷ඼࣭ʹґଘ ݻ༗໊ࢺ΍ઐ໳༻ޠͷ൑ఆ 9 ʮ

    ݚڀࣨʹೖΓ͍͕ͨɼ੒੷͕଍Γͳ͍ɽʯ ໊ࢺ໊ࢺ໊ࢺॿࢺಈࢺॿಈࢺॿࢺه߸໊ࢺॿࢺಈࢺॿಈࢺه߸ l zݚڀࣨʹೖΓ͍͕ͨɼ੒੷͕଍Γͳ͍ɽ .F$BCʹΑΔܗଶૉղੳͷྫ
  3. .F$BC ࿨෍෥ w ژ౎େֶ৘ใֶݚڀՊɼ/55ίϛϡχέʔγϣϯՊֶجૅݚڀॴʹΑΔղੳΤϯδϯ w σϑΥϧτͰ͸ɼ*1"%JDͱ͍͏ࣙॻ͕༻͍ΒΕ͍ͯΔ w 3͔Β͸ɼ3.F$BCͱ͍͏ύοέʔδΛ༻͍ͯར༻Մೳ 10 Πϯετʔϧ

    NBD λʔϛφϧ  brew install mecab brew install mecab-ipacid Πϯετʔϧ NBD 3  install.packages("RMeCab", repos = "http://rmecab.jp/R", type = "source")
  4. ΍ͬͯΈΑ͏ library(RMeCab) words <- RMeCabC(str = “͢΋΋΋΋΋΋΋΋ͷ͏ͪ”, 
 mypref =

    0) do.call(c, words) # RMeCab ͷग़ྗ͸ϦετܕɼϕΫτϧ΁ # mypref = 1 ͱ͢Δͱɼݪܕʹม׵ͯ͘͠ΕΔ 11
  5. ਺஋ϕΫτϧԽ w ղੳ͢ΔͨΊʹɼจॻΛ਺஋Խ͢Δඞཁ͕͋Δ w ࠷΋୯७ͳͷ͸ɼස౓ 5FSN'SFRVFODZ5' 12 ࢲ ͍Δ ͢Δ

    ໷ Ԏ૲ ʜ খઆ      খઆ      খઆ      খઆ      ʜ
  6. 5'*%'ॏΈ෇͚ w 5' 5FSN'SFRVFODZ  *%' *OWFSUFEEPDVNFOUGSFRVFODZ × 14 ɿจॻ

    ʹ͓͚Δޠ۟ ͕ݱΕͨස౓ ɿޠ۟ ͕ݱΕΔςΩετͷ਺ ɿςΩετͷ૯਺ tfij i j dfj j N 5'*%' ij = tfij × log N dfj ͦͷ୯ޠͷϨΞ͞ͷٯ਺
  7. ετοϓϫʔυ w ղੳ্ɼ໾ʹͨͨͳ͍୯ޠ w ॿࢺ΍ॿಈࢺͳͲͷػೳޠ w lUIFz lBzͳͲ w ͲͷΑ͏ʹআڈ͢Δ͔

    w ࣙॻΛ༻͍Δ
 ࢀߟɿIUUQTWOTPVSDFGPSHFKQTWOSPPUTMPUIMJC$4IBSQ7FSTJPO4MPUI-JC/-1'JMUFS 4UPQ8PSEXPSE+BQBOFTFUYU w ग़ݱස౓Λ༻͍Δ ্Ґͷ୯ޠ͕΄ͱΜͲͷׂ߹Λ઎ΊΔ 16
  8. ४උ library(tidyverse) # ͍ͭ΋ͷ library(RMeCab) # ܗଶૉղੳ༻ # σʔληοτͷಡΈࠐΈ df

    <- read_csv(“aozora.csv”) # ྫͱͯ͠ɼਓࣦؒ֨Λநग़ ningen <- df %>% 
 filter(title == “ਓࣦؒ֨”) %>%
 .$main_text print(df) print(ningen) 18 w ੨ۭจݿ͔ΒɼখઆσʔληοτΛ༻ҙ͠·ͨ͠ɽ
 ଠ࠻ɼᕸੴɼև઒
  9. ਖ਼نදݱʹΑΔআڈ ਖ਼نදݱ w จࣈྻͷू߹ΛҰͭͷจࣈྻͰදݱ͢Δํ๏ 8JLJQFEJB  w ྫɿ\l͘Ζ͖z l͘Γ͖z l͖͘͞z^l͘\^z

    20 ningen <- str_remove_all(ningen, c("\n|\r|ʢ.{1,10}ʣ| |ɹ|Ұ")) ೚ҙͷจࣈ \^܁Γฦ͠ճ਺ͷࢦఆ
  10. ܗଶૉղੳͱετοϓϫʔυͷআڈ ઌͷϦϯΫΛετοϓϫʔυͱͯ͠ར༻͢Δ
 TUPQXPSETTDBO pMFlIUUQTWOTPVSDFGPSHFKQTWOSPPUTMPUIMJC$4IBSQ7FSTJPO4MPUI-JC/-1'JMUFS4UPQ8PSEXPSE+BQBOFTFUYU z 
 XIBUlz ର৅͕จࣈྻͰ͋Δ͜ͱͷࢦఆ OJOHFO@XPSET3.F$BC$ OJOHFO

    NZQSFG ܗଶૉղੳɼݪܕ΁ม׵
 VOMJTU ϦετܕΛϕΫτϧ΁
 UJCCMF WFSCOBNFT  σʔλϑϨʔϜ΁ͷม׵඼ࢺྻΛ௥Ճ
 XPSE 
 pMUFS XPSEJOTUPQXPSET ετοϓϫʔυͷআڈɽlJOz͸แؚͰ536& 21
  11. 5'ͷࢉग़ # ه߸΍ॿࢺɼॿಈࢺ͕ଟ͍
 count(ningen_words, verb, sort = TRUE) count(ningen_words, word,

    sort = TRUE) # ໊ࢺɼܗ༰ࢺɼಈࢺʹߜͬͯΧ΢ϯτ
 ningen_words %>%
 filter(verb %in% c(“໊ࢺ”, “ܗ༰ࢺ”, “ಈࢺ”)) %>%
 count(word, sort = TRUE) 22
  12. ෳ਺খઆͷܗଶૉղੳ res_mecab <- df %>% 
 filter(author == “և઒ཾ೭հ”) %>%

    
 mutate(mecab = map(main_text, RMeCabC, mypref = 1)) %>%
 select(author, title, mecab) df2 <- tibble() for(i in 1:nrow(res_mecab)){
 try({ 
 df2 <- tibble(verb = sapply(res_mecab$mecab[[i]], names),
 word = sapply(res_mecab$mecab[[i]], c)) %>% 
 filter(verb %in% c("໊ࢺ", "ܗ༰ࢺ", "ಈࢺ")) %>% 
 filter(!word %in% stopwords) %>% 
 mutate(author = res_mecab$author[[i]],
 title = res_mecab$title[[i]]) %>% 
 bind_rows(df2,.)
 })
 } 24 ίʔυ͸ಛʹؾʹ࣮ͤͣߦ͢Δ͚ͩͰྑ͍ EGΛ֬ೝͤΑ
  13. तۀ಺՝୊ TF <- df2 %>% 
 count( ***** , *****

    , name = “TF”) IDF <- df2 %>% 
 distinct(title, word) %>% 
 group_by(word) %>% 
 summarise(IDF = ***** ) TF_IDF <- TF %>% 
 left_join(IDF, by = ***** ) %>% 
 mutate(TF_IDF = TF * IDF) %>% 
 group_by(title) %>% 
 mutate(TF_IDF = TF_IDF / sum(TF)) %>% 
 ungroup() %>% 
 arrange(desc(TF_IDF)) 26
  14. 29