Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#javajo Java/Scala ではじめる機械学習

#javajo Java/Scala ではじめる機械学習

https://javajo.doorkeeper.jp/events/27588 での発表資料です。

KOMIYA Atsushi

July 23, 2015
Tweet

More Decks by KOMIYA Atsushi

Other Decks in Programming

Transcript

  1. We’re hiring! iOSΤϯδχΞ / AndroidΤϯδχΞ / WebΞϓϦέʔγϣϯΤϯδχΞ / ϓϩμΫςΟϏςΟΤϯδχΞ /

    ػցֶश / ࣗવݴޠॲཧΤϯδχΞ / άϩʔεϋοΫΤϯδχΞ / αʔόαΠυΤϯδχΞ / ޿ࠂΤϯδχΞ…
  2. • ਺஋ྻʢϕΫτϧʣ͔͠ѻ͑ͳ͍ • ඇߏ଄σʔλʢը૾ɺԻ੠ɺςΩετɺ
 ΞΫηεϩάɺetc.ʣ͸ͦͷ··Ͱ͸ѻ͑ͳ͍ • ಛ௃நग़ͯ͠ϕΫτϧʹ͢Δඞཁ͕͋Δ • ͍ΘΏΔ feature

    engineering • ڭࢣ͋Γֶशͷڭࢣσʔλͷ৔߹͸ɺՃ͑ͯ
 ʮϥϕϧʯͱͳΔਖ਼ղ৘ใΛ෇༩͢Δ ԿΛೖྗσʔλͱ͢Δͷ͔
  3. ԿΛೖྗσʔλͱ͢Δͷ͔ • ಛ௃ྔͷநग़ɾม׵ • ΧςΰϦม਺ɿOne-hot encoding • ࣗવݴޠɿTerm frequency, Word2vec

    ͳͲ • ը૾ɿSIFT, SURF, AKAZE ͳͲ • ࠷ۙͩͱಛ௃நग़ʹ Deep learning Λ࢖ͬͨΓ΋ • ߴ࣍ݩˍૄͳಛ௃ϕΫτϧͷදݱ • Feature hashing
  4. ಘΒΕͨ݁Ռ͸ਖ਼͍͠ͷ͔ • ਖ਼͠͞Λ͔֬ΊΔ • k-෼ׂަࠩݕূ (k-fold cross validation) • ਖ਼͠͞ΛଌΔ

    • ෼ྨɾࣝผ • AUC, Precision, Recall, F-measure • ༧ଌɾճؼ • ૬ؔ܎਺ɺܾఆ܎਺ɺMAE, RMSE, LogLoss
  5. ΦϯϥΠϯֶशɾΦϑϥΠϯֶश • ΦϯϥΠϯֶश • ஞ࣍ಘΒΕΔσʔλΛ΋ͱʹɺϞσϧΛਵ࣌ߋ৽͢Δ • ετϦʔϜॲཧతͳΠϝʔδ • ར༻ͨ͠σʔλ͸஝ੵ͢Δ͜ͱͳ͘ഁغͰ͖Δ •

    ʢઍ੾ͬͯ͸౤͛ɺઍ੾ͬͯ͸౤͛…ʣ • ΦϑϥΠϯֶश • ஝ੵ͞ΕͨσʔλΛ΋ͱʹɺϞσϧΛҰؾʹߋ৽͢Δ • όονॲཧʹ૬౰͢Δ
  6. ྫ͑͹͜ΜͳϫʔΫϑϩʔ 1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ 2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ 3.

    ϞσϧΛ࡞੒͢Δ • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ 4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ • ਫ਼౓͸͍͔΄Ͳ͔ʁ 5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ
  7. ྫ͑͹͜ΜͳϫʔΫϑϩʔ 1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ 2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ 3.

    ϞσϧΛ࡞੒͢Δ • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ 4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ • ਫ਼౓͸͍͔΄Ͳ͔ʁ 5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ ͜ͷ͋ͨΓͰ ػցֶशΛ ׆༻͢Δ
  8. ྫ͑͹͜ΜͳϫʔΫϑϩʔ 1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ 2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ 3.

    ϞσϧΛ࡞੒͢Δ • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ 4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ • ਫ਼౓͸͍͔΄Ͳ͔ʁ 5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ ͜ͷ͋ͨΓ͸ ΞυϗοΫͳ ෼ੳ͕ඞཁ
  9. ྫ͑͹͜ΜͳϫʔΫϑϩʔ 1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ 2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ 3.

    ϞσϧΛ࡞੒͢Δ • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ 4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ • ਫ਼౓͸͍͔΄Ͳ͔ʁ 5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ +BWBʹ޲͍ͯ ͍Δͷ͸ ͜ͷ͋ͨΓ
  10. ྫ͑͹͜ΜͳϫʔΫϑϩʔ 1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ 2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ 3.

    ϞσϧΛ࡞੒͢Δ • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ 4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ • ਫ਼౓͸͍͔΄Ͳ͔ʁ 5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ 4QBSL 4DBMB  ͳΒ͜ͷ͋ͨΓ ΋ΧόʔͰ͖Δ ͔΋
  11. దࡐదॴͰ͍͜͏ • ΞυϗοΫͳ෼ੳ΍Ϟσϧͷߏங͸ R ΍ Python Ͱ • ΠϯλϥΫςΟϒͳૢ࡞͕͠΍͍͢ •

    ࢼߦࡨޡͷ܁Γฦ͕͠͠΍͍͢ • Spark ͷ interactive shell ΋͍͍͔΋͠Εͳ͍ • γεςϜԽͷ෦෼ͰɺPython ΍ Java, Scala Λར༻͢Δ • ύϑΥʔϚϯεͷٻΊΒΕΔέʔεͦ͜ɺJava ΍ Scala ͷग़൪
  12. Spark / MLlib • gradle ‘org.apache.spark:spark-mllib_2.10:1.1.1' • https://github.com/apache/spark • ˒

    2,336 → 4,813 • ෼ࢄॲཧϑϨʔϜϫʔΫ Spark ্Ͱͷར༻Λલఏ ͱͨ͠ػցֶशϥΠϒϥϦ MLlib • ػೳ௥Ճɾվળ͕ࠓͩ੝Μ • ΞυϗοΫ෼ੳͷ؀ڥͱͯ͠΋ར༻Ͱ͖Δ
  13. liblinear-java • gradle ‘de.bwaldvogel:liblinear:1.95' • https://github.com/bwaldvogel/liblinear-java • ˒ 121 →

    144 • LibSVM Λઢܗ෼ྨɾճؼʹಛԽͨ͠΋ͷɺͷ
 Java ϙʔςΟϯά • ϥΠϒϥϦ • ΘΓͱؤுͬͯɺຊମ (C++ ൛) ͷ࠷৽όʔδϣϯʹ௥ै͠ Α͏ͱ͍ͯ͠Δ
  14. Mahout • gradle ‘org.apache.mahout:mahout-core:0.9' • https://github.com/apache/mahout • ˒ 229 →

    507 • ෼ࢄॲཧϑϨʔϜϫʔΫ Hadoop ্ͷػցֶशϥΠϒϥϦ • Goodbye MapReduce ͯ͠ɺSpark ΍ h2o ͱͷ਌࿨ੑΛߴ ΊΔ։ൃ͕͞Ε͍ͯΔ༷ࢠ • https://issues.apache.org/jira/browse/MAHOUT-1510 • ͔͠͠ɺͦ͜͸͔ͱͳ͘ඬ͏Φϫίϯײ…
  15. SAMOA • https://github.com/yahoo/samoa • ˒ 363 → 397 • Storm

    ͳͲͷ෼ࢄετϦʔϛϯάϑϨʔϜ ϫʔΫ্Ͱར༻Ͱ͖ΔػցֶशϥΠϒϥϦ • Yahoo! Labs ۘ੡ • ͜͜࠷ۙ͸͋·Γ։ൃ׆ൃͰͳ͍ʁ
  16. Jubatus • https://github.com/jubatus/jubatus • ˒ 389 → 453 • ෼ࢄॲཧϑϨʔϜϫʔΫˍΦϯϥΠϯػցֶशϥΠϒϥ

    Ϧ • ຊମ͸ C++ ࣮૷͕ͩɺJava ͷΫϥΠΞϯτϥΠϒϥϦ ͕ఏڙ͞Ε͍ͯΔ • Bandit algorithm ͕࣮૷͞ΕͨΓͱɺ·ͩ·ͩ։ൃܧଓ த
  17. h2o • https://github.com/h2oai/h2o • ˒ 1,333 → 1,741 • ෼ࢄॲཧϑϨʔϜϫʔΫ

    Hadoop ্Ͱར༻Ͱ ͖ΔػցֶशϥΠϒϥϦ • Կ͔ͱ࿩୊ͷ Deep learning Λ Java Ͱ͍ͨ͠ ͳΒɺ͜Ε୒Ұʂʁ
  18. εύϜϝʔϧ൑ఆ • ༩͑ΒΕͨςΩετ͕εύϜϝʔϧ͔൱͔Λ൑ఆ͢Δ • ڭࢣ͋Γֶशͷࣝผɾ෼ྨͷλεΫʹ૬౰ • ςΩετ͔Β term frequency Λಛ௃ͱͯ͠நग़͢Δ

    • ࠓճ͸ʢΑ͋͘Δ Naive Bayes ͡Όͳͯ͘ʣ
 ϩδεςΟοΫճؼΛར༻͢Δ • άϦουαʔνͰύϥϝʔλνϡʔχϯάͭͭ͠ɺ
 k-෼ׂަࠩݕূ & AUC ͰϞσϧΛධՁ͢Δ
  19. UCI Machine learning repository • https://archive.ics.uci.edu/ml/datasets.html • CSV ϑΝΠϧͳͲͷॻࣜͰఏڙ͞Ε͍ͯΔ •

    σʔλ෼ੳɾػցֶशք۾ͷ Hello world తͳ σʔληοτ Iris (ΞϠϝ) ΋͋ΔΑ • ࠓճ͸ SMS Spam collection Λར༻͠·͢ • https://archive.ics.uci.edu/ml/datasets/SMS +Spam+Collection
  20. SMS Spam collection ham Go until jurong point, crazy.. Available

    only in bugis n great world la e buffet... Cine there got amore wat... ham Ok lar... Joking wif u oni... spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 547ϑΝΠϧ
  21. SMS Spam collection ham Go until jurong point, crazy.. Available

    only in bugis n great world la e buffet... Cine there got amore wat... ham Ok lar... Joking wif u oni... spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 547ϑΝΠϧ ϥϕϧ IBNˠεύϜ͡Όͳ͍ TQBNˠεύϜ
  22. Spark / MLlib • Spark cluster Λߏஙͯ͠ར༻͢Δͷ͕Ұൠత • ໘౗ͳͷͰࠓճ͸ self-contained

    ͳΞϓϦͰ
 ͓஡Λ୙͠·͢ • https://spark.apache.org/docs/latest/quick- start.html#self-contained-applications • ࠓ෩ͷ Spark ΞϓϦέʔγϣϯΒ͘͠ʢʁʣɺ
 ML Pipeline API ͱ DataFrame API Λ࢖ͬͯΈ·͢
  23. ML Pipeline API • MLlib ͷ֤छΞϧΰϦζϜΛ࢖͍΍͘͢͢Δ࢓૊Έ ʹա͗ͳ͍ • MLlib ͷΞϧΰϦζϜ౳͕͢΂͕ͯ࢖͑ΔΘ͚Ͱ͸

    ͳ͍͜ͱʹ஫ҙ • “Developers should contribute new algorithms to spark.mllib and can optionally contribute to spark.ml.” • K-Means ͳͲ͸࢖͑ͳ͍
  24. DataFrame API • εΩʔϚ৘ใΛ൐ͬͨσʔληοτ • σʔληοτʹରͯ͠ SQL తͳૢ࡞͕Ͱ͖Δ • select

    / join / filter / aggregation ͳͲͳͲ • RDD ͱൺֱͯ͠ɺ֤ݴޠ binding ؒͷύϑΥʔϚϯεࠩҟ͕খ͍͞ • ৄ͘͠͸ Ishikawa ͞Μͷ slideshare Λ͝ཡԼ͍͞ • http://www.slideshare.net/yuishikawa/2015-0312-lt2-spark- dataframe-introduction
  25. CSV / TSV to DataFrame • DataFrame ͱͯ͠ CSV /

    TSV ϑΝΠϧΛϩʔυ ͢Δʹ͸ spark-csv Λ࢖͏ • https://github.com/databricks/spark-csv • εΩʔϚ͸໌ࣔతʹࢦఆ͓͍ͯͨ͠ํ͕Αͦ͞͏ • ࢦఆ͠ͳ͍ͱจࣈྻѻ͍͞Εͯ͠·͏ͨΊɺ
 ਺஋ྻΛؚΉ৔߹͸ಛʹཁ஫ҙ
  26. ςΩετσʔλ͔Βͷಛ௃நग़ • ςΩετΛ white space tokenize ͢Δ • org.apache.spark.ml.feature.Tokenizer •

    ֤୯ޠͷग़ݱස౓ (TF, term frequency) ΛͱΔ • org.apache.spark.ml.feature.HashingTF • Hashing trick Λར༻͍ͯ͠Δ • LogisticRegression ʹೖྗ͢Δ DataFrame ͱ͢ΔͨΊʹ label ΧϥϜͱ features ΧϥϜΛ༻ҙ͢Δ
  27. Spark / MLlib ͷϩδεςΟοΫճؼ • org.apache.spark.ml.classification.LogisticRegression • optimizer ͸(ࠓͷͱ͜Ζ?) LBFGS

    ͷΈɺSGD ͸࢖͑ͳ͍ • ύϥϝʔλ • regParam: ਖ਼ଇԽύϥϝʔλ • elasticNetParam: 0.0 Λઃఆ͢Δͱ L2 ਖ਼ଇԽ,
 1.0 Λઃఆ͢Δͱ L1 ਖ਼ଇԽͱͳΔ • maxIter: ऩଋ·Ͱʹ܁Γฦ͢ճ਺
  28. ࠓճͷνϡʔχϯάཁૉ • ಛ௃நग़ • numFeatures: Feature hashing ޙͷ࣍ݩ਺ • ϩδεςΟοΫճؼ

    • regParam: ਖ਼ଇԽύϥϝʔλ • maxIter: ऩଋ·ͰͷΠςϨʔγϣϯճ਺
  29. AUC (Area under the curve) ը૾͸Ԟଜઌੜͷʮ30$ۂઢʯΑΓҾ༻ IUUQTPLVFEVNJFVBDKQdPLVNVSBTUBU30$IUNM ͜͜ͷ໘ੵ͕"6$ ໘ੵ͕޿͚Ε͹޿͍ ʹ͍ۙ

    ΄Ͳɺ Α͍ਫ਼౓Ͱ͋Δ͜ͱΛҙຯ͢Δ 5SVFQPTJUJWF ɹεύϜΛਖ਼͘͠ݕग़ͨ֬͠཰ 'BMTFQPTJUJWF ɹؒҧͬͯεύϜͱ൑ఆͨ֬͠཰
  30. Cross validation & metrics • ަࠩݕূ • org.apache.spark.ml.tuning.CrossValidator • #fit()

    Ͱ༩͑ΒΕͨڭࢣσʔλʹֶ͍ͭͯशͨ͠Ϟσϧͱ
 ϕετύϥϝʔλΛฦ٫͢Δ • ύϥϝʔλબ୒ͷϝτϦΫε • org.apache.spark.ml.evaluation.BinaryClassificationEval uator • AUC Λܭࢉͯ͘͠ΕΔ