Upgrade to Pro — share decks privately, control downloads, hide ads and more …

カンファレンスセッションの選択傾向を知りたい / Let’s study trends of ...

カンファレンスセッションの選択傾向を知りたい / Let’s study trends of entry to conference sessions

Kanazawa.rb meetup #84 で発表した資料です。

今回分析に使ったサンプルコード
https://github.com/TAKAyukiatkwsk/session_analytics_sample

TAKAyukiatkwsk

August 17, 2019
Tweet

More Decks by TAKAyukiatkwsk

Other Decks in Technology

Transcript

  1. Who am I? • Takayuki Takagi (高木貴之 / ニボシーニョ) •

    @TAKAyuki_atkwsk / takayukiatkwsk • Freelance programmer • Remote work • Scala, Ruby, Python, AWS, Docker, etc. • Like beer and gyoza
  2. Today’s topic • I want to know trends of entry

    to conference sessions ◦ Extract characteristic words from their title and description with NLP(Natural Language Processing) ◦ But I’m NOT familiar with NLP, so I want to use as easy tools as possible ◦ Easy tools - Cloud APIs
  3. Cloud APIs for NLP • Amazon Comprehend API (AWS) •

    Cloud Natural Language API (GCP) ◦ Syntactic analysis • Text Analytics API (Azure) ◦ Key-phrase extraction API -> These APIs are directly available in Japanese!!
  4. Make input data • Copy session titles and descriptions to

    spreadsheet ◦ Japan Container Days 2018 (no descriptions) ◦ Scala Kansai Summit 2018 ◦ JAWS DAYS 2019 ◦ Scala Matsuri 2019 ◦ Google Cloud Next Tokyo 2019 • Export as CSV (script input) ◦ id, title ◦ id, title + description
  5. Analysis methods 1. Extract key-phrases with Text Analytics API 2.

    Analyze syntax with Cloud Natural Language API 3. Analyze syntax with MeCab + NEologd (for comparison) • Source code ◦ https://github.com/TAKAyukiatkwsk/session_analytics_sample
  6. Key-phrase frequency • [title] find technical words but they are

    the low frequency(Max=4, mostly 1) • [title + description] the high frequency(Max=9) but they are not technical words • “Kubernetes” is 4 in title, but is 3 in title + description
  7. N-gram Frequency • [title unigram] More general topics (ex. Scala,

    Kubernetes, サービス, コンテナ, Cloud) • Trends: Scala, Kubernetes, Akka, 機械学習, サーバーレス, Cloud Spanner, マイクロサービス
  8. N-gram Frequency • [title + desc bigram] more understandable words

    than title bigram • “分散トレーシング” is a characteristic phrase
  9. N-gram Frequency • “型” is tokenized as a noun (as

    an affix with Natural Language API) • “機械学習” and “サーバーレス” are tokenzied as one word • “関数型” is a characteristic phrase
  10. N-gram Frequency • There are not abstract words like “よう”

    “こと” “ため” • “GraphQL” and “マイクロサービス” are tokenzied as one word (not in chart) • “分散トレーシング” is a characteristic phrase
  11. Results • Trends: Kubernetes, Serverless, Scala, 分散トレーシング, 関数型, Akka, Cloud

    Spanner ◦ The frequency depends on title and description quality • Cloud APIs are useful • Key-phrase is not enough, using N-gram too is better • MeCab + NEologd can analyze better than Native Language API (in Japanese/specific category?)
  12. References • Cloud Natural Language | Cloud Natural Language API

    | Google Cloud ◦ https://cloud.google.com/natural-language/?hl=ja • Text Analytics API とは - 機能 - - Azure Cognitive Services | Microsoft Docs ◦ https://docs.microsoft.com/ja-jp/azure/cognitive-services/text-analytics/overview • Amazon Comprehend(テキストのインサイトや関係性を検出) | AWS ◦ https://aws.amazon.com/jp/comprehend/ • TF-IDFで見る評価の高いラーメン屋の口コミ傾向(自然言語処理 , TF-IDF, Mecab, wordcloud, 形態素 解析、分かち書き) - ギークなエンジニアを目指す男 ◦ https://www.takapy.work/entry/2019/01/14/142128 • N-gramモデルを利用したテキスト分析  ―インデックスページ― ◦ http://www.shuiren.org/chuden/teach/n-gram/index-j.html