Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
260
1
Share
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
360
The right tool for the job
juliasilge
0
76
Good practices for applied machine learning
juliasilge
0
240
Applied machine learning with tidymodels
juliasilge
0
170
Maintaining an R Package
juliasilge
0
430
Publishing the Stack Overflow Developer Survey
juliasilge
2
96
Text Mining Using Tidy Data Principles
juliasilge
0
180
North American Developer Hiring Landscape
juliasilge
0
83
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
2026年に相応しい 最先端プラグインホストの設計<del>と実装</del>
atsushieno
0
120
Introduction to Sansan, inc / Sansan Global Development Center, Inc.
sansan33
PRO
0
3k
組織的なAI活用を阻む 最大のハードルは コンテキストデザインだった
ixbox
7
1.8k
60分で学ぶ最新Webフロントエンド
mizdra
PRO
33
16k
ログ基盤・プラグイン・ダッシュボード、全部整えた。でも最後は人だった。
makikub
5
1.9k
非エンジニア職からZOZOへ 〜登壇がキャリアに与えた影響〜
penpeen
0
430
Bill One 開発エンジニア 紹介資料
sansan33
PRO
5
18k
Contract One Engineering Unit 紹介資料
sansan33
PRO
0
16k
JOAI2026講評会資料(近藤佐介)
element138
1
110
Azure Speech で音声対応してみよう
kosmosebi
0
100
建設的な現実逃避のしかた / How to practice constructive escapism
pauli
4
330
Databricksで構築するログ検索基盤とアーキテクチャ設計
cscengineer
0
180
Featured
See All Featured
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
23k
Side Projects
sachag
455
43k
Noah Learner - AI + Me: how we built a GSC Bulk Export data pipeline
techseoconnect
PRO
0
160
Into the Great Unknown - MozCon
thekraken
40
2.3k
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
inesmontani
PRO
3
2.1k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.7k
How to build a perfect <img>
jonoalderson
1
5.4k
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.8k
Building the Perfect Custom Keyboard
takai
2
720
How to Think Like a Performance Engineer
csswizardry
28
2.5k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
8k
4 Signs Your Business is Dying
shpigford
187
22k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE