Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Julia Silge
March 04, 2019
Technology
1
250
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
350
The right tool for the job
juliasilge
0
70
Good practices for applied machine learning
juliasilge
0
230
Applied machine learning with tidymodels
juliasilge
0
160
Maintaining an R Package
juliasilge
0
410
Publishing the Stack Overflow Developer Survey
juliasilge
2
86
Text Mining Using Tidy Data Principles
juliasilge
0
170
North American Developer Hiring Landscape
juliasilge
0
71
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
Azure Copilot Migration Agent / #jazug
koudaiii
1
140
顧客の言葉を、そのまま信じない勇気
yamatai1212
1
380
1,000 にも届く AWS Organizations 組織のポリシー運用をちゃんとしたい、という話
kazzpapa3
0
240
Prox Industries株式会社 会社紹介資料
proxindustries
0
180
Claude_CodeでSEOを最適化する_AI_Ops_Community_Vol.2__マーケティングx_AIはここまで進化した.pdf
riku_423
2
640
会社紹介資料 / Sansan Company Profile
sansan33
PRO
15
400k
ブロックテーマ、WordPress でウェブサイトをつくるということ / 2026.02.07 Gifu WordPress Meetup
torounit
0
220
生成AIを活用した音声文字起こしシステムの2つの構築パターンについて
miu_crescent
PRO
3
250
AIエージェントを開発しよう!-AgentCore活用の勘所-
yukiogawa
0
210
[CV勉強会@関東 World Model 読み会] Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models (Mousakhan+, NeurIPS 2025)
abemii
0
170
StrandsとNeptuneを使ってナレッジグラフを構築する
yakumo
1
140
Oracle Database@Azure:サービス概要のご紹介
oracle4engineer
PRO
3
600
Featured
See All Featured
Site-Speed That Sticks
csswizardry
13
1.1k
How to train your dragon (web standard)
notwaldorf
97
6.5k
How to Ace a Technical Interview
jacobian
281
24k
Are puppies a ranking factor?
jonoalderson
1
3k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
Building a Scalable Design System with Sketch
lauravandoore
463
34k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
122
21k
Git: the NoSQL Database
bkeepers
PRO
432
66k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.8k
SEO for Brand Visibility & Recognition
aleyda
0
4.3k
The Anti-SEO Checklist Checklist. Pubcon Cyber Week
ryanjones
0
75
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
290
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE