Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Julia Silge
March 04, 2019
Technology
260
1
Share
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
370
The right tool for the job
juliasilge
0
80
Good practices for applied machine learning
juliasilge
0
250
Applied machine learning with tidymodels
juliasilge
0
170
Maintaining an R Package
juliasilge
0
430
Publishing the Stack Overflow Developer Survey
juliasilge
2
98
Text Mining Using Tidy Data Principles
juliasilge
0
180
North American Developer Hiring Landscape
juliasilge
0
85
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
UIライブラリに依存しすぎないReact Native設計を目指して
grandbig
0
190
GKE Agent SandboxでAIが生成したコードを 安全に実行してみた
lamaglama39
0
180
AI時代の品質はテストプロセスの作り直し #scrumniigata
kyonmm
PRO
4
1.2k
独断と偏見で試してみる、 シングル or マルチエージェント どっちがいいの?
shichijoyuhi
1
240
ボトムアップの改善の火を灯し続けろ!〜支援現場で学んだ、消えないための3つの打ち手〜 / 20260509 Kazuki Mori
shift_evolve
PRO
2
440
20260428_Product Management Summit_Loglass_JoeHirose
loglassjoe
4
6.7k
サービスの信頼性を高めるため、形骸化した「プロダクションミーティング」を立て直すまでの取り組み
stefafafan
1
230
AI駆動開発で生産性を追いかけたら、行き着いたのは品質とシフトレフトだった
littlehands
0
320
「誰一人取り残されない」 AIエージェント時代のプロダクト設計思想 Product Management Summit 2026
mizushimac
1
2.8k
生成AIはソフトウェア開発の革命か、ソフトウェア工学の宿題再提出なのか -ソフトウェア品質特性の追加提案-
kyonmm
PRO
2
830
Microsoft 365 / Microsoft 365 Copilot : 自分の状態を確認する「ラベル」について
taichinakamura
0
450
AI時代に越境し、 組織を変えるQAスキルの正体 / QA Skills for Transforming an Organization
mii3king
5
3.6k
Featured
See All Featured
Context Engineering - Making Every Token Count
addyosmani
9
860
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.5k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
122
21k
Breaking role norms: Why Content Design is so much more than writing copy - Taylor Woolridge
uxyall
0
270
[SF Ruby Conf 2025] Rails X
palkan
2
1k
Public Speaking Without Barfing On Your Shoes - THAT 2023
reverentgeek
1
380
Discover your Explorer Soul
emna__ayadi
2
1.1k
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
530
Embracing the Ebb and Flow
colly
88
5k
エンジニアに許された特別な時間の終わり
watany
106
240k
AI Search: Implications for SEO and How to Move Forward - #ShenzhenSEOConference
aleyda
1
1.2k
Ruling the World: When Life Gets Gamed
codingconduct
0
220
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE