Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
260
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
390
The right tool for the job
juliasilge
0
90
Good practices for applied machine learning
juliasilge
0
250
Applied machine learning with tidymodels
juliasilge
0
170
Maintaining an R Package
juliasilge
0
450
Publishing the Stack Overflow Developer Survey
juliasilge
2
100
Text Mining Using Tidy Data Principles
juliasilge
0
190
North American Developer Hiring Landscape
juliasilge
0
90
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
クラウドファンディング版StackChan 3体(4体)をインタラクティブな体験型作品にして展示もした話 / スタックチャンお誕生日会2026
you
PRO
0
180
AWS Security Hub CSPMの成功・失敗体験
cmusudakeisuke
0
550
IaC コードを資産へ:AWS CDK 社内ライブラリと横断展開 / aws-summit-japan-2026
gotok365
10
1.6k
AWS Security Agent といっしょに脅威モデリングをやってみよう
amarelo_n24
1
210
SteampipeとExcel Power QueryでAWS構成定義書の作成を自動化する
jhashimoto
0
180
AI 不只幫你寫 Code: 當專案從 300 暴增到 1500, 我們如何撐住 DevOps
appleboy
0
220
起点・思考・出力で分解する 〜PM業務の自動化設計〜
kazu_kichi_67
1
1.1k
OTel × Datadog で 「AI活用」を計測し、改善に繋げる
shihochan
2
630
Agile and AI Redmine Japan 2026
hiranabe
4
480
技術・能力を向上する原理原則 #きのこセッションa #きのこ2026
bash0c7
0
120
時期が悪い!それでもRaspberry Piを買って遊んで活用するには / 20260627-osc26do-rpi-jikigawarui
akkiesoft
0
810
気軽に使える"情報のハブ"としてのNotion活用 〜フロー情報の集積点 と、 Claude Code × Notion AI〜
syucream
1
200
Featured
See All Featured
Unsuck your backbone
ammeep
672
58k
Navigating Algorithm Shifts & AI Overviews - #SMXNext
aleyda
1
1.3k
Raft: Consensus for Rubyists
vanstee
141
7.6k
VelocityConf: Rendering Performance Case Studies
addyosmani
333
25k
Facilitating Awesome Meetings
lara
57
7k
Design in an AI World
tapps
1
250
New Earth Scene 8
popppiees
3
2.4k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.5k
Producing Creativity
orderedlist
PRO
348
40k
For a Future-Friendly Web
brad_frost
183
10k
Getting science done with accelerated Python computing platforms
jacobtomlinson
2
240
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
220
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE