Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
220
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
170
The right tool for the job
juliasilge
0
22
Good practices for applied machine learning
juliasilge
0
180
Applied machine learning with tidymodels
juliasilge
0
91
Maintaining an R Package
juliasilge
0
310
Publishing the Stack Overflow Developer Survey
juliasilge
2
55
Text Mining Using Tidy Data Principles
juliasilge
0
110
North American Developer Hiring Landscape
juliasilge
0
33
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.4k
Other Decks in Technology
See All in Technology
【Startup CTO of the Year 2024 / Audience Award】アセンド取締役CTO 丹羽健
niwatakeru
0
1.3k
ExaDB-D dbaascli で出来ること
oracle4engineer
PRO
0
3.9k
Making your applications cross-environment - OSCG 2024 NA
salaboy
0
200
AGIについてChatGPTに聞いてみた
blueb
0
130
AWS Lambdaと歩んだ“サーバーレス”と今後 #lambda_10years
yoshidashingo
1
180
OCI Vault 概要
oracle4engineer
PRO
0
9.7k
New Relicを活用したSREの最初のステップ / NRUG OKINAWA VOL.3
isaoshimizu
3
630
データプロダクトの定義からはじめる、データコントラクト駆動なデータ基盤
chanyou0311
2
330
AWS Lambda のトラブルシュートをしていて思うこと
kazzpapa3
2
180
いざ、BSC討伐の旅
nikinusu
2
780
100 名超が参加した日経グループ横断の競技型 AWS 学習イベント「Nikkei Group AWS GameDay」の紹介/mediajaws202411
nikkei_engineer_recruiting
1
170
Flutterによる 効率的なAndroid・iOS・Webアプリケーション開発の事例
recruitengineers
PRO
0
120
Featured
See All Featured
Product Roadmaps are Hard
iamctodd
PRO
49
11k
Measuring & Analyzing Core Web Vitals
bluesmoon
4
130
Understanding Cognitive Biases in Performance Measurement
bluesmoon
26
1.4k
VelocityConf: Rendering Performance Case Studies
addyosmani
325
24k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
1.9k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
48k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
10
720
A Philosophy of Restraint
colly
203
16k
A Modern Web Designer's Workflow
chriscoyier
693
190k
Practical Orchestrator
shlominoach
186
10k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
232
17k
GitHub's CSS Performance
jonrohan
1030
460k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE