Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
250
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
330
The right tool for the job
juliasilge
0
60
Good practices for applied machine learning
juliasilge
0
210
Applied machine learning with tidymodels
juliasilge
0
140
Maintaining an R Package
juliasilge
0
390
Publishing the Stack Overflow Developer Survey
juliasilge
2
79
Text Mining Using Tidy Data Principles
juliasilge
0
150
North American Developer Hiring Landscape
juliasilge
0
63
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
re:Invent2025 コンテナ系アップデート振り返り(+CloudWatchログのアップデート紹介)
masukawa
0
300
Karate+Database RiderによるAPI自動テスト導入工数をCline+GitLab MCPを使って2割削減を目指す! / 20251206 Kazuki Takahashi
shift_evolve
PRO
1
460
ブロックテーマとこれからの WordPress サイト制作 / Toyama WordPress Meetup Vol.81
torounit
0
390
小さな判断で育つ、大きな意思決定力 / 20251204 Takahiro Kinjo
shift_evolve
PRO
1
570
今からでも間に合う!速習Devin入門とその活用方法
ismk
1
420
安いGPUレンタルサービスについて
aratako
2
2.6k
因果AIへの招待
sshimizu2006
0
920
[CMU-DB-2025FALL] Apache Fluss - A Streaming Storage for Real-Time Lakehouse
jark
0
110
AIと二人三脚で育てた、個人開発アプリグロース術
zozotech
PRO
0
680
pmconf2025 - データを活用し「価値」へ繋げる
glorypulse
0
700
re:Inventで気になったサービスを10分でいけるところまでお話しします
yama3133
1
120
新 Security HubがついにGA!仕組みや料金を深堀り #AWSreInvent #regrowth / AWS Security Hub Advanced GA
masahirokawahara
1
1.3k
Featured
See All Featured
Raft: Consensus for Rubyists
vanstee
141
7.2k
The Cult of Friendly URLs
andyhume
79
6.7k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.6k
Thoughts on Productivity
jonyablonski
73
5k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
9
1k
Side Projects
sachag
455
43k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.8k
Optimising Largest Contentful Paint
csswizardry
37
3.5k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3k
Reflections from 52 weeks, 52 projects
jeffersonlam
355
21k
Site-Speed That Sticks
csswizardry
13
990
RailsConf 2023
tenderlove
30
1.3k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE