Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Julia Silge
March 04, 2019
Technology
1
260
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
350
The right tool for the job
juliasilge
0
73
Good practices for applied machine learning
juliasilge
0
230
Applied machine learning with tidymodels
juliasilge
0
160
Maintaining an R Package
juliasilge
0
420
Publishing the Stack Overflow Developer Survey
juliasilge
2
89
Text Mining Using Tidy Data Principles
juliasilge
0
170
North American Developer Hiring Landscape
juliasilge
0
76
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
プロジェクトマネジメントをチームに宿す -ゼロからはじめるチームプロジェクトマネジメントは活動1年未満のチームの教科書です- / 20260304 Shigeki Morizane
shift_evolve
PRO
1
140
Security Diaries of an Open Source IAM
ahus1
0
210
マルチアカウント環境でSecurity Hubの運用!導入の苦労とポイント / JAWS DAYS 2026
genda
0
100
JAWS DAYS 2026 ExaWizards_20260307
exawizards
0
120
Introduction to Sansan for Engineers / エンジニア向け会社紹介
sansan33
PRO
6
72k
ビズリーチにおける検索・推薦の取り組み / DEIM2026
visional_engineering_and_design
1
120
A Gentle Introduction to Transformers
keio_smilab
PRO
2
930
JAWS DAYS 2026 楽しく学ぼう!ストレージ 入門
yoshiki0705
2
110
「ヒットする」+「近い」を同時にかなえるスマートサジェストの作り方.pdf
nakasho
0
150
EMからICへ、二周目人材としてAI全振りのプロダクト開発で見つけた武器
yug1224
5
480
チームメンバー迷わないIaC設計
hayama17
5
4k
DevOpsエージェントで実現する!! AWS Well-Architected(W-A) を実現するシステム設計 / 20260307 Masaki Okuda
shift_evolve
PRO
3
260
Featured
See All Featured
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
620
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
10
1.1k
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
220
Reality Check: Gamification 10 Years Later
codingconduct
0
2k
Building Adaptive Systems
keathley
44
2.9k
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
310
Amusing Abliteration
ianozsvald
0
120
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
1.8k
BBQ
matthewcrist
89
10k
Navigating Weather and Climate Data
rabernat
0
130
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
69
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE