Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Julia Silge
March 04, 2019
Technology
260
1
Share
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
360
The right tool for the job
juliasilge
0
76
Good practices for applied machine learning
juliasilge
0
240
Applied machine learning with tidymodels
juliasilge
0
170
Maintaining an R Package
juliasilge
0
430
Publishing the Stack Overflow Developer Survey
juliasilge
2
96
Text Mining Using Tidy Data Principles
juliasilge
0
180
North American Developer Hiring Landscape
juliasilge
0
83
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
ログ基盤・プラグイン・ダッシュボード、全部整えた。でも最後は人だった。
makikub
5
1.9k
生成AI時代のエンジニア育成 変わる時代と変わらないコト
starfish719
0
730
Azure Speech で音声対応してみよう
kosmosebi
0
100
#jawsugyokohama 100 LT11, "My AWS Journey 2011-2026 - kwntravel"
shinichirokawano
0
230
Data Hubグループ 紹介資料
sansan33
PRO
0
2.9k
ルールルルルル私的函館観光ガイド── 函館の街はイクラでも楽しめる!
nomuson
0
190
2026年、知っておくべき最新 サーバレスTips10選/serverless-10-tips
slsops
12
4.4k
CDK Insightsで見る、AIによるCDKコード静的解析(+AI解析)
k_adachi_01
2
140
Introduction to Sansan for Engineers / エンジニア向け会社紹介
sansan33
PRO
6
74k
🀄️ on swiftc
giginet
PRO
0
350
システムは「動く」だけでは 足りない - 非機能要件・分散システム・トレードオフの基礎
nwiizo
29
8.8k
聞き手の目線で考えるプロポーザル
takefumiyoshii
0
390
Featured
See All Featured
Code Reviewing Like a Champion
maltzj
528
40k
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
270
Joys of Absence: A Defence of Solitary Play
codingconduct
1
340
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.4k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
37
6.3k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
310
Thoughts on Productivity
jonyablonski
76
5.1k
A designer walks into a library…
pauljervisheath
211
24k
The Pragmatic Product Professional
lauravandoore
37
7.2k
Building a Scalable Design System with Sketch
lauravandoore
463
34k
Building Flexible Design Systems
yeseniaperezcruz
330
40k
Optimising Largest Contentful Paint
csswizardry
37
3.6k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE