Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
230
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
280
The right tool for the job
juliasilge
0
43
Good practices for applied machine learning
juliasilge
0
200
Applied machine learning with tidymodels
juliasilge
0
120
Maintaining an R Package
juliasilge
0
370
Publishing the Stack Overflow Developer Survey
juliasilge
2
68
Text Mining Using Tidy Data Principles
juliasilge
0
130
North American Developer Hiring Landscape
juliasilge
0
50
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
cdk initで生成されるあのファイル達は何なのか/cdk-init-generated-files
tomoki10
1
670
microCMSではじめるAIライティング
himaratsu
0
150
How to Quickly Call American Airlines®️ U.S. Customer Care : Full Guide
flyaahelpguide
0
240
United™️ Airlines®️ Customer®️ USA Contact Numbers: Complete 2025 Support Guide
flyunitedguide
0
800
ゼロから始めるSREの事業貢献 - 生成AI時代のSRE成長戦略と実践 / Starting SRE from Day One
shinyorke
PRO
0
110
衛星運用をソフトウェアエンジニアに依頼したときにできあがるもの
sankichi92
1
1k
Copilot coding agentにベットしたいCTOが開発組織で取り組んだこと / GitHub Copilot coding agent in Team
tnir
0
190
AIでテストプロセス自動化に挑戦する
sakatakazunori
1
530
Data Engineering Study#30 LT資料
tetsuroito
1
180
モニタリング統一への道のり - 分散モニタリングツール統合のためのオブザーバビリティプロジェクト
niftycorp
PRO
1
520
IPA&AWSダブル全冠が明かす、人生を変えた勉強法のすべて
iwamot
PRO
2
230
助けて! XからWaylandに移行しないと新しいGNOMEが使えなくなっちゃう 2025-07-12
nobutomurata
2
200
Featured
See All Featured
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
830
Balancing Empowerment & Direction
lara
1
450
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
Code Reviewing Like a Champion
maltzj
524
40k
How to Think Like a Performance Engineer
csswizardry
25
1.7k
The Cult of Friendly URLs
andyhume
79
6.5k
Build your cross-platform service in a week with App Engine
jlugia
231
18k
GraphQLとの向き合い方2022年版
quramy
49
14k
Large-scale JavaScript Application Architecture
addyosmani
512
110k
Speed Design
sergeychernyshev
32
1k
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
The Art of Programming - Codeland 2020
erikaheidi
54
13k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE