Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
260
1
Share
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
370
The right tool for the job
juliasilge
0
88
Good practices for applied machine learning
juliasilge
0
250
Applied machine learning with tidymodels
juliasilge
0
170
Maintaining an R Package
juliasilge
0
440
Publishing the Stack Overflow Developer Survey
juliasilge
2
100
Text Mining Using Tidy Data Principles
juliasilge
0
190
North American Developer Hiring Landscape
juliasilge
0
89
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
最低限これだけ押さえれ大丈夫_Claude Enterprise/Team企業展開ガバナンス入門
tkikuchi
1
490
エンジニアは生成AIと どのように向き合うべきか? ことばの意味という観点から
verypluming
3
290
Spring Boot における AOT Cache 活用テクニックと 起動時間改善事例
ntt_dsol_java
0
170
Kaggle未経験社員をメダリストに育てる「AIドラゴン桜」
lycorptech_jp
PRO
0
650
『家族アルバム みてね』における インシデント対応との向き合い方 / Approach incident response in Family Album
kohbis
2
240
long-running-tasks
cipepser
2
440
20260528_生成AIを専属DSに_Howの次にすべきことを考える
doradora09
PRO
0
250
Datadog 認定試験の概要と対策
uechishingo
0
160
AIガバナンス実践 - 生成AIコネクタのデータ漏洩リスクと実務対策
knishioka
0
130
Oracle AI Database@Azure:サービス概要のご紹介
oracle4engineer
PRO
6
1.8k
基礎から解説!Icebergで紐解くSnowflake×Databricks連携の現在地
cm_yasuhara
0
370
AI時代の私の技術インプットとアウトプット術
tonkotsuboy_com
15
7.6k
Featured
See All Featured
VelocityConf: Rendering Performance Case Studies
addyosmani
333
25k
Exploring anti-patterns in Rails
aemeredith
3
370
Principles of Awesome APIs and How to Build Them.
keavy
128
17k
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
2
380
Public Speaking Without Barfing On Your Shoes - THAT 2023
reverentgeek
1
410
Raft: Consensus for Rubyists
vanstee
141
7.5k
Making the Leap to Tech Lead
cromwellryan
135
9.9k
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
290
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
310
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
35k
The untapped power of vector embeddings
frankvandijk
2
1.7k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE