Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
250
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
340
The right tool for the job
juliasilge
0
67
Good practices for applied machine learning
juliasilge
0
220
Applied machine learning with tidymodels
juliasilge
0
150
Maintaining an R Package
juliasilge
0
410
Publishing the Stack Overflow Developer Survey
juliasilge
2
84
Text Mining Using Tidy Data Principles
juliasilge
0
160
North American Developer Hiring Landscape
juliasilge
0
68
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
人はいかにして 確率的な挙動を 受け入れていくのか
vaaaaanquish
4
2.6k
Claude Codeベストプラクティスまとめ
minorun365
46
25k
KubeCon + CloudNativeCon NA ‘25 Recap, Extensibility: Gateway API / NRI
ladicle
0
130
【Oracle Cloud ウェビナー】ランサムウェアが突く「侵入の隙」とバックアップの「死角」 ~ 過去の教訓に学ぶ — 侵入前提の防御とデータ保護 ~
oracle4engineer
PRO
2
220
かわいい身体と声を持つ そういうものに私はなりたい
yoshimura_datam
0
470
ファシリテーション勉強中 その場に何が求められるかを考えるようになるまで / 20260123 Naoki Takahashi
shift_evolve
PRO
3
380
みんなでAI上手ピーポーになろう! / Let’s All Get AI-Savvy!
kaminashi
0
220
Claude in Chromeで始める自律的フロントエンド開発
diggymo
1
270
Kaggleコンペティション「MABe Challenge - Social Action Recognition in Mice」振り返り
yu4u
1
760
Databricks Free Edition講座 データエンジニアリング編
taka_aki
0
2.8k
Git Training GitHub
yuhattor
1
270
The Engineer with a Three-Year Cycle - 2
e99h2121
0
190
Featured
See All Featured
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
120
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
270
Why Our Code Smells
bkeepers
PRO
340
58k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.1k
Applied NLP in the Age of Generative AI
inesmontani
PRO
4
2k
Building the Perfect Custom Keyboard
takai
2
670
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.5k
Designing Powerful Visuals for Engaging Learning
tmiket
0
210
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
410
Documentation Writing (for coders)
carmenintech
77
5.2k
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
0
260
Deep Space Network (abreviated)
tonyrice
0
36
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE