Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Julia Silge
March 04, 2019
Technology
1
260
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
360
The right tool for the job
juliasilge
0
76
Good practices for applied machine learning
juliasilge
0
240
Applied machine learning with tidymodels
juliasilge
0
160
Maintaining an R Package
juliasilge
0
420
Publishing the Stack Overflow Developer Survey
juliasilge
2
93
Text Mining Using Tidy Data Principles
juliasilge
0
180
North American Developer Hiring Landscape
juliasilge
0
81
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
「活動」は激変する。「ベース」は変わらない ~ 4つの軸で捉える_AI時代ソフトウェア開発マネジメント
sentokun
0
110
Physical AI on AWS リファレンスアーキテクチャ / Physical AI on AWS Reference Architecture
aws_shota
1
150
CloudFrontのHost Header転送設定でパケットの中身はどう変わるのか?
nagisa53
1
200
LLMに何を任せ、何を任せないか
cap120
10
5.8k
FlutterでPiP再生を実装した話
s9a17
0
200
夢の無限スパゲッティ製造機 #phperkaigi
o0h
PRO
0
380
AIエージェント時代に必要な オペレーションマネージャーのロールとは
kentarofujii
0
140
PostgreSQL 18のNOT ENFORCEDな制約とDEFERRABLEの関係
yahonda
0
130
スピンアウト講座03_CLAUDE-MDとSKILL-MD
overflowinc
0
1.4k
AI時代のオンプレ-クラウドキャリアチェンジ考
yuu0w0yuu
0
240
Kubernetesの「隠れメモリ消費」によるNode共倒れと、Request適正化という処方箋
g0xu
0
140
SaaSに宿る21g
kanyamaguc
2
170
Featured
See All Featured
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
11
860
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
180
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
1k
How GitHub (no longer) Works
holman
316
150k
Navigating the moral maze — ethical principles for Al-driven product design
skipperchong
2
310
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
1
140
Facilitating Awesome Meetings
lara
57
6.8k
GraphQLとの向き合い方2022年版
quramy
50
14k
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
250
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
310
Fireside Chat
paigeccino
42
3.8k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.8k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE