Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
240
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
310
The right tool for the job
juliasilge
0
50
Good practices for applied machine learning
juliasilge
0
210
Applied machine learning with tidymodels
juliasilge
0
130
Maintaining an R Package
juliasilge
0
370
Publishing the Stack Overflow Developer Survey
juliasilge
2
75
Text Mining Using Tidy Data Principles
juliasilge
0
140
North American Developer Hiring Landscape
juliasilge
0
54
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
いま注目のAIエージェントを作ってみよう
supermarimobros
0
370
slog.Handlerのよくある実装ミス
sakiengineer
4
490
Oracle Base Database Service 技術詳細
oracle4engineer
PRO
10
75k
「Linux」という言葉が指すもの
sat
PRO
4
150
2025/09/16 仕様駆動開発とAI-DLCが導くAI駆動開発の新フェーズ
masahiro_okamura
0
150
AI時代を生き抜くエンジニアキャリアの築き方 (AI-Native 時代、エンジニアという道は 「最大の挑戦の場」となる) / Building an Engineering Career to Thrive in the Age of AI (In the AI-Native Era, the Path of Engineering Becomes the Ultimate Arena of Challenge)
jeongjaesoon
0
270
使いやすいプラットフォームの作り方 ー LINEヤフーのKubernetes基盤に学ぶ理論と実践
lycorptech_jp
PRO
1
180
機械学習を扱うプラットフォーム開発と運用事例
lycorptech_jp
PRO
0
710
20250905_MeetUp_Ito-san_s_presentation.pdf
magicpod
1
100
バイブスに「型」を!Kent Beckに学ぶ、AI時代のテスト駆動開発
amixedcolor
3
590
AWSを利用する上で知っておきたい名前解決のはなし(10分版)
nagisa53
10
3.3k
現場で効くClaude Code ─ 最新動向と企業導入
takaakikakei
1
260
Featured
See All Featured
Statistics for Hackers
jakevdp
799
220k
YesSQL, Process and Tooling at Scale
rocio
173
14k
GitHub's CSS Performance
jonrohan
1032
460k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.6k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
590
VelocityConf: Rendering Performance Case Studies
addyosmani
332
24k
Intergalactic Javascript Robots from Outer Space
tanoku
272
27k
Making Projects Easy
brettharned
117
6.4k
Rebuilding a faster, lazier Slack
samanthasiow
83
9.2k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Documentation Writing (for coders)
carmenintech
74
5k
The Art of Programming - Codeland 2020
erikaheidi
56
13k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE