Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
260
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
390
The right tool for the job
juliasilge
0
90
Good practices for applied machine learning
juliasilge
0
250
Applied machine learning with tidymodels
juliasilge
0
170
Maintaining an R Package
juliasilge
0
450
Publishing the Stack Overflow Developer Survey
juliasilge
2
100
Text Mining Using Tidy Data Principles
juliasilge
0
190
North American Developer Hiring Landscape
juliasilge
0
90
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
Comment regagner la souveraineté de vos données tout en étant payé grâce à Nostr !
rlifchitz
0
200
FPC(フレキシブル)基板にZephyr実装してみた。
iotengineer22
0
170
アジャイルな経理と Claude Code と経営の未来
kawaguti
PRO
3
190
秘密度ラベル初心者が第1歩でつまづかないための「設計・運用」ポイント
seafay
PRO
1
480
Kiro Ambassador を目指す話
k_adachi_01
0
130
2026年6月23日 Syncable Tech + Start Python Club にて
hamukazu
0
150
AI-DLCを “そのまま導入しなかった”話 ~組織に合わせてアジャストした 私たちの実践共有~
hiroramos4
PRO
1
430
【Snowflake Summit 2026 Recap!!】Snowflake Summit Deep Dive: Security & Governance
civitaspo
1
310
AIが自律的に回る開発ループを設計してチーム開発に組み込む
nekorush14
0
130
週末にループ・エンジニアリングの理解を深めるためのスライド
nagatsu
0
190
元・セキュリティ学習経験0大学生による業務紹介 / An Introduction to the Job by a Former College Student with Zero Security Training Experience
nttcom
0
100
Lightning近況報告
kozy4324
0
220
Featured
See All Featured
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
130k
How to Think Like a Performance Engineer
csswizardry
28
2.7k
Information Architects: The Missing Link in Design Systems
soysaucechin
0
980
Producing Creativity
orderedlist
PRO
348
40k
HDC tutorial
michielstock
2
720
16th Malabo Montpellier Forum Presentation
akademiya2063
PRO
0
150
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
170
エンジニアに許された特別な時間の終わり
watany
107
250k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
9.1k
Code Review Best Practice
trishagee
74
20k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
118
120k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE