Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Julia Silge
March 04, 2019
Technology
1
250
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
340
The right tool for the job
juliasilge
0
67
Good practices for applied machine learning
juliasilge
0
220
Applied machine learning with tidymodels
juliasilge
0
150
Maintaining an R Package
juliasilge
0
410
Publishing the Stack Overflow Developer Survey
juliasilge
2
84
Text Mining Using Tidy Data Principles
juliasilge
0
160
North American Developer Hiring Landscape
juliasilge
0
68
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
re:Inventで出たインフラエンジニアが嬉しかったアップデート
nagisa53
4
210
AIとともに歩む情報セキュリティ / Information Security with AI
kanny
4
2k
OCI技術資料 : OS管理ハブ 概要
ocise
2
4.2k
3リポジトリーを2ヶ月でモノレポ化した話 / How I turned 3 repositories into a monorepo in 2 months
kubode
0
110
JuliaTokaiとしてはこれが最後かもしれない(仮) for NGK2026S
antimon2
0
120
現場で活かす生成AI実践セミナー「広報×AI活用」編
matyuda
0
100
エンジニアとマネジメントの距離/Engineering and Management
ikuodanaka
3
620
re:Inventで見つけた「運用を捨てる」技術。
ezaki
1
140
メルカリのAI活用を支えるAIセキュリティ
s3h
6
3.5k
クラウドセキュリティの進化 — AWSの20年を振り返る
kei4eva4
0
160
Zephyr RTOS の発表をOpen Source Summit Japan 2025で行った件
iotengineer22
0
270
AI Agent Standards and Protocols: a Walkthrough of MCP, A2A, and more...
glaforge
1
560
Featured
See All Featured
Design of three-dimensional binary manipulators for pick-and-place task avoiding obstacles (IECON2024)
konakalab
0
340
Designing for Timeless Needs
cassininazir
0
120
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.1k
Producing Creativity
orderedlist
PRO
348
40k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.3k
[RailsConf 2023] Rails as a piece of cake
palkan
59
6.3k
Odyssey Design
rkendrick25
PRO
0
470
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
280
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Fireside Chat
paigeccino
41
3.8k
What's in a price? How to price your products and services
michaelherold
247
13k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE