Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Julia Silge
March 04, 2019
Technology
1
260
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
350
The right tool for the job
juliasilge
0
73
Good practices for applied machine learning
juliasilge
0
230
Applied machine learning with tidymodels
juliasilge
0
160
Maintaining an R Package
juliasilge
0
420
Publishing the Stack Overflow Developer Survey
juliasilge
2
89
Text Mining Using Tidy Data Principles
juliasilge
0
170
North American Developer Hiring Landscape
juliasilge
0
76
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
Ultra Ethernet (UEC) v1.0 仕様概説
markunet
3
230
ビズリーチにおける検索・推薦の取り組み / DEIM2026
visional_engineering_and_design
1
120
Datadog の RBAC のすべて
nulabinc
PRO
3
330
わたしがセキュアにAWSを使えるわけないじゃん、ムリムリ!(※ムリじゃなかった!?)
cmusudakeisuke
1
420
非情報系研究者へ送る Transformer入門
rishiyama
0
670
研究開発部メンバーの働き⽅ / Sansan R&D Profile
sansan33
PRO
4
22k
マルチプレーンGPUネットワークを実現するシャッフルアーキテクチャの整理と考察
markunet
2
150
Claude Codeが爆速進化してプラグイン追従がつらいので半自動化した話 ver.2
rfdnxbro
0
430
Security Diaries of an Open Source IAM
ahus1
0
210
組織のSREを推進するためのPlatform EngineeringとEKS / Platform Engineering and EKS to drive SRE in your organization
chmikata
0
190
元エンジニアPdM、IDEが恋しすぎてCursorに全業務を集約したら、スライド作成まで爆速になった話
doiko123
1
480
大規模サービスにおける レガシーコードからReactへの移行
magicpod
1
170
Featured
See All Featured
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
280
Designing for Timeless Needs
cassininazir
0
150
Building a A Zero-Code AI SEO Workflow
portentint
PRO
0
370
The Mindset for Success: Future Career Progression
greggifford
PRO
0
270
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
130
Everyday Curiosity
cassininazir
0
150
Why Your Marketing Sucks and What You Can Do About It - Sophie Logan
marketingsoph
0
99
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Documentation Writing (for coders)
carmenintech
77
5.3k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
62
51k
Marketing to machines
jonoalderson
1
5k
Building an army of robots
kneath
306
46k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE