Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
220
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
200
The right tool for the job
juliasilge
0
24
Good practices for applied machine learning
juliasilge
0
180
Applied machine learning with tidymodels
juliasilge
0
93
Maintaining an R Package
juliasilge
0
320
Publishing the Stack Overflow Developer Survey
juliasilge
2
55
Text Mining Using Tidy Data Principles
juliasilge
0
120
North American Developer Hiring Landscape
juliasilge
0
34
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.4k
Other Decks in Technology
See All in Technology
Denoで作るチーム開発生産性向上のためのCLIツール
sansantech
PRO
0
120
サーバーなしでWordPress運用、できますよ。
sogaoh
PRO
0
170
いまからでも遅くないコンテナ座学
nomu
0
190
20240513 - 框裡框外_文學院學生如何在AI世代安身立命 @ 淡江大學
dpys
0
600
信頼されるためにやったこと、 やらなかったこと。/What we did to be trusted, What we did not do.
bitkey
PRO
0
1.1k
Formal Development of Operating Systems in Rust
riru
1
290
サイボウズフロントエンドエキスパートチームについて / FrontendExpert Team
cybozuinsideout
PRO
5
39k
普通のエンジニアがLaravelコアチームメンバーになるまで
avosalmon
0
660
生成AIによるテスト設計支援プロセスの構築とプロセス内のボトルネック解消の取り組み / 20241220 Suguru Ishii
shift_evolve
0
160
Visual StudioとかIDE関連小ネタ話
kosmosebi
1
140
スタートアップで取り組んでいるAzureとMicrosoft 365のセキュリティ対策/How to Improve Azure and Microsoft 365 Security at Startup
yuj1osm
0
280
組織に自動テストを書く文化を根付かせる戦略(2024冬版) / Building Automated Test Culture 2024 Winter Edition
twada
PRO
25
6.9k
Featured
See All Featured
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
3
330
Being A Developer After 40
akosma
89
590k
Making the Leap to Tech Lead
cromwellryan
133
9k
The Pragmatic Product Professional
lauravandoore
32
6.3k
Documentation Writing (for coders)
carmenintech
67
4.5k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
GitHub's CSS Performance
jonrohan
1030
460k
StorybookのUI Testing Handbookを読んだ
zakiyama
28
5.4k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Making Projects Easy
brettharned
116
6k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
26
1.9k
GraphQLとの向き合い方2022年版
quramy
44
13k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE