Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
240
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
330
The right tool for the job
juliasilge
0
56
Good practices for applied machine learning
juliasilge
0
210
Applied machine learning with tidymodels
juliasilge
0
140
Maintaining an R Package
juliasilge
0
390
Publishing the Stack Overflow Developer Survey
juliasilge
2
77
Text Mining Using Tidy Data Principles
juliasilge
0
150
North American Developer Hiring Landscape
juliasilge
0
60
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
入社したばかりでもできる、 アクセシビリティ改善の第一歩
unachang113
2
330
国産クラウドを支える設計とチームの変遷 “技術・組織・ミッション”
kazeburo
3
3.5k
大規模プロダクトで実践するAI活用の仕組みづくり
k1tikurisu
4
1.6k
PostgreSQL で列データ”ファイル”を利用する ~Arrow/Parquet を統合したデータベースの作成~
kaigai
0
120
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
peisuke
0
160
その意思決定、まだ続けるんですか? ~痛みを超えて未来を作る、AI時代の撤退とピボットの技術~
applism118
0
1.4k
リアーキテクティングのその先へ 〜品質と開発生産性の壁を越えるプラットフォーム戦略〜 / architecture-con2025
visional_engineering_and_design
0
1.7k
ABEMAのCM配信を支えるスケーラブルな分散カウンタの実装
hono0130
4
1k
明日から真似してOk!NOT A HOTELで実践している入社手続きの自動化
nkajihara
1
860
「データ無い! 腹立つ! 推論する!」から 「データ無い! 腹立つ! データを作る」へ チームでデータを作り、育てられるようにするまで / How can we create, use, and maintain data ourselves?
moznion
8
4.5k
未回答質問の回答一覧 / 開発をリードする品質保証 QAエンジニアと開発者の未来を考える-Findy Online Conference -
findy_eventslides
0
300
Axon Frameworkのイベントストアを独自拡張した話
zozotech
PRO
0
210
Featured
See All Featured
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
BBQ
matthewcrist
89
9.9k
Raft: Consensus for Rubyists
vanstee
140
7.2k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.1k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.2k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
9.7k
Git: the NoSQL Database
bkeepers
PRO
432
66k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.7k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
15k
How to train your dragon (web standard)
notwaldorf
97
6.4k
Automating Front-end Workflow
addyosmani
1371
200k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.6k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE