Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
text_mining_slides_20180512
Search
Leo Lu
May 12, 2018
Technology
92
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
text_mining_slides_20180512
Leo Lu
May 12, 2018
More Decks by Leo Lu
See All by Leo Lu
R from Data Analysis to Production
leoluyi
1
150
2018-07-28_viz_talk
leoluyi
0
90
Other Decks in Technology
See All in Technology
從觀望到全公司落地:AI Agentic Coding 導入實戰 — 流程整合與安全治理
appleboy
0
160
2026 AI Memory Architecture
nagatsu
0
580
時期が悪い!それでもRaspberry Piを買って遊んで活用するには / 20260627-osc26do-rpi-jikigawarui
akkiesoft
1
900
そこにあるから地図ができる~位置を示す"モノ"を愉しむ~ - Interface 2026年6月号GPS特集オフ会 / interface_202606_GPS_offline
sakaik
1
120
CVE-2026-20833_脆弱性対応とAES 化について
jukishiya
0
130
AI-DLCを “そのまま導入しなかった”話 ~組織に合わせてアジャストした 私たちの実践共有~
hiroramos4
PRO
1
440
“詰む”前に仕組みを作れ 〜技術の波に溺れないためのキャッチアップ術〜
takasyou
7
4.3k
「勝手に広まる」人気 AI エージェントを爆速で作ろう!(AWS Summit Japan 2026講演資料)
minorun365
PRO
10
2.6k
Oracle Cloud Infrastructure:2026年6月度サービス・アップデート
oracle4engineer
PRO
1
370
5分でわかるDuckDB Quack
chanyou0311
4
260
Why is RC4 still being used?
tamaiyutaro
0
110
UIパーツの設計を「型」から読み解く 〜TSKaigiのセッションから得た学び〜
yud0uhu
0
100
Featured
See All Featured
The State of eCommerce SEO: How to Win in Today's Products SERPs - #SEOweek
aleyda
2
11k
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
2
1.6k
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
1
340
Documentation Writing (for coders)
carmenintech
77
5.4k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
38
2.9k
Digital Projects Gone Horribly Wrong (And the UX Pros Who Still Save the Day) - Dean Schuster
uxyall
1
1.8k
Everyday Curiosity
cassininazir
0
240
A better future with KSS
kneath
240
18k
A designer walks into a library…
pauljervisheath
211
24k
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
310
Ten Tips & Tricks for a 🌱 transition
stuffmc
0
140
Transcript
Text Mining and Data Viz 2018-05-12 leoluyi@iii Slides http://pcse.pw/6WHWJ ©
leoluyi, 2018 1
橕ෝ౯ 4 㸎瓽 Leo Lu 4 ݣय़ૡᓕ 4 ፓ獮ෝᰂᣟ禂๐率 4
Build data products 4 ETL 4 Models 4 Text mining 4 Viz 4 ... © leoluyi, 2018 2
Text Minning 窕纷 膏 ૡٍ㮉 © leoluyi, 2018 3
膑碻դጱૡٍ vs. 碝Ӯդጱૡٍ © leoluyi, 2018 4
犥獮౯㮉᮷አक़㾴Ո䌃ጱ䩚ᥜ tm + tmcn Rwordseg © leoluyi, 2018 5
֕ฎ蝡犚ॺկஃஃࣁӾ 䨝磪๚Ꭳጱ襊 © leoluyi, 2018 6
犡ॠ౯㮉ᥝአӞ犚碝ጱૡٍ © leoluyi, 2018 7
窕纷 Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜
Model © leoluyi, 2018 8
Get data Get data ➜ Tokenize ➜ Embedding ➜ Viz
➜ Model 9
PTT ฎ疌疌ጱঅ๏ Get data ➜ Tokenize ➜ Embedding ➜ Viz
➜ Model 10
ྯॠ᮷磪盄ग़盄ग़ጱ䔂承碘 © leoluyi, 2018 11
ᛔ૩ጱ粖恝ᛔ૩䌃 devtools::install_packages( "leoluyi/PTTr") © leoluyi, 2018 12
Cleaning and preprocessing text ኸӥ虻懱牧݄ധ褾懱 © leoluyi, 2018 13
Tokenize Transform whole text into parts Get data ➜ Tokenize
➜ Embedding ➜ Viz ➜ Model 14
For English 4 normalization 4 stemming (扃䓄玲) 4 lemmatization (扃ࣳ螭ܻ)
4 POS tagging 4 ... Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 15
Ӿ犲Ԓ穉斃墋㻌 4 䥁扃 4 犋䥁扃 4 POS tagging 4 ...
Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 16
Semantic Parsing vs. Bag-of-Words © leoluyi, 2018 17
R tools 4 stringr 4 jiebaR Get data ➜ Tokenize
➜ Embedding ➜ Viz ➜ Model 18
Embedding (Encode, Feature Extraction) Get data ➜ Tokenize ➜ Embedding
➜ Viz ➜ Model 19
Embedding In a nutshell, Word Embedding turns text into numbers.
4 Embedding Layer1 4 Word2Vec 4 GloVe 4 doc2vec 4 sense2vec 1 https://machinelearningmastery.com/what-are-word-embeddings/ Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 20
© leoluyi, 2018 21
Demo Information Retrieval Get data ➜ Tokenize ➜ Embedding ➜
Viz ➜ Model 22
Visualize 4 Dimension Reduction 4 t-sne 4 PCA 4 Clustering
4 Interactive or static plots Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 23
Visualize 4 tsne::tsne() 4 prcomp() Get data ➜ Tokenize ➜
Embedding ➜ Viz ➜ Model 24
Model Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜
Model 25
Tasks 4 Classification 4 獤觊 4 Clustering 4 ತ疨ፘ犲 4
Generative models 4 ᛔ㵕ኞ౮ Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 26
አک磧盅᮷䨝మᥝ䌃ᛔ૩ጱ toolkit 4 Sparse Matrix manipulation 4 Informaiton retrieval tools
4 ... © leoluyi, 2018 27
Summary 1. Problem definition & specific goal: Get Curious About
Text 2. Finding Your Data 3. Preprocessing Your Data 4 Removing stopwords, Stemming, Segmentation, ... 4. Feature Extraction 4 Document-Term Matrix: tm, text2vec 4 Named Entity Recognition, POS tagging 4 Word embeddings: word2vec, GloVe 5. More Text Mining Skills 4 sentiment analysis 4 topicmodels, LDAViz: LDA 6. More Than Words - Visualizing Your Results © leoluyi, 2018 28
碍硁ᑀ䋊 㸎瓽 leoluyi@github https://leoluyi.github.io © leoluyi, 2018 29