Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
7
540
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
580
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
780
Other Decks in Programming
See All in Programming
hotwire_or_react
harunatsujita
6
2.7k
Why Spring Matters to Jakarta EE - and Vice Versa
ivargrimstad
0
560
推し活としてのrails new/oshikatsu_ha_iizo
sakahukamaki
3
920
詳細解説! ArrayListの仕組みと実装
yujisoftware
0
420
レイトレ合宿10 レンダラー紹介 / Renderer Introduction, Ray Tracing Camp 10
shocker_0x15
0
680
リリース8年目のサービスの1800個のERBファイルをViewComponentに移行した方法とその結果
katty0324
2
2.8k
GPU Hash Table | レイトレ合宿10
yknishidate
0
360
テスト駆動開発✅️
akitoshiga
1
210
tc39 x jsconf.jp Panel Discussion 2024
yosuke_furukawa
PRO
0
130
WEBエンジニア向けAI活用入門
sutetotanuki
0
280
本編では話さない Zig の話
andpad
2
150
Nuxt UI Pro、NuxtHub、Nuxt Scripts、Nuxtエコシステムをふんだんに利用して開発するコーポレートサイト@Vue Fes Japan 2024
shingangan
3
820
Featured
See All Featured
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
43
6.6k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
5
140
Testing 201, or: Great Expectations
jmmastey
38
7k
Fontdeck: Realign not Redesign
paulrobertlloyd
81
5.2k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
37
1.8k
Unsuck your backbone
ammeep
668
57k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
664
120k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
250
21k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
228
52k
Building a Modern Day E-commerce SEO Strategy
aleyda
38
6.9k
Navigating Team Friction
lara
183
14k
Speed Design
sergeychernyshev
24
560
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo