Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
590
7
Share
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
690
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
810
Other Decks in Programming
See All in Programming
AIベース静的検査器の偽陽性率を抑える工夫3選
orgachem
PRO
4
460
WebAssembly を読み込むベストプラクティス 2026年春版 / Best Practices for Loading WebAssembly (Spring 2026)
petamoriken
5
1.1k
クラウドネイティブなエンジニアに向ける Raycastの魅力と実際の活用事例
nealle
2
260
ソフトウェア設計の結合バランス #phperkaigi
kajitack
0
510
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
350
書き換えて学ぶTemporal #fukts
pirosikick
2
380
Claude CodeでETLジョブ実行テストを自動化してみた
yoshikikasama
0
1.2k
【ディップ|26年新卒研修資料】TDD実装演習
dip_tech
PRO
0
190
サーバーレスで作る、動画データ管理基盤
oyasumipants
0
190
いつか誰かが、と思っていた フロントエンド刷新5年間の実践知
kiichisugihara
1
280
Back to the roots of date
jinroq
0
850
t *testing.T は どこからやってくるの?
otakakot
1
940
Featured
See All Featured
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
180
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.9k
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
2
370
エンジニアに許された特別な時間の終わり
watany
106
240k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
130k
RailsConf 2023
tenderlove
30
1.4k
Navigating Algorithm Shifts & AI Overviews - #SMXNext
aleyda
1
1.2k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.4k
A better future with KSS
kneath
240
18k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.5k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
62k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.3k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo