Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Victor Neo
March 27, 2012
Programming
590
7
Share
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
670
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
800
Other Decks in Programming
See All in Programming
一度始めたらやめられない開発効率向上術 / Findy あなたのdotfilesを教えて!
k0kubun
3
2.7k
野球解説AI Agentを開発してみた - 2026/02/27 LayerX社内LT会資料
shinyorke
PRO
0
380
存在論的プログラミング: 時間と存在を記述する
koriym
5
740
見せてもらおうか、 OpenSearchの性能とやらを!
shunta27
1
170
今こそ押さえておきたい アマゾンウェブサービス(AWS)の データベースの基礎 おもクラ #6版
satoshi256kbyte
1
220
それはエンジニアリングの糧である:AI開発のためにAIのOSSを開発する現場より / It serves as fuel for engineering: insights from the field of developing open-source AI for AI development.
nrslib
1
820
Reactive ❤️ Loom: A Forbidden Love Story
franz1981
2
210
20260315 AWSなんもわからん🥲
chiilog
2
180
年間50登壇、単著出版、雑誌寄稿、Podcast出演、YouTube、CM、カンファレンス主催……全部やってみたので面白さ等を比較してみよう / I’ve tried them all, so let’s compare how interesting they are.
nrslib
4
640
AI Assistants for YourAngular Solutions @Angular Graz, March 2026
manfredsteyer
PRO
0
140
AI活用のコスパを最大化する方法
ochtum
0
360
L’IA au service des devs : Anatomie d'un assistant de Code Review
toham
0
170
Featured
See All Featured
The State of eCommerce SEO: How to Win in Today's Products SERPs - #SEOweek
aleyda
2
10k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
22k
Ten Tips & Tricks for a 🌱 transition
stuffmc
0
95
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
620
The Language of Interfaces
destraynor
162
26k
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
700
A better future with KSS
kneath
240
18k
The Cult of Friendly URLs
andyhume
79
6.8k
It's Worth the Effort
3n
188
29k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.2k
Stop Working from a Prison Cell
hatefulcrawdad
274
21k
GitHub's CSS Performance
jonrohan
1032
470k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo