Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
7
570
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
640
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
790
Other Decks in Programming
See All in Programming
社内オペレーション改善のためのTypeScript / TSKaigi Hokuriku 2025
dachi023
1
540
モデル駆動設計をやってみようワークショップ開催報告(Modeling Forum2025) / model driven design workshop report
haru860
0
230
LLMで複雑な検索条件アセットから脱却する!! 生成的検索インタフェースの設計論
po3rin
1
410
AI時代もSEOを頑張っている話
shirahama_x
0
260
ソフトウェア設計の課題・原則・実践技法
masuda220
PRO
26
22k
Reactive Thinking with Signals and the new Resource API
manfredsteyer
PRO
0
170
AIコーディングエージェント(Manus)
kondai24
0
150
無秩序からの脱却 / Emergence from chaos
nrslib
2
12k
STYLE
koic
0
100
tsgolintはいかにしてtypescript-goの非公開APIを呼び出しているのか
syumai
5
1.9k
関数の挙動書き換える
takatofukui
4
770
「コードは上から下へ読むのが一番」と思った時に、思い出してほしい話
panda728
PRO
37
23k
Featured
See All Featured
Done Done
chrislema
186
16k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.3k
Testing 201, or: Great Expectations
jmmastey
46
7.8k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Unsuck your backbone
ammeep
671
58k
Facilitating Awesome Meetings
lara
57
6.7k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
132
19k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.6k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Embracing the Ebb and Flow
colly
88
4.9k
Designing for humans not robots
tammielis
254
26k
Building Adaptive Systems
keathley
44
2.9k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo