Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
7
550
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
610
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
790
Other Decks in Programming
See All in Programming
Building a macOS screen saver with Kotlin (Android Makers 2025)
zsmb
1
140
Preact、HooksとSignalsの両立 / Preact: Harmonizing Hooks and Signals
ssssota
1
1.4k
AWS で実現する安全な AI エージェントの作り方 〜 Bedrock Engineer の実装例を添えて 〜 / how-to-build-secure-ai-agents
gawa
8
710
Signal-Based Data FetchingWith the New httpResource
manfredsteyer
PRO
0
160
API for docs
soutaro
1
710
Memory API : Patterns, Performance et Cas d'Utilisation
josepaumard
0
110
S3静的ホスティング+Next.js静的エクスポート で格安webアプリ構築
iharuoru
0
220
The Weight of Data: Rethinking Cloud-Native Systems for the Age of AI
hollycummins
0
270
Agentic Applications with Symfony
el_stoffel
2
270
custom_lintで始めるチームルール管理
akaboshinit
0
200
「”誤った使い方をすることが困難”な設計」で良いコードの基礎を固めよう / phpcon-odawara-2025
taniguhey
0
120
MCP調べてみました! / Exploring MCP
uhzz
2
2.2k
Featured
See All Featured
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
120k
Rails Girls Zürich Keynote
gr2m
94
13k
Building Applications with DynamoDB
mza
94
6.3k
Optimizing for Happiness
mojombo
377
70k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
47
2.5k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
3.8k
What's in a price? How to price your products and services
michaelherold
245
12k
How GitHub (no longer) Works
holman
314
140k
Gamification - CAS2011
davidbonilla
81
5.2k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
1.6k
GitHub's CSS Performance
jonrohan
1030
460k
Speed Design
sergeychernyshev
29
880
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo