Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Victor Neo
March 27, 2012
Programming
7
590
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
670
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
800
Other Decks in Programming
See All in Programming
Angular-Apps smarter machen mit Gen AI: Lokal und offlinefähig - Hands-on Workshop!
christianliebel
PRO
0
140
今からFlash開発できるわけないじゃん、ムリムリ! (※ムリじゃなかった!?)
arkw
0
160
見せてもらおうか、 OpenSearchの性能とやらを!
shunta27
1
150
CS教育のDX AIによる育成の効率化
niftycorp
PRO
0
170
AI時代のシステム設計:ドメインモデルで変更しやすさを守る設計戦略
masuda220
PRO
6
1.1k
ロボットのための工場に灯りは要らない
watany
12
3.2k
GC言語のWasm化とComponent Modelサポートの実践と課題 - Scalaの場合
tanishiking
0
130
野球解説AI Agentを開発してみた - 2026/02/27 LayerX社内LT会資料
shinyorke
PRO
0
370
脱 雰囲気実装!AgentCoreを良い感じにWEBアプリケーションに組み込むために
takuyay0ne
3
410
What Spring Developers Should Know About Jakarta EE
ivargrimstad
0
720
AI Assistants for Your Angular Solutions
manfredsteyer
PRO
0
160
AI Assistants for YourAngular Solutions @Angular Graz, March 2026
manfredsteyer
PRO
0
120
Featured
See All Featured
Six Lessons from altMBA
skipperchong
29
4.2k
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
780
Mind Mapping
helmedeiros
PRO
1
130
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
320
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
38
2.8k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
360
30k
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
300
Documentation Writing (for coders)
carmenintech
77
5.3k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
1.1k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
YesSQL, Process and Tooling at Scale
rocio
174
15k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo