Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alyona Medelyan: Understanding human language w...

Alyona Medelyan: Understanding human language with Python

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Alyona Medelyan:
Understanding human language with Python
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
@ Kiwi PyCon 2014 - Sunday, 14 Sep 2014 - Track 1
http://kiwi.pycon.org/

**Audience level**

Novice

**Description**

Natural Language Processing (NLP) is an area of Computer Science that studies how computers can understand human language. This talk will explain the main principles behind NLP and introduce some key Python libraries.

**Abstract**

Natural Language Processing (NLP) is an area of Computer Science that studies how computers can understand human language. Thanks to NLP, one day you might have your own intelligent assistant who can understand any dialect, who can answer your questions by finding the right answers on the web, and who can help you communicate in any language through instant and accurate translation.

There are many challenges that still need to be overcome for this to happen, but in the meantime researchers have been finessing the building blocks required for NLP analysis. Some of the algorithms out there are quite powerful already. Many exist in Python and are readily available for any Pythonista to use.

This talk will explain the main principles behind NLP and introduce some key Python libraries. If you are interested in finding out more, please also attend the in-depth tutorial on Friday.

**YouTube**

https://www.youtube.com/watch?v=vZsW-xCXfRI

New Zealand Python User Group

September 14, 2014
Tweet

More Decks by New Zealand Python User Group

Other Decks in Programming

Transcript

  1. Who+am+I?+ Alyona'' Medelyan' ▪  In'Natural'Language'Processing'since'2000' ▪  PhD'in'NLP'&'Machine'Learning'from'Waikato' ▪  Author'of'the'open'source'keyword'extraction'algorithm'Maui' ▪ 

    Author'of'the'mostBcited'2009'journal'survey'“Mining'Meaning'with'Wikipedia”' ▪  Past:'Chief'Research'Officer'at'Pingar'' ▪  Now:'Founder'of'Entopix,'NLP'consultancy'&'software'development' aka'@zelandiya'
  2. Word+segmentation+complexities+ ▪      ' ▪   

      ' ▪  The'first'hot'dogs'were'sold'by'Charles'Feltman'on'Coney'Island'in' 1870.'' ▪  The'first'hot'dogs'were'sold'by'Charles'Feltman'on'Coney'Island'in' 1870.' '
  3. text text text text text text text text text text

    text text text text text text text text sentiment keywords tags genre categories taxonomy terms entities names patterns biochemical entities … text text text text text text ' text text text ' text text text ' text text text' text text text' What can we do with text?+
  4. How+to+get+to+the+core+words?+ Remove+Stopwords+with+NLTK+ even'the'acting'in'transcendence'is'solid','with'the'dreamy' depp'turning'in'a'typically'strong'performance' i'think'that'transcendence'has'a'pretty'solid'acting,'with'the' dreamy'depp'turning'in'a'strong'performance'as'he'usually'does' >>> from nltk.corpus import

    stopwords >>> stop = stopwords.words('english') >>> words = ['the', 'acting', 'in', 'transcendence', 'is', 'solid', 'with', 'the', 'dreamy', 'depp'] >>> print [word for word in words if word not in stop] ['acting', 'transcendence', 'solid’, 'dreamy', 'depp']
  5. Getting+closer+to+the+meaning:+ Part+of+Speech+tagging+with+NLTK+ Flying'planes'can'be'dangerous' >>> import nltk >>> from nltk.tokenize import

    word_tokenize >>> nltk.pos_tag(word_tokenize("Flying planes can be dangerous")) [('Flying', 'VBG'), ('planes', 'NNS'), ('can', 'MD'), ('be', 'VB'), ('dangerous', 'JJ')] ✓'
  6. from nltk.corpus import movie_reviews from gensim import corpora, models texts

    = [] for fileid in movie_reviews.fileids(): words = texts.append(movie_reviews.words(fileid)) dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] tfidf = models.TfidfModel(corpus) TFxIDF+with+Gensim+
  7. TFxIDF+with+Gensim+(Results)+ for word in ['film', 'movie', 'comedy', 'violence', 'jolie']: my_id

    = dictionary.token2id.get(word) print word, '\t', tfidf.idfs[my_id] film 0.190174003903 movie 0.364013496254 comedy 1.98564470702 violence 3.2108967825 jolie 6.96578428466
  8. Where+does+this+text+belong?+ Text+Categorization+with+NLTK+ Entertainment' Politics' TVNZ:'“Obama'and'' Hangover'star'' trade'insults'in'interview”' >>> train_set =

    [(document_features(d), c) for (d,c) in categorized_documents] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> doc_features = document_features(new_document) >>> category = classifier.classify(features)
  9. Sentiment+analysis+with+TextBlob+ >>> from textblob import TextBlob >>> blob = TextBlob("I

    love this library") >>> blob.sentiment Sentiment(polarity=0.5, subjectivity=0.6) for review in transcendence: blob = TextBlob(open(review).read()) print review, blob.sentiment.polarity ../data/transcendence_1star.txt 0.0170799124247 ../data/transcendence_5star.txt 0.0874591503268 ../data/transcendence_8star.txt 0.256845238095 ../data/transcendence_10star.txt 0.304310344828
  10. Keywords+extracton+in+3h:+ Understanding+a+movie+review+ bellboy+ jennifer+beals+ four+rooms+ beals+ rooms+ tarantino+ madonna+ antonio+banderas+

    valeria+golino+ …four'of'the'biggest'directors'in'hollywood':'quentin' tarantino','robert'rodriguez','…'were'all'directing'one'big'film' with'a'big'and'popular'cast'...the'second'room'('jennifer' beals')'was'better','but'lacking'in'plot'...'the'bumbling'and' mumbling'bellboy'…'ruins'every'joke'in'the'film'…' github.com/zelandiya/KiwiPyConBNLPBtutorial'
  11. Keyword+extraction+on+2000+movie+reviews:+ What+makes+a+successful+movie?+ van'damme' zeta'–'jones' smith'' batman'' de'palma'' eddie'murphy'' killer'' tommy'lee'jones''

    wild'west'' mars'' murphy'' ship'' space'' brothers'' de'bont'' ...' star'wars'' disney'' war'' de'niro'' jackie'' alien'' jackie'chan'' private'ryan'' truman'show'' ben'stiller'' cameron'' science'fiction'' cameron'diaz'' fiction'' jack'' ...' Negative ( (((((((((Positive(