Alyona Medelyan: Understanding human language with Python

Alyona Medelyan: Understanding human language with Python

Alyona Medelyan:
Understanding human language with Python
@ Kiwi PyCon 2014 - Sunday, 14 Sep 2014 - Track 1

Natural Language Processing (NLP) is an area of Computer Science that studies how computers can understand human language. This talk will explain the main principles behind NLP and introduce some key Python libraries.


Natural Language Processing (NLP) is an area of Computer Science that studies how computers can understand human language. Thanks to NLP, one day you might have your own intelligent assistant who can understand any dialect, who can answer your questions by finding the right answers on the web, and who can help you communicate in any language through instant and accurate translation.

There are many challenges that still need to be overcome for this to happen, but in the meantime researchers have been finessing the building blocks required for NLP analysis. Some of the algorithms out there are quite powerful already. Many exist in Python and are readily available for any Pythonista to use.

This talk will explain the main principles behind NLP and introduce some key Python libraries. If you are interested in finding out more, please also attend the in-depth tutorial on Friday.



September 14, 2014

  1. Who+am+I?+ Alyona'' Medelyan' ▪  In'Natural'Language'Processing'since'2000' ▪  PhD'in'NLP'&'Machine'Learning'from'Waikato' ▪  Author'of'the'open'source'keyword'extraction'algorithm'Maui' ▪ 

    Author'of'the'mostBcited'2009'journal'survey'“Mining'Meaning'with'Wikipedia”' ▪  Past:'Chief'Research'Officer'at'Pingar'' ▪  Now:'Founder'of'Entopix,'NLP'consultancy'&'software'development' aka'@zelandiya'
  Word segmentation complexities   

      The first hot dogs were sold by Charles Feltman on Coney Island in 1870.
  What can we do with text?

    text text text text text text text text text text text text text text text text text text text text text text
sentiment keywords tags genre categories taxonomy terms entities names patterns biochemical entities …
What can we do with text?
  How to get to the core words?
Remove Stopwords with NLTK

even the acting in transcendence is solid, with the dreamy depp turning in a typically strong performance

i think that transcendence has a pretty solid acting, with the dreamy depp turning in a strong performance as he usually does

>>> from nltk.corpus import stopwords

    >>> stop = stopwords.words('english')
>>> words = ['the', 'acting', 'in', 'transcendence', 'is', 'solid', 'with', 'the', 'dreamy', 'depp']
>>> print [word for word in words if word not in stop]
['acting', 'transcendence', 'solid', 'dreamy', 'depp']
  Getting closer to the meaning:
Part of Speech tagging with NLTK

Flying planes can be dangerous

>>> import nltk
>>> from nltk.tokenize import word_tokenize

    >>> nltk.pos_tag(word_tokenize("Flying planes can be dangerous"))
[('Flying', 'VBG'), ('planes', 'NNS'), ('can', 'MD'), ('be', 'VB'), ('dangerous', 'JJ')]
  from nltk.corpus import movie_reviews
from gensim import corpora, models
texts = []

    for fileid in movie_reviews.fileids():
    words = texts.append(movie_reviews.words(fileid))
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)

TFxIDF with Gensim
  TFxIDF with Gensim (Results)

for word in ['film', 'movie', 'comedy', 'violence', 'jolie']:
    my_id = dictionary.token2id.get(word)

    print word, '\t', tfidf.idfs[my_id]

film    0.190174003903
movie   0.364013496254
comedy  1.98564470702
violence 3.2108967825
jolie   6.96578428466
  Where does this text belong?
Text Categorization with NLTK

Entertainment Politics
TVNZ: "Obama and Hangover star trade insults in interview"

>>> train_set = [(document_features(d), c) for (d,c) in categorized_documents]

    >>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> doc_features = document_features(new_document)
>>> category = classifier.classify(features)
  Sentiment analysis with TextBlob

>>> from textblob import TextBlob
>>> blob = TextBlob("I love this library")

    >>> blob.sentiment
Sentiment(polarity=0.5, subjectivity=0.6)

for review in transcendence:
    blob = TextBlob(open(review).read())
    print review, blob.sentiment.polarity

../data/transcendence_1star.txt 0.0170799124247
../data/transcendence_5star.txt 0.0874591503268
../data/transcendence_8star.txt 0.256845238095
../data/transcendence_10star.txt 0.304310344828
  Keywords extraction in 3h:
Understanding a movie review

bellboy
jennifer beals
four rooms
beals
rooms
tarantino
madonna
antonio banderas
valeria golino

    …four of the biggest directors in hollywood: quentin tarantino, robert rodriguez, … were all directing one big film with a big and popular cast...the second room (jennifer beals) was better, but lacking in plot... the bumbling and mumbling bellboy… ruins every joke in the film…

github.com/zelandiya/KiwiPyCon-NLP-tutorial
  11. Keyword+extraction+on+2000+movie+reviews:+ What+makes+a+successful+movie?+ van'damme' zeta'–'jones' smith'' batman'' de'palma'' eddie'murphy'' killer'' tommy'lee'jones''

    Negative