Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adam Palay - "Words, words, words": Reading Sha...

Adam Palay - "Words, words, words": Reading Shakespeare with Python

This talk will give an introduction to text analysis with Python by asking some questions about Shakespeare and discussing the quantitative methods that will go in to answering them. While we’ll use Shakespeare to illustrate our methodologies, we’ll also discuss how they can be ported over into more 21st century texts, like tweets or New York Times articles.

https://us.pycon.org/2015/schedule/presentation/339/

PyCon 2015

April 18, 2015
Tweet

More Decks by PyCon 2015

Other Decks in Programming

Transcript

  1. Motivation How can we use Python to supplement our reading

    of Shakespeare? How can we get Python to read for us?
  2. Why Shakespeare? Polonius: What do you read, my lord? Hamlet:

    Words, words, words. P: What is the matter, my lord? H: Between who? P: I mean, the matter that you read, my lord. --II.2.184
  3. Challenges • Language, especially English, is messy • Texts are

    usually unstructured • Pronunciation is not standard • Reading is pretty hard!
  4. Humans and Computers Nuance Ambiguity Close reading Counting Repetitive tasks

    Making graphs Humans are good at: Computers are good at:
  5. Shakespeare’s Sonnets • A sonnet is 14 line poem •

    There are many different rhyme schemes a sonnet can have; Shakespeare was pretty unique in choosing one • This is a huge win for us, since we can “hard code” his rhyme scheme in our analysis
  6. Shall I compare thee to a summer’s day? Thou art

    more lovely and more temperate: Rough winds do shake the darling buds of May, And summer’s lease hath all too short a date; Sometime too hot the eye of heaven shines, And often is his gold complexion dimm'd; And every fair from fair sometime declines, By chance or nature’s changing course untrimm'd; But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall death brag thou wander’st in his shade, When in eternal lines to time thou grow’st: So long as men can breathe or eyes can see, So long lives this, and this gives life to thee. http://www.poetryfoundation.org/poem/174354 a b a b c d c d e f e f g g Sonnet 18
  7. Rhyme Distribution • Most common rhymes • nltk.FreqDict Frequency Distribution

    • Given a word, what is the frequency distribution of the words that rhyme with it? • nltk.ConditionalFreqDict Conditional Frequency Distribution
  8. Our Classifier Can we write code to tell if a

    given speech is from a tragedy or comedy?
  9. • Requires labeled text ◦ (in this case, speeches labeled

    by genre) ◦ [(<speech>, <genre>), ...] • Requires “training” • Predicts labels of text Classifiers: overview
  10. Classifiers: ingredients • Classifier • Vectorizer, or Feature Extractor •

    Classifiers only interact with features, not the text itself
  11. Vectorizers (or Feature Extractors) • A vectorizer, or feature extractor,

    transforms a text into quantifiable information about the text. • Theoretically, these features could be anything. i.e.: ◦ How many capital letters does the text contain? ◦ Does the text end with an exclamation point? • In practice, a common model is “Bag of Words”.
  12. Bag of Words is a kind of feature extraction where:

    • The set of features is the set of all words in the text you’re analyzing • A single text is represented by how many of each word appears in it Bag of Words
  13. Bag of Words: Simple Example Two texts: • “Hello, Will!”

    • “Hello, Globe!” Bag: [“Hello”, “Will”, “Globe”] “Hello” “Will” “Globe”
  14. Bag of Words: Simple Example Two texts: • “Hello, Will!”

    • “Hello, Globe!” Bag: [“Hello”, “Will”, “Globe”] “Hello” “Will” “Globe” “Hello, Will” 1 1 0 “Hello, Globe” 1 0 1
  15. Bag of Words: Simple Example Two texts: • “Hello, Will!”

    • “Hello, Globe!” “Hello” “Will” “Globe” “Hello, Will” 1 1 0 “Hello, Globe” 1 0 1 “Hello, Will” → “A text that contains one instance of the word “Hello”, contains one instance of the word “Will”, and does not contain the word “Globe”. (Less readable for us, more readable for computers!)
  16. Why are these called “Vectorizers”? text_1 = "words, words, words"

    text_2 = "words, words, birds" # times “birds” is used # times “words” is used text_2 text_1
  17. Classification: Steps 1) Split pre-labeled text into training and testing

    sets 2) Vectorize text (extract features) 3) Train classifier 4) Test classifier Text → Features → Labels
  18. Classifier Training from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB

    vectorizer = CountVectorizer() vectorizer.fit(train_speeches) train_features = vectorizer.transform(train_speeches) classifier = MultinomialNB() classifier.fit(train_features, train_labels)
  19. test_speech = test_speeches[0] print test_speech Farewell, Andronicus, my noble father,

    The woefull'st man that ever liv'd in Rome. Farewell, proud Rome, till Lucius come again; He loves his pledges dearer than his life. ... (From Titus Andronicus, III.1.288-300) Classifier Testing
  20. Classifier Testing test_speech = test_speeches[0] test_label = test_labels[0] test_features =

    vectorizer.transform([test_speech]) prediction = classifier.predict(test_features)[0] print prediction >>> 'tragedy' print test_label >>> 'tragedy'
  21. Critiques • "Bag of Words" assumes a correlation between word

    use and label. This correlation is stronger in some cases than in others. • Beware of highly-disproportionate training data.