Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Same content. Different words.

Same content. Different words.

Introducing Word Mover's Distance in gensim.

Lev Konstantinovskiy

April 05, 2016
Tweet

More Decks by Lev Konstantinovskiy

Other Decks in Technology

Transcript

  1. Credits Olavur Mortensen Applied Mathematics student DTU in Copenhagen RaReTech

    Incubator program Added WMD to Gensim http://rare-technologies.com/incubator/
  2. Business Problem All reviewers are raving about the same thing

    “The Sicilian gelato was extremely rich” “The Italian ice-cream was very velvety” What about Ambiance, Service and Prices? Let’s filter “gelato” out and add other aspects! Credit: Sudeep Das @datamusing applied WMD to restaurant reviews. http://tech.opentable. com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
  3. Ways to find similar documents • Count common words (

    bag of words, TF-IDF) ◦ #Dimensions = #Vocabulary (thousands) Stuck if no words in common. “Gelato” != “Ice-cream”
  4. Ways to find similar documents • Low-dimensional latent features ◦

    Eigen-values (LSI) ◦ Probability (LDA) Nice features. That is basically what most of Gensim is. But there is something better now… WMD!
  5. New way to find similar documents • Word Mover’s Distance

    ◦ Built on top of Google’s word2vec ◦ Well-used concept in other fields known as Earth Mover’s Distance Beats BOW, TF-IDF, LDA, LSI in k Nearest Neigbours classification tasks.
  6. Finding similar reviews from gensim.similarities import WmdSimilarity similiar_reviews = WmdSimilarity(reviews,

    model, num_best=10) similar_reviews['Very good, you should seat outdoor.']
  7. Thanks! Lev Konstantinovskiy github.com/tmylk @teagermylk Events coming up: - Word2vec

    successor Swivel DS Journal Club Meetup 21 April - Word embeddings talk at PyData London conference 7-8 May - Word2vec tutorial at PyData Berlin 20-21 May
  8. Ways to find similar documents • Google’s Doc2vec ◦ Built

    on top of word2vec ◦ Document tags are just extra words in the document Hard to tune. Slow inference.
  9. Earth Mover’s Distance How do you best move piles of

    sand to fill up holes of the same total volume? Stated by Monge in 1781. Solved by Kantorovich in [Image: APS/Alan Stonebraker]
  10. Google’s Word2vec algorithm • Word becomes a vector in 100-dimensional

    space. • king - man + woman = queen http://nbviewer.jupyter.org/github/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb http://radimrehurek.com/2014/02/word2vec-tutorial