Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining topics in documents with topic modelling...

Mining topics in documents with topic modelling and Python @ London Python meetup

Introduction to topic modelling in Python - presentation given at the London Python meetup in September 2019 (https://www.meetup.com/LondonPython/events/264921863/)

Title: What are they talking about? Mining topics in documents with topic modelling and Python

Abstract:
This presentation is a practical introduction to topic modelling in Python, tackling the problem of analysing large data sets of text, in order to identify topics of interest and related keywords.

Demo:
https://github.com/bonzanini/topic-modelling

Avatar for Marco Bonzanini

Marco Bonzanini

September 26, 2019
Tweet

More Decks by Marco Bonzanini

Other Decks in Programming

Transcript

  1. Mining Topics in Documents
 with Topic Modelling and Python @MarcoBonzanini

    London Python meetup - September 2019 Demo on: github.com/bonzanini/topic-modelling
  2. • Sept 2016: Intro to NLP • Sept 2017: Intro

    to Word Embeddings • Sept 2018: Intro to NLG • Sept 2019: Intro to Topic Modelling • Sept 2020: Intro to … ???
  3. Nice to meet you • Data Science consultant:
 NLP, Machine

    Learning,
 Data Engineering • Corporate training:
 Python + Data Science • PyData London chairperson
  4. This presentation • Introduction to Topic Modelling • Depending on

    time/interest:
 Happy to discuss broader applications of NLP • The audience (tell me about you):
 - new-ish to NLP?
 - new-ish to Python tools for NLP? github.com/bonzanini/topic-modelling
  5. Motivation Suppose you: • have a huge number of (text)

    documents • want to know what they’re talking about • can’t read them all
  6. Topic Modelling • Bird’s-eye view on the whole corpus (dataset

    of docs) • Unsupervised learning
 pros: no need for labelled data
 cons: how to evaluate the model?
  7. Topic Modelling Output:
 - K topics - their word distributions

    movie, actor,
 soundtrack,
 director, … goal, match,
 referee,
 champions, … price, invest, market, stock, …
  8. Distributional Hypothesis • “You shall know a word by the

    company it keeps”
 — J. R. Firth, 1957 • “Words that occur in similar context, tend to have similar meaning”
 — Z. Harris, 1954 • Context approximates Meaning
  9. Term-document matrix Word 1 Word 2 Word N Doc 1

    1 7 2 Doc 2 3 0 5 Doc N 0 4 2
  10. Latent Dirichlet Allocation • Commonly used topic modelling approach •

    Key idea:
 each document is a distribution of topics
 each topic is a distribution of words
  11. Latent Dirichlet Allocation • “Latent” as in hidden:
 only words

    are visible, other variables are hidden • “Dirichlet Allocation”:
 topics are assumed to be distributed with a specific probability (Dirichlet prior)
  12. Topic Model Evaluation • How good is my topic model?


    “Unsupervised learning”… is there a correct answer? • Extrinsic metrics: what’s the task? • Intrinsic metrics: e.g. topic coherence • More interesting:
 - how useful is my topic model?
 - data visualisation can help to get some insights
  13. Topic Coherence • It gives a score of the topic

    quality • Relationship with Information Theory
 (Pointwise Mutual Information) • Used to find the best number of topics for a corpus
  14. Conclusions • Topic Modelling gives you a bird’s-eye view on

    a collection of documents • It doesn’t give you:
 - a “name” for each topic (you have to find out)
 - the exact number of topics (you have to find out) • Excellent tool for exploratory analysis and knowledge discovery