Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What Makes BERT Special? A Study on Emotion De...

What Makes BERT Special? A Study on Emotion Detection in Text

Python Conference Indonesia 2021
4-5 December, 2021 [Online]

https://pycon.id

Kuncahyo Setyo Nugroho

December 05, 2021
Tweet

More Decks by Kuncahyo Setyo Nugroho

Other Decks in Research

Transcript

  1. What Makes BERT Special? A Study on Emotion Detection in

    Text Kuncahyo Setyo Nugroho Presented at Python Conference Indonesia 2021 5th December, 2021
  2. About Me What Makes BERT Special? A Study on Emotion

    Detection in Text | Kuncahyo Setyo Nugroho ▪ Software Engineer ITC and Data Center, Widya Gama University, Indonesia ▪ Master Student of Computer Science Faculty of Computer Science, Brawijaya University, Indonesia Intelligent System Laboratory Affective Computing Research Interest Group ▪ Research Interest Artificial Intelligence, Affective Computing, Natural Language Processing 2
  3. Outline What Makes BERT Special? A Study on Emotion Detection

    in Text | Kuncahyo Setyo Nugroho 4 • Current state-of-the-art in Natural Language Processing (NLP) • What makes BERT "special" and evolved? • How does BERT see the context? • BERT fine-tuning for text-based emotion detection in Indonesian • Future work
  4. What Makes BERT Special? A Study on Emotion Detection in

    Text | Kuncahyo Setyo Nugroho 5 Current state-of-the-art in Natural Language Processing (NLP) 01
  5. Why Embedding? What Makes BERT Special? A Study on Emotion

    Detection in Text | Kuncahyo Setyo Nugroho 6 • Humans learn words from environment and experience, but the machine has no such context. • Word embedding are the basics of deep learning for NLP. • Word is a symbolic representation of semantic. • It has meaning; • Words with similar meanings should have similar vectors; • The distance between vectors for the same concepts should be similar.
  6. Evolution of Word Embedding What Makes BERT Special? A Study

    on Emotion Detection in Text | Kuncahyo Setyo Nugroho 7 • Word2Vec in 2013 https://arxiv.org/abs/1301.3781 • GloVe in 2014 https://nlp.stanford.edu/pubs/glove.pdf • FastText in 2016 https://arxiv.org/abs/1607.04606
  7. Limitations of Word Embedding What Makes BERT Special? A Study

    on Emotion Detection in Text | Kuncahyo Setyo Nugroho 8 • Problem: Some words have different meanings (homonym and polysemy) • Word embeddings are applied in a context-free manner • Solution: train contextual representations on a text corpus
  8. ELMo: Deep Contextualized Word Representation What Makes BERT Special? A

    Study on Emotion Detection in Text | Kuncahyo Setyo Nugroho 9 Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding https://www.mihaileric.com/posts/deep-contextualized-word-representations-elmo Peters et al., 2018 https://arxiv.org/abs/1802.05365
  9. What Makes BERT Special? A Study on Emotion Detection in

    Text | Kuncahyo Setyo Nugroho 10 What makes BERT “Special” and Evolved? 02 BERT (Bidirectional Encoder Representations from Transformers) Devlin et al., 2018 https://arxiv.org/abs/1810.04805
  10. BERT What Makes BERT Special? A Study on Emotion Detection

    in Text | Kuncahyo Setyo Nugroho 11 • Problem: Language models only use left context or right context, but language understanding is bidirectional. • Why are LMs unidirectional? • Directionality is needed to generate a well-formed probability distribution • Words can “see themselves” in a bidirectional encoder
  11. BERT What Makes BERT Special? A Study on Emotion Detection

    in Text | Kuncahyo Setyo Nugroho 12 • BERT improves on ELMO • Replace LSTM with Transformers, which deal better with long-term dependencies • Truly bidirectional architecture: left-to-right and right-to-left contexts are learned by the same network
  12. BERT Training Strategies: MLM What Makes BERT Special? A Study

    on Emotion Detection in Text | Kuncahyo Setyo Nugroho 13 Mask out k% of the input words, and then predict the masked words. Input: the man went to the [MASK1] he bought a [MASK2] of milk Labels: [MASK1] = store; [MASK2] = gallon
  13. BERT Training Strategies: NSP What Makes BERT Special? A Study

    on Emotion Detection in Text | Kuncahyo Setyo Nugroho 14 Learn relationships between sentences, predict whether Sentence B is actual sentence that proceeds Sentence A, or a random sentence Sentence A: the man went to the store. Sentence B: he bought a gallon of milk. Label: IsNextSentence
  14. BERT Part What Makes BERT Special? A Study on Emotion

    Detection in Text | Kuncahyo Setyo Nugroho 15 Unsupervised Training BERT Pre-Training Supervised Training BERT Fine-Tuning
  15. What Makes BERT Special? A Study on Emotion Detection in

    Text | Kuncahyo Setyo Nugroho 17 BERT under covers: How does BERT see the context? 03 Transformers (Attention Is All You Need) Vaswani et al., 2017 https://arxiv.org/abs/1706.03762
  16. Transformer Architecture What Makes BERT Special? A Study on Emotion

    Detection in Text | Kuncahyo Setyo Nugroho 18 • BERT uses the encoder stack of the transformer. 12 or 24 layers, each layer with multiple (12 or 16) attention heads. • The outputs of all heads in the same layer are combined and run through a fully-connected feed-forward module.
  17. BERT: Input Representations What Makes BERT Special? A Study on

    Emotion Detection in Text | Kuncahyo Setyo Nugroho 19 • Sentence can be an arbitrary span of contiguous text. • Sequence refers to the input token sequence to BERT, a single sentence or two sentences packed together. • Use 30,000 WordPiece vocabulary on input. • Each token is sum of three embeddings.
  18. Self-Attention Mechanism What Makes BERT Special? A Study on Emotion

    Detection in Text | Kuncahyo Setyo Nugroho 20 Self-attention function: 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 𝑑𝑘 𝑉 • Each row in Input X is the embedding vector of one token. • Dimension matrix X will be [sentence length * embedding dimension]. • Uses weight WQ, WK, WV to multiple the same input matrix X to get Q, K, V matrix.
  19. Self-Attention Mechanism What Makes BERT Special? A Study on Emotion

    Detection in Text | Kuncahyo Setyo Nugroho 21 Compute dot product between Q and K matrix. Why? • Every line represents the similarity between one token and other tokens. • E.g. we can understand that the word I is more related to itself than the words love and python.
  20. Self-Attention Mechanism What Makes BERT Special? A Study on Emotion

    Detection in Text | Kuncahyo Setyo Nugroho 22 Divide dot product of between Q and K by the square root of K dimension. Why? • Obtaining stable gradients. Normalize using softmax function. Why? • Make the value to the range of 0 to 1 and the sum of the scores equals to 1. 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 𝑑𝑘 =
  21. Self-Attention Mechanism What Makes BERT Special? A Study on Emotion

    Detection in Text | Kuncahyo Setyo Nugroho 23 Final step: compute the attention matrix (Z) by multiplying the score matrix by V. • How a word is related to all word in the sentence.
  22. Multi-head Attention What Makes BERT Special? A Study on Emotion

    Detection in Text | Kuncahyo Setyo Nugroho 24 Instead of computing a single attention matrix, we can compute multiple attention matrices.
  23. BertViz: Visualize BERT Attention What Makes BERT Special? A Study

    on Emotion Detection in Text | Kuncahyo Setyo Nugroho 25 Open source tool for visualizing attention in Transformer models https://github.com/jessevig/bertviz Attention Head View Neuron View
  24. What Makes BERT Special? A Study on Emotion Detection in

    Text | Kuncahyo Setyo Nugroho 26 BERT for Text-Based Emotion Detection in Indonesian 04
  25. What is Needed? What Makes BERT Special? A Study on

    Emotion Detection in Text | Kuncahyo Setyo Nugroho 27 • Dataset • Computing resource • Code • A cup of coffee ☕ Challenge: • Need more data, more and more … • Powerful computer resource (GPU or TPU)
  26. The Code What Makes BERT Special? A Study on Emotion

    Detection in Text | Kuncahyo Setyo Nugroho 28 https://github.com/google-research/bert
  27. BERT Indonesia Pre-Trained Model What Makes BERT Special? A Study

    on Emotion Detection in Text | Kuncahyo Setyo Nugroho 30 # Pre-Trained Name Author Dataset 1 BERT multilingual Devlin et al., 2018 •104 language from Wikipedia dump 2 IndoBERT Wilie et al., 2020 •Indo4B dataset (4B words / 250M sentences) 3 IndoBERT Koto et al., 2020 •Indonesian Wikipedia (74M words) •News articles (55M words) •Indonesian Web Corpus (90M words) 4 IndoBERTweet Koto et al., 2021 26M tweets (409M word tokens) https://huggingface.co
  28. What Makes BERT Special? A Study on Emotion Detection in

    Text | Kuncahyo Setyo Nugroho 31 Thank you! Slide & Code https://s.id/cahyopycon21