What Makes BERT Special? A Study on Emotion Detection in Text

What Makes BERT Special? A Study on Emotion Detection in
Text Kuncahyo Setyo Nugroho Presented at Python Conference Indonesia 2021 5th December, 2021

About Me What Makes BERT Special? A Study on Emotion
Detection in Text | Kuncahyo Setyo Nugroho ▪ Software Engineer ITC and Data Center, Widya Gama University, Indonesia ▪ Master Student of Computer Science Faculty of Computer Science, Brawijaya University, Indonesia Intelligent System Laboratory Affective Computing Research Interest Group ▪ Research Interest Artificial Intelligence, Affective Computing, Natural Language Processing 2

Text | Kuncahyo Setyo Nugroho 3

Outline What Makes BERT Special? A Study on Emotion Detection
in Text | Kuncahyo Setyo Nugroho 4 • Current state-of-the-art in Natural Language Processing (NLP) • What makes BERT "special" and evolved? • How does BERT see the context? • BERT fine-tuning for text-based emotion detection in Indonesian • Future work

Text | Kuncahyo Setyo Nugroho 5 Current state-of-the-art in Natural Language Processing (NLP) 01

Why Embedding? What Makes BERT Special? A Study on Emotion
Detection in Text | Kuncahyo Setyo Nugroho 6 • Humans learn words from environment and experience, but the machine has no such context. • Word embedding are the basics of deep learning for NLP. • Word is a symbolic representation of semantic. • It has meaning; • Words with similar meanings should have similar vectors; • The distance between vectors for the same concepts should be similar.

Evolution of Word Embedding What Makes BERT Special? A Study
on Emotion Detection in Text | Kuncahyo Setyo Nugroho 7 • Word2Vec in 2013 https://arxiv.org/abs/1301.3781 • GloVe in 2014 https://nlp.stanford.edu/pubs/glove.pdf • FastText in 2016 https://arxiv.org/abs/1607.04606

Limitations of Word Embedding What Makes BERT Special? A Study
on Emotion Detection in Text | Kuncahyo Setyo Nugroho 8 • Problem: Some words have different meanings (homonym and polysemy) • Word embeddings are applied in a context-free manner • Solution: train contextual representations on a text corpus

ELMo: Deep Contextualized Word Representation What Makes BERT Special? A
Study on Emotion Detection in Text | Kuncahyo Setyo Nugroho 9 Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding https://www.mihaileric.com/posts/deep-contextualized-word-representations-elmo Peters et al., 2018 https://arxiv.org/abs/1802.05365

Text | Kuncahyo Setyo Nugroho 10 What makes BERT “Special” and Evolved? 02 BERT (Bidirectional Encoder Representations from Transformers) Devlin et al., 2018 https://arxiv.org/abs/1810.04805

BERT What Makes BERT Special? A Study on Emotion Detection
in Text | Kuncahyo Setyo Nugroho 11 • Problem: Language models only use left context or right context, but language understanding is bidirectional. • Why are LMs unidirectional? • Directionality is needed to generate a well-formed probability distribution • Words can “see themselves” in a bidirectional encoder

BERT What Makes BERT Special? A Study on Emotion Detection
in Text | Kuncahyo Setyo Nugroho 12 • BERT improves on ELMO • Replace LSTM with Transformers, which deal better with long-term dependencies • Truly bidirectional architecture: left-to-right and right-to-left contexts are learned by the same network

BERT Training Strategies: MLM What Makes BERT Special? A Study
on Emotion Detection in Text | Kuncahyo Setyo Nugroho 13 Mask out k% of the input words, and then predict the masked words. Input: the man went to the [MASK1] he bought a [MASK2] of milk Labels: [MASK1] = store; [MASK2] = gallon

BERT Training Strategies: NSP What Makes BERT Special? A Study
on Emotion Detection in Text | Kuncahyo Setyo Nugroho 14 Learn relationships between sentences, predict whether Sentence B is actual sentence that proceeds Sentence A, or a random sentence Sentence A: the man went to the store. Sentence B: he bought a gallon of milk. Label: IsNextSentence

BERT Part What Makes BERT Special? A Study on Emotion
Detection in Text | Kuncahyo Setyo Nugroho 15 Unsupervised Training BERT Pre-Training Supervised Training BERT Fine-Tuning

Text | Kuncahyo Setyo Nugroho 16

Text | Kuncahyo Setyo Nugroho 17 BERT under covers: How does BERT see the context? 03 Transformers (Attention Is All You Need) Vaswani et al., 2017 https://arxiv.org/abs/1706.03762

Transformer Architecture What Makes BERT Special? A Study on Emotion
Detection in Text | Kuncahyo Setyo Nugroho 18 • BERT uses the encoder stack of the transformer. 12 or 24 layers, each layer with multiple (12 or 16) attention heads. • The outputs of all heads in the same layer are combined and run through a fully-connected feed-forward module.

BERT: Input Representations What Makes BERT Special? A Study on
Emotion Detection in Text | Kuncahyo Setyo Nugroho 19 • Sentence can be an arbitrary span of contiguous text. • Sequence refers to the input token sequence to BERT, a single sentence or two sentences packed together. • Use 30,000 WordPiece vocabulary on input. • Each token is sum of three embeddings.

Self-Attention Mechanism What Makes BERT Special? A Study on Emotion
Detection in Text | Kuncahyo Setyo Nugroho 20 Self-attention function: 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 𝑑𝑘 𝑉 • Each row in Input X is the embedding vector of one token. • Dimension matrix X will be [sentence length * embedding dimension]. • Uses weight WQ, WK, WV to multiple the same input matrix X to get Q, K, V matrix.

Detection in Text | Kuncahyo Setyo Nugroho 21 Compute dot product between Q and K matrix. Why? • Every line represents the similarity between one token and other tokens. • E.g. we can understand that the word I is more related to itself than the words love and python.

Detection in Text | Kuncahyo Setyo Nugroho 22 Divide dot product of between Q and K by the square root of K dimension. Why? • Obtaining stable gradients. Normalize using softmax function. Why? • Make the value to the range of 0 to 1 and the sum of the scores equals to 1. 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 𝑑𝑘 =

Detection in Text | Kuncahyo Setyo Nugroho 23 Final step: compute the attention matrix (Z) by multiplying the score matrix by V. • How a word is related to all word in the sentence.

Multi-head Attention What Makes BERT Special? A Study on Emotion
Detection in Text | Kuncahyo Setyo Nugroho 24 Instead of computing a single attention matrix, we can compute multiple attention matrices.

BertViz: Visualize BERT Attention What Makes BERT Special? A Study
on Emotion Detection in Text | Kuncahyo Setyo Nugroho 25 Open source tool for visualizing attention in Transformer models https://github.com/jessevig/bertviz Attention Head View Neuron View

Text | Kuncahyo Setyo Nugroho 26 BERT for Text-Based Emotion Detection in Indonesian 04

What is Needed? What Makes BERT Special? A Study on
Emotion Detection in Text | Kuncahyo Setyo Nugroho 27 • Dataset • Computing resource • Code • A cup of coffee ☕ Challenge: • Need more data, more and more … • Powerful computer resource (GPU or TPU)

The Code What Makes BERT Special? A Study on Emotion
Detection in Text | Kuncahyo Setyo Nugroho 28 https://github.com/google-research/bert

Framework What Makes BERT Special? A Study on Emotion Detection
in Text | Kuncahyo Setyo Nugroho 29

BERT Indonesia Pre-Trained Model What Makes BERT Special? A Study
on Emotion Detection in Text | Kuncahyo Setyo Nugroho 30 # Pre-Trained Name Author Dataset 1 BERT multilingual Devlin et al., 2018 •104 language from Wikipedia dump 2 IndoBERT Wilie et al., 2020 •Indo4B dataset (4B words / 250M sentences) 3 IndoBERT Koto et al., 2020 •Indonesian Wikipedia (74M words) •News articles (55M words) •Indonesian Web Corpus (90M words) 4 IndoBERTweet Koto et al., 2021 26M tweets (409M word tokens) https://huggingface.co

Text | Kuncahyo Setyo Nugroho 31 Thank you! Slide & Code https://s.id/cahyopycon21

What Makes BERT Special? A Study on Emotion De...

What Makes BERT Special? A Study on Emotion Detection in Text

Kuncahyo Setyo Nugroho

More Decks by Kuncahyo Setyo Nugroho

Other Decks in Research

Featured

Transcript

What Makes BERT Special? A Study on Emotion Detection in

About Me What Makes BERT Special? A Study on Emotion

What Makes BERT Special? A Study on Emotion Detection in

Outline What Makes BERT Special? A Study on Emotion Detection

What Makes BERT Special? A Study on Emotion Detection in

Why Embedding? What Makes BERT Special? A Study on Emotion

Evolution of Word Embedding What Makes BERT Special? A Study

Limitations of Word Embedding What Makes BERT Special? A Study

ELMo: Deep Contextualized Word Representation What Makes BERT Special? A

What Makes BERT Special? A Study on Emotion Detection in

BERT What Makes BERT Special? A Study on Emotion Detection

BERT What Makes BERT Special? A Study on Emotion Detection

BERT Training Strategies: MLM What Makes BERT Special? A Study

BERT Training Strategies: NSP What Makes BERT Special? A Study

BERT Part What Makes BERT Special? A Study on Emotion

What Makes BERT Special? A Study on Emotion Detection in

What Makes BERT Special? A Study on Emotion Detection in

Transformer Architecture What Makes BERT Special? A Study on Emotion

BERT: Input Representations What Makes BERT Special? A Study on

Self-Attention Mechanism What Makes BERT Special? A Study on Emotion

Self-Attention Mechanism What Makes BERT Special? A Study on Emotion

Self-Attention Mechanism What Makes BERT Special? A Study on Emotion

Self-Attention Mechanism What Makes BERT Special? A Study on Emotion

Multi-head Attention What Makes BERT Special? A Study on Emotion

BertViz: Visualize BERT Attention What Makes BERT Special? A Study

What Makes BERT Special? A Study on Emotion Detection in

What is Needed? What Makes BERT Special? A Study on

The Code What Makes BERT Special? A Study on Emotion

Framework What Makes BERT Special? A Study on Emotion Detection

BERT Indonesia Pre-Trained Model What Makes BERT Special? A Study

What Makes BERT Special? A Study on Emotion Detection in