very good movie very good movie very good movie very good movie とても よい 映画 Word embeddings (representing a word as a vector) Semantic composition (computing the vector of phrases from constituent words) Encoder-decoder model (generating a sequence of words from the composed vector) DNN made breakthroughs in speech processing and computer vision Reduced the error rate of image recognition more than 10% (ILSVRC 2012) At first, DNN had limited impacts on NLP Natural languages have symbols that represent semantic information Recently, DNNs have successfully been applied to various tasks DNNs achieve the state-of-the-art performance on most NLP tasks DNNs learn vector representations of text and generate a text (e.g., sequence of words) from the representations
a unit (neuron, dimension, symbol) to every concept Distributed representation Each concept is represented by multiple units (micro-features) Each unit commits to multiple concepts … … #249 … … #809 … … #18329
people drinking beer or wine. Many restaurants … into alcoholic drinks such as beer or hard liquor and derive … … in miles per hour, pints of beer, and inches for clothes. M… …ns and for pints for draught beer, cider, and milk sales. The carbonated beverages such as beer and soft drinks in non-ref… …g of a few young people to a beer blast or fancy formal part… …c and alcoholic drinks, like beer and mead, contributed to a… People are depicted drinking beer, listening to music, flirt… … and for the pint of draught beer sold in pubs (see Metricat… beer beer beer beer beer beer beer beer beer … ith people drinking beer or wine. Many restaurants can be f… …gan to drink regularly, host wine parties and consume prepar… principal grapes for the red wines are the grenache, mourved… … four or more glasses of red wine per week had a 50 percent … …e would drink two bottles of wine in an evening. According t… …. Teran is the principal red wine grape in these regions. In… …a beneficial compound in red wine that other types of alcohol … Colorino and even the white wine grapes like Trebbiano and … In Shakesperean theatre, red wine was used in a glass contai… wine wine wines wine wine wine wine wine wine You shall know a word by the company it keeps Z Harris. 1954. Distributional structure. Word, 10(23):146-162. J Firth. 1957. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32.
bottle train book speed read 36 108 578 291 841 14 14 284 94 201 72 92 3 3 0 57 86 2 0 0 3 0 37 72 2 0 1 44 43 1 1 2 3 2 338 Context: words appearing in ±ℎ word offsets to the target word cols rows : Frequency of co-occurrences of the word with context word (for example, “train” co-occurred with “drink” three times) This row vector represents the meaning of the word “beer” Context Word
two vectors and whose angle is , ⋅ = cos Therefore, cos = ⋅ The value of cos is, → 0 (same direction): cos → +1 → /2 (orthogonal): cos → 0 → (opposite direction): cos → −1 In this way, cos can measure the similarity of two vectors within the range of −1, +1 θ cosθ v u 0 1
Dumais, G Furnas, T Landauer, R Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407. Singular Value Decomposition (SVD) on = Σ T Truncate Σ with singular values = Σ T (-rank approximation) ( is a minimizer of − among rank- matrices) Use Σ as -dimensional vectors = Σ T Σ T T = Σ Σ T Inner product of Σ is equal to that of ( × ) ( × ) ( × ) ( × ) : unitary matrix Σ: diagonal matrix with singular values T: unitary matrix
three SVs Uses up to three columns Uses up to three rows (SVD on the original matrix) (3-rank approximation) beer wine car train book Truncated SVD (Halko, 2011) finds top- singular values of the matrix efficiently (for example, sklearn.decomposition.TruncatedSVD) cos(beer,wine) = 0.96 cos(beer,train) = 0.37 N Halko, P G Martinsson, and J A Tropp. 2011. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2), 217-288.
pubs beer, cider, and wine last use place people make city full know build time group have new game rather age show take take team season say Word vector ∈ ℝ Context vector � ∈ ℝ : Positive : Negative Update rule Corpus Each word vector predicts 2ℎ context words Sample 𝑘 words as negative words from the unigram distribution. Update vectors such that word vectors do not predict the negative words T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.
0 Word vectors ( ): Initialize with random values of [0,1] Context vectors ( ): Initialize with zero Repeat from the head to tail of the training corpus: ← + 1 Learning rate = 0 1 − +1 For each connected with the target word = 1 − ⋅ inner product → +∞ ⋅ inner product → −∞ ← + ← +
on Google News dataset (100B words) https://code.google.com/archive/p/word2vec/ Japanese: (trained by me) Trained on Japanese Wikipedia articles (400M words) Use gensim for manipulating them in Python https://github.com/chokkan/deeplearning/blob/master/notebook/word2vec_ja.ipynb https://github.com/chokkan/deeplearning/blob/master/notebook/word2vec_en.ipynb
of semantic analogy: Athens : Greece = Tokyo : Japan Example of syntactic analogy: cool : cooler = deep : deeper T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.
man + woman ≈ queen (Mikolov+ 2013) T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.
(MLE) (|) is modeled by softmax Approximate (|) with logistic regressions = − � ∈ � ∈ log (|) : corpus (sequence of words) : a set of words appearing within the offset ±ℎ from the word Probability to predict ∈ from = exp � ∑ ′∈ exp( � ′) log ≈ log ⋅ � + � Ε ∼ log − ⋅ � Too heavy computation as this requires the sum over exponentials of inner products between the word with all words ′ ∈ Sample a word from the unigram distribution ( times)
SGNS models a co-occurrence matrix , = PMI , − log ≈ � This is similar to training word vectors by building a co- occurrence matrix by using PMI The previous approach (PMI) could also realize additive composition Shifting PMI to negative O Levy and Y Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS 2014, pp. 2177–2185.
� 𝑗𝑗 + + � 𝑗𝑗 − log ,𝑗𝑗 )2 Minimize = (/max ) (if < max ) 1 (otherwise) Co-occurrence frequency between words and Total number of words Vector of word Vector of word Bias for word Bias for word Vector #1 Vector #2 Similarly to SGNS, each word has two vectors assigned. This study uses ( + � ) after training the vectors (this treatment improves the performance) 𝑚𝑚 = 100, α = 0.75 → (by AdaGrad) J Pennington, R Socher, and C Manning. 2014. Glove: Global vectors for word representation. In EMNLP-2014, pp. 1532–1543.
,𝑗𝑗 (1/4) 32 Consider representing a relation of words and on an aspect by using context word E.g., Relation between ice and steam on thermodynamics , /𝑗𝑗, may be more useful than , = (|) to capture the characteristics of words and E.g., solid and gas is more useful than water and fashion (Pennington+ 2014) J Pennington, R Socher, and C Manning. 2014. Glove: Global vectors for word representation. In EMNLP-2014, pp. 1532–1543.
,𝑗𝑗 (2/4) 33 Let , 𝑗𝑗 , � the vectors of words , , In order to represent , /𝑗𝑗, with word vectors, − 𝑗𝑗 , � = , /𝑗𝑗, The most simple way to cast the type of the left (vector) into that of the right (scalar), − 𝑗𝑗 � = , /𝑗𝑗, Represent the contrast of the characteristics of words and with vector subtraction We will decide the form of later Different from
,𝑗𝑗 (4/4) 35 Words and contexts should be interchangeable Consider ↔ � and ↔ at the same time Words and contexts are not interchangeable � = log , − log Because we have no constant for Represent log as a bias term , and introduce a new bias term � about � = log , − − � � + + � = log ,
,𝑗𝑗 when ,𝑗𝑗 = 0 Most elements in are 0 (sparse matrix) We ignore unobserved statistics We should not respect rare co-occurrences Hard to reproduce rare co-occurrences with vectors Force the weight (,𝑗𝑗 /max ) when ,𝑗𝑗 < max We should not respect frequent ones too much Treat frequent co-occurrences with the same importance Clip the weight 1 when ,𝑗𝑗 ≥ max
SVD SGNS GloVe win Window size (ℎ) ℎ ∈ {2, 5, 10} dyn Weighted context with(/ℎ), none *1 sub Subsampling with, none del Rare word removal with, none neg Negative samples ∈ {1, 5, 15} *2 *2 cds Distribution correction α ∈ {1, 0.75} *3 *3 w+c Vector summation , ( + � ) eig Weighted SVs ∈ {0, 0.5, 1.0} nrm Normalization *4 both, col, row, none *1: The same weighting method implemented in word2vec *2: These are set by shifted PPMI *3: These are implemented by modifying the denominator of PMIs *4: Normalization for each word vector was the best Preprocessing Association measure Postprocessing O Levy, Y Goldberg, and I Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics (TACL), 3:211-225.
context distribution smoothing (cds=0.75) Use SVD with symmetric variants (eig=0 or 0.5) No effect with neg > 1 in Shifted PPMI SGNS is a robust baseline It does not underperform in any scenario It trains word embeddings the fastest with the cheapest memory consumption Larger negative samples are better in SGNS Worth trying w+c in SGNS and GloVe May result in substantial gains (but sometimes in losses) O Levy, Y Goldberg, and I Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics (TACL), 3:211-225.
human worker chooses the most similar word among the candidates computed by word embeddings GloVe was poor at adverbs for some reason CBOW suffers from larger candidates (50 NN) T Schnabel, I Labutov, D Mimno, T Joachims. Evaluation methods for unsupervised word embeddings. In EMNLP 2015, pp. 298-307.
almighty word embeddings for all tasks In order to improve the performance on a task, we should fine-tune word embeddings on the target task (Schnabel+ 2015) T Schnabel, I Labutov, D Mimno, T Joachims. Evaluation methods for unsupervised word embeddings. In EMNLP 2015, pp. 298-307.
of internal letters in words Extend SGNS to consider letter -grams (subword units) The use of subword units is also effective in machine translation <offer> <of off ffe fer er> pubs draught show age take Word vector Subword vectors Sum Context vectors The update procedure is the same P Bojanowsk, E Grave, A Joulin, T Mikolov. 2017. Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics (TACL), 5:135-146.
favors syntactic analogy more than semantic analogy fastText (sisg) outperforms the other except for WS353 in English P Bojanowsk, E Grave, A Joulin, T Mikolov. 2017. Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics (TACL), 5:135-146.
some extent The underlying idea is distributional hypothesis You shall know the word by the company it keeps You shall know the word by predicting its companies No almighty word embeddings for all downstream tasks Next question Can we represent a phrase/sentence with a vector? 45