Word2vec implementation in gensim

Word2vecの実装 @masa_kazama

目次 • Word2vecの概要 • GensimのWord2vecの実装(Python, Cython) ◦ Negative Sampling ◦
Hierarchical softmax • GensimのCythonのコードを変更する • 参考資料

Word2vec • Mikolovが2013年に提案した単語をベクトル化する手法 • 同じ文脈で出てくる単語は似ているという分布仮説に基づいて、単語をベクトル化 • 王様 - 男性
+ 女性のようなアナロジー計算も可能 • 近年は、推薦システムでもitem2vecという形で用いられているこのスライドではword2vecの概要については詳しく説明しないため、詳細は下記の資料を参照ください • word2vec Parameter Learning Explained • 数式からみるWord2Vec • Word2Vec のニューラルネットワーク学習過程を理解する

Word2vec • モデル ◦ Skip gram ◦ Continuous Bag of
Words (CBOW) • パラメータ最適化方法 ◦ Negative sampling ◦ Hierarchical softmax

Skip gram • 入力単語の周辺の単語を予測するモデル

CBOW • 周辺の単語から中心の単語を予測するモデル

目的関数と最適化方法 • 下記の目的関数を微分すると更新式が求まる • しかし、その更新式の計算量はとても大きく時間がかかる(すべての単語の総和を計算するため) • そのため、目的関数を変更/近似して、計算量を少なくする手法が提案されている ◦ Negative
sampling ◦ Hierarchical softmax

Negative sampling 目的関数更新式少数の負例をサンプリングすることで、計算を高速化

Hierarchical softmax 目的関数更新式ハフマンツリーを使って、 softmaxを近似することで、計算量を削減する

データの作り方例 he is a very good man (Window size
= 2のとき） Input Output he is he a is he is a is very a he a is ・・・・ man good 文章から、inputとoutputの単語のペアを作成する

パラメータそれぞれの単語がInput vectorとOutput vectorの２つのベクトルを持つ Gensimでは、 model.wv.syn0と model.syn1neg に格納されている。学習後は、Input
vectorだけを使い、類似度やアナロジー計算を行う Index Word Input vector Output vector 0 he [0.4, 0.9, …, 0.1] [0.1, 0.2, …, 0.1] 1 is [0.2, 0.7, …, 0.2] [0.8, 0.5, …, 0.4] 2 a [0.6, 0.6, …, 0.7] [0.2, 0.1, …, 0.7] 3 very [0.2, 0.5, …, 0.9] [0.6, 0.7, …, 0.3] 4 kind [0.1, 0.4, …, 0.1] [0.5, 0.8, …, 0.8] 5 man [0.5, 0.3, …, 0.5] [0.4, 0.3, …, 0.9] (タスクやデータによっては、学習後に Input vectorとOutput vector を足したものを単語のベクトルとして使うと性能が上がると報告されている。[Levy 2015])

Negative samplingのパラメータの更新方法 l1 = context_vectors[context_index] word_indices = [predict_word.index] while len(word_indices)
< model.negative + 1: w = model.cum_table.searchsorted(model.random.randint(model.cum_table[-1])) if w != predict_word.index: word_indices.append(w) l2b = model.syn1neg[word_indices] # 2d matrix, k+1 x layer1_size prod_term = dot(l1, l2b.T) fb = expit(prod_term) # propagate hidden -> output gb = (model.neg_labels - fb) * alpha # vector of error gradients multiplied by the learning rate model.syn1neg[word_indices] += outer(gb, l1) # learn hidden -> output neu1e += dot(gb, l2b) # save error l1 += neu1e gensim/models.word2vec.py train_sg_pair 説明のため、一部コードを変更次ページ以降、上から一行ずつコードを解説していく

例 (input, output) = (he, is) Input word he Input
index (context_indext) 0 Input vector (l1) [0.4, 0.9, …, 0.1] Output word is Output index (predict_word.index) 1 Output vector [0.8, 0.5, …, 0.4] l1 = context_vectors[context_index] word_indices = [predict_word.index]

Negative sampling while len(word_indices) < model.negative + 1: w =
model.cum_table.searchsorted(model.random.randint(model.cum_table[-1])) if w != predict_word.index: word_indices.append(w) 事前に作成しておいた cum_tableを利用して、model.negative個のwordを Negative sampling行う。（例として、”very”と”kind”がnegative samplingされたとして、次ページ以降の説明を進める。）参考）cum_tableの構築について def make_cum_table(self, wv, domain=2**31 - 1): vocab_size = len(wv.index2word) self.cum_table = zeros(vocab_size, dtype=uint32) # compute sum of all power (Z in paper) train_words_pow = 0.0 for word_index in range(vocab_size): train_words_pow += wv.vocab[wv.index2word[word_index]].count**self.ns_exponent cumulative = 0.0 for word_index in range(vocab_size): cumulative += wv.vocab[wv.index2word[word_index]].count**self.ns_exponent self.cum_table[word_index] = round(cumulative / train_words_pow * domain) if len(self.cum_table) > 0: assert self.cum_table[-1] == domain Negative samplingするときの単語分布自然言語処理では、 α=3/4が推薦システムでは、α=負の値が良いとされている [Hugo 2018]

Negative sampling l2b = model.syn1neg[word_indices] # 2d matrix, k+1 x
layer1_size Word Output vector is [0.8, 0.5, …, 0.4] Predict_word very [0.6, 0.7, …, 0.3] negative sampling kind [0.5, 0.8, …, 0.8] negative sampling l2b

Negative sampling prod_term = dot(l1, l2b.T) fb = expit(prod_term) #
propagate hidden -> output Word Output vector is [0.8, 0.5, …, 0.4] very [0.6, 0.7, …, 0.3] kind [0.5, 0.8, …, 0.8] Word input vector he [0.1, 0.2, …, 0.1] l2b l1 prod_term [0.9, 0.4, 0.2] fb [0.71, 0.59, 0.54] expit(prod_term) = 1/(1+exp(-prod_term)) = 1/(1+exp(-[0.9, 0.4, 0.2]))

Negative sampling gb = (model.neg_labels - fb) * alpha #
vector of error gradients multiplied by the learning rate self.neg_labels = [] if self.negative > 0: # precompute negative labels optimization for pure-python training self.neg_labels = zeros(self.negative + 1) self.neg_labels[0] = 1. alphaは学習率。neg_labelsは、predict_wordのときは１、それ以外は 0。 Word Output vector neg_labels is [0.8, 0.5, …, 0.4] 1 very [0.6, 0.7, …, 0.3] 0 kind [0.5, 0.8, …, 0.8] 0 参考）neg_labelsの構築方法 gb = ([1, 0, 0] - [0.71, 0.59, 0.54] ) * 0.1 = [0.028, -0.059, -0.054]

Negative sampling Output vectorの更新 predict_wordとnegative samplingされたwordsのoutput vectorを更新する model.syn1neg[word_indices] += outer(gb,
l1) # learn hidden -> output Word Output vector is [0.8, 0.5, …, 0.4] very [0.6, 0.7, …, 0.3] kind [0.5, 0.8, …, 0.8] outer(gb, l1) 0.028 * [0.4, 0.9, …, 0.1] -0.059 * [0.4, 0.9, …, 0.1] -0.054 * [0.4, 0.9, …, 0.1] +=

Negative sampling neu1e += dot(gb, l2b) # save error l1
+= neu1e Input vectorの更新 Word Output vector is [0.8, 0.5, …, 0.4] very [0.6, 0.7, …, 0.3] kind [0.5, 0.8, …, 0.8] l2b gb 0.028 -0.059 -0.054 Input vector (l1) [0.4, 0.9, …, 0.1] +=

Negative sampling (Cython) for d in range(negative+1): if d ==
0: target_index = word_index label = ONEF else: target_index = bisect_left(cum_table, (next_random >> 16) % cum_table[cum_table_len-1], 0, cum_table_len) next_random = (next_random * <unsigned long long>25214903917ULL + 11) & modulo if target_index == word_index: continue label = <REAL_t>0.0 row2 = target_index * size f_dot = our_dot(&size, &syn0[row1], &ONE, &syn1neg[row2], &ONE) #内積の計算 if f_dot <= -MAX_EXP or f_dot >= MAX_EXP: continue f = EXP_TABLE[<int>((f_dot + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]　#シグモイド関数の計算 g = (label - f) * alpha our_saxpy(&size, &g, &syn1neg[row2], &ONE, work, &ONE) #inputベクトルの更新のための一時変数 work += g*output_vector our_saxpy(&size, &g, &syn0[row1], &ONE, &syn1neg[row2], &ONE) #outputベクトルの更新 output_vector += g*input_vector our_saxpy(&size, &word_locks[word2_index], work, &ONE, &syn0[row1], &ONE) #inputベクトルの更新 input_vector += g*work return next_random gensim/models/word2vec_inner.pyx 説明のため一部コードを変更

Negative sampling (Cython) • Pythonでは、複数のnegative samplingのwordsに対して、行列を使ってまとめて計算していた • Cythonでは、negative samplingのwordひとつひとつに対して、ベクトルを更新して
いる • 高速化のための工夫がなされている ◦ あらかじめシグモイド関数を計算しておいてそれを配列に格納している ◦ 内積などのベクトル演算が高速に計算される cdef scopy_ptr scopy=<scopy_ptr>PyCObject_AsVoidPtr(fblas.scopy._cpointer) # y = x cdef saxpy_ptr saxpy=<saxpy_ptr>PyCObject_AsVoidPtr(fblas.saxpy._cpointer) # y += alpha * x cdef sdot_ptr sdot=<sdot_ptr>PyCObject_AsVoidPtr(fblas.sdot._cpointer) # float = dot(x, y) cdef dsdot_ptr dsdot=<dsdot_ptr>PyCObject_AsVoidPtr(fblas.sdot._cpointer) # double = dot(x, y) cdef snrm2_ptr snrm2=<snrm2_ptr>PyCObject_AsVoidPtr(fblas.snrm2._cpointer) # sqrt(x^2) cdef sscal_ptr sscal=<sscal_ptr>PyCObject_AsVoidPtr(fblas.sscal._cpointer) # x = alpha * x

Hierarchical softmax from gensim.models import Word2Vec sentences = [["he", "is",
"a", "very", "kind", "man"]] model = Word2Vec(sentences, min_count=1, seed=1, hs=1) for word in model.vocab.keys(): print("word:", word) print("index", model.vocab[word].index) print("code", model.vocab[word].code) print("point", model.vocab[word].point) print("-------------") ('word:', 'a') ('index', 0) ('code', array([1, 0, 0], dtype=uint8)) ('point', array([4, 3, 1], dtype=uint32)) ------------- ('word:', 'kind') ('index', 1) ('code', array([1, 0, 1], dtype=uint8)) ('point', array([4, 3, 1], dtype=uint32)) ------------- ('word:', 'very') ('index', 2) ('code', array([1, 1, 1], dtype=uint8)) ('point', array([4, 3, 0], dtype=uint32)) ------------- ('word:', 'is') ('index', 3) ('code', array([0, 1], dtype=uint8)) ('point', array([4, 2], dtype=uint32)) ------------- ('word:', 'he') ('index', 4) ('code', array([0, 0], dtype=uint8)) ('point', array([4, 2], dtype=uint32)) ------------- ('word:', 'man') ('index', 5) ('code', array([1, 1, 0], dtype=uint8)) ('point', array([4, 3, 0], dtype=uint32)) ------------- Gensimは、Hierarchical softmaxのデータ構造をindex, code, pointという形で保持している

Hierarchical softmax 4 3 2 0 1 1 0 1
0 1 0 1 0 1 0 very man kind a he is word index code point a 0 [1,0,0] [4,3,1] kind 1 [1,0,1] [4,3,1] very 2 [1,1,1] [4,3,0] is 3 [0,1] [4,2] he 4 [0,0] [4,2] man 5 [1,1,0] [4,3,0] Pointは、その単語にたどり着くまでの経由したノード Codeは、そのノードの左右どちらに行ったかを示す

ハフマンツリーの構築 def create_binary_tree(self, wv): # build the huffman tree heap
= list(itervalues(wv.vocab)) heapq.heapify(heap) for i in range(len(wv.vocab) - 1): min1, min2 = heapq.heappop(heap), heapq.heappop(heap) heapq.heappush( heap, Vocab(count=min1.count + min2.count, index=i + len(wv.vocab), left=min1, right=min2) ) # recurse over the tree, assigning a binary code to each vocabulary word if heap: max_depth, stack = 0, [(heap[0], [], [])] while stack: node, codes, points = stack.pop() if node.index < len(wv.vocab): # leaf node => store its path from the root node.code, node.point = codes, points max_depth = max(len(codes), max_depth) else: # inner node => continue recursion points = array(list(points) + [node.index - len(wv.vocab)], dtype=uint32) stack.append((node.left, array(list(codes) + [0], dtype=uint8), points)) stack.append((node.right, array(list(codes) + [1], dtype=uint8), points)) ヒープを用いて、ハフマンツリーを構築する。回数が少ないもの同士をマージしていくノードにpointや codeを割り振る gensim/models.word2vec.py

Hierarchical softmax のパラメータ word vector a [0.4, 0.9, …, 0.1]
kind [0.2, 0.7, …, 0.2] very [0.6, 0.6, …, 0.7] is [0.2, 0.5, …, 0.9] he [0.1, 0.4, …, 0.1] man [0.5, 0.3, …, 0.5] node vector node0 [0.8, 0.5, …, 0.4] node1 [0.2, 0.1, …, 0.7] node2 [0.6, 0.7, …, 0.3] node3 [0.5, 0.8, …, 0.8] node4 [0.4, 0.3, …, 0.9] wordのinput vectorとハフマンツリーのノードの vector (Output vectorは出てこない)

Hierarchical softmaxのパラメータの更新方法 l1 = context_vectors[context_index] l2a = deepcopy(model.syn1[predict_word.point]) # 2d
matrix, codelen x layer1_size prod_term = dot(l1, l2a.T) fa = expit(prod_term) # propagate hidden -> output ga = (1 - predict_word.code - fa) * alpha # vector of error gradients multiplied by the learning rate model.syn1[predict_word.point] += outer(ga, l1) # learn hidden -> output neu1e += dot(ga, l2a) # save error l1 += neu1e Negative samplingのときのl2bが、Hierarchical softmaxではl2aに、 Negative samplingのときのneg_labelsが、Hierarchical softmaxではpredict_word.codeに対応していると考えると、パラメータの更新方法は Negative samplingのときとほぼ同様 (メモ：predict_word.codeの長さの平均は、ハフマンツリーの平均符号長に対応するため、ネガティブサンプリングの数とオーダが等しく、計算量は negative samplingとHierachical softmaxはほぼ等しくなる)

Hierarchical softmax (Cython) for b in range(codelen): row2 = word_point[b]
* size f_dot = our_dot(&size, &syn0[row1], &ONE, &syn1[row2], &ONE) #内積の計算 if f_dot <= -MAX_EXP or f_dot >= MAX_EXP: continue f = EXP_TABLE[<int>((f_dot + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))] #シグモイド関数の計算 g = (1 - word_code[b] - f) * alpha our_saxpy(&size, &g, &syn1[row2], &ONE, work, &ONE) #inputベクトルの更新のための一時変数 work += g*syn1[row2] our_saxpy(&size, &g, &syn0[row1], &ONE, &syn1[row2], &ONE) #節ベクトルの更新 syn1[row2] += g*syn0[row1] our_saxpy(&size, &word_locks[word2_index], work, &ONE, &syn0[row1], &ONE) #inputベクトルの更新 input_vector += g*work

GensimのCythonのコード変更 • gensimのword2vecでは、window_size=3と指定しても常に前後の3つの単語を取得するのではなく、1から３の整数を乱数で選択し、その分だけ前後の単語を取得する。（近傍の単語をより重点的にサンプリングしたいため） • コード変更の例として、乱数で取得するのではなく、常にwindow_size分だけ取得するように変更する # precompute
"reduced window" offsets in a single randint() call for i, item in enumerate(model.random.randint(0, c.window, effective_words)): c.reduced_windows[i] = item print("fix windowsize") # precompute "reduced window" offsets in a single randint() call for i, item in enumerate(model.random.randint(0, c.window, effective_words)): c.reduced_windows[i] = 0 変更前変更後 gensim/models/word2vec_inner.pyx train_batch_sg

GensimのCythonのコード変更 git clone https://github.com/RaRe-Technologies/gensim.git cd gensim virtualenv gensim_env #gensim用に環境作成 source
gensim_env/bin/activate vim gensim/models/word2vec_inner.pyx #コードの変更 cython -2 gensim/models/word2vec_inner.pyx #cythonのコンパイル pip install -e .[test] #変更したgensimのインストール • コード変更、コンパイル、インストールの手順 from gensim.models import Word2Vec sentences = [["he", "is", "a", "very", "kind", "man"]] model = Word2Vec(sentences, min_count=1, seed=1, negative=1, sg=1) 下記を実行するとfix windowsizeが表示され、変更したコードが反映されていることを確かめることができる

参考資料 • 数式からみるWord2Vec • Word2Vec のニューラルネットワーク学習過程を理解する • Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. 2013 • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. 2013 • Yoav Goldberg, Omer Levy. word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. 2014 • Hugo Caselles-Dupré, Florian Lesaint, Jimena Royo-Letelier. Word2Vec applied to Recommendation: Hyperparameters Matter. 2018 • Omer Levy, Yoav Goldberg, Ido Dagan. Improving Distributional Similarity with Lessons Learned from Word Embeddings. 2015 • GensimのContributionガイド • Gensimのdeveloper page

Word2vec implementation in gensim

Word2vec implementation in gensim

Masa Kazama

More Decks by Masa Kazama

Other Decks in Programming

Featured

Transcript

Word2vecの実装 @masa_kazama

目次 • Word2vecの概要 • GensimのWord2vecの実装(Python, Cython) ◦ Negative Sampling ◦

目次 • Word2vecの概要 • GensimのWord2vecの実装(Python, Cython) ◦ Negative Sampling ◦

Word2vec • Mikolovが2013年に提案した単語をベクトル化する手法 • 同じ文脈で出てくる単語は似ているという分布仮説に基づいて、単語をベクトル化 • 王様 - 男性

Word2vec • モデル ◦ Skip gram ◦ Continuous Bag of

Skip gram • 入力単語の周辺の単語を予測するモデル

CBOW • 周辺の単語から中心の単語を予測するモデル

Negative sampling 目的関数更新式少数の負例をサンプリングすることで、計算を高速化

Hierarchical softmax 目的関数更新式ハフマンツリーを使って、 softmaxを近似することで、計算量を削減する

データの作り方例 he is a very good man (Window size

パラメータそれぞれの単語がInput vectorとOutput vectorの２つのベクトルを持つ Gensimでは、 model.wv.syn0と model.syn1neg に格納されている。学習後は、Input

目次 • Word2vecの概要 • GensimのWord2vecの実装(Python, Cython) ◦ Negative Sampling ◦

Negative samplingのパラメータの更新方法 l1 = context_vectors[context_index] word_indices = [predict_word.index] while len(word_indices)

例 (input, output) = (he, is) Input word he Input

Negative sampling while len(word_indices) < model.negative + 1: w =

Negative sampling l2b = model.syn1neg[word_indices] # 2d matrix, k+1 x

Negative sampling prod_term = dot(l1, l2b.T) fb = expit(prod_term) #

Negative sampling gb = (model.neg_labels - fb) * alpha #

Negative sampling Output vectorの更新 predict_wordとnegative samplingされたwordsのoutput vectorを更新する model.syn1neg[word_indices] += outer(gb,

Negative sampling neu1e += dot(gb, l2b) # save error l1

Negative sampling (Cython) for d in range(negative+1): if d ==

Negative sampling (Cython) • Pythonでは、複数のnegative samplingのwordsに対して、行列を使ってまとめて計算していた • Cythonでは、negative samplingのwordひとつひとつに対して、ベクトルを更新して

目次 • Word2vecの概要 • GensimのWord2vecの実装(Python, Cython) ◦ Negative Sampling ◦

Hierarchical softmax from gensim.models import Word2Vec sentences = [["he", "is",

Hierarchical softmax 4 3 2 0 1 1 0 1

ハフマンツリーの構築 def create_binary_tree(self, wv): # build the huffman tree heap

Hierarchical softmax のパラメータ word vector a [0.4, 0.9, …, 0.1]

Hierarchical softmaxのパラメータの更新方法 l1 = context_vectors[context_index] l2a = deepcopy(model.syn1[predict_word.point]) # 2d

Hierarchical softmax (Cython) for b in range(codelen): row2 = word_point[b]

目次 • Word2vecの概要 • GensimのWord2vecの実装(Python, Cython) ◦ Negative Sampling ◦

GensimのCythonのコード変更 git clone https://github.com/RaRe-Technologies/gensim.git cd gensim virtualenv gensim_env #gensim用に環境作成 source

目次 • Word2vecの概要 • GensimのWord2vecの実装(Python, Cython) ◦ Negative Sampling ◦

参考資料 • 数式からみるWord2Vec • Word2Vec のニューラルネットワーク学習過程を理解する • Tomas Mikolov, Kai