NLP: Word2Vec Skip-Gram(CS224n) implemented in raw way and in Tensorflow

The following contents and images are from cs224n and hankcs.

Before getting into the word2vec part, let’s talk about how do you understand a sentence or say when you’re reading, how do you figure out the meaning of the whole bunch of words, the meaning and the specific image of the words right? So, when it comes to Computer Science, how do you teach computer to know what’s the “meaning” of a word, or a sentence? In the last couple of decades, scientists were using classifying dictionary like wordnet, but it takes massive amount of time for people to put words in order, and it cannot tackle the problem of word similarity.

Then a linguist called J. R. Firth came up with an idea that a word can be understood through its context, it’s the basic idea of NLP statistics. it is also called distributed representations.

So the word2vec means we’re using “center words” and its context to predict each other. In cs224n there are 2 algorithms:

Skip-grams: using center words to predict its context
CBOW(Continuous Bag of Words): using context to predict center words

Anther algorithm will be more efficient called Negative Sampling.

Skip-gram:

We are using conditional probability to describe how precise we can predict its context, our task is to maximize all of the conditional probabilities, when doing so, we can get its context well. then we can write down its (Likelihood function)

screenshot

Then the objective function will be :

screenshot

We take the negative log likelihood of the likelihood function, then we need to minimize the objective function.

So, how to calculate all the conditional probabilities? we use softmax(the reason we use softmax is that it can map arbitrary values Xi to a probability ditribution Pi)

screenshot

Uo is one context word(outside word) and Vc is the vector of center words, and Uw is the whoe contexts words.

some fundamental math:

screenshot

And a ppt from manning can show all the stages of Skipgram:

First we look up the center word from word embedding using one hot vector * word embedding matrix, the dot product result can be the representation of center word Vc, and then it times the output representation to calculate the similarity of every words with respect to Vc. then we doing the softmax to get the right probability.

First we should know how to normalizeRows: Implement a function that normalizes each row of a matrix to have unit length.

1 def normalizeRows(x): 2 “”” Row normalization function 3 4 Implement a function that normalizes each row of a matrix to have 5 unit length. 6 “”” 7 8 ### YOUR CODE HERE 9 denominator = np.apply_along_axis(lambda x:np.sqrt(x.T.dot(x)),1,x)#跨列 10 x /= denominator[:,None] #将整个 11 #raise NotImplementedError 12 ### END YOUR CODE 13
14 return x

Then comes the softmaxCostAndGradient: First we calculate the dot product of the v_hat(predicted word or say center word) , then through softmax and cross entropy to calculate its loss, then doing the Gradient,in the function, we should return cost(loss) gradPred (gradients)for center word. and gradients for other word(outside word).

take the derivative wrt vc

screenshot

U = [u1,u2,….uw] means the matrix made of all word vectors, y_hat - y means the probability vector.

tkae the derivative wrt U

screenshot

def softmaxCostAndGradient(predicted, target, outputVectors, dataset):
  """ Softmax cost function for word2vec models

  Implement the cost and gradients for one predicted word vector
  and one target word vector as a building block for word2vec
  models, assuming the softmax prediction function and cross
  entropy loss.

  Arguments:
   predicted -- numpy ndarray, predicted word vector (hat{v} in
                the written component)
   target -- integer, the index of the target word
   outputVectors -- "output" vectors (as rows) for all tokens
   dataset -- needed for negative sampling, unused here.

   Return:
   cost -- cross entropy cost for the softmax word prediction
   gradPred -- the gradient with respect to the predicted word
          vector
   grad -- the gradient with respect to all the other word
          vectors

   We will not provide starter code for this function, but feel
   free to reference the code you previously wrote for this
   assignment!
   """

   ### YOUR CODE HERE
   #softmax
   vhat = predicted 
   z = np.dot(outputVectors,vhat)
   preds = softmax(z)
   # cross entropy
   cost = -np.log(preds[target])
   
   # Gradient
   z = preds.copy()
   z[target] -= 1.0
   grad = np.outer(z,vhat) # wrt outside words
   gradPred = np.dot(outputVectors.T,z) # wrt center word
   ### END YOUR CODE

   return cost, gradPred, grad

Then we implemented the skipgram part, all we have to do is also compute all the cost(loss) and gradients, and we got bunch of parameter in this function

currentWord – a string of the current center word
C – integer, context size
contextWords – list of no more than 2*C strings, the context words
tokens – a dictionary that maps words to their indices in the word vector list(impotant in implementation)
inputVectors – “input” word vectors (as rows) for all tokens
outputVectors – “output” word vectors (as rows) for all tokens
word2vecCostAndGradient – the cost and gradient function for a prediction vector given the target word vectors, could be one of the two cost functions you implemented above.

Recalled that the objective function is a neg log of the likelihood function:

screenshot

in the for loop, we just scan all the contextWords and calculate the gradient<p align="center"> screenshot </p>

and all of the cost and gradient will be summation:

def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
           dataset, word2vecCostAndGradient=softmaxCostAndGradient):
  """ Skip-gram model in word2vec

  Implement the skip-gram model in this function.

  Arguments:
  currentWord -- a string of the current center word
  C -- integer, context size
   contextWords -- list of no more than 2*C strings, the context words
   tokens -- a dictionary that maps words to their indices in
             the word vector list
   inputVectors -- "input" word vectors (as rows) for all tokens
   outputVectors -- "output" word vectors (as rows) for all tokens
   word2vecCostAndGradient -- the cost and gradient function for
                              a prediction vector given the target
                              word vectors, could be one of the two
                              cost functions you implemented above.

   Return:
   cost -- the cost function value for the skip-gram model
   grad -- the gradient with respect to the word vectors
   """

   cost = 0.0
   gradIn = np.zeros(inputVectors.shape)
   gradOut = np.zeros(outputVectors.shape)

   ### YOUR CODE HERE
   #tokens是词到idx的映射 得到idx再回输入中去找到词向量
   centerword_idx = tokens[currentWord]
   vhat = inputVectors[centerword_idx]
   
   # 对每一个上个文的单词进行word2vec训练 计算累计cost与gradients
   for j in contextWords:
       u_idx = tokens[j]
       c_cost, c_grad_in,c_grad_out = 
           word2vecCostAndGradient(vhat, u_idx, outputVectors, dataset)
       cost += c_cost
       gradIn[centerword_idx] += c_grad_in
       gradOut += c_grad_out
   #raise NotImplementedError
   ### END YOUR CODE

   return cost, gradIn, gradOut

Implement skipgram in tensorflow:

some packages needed to be imported:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
from tensorflow.contrib.tensorboard.plugins import projector
import tensorflow as tf

import utils
import word2vec_utils

First we need to know our model hyperparameters:

# Model hyperparameters
VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128            # dimension of the word embedding vectors
SKIP_WINDOW = 1             # the context window
NUM_SAMPLED = 64            # number of negative examples to sample
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 100000
VISUAL_FLD = 'visualization'
SKIP_STEP = 5000

In tensorflow, normally we will build a graph for model , in every def, we have a name_scope for valuable sharing:

in the SkipGramModel, or in any model, first we should do is to create an iterator to get dataset:

# Step 1: get input, output from the dataset
  iterator = dataset.make_initializable_iterator()
  center_words, target_words = iterator.get_next()

After this, we neen to define weights, and the weights are for embed matrix, and in this step we initialize it.

# Step 2: define weights. 
  # In word2vec, it's the weights that we care about
  embed_matrix = tf.get_variable('embed_matrix', 
                                  shape=[VOCAB_SIZE, EMBED_SIZE],
                                  initializer=tf.random_uniform_initializer())

notice that the shape is [VOCAB_SIZE, EMBED_SIZE], in the following step, we define the inference. This is a function in tf.nn , embedding——lookup means that we return the index(actually in this case ceter_words is just position,indices) in the embed_matrix.

 1 # Step 3: define the inference
 2 embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed')

Next step is to define loss function, recall that in skip_gram model, we use softmax for probability, and in this tf version, we use nce,which is another loss function.

  # Step 4: define loss function
  # construct variables for NCE loss
  nce_weight = tf.get_variable('nce_weight', 
                               shape=[VOCAB_SIZE, EMBED_SIZE],
                               initializer=tf.truncated_normal_initializer(stddev=1.0 / (EMBED_SIZE ** 0.5)))
  nce_bias = tf.get_variable('nce_bias', initializer=tf.zeros([VOCAB_SIZE]))

for the whole loss, we need to summation, and in tensorflow, simply we just convey nce loss in the tf.reduce_mean, when the model was trained, it will sum all of the loss automatically.

  # define loss function to be NCE loss function
  loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                      biases=nce_bias, 
                                      labels=target_words, 
                                      inputs=embed, 
                                      num_sampled=NUM_SAMPLED, 
                                      num_classes=VOCAB_SIZE), name='loss')

Then comes the optimizer, the reduce_mean and train_GradientDescentOptimizer function should be remembered since they are frequently used function.

  # Step 5: define optimizer that follows gradient descent update rule
  # to minimize loss
  optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)

And for the training part: 1 . initiliaze ierator and variables, and in each epoch, we sess.run the loss and optimizer, then print out loss in each 5000 step, and the loss will be total_loss / 5000, in the code below you can find it :

  with tf.Session() as sess:

      # Step 6: initialize iterator and variables
      sess.run(iterator.initializer)
      sess.run(tf.global_variables_initializer())

      total_loss = 0.0 
      writer = tf.summary.FileWriter('graphs/word2vec_simple', sess.graph)

       for index in range(NUM_TRAIN_STEPS):
           try:
               # Step 7: execute optimizer and fetch loss
               loss_batch, _ = sess.run([loss, optimizer])

               total_loss += loss_batch

               if (index + 1) % SKIP_STEP == 0:
                   print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                   total_loss = 0.0
           except tf.errors.OutOfRangeError:
               sess.run(iterator.initializer)
       writer.close()

After training on spyder, the result:

screenshot

Enjoy Reading This Article?