CS231n Assignment3_Image Captioning with RNNs

Yesterday I finished a vanilla recurrent neural networks and used them to train a model that could generate novel captions for images. It’s really excited that as the the domain of NLP, word embedding can be combining with Computer Vision CNN for image captioning,which is sort of like Lego construction.(All the images drawn in draft are from link )

screenshot

The whole process will be pretty much as following:

screenshot

Vanilla RNN single step forward:

screenshot

def rnn_step_forward(x, prev_h, Wx, Wh, b):
"""
Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
activation function.

The input data has dimension D, the hidden state has dimension H, and we use
a minibatch size of N.

Inputs:
 - x: Input data for this timestep, of shape (N, D).
 - prev_h: Hidden state from previous timestep, of shape (N, H)
 - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
 - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
 - b: Biases of shape (H,)

 Returns a tuple of:
 - next_h: Next hidden state, of shape (N, H)
 - cache: Tuple of values needed for the backward pass.
 """
 next_h, cache = None, None
 ##############################################################################
 # TODO: Implement a single forward step for the vanilla RNN. Store the next  #
 # hidden state and any values you need for the backward pass in the next_h   #
 # and cache variables respectively.                                          #
 ##############################################################################
 next_h = np.tanh(prev_h.dot(Wh) + x.dot(Wx) + b)
 cache = (x,Wx,Wh,prev_h,next_h)
 #pass
   ##############################################################################
   #                               END OF YOUR CODE                             #
   ##############################################################################
 return next_h, cache

Notice these two parameters: - Wx: Weight matrix for input-to-hidden connections, of shape (D, H) - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)

RNN one step backward, according to their size and pay attention to the transpose. screenshot

def rnn_step_backward(dnext_h, cache):
  """
  Backward pass for a single timestep of a vanilla RNN.
  Inputs:
  - dnext_h: Gradient of loss with respect to next hidden state, of shape (N, H)
  - cache: Cache object from the forward pass
  Returns a tuple of:
  - dx: Gradients of input data, of shape (N, D)
  - dprev_h: Gradients of previous hidden state, of shape (N, H)
   - dWx: Gradients of input-to-hidden weights, of shape (D, H)
   - dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
   - db: Gradients of bias vector, of shape (H,)
   """
   dx, dprev_h, dWx, dWh, db = None, None, None, None, None
   ##############################################################################
   # TODO: Implement the backward pass for a single step of a vanilla RNN.      #
   #                                                                            #
   # HINT: For the tanh function, you can compute the local derivative in terms #
   # of the output value from tanh.                                             #
   ##############################################################################
   x, Wx, Wh, prev_h, next_h = cache
   d_tanh = 1 - next_h**2
   dx = (dnext_h*dtanh).dot(Wx.T)
   dWx = x.T.dot(dnext_h*dtanh)
   dprev_h = (dnext_h*dtanh).dot(Wh.T)
   dWh = prev_h.T.dot(dnext_h*dtanh)
   db = np.sum(dnext_h*dtanh,axis=0)# 按列相加
   #pass
   ##############################################################################
   #                               END OF YOUR CODE                             #
   ##############################################################################
   return dx, dprev_h, dWx, dWh, db

After single step, we need to finish the reccurent loop part

screenshot

def rnn_forward(x, h0, Wx, Wh, b):
  """
  Run a vanilla RNN forward on an entire sequence of data. We assume an input
  sequence composed of T vectors, each of dimension D. The RNN uses a hidden
  size of H, and we work over a minibatch containing N sequences. After running
  the RNN forward, we return the hidden states for all timesteps.
  Inputs:
  - x: Input data for the entire timeseries, of shape (N, T, D).
  - h0: Initial hidden state, of shape (N, H)
   - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
   - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
   - b: Biases of shape (H,)
   Returns a tuple of:
   - h: Hidden states for the entire timeseries, of shape (N, T, H).
   - cache: Values needed in the backward pass
   """
   h, cache = None, None
   ##############################################################################
   # TODO: Implement forward pass for a vanilla RNN running on a sequence of    #
   # input data. You should use the rnn_step_forward function that you defined  #
   # above. You can use a for loop to help compute the forward pass.            #
   ##############################################################################
   # get the size of x 
N,T,D = x.shape
_,H = h0.shape
# initialize the hidden h
h = np.zeors((N,T,H))
cache = []
h_next = h0
for i in range(T):
	h[:,i,:],cache_next = rnn_step_forward(x[:,i,:], h_next, Wx, Wh, b)
	h_next = h[:,i,:]
	cache.append(cache_next)
   #pass
   ##############################################################################
   #                               END OF YOUR CODE                             #
   ##############################################################################
   return h, cache

For the Backward RNN: screenshot

def rnn_backward(dh, cache):
  """
  Compute the backward pass for a vanilla RNN over an entire sequence of data.
  Inputs:
 5- dh: Upstream gradients of all hidden states, of shape (N, T, H). 
  
 7NOTE: 'dh' contains the upstream gradients produced by the 
  individual loss functions at each timestep, *not* the gradients
  being passed between timesteps (which you'll have to compute yourself
   by calling rnn_step_backward in a loop).
   Returns a tuple of:
   - dx: Gradient of inputs, of shape (N, T, D)
   - dh0: Gradient of initial hidden state, of shape (N, H)
   - dWx: Gradient of input-to-hidden weights, of shape (D, H)
   - dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
   - db: Gradient of biases, of shape (H,)
   """
   dx, dh0, dWx, dWh, db = None, None, None, None, None
   ##############################################################################
   # TODO: Implement the backward pass for a vanilla RNN running an entire      #
   # sequence of data. You should use the rnn_step_backward function that you   #
   # defined above. You can use a for loop to help compute the backward pass.   #
   ##############################################################################
   x,Wx,Wh,prev_h,nexth = cache[-1] # start from the final one 
   x, Wx, Wh, prev_h, next_h = cache[-1] # start from the final one
   _,D = x.shape
   N,T,H = dh.shape
   dx = np.zeros((N,T,D)) # initialization
   dh0 = np.zeros((N,H))
   dWx = np.zeros((D,H))
   dWh = np.zeros((H,H))
   db = np.zeros(H)
   dprev_h_ = np.zeros((N,H))
   for i in range(T-1,-1,-1): # start from the final one
       dx_, dprev_h_, dWx_, dWh_, db_ = rnn_step_backward(dh[:,i,:] + dprev_h_,cache.pop())
       dx[:,i,:] = dx_
       dh0 = dprev_h_
       dWx += dWx_
       dWh += dWh_
       db += db_	
   #pass
   ##############################################################################
   #                               END OF YOUR CODE                             #
   ##############################################################################
   return dx, dh0, dWx, dWh, db

Word_embedding : FORWARD , In deep learning systems, we commonly represent words using vectors. Each word of the vocabulary will be associated with a vector, and these vectors will be learned jointly with the rest of the system. The whole process will be like this one , from caption_in to the X is the word_embedding process: screenshot

def word_embedding_forward(x, W):
"""
Forward pass for word embeddings. We operate on minibatches of size N where
each sequence has length T. We assume a vocabulary of V words, assigning each
to a vector of dimension D.
 
Inputs:
- x: Integer array of shape (N, T) giving indices of words. Each element idx
  of x muxt be in the range 0 <= idx < V.
 - W: Weight matrix of shape (V, D) giving word vectors for all words.
  
 Returns a tuple of:
 - out: Array of shape (N, T, D) giving word vectors for all input words.
 - cache: Values needed for the backward pass
 """

 out = W[x, :]
 cache = (W, x)
  
 return out, cache

This process just choose the giving indices of words from the vectors of all words And the backward will be like :

def word_embedding_backward(dout, cache):
  """
  Backward pass for word embeddings. We cannot back-propagate into the words
  since they are integers, so we only return gradient for the word embedding
  matrix.
  HINT: Look up the function np.add.at
  Inputs:
  - dout: Upstream gradients of shape (N, T, D)
  - cache: Values from the forward pass
   Returns:
   - dW: Gradient of word embedding matrix, of shape (V, D).
   """
   dW = None
   ##############################################################################
   # TODO: Implement the backward pass for word embeddings.                     #
   #                                                                            #
   # Note that words can appear more than once in a sequence.                   #
   # HINT: Look up the function np.add.at                                       #
   ##############################################################################
   W, x = cache
   dW = np.zeros_like(W)
# add dout at the indices x TO dW
   np.add.at(dW, x, dout)
   #pass
   ##############################################################################
   #                               END OF YOUR CODE                             #
   ##############################################################################
   return dW

Notice that in evert timestop we should use an afine function to transform the RNN hidden vector at vector into scores for each word，we omit this because I have implemented it in assignment2, if you want to see the code , you can go to my github,and in this file. in function temporal_affine_forward/backward

At every timestep we produce a score for each word in vocabulary, then use the ground truth word to compute the softmax loss function:file

  def loss(self, features, captions):
      """
      Compute training-time loss for the RNN. We input image features and
      ground-truth captions for those images, and use an RNN (or LSTM) to compute
      loss and gradients on all parameters.
      Inputs:
      - features: Input image features, of shape (N, D)
      - captions: Ground-truth captions; an integer array of shape (N, T) where
        each element is in the range 0 <= y[i, t] < V
       Returns a tuple of:
       - loss: Scalar loss
       - grads: Dictionary of gradients parallel to self.params
       """
       # Cut captions into two pieces: captions_in has everything but the last word
       # and will be input to the RNN; captions_out has everything but the first
       # word and this is what we will expect the RNN to generate. These are offset
       # by one relative to each other because the RNN should produce word (t+1)
       # after receiving word t. The first element of captions_in will be the START
       # token, and the first element of captions_out will be the first word.
       captions_in = captions[:, :-1]
       captions_out = captions[:, 1:]

       # You'll need this
       mask = (captions_out != self._null)

       # Weight and bias for the affine transform from image features to initial
       # hidden state
       W_proj, b_proj = self.params['W_proj'], self.params['b_proj']

       # Word embedding matrix
       W_embed = self.params['W_embed']

       # Input-to-hidden, hidden-to-hidden, and biases for the RNN
       Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']

       # Weight and bias for the hidden-to-vocab transformation.
       W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

       loss, grads = 0.0, {}
       ############################################################################
       # TODO: Implement the forward and backward passes for the CaptioningRNN.   #
       # In the forward pass you will need to do the following:                   #
       # (1) Use an affine transformation to compute the initial hidden state     #
       #     from the image features. This should produce an array of shape (N, H)#
       # (2) Use a word embedding layer to transform the words in captions_in     #
       #     from indices to vectors, giving an array of shape (N, T, W).         #
       # (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to    #
       #     process the sequence of input word vectors and produce hidden state  #
       #     vectors for all timesteps, producing an array of shape (N, T, H).    #
       # (4) Use a (temporal) affine transformation to compute scores over the    #
       #     vocabulary at every timestep using the hidden states, giving an      #
       #     array of shape (N, T, V).                                            #
       # (5) Use (temporal) softmax to compute loss using captions_out, ignoring  #
       #     the points where the output word is <NULL> using the mask above.     #
       #                                                                          #
       # In the backward pass you will need to compute the gradient of the loss   #
       # with respect to all model parameters. Use the loss and grads variables   #
       # defined above to store loss and gradients; grads[k] should give the      #
       # gradients for self.params[k].                                            #
       #                                                                          #
       # Note also that you are allowed to make use of functions from layers.py   #
       # in your implementation, if needed.                                       #
       ############################################################################
       # Word Embedding
       captions_in_emb,emb_cache = word_embedding_forward(captions_in,W_embed)
       # Affine Forward 
       h_0,feature_cache = affine_forward(features,W_proj,b_proj)
       #RNN part
       h,rnn_cache = rnn_forward(captions_in_emb, h_0, Wx, Wh, b)
       
       # Temporal Afine 
       temporal_out, temporal_cache = temporal_affine_forward(h, W_vocab, b_vocab)
       
       # Softloss 
       loss, dout = temporal_softmax_loss(temporal_out, captions_out, mask)
       
       # Gradient 倒序
       dtemp, grads['W_vocab'], grads['b_vocab'] = temporal_affine_backward(dout, temporal_cache)
       drnn, dh0, grads['Wx'], grads['Wh'], grads['b'] = rnn_backward(dtemp, rnn_cache)
       dfeatures, grads['W_proj'], grads['b_proj'] = affine_backward(dh0, feature_cache)
    
       grads['W_embed'] = word_embedding_backward(drnn, emb_cache)
       #pass
       ############################################################################
       #                             END OF YOUR CODE                             #
       ############################################################################

       return loss, grads

This function basically implement the process shown in the image that we saw in the very begining.

AFTER finishing this function, in the cs231n assignment3 file, they also present a function that is used for overfitting small data, only in this way can we know that this model can be used. So don’t forget to overfit small data first right after finishing your model!

screenshot

If you see this image showing then you should have a big smile on your face : )

And the description of the image will be start with the tp=token and end with token, the result will be like : screenshot

screenshot

TBH, it could be really hard to write the code from the very begining and build all the functions and connect them based on logical basis, but if you can imagine the whole process or how the data will be transported and calculated, after imaging so,recalling the one specific process image in your mind, then the functions can be easier to be implemented. Tips: Every time you need to do dot product, make sure you know the size of the elements!!!

3rd Aug 2018

Enjoy Reading This Article?