Word2Vec – Example

Here’s a short description of hands-on code “word2vec.py” at the Cork AI Meetup, 15th March 2018, The instructions on how to execute on an AWS virtual machine, code and sample documents can be found on GitHub.

This code was obtained from Pete Warden’s GitHub here:  https://github.com/petewarden/tensorflow_makefile/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py. This reference provides an excellent description of word2vec with skipgram:

McCormick, C. (2016, April 19). Word2Vec Tutorial – The Skip-Gram Model. Retrieved from http://www.mccormickml.com

The following parameters are used:

  • Batch Size for learning: 128
  • Number of dimensions for embedding vector: 128
  • Window size: 1 to left and right.

This code performs the following tasks:

  • maybe_download() and read_data(): Downloads and reads about 1.7 millions words from a zip file.
  • build_dataset(): Builds a dictionary of terms, keyed on index position, and reverse dictionary of term to index. Less common words represented by the term ‘UNK’.
  • generate_batch(): Generates a batch of terms for training, uses skip-gram for optimization. Batch size is 128.
  • Creates a TensorFlow graph with single hidden layer with dimensions “vocabulary_size” by “embedding_size”. These are mappings from term into the vector space.
  • Create a loss function based reducing the mean of Noise Contrastive Estimation (NCE)
  • Train the model with 100000 steps. Every 10000 show examples of a term and other semantically close samples.
  • Create a visual representation of the semantic vector space reduced to 2 dimensions using Principal Component Analysis (PCA).

The program outputs the following for the last iteration. This shows the loss to be about 4.7, and a sample of terms with semantically similar terms:

Word2VecResult

Notice how the term “three” is semantically similar to “five”, “four”, “seven” etc. as would be expected.

The semantic space visualization looks like the following:

tsne

Zooming in shows an area in some detail, showing related words:

Wordspacebit

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s