Here’s a short description of hands-on code “word2vec.py” at the Cork AI Meetup, 15th March 2018, The instructions on how to execute on an AWS virtual machine, code and sample documents can be found on GitHub.
This code was obtained from Pete Warden’s GitHub here: https://github.com/petewarden/tensorflow_makefile/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py. This reference provides an excellent description of word2vec with skipgram:
McCormick, C. (2016, April 19). Word2Vec Tutorial – The Skip-Gram Model. Retrieved from http://www.mccormickml.com
The following parameters are used:
- Batch Size for learning: 128
- Number of dimensions for embedding vector: 128
- Window size: 1 to left and right.
This code performs the following tasks:
- maybe_download() and read_data(): Downloads and reads about 1.7 millions words from a zip file.
- build_dataset(): Builds a dictionary of terms, keyed on index position, and reverse dictionary of term to index. Less common words represented by the term ‘UNK’.
- generate_batch(): Generates a batch of terms for training, uses skip-gram for optimization. Batch size is 128.
- Creates a TensorFlow graph with single hidden layer with dimensions “vocabulary_size” by “embedding_size”. These are mappings from term into the vector space.
- Create a loss function based reducing the mean of Noise Contrastive Estimation (NCE)
- Train the model with 100000 steps. Every 10000 show examples of a term and other semantically close samples.
- Create a visual representation of the semantic vector space reduced to 2 dimensions using Principal Component Analysis (PCA).
The program outputs the following for the last iteration. This shows the loss to be about 4.7, and a sample of terms with semantically similar terms:
Notice how the term “three” is semantically similar to “five”, “four”, “seven” etc. as would be expected.
The semantic space visualization looks like the following:
Zooming in shows an area in some detail, showing related words: