First, documents can be represented by a “Bag of Words” (a list of the unique words in a document) or a “Frequency Distribution” (a list of the unique words in a document together with the occurrence frequency).
The simplest similarity measure covered in this series is the Jaccard Similarity measure. This uses a bag of words and compares the number of common words between two documents with the overall number of words. This does not take into account the relative frequency nor the order of the words in the two documents.
Three measures using Frequency Distributions are described. In all these cases a n-dimensional space is created from the Frequency Distribution with a dimension for each word in the documents being compared. They are:
- Euclidean Distance – The shortest distance between two documents in the Frequency Distribution space.
- Manhattan Distance – The sum of all the sides in the hyper rectangle formed around two documents in the Frequency Distribution Space.
- Cosine Distance – The cosine of the angle subtended at the origin between two documents in the Frequency Distribution Space.
These measures take into account the word frequency, but Cosine Distance cannot distinguish between documents where the relative frequency of words is the same (rather than the absolute frequency). They do not take into account the order of the words in the documents.
The final measure is the Levenshtein Minimum Edit Distance. This measure aligns two documents and calculates the number of inserts, deletes or substitutions that are required to change the first document into the second document, which may not necessarily be the same length. This measure takes into account the words, the frequency of words and the order of words in the document.
By using MinHash and Locality Sensitivity Hashing, similar documents can be identified very, very efficiently – these techniques are related to the Jaccard Similarity Index.