Jaccard Similarity Index for measuring Document Similarity

The Jaccard Similarity Index can be used to represent the similarity between two documents. A value “0” means the documents are completely dissimilar, “1” that they are identical,  and values between 0 and 1 representing a degree of similarity.

The Jaccard index is defined as the size of the intersection of the two documents divided by the size of the union of the two documents.

Of course, we need to decide how to represent the two documents as sets. Of the different ways this can be done, we might:

1. Use a Bag Of Words, which is the list of unique words in the document, with no frequency count.
2. A Bag of Words with Stop Words removed (common language words like “the”, “at” for English).
3. Sets that take into account the frequency of words in the document.
4. k-Shingles which are fixed k length overlapping runs of characters extracted from the document.
5. n-grams which are fixed length n word runs. For example, a 2-gram are overlapping pairs of words taken from a document.

A simple, efficient, way of representing a bag of words is to create a dictionary of all words and assign a unique integer value to each word. Then, the bag of words for a document is simply the unique list of integer values for the words in the document, e.g. List.

For a comparison between Bag of Words and Frequency Distributions see here.

Since sets are not ordered, a bag of words represented like this takes no account of the order in which words appear in the document.

Here is a C# class to calculate the Jaccard index:

public class Jaccard
{

public static double Calc(HashSet<int> hs1, HashSet<int> hs2)
{
return ((double)hs1.Intersect(hs2).Count() /
(double)hs1.Union(hs2).Count());
}

public static double Calc(List<int> ls1, List<int> ls2)
{
HashSet<int> hs1 = new HashSet<int>(ls1);
HashSet<int> hs2 = new HashSet<int>(ls2);
return Calc(hs1, hs2);
}
}

Notice how the List objects are converted to HashSet, since the HashSet class has methods for performing set calculations. This C# code shows some tests for this class:

public void JaccardTest1()
{
Dictionary<int, string> wordDict = new Dictionary<int, string>();

List<int> doc1 = new List<int>();

List<int> doc2 = new List<int>();

List<int> doc3 = new List<int>();

Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc2));
Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc1));
Console.WriteLine("Jaccard: " + Jaccard.Calc(doc1, doc3));
}

This creates three documents:

• Doc1: “Word2”, “Word3”, “Word4”, “Word2”.
• Doc2: “Word1”, “Word5”, “Word4”, “Word2”.
• Doc3: “Word1”.

The output is:

Jaccard: 0.4
Jaccard: 1
Jaccard: 0

For the first result, the total number of unique words (union) in “Doc1” and “Doc2” is 5, and the number of shared words (intersection) is 2. This gives the Jaccard Similarity of 2.0 / 5.0 = 0.4.

The second result shows that a document compared to itself gives a value of 1 (completely similar) and the third test shows a value of 0 for completely dissimilar documents.

Reference: “Mining of Massive Datasets” by A.Rajaraman, J. Leskovec, J.D. Ullman, Cambridge University Press. 2011.  P.75. See http://infolab.stanford.edu/~ullman/mmds.html