Document Clustering – Example

Here’s a description of hands-on code “doccluster.py” at the Cork AI Meetup, 15th March 2018, The instructions on how to execute on an AWS virtual machine, code and sample documents can be found on GitHub.  The code executes against a set of 13 HTML documents with content from w3c.org. These documents are located in the “Documents” folder.

This code performs the following tasks:

  • getdocumentlist(): Gets the list of documents to cluster.
  • readdocs(): Reads each document, extracts text from HTML using BeautifulSoup and adds the text to a dictionary keyed by the document filename.
  • vectorize(): Creates vectors for each term with TF/IDF (Term Frequency Inverse Document Frequency). Vectorization uses  tokenize() to tokenize the text extracted from the HTML documents, and calls stemtokens() to stem using the Porter stemmer from NLTK (Natural Language Toolkit) 
  • hierarchical_cluster():  Creates a dendrogram showing the relatedness of the documents obtained from a Euclidean distance matrix with the Ward linkage algorithm.

The program produces the following output:

progoutput

The dendrogram is created in the “output_images” folder and looks like the following:

docclust

As expected, versions of a specific document are more closely related than to other documents.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s