Here’s a description of hands-on code “doccluster.py” at the Cork AI Meetup, 15th March 2018, The instructions on how to execute on an AWS virtual machine, code and sample documents can be found on GitHub. The code executes against a set of 13 HTML documents with content from w3c.org. These documents are located in the “Documents” folder.
This code performs the following tasks:
- getdocumentlist(): Gets the list of documents to cluster.
- readdocs(): Reads each document, extracts text from HTML using BeautifulSoup and adds the text to a dictionary keyed by the document filename.
- vectorize(): Creates vectors for each term with TF/IDF (Term Frequency Inverse Document Frequency). Vectorization uses tokenize() to tokenize the text extracted from the HTML documents, and calls stemtokens() to stem using the Porter stemmer from NLTK (Natural Language Toolkit)
- hierarchical_cluster(): Creates a dendrogram showing the relatedness of the documents obtained from a Euclidean distance matrix with the Ward linkage algorithm.
The program produces the following output:
The dendrogram is created in the “output_images” folder and looks like the following:
As expected, versions of a specific document are more closely related than to other documents.