Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organisation, topic extraction and fast information retrieval or filtering.
Of course, this task requires a lot of work, but for this In this post, I will examine the sub-task of grouping the documents which refer to the same topic using an approach based on Word2Vec and KMeans.
This approach is quite common, and on Medium, there are a lot of similar works (see here). However, I will start from an architectural point of view. So, the goal is the creation of an application for text clustering (TCA for brevity) that gives us the possibility to:
- import different text format
- manage different Clustering Algorithms
- compare the results
A possible architecture is the following:
This application is formed by 4 modules:
- Read Module that reads the documents from a repository
- Word2Vec Module that transforms the documents in embeddings
- Cluster Module that applies different clustering techniques to the documents (KMeans, LDA, …)
- Presentation Layer that reads the different output from the Cluster Module and shows to the end Users
We can image TCA as an Application that exploits Spark for the computation and has a Web Application for the interaction with the end-user. However, for this post, I will show a simple module that:
- Reads the documents from a text file
- Compute the embeddings using Word2Vec
- Cluster the documents using KMeans
- Show the results
The repository of the documents is a simple text file, called data.dat, and each line represents a document. For this demo, the file contains only 4 documents:
- The sun is a star
- The earth is a planet
- The moon is a satellite’s earth
- The sun is yellow
Just for fun, let’s think about how manually clustering our documents. I suppose that the first and last documents belong to a cluster, while the others to another one.
Now, we can start with coding, the Read Module is quite simple, so I just show the code:
documentsPath = "../data/documents.txt" documents = sqlContext.read.format("csv").option("header", "true").load(documentsPath)
Now, we must provide some basic manipulation like the transformation to lower, and the splitting of the documents into words.
# To lower and split by space documents = documents.withColumn("text_splitted", split(lower(col("text")), " "))
The result is a DataFrame with a column called text_splitted:
- [the, sun, is, a, star]
- [the, earth, is, a, planet]
- [the, moon, is, a, satellite’s, earth]
- [the, sun, is, yellow]
Now, let’s start with Word2vec. It is a group of techniques that allow you to takes a large corpus of text and produces for each word a vector space (typically of hundreds of dimensions). The mathematical behind to Word2Vec is not the goal of this post, however, if you want to know more, I write a post for Word2vec, in Scala.
Let’s see the code for the creation of the Word2Vec model:
# Word2Vec word2Vec = Word2Vec(vectorSize=100, minCount=0, maxIter=100, inputCol="text_splitted", outputCol="features")
- vectorSize is the dimension of the vector for each word
- minCount is the minimal frequency of the word
- maxIter is the max number of iterations for the computation
- inputCol is the name of the column where there are the splitted text
- outputCol is the name of the column of the vectors
Now, we can train the model with our documents:
model = word2Vec.fit(documents)
Due to the limited number of documents, the model is ready in a few seconds. The next step is the creation of the embeddings for every document. This is the first problem. How can I combine the word’s vector for a document? Without losing generalization, we can think that the vector for a document is a combination (max, min, average) of embeddings for each word. For this, SparkML helps us with the methods transform.
result = model.transform(documents)
Now we can perform the text clustering. There are different techniques, but in this post, I use the KMeans:
# KMeans Clustering numIterations = 200 numberClusters = 2 kmeans = KMeans().setMaxIter(numIterations).setK(numberClusters).setSeed(1) kmeans_model = kmeans.fit(result)
For seeing the cluster assigned to each document, we use the predict method:
# Make predictions predictions = kmeans_model.transform(result) predictions.show()
Here we are, below I show the cluster for each document:
- The sun is a star => 1
- The earth is a planet => 0
- The moon is a satellite’s earth => 0
- The sun is yellow => 1
Just for closing the circle, let’s see the code for saving the predictions on a csv file.
# Save Output predictionsPath = "../data/predictions.txt" predictions.select(["text", "prediction"]).write.csv(predictionsPath, mode="overwrite", header="true")
Not bad for few lines of code, as usual, the code is available here.
In the next posts, I will extend this simple approach by allowing more clustering techniques (like LDA) and discussing the performance and correctness of this approach.