Text Clustering using Python and Spark

Hi everyone,

in this post, I’ll show a very simple Text Clustering Module using SparkML and Python. The first question is: What is the Text Clustering?

From Wikipedia:

Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organisation, topic extraction and fast information retrieval or filtering.

Of course, this task requires a lot of work, but for this In this post, I will examine the sub-task of grouping the documents which refer to the same topic using an approach based on Word2Vec and KMeans.
This approach is quite common, and on Medium, there are a lot of similar works (see here). However, I will start from an architectural point of view. So, the goal is the creation of an application for text clustering (TCA for brevity) that gives us the possibility to:

  • import different text format
  • manage different Clustering Algorithms
  • compare the results

A possible architecture is the following:

This application is formed by 4 modules:

  1. Read Module that reads the documents from a repository
  2. Word2Vec Module that transforms the documents in embeddings
  3. Cluster Module that applies different clustering techniques to the documents (KMeans, LDA, …)
  4. Presentation Layer that reads the different output from the Cluster Module and shows to the end Users

We can image TCA as an Application that exploits Spark for the computation and has a Web Application for the interaction with the end-user. However, for this post, I will show a simple module that:

  1. Reads the documents from a text file
  2. Compute the embeddings using Word2Vec
  3. Cluster the documents using KMeans
  4. Show the results

The repository of the documents is a simple text file, called data.dat, and each line represents a document. For this demo, the file contains only 4 documents:

  • The sun is a star
  • The earth is a planet
  • The moon is a satellite’s earth
  • The sun is yellow

Just for fun, let’s think about how manually clustering our documents. I suppose that the first and last documents belong to a cluster, while the others to another one.

Read Module

Now, we can start with coding, the Read Module is quite simple, so I just show the code:

documentsPath = "../data/documents.txt"
documents = sqlContext.read.format("csv").option("header", "true").load(documentsPath)

Now, we must provide some basic manipulation like the transformation to lower, and the splitting of the documents into words.

# To lower and split by space
documents = documents.withColumn("text_splitted", split(lower(col("text")), " "))

The result is a DataFrame with a column called text_splitted:

  • [the, sun, is, a, star]
  • [the, earth, is, a, planet]
  • [the, moon, is, a, satellite’s, earth]
  • [the, sun, is, yellow]

Word2Vec Module

Now, let’s start with Word2vec. It is a group of techniques that allow you to takes a large corpus of text and produces for each word a vector space (typically of hundreds of dimensions). The mathematical behind to Word2Vec is not the goal of this post, however, if you want to know more, I write a post for Word2vec, in Scala.

Let’s see the code for the creation of the Word2Vec model:

# Word2Vec
word2Vec = Word2Vec(vectorSize=100, minCount=0, maxIter=100, inputCol="text_splitted", outputCol="features")


  • vectorSize is the dimension of the vector for each word
  • minCount is the minimal frequency of the word
  • maxIter is the max number of iterations for the computation
  • inputCol is the name of the column where there are the splitted text
  • outputCol is the name of the column of the vectors

Now, we can train the model with our documents:

model = word2Vec.fit(documents)

Due to the limited number of documents, the model is ready in a few seconds. The next step is the creation of the embeddings for every document. This is the first problem. How can I combine the word’s vector for a document? Without losing generalization, we can think that the vector for a document is a combination (max, min, average) of embeddings for each word. For this, SparkML helps us with the methods transform.

result = model.transform(documents)

KMeans Clustering

Now we can perform the text clustering. There are different techniques, but in this post, I use the KMeans:

# KMeans Clustering
numIterations = 200
numberClusters = 2
kmeans = KMeans().setMaxIter(numIterations).setK(numberClusters).setSeed(1)
kmeans_model = kmeans.fit(result)

For seeing the cluster assigned to each document, we use the predict method:

# Make predictions
predictions = kmeans_model.transform(result)

Here we are, below I show the cluster for each document:

  • The sun is a star => 1
  • The earth is a planet => 0
  • The moon is a satellite’s earth => 0
  • The sun is yellow => 1

Just for closing the circle, let’s see the code for saving the predictions on a csv file.

# Save Output
predictionsPath = "../data/predictions.txt"
predictions.select(["text", "prediction"]).write.csv(predictionsPath, mode="overwrite", header="true")

Not bad for few lines of code, as usual, the code is available here.

In the next posts, I will extend this simple approach by allowing more clustering techniques (like LDA) and discussing the performance and correctness of this approach.

Sharing is caring!

Leave a Reply