Word embedding using Scala and Spark (part 1)

In this post, I will show a very simple how to calculate the Word Embeddings using the SparkMLlib in Scala. In this post, I will reuse the well-known technique Word2Vec.
Word2Vec is a group of techniques that allow you to takes a large corpus of text and produces for each word a vector space (typically of hundred of dimensions).  This prototype is formed by 3 modules:

  • Document repository
  • Input module
  • Word2Vec module

The repository could be a database, a web services, but in this case, it is a simple text file, called data.dat, where there is a document for each row. For this demo, the file contains just 4 documents:

  • The sun is a star
  • The earth is a planet
  • The moon is a satellite’s earth
  • The sun is yellow

Before to describe the reading module, let’s see the libraries used for this job.

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark2 = spark
import spark2.implicits._

Over the typical Spark libs, we will use some org.apache.spark.ml libs like Word2Vec/Word2VecModel for calculating embeddings. Now, we can describe the reading module. It is very simple, in fact, it is the line for the creation of a DataFrame from the data.dat file:

val documents = sc.textFile("../files/data.dat").
            map(line => line.toLowerCase).
            map(_.split(" ")).
            map(Tuple1.apply).
            toDF("text")

More in details, for each row, we will perform the toLowerCase transformation and then the splitting of each document into words. The result is a DataFrame with a column called text, where each row is the array contains the split document:

  • [the, sun, is, a, star]
  • [the, earth, is, a, planet]
  • [the, moon, is, a, satellite’s, earth]
  • [the, sun, is, yellow]

After, we have to create the Word2Vec model. The code is the following:

val word2vec = new org.apache.spark.ml.feature.Word2Vec().
     setInputCol("text").
     setOutputCol("features").
     setVectorSize(100).
     setMinCount(0).
     setMaxIter(100)

First, we use the Word2Vec class for creating an object that takes the documents Dataframe and produces the embeddings in the column “feature”. For each word, the module will create a vector of 100 embeddings.
Moreover, we can set the minimum frequency a token for being included in the word2vec model’s vocabulary (setMinCount) and the max number of iterations (setMaxIter).
For seeing the vector for the word sun, we can use the following:

modelW2V.getVectors.filter($”word” === “sun”).show()

The vector with 100 elements for the word “sun” is:

[0.09446191042661667,0.1552284210920334,-0.09290853142738342,0.0445389524102211,0.06610991805791855,-0.02337893657386303,0.11622816324234009,-0.08182769268751144,0.06403583288192749,0.0231629628688097,-0.01798628270626068,-0.06546609103679657,-0.032661933451890945,-0.01344525907188654,-0.02399328164756298,0.11137351393699646,-0.022922128438949585,0.030550602823495865,-0.07536254078149796,-0.05032337084412575,0.029553625732660294,-0.1476014405488968,-0.0011099856346845627,0.08598096668720245,0.004308049101382494,-0.05415727570652962,-0.0620625875890255,-0.09463722258806229,0.08285677433013916,0.07196198403835297,-0.01583552174270153,-0.09427937865257263,-0.030169321224093437,0.01813172921538353,-6.091261748224497E-4,0.1360599845647812,-0.043689873069524765,0.08709631860256195,-0.11659324914216995,0.03771822899580002,-0.03156512603163719,-0.048343855887651443,-0.14942805469036102,-0.18071718513965607,-0.03739666938781738,-0.01532907783985138,0.05065494030714035,0.023628517985343933,0.01560444850474596,0.10551293194293976,0.062271784991025925,0.05541074648499489,0.1813959926366806,-0.06118335574865341,-0.0463416650891304,-0.08667901903390884,-0.05646215006709099,-0.001103816437534988,0.06259763240814209,-0.01996113732457161,-0.1205950379371643,-0.004463272634893656,0.02382899634540081,-0.012207966297864914,0.10697317123413086,0.019047774374485016,0.03311692550778389,0.013069088570773602,0.04143095389008522,-0.03262871876358986,-0.0456211194396019,0.05934283137321472,0.12997221946716309,-0.017169641330838203,-0.04913315549492836,-0.010379312559962273,0.05310118943452835,-0.13470590114593506,0.011609711684286594,0.0948350802063942,0.10490880906581879,0.03674982860684395,0.09822071343660355,0.014429733157157898,-0.03731803596019745,-0.13551275432109833,0.08128348737955093,0.006473873741924763,-0.03733252361416817,-0.09077411144971848,0.016692079603672028,-0.03317994624376297,-0.010351890698075294,-0.052576109766960144,0.18440976738929749,0.017175475135445595,0.006681717000901699,-0.046664170920848846,0.0673007071018219,-0.0704694539308548]

Now, we can train the model with the opened documents:

val modelW2V = word2vec.fit(documents)
modelW2v.show()

Due to the limited number of documents, the model is ready in a few seconds. Next step is the creation of the embeddings for every document. This is the first problem. How can I combine the word’s vector for a document? Without losing generalization, we can think that the vector for a document is a combination (max, min, average) of embeddings for each word. For this, SparkML helps us with the methods transform.
In fact, the transform method, that belongs to the Word2VecModel class, use the average function for creating the vector for a document.

// Creation of embeddings for documents
val parsedData = modelW2V.transform(documents)
parsedData.show(false)

With less than 20 lines of codes, we have the vector for each word and document.

For now, that’s all…

 

Sharing is caring!

Leave a Reply