Simple example with pipe in Spark

Hi everyone,

in the last days, I read a couple of blog posts about the pipe in Spark. It is a very useful technique able to process RDD data using an external application. In fact, nowadays there is a lot of legacy software which is perfect for solving some problem that you can simple integrating into your spark application.

From the official spark’s documentation, the Spark’s pipe uses the following syntax:

pipe(command[envVars]) Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings.

As an example, I will call a Java code from a PySpark script. Let’s start with the Java code called Square.java. It is a simple program that reads from stdin, calculates the square and prints the output.

import java.util.Scanner;

public class Square {

    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        // Read the scanner
        while(scanner.hasNext()){
            // Read the input
            int i = scanner.nextInt();
            // Print the square
            System.out.println(i * i);
        }
        scanner.close();
    }
}

Now let’s see how to compile and run the Java code:

javac Square.java
java Square

The following is an example of the run:

$ java Square
2
4
3
9

Now create the Spark code:

from pyspark.sql import SparkSession
from pyspark import SparkContext

sc = SparkContext()
data = sc.parallelize(["1","2","3","4","5","6","7","8","9","10"])
# Print the original rdd
print("Output : " + str(data.collect()))
# Pipe
scriptPath = "java Square"
pipeRDD = data.pipe(scriptPath)
# Print the final output
print("Output : " + str(pipeRDD.collect()))

The pyspark script just creates an RDD with the number from 1 to 10, pipe the RDD through the java program and print the output. Executing the script, by the following command:

spark-submit sparkScript.py | grep "Output :"

you can see the output:

Output : ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
Output : ['1', '4', '9', '16', '25', '36', '49', '64', '81', '100']

As usual, this example is available on GitHub.

For this post, it is all. Another post with examples of spark’s pipe is available here.

Sharing is caring!

One thought on “Simple example with pipe in Spark

  1. Can I simply just say what a comfort to discover an individual who truly understands what they are talking about on the web. You definitely realize how to bring an issue to light and make it important. A lot more people have to read this and understand this side of your story. I can’t believe you aren’t more popular given that you definitely possess the gift.

Leave a Reply to Francis Stukowski Cancel reply