From Pandas to Spark

Work in progress for translating Pandas to Spark…

Operation Pandas PySpark
Read file df = pd.read_csv(file, sep =’;’, encoding=
df =‘csv’).options(header=’true’, inferSchema=’true’, delimiter=’;’).load(file)
Group By group_by = df.groupby(“field1”)[‘field2’] group_by = df(“field1”).agg(“field2”)
Unique values of column unique=df[‘field1’].nunique()‘field1’).distinct().count()
Head df.head(number_rows)
  • df.head(number_rows) –> ugly
Tail df.tail(number_rows) N.A.
Inferring Types df.dtypes df.printSchema()
Describe Data df.describe() df.describe()
Add a column df[“new_column_name”] = value/formula df_new = df.withColumn(‘new_column_name’, value/formula)
Write a csv df.to_csv(filename, sep=’,/;/\t’, encoding=’utf-8′) df.write().format(“com.databricks.spark.csv”).option(“header”, “true”).save(filename)

Convert Spark DataFrame to Pandas

pandas_df = spark_df.toPandas()

Create a Spark DataFrame from Pandas

spark_df = context.createDataFrame(pandas_df)

Sharing is caring!

3 thoughts on “From Pandas to Spark

  1. Hiya, I am really glad I have found this info. Today bloggers publish only about gossips and internet and this is really annoying. A good web site with interesting content, this is what I need. Thank you for keeping this website, I will be visiting it. Do you do newsletters? Cant find it.

Leave a Reply