Category: Big Data

Big Data

From Pandas to Spark

Work in progress for translating Pandas to Spark… Operation Pandas PySpark Read file df = pd.read_csv(file, sep =’;’, encoding= ‘latin-1’) df = spark.read.format(‘csv’).options(header=’true’, inferSchema=’true’, delimiter=’;’).load(file) Group By group_by = df.groupby(“field1”)[‘field2’] group_by = df(“field1”).agg(“field2”) Unique values of column unique=df[‘field1’].nunique() unique=data.select(‘field1’).distinct().count() Head df.head(number_rows) df.head(number_rows) –> ugly df.show(number_rows)Continue reading