Pandas Optimization

Nowadays, Pandas is high-performance, easy-to-use data structures for managing data in Python. Moreover, it is open source and it is released under the BSD-license.
When you analyse a small dataset (around 100 MB), it is difficult having performance issues, instead of with dataset of some GB it can.
Fortunately, Pandas offers many different methodologies for optimize Dataframe occupation.
In this post, based on this post, I will show three different methodologies for optimise Dataset:

  1. Downgrade column
  2. Categorical column
  3. Optimize Read

For the experiments, I will use a 1,5 MB file (download from this link from the GDELT project.
But before, what is GDELT?

The GDELT Project is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day

GDELT acronym is for Global Database of Events, Language, and Tone. A single year of the dataset in CSV format is over 2.5TB. I do not want to analyze 2,5TB with a single machine (I prefer a cluster with Spark), but just a single file. The file describes the number of events for each country in a month.

##Evaluating the situation

First of all, let’s load the data from the remote URL.

import io
import requests
import pandas as pd

# Load Data
url = 'http://data.gdeltproject.org/normfiles/monthly_country.csv'
urlData = requests.get(url).content
df = pd.read_csv(io.StringIO(urlData.decode('utf-8')), names=["Data", "Country", "Value"])

print(df.head())

The code is quite simple, first I import the io, requests for the remote load and finally the command for creating the DataFrame with Pandas.
As described below, I will load the GDELT file called monthly_country.csv. This file is formed by 3 fields:

1. Data: The month/year aggregation data
2. Country: the first 2 letter of a country IT for Italy, FR for France, ...
3. Value: the number of events [to verify]

The last statement shows the first lines of the DataFrame.

Data Country Value
197901 NaN 1876
197901 AE 14
197901 AF 121
197901 AG 46
197901 AJ 3

Nothing of difficult until now. Let’s start doing something nice.
First, let’s see the dataframe info:

print(df.info())

The response is the follow:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105375 entries, 0 to 105374
Data columns (total 3 columns):
Data       105375 non-null int64
Country    104904 non-null object
Value      105375 non-null int64
dtypes: int64(2), object(1)
memory usage: 7.5 MB

OK, the occupation is only 7.5 MB for only 105375 rows. Now image to analyze 10^9 row… Impossible.
Let’s see a very simple technique, the downcast. As showed above, the columns Data and Value are described as int64. The question is int64 is really useful.
Let’s describe the data:

print(df.describe())
                Data         Value
count  105375.000000  1.053750e+05
mean   199912.720019  4.808708e+03
std      1128.853207  4.496002e+04
min    197901.000000  1.000000e+00
25%    198911.000000  4.600000e+01
50%    200001.000000  3.270000e+02
75%    200906.000000  1.703000e+03
max    201803.000000  2.103586e+06

As shown above, the field Value has a max of 2.103586e+06.

print("Downcast to Int")
df_int = df.copy()
df_int['Value'] =  pd.to_numeric(df_int.Value, downcast='unsigned')
print("New Memory Usage")
df_int.info(memory_usage='deep')

With this first improvement, we can pass from 7.5MB to 7.1MB, unfortunatelly a very small improvements:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105375 entries, 0 to 105374
Data columns (total 3 columns):
Data       105375 non-null int64
Country    104904 non-null object
Value      105375 non-null int32
dtypes: int32(1), int64(1), object(1)
memory usage: 7.1 MB

The function pd.to_numeric has other conversion option available here.

Ok, interesting, but as simple to see, the Country column contains the first to letters of a country. For each one, Pandas use 3 bytes, but if we see them as simple constants, we could image table like the following:

ID Country
1 NaN
2 AE
3 AF
4 AG
5 AJ

So if we swap the country labels with his ID, we can save 2 bytes if the number of the countries is less than 255. This is in simple the approach for the Categoricals. In fact, the category type uses integer values to represent the values in a column. It was introduced in version 0.15.
Let’s see the code:

df_cat = df.copy()
print("Create Category")
df_cat['Country'] = df_cat.Country.astype('category')
print(df.info(memory_usage='deep'))

In this case, the memory occupancy is about 1.8 MB of RAM (around the 25% of the original).

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105375 entries, 0 to 105374
Data columns (total 3 columns):
Data       105375 non-null int64
Country    104904 non-null category
Value      105375 non-null int64
dtypes: category(1), int64(2)
memory usage: 1.8 MB

As simple to the image, with the category, we cannot do any arithmetic operation, so first to use it real the official (documentation)[http://pandas.pydata.org/pandas-docs/stable/categorical.html#gotchas]

Until now, we work for refiring a dataframe, but can we read in an optimazed version? The answer is yes, but we have to pass a dictionary with the description of the types.

column_types={"Data":"category", "Country":"category", "Value":"int32"}
read_and_optimized = pd.read_csv(io.StringIO(urlData.decode('utf-8')), names=["Data", "Country", "Value"], dtype=column_types)
print(read_and_optimized.info(memory_usage='deep'))

In this last example, we have created a dictionary called column_types and passed to the ps.read_csv function. In the dictionary, we impose a type for each column field:

Field Type
Data category
Country category
Value int32

Let’s see the memory occupance:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105375 entries, 0 to 105374
Data columns (total 3 columns):
Data       105375 non-null category
Country    104904 non-null category
Value      105375 non-null int32
dtypes: category(2), int32(1)
memory usage: 877.6 KB

Just 877.6 KB, around the 12% of the original, fantastic.
I think that this is all. As usual the presented code is available on github

Sharing is caring!

2 thoughts on “Pandas Optimization

Leave a Reply