July 1st, 2021
Several Python DataFrame libraries are benchmarked on CASFS on AWS. Various statistics, like time, memory usage, and cost, are compared on varying dataset sizes to help users determine which library is the best for their specific use case.
When most aspiring data professionals begin learning data analysis in Python, the first libraries they normally encounter are Numpy and Pandas, both of which compose the foundation for the modern paradigm of data processing today. While Pandas is, and will remain, the standard for processing tabular data in Python, Pandas' biggest drawback is that as dataset size increases, Pandas' performance grows linearly. While other libraries make use of multithreading or multiprocessor clusters, Pandas is designed to only make use of one core. Increasing the amount of processing cores does not improve Pandas performance.
In this paper, we compare alternatives to Pandas and make recommendations on how to improve code performance with increasing data sizes on CASFS. The libraries we compare are Pandas, Polars, Dask, Ray, and PySpark. For libraries that can deploy code to a cluster, we compare both running the code on a singular machine and running the code on a cluster.
The libraries were tested using daily, financial datasets. The dataset sizes are as follows:
We processed the data on a r5.24xlarge machine with 96 cores and 768 GB of RAM. The cost for this machine averages $1.00 per hour. For the code processed on a cluster, we used 10 r5.2xlarge machines with a total of 80 cores and 640 GB of RAM. The cost for this cluster averages $0.80 per hour. We chose to have less total resources for the cluster to demonstrate that some of these libraries actually perform better on a cluster than they do on a singular machine, even if the single machine has more cores. Because of this, it may be more cost effective to use a cluster rather than a single machine.
To process the data we used this generalized algorithm:
For each library, we tried to keep the code as similar as possible to Pandas. Polars and Dask have a similar syntax to Pandas and require few code changes. PySpark's API is different from Pandas, so most of the code for PySpark was rewritten. However, we still utilized the same algorithm. To properly implement Ray, we didn't have to change the syntax of the Pandas code, but we had to determine the most optimal place to parallelize the code in our algorithm. In the algorithm used by Pandas, Polars, Dask, and PySpark, we read in all the data at once, and let the libraries decide how to best use resources. In the algorithm used by Ray, we determined the best split was to read in and process files individually in parallel, and then concatenate them before writing to parquet.
Our testing results are presented in the table and graphs on the subsequent pages. The x and y scales have been scaled by logarithm to the base 10 to show which libraries' performance scales linearly as the dataset size increases. Memory usage was not recorded for cluster processing because it was difficult to accurately aggregate total memory usage across each machine of the cluster. Overall, PySpark and Ray were the most performant and memory efficient as the data size increased.
Library |
N Files |
Time (seconds) |
Memory (MiB) |
Cost Per Hour |
Total Cost (dollars) |
pandas |
1 |
7.89 |
1255.12 |
1.00 |
0.0022 |
pandas |
10 |
76.00 |
11452.78 |
1.00 |
0.0211 |
pandas |
100 |
797.00 |
130078.20 |
1.00 |
0.2214 |
polars |
1 |
3.09 |
2712.89 |
1.00 |
0.0009 |
polars |
10 |
30.30 |
15161.97 |
1.00 |
0.0084 |
polars |
100 |
318.00 |
106192.75 |
1.00 |
0.0883 |
dask standalone |
1 |
7.09 |
1642.15 |
1.00 |
0.0020 |
dask standalone |
10 |
34.40 |
11501.16 |
1.00 |
0.0096 |
dask standalone |
100 |
370.00 |
75321.80 |
1.00 |
0.1028 |
spark standalone |
1 |
19.50 |
62.62 |
1.00 |
0.0054 |
spark standalone |
10 |
37.50 |
63.91 |
1.00 |
0.0104 |
spark standalone |
100 |
86.00 |
63.93 |
1.00 |
0.0239 |
ray standalone |
1 |
6.73 |
40.05 |
1.00 |
0.0019 |
ray standalone |
10 |
7.53 |
144.20 |
1.00 |
0.0021 |
ray standalone |
100 |
26.30 |
917.56 |
1.00 |
0.0073 |
dask cluster |
1 |
8.74 |
|
0.90 |
0.0022 |
dask cluster |
10 |
26.88 |
|
0.90 |
0.0067 |
dask cluster |
100 |
241.50 |
|
0.90 |
0.0604 |
spark cluster |
1 |
22.78 |
|
0.90 |
0.0057 |
spark cluster |
10 |
22.84 |
|
0.90 |
0.0057 |
spark cluster |
100 |
53.65 |
|
0.90 |
0.0134 |
ray cluster |
1 |
8.10 |
|
0.90 |
0.0020 |
ray cluster |
10 |
21.32 |
|
0.90 |
0.0053 |
ray cluster |
100 |
159.81 |
|
0.90 |
0.0400 |
After completing this research, we provide the following recommendations:
File size:
The performance of Pandas processes in Python can be improved by utilizing other DataFrame libraries, such as Polars, Dask, Ray, and PySpark. Dask, Ray, and PySpark can be used either on a single machine or on a cluster. On CASFS, when users add worker nodes to their Analytics cluster, CASFS automatically adds the worker nodes to a Dask and a Ray cluster. Currently, we provide a script to launch Spark clusters on Analytics, but in the future, Analytics clusters will automatically connect worker nodes to a Spark cluster as well.