Increasing Upload Speed With CASFS

100x Faster Database Inserts by Code Willing

Code Willing has determined the most efficient way to upload large amounts of data into a database hosted on their custom-built file system.

Following extensive research, Code Willing greatly decreased the time it takes to insert massive amounts of data into a TimescaleDB SQL database hosted on their file system, CASFS.

During their experiment, Code Willing used two different types of uploading methods: upserting data using Pangres and copying the data by using Timescale-parallel-copy.

The data being used in this research was time-sequential financial data daily files, with each of them having 2,229,120 rows and 19 columns.

The control variables during the course of the investigation were the size of the financial data files remaining identical for both uploading tests and using Pandas to process the data before inserting the data into the database.

While using the Pangres method, it allowed for easy insertion from Pandas dataframes into PostgreSQL databases using their upsert method. The upsert method takes advantage of Postgres’ insert function and handles issues like duplicates, adding missing columns, and datatypes.

By using the Pangres method, the upload time for inserting the data was 14.23 minutes (853.8 seconds).

 

Upload Origin Cores Ram Workers Upload Method Parallel Processing Server_data relationship Time (minutes)
Python Notebook 8 64 1 Pangres No none 14.23

Following this result, Code Willing wanted to find a quicker and more efficient way to upload their data. That is where the second uploading method came in.

Timescale-parallel-copy is, “a command-line program for parallelizing PostgreSQL’s built-in `copy` functionality for bulk inserting data into TimescaleDB.”

Code Willing performed three different ways to insert their data using Timescale-parallel-copy.

The first test was where the server-data relationship was local. The result led to an upload time of 1 minute (60 seconds).

 

Upload Origin Cores Ram Workers Upload Method Parallel Processing Server_data relationship Time (minutes)
Command Line 8 64 1 Timescale-parallel-copy No local 1.00

 

For their second trial, they considered using parallel processing and increasing workers by 1 to decrease the upload time further.

This resulted in an upload time of .35 minutes (21 seconds). 

 

Upload Origin Cores Ram Workers Upload Method Parallel Processing Server_data relationship Time (minutes)
Command Line 8 64 2 Timescale-parallel-copy Yes local 0.35 

 

During the third trial, Code Willing set up a TimescaleDB on a separate server and ran it continuously. This is described in the variable “Server_data relationship” as “not local.”

With this change in place, everything else remained the same, including parallel processing, except for changing the number of workers. After testing 1, 2, 4, 6, and 8 workers, they determined 8 workers were the most efficient. The upload time with 8 workers showed .13 minutes (8 seconds).

 

Upload Origin Cores Ram Workers Upload Method Parallel Processing Server_data relationship Time (minutes)
Command Line 8 64 8 Timescale-parallel-copy Yes Not local 0.13 

 

Through this research, Code Willing improved Pangres’ upsert method 99.1%, going from 14.23 minutes to 0.13 minutes, by using TimescaleDB’s parallel copy functionality and implementing Ray’s parallel processing.

To break it down even further, this new method allowed them to copy 278,640 rows per second.

In the end, Code Willing’s experiment determined when uploading copious amounts of data, the most efficient and quickest way is by using the built-in “copy” function (like Timescale-parallel-copy), as well as parallel processing with large amounts of workers to decrease the upload time of time-sequential data into a Postgres database (TimescaleDB).

 

Leave a Reply

Your email address will not be published. Required fields are marked *

6 + 3 =