Quantbot CEO & Co-Founder Talks Evolution of Data Science

Speaking at the Neudata Summer Online Summit 2021, Quantbot Technologies CEO and co-founder, Paul White, presented “When Data is Everything,” where he talked about how far we have come in the world of data science.

During the presentation, White discussed how Quantbot manages its ever-growing data science demands utilizing tools from Code Willing and Options IT for smart task routing in the private and public clouds.

One of the tools they are currently using is CASFS+, a cloud-based file system and cloud resource manager built by Code Willing. 

In particular, White dove into costs and how working with CASFS+ has given them the ability to control a user’s budget. 

When working with data science, the amount of data stored is continuously growing. While having as much information as possible is desirable, it can also present a problem when researching the data and attempting to keep costs to a minimum.

When using the CASFS+ program, Quantbot can set budget controls for every user, ensuring no one will exceed their budget. Ultimately, allowing the company to rest easy knowing that the cost they foreshadowed will indeed be the actual cost in the end. 

Currently, CASFS+ is available as a fully-managed option directly through Code Willing. For those working with a more refined budget, Code Willing will soon provide CASFS+ as a self-service option through the AWS Marketplace. 

 

Click here to review the entire presentation of “When Data is Everything.”

 

Increasing Upload Speed With CASFS

100x Faster Database Inserts by Code Willing

Code Willing has determined the most efficient way to upload large amounts of data into a database hosted on their custom-built file system.

Following extensive research, Code Willing greatly decreased the time it takes to insert massive amounts of data into a TimescaleDB SQL database hosted on their file system, CASFS.

During their experiment, Code Willing used two different types of uploading methods: upserting data using Pangres and copying the data by using Timescale-parallel-copy.

The data being used in this research was time-sequential financial data daily files, with each of them having 2,229,120 rows and 19 columns.

The control variables during the course of the investigation were the size of the financial data files remaining identical for both uploading tests and using Pandas to process the data before inserting the data into the database.

While using the Pangres method, it allowed for easy insertion from Pandas dataframes into PostgreSQL databases using their upsert method. The upsert method takes advantage of Postgres’ insert function and handles issues like duplicates, adding missing columns, and datatypes.

By using the Pangres method, the upload time for inserting the data was 14.23 minutes (853.8 seconds).

 

Upload Origin Cores Ram Workers Upload Method Parallel Processing Server_data relationship Time (minutes)
Python Notebook 8 64 1 Pangres No none 14.23

Following this result, Code Willing wanted to find a quicker and more efficient way to upload their data. That is where the second uploading method came in.

Timescale-parallel-copy is, “a command-line program for parallelizing PostgreSQL’s built-in `copy` functionality for bulk inserting data into TimescaleDB.”

Code Willing performed three different ways to insert their data using Timescale-parallel-copy.

The first test was where the server-data relationship was local. The result led to an upload time of 1 minute (60 seconds).

 

Upload Origin Cores Ram Workers Upload Method Parallel Processing Server_data relationship Time (minutes)
Command Line 8 64 1 Timescale-parallel-copy No local 1.00

 

For their second trial, they considered using parallel processing and increasing workers by 1 to decrease the upload time further.

This resulted in an upload time of .35 minutes (21 seconds). 

 

Upload Origin Cores Ram Workers Upload Method Parallel Processing Server_data relationship Time (minutes)
Command Line 8 64 2 Timescale-parallel-copy Yes local 0.35 

 

During the third trial, Code Willing set up a TimescaleDB on a separate server and ran it continuously. This is described in the variable “Server_data relationship” as “not local.”

With this change in place, everything else remained the same, including parallel processing, except for changing the number of workers. After testing 1, 2, 4, 6, and 8 workers, they determined 8 workers were the most efficient. The upload time with 8 workers showed .13 minutes (8 seconds).

 

Upload Origin Cores Ram Workers Upload Method Parallel Processing Server_data relationship Time (minutes)
Command Line 8 64 8 Timescale-parallel-copy Yes Not local 0.13 

 

Through this research, Code Willing improved Pangres’ upsert method 99.1%, going from 14.23 minutes to 0.13 minutes, by using TimescaleDB’s parallel copy functionality and implementing Ray’s parallel processing.

To break it down even further, this new method allowed them to copy 278,640 rows per second.

In the end, Code Willing’s experiment determined when uploading copious amounts of data, the most efficient and quickest way is by using the built-in “copy” function (like Timescale-parallel-copy), as well as parallel processing with large amounts of workers to decrease the upload time of time-sequential data into a Postgres database (TimescaleDB).