Benchmarking EC2 Instances on CASFS+

September 24th, 2021

Abstract

Several EC2 instances are benchmarked on CASFS on AWS. Various statistics, such as time, memory usage, and cost, are compared on varying dataset sizes to help users determine which EC2 instance is the best for their specific use case.

Introduction

In our previous benchmarking analysis, we compared various Python dataframe libraries and compared their performance on a r5.24xlarge (96 cores, 768 GB memory) machine on CASFS+. In this testing, we found that using Pandas parallelized over several cores with Ray was the fastest process, since the process had a trivial way to parallelize the code.

With this paper, we run this process on several different EC2 instances of various types and sizes to try and make recommendations on which EC2 instance is best suited for the user's needs. First, we compared eight core machines, as these are commonly used for daily work. Then we compared 96 core machines, since these can be used for larger workloads.

EC2 Instance Description

All instance descriptions were taken from https://aws.amazon.com/ec2/instance-types/

General Purpose

General purpose instances provide a balance of compute, memory and networking resources, and can be used for a variety of diverse workloads. These instances are ideal for applications that use these resources in equal proportions, such as web servers and code repositories.


  • M4 instances provide a balance of compute, memory, and network resources, and it is a good choice for many applications.
  • M5 instances are the latest generation of General Purpose Instances powered by Intel Xeon® Platinum 8175M processors. This family provides a balance of compute, memory, and network resources, and is a good choice for many applications.
  • M5a instances are the latest generation of General Purpose Instances powered by AMD EPYC 7000 series processors. M5a instances deliver up to 10% cost savings over comparable instance types. With M5ad instances, local NVMe-based SSDs are physically connected to the host server and provide block-level storage that is coupled to the lifetime of the instance.
  • M5n instances are ideal for workloads that require a balance of compute, memory, and networking resources including web and application servers, small and mid-sized databases, cluster computing, gaming servers, and caching fleet. The higher bandwidth, M5n and M5dn, instance variants are ideal for applications that can take advantage of improved network throughput and packet rate performance.
  • T3 instances are the next generation burstable general-purpose instance type that provides a baseline level of CPU performance with the ability to burst CPU usage at any time, for as long as required. T3 instances offer a balance of compute, memory, and network resources and are designed for applications with moderate CPU usage that experience temporary spikes in use.

Compute Optimized

Compute Optimized instances are ideal for compute-bound applications that benefit from high-performance processors. Instances belonging to this family are well suited for batch-processing workloads, media transcoding, high-performance web servers, high-performance computing (HPC), scientific modeling, dedicated gaming servers and ad server engines, machine learning inference, and other compute intensive applications.


  • C4 instances are optimized for compute-intensive workloads and deliver very cost-effective high performance at a low price per compute ratio.
  • C5 instances are optimized for compute-intensive workloads and deliver cost-effective high performance at a low price per compute ratio.
  • C5n instances are ideal for high-compute applications (including High-Performance Computing (HPC) workloads, data lakes, and network appliances such as firewalls and routers) that can take advantage of improved network throughput and packet rate performance. C5n instances offer up to 100 Gbps network bandwidth and increased memory over comparable C5 instances. C5n.18xlarge instances support Elastic Fabric Adapter (EFA), a network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-node communications, like High-Performance Computing (HPC) applications using the Message Passing Interface (MPI), at scale on AWS.

Memory Optimized

Memory-optimized instances are designed to deliver fast performance for workloads that process large data sets in memory.


  • R4 instances are optimized for memory-intensive applications and offer better price per GiB of RAM than R3 instances.
  • R5 instances deliver 5% additional memory per vCPU than R4 instances, and the largest size provides 768 GiB of memory. In addition, R5 instances deliver a 10% price per GiB improvement and a ~20% increased CPU performance over R4 instances.
  • R5a instances are the latest generation of Memory-Optimized instances ideal for memory-bound workloads and are powered by AMD EPYC 7000 series processors. R5a instances deliver up to 10% lower cost per GiB memory over comparable instances.
  • R5n instances are ideal for memory-bound workloads including high-performance databases, distributed web scale in-memory caches, mid-sized in-memory database, real-time big data analytics, and other enterprise applications. The higher bandwidth, R5n and R5dn, instance variants are ideal for applications that can take advantage of improved network throughput and packet rate performance.
  • X1e instances are optimized for high-performance databases, in-memory databases, and other memory intensive enterprise applications. X1e instances offer one of the lowest price per GiB of RAM among Amazon EC2 instance types.

Suffix Descriptions


  • a — AMD machine; all other machines are Intel
  • n, dn — Network optimized
  • e — Database optimized
  • metal — provide 96 logical processors on 48 physical cores; they run on single servers with two physical Intel sockets

Accelerated Computing

Accelerated computing instances use hardware accelerators, or co-processors, to perform functions, such as floating-point number calculations, graphics processing, or data pattern matching, more effciently than is possible in software running on CPUs.


  • G4dn instances are designed to help accelerate machine learning inference and graphics-intensive workloads.
  • P3 - Amazon EC2 P3 instances deliver high-performance compute in the cloud with up to eight NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications.

Data Description

The libraries were tested using daily financial datasets. The dataset sizes are as follows:


  • One file - 2,524,365 rows x 20 columns
  • 10 files - 23,746,635 rows x 20 columns
  • 100 files - 241,313,625 rows x 20 columns

Process Description

To process the data, we used this generalized algorithm, with each file being parallelized out to a single core:


  1. Read in the data from csv files
  2. Query the data
  3. Group by three columns and aggregate two separate columns
  4. Calculate a new column using a vectorized operation
  5. Write processed data to parquet

This process does not utilize GPUs or databases, so it did not utilize the benefits of the P3, G4dn, or X1e machines. Further testing would need to be done to analyze the benefits of using these machines.

Results

First, we tested the process on a single file over the eight core machines to get a baseline of which performed the best relative to cost for this particular process.


Table 1


EC2 Instance

RAM (GiB)

GPU

Mean Time (s)

STD Time (s)

Cost ($/hr)

Total Cost ($)

c5.2xlarge

16

0

6.38

0.014

0.09

0.00016

c5n.2xlarge

21

0

6.69

0.074

0.08

0.00015

g4dn.2xlarge

32

1

6.99

0.053

0.23

0.00045

r5.2xlarge

64

0

6.99

0.042

0.10

0.00019

c4.2xlarge

15

0

7.07

0.162

0.07

0.00014

r5n.2xlarge

64

0

7.34

0.043

0.09

0.00018

m5n.2xlarge

32

0

7.38

0.044

0.08

0.00016

m5.2xlarge

32

0

7.57

0.038

0.08

0.00017

m5a.2xlarge

32

0

7.79

0.032

0.08

0.00017

r5a.2xlarge

64

0

7.84

0.073

0.10

0.00022

r4.2xlarge

61

0

8.09

0.031

0.08

0.00018

m4.2xlarge

32

0

8.43

0.061

0.08

0.00019

p3.2xlarge

61

1

8.45

0.071

0.92

0.00216

x1e.2xlarge

244

0

8.53

0.051

0.50

0.00118

t3.2xlarge

32

0

9.23

0.395

0.10

0.00026



Processing Time by Prefix

Processing Time by Prefix

Processing Time by Instance Number

Processing Time by Instance Number

Processing Time by Suffix

Processing Time by Suffix

Next, we tested 96 core machines for the C, M, and R EC2 instances. Since this process doesn't use GPUs or databases, we left out the other EC2 instance types.


Table 2


EC2 Instance

RAM (GiB)

N Files

Mean Time (s)

STD Time (s)

Cost ($/hr)

Total Cost ($)

c5.metal

192

1

6.28

0.312

0.91

0.00159

c5.24xlarge

192

1

6.45

0.207

1.03

0.00185

r5.metal

768

1

6.85

0.148

1.03

0.00196

m5.metal

384

1

6.87

0.100

0.96

0.00183

r5.24xlarge

768

1

7.07

0.282

1.21

0.00238

r5n.24xlarge

768

1

7.08

0.287

1.00

0.00197

m5.24xlarge

384

1

7.14

0.288

0.96

0.00190

m5n.24xlarge

384

1

7.20

0.345

0.96

0.00192

m5a.24xlarge

384

1

7.93

0.090

0.96

0.00211

r5a.24xlarge

768

1

8.00

0.187

1.00

0.00222



Table 3


EC2 Instance

RAM (GiB)

N Files

Mean Time (s)

STD Time (s)

Cost ($/hr)

Total Cost ($)

c5.24xlarge

192

10

7.02

0.075

1.03

0.00201

c5.metal

192

10

7.06

0.247

0.91

0.00178

m5.metal

384

10

7.71

0.197

0.96

0.00206

r5.24xlarge

768

10

7.76

0.148

1.21

0.00261

r5.metal

768

10

7.80

0.286

1.03

0.00223

r5n.24xlarge

768

10

7.86

0.305

1.00

0.00218

m5n.24xlarge

384

10

7.97

0.350

0.96

0.00213

m5.24xlarge

384

10

8.07

0.353

0.96

0.00215

r5a.24xlarge

768

10

9.54

0.443

1.00

0.00265

m5a.24xlarge

384

10

9.56

0.448

0.96

0.00255



Table 4


EC2 Instance

RAM (GiB)

N Files

Mean Time (s)

STD Time (s)

Cost ($/hr)

Total Cost ($)

c5.24xlarge

192

100

25.50

0.416

1.03

0.00730

c5.metal

192

100

25.50

0.704

0.91

0.00645

r5.metal

768

100

26.20

0.342

1.03

0.00750

r5.24xlarge

768

100

26.60

0.453

1.21

0.00894

m5n.24xlarge

384

100

26.90

0.529

0.96

0.00717

m5.24xlarge

384

100

27.00

0.445

0.96

0.00720

m5.metal

384

100

27.20

0.797

0.96

0.00725

r5n.24xlarge

768

100

27.20

0.498

1.00

0.00756

m5a.24xlarge

384

100

31.30

1.230

0.96

0.00835

r5a.24xlarge

768

100

33.90

0.772

1.00

0.00942



Mean Processing Time by Instance Prefix

Processing Time by Instance Prefix

Mean Total Cost by Instance Prefix

Mean Total Cost by Instance Prefix

Mean Processing Time by Instance Suffx

Mean Processing Time by Instance Suffix

Mean Total Cost by Instance Suffix

Mean Total Cost by Instance Suffix

Mean Processing Time by Metal/ Non-Metal

Mean Processing Time by Metal/ Non-Metal

Mean Total Cost by Metal/ Non-Metal

Mean Total Cost by Metal/ Non-Metal

Analysis

  1. C instances are the fastest, but have the least amount of RAM.

  2. R instances have the most amount of RAM and have slightly better performance than M machines, but with a higher cost.

  3. M instances offer higher performance than other general purpose machines, while being cost-effective.

  4. Instances with greater numbers perform better than machines with the same prefix and a lower number. For example, the m5 performs better than m4, with only a minimal cost increase.

  5. CASFS+ is network optimized, so instances with the -n suffx (that are network optimized) do not receive the network performance benefits. However, these are cheaper on average than the base machines and can provide cost savings.

  6. Intel machines perform better than AMD machines (instances with the -a suffx). Also, AMD machines do not provide that great of a cost benefit either, so we recommend using Intel machines.

  7. Metal machines provide a performance boost when the dataset size is smaller, but perform on par with 24xlarge machines as dataset size increases. These are generally cheaper than the 24xlarge instances and can provide cost savings.

Conclusion

For workloads that are not GPU or database dependent, we recommend using either a C, M, or R machine depending on your RAM needs and budget. If the dataset will fit into memory, the C machines will always provide the best performance. After that, choosing between an M or an R machine will depend on memory and budget. If budget is not an issue, the R instances will provide slightly better performance. However, if budget is an issue, M machines can perform comparably.