Prediction of AWS Expenses with Machine Learning

May 18th, 2021

About the Research

In this research, we analyzed users' total expenses in the AWS analytics environment and forecasted future user costs for the upcoming months so that the management can set a proper budget for each user. The total cost included the expenses incurred from using the CASFS file system, as well as any other Amazon Web Services such as EC2, EBS, or EFS. We used an ensemble of various gradient boosted models to predict the usage for future months.

About the Data

The data we collected on user costs are available to any CASFS administrator. The columns in the data set are described below.

  • Username- The user who ordered the service

  • Date- The date the service was ordered and used

  • Service- The name of the service that was used

  • Cost- The cost incurred while using this service on this specific day

The data is time-sequential ordered by Username and Date and grouped by Service. An example of a two-day snapshot of three users and their usage data are shown below:

*usernames have been changed for security reasons*

Index,Username,Date,Service,Cost 0,david,2021-05-06,CAS,3.22








Feature Engineering

To extract further information from the data, we performed feature engineering on the data to create more accurate predictions. The original data contained the user_name, service, month, day, week_day, and cost. First, we added features to the data with the previous 1 day, 7 day, 14 day, and 21 day previous costs. We also calculated several rolling statistics, such as the 3, 7, 14, 21, and 30 day mean, median, standard deviation, minimum, and maximum. A portion of the engineered features can be seen below. Mean, median, standard deviation, minimum, and maximum were repeated for 14, 21 and 30 days as well, but are not shown in the picture.


Model Building

We determined that the best way to predict usage was to use an ensemble method with three specific models. The models we chose to use in this ensemble method were XGBoost, LightGBM, and CatBoost. These are all gradient boosted decision trees tree algorithms for predictive analytics. Each model is a little different from the other, which is why we received our best results from the ensemble of all three models.

For example, comparing LightGBM to XGBoost, LightGMB uses a Gradient-based One-Side Sampling when trying to find the best split, while XGBoost uses a more histogram-based algorithm. Each algorithm also has its own way of treating categorical features. For example, CatBoost has a parameter in which you can input the column indices that have categorical variables. It then uses a combination of one-hot encoding and another method similar to mean encoding to encode the variables so that the algorithm can use the data. LightGBM can also handle categorical features when you input the feature names. It encodes them using a special algorithm that finds the split value of categorical features. Finally, XGBoost cannot handle categorical features. Your data must already be encoded before being processed using the XGBoost algorithm.

We went about cleaning the data and creating some features that would be used to train our models. Information on these features is discussed above in the “Feature Engineering'' section of this paper. This new data set contained 35 features total: our Y variable being the cost and the rest as the X variables. We then used this data to create our testing and training data set with a test/train split of 20%. After we created the test/train split, we first ran testing data against all three models to fit them: LightGBM Regressor, CatBoost Regressor, and XGBoost Regressor. Once we fitted all models and appropriately tuned their parameters, we ran the test data against them to predict the cost for each instance. We saved all three predictions from all three algorithms, as well as an ensemble value that was the average of all three predictions.

We then took a vertical stack approach and used the predicted results from all three algorithms, as well as the ensemble value (which was the mean of all three results), and used this data as the X variables and cost as the Y variable. We ran this data through the Gradient Boosting Regressor algorithm. The Mean Absolute Error was used to measure the accuracy of the predicted results from this vertical stack.


Overall, we ran four machine learning models. First, we trained the three models: LightGBM Regressor, CatBoost Regressor, and XGBoost Regressor algorithms. After training these models, we took the predicted values from these algorithms and vertically assembled them as the three independent variables for the final algorithm we ran, the Gradient Boosting Regressor. The Gradient Boosting algorithm forecasted each day's use for the next month. We also calculated the upper 95% and lower 95% confidence intervals for each monthly prediction. Finally, we summed the predicted-use value of each day to get a predicted monthly total. This is the final value we used to show the user as their next month's projected use.

Below is a graph of our results. The y-axis represents the cost and the x-axis represents the date the cost was incurred. The blue dots represent actual values that were observed, while the green line represents values that were predicted from our vertical ensemble machine learning method.