# The Cost of Doing Data Science on Laptops

on May 4, 2017

At the heart of the data science process are the resource intensive tasks of modeling and validation. During these tasks, data scientists will try and discard thousands of temporary models to find the optimal configuration. Even for small data sets, this could take hours to process.

Because of this, data scientists who rely on their laptops or departmental servers for processing power must choose between fast processing time and model complexity.

In either case, performance and revenue suffer:

• Decreased model complexity leads to less accurate models, which impacts revenue.
• Increased processing time means running fewer experiments, which limits innovation and therefore impacts revenue.

## Cost of Less Accurate Models

Take the example of churn prediction—something common to almost all organizations. The below analysis shows that even with a small dataset of 5,000 customers, there is a difference of 10% in accuracy between a simple and a complex model. The small 10% difference in model accuracy leads to $28,750 in lost revenues due to customer loss. Our analysis is based on the CHURN dataset from the UCI repository. This dataset contains a list of 5,000 telecom customers, each with attributes such as account length and number of customer service calls, and whether the customer churned. We assumed a$50 cost of intervention with a 60% success rate, and a loss of $500 per churned customer. We trained 3 models to predict churn on a test set of 1,000 accounts. We used laptop-quality hardware, just as a data scientist would in many organizations. Model Missed Churns Unnecessary Outreach Cost GLMNET 120 17$36,850
GBM 38 14 $12,100 H2O Ensemble 24 18$8,100

### Linear Model using R’s GLMNET

This is representative of the type of model a data scientist working with hardware constraints might train.

Results:

• 86% overall accuracy on the test set
• 120 unidentified churned customers
• 17 instances of unnecessary outreach to customers who were unlikely to churn

### Cutting Edge Stacked H2O Ensemble

This is the cutting edge of modeling techniques. It is the kind of modeling data scientists want to do, but are limited by hardware constraints.

This model leverages a Gradient Boosting Machine, a Random Forest, and a deep learning neural network to provide an ensemble prediction. It provided the highest performance and the biggest cost savings.

Results:

• 95.8% overall accuracy
• 24 unidentified churned customers
• 18 instances of unnecessary outreach
• $8,100 loss due to model underperformance The difference between a cutting-edge model and a simple model is 10% in accuracy which translates to$27,850. Remember this is with a small dataset of just 5,000 customers, using conservative estimates. The cost of less accurate models at larger organizations can easily reach hundreds of thousands if not millions.

Why not just use the best available models all the time? The answer is in the training times and the cost of limited processing power.

## Cost of Restricted Processing Power

High-performance models require more processing power, and on standard laptops it could take hours to train these models. Here are the training times for each of the models in our analysis:

Model Training Time (Laptop) Missed Churns Unnecessary Outreach Cost
GLMNET 43 seconds 120 17 $36,850 GBM 828 seconds 38 14$12,100
H2O Ensemble hours 24 18 $8,100 Data scientists working on restricted hardware such as laptops are less likely to try high-performance models when it takes half of their day to get results. This is not even considering the additional time it would take to validate those results with each model. If they do choose to wait for hours in order to get a more accurate models, they are left with less time to run other experiments that could lead to even better results. This cost of opportunity leads to slow or stagnant innovation, and an inability to make a significant impact for the organization. This is a terrible set of choices, yet many data scientists are put into this position every day. As long as data scientists are forced to work on restricted machines—such as laptops or self-managed departmental servers—the organization will continue to lose money and competitive edge. ## Another Option: Cloud The solution is to enable data scientists to run experiments on cloud hardware. The table below shows training times for each of the models in our analysis when run in the cloud, demonstrating that it’s possible to develop accurate models without sacrificing time. Model Training Time (Laptop) Training Time (Cloud) Missed Churns Unnecessary Outreach Cost GLMNET 43 seconds 9 seconds 120 17$36,850
GBM 828 seconds 27 seconds 38 14 $12,100 H2O Ensemble hours 71 seconds 24 18$8,100

The cutting-edge H2O model—which took hours on a laptop—trained in just over a minute on an AWS X1 instance at a cost of around 39 cents. That saves \$27,850 for the organization, and leaves the data scientist with many hours of free time to try other models and experiments.

## Conclusion

The cost of having data scientists work on laptops is significant. Even when working with small datasets, data scientists must choose between developing accurate models and developing them faster. Both options lead to lost revenue for the organization.

The cloud is the optimal home for data science teams. It enables them to try more audacious experiments and use cutting-edge techniques, resulting in significant and quantifiable ROI for the organization.

The easiest and fastest way to give data scientists access to on-demand and scalable cloud hardware, without the need to provision or maintain cloud services, is with a data science platform such as Domino.