The Cost of Doing Data Science on Laptops

AWSdata science
Share

At the heart of the data science process are the resource intensive tasks of modeling and validation. During these tasks, data scientists will try and discard thousands of temporary models to find the optimal configuration. Even for small data sets, this could take hours to process.

Because of this, data scientists who rely on their laptops or departmental servers for processing power must choose between fast processing time and model complexity.

In either case, performance and revenue suffer:

  • Decreased model complexity leads to less accurate models, which impacts revenue.
  • Increased processing time means running fewer experiments, which limits innovation and therefore impacts revenue.

Cost of Less Accurate Models

Take the example of churn prediction—something common to almost all organizations. The below analysis shows that even with a small dataset of 5,000 customers, there is a difference of 10% in accuracy between a simple and a complex model. The small 10% difference in model accuracy leads to $28,750 in lost revenues due to customer loss.

Our analysis is based on the CHURN dataset from the UCI repository. This dataset contains a list of 5,000 telecom customers, each with attributes such as account length and number of customer service calls, and whether the customer churned. We assumed a $50 cost of intervention with a 60% success rate, and a loss of $500 per churned customer.

We trained 3 models to predict churn on a test set of 1,000 accounts. We used laptop-quality hardware, just as a data scientist would in many organizations.

Model Missed Churns Unnecessary Outreach Cost
GLMNET 120 17 $36,850
GBM 38 14 $12,100
H2O Ensemble 24 18 $8,100

Linear Model using R’s GLMNET

This is representative of the type of model a data scientist working with hardware constraints might train.

Results:

  • 86% overall accuracy on the test set
  • 120 unidentified churned customers
  • 17 instances of unnecessary outreach to customers who were unlikely to churn
  • $36,850 loss due to model underperformance

GBM Model using R’s GBM

This model leverages more advanced algorithms than a linear model, and provides a significant 8% improvement in predicting which customers will churn.

Results:

  • 94% overall accuracy
  • 38 unidentified churned customers
  • 14 instances of unnecessary outreach
  • $12,100 loss due to model underperformance

Cutting Edge Stacked H2O Ensemble

This is the cutting edge of modeling techniques. It is the kind of modeling data scientists want to do, but are limited by hardware constraints.

This model leverages a Gradient Boosting Machine, a Random Forest, and a deep learning neural network to provide an ensemble prediction. It provided the highest performance and the biggest cost savings.

Results:

  • 95.8% overall accuracy
  • 24 unidentified churned customers
  • 18 instances of unnecessary outreach
  • $8,100 loss due to model underperformance

The difference between a cutting-edge model and a simple model is 10% in accuracy which translates to $27,850. Remember this is with a small dataset of just 5,000 customers, using conservative estimates. The cost of less accurate models at larger organizations can easily reach hundreds of thousands if not millions.

Why not just use the best available models all the time? The answer is in the training times and the cost of limited processing power.

Cost of Restricted Processing Power

High-performance models require more processing power, and on standard laptops it could take hours to train these models. Here are the training times for each of the models in our analysis:

Model Training Time (Laptop) Missed Churns Unnecessary Outreach Cost
GLMNET 43 seconds 120 17 $36,850
GBM 828 seconds 38 14 $12,100
H2O Ensemble hours 24 18 $8,100

Data scientists working on restricted hardware such as laptops are less likely to try high-performance models when it takes half of their day to get results. This is not even considering the additional time it would take to validate those results with each model.

If they do choose to wait for hours in order to get a more accurate models, they are left with less time to run other experiments that could lead to even better results. This cost of opportunity leads to slow or stagnant innovation, and an inability to make a significant impact for the organization.

This is a terrible set of choices, yet many data scientists are put into this position every day.

As long as data scientists are forced to work on restricted machines—such as laptops or self-managed departmental servers—the organization will continue to lose money and competitive edge.

Another Option: Cloud

The solution is to enable data scientists to run experiments on cloud hardware.

The table below shows training times for each of the models in our analysis when run in the cloud, demonstrating that it’s possible to develop accurate models without sacrificing time.

Model Training Time (Laptop) Training Time (Cloud) Missed Churns Unnecessary Outreach Cost
GLMNET 43 seconds 9 seconds 120 17 $36,850
GBM 828 seconds 27 seconds 38 14 $12,100
H2O Ensemble hours 71 seconds 24 18 $8,100

The cutting-edge H2O model—which took hours on a laptop—trained in just over a minute on an AWS X1 instance at a cost of around 39 cents. That saves $27,850 for the organization, and leaves the data scientist with many hours of free time to try other models and experiments.

Conclusion

The cost of having data scientists work on laptops is significant. Even when working with small datasets, data scientists must choose between developing accurate models and developing them faster. Both options lead to lost revenue for the organization.

The cloud is the optimal home for data science teams. It enables them to try more audacious experiments and use cutting-edge techniques, resulting in significant and quantifiable ROI for the organization.

The easiest and fastest way to give data scientists access to on-demand and scalable cloud hardware, without the need to provision or maintain cloud services, is with a data science platform such as Domino.

(You can view, fork, and play with the analysis used in this article in Domino.)

  • Hi,
    I’ve got a couple of quick comments on your analysis without looking at the notebooks to check the details.
    Firstly, the cost of the AWS X1 instance is misleading since you have to pay per hour (around 3.7926$/hr for a x1.32xl in North Virginia with spot pricing, 13.338$/hr for on-demand pricing, https://aws.amazon.com/ec2/pricing/on-demand/). This could work out to being > 5 k$ / year depending on use, unless the instance is reserved.
    Secondly, given the hype on ML and deep learning over the last year or so in particular, there’s a fair chance that your data scientist might want a decent gaming laptop in order to use the GPU for training, not gaming :), say with a Nvidia GTX 1080 GPU. Heavier than a Macbook, but potentially a lot more powerful, and cheaper. And with that kind of laptop (which would have up to 4 cores/ 8 threads for use – I think you used 2, based on “Dual Core GBM Churn Prediction”?), and the use of H2O Deep Water for example (though 1000 points for the dataset is somewhat small), you might well find the results to be a bit more favourable for the laptop, plus you wouldn’t have to worry about network connectivity.
    There are also the AWS instances with GPU’s that could be as interesting as well, at least for Deep Learning, or workloads that cuda can be leveraged upon.
    Interesting article anyhow, and I look forward to trying the analysis on my own workstation to see how things compare 🙂

    • To add to that – it is possible to give them a combination of desktop/laptop which means they can work slower remotely and “dock” into something with multiple gpu’s for CUDA, at their desk. A good example is something like a Razor core etc. Also if the cryptocurrency boys have taught me anything. Its that to build a massive GPU server rack is not as expensive as it sounds. It you don’t mind ugly, you could have on prem systems in the $10K range that work even when the data scientists might not be.

  • Vance Lopez

    Very clear write up and analysis. Thank you.

  • Hey Thanks for sharing very insightful blog post on cost of doing Data Science on Laptops. Very clear and informative.