by Eduardo Ariño de la Rubia on May 4th, 2017
At the heart of the data science process are the resource intensive tasks of modeling and validation. During these tasks, data scientists will try and discard thousands of temporary models to find the optimal configuration. Even for small data sets, this could take hours to process.
Because of this, data scientists who rely on their laptops or departmental servers for processing power must choose between fast processing time and model complexity.
In either case, performance and revenue suffer:
- Decreased model complexity leads to less accurate models, which impacts revenue.
- Increased processing time means running fewer experiments, which limits innovation and therefore impacts revenue.
Cost of Less Accurate Models
Take the example of churn prediction—something common to almost all organizations. The below analysis shows that even with a small dataset of 5,000 customers, there is a difference of 10% in accuracy between a simple and a complex model. The small 10% difference in model accuracy leads to $28,750 in lost revenues due to customer loss.
Our analysis is based on the CHURN dataset from the UCI repository. This dataset contains a list of 5,000 telecom customers, each with attributes such as account length and number of customer service calls, and whether the customer churned. We assumed a $50 cost of intervention with a 60% success rate, and a loss of $500 per churned customer.
We trained 3 models to predict churn on a test set of 1,000 accounts. We used laptop-quality hardware, just as a data scientist would in many organizations.
|Model||Missed Churns||Unnecessary Outreach||Cost|
Linear Model using R’s GLMNET
This is representative of the type of model a data scientist working with hardware constraints might train.
- 86% overall accuracy on the test set
- 120 unidentified churned customers
- 17 instances of unnecessary outreach to customers who were unlikely to churn
- $36,850 loss due to model underperformance
GBM Model using R’s GBM
This model leverages more advanced algorithms than a linear model, and provides a significant 8% improvement in predicting which customers will churn.
- 94% overall accuracy
- 38 unidentified churned customers
- 14 instances of unnecessary outreach
- $12,100 loss due to model underperformance
Cutting Edge Stacked H2O Ensemble
This is the cutting edge of modeling techniques. It is the kind of modeling data scientists want to do, but are limited by hardware constraints.
This model leverages a Gradient Boosting Machine, a Random Forest, and a deep learning neural network to provide an ensemble prediction. It provided the highest performance and the biggest cost savings.
- 95.8% overall accuracy
- 24 unidentified churned customers
- 18 instances of unnecessary outreach
- $8,100 loss due to model underperformance
The difference between a cutting-edge model and a simple model is 10% in accuracy which translates to $27,850. Remember this is with a small dataset of just 5,000 customers, using conservative estimates. The cost of less accurate models at larger organizations can easily reach hundreds of thousands if not millions.
Why not just use the best available models all the time? The answer is in the training times and the cost of limited processing power.
Cost of Restricted Processing Power
High-performance models require more processing power, and on standard laptops it could take hours to train these models. Here are the training times for each of the models in our analysis:
|Model||Training Time (Laptop)||Missed Churns||Unnecessary Outreach||Cost|
Data scientists working on restricted hardware such as laptops are less likely to try high-performance models when it takes half of their day to get results. This is not even considering the additional time it would take to validate those results with each model.
If they do choose to wait for hours in order to get a more accurate models, they are left with less time to run other experiments that could lead to even better results. This cost of opportunity leads to slow or stagnant innovation, and an inability to make a significant impact for the organization.
This is a terrible set of choices, yet many data scientists are put into this position every day.
As long as data scientists are forced to work on restricted machines—such as laptops or self-managed departmental servers—the organization will continue to lose money and competitive edge.
Another Option: Cloud
The solution is to enable data scientists to run experiments on cloud hardware.
The table below shows training times for each of the models in our analysis when run in the cloud, demonstrating that it’s possible to develop accurate models without sacrificing time.
|Model||Training Time (Laptop)||Training Time (Cloud)||Missed Churns||Unnecessary Outreach||Cost|
|GLMNET||43 seconds||9 seconds||120||17||$36,850|
|GBM||828 seconds||27 seconds||38||14||$12,100|
|H2O Ensemble||hours||71 seconds||24||18||$8,100|
The cutting-edge H2O model—which took hours on a laptop—trained in just over a minute on an AWS X1 instance at a cost of around 39 cents. That saves $27,850 for the organization, and leaves the data scientist with many hours of free time to try other models and experiments.
The cost of having data scientists work on laptops is significant. Even when working with small datasets, data scientists must choose between developing accurate models and developing them faster. Both options lead to lost revenue for the organization.
The cloud is the optimal home for data science teams. It enables them to try more audacious experiments and use cutting-edge techniques, resulting in significant and quantifiable ROI for the organization.
The easiest and fastest way to give data scientists access to on-demand and scalable cloud hardware, without the need to provision or maintain cloud services, is with a data science platform such as Domino.
(You can view, fork, and play with the analysis used in this article in Domino.)