Data Science on AWS: Benefits and Common Pitfalls

AWSdata science
Share

More than two years ago, we wrote about the misguided fear of the cloud among many enterprise companies. How quickly things change! Today, every enterprise we work with is either using the cloud or in the process of moving there. We work with companies that insisted, just two years ago, that they “can’t use the cloud” — and are now undertaking strategic initiatives to have “real work in AWS by the end of 2017.” We see this happening across industries including finance, insurance, pharmaceuticals, retail, and even government.

AWS is especially compelling for data science workloads, which benefit from bursts of elastic compute for computationally intensive experiments, and often from specialized hardware such as GPUs. For example, hedge funds and banks firms can backtest investment strategies faster by spreading work out across machines. Pharmaceutical research teams can run faster Monte Carlo simulations to speed up drug design. And insurance companies experimenting with driver telemetry models or image recognition can more easily benefit from deep learning techniques.

In order to realize the full potential of the cloud for data science, organizations must address a series of peripheral use cases beyond simply providing access to infrastructure. Because we built Domino with first-class AWS integration from day one, it provides an integrated, complete solution for data science workflows in the cloud. Below we cover a few common questions and pitfalls we’ve encountered over the last two years of watching data science teams move to the cloud.

DevOps Automation

You don’t want your data scientists spending time on DevOps tasks like creating AMIs, defining Security Groups, and creating EC2 instances. Data science workloads benefit from large machines for exploratory analysis in tools like Jupyter or RStudio, as well as elastic scalability to support bursty demand from teams, or parallel execution of data science experiments, which are often computationally intensive.

Cost controls, resource monitoring and reporting

Data science workloads often benefit from high-end hardware, which can be expensive. When data scientists have more access to scalable compute, how do you mitigate risk of runaway costs, enforce limits, and attribute across multiple groups or teams?

Environment management

Data scientists need agility to experiment with new open source tools and packages, which are evolving faster than ever before. System administrators must ensure stability and safety of environments. How can you balance these two points in tension?

GPUs

Neural networks and other effective data science techniques benefit from GPU acceleration, but configuring and utilizing GPUs remains easier said than done. How can you provide efficient access to GPUs for your data scientists without miring them in DevOps configuration tasks?

Security

AWS offers world-class security in their environment — but you must still make choices about how you configure security for your applications running on AWS. These choices can make a significant difference in mitigating risk as your data scientists transfer logic (source code) and data sets that may represent your most valuable intellectual property.


Learn more

Domino runs natively in AWS and can be installed in your VPC, or used in our managed hosted environment. Companies including Monsanto, Mashable, Eventbrite, DBRS, and hedge funds, banks and insurance companies that we can’t name all use Domino in AWS as a force multiplier for their data scientists. Let us know if you’d like to learn how.

  • How is AWS for GPU work? I have only used it for standard CPU work with threading. I actually haven’t done anything with GPU work for R in general. Is it worth looking into?