Open Source Winning Against Proprietary Data Science Vendors

data sciencematlabopen sourcepythonRSAS

With the recent publication of Gartner’s Magic Quadrant for Advanced Analytics, we wanted to know how proprietary data science software vendors were faring against open source challengers. We discovered compelling evidence that open source tools have had a dramatic impact on SAS, IBM, Microsoft and others.

We also investigated how open source tools were faring against each other. Which tools have seen the most growth, and which are falling behind? And what of the R versus Python debate?

Inspired by RedMonk’s Programming Language Rankings, we utilized search data from Google Trends and StackOverflow question volume to conduct our investigation. The tools we studied include popular open source and proprietary software used by data science teams.

UPDATE 03/17/2016 14:00 PT - We've corrected some inconsistencies with Google Trends data. While the magnitude of the data in the original report has changed (with most open source tools seeing higher search volume), the directional findings and conclusion remain the same.

Article Highlights:

Want to see our workings? Visit our Python notebook on Domino.

It’s a busy, and confusing, market

We included 31 technologies in our research:

  • Leading proprietary advanced analytics products (as well as Business
  • Intelligence products that have advanced analytics capabilities);
  • All vendors included in the “Leaders” and “Visionaries” quadrants of the 2016 Gartner Magic Quadrant (MQ) for Advanced Analytics;
  • Open source tools commonly (and some not so commonly) used by data science teams.

As evidenced by the two colorful (and not very useful) overview charts, this makes for a large number of technologies. We have chosen to split this population into 3 software groups: Proprietary, Gartner MQ, and Open Source. Proprietary is a superset of the Gartner MQ group.

Data sources and bias

We used two sources of trend data in this study: Google Trends search data, and tag usage on the StackOverflow Q&A site. Google Trends reflects interest from a broader population than StackOverflow, and provides unbiased data for both proprietary and open source technologies. The StackOverflow data suffers from a selection bias: It is used largely by the open source developer community. The StackOverflow data, is however, very useful to explore usage of open source data science technologies.

Some keywords used in the Google Trends research are a vendor’s company name, and do not refer directly to the vendor’s advanced analytics products alone. We chose to use these keywords where product names are generic (‘SAS STAT’), or the company name is synonymous with the vendor’s advanced analytics product. This may result in overstating interest for proprietary analytics products, particularly where vendors have broad product portfolios.

All's not well in Magic Quadrant land

Within larger companies, the use of SAS, IBM SPSS, and other products is pervasive. Despite this, interest in these products and product suites is waning. This is perhaps unsurprising: Gartner’s customer interviews revealed low satisfaction, implementation challenges, high prices, and concerns about lack of pricing transparency with these tools.

Other than for SAS, IBM SPSS and Cognos, search volume was relatively low for these vendors, and all saw a fall in interest. SAS saw a fall in search volume of 26%, and Microsoft Analysis Services a dramatic 46% from 2008 to 2015. Over the same period, IBM SPSS and Cognos lost 29% and 37%, respectively. Statistica's search volume relative to other vendors was, from 2011 to 2013, too low for Google Trends to provide data.

The notable exception in this group being Alteryx, a relatively new challenger to the established vendors. Alteryx, founded in 2010, does not feature on the Google Search Trends - Indexed to 2008 chart below. For comparison, we included the search volume for R.

Many of the Gartner MQ participants have too low search volume for Google Trends to report data. These are reported on the right-hand side of the x-axis on the compound annual growth rate (CAGR) chart below. Here, we can clearly see Alteryx’s significant growth in search volume, albeit off a low base compared to established competitors. Statistica makes a reappearance with such low relative search volume that the CAGR is not meaningful.

Enterprise analytics vendors such as Microsoft and IBM have responded to the changing market landscape with SaaS or cloud-based solutions. We’ve included these in our study: Microsoft with Cortana Analytics and Azure Machine Learning, and IBM with Watson Analytics.

None of these technologies, except for Watson Analytics, have seen enough search traffic for Google Trends reporting. We did however see Watson Analytics, Cortana Analytics and Azure Machine Learning in our StackOverflow analysis. The 2014/2015 YoY Google Trends and StackOverflow tag growth for all three of these technologies is misleading, being calculated from low numbers.

Open source: knocking it out of the park

Open source analytics tools have seen significant growth in interest over the last 5 years. Many tools see search volumes and growth rates far exceeding those of proprietary vendors. This tells a compelling story about the future of the data science tools market.

For a sense of scale, we have included Apache Hadoop, a popular distributed data processing technology, and SAS in the Google Trends traffic chart. As expected, Hadoop, R and Python saw the highest search volume.

Python versus R: Tools democratization in action?

Python is a general purpose programming language. Growth in Python search volume, in and itself, is not indicative of increased Python usage within data science teams. Search data for Pandas, Numpy, and scikit-learn, popular Python data analytics and machine learning add-on packages, more accurately reflects adoption. And here, the growth has been spectacular: A 4 year CAGR for Pandas of 45%, and a 2 year CAGR for scikit-learn of 58%.

While search volume for R grew at a faster pace than Python over the period from 2008, this momentum appears to have tapered comparatively: an 8.8% CAGR from 2008 versus 5.5% for Python, to a two year CAGR to 2015 of 5.9% versus Python’s 10.5%. And as discussed above, the growth in interest in Python’s analytics packages far outpaced the growth in Python itself.

The growth in use of R and Python-related StackOverflow Tags reveal a similar pattern to that of the Google Trends data. These trends may reflect a democratization of advanced analytics tools beyond data scientists. Software engineering teams who do not use R are increasingly using powerful, high-quality tools for data analysis and machine learning.

Spark ignites and Scala follows

While Apache Spark did not feature in Google Trends data prior to 2013, year on year growth for 2014-2015 was a phenomenal 121%. Scala, tied closely to the success of Spark, has seen accelerated adoption over the period: A 4 year CAGR of 8.1%, to a year over year 2014-2015 growth rate of 12.4%.

GNU Octave on the wrong note

GNU Octave, primarily used in educational settings, was the biggest loser in the open source Google Trends analysis. This may reflect the relatively stable interest in MATLAB, and possibly the growth in use of Python in schools to teach some topics in data science.

Beyond the Magic Quadrant

Looking at proprietary vendors beyond Gartner MQ participants, Tableau, Teradata, and QlikView all saw increased interest. Tableau dramatically so. Meanwhile Mathematica saw a significant drop. All these vendors offer technologies beyond advanced analytics, from business intelligence to data warehousing. It is difficult to determine whether their analytics products are driving this increased interest, or their mainstay businesses.

We noticed some seasonality to MATLAB’s data that may indicate significant use in educational settings, rather than by commercial data science teams.

Proprietary versus open source: correlation with causation?

When looking at the Google Trends charts we noticed an interesting relationship between search volume for proprietary and open source tools. In 2010, searches for tools from the three largest advanced analytics vendors in our study appear to have reached an inflection point. So did search volume for R, and pandas followed a year later.

We tested the time series data for correlation, and found a strong inverse relationship between R and the proprietary vendors.

While this result does not imply causation, the actions of proprietary vendors over the last few years provides insight into the impact of open source tools on the advanced analytics market.

Some vendors have invested heavily in supporting open source tools. Microsoft acquired Revolution Analytics, the developers of a high-performance distribution of R. While others, SAS included, have integrated their products with R and Python. This coopetition extends to the cloud and SAAS services discussed above. Many of the analytics services available on Microsoft Azure support both Python and R.

While Microsoft Azure analytics services and IBM Watson are seeing some usage, it’s probable that this is existing customer adoption and engineering, rather than data science teams. The jury is still out on whether these products will see widespread adoption by data scientists.

Want to see our workings? Visit our Python notebook on Domino.