by Anna Anisin on June 3rd, 2015
Both Python and R are popular open source languages for performing data science tasks. As a consequence, we at DataCamp often get questions from our students on whether they should use one or the other for their statistical chores.
This post describes some of the basic differences between the two languages and the places they occupy in the world of data science, based on the research we did for our recent infographic "Data Science Wars: R vs Python"
DataCamp is an online interactive education platform that offers tutorials in data science and R programming. Each R tutorial is built around a certain data science topic, and combines video instruction with in-browser coding challenges so that you can learn by doing. You can start every course for free, whenever you want, wherever you want.
Introducing the participants
This section highlights the history and benefits of each language. It will become clear that R’s functionality is developed with statisticians in mind, while Python is more of a general purpose programming language, praised for its simple syntax and easy learning curve.
Intro to Python
Python is a great first language for programmers who want to expand towards data science, especially if they are working in an engineering environment. It is a flexible language that emphasizes readability and productivity. Its rise in popularity among the data science community has seen a similar growth in data science packages on the Python Package index, PyPi.
For those who sometimes participate in data science quiz contests: Python was created by Guido Van Rossem in 1991, and the name is indeed inspired by Monty Python =]
Intro to R
R was created in 1995 by Ross Ihaka and Robert Gentleman and finds its roots in an older language called S. Like many programming languages, the initial traction came from academia and research, but in the past years the enterprise world is rapidly discovering R as well. Just like Python’s PyPi, R also has package index system: CRAN Here, you can find over 6600 curated packages.
R’s success strongly links to its huge community. The R community is a breeding place for the rapid adoption and implementation of new ideas and statistical techniques. It provides R with a clear advantage over commercial software packages such as SAS or SPSS that are by nature slower innovators, and other open source communities that not yet have the bandwidth to support all the latest progress fast enough.
When and how to use R or Python?
Want to easily integrate statistical tasks with web apps or into a production database? Consider Python. As a full-fledged programming language, Python is a great solution for the implementation of algorithms for production use.
Want to do some exploratory work? Dealing with a data analysis task that requires stand-alone computing or analysis on individual servers? Consider R. It's handy for almost any type of data analysis, because of the huge number of packages.
Going for Python?
Then your next step is to create a toolbox of data analysis libraries. In the past, the infancy of Python packages was an issue, but over the years this has improved significantly. Consider the following packages, depending to the type of work you need to do:
- NumPy /SciPy for scientific computing
- pandas to make Python usable for data analysis
- matplotlib to make graphics
- scikit-learn for machine learning
Going for R?
Install the fantastic RStudio IDE and have a look at the following packages:
- dplyr, plyr and data.table for data manipulation,
- stringr to manipulate strings,
- zoo for time series,
- ggvis, lattice, and ggplot2 for visualization, and
- caret for machine learning
Advantages and disadvantages of Python
- Just check out IPython Notebook and be amazed. It makes it easy to work with Python and data.
- Python is a general purpose programming language. The easy syntax increases the speed at which you can write a program. So more play time =]
- Great testing framework with low barrier to entry.
- Python is a multi-purpose language, thereby bringing people with different backgrounds together. How cool is it to be able to build a single tool in one language that integrates with every part of your workflow!
- Visualizations are important. Python has visualization libraries (Seaborn, Bokeh and Pygal) but R’s are more flexible and (in our opinion) more aesthetically pleasing.
- As said before, Python is a challenger. It does not offer an alternative to the hundreds of R packages, making it unclear whether people will give up on R.
Advantages and disadvantages of R
- R is probably the best statistical tool in the world for data visualization.
- R has a rich community and ecosystem of packages across fields such as finance, pharmaceuticals, actuarial analysis, web technologies, machine learning, etc. Search through all R packages at Rdocumentation.
- R is developed by statisticians for statisticians, making it the lingua franca of data science.
- R can be experienced as slow due to poorly written code. However, there are solutions like the pqR, renjin and FastR package.
- A steep learning curve. This is especially true if you come from a GUI world such as Excel.
Comparing adoption and popularity numbers for Python and R is not that straightforward. Since Python is used as a general-purpose language, its applications are more numerous (mainly web development) which inflates some of the numbers.
Instead of using these numbers for comparison, it’s better to see them in the light of how these two languages are evolving in the overall ecosystem of computer science:
In an attempt to have a more detailed look at how R and Python compare in a data science environment, we had a look at polls conducted by parties such as KDnuggets. If you look at recent polls R often is a clear winner:
In addition to the figures above, other figures indicate that more people are switching from Python to R, than from R to Python. Also, it seems that a growing number of people are making use of a combination of both languages, which is exactly in line with our recommendation.
Either way, if you plan to start a career in data science you’ll be good with both languages. Namely, job trends indicate an increasing demand for both skills, and wages are well above average.
In general, the most important thing to remember is that choosing between Python and R depends on the type of task you need to get done. It’s your job to pick the language that best fits your needs. Just make sure to ask the right questions, such as “What problem am I trying to solve?”, “What costs are involved in switching between languages”, “What tools are standard in the field?”, etc.