Using R and Python for Common SAS Functions


SAS is the recognized incumbent in the analytics, statistics and data science tool space. As the software celebrates its 50th birthday this year, it has evolved into a broad suite of tools and approaches that tries to do everything. From basic inference to the most complex clinical trials, SAS is trying to provide a framework for everyone. Even with 50 years of code (or perhaps because of 50 years of code), there are some areas where SAS may be falling behind.

People interested in data science have been watching open source statistical environments develop as alternate solutions for full cycle data science programs. Over the last 5 years, two contenders, R and Python, have proven themselves to be capable and worthwhile investments professionally and organizationally.

SAS programmers are accustomed to a number of great tools in their environment and have worried that R and Python can't do all that they need. This blog post looks at alternatives to SAS approaches for basic functions and operations. There are entire textbooks written on higher level operations, such as Applied Predictive Modeling and Data Visualization, so instead we are going to focus on some of the smaller issues that may cause a transitioning SAS programmer some challenges in R and Python.

Jupyter Notebooks

There is no doubt that Jupyter notebooks have really changed how data scientists interact with their data and create reproducible results. SAS can now be used as a Jupyter kernel! Using the open source sas_kernel SAS users can use the same cell-based tools that have been available in the Python and R ecosystem via Jupyter! This way of interleaving code, narrative, and results, leads to some fantastic collaboration use-cases, and it's exciting to see SAS embrace this new method of interaction.

Loading Data

In R and Python, loading data can be done trivially from a CSV file given their built-in tools. However, often times transitioning SAS developers will have data files in the sas7bdat format. Fortunately, there are two packages which can make this transition much easier for new R/Python users.


The Haven library provides an interface which makes it easy and fast to load data from a sas7bdat file into an R data.frame object.

%load_ext rmagic


cars_data <- read_sas("data/cars.sas7bdat")
1  18   8 307 3504
2  15   8 350 3693
3  18   8 318 3436
4  16   8 304 3433
5  17   8 302 3449
6  15   8 429 4341

This is now a regular R data frame, and we can use it to do whatever analysis we would usually like to do in R. For example, we can plot out a generalized pairs plot of this car's data with a single command after loading the GGally package.


Blog Post_4_0

Similarly, Python has the sas7bdat package, which allows a user to take a sas7bdat file, and in a single command read it and turn it into a pandas data frame.

from sas7bdat import SAS7BDAT

cars = SAS7BDAT('data/cars.sas7bdat').to_data_frame()
0 18 8 307 3504
1 15 8 350 3693
2 18 8 318 3436
3 16 8 304 3433
4 17 8 302 3449

This now opens up this data, which was previously only accessible inside of the SAS environment, to the full suite of tools available inside of Python. Here we can simply do a quick describe and get a table of values. We can also then use scikit-learn to do a linear regression! Let's see how well we can predict MPG from all other values in this table.

count 392.000000 392.000000 392.000000 392.000000
mean 23.445918 5.471939 194.411990 2977.584184
std 7.805007 1.705783 104.644004 849.402560
min 9.000000 3.000000 68.000000 1613.000000
25% 17.000000 4.000000 105.000000 2225.250000
50% 22.750000 4.000000 151.000000 2803.500000
75% 29.000000 8.000000 275.750000 3614.750000
max 46.599998 8.000000 455.000000 5140.000000
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
%matplotlib inline

msk = np.random.rand(len(cars)) < 0.8
train_cars = cars[msk]
test_cars = cars[~msk]

regr = linear_model.LinearRegression()'MPG', axis=1), train_cars.ix[:, 'MPG'])

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(test_cars.drop('MPG', axis=1)) - test_cars.ix[:, 'MPG']) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(test_cars.drop('MPG', axis=1), test_cars.ix[:, 'MPG']))
('Coefficients: \n', array([ 0.05076733, -0.01593617, -0.00597654]))
Residual sum of squares: 14.48
Variance score: 0.64

Not bad! There are more advanced models that you could use in scikit-learn, from random forests all the way through gradient boosted trees. These more advanced models are outside of the scope of this post, but it's good to know that they're all available right within Python!

A Survey of Some Common SAS String Processing Functions

The rest of this post is going to be a quick survey of the most common functions used in SAS. We are basing this off of the document provided by SAS: A Survey of Some of the Most Useful SAS® Functions. This should provide you with a nice way of translating your SAS code into either R or Python.


These two functions return information about the length of character values. LENGTHN returns the length of its argument not counting trailing blanks. LENGTHC returns the storage length of a character variable.



strings <- c("one", "two", "three", "four", "five with spaces", "six with trailing spaces    ")
lengthc <- nchar(strings)


lengthn <- nchar(trimws(strings))
[1]  3  3  5  4 16 28
[1]  3  3  5  4 16 24


strings = ["one", "two", "three", "four", "five with spaces", "six with trailing spaces    "]
lengthc = map(len, strings)


lengthn = map(len, map(str.strip, strings))

[3, 3, 5, 4, 16, 28]
[3, 3, 5, 4, 16, 24]


The FIND function takes the string defined by the first argument and searches for the first position of the substring you supply as the second argument. If the substring is found, the function returns its position. It's important to note that in R indexes are 1 based, whereas in Python, they are zero based.



needle <- c("asdf")
haystack <- c("werjliwejltiawjelij asdf liwejrliwaelrijawer")

pos = regexpr(needle, haystack)
print(paste("It found it at: ", pos[[1]]))

pos = regexpr("will not find it", haystack)
print(paste("It did not find it at: ", pos[[1]]))
[1] "It found it at:  21"
[1] "It did not find it at:  -1"


needle = "asdf"
haystack = "werjliwejltiawjelij asdf liwejrliwaelrijawer"

pos = haystack.find(needle)
print "It found it at %d" % pos

pos = haystack.find("will not find it")
print "It did not find it at: %d" % pos
It found it at 20
It did not find it at: -1

It's important to note that even though both Python and R return a -1 when the string is not found, they return two different values regarding the location of the found substring.


If you need to extract a substring from a string, the SUBSTR function is the way to go. R and Python also provide this. It's a standard function which makes it easy to do string manipulation.


The R function however also has a cool capability: you can also assign to a substring!


my_string <- "i love to eat potatoes"
the_word_eat <- substr(my_string, 11, 13)


substr(my_string, 11, 13) <- "fry"


substr(my_string, 11, 13) <- "cook"

[1] "eat"
[1] "i love to fry potatoes"
[1] "i love to coo potatoes"

You will notice, however, that if you try to assign more characters, it will cause a problem and truncate the string.


Python does not expose the lvalue assignable semantics of substrings the same way. The syntax is a lot more concise as it's inherited from array indexing.

my_string = "i love to eat potatoes"

the_word_eat = my_string[10:13]



You use the SCAN function to parse (take apart) a string. The first argument to the SCAN function is the string you want to parse. The second argument specifies which "word" you want to extract. The third (optional) argument is a list of delimiters. This function is easilly replaced with R and Python code using regular expressions. There are much more powerful approaches, but this just shows the simple approach.



my_string <- "i love to eat potatoes"

the_word_eat <- unlist(strsplit(my_string, " "))[4]
[1] "eat"


my_string = "i love to eat potatoes"
my_string.split(" ")[3]

You will notice that in the sentence i love to eat potatoes in R the word eat is the 4th word, but in Python you index it with the number 3. This is just the way languages having different origin indexing causes surprises.


These three functions change the case of the argument. UPCASE and LOWCASE are pretty obvious. PROPCASE (stands for proper case) capitalizes the first character in each "word" and sets the remaining letters to lowercase.


R does not provide a function like PROPCASE, but it's easy to write one.


my_string <- "i LOVE to eat potatoes"

upcase <- toupper(my_string)
print(paste("upcase:", upcase))

lowcase <- tolower(my_string)
print(paste("lowcase:", lowcase))

f_propcase <- function(x) {
  s <- strsplit(x, " ")[[1]]
  paste(toupper(substring(s, 1,1)), substring(s, 2),
      sep="", collapse=" ")

propcase <- f_propcase(my_string)
print(paste("propcase:", propcase))
[1] "upcase: I LOVE TO EAT POTATOES"
[1] "lowcase: i love to eat potatoes"
[1] "propcase: I LOVE To Eat Potatoes"


Python has the standard upper and lower case functions as methods of the string object. However, it also has a super cool function called title. This takes a string and capitalizes every word in the string, like if it was a title of a book or movie!

my_string = "i love to eat potatoes"

i love to eat potatoes
I Love To Eat Potatoes

Looking Forward

SAS is a long way from being dethroned as the leader in this space. With a decades-old code base and a large installed base, SAS will keep going for the foreseeable future.

At the same time, the domain and capabilities of open data science platforms are improving all the time. The open source community is making incredible strides in adding more and better capabilities. These tools are often free. And tens of thousands of new data scientists being created each year, most coming up through programs that use free tools instead of SAS.

Data science leaders thinking about the future of their own data science program should keep this in mind as they consider their longer term platform investments.