We recently caught up with Jason Toy, CEO of filepicker.io, a cloud file management platform for developers. Previously, Jason was a CTO and co-founder of Socmetrics, a topical influencer platform. As CTO at Socmetrics, Jason and his team supplied clients with business intelligence enabling them to extract more value from their current customers.
How did you become interested in data science?
I have quite a technical background, have been writing software for almost 15 years. I’ve built software; I’ve done infrastructure stuff, DevOps, and a whole stack of building websites. But the thing that’s always been the most interesting to me has always been data; how are we going to use data to improve our products and services? How do we learn from data better? Data analysis is one of the most important parts of any business today. There’s so much data coming in, we need tools to keep track of it all and tools to derive insights. With Socmetrics we focused a lot on customer segmentation and social media, taking those two data sets to combine them and then analyze it. So I spend a long time thinking about and building analytical platforms.
What skills are needed in your day-to-day work?
The number one competency you need to work with data is math. You need to have an extremely strong understanding of mathematical concepts, excellent programming skills, and an analytical mindset. Data science has many different names: data science, machine learning, analysis, big data, artificial intelligence, statistics, etc.
Oftentimes we hear that in order to be a data scientist you need to be a unicorn, do you agree with that statement?
You don’t need to be a unicorn, but If you can be that unicorn you will be in a very strong position. Typically most people chose to go into one specific area of data science. There’s the engineering aspect; you can work specifically on cleaning data and moving data around as a full-time job. Then there’s the analysis part, where data scientists are playing with the data, looking for insights, and testing out different hypotheses. The third one is theoretical, that’s when data scientists are actually coming up with algorithms. This use case is more common in academia and large companies with big research teams, you won’t see it used as much in smaller businesses. You can slice up the field into different segments. It’s nice to be a unicorn, but I suggest focusing on your strengths and diving deeper in that specific field.
Which one of those skill sets have you applied to your past analytical work the most?
My skills are stronger on the “apply” or practical side, anywhere from building systems to processing the data to doing the actual analysis. I have an undergraduate computer science degree, but not a PHD where I can actually come up with new algorithms. That’s a tough field. There are a lot of people working on this problem; Google, Microsoft, and all the other large companies have large research teams. Generally, most data scientists won’t be on the research side.
Is there any work you can share with us, personal perspective?
I’ve built a few interesting products at my previous companies. One of them that I mentioned earlier is Socmetrics and the other one is Truelens. I’ve also built quite a few machine learning-based internal marketing tools at filepicker.io to help us make better decisions with our marketing.
What are some of the challenges with data science in a sophisticated enterprise setting?
One of the challenges for large enterprises is, getting a centralized place for all their data. Large enterprises have so many “silos” of data and typically there’s not enough knowledge sharing. If you can get all your data into one place and sync it with Spark or Hadoop you can get a lot more cross-pollination of analysis to occur. Besides that, choosing the right tools can be tough. Data science has been around for quite a while, but the machine learning-data science movement is quite new. The buzz picked up in the last couple of years with products like Hadoop. Things changed a lot since I’ve been working with data full time, Spark is the big hot thing right now and I have yet to use it.
How do you leverage technology in your analysis?
One of the tools I use is R, and all kinds of R-packages for strictly doing analysis and machine learning. Also scikit-learn, a python library a lot of data scientists use to integrate it into their own flow to do analysis. Also Hadoop, I programmed a lot of Hadoop projects on top of Amazon’s EMR platform.
Would you have any advice for organizations trying to decide whether to use Hadoop or not?
It all depends on how much data you have, and if you’re attempting to do more insights by pumping in more data, I would first try to see if you could actually just solve the problem with a smaller data set. There’s also another trend towards what’s called “small data” which says, don’t fall into the hype of having too much data. Unless you’re Google scale and you have that kind of data, in most cases you can get away and do better with less data, because the processing and the time it takes to develop the algorithms and actually injecting the data into these systems is a long tedious process. Even a few gigs or tens of gigs of data is actually “small data” so I would try to do that first. A lot of people think they have big data but they really don’t.
If you could wave a magic wand and have one thing be different about data science today, what would it be?
I wouldn’t say magic wand, because technology always changes, but sometimes I certainly wish that we had tools to simplify the lives of data scientists. There are a few platforms out there, including Domino that I’ve seen who do this. But I’ve always dreamed of this magic API where you can just pump in data and it will automatically run all the different algorithms for you, runs all the different types of regression tests and statistical significance tests to make sure that you chose the right algorithms and also makes sure that your data is not overfitting . When you’re choosing an algorithm to use, you’ll sometimes find that, for example, with this particular set of features and these algorithms, and with this data set it performs better but it doesn’t with the other algorithm and other factors. Oftentimes that gives us false answers because we didn’t properly segment the data, we didn’t properly test the data, so we introduced bias and the system is not learning properly, its just learning about that specific data set. There’s a lot of work you have to do as a data scientist to make sure that your system is properly trained. I wish there was an algorithm or an API that could automatically do this for you. This is a little secret little project I’m working on :).
What advice do you have for someone entering the field?
Just like I said before, make sure you have a strong mathematical background because data science is pretty math-heavy. You can get around it by using different libraries, but I would try learning and implementing some of the basic algorithms by hand just so you have a basic understanding of how they work. There are lots of data sets online you can play with. I would also try one of these smaller data science competitions with Kaggle. You can either compete with others or download the data and do the analysis on your own and compare your results at a later time. This will let you know where you stand in terms of progress and also where you fit in the whole data science marketplace. The location also matters; the best place to kick off your career as a data scientist is in the Valley and San Francisco. However, there is also a big data science community online, and most major cities have data science Meetups. If there isn’t a data science group there’s an artificial intelligence group, which is a slightly different but similar field.
We’re very grateful to Jason for his time and insight. You can follow Jason @jtoy.