We recently caught up with Sean McClure PhD, Sr. Data Scientist at Thoughtworks. Sean, firstly thank you for the interview. Let’s start with your background and how you became interested in data science.
What is your 30 second bio?
My academic background is in scientific computing where I worked with high performance computers and cutting edge algorithms to calculate molecular properties for nanotechnology-based devices. I fell in love with using computers and data to solve very challenging problems. After my PhD I started my own advanced analytics business where I used machine learning and database technologies to help businesses compete analytically. After doing that for a few years I joined ThoughtWorks as a Data Scientist where I currently work. I love writing and public speaking, and enjoy playing the piano.
How did you get interested in Data Science and Machine Learning?
During my PhD I was exposed to a variety of approaches in trying to solve challenging problems on the computer. I quickly recognized the power of machine learning techniques to solving the problems in my domain. I saw that my experience as a scientific researcher could be blended with the approaches used in machine learning to address problems well beyond academia. Data science was just starting to heat up as I graduated and I felt that it was a natural extension of my skills and passion.
Was there a specific “aha” moment when you realized the power of data?
One of my earlier projects was in healthcare and involved building out an application that attempted to correlate a variety of symptoms with disease outcome. Although the product was far from perfect, it opened my eyes to the power of data and how transformational it could be in the world. This was one of those moments I realized that there is so much important work we can do with data, and that it can have a positive transformational impact on our humanity.
The power of data can be transformational to the world
What kind of work are you doing at ThoughtWorks?
At TW I build adaptive applications that help automate decision-making by learning from their environment. I work with software developers to integrate the models I build into the working codebase of real-world applications. At ThoughtWorks, we see this as the next-generation of software where the ideals of agility go beyond design-time, and are found throughout the entire lifecycle of the software.
What in your career are you most proud of so far?
The first time I received verification from a business leader that my work added value to the organization. Getting positive feedback from a domain expert that your work made a real impact is a great feeling.
What has been the most surprising insight or development you have found?
I was working for a search engine marketing (SEM) firm where we were trying to find patterns inside massive amounts of market data. We discovered that while many marketing campaigns were dealt with very well by the staff, there were other clusters of campaigns that deviated greatly from commonly-held beliefs. Although this is a core reason why we are there, it is still a delightful surprise to uncover new opportunities in an organization’s data.
What personal/professional projects have you been working on this year, and why/how are they interesting to you?
This year has been filled with a lot of excitement around building automation into existing business processes. I recently finished an application that ingests thousands of manually entered text descriptions and surfaces the core topics being discussed. This is being used to uncover the main failures that occur throughout large systems; failures that over the years have been manually described in documentation. It’s a great example of teaching a machine to do something that would have taken a large number of people and thousands of hours to accomplish. It uses concepts found in natural language processing which I think is going to play a very large role in intelligent data products of the future.
What does the future of Data Science look like?
We are still in early stages of defining this field and what it means for organizations looking to compete analytically. I think in the future we’ll see a lot of the hype die down and settle into a solid discipline that turns data into value-delivering products. The need for turning data into value is only going to increase and the way to accomplish that is by doing quality scientific research; research that leads to models that capture the underlying patterns that drive the domains we are interested in. I am personally working to help define this field in my writing and public speaking, and to ensure that the hype doesn’t interfere with the need for great science on great data.
I think in the future we’ll see a lot of the hype die down and settle into a solid discipline that turns data into value-delivering products
What publications, websites, blogs, conferences and/or books are helpful to your work?
A great book to return to every now and again is Learning from Data by Yaser S. Abu-Mostafa. It does well to explain the feasibility of making machines that learn and describes many of the core concepts that we apply daily as a data scientist. I also regularly use DeepDyve which allows people to rent academic articles. This way we can stay up-to-date on the latest research without having to pay large fees for journal subscriptions.
What machine learning methods have you found or do you envision being most helpful? What are your favorite tools/applications to work with?
Machine learning forms the core of the algorithmic variety we use as data scientists. I don’t think you can put any one of the methods above another until the specific challenge you are trying to solve exposes some of its secrets. These are like flags that pop up during discovery that hint at methods and approaches that may prove feasible. But ultimately, jumping in with a variety of approaches and letting nature tell you what does and doesn’t work is the key. For tools and applications I first and foremost use R and Python as these have the richest variety of scientific computing libraries available. This variety is crucial to allowing a scientist to explore the data from a variety of different angles and uncover insight into how to model the system of interest. I also use RStudio when working with R and IPython for python. When it makes sense to scale the model then I look to tools like Spark that are making great headway in making machine learning scalable and fast. H20 is also starting to look promising as a tool for scaling our science. Beyond that I work with NoSQL databases and the Hadoop ecosystem.
Any words of wisdom for Data Science / Machine Learning students or practitioners starting out?
For those looking to start out in this field, I have two pieces of advice:
1) focus on the core concepts as these are timeless and 2) jump in and practice constantly. Failure is the only way to learn. I recently wrote an article on this called The Only Skill you Should be Concerned With.
Failure is the only way to learn
Sean – Thank you so much for your time! Really enjoyed learning more about what you are achieving at Thoughtworks.