tl;dr: My path into data science involved dabbling in a new programming language, stalking GitHub for a popular machine learning package, doing some networking and finding some other people to teach.

Posted: 2017-05-15

Last week, I had the amazing opportunity of co-presenting a talk called "Navigating the AI Revolution" at Microsoft's developer conference, //build 2017. I spoke about questions we can answer with machine learning and mentioned a little bit about my path into data science. Afterwards, many folks approached me about this path and asked for some more guidance and advice. I thought I'd begin by expanding on my story of how I started out. My hope is that something in here is helpful to you.

Before I was a data scientist I was a developer. TBH I'm now a developer again and getting to do more data science than ever before. One of my favorite articles of late states that the real prerequisite for machine learning is not math, but rather data analysis skills like data viz and wrangling (link here). But what's next, i.e., how did I begin actually learning data science...

Tip: Start doing data visualization and/or data cleaning

At the beginning, this historic tweet by Josh Wills was both funny and helpful to me -- giving me perspective.

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

Before I presented on AI to 700 people at //build, before I taught an ML package at a workshop, before I had read docs and gone through the tutorials and before I had trolled YouTube for ML videos, I had attended the Open Data Science Conference in the Fall of 2015. I networked, as we do, and met a group of devs/data scientists/physicists/biologists interested in data science like me (a great, diverse group of super smart people!) and immediately got involved in a Python workshop group (GitHub organization). I got asked to help them give a workshop on ML, scikit-learn and tensorflow they were planning for the following year because I had a some basic Python materials they could use.

Tip: Network with those in your field interested in data science

Most of what I had learned was from one single package, scikit-learn which I chose based on its popularity and GitHub activity. I also studied videos on scikit-learn like this one by Jake VanderPlas and this one by Olivier Grisel. After some YouTube'ing and reading the docs, plus talking about it with other data scientists in my workshop-planning group, I was ready to teach the basics to others. The old adage is true, the best way to learn is to teach.

Scikit-learn is a Python machine learning library with incredible docs that not only explain the package, actually taught me something useful once I had watched a couple of basic videos like the ones just mentioned (check out this page from the docs). The development of this package is now led by Andreas Müller, a lecturer at the Data Science Institute at Columbia University. He's also a big proponent of women as contributors to scikit-learn and open source, openly encouraging more and more to contribute.

Tip: Pick a package to learn that is popular and active

In May of 2016, I co-instructed a workshop (close to 200 folks, mostly women) at a Google office in Mountain View, CA. They thought it was funny when I said "I'm now in the belly of the beast" (I'm a Microsofty). I was still a new data scientist, but I knew how to use a powerful ML tool to understand some simple data (the iris dataset) and could communicate that to others.

I didn't start out teaching or creating courses by any means. I started out by being curious about analytics, for many years doing genomics and proteomics. I started dabbling in Python as necessary and a little R for my analyses and visualizations. I chose to write some teaching material because I wanted to learn Python better (you can start out small even with a readme or two on GitHub) and off I went.

Tip: Pick a language like R or Python and plan to create some training material on something analytical for work or fun

All of this certainly required experience in a new language, Python, adopted by the data science community along with R as the go-to programming language for its intuitive syntax, readability, and existing math libraries, among other reasons. I had been dabbling in Python for a few years, doing basic math and web programming. I really grokked Python, however, after I had created a Python course of my own (find it here) — see a theme here? :)