My path to becoming a full-stack data scientist, helpful data science tools and resources, machine learning competitions, our obligation to use data science for good, and more
In October 2020, I was interviewed by DrivenData, an organization that hosts data science competitions for good, one of which I placed second in while teaching myself data science. My interview appeared condensed and edited on DrivenData’s blog and here are my full, unedited answers.
Even if you don’t make it through this article — there’s a reason my responses were condensed— I’d recommend checking out DrivenData’s challenges to get experience with realistic data science problems that try to benefit society. As I try to make clear in this interview, data science is a powerful tool, one that should be put to use in the best interests of our world.
Data Science Career
What do you do professionally?
I’m currently a full-stack data scientist at Cortex Building Intelligence, a startup that helps office buildings operate more energy-efficiently. Full-stack means I’m responsible for the whole data science pipeline, taking concepts from idea through to machine learning models providing continuous, real-time operating recommendations. (Here’s a good article on end-to-end data scientists). The work is very rewarding, in terms of both the day-to-day construction of machine learning systems and the larger mission of reducing carbon dioxide emissions while saving building owners significantly on energy bills. Energy efficiency is a field where the best decision for the environment is also the best for your wallet and where a little data science can go a long way.
How did you get started in data science?
We tend to look at the past and draw a straight line connecting disparate events, making it seem like the way things turned out was inevitable (the narrative fallacy). It’s tempting to tell a neat story about how I got into data science, but it really was a random series of unexpected events that led me to this career. In college, I studied mechanical engineering but had two terrible internships that convinced me to change fields. While working one of those internships at NASA, I decided to explore this new field of data science I heard so much about. Since I didn’t have real responsibilities, I spent almost every hour on the job taking Udacity courses in data science and machine learning.
After the internship, I went back for my senior year, and while it was too late to switch my major, I was able to get on a research project using machine learning to help buildings save energy. Little did I realize this project would precisely match a DrivenData challenge and, eventually, my career at Cortex. I spent that year self-studying data science, writing an article a week, and competing in the Power Laws: Forecasting Energy Consumption DrivenData challenge. Coming in third in that competition was a huge confidence boost and convinced me I could make it in this field even without formal university training. After graduation, I was fortunate to be hired at FeatureLabs (since acquired by Alteryx and the maintainers of the featuretools open-source library) as a data scientist on the strength of my blogs and projects. A few months later, I came to work at Cortex where I’ve been for almost two years.
So, the clean story is: a disillusioned mechanical engineer teaches himself data science, works on a data science research project, writes some blog posts, and then goes to work for a startup changing the world with machine learning. The reality was a disconnected bunch of events that ended with me as a data scientist through minimal planning on my part. In hindsight, the one smart thing I did was write publicly about data science projects. The takeaway is you don’t need university courses to be a data scientist and be prepared to seize the unexpected opportunities that come your way.
Data science is a broad field. What areas are you particularly interested in (e.g. computer vision, NLP, geospatial analysis)? Are there particular problems you’re interested in solving?
The applied aspect of data science — how can we get machine learning models into production and put the results in front of decision-makers — interests me the most. I’m thankful for researchers who develop the algorithms (the random forest being my goto), but I’m more concerned about changing real-world outcomes with data science than algorithmic advances. That means I spend a lot of time thinking about infrastructure, including databases, cloud computing, monitoring, and what is generally considered software engineering: testing, deployment, writing high-quality software. At the end of the day, you can use the most cutting-edge algorithms, but if you can’t get those results to the people who need them when they need them, you can’t improve real-world systems.
There are so many fields with so many problems that need data science — healthcare, education, government, transportation, housing — but I’ve chosen to focus on energy, in particular energy efficiency. Climate change is the defining problem of our generation, and data science can help immensely on both the supply (clean energy sources) and demand (reducing energy needs) sides. Admittedly, I’m somewhat of an idealist, and I firmly believe data science should be used to make the world a better place. It’s frustrating seeing all the talent and effort devoted to getting people to click on ads, spend more time on social media, or buy more consumer goods. I want to tell my grandkids I was part of the solution, not the problem, and climate change is where I can have the most positive impact. Each of us has a choice to make, and I believe our obligation is not to maximize material gain but to ensure humanity’s flourishing far into the future.
Useful Tools and Resources
Any good tools, posts, or projects from other developers that you appreciate or think the community might enjoy?
There are too many great things in data science to list, but a few of my favorites are:
- Editor: VSCode for development, classic Jupyter Notebook for exploration
- Data science book: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurelion Geron — taught me 90%+ of what I know about data science
- PSQL CLI: pgcli — excellent tab-completion + nice aesthetics, what’s not to like?
- Blog: Normcore Tech — talks about tech without all the hype
- Code autocompletion: Tabnine — worryingly-accurate AI code completion
- Python package: pandas — the one library for working with data
- Visualization library: plotly — switching to plotly from matplotlib was life-altering
- YouTube channel: sentdex — made me fall for Python
- Podcast: AI Podcast with Lex Fridman — good in-depth conversations on AI
- Udacity course: Data Analyst Nanodegree — much better than any of my university courses
- Data Science Newsletter: The Data Science Roundup by Tristan Handy — filters through all the articles to send you a few quality ones every week
- Ethics checklist: Deon — every organization should have a data science code of ethics
Machine Learning Competitions
What motivated you to join a DrivenData competition or to continue participating in challenges?
The combination of gaining experience solving actual problems with data science and working on a project with socially beneficial objectives initially drew me to DrivenData. The first competition I participated in was Forecasting Energy Demand part of the Power Laws series of challenges. Seeing that I could predict actual energy usage with machine learning was a transformational experience. I saw that what I had read about and studied could be put to use. Working with the toy datasets used in books and college classes gets tedious pretty quick, but DrivenData provided real-world data and the reward of tackling more ambitious problems. That first competition cemented not only my desire to work in the data science field, but also to use machine learning for positive ends.
I continue participating in challenges to keep my skills up-to-date. There’s always something new on the data science scene, and, while I mostly use proven, battle-tested methods in my job, it’s still fun to see the latest developments and try to put them to practice. It feels like if you’re not continually learning in data science, you’re falling behind (The Red Queen’s Race). The community aspect of challenges is also a large draw — many of the competitions are more like collaborations (especially if you’re not in it for the cash), and it’s enjoyable to work with others. Combine the prospects of improving my skill set, doing a small part to make the world better, and connecting with others in the field, and DrivenData competitions are a win all-around.
Is there a particular DrivenData challenge you’ve enjoyed working on?
The demand forecasting challenge was my favorite. It was the first time I took the material I’d learned through Udacity courses and books and applied them to actual data. The idea that you could relatively easily build these machine learning models that accurately predicted building energy use was a revelation to me.
The problem was straightforward as far as data science problems go: use weather, date, time, and employee occupancy patterns to predict energy use at office buildings. My approach was to combine all the data sources, do a little feature engineering, start off with simple machine learning models, progress to more complex models, and do hyperparameter optimization. I made sure to save some of the data for validation, which helped reduce overfitting, a nearly universal problem in these data science challenges. In hindsight, I probably spent too much time trying to tune the model hyperparameters or use complex models when I should have devoted that time to learning more about the domain and building relevant features. Generally, I’ve found that the return on time invested is much higher for feature engineering than model tuning. That’s something I came to learn through the challenge and has shaped my approach to data science problems since.
Did you have any domain expertise, or any particular insights from exploring the data, that helped you solve the problem?
I wish I had some domain knowledge before working on the competition. At the time, I bought into the idea that data scientists didn’t need to know about the subject matter and could just let the model learn from the data. It turns out that approach doesn’t work so great, especially with smaller datasets. Now, I know the value of incorporating domain knowledge into data science through engineered features.
What was the most important technical tip or trick you used to solve the problem? Non-technical?
Save a validation set or, better yet, use cross-validation to score models. It’s easy to overfit if you’re only working with the training set and basing model performance off the public leaderboard. At the end of the competition, I think I was in 5th place on the public leaderboard but ended up in 3rd place on the private leaderboard (and in the money!). The public leader ended up out of the cash, indicating severe overfitting to the public dataset. A lot of these challenges seem to be won with “tricks” like ensembling models, but I didn’t go that far and mostly stuck to relatively basic models but made sure to use all the provided data. This approach may have prevented me from taking first, but, it’s proven invaluable in my day job where a simple model that we can deploy is much preferable to a complex, ensemble model that proves impossible to productionalize (illustrated by the Netflix Prize competition where the winning model was never used).
Advice and Data Science Community
What advice (career or otherwise) would you give to your younger self?
Prioritize mental health. Reach out and get help as soon as you need it. There’s nothing stigmatizing about mental health issues, and there are effective ways of addressing these very real problems.
Besides your internships, no employers will look at your resume, so spend less time worrying about grades and more time enjoying the college experience. Who you know is often more important than what you know, so spend at least as much time forming relationships as studying. You don’t understand something until you can implement it, so spend less time reading and more time doing. Have a bias towards action.
Explaining a topic to others is also a good test of whether you know the concept, and communication is one of the essential skills you’ll need in any career. Share your results early and often to get feedback and spot issues while they can be quickly addressed. Constructive criticism is invaluable — always be asking what you can do better.
What hurdles have you had to overcome to become a data scientist? What advice would you give to others facing the same challenges?
There aren’t yet best standards for many aspects of data science, so if you’re looking for the optimal solution, you’re probably not going to find a consensus. As a perfectionist, I often struggled because I couldn’t find the ideal answer. I probably spent too much time reading about methods and not enough time trying them out. Data science is mostly an empirical (as opposed to theoretical) field, which means that if you want to know whether a given method will work, you just have to try it.
Being self-taught, I often worried I didn’t have the credentials to make it as a data scientist. That first DrivenData competition helped boost my confidence, as did writing about my data science projects and seeing the positive feedback (and occasional constructive criticism). I’d tell others who don’t have the largely meaningless yet highly-valued pieces of paper known as degrees to do projects with real-world data and make them public (blog posts). Having a body of work to point to can help you get in the door and gives you a chance to demonstrate your skills. Moreover, be ambitious when applying for jobs. My current position asked for 5–10 years of data science experience (as if anyone had that much at the time!); I had three months at another startup and nine months of research. Fortunately, my oeuvre of data science projects spoke volumes, and I was given a chance to prove myself with an assignment. That might not have been possible at a larger company, but still, try reaching out and pointing to your public work.
Anything else you want to share with the community?
Data science is a powerful tool, and as practitioners, we have a responsibility to use that power for good. DrivenData demonstrates a few of the numerous fields in which data science can make the world a better place. I’d encourage every individual to think about their values and if their work aligns with those values. I don’t think about saving the world in my daily work, but when it gets tough, I take solace in the broader picture of what Cortex is doing to address climate change. Working for a company where the mission agrees with my morals is extremely rewarding. Individual choices do matter, and every data scientist that goes to work for a non-profit, government, cleantech startup, or company working to make healthcare more efficient is a win for humanity.
Have you read any good books or articles recently?
Books are the best tools we have for sharing knowledge so I try to read a good deal across a wide range of subjects. Here are a few books I’ve recently found worthwhile:
- Enlightenment Now: The Case for Reason, Science, Humanism, and Progress by Steven Pinker — In almost every measurable category — wealth, health, longevity, rights — the world is better now than ever; it’s only our perception (shaped by the news) that is getting more negative. This is critical because it shows that progress is possible, and we can continue to increase human flourishing with the tools of science and technology.
- Rewiring America: A Handbook for Winning the Climate Fight by Saul Griffith (available for free) — A no-nonsense plan for combating climate change by electrifying America, creating millions of jobs and improving quality of life in the process. This report is a welcome contrast to the usual doom and gloom that pervades climate change discussions and provides an optimistic path to address climate change.
- Utopia for Realists: How We Can Build the Ideal World by Rutger Bregman — Universal Basic Income (UBI) should form the foundation of our social safety net, so people are not forced to work meaningless jobs just to survive. (Yes, I’m solidly on Team Yang)
- Thirst: A Story of Redemption, Compassion, and a Mission to Bring Clean Water to the World by Scott Harrison — A nightclub promoter gives up a life of drinking and partying to start a charity to bring clean water to every person on Earth. Doesn’t get much more inspiring than that.
- Uncanny Valley by Anna Weiner: A non-technologist takes her skills to Silicon Valley and achieves impressive success while detailing many of the faults of the tech industry.
- The Road to Character by David Brooks — biographies of individuals who dedicated their lives to socially beneficial causes. From the stories, it’s clear that there is nothing more fulfilling than spending one’s life working for the good of humanity.
What do you enjoy doing outside of work?
Outside of work, I like to give back to the community through volunteering. Helping others provides a deep sense of fulfillment and allows you to build lasting friendships with other volunteers. In-person volunteering has been put on hold with the pandemic. Still, there are ways to help out online, like calling to check in on seniors, or working on data science problems with a social benefit on DrivenData! I used to write frequently about data science on Towards Data Science, but that’s been put on hold while pursuing an online Master’s in Computer Science from Georgia Tech. For stress relief, I go on long runs every day and occasionally compete in ultramarathons (the farthest so far was 100 miles). Challenges — whether data science or physical — always have appealed to me, and I try to make sure I’m being tested in at least one dimension.
Where do you live?
Until March of this year, I lived in New York City, but since then, I’ve been sheltering back home in the Midwestern US. Cortex has gone fully remote, and, now that I’m fortunate to be able to work from anywhere with an Internet connection, I’m trying to be thoughtful about where I live next. NYC is an incredible place, but living there grinds you down, so I’m looking for a more livable city (recommendations appreciated)!
Where can the community find you online (GitHub, Twitter, etc.)?
Since you’ve made it to the end of this article — and maybe even read most of it — I can also be reached at firstname.lastname@example.org (of course I still use my school email for all those discounts). I’m glad to talk more about my experience and to try and help others with career and personal development.