Helping others with their data science careers

Last week, I posted to LinkedIn about successfully helping a friend in her job search, which got thousands of views and a ton of positive feedback. I started getting overwhelmed with dozens of connection requests from folks in the data science industry, or trying to break into the industry, so I used this momentum to create a Collaborative Data Careers Community LinkedIn group, which is now nearly 400 members strong.

A very effective method that you can use to help your stellar colleagues and friends is by periodically taking the time to recognize their achievements and skills, by writing them a thorough review on LinkedIn. This is especially great if you know they are looking to make a career change, or they are unemployed. I have firsthand experience of this method being effective, even if the job seeker is not using LinkedIn to apply for a job. Recruiters and HR professionals use the platform to vet people and look over their resume even if they have a copy in hand.

Another way you can help is by offering to proofread their profile or their resume. This is extra helpful if you are not in their industry and you are able to identify what jargon will not be understood by a recruiter or future employer. Finally, you also can pass them along as possible candidates to recruiters that contact you. Your mileage may vary with this, since every role is different, and there can be a mismatch between the candidate and the company. Finding a job and being unemployed are often difficult and can lead to depression. It pays to be proactive and compassionate.

My hope for the group is that it will give members a place to reach out to one another for career advice, job leads, insight into job opportunities, and also to learn about tools that can make them more effective in their jobs. Those of us that are not actively looking for jobs can help others who are, by making introductions and referrals. Likewise, those in the community who are actively looking can take advantage of the folks are willing to offer their help. I also posted about the group on twitter, hoping that folks from the #rstats community would join and foster the same sense of compassion and giving back that I’ve seen from them. The whole idea is to work together, believing that we only get ahead by helping others along their own journeys. Just like we know it’s harder to take on a huge task at work or school alone, it’s more difficult to find the job you want or need on your own, and everyone’s career is a collaborative effort.

Recently, I have also received questions via private message for advice, and for the story of my career progression. I think everybody’s journey in this new field is different, and to an extent atypical, so I don’t think mine is necessarily the model path for data science careers. Nevertheless, I decided to write a blog including the back story on how I got involved in data science:

How I became a “data scientist”

I started coding back in the day when it was cool to change up your MySpace page with custom code snippets. I enjoyed manipulating everything by adding custom CSS and HTML.

I went on from there to learn some PHP, understand how servers worked, get free hosting plans, and build a slew of websites both in WordPress and from scratch. I never considered myself great at this, but I could do it, and I enjoyed it for the most part (except deployment headaches, and troubleshooting CSS issues). It was mostly reverse-engineering what others had done and never building much from scratch. I didn’t feel like I had a firm grasp on, or enjoyed PHP that much, so I went on to study management at Clemson, which included heavy emphases on business strategy, logistics, information systems, and more. I continued web development as a side job to make some money to help pay for expenses.

After an internship at the US State Department, I got a start to my career in the tech world on a development team of four major budgeting applications for the DOD and several other agencies. My role included everything from functional/business analyst to client-facing product owner. This required not only expertise in finance, the US government budgeting process, and excellent (internal and external) communication skills (All things I developed on the job), but also a keen understanding of how the data was being stored, presented, and manipulated in our applications. I taught myself SQL online and got plenty of use out of it, since part of my role included serving as the Quality Assurance / Testing lead. I learned some handy formulas in Excel to build digestible test plans. More importantly, I learned well how to work with software developers to get things done.

After a couple of years in that role, I was able to pivot and land a job at CARFAX as a Business Data Analyst on the Data Services team. Here, we analyzed all of the data sources – potential sources (for Data Acquisition to evaluate), and new sources (for the Development Team to understand how to load them into the database(s) ). We also needed to become familiar with the data and what it means so that we could work with the Product Team to control the display in our products, and make changes when needed.

After working with tools like SAS and MS Access databases, I quickly learned how to program in R, made extensive use of the tidyverse and Rstudio products, and became essential to the team by building dashboards and automating processes that used to be manual.

Eventually I was promoted to a role under someone that had been an internal client, and my responsibilities increased. I am on a team that is seen as the go-to for ad hoc data questions in the company. My position includes a lot more now than it did, but the technical part of it is mainly using R, SQL, Spark, and other big data technologies to drive insights for the Senior Team. I also work with various internal teams on data related projects and issues, creating and consulting on new data products. I build, maintain, and use several Shiny dashboards. Sometimes, I get the chance to create a predictive model or play around with cool stuff in R – like building web scrapers, or learning how to implement the furrr package to hit GraphQL with parallel processing for a project. I collect data from many different sources, and I have to become an internal expert on what damage data means, how it is used, what flags represent, and more. To a certain extent, in various projects, the job includes being both a data librarian and a data engineer. (When a product development team needs to know whether a feature is worthwhile implementing, based on some data, and you can’t successfully interrupt the development queue of an agile data engineering team, sometimes you need to learn the tools to get the data yourself!)

My experience as a data scientist:

In my view, the data scientist role is much bigger than just knowing how to deploy a model. Terms are used loosely across the industry, and I tend to agree with David Robinson in that data science produces insights, and involves a combination of statistics, software engineering, and domain expertise. Efficiently getting insights from data in business environments now demand a lot more skills than just knowing how to deploy a scikit-learn model in a kaggle competition. Not to belittle that as an important skill – There are data science positions that are focused more exclusively on building and maintaining models. For instance, we have a few people that continually work on improving our history based value models. I don’t think I’d really like the lack of variety in that role, so if you are looking for a job in this industry, it’s important to understand the makeup of the role(s) are you are considering.

It seems to me that a Data Scientist should be about understanding the business and what drives value at the end of the day. I think there are a lot of people that know how to build models, chatbots, NLP programs, and all sorts of cool stuff, but can’t get those “put into production” because they haven’t made the case to the business. For instance, during a company hackathon that my team won, some of my colleagues and I trained a machine learning model to categorize car photos on our Used Car Listings site. But if you want to put models like that into production, you don’t just do it because it’s possible and fun to work on. You need the business to justify the potential expenses and opportunity cost. In this case, that could mean needing to pay to host those images internally, rather than relying on a current partner / service provider, ending a contract, or standing up AWS for compute, etc. The bottom line though is, though, will such a proposal pay for itself in the long run? Does it increase conversion enough to justify the salary in time spent for you to work on it or optimize it? If senior leadership believes in such a product and you get assigned to it, maybe you’ll use some of those cool skills.

It’s a balance between using data technology in novel ways, and solving real business needs. Modeling is one part of that, but right now a lot of companies simply don’t even know the value of their data, and maybe analytics on it is a wiser investment than predicting with it. Perhaps they know they need modeling for a particular product, and they build out a team for that. Across the market, the impression I get is that a lot of business’ leadership don’t understand enough about the technology or the data to know what they need. I’m hearing a lot on LinkedIn and Twitter about data scientists being hired without access to existing / good data pipelines and having to learn data engineering skills on the job. This isn’t a bad thing in my opinion, since it makes you into a better generalist, and more capable, but you want to know what you’re getting into.

Being the resident data expert, you will frequently be asked to answer questions that you don’t have the best data to answer, and the only way is to wildly estimate (which, to your surprise, may be good enough for the business!) You may use one data set to narrow down and qualify a scenario and then multiply the ratio you get by a slightly different data set of known observations with a similar distribution, which is far from a perfect way of estimating, but it’s what the business needs to make a good and timely “gut” judgement call. You have to learn to thrive in the ambiguity because you won’t always have the data to answer the questions you’re asked.

These are my thoughts, but the situation may be completely different for companies where there is a major data science focus, or where models are the product. I’m speaking from experience in a company where information is the core product and the client is consumers and players in the automotive industry.

Another note, since predictive modeling is largely based on applying mathematical algorithms to data, it’s important to understand the theory and intuition behind the algorithms at your disposal, in order to know what will and won’t be effective or correct for your data set and your goals. This is why statistics and maths are important in this discipline. I took advanced calculus and statistics classes at my university, but I didn’t learn it at a master’s or PHD level, and I don’t consider myself a “statistician” like some of the folks in this field. To counteract my impostor syndrome, I have recently been learning a lot more outside of work in this area to brush up on my own knowledge and understanding. I have been taking classes online, working through books like “ISLR”, and learning through classes with practical exercises on DataCamp (which I highly recommend!).

I hope this has been helpful, and I’d like to hear your stories as well. If you are somehow involved in the data world, join the group here and share!