Data Science for Everyone
Data science is a key part of the world of STEM (science, technology, engineering, and mathematics). You may have heard it described as the “sexiest job of the twenty-first century”. There’s a soaring demand for data scientists, and the role involves a wide variety of skills. However, the day-to-day job of a data scientist—the impact of their work, and the most effective way to collaborate with them—can be a mystery within an organization.
In this guide, we’ll take a look at the field of data science, how you can use data science skills in your own day-to-day work, and how to best learn key data science skills.
An Overview of the Data Science Field
You may be familiar with the “concentric circles” representation of skills required to be a data scientist that looks something like this:
Image source: datanami.com
This representation suggests that data science requires a combination of skills from the domains of mathematics, computer science, and a domain expertise relevant to the questions being asked. There are numerous additional representations of overlapping skill sets that include communication, business knowledge, and social sciences. Each set of skills is valuable in different types of data science roles, and have evolved over the last several years. We’ll dive into common skills required today.
Mathematics
A data scientist generally needs an understanding of statistics, linear algebra and calculus. Each of these underlies the most common statistical and machine learning methods used to answer questions and create machine learning models.
Statistical tests are used to evaluate experiments (for example, A/B tests), determine the effectiveness of a program, a marketing campaign, or other business actions, and to make decisions about the most effective use of allocating funds in a business. Some example questions include:
- Does a webinar series lead to an increase in paid subscribers?
- Which educational program leads to the most positive outcomes in test scores?
An understanding of linear algebra and calculus will enable you to work with regression, classification, and other algorithms used to answer common questions. It will also provide the foundation necessary to understand how deep learning models work, which are key to many advanced model series (think Tesla’s automated driving system, or Google’s text completion algorithms). Even if you’re not looking to create a complicated AI system, mathematics knowledge underlying machine learning can enable you to answer simpler questions that create significant value within a business. Some examples include:
- Which customers are most likely to cancel their contracts in the next year?
- Can we recommend products to customers based on other customers’ previous purchases?
- How many support tickets is the company expecting over the course of the next six months?
Each of the above questions can use one or more of the algorithms discussed to support business planning, save time and resources, and better engage your audience.
Computer Science
Data is everywhere, and most of it isn’t readily available in the format needed to perform a statistical analysis. Computer science skills are necessary for leveraging the full scope of possible data sources in your analysis. The ability to perform repeatable tasks—such as web scraping, data cleaning and transformation, and retraining an algorithm—is necessary to be effective as a data scientist.
Data scientists typically use Python or R for programming, and almost all use SQL for leveraging data in relational databases. The majority of learning resources and tutorials available online will use either Python or R. Either programming language is effective for performing most tasks, but Python tends to be used more heavily in production environments and for machine learning tasks. The R programming language has more libraries available for complex statistical modeling tasks, which we’ll discuss in the next section on types of models.
Domain Expertise
The requirement of domain expertise is highly subjective, but usually points to an area of background knowledge and context that supports your understanding of the types of questions to ask, what data sources to seek out, and what prior knowledge exists on your topic of interest.
For example, a data scientist working on an HR analytics team will typically benefit from domain expertise on human behavior, industrial/organizational psychology and trends in employment and hiring practices. The most effective way to leverage data science methods and skills is to use your own domain expertise. The knowledge you gain from the industries, areas of study, and areas of interest that you’ve participated in, will provide invaluable context for your work, and is the most effective starting point for learning data science or starting a new career.
Types of Models
Data science education typically classifies types of algorithms into supervised and unsupervised modeling to answer different types of questions. In this guide, we’ll focus on the different purposes of models and the types of tools and statistical test information you may use to answer specific questions.
Modeling for Explanation
A common use case for statistical models or machine learning algorithms is for explanation. Modeling for explanation answers questions looking for specific relationships between variables. Some example questions that are answered by explanation include:
- Did the new student after-school program improve class attendance and homework completion?
- Does a new feature in the company software improve time to complete a task?
Each of the above questions is best answered by determining the best fitting statistical model, and also reporting on how the variables in your model relate. Is there a positive or negative correlation between them? If customers use the new feature in your software, does it increase or decrease the time to complete a task? Is there a point of diminishing returns for students attending the after-school program?
Each of these requires you to report on the overall performance of your statistical model as well as evaluate the individual relationships between predictors and the outcome. The details of the relationship between variables is what provides value and is what guides future actions.
Modeling for Prediction
Models trained to optimize for the accuracy of a predicted outcome are the most commonly discussed applications of machine learning. Modeling for prediction requires focusing on different performance metrics, and typically answers different sets of questions than models for explanation. Some examples include:
- Which customers are predicted to purchase a specific product?
- Which water main lines in a city are predicted to burst in the next year?
For each question, a data scientist will train several types of algorithms on a training set of data and choose the model with the highest accuracy on a test set. The actual predictions generated on future data points where the outcome is not yet known provide value for this type of question and model.
Thinking Like a Data Scientist
An important skill is your ability to think like a data scientist. You can practice learning how to identify problems that can be solved using a statistical test or algorithm and how to optimize them effectively using the skills we’ve discussed so far. However, your ability to creatively find data, structure it, and connect multiple pieces of information together to solve a problem will make your work more effective. Some simple examples include the following:
- Is your customer engagement (store visits) impacted by the weather? How? Can you use this to predict sales in the next week?
- Is your business performance correlated with any aggregate economic measures, such as the unemployment rate, or a stock market index?
There are many ways to “think” like a data scientist, and these are skills that aren’t necessarily taught in a specific course. However, they’re an effective asset regardless of your specific job title or role. If you seek evidence to support your decisions, ask questions of your data, and seek to augment that data for your knowledge, then you’re thinking like a data scientist!
Why Is This Important?
Data literacy and the ability too perform analysis is becoming an increasingly in-demand skill, and it’s necessary for performing a variety of jobs outside the role of a data scientist. Professionals across different fields routinely have access to a volume of data about business processes, customers, and other key pieces of information that can support their decision, track their performance, and improve their workflows. They also have the ability to answer questions that may not have been possible before.
Answering Complex Questions
Data science methods allow you to answer more complex questions and plan around predicted values, such as predictions on which customers will cancel their contracts in the next calendar year. A business owner can engage with customers predicted to cancel, thus potentially changing the outcome and reducing the amount of revenue lost.
You can also understand the complex mathematical representation between multiple predictors and an outcome of interest, guiding future planning. Even if you’re not yet proficient in leveraging machine learning or statistical models, your ability to formulate clear questions and collaborate with the data science team at work will streamline the path toward value using data.
Optimizing Workflows
An easy way to get started learning data science methods is to optimize your workflow using the programming skills needed to be a data scientist. If there are repetitive manual tasks you perform using data in spreadsheets or similar files, where you can expect to perform the same tasks in the future, you can leverage Python or a similar language to streamline your work. For example, if you need to generate a weekly report by joining two spreadsheets and creating a pivot table, you can write a Python script to perform the work for you. Every small amount of time you save by not doing this manual task is time you can use to optimize your workflows further.
Learning Data Science
There are a great number of free or affordable resources online that you can use to learn new skills in the realm of data science. I’ll cover some of the resources I’ve found most effective in my own work, but keep in mind this list is not exhaustive! There are many other options based on your goals and preferred learning style.
Free Resources
If you want to get started optimizing your workflows, a great resource is the book Automate the Boring Stuff with Python. There are several great, practical examples to get you started if you want to learn the Python programming language and interact with common types of files used in data science.
Another option is to research “degrees” or other curricula curated by professionals in the field. One example is Not a Real Degree, combining both free and paid resources that you can use to further your learning goals.
Online Courses
Online platforms such as AI+ Training or Dataquest offer online courses using video and/or interactive environments to practice specific topics and skills, with curated certificate programs and tracks to ensure you cover the breadth of knowledge used in data science. This is far from exhaustive, and you can find many other options and entire articles written about learning resources that best suit your needs.
Finding Datasets and Answering Questions
There are many great repositories of free datasets available to practice answering questions, including on Kaggle (where you can also participate in competitions to win money solving machine learning problems), data.world, and NYC Open Data. There’s a dataset for everyone based on topics that interest you, so I encourage you to start there by looking up data, tutorials, and other free information online.
Conclusion
Data science is a continually evolving and growing field. Regardless of your current skill set or your career interests, data science methods and approaches can be beneficial to everyone. With an interest in answering questions, and a desire to use data to augment your decision-making, you can benefit from any of the resources discussed here in this guide.