How to Transition from Software Development to Data Science
While data science is not a new field, demand for people with data science skills is soaring, and it’s drawing people in from many other disciplines. Software engineers, in particular, might like to consider transitioning into data science, as many skillsets are shared between the two disciplines. For example, data engineers should be well versed in programming languages like SQL, which allow them to build, maintain and secure big data. They should also be very well versed in logical thinking. Indeed, most data engineers have been software engineers at some time in their careers and have moved on to specializing in data engineering.
In this piece, we’ll aim to answer the question: what can a software engineer do to transition into data science?
As a software engineer, you’re in a privileged position to transition into data science, as most of the skillsets you already have are shared between both fields. Most of the skills you’ll need to acquire are more related to mindset than to technical skills.
What Is Data Science?
Before one starts thinking about transitioning into data science, it’s important to actually know what data science really is.
This is the definition of data science provided by Wikipedia:
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains ... Data science is a “concept to unify statistics, data analysis, informatics, and their related methods” to “understand and analyze actual phenomena” with data.
It makes sense that, for this particular field of engineering, we use advanced mathematics, statistics, and various computer science-related skills. If you’re not familiar with the Venn diagram that describes the skills needed to be a data scientist, it looks like this:
Image from datanami.com
Inside this computer science slice, we find languages and tools like Python or Scala for general-purpose programming; R and MATLAB for statistical and numeric computing; or Scala, and libraries like Pandas for data analysis and manipulation; PyTorch for deep learning, or NumPy to operate on matrices or arrays.
Just like the fields of mathematics, statistics, and informatics, in which data science is based, this field also brings a whole package of new terminology that one needs to understand. Taking into consideration the definition of data science and all the layers of knowledge it involves, we should now be able to define some of the steps needed for a transition from software engineering into data science:
- understand the concepts and terminology of data science
- understand the role of a data scientist
- merge your technical skillset with the one needed for data analysis
- get involved in the data science community
- specialize!
The Role of a Data Scientist
After understanding what data science is, we need to understand the role of a data scientist. As a software engineer, your responsibilities are to use your programming skills to build and test enterprise software. As a data scientist, you’ll normally need to use your programming skills to collect, analyze, interpret and visualize large data sets.
After this, you’ll need to leverage your programming skills to solve a different set of problems. This involves learning the proper tools for data gathering and analysis and the methods behind it.
Getting involved in the community is also very important. Having contact with the community is where you’ll find out what’s new in terms of tools, technologies, and new ways of solving new and more complex problems.
Transitioning to data science from software engineering is not a hard process. As a software engineer, you should already have two thirds of the skillset needed, as good programming skills, a problem-solving mindset, and the knowledge of infrastructure and architecture design are needed in both fields.
With this in mind, there’s one last question you should answer. Do you want to switch to data science?
The data science field is still in its infancy, and it presents new and exciting challenges and its popularity is still rising. As it gets widespread, more and more industries—like marketing, retail and logistics—are changing their ways to take advantage of it. And because of this, there are also increasing opportunities and salaries.
Switching to Data Science
If your answer is “Yes, I want to switch to data science”, then I have some tips for you.
Tip 1: Getting Immersed
There are a couple of different ways to getting immersed into data science.
First, I would recommend you to read some of the most important books on the subject, such as these:
These books are all industry standards and will help you get immersed on the topic and understand it on a deeper level.
If you’re more of an online course person, than both Harvard University and DataCamp have excellent courses that you can use to get deeper into the subject.
Finally, I highly recommend you get involved in the community. This includes data science newsletters such as Data Science Weekly, Data Science Roundup and Data is Plural; Twitter influencers such as @kdnuggets, @BernardMarr and @KirkDBorne; and data science publications like Towards Data Science on Medium. These are all resources that you should start consuming to understand more about the field and its trends. It also gives you access to more experienced people that might help you solve your problems in the future.
Tip 2: Think Like a Data Scientist
Start thinking like a data scientist. Data science is much more than just learning Python or R. While those skills are essential, they’re just a small piece of the puzzle. As a data scientist, you’re expected to analyze day-to-day activities from a unique perspective—quite different from that of software engineers. When a data scientist is made aware of an event (for example, the need for a new KPI), they have the duty to ensure that any data generated by that event is accounted for, and that it’s used to its full potential.
Here are some ways you can start developing this mindset:
- Start paying more attention to opportunities to build documentation for database tables, code bases, or processes that would benefit other employees in your company.
- Take the time to explain why something works, or where the data came from, to your business counterparts. (This means understanding where the data comes from and how it’s transformed.)
- Practice communicating complex data science concepts to people around you who aren’t in the field. This will ensure you actually know how to break down complex data science topics into digestible information.
Tip 3: Get Some Experience
Start solving some real problems (simple ones to start). This way, you can consolidate all the terms and basic skills in a practical manner. Good projects for this can be things like building a simple chatbot; performing sentiment analysis from product reviews; or building some sort of price recommendation engine.
Another great way of getting started is by following tutorials that guide you through an entire project. For example, these are both excellent resources for getting you started with understanding more of the core concepts and techniques of data science:
Tip 4: Go Deeper
This is where it’s good to consolidate all the concepts and technologies available and start using them together to solve bigger and more complex problems.
Start looking at the following:
Database architecture and design. Data is the core of data science, so it’s no surprise that databases are the primary mechanism for data storing. Because of that, you want to be as familiar as possible with how to build, maintain and optimize databases.
This is where a book like Database Management Systems can come very handy, as it provides comprehensive and up-to-date coverage of the fundamentals of database systems with plenty of coherent explanations and practical examples to help you.
SQL for aggregation, analysis, and modeling of database data. A good knowledge of SQL language goes a long way when it comes to obtaining data.
There’s no shortage of good sources for learning SQL over the Internet. Books like Learning SQL: Generate, Manipulate, and Retrieve Data, Practical SQL or Simply SQL will get you started, but there are also a lot of online courses that will go step by step into the world of SQL.
Also make sure to check SQLFiddle for a simple sandbox where you can practice you SQL.
Big Data. This is a huge collection of data that grows exponentially over time. The best datasets are the ones that keep on growing and providing fresh information. Being able to properly analyze these huge sets can be a great asset.
Coursera has a great introductory course for big data that’s worth checking out.
Hadoop. Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. This is a great tool to have when analyzing huge sets of data that would be otherwise nearly impossible to do with a single machine.
If you want to get started with Apache Hadoop, try to go through their documentation. It’s very complete, and goes step by step into what you’ll have to do to start using Hadoop as well as providing some examples for your to practice.
R, Scala, Python, MATLAB, and libraries like NumPy, Pandas, and PyTorch. As mentioned earlier, these are some of the core tools used by a data scientist.
In terms of core tools, there’s a lot of them to choose from. So let’s look at each one of them individually:
R R is a programming language for statistical computing and graphics, which you can use to clean, analyze, and graph data. It’s widely used by researchers from diverse disciplines to estimate and display results and by professionals of statistics.
Scala is a high-level language that combines functional and object-oriented programming with high performance runtimes. Since Spark (a distributed processing system used for big data workloads) was built using Scala, it makes sense that learning it will be a great tool for any data scientist.
Pyhton needs no introduction. The fact that it’s open source, and given the amount of libraries available, makes it the preferred tool for most data scientists.
MATLAB is a high-performance language for technical computing. It integrates computation, visualization and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation.
NumPy, Pandas and PyTorch are all Python packages that help in different fields of data science. Pandas is the number-one data analysis library. It provides all the functions needed for analyzing datasets. NumPy is the fundamental package for scientific computing, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. Finally, PyTorch is an optimized tensor library primarily used for deep learning applications using GPUs and CPUs.
Data Science concepts such as Data Manipulation, Data Visualization, Statistical Analysis, and Machine Learning. Learning and mastering the core concepts is essential.
Wikipedia is a great resource when it comes to getting explanations on most data science concepts. Normally, every concept is well explained with plenty of links to explore. Take, for example, the concept of data analysis.
There are also some blog posts worth reading on the subject of core concepts, but you’ll be better served diving deeper into each concept, since the limited mount of information on blog posts might leave precious information behind.
Data analysis performance. Performance is a huge part of data science. Knowing how to improve analysis time can be the difference between having results now or in a couple of days.
The topic of performance in data analysis can be an entire field of research in itself. Each tool, each method and each strategy has its own ways of improving performance. Dealing with those tools and methodologies means you’ll have to deal with performance. Be sure to always have in your mind that performance is a top priority in data science.
Mathematical and statistical concepts involve algebra, calculus, probability, statistics, and regression. Once again, data science is as much computer science as it is mathematics and statistics. If you want to be good at science, you’ll have to be good at all three.
These concepts and technologies all work together to properly gather, treat and analyze data, so a good knowledge of them is required if you want to specialize in this field.
After you get deep down into the specialization, it’s again time to practice with some real projects. Pick a fun project you’d like to execute from beginning to the end. This will allow you to think independently and creatively about solving data-related problems.
Here’s a list of some of those data-related areas you might like to tackle:
Natural language processing is an old and hard problem. Natural language is always bound to a specific context, and while words are unique, they can take different meanings depending on context. This is what makes this an exciting problem to research.
Great projects for getting you started with natural language problems are chatbots, autocomplete and autocorrect algorithms, and sentiment analysis, where the goal is to interpret and classify subjective data using natural language processing and machine learning. All of these require the interpretation of not only language but also context, and are great starting projects.
For more information on this topic you can read “ Getting Started with Natural Language Processing in Python ” or “ Your Guide to Natural Language Processing (NLP) ”, which work as beginning guides to natural language processing.
Predict Prices of a Stock. Not only can this be a very complete exercise as it involves maths and statistics, it can also bring you some extra money on the side.
Once again, you can delve deeper by reading “ Can You Use Data Science in the Stock Market? ” to understand stocks and the concepts of data science associated with it.
Predict the strength of a Password. This is a very common problem, and coming to it with a new and creative solution for it can be a very exciting challenge.
The Wikipedia entry on password strength provides a wealth of information that one can use to try to solve password strength problems using the strengths of data analysis.
These are real-world problems that need solving and that you can use to practice your newly acquired skills.
Conclusion
Congratulations, you are now on your path to become a data scientist, and you have all the information you need to successfully transition from software development into data science. Hopefully, you can start looking for your first job in the field soon.
And remember: a data scientist needs to be a jack of all trades, but master of some. Most of the time, you won’t be working solely on modeling the data pulled by data engineers. Often, many companies lack resources in data science teams, so to deliver maximum benefit to the business, you’ll have to work across the complete end-to-end data science product development life cycle. This is where your past as a software engineer will help a lot.