A Primer on Machine Learning with Python

In the past decade, machine learning has moved from scientific research labs into everyday web and mobile apps. Machine learning enables your applications to perform tasks that were previously very difficult to program, such as detecting objects and faces in images, detecting spam and hate speech, and generating smart replies for emails and messaging apps.

But performing machine learning is fundamentally different from classic programming. In this article, you’ll learn the basics of machine learning and will create a basic model that can predict the species of flowers based on their measurements.

Key Takeaways

Machine learning has evolved from scientific research labs into everyday web and mobile apps, enabling applications to perform tasks that were previously very difficult to program.
Machine learning relies on experience, training models through examples rather than providing them with rules. There are different categories of machine learning algorithms, each of which can solve specific problems: supervised learning, unsupervised learning, and reinforcement learning.
Python is a popular language for machine learning due to its simplicity, readability, and extensive ecosystem, including libraries and frameworks like Scikit-learn, TensorFlow, and PyTorch. However, understanding Python programming, libraries like NumPy, Pandas, and Matplotlib, and basic concepts of statistics and probability are prerequisites.
The process of implementing a machine learning model involves defining the problem, gathering data, splitting the dataset into training and test sets, building the model, and evaluating its performance. Techniques like cross-validation and train-test split, alongside metrics like accuracy, precision, recall, and F1 score, can be used to validate the model’s performance.

How Does Machine Learning Work?

Classic programming relies on well-defined problems that can be broken down into distinct classes, functions, and if–else commands. Machine learning, on the other hand, relies on developing its behavior based on experience. Instead of providing machine learning models with rules, you train them through examples.

There are different categories of machine learning algorithms, each of which can solve specific problems.

Supervised learning

Supervised learning is suitable for problems where you want to go from input data to outcomes. The common trait of all supervised learning problems is that there’s a ground truth against which you can test your model, such as labeled images or historical sales data.

Supervised learning models can solve regression or classification problems. Regression models predict quantities (such as the number of items sold or the price of stock) while classification problems try to determine the category of input data (such as cat/dog/fish/bird, fraud/not fraud).

Image classification, face detection, stock price prediction, and sales forecasting are examples of problems supervised learning can solve.

Some popular supervised learning algorithms include linear and logistic regression, support vector machines, decision trees, and artificial neural networks.

Unsupervised learning

Unsupervised learning is suitable for problems where you have data but instead of outcomes, you’re looking for patterns. For instance, you might want to group your customers into segments based on their similarities. This is called clustering in unsupervised learning. Or you might want to detect malicious network traffic that deviates from the normal activity in your enterprise. This is called anomaly detection, another unsupervised learning task. Unsupervised learning is also useful for dimensionality reduction, a trick that simplifies machine learning tasks by removing irrelevant features.

Some popular unsupervised learning algorithms include K-means clustering and principle component analysis (PCA).

Reinforcement learning

Reinforcement learning is a branch of machine learning in which an intelligent agent tries to achieve a goal by interacting with its environment. Reinforcement learning involves actions, states, and rewards. An untrained RL agent starts by randomly taking actions. Each action changes the state of the environment. If the agent finds itself in the desired state, it receives a reward. The agent tries to find sequences of actions and states that produce the most rewards.

Reinforcement learning is used in recommendation systems, robotics, and game-playing bots such as Google’s AlphaGo and AlphaStar.

Setting Up the Python Environment

In this post, we’ll focus on supervised learning, because it’s the most popular branch of machine learning and its results are easier to evaluate. We will be using Python, because it has many features and libraries that support machine learning applications. But the general concepts can be applied to any programming language that has similar libraries.

(In case you’re new to Python, freeCodeCamp has a great crash course that will get you started with the basics.)

One of the Python libraries often used for data science and machine learning is Scikit-learn, which provides implementations of popular machine learning algorithms. Scikit-learn is not part of the base Python installation and you must install it manually.

macOS and Linux come with Python preinstalled. To install the Scikit-learn library, type the following command in a terminal window:

pip install scikit-learn

Or for Python 3:

python3 -m pip install scikit-learn

On Microsoft Windows, you must install Python first. You can get the installer of the latest version of Python 3 for Windows from the official website. After installing Python, type the following command in a command-line window:

python -m pip install scikit-learn

Alternatively, you can install the Anaconda framework, which includes an independent installation of Python 3 along with Scikit-learn and many other libraries used for data science and machine learning, such as Numpy, Scipy, and Matplotlib. You can find the installation instruction of the free Individual Edition of Anaconda on its official website.

Step 1: Define the Problem

The first step to every machine learning project is knowing what problem you want to solve. Defining the problem will help you determine the kind of data you need to gather and give you an idea of the kind of machine learning algorithm you’ll need to use.

In our case, we want to create a model that predicts the species of a flower based on the measurements of the petal and sepal length and width.

This is a supervised classification problem. We’ll need to gather a list of measurements of different specimens of flowers and their corresponding species. Then we’ll use this data to train and test a machine learning model that can map measurements to species.

Step 2: Gather the Data

One of the trickiest parts of machine learning is gathering data to train your models. You’ll have to find a source where you can gather data in the quantity needed to train your model. You’ll also need to verify the quality of your data, make sure it’s representative of the different cases your model will handle, and avoid collecting data that contains hidden biases.

Luckily for us, Scikit-learn contains several toy datasets to try out different machine learning algorithms. One of them is the “Iris flower dataset”, which happens to contain the exact data that we need for our problem. All we need to do is to load it from the library.

The following code loads the housing dataset:

from sklearn.datasets import load_iris

iris = load_iris()

The Iris dataset contains 150 observations, each containing four measurements (iris.data) and the target flower species (iris.target). The names of data columns can be seen in iris.feature_names:

print(iris.feature_names)
'''
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
'''

iris.target contains the numerical index (0–2) of one of three flower species registered in the dataset. The names of the flower species are available in iris.target_names:

print(iris.target_names)
'''['setosa' 'versicolor' 'virginica']'''

Step 3: Split the Dataset

Before beginning the training, you must split your data into a train and test set. You’ll use the train set to train your machine learning model and the test set to verify its accuracy.

This is to make sure your model has not overfit on the training data. Overfitting happens when your machine learning model performs well on the training examples but poorly on unseen data. Overfitting can happen as a result of choosing the wrong machine learning algorithm, making the wrong configuration on the model, having poor training data, or having too few training examples.

Depending on the kind of problem you’re solving and the amount of data you have, you must determine how much of your data you’ll allocate to the test set. Usually, when you have a lot of data (in the order of tens of thousands of examples), even a small sample of about one percent will be adequate to test your model. In the case of the Iris dataset, which contains a total of 150 records, we’ll choose a 75–25 split.

Scikit-learn has a train_test_split function that splits the dataset into train and test datasets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, stratify=iris.target, random_state=42)

train_test_split takes the data and target datasets and returns two pairs of datasets for training (X_train and y_train) and testing (X_test and y_test). The test_size parameter determines the percent (between 0 and 1) of data that will be allocated to testing. The stratify parameter makes sure that the train and the test arrays contain a balanced number of samples from each class. The random_state variable, which is present in many functions of Scikit-learn, is to control the random number generators and for reproducibility.

Step 4: Build the Model

Now that our data is ready, we can create a machine learning model and train it on the train set. There are many different machine learning algorithms that can solve classification problems like the one we’re dealing with. In our case, we’ll use the “logistic regression” algorithm, which is very fast and suitable for classification problems that are simple and don’t contain too many dimensions.

Scikit-learn’s LogisticRegression class implements this algorithm. After instantiating it, we train it on our train set (X_train and y_train) by calling the fit function. This will tune the model’s parameters to find a mapping between the measurements and the flower species.

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)

Step 5: Evaluate the Model

Now that we’ve trained the model, we want to measure its accuracy. The LogisticRegression class has a score method that returns the accuracy of the model. First, we’ll measure the accuracy of the model on the training data:

print(lr.score(X_train, y_train))

This will return approximately 0.97, which means the model predicts the class of 97 percent of the training examples accurately, which is pretty good given that we only had around 37 training examples per species.

Next, we’ll check the accuracy of the model on the test set:

print(lr.score(X_test, y_test))

This will give us around 95 percent, a bit lower than the training accuracy, which is natural because these are examples that the model has never seen before. By creating a larger dataset or trying another machine learning algorithm (such as support vector machines), we might be able to further improve the model’s accuracy and bridge the gap between training and test performance.

Finally, we want to see how we can use our trained model on new examples. The LogisticRegression class has a predict function that takes an array of observations as input and returns the predicted class. In the case of our flower classifier model, we need to provide it with an array of four measurements (sepal length, sepal width, petal length, petal width) and it will return an integer that represents the class of the flower:

output = lr.predict([[4.4, 3.2, 1.3, 0.2]])
print(iris.target_names[output[0]])

'''setosa

Congratulations! You’ve created your first machine learning model. We can now put it together into an application that takes measurements from users and returns the flower species:

sepal_l = float(input("Sepal length (cm):"))
sepal_w = float(input("Sepal width (cm):"))
petal_l = float(input("Petal length (cm):"))
petal_w = float(input("Petal width (cm):"))

measurements = [[sepal_l, sepal_w, petal_l, petal_w]]
output = lr.predict(measurements)
print(f"Your flower is {iris.target_names[output[0]]}")

Hopefully, this will be your first step toward becoming a machine learning guru. From here, you can continue to learn other machine learning algorithms, learn more about the fundamental concepts of machine learning, and move on to more advanced topics such as neural networks and deep learning. With a bit of study and practice, you’ll be able to create remarkable applications that can detect objects in images, process voice commands, and engage in conversations with users.

Frequently Asked Questions (FAQs) on Machine Learning with Python

What are the prerequisites for learning machine learning with Python?

To start learning machine learning with Python, you need to have a basic understanding of Python programming. Familiarity with libraries like NumPy, Pandas, and Matplotlib is also beneficial. Additionally, a basic understanding of statistics and probability is essential as they form the core of machine learning algorithms.

How does Python compare to other languages for machine learning?

Python is one of the most popular languages for machine learning due to its simplicity and readability. It has a wide range of libraries and frameworks like Scikit-learn, TensorFlow, and PyTorch that simplify the development of machine learning models. Other languages like R and Java are also used, but Python’s extensive ecosystem makes it a preferred choice for many.

What are some common machine learning algorithms I can implement with Python?

Python’s Scikit-learn library provides implementations for a wide range of machine learning algorithms. Some of the commonly used ones include linear regression, logistic regression, decision trees, random forests, support vector machines, and k-nearest neighbors. For deep learning, you can use libraries like TensorFlow and PyTorch.

How can I validate the performance of my machine learning model in Python?

You can use techniques like cross-validation and train-test split to validate the performance of your model. Python’s Scikit-learn library provides functions for these. Additionally, you can use metrics like accuracy, precision, recall, and F1 score for classification problems, and mean squared error or R-squared for regression problems.

Can I use Python for both supervised and unsupervised learning?

Yes, Python supports both supervised and unsupervised learning. Supervised learning algorithms like regression and classification can be implemented using libraries like Scikit-learn. For unsupervised learning, you can use clustering algorithms like k-means, hierarchical clustering, and DBSCAN.

How can I handle overfitting in my machine learning model?

Overfitting can be handled using techniques like regularization, early stopping, and dropout for neural networks. You can also use ensemble methods like bagging and boosting to reduce overfitting.

What is the role of data preprocessing in machine learning with Python?

Data preprocessing is a crucial step in machine learning. It involves cleaning the data, handling missing values, encoding categorical variables, and scaling features. Python provides libraries like Pandas and Scikit-learn for efficient data preprocessing.

How can I visualize my machine learning model’s performance in Python?

You can use libraries like Matplotlib and Seaborn to visualize your model’s performance. These libraries provide functions to plot graphs like confusion matrix, ROC curve, and learning curve.

Can I use Python for natural language processing (NLP)?

Yes, Python provides libraries like NLTK and SpaCy for natural language processing. These libraries provide functionalities for tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis.

How can I deploy my machine learning model built with Python?

You can deploy your machine learning model using web frameworks like Flask or Django. For large-scale deployment, you can use cloud platforms like AWS, Google Cloud, or Azure. They provide services for model deployment, scaling, and monitoring.