Cluster Analysis in Python
If you’ve browsed through Google News, you’ll most likely have noticed that news articles on similar topics are grouped together. Google scans the Internet for new articles, and then analyzes the text within each of them to decide which articles talk about the same topic! It’s essentially a result of an unsupervised machine learning algorithm. Cluster analysis is an example of an unsupervised learning algorithm. Another example of clustering is trending topics on Twitter. Twitter goes through new tweets in real time to determine words, hashtags, and a combination of them, to determine what’s causing a buzz.
Unsupervised learning involves the use of machine learning techniques to find patterns and make inferences on unlabelled data. Labelled data has some kind of a tag associated with it—such as a name or an attribute. This label can be categorical or numerical data. Unlabelled data contains no tags. Going back to our example on Google News, news articles are examples of unlabelled data. A cluster of news articles has similar words and word associations repeating in it.
Common unsupervised learning algorithms are cluster analysis, anomaly detection, and neural networks. Cluster analysis involves analyzing unlabelled data and grouping similar data points together. The data points are grouped on the basis of certain characteristics. These groups are called clusters. A common example of clustering is customer segmentation. Cluster analysis can group similar customers on the basis of their demographics and spending patterns, and help businesses cater their offerings differently to different clusters.
In this tutorial, we’ll look at the following:
- what cluster analysis is
- when cluster analysis should be used
- how data can be prepared for cluster analysis
- two techniques for performing cluster analysis
- how to interpret the results of cluster analysis