Intuition on Unsupervised Learning with K-Means Clustering

Overview

Clustering is an Unsupervised learning algorithm that will attempt to cluster a group of similar data based on the features. An Unsupervised learning algorithm will attempt to cluster a group of similar data. The clustering aims to identify the relationship within the given data and group accordingly so that predicting new data points can be generalized. Clustering has many real-world applications like Customer Segmentation, Anomaly Detection, etc. Here in this blog, we will see a brief about an Unsupervised Machine Learning algorithm called K-Means.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Introduction

Supervised: A machine learning technique trains the model based on the labeled data. We directly provide the target or output labels to the training. The goal of the supervised technique is to identify the mapping pattern with a given dataset so that the prediction can be made for new or unseen data. Examples of Supervised Learning, Neural Networks, Linear Regression, etc.

Un-Supervised: It is a machine learning technique where the model gets trained on the unlabeled dataset, meaning there are no target or output labels. Unsupervised learning aims to identify the structure or group in the given data as clusters. Few examples of Unsupervised learning are K-means Clustering, PCA (Principal component analysis), etc.
Clustering Example:

Fruits Market (apple, orange, etc.)
Clustering Similar documents
Customer Segmentation

Clustering news and many more.

Intuition of K-Means

kmeans

Source: Google

We input a set of unlabeled data into our model, and the model will find the hidden pattern or relationship or even called a similarity in the given data. Here below, I have taken fruit clustering for easy understanding.

kmeans2

Source: Google

Imagine the above black dots are of mixed different fruits. We can group the different fruits based on their similarity within the group. For example, Apple is red in color and has an oblong shape. Imaging other fruits like orange and pineapple is clustered on the right side of the color according to the similarities found. We can do this manually, but we want the machine to look for the pattern by understanding the data and predict the new data.

Working Principal of K- Means

Step 1: Initialize the K value.

Step 2: Assign Data Points to Cluster

Step 3: Move the Cluster

Repeat the Steps 2 and 3 until you get optimal centroid points.

We want to provide n_clusters to our model to cluster in n_cluster groups.

Note: n_clusters is the total no of groups or clusters we want from the given data.

Input the total no of clusters we want from the data.

Note: It is an iterative flow consisting of the below steps,

Cluster Assignment Step
Move Centroid Step

Cluster Assignment Step

Once we randomly initialize the centroid, the model will calculate the distance between each data point and the centroid.

Example: n_clusters = 2 = k (C1 and C2)

Sample Distance,

kmeans3

1^st data point to centroid 1: 5 pt
1^st data point to centroid 2: 5.5 pt
2^nd data point to centroid 1: 2.1 pt
2^nd data point to centroid 2: 3 pt
3^rd data point to centroid 1:8 pt
3^rd data point to centroid 2: 3.6 pt

Based on the distance calculated from each data point to the given centroids, nearby centroids are assigned to the data points.

For centroid 1: data points (1 and 3rd data points are assigned)
For centroid 2: data point (2) is assigned

Likewise, we do this for all data points

Move Centroid Step: (centroid: center point for cluster)

Once you assign the centroids for all data points, average the corresponding data points and move the current centroid to that average point.

Visualize:

kmeans4

Source: Google

The above sample diagram implies after 4 iterations.

How do you choose n-clusters or K value?

kmeans5

Source: Google

Depending on the data and the use case for the problem, one may choose the ideal value of K for K-means clustering. There are several popular methods for computing no of clusters, the Elbow technique, the Silhouette method, Domain Expertise, and manual experimentation. I would recommend using multiple methods and comparing the results for optimal workflow.

Conclusion

K-Means is a Machine Learning technique used for data clustering by finding hidden similarities among the given unlabeled data. We have seen the intuition of K-Means Clustering. It is a simple and powerful machine learning technique that solves many real-world applications, from segmentation to anomaly detection.

Drop a query if you have any questions regarding K-Means Clustering and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is the K value in K means?

ANS: – The K value or n-cluster is the number of clusters in the given data. K value is a Hyper Parameter, which is set before the training.

2. How distance is calculated between the centroid and data points?

ANS: – There are several methods to calculate the distance between two points however, one commonly used method is the Euclidean distance.

WRITTEN BY Ganesh Raj

Ganesh Raj V works as a Sr. Research Associate at CloudThat. He is a highly analytical, creative, and passionate individual experienced in Data Science, Machine Learning algorithms, and Cloud Computing. In a quest to learn and work with recent technologies, he strives hard to stay updated on advanced technologies along efficiently solving problems analytically.