AI/ML, Cloud Computing, Data Analytics

4 Mins Read

Intuition on Unsupervised Learning with K-Means Clustering

Voiced by Amazon Polly

Overview

Clustering is an Unsupervised learning algorithm that will attempt to cluster a group of similar data based on the features. An Unsupervised learning algorithm will attempt to cluster a group of similar data. The clustering aims to identify the relationship within the given data and group accordingly so that predicting new data points can be generalized. Clustering has many real-world applications like Customer Segmentation, Anomaly Detection, etc. Here in this blog, we will see a brief about an Unsupervised Machine Learning algorithm called K-Means.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Introduction

Supervised: A machine learning technique trains the model based on the labeled data. We directly provide the target or output labels to the training. The goal of the supervised technique is to identify the mapping pattern with a given dataset so that the prediction can be made for new or unseen data. Examples of Supervised Learning, Neural Networks, Linear Regression, etc.

Un-Supervised: It is a machine learning technique where the model gets trained on the unlabeled dataset, meaning there are no target or output labels. Unsupervised learning aims to identify the structure or group in the given data as clusters. Few examples of Unsupervised learning are K-means Clustering, PCA (Principal component analysis), etc.
Clustering Example:

  • Fruits Market (apple, orange, etc.)
  • Clustering Similar documents
  • Customer Segmentation

Clustering news and many more.

Intuition of K-Means

kmeans

Source: Google

We input a set of unlabeled data into our model, and the model will find the hidden pattern or relationship or even called a similarity in the given data. Here below, I have taken fruit clustering for easy understanding.

kmeans2

Source: Google

Imagine the above black dots are of mixed different fruits. We can group the different fruits based on their similarity within the group. For example, Apple is red in color and has an oblong shape. Imaging other fruits like orange and pineapple is clustered on the right side of the color according to the similarities found. We can do this manually, but we want the machine to look for the pattern by understanding the data and predict the new data.

Working Principal of K- Means

Step 1: Initialize the K value.

Step 2: Assign Data Points to Cluster

Step 3: Move the Cluster

Repeat the Steps 2 and 3 until you get optimal centroid points.

We want to provide n_clusters to our model to cluster in n_cluster groups.

Note: n_clusters is the total no of groups or clusters we want from the given data.

Input the total no of clusters we want from the data.

Note: It is an iterative flow consisting of the below steps,

  • Cluster Assignment Step
  • Move Centroid Step

Cluster Assignment Step

Once we randomly initialize the centroid, the model will calculate the distance between each data point and the centroid.

Example:  n_clusters = 2 = k (C1 and C2)

Sample Distance,

kmeans3

  • 1st data point to centroid 1: 5 pt
  • 1st data point to centroid 2: 5.5 pt
  • 2nd data point to centroid 1: 2.1 pt
  • 2nd data point to centroid 2: 3 pt
  • 3rd data point to centroid 1:8 pt
  • 3rd data point to centroid 2: 3.6 pt

Based on the distance calculated from each data point to the given centroids, nearby centroids are assigned to the data points.

  • For centroid 1: data points (1 and 3rd data points are assigned)
  • For centroid 2: data point (2) is assigned

Likewise, we do this for all data points

Move Centroid Step: (centroid: center point for cluster)

Once you assign the centroids for all data points, average the corresponding data points and move the current centroid to that average point.

Visualize:

kmeans4

Source: Google

The above sample diagram implies after 4 iterations.

How do you choose n-clusters or K value?

kmeans5

Source: Google

Depending on the data and the use case for the problem, one may choose the ideal value of K for K-means clustering. There are several popular methods for computing no of clusters, the Elbow technique, the Silhouette method, Domain Expertise, and manual experimentation. I would recommend using multiple methods and comparing the results for optimal workflow.

Conclusion

K-Means is a Machine Learning technique used for data clustering by finding hidden similarities among the given unlabeled data. We have seen the intuition of K-Means Clustering. It is a simple and powerful machine learning technique that solves many real-world applications, from segmentation to anomaly detection.

Drop a query if you have any questions regarding K-Means Clustering and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

  • Reduced infrastructure costs
  • Timely data-driven decisions
Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 850k+ professionals in 600+ cloud certifications and completed 500+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training PartnerAWS Migration PartnerAWS Data and Analytics PartnerAWS DevOps Competency PartnerAWS GenAI Competency PartnerAmazon QuickSight Service Delivery PartnerAmazon EKS Service Delivery Partner AWS Microsoft Workload PartnersAmazon EC2 Service Delivery PartnerAmazon ECS Service Delivery PartnerAWS Glue Service Delivery PartnerAmazon Redshift Service Delivery PartnerAWS Control Tower Service Delivery PartnerAWS WAF Service Delivery PartnerAmazon CloudFront Service Delivery PartnerAmazon OpenSearch Service Delivery PartnerAWS DMS Service Delivery PartnerAWS Systems Manager Service Delivery PartnerAmazon RDS Service Delivery PartnerAWS CloudFormation Service Delivery PartnerAWS ConfigAmazon EMR and many more.

FAQs

1. What is the K value in K means?

ANS: – The K value or n-cluster is the number of clusters in the given data. K value is a Hyper Parameter, which is set before the training.

2. How distance is calculated between the centroid and data points?

ANS: – There are several methods to calculate the distance between two points however, one commonly used method is the Euclidean distance.

WRITTEN BY Ganesh Raj

Ganesh Raj V works as a Sr. Research Associate at CloudThat. He is a highly analytical, creative, and passionate individual experienced in Data Science, Machine Learning algorithms, and Cloud Computing. In a quest to learn and work with recent technologies, he strives hard to stay updated on advanced technologies along efficiently solving problems analytically.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!