AI/ML, Cloud Computing, Data Analytics

4 Mins Read

Intuition on Unsupervised Learning with K-Means Clustering

Overview

Clustering is an Unsupervised learning algorithm that will attempt to cluster a group of similar data based on the features. An Unsupervised learning algorithm will attempt to cluster a group of similar data. The clustering aims to identify the relationship within the given data and group accordingly so that predicting new data points can be generalized. Clustering has many real-world applications like Customer Segmentation, Anomaly Detection, etc. Here in this blog, we will see a brief about an Unsupervised Machine Learning algorithm called K-Means.

Introduction

Supervised: A machine learning technique trains the model based on the labeled data. We directly provide the target or output labels to the training. The goal of the supervised technique is to identify the mapping pattern with a given dataset so that the prediction can be made for new or unseen data. Examples of Supervised Learning, Neural Networks, Linear Regression, etc.

Un-Supervised: It is a machine learning technique where the model gets trained on the unlabeled dataset, meaning there are no target or output labels. Unsupervised learning aims to identify the structure or group in the given data as clusters. Few examples of Unsupervised learning are K-means Clustering, PCA (Principal component analysis), etc.
Clustering Example:

  • Fruits Market (apple, orange, etc.)
  • Clustering Similar documents
  • Customer Segmentation

Clustering news and many more.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Intuition of K-Means

kmeans

Source: Google

We input a set of unlabeled data into our model, and the model will find the hidden pattern or relationship or even called a similarity in the given data. Here below, I have taken fruit clustering for easy understanding.

kmeans2

Source: Google

Imagine the above black dots are of mixed different fruits. We can group the different fruits based on their similarity within the group. For example, Apple is red in color and has an oblong shape. Imaging other fruits like orange and pineapple is clustered on the right side of the color according to the similarities found. We can do this manually, but we want the machine to look for the pattern by understanding the data and predict the new data.

Working Principal of K- Means

Step 1: Initialize the K value.

Step 2: Assign Data Points to Cluster

Step 3: Move the Cluster

Repeat the Steps 2 and 3 until you get optimal centroid points.

We want to provide n_clusters to our model to cluster in n_cluster groups.

Note: n_clusters is the total no of groups or clusters we want from the given data.

Input the total no of clusters we want from the data.

Note: It is an iterative flow consisting of the below steps,

  • Cluster Assignment Step
  • Move Centroid Step

Cluster Assignment Step

Once we randomly initialize the centroid, the model will calculate the distance between each data point and the centroid.

Example:  n_clusters = 2 = k (C1 and C2)

Sample Distance,

kmeans3

  • 1st data point to centroid 1: 5 pt
  • 1st data point to centroid 2: 5.5 pt
  • 2nd data point to centroid 1: 2.1 pt
  • 2nd data point to centroid 2: 3 pt
  • 3rd data point to centroid 1:8 pt
  • 3rd data point to centroid 2: 3.6 pt

Based on the distance calculated from each data point to the given centroids, nearby centroids are assigned to the data points.

  • For centroid 1: data points (1 and 3rd data points are assigned)
  • For centroid 2: data point (2) is assigned

Likewise, we do this for all data points

Move Centroid Step: (centroid: center point for cluster)

Once you assign the centroids for all data points, average the corresponding data points and move the current centroid to that average point.

Visualize:

kmeans4

Source: Google

The above sample diagram implies after 4 iterations.

How do you choose n-clusters or K value?

kmeans5

Source: Google

Depending on the data and the use case for the problem, one may choose the ideal value of K for K-means clustering. There are several popular methods for computing no of clusters, the Elbow technique, the Silhouette method, Domain Expertise, and manual experimentation. I would recommend using multiple methods and comparing the results for optimal workflow.

Conclusion

K-Means is a Machine Learning technique used for data clustering by finding hidden similarities among the given unlabeled data. We have seen the intuition of K-Means Clustering. It is a simple and powerful machine learning technique that solves many real-world applications, from segmentation to anomaly detection.

Drop a query if you have any questions regarding K-Means Clustering and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

  • Reduced infrastructure costs
  • Timely data-driven decisions
Get Started

About CloudThat

CloudThat is an official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, Amazon QuickSight Service Delivery Partner, AWS EKS Service Delivery Partner, and Microsoft Gold Partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

To get started, go through our Consultancy page and Managed Services PackageCloudThat’s offerings.

FAQs

1. What is the K value in K means?

ANS: – The K value or n-cluster is the number of clusters in the given data. K value is a Hyper Parameter, which is set before the training.

2. How distance is calculated between the centroid and data points?

ANS: – There are several methods to calculate the distance between two points however, one commonly used method is the Euclidean distance.

WRITTEN BY Ganesh Raj

Ganesh Raj V works as a Sr. Research Associate at CloudThat. He is a highly analytical, creative, and passionate individual experienced in Data Science, Machine Learning algorithms, and Cloud Computing. In a quest to learn and work with recent technologies, he strives hard to stay updated on advanced technologies along efficiently solving problems analytically.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!