Voiced by Amazon Polly
Clustering is an Unsupervised learning algorithm that will attempt to cluster a group of similar data based on the features. An Unsupervised learning algorithm will attempt to cluster a group of similar data. The clustering aims to identify the relationship within the given data and group accordingly so that predicting new data points can be generalized. Clustering has many real-world applications like Customer Segmentation, Anomaly Detection, etc. Here in this blog, we will see a brief about an Unsupervised Machine Learning algorithm called K-Means.
Supervised: A machine learning technique trains the model based on the labeled data. We directly provide the target or output labels to the training. The goal of the supervised technique is to identify the mapping pattern with a given dataset so that the prediction can be made for new or unseen data. Examples of Supervised Learning, Neural Networks, Linear Regression, etc.
Un-Supervised: It is a machine learning technique where the model gets trained on the unlabeled dataset, meaning there are no target or output labels. Unsupervised learning aims to identify the structure or group in the given data as clusters. Few examples of Unsupervised learning are K-means Clustering, PCA (Principal component analysis), etc.
- Fruits Market (apple, orange, etc.)
- Clustering Similar documents
- Customer Segmentation
Clustering news and many more.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Intuition of K-Means
We input a set of unlabeled data into our model, and the model will find the hidden pattern or relationship or even called a similarity in the given data. Here below, I have taken fruit clustering for easy understanding.
Imagine the above black dots are of mixed different fruits. We can group the different fruits based on their similarity within the group. For example, Apple is red in color and has an oblong shape. Imaging other fruits like orange and pineapple is clustered on the right side of the color according to the similarities found. We can do this manually, but we want the machine to look for the pattern by understanding the data and predict the new data.
Working Principal of K- Means
Step 1: Initialize the K value.
Step 2: Assign Data Points to Cluster
Step 3: Move the Cluster
Repeat the Steps 2 and 3 until you get optimal centroid points.
We want to provide n_clusters to our model to cluster in n_cluster groups.
Note: n_clusters is the total no of groups or clusters we want from the given data.
Input the total no of clusters we want from the data.
Note: It is an iterative flow consisting of the below steps,
- Cluster Assignment Step
- Move Centroid Step
Cluster Assignment Step
Once we randomly initialize the centroid, the model will calculate the distance between each data point and the centroid.
Example: n_clusters = 2 = k (C1 and C2)
- 1st data point to centroid 1: 5 pt
- 1st data point to centroid 2: 5.5 pt
- 2nd data point to centroid 1: 2.1 pt
- 2nd data point to centroid 2: 3 pt
- 3rd data point to centroid 1:8 pt
- 3rd data point to centroid 2: 3.6 pt
Based on the distance calculated from each data point to the given centroids, nearby centroids are assigned to the data points.
- For centroid 1: data points (1 and 3rd data points are assigned)
- For centroid 2: data point (2) is assigned
Likewise, we do this for all data points
Move Centroid Step: (centroid: center point for cluster)
Once you assign the centroids for all data points, average the corresponding data points and move the current centroid to that average point.
The above sample diagram implies after 4 iterations.
How do you choose n-clusters or K value?
Depending on the data and the use case for the problem, one may choose the ideal value of K for K-means clustering. There are several popular methods for computing no of clusters, the Elbow technique, the Silhouette method, Domain Expertise, and manual experimentation. I would recommend using multiple methods and comparing the results for optimal workflow.
Drop a query if you have any questions regarding K-Means Clustering and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
CloudThat is an official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, Amazon QuickSight Service Delivery Partner, AWS EKS Service Delivery Partner, and Microsoft Gold Partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
1. What is the K value in K means?
ANS: – The K value or n-cluster is the number of clusters in the given data. K value is a Hyper Parameter, which is set before the training.
2. How distance is calculated between the centroid and data points?
ANS: – There are several methods to calculate the distance between two points however, one commonly used method is the Euclidean distance.
WRITTEN BY Ganesh Raj
Ganesh Raj V works as a Sr. Research Associate at CloudThat. He is a highly analytical, creative, and passionate individual experienced in Data Science, Machine Learning algorithms, and Cloud Computing. In a quest to learn and work with recent technologies, he strives hard to stay updated on advanced technologies along efficiently solving problems analytically.