Addressing Class Imbalance: Empowering Predictive Models with SMOTE

Overview

Machine learning is a branch of artificial intelligence that involves building algorithms and statistical models that enable computers to learn and make predictions or decisions without being explicitly programmed. To put it differently, machine learning algorithms enable computers to gather knowledge from data and enhance their capabilities progressively.

There are several types of Machines learning algorithms, including:

Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning

To keep ourselves intact, we will be discussing Supervised Learning.

Supervised Machine Learning is where a model is trained using labeled data, where each data point has a known output or target value. Such models get trained by learning the mapping between input and output, making the prediction further on the unseen data.

Supervised Machine Learning is further divided into:

Regression: Regression algorithms are used when the output is continuous, such as predicting the demand for the rental bike at any hour of the day.
Classification: This algorithm is used when the output variable is categorical, such as predicting whether a person might suffer from a deadly disease in the coming years based on current health-related features. The algorithm learns a function that maps the input variables to a discrete output variable, such as a binary or multi-class classification.

Since we have discussed machine learning and its subfields, it’s time to develop a foundation around the topic we will discuss today.

We usually encounter many binary classification problems, and it isn’t easy to find such a case where both the classes in the dataset are equally proportioned. E.g., Taking a binary classification problem wherein we have to predict whether a patient will suffer from a deadly disease in the coming 10 years depending upon the current habits and health-related factors such as blood pressure, hemoglobin, etc. In such a case, there would be a lesser chance that if we have data of 1 lakh patients under both classes have equal or comparable proportions.

Since we are dealing with supervised learning, our model will learn from the training data, which is biased toward one class (95,000 patients do not have a risk of the deadly disease in the coming 10 years, and only 5000 will have).

And talking about why this class imbalance is an issue? This is because it leads to inconsistent accuracy when evaluating the model, and secondly, the model learns on biased data. Prediction of such a model is unreliable when we use cases with serious consequences, like in the healthcare industry.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Introduction

SMOTE (Synthetic Minority Over-sampling Technique) is a method employed to tackle the issue of imbalanced classes in datasets. It is an oversampling technique wherein synthetic samples are generated for the class with rare or fewer occurrences in our dataset. Its primary emphasis lies on the feature space, utilizing interpolation between closely located positive instances to generate new instances.

SMOTE works on a certain set of steps to generate synthetic data. Iteration starts by selecting a minority class instance at random. Next, the N number of nearest neighbor instances is chosen, and these instances are used to generate a synthetic instance to overcome the class imbalance.

It is done using any distance metric such as Euclidean Distance or Manhattan Distance, and the distance difference between the feature vector and its neighbors is calculated. The difference is multiplied by a random value between 0 and 1 (excluding 0) and added to the previous feature vector to generate synthetic data.

Implementation of SMOTE

Let’s now have a look at the implementation of SMOTE:

We have used the Cardiovascular dataset and wherein we have two classes
1: Represents There is a risk of deadly disease to the patient.
0: Represents There is no risk of deadly disease to the patient.
The dataset is not very large, it has around 3400 records.

Let’s have a glimpse of what the dataset looks like:

smote

We will now see how both classes are distributed:

smote2

Since we can see that the 0 or No risk of disease is the majority class, if the model learns and predicts from this data, it will lead to biased prediction. Thus, we will use SMOTE to counter the issue.

Let’s have a look at the code:

#import the library 

from imblearn.over sampling import SMOTE 

# Resampling the minority class 

sm = SMOTE(sampling_strategy='"minority', random_state=42) 

 	# Fit the model to generate the data. 

 	X, y = sm.fit_resample(df _transformed.drop('TenvearCHD', axis=1), df _transformed['TenYearCHD']) 

df_ smote = pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)

#import the library

from imblearn.over sampling import SMOTE

# Resampling the minority class

sm = SMOTE(sampling_strategy='"minority', random_state=42)

# Fit the model to generate the data.

X, y = sm.fit_resample(df _transformed.drop('TenvearCHD', axis=1), df _transformed['TenYearCHD'])

df_ smote = pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)

After applying the SMOTE, lets now have a look at how the classes are distributed:

smote3

The classes are now equally distributed.

Advantages and Disadvantages of SMOTE

Advantages of SMOTE:

It preserves the original data distribution.
Reduces risk of overfitting.
It works well with high-dimensional data.
Easy to implement.
Generates more diverse synthetic samples.
It can be combined with other techniques.

Disadvantages of SMOTE:

It generates the synthetic data points using the minority class, which may sometimes lead to the imputation of noisy points.
New data points are generated from the existing data points, which can lead to overfitting.
Sampling parameters can be difficult to tune.

Conclusion

Through this blog, we tried to understand what, why, and how to tackle a major class imbalance issue. Several other techniques could be used to address the issue. In the further set of blogs, I will try to come up with other techniques to combat the class imbalance issue.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. List down some different techniques to address the issue of Class Imbalance.

ANS: – The other such techniques to combat class imbalance are:

ADASYN (Adaptive Synthetic Sampling Approach)
Random Under-sampling
Random Over Sampling
Hybridization (SMOTE + Tomek Links)
Hybridization (SMOTE + ENN)

2. What is the difference between SMOTE and random oversampling?

ANS: – Random oversampling involves duplicating instances of the minority class randomly, whereas SMOTE generates synthetic instances by interpolating between minority class instances and their k-nearest neighbors in the feature space. SMOTE is typically considered to be more effective than random oversampling.

3. Why is class imbalance a problem in machine learning?

ANS: – Class imbalance can lead to biased models that perform poorly on the minority class. Machine learning algorithms tend to favor the majority class due to its larger representation in the dataset.

WRITTEN BY Parth Sharma

Parth works as a Subject Matter Expert at CloudThat. He has been involved in a variety of AI/ML projects and has a growing interest in machine learning, deep learning, generative AI, and cloud computing. With a practical approach to problem-solving, Parth focuses on applying AI to real-world challenges while continuously learning to stay current with evolving technologies and methodologies.