Feature Engineering: For Highly Skewed Feature

Introduction

In this article, we are covering 3 scenarios of skewed data and implementing feature engineering for them:

  1. When data is highly right-skewed
  2. When data is highly left-skewed
  3. When data spikes at extreme values

Dataset Information

Data is taken from the Kaggle Dataset “Predicting the Beats-per-Minute of Songs”. Click on the link below for the dataset:

https://www.kaggle.com/competitions/playground-series-s5e9/data

Handling Different Scenarios

Let’s discuss how to handle these 3 different scenarios in detail:

1. Right-Skewed Feature Data

The feature ‘Vocal Content’ is highly right-skewed, with a significant spike at the minimum value and a long tail.

Here, approximately 30% of data values are equal to the minimum value, and the rest is distributed from 0.03 to 0.25. To handle such right-skewed data, a new binary feature is created that flags whether a value belongs to the min-value spike group or not by implementing the code below:

# Extract minimum threshold value
threshold = df['VocalContent'].min() 

# Create a binary indicator
df["VocalContent_bin"] = df['VocalContent'].apply(lambda x: 0 if (x <= threshold) else 1).astype(int)

2. Left-Skewed Feature Data

The feature ‘Audio Loudness’ is highly left-skewed, with a significant spike at the maximum value and a long tail on the left side.

Here, approximately 11% of data values are equal to the maximum value, and the rest is distributed. To handle such left-skewed data, a new binary feature is created that flags whether a value belongs to the max-value spike group or not by implementing the code below:

# Extract minimum threshold value
threshold = df['AudioLoudness'].max() 

# Create a binary indicator
df["AudioLoudness_bin"] = df['AudioLoudness'].apply(lambda x: 1 if (x >= threshold) else 0).astype(int)

3. Data spikes at extreme values

The feature ‘MoodScore’ exhibits a significant spike at the minimum and maximum values, along with a long tail in between.

Here, approximately 9% of data values are equal to the minimum and maximum values, and the rest is distributed between them. To handle this type of skew, we have added a binary feature indicator to mark whether a row belongs to the extreme group, either a very low or a very high mood score, by implementing the code below:

# Define thresholds for "low" or "high" spikes
low_threshold = df['MoodScore'].min() # near 0
high_threshold = df['MoodScore'].max() # near 1

# Create binary indicators
df["MoodScore_is_low"] = (df["MoodScore"] <= low_threshold).astype(int)
df["MoodScore_is_high"] = (df["MoodScore"] >= high_threshold).astype(int)

# Create a combined indicator for "extreme"
df["MoodScore_extreme"] = ((df["MoodScore"] <= low_threshold) | (df["MoodScore"] >= high_threshold)).astype(int)

Conclusion

We need to be experimental in handling different kinds of features. Defined formulas won’t work for every feature. And, trust me, different types of feature engineering have a significant impact even on a simple model.

Stay Tuned!!

Learn learn to do the Multi-Class Prediction of Obesity Risk on a Kaggle dataset in detail by clicking on the link below:

Multi-Class Prediction of Obesity Risk- Kaggle Dataset

Keep learning and keep implementing!!

Leave a Comment

Your email address will not be published. Required fields are marked *