Predict Introverts and Extroverts on Kaggle Dataset

In this article, we will predict introverts from extroverts using Machine Learning Algorithms to gain a deeper understanding of human behavior and personality.

1. Introduction

This is a playground dataset that I have used to predict introverts from extroverts, analyzing different personality traits. Using Machine Learning algorithms helps in understanding various personality traits of introverts and extroverts, like:

Extroverts like to socialize more and interact with people more.
Introverts think before speaking, and extroverts think after speaking.
Introverts prefer small, closed groups while extroverts are often seen in large groups.
Extroverts post more regularly on social platforms.

Overall, predicting introverts from extroverts is important to learn more about human behaviour.

2. Data Source

The data source for “Predicting Introverts from Extroverts” is taken from Kaggle. Click here for the dataset.

In the data source, we have three CSV files:

train.csv – the training dataset; ‘Personality’ is the target variable.

test.csv – the test dataset; your objective is to predict the ‘Personality’ variable.

sample_submission.csv – a sample submission file in the correct format

3. Objective

The objective is to predict whether a person is an Introvert or an Extrovert, given their social behavior and personality traits. Evaluation will be done on the Accuracy Score. The goal is to predict the target variable ‘Personality’ for test data and submit it in ‘sample_submission.csv’ format.

4. Evaluation Metrics

Submissions are evaluated on Accuracy between the predicted value and the observed target.

5. Data Understanding

Let’s understand the dataset in detail with Python Code.

Import Libraries

The first step is to import relevant libraries.

Code to import libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import plotly.express as px

import category_encoders as ce

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix

from sklearn import metrics

from sklearn.metrics import roc_curve, auc

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler

from catboost import CatBoostClassifier

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.metrics import classification_report

Loading Data

Load train, test, and sample_submission CSV files.

Code to load data

df_train= pd.read_csv(“/kaggle/input/playground-series-s5e7/train.csv”)

df_test = pd.read_csv(“/kaggle/input/playground-series-s5e7/test.csv”)

df_sub = pd.read_csv(“/kaggle/input/playground-series-s5e7/sample_submission.csv”)

print(“Shape of training data: “,df_train.shape)

print(“Shape of test data: “,df_test.shape)

print(“Shape of submission file: “,df_sub.shape)

print(“Print columns data “, df_train.columns.values)

Output

Shape of training data: (18524, 9)

Shape of test data: (6175, 8)

Shape of submission file: (6175, 2)

Print columns data [‘id’ ‘Time_spent_Alone’ ‘Stage_fear’ ‘Social_event_attendance’

‘Going_outside’ ‘Drained_after_socializing’ ‘Friends_circle_size’

‘Post_frequency’ ‘Personality’]

Data Description

Training data has 18524 rows of data with 9 columns, out of which 8 are independent features and ‘Personality’ is a dependent feature or target variable described below:

id: Denotes serial no.

Time_spent_Alone: No. of hours spent alone.

Stage_fear: fear of the stage: Yes or No

Social_event_attendance: Attendance in any social event.

Going_outside: Frequency of going out.

Drained_after_socializing: Is the person drained after socializing: Yes or No

Friend_circle_size: Size of the friend circle.

Post_frequency: Frequency of social media posts.

Personality: It is the target variable that denotes the personality of a person: Introvert or Extrovert

Data Cleaning

Understand data types for all features and check for missing and duplicate values.

Code to get train data info

df_train.info()

Output

RangeIndex: 18524 entries, 0 to 18523

Data columns (total 9 columns):

# Column Non-Null Count Dtype

— —— ————– —–

0 id 18524 non-null int64

1 Time_spent_Alone 17334 non-null float64

2 Stage_fear 16631 non-null object

3 Social_event_attendance 17344 non-null float64

4 Going_outside 17058 non-null float64

5 Drained_after_socializing 17375 non-null object

6 Friends_circle_size 17470 non-null float64

7 Post_frequency 17260 non-null float64

8 Personality 18524 non-null object

dtypes: float64(5), int64(1), object(3)

memory usage: 1.3+ MB

Code to check for duplicate values

df_train.duplicated().sum()

Output

Observations

Output Feature: Personality (Categorical): two values possible: ‘Extrovert’ and ‘Introvert’
Numerical Features: ‘Time_spent_Alone’, ‘Social_event_attendance’, ‘Going_outside’, ‘Friends_circle_size and ‘Post_frequency’. These are discrete numerical features.
Categorical Features: ‘Stage_fear’ and ‘Drained_after_socializing’ both are nominal, so label encoding can be used.
There seems to be no outliers in numerical features Time_spent_Alone, Social_event_attendance, Going_outside, Friends_circle_size and Post_frequency
There are many Nan values in features: ‘Time_spent_Alone’,’Stage_fear’, ‘Social_event_attendance’, ‘Going_outside’, ‘Drained_after_socializing’, ‘Friends_circle_size’and ‘Post_frequency’
There are no duplicate values.

Code to handle null values in ‘Time_spent_Alone’

median_Time_spent_Alone = df_train.groupby(“Personality”)[“Time_spent_Alone”].median()
print(median_Time_spent_Alone )
df_train[‘Time_spent_Alone’] = df_train.apply(
lambda row: median_Time_spent_Alone[row[‘Personality’]] if pd.isnull(row[‘Time_spent_Alone’]) else row[‘Time_spent_Alone’], axis=1)

Output

Personality
Extrovert 2.0
Introvert 7.0
Name: Time_spent_Alone, dtype: float64

Observations

Median values for extrovert personality are 2.0, which he spent alone.
Median values for introvert personality are 7.0, which he spent alone.

You can handle other null values for numerical features like this and categorical features with the mode value. To see the full code, kindly go to the Kaggle notebook.

6. Exploratory Data Analysis and Visualizations

Now is the time for data exploratory analysis and visualizations. Let’s start with Univariate analysis to find statistical features like the mean, median, standard deviation, percentile, quantile, and skewness of numerical features.

Code for Univariate Analysis

df_train.describe() #It will do univariate analysis on numerical features only

Output

Observations

Numerical Features: ‘Time_spent_Alone’, ‘Social_event_attendance’, ‘Going_outside’, ‘Friends_circle_size and ‘Post_frequency’. These are discrete numerical features.
Categorical Features: ‘Stage_fear’ and ‘Drained_after_socializing’ both are nominal, so label encoding can be used.
There seems to be no outliers in numerical features Time_spent_Alone, Social_event_attendance, Going_outside, Friends_circle_size, and Post_frequency
Data is highly skewed towards ‘Extroverts’.

Code to find unique categorical values

def find_unique(feature):
print(“Unique features of “+ feature + “: “,df_train[feature].unique())

for column in df_train[[‘Time_spent_Alone’, ‘Stage_fear’, ‘Social_event_attendance’,
‘Going_outside’ ,‘Drained_after_socializing’ ,‘Friends_circle_size’,
‘Post_frequency’, ‘Personality’]]: find_unique(column)

Output

Unique features of Time_spent_Alone: [ 0. 1. 6. 3. 2. 4. 5. 9. 10. 7. 8. 11.]
Unique features of Stage_fear: [‘No’ ‘Yes’]
Unique features of Social_event_attendance: [ 6. 7. 1. 4. 8. 2. 5. 0. 9. 3. 10.]
Unique features of Going_outside: [ 4. 3. 0. 5. 1. 6. 2. 7.]
Unique features of Drained_after_socializing: [‘No’ ‘Yes’]
Unique features of Friends_circle_size: [15. 10. 3. 11. 13. 4. 0. 14. 5. 9. 12. 8. 2. 1. 6. 7.]
Unique features of Post_frequency: [ 5. 8. 0. 3. 4. 2. 9. 10. 6. 7. 1.]
Unique features of Personality: [‘Extrovert’ ‘Introvert’]

Observations

Numerical features are discrete.
Categorical Features are binary.
‘Time_spent_Alone‘ has 12 unique numerical values.
‘Social_event_attendance‘ has 11 unique numerical values.
‘Going_outside’ has 8 unique numerical values.
‘Friends_circle_size’ has 16 unique numerical values.
‘Post_frequency’ has 11 unique numerical values.
‘Stage_fear’ has two values: Yes and No.
‘Drained_after_socializing’ has two values: Yes and No.
‘Personality’ has two values: Extrovert and Introvert.

Let’s draw a histogram to understand the division of the target variable ‘Personality’.

Code to draw ‘Personality’ histogram

fig = px.histogram(df_train, x=‘Personality’,barmode=‘group’,title=“Personality Histogram”)
fig.show()

Output

Observation

Data is highly skewed towards the ‘Extrovert’ personality.

Code to draw a Bar Chart of ‘time_spent_alone’ as per ‘Personality’

fig = px.histogram(df_train, x=‘Time_spent_Alone’,color=“Personality”,barmode=‘group’,title=“Grouped Bar Chart of Time_spent_Alone”)
fig.show()

Output

Observations

Persons with an ‘Extrovert’ personality spend less than 4 hours alone.
Persons with an ‘Introvert’ personality spent 4-11hours alone.

7. Feature Engineering

Now is the time for Feature Engineering for various features and aligning them so that our machine learning model performs well.

Label Encoding

We will first label encode the categorical features Stage_fear, Drained_after_socializing, and Personality.

Code to label encode in ‘Stage_fear, ‘Drained_after_socializing and Personality.

# Label encode: yes → 1, no → 0
df_train[‘Stage_fear_encoded’] = df_train[‘Stage_fear’].map({‘Yes’: 1, ‘No’: 0})
df_train[‘Drained_after_socializing_encoded’] = df_train[‘Drained_after_socializing’].map({‘Yes’: 1, ‘No’: 0})
df_train[‘Personality_encoded’] = df_train[‘Personality’].map({‘Extrovert’: 1, ‘Introvert’: 0})

Output

Observation

Categorical value ‘Yes’ encoded to ‘1’ and ‘No’ to ‘0’
Categorical value ‘Extrovert’ encoded to ‘1’ and ‘Introvert’ to ‘0’

Feature Standardization

We will standardize all numerical features using Standard Scaler.

Code to standardize all numerical features.

# Standardize all numerical features
numerical_features = [‘Time_spent_Alone’,‘Social_event_attendance’, ‘Going_outside’, ‘Friends_circle_size’, ‘Post_frequency’]

scaler = StandardScaler()
df_train[numerical_features] = scaler.fit_transform(df_train[numerical_features])
df_train.head()

Output

Observation

All numerical features are standardized with a ‘0’ mean and a standard deviation of ‘1’.

Handle Test Data

Handle null values and duplicate values in test data (df_test.csv).
Do all feature engineering steps on test data.

8. Machine Learning Algorithms Implementation

Once all data is cleaned, processed, and feature engineered, now is the time for machine learning algorithm implementation.

Data Splitting

Let’s split ‘df_train’ into 70% training data and 30% validation data to train for machine learning algorithm.

Code to split data into training data and validation data.

y=df_train[‘Personality_encoded’]
X= df_train.drop(‘Personality_encoded’,axis=1)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

Machine Learning Algorithm Implementation

We have implemented XGBoostClassifier with tuned parameters.

Code to model training data using XGBoostClassifier

model_xg = XGBClassifier(max_depth = 10, n_estimators = 100, learning_rate=0.01, colsample_bytree= 0.7, subsample= 0.7,reg_alpha =0.01, random_state=42, n_jobs=-1)
model_xg.fit(X_train, y_train)

# Predict
y_pred_xg = model_xg.predict(X_val)

# Evaluate
report_xg = classification_report(y_val, y_pred_xg)
print(report_xg)

Output

Observation

We get a detailed classification report, and XGBoost performs very fast.
XGBoost performs very well with an accuracy of 97% achieved on validation data.

Finding important features

Let’s find out the most and least important features in the dataset learned by XGBoostClassifier

Code to find important features using an XGBoostClassifier trained model

xgb.plot_importance(model_xg, max_num_features=10)
plt.show()

Output

Observations

‘Time_spent_Alone’ is the most important feature.
‘Drained_after_socializing’ and ‘Stage_fear’ are the least important features.

9. Results

Now, predict the output on test data (df_test) and submit the results in the ‘submission.csv’ file.

Accuracy achieved on unknown test data is ‘0.973279’ on Kaggle.

Kaggle Notebook Link:

Click on the link below for the complete notebook, and kindly upvote if you like and learn from this notebook. Also, drop a comment and any queries you have.

https://www.kaggle.com/code/playingmyway/simplified-predict-introverts-from-extroverts

Stay Tuned!!

Learn the complete data science project lifecycle by clicking on the link below:

Data Science Project Lifecycle: A Comprehensive Overview

Keep learning and keep implementing!!

Predict Introverts and Extroverts on Kaggle Dataset

1. Introduction

2. Data Source

3. Objective

4. Evaluation Metrics