Gender Classification | K-Nearest Neighbors (KNN)

Original Source Here

Gender Classification | K-Nearest Neighbors (KNN)

by: Bryan Antonnio, Nicky Santano

In this article, we will show you our results on our model’s performance when we analyze Gender Classification Data on Kaggle using K-Nearest Neighbors method.

Gender is a social construction. The way man and woman are treated differently since birth, shapes their behavior and personal preferences into what society expects only by looking at their gender.

The data set we use is designed to provide an idea of whether a person’s gender can be predicted with accuracy well above 50% based on his personal preferences.

Let me take a real life example, and i’ll use it to understand more about the statement above.

With feminism, the differences between man and woman in terms of personal preferences have decreased in recent years. For example, historically in many cultures, colors like blue were considered masculine, while colors like red and pink were seen as feminine. On this era, such ideas like that are considered out of date.

So right now, we knew what’s the problem that society has right now. We will try to analyze this problem with Python at Google Collab using the data below. First, we need to import all of the necessary library that Python has, so that we can have the best analysis results:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn import preprocessingfrom sklearn.model_selection import train_test_split

The data that we used

link = ""datas = pd.read_csv(link)datas
Showing all values in the “Gender Classification” dataset

After we show all the dataset, the fun part begins now. Next, we‘ll going to the “Exploratory Data Analysis” section. After we done showing all the dataset values, we’ll turn all the datas into values.

“Gender Classification” dataset on values

We’ll show each variables composition in the dataset in graphics

fav_clr,clr_count=np.unique(datas['Favorite Color'], return_counts=True)plt.pie(clr_count , labels = fav_clr,autopct='%1.2f%%')plt.axis('equal')plt.title("Komposisi Favorit Color", pad=20)
Favorite Color Composition

So our final result of the Favorite Color Composition, the one who has the biggest percentage it the “Cool” color.

fav_msc,msc_count=np.unique(datas['Favorite Music Genre'],return_counts=True)plt.pie(msc_count , labels = fav_msc,autopct='%1.2f%%')plt.axis('equal')plt.title("Komposisi Favorit Music Genre", pad=20)
Favorite Music Genre Composition

So our final result of the Favorite Music Genre Composition, the one who has the biggest percentage it the “Rock” music genre.

fav_msc,msc_count=np.unique(datas['Favorite Beverage'],return_counts=True)plt.pie(msc_count , labels = fav_msc,autopct='%1.2f%%')plt.axis('equal')plt.title("Komposisi Favorit Beverage", pad=20)
Favorite Beverage Composition

So our result of the Favorite Beverage Composition, the one who has the biggest percentage it the “doesn’t drink”.

fav_msc,msc_count=np.unique(datas['Favorite Soft Drink'],return_counts=True)plt.pie(msc_count , labels = fav_msc,autopct='%1.2f%%')plt.axis('equal')plt.title("Komposisi Favorit Soft Drink", pad=20)
Favorite Soft Drink Composition

Now, we’ll create X and Y variable to use as train test and result test

X = datas[["Favorite Color", "Favorite Music Genre", "Favorite Beverage", "Favorite Soft Drink"]]y = datas["Gender"]

And we’ll show each of the data types

The data above are still in an object. We’ll user LabelEncoder next to change it to numerical data

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()X = X.apply(le.fit_transform)y = le.fit_transform(y)

We’ll create the model for training and testing the data

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 30)

After all of that, we will enter:

Classification Test with K-NN method

K- Nearest Neighbors is a supervised machine learning algorithm as target variable is known. As KNN does not have a training step. All data points will be used only at the time of prediction. With no training step, prediction step is costly. An eager learner algorithm eagerly learns during the training step. It does not make an assumption about the underlying data distribution pattern.

Reference: K-Nearest Neighbors(KNN) by Renu Khandelwal

Training data is the main and most important data which helps machines to learn and make the predictions. This data set is used by machine learning engineer to develop your algorithm and more than 70% of your total data used in the project. So, we’ll train our data first

from sklearn.neighbors import KNeighborsClassifierclassifier = KNeighborsClassifier(n_neighbors=3, metric = "euclidean", p = 2), y_train)

Predicting the y value with X_test value

y_pred = classifier.predict(X_test)

Last but not least, we’ll calculate the conclusion matrix and the accuracy score, for our final KNN results

from sklearn.metrics import confusion_matrix, accuracy_scorecm = confusion_matrix(y_test, y_pred)print("Confusion Matrix:\n", cm)print("Accuracy Score: ", accuracy_score(y_test, y_pred))
K-NN’s final results

Our final results from using K-NN method is 53%.

Now we knew the results from using K-NN method above. We’ll try to analyze the same data using other method, because we are curious of the other method as well.

Classification Test with Naive Bayes method

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Reference: 6 Easy Steps to Learn Naive Bayes Algorithm with codes in Python and R by SUNIL RAY

from sklearn.naive_bayes import GaussianNB
classifierNB = GaussianNB(), y_train)

Predicting the y value with X_test value

y_pred = classifierNB.predict(X_test)

we’ll calculate the conclusion matrix and the accuracy score, for our final Naive Bayes results

from sklearn.metrics import confusion_matrix, accuracy_score
cmNB = confusion_matrix(y_test, y_pred)print("Confusion Matrix:\n", cmNB)print("Accuracy Score: ", accuracy_score(y_test, y_pred))
Naive Bayes’s final results

Our final results from using Naive Bayes method is 35%.

Classification Test with Logistic Regression method

Logistic regression is a type of regression analysis. So, before we delve into logistic regression, let us first introduce the general concept of regression analysis.

Regression analysis is a type of predictive modeling technique which is used to find the relationship between a dependent variable (usually known as the “Y” variable) and either one independent variable (the “X” variable) or a series of independent variables. When two or more independent variables are used to predict or explain the outcome of the dependent variable, this is known as multiple regression.

Reference: What is Logistic Regression? A Beginner’s Guide by ANAMIKA THANDA

from sklearn.linear_model import LogisticRegression
classifierLR = LogisticRegression(random_state = 0), y_train)

Predicting the y value with X_test value

y_pred = classifierLR.predict(X_test)

we’ll calculate the conclusion matrix and the accuracy score, for our final Logistic Regression results

from sklearn.metrics import confusion_matrix, accuracy_score
cmLR = confusion_matrix(y_test, y_pred)print("Confusion Matrix:\n", cmLR)print("Accuracy Score: ", accuracy_score(y_test, y_pred))
Logistic Regression’s final results.

Our final results from using Naive Bayes method is 35%.

After we analyze using Naive Bayes and Logistic regression method, we can conclude that our Logistic Regression results is the same as Naive Bayes.

CLAP if you liked this article 😀


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: