Machine Learning projects

Original Source Here

Machine Learning projects

This is project about Turkiye student evaluation generic dataset which is collected for kaggle. where as this dataset consist of 5820 rows and 33 features. features are like ‘instr’, ‘class’, ‘nb.repeat’, ‘attendance’, ‘difficulty’, ‘Q1’, ‘Q2’, ‘Q3’, ‘Q4’, ‘Q5’, ‘Q6’, ‘Q7’, ‘Q8’, ‘Q9’, ‘Q10’, ‘Q11’, ‘Q12’, ‘Q13’,
‘Q14’, ‘Q15’, ‘Q16’, ‘Q17’, ‘Q18’, ‘Q19’, ‘Q20’, ‘Q21’, ‘Q22’, ‘Q23’,
‘Q24’, ‘Q25’, ‘Q26’, ‘Q27’, ‘Q28’. every features is important in this dataset.

# implementation of project

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings(‘ignore’)
pd.options.display.max_columns = 99 # for see all values in dataset

#loading the dataset

df = pd.read_csv(r”D:\projects\ML Projects datasets\turkiye-student-evaluation_generic.csv”)

df.head()

df.shape

(5820, 33)

#finding null values

df.isna().sum()

instr         0
class 0
nb.repeat 0
attendance 0
difficulty 0
Q1 0
Q2 0
Q3 0
Q4 0
Q5 0
Q6 0
Q7 0
Q8 0
Q9 0
Q10 0
Q11 0
Q12 0
Q13 0
Q14 0
Q15 0
Q16 0
Q17 0
Q18 0
Q19 0
Q20 0
Q21 0
Q22 0
Q23 0
Q24 0
Q25 0
Q26 0
Q27 0
Q28 0
dtype: int64

df.describe() # statical info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5820 entries, 0 to 5819
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 instr 5820 non-null int64
1 class 5820 non-null int64
2 nb.repeat 5820 non-null int64
3 attendance 5820 non-null int64
4 difficulty 5820 non-null int64
5 Q1 5820 non-null int64
6 Q2 5820 non-null int64
7 Q3 5820 non-null int64
8 Q4 5820 non-null int64
9 Q5 5820 non-null int64
10 Q6 5820 non-null int64
11 Q7 5820 non-null int64
12 Q8 5820 non-null int64
13 Q9 5820 non-null int64
14 Q10 5820 non-null int64
15 Q11 5820 non-null int64
16 Q12 5820 non-null int64
17 Q13 5820 non-null int64
18 Q14 5820 non-null int64
19 Q15 5820 non-null int64
20 Q16 5820 non-null int64
21 Q17 5820 non-null int64
22 Q18 5820 non-null int64
23 Q19 5820 non-null int64
24 Q20 5820 non-null int64
25 Q21 5820 non-null int64
26 Q22 5820 non-null int64
27 Q23 5820 non-null int64
28 Q24 5820 non-null int64
29 Q25 5820 non-null int64
30 Q26 5820 non-null int64
31 Q27 5820 non-null int64
32 Q28 5820 non-null int64
dtypes: int64(33)
memory usage: 1.5 MB

# EDA

sns.countplot(df[‘instr’])

sns.countplot(df[‘class’])

sns.countplot(df[‘nb.repeat’])

sns.countplot(df[‘attendance’])

sns.countplot(df[‘difficulty’])

#find of questions
x_questions = df.iloc[:, 5:33]
q_mean = x_questions.mean(axis=0)
total_mean = q_mean.mean()

q_mean = q_mean.to_frame(‘mean’)
q_mean.reset_index(level=0, inplace=True)
q_mean.head()

# for finding mean

total_mean

#plot the mean

plt.figure(figsize=(14,8))
sns.barplot(x=’index’, y=’mean’, data=q_mean)

#coorelation
plt.figure(figsize=(20,20))
corr = df.corr()
sns.heatmap(corr, annot=True, cmap=’gist_earth_r’)

# principal component analysis
# for reduction no of deminsions
x = df.iloc[:,5:33]

from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=42)
x_pca=pca.fit_transform(x)
x_pca

#how much info we retained form the dataset
pca.explained_variance_ratio_.cumsum()[1]

#model training
#k means clustering
from sklearn.cluster import KMeans
distortions = []# work as output
#eblow method
cluster_range = range(1,6)
for i in cluster_range:
model = KMeans(n_clusters=i,init=’k-means++’,n_jobs=-1, random_state=42)
model.fit(x_pca)
distortions.append(model.inertia_)

plt.plot(cluster_range, distortions, marker=’o’)
plt.xlabel(‘number of clusters’)
plt.ylabel(‘distortions’)

# use best cluster
model = KMeans(n_clusters=3,init=’k-means++’,n_jobs=-1, random_state=42)
model.fit(x_pca)
y = model.predict(x_pca)

plt.scatter(x_pca[y==0, 0], x_pca[y==0,1],s=50,c=’red’,label=’cluster 1′)
plt.scatter(x_pca[y==1, 0], x_pca[y==1,1],s=50,c=’blue’,label=’cluster 2′)
plt.scatter(x_pca[y==2, 0], x_pca[y==2,1],s=50,c=’green’,label=’cluster 3′)
plt.scatter(model.cluster_centers_[:,0], model.cluster_centers_[:, 1],s=100, c=’yellow’, label=’centroids’)
plt.title(‘cluster of students’)
plt.xlabel(‘PCA 1’)
plt.ylabel(‘PCA 2’)
plt.legend()

from collections import Counter
Counter(y)

Counter({2: 2360, 0: 2220, 1: 1240})

# training whole dataset
model = KMeans(n_clusters=3,init=’k-means++’,n_jobs=-1, random_state=42)
model.fit(x)
y1= model.predict(x)

Counter(y1)

Counter({0: 2359, 1: 2221, 2: 1240})

conclusion :

By the end of this project we had predict the values and visualized the data in the forms of graphs

lets be touch

I had started youtube channel
Python Playlist (English):- https://www.youtube.com/watch?v=9AiYyhKcBzI&list=PL-fvvgBPtI-XLPIbFMsO3i7uG9N9FShXW

GitHub:- https://www.github.com/Vamsi-2203

LinkedIn:- www.linkedin.com/in/vamsireddy2203

for more updates follow me………

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: