Heart Disease Prediction with Machine Learning



Original Source Here

II. Python — Processing

Once you have downloaded the .csv, turn on your python editor. Import pandas and bring in the CSV file as a data frame using this piece of code:

import pandas as pd
dfstack=pd.read_csv(r'C:\Users\...\HeartDiseaseTrain-Test.csv')

Let us take a look at the columns:

print(dfstack.columns)

This will give you the following when you run this line:

Index([‘age’, ‘sex’, ‘chest_pain_type’, ‘resting_blood_pressure’,
‘cholestoral’, ‘fasting_blood_sugar’, ‘rest_ecg’, ‘Max_heart_rate’,
‘exercise_induced_angina’, ‘oldpeak’, ‘slope’,
‘vessels_colored_by_flourosopy’, ‘thalassemia’, ‘target’],
dtype=’object’)

If you take a look into the data frame, you will notice that many columns have categorical data. For example, the ‘sex’ column has two values: male/female, etc.

At this point, we want to convert these categorical columns to numerical data. The quickest way to do this is the following:

df_stack = pd.get_dummies(dfstack, prefix_sep=’_’)

The get_dummies function will identify the categorical columns and convert them to numerical data. To illustrate this, here are the column names in the data frame we just created df_stack after using get_dummies:

Index([‘age’, ‘resting_blood_pressure’, ‘cholestoral’, ‘Max_heart_rate’,
‘oldpeak’, ‘target’, ‘sex_Female’, ‘sex_Male’,
‘chest_pain_type_Asymptomatic’, ‘chest_pain_type_Atypical angina’,
‘chest_pain_type_Non-anginal pain’, ‘chest_pain_type_Typical angina’,
‘fasting_blood_sugar_Greater than 120 mg/ml’,
‘fasting_blood_sugar_Lower than 120 mg/ml’,
‘rest_ecg_Left ventricular hypertrophy’, ‘rest_ecg_Normal’,
‘rest_ecg_ST-T wave abnormality’, ‘exercise_induced_angina_No’,
‘exercise_induced_angina_Yes’, ‘slope_Downsloping’, ‘slope_Flat’,
‘slope_Upsloping’, ‘vessels_colored_by_flourosopy_Four’,
‘vessels_colored_by_flourosopy_One’,
‘vessels_colored_by_flourosopy_Three’,
‘vessels_colored_by_flourosopy_Two’,
‘vessels_colored_by_flourosopy_Zero’, ‘thalassemia_Fixed Defect’,
‘thalassemia_No’, ‘thalassemia_Normal’,
‘thalassemia_Reversable Defect’],
dtype=’object’)

What you will notice is that categorical columns get split into the number of unique choices in them. For example, the sex column is now sex_Male and another column is called sex_Female. Instead of saying “male” or “female”, as in the original column, the news will look like this:

As you can see, when the sex is Male, there is a 1 on that row in the sex_Male column and a 0 in the sex_Female column. This gets switched when the sex is Female.

Next, in order to apply the algorithm, we have to split the class column (in this case named “target”) that we are predicting and the rest of the dataset which are the features that the algorithm will use to predict this column.

We can do that with this below code:

X = df_stack.drop('target', axis=1)
y = df_stack['target']

As we can see, we are dropping the ‘target’ column and assigning the rest of the data frame to X. Then, we assign just the ‘target’ column to y.

Next, let us import the necessary package to create a test-train split:

from sklearn.model_selection import train_test_split

Here is the standard code for creating a split:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

Next, we will do some feature scaling using the standardScaler from sklearn. Here is the standard code for this:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

III. Python — Apply ML Algo

It is now time to use a predictive algorithm. In this case, we will use a support vector machine:

from sklearn.svm import SVC
svclassifier = SVC()
svclassifier.fit(X_train, y_train)

As you can see we are fitting our svc classifier algorithm to the training set.

Now, we can pass the test set into the predict function like so:

y_pred = svclassifier.predict(X_test)

This predicts the labels of the X_test.

IV. Analysis

To understand how our algorithm works, we need to print out the confusion matrix and other metrics:

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

When you run the whole program you will get the output:

In sklearn, the predicted labels are at the top and the actuals are on the left:

Let us calculate some metrics to analyze what this confusion matrix shows.

First, let us calculate accuracy:

Accuracy → Correctly predicted / overall 
Correctly predicted is this --> 141 + 152 = 293
Overall is this --> 141+152+9+6 = 308
Therefore, the overall accuracy will be:
293/308 = 95.13%

While the accuracy is high, let us remember that is a biological disease dataset. The reason this is important is that disease diagnosis has certain metrics that are as important as accuracy. The first such metric is sensitivity and another is specificity. Let us go ahead and understand what these are and then calculate them.

The first such metric is sensitivity and another is specificity.

Here is the formula for sensitivity:

Sensitivity = number of true positives / (number of true positives + number of false negatives)

Let us break this down to understand what sensitivity is telling us.

The denominator of the formula is the sum of the true positives and false negatives. Let us go over each of these terms.

True positive means that the person was diagnosed as having the disease (i.e positive diagnosis) when they had the disease i.e correct positive diagnosis was made.

False-negative means that the person was diagnosed as NOT having the disease (i.e negative diagnosis) BUT they had the disease.

Therefore, if you add TRUE POSITIVE AND FALSE NEGATIVE, what you will get is the total number of sick people in the confusion matrix.

Sensitivity = number of true positives / (number of true positives + number of false negatives)
Sensitivity = 152/(6+152) --> 96.2%

Therefore, we now know that 96.2% of the time, the algorithm was able to correctly predict that a person had the disease when they actually had it.

Now let us look at specificity:

Specificity = number of true negatives / (number of true negatives + number of false positives)

True negative means the test predicted that the person did not have a disease when they actually did not have it.

False-positive means the test predicted that the person did have the disease but it was false i.e the person actually did not have it.

Specificity =  number of true negatives / (number of true negatives + number of false positives)
Specificity = 141/(141+9) --> 94%

Therefore, we now know that 94% of the time, the algorithm was able to correctly predict that a person did NOT have the disease when they actually did not have it.

Thanks for reading.

Links to some other works:

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: