Predicting Heart Disease Using Maching Learning
A Jupyter notebook exploring exploring data visualization and various supervised machine learning models used on a heart disease dataset (11 common features, with a target).
- Problem Definition
- Import Libraries
- Import and Prepare the Data
- Exploratory Data Analysis (EDA)
- Data Processing
- Build Models
- Conclusions
Sources:
- The original dataset is from Kaggle, which combines 5 heart disease datasets over 11 common features; this project uses a modified version of the dataset.
- The project below closely follows Fares Sayah's Kaggle project to explore the data.
- I also derived insights from Karan Bhanot's article on Medium
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import sklearn
%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease.info()
-
age (age in years)
-
sex (1 = male; 0 = female)
-
cp (chest pain, based on scale of 0-3)
- 0: Typical angina
- 1: Atypical angina (chest pain not related to heart)
- 2: Non-anginal pain (usually esophageal spasms)
- 3: Asymptomatic (chest pain with no signs of disease)
-
trestbps (resting systolic blood pressure, in mmHg)
-
chol (serum cholesterol in mg/dL)
-
fbs (fasting blood sugar; 0 = less than or equal to 126 mg/dL; 1 > 126 mg/dL)
-
restecg (resting ECG)
- 0: Nothing to note
- 1: ST-T Wave abnormality
- 2: Possible or definite left ventricular hypertrophy
-
thalach (maximum heart rate acheived on stress test, between 60-202)
-
exang (exercise induced angina; 1 = yes; 0 = no)
-
oldpeak (amount of ST depression induced by exercise)
-
slope (slope of the peak exercise ST segment)
- 0: Upsloping
- 1: Flatsloping
- 2: Downsloping
-
ca (number of major vessels colored by flouroscopy, 0-3)
-
thal (thalium stress test)
- 1-3: normal
- 6: fixed defect
- 7: reversible defect (no blood flow during exercise)
-
target (have heart disease = 1; no heart disease = 0)
heart_disease.head()
- What questions am I trying to solve?
- What kind of data do I have, and how do I work with it?
- What's missing from the data, and how to deal with it?
- Where are the outliers and why are they important?
- How can I add, change, or remove features to get more out of the data?
heart_disease.shape
pd.set_option("display.float", "{:.2f}".format)
heart_disease.describe()
counts = heart_disease.target.value_counts()
counts
p = sns.countplot(data=heart_disease, x="target")
p.set_xlabel("Heart Disease")
p.set_ylabel("Counts");
heart_disease.isna().sum()
Note:
- It looks like there are 165 people with heart disease, and 138 without, so the dataset is fairly balanced.
- Also, there are no null values.
categorical_val = []
continuous_val = []
for column in heart_disease.columns:
if len(heart_disease[column].unique()) <= 10:
categorical_val.append(column)
else:
continuous_val.append(column)
categorical_val
continuous_val
p = sns.countplot(data=heart_disease, x="target", hue="sex")
p.set_xlabel("Heart Disease")
plt.legend(bbox_to_anchor=(1.05,1), labels=['Female', 'Male']);
p = sns.countplot(data=heart_disease, x="cp", hue="target")
p.set_xlabel("Chest Pain")
plt.legend(labels=["No heart disease", "Heart Disease"]);
Chest Pain (Based on scale of 0-3)
- 0: Typical angina
- 1: Atypical angina (chest pain not related to heart)
- 2: Non-anginal pain (usually esophageal spasms)
- 3: Asymptomatic (chest pain with no signs of disease)
plt.figure(figsize=(15,15))
for i, column in enumerate(continuous_val, 1):
plt.subplot(3,2,i)
sns.histplot(data=heart_disease, x=column, hue="target", multiple="stack")
#plt.legend(labels=["No Ht Disease", "Heart Disease"]);
- trestbps : resting blood pressure
- chol : serum cholesterol (mg/dL)
- thalach : maximum heart rate acheived on stress test (60-202)
- oldpeak : ST depression induced by exercise relative to rest
Target:
- 0 (blue) = no heart disease
- 1 (orange) = heart disease
plt.figure(figsize=(9,7))
plt.scatter(heart_disease.age[heart_disease.target == 1],
heart_disease.chol[heart_disease.target == 1],
c="salmon")
plt.scatter(heart_disease.age[heart_disease.target == 0],
heart_disease.chol[heart_disease.target == 0],
c="lightblue")
plt.title("Correlation of Heart Disease with Age and Cholesterol")
plt.xlabel("Age")
plt.ylabel("Cholesterol")
plt.legend(["Disease", "No Disease"]);
There is no obvious correlation between cholesterol levels and heart disease!
corr_matrix = heart_disease.corr()
fig, ax = plt.subplots(figsize=(15,15))
ax = sns.heatmap(corr_matrix,
annot = True,
linewidths = 0.5,
fmt = ".2f",
cmap = "YlGnBu");
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
Notes:
- The presence of chest pain and thalach (highest pulse rate acheived on stress test) seem to have the highest correlations with the target value.
- Also, fasting blood sugar and cholesterol have the lowest correlation with the target variable.
- Create dummy values
- Scale values
categorical_val.remove('target')
dataset = pd.get_dummies(heart_disease, columns = categorical_val)
dataset.head()
print(heart_disease.columns)
print(dataset.columns)
from sklearn.preprocessing import StandardScaler
s_sc = StandardScaler()
col_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[col_to_scale] = s_sc.fit_transform(dataset[col_to_scale])
dataset.head()
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
if train:
pred = clf.predict(X_train)
clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
print("Train Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
elif train==False:
pred = clf.predict(X_test)
clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
print("Test Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")
from sklearn.model_selection import train_test_split
X = dataset.drop('target', axis=1)
y = dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Now that the data is split into training and test sets, machine learning models can be created
-
Train the data on the training set
-
Test on the test set
Here, I'll try 5 different machine learning models:
- Logistic Regression
- K-Nearest Neighbors Classifier
- Support Vector machine
- Decision Tree Classifier
- Random Forest Classifier
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train, y_train)
print_score(lr_clf, X_train, y_train, X_test, y_test, train=True)
print_score(lr_clf, X_train, y_train, X_test, y_test, train=False)
test_score = accuracy_score(y_test, lr_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, lr_clf.predict(X_train)) * 100
results_df = pd.DataFrame(data=[["Logistic Regression", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)
print_score(knn_clf, X_train, y_train, X_test, y_test, train=True)
print_score(knn_clf, X_train, y_train, X_test, y_test, train=False)
test_score = accuracy_score(y_test, knn_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, knn_clf.predict(X_train)) * 100
results_df_2 = pd.DataFrame(data=[["K-nearest neighbors", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = results_df.append(results_df_2, ignore_index=True)
results_df
from sklearn.svm import SVC
svm_clf = SVC(kernel='rbf', gamma=0.1, C=1.0)
svm_clf.fit(X_train, y_train)
print_score(svm_clf, X_train, y_train, X_test, y_test, train=True)
print_score(svm_clf, X_train, y_train, X_test, y_test, train=False)
test_score = accuracy_score(y_test, svm_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, svm_clf.predict(X_train)) * 100
results_df_2 = pd.DataFrame(data=[["Support Vector Machine", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = results_df.append(results_df_2, ignore_index=True)
results_df
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)
test_score = accuracy_score(y_test, tree_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, tree_clf.predict(X_train)) * 100
results_df_2 = pd.DataFrame(data=[["Decision Tree Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = results_df.append(results_df_2, ignore_index=True)
results_df
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
rf_clf = RandomForestClassifier(n_estimators=1000, random_state=42)
rf_clf.fit(X_train, y_train)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)
test_score = accuracy_score(y_test, rf_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, rf_clf.predict(X_train)) * 100
results_df_2 = pd.DataFrame(data=[["Random Forest Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = results_df.append(results_df_2, ignore_index=True)
results_df
The next step would be to explore hyperparameter tunings for the various ML models
From the preliminary data, though, it seems that Logistic Regression and Support Vector Machine models give the highest % testing accuracy, predicting heart disease from the features with a > 86% accuracy.