I will be doing my project on the prediction and classification utilizing a heart disease dataset. We have different metrics such as Cholesterol, Age, Resting Blood Pressure, and will be using these metrics to determine if a patient has heart disease or not. As a student very interested in the field of computational biochemistry, I believe this will help me gain a large understanding for the procdures necessary to apply PCA, and machine learning to a variety of similar datasets. I found the dataset on kaggle, and this is a diagnostic dataset of a multitude of variables to choose from as described in the earlier portion. In total, there are 918 observations of “patient” records.
library(tidyverse)
library(knitr)
library(tidytext)
# read your datasets in here, e.g., with read_csv()
# read in the heart datasets
heart <- read_csv("heart.csv")
# if your dataset needs tidying, do so here
# any other code here
heart <- heart %>% mutate(Heart_Disease = ifelse(HeartDisease ==
1, "yes", "no"))
# glimpse of dataset
glimpse(heart)
## Rows: 918
## Columns: 13
## $ Age <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49…
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA"…
## $ RestingBP <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130,…
## $ Cholesterol <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211,…
## $ FastingBS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal"…
## $ MaxHR <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, …
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N",…
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0,…
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "F…
## $ HeartDisease <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, …
## $ Heart_Disease <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "yes…
library(cluster)
library(GGally)
# clustering code here
#create the correlation matrix
cormat <- heart %>% select(Age,Cholesterol,MaxHR,RestingBP) %>% cor(use="pair") %>% round(3)
cormat
## Age Cholesterol MaxHR RestingBP
## Age 1.000 -0.095 -0.382 0.254
## Cholesterol -0.095 1.000 0.236 0.101
## MaxHR -0.382 0.236 1.000 -0.112
## RestingBP 0.254 0.101 -0.112 1.000
#tidy the correlation matrix
tidycor <- cormat %>% as.data.frame %>%
rownames_to_column("Variable1") %>%
pivot_longer(-1,names_to="Variable2",values_to="correlation")
#plot the correlation matrix
tidycor %>% ggplot(aes(Variable1,Variable2,fill=correlation))+
geom_tile()+
scale_fill_gradient2(low="red",mid="white",high="blue")+ #makes colors!
geom_text(aes(label=round(correlation,2)),color = "black", size = 4)+ #overlay values
theme(axis.text.x = element_text(angle = 90, hjust=1))+ #flips x-axis labels
coord_fixed()+ggtitle("Correlation Matrix of Heart Disease Indicators")
#I selected the numeric variables age, cholesterol, max heart rate, and resting beats per minute
clust_dat<-heart%>%dplyr::select(Age,Cholesterol,MaxHR,RestingBP)
#graph the silouette width to find the correct number of clusters
sil_width<-vector() #empty vector to hold mean sil width
for(i in 2:10){
kms <- kmeans(clust_dat,centers=i) #compute k-means solution
sil <- silhouette(kms$cluster,dist(clust_dat)) #get sil widths
sil_width[i]<-mean(sil[,3]) #take averages (higher is better)
}
ggplot()+geom_line(aes(x=1:10,y=sil_width))+scale_x_continuous(name="k",breaks=1:10)+ggtitle("Silouette Width for Cluster Analysis")
#based on the silouette width graph, we plan on using 2 clusters for further analysis
#use the pam method
pam1<-clust_dat%>%pam(k=2)
#pam1
#save this in a cluster dataset
pamclust<-clust_dat%>%mutate(cluster=as.factor(pam1$clustering))
#compute the mean for the cluster set
pamclust%>%group_by(cluster)%>%summarize_if(is.numeric,mean,na.rm=T)
## # A tibble: 2 x 5
## cluster Age Cholesterol MaxHR RestingBP
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 52.8 246. 140. 133.
## 2 2 56.4 2.87 122. 130.
#create the medioids for the data
clust_dat%>%slice(pam1$id.med)
## # A tibble: 2 x 4
## Age Cholesterol MaxHR RestingBP
## <dbl> <dbl> <dbl> <dbl>
## 1 47 235 145 130
## 2 56 0 122 130
#plot the clusters in 3d
ggpairs(pamclust, aes(color=cluster))+ggtitle("Cluster Plot 3D")
Discussion of clustering here
From looking at the silouette width we are seeing the recommended number of clusters for the dataset is 2 clusters. Additionally, a correlation matrix was utilized and was found that many of the variables have little relation to each other. The highest being MaxHR to Age ina negative relationship. From a cluster perspective both Age and RestingBP have similar cluster density graphs. For the MaxHR density plot the first cluster represents more of the larger values. Lasrlt, Cholesterol is showing a great ability to differentiate between cluster 1 and cluster 2 as many of the data points in cluster2 are the lower values of the spectrum.
# PCA code here
heartpca <- heart %>% select(Age, Cholesterol, MaxHR, RestingBP) %>%
scale
# run the pca test for this dataset
pca1 <- princomp(heartpca, cor = T)
# look at signs and magnitudes of loadings to get 85%
# variance From this we can see we only need 3 PC Scores
summary(pca1, loadings = T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.2511464 1.0671626 0.8463246 0.7612695
## Proportion of Variance 0.3913418 0.2847090 0.1790663 0.1448828
## Cumulative Proportion 0.3913418 0.6760508 0.8551172 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Age 0.630 0.169 0.310 0.692
## Cholesterol -0.303 0.705 0.617 -0.173
## MaxHR -0.622 0.201 -0.346 0.673
## RestingBP 0.353 0.659 -0.635 -0.197
# we can see that 3 PC Scores Result in a great deal of
# variance for the dataset
# plot findings in dimensional reductivity for further
# analysis
# scatter plot showing the differentiation between heart
# disease and sex of patient
heart %>% mutate(PC1 = pca1$scores[, 1], PC2 = pca1$scores[,
2], PC3 = pca1$scores[, 3]) %>% ggplot(aes(x = PC1, y = PC2,
color = Heart_Disease, shape = Sex)) + geom_point(alpha = 0.5)
# relationship to indicators and their trends of PC loading
# scores
pca1$loadings[1:4, 1:2] %>% as.data.frame %>% rownames_to_column %>%
ggplot() + geom_hline(aes(yintercept = 0), lty = 2) + geom_vline(aes(xintercept = 0),
lty = 2) + ylab("PC2") + xlab("PC1") + geom_segment(aes(x = 0,
y = 0, xend = Comp.1, yend = Comp.2), arrow = arrow(), col = "red") +
geom_label(aes(x = Comp.1 * 1.1, y = Comp.2 * 1.1, label = rowname)) +
ggtitle("Relationship of Heart Disease Indicators to Principle Component Loadings")
Discussions of PCA here.
I wanted to identify the vectors which direction they fall towards on a PC scores graph. Age shows a direct relationship with both PC1 and PC2 along with RestingBP. Cholesterol and MaxHR show an inverse relationship where with the decrease in PC1 there’s an increase in PC2. We can see from the scatter plot created of PC Scores on a yes/no basis on heart disease that a lower PC1 score is usually an indicator of no whereas for PC2 there’s much variability so it’s harder to determine the spread from the plot.
#-------------------------------------------------------
# second use a logistic regression classifier
logistic_fit <- glm(Heart_Disease == "yes" ~ Age + Cholesterol +
MaxHR + RestingBP, data = heart, family = "binomial")
# your code here Generate the predicted score/probs for
# original observations
prob_reg <- predict(logistic_fit, type = "response")
# prob_reg %>% round(3)
# Compute Classification Diagnostics
class_diag(prob_reg, heart$Heart_Disease, positive = "yes")
## acc sens spec ppv f1 ba auc
## Metrics 0.7015 0.752 0.639 0.7208 0.736 0.6955 0.7626
# Report a Confusion Matrix
# In the confusion matrix 1 - represents someone with heart
# disease 2 - represents someone without heart disease
table(truth = heart$HeartDisease, predictions = prob_reg > 0.5)
## predictions
## truth FALSE TRUE
## 0 262 148
## 1 126 382
# cross-validation of linear classifier here
set.seed(322)
k = 10
data <- sample_frac(heart) #randomly order rows
folds <- rep(1:k, length.out = nrow(data)) #create folds
diags <- NULL
i = 1
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Heart_Disease
# train model
fit <- glm(Heart_Disease == "yes" ~ Age + Cholesterol + MaxHR +
RestingBP, data = heart, family = "binomial")
# test model
probs <- predict(fit, newdata = test, type = "response")
# get performance metrics for each fold
diags <- rbind(diags, class_diag(probs, truth, positive = "yes"))
}
# average performance metrics across all folds
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.70151 0.7552 0.64154 0.72266 0.73507 0.69836 0.76646
Discussion here
We are seeing through the utilization of a generalized linear model, an auc of 0.7626. This is not the most accurate score, and after doing a cross validation of the data we found the auc to increase to 0.76646. Thus, we are not seeing signs of overfitting due to such a slight change, however, this is not the greatest model specific to our data. So it would not be preferred and we should see how a non-parametric classifier will operate.
library(caret)
# non-parametric classifier code here
knn_fit <- knn3(Heart_Disease == "yes" ~ Age + Cholesterol +
MaxHR + RestingBP, data = heart)
# your code here
prob_knn <- predict(knn_fit, heart)[, 2]
# prob_knn %>% round(3)
# Compute Classification Diagnostics
class_diag(prob_knn, heart$Heart_Disease, positive = "yes")
## acc sens spec ppv f1 ba auc
## Metrics 0.781 0.8071 0.7488 0.7992 0.8031 0.7779 0.8659
# Report a Confusion Matrix
knn_fit2 <- knn3(factor(HeartDisease == 1, levels = c("TRUE",
"FALSE")) ~ Age + Cholesterol + MaxHR + RestingBP, data = heart)
# create prediction columns based on the knn fit
y_hat_knn <- predict(knn_fit2, heart)
# report the confusion matrix based on the truth and
# prediction counts
table(truth = factor(heart$HeartDisease == 1, levels = c("TRUE",
"FALSE")), prediction = factor(y_hat_knn[, 1] > 0.5, levels = c("TRUE",
"FALSE")))
## prediction
## truth TRUE FALSE
## TRUE 410 98
## FALSE 103 307
# cross-validation of np classifier here
set.seed(322)
k = 10
data <- sample_frac(heart) #randomly order rows
folds <- rep(1:k, length.out = nrow(data)) #create folds
diags <- NULL
i = 1
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Heart_Disease
# train model
fit <- knn3(Heart_Disease == "yes" ~ Age + Cholesterol +
MaxHR + RestingBP, data = heart)
# test model
probs <- predict(fit, newdata = test)[, 2]
# get performance metrics for each fold
diags <- rbind(diags, class_diag(probs, truth, positive = "yes"))
}
# average performance metrics across all folds
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.78102 0.8072 0.75214 0.79809 0.80147 0.77966 0.86628
Discussion We are seeing a much higher auc variable of 0.8659 which is not the best performance metric, but very good given the size and scope of our data. After performing a cross validation, we see the auc increase to 0.86628 which shows there are no signs of overfitting with our data. If we were to choose the correct approach to modeling our data, I would prefer to use the knn methodology for further analysis and implementation of our dataset.
# regression model code here
# Regression model for predicting the age of person based on
# Cholesterol,MaxHR,and RestingBP
# linear classifier code here
linClassifierData <- heart %>% select(Age, Cholesterol, MaxHR,
RestingBP, Heart_Disease)
# first use a linear regression classifier
linear_fit <- lm(heart$Age ~ Cholesterol + MaxHR + RestingBP,
data = heart)
# your code here Generate the predicted score/probs for
# original observations
score <- predict(linear_fit)
# score %>% round(3)
# Compute Mean Squared Error
paste0("Mean Squared Error: ", mean((heart$Age - score)^2))
## [1] "Mean Squared Error: 71.7753752128872"
# cross-validation of regression model here cross-validation
# of linear classifier here
set.seed(322)
k = 10
data <- sample_frac(heart) #randomly order rows
folds <- rep(1:k, length.out = nrow(data)) #create folds
diags <- NULL
i = 1
for (i in 1:k) {
# create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
# train model
fit <- lm(Age ~ Cholesterol + MaxHR + RestingBP, data = train)
# test model
probs <- predict(fit, newdata = test)
# get performance metrics for each fold
diags <- mean((test$Age - probs)^2)
# get the MSE for each fold MSE <-
# mean((test$Heart_Disease-probs)^2)
}
# Computes average MSE across all folds
paste0("Average MSE Across the Folds: ", mean(diags))
## [1] "Average MSE Across the Folds: 68.3555007508285"
Discussion
The Mean Squared Error for the dataset is a very high number of 71.77 which is very bad from a prediction standpoint. After performing a cross-validation of the linear model, the MSE changes to 68.355 which is a slight improvement. In final, linear regression is a very poor way to model the data as it performs terribly. There doesn’t seem to be overfitting on the linear regression model, but this is probably due to such a poor model.
library(reticulate)
# python code here
import pandas as pd
heartDataset = r.heart
#aggregate the number of patients with heart disease by their chest pain type
heartDataset.filter(['ChestPainType','Heart_Disease']).query('Heart_Disease == "yes"').groupby(['ChestPainType']).agg(['size'])
#Query the patients and their attributes with/without heart disease for further analysis
## Heart_Disease
## size
## ChestPainType
## ASY 392
## ATA 24
## NAP 72
## TA 20
patientsWithHeartDisease = heartDataset.loc[heartDataset['HeartDisease'] == 1].filter(['Age','Sex','ChestPainType','Cholesterol','MaxHR'])
patientsWithoutHeartDisease = heartDataset.loc[heartDataset['HeartDisease'] == 0].filter(['Age','Sex','ChestPainType','Cholesterol','MaxHR'])
#Maintain Records of patients with/without heart disease
print(patientsWithHeartDisease.head(20))
## Age Sex ChestPainType Cholesterol MaxHR
## 1 49.0 F NAP 180.0 156.0
## 3 48.0 F ASY 214.0 108.0
## 8 37.0 M ASY 207.0 130.0
## 11 58.0 M ATA 164.0 99.0
## 13 49.0 M ASY 234.0 140.0
## 16 38.0 M ASY 196.0 166.0
## 18 60.0 M ASY 248.0 125.0
## 19 36.0 M ATA 267.0 160.0
## 23 44.0 M ATA 288.0 150.0
## 30 53.0 M NAP 518.0 130.0
## 32 54.0 M ASY 224.0 122.0
## 33 41.0 M ASY 172.0 130.0
## 36 65.0 M ASY 306.0 87.0
## 41 54.0 F NAP 294.0 100.0
## 44 43.0 M ASY 175.0 120.0
## 49 41.0 M ASY 289.0 170.0
## 50 50.0 M ASY 233.0 121.0
## 51 47.0 F ASY 205.0 98.0
## 56 31.0 M ASY 270.0 153.0
## 57 58.0 M NAP 213.0 140.0
print(patientsWithoutHeartDisease.head(20))
## Age Sex ChestPainType Cholesterol MaxHR
## 0 40.0 M ATA 289.0 172.0
## 2 37.0 M ATA 283.0 98.0
## 4 54.0 M NAP 195.0 122.0
## 5 39.0 M NAP 339.0 170.0
## 6 45.0 F ATA 237.0 170.0
## 7 54.0 M ATA 208.0 142.0
## 9 48.0 F ATA 284.0 120.0
## 10 37.0 F NAP 211.0 142.0
## 12 39.0 M ATA 204.0 145.0
## 14 42.0 F NAP 211.0 137.0
## 15 54.0 F ATA 273.0 150.0
## 17 43.0 F ATA 201.0 165.0
## 20 43.0 F TA 223.0 142.0
## 21 44.0 M ATA 184.0 142.0
## 22 49.0 F ATA 201.0 164.0
## 24 40.0 M NAP 215.0 138.0
## 25 36.0 M NAP 209.0 178.0
## 26 53.0 M ASY 260.0 112.0
## 27 52.0 M ATA 284.0 118.0
## 28 53.0 F ATA 468.0 127.0
Discussion
By convering the heart dataset from an r object to a python object using the .r method, I am able to perform queries on the data in a python syntax. For the above are examples showing how to further query the data for statistical analysis in python. Querying the different types of heart disease with the greater number of patients. Along with keeping records of patients with/without heart disease for further analysis. I struggled in calling python from r due to errors, but this is something I will work on to improve the project
Include concluding remarks here, if any