Mining, Classification, Prediction

Vincent Lebovitz vjl369

Introduction

I will be doing my project on the prediction and classification utilizing a heart disease dataset. We have different metrics such as Cholesterol, Age, Resting Blood Pressure, and will be using these metrics to determine if a patient has heart disease or not. As a student very interested in the field of computational biochemistry, I believe this will help me gain a large understanding for the procdures necessary to apply PCA, and machine learning to a variety of similar datasets. I found the dataset on kaggle, and this is a diagnostic dataset of a multitude of variables to choose from as described in the earlier portion. In total, there are 918 observations of “patient” records.

library(tidyverse)
library(knitr)
library(tidytext)
# read your datasets in here, e.g., with read_csv()

# read in the heart datasets
heart <- read_csv("heart.csv")
# if your dataset needs tidying, do so here

# any other code here
heart <- heart %>% mutate(Heart_Disease = ifelse(HeartDisease == 
    1, "yes", "no"))
# glimpse of dataset
glimpse(heart)

## Rows: 918
## Columns: 13
## $ Age            <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49…
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA"…
## $ RestingBP      <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130,…
## $ Cholesterol    <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211,…
## $ FastingBS      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal"…
## $ MaxHR          <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, …
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N",…
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0,…
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "F…
## $ HeartDisease   <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, …
## $ Heart_Disease  <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "yes…

Cluster Analysis

library(cluster)
library(GGally)
# clustering code here

#create the correlation matrix
cormat <- heart %>% select(Age,Cholesterol,MaxHR,RestingBP) %>% cor(use="pair") %>% round(3)
cormat

##                Age Cholesterol  MaxHR RestingBP
## Age          1.000      -0.095 -0.382     0.254
## Cholesterol -0.095       1.000  0.236     0.101
## MaxHR       -0.382       0.236  1.000    -0.112
## RestingBP    0.254       0.101 -0.112     1.000

#tidy the correlation matrix
tidycor <- cormat %>% as.data.frame %>% 
  rownames_to_column("Variable1") %>%
  pivot_longer(-1,names_to="Variable2",values_to="correlation")
#plot the correlation matrix
tidycor %>% ggplot(aes(Variable1,Variable2,fill=correlation))+
  geom_tile()+
  scale_fill_gradient2(low="red",mid="white",high="blue")+ #makes colors!
  geom_text(aes(label=round(correlation,2)),color = "black", size = 4)+ #overlay values
  theme(axis.text.x = element_text(angle = 90, hjust=1))+ #flips x-axis labels
  coord_fixed()+ggtitle("Correlation Matrix of Heart Disease Indicators")

#I selected the numeric variables age, cholesterol, max heart rate, and resting beats per minute
clust_dat<-heart%>%dplyr::select(Age,Cholesterol,MaxHR,RestingBP)

#graph the silouette width to find the correct number of clusters
sil_width<-vector() #empty vector to hold mean sil width
for(i in 2:10){  
  kms <- kmeans(clust_dat,centers=i) #compute k-means solution
  sil <- silhouette(kms$cluster,dist(clust_dat)) #get sil widths
  sil_width[i]<-mean(sil[,3]) #take averages (higher is better)
}
ggplot()+geom_line(aes(x=1:10,y=sil_width))+scale_x_continuous(name="k",breaks=1:10)+ggtitle("Silouette Width for Cluster Analysis")

#based on the silouette width graph, we plan on using 2 clusters for further analysis
#use the pam method
pam1<-clust_dat%>%pam(k=2)
#pam1
#save this in a cluster dataset
pamclust<-clust_dat%>%mutate(cluster=as.factor(pam1$clustering)) 
#compute the mean for the cluster set
pamclust%>%group_by(cluster)%>%summarize_if(is.numeric,mean,na.rm=T)

## # A tibble: 2 x 5
##   cluster   Age Cholesterol MaxHR RestingBP
##   <fct>   <dbl>       <dbl> <dbl>     <dbl>
## 1 1        52.8      246.    140.      133.
## 2 2        56.4        2.87  122.      130.

#create the medioids for the data
clust_dat%>%slice(pam1$id.med)

## # A tibble: 2 x 4
##     Age Cholesterol MaxHR RestingBP
##   <dbl>       <dbl> <dbl>     <dbl>
## 1    47         235   145       130
## 2    56           0   122       130

#plot the clusters in 3d
ggpairs(pamclust, aes(color=cluster))+ggtitle("Cluster Plot 3D")

Discussion of clustering here

From looking at the silouette width we are seeing the recommended number of clusters for the dataset is 2 clusters. Additionally, a correlation matrix was utilized and was found that many of the variables have little relation to each other. The highest being MaxHR to Age ina negative relationship. From a cluster perspective both Age and RestingBP have similar cluster density graphs. For the MaxHR density plot the first cluster represents more of the larger values. Lasrlt, Cholesterol is showing a great ability to differentiate between cluster 1 and cluster 2 as many of the data points in cluster2 are the lower values of the spectrum.

Dimensionality Reduction with PCA

# PCA code here
heartpca <- heart %>% select(Age, Cholesterol, MaxHR, RestingBP) %>% 
    scale

# run the pca test for this dataset
pca1 <- princomp(heartpca, cor = T)
# look at signs and magnitudes of loadings to get 85%
# variance From this we can see we only need 3 PC Scores
summary(pca1, loadings = T)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4
## Standard deviation     1.2511464 1.0671626 0.8463246 0.7612695
## Proportion of Variance 0.3913418 0.2847090 0.1790663 0.1448828
## Cumulative Proportion  0.3913418 0.6760508 0.8551172 1.0000000
## 
## Loadings:
##             Comp.1 Comp.2 Comp.3 Comp.4
## Age          0.630  0.169  0.310  0.692
## Cholesterol -0.303  0.705  0.617 -0.173
## MaxHR       -0.622  0.201 -0.346  0.673
## RestingBP    0.353  0.659 -0.635 -0.197

# we can see that 3 PC Scores Result in a great deal of
# variance for the dataset


# plot findings in dimensional reductivity for further
# analysis

# scatter plot showing the differentiation between heart
# disease and sex of patient
heart %>% mutate(PC1 = pca1$scores[, 1], PC2 = pca1$scores[, 
    2], PC3 = pca1$scores[, 3]) %>% ggplot(aes(x = PC1, y = PC2, 
    color = Heart_Disease, shape = Sex)) + geom_point(alpha = 0.5)

# relationship to indicators and their trends of PC loading
# scores
pca1$loadings[1:4, 1:2] %>% as.data.frame %>% rownames_to_column %>% 
    ggplot() + geom_hline(aes(yintercept = 0), lty = 2) + geom_vline(aes(xintercept = 0), 
    lty = 2) + ylab("PC2") + xlab("PC1") + geom_segment(aes(x = 0, 
    y = 0, xend = Comp.1, yend = Comp.2), arrow = arrow(), col = "red") + 
    geom_label(aes(x = Comp.1 * 1.1, y = Comp.2 * 1.1, label = rowname)) + 
    ggtitle("Relationship of Heart Disease Indicators to Principle Component Loadings")

Discussions of PCA here.

I wanted to identify the vectors which direction they fall towards on a PC scores graph. Age shows a direct relationship with both PC1 and PC2 along with RestingBP. Cholesterol and MaxHR show an inverse relationship where with the decrease in PC1 there’s an increase in PC2. We can see from the scatter plot created of PC Scores on a yes/no basis on heart disease that a lower PC1 score is usually an indicator of no whereas for PC2 there’s much variability so it’s harder to determine the spread from the plot.

Linear Classifier

#-------------------------------------------------------
# second use a logistic regression classifier
logistic_fit <- glm(Heart_Disease == "yes" ~ Age + Cholesterol + 
    MaxHR + RestingBP, data = heart, family = "binomial")
# your code here Generate the predicted score/probs for
# original observations
prob_reg <- predict(logistic_fit, type = "response")
# prob_reg %>% round(3)


# Compute Classification Diagnostics
class_diag(prob_reg, heart$Heart_Disease, positive = "yes")

##            acc  sens  spec    ppv    f1     ba    auc
## Metrics 0.7015 0.752 0.639 0.7208 0.736 0.6955 0.7626

# Report a Confusion Matrix

# In the confusion matrix 1 - represents someone with heart
# disease 2 - represents someone without heart disease
table(truth = heart$HeartDisease, predictions = prob_reg > 0.5)

##      predictions
## truth FALSE TRUE
##     0   262  148
##     1   126  382

# cross-validation of linear classifier here
set.seed(322)
k = 10

data <- sample_frac(heart)  #randomly order rows
folds <- rep(1:k, length.out = nrow(data))  #create folds

diags <- NULL

i = 1
for (i in 1:k) {
    # create training and test sets
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$Heart_Disease
    
    # train model
    fit <- glm(Heart_Disease == "yes" ~ Age + Cholesterol + MaxHR + 
        RestingBP, data = heart, family = "binomial")
    
    # test model
    probs <- predict(fit, newdata = test, type = "response")
    
    # get performance metrics for each fold
    diags <- rbind(diags, class_diag(probs, truth, positive = "yes"))
}

# average performance metrics across all folds
summarize_all(diags, mean)

##       acc   sens    spec     ppv      f1      ba     auc
## 1 0.70151 0.7552 0.64154 0.72266 0.73507 0.69836 0.76646

Discussion here

We are seeing through the utilization of a generalized linear model, an auc of 0.7626. This is not the most accurate score, and after doing a cross validation of the data we found the auc to increase to 0.76646. Thus, we are not seeing signs of overfitting due to such a slight change, however, this is not the greatest model specific to our data. So it would not be preferred and we should see how a non-parametric classifier will operate.

Non-Parametric Classifier

library(caret)
# non-parametric classifier code here
knn_fit <- knn3(Heart_Disease == "yes" ~ Age + Cholesterol + 
    MaxHR + RestingBP, data = heart)


# your code here
prob_knn <- predict(knn_fit, heart)[, 2]
# prob_knn %>% round(3)

# Compute Classification Diagnostics
class_diag(prob_knn, heart$Heart_Disease, positive = "yes")

##           acc   sens   spec    ppv     f1     ba    auc
## Metrics 0.781 0.8071 0.7488 0.7992 0.8031 0.7779 0.8659

# Report a Confusion Matrix
knn_fit2 <- knn3(factor(HeartDisease == 1, levels = c("TRUE", 
    "FALSE")) ~ Age + Cholesterol + MaxHR + RestingBP, data = heart)
# create prediction columns based on the knn fit
y_hat_knn <- predict(knn_fit2, heart)

# report the confusion matrix based on the truth and
# prediction counts
table(truth = factor(heart$HeartDisease == 1, levels = c("TRUE", 
    "FALSE")), prediction = factor(y_hat_knn[, 1] > 0.5, levels = c("TRUE", 
    "FALSE")))

##        prediction
## truth   TRUE FALSE
##   TRUE   410    98
##   FALSE  103   307

# cross-validation of np classifier here
set.seed(322)
k = 10

data <- sample_frac(heart)  #randomly order rows
folds <- rep(1:k, length.out = nrow(data))  #create folds

diags <- NULL

i = 1
for (i in 1:k) {
    # create training and test sets
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$Heart_Disease
    
    # train model
    fit <- knn3(Heart_Disease == "yes" ~ Age + Cholesterol + 
        MaxHR + RestingBP, data = heart)
    
    # test model
    probs <- predict(fit, newdata = test)[, 2]
    
    # get performance metrics for each fold
    diags <- rbind(diags, class_diag(probs, truth, positive = "yes"))
}

# average performance metrics across all folds
summarize_all(diags, mean)

##       acc   sens    spec     ppv      f1      ba     auc
## 1 0.78102 0.8072 0.75214 0.79809 0.80147 0.77966 0.86628

Discussion We are seeing a much higher auc variable of 0.8659 which is not the best performance metric, but very good given the size and scope of our data. After performing a cross validation, we see the auc increase to 0.86628 which shows there are no signs of overfitting with our data. If we were to choose the correct approach to modeling our data, I would prefer to use the knn methodology for further analysis and implementation of our dataset.

Regression/Numeric Prediction

# regression model code here

# Regression model for predicting the age of person based on
# Cholesterol,MaxHR,and RestingBP

# linear classifier code here
linClassifierData <- heart %>% select(Age, Cholesterol, MaxHR, 
    RestingBP, Heart_Disease)

# first use a linear regression classifier
linear_fit <- lm(heart$Age ~ Cholesterol + MaxHR + RestingBP, 
    data = heart)
# your code here Generate the predicted score/probs for
# original observations
score <- predict(linear_fit)
# score %>% round(3)

# Compute Mean Squared Error
paste0("Mean Squared Error: ", mean((heart$Age - score)^2))

## [1] "Mean Squared Error: 71.7753752128872"

# cross-validation of regression model here cross-validation
# of linear classifier here
set.seed(322)
k = 10

data <- sample_frac(heart)  #randomly order rows
folds <- rep(1:k, length.out = nrow(data))  #create folds

diags <- NULL

i = 1
for (i in 1:k) {
    # create training and test sets
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    
    
    # train model
    fit <- lm(Age ~ Cholesterol + MaxHR + RestingBP, data = train)
    
    # test model
    probs <- predict(fit, newdata = test)
    
    # get performance metrics for each fold
    diags <- mean((test$Age - probs)^2)
    
    # get the MSE for each fold MSE <-
    # mean((test$Heart_Disease-probs)^2)
}

# Computes average MSE across all folds

paste0("Average MSE Across the Folds: ", mean(diags))

## [1] "Average MSE Across the Folds: 68.3555007508285"

Discussion

The Mean Squared Error for the dataset is a very high number of 71.77 which is very bad from a prediction standpoint. After performing a cross-validation of the linear model, the MSE changes to 68.355 which is a slight improvement. In final, linear regression is a very poor way to model the data as it performs terribly. There doesn’t seem to be overfitting on the linear regression model, but this is probably due to such a poor model.

Python

library(reticulate)

# python code here
import pandas as pd

heartDataset = r.heart
#aggregate the number of patients with heart disease by their chest pain type
heartDataset.filter(['ChestPainType','Heart_Disease']).query('Heart_Disease == "yes"').groupby(['ChestPainType']).agg(['size'])

#Query the patients and their attributes with/without heart disease for further analysis

##               Heart_Disease
##                        size
## ChestPainType              
## ASY                     392
## ATA                      24
## NAP                      72
## TA                       20

patientsWithHeartDisease = heartDataset.loc[heartDataset['HeartDisease'] == 1].filter(['Age','Sex','ChestPainType','Cholesterol','MaxHR'])
patientsWithoutHeartDisease = heartDataset.loc[heartDataset['HeartDisease'] == 0].filter(['Age','Sex','ChestPainType','Cholesterol','MaxHR'])
#Maintain Records of patients with/without heart disease
print(patientsWithHeartDisease.head(20))

##      Age Sex ChestPainType  Cholesterol  MaxHR
## 1   49.0   F           NAP        180.0  156.0
## 3   48.0   F           ASY        214.0  108.0
## 8   37.0   M           ASY        207.0  130.0
## 11  58.0   M           ATA        164.0   99.0
## 13  49.0   M           ASY        234.0  140.0
## 16  38.0   M           ASY        196.0  166.0
## 18  60.0   M           ASY        248.0  125.0
## 19  36.0   M           ATA        267.0  160.0
## 23  44.0   M           ATA        288.0  150.0
## 30  53.0   M           NAP        518.0  130.0
## 32  54.0   M           ASY        224.0  122.0
## 33  41.0   M           ASY        172.0  130.0
## 36  65.0   M           ASY        306.0   87.0
## 41  54.0   F           NAP        294.0  100.0
## 44  43.0   M           ASY        175.0  120.0
## 49  41.0   M           ASY        289.0  170.0
## 50  50.0   M           ASY        233.0  121.0
## 51  47.0   F           ASY        205.0   98.0
## 56  31.0   M           ASY        270.0  153.0
## 57  58.0   M           NAP        213.0  140.0

print(patientsWithoutHeartDisease.head(20))

##      Age Sex ChestPainType  Cholesterol  MaxHR
## 0   40.0   M           ATA        289.0  172.0
## 2   37.0   M           ATA        283.0   98.0
## 4   54.0   M           NAP        195.0  122.0
## 5   39.0   M           NAP        339.0  170.0
## 6   45.0   F           ATA        237.0  170.0
## 7   54.0   M           ATA        208.0  142.0
## 9   48.0   F           ATA        284.0  120.0
## 10  37.0   F           NAP        211.0  142.0
## 12  39.0   M           ATA        204.0  145.0
## 14  42.0   F           NAP        211.0  137.0
## 15  54.0   F           ATA        273.0  150.0
## 17  43.0   F           ATA        201.0  165.0
## 20  43.0   F            TA        223.0  142.0
## 21  44.0   M           ATA        184.0  142.0
## 22  49.0   F           ATA        201.0  164.0
## 24  40.0   M           NAP        215.0  138.0
## 25  36.0   M           NAP        209.0  178.0
## 26  53.0   M           ASY        260.0  112.0
## 27  52.0   M           ATA        284.0  118.0
## 28  53.0   F           ATA        468.0  127.0

Discussion

By convering the heart dataset from an r object to a python object using the .r method, I am able to perform queries on the data in a python syntax. For the above are examples showing how to further query the data for statistical analysis in python. Querying the different types of heart disease with the greater number of patients. Along with keeping records of patients with/without heart disease for further analysis. I struggled in calling python from r due to errors, but this is something I will work on to improve the project

Concluding Remarks

Include concluding remarks here, if any

Project 2: Data Mining, Classification, Prediction

SDS322E

Mining, Classification, Prediction

Vincent Lebovitz vjl369

Introduction

Cluster Analysis

Dimensionality Reduction with PCA

Linear Classifier

Non-Parametric Classifier

Regression/Numeric Prediction

Python

Concluding Remarks