Classification – Quadratic vs. Linear Discriminant Analysis (Pima Indians Data Set)
April 2, 2018 By Pascal Schmidt Machine Learning R
In previous blog posts, we have discussed the theory behind the linear and quadratic discriminant analysis and we have also examined the assumptions for the Pima Indians Data Set with a Shiny App.
In this blog post, we are going to implement these two algorithms and see which one performs better.
First, we are going to load all the required packages for this tutorial.
library(dplyr) library(MASS) library(ggplot2) library(broom) library(AUC) diabetes <- read.csv("diabetes.csv") head(diabetes, 3) ## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI ## 1 6 148 72 35 0 33.6 ## 2 1 85 66 29 0 26.6 ## 3 8 183 64 0 0 23.3 dim(diabetes) ## [1] 768 9 diabetes <- dplyr::rename(diabetes, Pedigree = DiabetesPedigreeFunction) # replaces all zero values from column two to six with NA diabetes[, 2:6][diabetes[, 2:6] == 0] <- NA # now we omit all NA values diabetes <- na.omit(diabetes) diabetes$Outcome <- ifelse(diabetes$Outcome == 1, "Diabetes", "No Diabetes")
After some data cleaning, where we have thrown out missing values, we are ready for our analysis.
Linear Discriminant Analysis
lda_model <- MASS::lda(Outcome ~., data = diabetes) preds <- predict(lda_model) head(preds$posterior) ## Diabetes No Diabetes ## 4 0.01976163 0.98023837 ## 5 0.87150729 0.12849271 ## 7 0.02576147 0.97423853 ## 9 0.90400328 0.09599672 ## 14 0.85368115 0.14631885 ## 15 0.77561936 0.22438064
The output above shows the probabilities of being classified into the “Diabetes” or “No Diabetes” group. For example, observation one has not been tested positive for diabetes with a probability of 98%. Observation two has been diagnosed with diabetes with a probability of 87%. The model uses a 50% threshold for the posterior probabilities.
lda_model ## Call: ## lda(Outcome ~ ., data = diabetes) ## ## Prior probabilities of groups: ## Diabetes No Diabetes ## 0.3316327 0.6683673 ## ## Group means: ## Pregnancies Glucose BloodPressure SkinThickness Insulin ## Diabetes 4.469231 145.1923 74.07692 32.96154 206.8462 ## No Diabetes 2.721374 111.4313 68.96947 27.25191 130.8550 ## BMI Pedigree Age ## Diabetes 35.77769 0.6255846 35.93846 ## No Diabetes 31.75076 0.4721679 28.34733 ## ## Coefficients of linear discriminants: ## LD1 ## Pregnancies -0.0576971998 ## Glucose -0.0285466972 ## BloodPressure -0.0002434335 ## SkinThickness -0.0074724303 ## Insulin 0.0005493946 ## BMI -0.0415379420 ## Pedigree -0.7002019020 ## Age -0.0261835760
From our output we can read off the prior probabilities \hat{\Pi}_1 = 0.332 and \hat{\Pi}_2 = 0.668. This means that around 33% of our data set includes people who have been diagnosed with diabetes and 66.8% who have not been diagnosed with diabetes.
Below the prior probabilities, we can see the group means of each predictor within each class. On average, people who have been diagnosed with diabetes were pregnant 4.5 times, have a glucose level of 145.2, a blood pressure of 74.1 (mm hg), triceps skinfold thickness of 33 mm, an insulin level of 207 mu U/ml, a BMI of 35.8 (weight in kg/(height in m)^2), a diabetes pedigree function of 0.63, and are 36 years old.
The scores below the group means are used to classify the observations into “Diabetes” and “No Diabetes”.
plot(lda_model)
More specifically, the scores, or coefficients of the output of the linear discriminant, are a linear combination that forms the LDA decision rule. When the linear combination of these coefficients is negative, then the probability increases that observation has diabetes (see plot), whereas when the linear combination is positive, observation is more likely to belong to the “No Diabetes” group.
diabetes <- data.frame(diabetes, predicted = preds$class) xtabs(~ predicted + Outcome, data = diabetes) ## Outcome ## predicted Diabetes No Diabetes ## Diabetes 76 30 ## No Diabetes 54 232 # prediction accuracy round((232+76)/(392), 4) ## [1] 0.7857 # error round((30+54)/(392), 4) ## [1] 0.2143
We get a prediction error of about 21.4%. Our sensitivity (true positive rate) is only 76/(76+54) = 58.5%. We have a specificity (true negative rate) of 232/(232+30) = 88.5%.
ROC Curve
With the ROC curve, we can play a little bit with our sensitivity and specificity. Remember, we used a threshold of 50% for our posterior probabilities. The ROC curve is a plot of the true positive rate (TPR) versus the false-positive rate (FPR) as we vary the threshold. So we have a trade-off between true- and false-positive rates where we are able to change the threshold. The ideal threshold would be in the upper left corner of the ROC plot. Hence, we want to choose a threshold so that this particular point on the curve is closest to the upper left corner.
posterior_yes <- preds$posterior[, "Diabetes"] true_yes <- (diabetes$Outcome == "Diabetes") %>% as.numeric() %>% factor() ROC_res <- roc(posterior_yes, true_yes) tidy_ROC <- tidy(ROC_res) ggplot(tidy_ROC, aes(x = fpr, y = tpr)) + geom_point(pch = ".") + geom_vline(xintercept = 0.15)
tidy_ROC %>% filter(fpr >= 0.148, fpr <= 0.151) ## instance cutoff fpr tpr ## 1 283 0.3661268 0.148855 0.6923077 ## 2 541 0.3637505 0.148855 0.7000000 ## 3 244 0.3620957 0.148855 0.7076923 n <- nrow(diabetes) thresh <- 0.3620957 dclass <- rep("No Diabetes", n) dclass[posterior_yes > thresh] <- "Diabetes" outcome <- data.frame(diabetes,prednew = dclass) xtabs(~ prednew + Outcome, data = outcome) ## Outcome ## prednew Diabetes No Diabetes ## Diabetes 92 39 ## No Diabetes 38 223 (223+92)/(392) ## [1] 0.8035714
Above, we chose a threshold of around 0.36. Our true positive rate has increased from 76 to 92 observations that were correctly classified as the “Diabetes” group. Our false-negative rate has decreased and we falsely classified 39 people into the “Diabetes” group, even though they have not been diagnosed with diabetes. Notice, that overall, however, we have decreased our error rate.
Quadratic Discriminant Analysis
qda_model <- qda(Outcome ~., data = diabetes) preds <- predict(qda_model) head(preds$posterior) ## Diabetes No Diabetes ## 4 0.006150471 9.938495e-01 ## 5 0.999980207 1.979306e-05 ## 7 0.012858630 9.871414e-01 ## 9 0.999993547 6.453313e-06 ## 14 1.000000000 4.001310e-14 ## 15 0.834364633 1.656354e-01 diabetes <- data.frame(diabetes, predicted = preds$class) xtabs(~ predicted + Outcome, data = diabetes) ## Outcome ## predicted Diabetes No Diabetes ## Diabetes 86 32 ## No Diabetes 44 230 (230+86)/(392) ## [1] 0.8061224
posterior_yes <- preds$posterior[, "Diabetes"] true_yes <- (diabetes$Outcome == "Diabetes") %>% as.numeric() %>% factor() ROC_res <- roc(posterior_yes, true_yes) tidy_ROC <- tidy(ROC_res) ggplot(tidy_ROC, aes(x = fpr, y = tpr)) + geom_point(pch = ".") + geom_vline(xintercept = 0.135) posterior_yes <- preds$posterior[, "Diabetes"] true_yes <- (diabetes$Outcome == "Diabetes") %>% as.numeric() %>% factor() ROC_res <- roc(posterior_yes, true_yes) tidy_ROC <- tidy(ROC_res) ggplot(tidy_ROC, aes(x = fpr, y = tpr)) + geom_point(pch = ".") + geom_vline(xintercept = 0.135)
tidy_ROC %>% filter(fpr >= 0.13, fpr <= 0.14) ## instance cutoff fpr tpr ## 1 748 0.4568396 0.1335878 0.6923077 ## 2 32 0.4488076 0.1335878 0.7000000 ## 3 391 0.4432782 0.1374046 0.7000000 n <- nrow(diabetes) thresh <- 0.4432782 dclass <- rep("No Diabetes", n) dclass[posterior_yes > thresh] <- "Diabetes" outcome <- data.frame(diabetes, prednew = dclass) xtabs(~ prednew + Outcome, data = outcome) ## Outcome ## prednew Diabetes No Diabetes ## Diabetes 91 36 ## No Diabetes 39 226 (91+226)/(392) ## [1] 0.8086735
For quadratic discriminant analysis, there is nothing much that is different from the linear discriminant analysis in terms of code. The quadratic discriminant analysis algorithm yields the best classification rate. This might be due to the fact that the covariances matrices differ or because the true decision boundary is not linear.
We play again with the ROC curve and determine a threshold that yields us a better classification rate. Maybe you can achieve an even better one. Try it out! Keep in mind though that this exercise is just great for understanding ROC curves better. It does not matter if the accuracy is 82% or 78%. The reason why is because this is only the training data. If we would apply our threshold to a new data set then we would get a different performance.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications