Assumption Checking of LDA vs. QDA – R Tutorial (Pima Indians Data Set)

In this blog post, we will be discussing how to check the assumptions behind linear and quadratic discriminant analysis for the Pima Indians data. We also built a Shiny app for this purpose. The code is available here. Let’s start with the assumption checking of LDA vs. QDA.

What we will be covering:

  • Data checking and data cleaning
  • Checking assumption of equal variance-covariance matrices
  • Checking normality assumption

In the next blog post, we will be implementing the linear discriminant algorithms.

assumption checking of lda vs. qda

Assumption Checking of LDA vs. QDA – Data Checking and Data Cleaning

Before we do some assumption checking of LDA vs. QDA, we have to load some libraries and then read in the data.

When examining the data frame, we notice that there are some people with an insulin level of zero. Insulin levels might be very important in determining if someone has diabetes or not and an insulin level of 0 is not possible. Therefore, we are deciding to throw away all observations that have an insulin level of 0.

Given that we only have 768 observations, it hurts very much that we have to throw away almost half of our observations. One could do the analysis without Insulin or impute the missing insulin level based on the fact that someone has diabetes or not. In addition to the zero values in the insulin column, there are also zero values in columns two to six. We are going to remove these values as well.

Assumption Checking of LDA vs. QDA – Checking Assumption of Equal Variance-Coavariance matrices

We want to use LDA and QDA in order to classify our observations into diabetes and no diabetes. From this post we know our assumptions of LDA and QDA so let’s check them.

Checking the Assumption of Equal Variance

First, in order to get a sense of our data and if we have equal variances among each class, we can use boxplots.

diabetes data set boxplots

The three diferent boxplots show us that the length of each plot clearly differs. This is an indication for non-equal variances.

We can examine further the assumption of homogeneity of variance covariance matrices by plotting covariance ellipses.

Checking the Assumption of Equal Covariance Ellipse

pima indians data set

covariance ellipses

From this scatterplot, we can clearly see that the variance for the diabetes group is much wider than the variance from the non-diabetes group. This is because the blue points have a wider spread. The red points in contrast do not have as wide of a spread as the blue points.

We are using the BoxM test in order to check our assumption of homogeneity of variance-covariance matrices.
H_o = Covariance matrices of the outcome variable are equal across all groups
H_a = Covariance matrices of the outcome variable are different for at least one group

When are choosing our alpha to be 0.05 then from our result we can conclude that we have a problem of heterogeneity of variance-covariance matrices. The plot below gives information of how the groups differ in the components that go into Box’s M test.

The log determinants are ordered according to the sizes of the ellipses we saw in the covariance ellipse plots. This plot confirms our visualizations that we have ellipses of different sizes and therefore, no equal variance-covariance matrices. It is worth noting that the Box M test is sensitive and can detect even small departures from homogeneity.
As an additional check, we can perform a Levene test to check for equal variances.

From this test, we can see how the variances of the groups differ for Pregnancies. They also differ for the variables Age, Glucose, and Insulin (not shown).

For BloodPressure, the variance seems to be equal. This is also the case for SkinThickness, and BMI (not shown).

Assumption Checking of LDA vs. QDA – Checking Assumption of Normality

With the following qqplots, we are checking that the distribution of the predictors is normally distributed within the diabetes group and non diabetes group.

  • Diabetes Group

qqplot

When looking at the distribution of Glucose, we can see that it is roughly normally distributed because the dots fall on the red line. We can see that the tails of the distribution are thicker than the tails from a normal distribution because the points deviate from the line at the left and at the right from the plot.

We can see that the distribution of the variable Pregnancies is not normally distributed because the points deviate from the line very much.

Insulin and Age are also not normally distributed. The curve pattern in the plot is in the shape of a bow which indicated skewing. The points are above the line, then below it, and then above it again. This indicated that the skewing is to the right.

qqplot pima indians data set

The variables SkinThickness and BloodPressure look like they are normally distributed. The down-swing at the left and up-swing at the right of the plot for the blood pressure variable suggests that the distribution is heavier-tailed than the theoretical normal distribution.

For the BMI variable, the points swing up substantially at the right of the plot. These points might be outliers but we cannot infer that by looking at these plots.

The distribution for the PedigreeFunction variable looks again right skewed and not normally distributed.

If one wants to test for normality, the shapito.test function is an option. Here is an example of how to use it:

The null hypothesis is that the data is normally distributed. Based on the results, we fail to reject the null hypothesis and conclude that the data is normally distributed for SkinThickness.

  • Non Diabetes Group

qqplot pima indians data set

All variables do not seem to be normally distributed and all distributions seem to be right skewed based on the bow shape of the points in all four plots

qqplot

BMI, SkinThickness, and BloodPressure seem to be roughly normally distributed whereas the PedigreeFunction has this bow shape which suggests it is right skewed.
Another visualization technique is to plot the density of the predictors for each group. Through the plots below, we can detect if the predictors in each group are normally distributed and we can also check for equal variance.

pima indians data set

From the plots above we can conclude, that a lot of distributions are right skewed and that the variance is often also different.

I hope you have enjoyed this post about the assumption checking of LDA vs. QDA and know why these algorithms behave differently for certain data sets.

In our next post, we are going to implement LDA and QDA and see, which algorithm gives us a better classification rate.

Leave a Reply

Your email address will not be published. Required fields are marked *