Assumption Checking for Multiple Linear Regression – R Tutorial (Part 1)

In this blog post, we are going through the underlying assumptions of a multiple linear regression model. These assumptions are:

  • Constant Variance (Assumption of Homoscedasticity)
  • Residuals are normally distributed
  • No multicollinearity between predictors (or only very little)
  • Linear relationship between the response variable and the predictors

We are going to build a model with life expectancy as our response variable and a model for inference purposes. Meaning, that we do not want to build a complicated model with interaction terms only to get higher prediction accuracy. We are rather interested in one, that is very interpretable.

These are the packages you may need for part 1, part 2, and part 3:

library(dplyr)
library(gapminder)
library(leaps)
library(glmnet)
library(ggplot2)
library(MASS)
library(usdm)
library(car)
library(corrplot)
library(texreg)
library(MPV)

 

Data Preparation

For our analysis, we are using the gapminder data set and are merging it with another one from Kaggle.com. First, we are going to read in the data from gapminder and kaggle.

data("gapminder")
gap.2002 <- filter(gapminder, year == 2002)
gap.2007 <- filter(gapminder, year == 2007)
gap.2002 <- rename(gap.2002, Country = country)
gap.2007 <- rename(gap.2007, Country = country)
lifeExp <- read.csv("expect.csv")
lifeExp.2002 <- filter(lifeExp, Year == 2002)
lifeExp.2007 <- filter(lifeExp, Year == 2007)

We are choosing our data to only be from 2002 and 2007 and are merging on Country for each year.  After that, we can do an rbind for these two years.

gapExp.2002 <- merge(lifeExp.2002, gap.2002, by = "Country")
gapExp.2007 <- merge(lifeExp.2007, gap.2007, by = "Country")
gapExp <- rbind(gapExp.2002, gapExp.2007)

Now, we are throwing away the variables that appear twice in our data set and also Hepatitis.B because of the large amount of NA values.

gapExp <- dplyr::select(gapExp, -c(Year, year, Life.expectancy, Population, GDP, Country, Hepatitis.B, Income.composition.of.resources))
gapExp <- na.omit(gapExp)

We are also deciding to not include variables like Status, year, and continent in our analysis because they do not have any physical meaning.

gapExp.muu = dplyr::select(gapExp, -c(Status, continent)
(data.frame(Variables = colnames(gapExp.muu))

##                 Variables
## 1         Adult.Mortality
## 2           infant.deaths
## 3                 Alcohol
## 4  percentage.expenditure
## 5                 Measles
## 6                     BMI
## 7       under.five.deaths
## 8                   Polio
## 9       Total.expenditure
## 10             Diphtheria
## 11               HIV.AIDS
## 12   thinness..1.19.years
## 13     thinness.5.9.years
## 14              Schooling
## 15                lifeExp
## 16                    pop
## 17              gdpPercap

dim(gapExp) 
## [1] 236 17

In the end, we are ending up with 16 predictors and one response variable (lifeExp). There are 236 observations in our data set.

Assumption Checking

First, we are deciding to fit a model with all predictors included and then look at the constant variance assumption.

all.reg <- lm(lifeExp ~. -continent -Status,data = gapExp)
summary(all.reg)

## 
## Call:
## lm(formula = lifeExp ~ . - continent - Status, data = gapExp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9242  -2.1037   0.5952   2.6685  15.4555 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.600e+01  2.288e+00  20.102  < 2e-16 ***
## Adult.Mortality        -9.800e-03  2.855e-03  -3.432 0.000715 ***
## infant.deaths           6.868e-02  3.998e-02   1.718 0.087223 .  
## Alcohol                -2.530e-01  1.143e-01  -2.214 0.027846 *  
## percentage.expenditure -1.164e-04  1.933e-04  -0.602 0.547543    
## Measles                 1.260e-05  5.671e-05   0.222 0.824411    
## BMI                     6.356e-02  2.327e-02   2.732 0.006814 ** 
## under.five.deaths      -4.963e-02  2.757e-02  -1.800 0.073158 .  
## Polio                   4.774e-02  2.293e-02   2.082 0.038483 *  
## Total.expenditure       4.224e-02  1.473e-01   0.287 0.774576    
## Diphtheria              3.442e-02  1.974e-02   1.744 0.082620 .  
## HIV.AIDS               -6.101e-01  4.800e-02 -12.712  < 2e-16 ***
## thinness..1.19.years   -4.188e-01  1.255e-01  -3.336 0.000999 ***
## thinness.5.9.years      2.438e-01  1.279e-01   1.905 0.058057 .  
## Schooling               1.280e+00  1.814e-01   7.056 2.23e-11 ***
## pop                     2.791e-09  5.893e-09   0.474 0.636276    
## gdpPercap               1.688e-04  5.072e-05   3.328 0.001025 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.613 on 219 degrees of freedom
## Multiple R-squared:  0.8694, Adjusted R-squared:  0.8598 
## F-statistic:  91.1 on 16 and 219 DF,  p-value: < 2.2e-16
plot(all.reg, which = 1)

Clearly, we can see that the constant variance assumption is violated. There is an upswing and then a downswing visible, which indicates that the homoscedasticity assumption is not fulfilled. We will fix this later in form of transformations.

plot(all.reg, which = 2)

In the plot above we can see that the residuals are roughly normally distributed. The down-swing in residuals at the left and up-swing in residuals at the right of the plot suggests that the distribution of residuals is heavier-tailed than the theoretical distribution.

Next, we will have a look at the no multicollinearity assumption. We can do this by looking at the variance inflation factors (VIF). When the variance inflation factor  is above 5, then there exists multiollinearity. From the output below, infant.deaths and under.five.deaths have very high variance inflation factors. One way to deal with that is to center theses two variables and see if the VIF values decrease.

vif.1 <- usdm::vif(gapExp)
vif.1

##                 Variables        VIF
## 1                  Status         NA
## 2         Adult.Mortality   1.979009
## 3           infant.deaths 479.910310
## 4                 Alcohol   4.138553
## 5  percentage.expenditure   2.403586
## 6                 Measles   4.000891
## 7                     BMI   2.731662
## 8       under.five.deaths 432.757300
## 9                   Polio   2.769248
## 10      Total.expenditure   1.355378
## 11             Diphtheria   2.430362
## 12               HIV.AIDS   2.521836
## 13   thinness..1.19.years   4.380513
## 14     thinness.5.9.years   4.904032
## 15              Schooling   6.336510
## 16              continent         NA
## 17                lifeExp   9.143558
## 18                    pop   9.424969
## 19              gdpPercap   5.628120
centre <- function(x) { x - mean(x)}
gapExp.centered <- data.frame(gapExp, infant.deaths = centre(gapExp$infant.deaths), under.five.deaths = centre(gapExp$under.five.deaths))
gapExp.centered <- dplyr::select(gapExp.centered, -c(under.five.deaths, infant.deaths))
usdm::vif(gapExp.centered)


##                 Variables        VIF
## 1                  Status         NA
## 2         Adult.Mortality   1.979009
## 3                 Alcohol   4.138553
## 4  percentage.expenditure   2.403586
## 5                 Measles   4.000891
## 6                     BMI   2.731662
## 7                   Polio   2.769248
## 8       Total.expenditure   1.355378
## 9              Diphtheria   2.430362
## 10               HIV.AIDS   2.521836
## 11   thinness..1.19.years   4.380513
## 12     thinness.5.9.years   4.904032
## 13              Schooling   6.336510
## 14              continent         NA
## 15                lifeExp   9.143558
## 16                    pop   9.424969
## 17              gdpPercap   5.628120
## 18        infant.deaths.1 479.910310
## 19    under.five.deaths.1 432.757300

Unfortunately, centering did not help in lowering the VIF values for these varaibles. Consequently, we are forced to throw away one of these variables in order to lower the VIF values. We are deciding to throw away under.five.deaths. 

gapExp.centered <- dplyr::select(gapExp.centered, -under.five.deaths.1)
gapExp <- dplyr::select(gapExp, -under.five.deaths)
usdm::vif(gapExp)

##                 Variables      VIF
## 1                  Status       NA
## 2         Adult.Mortality 1.965442
## 3           infant.deaths 3.671070
## 4                 Alcohol 3.900174
## 5  percentage.expenditure 2.401401
## 6                 Measles 3.331261
## 7                     BMI 2.727756
## 8                   Polio 2.766994
## 9       Total.expenditure 1.355349
## 10             Diphtheria 2.398405
## 11               HIV.AIDS 2.507556
## 12   thinness..1.19.years 4.348967
## 13     thinness.5.9.years 4.812785
## 14              Schooling 6.288410
## 15              continent       NA
## 16                lifeExp 9.128520
## 17                    pop 5.636178
## 18              gdpPercap 5.596200

Now, every single VIF value is below 10 which is not bad. At this point we are continuing with our assumption checking and deal with the VIF values that are above 5 later on, when we are building a model with only a subset of predictors.

We also assume that there is a linear relationship between our response variable and the predictors. Let’s check this assumption with scatterplots.

ggplot(gapExp, aes(x = HIV.AIDS , y = lifeExp, col = continent)) + 
    geom_point() + 
    geom_smooth(method = "lm", formula= y ~ log(x),  se = F, aes(group = 1), col = "black") + 
    theme_minimal()

ggplot(gapExp, aes(x = log(HIV.AIDS + 1) , y = lifeExp, col = continent)) + 
    geom_point() + 
    geom_smooth(method = "lm", formula= y ~ x,  se = F, aes(group = 1), col = "black") + 
    theme_minimal()

ggplot(gapExp, aes(x = gdpPercap , y = lifeExp, col = continent)) + 
    geom_point() + 
    geom_smooth(method = "lm", formula = y ~ log(x), aes(group = 1), col = "black", se = F) + 
    theme_minimal()

ggplot(gapExp, aes(x = log(gdpPercap) , y = lifeExp, col = continent)) + 
    geom_point() + 
    geom_smooth(method = "lm", formula = y ~ x, aes(group = 1), col = "black", se = F) + 
    theme_minimal()

The black curve represents a logarithm curve. We can see that the data points follow this curve quite closely. Therefore, we are deciding to log transform our predictors HIV.AIDS and gdpPercap. We can, see in the plots above, that the linear relationship is stronger after these variables have been log trabsformed.

ggplot(gapExp, aes(x = pop , y = lifeExp, col = continent)) + 
    geom_point() + 
    geom_smooth(method = "lm", aes(group = 1), col = "black", se = F) + 
    theme_minimal()

ggplot(gapExp, aes(x = log(pop) , y = lifeExp, col = continent)) + 
    geom_point() + 
    geom_smooth(method = "lm", aes(group = 1), col = "black", se = F) + 
    theme_minimal()

ggplot(gapExp, aes(x = infant.deaths , y = lifeExp, col = continent)) + 
    geom_point() + 
    geom_smooth(method = "lm", aes(group = 1), col = "black", se = F) +
    theme_minimal()

ggplot(gapExp, aes(x = log(infant.deaths + 1) , y = lifeExp, col = continent)) + 
    geom_point() + 
    geom_smooth(method = "lm", aes(group = 1), col = "black", se = F) +
    theme_minimal()

We are also deciding to log transform pop and infant.deaths in order to normalize these variables.

Through the visualizations, the transormations are looking very promising and it seems that we can improve the linear relationship of the response variable with the predictors above by log – transforming them. To make sure that this makes sense, we are checking the correlation coefficients before and after our transformations.

N <- cor(gapExp[, c(3, 6, 11, 16, 17, 18)])[4, c(1:3, 5, 6), drop = F]
corrplot(N, method = "number", cl.pos = "n")

 

gapExp$lHIV.AIDS <- log(gapExp$HIV.AIDS + 1)
gapExp$lgdpPercap <- log(gapExp$gdpPercap)
gapExp$lpop <- log(gapExp$pop)
gapExp$lMeasles <- log(gapExp$Measles + 1)
gapExp$linfant.deaths <- log(gapExp$infant.deaths + 1)
gapExp <- dplyr::select(gapExp, -c(HIV.AIDS, gdpPercap, pop, Measles, infant.deaths))
M <- cor(gapExp[, 13:18])[1, 2:6, drop = FALSE]
corrplot(M, method = "number", cl.pos = "n")


We can see that the correlation coefficient increased for every single variable that we have log transformed. This says that there is now a stronger linear relationship between these predictors and lifeExp. Exactly what we wanted. In addition to that, these transormations might also improve our residual versus fitted plot (constant variance). We will see later when we are building a model.

 

Other predictors seem to have a quadratic relationship with our response variable.

ggplot(gapExp, aes(x = thinness..1.19.years, y = lifeExp, col = continent)) + 
    geom_point() + 
    geom_smooth(method = "lm", formula = y ~ poly(x, 2), aes(group = 1), col = "black", se = F) + 
    theme_minimal()

ggplot(gapExp, aes(x = thinness.5.9.years, y = lifeExp, col = continent)) + 
   geom_point() + 
   geom_smooth(method = "lm", formula = y ~ poly(x, 2), aes(group = 1), col = "black", se = F) + 
   theme_minimal()

ggplot(gapExp, aes(x = Polio, y = lifeExp, col = continent)) + 
   geom_point() + 
   geom_smooth(method = "lm", formula = y ~ poly(x, 2), aes(group = 1), col = "black", se = F) + 
   theme_minimal()

ggplot(gapExp, aes(x = Diphtheria, y = lifeExp, col = continent)) + 
   geom_point() + 
   geom_smooth(method = "lm", formula = y ~ poly(x, 2), aes(group = 1), col = "black", se = F) + 
   theme_minimal()


For our later model, we will include polynomials of degree two for Diphtheria, Polio, thinness.5.9.years, and thinness..1.19.years. Based on our visualizations, there might exists a quadratic relationship between these variables.

In our next blog post, we will finally start to build our multiple linear regression model and decide on good model through variable selection and other criteria. In our final blog post of this series, we will build a Lasso model and see how it compares to the multiple linear regression model.

Comments (3)

  1. When you begin fitting your model with all predictors, you choose to exclude Status and continent; however, neither predictor exists in the dataset as they were removed in the data preparation step. The code reflected could be modified from “lifeExp ~. -continent -Status” to “lifeExp ~.”

  2. When you begin charting your scatterplots, you’re including the continent variable; however, this variable was removed in the data preparation step. The color should be removed.

  3. When you check for correlation coefficients, the column references are inaccurate. Overall, I enjoyed the post but there are a lot of errors that appear to have occurred over time as edits took place. I’d recommend correcting the issues so that someone new can go through them.

Post your comment