Multiple Linear Regression Model Building – R Tutorial (Part 2)

April 30, 2018 By Pascal Schmidt Machine Learning R

After we prepared our data and checked all the necessary assumptions to build a successful regression model in part one, in this blog post we are going to build and select the “best” model.

Model 1

First, we start off with the backward selection method in order to choose the “best” subset model. The regsubsets function is from the leaps package. Make sure to center the variables where we included a polynomial term. This is because it is likely that the simple and quadratic terms have high correlations. To avoid that issue of multicollinearity in the model, we want to center these variables first.

centre <- function(x) { x - mean(x)}
gapExp.centered <- data.frame(gapExp, Polio = centre(gapExp$Polio), 
                                      Diphtheria = centre(gapExp$Diphtheria), 
                                      thinness..1.19.years = centre(gapExp$thinness..1.19.years), 
                                      thinness.5.9.years = centre(gapExp$thinness.5.9.years))

 

back <- regsubsets(lifeExp ~ Adult.Mortality + Alcohol + percentage.expenditure + BMI + poly(Polio, 2) + Total.expenditure + poly(Diphtheria, 2) + poly(thinness..1.19.years, 2) + poly(thinness.5.9.years, 2) + Schooling + lHIV.AIDS + lgdpPercap + lpop + lMeasles + linfant.deaths, data = gapExp, method = "backward")
plot(back, scale = "Cp")

From the plot above, we can see that the algorithm selects the variables Alcohol, thinness..1.19.years,  thinness.5.9.years, lHIV.AIDS, lgdpPercap, lpop, lMeasles, and linfant.deaths. So for some more detail, we want to choose a model with the lowest Mallow’s C_p value. The C_p values are displayed on the y-axis and the variables are displayed on the x-axis. The equation of Mallow’s C_p is  C_p = \frac{1}{n}({\rm RSS} + 2d\hat{\sigma}^2), where RSS is the residual sum of squares of the regression, d is the number of predictors, and \hat{\sigma}^2 is an estimate of the variance of the error \epsilon. We can say that the form of C_p is RSS plus penalty. So the penalty term consists of the variance and the number of predictors. This means that the more predictors we add, the larger our penalty term gets. This algorithm finds the lowest C_p value when the decrease in RSS is smaller than the increase in the penalty term when we keep adding or subtracting predictors from our model. The lowest  C_p value is 20 and everywhere, where there is a black box, these are the values that are included at the specific  C_p value.

mo <- lm(lifeExp ~ Alcohol + thinness..1.19.years + thinness.5.9.years + lHIV.AIDS + lgdpPercap + lpop + lMeasles + linfant.deaths, data = gapExp)
summary(mo)
## 
## Call:
## lm(formula = lifeExp ~ Alcohol + thinness..1.19.years + thinness.5.9.years + 
##     lHIV.AIDS + lgdpPercap + lpop + lMeasles + linfant.deaths, 
##     data = gapExp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8066  -1.6774   0.1524   2.0257   9.7778 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.89247    4.33627   2.281 0.023456 *  
## Alcohol              -0.31169    0.08065  -3.865 0.000145 ***
## thinness..1.19.years -0.38499    0.09250  -4.162 4.48e-05 ***
## thinness.5.9.years    0.25809    0.08838   2.920 0.003850 ** 
## lHIV.AIDS            -5.52135    0.32431 -17.025  < 2e-16 ***
## lgdpPercap            2.51421    0.32887   7.645 5.85e-13 ***
## lpop                  3.06582    0.33560   9.135  < 2e-16 ***
## lMeasles             -0.27631    0.08822  -3.132 0.001965 ** 
## linfant.deaths       -3.54575    0.39576  -8.959  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.462 on 227 degrees of freedom
## Multiple R-squared:  0.9237, Adjusted R-squared:  0.9211 
## F-statistic: 343.7 on 8 and 227 DF,  p-value: < 2.2e-16

The summary output looks very good and every single variable is statistically significant at a 0.1% level. Our adjusted R-squared value is 0.9211. This means that all the variables in our model are explaining 92.11% of the variation in our response variable lifeExp.

car::vif(mo)
##              Alcohol thinness..1.19.years   thinness.5.9.years 
##             1.914408             3.957565             3.812058 
##            lHIV.AIDS           lgdpPercap                 lpop 
##             1.962144             3.849195             5.288542 
##             lMeasles       linfant.deaths 
##             1.745723             8.907579

When checking the VIF values, we can see that we have a problem with multicollinearity. This is because some of the variables are above 5. This can cause a problem because the standard errors of the predictors of our regression might be inflated. Meaning, they are higher than they could be without a multicollinearity problem. This is not good!

plot(model, c(1,2))



The q-q plot looks decent and we can say that our residuals are roughly normally distributed with a thicker left tail than the theoretical distribution. We can say that because the points deviate from the line at the left. The residual versus fitted plot looks better than the one in our first blog post. However, the constant variance cannot be assumed yet because we can see that the red line has a bow shape and hence, follows a pattern. We want to achieve that the red line is is a horizontal straight line at y = 0 in order to assume homoscedasticity.

PRESS(mo)
## [1] 3114.46

The PRESS statistic gives us the predicted residual error sum of squares and measures the model fit. The lower the better.

We are deciding to not use the model we built above because of the high multicollinearity among the predictors. We can see that linfant.deaths has the highest VIF value out of all other predictors. We already centered this variable and the VIF value is still high. Hence, we are throwing away this variable.

 

Model 2

back <- regsubsets(lifeExp ~ Adult.Mortality + Alcohol + percentage.expenditure + BMI + poly(Polio, 2) + Total.expenditure + poly(Diphtheria, 2) + poly(thinness..1.19.years, 2) + poly(thinness.5.9.years, 2) + Schooling + lHIV.AIDS + lgdpPercap + lpop + lMeasles, data = gapExp, method = "backward")
plot(back, scale = "Cp")

From the plot above, we can see that the backward selection algorithm chose Adult.Mortality, Diphtheria (the simple and quadratic term), thinness..1.19.years, thinness.5.9.years, Schooling, lHIV.AIDS, and lgdpPercap. As before, the fields that are in black are included in our model and the fields that are white are excluded from our model. The “best” model has a C_p value of 7.3, lower than for our previous model.

final.model <- lm(lifeExp ~ Adult.Mortality + poly(Diphtheria, 2) + thinness..1.19.years + thinness.5.9.years + Schooling + lHIV.AIDS + lgdpPercap, data = gapExp)
summary(final.model)
## 
## Call:
## lm(formula = lifeExp ~ Adult.Mortality + poly(Diphtheria, 2) + 
##     thinness..1.19.years + thinness.5.9.years + Schooling + lHIV.AIDS + 
##     lgdpPercap, data = gapExp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7444  -1.6650   0.3242   2.0197   8.3439 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          45.494590   2.578429  17.644  < 2e-16 ***
## Adult.Mortality      -0.006518   0.002254  -2.891  0.00421 ** 
## poly(Diphtheria, 2)1 20.736139   4.447150   4.663 5.32e-06 ***
## poly(Diphtheria, 2)2 18.106869   4.361209   4.152 4.67e-05 ***
## thinness..1.19.years -0.391420   0.097285  -4.023 7.81e-05 ***
## thinness.5.9.years    0.242412   0.091462   2.650  0.00861 ** 
## Schooling             0.505195   0.137563   3.672  0.00030 ***
## lHIV.AIDS            -6.041489   0.344766 -17.523  < 2e-16 ***
## lgdpPercap            2.477335   0.371811   6.663 2.01e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.65 on 227 degrees of freedom
## Multiple R-squared:  0.9153, Adjusted R-squared:  0.9123 
## F-statistic: 306.5 on 8 and 227 DF,  p-value: < 2.2e-16

The Adjusted R-squared is around 0.91 and all of our predictors are statistically significant at a 0.1% level. The q-q plot looks similar to our previous model. The residual versus fitted plot has improved and we got rid of the bow shape. The red line is almost horizontal now and the residuals are bouncing randomly around the y=0 line.

vif(final.model)
##                          GVIF Df GVIF^(1/(2*Df))
## Adult.Mortality      1.793482  1        1.339210
## poly(Diphtheria, 2)  1.925111  2        1.177915
## thinness..1.19.years 3.939165  1        1.984733
## thinness.5.9.years   3.673942  1        1.916753
## Schooling            4.321024  1        2.078707
## lHIV.AIDS            1.995428  1        1.412596
## lgdpPercap           4.427470  1        2.104155

All VIF values for our final model are below 5 which indicates that there is almost no multicollinearity among the predictors.

PRESS(final.model)
## [1] 3463.621

The PRESS statistic for our final model is higher (worse) compared to the previous one.

In conclusion, the second model is preferred because of the better looking residual versus fitted plot and also because of the lower VIF values. The C_p value for our second model is 7.3 which is a lower value compared to 20 for our first model. Therefore, considering all things, we are going with the second model as our “best” model.

In our next post, we are implementing the Lasso model and see which variables are going to be selected by this algorithm.

Post your comment