The Lasso – R Tutorial (Part 3)

This is the third part of our regression series. Check out parts one and two. In this blog post, we are going to implement the Lasso.

The Lasso is a shrinkage method that biases the estimates but reduces variance. For the data set that we used in part one and two, we had some multicollinearity problems with our predictors. This could potentially lead to inflated and overestimated standard errors. The Lasso is a great method to avoid that because as already mentioned, it is trying to minimize the variance.

The Lasso equation looks like this: {\rm RSS} + \lambda \sum_{j=1}^p |\beta_j|. It consists of the residual sum of squares and the penalty term, sometimes called the \ell_1 penalty. The penalty term has two components, the tuning parameter \lambda and the sum of squared coefficients. The Lasso has an advantage over Ridge regression because it does variable selection for us and shrinks some of the coefficients exactly to zero.

Now that’s enough about the theory. Let’s jump into the code! We are using the glmnet package for our analysis.

 

Building The Lasso Model

Xfull <- model.matrix(lifeExp ~ .-continent- -Status, data = gapExp)
Xfull <- Xfull[, -1] # the Lasso will add its own intercept to the model</pre>
Y <- gapExp$lifeExp

lambdas <- 10^{seq(from = -2, to = 5, length = 100)}
lafit <- glmnet(Xfull, Y, alpha = 1, lambda = lambdas)
plot(lafit, xvar = "lambda")

The glmnet package does cross-validation for us. Therefore, it chooses the best \lambda that minimizes the estimated test set error. The plot above shows the coefficients on the y-axis and the log lambda on the x-axis. Every single-colored line in the plot corresponds to a predictor. So all in all there should be 16 lines in the plot above because we have 16 predictors. As you can see, the bigger the log lambda value gets, the more and more coefficients are shrunken towards zero. When we have a log lambda value of around 3, every single coefficient is equal to zero.

cv.rrfit <- cv.glmnet(Xfull, Y, alpha = 1, lambda = lambdas) 
plot(cv.rrfit)


The plot above shows the log lambda value on the x-axis and the mean-squared error on the y-axis. Above the plot, there are numbers that correspond to the number of predictors in the model at each value of log lambda. The first vertical dotted line represents the lowest mean-squared error. In this model, there are 11 variables included in our model. However, we do not want to necessarily build a model with the lowest mean-squared error. Instead, we want to build a model with a low mean-squared error that is parsimonious. Hence, the second dotted line at around zero represents a log lambda value that is one standard error to the right of the log lambda value with the lowest mean-squared error. As we can see, for only a little increase in the mean-squared error, we get a model that only has 8 predictors compared to the lowest mean-squared error which has 11 predictors.

rr.best.lam <- cv.rrfit$lambda.1se 
rr.best.lam 

## [1] 1.123324

This is our best lambda (second dotted line).

rr.best <- glmnet(Xfull, Y, alpha = 1, lambda = rr.best.lam)
coef(rr.best) ## 18 x 1 sparse Matrix of class "dgCMatrix"


##                                   s0
## (Intercept)             4.628603e+01
## StatusDeveloping        .           
## Adult.Mortality        -1.020503e-02
## infant.deaths           .           
## Alcohol                 .           
## percentage.expenditure  .           
## Measles                 .           
## BMI                     4.777275e-02
## under.five.deaths       .           
## Polio                   5.231503e-02
## Total.expenditure       .           
## Diphtheria              2.283656e-02
## HIV.AIDS               -5.625701e-01
## thinness..1.19.years   -6.696799e-02
## thinness.5.9.years      .           
## Schooling               1.282836e+00
## pop                     .           
## gdpPercap               8.992658e-05

The above output displays the coefficients that our lambda model chooses. The predictors a similar to our backward selection model. The only difference is that the Lasso chose BMI and Polio, and excluded thinness.5.9.years.

Lasso.lm <- lm(lifeExp ~ Adult.Mortality + BMI + Polio + Diphtheria + HIV.AIDS + thinness..1.19.years + Schooling + gdpPercap, data = gapExp)
summary(Lasso.lm)

## 
## Call:
## lm(formula = lifeExp ~ Adult.Mortality + BMI + Polio + Diphtheria + 
##     HIV.AIDS + thinness..1.19.years + Schooling + gdpPercap, 
##     data = gapExp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0784  -2.2553   0.4728   3.0024  16.0524 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          44.9733909  2.1397800  21.018 < 2e-16 ***
## Adult.Mortality      -0.0116677  0.0028721  -4.062 6.69e-05 ***
## BMI                   0.0569323  0.0237958   2.393  0.01755 *  
## Polio                 0.0648003  0.0228560   2.835  0.00499 ** 
## Diphtheria            0.0355485  0.0203689   1.745  0.08230 .  
## HIV.AIDS             -0.6258084  0.0496056 -12.616  <; 2e-16 ***
## thinness..1.19.years -0.1251632  0.0837355  -1.495  0.13637    
## Schooling             1.2324330  0.1705728   7.225 7.52e-12 ***
## gdpPercap             0.0001160  0.0000417   2.781  0.00588 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.82 on 227 degrees of freedom
## Multiple R-squared:  0.8522, Adjusted R-squared:  0.847 
## F-statistic: 163.6 on 8 and 227 DF,  p-value: < 2.2e-16

Not all predictors that the Lasso selects are statistically significant at a 5% level. Thinness..1.19.years, for example, has a p-value of about 0.14.

plot(Lasso.lm, c(1, 2))


d

From the q-q plot, we can say that the residuals are roughly normally distributed with a thicker left tail. The residual versus fitted plot suggests non-constant variance.

PRESS(Lasso.lm)
## [1] 5957.213

vif(Lasso.lm)
##      Adult.Mortality                  BMI                Polio 
##             1.668949             2.251578             2.272961 
##           Diphtheria             HIV.AIDS thinness..1.19.years 
##             2.161373             1.388837             1.672961 
##            Schooling            gdpPercap 
##             3.808492             2.558608

The PRESS statistic is higher compared to the model in part two. The VIF values are all below 5 which is to be expected from the Lasso.

Because the residuals versus fitted plot do not look satisfying, we are trying to achieve constant variance with some transformations as in part one of this blog post series.

gapExp$lHIV.AIDS <- log(gapExp$HIV.AIDS + 1) 
gapExp$lgdpPercap <- log(gapExp$gdpPercap) 
gapExp <- dplyr::select(gapExp, -c(HIV.AIDS, gdpPercap, pop, Measles, infant.deaths))

We are log transforming the variables HIV.AIDS and also gdpPercap. Moreover, we will include a polynomial of degree two for the Diphtheria variable.

Lasso.2 <- lm(lifeExp ~ Adult.Mortality + BMI + Polio + poly(Diphtheria, 2) + lHIV.AIDS + thinness..1.19.years + Schooling + lgdpPercap, data = gapExp)
step(Lasso.2)

## Start:  AIC=629.09
## lifeExp ~ Adult.Mortality + BMI + Polio + poly(Diphtheria, 2) + 
##     lHIV.AIDS + thinness..1.19.years + Schooling + lgdpPercap
## 
##                        Df Sum of Sq    RSS    AIC
## - BMI                   1       0.0 3117.4 627.10
## - Polio                 1       0.0 3117.4 627.10
## &lt;none&gt;                              3117.4 629.09
## - Adult.Mortality       1      91.7 3209.1 633.94
## - thinness..1.19.years  1     122.7 3240.1 636.21
## - Schooling             1     170.8 3288.1 639.68
## - poly(Diphtheria, 2)   2     232.5 3349.9 642.07
## - lgdpPercap            1     551.7 3669.1 665.55
## - lHIV.AIDS             1    4068.0 7185.3 824.17
## 
## Step:  AIC=627.1
## lifeExp ~ Adult.Mortality + Polio + poly(Diphtheria, 2) + lHIV.AIDS + 
##     thinness..1.19.years + Schooling + lgdpPercap
## 
##                        Df Sum of Sq    RSS    AIC
## - Polio                 1       0.0 3117.4 625.10
## &lt;none&gt;                              3117.4 627.10
## - Adult.Mortality       1      91.8 3209.2 631.95
## - thinness..1.19.years  1     127.6 3245.0 634.56
## - Schooling             1     178.9 3296.2 638.26
## - poly(Diphtheria, 2)   2     233.2 3350.6 640.12
## - lgdpPercap            1     570.0 3687.4 664.73
## - lHIV.AIDS             1    4077.8 7195.2 822.49
## 
## Step:  AIC=625.1
## lifeExp ~ Adult.Mortality + poly(Diphtheria, 2) + lHIV.AIDS + 
##     thinness..1.19.years + Schooling + lgdpPercap
## 
##                        Df Sum of Sq    RSS    AIC
## &lt;none&gt;                              3117.4 625.10
## - Adult.Mortality       1      93.5 3210.9 630.07
## - thinness..1.19.years  1     128.8 3246.2 632.65
## - Schooling             1     180.1 3297.5 636.35
## - poly(Diphtheria, 2)   2     375.3 3492.7 647.92
## - lgdpPercap            1     579.8 3697.2 663.36
## - lHIV.AIDS             1    4152.1 7269.5 822.92
## 
## Call:
## lm(formula = lifeExp ~ Adult.Mortality + poly(Diphtheria, 2) + 
##     lHIV.AIDS + thinness..1.19.years + Schooling + lgdpPercap, 
##     data = gapExp)
## 
## Coefficients:
##          (Intercept)       Adult.Mortality  poly(Diphtheria, 2)1  
##            45.863721             -0.005945             20.236357  
## poly(Diphtheria, 2)2             lHIV.AIDS  thinness..1.19.years  
##            17.270051             -6.081137             -0.192632  
##            Schooling            lgdpPercap  
##             0.505747              2.452236

After running the step() function in R, the algorithm decides on the variables Adult.Mortality, Diphtheria (the simple and quadratic term), lHIV.AIDS, thinness..1.19.years, Schooling, and lgdpPercap based on the lowest AIC value (similar to Cp which we discussed in part two).

Lasso.best <- lm(formula = lifeExp ~ Adult.Mortality + poly(Diphtheria, 2) + lHIV.AIDS + thinness..1.19.years + Schooling + lgdpPercap, data = gapExp)
(summary(Lasso.best))

## 
## Call:
## lm(formula = lifeExp ~ Adult.Mortality + poly(Diphtheria, 2) + 
##     lHIV.AIDS + thinness..1.19.years + Schooling + lgdpPercap, 
##     data = gapExp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2761  -1.5576   0.2977   2.1011   8.2153 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          45.863721   2.608459  17.583  < 2e-16 ***
## Adult.Mortality      -0.005945   0.002273  -2.615 0.009518 ** 
## poly(Diphtheria, 2)1 20.236357   4.501471   4.495 1.10e-05 ***
## poly(Diphtheria, 2)2 17.270051   4.406860   3.919 0.000118 ***
## lHIV.AIDS            -6.081137   0.348963 -17.426  < 2e-16 ***
## thinness..1.19.years -0.192632   0.062772  -3.069 0.002410 ** 
## Schooling             0.505747   0.139369   3.629 0.000351 ***
## lgdpPercap            2.452236   0.376569   6.512 4.67e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.698 on 228 degrees of freedom
## Multiple R-squared:  0.9126, Adjusted R-squared:   0.91 
## F-statistic: 340.3 on 7 and 228 DF,  p-value: < 2.2e-16

Now, every predictor is statistically significant at a 1% level and the Adjusted R-squared value is 0.91. This model is very similar to our “best” model in part two, except that the adjusted Lasso model does not include the variable thinness.5.9.years. 

PRESS(Lasso.best) 

## [1] 3463.337

The PRESS statistic is almost identical to the one in our final model in part two.

par(mfrow = c(1, 2)) 
plot(Lasso.best, c(1,2))


These two plots also look very similar to our final model in part two.

In conclusion, the final model from part two and the adjusted Lasso model are almost identical, except that the Lasso model does not include the variable thinness.5.9.years. As we said in part one of this series, we want to go with the most parsimonious model which is easy to interpret. Hence, we are choosing the adjusted Lasso model to be the “best”.

Notice that we can also do a partial F-test in order to pick between two models.

anova(Lasso.best, final.model)

## Analysis of Variance Table
## 
## Model 1: lifeExp ~ Adult.Mortality + poly(Diphtheria, 2) + lHIV.AIDS + 
##     thinness..1.19.years + Schooling + lgdpPercap
## Model 2: lifeExp ~ Adult.Mortality + poly(Diphtheria, 2) + thinness..1.19.years + 
##     thinness.5.9.years + Schooling + lHIV.AIDS + lgdpPercap
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    228 3117.4                                
## 2    227 3023.8  1    93.574 7.0247 0.008605 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the partial F-test,  the thinness.5.9.years variable from model two is needed. However, as we said earlier, we do not want to include this variable because it does not improve the model significantly based on the Adjusted R-squared, the PRESS statistic, and the AIC value.

Comment (1)

  1. Thank you for sharing the knowledge. I want to ask. I want to implement Lasso in the geographically weighted gamma regression. How do I enter the objective function and its distance matrix? Thank you for your answer.

Post your comment