Gradient Descent and Line Search Methods for Linear Regression

In this tutorial, we will be using the gradient descent optimization algorithm to find the beta values that are minimizing the objective function of a multiple linear regression. In the last blog post, we discussed the intuition behind gradient descent and showed how to minimize x^2. Check it out for some more information about gradient descent and line search methods.

What we will be covering in this post is:

  • The intuition behind linear regression.
  • Is the objective function of a linear regression convex?
  • Gradient descent algorithm framework.
  • Gradient descent with constant step length, exact step length, backtracking line search Armijo), and Wolfe conditions.

gradient descent linear regression

Intuition Behind the Objective Function of a Linear Regression

Let’s consider the most basic regression model, a simple linear regression.

We have a response variable y and a predictor x. Visually this would look like this:

linear regression

What we are essentially doing is trying to minimize the sum of squares of the actual values (dots in red) minus the predicted values (the regression line). This would look like this:

linear regression sum of squares

As you can see, we are trying to minimize the sum of squares in the picture above. There is only one line for which the sum of squares are minimized.

Now, you might ask yourself why it is not the sum of absolute values. In order to answer this question, we have to know how to minimize our objective function. We need to take the derivative and set it equal to zero. However, dealing with derivatives and absolute values can be tricky in mathematics and can result in discontinuous derivatives. Therefore, we prefer the sum of squares rather than the sum of absolute values.

In order to minimize the sum of squares, we have to subtract every single actual value of our response variable Y from the predicted value \hat{y} and and then add all of them up. Putting that into a formula would look like this:

f(\beta) = \frac{1}{2} \sum\limits_{i = 1}^n(y_i - \hat{y}_i)^2

where

\hat{y}_i = \hat{\beta}^T X_i

So,

f(\beta) = \frac{1}{2} \sum\limits_{i = 1}^n(y_i - \hat{\beta}^T X_i)^2

The equation above can also be written with matrices and vectors. The objective function looks like this in matrix notation:

f(\beta) = \frac{1}{2} (Y - X\beta)^T(Y-X\beta)

Does Gradient Descent Always Find A Global Minimizer of a Linear Regression?

In order to find a global minimizer, the objective linear regression function has to be convex. In **this** blog post, I talked about gradient descent and convexity. The objective function has to fulfill the following criteria:

  • The second derivative of the objective function has to be a positive semidefinite.

Above, we derived the linear regression objective function. In order to check the second derivative or hessian matrix of the function, we first have to multiply out the brackets.

Let’s have a look at the objective function of a linear regression again:

f(\beta) = (Y - X\beta)^T(Y-X\beta)

When we are multiplying out the terms, we get:

= \beta^TX^TX\beta - \beta^TX^TY - Y^TX\beta + Y^TY

Now we take the first-order derivative (or gradient) of the function with respect to beta:

\nabla f(\beta) = X^TY -X^TY + 2X^TX\beta

And now, we are taking the second-order derivative with respect to beta:

\nabla^2 f(\beta) = 2X^TX \succeq 0

The matrix above will always be a positive semidefinite and hence, the linear regression objective function is convex which ensures that any local minimizer is a global minimizer.

Preparing Our Data Set for Linear Regression

We are using the Gapminder data set to construct a linear regression model. Our response variable will be life expectancy lifeExp, and continent, gdpPercap, and pop will be our predictors.
We ar needing the following libraries:

library(tidyverse) 
library(ggpubr)
library(gapminder)

Then, we are constructing our Y variable (lifeExp) and our design matrix X. Our design matrix consists of all predictor variables with indicator variables and all ones in the first column. A generic design matrix or model matrix looks like this:

\begin{bmatrix} 1 & x_{11} & \dots & x_{1n} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{m1} & \dots & x_{mn} \end{bmatrix}

We will be using lifeExp as our response variable and continent, pop, and gdpPercapas predictors.

# response variable Y (life expectancy)
Y <- matrix(gapminder$lifeExp, nrow = nrow(gapminder), ncol = 1)

# setting up our design matrix x
gapminder %>%
  tidyr::spread(key = continent, value = continent) %>%
  dplyr::mutate_at(
    .vars = vars(Oceania, Americas, Asia, Europe),
    .funs = funs(ifelse(is.na(.), 0, 1))
  ) %&gt;%
  dplyr::select(Americas, Asia, Europe, Oceania) %>%
  data.matrix() -> gap

# setting up our design matrix
gapminder %>%
  dplyr::select(pop, gdpPercap) %>%
  data.matrix() %>%
  scale() %>%
  cbind(rep(1, nrow(.)), ., gap) %>%
  magrittr::set_colnames(c("Intercept", colnames(.[, -1]))) -> x

head(x)

The design matrix is displayed below. We used spread()from the tidyr package to construct the indicator variables with Africa as our baseline.

#      Intercept        pop  gdpPercap Americas Asia Europe Oceania
# [1,]         1 -0.1994753 -0.6528949        0    1      0       0
# [2,]         1 -0.1917924 -0.6486942        0    1      0       0
# [3,]         1 -0.1821262 -0.6454228        0    1      0       0
# [4,]         1 -0.1701545 -0.6471376        0    1      0       0
# [5,]         1 -0.1556338 -0.6568984        0    1      0       0
# [6,]         1 -0.1386693 -0.6522184        0    1      0       0

The performance of the gradient descent algorithm depends on how well the problem is formulated. An important issue in the formulation of the problem is scaling. When we look at the original Gapminder data frame, then we can see that the value for pop is much greater than the values for gdpPercap. Hence, our linear regression function will be very sensitive to small changes in pop but no very sensitive to small changes in gdpPercap. Therefore, we have to scale these variables in order to get the gradient descent algorithm to converge.

The Gradient Descent Algorithm Framework

Given \: x_0 (our initial guess),

k = 0

while \: ||\nabla f(x_k)>10^{-4}||_2 (while the norm of the first derivative or gradient is smaller than some threshold)

p_0=-\nabla f(x_0) (search direction)

\alpha= \frac{\nabla f_k^T \nabla f_k}{\nabla f_k^T X^TX \nabla f_k} (step length)

p_{k} = -\nabla f(x_k)

x_{k+1} = x_k + \alpha p_k

k = k + 1

end (while)

Now that we understand the framework, we can get to the code. First, we have to make a guess about the beta coefficients. In our case, we guess them to be all zeros. For the first time, we are trying to get the beta coefficients that are minimizing the objective function value, by assigning a constant step length of 0.0005.

Gradient Descent With Constant Step Length

First, we are doing gradient descent with a fixed step length and after that, we are using the formula from above to calculate the exact step length after each iteration. If you want to find out more about the exact step length you can check out **this** post about gradient descent.

# initial guess for beta
beta_coefficients <- list(
  beta_0 = 0,
  beta_1 = 0,
  beta_2 = 0,
  beta_3 = 0,
  beta_4 = 0,
  beta_5 = 0,
  beta_6 = 0
)

step_length <- 0.0005
beta_vector <- do.call(rbind, beta_coefficients) %>%
  set_rownames(colnames(x))

# gradient of objective function at beta
grad <- -t(x) %*% (Y - x %*% beta_vector)

# initialize iterations
k <- 0

while (norm(grad) > 10^-4) {

  # objective functio value at beta
  obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector)

  # gradient of objective function value at beta
  grad <- -t(x) %*% (Y - x %*% beta_vector)

  # search direction alpha
  search_direction <- -grad

  # update beta coefficients
  beta_vector <- beta_vector + step_length * search_direction

  # update iterations
  k <- k + 1
}

# prepare data
dat <- as.data.frame(scale(gapminder[, c("pop", "gdpPercap")])) %>%
  cbind(gapminder[, c("lifeExp", "continent")])

# fit regression model
lm_summary <- summary(lm(lifeExp ~ pop + gdpPercap + continent, data = dat))
coefs <- lm_summary$coefficients[, 1] %>%
  base::unname() %>%
  round(5)

beta_vector <- beta_vector %>%
  base::unname() %>%
  c() %>%
  round(5)

# compare gradient descent with base R lm() function
all.equal(beta_vector, coefs)

## [1] TRUE

# print iterations
k

## [1] 1418

The gradient descent algorithm took 1418 iterations until convergence with a constant step length of 0.0005. As we can see above, the lm() function in R gives us the same result as the gradient descent algorithm for the beta values.

Now, we are fitting a simple linear regression model with only one predictor to visualize the updated regression line.

Simple Linear Regression to Visualize Gradient Descent

# response variable Y (life expectancy)
Y <- matrix(log(gapminder$lifeExp), nrow = nrow(gapminder), ncol = 1)

# setting up our design matrix
gapminder %>%
  dplyr::select(pop) %>%
  dplyr::mutate(pop = log(pop)) %>%
  data.matrix() %>%
  scale() %>%
  cbind(rep(1, nrow(.)), .) -> x

# initial guess for beta
beta_coefficients <- list(
  beta_0 = 0,
  beta_1 = 0
)

step_length <- 0.0005
beta_vector <- do.call(rbind, beta_coefficients)

# gradient of objective function at beta
grad <- -t(x) %*% (Y - x %*% beta_vector)

# initialize iterations
k <- 0

while (norm(grad) > 10^-4) {

    # plot data
    plot(x[, "pop"], Y,
    main = "Gradient Descent With Fixed Step Length \n and 11 Iterations",
    xlab = "Population",
    ylab = "Life Expectancy"
  )

  # add updated regression line after each iteration
  abline(beta_vector[1, 1], beta_vector[2, 1], col = "red")
  par(new = T)

  # objective functio value at beta
  obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector)

  # gradient of objective function value at beta
  grad <- -t(x) %*% (Y - x %*% beta_vector)

  # search direction alpha
  search_direction <- -grad

  # update beta coefficients
  beta_vector <- beta_vector + step_length * search_direction

  # update iterations
  k <- k + 1
}

k

## [1] 11
gradient descent regression lines

The gradient descent algorithm for the simple linear regression with fixed step length took 11 iterations to converge. The red line above shows how the regression line adjusts after each iteration of gradient descent.

Gradient Descent With Exact Step Length

Now, we are calculating the exact step length after each gradient descent iteration with the updated beta coefficients.

# response variable Y (life expectancy)
Y <- matrix(gapminder$lifeExp, nrow = nrow(gapminder), ncol = 1)

# setting up our design matrix x
gapminder %>%
  tidyr::spread(key = continent, value = continent) %>%
  dplyr::mutate_at(
    .vars = vars(Oceania, Americas, Asia, Europe),
    .funs = funs(ifelse(is.na(.), 0, 1))
  ) %&gt;%
  dplyr::select(Americas, Asia, Europe, Oceania) %>%
  data.matrix() -> gap

# setting up our design matrix
gapminder %>%
  dplyr::select(pop, gdpPercap) %>%
  base::scale() %>%
  data.matrix() %>%
  cbind(rep(1, nrow(.)), ., gap) %>%
  magrittr::set_colnames(c("Intercept", colnames(.[, -1]))) -> x

# initial guess for beta
beta_coefficients <- list(
  beta_0 = 0,
  beta_1 = 0,
  beta_2 = 0,
  beta_3 = 0,
  beta_4 = 0,
  beta_5 = 0,
  beta_6 = 0
)

beta_vector <- do.call(rbind, beta_coefficients) %>%
  set_rownames(colnames(x))

# gradient of objective function value
grad <- -t(x) %*% (Y - x %*% beta_vector)

# initialize iteration
k <- 0

while (norm(grad) > 10^-4) {

  # objective function value at beta
  obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector)

  # gradient of objective function at beta
  grad <- -t(x) %*% (Y - x %*% beta_vector)

  # search direction alpha
  search_direction <- -grad

  # calculation of exact step length at each iteration k
  step_length <- (t(grad) %*% grad) / (t(grad) %*% (t(x) %*% x) %*% grad)
  step_length <- c(step_length)

  # updating beta values
  beta_vector <- beta_vector + step_length * search_direction

  # updating iteration
  k <- k + 1
}

# print iterations
k

## [1] 35

The gradient descent algorithm converges in 35 iterations. This is way fewer iterations compared to the gradient descent algorithm with fixed step length. So, as you can see, guessing the step length is not always a good idea.

For the Gapminder problem, it makes sense to calculate the exact step length for every iteration of gradient descent. However, when you look at the formula, we have to make some matrix multiplication. Therefore, when we have a data set with a million rows or more, these matrix calculations can take a lot of time, and using the exact step length might not always make the algorithm converge faster.

Gradient Descent With Backtracking Line Search

We discussed the backtracking line search in this blog post. Check it out if you want to learn more about the Armijo condition and Wolfe conditions which are important to understand when doing the backtracking line search.

Let’s reset all of our coefficients to zero again.

# initial guess for beta
beta_coefficients <- list(
  beta_0 = 0,
  beta_1 = 0,
  beta_2 = 0,
  beta_3 = 0,
  beta_4 = 0,
  beta_5 = 0,
  beta_6 = 0
)

beta_vector <- do.call(rbind, beta_coefficients) %>%
  set_rownames(colnames(x))

Now, we are implementing the backtracking line search method by only using the Armijo condition. The Armijo condition ensures that there will be a sufficient decrease in the objective function value after each iteration for an appropriate step length.

The Armijo inequality looks like this:

f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \nabla f(x_k)^Tp_k

The backtracking framework looks like this:

Choose

\bar{\alpha} > 0

\rho \in (0, 1)

c_1 \in (0, 1)

\alpha = \bar{\alpha}

while

f(x_k + \alpha p_k) > f(x_k) + c_1 \alpha \nabla f(x_k)^Tp_k

\alpha = \rho \alpha

end (while)

# for Armijo inequality
c <- 0.9
rho <- 0.95

# gradient of objective function value
grad <- -t(x) %*% (Y - x %*% beta_vector)

# search direction
search_direction <- -grad

# objective function value at beta
obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector)

# initialize iteration
k <- 0

while (norm(grad) > 10^-4) {

  # reset step length to 1 before entering loop 
  step_length <- 1
  
  # left hand side of Armijo inequality
  lhs <- 0.5 * t((Y - x %*% (beta_vector + step_length * search_direction))) %*% (Y - x %*% (beta_vector + step_length * search_direction))

  # right hand side of Armijo inequality
  rhs <- obj_fun + c * step_length %*% t(grad) %*% search_direction
  
  # new calculation of alpha with backtracking method
  while (lhs > rhs) {
    
    step_length <- rho * step_length
    
    # update Armijo inequality
    lhs <- 0.5 * t((Y - x %*% (beta_vector + step_length * search_direction))) %*% (Y - x %*% (beta_vector + step_length * search_direction))
    rhs <- obj_fun + c * step_length %*% t(grad) %*% search_direction
  }
  
  # objective function value at beta
  obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector)

  # gradient of objective function at beta
  grad <- -t(x) %*% (Y - x %*% beta_vector)
  
  # search direction alpha
  search_direction <- -grad

  # updating beta values
  beta_vector <- beta_vector + step_length * search_direction

  # updating iteration
  k <- k + 1
}

# print iterations
k

## [1] 42

The backtracking line search is a really simple line search method to code and yields very good results. It only takes 7 iterations more before convergence as compared to the exact step length.

I hope you have enjoyed this blog post about gradient descent and the implementation of it with linear regression.

In the next tutorial, we will be covering the same Gapminder data and linear regression by applying two other optimization techniques, Newton’s method and conjugate gradient descent.

If you have any suggestions or questions, feel free to leave a comment.

Post your comment