Conjugate Gradient Descent for Linear Regression

July 15, 2019 By Pascal Schmidt Machine Learning R

Optimization techniques are constantly used in machine learning to minimize some function. In this blog post, we will be using two optimization techniques used in machine learning. Namely, conjugate gradient descent and the preconditioned conjugate gradient descent to find the beta values that are minimizing the objective linear regression function.

In the last blog post, we found the beta values by applying gradient descent and different line search methods to the Gapminder data set. This time, we are using this data set again.

What we will be covering:

  • Introduction to conjugate gradient descent and its framework.
  • Applying the conjugate gradient to the gapminder data set to find the beta coefficients that are minimizing the sum of squares.
  • Can conjugate gradient descent with a Cholesky preconditioner outperform ordinary conjugate gradient descent?

conjugate gradient descent

Let’s go!

Preparing the Data and Design Matrix for Linear Regression

# required libraries
library(tidyverse)
library(ggpubr)
library(gapminder)

# response variable Y (life expectancy)
Y <- matrix(gapminder$lifeExp, nrow = nrow(gapminder), ncol = 1)

# setting up our design matrix x
gapminder %>%
  tidyr::spread(key = continent, value = continent) %>%
  dplyr::mutate_at(
    .vars = vars(Oceania, Americas, Asia, Europe),
    .funs = funs(ifelse(is.na(.), 0, 1))
  ) %>%
  dplyr::select(Americas, Asia, Europe, Oceania) %>%
  data.matrix() -> gap

# setting up our design matrix
gapminder %>%
  dplyr::select(pop, gdpPercap) %>%
  data.matrix() %>%
  scale() %>%
  cbind(rep(1, nrow(.)), ., gap) %>%
  magrittr::set_colnames(c("Intercept", colnames(.[, -1]))) -> x

head(x)
##      Intercept        pop  gdpPercap Americas Asia Europe Oceania
## [1,]         1 -0.1994753 -0.6528949        0    1      0       0
## [2,]         1 -0.1917924 -0.6486942        0    1      0       0
## [3,]         1 -0.1821262 -0.6454228        0    1      0       0
## [4,]         1 -0.1701545 -0.6471376        0    1      0       0
## [5,]         1 -0.1556338 -0.6568984        0    1      0       0
## [6,]         1 -0.1386693 -0.6522184        0    1      0       0

Now that we are done with the data preparation, we can have a look at the framework of the conjugate gradient descent.

The Conjugate Gradient Method

The conjugate gradient method is among the most useful techniques for solving large linear systems of equations with a positive definite design matrix. The performance of the conjugate gradient method is determined by the distribution of the eigenvalues of the coefficient matrix. In our case, the coefficient matrix is the design matrix that we have constructed above. Namely,

X^TX

The conjugate gradient descent algorithm will converge in at most n (the number of unique eigenvalues of the design matrix) iterations. Let’s have a look at the eigenvalues of the design matrix of the Gapminder data set.

# number of unique eigenvalues
eigen(t(x) %*% x)$values %>%
  unique() %>%
  length() -> unique_eigenvalues

We can see that we have 7 eigenvalues in total which are all unique. Hence, we can conclude that the conjugate gradient descent will find the optimal beta values in at most 7 iterations. This is much fewer iterations compared to the gradient descent method we applied in this blog post.

Later, we will be implementing the preconditioned conjugate gradient descent method as well which can decrease the number of iterations further. What the preconditioned conjugate gradient descent does is clustering similar eigenvalues of the design matrix and therefore, reducing the number of iterations needed for the method to converge.

The Framework of the Conjugate Gradient Method

Remember that our objective function looks like this:

(Y - X\beta)^T(Y-X\beta) = || Y - X\beta ||^2_2

The conjugate gradient method finds the minimizer \beta and solves A\beta = b for \beta. Now, before jumping into the algorithm, we must know what A and b is in this case. In order to see that, we have to multiply out the brackets of our objective function.

Note that

(AB)^T = B^TA^T

So, let’s get started with the calculations.

(Y - X\beta)^T(Y-X\beta)

= \beta^TX^TX\beta - \beta^TX^TY - Y^TX\beta + Y^TY

Now we have to take the derivative with respect to $\beta$ and set the equation to zero in order to find minimizer $\beta$.

We get

-X^TY -X^TY + 2X^TX\beta = 0

X^TX\beta = X^TY

From the equation above, we can see that X^TX = A and X^TY = b which gives us

A\beta = b

The gradient is equal to the residual of the linear system. So, we have

\nabla(x)=A\beta-b=r

If you want more explanations on this, this post also provides a great explanation on conjuagte gradient descent and linear regression.

So the conjugate gradient algorithm that minimizes our objective function looks like the framework below.

Conjugate Gradient Descent Algorithm

Given

\beta_0;

r_0=AX - b =X^TX\beta_0-X^TY

p_0=- r_0

k=0

while r_k\neq0

\alpha_k = \frac{r_k^T r_k}{p^T_kX^TXp_k}

x_{k+1} = x_k + \alpha_kp_k

r_{k+1} = r_k + \alpha_kXp_k

\beta_{k + 1} = \frac{r^T_{k+1}r_{k+1}}{r^T_kr_k}

p_{k + 1} = -r_{k + 1} + \beta_{k + 1}p_k

k = k + 1

end (while)

Now that we have defined all the terms and know, how the algorithm looks like, we can implement the method in R to get the beta coefficients for our linear regression.

# initial guess for beta
beta_coefficients <- list(
  beta_0 = 0,
  beta_1 = 0,
  beta_2 = 0,
  beta_3 = 0,
  beta_4 = 0,
  beta_5 = 0,
  beta_6 = 0
)

beta_vector <- do.call(rbind, beta_coefficients) %>%
  set_rownames(colnames(x))

# we derived A from the equations above
A <- t(x) %*% x

# we derived b from the equations above as well
b <- t(x) %*% Y

# residual r
r <- A %*% beta_vector - b

# search direction. In the algorithm, it is called p
search_direction <- -r

# number of iterations
k <- 0

# Specify threshold
while (norm(r, "2") < 0.01) {

  # step length alpha
  step_length <- (t(r) %*% r) / (t(search_direction) %*% A %*% search_direction)
  step_length <- c(step_length)

  # keep updating our beta coefifcients
  beta_vector <- beta_vector + step_length * search_direction

  # residual r
  r_plus_one <- r + step_length * A %*% search_direction

  # calculate beta for search direction
  beta_plus_one <- (t(r_plus_one) %*% r_plus_one) / (t(r) %*% r)
  beta_plus_one <- c(beta_plus_one)

  # update search direction (p)
  search_direction <- -r_plus_one + beta_plus_one * search_direction

  r <- r_plus_one

  # + 1 for iterations
  k <- k + 1
}
# print iterations 
k

## [1] 7

# number of unique eigenvalues
eigen(t(x) %*% x)$values %>%
  unique() %>%
  length() -> unique_eigenvalues

k == unique_eigenvalues

## [1] TRUE

As we said earlier, the conjugate gradient descent converges in at most 7 iterations. 7 is also the number of unique eigenvalues of the design matrix  X^TX.

Now, there is an even faster method than the conjugate gradient method. The preconditioned gradient descent algorithm tries to cluster similar eigenvalues and can therefore converge in even fewer iterations.

Preconditioned Conjugate Gradient Descent

For the type of preconditioner, we can use a Cholesky factorization. However, there are also many other preconditioners which can be used. Let’s have a look at the framework:

Given \beta_0 and preconditioner M where M=C^TC;

Set

r_0=X^TX\beta_0-X^TY

Solve

My_0 = r_0$ for $latex y_0

p_0=- y_0

k=0

while $latex r_k\neq0[/katex]

\alpha_k = \frac{r_k^T y_k}{p^T_kX^TXp_k}

x_{k+1} = x_k + \alpha_kp_k

r_{k+1} = r_k + \alpha_kXp_k

Solve My_{k+1} = r_{k+1} for y_{k+1}

\beta_{k + 1} = \frac{r^T_{k+1}y_{k+1}}{r^T_ky_k}

p_{k + 1} = -y_{k + 1} + \beta_{k + 1}p_k

k = k + 1

end (while)

# initial guess for beta
beta_coefficients <- list(
  beta_0 = 0,
  beta_1 = 0,
  beta_2 = 0,
  beta_3 = 0,
  beta_4 = 0,
  beta_5 = 0,
  beta_6 = 0
)

beta_vector <- do.call(rbind, beta_coefficients) %>%
  set_rownames(colnames(x))

# we derived A from the equations above
A <- t(x) %*% x

# we derived b from the equations above as well
b <- t(x) %*% Y

# residual r
r <- A %*% beta_vector - b

# Cholesky factorization
C <- chol(A)

# solve for z
z <- solve(t(C), r)

# solve for y
y <- solve(C, z)

# search direction. In the algorithm, it is called p
search_direction <- -y

# number of iterations
k <- 0

# Specify threshold
while (norm(r, "2") > 0.01) {

  # step length alpha
  step_length <- (t(r) %*% y) / (t(search_direction) %*% A %*% search_direction)
  step_length <- c(step_length)

  # keep updating our beta coefifcients
  beta_vector <- beta_vector + step_length * search_direction
  
  # residual r
  r_plus_one <- r + step_length * A %*% search_direction
  
  # Cholesky factorization
  C <- chol(A)
  
  # solve for z
  z <- solve(t(C), r)
  
  # solve for y
  y_plus_one <- solve(C, z)
  
  # calculate beta for search direction
  beta_plus_one <- (t(r_plus_one) %*% y_plus_one) / (t(r) %*% y)
  beta_plus_one <- c(beta_plus_one)

  # update search direction (p)
  search_direction <- -y_plus_one + beta_plus_one * search_direction

  r <- r_plus_one
  y <- y_plus_one

  # update iterations
  k <- k + 1
}

# print iterations
k

## [1] 1

Wow!!! We only needed on iterations with the preconditioned conjugate gradient descent algorithm to find the perfect beta coefficients that are minimizing the objective function value.

As a quick validation, let’s use the lm() function in R and make sure that the coefficients are equal.

# prepare data
dat <- as.data.frame(scale(gapminder[, c("pop", "gdpPercap")])) %>%
  cbind(gapminder[, c("lifeExp", "continent")])

# fit regression model
lm_summary <- summary(lm(lifeExp ~ pop + gdpPercap + continent, data = dat))
coefs <- lm_summary$coefficients[, 1] %>%
  base::unname() %>%
  round(5)

beta_vector <- beta_vector %>%
  base::unname() %>%
  c() %>%
  round(5)

# compare gradient descent with base R lm() function
all.equal(beta_vector, coefs)

## [1] TRUE

As we can see from the output above, the beta coefficients are the same. So far, conjugate gradient descent with the preconditioner outperforms gradient descent with different line search methods and conjugate gradient descent.

In the next blog post, we are exploring Newton’s method and the BFGS method with the same Gapminder data set. Do you think these methods can outperform conjugate gradient descent?

If you are interested in other optimization algorithms then check out my post about the basics of gradient descent.

If you have questions or suggestions, feel free to leave a comment.

Post your comment