Conjugate Gradient Descent for Linear Regression
July 15, 2019 By Pascal Schmidt Machine Learning R
Optimization techniques are constantly used in machine learning to minimize some function. In this blog post, we will be using two optimization techniques used in machine learning. Namely, conjugate gradient descent and the preconditioned conjugate gradient descent to find the beta values that are minimizing the objective linear regression function.
In the last blog post, we found the beta values by applying gradient descent and different line search methods to the Gapminder data set. This time, we are using this data set again.
What we will be covering:
- Introduction to conjugate gradient descent and its framework.
- Applying the conjugate gradient to the
gapminder
data set to find the beta coefficients that are minimizing the sum of squares. - Can conjugate gradient descent with a Cholesky preconditioner outperform ordinary conjugate gradient descent?
Let’s go!
Preparing the Data and Design Matrix for Linear Regression
# required libraries library(tidyverse) library(ggpubr) library(gapminder) # response variable Y (life expectancy) Y <- matrix(gapminder$lifeExp, nrow = nrow(gapminder), ncol = 1) # setting up our design matrix x gapminder %>% tidyr::spread(key = continent, value = continent) %>% dplyr::mutate_at( .vars = vars(Oceania, Americas, Asia, Europe), .funs = funs(ifelse(is.na(.), 0, 1)) ) %>% dplyr::select(Americas, Asia, Europe, Oceania) %>% data.matrix() -> gap # setting up our design matrix gapminder %>% dplyr::select(pop, gdpPercap) %>% data.matrix() %>% scale() %>% cbind(rep(1, nrow(.)), ., gap) %>% magrittr::set_colnames(c("Intercept", colnames(.[, -1]))) -> x head(x) ## Intercept pop gdpPercap Americas Asia Europe Oceania ## [1,] 1 -0.1994753 -0.6528949 0 1 0 0 ## [2,] 1 -0.1917924 -0.6486942 0 1 0 0 ## [3,] 1 -0.1821262 -0.6454228 0 1 0 0 ## [4,] 1 -0.1701545 -0.6471376 0 1 0 0 ## [5,] 1 -0.1556338 -0.6568984 0 1 0 0 ## [6,] 1 -0.1386693 -0.6522184 0 1 0 0
Now that we are done with the data preparation, we can have a look at the framework of the conjugate gradient descent.
The Conjugate Gradient Method
The conjugate gradient method is among the most useful techniques for solving large linear systems of equations with a positive definite design matrix. The performance of the conjugate gradient method is determined by the distribution of the eigenvalues of the coefficient matrix. In our case, the coefficient matrix is the design matrix that we have constructed above. Namely,
X^TX
The conjugate gradient descent algorithm will converge in at most n (the number of unique eigenvalues of the design matrix) iterations. Let’s have a look at the eigenvalues of the design matrix of the Gapminder data set.
# number of unique eigenvalues eigen(t(x) %*% x)$values %>% unique() %>% length() -> unique_eigenvalues
We can see that we have 7 eigenvalues in total which are all unique. Hence, we can conclude that the conjugate gradient descent will find the optimal beta values in at most 7 iterations. This is much fewer iterations compared to the gradient descent method we applied in this blog post.
Later, we will be implementing the preconditioned conjugate gradient descent method as well which can decrease the number of iterations further. What the preconditioned conjugate gradient descent does is clustering similar eigenvalues of the design matrix and therefore, reducing the number of iterations needed for the method to converge.
The Framework of the Conjugate Gradient Method
Remember that our objective function looks like this:
(Y - X\beta)^T(Y-X\beta) = || Y - X\beta ||^2_2
The conjugate gradient method finds the minimizer and solves for . Now, before jumping into the algorithm, we must know what A and b is in this case. In order to see that, we have to multiply out the brackets of our objective function.
Note that
(AB)^T = B^TA^T
So, let’s get started with the calculations.
(Y - X\beta)^T(Y-X\beta)
= \beta^TX^TX\beta - \beta^TX^TY - Y^TX\beta + Y^TY
Now we have to take the derivative with respect to $\beta$ and set the equation to zero in order to find minimizer $\beta$.
We get
-X^TY -X^TY + 2X^TX\beta = 0
X^TX\beta = X^TY
From the equation above, we can see that X^TX = A and X^TY = b which gives us
A\beta = b
The gradient is equal to the residual of the linear system. So, we have
\nabla(x)=A\beta-b=r
If you want more explanations on this, this post also provides a great explanation on conjuagte gradient descent and linear regression.
So the conjugate gradient algorithm that minimizes our objective function looks like the framework below.
Conjugate Gradient Descent Algorithm
Given
\beta_0;
r_0=AX - b =X^TX\beta_0-X^TY
p_0=- r_0
k=0
while r_k\neq0
\alpha_k = \frac{r_k^T r_k}{p^T_kX^TXp_k}
x_{k+1} = x_k + \alpha_kp_k
r_{k+1} = r_k + \alpha_kXp_k
\beta_{k + 1} = \frac{r^T_{k+1}r_{k+1}}{r^T_kr_k}
p_{k + 1} = -r_{k + 1} + \beta_{k + 1}p_k
k = k + 1
end (while)
Now that we have defined all the terms and know, how the algorithm looks like, we can implement the method in R to get the beta coefficients for our linear regression.
# initial guess for beta beta_coefficients <- list( beta_0 = 0, beta_1 = 0, beta_2 = 0, beta_3 = 0, beta_4 = 0, beta_5 = 0, beta_6 = 0 ) beta_vector <- do.call(rbind, beta_coefficients) %>% set_rownames(colnames(x)) # we derived A from the equations above A <- t(x) %*% x # we derived b from the equations above as well b <- t(x) %*% Y # residual r r <- A %*% beta_vector - b # search direction. In the algorithm, it is called p search_direction <- -r # number of iterations k <- 0 # Specify threshold while (norm(r, "2") < 0.01) { # step length alpha step_length <- (t(r) %*% r) / (t(search_direction) %*% A %*% search_direction) step_length <- c(step_length) # keep updating our beta coefifcients beta_vector <- beta_vector + step_length * search_direction # residual r r_plus_one <- r + step_length * A %*% search_direction # calculate beta for search direction beta_plus_one <- (t(r_plus_one) %*% r_plus_one) / (t(r) %*% r) beta_plus_one <- c(beta_plus_one) # update search direction (p) search_direction <- -r_plus_one + beta_plus_one * search_direction r <- r_plus_one # + 1 for iterations k <- k + 1 } # print iterations k ## [1] 7 # number of unique eigenvalues eigen(t(x) %*% x)$values %>% unique() %>% length() -> unique_eigenvalues k == unique_eigenvalues ## [1] TRUE
As we said earlier, the conjugate gradient descent converges in at most 7 iterations. 7 is also the number of unique eigenvalues of the design matrix X^TX.
Now, there is an even faster method than the conjugate gradient method. The preconditioned gradient descent algorithm tries to cluster similar eigenvalues and can therefore converge in even fewer iterations.
Preconditioned Conjugate Gradient Descent
For the type of preconditioner, we can use a Cholesky factorization. However, there are also many other preconditioners which can be used. Let’s have a look at the framework:
Given \beta_0 and preconditioner M where M=C^TC;
Set
r_0=X^TX\beta_0-X^TY
Solve
My_0 = r_0$ for $latex y_0
p_0=- y_0
k=0
while $latex r_k\neq0[/katex]
\alpha_k = \frac{r_k^T y_k}{p^T_kX^TXp_k}
x_{k+1} = x_k + \alpha_kp_k
r_{k+1} = r_k + \alpha_kXp_k
Solve My_{k+1} = r_{k+1} for y_{k+1}
\beta_{k + 1} = \frac{r^T_{k+1}y_{k+1}}{r^T_ky_k}
p_{k + 1} = -y_{k + 1} + \beta_{k + 1}p_k
k = k + 1
end (while)
# initial guess for beta beta_coefficients <- list( beta_0 = 0, beta_1 = 0, beta_2 = 0, beta_3 = 0, beta_4 = 0, beta_5 = 0, beta_6 = 0 ) beta_vector <- do.call(rbind, beta_coefficients) %>% set_rownames(colnames(x)) # we derived A from the equations above A <- t(x) %*% x # we derived b from the equations above as well b <- t(x) %*% Y # residual r r <- A %*% beta_vector - b # Cholesky factorization C <- chol(A) # solve for z z <- solve(t(C), r) # solve for y y <- solve(C, z) # search direction. In the algorithm, it is called p search_direction <- -y # number of iterations k <- 0 # Specify threshold while (norm(r, "2") > 0.01) { # step length alpha step_length <- (t(r) %*% y) / (t(search_direction) %*% A %*% search_direction) step_length <- c(step_length) # keep updating our beta coefifcients beta_vector <- beta_vector + step_length * search_direction # residual r r_plus_one <- r + step_length * A %*% search_direction # Cholesky factorization C <- chol(A) # solve for z z <- solve(t(C), r) # solve for y y_plus_one <- solve(C, z) # calculate beta for search direction beta_plus_one <- (t(r_plus_one) %*% y_plus_one) / (t(r) %*% y) beta_plus_one <- c(beta_plus_one) # update search direction (p) search_direction <- -y_plus_one + beta_plus_one * search_direction r <- r_plus_one y <- y_plus_one # update iterations k <- k + 1 } # print iterations k ## [1] 1
Wow!!! We only needed on iterations with the preconditioned conjugate gradient descent algorithm to find the perfect beta coefficients that are minimizing the objective function value.
As a quick validation, let’s use the lm()
function in R and make sure that the coefficients are equal.
# prepare data dat <- as.data.frame(scale(gapminder[, c("pop", "gdpPercap")])) %>% cbind(gapminder[, c("lifeExp", "continent")]) # fit regression model lm_summary <- summary(lm(lifeExp ~ pop + gdpPercap + continent, data = dat)) coefs <- lm_summary$coefficients[, 1] %>% base::unname() %>% round(5) beta_vector <- beta_vector %>% base::unname() %>% c() %>% round(5) # compare gradient descent with base R lm() function all.equal(beta_vector, coefs) ## [1] TRUE
As we can see from the output above, the beta coefficients are the same. So far, conjugate gradient descent with the preconditioner outperforms gradient descent with different line search methods and conjugate gradient descent.
In the next blog post, we are exploring Newton’s method and the BFGS method with the same Gapminder data set. Do you think these methods can outperform conjugate gradient descent?
If you are interested in other optimization algorithms then check out my post about the basics of gradient descent.
If you have questions or suggestions, feel free to leave a comment.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications