Newton’s Method and Quasi-Newton’s Method (BFGS) for Linear Regression
August 7, 2019 By Pascal Schmidt Machine Learning R
In this blog post, we will be going over two more optimization techniques, Newton’s method and Quasi-Newton’s Method (BFGS), to find the minimum of the objective function of linear regression.
What we will be covering:
- An introduction to Newton’s method and when to apply this optimization technique.
- Newton’s method for linear regression with a step length of 1 and backtracking line search (Armijo condition).
- Quasi-Newton method (BFGS) for Linear Regression.
In previous blog posts, we have already covered
- An introduction to gradient descent and line search methods.
- Solving linear regression with gradient descent and various line search methods.
- Using conjugate gradient descent and conjugate gradient descent with a preconditioner to find the beta coefficients that are minimizing the sum of squares of linear regression.
Newton’s method and Quasi-Newton’s method are two more iterative optimization techniques that can find the perfect beta coefficients. However, in comparison to gradient descent, Newton’s method requires us to find the hessian of the coefficient matrix. Moreover, we have to ensure that the hessian of the coefficient matrix is a positive definite.
If it is not a positive definite, then the search direction may not always be a descent direction and the algorithm fails. Hence, we first have to check if the hessian of our Gapminder coefficient matrix is a positive definite. If not, there are some techniques that we can use in order to transform an indefinite matrix or negative definite matrix into a sufficient positive definite matrix.
Remember that a matrix is positive definite if all its eigenvalues are positive. Hence, one popular technique when doing a hessian modification is an eigenvalue modification.
Let’s construct the design matrix and then let’s check if the hessian is a positive definite matrix.
Preparing the Data and Design Matrix
We first have to construct the design matrix for the linear regression. From the Gapminder data set, we will be using lifeExp
as the response variable and continent
, pop
, and gdpPercap
as predictors. We will be scaling the continuous variables. If you are interested in why we are doing it check out this post.
# required libraries library(tidyverse) library(ggpubr) library(gapminder)
# response variable Y (life expectancy) Y <- matrix(gapminder$lifeExp, nrow = nrow(gapminder), ncol = 1) # setting up our design matrix x gapminder %>% tidyr::spread(key = continent, value = continent) %>% dplyr::mutate_at( .vars = vars(Oceania, Americas, Asia, Europe), .funs = funs(ifelse(is.na(.), 0, 1)) ) %>% dplyr::select(Americas, Asia, Europe, Oceania) %>% data.matrix() -> gap # setting up our design matrix gapminder %>% dplyr::select(pop, gdpPercap) %>% data.matrix() %>% scale() %>% cbind(rep(1, nrow(.)), ., gap) %>% magrittr::set_colnames(c("Intercept", colnames(.[, -1]))) -> x head(x) ## Intercept pop gdpPercap Americas Asia Europe Oceania ## [1,] 1 -0.1994753 -0.6528949 0 1 0 0 ## [2,] 1 -0.1917924 -0.6486942 0 1 0 0 ## [3,] 1 -0.1821262 -0.6454228 0 1 0 0 ## [4,] 1 -0.1701545 -0.6471376 0 1 0 0 ## [5,] 1 -0.1556338 -0.6568984 0 1 0 0 ## [6,] 1 -0.1386693 -0.6522184 0 1 0 0
Now, we will be checking if the Hessian matrix is a positive definite in order to determine if Newton’s method is appropriate.
hessian <- t(x) %*% x eigen(hessian)$values > 0 ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
That is good news. The Hessian matrix is positive definite and therefore, we can apply Newton’s method without any Hessian modification. Note, the linear regression objective function is a convex function. Meaning, the Hessian will always be a positive definite when x \neq 0. Hence, when we want to solve a linear regression with Newton’s method we can do so without checking the Hessian matrix because we know that it will always be positive definite regardless of the data we have.
To make this point more clear below is the objective function of linear regression, the gradient, and the Hessian.
Let’s have a look at the objective function of a linear regression again:
f(\beta) = (Y - X\beta)^T(Y-X\beta)
When we are multiplying out the terms, we get:
= \beta^TX^TX\beta - \beta^TX^TY - Y^TX\beta + Y^TY
Now we take the first-order derivative (or gradient) of the function with respect to beta:
\nabla f(\beta) = X^TY -X^TY + 2X^TX\beta
And now, we are taking the second-order derivative with respect to beta:
\nabla^2 f(\beta) = 2X^TX \succeq 0
Newton’s Method for Linear Regression
Newton’s method has a quadratic rate of convergence and converges therefore faster than gradient descent which has only a sublinear rate of convergence. However, the drawback of Newton’s method is the evaluation of the Hessian matrix which we don’t have to do for gradient descent. Also, matrix modification can be a potential disadvantage for not positive definite matrices.
Let’s first have a look at the framework of Newton’s method before jumping into the code.
Given x_0 (our initial guess),
k = 0
while ||\nabla f(x_k)>10^{-4}||_2 (while the norm of the first derivative or gradient is smaller than some threshold)
B_k= \nabla^2 f(x_k) (where B_k is sufficient positive definite)
p_{k} = -B_k^{-1}\nabla f(x_k)
\alpha= 1
x_{k+1} = x_k + \alpha p_k
k = k + 1
end (while)
From the framework above, we can see that if the Hessian matrix is an identity matrix, then Newton’s method defaults to the gradient descent algorithm.
For the step length alpha, we can choose 1, or choose alpha such that it satisfies the Wolfe or Armijo condition. Check out this post to read up on the Wolfe and Armijo condition.
Now that we understand the framework better, let’s jump into the code and find the beta values which are minimizing the objective function.
# initial guess for beta beta_coefficients <- list( beta_0 = 0, beta_1 = 0, beta_2 = 0, beta_3 = 0, beta_4 = 0, beta_5 = 0, beta_6 = 0 ) beta_vector <- do.call(rbind, beta_coefficients) %>% set_rownames(colnames(x)) # gradient evaluated at betas grad <- -t(x) %*% (Y - x %*% beta_vector) # hessian matrix hessian <- t(x) %*% x # initializing iterations k <- 0 while (norm(grad, "2") > 10^-4) { # gradient at betas grad <- -t(x) %*% (Y - x %*% beta_vector) # hessian matrix hessian <- t(x) %*% x # search direction p_k search_direction <- solve(-hessian) %*% grad # update beta values beta_vector <- beta_vector + search_direction # update iteration k <- k + 1 } # print number of iterations k ## [1] 2
Newton’s method performs as well as the conjugate gradient descent algorithm with the preconditioner in terms of iterations.
Let’s have a look at Newton’s method when we choose the step length (alpha), such that it satisfies the Armijo condition (backtracking line search).
Newton’s Method with Backtracking Line Search
The step length alpha has to satisfy the Armijo inequality:
f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \nabla f(x_k)^Tp_k
# initial guess for beta beta_coefficients <- list( beta_0 = 0, beta_1 = 0, beta_2 = 0, beta_3 = 0, beta_4 = 0, beta_5 = 0, beta_6 = 0 ) beta_vector <- do.call(rbind, beta_coefficients) %>% set_rownames(colnames(x)) # for Armijo inequality c <- 0.9 rho <- 0.95 # objective function value at beta obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector) # gradient evaluated at betas grad <- -t(x) %*% (Y - x %*% beta_vector) # hessian matrix hessian <- t(x) %*% x # initializing iterations k <- 0 while (norm(grad, "2") > 10^-4) { # reset step length to 1 before entering loop step_length <- 1 # left hand side of Armijo inequality lhs <- 0.5 * t((Y - x %*% (beta_vector + step_length * search_direction))) %*% (Y - x %*% (beta_vector + step_length * search_direction)) # right hand side of Armijo inequality rhs <- obj_fun + c * step_length %*% t(grad) %*% search_direction # new calculation of alpha with backtracking method while (lhs > rhs) { step_length <- rho * step_length # update Armijo inequality lhs <- 0.5 * t((Y - x %*% (beta_vector + step_length * search_direction))) %*% (Y - x %*% (beta_vector + step_length * search_direction)) rhs <- obj_fun + c * step_length %*% t(grad) %*% search_direction } # objective function value at beta obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector) # gradient at betas grad <- -t(x) %*% (Y - x %*% beta_vector) # hessian matrix hessian <- t(x) %*% x # search direction p_k search_direction <- solve(-hessian) %*% grad # update beta values beta_vector <- beta_vector + step_length * search_direction # update iteration k <- k + 1 } # print number of iterations k ## [1] 29
When using backtracking line search, then we need 29 iterations until convergence which is a lot more than the 2 iterations when we set the step length, alpha, to be 1.
Let’s check if Newton’s method actually calculated the beta values right by comparing them to the beta values from the lm()
function.
# prepare data dat <- as.data.frame(scale(gapminder[, c("pop", "gdpPercap")])) %>% cbind(gapminder[, c("lifeExp", "continent")]) # fit regression model lm_summary <- summary(lm(lifeExp ~ pop + gdpPercap + continent, data = dat)) coefs <- lm_summary$coefficients[, 1] %>% base::unname() %>% round(5) beta_vector <- beta_vector %>% base::unname() %>% c() %>% round(5) # compare gradient descent with base R lm() function all.equal(beta_vector, coefs) ## [1] TRUE
The beta values are equal to the ones the lm()
function gives us.
Quasi-Newton Method (BFGS) for Linear Regression
The BFGS method converges sublinearly. Because the objective function is convex, we can use a backtracking line search to find the step length alpha. We could also choose alpha to be 1 again. However, when the objective function is not convex, backtracking line search should not be used to determine the step length. The Wolfe conditions should rather be used in order to ensure convergence.
Again, let’s first have a look at the framework before we jump into the code.
Given starting point x_0 convergence tolerance \epsilon > 0
inverse Hessian approximation H_0
k = 0
while \: ||\nabla f_k|| > \epsilon;
Compute search direction
p_k = -H_k \nabla f_k
Set x_{k+1} = x_k + \alpha_k p_k
where αk is computed from a line search procedure to satisfy the Wolfe conditions
Define s_k = x_{k+1} - x_k \: and \: y_k= \nabla f_{k+1} - \nabla f_k
Compute H_{k+1} = (I - \rho_k s_k y_{k}^T) H_k (I - \rho_k y_k s_{k}^T) + \rho_k s_k s^T_k
k = k + 1
end (while)
Let’s have a look at the code.
# initial guess for beta beta_coefficients <- list( beta_0 = 0, beta_1 = 0, beta_2 = 0, beta_3 = 0, beta_4 = 0, beta_5 = 0, beta_6 = 0 ) beta_vector <- do.call(rbind, beta_coefficients) %>% set_rownames(colnames(x)) # gradient of objective function at beta grad <- -t(x) %*% (Y - x %*% beta_vector) # objective function at beta obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector) # dimension of square matrix n <- dim(t(x) %*% x)[1] # identity matrix H <- diag(n) # serach direction serach_direction <- -H %*% grad # parameters for Armijo condition (backtracking line search) rho_backtracking <- 0.95 c <- 0.9 # initialize iterations k <- 0 while (norm(grad, "2") > 10^-4) { search_direction <- -H %*% grad # reset step length to 1 before entering loop step_length <- 1 # left hand side of Armijo inequality lhs <- 0.5 * t((Y - x %*% (beta_vector + step_length * search_direction))) %*% (Y - x %*% (beta_vector + step_length * search_direction)) # right hand side of Armijo inequality rhs <- obj_fun + c * step_length %*% t(grad) %*% search_direction # new calculation of alpha with backtracking method while (lhs > rhs) { step_length <- rho_backtracking * step_length # update Armijo inequality lhs <- 0.5 * t((Y - x %*% (beta_vector + step_length * search_direction))) %*% (Y - x %*% (beta_vector + step_length * search_direction)) rhs <- obj_fun + c * step_length %*% t(grad) %*% search_direction } # update beta_vector beta_vector_plus_one <- beta_vector + step_length * search_direction # update gradient grad_plus_one <- -t(x) %*% (Y - x %*% beta_vector_plus_one) s <- beta_vector_plus_one - beta_vector y <- grad_plus_one - grad rho <- 1 / (t(y) %*% s) %>% c() H <- (diag(n) - rho * s %*% t(y)) %*% H %*% (diag(n) - rho * y %*% t(s)) + rho * s %*% t(s) beta_vector <- beta_vector_plus_one grad <- grad_plus_one k <- k + 1 } # print number of iterations k ## [1] 14
The BFGS method needed 14 iterations to converge.
If you want some more information about Newton’s method and BFGS, then check out these slides.
If you have any suggestions or questions, feel free to leave a comment below.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
Comments (3)
Hi,
Is it possible to do all this using python?
Hi,
Yes for sure! Just translate all the code into python.
Hey, great stuff! Note that one iteration of Newton’s algorithm is identical to ordinary least squares (the terms involving the initial value cancel out and only leave the OLS solution)