Newton’s Method and Quasi-Newton’s Method (BFGS) for Linear Regression

In this blog post, we will be going over two more optimization techniques, Newton’s method and Quasi-Newton’s Method (BFGS), to find the minimum of the objective function of linear regression.

What we will be covering:

An introduction to Newton’s method and when to apply this optimization technique.
Newton’s method for linear regression with a step length of 1 and backtracking line search (Armijo condition).
Quasi-Newton method (BFGS) for Linear Regression.

In previous blog posts, we have already covered

An introduction to gradient descent and line search methods.
Solving linear regression with gradient descent and various line search methods.
Using conjugate gradient descent and conjugate gradient descent with a preconditioner to find the beta coefficients that are minimizing the sum of squares of linear regression.

Newton’s method and Quasi-Newton’s method are two more iterative optimization techniques that can find the perfect beta coefficients. However, in comparison to gradient descent, Newton’s method requires us to find the hessian of the coefficient matrix. Moreover, we have to ensure that the hessian of the coefficient matrix is a positive definite.

If it is not a positive definite, then the search direction may not always be a descent direction and the algorithm fails. Hence, we first have to check if the hessian of our Gapminder coefficient matrix is a positive definite. If not, there are some techniques that we can use in order to transform an indefinite matrix or negative definite matrix into a sufficient positive definite matrix.

Remember that a matrix is positive definite if all its eigenvalues are positive. Hence, one popular technique when doing a hessian modification is an eigenvalue modification.

Let’s construct the design matrix and then let’s check if the hessian is a positive definite matrix.

Preparing the Data and Design Matrix

We first have to construct the design matrix for the linear regression. From the Gapminder data set, we will be using lifeExp as the response variable and continent, pop, and gdpPercap as predictors. We will be scaling the continuous variables. If you are interested in why we are doing it check out this post.

# required libraries 

library(tidyverse) 
library(ggpubr) 
library(gapminder)

# response variable Y (life expectancy)
Y <- matrix(gapminder$lifeExp, nrow = nrow(gapminder), ncol = 1)

# setting up our design matrix x
gapminder %>%
  tidyr::spread(key = continent, value = continent) %>%
  dplyr::mutate_at(
    .vars = vars(Oceania, Americas, Asia, Europe),
    .funs = funs(ifelse(is.na(.), 0, 1))
  ) %>%
  dplyr::select(Americas, Asia, Europe, Oceania) %>%
  data.matrix() -> gap

# setting up our design matrix
gapminder %>%
  dplyr::select(pop, gdpPercap) %>%
  data.matrix() %>%
  scale() %>%
  cbind(rep(1, nrow(.)), ., gap) %>%
  magrittr::set_colnames(c("Intercept", colnames(.[, -1]))) -> x

head(x)

##      Intercept        pop  gdpPercap Americas Asia Europe Oceania
## [1,]         1 -0.1994753 -0.6528949        0    1      0       0
## [2,]         1 -0.1917924 -0.6486942        0    1      0       0
## [3,]         1 -0.1821262 -0.6454228        0    1      0       0
## [4,]         1 -0.1701545 -0.6471376        0    1      0       0
## [5,]         1 -0.1556338 -0.6568984        0    1      0       0
## [6,]         1 -0.1386693 -0.6522184        0    1      0       0

Now, we will be checking if the Hessian matrix is a positive definite in order to determine if Newton’s method is appropriate.

hessian <- t(x) %*% x
eigen(hessian)$values > 0

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

That is good news. The Hessian matrix is positive definite and therefore, we can apply Newton’s method without any Hessian modification. Note, the linear regression objective function is a convex function. Meaning, the Hessian will always be a positive definite when $x \neq 0$ . Hence, when we want to solve a linear regression with Newton’s method we can do so without checking the Hessian matrix because we know that it will always be positive definite regardless of the data we have.

To make this point more clear below is the objective function of linear regression, the gradient, and the Hessian.

Let’s have a look at the objective function of a linear regression again:

$f(\beta) = (Y - X\beta)^T(Y-X\beta)$

When we are multiplying out the terms, we get:

$= \beta^TX^TX\beta - \beta^TX^TY - Y^TX\beta + Y^TY$

Now we take the first-order derivative (or gradient) of the function with respect to beta:

$\nabla f(\beta) = X^TY -X^TY + 2X^TX\beta$

And now, we are taking the second-order derivative with respect to beta:

$\nabla^2 f(\beta) = 2X^TX \succeq 0$

Newton’s Method for Linear Regression

Newton’s method has a quadratic rate of convergence and converges therefore faster than gradient descent which has only a sublinear rate of convergence. However, the drawback of Newton’s method is the evaluation of the Hessian matrix which we don’t have to do for gradient descent. Also, matrix modification can be a potential disadvantage for not positive definite matrices.

Let’s first have a look at the framework of Newton’s method before jumping into the code.

Given $x_0$ (our initial guess),

$k = 0$

while $||\nabla f(x_k)>10^{-4}||_2$ (while the norm of the first derivative or gradient is smaller than some threshold)

$B_k= \nabla^2 f(x_k)$ (where B_k is sufficient positive definite)

$p_{k} = -B_k^{-1}\nabla f(x_k)$

$\alpha= 1$

$x_{k+1} = x_k + \alpha p_k$

$k = k + 1$

end (while)

From the framework above, we can see that if the Hessian matrix is an identity matrix, then Newton’s method defaults to the gradient descent algorithm.

For the step length alpha, we can choose 1, or choose alpha such that it satisfies the Wolfe or Armijo condition. Check out this post to read up on the Wolfe and Armijo condition.

Now that we understand the framework better, let’s jump into the code and find the beta values which are minimizing the objective function.

# initial guess for beta
beta_coefficients <- list(
  beta_0 = 0,
  beta_1 = 0,
  beta_2 = 0,
  beta_3 = 0,
  beta_4 = 0,
  beta_5 = 0,
  beta_6 = 0
)

beta_vector <- do.call(rbind, beta_coefficients) %>%
  set_rownames(colnames(x))

# gradient evaluated at betas
grad <- -t(x) %*% (Y - x %*% beta_vector)

# hessian matrix
hessian <- t(x) %*% x

# initializing iterations
k <- 0

while (norm(grad, "2") > 10^-4) {

  # gradient at betas
  grad <- -t(x) %*% (Y - x %*% beta_vector)
  
  # hessian matrix
  hessian <- t(x) %*% x

  # search direction p_k
  search_direction <- solve(-hessian) %*% grad
  
  # update beta values
  beta_vector <- beta_vector + search_direction
  
  # update iteration
  k <- k + 1
}

# print number of iterations
k

## [1] 2

Newton’s method performs as well as the conjugate gradient descent algorithm with the preconditioner in terms of iterations.

Let’s have a look at Newton’s method when we choose the step length (alpha), such that it satisfies the Armijo condition (backtracking line search).

Newton’s Method with Backtracking Line Search

The step length alpha has to satisfy the Armijo inequality:

$f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \nabla f(x_k)^Tp_k$

# initial guess for beta
beta_coefficients <- list(
  beta_0 = 0,
  beta_1 = 0,
  beta_2 = 0,
  beta_3 = 0,
  beta_4 = 0,
  beta_5 = 0,
  beta_6 = 0
)

beta_vector <- do.call(rbind, beta_coefficients) %>%
  set_rownames(colnames(x))

# for Armijo inequality
c <- 0.9
rho <- 0.95

# objective function value at beta
obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector)

# gradient evaluated at betas
grad <- -t(x) %*% (Y - x %*% beta_vector)

# hessian matrix
hessian <- t(x) %*% x

# initializing iterations
k <- 0

while (norm(grad, "2") > 10^-4) {
  
  # reset step length to 1 before entering loop 
  step_length <- 1
  
  # left hand side of Armijo inequality
  lhs <- 0.5 * t((Y - x %*% (beta_vector + step_length * search_direction))) %*% (Y - x %*% (beta_vector + step_length * search_direction))

  # right hand side of Armijo inequality
  rhs <- obj_fun + c * step_length %*% t(grad) %*% search_direction
  
  # new calculation of alpha with backtracking method
  while (lhs > rhs) {
    
    step_length <- rho * step_length
    
    # update Armijo inequality
    lhs <- 0.5 * t((Y - x %*% (beta_vector + step_length * search_direction))) %*% (Y - x %*% (beta_vector + step_length * search_direction))
    rhs <- obj_fun + c * step_length %*% t(grad) %*% search_direction
  }
  
  # objective function value at beta
  obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector)

  # gradient at betas
  grad <- -t(x) %*% (Y - x %*% beta_vector)
  
  # hessian matrix
  hessian <- t(x) %*% x

  # search direction p_k
  search_direction <- solve(-hessian) %*% grad
  
  # update beta values
  beta_vector <- beta_vector + step_length * search_direction
  
  # update iteration
  k <- k + 1
}

# print number of iterations
k

## [1] 29

When using backtracking line search, then we need 29 iterations until convergence which is a lot more than the 2 iterations when we set the step length, alpha, to be 1.

Let’s check if Newton’s method actually calculated the beta values right by comparing them to the beta values from the lm() function.

# prepare data
dat <- as.data.frame(scale(gapminder[, c("pop", "gdpPercap")])) %>%
  cbind(gapminder[, c("lifeExp", "continent")])

# fit regression model
lm_summary <- summary(lm(lifeExp ~ pop + gdpPercap + continent, data = dat))
coefs <- lm_summary$coefficients[, 1] %>%
  base::unname() %>%
  round(5)

beta_vector <- beta_vector %>%
  base::unname() %>%
  c() %>%
  round(5)

# compare gradient descent with base R lm() function
all.equal(beta_vector, coefs)

## [1] TRUE

The beta values are equal to the ones the lm() function gives us.

Quasi-Newton Method (BFGS) for Linear Regression

The BFGS method converges sublinearly. Because the objective function is convex, we can use a backtracking line search to find the step length alpha. We could also choose alpha to be 1 again. However, when the objective function is not convex, backtracking line search should not be used to determine the step length. The Wolfe conditions should rather be used in order to ensure convergence.

Again, let’s first have a look at the framework before we jump into the code.

Given starting point $x_0$ convergence tolerance $\epsilon > 0$

inverse Hessian approximation $H_0$

$k = 0$

$while \: ||\nabla f_k|| > \epsilon$ ;

Compute search direction

$p_k = -H_k \nabla f_k$

Set $x_{k+1} = x_k + \alpha_k p_k$

where αk is computed from a line search procedure to satisfy the Wolfe conditions

Define $s_k = x_{k+1} - x_k \: and \: y_k= \nabla f_{k+1} - \nabla f_k$

Compute $H_{k+1} = (I - \rho_k s_k y_{k}^T) H_k (I - \rho_k y_k s_{k}^T) + \rho_k s_k s^T_k$

$k = k + 1$

end (while)

Let’s have a look at the code.

# initial guess for beta
beta_coefficients <- list(
  beta_0 = 0,
  beta_1 = 0,
  beta_2 = 0,
  beta_3 = 0,
  beta_4 = 0,
  beta_5 = 0,
  beta_6 = 0
)

beta_vector <- do.call(rbind, beta_coefficients) %>%
  set_rownames(colnames(x))

# gradient of objective function at beta
grad <- -t(x) %*% (Y - x %*% beta_vector)

# objective function at beta
obj_fun <- 0.5 * t((Y - x %*% beta_vector)) %*% (Y - x %*% beta_vector)

# dimension of square matrix
n <- dim(t(x) %*% x)[1]

# identity matrix
H <- diag(n)

# serach direction
serach_direction <- -H %*% grad

# parameters for Armijo condition (backtracking line search)
rho_backtracking <- 0.95
c <- 0.9

# initialize iterations
k <- 0

while (norm(grad, "2") > 10^-4) {
  
  search_direction <- -H %*% grad

  # reset step length to 1 before entering loop
  step_length <- 1

  # left hand side of Armijo inequality
  lhs <- 0.5 * t((Y - x %*% (beta_vector + step_length * search_direction))) %*% (Y - x %*% (beta_vector + step_length * search_direction))

  # right hand side of Armijo inequality
  rhs <- obj_fun + c * step_length %*% t(grad) %*% search_direction

  # new calculation of alpha with backtracking method
  while (lhs > rhs) {
    
    step_length <- rho_backtracking * step_length

    # update Armijo inequality
    lhs <- 0.5 * t((Y - x %*% (beta_vector + step_length * search_direction))) %*% (Y - x %*% (beta_vector + step_length * search_direction))
    rhs <- obj_fun + c * step_length %*% t(grad) %*% search_direction
    
  }
  
  # update beta_vector
  beta_vector_plus_one <- beta_vector + step_length * search_direction
  
  # update gradient
  grad_plus_one <- -t(x) %*% (Y - x %*% beta_vector_plus_one)

  s <- beta_vector_plus_one - beta_vector
  y <- grad_plus_one - grad
  rho <- 1 / (t(y) %*% s) %>%
    c()
  H <- (diag(n) - rho * s %*% t(y)) %*% H %*% (diag(n) - rho * y %*% t(s)) + rho * s %*% t(s)
  
  beta_vector <- beta_vector_plus_one
  grad <- grad_plus_one
  k <- k + 1
  
}

# print number of iterations
k

## [1] 14

The BFGS method needed 14 iterations to converge.

If you want some more information about Newton’s method and BFGS, then check out these slides.

If you have any suggestions or questions, feel free to leave a comment below.

Tags: linear regression mathematics newton's method optimization quasi-newton's method R

Comments (3)

Ems says:

March 15, 2020 at 12:18 pm

Hi,
Is it possible to do all this using python?

1. Pascal Schmidt says:
  
  March 18, 2020 at 12:55 am
  
  Hi,
  
  Yes for sure! Just translate all the code into python.
  
Carlos says:

December 9, 2021 at 6:20 pm

Hey, great stuff! Note that one iteration of Newton’s algorithm is identical to ordinary least squares (the terms involving the initial value cancel out and only leave the OLS solution)