Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 14

Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 14

Gradient Descent

  • Gradient descent works poorly when the level sets are poorly conditioned (anisotropic) and works well when the level sets are well-conditioned (isotropic or spherical).
  • The efficiency or speed of gradient descent depends on how aspherical or anisotropic the sublevel sets are.
  • Steepest descent direction depends on the norm used.
  • Steepest descent in the Euclidean norm is precisely gradient descent.
  • The choice of the metric or norm for steepest descent makes a big difference in convergence.
  • It is desirable to use a norm that is aligned with the sublevel sets of the function.

Newton's Method

  • The Newton step involves using the Hessian (second derivative) of the function at the current point as a metric to determine the direction of steepest descent.
  • The Newton step can be thought of as a variable metric method, as the metric is changed at each step.
  • Another interpretation of the Newton step is to minimize a second-order approximation of the function.
  • The Newton step can also be derived by linearizing the optimality condition (gradient equals zero) and solving for the displacement.
  • In one dimension, the Newton step involves finding the zero-crossing of a linear approximation of the derivative.
  • Repeated application of the Newton step leads to rapid convergence to the minimum of a smooth convex function.
  • Newton's method uses the Newton step and the gradient to minimize a function.
  • The Newton decrement is the predicted decrease in the function when taking a Newton step.
  • The metric defined by the Hessian is a quadratic metric that is invariant under affine transformations.
  • Newton's method is independent of linear changes of coordinates.
  • Newton's method works exceptionally well for quadratic functions because it solves the entire problem in one step.
  • The convergence of Newton's method depends on the third derivative of the function.
  • Newton's method works exceptionally well when the third derivative of a function is small.
  • A small third derivative indicates that the function is close to quadratic.
  • The Lipschitz constant (L) is used to express the smallness of the third derivative.
  • If the norm of the gradient is larger than a certain number, Newton's method guarantees a fixed decrease in the function.
  • Once the norm of the gradient becomes smaller than that number, the convergence accelerates and becomes quadratic.
  • The damped Newton phase is followed by the quadratically convergent phase.
  • The maximum number of steps in the damped Newton step is determined by the function value minus the optimal value divided by gamma.
  • The number of iterations until reaching a certain level of suboptimality is proportional to the logarithm of the logarithm of the desired accuracy.
  • The classical analysis of Newton's method for optimization involves complex hypotheses and parameters that are often unknown in practical problems, making the convergence results useless.
  • The author criticizes the emphasis on theoretical convergence analysis rather than practical performance and suggests that many Western optimization convergence results are conceptually flawed.
  • Quasi-Newton methods and other variations of Newton's method can be used to approximate the Hessian matrix and make the method more practical.
  • The author argues that the convergence analysis of Newton's method is aesthetically embarrassing because it is not scale-invariant, while the actual code is.
  • The author encourages a focus on practical performance and simplicity in optimization methods, rather than overly complex theoretical analysis.

Practical Considerations

  • Newton's method is a powerful optimization algorithm that converges quadratically, meaning it rapidly approaches the optimal solution with each iteration.
  • It is particularly effective for smooth, unconstrained optimization problems.
  • Newton's method involves calculating the gradient and Hessian (matrix of second derivatives) of the objective function at each iteration to determine the search direction and step size.
  • Despite its efficiency, Newton's method can be computationally expensive, especially for large-scale problems with many variables.
  • Quasi-Newton methods are alternatives that approximate the Hessian and are often used in practice.
  • Newton's method is widely used in various fields, including machine learning, signal processing, and optimal control problems, where it can achieve high accuracy with relatively few iterations.

Advanced Analysis

  • The classical analysis of Newton's method has a less sophisticated analysis compared to the actual algorithm.
  • The issue with the classical analysis is that it says the third derivative is small in a way that is not affine independent.
  • Nesterov and Nemirovski introduced the concept of self-concordance, which is an affine invariant way of saying the third derivative is small.
  • Many practical functions are self-concordant, and Newton's method based on self-concordance has a simpler analysis.
  • Nesterov and Nemirovski's analysis of Newton's method based on self-concordance gives a complexity bound that is much better than the classical analysis.

Implementation Details

  • Newton's method reduces the solution of minimizing a smooth convex function to minimizing a sequence of quadratic functions.
  • Each iteration of Newton's method involves solving a linear system of equations.
  • If the Hessian matrix is structured, the linear system can be solved efficiently in linear time.
  • A variant of Newton's method called the "lazy Newton" method updates the Hessian matrix every 20 steps, which can save computation time.

Overwhelmed by Endless Content?