Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 14
Gradient Descent
- Gradient descent works poorly when the level sets are poorly conditioned (anisotropic) and works well when the level sets are well-conditioned (isotropic or spherical).
- The efficiency or speed of gradient descent depends on how aspherical or anisotropic the sublevel sets are.
- Steepest descent direction depends on the norm used.
- Steepest descent in the Euclidean norm is precisely gradient descent.
- The choice of the metric or norm for steepest descent makes a big difference in convergence.
- It is desirable to use a norm that is aligned with the sublevel sets of the function.
Newton's Method
- The Newton step involves using the Hessian (second derivative) of the function at the current point as a metric to determine the direction of steepest descent.
- The Newton step can be thought of as a variable metric method, as the metric is changed at each step.
- Another interpretation of the Newton step is to minimize a second-order approximation of the function.
- The Newton step can also be derived by linearizing the optimality condition (gradient equals zero) and solving for the displacement.
- In one dimension, the Newton step involves finding the zero-crossing of a linear approximation of the derivative.
- Repeated application of the Newton step leads to rapid convergence to the minimum of a smooth convex function.
- Newton's method uses the Newton step and the gradient to minimize a function.
- The Newton decrement is the predicted decrease in the function when taking a Newton step.
- The metric defined by the Hessian is a quadratic metric that is invariant under affine transformations.
- Newton's method is independent of linear changes of coordinates.
- Newton's method works exceptionally well for quadratic functions because it solves the entire problem in one step.
- The convergence of Newton's method depends on the third derivative of the function.
- Newton's method works exceptionally well when the third derivative of a function is small.
- A small third derivative indicates that the function is close to quadratic.
- The Lipschitz constant (L) is used to express the smallness of the third derivative.
- If the norm of the gradient is larger than a certain number, Newton's method guarantees a fixed decrease in the function.
- Once the norm of the gradient becomes smaller than that number, the convergence accelerates and becomes quadratic.
- The damped Newton phase is followed by the quadratically convergent phase.
- The maximum number of steps in the damped Newton step is determined by the function value minus the optimal value divided by gamma.
- The number of iterations until reaching a certain level of suboptimality is proportional to the logarithm of the logarithm of the desired accuracy.
- The classical analysis of Newton's method for optimization involves complex hypotheses and parameters that are often unknown in practical problems, making the convergence results useless.
- The author criticizes the emphasis on theoretical convergence analysis rather than practical performance and suggests that many Western optimization convergence results are conceptually flawed.
- Quasi-Newton methods and other variations of Newton's method can be used to approximate the Hessian matrix and make the method more practical.
- The author argues that the convergence analysis of Newton's method is aesthetically embarrassing because it is not scale-invariant, while the actual code is.
- The author encourages a focus on practical performance and simplicity in optimization methods, rather than overly complex theoretical analysis.
Practical Considerations
- Newton's method is a powerful optimization algorithm that converges quadratically, meaning it rapidly approaches the optimal solution with each iteration.
- It is particularly effective for smooth, unconstrained optimization problems.
- Newton's method involves calculating the gradient and Hessian (matrix of second derivatives) of the objective function at each iteration to determine the search direction and step size.
- Despite its efficiency, Newton's method can be computationally expensive, especially for large-scale problems with many variables.
- Quasi-Newton methods are alternatives that approximate the Hessian and are often used in practice.
- Newton's method is widely used in various fields, including machine learning, signal processing, and optimal control problems, where it can achieve high accuracy with relatively few iterations.
Advanced Analysis
- The classical analysis of Newton's method has a less sophisticated analysis compared to the actual algorithm.
- The issue with the classical analysis is that it says the third derivative is small in a way that is not affine independent.
- Nesterov and Nemirovski introduced the concept of self-concordance, which is an affine invariant way of saying the third derivative is small.
- Many practical functions are self-concordant, and Newton's method based on self-concordance has a simpler analysis.
- Nesterov and Nemirovski's analysis of Newton's method based on self-concordance gives a complexity bound that is much better than the classical analysis.
Implementation Details
- Newton's method reduces the solution of minimizing a smooth convex function to minimizing a sequence of quadratic functions.
- Each iteration of Newton's method involves solving a linear system of equations.
- If the Hessian matrix is structured, the linear system can be solved efficiently in linear time.
- A variant of Newton's method called the "lazy Newton" method updates the Hessian matrix every 20 steps, which can save computation time.