# Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 14

()

• Gradient descent works poorly when the level sets are poorly conditioned (anisotropic) and works well when the level sets are well-conditioned (isotropic or spherical).
• The efficiency or speed of gradient descent depends on how aspherical or anisotropic the sublevel sets are.
• Steepest descent direction depends on the norm used.
• Steepest descent in the Euclidean norm is precisely gradient descent.
• The choice of the metric or norm for steepest descent makes a big difference in convergence.
• It is desirable to use a norm that is aligned with the sublevel sets of the function.

## Newton's Method

• The Newton step involves using the Hessian (second derivative) of the function at the current point as a metric to determine the direction of steepest descent.
• The Newton step can be thought of as a variable metric method, as the metric is changed at each step.
• Another interpretation of the Newton step is to minimize a second-order approximation of the function.
• The Newton step can also be derived by linearizing the optimality condition (gradient equals zero) and solving for the displacement.
• In one dimension, the Newton step involves finding the zero-crossing of a linear approximation of the derivative.
• Repeated application of the Newton step leads to rapid convergence to the minimum of a smooth convex function.
• Newton's method uses the Newton step and the gradient to minimize a function.
• The Newton decrement is the predicted decrease in the function when taking a Newton step.
• The metric defined by the Hessian is a quadratic metric that is invariant under affine transformations.
• Newton's method is independent of linear changes of coordinates.
• Newton's method works exceptionally well for quadratic functions because it solves the entire problem in one step.
• The convergence of Newton's method depends on the third derivative of the function.
• Newton's method works exceptionally well when the third derivative of a function is small.
• A small third derivative indicates that the function is close to quadratic.
• The Lipschitz constant (L) is used to express the smallness of the third derivative.
• If the norm of the gradient is larger than a certain number, Newton's method guarantees a fixed decrease in the function.
• Once the norm of the gradient becomes smaller than that number, the convergence accelerates and becomes quadratic.
• The damped Newton phase is followed by the quadratically convergent phase.
• The maximum number of steps in the damped Newton step is determined by the function value minus the optimal value divided by gamma.
• The number of iterations until reaching a certain level of suboptimality is proportional to the logarithm of the logarithm of the desired accuracy.
• The classical analysis of Newton's method for optimization involves complex hypotheses and parameters that are often unknown in practical problems, making the convergence results useless.
• The author criticizes the emphasis on theoretical convergence analysis rather than practical performance and suggests that many Western optimization convergence results are conceptually flawed.
• Quasi-Newton methods and other variations of Newton's method can be used to approximate the Hessian matrix and make the method more practical.
• The author argues that the convergence analysis of Newton's method is aesthetically embarrassing because it is not scale-invariant, while the actual code is.
• The author encourages a focus on practical performance and simplicity in optimization methods, rather than overly complex theoretical analysis.

## Practical Considerations

• Newton's method is a powerful optimization algorithm that converges quadratically, meaning it rapidly approaches the optimal solution with each iteration.
• It is particularly effective for smooth, unconstrained optimization problems.
• Newton's method involves calculating the gradient and Hessian (matrix of second derivatives) of the objective function at each iteration to determine the search direction and step size.
• Despite its efficiency, Newton's method can be computationally expensive, especially for large-scale problems with many variables.
• Quasi-Newton methods are alternatives that approximate the Hessian and are often used in practice.
• Newton's method is widely used in various fields, including machine learning, signal processing, and optimal control problems, where it can achieve high accuracy with relatively few iterations.

• The classical analysis of Newton's method has a less sophisticated analysis compared to the actual algorithm.
• The issue with the classical analysis is that it says the third derivative is small in a way that is not affine independent.
• Nesterov and Nemirovski introduced the concept of self-concordance, which is an affine invariant way of saying the third derivative is small.
• Many practical functions are self-concordant, and Newton's method based on self-concordance has a simpler analysis.
• Nesterov and Nemirovski's analysis of Newton's method based on self-concordance gives a complexity bound that is much better than the classical analysis.

## Implementation Details

• Newton's method reduces the solution of minimizing a smooth convex function to minimizing a sequence of quadratic functions.
• Each iteration of Newton's method involves solving a linear system of equations.
• If the Hessian matrix is structured, the linear system can be solved efficiently in linear time.
• A variant of Newton's method called the "lazy Newton" method updates the Hessian matrix every 20 steps, which can save computation time.