Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 10

()
Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 10

Penalty Functions

  • Penalty functions describe the irritation with a residual of a certain value.
  • The goal is to minimize total irritation by minimizing the sum of these penalties.
  • L1 (Lasso) penalty functions are sparsifying, often resulting in many zero entries in the solution to an optimization problem.
  • L1 penalties are relatively relaxed for large residuals and upset with small residuals, while quadratic penalties are the opposite.
  • Huber penalties blend quadratic and L1 penalties, matching least squares for small residuals and resembling an L1 penalty for large residuals.

Regularized Approximation

  • Regularized approximation is a criterion problem with multiple objectives, often involving tracking a desired trajectory, minimizing input size, and minimizing input variation.
  • Weights (Delta and Ada) shape the solution and can be adjusted to achieve desired results.
  • Changing the penalty function to the sum of absolute values (L1) would likely result in a sparse solution.
  • Sparse first difference means that most of the time the value is equal to the previous value, encouraging piece-wise constant inputs.

Signal Reconstruction

  • Signal reconstruction aims to form an approximation of a corrupted signal by minimizing a regularization function or smoothing objective.
  • The trade-off in signal reconstruction is between deviating from the given corrupted signal and the size of the smoothing cost.
  • Cross-validation can be used to select the amount of smoothing by randomly removing a portion of the data and pretending it is missing.
  • When the penalty is an L1 Norm on the first difference, it is called the total variation, and the solution is expected to have a sparse difference corresponding to a piece-wise constant approximation.
  • Total variation smoothing preserves sharp boundaries in images.
  • Excessive regularization of total variation denoising results in a cartoonish effect with constant grayscale values in different regions.

Robust Approximation

  • Robust approximation addresses uncertainty in the model by ignoring it or taking an average of possible values.
  • The most common method for handling uncertainty in practice is to ignore it or use a mean and variance as a probability distribution.
  • A better approach is to simulate multiple scenarios or use an educated guess to account for uncertainty.
  • Regularization in machine learning and statistics addresses uncertainty in data.

Handling Uncertainty

  • Different approaches to handling uncertainty include stochastic methods, worst-case methods, and hybrids of these.
  • Stochastic methods assume that the uncertain parameter comes from a probability distribution.
  • Worst-case methods assume that the uncertain parameter satisfies certain constraints.
  • Robust stochastic methods combine elements of both stochastic and worst-case methods.
  • A simple example illustrates the different methods and their effects on the resulting model.
  • A practical trick for handling uncertainty is to obtain a small number of plausible models and use a min-max approach.
  • The speaker discusses how to generate multiple plausible models when faced with uncertainty in data.
  • One approach is to use bootstrapping, which is a valid method in this context.
  • Another principled approach is to use maximum likelihood estimation and then generate models with parameters that are close to the maximum likelihood estimate.
  • The speaker introduces the concept of stochastic robustly squares, which is a regularized least squares method that accounts for uncertainty in the data.
  • Regularization, such as Ridge regression, can be interpreted as a way of acknowledging uncertainty in the features of the data.
  • The speaker provides an example of a uniform distribution on a unit disc in Matrix space to illustrate the concept of uncertainty in model parameters.
  • The speaker emphasizes the importance of acknowledging uncertainty in data and models, and suggests that simply admitting uncertainty can provide significant benefits.

Maximum Likelihood Estimation

  • The video discusses statistical estimation, specifically maximum likelihood estimation.
  • Maximum likelihood estimation aims to find the parameter values that maximize the likelihood of observing the data.
  • The log likelihood function is often used instead of the likelihood function since maximizing both functions is equivalent.
  • Regularization can be added to the maximum likelihood function to prevent overfitting.
  • Maximum likelihood estimation is a convex problem when the log likelihood function is concave.
  • An example of a linear measurement model is given, where the goal is to estimate the unknown parameter X from a set of measurements.
  • The log likelihood function for this model is derived and shown to be convex when the noise is Gaussian.
  • The maximum likelihood solution for this model is the least squares solution.
  • The noise distribution can also be Laplacian, which has heavier tails than a Gaussian distribution.
  • Maximum likelihood estimation with Laplacian noise is equivalent to L1 estimation.
  • L1 estimation is more robust to outliers compared to least squares estimation because it is less sensitive to large residuals.
  • The difference between L2 fitting and L1 approximation can be explained by the difference in the assumed noise density.
  • Huber fitting can be interpreted as maximum likelihood estimation with a mixture of Gaussian and exponential distributions.

Logistic Regression

  • Logistic regression is a model for binary classification where the probability of an outcome is modeled using a logistic function.
  • The logistic map corresponding to the maximum likelihood is infinitely sharp because making it less sharp would drop the probability and lose log likelihood.
  • When the log-likelihood is unbounded above, it means the data are linearly separable, and this issue can be fixed by adding regularization.

Hypothesis Testing

  • In basic hypothesis testing, a randomized detector is a 2xN matrix that determines the probability of guessing the distribution from which an outcome originated.
  • The confusion matrix, or detection probability matrix, is obtained by multiplying the randomized detector matrix by the probability distributions P and Q.
  • The goal is to have the detection matrix be an identity matrix, indicating perfect accuracy in guessing the distribution.
  • The choice of the randomized detector involves a multi-criterion problem, balancing the probabilities of false negatives and false positives.
  • In some cases, an analytical solution exists for the optimal randomized detector, resulting in a deterministic detector.
  • The likelihood ratio test is a simple method for determining the distribution from which an outcome originated based on the ratio of P over Q.
  • In infinite dimensions, determining the density of two continuous distributions can be done by comparing their ratio to a threshold.
  • Deterministic detectors can be used to classify data points based on their density ratio.
  • Other approaches, such as minimizing the probability of being wrong, can also be used for classification.
  • These approaches often result in non-deterministic detectors.

Overwhelmed by Endless Content?