# Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 10

()

## Penalty Functions

• Penalty functions describe the irritation with a residual of a certain value.
• The goal is to minimize total irritation by minimizing the sum of these penalties.
• L1 (Lasso) penalty functions are sparsifying, often resulting in many zero entries in the solution to an optimization problem.
• L1 penalties are relatively relaxed for large residuals and upset with small residuals, while quadratic penalties are the opposite.
• Huber penalties blend quadratic and L1 penalties, matching least squares for small residuals and resembling an L1 penalty for large residuals.

## Regularized Approximation

• Regularized approximation is a criterion problem with multiple objectives, often involving tracking a desired trajectory, minimizing input size, and minimizing input variation.
• Weights (Delta and Ada) shape the solution and can be adjusted to achieve desired results.
• Changing the penalty function to the sum of absolute values (L1) would likely result in a sparse solution.
• Sparse first difference means that most of the time the value is equal to the previous value, encouraging piece-wise constant inputs.

## Signal Reconstruction

• Signal reconstruction aims to form an approximation of a corrupted signal by minimizing a regularization function or smoothing objective.
• The trade-off in signal reconstruction is between deviating from the given corrupted signal and the size of the smoothing cost.
• Cross-validation can be used to select the amount of smoothing by randomly removing a portion of the data and pretending it is missing.
• When the penalty is an L1 Norm on the first difference, it is called the total variation, and the solution is expected to have a sparse difference corresponding to a piece-wise constant approximation.
• Total variation smoothing preserves sharp boundaries in images.
• Excessive regularization of total variation denoising results in a cartoonish effect with constant grayscale values in different regions.

## Robust Approximation

• Robust approximation addresses uncertainty in the model by ignoring it or taking an average of possible values.
• The most common method for handling uncertainty in practice is to ignore it or use a mean and variance as a probability distribution.
• A better approach is to simulate multiple scenarios or use an educated guess to account for uncertainty.
• Regularization in machine learning and statistics addresses uncertainty in data.

## Handling Uncertainty

• Different approaches to handling uncertainty include stochastic methods, worst-case methods, and hybrids of these.
• Stochastic methods assume that the uncertain parameter comes from a probability distribution.
• Worst-case methods assume that the uncertain parameter satisfies certain constraints.
• Robust stochastic methods combine elements of both stochastic and worst-case methods.
• A simple example illustrates the different methods and their effects on the resulting model.
• A practical trick for handling uncertainty is to obtain a small number of plausible models and use a min-max approach.
• The speaker discusses how to generate multiple plausible models when faced with uncertainty in data.
• One approach is to use bootstrapping, which is a valid method in this context.
• Another principled approach is to use maximum likelihood estimation and then generate models with parameters that are close to the maximum likelihood estimate.
• The speaker introduces the concept of stochastic robustly squares, which is a regularized least squares method that accounts for uncertainty in the data.
• Regularization, such as Ridge regression, can be interpreted as a way of acknowledging uncertainty in the features of the data.
• The speaker provides an example of a uniform distribution on a unit disc in Matrix space to illustrate the concept of uncertainty in model parameters.
• The speaker emphasizes the importance of acknowledging uncertainty in data and models, and suggests that simply admitting uncertainty can provide significant benefits.

## Maximum Likelihood Estimation

• The video discusses statistical estimation, specifically maximum likelihood estimation.
• Maximum likelihood estimation aims to find the parameter values that maximize the likelihood of observing the data.
• The log likelihood function is often used instead of the likelihood function since maximizing both functions is equivalent.
• Regularization can be added to the maximum likelihood function to prevent overfitting.
• Maximum likelihood estimation is a convex problem when the log likelihood function is concave.
• An example of a linear measurement model is given, where the goal is to estimate the unknown parameter X from a set of measurements.
• The log likelihood function for this model is derived and shown to be convex when the noise is Gaussian.
• The maximum likelihood solution for this model is the least squares solution.
• The noise distribution can also be Laplacian, which has heavier tails than a Gaussian distribution.
• Maximum likelihood estimation with Laplacian noise is equivalent to L1 estimation.
• L1 estimation is more robust to outliers compared to least squares estimation because it is less sensitive to large residuals.
• The difference between L2 fitting and L1 approximation can be explained by the difference in the assumed noise density.
• Huber fitting can be interpreted as maximum likelihood estimation with a mixture of Gaussian and exponential distributions.

## Logistic Regression

• Logistic regression is a model for binary classification where the probability of an outcome is modeled using a logistic function.
• The logistic map corresponding to the maximum likelihood is infinitely sharp because making it less sharp would drop the probability and lose log likelihood.
• When the log-likelihood is unbounded above, it means the data are linearly separable, and this issue can be fixed by adding regularization.

## Hypothesis Testing

• In basic hypothesis testing, a randomized detector is a 2xN matrix that determines the probability of guessing the distribution from which an outcome originated.
• The confusion matrix, or detection probability matrix, is obtained by multiplying the randomized detector matrix by the probability distributions P and Q.
• The goal is to have the detection matrix be an identity matrix, indicating perfect accuracy in guessing the distribution.
• The choice of the randomized detector involves a multi-criterion problem, balancing the probabilities of false negatives and false positives.
• In some cases, an analytical solution exists for the optimal randomized detector, resulting in a deterministic detector.
• The likelihood ratio test is a simple method for determining the distribution from which an outcome originated based on the ratio of P over Q.
• In infinite dimensions, determining the density of two continuous distributions can be done by comparing their ratio to a threshold.
• Deterministic detectors can be used to classify data points based on their density ratio.
• Other approaches, such as minimizing the probability of being wrong, can also be used for classification.
• These approaches often result in non-deterministic detectors.