# Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 10

## Penalty Functions

- Penalty functions describe the irritation with a residual of a certain value.
- The goal is to minimize total irritation by minimizing the sum of these penalties.
- L1 (Lasso) penalty functions are sparsifying, often resulting in many zero entries in the solution to an optimization problem.
- L1 penalties are relatively relaxed for large residuals and upset with small residuals, while quadratic penalties are the opposite.
- Huber penalties blend quadratic and L1 penalties, matching least squares for small residuals and resembling an L1 penalty for large residuals.

## Regularized Approximation

- Regularized approximation is a criterion problem with multiple objectives, often involving tracking a desired trajectory, minimizing input size, and minimizing input variation.
- Weights (Delta and Ada) shape the solution and can be adjusted to achieve desired results.
- Changing the penalty function to the sum of absolute values (L1) would likely result in a sparse solution.
- Sparse first difference means that most of the time the value is equal to the previous value, encouraging piece-wise constant inputs.

## Signal Reconstruction

- Signal reconstruction aims to form an approximation of a corrupted signal by minimizing a regularization function or smoothing objective.
- The trade-off in signal reconstruction is between deviating from the given corrupted signal and the size of the smoothing cost.
- Cross-validation can be used to select the amount of smoothing by randomly removing a portion of the data and pretending it is missing.
- When the penalty is an L1 Norm on the first difference, it is called the total variation, and the solution is expected to have a sparse difference corresponding to a piece-wise constant approximation.
- Total variation smoothing preserves sharp boundaries in images.
- Excessive regularization of total variation denoising results in a cartoonish effect with constant grayscale values in different regions.

## Robust Approximation

- Robust approximation addresses uncertainty in the model by ignoring it or taking an average of possible values.
- The most common method for handling uncertainty in practice is to ignore it or use a mean and variance as a probability distribution.
- A better approach is to simulate multiple scenarios or use an educated guess to account for uncertainty.
- Regularization in machine learning and statistics addresses uncertainty in data.

## Handling Uncertainty

- Different approaches to handling uncertainty include stochastic methods, worst-case methods, and hybrids of these.
- Stochastic methods assume that the uncertain parameter comes from a probability distribution.
- Worst-case methods assume that the uncertain parameter satisfies certain constraints.
- Robust stochastic methods combine elements of both stochastic and worst-case methods.
- A simple example illustrates the different methods and their effects on the resulting model.
- A practical trick for handling uncertainty is to obtain a small number of plausible models and use a min-max approach.
- The speaker discusses how to generate multiple plausible models when faced with uncertainty in data.
- One approach is to use bootstrapping, which is a valid method in this context.
- Another principled approach is to use maximum likelihood estimation and then generate models with parameters that are close to the maximum likelihood estimate.
- The speaker introduces the concept of stochastic robustly squares, which is a regularized least squares method that accounts for uncertainty in the data.
- Regularization, such as Ridge regression, can be interpreted as a way of acknowledging uncertainty in the features of the data.
- The speaker provides an example of a uniform distribution on a unit disc in Matrix space to illustrate the concept of uncertainty in model parameters.
- The speaker emphasizes the importance of acknowledging uncertainty in data and models, and suggests that simply admitting uncertainty can provide significant benefits.

## Maximum Likelihood Estimation

- The video discusses statistical estimation, specifically maximum likelihood estimation.
- Maximum likelihood estimation aims to find the parameter values that maximize the likelihood of observing the data.
- The log likelihood function is often used instead of the likelihood function since maximizing both functions is equivalent.
- Regularization can be added to the maximum likelihood function to prevent overfitting.
- Maximum likelihood estimation is a convex problem when the log likelihood function is concave.
- An example of a linear measurement model is given, where the goal is to estimate the unknown parameter X from a set of measurements.
- The log likelihood function for this model is derived and shown to be convex when the noise is Gaussian.
- The maximum likelihood solution for this model is the least squares solution.
- The noise distribution can also be Laplacian, which has heavier tails than a Gaussian distribution.
- Maximum likelihood estimation with Laplacian noise is equivalent to L1 estimation.
- L1 estimation is more robust to outliers compared to least squares estimation because it is less sensitive to large residuals.
- The difference between L2 fitting and L1 approximation can be explained by the difference in the assumed noise density.
- Huber fitting can be interpreted as maximum likelihood estimation with a mixture of Gaussian and exponential distributions.

## Logistic Regression

- Logistic regression is a model for binary classification where the probability of an outcome is modeled using a logistic function.
- The logistic map corresponding to the maximum likelihood is infinitely sharp because making it less sharp would drop the probability and lose log likelihood.
- When the log-likelihood is unbounded above, it means the data are linearly separable, and this issue can be fixed by adding regularization.

## Hypothesis Testing

- In basic hypothesis testing, a randomized detector is a 2xN matrix that determines the probability of guessing the distribution from which an outcome originated.
- The confusion matrix, or detection probability matrix, is obtained by multiplying the randomized detector matrix by the probability distributions P and Q.
- The goal is to have the detection matrix be an identity matrix, indicating perfect accuracy in guessing the distribution.
- The choice of the randomized detector involves a multi-criterion problem, balancing the probabilities of false negatives and false positives.
- In some cases, an analytical solution exists for the optimal randomized detector, resulting in a deterministic detector.
- The likelihood ratio test is a simple method for determining the distribution from which an outcome originated based on the ratio of P over Q.
- In infinite dimensions, determining the density of two continuous distributions can be done by comparing their ratio to a threshold.
- Deterministic detectors can be used to classify data points based on their density ratio.
- Other approaches, such as minimizing the probability of being wrong, can also be used for classification.
- These approaches often result in non-deterministic detectors.