Regularization

Machine learning models must simultaneously meet two conflicting goals:

Fit data well.
Fit data as simply as possible.

One approach to keeping a model simple is to penalize complex models; that is, to force the model to become simpler during training. Penalizing complex models is one form of regularization.

Loss and complexity

So far, this course has suggested that the only goal when training was to minimize loss; that is:

As you’ve seen, models focused solely on minimizing loss tend to overfit. A better training optimization algorithm minimizes some combination of loss and complexity:

Unfortunately, loss and complexity are typically inversely related. As complexity increases, loss decreases. As complexity decreases, loss increases. You should find a reasonable middle ground where the model makes good predictions on both the training data and real-world data. That is, your model should find a reasonable compromise between loss and complexity.

Quantifying complexity

How would you quantify complexity?

L2 regularization is a popular regularization metric, which uses the following formula:

Read the rest at the course site

Regularization rate (lambda)

As noted, training attempts to minimize some combination of loss and complexity:

Model developers tune the overall impact of complexity on model training by multiplying its value by a scalar called the regularization rate. The Greek character lambda typically symbolizes the regularization rate.

That is, model developers aim to do the following:

A high regularization rate:

Strengthens the influence of regularization, thereby reducing the chances of overfitting.
Tends to produce a histogram of model weights having the following characteristics:
- a normal distribution
- a mean weight of 0.

A low regularization rate:

Lowers the influence of regularization, thereby increasing the chances of overfitting.

Dropout

Another tool for reducing overfitting is dropout. Dropout layers are a tool for encouraging sparse representations in your model - that is, pushing it to do inference with less data.

Dropout layers work by randomly setting parts of the input tensor during training - dropout layers are always turned off for inference. This forces the model to learn against this masked or reduced dataset.