- important technique for regularization
Imagine that you have one layer that connects to another layer.
The values that go from one layer to the next are often called activations.
Take those activations randomly, for every example you train your network on, set half of them to 0.
Then we get the next figure:
Completely and randomly, you basically take half of the data that's flowing through your network, and just destroy it.
And then randomly again.
So what happens with dropout?
Your network can never rely on any given activation to be present, because they might be squashed at any given moment.
So it is forced to learn a redundant representation for everything to make sure that at least some of the information remains. It's like a game of whack-a-mole (打地鼠). One activations get smashed, but there is always one or more that do the same job. And that don't get killed. So everything remains fine at the end.
Forcing your network to learn redundant representations might sound very inefficient.
But in practice, it makes things more robust, and prevents over fitting.
It also makes your network act as if taking the consensus over an ensemble of networks, which is always a good way to improve performance.
If dropout doesn't work for you, you should probably be using a bigger network.
??How to do evaluation when using dropout
When you evaluate the network that's been trained with dropout, you obviously don't want this randomness. you want your evaluation result deterministic.
Instead, you're going to want to take the consensus over this redundant models. You get the consensus opinion by averaging the activations. That is .
Here is a trick to make sure this expectation holds. During training, not only do you use zero out so the activations that you drop out, but your also scale the remaining activations by a factor of 2.