PAC results with Hoeffding-Bernstein’s inequality are the bread and butter of machine learning theory. In this post, we’ll see how to use Hoeffding’s inequality to derive agnostic PAC bounds with an error rate that decreases as Image may be NSFW.
Clik here to view. in number of training samples Image may be NSFW.
Clik here to view.. Then, we’ll see that in the realizable setting we can use Bernstein’s inequality to improve this to Image may be NSFW.
Clik here to view..
Supervised Learning
Consider the task of regression with input space Image may be NSFW.
Clik here to view. and output space Image may be NSFW.
Clik here to view.. We have access to a model class (also known as hypothesis family) Image may be NSFW.
Clik here to view.. We will assume Image may be NSFW.
Clik here to view. to be finite for simplicity. There exists some unknown data distribution Image may be NSFW.
Clik here to view. that is generating examples. Of course, we don’t have access to this data distribution but we have access to some samples from it. Let Image may be NSFW.
Clik here to view. be Image may be NSFW.
Clik here to view. independent identically distributed (i.i.d) samples from the data distribution i.e., Image may be NSFW.
Clik here to view. and probability of observing a sample Image may be NSFW.
Clik here to view. is given by Image may be NSFW.
Clik here to view..
In order, to evaluate the performance of a model we will need a loss function. Let Image may be NSFW.
Clik here to view. be a loss function. Given a datapoint Image may be NSFW.
Clik here to view. and a model Image may be NSFW.
Clik here to view., the loss associated with our prediction is given by Image may be NSFW.
Clik here to view.. For now, we will not make any assumption about Image may be NSFW.
Clik here to view..
Our aim is to find a function Image may be NSFW.
Clik here to view. that is as good as possible. What is a good function? Well one that has the least loss function across the input space i.e., its generalization error Image may be NSFW.
Clik here to view. is as low as possible: Image may be NSFW.
Clik here to view.. The only signal we have for finding such a Image may be NSFW.
Clik here to view. is the training data Image may be NSFW.
Clik here to view.. We will define the training error Image may be NSFW.
Clik here to view.. The aim of machine learning theory is to (i) allow us to find a “good” Image may be NSFW.
Clik here to view., and (ii) bound the performance of the predicted model in terms of Image may be NSFW.
Clik here to view. and other relevant parameters. The first result we will look at is the Occam’s Razor Bound and then we will see how to improve this result.
Occam’s Razor Bound
One of the first result that is taught in a machine learning theory class is the Occam’s Razor bound. The formal result is given in the theorem below. Occam’s Razor bound suggests using the model that minimizes the error on the observed sample Image may be NSFW.
Clik here to view. i.e., the solution of the following optimization: Image may be NSFW.
Clik here to view.. This solution is called the empirical risk minimizer (ERM) or the ERM solution. We will drop the Image may be NSFW.
Clik here to view. from the notation when it is clear from the context.
Theorem: Let Image may be NSFW.
Clik here to view. be a supervised learning problem with input space Image may be NSFW.
Clik here to view., output space Image may be NSFW.
Clik here to view., data distribution Image may be NSFW.
Clik here to view., and Image may be NSFW.
Clik here to view. be the loss function. We are given Image may be NSFW.
Clik here to view. i.i.d samples Image may be NSFW.
Clik here to view. from the data distribution Image may be NSFW.
Clik here to view.. Let Image may be NSFW.
Clik here to view. be a finite collection of models from Image may be NSFW.
Clik here to view. to Image may be NSFW.
Clik here to view.. Fix any Image may be NSFW.
Clik here to view. then with probability at least Image may be NSFW.
Clik here to view. over samples Image may be NSFW.
Clik here to view. drawn i.i.d. from Image may be NSFW.
Clik here to view., we have:
Image may be NSFW.
Clik here to view., and
Image may be NSFW.
Clik here to view..
Proof: Fix any model in the model family Image may be NSFW.
Clik here to view.. Let Image may be NSFW.
Clik here to view.. Then observe Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. are i.i.d random variables. They are random variables with respect to the sample Image may be NSFW.
Clik here to view. which is randomly chosen. Further, Image may be NSFW.
Clik here to view. and
Image may be NSFW.
Clik here to view.,
where the first equality follows from linearity of expectation and the second inequality follows from iid assumption. Then using Hoeffding’s inequality:
Image may be NSFW.
Clik here to view.,
Setting Image may be NSFW.
Clik here to view., we get: Image may be NSFW.
Clik here to view. with probability of at least Image may be NSFW.
Clik here to view.. Let Image may be NSFW.
Clik here to view. be result derived in the above equation then we have Image may be NSFW.
Clik here to view. or in other words Image may be NSFW.
Clik here to view. (Image may be NSFW.
Clik here to view. represents the complement of measurable set Image may be NSFW.
Clik here to view.). Even though we derived this result for a fixed Image may be NSFW.
Clik here to view., we never used any property of Image may be NSFW.
Clik here to view. and therefore, it holds for all Image may be NSFW.
Clik here to view.. Therefore, we can bound the probability of Image may be NSFW.
Clik here to view. using union bound:
Image may be NSFW.
Clik here to view..
This means with probability of at least Image may be NSFW.
Clik here to view. we have:
Image may be NSFW.
Clik here to view..
Replacing Image may be NSFW.
Clik here to view. by Image may be NSFW.
Clik here to view. proves the first result. Let Image may be NSFW.
Clik here to view. be the ERM solution and fix Image may be NSFW.
Clik here to view. in Image may be NSFW.
Clik here to view.. Then:
Image may be NSFW.
Clik here to view. (from Hoeffidng’s inequality)
Image may be NSFW.
Clik here to view. (from the definition of ERM)
Image may be NSFW.
Clik here to view. (from Hoeffidng’s inequality)
This implies, for any Image may be NSFW.
Clik here to view., with probability at least Image may be NSFW.
Clik here to view.: Image may be NSFW.
Clik here to view.. As this holds for any Image may be NSFW.
Clik here to view., therefore, we can pick the Image may be NSFW.
Clik here to view. with smallest value of Image may be NSFW.
Clik here to view.. This proves the second result.
There are couple of things that stand out in this result:
- More training data helps: Formally, the bound improves as Image may be NSFW.
Clik here to view.in sample size Image may be NSFW.
Clik here to view.. Informally, this means as we have more training data we will learn a model whose generalized error has a better upper bound.
- Complex model class can be undesirable: The bound becomes weaker as the model family Image may be NSFW.
Clik here to view.becomes larger. This is an important lesson in model regularization: if we have two model class Image may be NSFW.
Clik here to view.that have the same training error on a sample Image may be NSFW.
Clik here to view., then if Image may be NSFW.
Clik here to view.is more complex class than Image may be NSFW.
Clik here to view.i.e., Image may be NSFW.
Clik here to view., then the bound on the generalization error for the later will be weaker. Note that the result does not say that the larger model class is always bad. We can get better bounds with a more complex model class provided we can appropriately improve the training error.
Improving Occam’s Razor Bound with Bernstein’s Inequality
Bernstein’s inequality allows us to incorporate the variance of the random variables in the bound thereby allowing us to get tighter bound for low variance settings.
Theorem (Bernstein’s Inequality): Let Image may be NSFW.
Clik here to view. be Image may be NSFW.
Clik here to view. i.i.d random variables with mean Image may be NSFW.
Clik here to view. and variance Image may be NSFW.
Clik here to view.. Further, let Image may be NSFW.
Clik here to view. almost surely, for all Image may be NSFW.
Clik here to view.. Then:
Image may be NSFW.
Clik here to view.
If the variance was negligible Image may be NSFW.
Clik here to view. then we will get Image may be NSFW.
Clik here to view.. This bound, if we repeat what we did for Occam’s Razor Bound, will give us Image may be NSFW.
Clik here to view. bound.
Before we proceed to derive a Image may be NSFW.
Clik here to view. in realizable setting, we will first state the Bernstein’s inequality in the more familiar form. Setting the right hand side to Image may be NSFW.
Clik here to view. and solving the quadratic equation gives us:
With probability at least Image may be NSFW.
Clik here to view. we have:
Image may be NSFW.
Clik here to view.
We now apply the inequality Image may be NSFW.
Clik here to view. to get:
Image may be NSFW.
Clik here to view..
Theorem: Let Image may be NSFW.
Clik here to view. be a supervised learning problem with input space Image may be NSFW.
Clik here to view., output space Image may be NSFW.
Clik here to view., data distribution Image may be NSFW.
Clik here to view., and loss function Image may be NSFW.
Clik here to view.. Let there be Image may be NSFW.
Clik here to view. such that Image may be NSFW.
Clik here to view.. We are given Image may be NSFW.
Clik here to view. i.i.d samples Image may be NSFW.
Clik here to view. from the data distribution Image may be NSFW.
Clik here to view.. Let Image may be NSFW.
Clik here to view. be a finite collection of models from Image may be NSFW.
Clik here to view. to Image may be NSFW.
Clik here to view.. Fix any Image may be NSFW.
Clik here to view. then with probability at least Image may be NSFW.
Clik here to view. we have:
Image may be NSFW.
Clik here to view..
Proof: Fix a Image may be NSFW.
Clik here to view. and define Image may be NSFW.
Clik here to view. as before. This time we will apply Bernstein’s inequality. We have Image may be NSFW.
Clik here to view. as the loss function is bounded in Image may be NSFW.
Clik here to view.. Bernstein’s inequality gives us:
Image may be NSFW.
Clik here to view.,
with probability of at least Image may be NSFW.
Clik here to view.. All we need to do is bound the variance Image may be NSFW.
Clik here to view.. As the loss function is bounded in Image may be NSFW.
Clik here to view. therefore, Image may be NSFW.
Clik here to view.. This gives us Image may be NSFW.
Clik here to view.. Plugging this in we get:
Image may be NSFW.
Clik here to view..
Applying AM-GM inequality, we have Image may be NSFW.
Clik here to view. giving us,
Image may be NSFW.
Clik here to view.,
with probability Image may be NSFW.
Clik here to view.. Applying union bound on all hypothesis we get:
Image may be NSFW.
Clik here to view.,
with probability at least Image may be NSFW.
Clik here to view.. Let Image may be NSFW.
Clik here to view. be the ERM solution then:
Image may be NSFW.
Clik here to view. (Bernstein’s inequality)
Image may be NSFW.
Clik here to view. (ERM solution)
Image may be NSFW.
Clik here to view. (Bernstein’s inequality)
Image may be NSFW.
Clik here to view. (realizability condition)
Combining these we get Image may be NSFW.
Clik here to view. with probability at least Image may be NSFW.
Clik here to view.. Hence, proved.
So what is happening? We are making a realizability assumption i.e., there is a function Image may be NSFW.
Clik here to view. such that Image may be NSFW.
Clik here to view.. As the loss is lower bounded by 0, therefore, the variance of the samples of its loss are 0. Now as our ERM solver is becoming better with more data, it is converging towards this Image may be NSFW.
Clik here to view. whose variance is 0. Therefore, the variance of the ERM (which is bounded by its mean Image may be NSFW.
Clik here to view.) is becoming smaller. This helps in bounding the rate of convergence.
Summary: We saw how Hoeffding’s inequality gives us a Image may be NSFW.
Clik here to view. bound in number of samples Image may be NSFW.
Clik here to view.. When we are in the realizable case, we can get Image may be NSFW.
Clik here to view. using Bernstein’s inequality.