Library

Video Player is loading.
 
Current Time 0:00
Duration 1:16:49
Loaded: 0.00%
 

x1.00


Back

Games & Quizzes

Training Mode - Typing
Fill the gaps to the Lyric - Best method
Training Mode - Picking
Pick the correct word to fill in the gap
Fill In The Blank
Find the missing words in a sentence Requires 5 vocabulary annotations
Vocabulary Match
Match the words to the definitions Requires 10 vocabulary annotations

You may need to watch a part of the video to unlock quizzes

Don't forget to Sign In to save your points

Challenge Accomplished

PERFECT HITS +NaN
HITS +NaN
LONGEST STREAK +NaN
TOTAL +
- //

We couldn't find definitions for the word you were looking for.
Or maybe the current language is not supported

  • 00:00

    ANNOUNCER: The following program is brought to you by Caltech.
    ANNOUNCER: The following program is brought to you by Caltech.

  • 00:17

    YASER ABU-MOSTAFA: Welcome back.
    YASER ABU-MOSTAFA: Welcome back.

  • 00:21

    Last time, we introduced the learning problem.
    Last time, we introduced the learning problem.

  • 00:25

    And if you have an application in your domain that you wonder if machine
    And if you have an application in your domain that you wonder if machine

  • 00:30

    learning is the right technique for it, we found that there are three
    learning is the right technique for it, we found that there are three

  • 00:34

    criteria that you should check.
    criteria that you should check.

  • 00:37

    You should ask yourself: is there a pattern to begin
    You should ask yourself: is there a pattern to begin

  • 00:40

    with that we can learn?
    with that we can learn?

  • 00:42

    And we realize that this condition can be intuitively met in many applications,
    And we realize that this condition can be intuitively met in many applications,

  • 00:49

    even if we don't know mathematically what the pattern is.
    even if we don't know mathematically what the pattern is.

  • 00:52

    The example we gave was the credit card approval.
    The example we gave was the credit card approval.

  • 00:55

    There is clearly a pattern-- if someone has a particular salary, has
    There is clearly a pattern-- if someone has a particular salary, has

  • 00:58

    been in a residence for so long, has that much debt, and so on, that this
    been in a residence for so long, has that much debt, and so on, that this

  • 01:02

    is somewhat correlated to their credit behavior.
    is somewhat correlated to their credit behavior.

  • 01:06

    And therefore, we know that the pattern exists in spite of the fact
    And therefore, we know that the pattern exists in spite of the fact

  • 01:09

    that we don't know exactly what the pattern is.
    that we don't know exactly what the pattern is.

  • 01:13

    The second item is that we cannot pin down the pattern mathematically, like
    The second item is that we cannot pin down the pattern mathematically, like

  • 01:18

    the example I just gave.
    the example I just gave.

  • 01:19

    And this is why we resort to machine learning.
    And this is why we resort to machine learning.

  • 01:23

    The third one is that we have data that represents that pattern.
    The third one is that we have data that represents that pattern.

  • 01:27

    In the case of the credit application, for example, there are historical
    In the case of the credit application, for example, there are historical

  • 01:30

    records of previous customers, and we have the data they wrote in their
    records of previous customers, and we have the data they wrote in their

  • 01:35

    application when they applied, and we have some years' worth of record of
    application when they applied, and we have some years' worth of record of

  • 01:40

    their credit behavior.
    their credit behavior.

  • 01:41

    So we have data that are going to enable us to correlate what they wrote in the
    So we have data that are going to enable us to correlate what they wrote in the

  • 01:46

    application to their eventual credit behavior, and that is what we are
    application to their eventual credit behavior, and that is what we are

  • 01:50

    going to learn from.
    going to learn from.

  • 01:52

    Now, if you look at the three criteria, basically there are two that
    Now, if you look at the three criteria, basically there are two that

  • 01:55

    you can do without, and one that is absolutely essential.
    you can do without, and one that is absolutely essential.

  • 02:00

    What do I mean?
    What do I mean?

  • 02:02

    Let's say that you don't have a pattern.
    Let's say that you don't have a pattern.

  • 02:04

    Well, if you don't have a pattern, then you can try learning.
    Well, if you don't have a pattern, then you can try learning.

  • 02:10

    And the only problem is that you will fail.
    And the only problem is that you will fail.

  • 02:13

    That doesn't sound very encouraging.
    That doesn't sound very encouraging.

  • 02:16

    But the idea here is that, when we develop the theory of learning, we will
    But the idea here is that, when we develop the theory of learning, we will

  • 02:20

    realize that you can apply the technique regardless of whether there
    realize that you can apply the technique regardless of whether there

  • 02:24

    is a pattern or not.
    is a pattern or not.

  • 02:26

    And you are going to determine whether there's a pattern or not.
    And you are going to determine whether there's a pattern or not.

  • 02:30

    So you are not going to be fooled and think, I learned, and then give the
    So you are not going to be fooled and think, I learned, and then give the

  • 02:33

    system to your customer, and the customer will be disappointed.
    system to your customer, and the customer will be disappointed.

  • 02:36

    There is something you can actually measure that will tell you whether you
    There is something you can actually measure that will tell you whether you

  • 02:39

    learned or not.
    learned or not.

  • 02:40

    So if there's no pattern, there is no harm done in trying machine learning.
    So if there's no pattern, there is no harm done in trying machine learning.

  • 02:46

    The other one, also, you can do without.
    The other one, also, you can do without.

  • 02:48

    Let's say that we can pin the thing down mathematically.
    Let's say that we can pin the thing down mathematically.

  • 02:52

    Well, in that case, machine learning is not the recommended technique.
    Well, in that case, machine learning is not the recommended technique.

  • 02:55

    It will still work.
    It will still work.

  • 02:57

    It may not be the optimal technique.
    It may not be the optimal technique.

  • 02:58

    If you can outright program it, and find the result perfectly, then why
    If you can outright program it, and find the result perfectly, then why

  • 03:02

    bother generate examples, and try to learn, and go through all of that?
    bother generate examples, and try to learn, and go through all of that?

  • 03:06

    But machine learning is not going to refuse.
    But machine learning is not going to refuse.

  • 03:09

    It is going to learn, and it is going to give you a system.
    It is going to learn, and it is going to give you a system.

  • 03:11

    It may not be the best system in this case, but it's a system nonetheless.
    It may not be the best system in this case, but it's a system nonetheless.

  • 03:15

    The third one, I'm afraid you cannot do without.
    The third one, I'm afraid you cannot do without.

  • 03:19

    You have to have data.
    You have to have data.

  • 03:20

    Machine learning is about learning from data.
    Machine learning is about learning from data.

  • 03:23

    And if you don't have data, there is absolutely nothing you can do.
    And if you don't have data, there is absolutely nothing you can do.

  • 03:26

    So this is basically the picture about the context of machine learning.
    So this is basically the picture about the context of machine learning.

  • 03:31

    Now, we went on to focus on one type, which is supervised learning.
    Now, we went on to focus on one type, which is supervised learning.

  • 03:36

    And in the case of supervised learning, we have a target function.
    And in the case of supervised learning, we have a target function.

  • 03:41

    The target function we are going to call f.
    The target function we are going to call f.

  • 03:44

    That is our standard notation.
    That is our standard notation.

  • 03:46

    And this corresponds, for example, to the credit application.
    And this corresponds, for example, to the credit application.

  • 03:49

    x is your application, and f of x is whether you are a good credit risk or
    x is your application, and f of x is whether you are a good credit risk or

  • 03:55

    not, for the bank.
    not, for the bank.

  • 03:57

    So if you look at the target function, the main criterion about the target
    So if you look at the target function, the main criterion about the target

  • 04:02

    function is that it's unknown.
    function is that it's unknown.

  • 04:05

    This is a property that we are going to insist on.
    This is a property that we are going to insist on.

  • 04:07

    And obviously, unknown is a very generous assumption, which means that
    And obviously, unknown is a very generous assumption, which means that

  • 04:13

    you don't have to worry about what pattern you are trying to learn.
    you don't have to worry about what pattern you are trying to learn.

  • 04:15

    It could be anything, and you will learn it-- if we manage to do that.
    It could be anything, and you will learn it-- if we manage to do that.

  • 04:19

    There's still a question mark about that.
    There's still a question mark about that.

  • 04:21

    But it's a good assumption to have, or lack of assumption, if you will,
    But it's a good assumption to have, or lack of assumption, if you will,

  • 04:25

    because then you know that you don't worry about the environment that
    because then you know that you don't worry about the environment that

  • 04:29

    generated the examples.
    generated the examples.

  • 04:30

    You only worry about the system that you use to implement machine learning.
    You only worry about the system that you use to implement machine learning.

  • 04:35

    Now, you are going to be given data.
    Now, you are going to be given data.

  • 04:37

    And the reason it's called supervised learning is that you are not only
    And the reason it's called supervised learning is that you are not only

  • 04:40

    given the input x's, as you can see here.
    given the input x's, as you can see here.

  • 04:43

    You're also given the output--
    You're also given the output--

  • 04:45

    the target outputs.
    the target outputs.

  • 04:46

    So in spite of the fact that the target function is generally unknown,
    So in spite of the fact that the target function is generally unknown,

  • 04:49

    it is known on the data that I give you.
    it is known on the data that I give you.

  • 04:52

    This is the data that you are going to use as training examples, and that you
    This is the data that you are going to use as training examples, and that you

  • 04:56

    are going to use to figure out what the target function is.
    are going to use to figure out what the target function is.

  • 04:59

    So in the case of supervised learning, you have the targets
    So in the case of supervised learning, you have the targets

  • 05:02

    explicitly.
    explicitly.

  • 05:04

    In the other cases, you have less information than the target, and we
    In the other cases, you have less information than the target, and we

  • 05:07

    talked about it-- like unsupervised learning, where you don't have
    talked about it-- like unsupervised learning, where you don't have

  • 05:09

    anything, and reinforcement learning, where you have partial information,
    anything, and reinforcement learning, where you have partial information,

  • 05:12

    which is just a reward or punishment for a choice of a value of y that
    which is just a reward or punishment for a choice of a value of y that

  • 05:17

    may or may not be the target.
    may or may not be the target.

  • 05:19

    Finally, you have the solution tools.
    Finally, you have the solution tools.

  • 05:21

    These are the things that you're going to choose in order to solve the
    These are the things that you're going to choose in order to solve the

  • 05:24

    problem, and they are called the learning model, as we discussed.
    problem, and they are called the learning model, as we discussed.

  • 05:28

    They are the learning algorithm and the hypothesis set.
    They are the learning algorithm and the hypothesis set.

  • 05:30

    And the learning algorithm will produce a hypothesis--
    And the learning algorithm will produce a hypothesis--

  • 05:34

    the final hypothesis, the one that you are going to give your customer, and
    the final hypothesis, the one that you are going to give your customer, and

  • 05:38

    we give the symbol g for that.
    we give the symbol g for that.

  • 05:40

    And hopefully g approximates f, the actual target function,
    And hopefully g approximates f, the actual target function,

  • 05:44

    which remains unknown.
    which remains unknown.

  • 05:45

    And g is picked from a hypothesis set, and the general the symbol for
    And g is picked from a hypothesis set, and the general the symbol for

  • 05:50

    a member of the hypothesis set is h.
    a member of the hypothesis set is h.

  • 05:52

    So h is a generic hypothesis.
    So h is a generic hypothesis.

  • 05:55

    The one you happen to pick, you are going to call g.
    The one you happen to pick, you are going to call g.

  • 05:59

    Now, we looked at an example of a learning algorithm.
    Now, we looked at an example of a learning algorithm.

  • 06:02

    First, the learning model-- the perceptron itself, which is a linear
    First, the learning model-- the perceptron itself, which is a linear

  • 06:07

    function, thresholded.
    function, thresholded.

  • 06:08

    That happens to be the hypothesis set.
    That happens to be the hypothesis set.

  • 06:10

    And then, there is an algorithm that goes with it that chooses which
    And then, there is an algorithm that goes with it that chooses which

  • 06:14

    hypothesis to report based on the data.
    hypothesis to report based on the data.

  • 06:17

    And the hypothesis in this case is represented by the purple line.
    And the hypothesis in this case is represented by the purple line.

  • 06:20

    Different hypotheses in the set H will result
    Different hypotheses in the set H will result

  • 06:24

    in different lines.
    in different lines.

  • 06:25

    Some of them are good and some of them are bad, in terms of separating
    Some of them are good and some of them are bad, in terms of separating

  • 06:28

    correctly the examples which are the pluses and minuses.
    correctly the examples which are the pluses and minuses.

  • 06:32

    And we found that there's a very simple rule to adjust the current
    And we found that there's a very simple rule to adjust the current

  • 06:36

    hypothesis, while the algorithm is still running, in order to get a better
    hypothesis, while the algorithm is still running, in order to get a better

  • 06:39

    hypothesis.
    hypothesis.

  • 06:40

    And once you have all the points classified correctly, which is
    And once you have all the points classified correctly, which is

  • 06:43

    guaranteed in the case of the perceptron learning algorithm if the
    guaranteed in the case of the perceptron learning algorithm if the

  • 06:46

    data was linearly separable in the first place,
    data was linearly separable in the first place,

  • 06:49

    then you will get there, and that will be the g that you are going to report.
    then you will get there, and that will be the g that you are going to report.

  • 06:54

    Now, we ended the lecture on sort of a sad note, because after all of this
    Now, we ended the lecture on sort of a sad note, because after all of this

  • 06:59

    encouragement about learning, we asked ourselves: well,
    encouragement about learning, we asked ourselves: well,

  • 07:01

    can we actually learn?
    can we actually learn?

  • 07:04

    So we said it's an unknown function.
    So we said it's an unknown function.

  • 07:06

    Unknown function is an attractive assumption, as I said.
    Unknown function is an attractive assumption, as I said.

  • 07:10

    But can we learn an unknown function, really?
    But can we learn an unknown function, really?

  • 07:13

    And then we realized that if you look at it, it's really impossible.
    And then we realized that if you look at it, it's really impossible.

  • 07:16

    Why is it impossible?
    Why is it impossible?

  • 07:18

    Because I'm going to give you a finite data set, and I'm going to give you
    Because I'm going to give you a finite data set, and I'm going to give you

  • 07:22

    the value of the function on this set.
    the value of the function on this set.

  • 07:24

    Good.
    Good.

  • 07:25

    Now, I'm going to ask you what is the function outside that set?
    Now, I'm going to ask you what is the function outside that set?

  • 07:30

    How in the world are you going to tell what the function is outside, if the
    How in the world are you going to tell what the function is outside, if the

  • 07:33

    function is genuinely unknown?
    function is genuinely unknown?

  • 07:35

    Couldn't it assume any value it wants?
    Couldn't it assume any value it wants?

  • 07:38

    Yes, it can.
    Yes, it can.

  • 07:40

    I can give you 1000 points, a million points, and on the million-and-first point,
    I can give you 1000 points, a million points, and on the million-and-first point,

  • 07:46

    still the function can behave any way it wants.
    still the function can behave any way it wants.

  • 07:49

    So it doesn't look like the statement we made is feasible in terms of
    So it doesn't look like the statement we made is feasible in terms of

  • 07:53

    learning, and therefore we have to do something about it.
    learning, and therefore we have to do something about it.

  • 07:57

    And what we are going to do about it is the subject of this lecture.
    And what we are going to do about it is the subject of this lecture.

  • 08:03

    Now, the lecture is called Is Learning Feasible?
    Now, the lecture is called Is Learning Feasible?

  • 08:06

    And I am going to address this question in extreme detail from
    And I am going to address this question in extreme detail from

  • 08:13

    beginning to end.
    beginning to end.

  • 08:14

    This is the only topic for this lecture.
    This is the only topic for this lecture.

  • 08:19

    Now, if you want an outline--
    Now, if you want an outline--

  • 08:21

    it's really a logical flow.
    it's really a logical flow.

  • 08:22

    But if you want to cluster it into points--
    But if you want to cluster it into points--

  • 08:25

    we are going to start with a probabilistic situation, that is a very
    we are going to start with a probabilistic situation, that is a very

  • 08:28

    simple probabilistic situation.
    simple probabilistic situation.

  • 08:29

    It doesn't seem to relate to learning.
    It doesn't seem to relate to learning.

  • 08:32

    But it will capture the idea--
    But it will capture the idea--

  • 08:34

    can we say something outside the sample data that we have?
    can we say something outside the sample data that we have?

  • 08:39

    So we're going to answer it in a way that is concrete, and where the
    So we're going to answer it in a way that is concrete, and where the

  • 08:42

    mathematics is very friendly.
    mathematics is very friendly.

  • 08:44

    And then after that, I'm going to be able to relate that probabilistic
    And then after that, I'm going to be able to relate that probabilistic

  • 08:48

    situation to learning as we stated.
    situation to learning as we stated.

  • 08:51

    It will take two stages.
    It will take two stages.

  • 08:52

    First, I will just translate the expressions into something that
    First, I will just translate the expressions into something that

  • 08:56

    relates to learning, and then we will move forward and make it correspond to
    relates to learning, and then we will move forward and make it correspond to

  • 09:01

    real learning.
    real learning.

  • 09:03

    That's the last one.
    That's the last one.

  • 09:04

    And then after we do that, and we think we are done, we find that there is
    And then after we do that, and we think we are done, we find that there is

  • 09:07

    a serious dilemma that we have.
    a serious dilemma that we have.

  • 09:09

    And we will find a solution to that dilemma, and then declare victory-- that
    And we will find a solution to that dilemma, and then declare victory-- that

  • 09:13

    indeed, learning is feasible in a very particular sense.
    indeed, learning is feasible in a very particular sense.

  • 09:18

    So let's start with the experiment that I talked about.
    So let's start with the experiment that I talked about.

  • 09:22

    Consider the following situation.
    Consider the following situation.

  • 09:24

    You have a bin, and the bin has marbles.
    You have a bin, and the bin has marbles.

  • 09:28

    The marbles are either red or green.
    The marbles are either red or green.

  • 09:32

    That's what it looks like.
    That's what it looks like.

  • 09:34

    And we are going to do an experiment with this bin.
    And we are going to do an experiment with this bin.

  • 09:40

    And the experiment is to pick a sample from the bin--
    And the experiment is to pick a sample from the bin--

  • 09:43

    some marbles.
    some marbles.

  • 09:44

    Let's formalize what the probability distribution is.
    Let's formalize what the probability distribution is.

  • 09:48

    There is a probability of picking a red marble, and let's call it mu.
    There is a probability of picking a red marble, and let's call it mu.

  • 09:55

    So now you think of mu as the probability of a red marble.
    So now you think of mu as the probability of a red marble.

  • 09:59

    Now, the bin is really just a visual aid to make us relate to the
    Now, the bin is really just a visual aid to make us relate to the

  • 10:03

    experiment.
    experiment.

  • 10:03

    You can think of this abstractly as a binary experiment--
    You can think of this abstractly as a binary experiment--

  • 10:07

    two outcomes, red or green.
    two outcomes, red or green.

  • 10:09

    Probability of red is mu, independently from
    Probability of red is mu, independently from

  • 10:11

    one point to another.
    one point to another.

  • 10:13

    If you want to stick to the bin, you can say the bin has an infinite number
    If you want to stick to the bin, you can say the bin has an infinite number

  • 10:16

    of marbles and the fraction of red marbles is mu.
    of marbles and the fraction of red marbles is mu.

  • 10:19

    Or maybe it has a finite number of marbles, and you are going to pick the
    Or maybe it has a finite number of marbles, and you are going to pick the

  • 10:22

    marbles, but replace them.
    marbles, but replace them.

  • 10:24

    But the idea now is that every time you reach in the bin, the probability
    But the idea now is that every time you reach in the bin, the probability

  • 10:28

    of picking a red marble is mu.
    of picking a red marble is mu.

  • 10:30

    That's the rule.
    That's the rule.

  • 10:33

    Now, there's a probability of picking a green marble.
    Now, there's a probability of picking a green marble.

  • 10:35

    And what might that be?
    And what might that be?

  • 10:38

    That must be 1 minus mu.
    That must be 1 minus mu.

  • 10:41

    So that's the setup.
    So that's the setup.

  • 10:44

    Now, the value of mu is unknown to us.
    Now, the value of mu is unknown to us.

  • 10:47

    So in spite of the fact that you can look at this particular bin and see
    So in spite of the fact that you can look at this particular bin and see

  • 10:50

    there's less red than green, so mu must be small.
    there's less red than green, so mu must be small.

  • 10:54

    and all of that.
    and all of that.

  • 10:54

    You don't have that advantage in real.
    You don't have that advantage in real.

  • 10:57

    The bin is opaque-- it's sitting there, and I reach for it like this.
    The bin is opaque-- it's sitting there, and I reach for it like this.

  • 11:03

    So now that I declare mu is unknown, you probably see
    So now that I declare mu is unknown, you probably see

  • 11:06

    where this is going.
    where this is going.

  • 11:08

    Unknown is a famous word from last lecture, and that will be the link to
    Unknown is a famous word from last lecture, and that will be the link to

  • 11:11

    what we have.
    what we have.

  • 11:14

    Now, we pick N marbles independently.
    Now, we pick N marbles independently.

  • 11:18

    Capital N. And I'm using the same notation for N, which is the
    Capital N. And I'm using the same notation for N, which is the

  • 11:22

    number of data points in learning, deliberately.
    number of data points in learning, deliberately.

  • 11:26

    So the sample will look like this.
    So the sample will look like this.

  • 11:29

    And it will have some red and some green.
    And it will have some red and some green.

  • 11:31

    It's a probabilistic situation.
    It's a probabilistic situation.

  • 11:33

    And we are going to call the fraction of marbles in the sample--
    And we are going to call the fraction of marbles in the sample--

  • 11:40

    this now is a probabilistic quantity.
    this now is a probabilistic quantity.

  • 11:42

    mu is an unknown constant sitting there.
    mu is an unknown constant sitting there.

  • 11:45

    If you pick a sample, someone else picks a sample, you will have a different
    If you pick a sample, someone else picks a sample, you will have a different

  • 11:49

    frequency in sample from the other person.
    frequency in sample from the other person.

  • 11:51

    And we are going to call it nu.
    And we are going to call it nu.

  • 11:56

    Now, interestingly enough, nu also should appear in the figure.
    Now, interestingly enough, nu also should appear in the figure.

  • 12:00

    So it says nu equals fraction of red marbles.
    So it says nu equals fraction of red marbles.

  • 12:03

    So that's where it lies.
    So that's where it lies.

  • 12:05

    Here is nu!
    Here is nu!

  • 12:07

    For some reason that I don't understand, the app wouldn't show nu
    For some reason that I don't understand, the app wouldn't show nu

  • 12:10

    in the figures.
    in the figures.

  • 12:13

    So I decided maybe the app is actually a machine learning expert.
    So I decided maybe the app is actually a machine learning expert.

  • 12:16

    It doesn't like things in sample.
    It doesn't like things in sample.

  • 12:18

    It only likes things that are real.
    It only likes things that are real.

  • 12:21

    So it knows that nu is not important.
    So it knows that nu is not important.

  • 12:22

    It's not an indication.
    It's not an indication.

  • 12:23

    We are really interested in knowing what's outside.
    We are really interested in knowing what's outside.

  • 12:25

    So it kept the mu, but actually deleted the nu.
    So it kept the mu, but actually deleted the nu.

  • 12:28

    At least that's what we are going to believe for the rest of the lecture.
    At least that's what we are going to believe for the rest of the lecture.

  • 12:32

    Now, this is the bin.
    Now, this is the bin.

  • 12:35

    So now, the next step is to ask ourselves the question we asked in
    So now, the next step is to ask ourselves the question we asked in

  • 12:40

    machine learning.
    machine learning.

  • 12:41

    Does nu, which is the sample frequency, tell us anything about mu,
    Does nu, which is the sample frequency, tell us anything about mu,

  • 12:47

    which is the actual frequency in the bin that we are interested in knowing?
    which is the actual frequency in the bin that we are interested in knowing?

  • 12:53

    The short answer--
    The short answer--

  • 12:56

    this is to remind you what it is.
    this is to remind you what it is.

  • 12:58

    The short answer is no.
    The short answer is no.

  • 13:02

    Why?
    Why?

  • 13:04

    Because the sample can be mostly green, while the bin is mostly red.
    Because the sample can be mostly green, while the bin is mostly red.

  • 13:14

    Anybody doubts that?
    Anybody doubts that?

  • 13:18

    The thing could have 90% red, and I pick 100 marbles, and all
    The thing could have 90% red, and I pick 100 marbles, and all

  • 13:26

    of them happen to be green.
    of them happen to be green.

  • 13:28

    This is possible, correct?
    This is possible, correct?

  • 13:31

    So if I ask you what is actually mu, you really don't know from the sample.
    So if I ask you what is actually mu, you really don't know from the sample.

  • 13:35

    You don't know anything about the marbles you did not pick.
    You don't know anything about the marbles you did not pick.

  • 13:40

    Well, that's the short answer.
    Well, that's the short answer.

  • 13:42

    The long answer is yes.
    The long answer is yes.

  • 13:47

    Not because no and yes, but this is more elaborate.
    Not because no and yes, but this is more elaborate.

  • 13:50

    We have to really discuss a lot in order to get there.
    We have to really discuss a lot in order to get there.

  • 13:54

    So why is it yes?
    So why is it yes?

  • 13:55

    Because if you know a little bit about probability, you realize that if the
    Because if you know a little bit about probability, you realize that if the

  • 14:03

    sample is big enough, the sample frequency, which is nu-- the mysterious
    sample is big enough, the sample frequency, which is nu-- the mysterious

  • 14:09

    disappearing quantity here-- that is likely to be close to mu.
    disappearing quantity here-- that is likely to be close to mu.

  • 14:16

    Think of a presidential poll.
    Think of a presidential poll.

  • 14:19

    There are maybe 100 million or more voters in the US, and you make a poll
    There are maybe 100 million or more voters in the US, and you make a poll

  • 14:25

    of 3000 people.
    of 3000 people.

  • 14:26

    You have 3000 marbles, so to speak.
    You have 3000 marbles, so to speak.

  • 14:29

    And you look at the result in the marbles, and you tell me how the 100
    And you look at the result in the marbles, and you tell me how the 100

  • 14:32

    million will vote.
    million will vote.

  • 14:33

    How the heck did you know that?
    How the heck did you know that?

  • 14:36

    So now the statistics come in.
    So now the statistics come in.

  • 14:37

    That's where the probability plays a role.
    That's where the probability plays a role.

  • 14:40

    And the main distinction between the two answers is
    And the main distinction between the two answers is

  • 14:43

    possible versus probable.
    possible versus probable.

  • 14:46

    In science and in engineering, you go a huge distance by settling for not
    In science and in engineering, you go a huge distance by settling for not

  • 14:52

    absolutely certain, but almost certain.
    absolutely certain, but almost certain.

  • 14:55

    It opens a world of possibilities, and this is one of the
    It opens a world of possibilities, and this is one of the

  • 14:57

    possibilities that it opens.
    possibilities that it opens.

  • 15:01

    So now we know that, from a probabilistic point of view, nu does
    So now we know that, from a probabilistic point of view, nu does

  • 15:06

    tell me something about mu.
    tell me something about mu.

  • 15:08

    The sample frequency tells me something about the bin.
    The sample frequency tells me something about the bin.

  • 15:10

    So what does it exactly say?
    So what does it exactly say?

  • 15:13

    Now we go into a mathematical formulation.
    Now we go into a mathematical formulation.

  • 15:19

    In words, it says: in a big sample, nu, the sample frequency,
    In words, it says: in a big sample, nu, the sample frequency,

  • 15:24

    should be close to mu, the bin frequency.
    should be close to mu, the bin frequency.

  • 15:28

    So now, the symbols that go with that-- what is a big sample?
    So now, the symbols that go with that-- what is a big sample?

  • 15:32

    Large N, our parameter N.
    Large N, our parameter N.

  • 15:35

    And how do we say that nu is close to mu?
    And how do we say that nu is close to mu?

  • 15:39

    We say that they are within epsilon.
    We say that they are within epsilon.

  • 15:43

    That is our criterion.
    That is our criterion.

  • 15:45

    Now, with this in mind, we are going to formalize this.
    Now, with this in mind, we are going to formalize this.

  • 15:50

    The formula that I'm going to show you is a formula that is going to
    The formula that I'm going to show you is a formula that is going to

  • 15:54

    stay with us for the rest of the course.
    stay with us for the rest of the course.

  • 15:56

    I would like you to pay attention.
    I would like you to pay attention.

  • 15:57

    And I'm going to build it gradually.
    And I'm going to build it gradually.

  • 16:01

    We are going to say that the probability of something is small.
    We are going to say that the probability of something is small.

  • 16:09

    So we're going to say that it's less than or equal to, and hopefully the
    So we're going to say that it's less than or equal to, and hopefully the

  • 16:12

    right-hand side will be a small quantity.
    right-hand side will be a small quantity.

  • 16:15

    Now if I am claiming that the probability of something is small, it
    Now if I am claiming that the probability of something is small, it

  • 16:18

    must be that that thing is a bad event.
    must be that that thing is a bad event.

  • 16:22

    I don't want it to happen.
    I don't want it to happen.

  • 16:24

    So we have a probability of something bad happening being small.
    So we have a probability of something bad happening being small.

  • 16:28

    What is a bad event in the context we are talking about?
    What is a bad event in the context we are talking about?

  • 16:35

    It is that nu does not approximate mu well.
    It is that nu does not approximate mu well.

  • 16:40

    They are not within epsilon of each other.
    They are not within epsilon of each other.

  • 16:42

    And if you look at it, here you have nu minus mu in absolute value, so
    And if you look at it, here you have nu minus mu in absolute value, so

  • 16:47

    that's the difference in absolute value.
    that's the difference in absolute value.

  • 16:49

    That happens to be bigger than epsilon.
    That happens to be bigger than epsilon.

  • 16:51

    So that's bad, because that tells us that they are further away from our
    So that's bad, because that tells us that they are further away from our

  • 16:56

    tolerance epsilon.
    tolerance epsilon.

  • 16:57

    We don't want that to happen.
    We don't want that to happen.

  • 16:59

    And we would like the probability of that happening to
    And we would like the probability of that happening to

  • 17:02

    be as small as possible.
    be as small as possible.

  • 17:04

    Well, how small can we guarantee it?
    Well, how small can we guarantee it?

  • 17:09

    Good news.
    Good news.

  • 17:11

    It's e to the minus N.
    It's e to the minus N.

  • 17:13

    It's a negative exponential.
    It's a negative exponential.

  • 17:16

    That is great, because negative exponentials tend to die very fast.
    That is great, because negative exponentials tend to die very fast.

  • 17:20

    So if you get a bigger sample, this will be diminishingly small
    So if you get a bigger sample, this will be diminishingly small

  • 17:23

    probability.
    probability.

  • 17:24

    So the probability of something bad happening will be very small, and we
    So the probability of something bad happening will be very small, and we

  • 17:27

    can claims that, indeed, nu will be within epsilon from mu, and we will be
    can claims that, indeed, nu will be within epsilon from mu, and we will be

  • 17:33

    wrong for a very minute amount of the time.
    wrong for a very minute amount of the time.

  • 17:36

    But that's the good news.
    But that's the good news.

  • 17:39

    Now the bad news--
    Now the bad news--

  • 17:41

    ouch!
    ouch!

  • 17:43

    Epsilon is our tolerance.
    Epsilon is our tolerance.

  • 17:45

    If you're a very tolerant person, you say:
    If you're a very tolerant person, you say:

  • 17:47

    I just want nu and mu to be within, let's say, 0.1.
    I just want nu and mu to be within, let's say, 0.1.

  • 17:52

    That's not very much to ask.
    That's not very much to ask.

  • 17:54

    Now, the price you pay for that is that you plug in the exponent
    Now, the price you pay for that is that you plug in the exponent

  • 17:59

    not epsilon, but epsilon squared.
    not epsilon, but epsilon squared.

  • 18:01

    So that becomes 0.01.
    So that becomes 0.01.

  • 18:04

    0.01 will dampen N significantly, and you lose a lot of the benefit of the
    0.01 will dampen N significantly, and you lose a lot of the benefit of the

  • 18:09

    negative exponential.
    negative exponential.

  • 18:11

    And if you are more stringent and you say, I really want nu
    And if you are more stringent and you say, I really want nu

  • 18:15

    to be close to mu.
    to be close to mu.

  • 18:15

    I am not fooling around here.
    I am not fooling around here.

  • 18:17

    So I am going to pick epsilon to be 10 to the minus 6.
    So I am going to pick epsilon to be 10 to the minus 6.

  • 18:20

    Good for you.
    Good for you.

  • 18:22

    10 to the minus 6?
    10 to the minus 6?

  • 18:23

    Pay the price for it.
    Pay the price for it.

  • 18:24

    You go here, and now that's 10 to the minus 12.
    You go here, and now that's 10 to the minus 12.

  • 18:27

    That will completely kill any N you will ever encounter.
    That will completely kill any N you will ever encounter.

  • 18:31

    So the exponent now will be around zero.
    So the exponent now will be around zero.

  • 18:34

    So this probability will be around 1, if that was the final answer.
    So this probability will be around 1, if that was the final answer.

  • 18:37

    That's not yet the final answer.
    That's not yet the final answer.

  • 18:38

    So now, you know that the probability is less than or equal to 1.
    So now, you know that the probability is less than or equal to 1.

  • 18:41

    Congratulations!
    Congratulations!

  • 18:43

    You knew that already. [LAUGHTER]
    You knew that already. [LAUGHTER]

  • 18:46

    Well, this is almost the formula, but it's not quite.
    Well, this is almost the formula, but it's not quite.

  • 18:51

    What we need is fairly trivial.
    What we need is fairly trivial.

  • 18:54

    We just put 2 here, and 2 there.
    We just put 2 here, and 2 there.

  • 18:57

    Now, between you and me, I prefer the original formula
    Now, between you and me, I prefer the original formula

  • 19:01

    better, without the 2's.
    better, without the 2's.

  • 19:03

    However, the formula with the 2's has the distinct advantage of being: true. [LAUGHTER]
    However, the formula with the 2's has the distinct advantage of being: true. [LAUGHTER]

  • 19:11

    So we have to settle for that.
    So we have to settle for that.

  • 19:13

    Now that inequality is called Hoeffding's Inequality.
    Now that inequality is called Hoeffding's Inequality.

  • 19:20

    It is the main inequality we are going to be using in the course.
    It is the main inequality we are going to be using in the course.

  • 19:24

    You can look for the proof.
    You can look for the proof.

  • 19:26

    It's a basic proof in mathematics.
    It's a basic proof in mathematics.

  • 19:28

    It's not that difficult, but definitely not trivial.
    It's not that difficult, but definitely not trivial.

  • 19:31

    And we are going to use it all the way-- and this is the same formula
    And we are going to use it all the way-- and this is the same formula

  • 19:35

    that will get us to prove something about the VC dimension.
    that will get us to prove something about the VC dimension.

  • 19:38

    If the buzzword 'VC dimension' means anything to you, it will come from
    If the buzzword 'VC dimension' means anything to you, it will come from

  • 19:42

    this after a lot of derivation.
    this after a lot of derivation.

  • 19:44

    So this is the building block that you have to really know cold.
    So this is the building block that you have to really know cold.

  • 19:51

    Now, if you want to translate the Hoeffding Inequality into words, what
    Now, if you want to translate the Hoeffding Inequality into words, what

  • 19:56

    we have been talking about is that we would like to make the
    we have been talking about is that we would like to make the

  • 20:00

    statement: mu equals nu.
    statement: mu equals nu.

  • 20:03

    That would be the ultimate.
    That would be the ultimate.

  • 20:04

    I look at the in-sample frequency, that's the out-of-sample frequency.
    I look at the in-sample frequency, that's the out-of-sample frequency.

  • 20:06

    That's the real frequency out there.
    That's the real frequency out there.

  • 20:09

    But that's not the case.
    But that's not the case.

  • 20:10

    We actually are making the statement mu equals nu, but we're not
    We actually are making the statement mu equals nu, but we're not

  • 20:15

    making the statement--
    making the statement--

  • 20:16

    we are making a PAC statement.
    we are making a PAC statement.

  • 20:19

    And that stands for: this statement is probably, approximately, correct.
    And that stands for: this statement is probably, approximately, correct.

  • 20:30

    Probably because of this.
    Probably because of this.

  • 20:34

    This is small, so the probability of violation is small.
    This is small, so the probability of violation is small.

  • 20:37

    Approximately because of this.
    Approximately because of this.

  • 20:39

    We are not saying that mu equals nu.
    We are not saying that mu equals nu.

  • 20:42

    We are saying that they are close to each other.
    We are saying that they are close to each other.

  • 20:44

    And that theme will remain with us in learning.
    And that theme will remain with us in learning.

  • 20:51

    So we put the glorified Hoeffding's Inequality at the top, and we spend
    So we put the glorified Hoeffding's Inequality at the top, and we spend

  • 20:55

    a viewgraph analyzing what it means.
    a viewgraph analyzing what it means.

  • 20:59

    In case you forgot what nu and mu are, I put the figure.
    In case you forgot what nu and mu are, I put the figure.

  • 21:02

    So mu is the frequency within the bin.
    So mu is the frequency within the bin.

  • 21:08

    This is the unknown quantity that we want to tell.
    This is the unknown quantity that we want to tell.

  • 21:10

    And nu is the disappearing quantity which happens to be the frequency in
    And nu is the disappearing quantity which happens to be the frequency in

  • 21:13

    the sample you have.
    the sample you have.

  • 21:16

    So what about the Hoeffding Inequality?
    So what about the Hoeffding Inequality?

  • 21:20

    Well, one attraction of this inequality is that it is valid for
    Well, one attraction of this inequality is that it is valid for

  • 21:25

    every N, positive integer, and every epsilon which is greater than zero.
    every N, positive integer, and every epsilon which is greater than zero.

  • 21:31

    Pick any tolerance you want, and for any number of examples you
    Pick any tolerance you want, and for any number of examples you

  • 21:35

    want, this is true.
    want, this is true.

  • 21:36

    It's not an asymptotic result.
    It's not an asymptotic result.

  • 21:38

    It's a result that holds for every N and epsilon.
    It's a result that holds for every N and epsilon.

  • 21:41

    That's a very attractive proposition for something that has
    That's a very attractive proposition for something that has

  • 21:43

    an exponential in it.
    an exponential in it.

  • 21:46

    Now, Hoeffding Inequality belongs to a large class of mathematical laws,
    Now, Hoeffding Inequality belongs to a large class of mathematical laws,

  • 21:50

    which are called the Laws of Large Numbers.
    which are called the Laws of Large Numbers.

  • 21:54

    So this is one law of large numbers, one form of it, and
    So this is one law of large numbers, one form of it, and

  • 21:57

    there are tons of them.
    there are tons of them.

  • 21:58

    This happens to be one of the friendliest, because it's not
    This happens to be one of the friendliest, because it's not

  • 22:01

    asymptotic, and happens to have an exponential in it.
    asymptotic, and happens to have an exponential in it.

  • 22:06

    Now, one observation here is that if you look at the left-hand side, we are
    Now, one observation here is that if you look at the left-hand side, we are

  • 22:12

    computing this probability.
    computing this probability.

  • 22:14

    This probability patently depends on mu.
    This probability patently depends on mu.

  • 22:17

    mu appears explicitly in it, and also mu affects the probability
    mu appears explicitly in it, and also mu affects the probability

  • 22:22

    distribution of nu.
    distribution of nu.

  • 22:23

    Nu is the sample, in N marbles you picked.
    Nu is the sample, in N marbles you picked.

  • 22:26

    That's a very simple binomial distribution.
    That's a very simple binomial distribution.

  • 22:29

    You can find the probability that nu equals anything based on
    You can find the probability that nu equals anything based on

  • 22:33

    the value of mu.
    the value of mu.

  • 22:34

    So the probability that this quantity, which depends on mu, exceeds epsilon--
    So the probability that this quantity, which depends on mu, exceeds epsilon--

  • 22:40

    the probability itself does depend on mu.
    the probability itself does depend on mu.

  • 22:43

    However, we are not interested in the exact probability.
    However, we are not interested in the exact probability.

  • 22:46

    We just want to bound it.
    We just want to bound it.

  • 22:47

    And in this case, we are bounding it uniformly.
    And in this case, we are bounding it uniformly.

  • 22:50

    As you see, the right-hand side does not have mu in it.
    As you see, the right-hand side does not have mu in it.

  • 22:53

    And that gives us a great tool, because now we don't use the quantity
    And that gives us a great tool, because now we don't use the quantity

  • 22:59

    that, we already declared, is unknown.
    that, we already declared, is unknown.

  • 23:02

    mu is unknown.
    mu is unknown.

  • 23:02

    It would be a vicious cycle if I go and say that it depends on mu,
    It would be a vicious cycle if I go and say that it depends on mu,

  • 23:07

    but I don't know what mu is.
    but I don't know what mu is.

  • 23:08

    Now you know uniformly, regardless of the value of mu-- mu could be anything
    Now you know uniformly, regardless of the value of mu-- mu could be anything

  • 23:12

    between 0 and 1, and this will still be bounding the deviation of the
    between 0 and 1, and this will still be bounding the deviation of the

  • 23:16

    sample frequency from the real frequency.
    sample frequency from the real frequency.

  • 23:20

    That's a good advantage.
    That's a good advantage.

  • 23:23

    Now, the other point is that there is a trade-off that you can read off the
    Now, the other point is that there is a trade-off that you can read off the

  • 23:28

    inequality.
    inequality.

  • 23:29

    What is the trade-off?
    What is the trade-off?

  • 23:32

    The trade-off is between N and epsilon.
    The trade-off is between N and epsilon.

  • 23:35

    In a typical situation, if we think of N as the number of examples that are
    In a typical situation, if we think of N as the number of examples that are

  • 23:39

    given to you-- the amount of data-- in this case, the number of marbles out
    given to you-- the amount of data-- in this case, the number of marbles out

  • 23:43

    of the bin,
    of the bin,

  • 23:44

    N is usually dictated.
    N is usually dictated.

  • 23:45

    Someone comes and gives you a certain resource of examples.
    Someone comes and gives you a certain resource of examples.

  • 23:49

    Epsilon is your taste in tolerance.
    Epsilon is your taste in tolerance.

  • 23:52

    You are very tolerant. You pick epsilon equals 0.5.
    You are very tolerant. You pick epsilon equals 0.5.

  • 23:55

    That will be very easy to satisfy.
    That will be very easy to satisfy.

  • 23:58

    And if you are very stringent, you can pick epsilon smaller and smaller.
    And if you are very stringent, you can pick epsilon smaller and smaller.

  • 24:01

    Now, because they get multiplied here, the smaller the epsilon is, the bigger
    Now, because they get multiplied here, the smaller the epsilon is, the bigger

  • 24:08

    than N you need in order to compensate for it and come up with the same level
    than N you need in order to compensate for it and come up with the same level

  • 24:12

    of probability bound.
    of probability bound.

  • 24:15

    And that makes a lot of sense.
    And that makes a lot of sense.

  • 24:17

    If you have more examples, you are more sure that nu and mu will be close
    If you have more examples, you are more sure that nu and mu will be close

  • 24:22

    together, even closer and closer and closer,
    together, even closer and closer and closer,

  • 24:25

    as you get larger N.
    as you get larger N.

  • 24:27

    So this makes sense.
    So this makes sense.

  • 24:30

    Finally,
    Finally,

  • 24:31

    it's a subtle point, but it's worth saying.
    it's a subtle point, but it's worth saying.

  • 24:34

    We are making the statement that nu is approximately the same as mu.
    We are making the statement that nu is approximately the same as mu.

  • 24:40

    And this implies that mu is approximately the same as nu.
    And this implies that mu is approximately the same as nu.

  • 24:47

    What is this?
    What is this?

  • 24:50

    The logic here is a little bit subtle.
    The logic here is a little bit subtle.

  • 24:53

    Obviously, the statement is a tautology, but I'm just making
    Obviously, the statement is a tautology, but I'm just making

  • 24:56

    a logical point, here.
    a logical point, here.

  • 24:59

    When you run the experiment, you don't know what mu is.
    When you run the experiment, you don't know what mu is.

  • 25:02

    mu is an unknown.
    mu is an unknown.

  • 25:03

    It's a constant.
    It's a constant.

  • 25:05

    The only random fellow in this entire operation is nu.
    The only random fellow in this entire operation is nu.

  • 25:09

    That is what the probability is with respect to.
    That is what the probability is with respect to.

  • 25:12

    You generate different samples, and you compute the probability.
    You generate different samples, and you compute the probability.

  • 25:15

    This is the probabilistic thing.
    This is the probabilistic thing.

  • 25:16

    This is a happy constant sitting there, albeit unknown.
    This is a happy constant sitting there, albeit unknown.

  • 25:22

    Now, the way you are using the inequality is to infer mu, the sample
    Now, the way you are using the inequality is to infer mu, the sample

  • 25:30

    here, from nu.
    here, from nu.

  • 25:34

    That is not the cause and effect that actually takes place.
    That is not the cause and effect that actually takes place.

  • 25:38

    The cause and effect is that mu affects nu, not the other way around.
    The cause and effect is that mu affects nu, not the other way around.

  • 25:43

    But we are using it the other way around.
    But we are using it the other way around.

  • 25:47

    Lucky for us, the form of the probability is symmetric.
    Lucky for us, the form of the probability is symmetric.

  • 25:51

    Therefore, instead of saying that nu tends to be close to mu, which will
    Therefore, instead of saying that nu tends to be close to mu, which will

  • 25:58

    be the accurate logical statement-- mu is there, and nu has a tendency to be
    be the accurate logical statement-- mu is there, and nu has a tendency to be

  • 26:03

    close to it.
    close to it.

  • 26:04

    We, instead of that, say that I know already nu, and now mu tends to
    We, instead of that, say that I know already nu, and now mu tends to

  • 26:10

    be close to nu.
    be close to nu.

  • 26:12

    That's the logic we are using.
    That's the logic we are using.

  • 26:19

    Now, I think we understand what the bin situation is, and we know what the
    Now, I think we understand what the bin situation is, and we know what the

  • 26:25

    mathematical condition that corresponds to it is.
    mathematical condition that corresponds to it is.

  • 26:28

    What I'd like to do,
    What I'd like to do,

  • 26:29

    I'd like to connect that to the learning problem we have.
    I'd like to connect that to the learning problem we have.

  • 26:35

    In the case of a bin, the unknown quantity that we want to decipher is
    In the case of a bin, the unknown quantity that we want to decipher is

  • 26:42

    a number, mu.
    a number, mu.

  • 26:43

    Just unknown.
    Just unknown.

  • 26:44

    What is the frequency inside the bin.
    What is the frequency inside the bin.

  • 26:48

    In the learning situation that we had, the unknown quantity we would like to
    In the learning situation that we had, the unknown quantity we would like to

  • 26:53

    decipher is a full-fledged function.
    decipher is a full-fledged function.

  • 26:57

    It has a domain, X, that could be a 10th-order Euclidean space.
    It has a domain, X, that could be a 10th-order Euclidean space.

  • 27:02

    Y could be anything.
    Y could be anything.

  • 27:03

    It could be binary, like the perceptron.
    It could be binary, like the perceptron.

  • 27:04

    It could be something else.
    It could be something else.

  • 27:06

    That's a huge amount of information.
    That's a huge amount of information.

  • 27:08

    The bin has only one number.
    The bin has only one number.

  • 27:10

    This one, if you want to specify it, that's a lot of specification.
    This one, if you want to specify it, that's a lot of specification.

  • 27:13

    So how am I going to be able to relate the learning problem to something that
    So how am I going to be able to relate the learning problem to something that

  • 27:18

    simplistic?
    simplistic?

  • 27:23

    The way we are going to do it is the following.
    The way we are going to do it is the following.

  • 27:29

    Think of the bin as your input space in the learning problem.
    Think of the bin as your input space in the learning problem.

  • 27:34

    That's the correspondence.
    That's the correspondence.

  • 27:35

    So every marble here is a point x.
    So every marble here is a point x.

  • 27:38

    That is a credit card applicant.
    That is a credit card applicant.

  • 27:41

    So if you look closely at the gray thing, you will read: salary, years in
    So if you look closely at the gray thing, you will read: salary, years in

  • 27:46

    residence, and whatnot.
    residence, and whatnot.

  • 27:47

    You can't see it here because it's too small!
    You can't see it here because it's too small!

  • 27:50

    Now the bin has all the points in the space. Therefore, this
    Now the bin has all the points in the space. Therefore, this

  • 27:57

    is really the space.
    is really the space.

  • 27:59

    That's the correspondence in our mind.
    That's the correspondence in our mind.

  • 28:01

    Now we would like to give colors to the marbles.
    Now we would like to give colors to the marbles.

  • 28:08

    So here are the colors.
    So here are the colors.

  • 28:14

    There are green marbles, and they correspond to something in the
    There are green marbles, and they correspond to something in the

  • 28:18

    learning problem.
    learning problem.

  • 28:19

    What do they correspond to?
    What do they correspond to?

  • 28:22

    They correspond to your hypothesis getting it right.
    They correspond to your hypothesis getting it right.

  • 28:30

    So what does that mean?
    So what does that mean?

  • 28:31

    There is a target function sitting there, right?
    There is a target function sitting there, right?

  • 28:34

    You have a hypothesis.
    You have a hypothesis.

  • 28:35

    The hypothesis is a full function, like the target function is.
    The hypothesis is a full function, like the target function is.

  • 28:39

    You can compare the hypothesis to the target function on every point.
    You can compare the hypothesis to the target function on every point.

  • 28:45

    And they either agree or disagree.
    And they either agree or disagree.

  • 28:48

    If they agree, please color the corresponding point
    If they agree, please color the corresponding point

  • 28:52

    in the input space--
    in the input space--

  • 28:53

    Color it green.
    Color it green.

  • 28:56

    Now, I'm not saying that you know which ones are green and which ones
    Now, I'm not saying that you know which ones are green and which ones

  • 28:59

    are not, because you don't know the target function overall.
    are not, because you don't know the target function overall.

  • 29:03

    I'm just telling you the mapping that takes an unknown target function into
    I'm just telling you the mapping that takes an unknown target function into

  • 29:07

    an unknown mu.
    an unknown mu.

  • 29:09

    So both of them are unknown, admittedly, but that's the
    So both of them are unknown, admittedly, but that's the

  • 29:12

    correspondence that maps it.
    correspondence that maps it.

  • 29:14

    And now you go, and there are some red ones.
    And now you go, and there are some red ones.

  • 29:17

    And, you guessed it.
    And, you guessed it.

  • 29:19

    You color the thing red if your hypothesis got the answer wrong.
    You color the thing red if your hypothesis got the answer wrong.

  • 29:25

    So now I am collapsing the entire thing into just agreement and
    So now I am collapsing the entire thing into just agreement and

  • 29:29

    disagreement between your hypothesis and the target function, and that's
    disagreement between your hypothesis and the target function, and that's

  • 29:34

    how you get to color the bin.
    how you get to color the bin.

  • 29:36

    Because of that, you have a mapping for every point, whether it's green or
    Because of that, you have a mapping for every point, whether it's green or

  • 29:43

    red, according to this rule.
    red, according to this rule.

  • 29:45

    Now, this will add a component to the learning problem that we
    Now, this will add a component to the learning problem that we

  • 29:49

    did not have before.
    did not have before.

  • 29:52

    There is a probability associated with the bin.
    There is a probability associated with the bin.

  • 29:54

    There is a probability of picking a marble, and
    There is a probability of picking a marble, and

  • 29:57

    independently, and all of that.
    independently, and all of that.

  • 29:58

    When we talked about the learning problem, there was no probability.
    When we talked about the learning problem, there was no probability.

  • 30:01

    I will just give you a sample set, and that's what you work with.
    I will just give you a sample set, and that's what you work with.

  • 30:05

    So let's see what is the addition we need to do in order to adjust the
    So let's see what is the addition we need to do in order to adjust the

  • 30:09

    statement of the learning problem to accommodate the new ingredient.
    statement of the learning problem to accommodate the new ingredient.

  • 30:14

    And the new ingredient is important, because otherwise we cannot learn.
    And the new ingredient is important, because otherwise we cannot learn.

  • 30:16

    It's not like we have the luxury of doing without it.
    It's not like we have the luxury of doing without it.

  • 30:20

    So we go back to the learning diagram from last time.
    So we go back to the learning diagram from last time.

  • 30:22

    Do you remember this one?
    Do you remember this one?

  • 30:23

    Let me remind you.
    Let me remind you.

  • 30:25

    Here is your target function, and it's unknown.
    Here is your target function, and it's unknown.

  • 30:29

    And I promised you last time that it will remain unknown, and the promise
    And I promised you last time that it will remain unknown, and the promise

  • 30:34

    will be fulfilled.
    will be fulfilled.

  • 30:34

    We are not going to touch this box.
    We are not going to touch this box.

  • 30:37

    We're just going to add another box to accommodate the probability.
    We're just going to add another box to accommodate the probability.

  • 30:41

    And the target function generates the training examples.
    And the target function generates the training examples.

  • 30:44

    These are the only things that the learning algorithm sees.
    These are the only things that the learning algorithm sees.

  • 30:47

    It picks a hypothesis from the hypothesis set, and produces it as the
    It picks a hypothesis from the hypothesis set, and produces it as the

  • 30:51

    final hypothesis, which hopefully approximates f.
    final hypothesis, which hopefully approximates f.

  • 30:54

    That's the game.
    That's the game.

  • 30:55

    So what is the addition we are going to do?
    So what is the addition we are going to do?

  • 30:58

    In the bin analogy, this is the input space.
    In the bin analogy, this is the input space.

  • 31:03

    Now the input space has a probability.
    Now the input space has a probability.

  • 31:05

    So I need to apply this probability to the points from the input space that
    So I need to apply this probability to the points from the input space that

  • 31:09

    are being generated.
    are being generated.

  • 31:12

    I am going to introduce a probability distribution over the
    I am going to introduce a probability distribution over the

  • 31:16

    input space.
    input space.

  • 31:17

    Now the points in the input space-- let's say the d-dimensional
    Now the points in the input space-- let's say the d-dimensional

  • 31:20

    Euclidean space--
    Euclidean space--

  • 31:21

    are not just generic points now.
    are not just generic points now.

  • 31:23

    There is a probability of picking one point versus the other.
    There is a probability of picking one point versus the other.

  • 31:26

    And that is captured by the probability, which I'm going to call
    And that is captured by the probability, which I'm going to call

  • 31:29

    capital P.
    capital P.

  • 31:31

    Now the interesting thing is that I'm making no assumptions about P. P can
    Now the interesting thing is that I'm making no assumptions about P. P can

  • 31:35

    be anything.
    be anything.

  • 31:36

    I just want a probability.
    I just want a probability.

  • 31:40

    So invoke any probability you want, and I am ready with the machinery.
    So invoke any probability you want, and I am ready with the machinery.

  • 31:44

    I am not going to restrict the probability distributions over X.
    I am not going to restrict the probability distributions over X.

  • 31:49

    That's number one.
    That's number one.

  • 31:50

    So this is not as bad as it looks.
    So this is not as bad as it looks.

  • 31:52

    Number two, I don't even need to know what P is.
    Number two, I don't even need to know what P is.

  • 31:57

    Of course, the probability choice will affect the choice of the probability
    Of course, the probability choice will affect the choice of the probability

  • 32:02

    of getting a green marble or a red marble, because now the probability of
    of getting a green marble or a red marble, because now the probability of

  • 32:06

    different marbles changed, so it could change the value mu.
    different marbles changed, so it could change the value mu.

  • 32:11

    But the good news with the Hoeffding is that I could bound the performance
    But the good news with the Hoeffding is that I could bound the performance

  • 32:14

    independently of mu.
    independently of mu.

  • 32:16

    So I can get away with not only any P, but with a P that I don't know, and
    So I can get away with not only any P, but with a P that I don't know, and

  • 32:22

    I'll still be able to make the mathematical statement.
    I'll still be able to make the mathematical statement.

  • 32:25

    So this is a very benign addition to the problem.
    So this is a very benign addition to the problem.

  • 32:28

    And it will give us very high dividends, which is the
    And it will give us very high dividends, which is the

  • 32:31

    feasibility of learning.
    feasibility of learning.

  • 32:33

    So what do you do with the probability?
    So what do you do with the probability?

  • 32:36

    You use the probability to generate the points x_1 up to x_N. So now
    You use the probability to generate the points x_1 up to x_N. So now

  • 32:43

    x_1 up to x_N are assumed to be generated by that probability
    x_1 up to x_N are assumed to be generated by that probability

  • 32:47

    independently.
    independently.

  • 32:49

    That's the only assumption that is made.
    That's the only assumption that is made.

  • 32:51

    If you make that assumption, we are in business.
    If you make that assumption, we are in business.

  • 32:54

    But the good news is, as I mentioned before,
    But the good news is, as I mentioned before,

  • 32:57

    we did not compromise about the target function.
    we did not compromise about the target function.

  • 32:59

    You don't need to make assumptions about the function you don't know and
    You don't need to make assumptions about the function you don't know and

  • 33:02

    you want to learn, which is good news.
    you want to learn, which is good news.

  • 33:04

    And the addition is almost technical.
    And the addition is almost technical.

  • 33:07

    That there is a probability somewhere, generating the points.
    That there is a probability somewhere, generating the points.

  • 33:10

    If I know that, then I can make a statement in probability.
    If I know that, then I can make a statement in probability.

  • 33:12

    Obviously, you can make that statement only to the extent that the assumption
    Obviously, you can make that statement only to the extent that the assumption

  • 33:15

    is valid, and we can discuss that in later lectures when the
    is valid, and we can discuss that in later lectures when the

  • 33:18

    assumption is not valid.
    assumption is not valid.

  • 33:23

    So, OK.
    So, OK.

  • 33:23

    Happy ending.
    Happy ending.

  • 33:24

    We are done, and we now have the correspondence.
    We are done, and we now have the correspondence.

  • 33:27

    Are we done?
    Are we done?

  • 33:31

    Well, not quite.
    Well, not quite.

  • 33:33

    Why are we not done?
    Why are we not done?

  • 33:36

    Because the analogy I gave you requires a particular
    Because the analogy I gave you requires a particular

  • 33:42

    hypothesis in mind.
    hypothesis in mind.

  • 33:44

    I told you that the red and green marbles correspond to the agreement between h
    I told you that the red and green marbles correspond to the agreement between h

  • 33:48

    and the target function.
    and the target function.

  • 33:50

    So when you tell me what h is, you dictate the colors here.
    So when you tell me what h is, you dictate the colors here.

  • 33:56

    All of these colors.
    All of these colors.

  • 33:57

    This is green not because it's inherently green, not because of
    This is green not because it's inherently green, not because of

  • 34:01

    anything inherent about the target function.
    anything inherent about the target function.

  • 34:03

    It's because of the agreement between the target function and your
    It's because of the agreement between the target function and your

  • 34:07

    hypothesis, h.
    hypothesis, h.

  • 34:09

    That's fine, but what is the problem?
    That's fine, but what is the problem?

  • 34:12

    The problem is that I know that for this h, nu generalizes to mu.
    The problem is that I know that for this h, nu generalizes to mu.

  • 34:21

    You're probably saying, yeah, but h could be anything.
    You're probably saying, yeah, but h could be anything.

  • 34:24

    I don't see the problem yet.
    I don't see the problem yet.

  • 34:27

    Now here is the problem.
    Now here is the problem.

  • 34:30

    What we have actually discussed is not learning, it's verification.
    What we have actually discussed is not learning, it's verification.

  • 34:34

    The situation as I describe it--
    The situation as I describe it--

  • 34:36

    you have a single bin and you have red and green marbles, and this and that,
    you have a single bin and you have red and green marbles, and this and that,

  • 34:39

    corresponds to the following.
    corresponds to the following.

  • 34:40

    A bank comes to my office.
    A bank comes to my office.

  • 34:41

    We would like a formula for credit approval.
    We would like a formula for credit approval.

  • 34:45

    And we have data.
    And we have data.

  • 34:47

    So instead of actually taking the data, and searching hypotheses, and picking
    So instead of actually taking the data, and searching hypotheses, and picking

  • 34:51

    one, like the perceptron learning algorithm, here is what I do that
    one, like the perceptron learning algorithm, here is what I do that

  • 34:54

    corresponds to what I just described.
    corresponds to what I just described.

  • 34:57

    You guys want a linear formula?
    You guys want a linear formula?

  • 34:58

    OK.
    OK.

  • 34:59

    I guess the salary should have a big weight.
    I guess the salary should have a big weight.

  • 35:01

    Let's say 2.
    Let's say 2.

  • 35:02

    The outstanding debt is negative, so that should be a weight minus 0.5.
    The outstanding debt is negative, so that should be a weight minus 0.5.

  • 35:06

    And years in residence are important, but not that important.
    And years in residence are important, but not that important.

  • 35:09

    So let's give them a 0.1.
    So let's give them a 0.1.

  • 35:11

    And let's pick a threshold that is high, in order for
    And let's pick a threshold that is high, in order for

  • 35:14

    you not to lose money.
    you not to lose money.

  • 35:14

    Let's pick a threshold of 0.5.
    Let's pick a threshold of 0.5.

  • 35:17

    Sitting down, improvising an h.
    Sitting down, improvising an h.

  • 35:20

    Now, after I fix the h, I ask you for the data and just verify whether the h
    Now, after I fix the h, I ask you for the data and just verify whether the h

  • 35:27

    I picked is good or bad.
    I picked is good or bad.

  • 35:29

    That I can do with the bin, because I'm going to look at the data.
    That I can do with the bin, because I'm going to look at the data.

  • 35:34

    If I miraculously agree with everything in your data, I can
    If I miraculously agree with everything in your data, I can

  • 35:37

    definitely declare victory by Hoeffding.
    definitely declare victory by Hoeffding.

  • 35:40

    But what are the chances that this will happen in the first place?
    But what are the chances that this will happen in the first place?

  • 35:43

    I have no control over whether I will be good on the data or not.
    I have no control over whether I will be good on the data or not.

  • 35:47

    The whole idea of learning is that I'm searching the space to deliberately
    The whole idea of learning is that I'm searching the space to deliberately

  • 35:51

    find a hypothesis that works well on the data.
    find a hypothesis that works well on the data.

  • 35:54

    In this case, I just dictated a hypothesis.
    In this case, I just dictated a hypothesis.

  • 35:57

    And I was able to tell you for sure what happens out-of-sample.
    And I was able to tell you for sure what happens out-of-sample.

  • 36:00

    But I have no control of what news I'm going to tell you.
    But I have no control of what news I'm going to tell you.

  • 36:03

    You can come to my office.
    You can come to my office.

  • 36:04

    I improvise this.
    I improvise this.

  • 36:05

    I go to the data.
    I go to the data.

  • 36:06

    And I tell you, I have a fantastic system.
    And I tell you, I have a fantastic system.

  • 36:08

    It generalizes perfectly, and it does a terrible job.
    It generalizes perfectly, and it does a terrible job.

  • 36:13

    That's what I have, because when I tested it, nu was terrible.
    That's what I have, because when I tested it, nu was terrible.

  • 36:16

    So that's not what we are looking for.
    So that's not what we are looking for.

  • 36:19

    What we are looking for is to make it learning.
    What we are looking for is to make it learning.

  • 36:23

    So how do we do that?
    So how do we do that?

  • 36:26

    No guarantee that nu will be small.
    No guarantee that nu will be small.

  • 36:28

    And we need to choose the hypothesis from multiple h's.
    And we need to choose the hypothesis from multiple h's.

  • 36:36

    That's the game.
    That's the game.

  • 36:37

    And in that case, you are going to go for the sample, so to speak, generated
    And in that case, you are going to go for the sample, so to speak, generated

  • 36:41

    by every hypothesis, and then you pick the hypothesis that is most favorable,
    by every hypothesis, and then you pick the hypothesis that is most favorable,

  • 36:44

    that gives you the least error.
    that gives you the least error.

  • 36:47

    So now, that doesn't look like a difficult thing.
    So now, that doesn't look like a difficult thing.

  • 36:50

    It worked with one bin.
    It worked with one bin.

  • 36:52

    Maybe I can have more than one bin, to accommodate the situation where I have
    Maybe I can have more than one bin, to accommodate the situation where I have

  • 36:56

    more than one hypothesis.
    more than one hypothesis.

  • 36:57

    It looks plausible.
    It looks plausible.

  • 36:59

    So let's do that.
    So let's do that.

  • 37:01

    We will just take multiple bins.
    We will just take multiple bins.

  • 37:05

    So here is the first bin.
    So here is the first bin.

  • 37:08

    Now you can see that this is a bad bin.
    Now you can see that this is a bad bin.

  • 37:11

    So that hypothesis is terrible.
    So that hypothesis is terrible.

  • 37:14

    And the sample reflects that, to some extent.
    And the sample reflects that, to some extent.

  • 37:18

    But we are going to have other bins, so let's call this something.
    But we are going to have other bins, so let's call this something.

  • 37:22

    So this bin corresponds to a particular h.
    So this bin corresponds to a particular h.

  • 37:26

    And since we are going to have other hypotheses, we are going to call this
    And since we are going to have other hypotheses, we are going to call this

  • 37:31

    h_1 in preparation for the next guy.
    h_1 in preparation for the next guy.

  • 37:34

    The next guy comes in, and you have h_2.
    The next guy comes in, and you have h_2.

  • 37:38

    And you have another mu_2.
    And you have another mu_2.

  • 37:41

    This one looks like a good hypothesis, and it's also reflected in the sample.
    This one looks like a good hypothesis, and it's also reflected in the sample.

  • 37:45

    And it's important to look at the correspondence.
    And it's important to look at the correspondence.

  • 37:48

    If you look at the top red point here and the top green point here, this is
    If you look at the top red point here and the top green point here, this is

  • 37:54

    the same point in the input space.
    the same point in the input space.

  • 37:56

    It just was colored red here and colored green here.
    It just was colored red here and colored green here.

  • 38:00

    Why did that happen?
    Why did that happen?

  • 38:01

    Because the target function disagrees with this h, and the target function
    Because the target function disagrees with this h, and the target function

  • 38:07

    happens to agree with this h.
    happens to agree with this h.

  • 38:08

    That's what got this the color green.
    That's what got this the color green.

  • 38:11

    And when you pick a sample, the sample also will have different colors,
    And when you pick a sample, the sample also will have different colors,

  • 38:15

    because the colors depend on which hypothesis.
    because the colors depend on which hypothesis.

  • 38:17

    And these are different hypotheses.
    And these are different hypotheses.

  • 38:19

    That looks simple enough.
    That looks simple enough.

  • 38:21

    So let's continue.
    So let's continue.

  • 38:22

    And we can have M of them.
    And we can have M of them.

  • 38:25

    I am going to consider a finite number of hypotheses, just to make the math
    I am going to consider a finite number of hypotheses, just to make the math

  • 38:29

    easy for this lecture.
    easy for this lecture.

  • 38:31

    And we're going to go more sophisticated when we get into the
    And we're going to go more sophisticated when we get into the

  • 38:33

    theory of generalization.
    theory of generalization.

  • 38:36

    So now I have this.
    So now I have this.

  • 38:37

    This is good.
    This is good.

  • 38:38

    I have samples, and the samples here are different.
    I have samples, and the samples here are different.

  • 38:42

    And I can do the learning, and the learning now, abstractly, is to scan
    And I can do the learning, and the learning now, abstractly, is to scan

  • 38:47

    these samples looking for a good sample.
    these samples looking for a good sample.

  • 38:50

    And when you find a good sample, you declare victory, because of Hoeffding,
    And when you find a good sample, you declare victory, because of Hoeffding,

  • 38:54

    and you say that it must be that the corresponding bin is good, and the
    and you say that it must be that the corresponding bin is good, and the

  • 38:59

    corresponding bin happens to be the hypothesis you chose.
    corresponding bin happens to be the hypothesis you chose.

  • 39:02

    So that is an abstraction of learning.
    So that is an abstraction of learning.

  • 39:05

    That was easy enough.
    That was easy enough.

  • 39:08

    Now, because this is going to stay with us, I am now going to introduce
    Now, because this is going to stay with us, I am now going to introduce

  • 39:11

    the notation that will survive with us for the entire discussion of learning.
    the notation that will survive with us for the entire discussion of learning.

  • 39:17

    So here is the notation.
    So here is the notation.

  • 39:19

    We realize that both mu, which happens to be inside the bin,
    We realize that both mu, which happens to be inside the bin,

  • 39:24

    and nu, which happens to be the sample frequency--
    and nu, which happens to be the sample frequency--

  • 39:27

    in this case, the sample frequency of error-- both of them depend on h.
    in this case, the sample frequency of error-- both of them depend on h.

  • 39:31

    So I'd like to give a notation that makes that explicit.
    So I'd like to give a notation that makes that explicit.

  • 39:35

    The first thing,
    The first thing,

  • 39:36

    I am going to call mu and nu with a descriptive name.
    I am going to call mu and nu with a descriptive name.

  • 39:40

    So nu, which is the frequency in the sample you have, is in-sample.
    So nu, which is the frequency in the sample you have, is in-sample.

  • 39:45

    That is a standard definition for what happens in the data that I give you.
    That is a standard definition for what happens in the data that I give you.

  • 39:49

    If you perform well in-sample, it means that your error in the sample
    If you perform well in-sample, it means that your error in the sample

  • 39:52

    that I give you is small.
    that I give you is small.

  • 39:55

    And because it is called in-sample, we are going to denote it by E_in.
    And because it is called in-sample, we are going to denote it by E_in.

  • 40:02

    I think this is worth blowing up, because it's an important one.
    I think this is worth blowing up, because it's an important one.

  • 40:08

    This is our standard notation for the error that you have in-sample.
    This is our standard notation for the error that you have in-sample.

  • 40:15

    Now, we go and get the other one, which happens to be mu.
    Now, we go and get the other one, which happens to be mu.

  • 40:19

    And that is called out-of-sample.
    And that is called out-of-sample.

  • 40:22

    So if you are in this field, I guess what matters is the out-of-sample
    So if you are in this field, I guess what matters is the out-of-sample

  • 40:26

    performance.
    performance.

  • 40:27

    That's the lesson.
    That's the lesson.

  • 40:28

    Out-of-sample means something that you haven't seen.
    Out-of-sample means something that you haven't seen.

  • 40:31

    And if you perform out-of-sample, on something that you haven't seen, then
    And if you perform out-of-sample, on something that you haven't seen, then

  • 40:35

    you must have really learned.
    you must have really learned.

  • 40:36

    That's the standard for it, and the name for it is E_out.
    That's the standard for it, and the name for it is E_out.

  • 40:44

    With this in mind, we realize that we don't yet have the dependency on h
    With this in mind, we realize that we don't yet have the dependency on h

  • 40:49

    which we need.
    which we need.

  • 40:51

    So we are going to make the notation a little bit more elaborate, by calling
    So we are going to make the notation a little bit more elaborate, by calling

  • 40:56

    E_in and E_out--
    E_in and E_out--

  • 40:59

    calling them E_in of h, and E_out of h.
    calling them E_in of h, and E_out of h.

  • 41:05

    Why is that?
    Why is that?

  • 41:06

    Well, the in-sample performance-- you are trying to see the error of
    Well, the in-sample performance-- you are trying to see the error of

  • 41:10

    approximating the target function by your hypothesis.
    approximating the target function by your hypothesis.

  • 41:14

    That's what E_in is.
    That's what E_in is.

  • 41:15

    So obviously, it depends on your hypothesis.
    So obviously, it depends on your hypothesis.

  • 41:18

    So it's E_in of h.
    So it's E_in of h.

  • 41:19

    Someone else picks another h, they will get another E_in of their h.
    Someone else picks another h, they will get another E_in of their h.

  • 41:25

    Similarly E_out, the corresponding one is E_out of h.
    Similarly E_out, the corresponding one is E_out of h.

  • 41:28

    So now, what used to be nu is now E_in of h.
    So now, what used to be nu is now E_in of h.

  • 41:32

    What used to be mu, inside the bin, is E_out of h.
    What used to be mu, inside the bin, is E_out of h.

  • 41:39

    Now, the Hoeffding Inequality, which we know all too well
    Now, the Hoeffding Inequality, which we know all too well

  • 41:43

    by now, said that.
    by now, said that.

  • 41:45

    So all I'm going to do is just replace the notation.
    So all I'm going to do is just replace the notation.

  • 41:50

    And now it looks a little bit more crowded, but it's
    And now it looks a little bit more crowded, but it's

  • 41:52

    exactly the same thing.
    exactly the same thing.

  • 41:54

    The probability that your in-sample performance deviates from your out-of-
    The probability that your in-sample performance deviates from your out-of-

  • 42:00

    sample performance by more than your prescribed tolerance is less than or
    sample performance by more than your prescribed tolerance is less than or

  • 42:05

    equal to a number that is hopefully small.
    equal to a number that is hopefully small.

  • 42:09

    And you can go back and forth.
    And you can go back and forth.

  • 42:12

    There's nu and mu, or you can go here and you get the new notation.
    There's nu and mu, or you can go here and you get the new notation.

  • 42:18

    So we're settled on the notation now.
    So we're settled on the notation now.

  • 42:20

    Now, let's go for the multiple bins and use this notation.
    Now, let's go for the multiple bins and use this notation.

  • 42:25

    These are the multiple bins as we left them.
    These are the multiple bins as we left them.

  • 42:27

    We have the hypotheses h_1 up to h_M, and we have the mu_1 and mu_M.
    We have the hypotheses h_1 up to h_M, and we have the mu_1 and mu_M.

  • 42:32

    And if you see 1, 2, M, again, this is a disappearing nu--
    And if you see 1, 2, M, again, this is a disappearing nu--

  • 42:35

    the symbol that the app doesn't like.
    the symbol that the app doesn't like.

  • 42:37

    But thank God we switched notations, so that
    But thank God we switched notations, so that

  • 42:40

    something will appear.
    something will appear.

  • 42:42

    Yeah!
    Yeah!

  • 42:43

    So right now, that's what we have.
    So right now, that's what we have.

  • 42:46

    Every bin has an out-of-sample performance, and out-of-
    Every bin has an out-of-sample performance, and out-of-

  • 42:50

    sample is: Out. Of. Sample.
    sample is: Out. Of. Sample.

  • 42:54

    So this is a sample.
    So this is a sample.

  • 42:55

    What's in it is in-sample.
    What's in it is in-sample.

  • 42:56

    What is not in it is out-of-sample.
    What is not in it is out-of-sample.

  • 42:58

    And the out-of-sample depends on h_1 here, h_2 here, and h_M here.
    And the out-of-sample depends on h_1 here, h_2 here, and h_M here.

  • 43:02

    And obviously, these quantities will be different according to the sample, and
    And obviously, these quantities will be different according to the sample, and

  • 43:05

    these quantities will be different according to the ultimate performance
    these quantities will be different according to the ultimate performance

  • 43:08

    of your hypothesis.
    of your hypothesis.

  • 43:12

    So we solved the problem.
    So we solved the problem.

  • 43:13

    It's not verification. It's not a single bin.
    It's not verification. It's not a single bin.

  • 43:16

    It's real learning.
    It's real learning.

  • 43:16

    I'm going to scan these.
    I'm going to scan these.

  • 43:18

    So that's pretty good.
    So that's pretty good.

  • 43:20

    Are we done already?
    Are we done already?

  • 43:25

    Not so fast.
    Not so fast.

  • 43:27

    [LAUGHING]
    [LAUGHING]

  • 43:28

    What's wrong?
    What's wrong?

  • 43:30

    Let me tell you what's wrong.
    Let me tell you what's wrong.

  • 43:34

    The Hoeffding Inequality, that we have happily studied and declared important
    The Hoeffding Inequality, that we have happily studied and declared important

  • 43:39

    and all of that, doesn't apply to multiple bins.
    and all of that, doesn't apply to multiple bins.

  • 43:47

    What?
    What?

  • 43:49

    You told us mathematics, and you go read the proof, and all of that.
    You told us mathematics, and you go read the proof, and all of that.

  • 43:52

    Are you just pulling tricks on us?
    Are you just pulling tricks on us?

  • 43:55

    What is the deal here?
    What is the deal here?

  • 43:57

    And you even can complain.
    And you even can complain.

  • 43:59

    We sat for 40 minutes now going from a single bin, mapping it to
    We sat for 40 minutes now going from a single bin, mapping it to

  • 44:04

    the learning diagram, mapping it to multiple bins, and now you tell us
    the learning diagram, mapping it to multiple bins, and now you tell us

  • 44:08

    that the main tool we developed doesn't apply.
    that the main tool we developed doesn't apply.

  • 44:12

    Why doesn't it apply, and what can we do about it?
    Why doesn't it apply, and what can we do about it?

  • 44:17

    Let me start by saying why it doesn't apply, and then we can go for what we
    Let me start by saying why it doesn't apply, and then we can go for what we

  • 44:23

    can do about it.
    can do about it.

  • 44:25

    Now, everybody has a coin.
    Now, everybody has a coin.

  • 44:27

    I hope the online audience have a coin ready.
    I hope the online audience have a coin ready.

  • 44:32

    I'd like to ask you to take the coin out and flip it,
    I'd like to ask you to take the coin out and flip it,

  • 44:38

    let's say, five times.
    let's say, five times.

  • 44:40

    And record what happens.
    And record what happens.

  • 44:45

    And when you at home flip the coin five times, please,
    And when you at home flip the coin five times, please,

  • 44:49

    if you happen to get all five heads in your experiment, then text us that you
    if you happen to get all five heads in your experiment, then text us that you

  • 44:55

    got all five heads.
    got all five heads.

  • 44:56

    If you get anything else, don't bother text us.
    If you get anything else, don't bother text us.

  • 44:59

    We just want to know if someone will get five heads.
    We just want to know if someone will get five heads.

  • 45:09

    Everybody is done flipping the coin.
    Everybody is done flipping the coin.

  • 45:14

    Because you have been so generous and cooperative, you can keep the coin!
    Because you have been so generous and cooperative, you can keep the coin!

  • 45:17

    [LAUGHTER]
    [LAUGHTER]

  • 45:20

    Now, did anybody get five heads?
    Now, did anybody get five heads?

  • 45:24

    All five heads?
    All five heads?

  • 45:28

    Congratulations, sir.
    Congratulations, sir.

  • 45:29

    You have a biased coin, right?
    You have a biased coin, right?

  • 45:32

    We just argued that in-sample corresponds to out-of-sample, and we
    We just argued that in-sample corresponds to out-of-sample, and we

  • 45:35

    have this Hoeffding thing, and therefore if you get five heads, it
    have this Hoeffding thing, and therefore if you get five heads, it

  • 45:38

    must be that this coin gives you heads.
    must be that this coin gives you heads.

  • 45:40

    We know better.
    We know better.

  • 45:41

    So in the online audience, what happened?
    So in the online audience, what happened?

  • 45:43

    MODERATOR: Yeah, in the online audience, there's also five heads.
    MODERATOR: Yeah, in the online audience, there's also five heads.

  • 45:46

    PROFESSOR: There are lots of biased coins out there.
    PROFESSOR: There are lots of biased coins out there.

  • 45:48

    Are they really biased coins?
    Are they really biased coins?

  • 45:51

    No.
    No.

  • 45:54

    What is the deal here?
    What is the deal here?

  • 45:56

    Let's look at it.
    Let's look at it.

  • 45:58

    Here, with the audience here, I didn't want to push my luck with 10 flips,
    Here, with the audience here, I didn't want to push my luck with 10 flips,

  • 46:03

    because it's a live broadcast.
    because it's a live broadcast.

  • 46:05

    So I said five will work.
    So I said five will work.

  • 46:07

    For the analytical example, let's take 10 flips.
    For the analytical example, let's take 10 flips.

  • 46:11

    Let's say you have a fair coin, which every coin is.
    Let's say you have a fair coin, which every coin is.

  • 46:14

    You have a fair coin.
    You have a fair coin.

  • 46:15

    And you toss it 10 times.
    And you toss it 10 times.

  • 46:18

    What is the probability that you will get all 10 heads?
    What is the probability that you will get all 10 heads?

  • 46:22

    Pretty easy.
    Pretty easy.

  • 46:23

    One half, times one half, 10 times, and that will give
    One half, times one half, 10 times, and that will give

  • 46:27

    you about 1 in 1000.
    you about 1 in 1000.

  • 46:30

    No chance that you will get it--
    No chance that you will get it--

  • 46:32

    not no chance, but very little chance.
    not no chance, but very little chance.

  • 46:35

    Now, the second question is the one we actually ran the experiment for.
    Now, the second question is the one we actually ran the experiment for.

  • 46:40

    If you toss 1000 fair coins-- it wasn't 1000 here. It's how many there.
    If you toss 1000 fair coins-- it wasn't 1000 here. It's how many there.

  • 46:44

    Maybe out there is 1000.
    Maybe out there is 1000.

  • 46:47

    What is the probability that some coin will give you all 10 heads?
    What is the probability that some coin will give you all 10 heads?

  • 46:54

    Not difficult at all to compute.
    Not difficult at all to compute.

  • 46:56

    And when you get the answer, the answer will be it's actually more
    And when you get the answer, the answer will be it's actually more

  • 47:01

    likely than not.
    likely than not.

  • 47:06

    So now it means that the 10 heads in this case are no indication at all of
    So now it means that the 10 heads in this case are no indication at all of

  • 47:13

    the real probability.
    the real probability.

  • 47:14

    That is the game we are playing.
    That is the game we are playing.

  • 47:16

    Can I look at the sample and infer something about the real probability?
    Can I look at the sample and infer something about the real probability?

  • 47:18

    No.
    No.

  • 47:19

    In this case, you will get 10 heads, and the coin is fair.
    In this case, you will get 10 heads, and the coin is fair.

  • 47:23

    Why did this happen?
    Why did this happen?

  • 47:25

    This happened because you tried too hard.
    This happened because you tried too hard.

  • 47:28

    Eventually what will happen is--
    Eventually what will happen is--

  • 47:29

    Hoeffding applies to any one of them.
    Hoeffding applies to any one of them.

  • 47:32

    But there is a probability, let's say half a percent, that you
    But there is a probability, let's say half a percent, that you

  • 47:36

    will be off here.
    will be off here.

  • 47:37

    Another half a percent that you will be off here.
    Another half a percent that you will be off here.

  • 47:39

    If you do it often enough, and you are lucky enough that the half percents
    If you do it often enough, and you are lucky enough that the half percents

  • 47:42

    are disjoint, you will end up with extremely high probability that
    are disjoint, you will end up with extremely high probability that

  • 47:47

    something bad will happen, somewhere.
    something bad will happen, somewhere.

  • 47:50

    That's the key.
    That's the key.

  • 47:52

    So let's translate this into the learning situation.
    So let's translate this into the learning situation.

  • 47:56

    Here are your coins.
    Here are your coins.

  • 47:59

    And how do they correspond to the bins?
    And how do they correspond to the bins?

  • 48:02

    Well, it's a binary experiment, whether you are picking a red marble
    Well, it's a binary experiment, whether you are picking a red marble

  • 48:05

    or a green marble, or you are flipping a coin getting heads or tails.
    or a green marble, or you are flipping a coin getting heads or tails.

  • 48:09

    It's a binary situation.
    It's a binary situation.

  • 48:11

    So there's a direct correspondence.
    So there's a direct correspondence.

  • 48:12

    Just get the probability of heads being mu, which is the probability of
    Just get the probability of heads being mu, which is the probability of

  • 48:16

    a red marble, corresponding to them.
    a red marble, corresponding to them.

  • 48:18

    So because the coins are fair,
    So because the coins are fair,

  • 48:20

    actually all the bins in this case are half red, half green.
    actually all the bins in this case are half red, half green.

  • 48:24

    That's really bad news for a hypothesis.
    That's really bad news for a hypothesis.

  • 48:26

    The hypothesis is completely random.
    The hypothesis is completely random.

  • 48:28

    Half the time it agrees with the target function.
    Half the time it agrees with the target function.

  • 48:30

    Half the time it disagrees.
    Half the time it disagrees.

  • 48:31

    No information at all.
    No information at all.

  • 48:33

    Now you apply the learning paradigm we mentioned, and you say: let me
    Now you apply the learning paradigm we mentioned, and you say: let me

  • 48:37

    generate a sample from the first hypothesis.
    generate a sample from the first hypothesis.

  • 48:41

    I get this, I look at it, and I don't like that.
    I get this, I look at it, and I don't like that.

  • 48:43

    It has some reds.
    It has some reds.

  • 48:43

    I want really a clean hypothesis that performs perfectly--
    I want really a clean hypothesis that performs perfectly--

  • 48:46

    all green.
    all green.

  • 48:48

    You move on.
    You move on.

  • 48:49

    And, OK.
    And, OK.

  • 48:51

    This one--
    This one--

  • 48:52

    even, I don't know.
    even, I don't know.

  • 48:53

    This is even worse.
    This is even worse.

  • 48:54

    You go on and on and on.
    You go on and on and on.

  • 48:55

    And eventually, lo and behold, I have all greens.
    And eventually, lo and behold, I have all greens.

  • 49:02

    Bingo.
    Bingo.

  • 49:03

    I have the perfect hypothesis.
    I have the perfect hypothesis.

  • 49:04

    I am going to report this to my customer, and if my customer is in
    I am going to report this to my customer, and if my customer is in

  • 49:07

    financial forecasting, we are going to beat the stock market and
    financial forecasting, we are going to beat the stock market and

  • 49:10

    make a lot of money.
    make a lot of money.

  • 49:11

    And you start thinking about the car you are going to buy, and all of that.
    And you start thinking about the car you are going to buy, and all of that.

  • 49:15

    Well, is it bingo?
    Well, is it bingo?

  • 49:19

    No, it isn't.
    No, it isn't.

  • 49:21

    And that is the problem.
    And that is the problem.

  • 49:24

    So now, we have to find something that makes us deal with
    So now, we have to find something that makes us deal with

  • 49:29

    multiple bins properly.
    multiple bins properly.

  • 49:31

    Hoeffding Inequality-- if you have one experiment, it has a guarantee.
    Hoeffding Inequality-- if you have one experiment, it has a guarantee.

  • 49:35

    The guarantee gets terribly diluted as you go, and we want to know exactly
    The guarantee gets terribly diluted as you go, and we want to know exactly

  • 49:41

    how the dilution goes.
    how the dilution goes.

  • 49:43

    So here is a simple solution.
    So here is a simple solution.

  • 49:46

    This is a mathematical slide. I'll do it step-by-step.
    This is a mathematical slide. I'll do it step-by-step.

  • 49:49

    There is absolutely nothing mysterious about it.
    There is absolutely nothing mysterious about it.

  • 49:53

    This is the quantity we've been talking about.
    This is the quantity we've been talking about.

  • 49:57

    This is the probability of a bad event.
    This is the probability of a bad event.

  • 50:01

    But in this case, you realize that I'm putting g.
    But in this case, you realize that I'm putting g.

  • 50:03

    Remember, g was our final hypothesis.
    Remember, g was our final hypothesis.

  • 50:06

    So this corresponds to a process where you had a bunch of h's, and you picked
    So this corresponds to a process where you had a bunch of h's, and you picked

  • 50:11

    one according to a criterion, that happens to be an in-sample criterion,
    one according to a criterion, that happens to be an in-sample criterion,

  • 50:15

    minimizing the error there, and then you report the g as the
    minimizing the error there, and then you report the g as the

  • 50:18

    one that you chose.
    one that you chose.

  • 50:20

    And you would like to make a statement that the probability for the g you
    And you would like to make a statement that the probability for the g you

  • 50:23

    chose-- the in-sample error-- happens to be close to the out-of-sample error.
    chose-- the in-sample error-- happens to be close to the out-of-sample error.

  • 50:29

    So you'd like the probability of the deviation being bigger than your
    So you'd like the probability of the deviation being bigger than your

  • 50:32

    tolerance to be, again, small.
    tolerance to be, again, small.

  • 50:35

    All we need to do is find a Hoeffding counterpart to this, because
    All we need to do is find a Hoeffding counterpart to this, because

  • 50:39

    now this fellow is loaded.
    now this fellow is loaded.

  • 50:41

    It's not just a fixed hypothesis and a fixed bin.
    It's not just a fixed hypothesis and a fixed bin.

  • 50:44

    It actually corresponds to a large number of bins, and I am visiting the
    It actually corresponds to a large number of bins, and I am visiting the

  • 50:48

    random samples in order to pick one.
    random samples in order to pick one.

  • 50:50

    So clearly the assumptions of Hoeffding don't apply-- that correspond
    So clearly the assumptions of Hoeffding don't apply-- that correspond

  • 50:54

    to a single bin.
    to a single bin.

  • 50:56

    This probability is less than or equal to the
    This probability is less than or equal to the

  • 50:59

    probability of the following.
    probability of the following.

  • 51:01

    I have M hypotheses--
    I have M hypotheses--

  • 51:03

    capital M hypotheses.
    capital M hypotheses.

  • 51:05

    h_1, h_2, h_3, h_M.
    h_1, h_2, h_3, h_M.

  • 51:08

    That's my entire learning model.
    That's my entire learning model.

  • 51:10

    That's the hypothesis set that I have, finite as I said I would assume.
    That's the hypothesis set that I have, finite as I said I would assume.

  • 51:14

    If you look at what is the probability that the hypothesis you
    If you look at what is the probability that the hypothesis you

  • 51:18

    pick is bad? Well, this will be less than or equal to the probability that the
    pick is bad? Well, this will be less than or equal to the probability that the

  • 51:24

    first hypothesis is bad, or the second hypothesis is bad, or, or, or the last
    first hypothesis is bad, or the second hypothesis is bad, or, or, or the last

  • 51:35

    hypothesis is bad.
    hypothesis is bad.

  • 51:37

    That is obvious.
    That is obvious.

  • 51:39

    g is one of them.
    g is one of them.

  • 51:41

    If it's bad, one of them is bad.
    If it's bad, one of them is bad.

  • 51:44

    So less than or equal to that.
    So less than or equal to that.

  • 51:46

    This is called the union bound in probability.
    This is called the union bound in probability.

  • 51:49

    It's a very loose bound, in general, because it doesn't
    It's a very loose bound, in general, because it doesn't

  • 51:51

    consider the overlap.
    consider the overlap.

  • 51:53

    Remember when I told you that the half a percent here, half a percent here,
    Remember when I told you that the half a percent here, half a percent here,

  • 51:56

    half a percent here--
    half a percent here--

  • 51:57

    if you are very unlucky and these are non-overlapping, they add up.
    if you are very unlucky and these are non-overlapping, they add up.

  • 52:02

    The non-overlapping is the worst-case assumption, and it is the assumption
    The non-overlapping is the worst-case assumption, and it is the assumption

  • 52:05

    used by the union bound.
    used by the union bound.

  • 52:07

    So you get this.
    So you get this.

  • 52:08

    And the good news about this is that I have a handle on each term of them.
    And the good news about this is that I have a handle on each term of them.

  • 52:12

    The union bound is coming up.
    The union bound is coming up.

  • 52:13

    So I put the OR's.
    So I put the OR's.

  • 52:14

    And then I use the union bound to say that this is less than or equal to, and simply sum
    And then I use the union bound to say that this is less than or equal to, and simply sum

  • 52:20

    the individual probabilities.
    the individual probabilities.

  • 52:22

    So the half a percent plus half a percent plus half a percent--
    So the half a percent plus half a percent plus half a percent--

  • 52:26

    this will be an upper bound on all of them.
    this will be an upper bound on all of them.

  • 52:28

    The probability that one of them goes wrong, the probability that someone
    The probability that one of them goes wrong, the probability that someone

  • 52:31

    gets all heads, and I add the probability for all of you, and that
    gets all heads, and I add the probability for all of you, and that

  • 52:35

    makes it a respectable probability.
    makes it a respectable probability.

  • 52:37

    So this event here is implied.
    So this event here is implied.

  • 52:40

    Therefore, I have the implication because of the OR, and this one
    Therefore, I have the implication because of the OR, and this one

  • 52:44

    because of the union bound, where I have the pessimistic assumption that I
    because of the union bound, where I have the pessimistic assumption that I

  • 52:50

    just need to add the probabilities.
    just need to add the probabilities.

  • 52:52

    Now, all of this-- again, we make simplistic assumptions, which is
    Now, all of this-- again, we make simplistic assumptions, which is

  • 52:56

    really not simplistic as in trivially restricting, but rather the opposite.
    really not simplistic as in trivially restricting, but rather the opposite.

  • 53:01

    We just don't want to make any assumptions that restrict the
    We just don't want to make any assumptions that restrict the

  • 53:03

    applicability of our result.
    applicability of our result.

  • 53:05

    So we took the worst case.
    So we took the worst case.

  • 53:07

    It cannot get worse than that.
    It cannot get worse than that.

  • 53:09

    If you look at this, now I have good news for you.
    If you look at this, now I have good news for you.

  • 53:12

    Because each term here is a fixed hypothesis.
    Because each term here is a fixed hypothesis.

  • 53:15

    I didn't choose anything.
    I didn't choose anything.

  • 53:17

    Every one of them has a hypothesis that was declared ahead of time.
    Every one of them has a hypothesis that was declared ahead of time.

  • 53:20

    Every one of them is a bin.
    Every one of them is a bin.

  • 53:22

    So if I look at a term by itself, Hoeffding applies to this, exactly the
    So if I look at a term by itself, Hoeffding applies to this, exactly the

  • 53:26

    same way it applied before.
    same way it applied before.

  • 53:29

    So this is a mathematical statement now.
    So this is a mathematical statement now.

  • 53:31

    I'm not looking at the bigger experiment.
    I'm not looking at the bigger experiment.

  • 53:33

    I reduced the bigger experiment to a bunch of quantities.
    I reduced the bigger experiment to a bunch of quantities.

  • 53:35

    Each of them corresponds to a simple experiment that we already solved.
    Each of them corresponds to a simple experiment that we already solved.

  • 53:39

    So I can substitute for each of these by the bound that the
    So I can substitute for each of these by the bound that the

  • 53:42

    Hoeffding gives me.
    Hoeffding gives me.

  • 53:47

    So what is the bound that the Hoeffding gives me?
    So what is the bound that the Hoeffding gives me?

  • 53:55

    That's the one.
    That's the one.

  • 53:56

    For every one of them, each of these guys was less than or
    For every one of them, each of these guys was less than or

  • 54:00

    equal to this quantity.
    equal to this quantity.

  • 54:03

    One by one.
    One by one.

  • 54:04

    All of them are obviously the same.
    All of them are obviously the same.

  • 54:07

    So each of them is smaller than this quantity.
    So each of them is smaller than this quantity.

  • 54:08

    Each of them is smaller than this quantity.
    Each of them is smaller than this quantity.

  • 54:10

    Now I can be confident that the probability that I'm interested in,
    Now I can be confident that the probability that I'm interested in,

  • 54:16

    which is the probability that the in-sample error
    which is the probability that the in-sample error

  • 54:21

    being close to the out-of-sample error-- the closeness of them is bigger
    being close to the out-of-sample error-- the closeness of them is bigger

  • 54:25

    than my tolerance, the bad event.
    than my tolerance, the bad event.

  • 54:27

    Under the genuine learning scenario-- you generate marbles from every bin,
    Under the genuine learning scenario-- you generate marbles from every bin,

  • 54:33

    and you look deliberately for a sample that happens to be all green or as
    and you look deliberately for a sample that happens to be all green or as

  • 54:41

    green as possible, and you pick this one.
    green as possible, and you pick this one.

  • 54:43

    And you want an assurance that whatever that might be, the
    And you want an assurance that whatever that might be, the

  • 54:46

    corresponding bin will genuinely be good out-of-sample.
    corresponding bin will genuinely be good out-of-sample.

  • 54:49

    That is what is captured by this probability.
    That is what is captured by this probability.

  • 54:51

    That is still bounded by something, which also has that exponential in it,
    That is still bounded by something, which also has that exponential in it,

  • 54:55

    which is good.
    which is good.

  • 54:56

    But it has an added factor that will be a very bothersome factor, which is:
    But it has an added factor that will be a very bothersome factor, which is:

  • 55:03

    I have M of them.
    I have M of them.

  • 55:08

    Now, this is the bad event.
    Now, this is the bad event.

  • 55:10

    I'd like the probability to be small.
    I'd like the probability to be small.

  • 55:12

    I don't like to magnify the right-hand side, because that is the probability
    I don't like to magnify the right-hand side, because that is the probability

  • 55:16

    of something bad happening.
    of something bad happening.

  • 55:19

    Now, with M, you realize that
    Now, with M, you realize that

  • 55:22

    if you use 10 hypotheses, this probability is probably tight.
    if you use 10 hypotheses, this probability is probably tight.

  • 55:27

    If you use a million hypotheses, we probably are already in trouble.
    If you use a million hypotheses, we probably are already in trouble.

  • 55:33

    There is no guarantee, because now the million gets multiplied by what used
    There is no guarantee, because now the million gets multiplied by what used

  • 55:38

    to be a respectable probability, which is 1 in 100,000, and now you can make
    to be a respectable probability, which is 1 in 100,000, and now you can make

  • 55:42

    the statement that the probability that something bad happens
    the statement that the probability that something bad happens

  • 55:45

    is less than 10.
    is less than 10.

  • 55:47

    [LAUGHING]
    [LAUGHING]

  • 55:47

    Yeah, thank you very much.
    Yeah, thank you very much.

  • 55:50

    We have to take a graduate course to learn that!
    We have to take a graduate course to learn that!

  • 55:53

    Now you see what the problem is.
    Now you see what the problem is.

  • 55:55

    And the problem is extremely intuitive.
    And the problem is extremely intuitive.

  • 55:58

    In that Q&A session after the last lecture, we all got through the
    In that Q&A session after the last lecture, we all got through the

  • 56:03

    discussion the assertion that if you have a more sophisticated model, the
    discussion the assertion that if you have a more sophisticated model, the

  • 56:08

    chances are you will memorize in-sample, and you are not going to
    chances are you will memorize in-sample, and you are not going to

  • 56:11

    really generalize well out-of-sample, because you have so many
    really generalize well out-of-sample, because you have so many

  • 56:14

    parameters to work with.
    parameters to work with.

  • 56:16

    There are so many ways to look at that intuitively, and this is one of them.
    There are so many ways to look at that intuitively, and this is one of them.

  • 56:20

    If you have a very sophisticated model-- M is huge, let alone infinite.
    If you have a very sophisticated model-- M is huge, let alone infinite.

  • 56:24

    That's later to come.
    That's later to come.

  • 56:26

    That's what the theory of generalization is about.
    That's what the theory of generalization is about.

  • 56:29

    But if you pick a very sophisticated example with a large M, you lose the
    But if you pick a very sophisticated example with a large M, you lose the

  • 56:34

    link between the in-sample and the out-of-sample.
    link between the in-sample and the out-of-sample.

  • 56:38

    So you look at here.
    So you look at here.

  • 56:41

    [LAUGHING], I didn't mean it this way, but let me go back just to show
    [LAUGHING], I didn't mean it this way, but let me go back just to show

  • 56:46

    you what it is.
    you what it is.

  • 56:47

    At least you know it's over, so that's good.
    At least you know it's over, so that's good.

  • 56:50

    So this fellow is supposed to track this fellow.
    So this fellow is supposed to track this fellow.

  • 56:54

    The in-sample is supposed to track the out-of-sample.
    The in-sample is supposed to track the out-of-sample.

  • 56:57

    The more sophisticated the model you use, the looser that in-sample will
    The more sophisticated the model you use, the looser that in-sample will

  • 57:02

    track the out-of-sample.
    track the out-of-sample.

  • 57:03

    Because the probability of them deviating becomes bigger and bigger
    Because the probability of them deviating becomes bigger and bigger

  • 57:06

    and bigger.
    and bigger.

  • 57:07

    And that is exactly the intuition we have.
    And that is exactly the intuition we have.

  • 57:11

    Now, surprise.
    Now, surprise.

  • 57:12

    The next one is for the Q&A. We will take a short break, and then we will
    The next one is for the Q&A. We will take a short break, and then we will

  • 57:16

    go to the questions and answers.
    go to the questions and answers.

  • 57:21

    We are now in the Q&A session.
    We are now in the Q&A session.

  • 57:24

    And if anybody wants to ask a question, they can go to the
    And if anybody wants to ask a question, they can go to the

  • 57:28

    microphone and ask, and we can start with the online audience questions, if
    microphone and ask, and we can start with the online audience questions, if

  • 57:32

    there are any.
    there are any.

  • 57:36

    MODERATOR: The first question is
    MODERATOR: The first question is

  • 57:38

    what happens when the Hoeffding Inequality
    what happens when the Hoeffding Inequality

  • 57:40

    gives you something trivial, like less than 2?
    gives you something trivial, like less than 2?

  • 57:43

    PROFESSOR: Well, it means that either the resources of the examples
    PROFESSOR: Well, it means that either the resources of the examples

  • 57:48

    you have, the amount of data you have, is not sufficient to guarantee any
    you have, the amount of data you have, is not sufficient to guarantee any

  • 57:51

    generalization, or--
    generalization, or--

  • 57:54

    which is somewhat equivalent--
    which is somewhat equivalent--

  • 57:56

    that your tolerance is too stringent.
    that your tolerance is too stringent.

  • 58:00

    The situation is not really mysterious.
    The situation is not really mysterious.

  • 58:03

    Let's say that you'd like to take a poll for the president.
    Let's say that you'd like to take a poll for the president.

  • 58:08

    And let's say that you ask five people at random.
    And let's say that you ask five people at random.

  • 58:12

    How can you interpret the result?
    How can you interpret the result?

  • 58:15

    Nothing.
    Nothing.

  • 58:16

    You need a certain amount of respondents in order for the
    You need a certain amount of respondents in order for the

  • 58:20

    right-hand side to start becoming interesting.
    right-hand side to start becoming interesting.

  • 58:23

    Other than that, it's completely trivial.
    Other than that, it's completely trivial.

  • 58:24

    It's very likely that what you have seen in-sample doesn't correspond to
    It's very likely that what you have seen in-sample doesn't correspond to

  • 58:28

    anything out-of-sample.
    anything out-of-sample.

  • 58:32

    MODERATOR: So in the case of the perceptron--
    MODERATOR: So in the case of the perceptron--

  • 58:36

    the question is would each set of w's be considered a new m?
    the question is would each set of w's be considered a new m?

  • 58:42

    PROFESSOR: The perceptron and, as
    PROFESSOR: The perceptron and, as

  • 58:45

    a matter of fact, every learning model of interest
    a matter of fact, every learning model of interest

  • 58:49

    that we're going to encounter, the number of hypotheses, M,
    that we're going to encounter, the number of hypotheses, M,

  • 58:53

    happens to be infinite.
    happens to be infinite.

  • 58:56

    We were just talking about the right-hand side not being meaningful
    We were just talking about the right-hand side not being meaningful

  • 58:59

    because it's bigger than 1. If you take an infinite hypothesis set and
    because it's bigger than 1. If you take an infinite hypothesis set and

  • 59:03

    verbatim apply what I said, then you find that the probability is actually
    verbatim apply what I said, then you find that the probability is actually

  • 59:07

    less than infinity.
    less than infinity.

  • 59:10

    That's very important.
    That's very important.

  • 59:12

    However, this is our first step.
    However, this is our first step.

  • 59:14

    There will be another step, where we deal with infinite hypothesis sets.
    There will be another step, where we deal with infinite hypothesis sets.

  • 59:18

    And we are going to be able to describe them with an abstract quantity
    And we are going to be able to describe them with an abstract quantity

  • 59:21

    that happens to be finite, and that abstract quantity will be the one we
    that happens to be finite, and that abstract quantity will be the one we

  • 59:25

    are going to use in the counterpart for the Hoeffding Inequality.
    are going to use in the counterpart for the Hoeffding Inequality.

  • 59:28

    That's why there is mathematics that needs to be done.
    That's why there is mathematics that needs to be done.

  • 59:32

    Obviously, the perceptron has an infinite number of hypotheses because
    Obviously, the perceptron has an infinite number of hypotheses because

  • 59:39

    you have real space, and here is your hypothesis, and you can perturb this
    you have real space, and here is your hypothesis, and you can perturb this

  • 59:46

    continuously as you want.
    continuously as you want.

  • 59:47

    Even just by doing this, you already have an infinite number of hypotheses
    Even just by doing this, you already have an infinite number of hypotheses

  • 59:50

    without even exploring further.
    without even exploring further.

  • 59:53

    MODERATOR: OK, and this is a popular one.
    MODERATOR: OK, and this is a popular one.

  • 59:55

    Could you go over again in slide 6, of the implication of nu equals mu and
    Could you go over again in slide 6, of the implication of nu equals mu and

  • 00:00

    vice versa.
    vice versa.

  • 01:00:01

    PROFESSOR: Six.
    PROFESSOR: Six.

  • 01:00:07

    It's a subtle point, and it's common between machine learning and
    It's a subtle point, and it's common between machine learning and

  • 01:00:11

    statistics.
    statistics.

  • 01:00:11

    What do you do in statistics?
    What do you do in statistics?

  • 01:00:13

    What is the cause and effect for a probability and a sample?
    What is the cause and effect for a probability and a sample?

  • 01:00:16

    The probability results in a sample.
    The probability results in a sample.

  • 01:00:18

    So if I know the probability, I can tell you exactly what is the
    So if I know the probability, I can tell you exactly what is the

  • 01:00:22

    likelihood that you'll get one sample or another or another.
    likelihood that you'll get one sample or another or another.

  • 01:00:26

    Now, what you do in statistics is the reverse of that.
    Now, what you do in statistics is the reverse of that.

  • 01:00:29

    You already have the sample, and you are trying to infer which probability
    You already have the sample, and you are trying to infer which probability

  • 01:00:34

    gave rise to it.
    gave rise to it.

  • 01:00:35

    So you are using the effect to decide the cause rather than
    So you are using the effect to decide the cause rather than

  • 01:00:39

    the other way around.
    the other way around.

  • 01:00:41

    So the same situation here.
    So the same situation here.

  • 01:00:43

    The bin is the cause.
    The bin is the cause.

  • 01:00:45

    The frequency in the sample is the effect.
    The frequency in the sample is the effect.

  • 01:00:48

    I can definitely tell you what the distribution is like in the sample,
    I can definitely tell you what the distribution is like in the sample,

  • 01:00:53

    based on the bin.
    based on the bin.

  • 01:00:55

    The utility, in terms of learning, is that I look at the sample
    The utility, in terms of learning, is that I look at the sample

  • 01:00:58

    and infer the bin.
    and infer the bin.

  • 01:01:00

    So I infer the cause based on the effect.
    So I infer the cause based on the effect.

  • 01:01:04

    There's absolutely nothing terrible about that.
    There's absolutely nothing terrible about that.

  • 01:01:07

    I just wanted to make the point clear, that when we write the Hoeffding
    I just wanted to make the point clear, that when we write the Hoeffding

  • 01:01:12

    Inequality, which you can see here, we are talking about this event.
    Inequality, which you can see here, we are talking about this event.

  • 01:01:19

    You should always remember that nu is the thing that plays around
    You should always remember that nu is the thing that plays around

  • 01:01:24

    and causes the probability to happen, and mu is a constant.
    and causes the probability to happen, and mu is a constant.

  • 01:01:27

    When we use it to predict that the out-of-sample will be the same as the in-
    When we use it to predict that the out-of-sample will be the same as the in-

  • 01:01:32

    sample, we are really taking nu as fixed, because this is the in-
    sample, we are really taking nu as fixed, because this is the in-

  • 01:01:37

    sample we've got.
    sample we've got.

  • 01:01:39

    And then we are trying to interpret what mu gave rise to it.
    And then we are trying to interpret what mu gave rise to it.

  • 01:01:42

    And I'm just saying that, in this case, since the statement is of the
    And I'm just saying that, in this case, since the statement is of the

  • 01:01:45

    form that the difference between them, which is symmetric, is greater than
    form that the difference between them, which is symmetric, is greater than

  • 01:01:50

    epsilon, then if you look at this as saying mu is there and I know that nu
    epsilon, then if you look at this as saying mu is there and I know that nu

  • 01:01:56

    will be approximately the same, you can also flip that.
    will be approximately the same, you can also flip that.

  • 01:02:00

    And you can say, nu is here, and I know that mu that gave rise to it must
    And you can say, nu is here, and I know that mu that gave rise to it must

  • 01:02:04

    be the same.
    be the same.

  • 01:02:05

    That's the whole idea.
    That's the whole idea.

  • 01:02:06

    It's a logical thing rather than a mathematical thing.
    It's a logical thing rather than a mathematical thing.

  • 01:02:10

    MODERATOR: OK.
    MODERATOR: OK.

  • 01:02:11

    Another conceptual question that is arising is that a more complicated
    Another conceptual question that is arising is that a more complicated

  • 01:02:16

    model corresponds to a larger number of h's.
    model corresponds to a larger number of h's.

  • 01:02:19

    And some people are asking--
    And some people are asking--

  • 01:02:21

    they thought each h was a model.
    they thought each h was a model.

  • 01:02:24

    PROFESSOR: OK.
    PROFESSOR: OK.

  • 01:02:25

    Each h is a hypothesis.
    Each h is a hypothesis.

  • 01:02:28

    A particular function, one of them you are going to pick, which is going to
    A particular function, one of them you are going to pick, which is going to

  • 01:02:31

    be equal to g, and this is the g that you're going to report as your best
    be equal to g, and this is the g that you're going to report as your best

  • 01:02:35

    guess as an approximation for f.
    guess as an approximation for f.

  • 01:02:37

    The model is the hypotheses that you're allowed to visit in order to
    The model is the hypotheses that you're allowed to visit in order to

  • 01:02:41

    choose one.
    choose one.

  • 01:02:42

    So that's the hypothesis set, which is H.
    So that's the hypothesis set, which is H.

  • 01:02:46

    And again, but there is an interesting point.
    And again, but there is an interesting point.

  • 01:02:48

    I'm using the number of hypotheses as a measure for the complexity in the
    I'm using the number of hypotheses as a measure for the complexity in the

  • 01:02:52

    intuitive argument that I gave you.
    intuitive argument that I gave you.

  • 01:02:55

    It's not clear at all that the pure number corresponds to the complexity.
    It's not clear at all that the pure number corresponds to the complexity.

  • 01:02:59

    It's not clear that anything that has to do with the size, really, is the
    It's not clear that anything that has to do with the size, really, is the

  • 01:03:02

    complexity.
    complexity.

  • 01:03:02

    Maybe the complexity has to do with the structure of individual
    Maybe the complexity has to do with the structure of individual

  • 01:03:05

    hypotheses.
    hypotheses.

  • 01:03:07

    And that's a very interesting point.
    And that's a very interesting point.

  • 01:03:08

    And that will be discussed at some point-- the complexity of individual
    And that will be discussed at some point-- the complexity of individual

  • 01:03:11

    hypotheses versus the complexity of the model that captures all the
    hypotheses versus the complexity of the model that captures all the

  • 01:03:14

    hypotheses.
    hypotheses.

  • 01:03:15

    This will be a topic that we will discuss much later in the course.
    This will be a topic that we will discuss much later in the course.

  • 01:03:19

    MODERATOR: Some people are getting ahead.
    MODERATOR: Some people are getting ahead.

  • 01:03:21

    So how do you pick g?
    So how do you pick g?

  • 01:03:23

    PROFESSOR: OK.
    PROFESSOR: OK.

  • 01:03:24

    We have one way of picking g-- that already was established last time--
    We have one way of picking g-- that already was established last time--

  • 01:03:27

    which is the perceptron learning algorithm.
    which is the perceptron learning algorithm.

  • 01:03:30

    So your hypothesis set is H.
    So your hypothesis set is H.

  • 01:03:33

    Script H.
    Script H.

  • 01:03:34

    It has a bunch of h's, which are the different lines in the plane.
    It has a bunch of h's, which are the different lines in the plane.

  • 01:03:37

    And you pick g by applying the PLA, the perceptron learning algorithm,
    And you pick g by applying the PLA, the perceptron learning algorithm,

  • 01:03:43

    playing around with this boundary, according to the update rule, until it
    playing around with this boundary, according to the update rule, until it

  • 01:03:46

    classifies the inputs correctly, assuming they are linearly separable,
    classifies the inputs correctly, assuming they are linearly separable,

  • 01:03:49

    and the one you end up with is what is declared g.
    and the one you end up with is what is declared g.

  • 01:03:53

    So g is just a matter of notation, a name for whichever one we settle on,
    So g is just a matter of notation, a name for whichever one we settle on,

  • 01:03:56

    the final hypothesis.
    the final hypothesis.

  • 01:03:58

    How you pick g depends on what algorithm you use, and what
    How you pick g depends on what algorithm you use, and what

  • 01:04:02

    hypothesis set you use.
    hypothesis set you use.

  • 01:04:03

    So it depends on the learning model, and obviously on the data.
    So it depends on the learning model, and obviously on the data.

  • 01:04:09

    MODERATOR: OK.
    MODERATOR: OK.

  • 01:04:10

    This is a popular question.
    This is a popular question.

  • 01:04:12

    So it says: how would you extend the equation to support an output that
    So it says: how would you extend the equation to support an output that

  • 01:04:16

    is a valid range of responses and not a binary response?
    is a valid range of responses and not a binary response?

  • 01:04:20

    PROFESSOR: It can be done.
    PROFESSOR: It can be done.

  • 01:04:22

    One of the things that I mentioned here is that this fellow, the
    One of the things that I mentioned here is that this fellow, the

  • 01:04:29

    probability here, is uniform.
    probability here, is uniform.

  • 01:04:31

    Now, let's say that you are not talking about a binary experiment.
    Now, let's say that you are not talking about a binary experiment.

  • 01:04:36

    Instead of taking the frequency of error versus the probability of error,
    Instead of taking the frequency of error versus the probability of error,

  • 01:04:41

    you take the expected value of something versus the
    you take the expected value of something versus the

  • 01:04:44

    sample average of it.
    sample average of it.

  • 01:04:46

    And they will be close to each other, and some, obviously technical,
    And they will be close to each other, and some, obviously technical,

  • 01:04:51

    modification is needed to be here.
    modification is needed to be here.

  • 01:04:53

    And basically, the set of laws of large numbers, from which this is one member,
    And basically, the set of laws of large numbers, from which this is one member,

  • 01:05:00

    has a bunch of members that actually have to do with expected value and
    has a bunch of members that actually have to do with expected value and

  • 01:05:06

    sample average, rather than just the specific case of probability and
    sample average, rather than just the specific case of probability and

  • 01:05:10

    sample average.
    sample average.

  • 01:05:12

    If you take your function as being 1, 0, and you take the expected value,
    If you take your function as being 1, 0, and you take the expected value,

  • 01:05:16

    that will give you the sample as the sample average, and the probability as
    that will give you the sample as the sample average, and the probability as

  • 01:05:20

    the expected value.
    the expected value.

  • 01:05:21

    So it's not a different animal.
    So it's not a different animal.

  • 01:05:23

    It's just a special case that is easier to handle.
    It's just a special case that is easier to handle.

  • 01:05:25

    And in the other case, one of the things that matters is the variance of
    And in the other case, one of the things that matters is the variance of

  • 01:05:29

    your variable.
    your variable.

  • 01:05:30

    So it will affect the bounds.
    So it will affect the bounds.

  • 01:05:32

    Here, I'm choosing epsilon in general, because the variance of this variable
    Here, I'm choosing epsilon in general, because the variance of this variable

  • 01:05:37

    is very limited.
    is very limited.

  • 01:05:39

    Let's say that the probability is mu, so the variance is mu
    Let's say that the probability is mu, so the variance is mu

  • 01:05:42

    times 1 minus mu.
    times 1 minus mu.

  • 01:05:43

    It goes from a certain value to a certain value.
    It goes from a certain value to a certain value.

  • 01:05:45

    So it can be absorbed.
    So it can be absorbed.

  • 01:05:46

    It's bounded above and below.
    It's bounded above and below.

  • 01:05:48

    And this is the reason why the right-hand side here can
    And this is the reason why the right-hand side here can

  • 01:05:50

    be uniformly done.
    be uniformly done.

  • 01:05:51

    If you have something that has variance that can be huge or small,
    If you have something that has variance that can be huge or small,

  • 01:05:54

    then that will play a role in your choice of epsilon, such that
    then that will play a role in your choice of epsilon, such that

  • 01:05:58

    this will be valid.
    this will be valid.

  • 01:06:00

    So the short answer is: it can be done.
    So the short answer is: it can be done.

  • 01:06:02

    There is a technical modification, and the main aspect of the technical
    There is a technical modification, and the main aspect of the technical

  • 01:06:05

    modification, that needs to be taken into consideration, is the variance of
    modification, that needs to be taken into consideration, is the variance of

  • 01:06:10

    the variable I'm talking about.
    the variable I'm talking about.

  • 01:06:12

    MODERATOR: OK.
    MODERATOR: OK.

  • 01:06:13

    There's also a common confusion.
    There's also a common confusion.

  • 01:06:14

    Why are there are multiple bins?
    Why are there are multiple bins?

  • 01:06:18

    PROFESSOR: OK.
    PROFESSOR: OK.

  • 01:06:19

    The bin was only our conceptual tool to argue that learning is
    The bin was only our conceptual tool to argue that learning is

  • 01:06:24

    feasible in a probabilistic sense.
    feasible in a probabilistic sense.

  • 01:06:28

    When we used a single bin, we had a correspondence with a hypothesis, and
    When we used a single bin, we had a correspondence with a hypothesis, and

  • 01:06:33

    it looked like we actually captured the essence of learning, until we
    it looked like we actually captured the essence of learning, until we

  • 01:06:37

    looked closer and we realized that, if you restrict yourself to one bin and
    looked closer and we realized that, if you restrict yourself to one bin and

  • 01:06:41

    apply the Hoeffding Inequality directly to it, what you are really
    apply the Hoeffding Inequality directly to it, what you are really

  • 01:06:44

    working with--
    working with--

  • 01:06:45

    if you want to put it in terms of learning--
    if you want to put it in terms of learning--

  • 01:06:48

    is that my hypothesis set has only one hypothesis.
    is that my hypothesis set has only one hypothesis.

  • 01:06:52

    And that corresponds to the bin.
    And that corresponds to the bin.

  • 01:06:54

    So now I am picking it--
    So now I am picking it--

  • 01:06:56

    which is my only choice.
    which is my only choice.

  • 01:06:57

    I don't have everything else.
    I don't have everything else.

  • 01:06:58

    And all I'm doing now is verifying that its in-sample performance will
    And all I'm doing now is verifying that its in-sample performance will

  • 01:07:03

    correspond to the out-of-sample performance, and that is guaranteed by
    correspond to the out-of-sample performance, and that is guaranteed by

  • 01:07:05

    the plain-vanilla Hoeffding.
    the plain-vanilla Hoeffding.

  • 01:07:08

    Now, if you have actual learning, then you have more than one
    Now, if you have actual learning, then you have more than one

  • 01:07:11

    hypothesis.
    hypothesis.

  • 01:07:12

    And we realize that the bin changes with the hypothesis, because whether
    And we realize that the bin changes with the hypothesis, because whether

  • 01:07:17

    a marble is red or green depends on whether the hypothesis agrees or
    a marble is red or green depends on whether the hypothesis agrees or

  • 01:07:20

    disagrees with your target function.
    disagrees with your target function.

  • 01:07:22

    Different hypotheses will lead to different colors.
    Different hypotheses will lead to different colors.

  • 01:07:25

    Therefore, you need multiple bins to represent multiple hypotheses, which
    Therefore, you need multiple bins to represent multiple hypotheses, which

  • 01:07:29

    is the only situation that admits learning as we know it--
    is the only situation that admits learning as we know it--

  • 01:07:33

    that I'm going to explore the hypotheses, based on their performance in-sample,
    that I'm going to explore the hypotheses, based on their performance in-sample,

  • 01:07:37

    and pick the one that performs best, perhaps, in-sample, and hope that it
    and pick the one that performs best, perhaps, in-sample, and hope that it

  • 01:07:41

    will generalize well out-of-sample.
    will generalize well out-of-sample.

  • 01:07:46

    MODERATOR: OK.
    MODERATOR: OK.

  • 01:07:46

    Another confusion.
    Another confusion.

  • 01:07:48

    Can you resolve the relationship between the probability and the big H?
    Can you resolve the relationship between the probability and the big H?

  • 01:07:54

    so I'm not clear exactly what--
    so I'm not clear exactly what--

  • 01:07:58

    PROFESSOR: We applied the--
    PROFESSOR: We applied the--

  • 01:08:00

    there are a bunch of components in the learning
    there are a bunch of components in the learning

  • 01:08:04

    situation, so let me get the--
    situation, so let me get the--

  • 01:08:08

    It's a big diagram, and it has lots of components.
    It's a big diagram, and it has lots of components.

  • 01:08:11

    So one big space or set is X, and another one is H. So if you
    So one big space or set is X, and another one is H. So if you

  • 01:08:17

    look at here.
    look at here.

  • 01:08:19

    This is hypothesis set H. It's a set.
    This is hypothesis set H. It's a set.

  • 01:08:21

    OK, fine.
    OK, fine.

  • 01:08:23

    And also, if you look here, the target function is defined from X to Y, and
    And also, if you look here, the target function is defined from X to Y, and

  • 01:08:28

    in this case, X is also a set.
    in this case, X is also a set.

  • 01:08:32

    The only invocation of probability that we needed to do, in order to get
    The only invocation of probability that we needed to do, in order to get

  • 01:08:37

    the benefit of the probabilistic analysis in learning, was to put
    the benefit of the probabilistic analysis in learning, was to put

  • 01:08:42

    a probability distribution on X.
    a probability distribution on X.

  • 01:08:45

    H, which is down there, is left as a fixed hypothesis set.
    H, which is down there, is left as a fixed hypothesis set.

  • 01:08:51

    There is no question of a probability on it.
    There is no question of a probability on it.

  • 01:08:53

    When we talk about the Bayesian approach, in the last lecture in
    When we talk about the Bayesian approach, in the last lecture in

  • 01:08:57

    fact, there will be a question of putting a probability distribution
    fact, there will be a question of putting a probability distribution

  • 01:09:02

    here in order to make the whole situation probabilistic.
    here in order to make the whole situation probabilistic.

  • 01:09:04

    But that is not the approach that is followed for the entire course, until
    But that is not the approach that is followed for the entire course, until

  • 01:09:08

    we discuss that specific approach at the end.
    we discuss that specific approach at the end.

  • 01:09:11

    Question.
    Question.

  • 01:09:12

    STUDENT: What do we do when there are many possible hypotheses which
    STUDENT: What do we do when there are many possible hypotheses which

  • 01:09:18

    will satisfy my criteria?
    will satisfy my criteria?

  • 01:09:20

    Like, in perceptron, for example.
    Like, in perceptron, for example.

  • 01:09:22

    I could have several hyperplanes which could be separating the set.
    I could have several hyperplanes which could be separating the set.

  • 01:09:25

    So how do I pick the best--
    So how do I pick the best--

  • 01:09:27

    PROFESSOR: Correct.
    PROFESSOR: Correct.

  • 01:09:27

    Usually, with a pre-specified algorithm,
    Usually, with a pre-specified algorithm,

  • 01:09:32

    you'll end up with something.
    you'll end up with something.

  • 01:09:33

    So the algorithm will choose it for you.
    So the algorithm will choose it for you.

  • 01:09:35

    But your remark now is that,
    But your remark now is that,

  • 01:09:37

    given that there are many solutions that happen to have zero in-sample
    given that there are many solutions that happen to have zero in-sample

  • 01:09:42

    error, there is really no distinction between them in terms of the out-of-
    error, there is really no distinction between them in terms of the out-of-

  • 01:09:46

    sample performance.
    sample performance.

  • 01:09:47

    I'm using the same hypothesis set, so M is the same.
    I'm using the same hypothesis set, so M is the same.

  • 01:09:49

    And the in-sample error is the same.
    And the in-sample error is the same.

  • 01:09:51

    So my prediction for the out-of-sample error would be the same, as there's no
    So my prediction for the out-of-sample error would be the same, as there's no

  • 01:09:54

    distinction between them.
    distinction between them.

  • 01:09:56

    The good news is that the learning algorithm will solve this for you, because
    The good news is that the learning algorithm will solve this for you, because

  • 01:09:58

    it will give you one specific, the one it ended with.
    it will give you one specific, the one it ended with.

  • 01:10:01

    But even within the ones that achieve zero error, there is a method,
    But even within the ones that achieve zero error, there is a method,

  • 01:10:07

    that we'll talk about later on when we talk about support vector machines,
    that we'll talk about later on when we talk about support vector machines,

  • 01:10:10

    that prefers one particular solution as having a better chance of
    that prefers one particular solution as having a better chance of

  • 01:10:13

    generalization.
    generalization.

  • 01:10:14

    Not clear at all given what I said so far, but I'm just telling you,
    Not clear at all given what I said so far, but I'm just telling you,

  • 01:10:18

    as an appetizer, there's something to be done in that regard.
    as an appetizer, there's something to be done in that regard.

  • 01:10:23

    MODERATOR: OK.
    MODERATOR: OK.

  • 01:10:24

    A question is does the inequality hold for any g,
    A question is does the inequality hold for any g,

  • 01:10:30

    even if g is not optimal?
    even if g is not optimal?

  • 01:10:34

    PROFESSOR: What about the g?
    PROFESSOR: What about the g?

  • 01:10:36

    MODERATOR: Does it hold for any g, no matter how you pick g?
    MODERATOR: Does it hold for any g, no matter how you pick g?

  • 01:10:39

    PROFESSOR: Yeah.
    PROFESSOR: Yeah.

  • 01:10:40

    So the whole idea--
    So the whole idea--

  • 01:10:41

    once you write the symbol g, you already are talking about any
    once you write the symbol g, you already are talking about any

  • 01:10:45

    hypothesis.
    hypothesis.

  • 01:10:45

    Because by definition, g is the final hypothesis, and your algorithm is
    Because by definition, g is the final hypothesis, and your algorithm is

  • 01:10:50

    allowed to pick any h from the hypothesis set and call it g.
    allowed to pick any h from the hypothesis set and call it g.

  • 01:10:55

    Therefore, when I say g, don't look at a fixed hypothesis.
    Therefore, when I say g, don't look at a fixed hypothesis.

  • 01:10:59

    Look at the entire learning process that went through the H, the
    Look at the entire learning process that went through the H, the

  • 01:11:03

    set of hypotheses, according to the data and according to the learning
    set of hypotheses, according to the data and according to the learning

  • 01:11:07

    rule, and went through and ended up with one that is declared the right
    rule, and went through and ended up with one that is declared the right

  • 01:11:12

    one, and now we call this g.
    one, and now we call this g.

  • 01:11:14

    So the answer is patently that g can be different.
    So the answer is patently that g can be different.

  • 01:11:18

    Patently yes, just by the notation that I'm using.
    Patently yes, just by the notation that I'm using.

  • 01:11:24

    MODERATOR: Also, some confusion.
    MODERATOR: Also, some confusion.

  • 01:11:26

    With the perceptron algorithm or any linear algorithm--
    With the perceptron algorithm or any linear algorithm--

  • 01:11:30

    there's a confusion that, at each step, there's a hypothesis, but--
    there's a confusion that, at each step, there's a hypothesis, but--

  • 01:11:36

    PROFESSOR: Correct.
    PROFESSOR: Correct.

  • 01:11:36

    But these are hidden processes for us.
    But these are hidden processes for us.

  • 01:11:40

    As far as analysis I mentioned, you get the data,
    As far as analysis I mentioned, you get the data,

  • 01:11:43

    the algorithm does something magic, and ends up with a final hypothesis.
    the algorithm does something magic, and ends up with a final hypothesis.

  • 01:11:48

    In the course of doing that, it will obviously be visiting lots of
    In the course of doing that, it will obviously be visiting lots of

  • 01:11:50

    hypotheses.
    hypotheses.

  • 01:11:51

    So the abstraction of having just the samples sitting there, and eyeballing
    So the abstraction of having just the samples sitting there, and eyeballing

  • 01:11:55

    them and picking the one that happens to be green, is an abstraction.
    them and picking the one that happens to be green, is an abstraction.

  • 01:11:59

    In reality, these guys happen in a space, and you are moving from one
    In reality, these guys happen in a space, and you are moving from one

  • 01:12:03

    hypothesis to another by moving some parameters.
    hypothesis to another by moving some parameters.

  • 01:12:06

    And in the course of doing that, including in the perceptron learning
    And in the course of doing that, including in the perceptron learning

  • 01:12:10

    algorithm, you are moving from one hypothesis to another.
    algorithm, you are moving from one hypothesis to another.

  • 01:12:14

    But I'm not accounting for that, because I haven't found my final
    But I'm not accounting for that, because I haven't found my final

  • 01:12:17

    hypothesis yet.
    hypothesis yet.

  • 01:12:18

    When you find the final hypothesis, you call it g.
    When you find the final hypothesis, you call it g.

  • 01:12:21

    On the other hand, because I use the union bound, I use the worst-case
    On the other hand, because I use the union bound, I use the worst-case

  • 01:12:25

    scenario, the generalization bound applies to every single hypothesis you
    scenario, the generalization bound applies to every single hypothesis you

  • 01:12:29

    visited or you didn't visit.
    visited or you didn't visit.

  • 01:12:32

    Because what I did to get the bound, of deviation between in-sample and out-of-
    Because what I did to get the bound, of deviation between in-sample and out-of-

  • 01:12:36

    sample, is that I consider that all the hypotheses simultaneously behave from
    sample, is that I consider that all the hypotheses simultaneously behave from

  • 01:12:41

    in-sample to out-of-sample, closely according to your epsilon criterion.
    in-sample to out-of-sample, closely according to your epsilon criterion.

  • 01:12:46

    And that obviously guarantees that whichever one you end up
    And that obviously guarantees that whichever one you end up

  • 01:12:49

    with will be fine.
    with will be fine.

  • 01:12:52

    But obviously, it could be an overkill.
    But obviously, it could be an overkill.

  • 01:12:54

    And among the positive side effects of that is that even the
    And among the positive side effects of that is that even the

  • 01:12:59

    intermediate values have good generalization--
    intermediate values have good generalization--

  • 01:13:01

    not that we look at it or consider it, but just to answer the question.
    not that we look at it or consider it, but just to answer the question.

  • 01:13:07

    MODERATOR: A question about the punchline.
    MODERATOR: A question about the punchline.

  • 01:13:09

    They say that they don't understand exactly how the Hoeffding works--
    They say that they don't understand exactly how the Hoeffding works--

  • 01:13:15

    shows that learning is feasible.
    shows that learning is feasible.

  • 01:13:18

    PROFESSOR: OK.
    PROFESSOR: OK.

  • 01:13:19

    Hoeffding shows that verification is feasible.
    Hoeffding shows that verification is feasible.

  • 01:13:23

    The presidential poll makes sense.
    The presidential poll makes sense.

  • 01:13:25

    That, if you have a sample and you have one question to ask, and you see
    That, if you have a sample and you have one question to ask, and you see

  • 01:13:30

    how the question is answered in the sample, then there is a reason to
    how the question is answered in the sample, then there is a reason to

  • 01:13:33

    believe that the answer in the general population, or in the big bin, will be
    believe that the answer in the general population, or in the big bin, will be

  • 01:13:37

    close to the answer you got in-sample.
    close to the answer you got in-sample.

  • 01:13:39

    So that's the verification.
    So that's the verification.

  • 01:13:41

    In order to move from verification to learning, you need to be able to make
    In order to move from verification to learning, you need to be able to make

  • 01:13:45

    that statement, simultaneously on a number of these guys, and that's why
    that statement, simultaneously on a number of these guys, and that's why

  • 01:13:50

    you had the modified Hoeffding Inequality at the end,
    you had the modified Hoeffding Inequality at the end,

  • 01:13:55

    which is this one
    which is this one

  • 01:13:55

    that has the red M in it.
    that has the red M in it.

  • 01:13:58

    This is no longer the plain-vanilla Hoeffding Inequality.
    This is no longer the plain-vanilla Hoeffding Inequality.

  • 01:14:00

    We'll still call it Hoeffding.
    We'll still call it Hoeffding.

  • 01:14:02

    But it basically deals with a situation where you have M of these
    But it basically deals with a situation where you have M of these

  • 01:14:05

    guys simultaneously, and you want to guarantee that all of them are
    guys simultaneously, and you want to guarantee that all of them are

  • 01:14:08

    behaving well.
    behaving well.

  • 01:14:10

    Under those conditions, this is the probability that the guarantee can
    Under those conditions, this is the probability that the guarantee can

  • 01:14:13

    give, and the probability, obviously, is looser than it used to be.
    give, and the probability, obviously, is looser than it used to be.

  • 01:14:16

    So the probability that bad thing happens when you have many
    So the probability that bad thing happens when you have many

  • 01:14:18

    possibilities is bigger than the probability that bad things happen when
    possibilities is bigger than the probability that bad things happen when

  • 01:14:22

    you have one of them.
    you have one of them.

  • 01:14:23

    And this is the case where you added up as if they happen disjointly, as I
    And this is the case where you added up as if they happen disjointly, as I

  • 01:14:27

    mentioned before.
    mentioned before.

  • 01:14:29

    MODERATOR: Can it be said that the bin corresponds to the entire
    MODERATOR: Can it be said that the bin corresponds to the entire

  • 01:14:32

    population in a--
    population in a--

  • 01:14:34

    PROFESSOR: The bin corresponds to the entire
    PROFESSOR: The bin corresponds to the entire

  • 01:14:37

    population before coloring.
    population before coloring.

  • 01:14:40

    So remember the gray bin--
    So remember the gray bin--

  • 01:14:41

    I have it somewhere.
    I have it somewhere.

  • 01:14:44

    We had a viewgraph where the bin had gray marbles.
    We had a viewgraph where the bin had gray marbles.

  • 01:14:49

    So this is my way of saying this is a generic input, and we
    So this is my way of saying this is a generic input, and we

  • 01:14:51

    call it X.
    call it X.

  • 01:14:53

    And this is indeed the input space in this case, or the general population.
    And this is indeed the input space in this case, or the general population.

  • 01:14:57

    Now, we start coloring it according to when you give me a hypothesis.
    Now, we start coloring it according to when you give me a hypothesis.

  • 01:15:01

    So now there's more in the process than just the input space.
    So now there's more in the process than just the input space.

  • 01:15:06

    But indeed, the bin can correspond to the general population, and the sample
    But indeed, the bin can correspond to the general population, and the sample

  • 01:15:10

    will correspond to the people you polled over the phone, in the case of
    will correspond to the people you polled over the phone, in the case of

  • 01:15:12

    the presidential thing.
    the presidential thing.

  • 01:15:16

    MODERATOR: Is there a relation between the Hoeffding Inequality and the
    MODERATOR: Is there a relation between the Hoeffding Inequality and the

  • 01:15:20

    p-values in statistics?
    p-values in statistics?

  • 01:15:22

    PROFESSOR: Yes.
    PROFESSOR: Yes.

  • 01:15:25

    The area where we are trying to say that if I have a sample and I get
    The area where we are trying to say that if I have a sample and I get

  • 01:15:29

    an estimate on the sample, the estimate is reliable.
    an estimate on the sample, the estimate is reliable.

  • 01:15:31

    The estimate is close to the out-of-sample.
    The estimate is close to the out-of-sample.

  • 01:15:33

    The probability that you will deviate-- is a huge body of work.
    The probability that you will deviate-- is a huge body of work.

  • 01:15:38

    And the p-value in statistics is one approach.
    And the p-value in statistics is one approach.

  • 01:15:40

    And there are other laws of large numbers that come with it.
    And there are other laws of large numbers that come with it.

  • 01:15:45

    I don't want to venture too much into that.
    I don't want to venture too much into that.

  • 01:15:48

    I basically picked from that jungle of mathematics the single most useful
    I basically picked from that jungle of mathematics the single most useful

  • 01:15:53

    formula that will get me home when I talk about the theory of
    formula that will get me home when I talk about the theory of

  • 01:15:56

    generalization.
    generalization.

  • 01:15:57

    And I want to focus on it.
    And I want to focus on it.

  • 01:15:59

    I want to understand it-- this specific formula-- perfectly, so when we
    I want to understand it-- this specific formula-- perfectly, so when we

  • 01:16:03

    keep modifying it until we get to the VC dimension, things are clear.
    keep modifying it until we get to the VC dimension, things are clear.

  • 01:16:07

    And, obviously, if you get curious about the law of large numbers, and
    And, obviously, if you get curious about the law of large numbers, and

  • 01:16:10

    different manifestations of in-sample being close to out-of-sample and
    different manifestations of in-sample being close to out-of-sample and

  • 01:16:14

    probabilities of error, that is a very fertile ground, and a very useful
    probabilities of error, that is a very fertile ground, and a very useful

  • 01:16:17

    ground to study.
    ground to study.

  • 01:16:18

    But it is not a core subject of the course.
    But it is not a core subject of the course.

  • 01:16:23

    The subject is only borrowing one piece as a utility
    The subject is only borrowing one piece as a utility

  • 01:16:26

    to get what it wants.
    to get what it wants.

  • 01:16:29

    So that ends the questions here?
    So that ends the questions here?

  • 01:16:30

    Let's call it a day, and we will see you next week.
    Let's call it a day, and we will see you next week.

All verb-ing
following
/ˈfälōiNG/

word

To be the logical result of something

Lecture 02 - Is Learning Feasible?

439,127 views

Intro:

ANNOUNCER: The following program is brought to you by Caltech.
YASER ABU-MOSTAFA: Welcome back.. Last time, we introduced the learning problem.. And if you have an application in your domain that you wonder if machine
learning is the right technique for it, we found that there are three
criteria that you should check.. You should ask yourself: is there a pattern to begin
with that we can learn?. And we realize that this condition can be intuitively met in many applications,
even if we don't know mathematically what the pattern is.
The example we gave was the credit card approval.. There is clearly a pattern-- if someone has a particular salary, has
been in a residence for so long, has that much debt, and so on, that this
is somewhat correlated to their credit behavior.. And therefore, we know that the pattern exists in spite of the fact
that we don't know exactly what the pattern is.. The second item is that we cannot pin down the pattern mathematically, like
the example I just gave.. And this is why we resort to machine learning.. The third one is that we have data that represents that pattern.

Video Vocabulary

/ˌreprəˈzent/

verb

To depict art objects, figures, scenes; to portray.

/ˌapləˈkāSH(ə)n/

noun

Software program, e.g. for smart phone.

/kənˈdiSH(ə)n/

noun verb

Something required in a business contract; term. have significant influence on.

/ˈprēvēəs/

adjective

Existing or happening before the present time.

/briNG/

verb

To take or go with someone to a place.

/ˈpadərn/

noun verb

repeated decorative design. To copy the way something else is made.

/iɡˈzampəl/

noun verb

thing characteristic of its kind. be illustrated or exemplified.

/ˈrē(ə)ˌlīz/

verb

To become aware of or understand mentally.

noun verb

each of two or more related or complementary things. To have/show the relationship between two things.

/ˈsəmˌwən/

pronoun

Person who is not known or named.

/pə(r)ˈtikyələr/

adjective noun

Having firm ideas about what is desirable. individual item.

/ˈfälōiNG/

adjective noun preposition verb

next in time. body of supporters or admirers. coming after or as a result. To come after someone; be guided by someone.

/tekˈnēk/

noun

Way of doing by using special knowledge or skill.

/ˈprōˌɡram/

noun verb

TV show. To write computer code for a piece of software.

/inˈt(y)o͞oədivlē/

adverb

In a manner based on feelings rather than facts.