Learnings in ML — PAC Learning

https://miro.medium.com/max/1200/0*n5AZ-zx1S2nI4e4H

Original Source Here

Learnings in ML — PAC Learning

Photo by Sei on Unsplash

We very well understand the importance of the size of dataset while performing training in machine learning. Whereas when it comes to what will be learnt well by the algorithm amongst the dataset and how well it will be learnt becomes a difficult part to answer.

In Machine learning we have a framework which can help us answering what can be learnt efficiently by the algorithm, also it can help us answering the sample size which can give better result. The framework is called Probably Approximately Correct learning framework.

PAC helps us in describing the probable features which an algorithm can learn, this depends upon factors like the number of sample size, Sample complexity, time, space complexity for the algorithm.

PAC Framework

Before getting into more detail first lets look at the representation/ terminologies which we are going to use to represent PAC framework

c — Concept/features where X -> Y since Y = {0,1}, X -> {0,1}
C —
Concept class ( set of concepts/ features to learn)
H —
Hypothesis( Set of concepts which may not coincide with C)
D —
Data distribution (considered here to be identical independently distributed)
S — Sample from H
hS
— Hypothesis for S Sample
ε —
Accuracy parameter
δ —
Confidence parameter

PAC Learning

A class C is termed to be PAC learnable if the hypothesis (H) returned after applying the algorithm (A) on the number of samples (N) is termed to be approximately correct if it gives an error rate lesser than ε and a probability of at least 1 − δ (where N is polynomial and N is a function for 1/ε, 1/δ) . The combination of probability and term approximately in the equation leads to the term PAC — Probably Approximate Correct.
* Pr [(hs)≤ ε] ≥ 1 − δ

Assumption here being made is (ε, δ) > 0 and the hypothesis H is finite. Such an algorithm/classifier which gives us atleast 1 − δ probability will be termed as approximately correct in learning the features/concepts.

Also if the algorithm A takes polynomial time while runing ( in form of 1/ε, 1/δ) then C is said to be efficiently PAC learnable. Here we are looking for more generalised learning (with lesser generalisation error) but not memorisation of the concepts/features by the algorithm.

Generalisation error We find the probability of H (hypothesis) and C ( Concept class) such that for random instances where h(x) != c(x) will be the generalisation error as we are assuming both of them to be different or we can say non overlapping(intersecting data points) and it will be our true error.
* Pr [h(x)!= c(x)]

Sample complexity — Using PAC we can also find the number of samples which will give us higher probability, here we are assuming that C is PAC learnable. So if we find a hypothesis with atleast 1 − δ probability (high probability), then number of samples needed for training such hypothesis can be defined as
* N = 1/ε (ln|H|+ln|δ| )

What if the hypothesis H is infinite?

Photo by Ghaith Harstany on Unsplash

The previous concept which we saw will hold true for finite hypothesis, but if hypothesis is infinite we have to look for options of how we can split the hypothesis, the splitting of hypothesis is termed as shattering.

To make the process more understandable there is a term VC dimension or VC dim which defines largest set of points that can be shattered by the algorithm.
We can also say,
* VC = 2^k , where k is set of points

Let’s say VC-dimension is d for a sample S, which contains set of d points which can be shattered, but for d+1 there is no set of points to be shattered. How? Lets look at an example

Example of VC Dimension1

Consider the graph (a) where there are two sets T, F(d points) which are being split into two using an axis. Likewise any other axis can also be selected.

Consider the graph (b) where there are two sets F, T (d points) which are being split into two using an axis. Likewise any other axis can also be selected.

Example of VC Dimension2

Consider the (a) graph, where there are 4 points of set — T,T,T, F (d+1 points). How can we shatter it, any line we choose will cut the F point which is not a perfect split, So (a) graph cannot be shattered.

Consider the (b) graph, where there are 3 points of set — T,T, F (d+1 points). This graph can easily be split by using a line.

In the above examples we saw if there are set of points ≤ 3 it can easily be split but if the set of points are ≥4 (d+1 points) it cannot be split perfectly.
So the last set of point that can be split or are shatterable is VC dimension <4 .

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: