PROBABILISTIC Learning Jb-new
PROBABILISTIC Learning Jb-new
Probabilistic Learning
Probabilistic learning methods offer a more transparent approach compared to traditional neu-
ral networks like the MLP. Neural networks often provide limited interpretability, where under-
standing neuron activations and weights may not give clear insights. However, with probabilistic
methods, we can directly observe the probabilities involved in decision-making processes.
In this section, we explore classification using probabilities derived from the frequency of
examples in the training data. Probabilistic methods help to handle classification tasks more
explicitly. We also introduce unsupervised learning methods for situations where training labels
are not available. In cases where data is drawn from known probability distributions, we can use
the Expectation-Maximization (EM) algorithm to solve the problem effectively.
Unsupervised Learning
Unsupervised learning is used when labels are not available for the training examples. Instead of
using labeled data, we focus on identifying patterns or clusters in the data. The EM algorithm is
a widely-used technique in this context.
The EM algorithm works in two main steps:
Expectation (E-step): Estimate the probability that each data point belongs to each cluster.
Maximization (M-step): Recalculate the parameters of the clusters (e.g., means and vari-
ances) to maximize the likelihood of the data.
This iterative process continues until the algorithm converges, resulting in well-defined clusters
in the data.
1
predictions. The k-nearest neighbor algorithm (k-NN) is a popular approach, where we find the k
closest examples in the dataset and use their labels to predict the label of a new data point.
For example, if we want to classify a fruit based on its size and color, we would look for the k
most similar fruits in the dataset. By checking their labels (e.g., apple, orange), we can determine
the label for the new fruit.
Probabilistic learning offers a more interpretable way to handle classification problems, com-
pared to neural networks. With methods like classification based on frequency, unsupervised
learning through the EM algorithm, and nearest neighbor approaches, we can build models that
provide clearer insights into how decisions are made.
where ϕ(x; µm , Σm ) is the Gaussian function with mean µm and covariance matrix Σm , and
the αm are weights with the constraint:
M
X
αm = 1.
m=1
This equation describes how the overall probability distribution is a mixture of several Gaussian
distributions.
Model Definition
Assume there are two Gaussian distributions:
G1 = N (µ1 , σ12 )
G2 = N (µ2 , σ22 )
y = pG1 + (1 − p)G2
Challenge
The goal is to compute the maximum likelihood solution. However, directly differentiating the log-
likelihood function is complex due to the hidden variable f , which indicates from which Gaussian
the data point was generated.
γi (µ̂1 , µ̂2 , σ̂1 , σ̂2 , π̂) = E(f |µ̂1 , µ̂2 , σ̂1 , σ̂2 , π̂, D)
M-step 5: Update π̂ PN
i=1 γ̂i
π̂ =
N
Iteration
Repeat the E-step and M-step until convergence is achieved. The EM algorithm is guaranteed to
converge to a local maximum of the likelihood function.
Conclusion
The EM algorithm effectively handles missing or hidden data by iteratively estimating the expec-
tations of the latent variables and optimizing the model parameters. This framework is widely
used in various applications, particularly in clustering and density estimation tasks.
In many practical learning scenarios, only a subset of relevant instance features is observ-
able. For instance, when training or utilizing a Bayesian belief network, certain variables may
be observed while others remain hidden. To effectively learn in the presence of these unobserved
variables, the EM (Expectation-Maximization) algorithm provides a systematic approach. It can
be applied even when the values of certain variables are never directly observed, given that the
form of the probability distribution governing these variables is known.
Radial Basis Function Networks: The algorithm can also be utilized here
The M-step updates the parameters of the Gaussian components based on these responsi-
bilities.
This process iterates until the parameters converge, refining the model to better explain the
data.
The EM algorithm serves as a powerful tool for parameter estimation in models involving
unobserved variables. By leveraging current hypotheses to estimate hidden data and iteratively
refining those hypotheses, EM approaches a maximum likelihood solution. Its versatility across
various applications makes it a fundamental method in both machine learning and statistical
inference.
Data Points
Consider the following data points in a 2-dimensional space:
Data points: {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10)}
We also have labels for these data points:
Labels: {A, A, A, A, B, B, B, B, B, B}
Our new point is (6.5, 6.5), and we want to classify it using the KNN algorithm with k = 3.
p √ √ √
Distance = (6.5 − 6)2 + (6.5 − 6)2 = 0.52 + 0.52 = 0.25 + 0.25 = 0.5 ≈ 0.71
Sorted Distances: {(6, 6), (7, 7), (5, 5), (8, 8), (4, 4), (9, 9), (3, 3), (10, 10), (2, 2), (1, 1)}
3. Select k Nearest Neighbors: We now select the 3 nearest neighbors (as k = 3):
Labels: {B, B, B}
Since the majority of the neighbors belong to class B, we classify the new point (6.5, 6.5) as
class B.
By using the KNN algorithm with k = 3, the new point (6.5, 6.5) is classified as class B.
- We ignore the square root in the distance formula because we only want to know which points
are closest.
2. Choosing Neighbours: - After calculating the distances, we identify the k nearest neigh-
bours to the new point. - The class of the new point is assigned based on the most common class
among these neighbours.
Choosing k
- The choice of k is crucial: - If k is too small, the method can be sensitive to noise. - If k is too
large, it may include points that are not relevant, reducing accuracy.
Curse of Dimensionality
As we increase the number of dimensions (d), the distance calculations become more complex.
Although methods like KD-Trees can help with this, the following issues arise: - As dimensions
increase, points tend to spread out, making it harder to find truly ”nearby” points. - Distances
become less meaningful as many points are far apart in some dimensions but close in others.
Bias-Variance Trade-off
For the k-nearest neighbours (KNN) algorithm, we can analyze the bias-variance trade-off:
k
!
1 X
E (ŷ − f (x))2 = σ 2 + (f (x) − f (xi ))2 .
k i=1
The bias-variance tradeoff is crucial in understanding the performance of the KNN algorithm.
Bias
Bias refers to the error introduced by approximating a real-world problem using a simplified
model.
In KNN, when k is small (e.g., k = 1), the model is flexible and fits the training data closely,
resulting in low bias.
However, this flexibility can lead to high variance, as the model captures noise in the training
data.
Variance
Variance refers to the error introduced by the model’s sensitivity to fluctuations in the
training data.
A high variance model, like KNN with small k, performs well on training data but poorly
on unseen data due to overfitting.
– High Bias: The model generalizes too much, potentially missing important patterns.
– Low Variance: The model is stable, averaging over more neighbors, reducing noise
impact.
Optimal k
Finding a balance between bias and variance is key:
In KNN, understanding the bias-variance tradeoff is essential for tuning the algorithm effec-
tively, with the goal of minimizing overall error.
The nearest neighbour methods rely on the idea of learning from similar data points. By
carefully choosing k and considering dimensionality, we can effectively classify new data based on
existing patterns.
(5, 4), (1, 6), (6, 1), (7, 5), (2, 7), (2, 2), (5, 8)
(1, 6), (2, 2), (2, 7), (5, 4), (5, 8), (6, 1), (7, 5)
- The median point (middle value) is at position 4 (0-indexed), which is (5, 4). - This creates a
split at x = 5.
2. Left Subtree (Points with x < 5): - Remaining points: (1, 6), (2, 2), (2, 7) - Sort by
y-coordinates:
(2, 2), (1, 6), (2, 7)
Visualization
At this point, the KD-Tree has the following structure:
- Root Node: (5, 4) (5,4)
- Left Child: (1, 6)
- Left: (2, 2)
- Right: (2, 7)
- Right Child: (7, 5) (1,6) (7,5)
- Left: (5, 4)
- Right: (5, 8)
- Child: (6, 1)
(2,2) (2,7) (5,4) (5,8)
This process continues until all potential points are checked, making the KD-Tree efficient in
finding nearest neighbors with significantly reduced computational costs.
Distance Measures
Distance measures are essential in data analysis, particularly in clustering and classification tasks.
They help quantify how similar or different two data points are. Below, we explore several impor-
tant distance metrics along with their real-time applications.
Manhattan Distance
In contrast to the Euclidean distance, the Manhattan distance (also known as city-block distance)
measures the distance between two points based on a grid-like path. It adds the absolute differences
in each dimension. For two points (x1 , y1 ) and (x2 , y2 ), the Manhattan distance dC is given by:
Manhattan distance is often used in robotics, particularly for pathfinding algorithms. For
example, a robot navigating through a city grid will calculate its distance to a destination using
Manhattan distance to find the most efficient path while avoiding obstacles like buildings.
Minkowski Distance
The Minkowski distance is a generalization of both the Euclidean and Manhattan distances. It is
defined for two points x and y in an n-dimensional space, with the parameter k controlling the
distance measure. It is expressed as:
n
!1/k
X
Lk (x, y) = |xi − yi |k (5)
i=1
By understanding these distance metrics, we can better analyze and classify data effectively.
Eager Learning
Eager learning is a machine learning approach where the model is built during the training phase,
resulting in a general representation of the data. This model is then used for making predictions,
leading to faster response times during inference since no additional computation is needed. Eager
learning is termed so because the model eagerly captures patterns and relationships from the
training data upfront.
Examples of eager learning algorithms include:
Decision Trees: Create a tree-like model based on feature splits.
Neural Networks: Learn representations through layered architectures.
Support Vector Machines: Find the optimal hyperplane for classification.