0% found this document useful (0 votes)
22 views13 pages

PROBABILISTIC Learning Jb-new

This document discusses probabilistic learning methods in machine learning, highlighting their interpretability compared to traditional neural networks. It covers classification using frequency, unsupervised learning with the Expectation-Maximization (EM) algorithm, and nearest neighbor methods like k-NN. Additionally, it explains Gaussian Mixture Models and the EM algorithm's application in parameter estimation for models with unobserved variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views13 pages

PROBABILISTIC Learning Jb-new

This document discusses probabilistic learning methods in machine learning, highlighting their interpretability compared to traditional neural networks. It covers classification using frequency, unsupervised learning with the Expectation-Maximization (EM) algorithm, and nearest neighbor methods like k-NN. Additionally, it explains Gaussian Mixture Models and the EM algorithm's application in parameter estimation for models with unobserved variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Probabilistic Learning

Unit-4 MACHINE LEARNING


Dr. John Babu

Probabilistic Learning
Probabilistic learning methods offer a more transparent approach compared to traditional neu-
ral networks like the MLP. Neural networks often provide limited interpretability, where under-
standing neuron activations and weights may not give clear insights. However, with probabilistic
methods, we can directly observe the probabilities involved in decision-making processes.
In this section, we explore classification using probabilities derived from the frequency of
examples in the training data. Probabilistic methods help to handle classification tasks more
explicitly. We also introduce unsupervised learning methods for situations where training labels
are not available. In cases where data is drawn from known probability distributions, we can use
the Expectation-Maximization (EM) algorithm to solve the problem effectively.

Classification using Frequency


In classification problems, we calculate the frequency of each class in the training data. This gives
us an estimate of the probability of that class. We can use these probabilities to assign labels to
new examples.
For example, consider a dataset of fruits with features like size and color. Based on the observed
features, we can estimate the probability of a fruit being an apple or an orange. By choosing the
class with the highest probability, we can classify new data points effectively.

Unsupervised Learning
Unsupervised learning is used when labels are not available for the training examples. Instead of
using labeled data, we focus on identifying patterns or clusters in the data. The EM algorithm is
a widely-used technique in this context.
The EM algorithm works in two main steps:

ˆ Expectation (E-step): Estimate the probability that each data point belongs to each cluster.

ˆ Maximization (M-step): Recalculate the parameters of the clusters (e.g., means and vari-
ances) to maximize the likelihood of the data.

This iterative process continues until the algorithm converges, resulting in well-defined clusters
in the data.

Nearest Neighbor Methods


Another way to use probabilistic learning is through nearest neighbor methods. Instead of using
a pre-defined model, these methods look for the closest examples from the training set to make

1
predictions. The k-nearest neighbor algorithm (k-NN) is a popular approach, where we find the k
closest examples in the dataset and use their labels to predict the label of a new data point.
For example, if we want to classify a fruit based on its size and color, we would look for the k
most similar fruits in the dataset. By checking their labels (e.g., apple, orange), we can determine
the label for the new fruit.
Probabilistic learning offers a more interpretable way to handle classification problems, com-
pared to neural networks. With methods like classification based on frequency, unsupervised
learning through the EM algorithm, and nearest neighbor approaches, we can build models that
provide clearer insights into how decisions are made.

Gaussian Mixture Models


For the Bayes’ classifier discussed previously, we utilized target labels for supervised learning.
However, when we have data without target labels, we require unsupervised learning methods. In
this section, we explore a special case where different classes originate from their own Gaussian
distributions, known as multi-modal data.
If we know the number of classes, we can estimate the parameters for that many Gaussians
simultaneously. If the number of classes is unknown, we can experiment with various counts to
find the best fit.

Gaussian Mixture Model Equation


The output for any particular data point input into the algorithm is given by:
M
X
f (x) = αm ϕ(x; µm , Σm )
m=1

where ϕ(x; µm , Σm ) is the Gaussian function with mean µm and covariance matrix Σm , and
the αm are weights with the constraint:
M
X
αm = 1.
m=1

This equation describes how the overall probability distribution is a mixture of several Gaussian
distributions.

Estimating Class Probabilities


The probability that an input xi belongs to class m can be estimated as:

α̂m ϕ(xi ; µ̂m , Σ̂m )


p(xi ∈ cm ) = PM .
k=1 α̂k ϕ(xi ; µ̂k , Σ̂k )
The challenge lies in selecting the weights αm . The standard approach is to seek a maximum
likelihood solution, where we maximize the likelihood of the data given the model.

Expectation-Maximization (EM) Algorithm


The EM algorithm is a powerful statistical technique used for finding maximum likelihood es-
timates of parameters in probabilistic models, especially when the data has missing or hidden

Machine Learning - Dr.John Babu Page 2


variables (often called latent variables). The central idea is to introduce these latent variables to
simplify the optimization process.

Gaussian Mixture Model Example


To illustrate the EM algorithm, consider a simple case of a Gaussian mixture model with two
components. Here’s how it works:

Model Definition
Assume there are two Gaussian distributions:

G1 = N (µ1 , σ12 )
G2 = N (µ2 , σ22 )

The overall data distribution is represented as:

y = pG1 + (1 − p)G2

The probability density function can be expressed as:

P (y) = πϕ(y; µ1 , σ1 ) + (1 − π)ϕ(y; µ2 , σ2 )

where π is the mixing coefficient.

Challenge
The goal is to compute the maximum likelihood solution. However, directly differentiating the log-
likelihood function is complex due to the hidden variable f , which indicates from which Gaussian
the data point was generated.

Introducing Latent Variables


Introduce a latent variable f :

ˆ f = 0 indicates the data came from Gaussian G1

ˆ f = 1 indicates the data came from Gaussian G2

0.0.1 Expectation Step (E-step)


Compute the expected value of the latent variable given the current estimates of the parameters:

γi (µ̂1 , µ̂2 , σ̂1 , σ̂2 , π̂) = E(f |µ̂1 , µ̂2 , σ̂1 , σ̂2 , π̂, D)

Specifically, this expectation computes:

π̂ϕ(yi ; µ̂1 , σ̂1 )


γi =
π̂ϕ(yi ; µ̂1 , σ̂1 ) + (1 − π̂)ϕ(yi ; µ̂2 , σ̂2 )

Machine Learning - Dr.John Babu Page 3


0.0.2 Maximization Step (M-step)
Maximize the expected log-likelihood with respect to the model parameters. Update the param-
eters based on the computed expectations:

ˆ M-step 1: Update µ̂1 PN


γ̂i yi
µ̂1 = Pi=1
N
i=1 γ̂i

ˆ M-step 2: Update µ̂2


PN
(1 − γ̂i )yi
µ̂2 = Pi=1
N
i=1 (1 − γ̂i )

ˆ M-step 3: Update σ̂12


PN
γ̂i (yi − µ̂1 )2
σ̂12 = i=1
PN
i=1 γ̂i

ˆ M-step 4: Update σ̂22


PN
− γ̂i )(yi − µ̂2 )2
i=1 (1
σ̂22 = PN
i=1 (1 − γ̂i )

ˆ M-step 5: Update π̂ PN
i=1 γ̂i
π̂ =
N

Iteration
Repeat the E-step and M-step until convergence is achieved. The EM algorithm is guaranteed to
converge to a local maximum of the likelihood function.

Conclusion
The EM algorithm effectively handles missing or hidden data by iteratively estimating the expec-
tations of the latent variables and optimizing the model parameters. This framework is widely
used in various applications, particularly in clustering and density estimation tasks.
In many practical learning scenarios, only a subset of relevant instance features is observ-
able. For instance, when training or utilizing a Bayesian belief network, certain variables may
be observed while others remain hidden. To effectively learn in the presence of these unobserved
variables, the EM (Expectation-Maximization) algorithm provides a systematic approach. It can
be applied even when the values of certain variables are never directly observed, given that the
form of the probability distribution governing these variables is known.

Applications of the EM Algorithm


ˆ Bayesian Belief Networks: EM is used to train networks where some variables are un-
observed

ˆ Radial Basis Function Networks: The algorithm can also be utilized here

ˆ Unsupervised Clustering: Many clustering algorithms are based on the EM algorithm

Machine Learning - Dr.John Babu Page 4


ˆ Partially Observable Markov Models: The Baum-Welch algorithm, which employs EM,
is widely used in this context

Physical Significance of the EM Steps


ˆ The E-step calculates the probability (responsibility) that each data point was generated
by each Gaussian component.

ˆ The M-step updates the parameters of the Gaussian components based on these responsi-
bilities.

ˆ This process iterates until the parameters converge, refining the model to better explain the
data.

The EM algorithm serves as a powerful tool for parameter estimation in models involving
unobserved variables. By leveraging current hypotheses to estimate hidden data and iteratively
refining those hypotheses, EM approaches a maximum likelihood solution. Its versatility across
various applications makes it a fundamental method in both machine learning and statistical
inference.

Comparison between EM Algorithm and K-Means Algo-


rithm

EM Algorithm K-Means Algorithm


EM is a probabilistic model-based al- K-Means is a centroid-based algorithm
gorithm that estimates parameters. that partitions data into clusters.
EM can accommodate different types of K-Means assumes that clusters are
distributions, including Gaussian. spherical and equally sized.
EM involves an iterative process of ex- K-Means involves iteratively assigning
pectation and maximization steps. points to the nearest centroid.
EM can produce soft assignments, pro- K-Means gives hard assignments,
viding probabilities for cluster member- where each point belongs to one clus-
ship. It is called Soft Clustering ter.It is called Hard Clustering
EM can handle missing data more effec- K-Means does not handle missing data
tively by estimating hidden variables. and requires complete datasets.
EM generally converges more reliably K-Means can converge to different so-
to a local maximum of the likelihood lutions based on initial centroid place-
function. ment.
Table 1: Differences between EM Algorithm and K-Means Algorithm

Nearest Neighbour Methods


Concept
Imagine we want to know the marks percentage of a particular students but we donot have his
marks.Then we can look into the marks of his close friends and come to a conclusion that his

Machine Learning - Dr.John Babu Page 5


marks would also be in the vicinity of the averages of that friends marks. This is similar to how
nearest neighbour methods work: when we don’t have a model for our data, we look at nearby
data points to make a decision.

K-Nearest Neighbors (KNN) Algorithm Example


We will demonstrate the K-Nearest Neighbors (KNN) algorithm using a simple numerical example
with 10 data points. Our goal is to classify a new data point based on its closest neighbors.

Data Points
Consider the following data points in a 2-dimensional space:

Data points: {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10)}
We also have labels for these data points:

Labels: {A, A, A, A, B, B, B, B, B, B}
Our new point is (6.5, 6.5), and we want to classify it using the KNN algorithm with k = 3.

Step-by-Step KNN Algorithm


1. Calculate Distances: First, we compute the distance between the new point (6.5, 6.5) and
each data point using Euclidean distance:
p
Distance = (x1 − x2 )2 + (y1 − y2 )2
For example, the distance between (6.5, 6.5) and (6, 6) is:

p √ √ √
Distance = (6.5 − 6)2 + (6.5 − 6)2 = 0.52 + 0.52 = 0.25 + 0.25 = 0.5 ≈ 0.71

Similarly, we calculate the distances for all other points.


2. Sort by Distance: After calculating the distances, we sort the points by their distances
to the new point:

Sorted Distances: {(6, 6), (7, 7), (5, 5), (8, 8), (4, 4), (9, 9), (3, 3), (10, 10), (2, 2), (1, 1)}

3. Select k Nearest Neighbors: We now select the 3 nearest neighbors (as k = 3):

Nearest Neighbors: {(6, 6), (7, 7), (5, 5)}


4. Determine Majority Class: The labels for these 3 nearest neighbors are:

Labels: {B, B, B}
Since the majority of the neighbors belong to class B, we classify the new point (6.5, 6.5) as
class B.
By using the KNN algorithm with k = 3, the new point (6.5, 6.5) is classified as class B.

Machine Learning - Dr.John Babu Page 6


How It Works
1. Finding Neighbours: - We have data points in a space, and we need to determine which
ones are close to a new point. - To do this, we compute the distance from the new point to each
data point. If there are N points in d dimensions, we need to perform:

O(N · d) operations for each point.

- We ignore the square root in the distance formula because we only want to know which points
are closest.
2. Choosing Neighbours: - After calculating the distances, we identify the k nearest neigh-
bours to the new point. - The class of the new point is assigned based on the most common class
among these neighbours.

Choosing k
- The choice of k is crucial: - If k is too small, the method can be sensitive to noise. - If k is too
large, it may include points that are not relevant, reducing accuracy.

Curse of Dimensionality
As we increase the number of dimensions (d), the distance calculations become more complex.
Although methods like KD-Trees can help with this, the following issues arise: - As dimensions
increase, points tend to spread out, making it harder to find truly ”nearby” points. - Distances
become less meaningful as many points are far apart in some dimensions but close in others.

Bias-Variance Trade-off
For the k-nearest neighbours (KNN) algorithm, we can analyze the bias-variance trade-off:
k
!
1 X
E (ŷ − f (x))2 = σ 2 + (f (x) − f (xi ))2 .

k i=1

The bias-variance tradeoff is crucial in understanding the performance of the KNN algorithm.

Bias
ˆ Bias refers to the error introduced by approximating a real-world problem using a simplified
model.

ˆ In KNN, when k is small (e.g., k = 1), the model is flexible and fits the training data closely,
resulting in low bias.

ˆ However, this flexibility can lead to high variance, as the model captures noise in the training
data.

Variance
ˆ Variance refers to the error introduced by the model’s sensitivity to fluctuations in the
training data.

ˆ A high variance model, like KNN with small k, performs well on training data but poorly
on unseen data due to overfitting.

Machine Learning - Dr.John Babu Page 7


KNN and the Tradeoff
ˆ Small k (e.g., k = 1):

– Low Bias: The model fits training data closely.


– High Variance: The model is sensitive to noise, leading to overfitting.

ˆ Large k (e.g., k = 10):

– High Bias: The model generalizes too much, potentially missing important patterns.
– Low Variance: The model is stable, averaging over more neighbors, reducing noise
impact.

Optimal k
Finding a balance between bias and variance is key:

ˆ Too small k leads to overfitting (high variance, low bias).

ˆ Too large k leads to underfitting (high bias, low variance).

ˆ Cross-validation can help determine the optimal k by evaluating model performance on


unseen data.

In KNN, understanding the bias-variance tradeoff is essential for tuning the algorithm effec-
tively, with the goal of minimizing overall error.
The nearest neighbour methods rely on the idea of learning from similar data points. By
carefully choosing k and considering dimensionality, we can effectively classify new data based on
existing patterns.

Efficient Distance Computations: the KD-Tree


Computing distances between all pairs of points can be very expensive. To solve this, we can
use a data structure called the KD-Tree, which reduces the cost of finding a nearest neighbor to
O(log N ) for O(N ) storage. The construction of the tree takes O(N log N ), mainly due to finding
the median.
The KD-tree is built by creating a binary tree that splits one dimension at a time using the
median of the point coordinates. Consider the following seven two-dimensional points:

(5, 4), (1, 6), (6, 1), (7, 5), (2, 7), (2, 2), (5, 8)

Steps for Splitting


1. Initial Split (First Dimension: x): - Sort the points by their x-coordinates:

(1, 6), (2, 2), (2, 7), (5, 4), (5, 8), (6, 1), (7, 5)

- The median point (middle value) is at position 4 (0-indexed), which is (5, 4). - This creates a
split at x = 5.
2. Left Subtree (Points with x < 5): - Remaining points: (1, 6), (2, 2), (2, 7) - Sort by
y-coordinates:
(2, 2), (1, 6), (2, 7)

Machine Learning - Dr.John Babu Page 8


- The median y-coordinate is at position 1, which is (1, 6).
- This creates a split at y = 6.
3. Right Subtree (Points with x ≥ 5): - Remaining points: (5, 4), (5, 8), (6, 1), (7, 5) - Sort
by y-coordinates:
(5, 4), (7, 5), (6, 1), (5, 8)
- The median y-coordinate is at position 1, which is (7, 5).
- This creates a split at y = 5.

Visualization
At this point, the KD-Tree has the following structure:
- Root Node: (5, 4) (5,4)
- Left Child: (1, 6)
- Left: (2, 2)
- Right: (2, 7)
- Right Child: (7, 5) (1,6) (7,5)
- Left: (5, 4)
- Right: (5, 8)
- Child: (6, 1)
(2,2) (2,7) (5,4) (5,8)

Searching the Tree (6,1)


To find the nearest neighbor, we start at the root and compare dimensions one at a time. For
example, if we introduce a test point, say (3, 5):

1. Start at the root (5, 4): x = 5 is greater than x = 3, go left.


2. Move to (1, 6): y = 6 is greater than y = 5, go right to (2, 7).
3. The leaf found is (2, 7). Check distances to see if there is a closer point.

This process continues until all potential points are checked, making the KD-Tree efficient in
finding nearest neighbors with significantly reduced computational costs.

Machine Learning - Dr.John Babu Page 9


Figure 1: Example KD-Tree Structure

Distance Measures
Distance measures are essential in data analysis, particularly in clustering and classification tasks.
They help quantify how similar or different two data points are. Below, we explore several impor-
tant distance metrics along with their real-time applications.

Machine Learning - Dr.John Babu Page 10


Euclidean Distance
The Euclidean distance is the most commonly used distance measure. It calculates the straight-
line distance between two points in Euclidean space. Given two points (x1 , y1 ) and (x2 , y2 ), the
Euclidean distance dE is defined as:
p
dE = (x1 − x2 )2 + (y1 − y2 )2 (1)
This distance is derived from the Pythagorean theorem, where the distance represents the
hypotenuse of a right triangle formed by the differences in the coordinates. In three dimensions,
the formula extends to:
p
dE = (x1 − x2 )2 + (y1 − y2 )2 + (z1 − z2 )2 (2)
In a recommendation system for online shopping, Euclidean distance can be used to measure
the similarity between users based on their purchasing behavior. For instance, if two users have
similar purchase histories, the system can recommend products based on what similar users have
bought.

Manhattan Distance
In contrast to the Euclidean distance, the Manhattan distance (also known as city-block distance)
measures the distance between two points based on a grid-like path. It adds the absolute differences
in each dimension. For two points (x1 , y1 ) and (x2 , y2 ), the Manhattan distance dC is given by:

dC = |x1 − x2 | + |y1 − y2 | (3)


This metric is particularly useful in urban planning, where one must navigate through streets
and blocks rather than a straight line. In higher dimensions, it generalizes to:
n
X
dC = |xi − yi | (4)
i=1

Manhattan distance is often used in robotics, particularly for pathfinding algorithms. For
example, a robot navigating through a city grid will calculate its distance to a destination using
Manhattan distance to find the most efficient path while avoiding obstacles like buildings.

Minkowski Distance
The Minkowski distance is a generalization of both the Euclidean and Manhattan distances. It is
defined for two points x and y in an n-dimensional space, with the parameter k controlling the
distance measure. It is expressed as:

n
!1/k
X
Lk (x, y) = |xi − yi |k (5)
i=1

- For k = 1, this reduces to the Manhattan distance:


n
X
L1 (x, y) = |xi − yi | (6)
i=1

- For k = 2, it becomes the Euclidean distance:


p
L2 (x, y) = (x1 − y1 )2 + (x2 − y2 )2 + . . . + (xn − yn )2 (7)

Machine Learning - Dr.John Babu Page 11


In machine learning, Minkowski distance is commonly used in k-nearest neighbors (KNN)
algorithms, allowing flexibility in choosing the distance metric. For instance, when classifying
images, k can be adjusted to emphasize certain features based on their importance in distinguishing
between classes.

Choosing the Right Distance Metric


The choice of distance metric can significantly impact the results of data analysis. Factors such as
the dimensionality of the data, the presence of outliers, and the underlying data distribution should
influence the selection of the appropriate distance measure. In some cases, more sophisticated
metrics, such as the Mahalanobis distance or invariant metrics like the tangent distance, may be
preferable, particularly for applications such as image recognition.

Figure 2: Distance Metrics

By understanding these distance metrics, we can better analyze and classify data effectively.

Machine Learning - Dr.John Babu Page 12


Lazy vs Eager learning
Lazy Learning
Lazy learning refers to a type of machine learning where the model does not explicitly generalize
from the training data during the training phase. Instead, it stores the training instances and
delays the learning process until a query is made.
K-Nearest Neighbors (KNN) is considered lazy learning because it does not build a model
ahead of time. Instead, it relies on the entire dataset at the time of prediction, calculating the
distance to find the nearest neighbors to make decisions, which can lead to high memory usage
and slow predictions.

Eager Learning
Eager learning is a machine learning approach where the model is built during the training phase,
resulting in a general representation of the data. This model is then used for making predictions,
leading to faster response times during inference since no additional computation is needed. Eager
learning is termed so because the model eagerly captures patterns and relationships from the
training data upfront.
Examples of eager learning algorithms include:
Decision Trees: Create a tree-like model based on feature splits.
Neural Networks: Learn representations through layered architectures.
Support Vector Machines: Find the optimal hyperplane for classification.

Lazy Learning Eager Learning


Learns from training data at the time Learns from training data during the
of query. training phase.
Examples: K-Nearest Neighbors Examples: Neural Networks, Support
(KNN), Decision Trees. Vector Machines (SVM).
No explicit model is built; data is An explicit model is built during train-
stored. ing.
Fast training time, slow prediction Slow training time, fast prediction
time. time.
High memory usage due to storing all Lower memory usage, as only the
instances. model parameters are stored.
Adapts quickly to new data; instant up- Requires retraining to adapt to new
dates. data.
Table 2: Differences between Lazy Learning and Eager Learning

Machine Learning - Dr.John Babu Page 13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy