0% found this document useful (0 votes)
8 views46 pages

ml unit 3

The document discusses inferential statistics, which allows for making inferences about a population based on sample data, highlighting its importance in parameter estimation and hypothesis testing. It details various statistical tests such as t-tests, z-tests, chi-square tests, and ANOVA, along with the concepts of central tendency and dispersion in descriptive statistics. Additionally, it introduces Bayesian reasoning as a probabilistic approach to inference.

Uploaded by

yaminimygapule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views46 pages

ml unit 3

The document discusses inferential statistics, which allows for making inferences about a population based on sample data, highlighting its importance in parameter estimation and hypothesis testing. It details various statistical tests such as t-tests, z-tests, chi-square tests, and ANOVA, along with the concepts of central tendency and dispersion in descriptive statistics. Additionally, it introduces Bayesian reasoning as a probabilistic approach to inference.

Uploaded by

yaminimygapule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Machine learning

UNIT 3
Statistical Learning:
Machine Learning and Inferential Statistical Analysis:
An area of statistics known as inferential statistics is one that employs analytical tools to
make inferences about a population by analyzing samples taken at random. Generalizations
about a population are what inferential statistics are intended to do. Inferential statistics uses
a statistic (such as the sample mean) from sample data to make inferences about a
population parameter (e.g., the population means).

Examples of Inferential Statistics:


Inferential statistics is beneficial and cost-effective because it can make inferences about a
population without collecting complete data. Some examples of derived statistics are given
below:

Suppose the average marks of 100 students in a particular country are known. Using this
sample information, inferred statistics can approximate the average student grades in a
country.

Suppose a coach wants to find out how many average throws his college sophomores can
make without stopping. A sample of several students will be asked to take a flip, and the
average will be calculated. Inferential statistics will use this data to infer how many
sophomores can perform on average.

Why do we Need Inferential Statistics?


Let's say you're interested in learning what Indian data science experts make on average.
Which of the subsequent techniques can be used to figure it out?

Meet each and every data science specialist in India. Take note of their pay, then what is the
aggregate average?

Or a few professionals in a city like Gurgaon. Note their salaries and use them to calculate
the Indian average.

The first method is not impossible, but it would require enormous resources and time. But
today, companies want to make decisions quickly and efficiently, so the first method has no
chance.

On the other hand, the second method seems feasible. However, there is a caveat. What if
the population of Gurgaon does not reflect the entire population of India?

Then there is a good chance that you will make a very wrong estimate of Indian data science
professionals' salaries.
Purpose of Inferential Statistics
There are two primary purposes of inferential statistics. The first is parameter estimation.
As we did before, we use a statistic from our data set, like the standard deviation, to define a
more generic parameter, such as the standard deviation of the total population.

A second place where inferential statistics is useful is in hypothesis tests. These can be
especially useful for gathering information about something that can only be administered to
a small group, like a new diabetes medication. The information gathered can be used to
construct a forecast about whether this medication will be effective for the "full population" of
diabetic patients (typically by computing a z-score).

Types of Inferential Statistics


Inferential Statistics: Can be divided into hypothesis testing and regression analysis.
Hypothesis testing also involves using confidence intervals to test population parameters.
Below are the different types of derived stats.

The Hypothesis is of Two Types:

Null Hypothesis: A null hypothesis is a type of hypothesis in which we assume that the
sample observations are purely random. It is denoted H0.

Alternative Hypothesis: An alternative hypothesis is a hypothesis in which we assume that


the sample observations are not random. Some non-random situations affect them. The
alternative hypothesis is labeled H1 or Ha.

Steps of Hypothesis Testing

The process of determining whether or not to reject the null hypothesis based on sample
data is called hypothesis testing. It consists of four steps:

 Define the null and alternative hypotheses


 Define an analysis plan to determine how to use the sample data to estimate
the null hypothesis
 Analyze data samples to produce a single number called a "test statistic."
 Understand the result by applying a decision rule to test whether the null
hypothesis is true or not

If the t-stat value is less than the significance level, we reject the null hypothesis, otherwise,
we fail to reject the null hypothesis.

Technically, we never accept the null hypothesis. We say that we either fail to reject or reject
the null hypothesis.

Terms in Hypothesis Testing


Significance Level
The probability that we will reject the null hypothesis is what is referred to as the significance
level. Still, it is valid for, e.g., a significance level of 0.05 means that there is a 5% risk of
assuming that there is some difference when there is no difference. It is denoted alpha (α).

The figure above shows that the two shaded regions are equidistant from the null
hypothesis, each with a probability of 0.025 and a total of 0.05, which is our significance
level. In the case of a two-tailed test, the shaded region is called the critical region.

P-value

The p-value is the probability that the t-statistic will be as extreme as the calculated value if
the null hypothesis is true. A sufficiently low p-value is a reason to reject the null hypothesis.
We leave the null hypothesis if the p-value is less than the significance level.

Errors in Hypothesis Testing

We have explained what hypothesis testing is and the steps to test it. Now, while doing
hypothesis testing, some errors may occur.

We classify these errors into two categories.

Type 1 Error: A Type 1 error is when we reject the null hypothesis, but it is actually true.
The probability of a type 1 error is called the significance level alpha(α).

Type 2 Error: A Type 2 error is when we fail to reject the null hypothesis, but it is actually
false. The probability of a type 2 error is called beta(β).

Z-test

The Z-test is mainly used when the data is usually distributed. We determine the sample
means' Z-statistic and compute the z-score. The z-score is given by the formula,

Z-score = (x – µ) / σ

The z-test is mainly used when the population means and standard deviation are given.

T-test

A t-test is similar to a z-test. Only when we have a sample standard deviation but no
population standard or the tiny sample size is it used (n<30).

Different Types of T-tests

One Sample T-test

A one-sample t-test compares the mean of sample data with a known value, when we have
to reach the standard of sample data with the population mean, we use a one-sample t-test.

We can perform a one-sample T-test when we do not have the population S.D. or a sample
size of less than 30.

The t-statistic is given by:


Two-Sample T-Test

We use the two-sample T-test when we want to evaluate whether the mean of the two
samples is different or not. In the two-sample T-test, we have two more categories:

Independent Sample T-Test: Two distinct models should be chosen from two entirely
separate populations using independent sampling. In other words, it is inappropriate for one
group to depend on another.

Paired T-Test: If our samples are related in some way, we need to use a paired t-test. Here,
linkage means that the samples are linked because we are collecting data from the same
group twice, e.g., a blood test of patients in a hospital before and after medication.

Chi-square TestThe Chi-square test is used when we have to compare categorical data.
The chi-square test is of two types. Both use statistics and the chi-square distribution for
different purposes.

The Goodness of Fit

Determines whether the sample data of the categorical variables match the population.

Test of Independence

Compares two categorical variables to see whether they are related.

The chi-square statistic is given by:

ANOVA (Analysis of Variance)

An ANOVA test can be used to assess the significance of experiment data. It is typically
applied when there are more than two groups and we need to determine whether the
numerous population variances and means are equivalent.

For instance, students from many universities sit for the same exam. See if one college does
better than the others.

There are Two Types of ANOVA Tests:-

1. One-way ANOVA
2. Two-way ANOVA

Conclusion
Thus, this guide explains all the theories and practical implementations of the various
concepts of inferential statistics.
Descriptive Statistics in learning techniques

Descriptive statistics serves as the initial step in understanding and

summarizing data. It involves organizing, visualizing, and summarizing raw data

to create a coherent picture. The primary goal of descriptive statistics is to

provide a clear and concise overview of the data’s main features. This helps us

identify patterns, trends, and characteristics within the data set without making

broader inferences.

Key Aspects of Descriptive Statistics

 Measures of Central Tendency: Descriptive statistics include calculating

the mean, median, and mode, which offer insights into the center of the

data distribution.

 Measures of Dispersion: Variance, standard deviation, and range help us

understand the spread or variability of the data.

 Visualizations: Creating graphs, histograms, bar charts, and pie charts

visually represent the data’s distribution and characteristics

Types of Descriptive Statistics

There are various dimensions in which this data can be described. The three

main dimensions used for describing data are the central tendency, dispersion,

and the shape of the data. Now, let’s look at them in detail, one by one.

Descriptive Statistics Based on the Central Tendency of Data

The central tendency of data is the center of the distribution of data. It describes

the location of data and concentrates on where the data is located. The three
most widely used measures of the “center” of the data are Mean, Median, and

Mode.

Mean

The “Mean” is the average of the data. The average can be identified by

summing up all the numbers and then dividing them by the number of

observations.

Mean = X 1 + X2 + X3 +… + Xn / n

Example:

Data – 10,20,30,40,50 and Number of observations = 5

Mean = [ 10+20+30+40+50 ] / 5

Mean = 30

The central tendency of the data may be influenced by outliers. You may now

ask, ‘What are outliers?‘ Well, outliers are extreme behaviors. An outlier is a

data point that differs significantly from other observations. It can cause serious

problems in analysis.
Example:

Data – 10,20,30,40,200

Mean = [ 10+20+30+40+200 ] / 5

Mean = 60

Solution for the outliers problem: Removing the outliers while taking averages

will give us better results.

Median

It is the 50th percentile of the data. In other words, it is exactly the center point of

the data. The median can be identified by ordering the data, splitting it into two

equal parts, and then finding the number in the middle. It is the best way to find

the center of the data.

Note that, in this case, the central tendency of the data is not affected by outliers.
Example:

Odd number of Data – 10,20,30,40,50

Median is 30.

Even the number of data – 10,20,30,40,50,60

Find the middle 2 data and take the mean of those two values.

Here, 30 and 40 are middle values.

Now, add them and divide the result by 2

30+40 / 2 =35

Median is 35

Mode

The mode of the data is the most frequently occurring data or elements in a

dataset. If an element occurs the highest number of times, it is the mode of that

data. If no number in the data is repeated, then that data has no mode. There

can be more than one mode in a dataset if two values have the same frequency,

which is also the highest frequency.

Outliers don’t influence the data in this case. The mode can be calculated for

both quantitative and qualitative data.


Example:

Data – 1,3,4,6,7,3,3,5,10, 3

Mode is 3, because 3 has the highest frequency (4 times)

Descriptive Statistics Based on the Dispersion of Data

The dispersion is the “spread of the data”. It measures how far the data is

spread. In most of the dataset, the data values are closely located near the

mean. The values are widely spread out of the mean on some other datasets.

These dispersions of data can be measured by the Inter Quartile Range (IQR),

range, standard deviation, and variance of the data.

Let us see these measures in detail.

Inter Quartile Range (IQR)

Quartiles are special percentiles.

1st Quartile Q1 is the same as the 25th percentile.

2nd Quartile Q2 is the same as 50th percentile.

3rd Quratile Q3 is same as 75th percentile

Steps to find quartile and percentile

 The data should sorted and ordered from the smallest to the largest.
 For Quartiles, ordered data is divided into 4 equal parts.

 For Percentiles, ordered data is divided into 100 equal parts.

The Inter Quartile Range is the difference between the third quartile (Q3) and the

first quartile (Q1)

IQR = Q3 – Q1

In this example, the Inter Quartile range is the spread of the middle half (50%) of

the data.

Range

The range is the difference between the largest and the smallest value in the

data.

Standard Deviation

The most common measure of spread is the standard deviation. The Standard

deviation measures how far the data deviates from the mean value. The standard

deviation formula varies for population and and highest value of sample. Both

formulas are similar but not the same.

 Symbol used for Sample Standard Deviation – “s” (lowercase)


 Symbol used for Population Standard Deviation – “σ” (sigma, lower case)

Steps to find the Standard Deviation

If x is a number, then the difference “x – mean” is its deviation. The deviations

are used to calculate the standard deviation.

Sample Standard Deviation, s = Square root of sample variance

Sample Standard Deviation, s = Square root of [Σ(x − x ¯ ) 2/ n-1] where x ¯ is

average and n is no. of samples

Population Standard Deviation, σ = Square root of population variance

Population Standard Deviation, σ = Square root of [ Σ(x − μ)2 / N ] where μ is Mean

and N is no.of population.

The standard deviation is always positive or zero. It will be large when the data

values are spread out from the mean.


Variance

The variance is a measure of variability. It is the average squared deviation from

the mean. The symbol σ2 represents the population variance, and the symbol for

s2 represents sample variance.

Population variance σ2 = [ Σ(x − μ)2 / N ]

Sample Variance s2 = [ Σ(x − x ¯ )2 / n-1 ]

Descriptive Statistics Based on the Shape of the Data

The shape of the data is important because deciding the probability of data is

based on its shape. The shape describes the type of the graph.

The shape of the data can be measured by three methodologies: symmetric,

skewness, kurtosis
Symmetric

In the symmetric shape of the graph, the data is distributed the same on both

sides. In symmetric data, the mean and median are located close together. The

curve formed by this symmetric graph is called a normal curve.

Skewness

Skewness is the measure of the asymmetry of the distribution of data. The data

is not symmetrical (i.e.) it is skewed towards one side. Skewness is classified

into two types: positive skew and negative skew.

 Positively skewed: In a Positively skewed distribution, the data values are

clustered around the left side of the distribution, and the right side is longer. The

mean and median will be greater than the mode in the positive skew.

 Negatively skewed: In a Negatively skewed distribution, the data values are

clustered around the right side of the distribution, and the left side is longer. The

mean and median will be less than the mode.


Kurtosis

Kurtosis is the measure of describing the distribution of data. This data is

distributed in three different ways: platykurtic, mesokurtic, and leptokurtic.

 Platykurtic: The platykurtic shows a distribution with flat tails. Here, the data is

distributed fairly. The flat tails indicated the small outliers in the distribution.
 Mesokurtic: In Mesokurtic, the data is widely distributed. It is normally

distributed, and it also matches normal distribution.

 Leptokurtic: In leptokurtic, the data is very closely distributed. The height of the

peak is greater than the width of the peak.

BAYESIAN REASONING: A PROBABILISTIC APPROACH TO INFERENCE

The Bayesian model of statistical decision making has a softer edge. It associates a

probability each with prediction. The Bayesian classification aims to estimate the

probabilities that a pattern to be classified belongs to various possible categories.

The Bayesian method assumes that the 88 Applied Machine Learning variables of

interest are governed by probability distributions and that optimal decisions are

made by reasoning about these probabilities along with the observed data [48].

Bayesian learning techniques have relevance in the study of machine learning for

two separate reasons.


(i) Bayesian learning algorithms that compute explicit probabilities, for example, the

naive Bayes classifier, are among the most practical approaches to specific kinds of

learning problems. For instance, the naive Bayes classifier is probably among the

most effective algorithms for learning tasks to classify text documents. The naive

Bayes technique is extremely helpful in case of huge datasets. For example, Google

employs naive Bayes classifier to correct the spelling mistakes in the text typed in by

users. Empirical outcomes reported in the literature show that naive Bayes classifier

is competitive with other algorithms, such as decision trees and neural networks; in

several cases with huge datasets, it outperforms the other methods.

(ii) The importance of Bayesian techniques to our study of machine learning is also

because it gives a meaningful perspective to the comprehension of various learning

algorithms that do not explicitly manipulate probabilities. It is essential to have at

least a basic familiarity with Bayesian techniques to understand and characterize the

operations of several algorithms in machine learning.


Bayes’ theorem is fundamental in machine learning, especially in the context of
Bayesian inference. It provides a way to update our beliefs about a hypothesis
based on new evidence.
What is Bayes theorem?
Bayes’ theorem is a fundamental concept in probability theory that plays a crucial role in
various machine learning algorithms, especially in the fields of Bayesian statistics and
probabilistic modelling. It provides a way to update probabilities based on new evidence
or information. In the context of machine learning, Bayes’ theorem is often used in
Bayesian inference and probabilistic models.
The theorem can be mathematically expressed as:
𝑃(𝐴∣𝐵)=𝑃(𝐵∣𝐴)⋅𝑃(𝐴)𝑃(𝐵)P(A∣B)=P(B)P(B∣A)⋅P(A)
where
 P(A∣B) is the posterior probability of event A given event B.
 (B∣A) is the likelihood of event B given event A.
 P(A) is the prior probability of event A.
 P(B) is the total probability of event B.
In the context of modeling hypotheses, Bayes’ theorem allows us to infer our belief in a
hypothesis based on new data. We start with a prior belief in the hypothesis, represented
by P(A), and then update this belief based on how likely the data are to be observed under
the hypothesis, represented by P(B∣A). The posterior probability P(A∣B) represents our
updated belief in the hypothesis after considering the data.
Key Terms Related to Bayes Theorem
1. Likelihood(P(B∣A)):
 Represents the probability of observing the given evidence
(features) given that the class is true.
 In the Naive Bayes algorithm, a key assumption is that features are
conditionally independent given the class label. In other words,
Naive Bayes works best with discrete features.
2. Prior Probability (P(A)):
 In machine learning, this represents the probability of a particular
class before considering any features.
 It is estimated from the training data.
3. Evidence Probability( P(B) ):
 This is the probability of observing the given evidence (features).
 It serves as a normalization factor and is often calculated as the sum
of the joint probabilities over all possible classes.
4. Posterior Probability( P(A∣B) ):
 This is the updated probability of the class given the observed
features.
 It is what we are trying to predict or infer in a classification task.
Now, to utilise this in terms of machine learning we use the Naive Bayes Classifier but in
order to understand how precisely this classifier works we must first understand the
maths behind it.
Applications of Bayes Theorem in Machine learning
1. Naive Bayes Classifier
The Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’
theorem with a strong (naive) independence assumption between the features. It is
widely used for text classification, spam filtering, and other tasks involving high-
dimensional data. Despite its simplicity, the Naive Bayes classifier often performs well in
practice and is computationally efficient.
How it works?
 Assumption of Independence: The “naive” assumption in Naive Bayes is that
the presence of a particular feature in a class is independent of the presence of
any other feature, given the class. This is a strong assumption and may not hold
true in real-world data, but it simplifies the calculation and often works well in
practice.
 Calculating Class Probabilities: Given a set of features x1,x2,…,xn, the Naive
Bayes classifier calculates the probability of each class Ck given the features
using Bayes’ theorem:
𝑃(𝐶𝑘∣𝑥1,𝑥2,…,𝑥𝑛)=𝑃(𝑥1,𝑥2,…,𝑥𝑛∣𝐶𝑘)⋅𝑃(𝐶𝑘)𝑃(𝑥1,𝑥2,…,𝑥𝑛)P(Ck∣x1,x2,…,xn
)=P(x1,x2,…,xn)P(x1,x2,…,xn∣Ck)⋅P(Ck),
o the denominator P(x1,x2,…,xn) is the same for all classes and
can be ignored for the purpose of comparison.
 Classification Decision: The classifier selects the class Ck with the highest
probability as the predicted class for the given set of features.
2. Bayes optimal classifier
The Bayes optimal classifier is a theoretical concept in machine learning that represents
the best possible classifier for a given problem. It is based on Bayes’ theorem, which
describes how to update probabilities based on new evidence.
In the context of classification, the Bayes optimal classifier assigns the class label that has
the highest posterior probability given the input features. Mathematically, this can be
expressed as:
𝑦^=𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑦∣𝑥)y=argmaxyP(y∣x)
where 𝑦^y is the predicted class label, y is a class label, x is the input feature vector,
and P(y∣x) is the posterior probability of class y given the input features.
3. Bayesian Optimization
Bayesian optimization is a powerful technique for global optimization of expensive-to-
evaluate functions. To choose which point to assess next, a probabilistic model of the
objective function—typically based on a Gaussian process—is constructed. Bayesian
optimization finds the best answer fast and requires few evaluations by intelligently
searching the search space and iteratively improving the model. Because of this, it is
especially well-suited for activities like machine learning model hyperparameter tweaking,
where each assessment may be computationally costly.
4. Bayesian Belief Networks?
Bayesian Belief Networks (BBNs), also known as Bayesian networks, are probabilistic
graphical models that represent a set of random variables and their conditional
dependencies using a directed acyclic graph (DAG).The graph’s edges show the
relationships between the nodes, which each represent a random variable.

BBNs are employed for modeling uncertainty and generating probabilistic conclusions
regarding the network’s variables. They may be used to provide answers to queries like
“What is the most likely explanation for the observed data?” and “What is the probability
of variable A given the evidence of variable B?”

BBNs are extensively utilized in several domains, including as risk analysis, diagnostic
systems, and decision-making. They are useful tools for reasoning under uncertainty
because they provide complicated probabilistic connections between variables a graphical
and understandable representation.

K-Nearest Neighbor Classifier



TheK-Nearest Neighbors (KNN) algorithm is a supervised machine learning method
employed to tackle classification and regression problems. Evelyn Fix and Joseph Hodges
developed this algorithm in 1951, which was subsequently expanded by Thomas Cover. The
article explores the fundamentals, workings, and implementation of the KNN algorithm.
What is the K-Nearest Neighbors Algorithm?
KNN is one of the most basic yet essential classification algorithms in machine learning. It
belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning it does not
make any underlying assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian distribution of the given data). We are
given some prior data (also called training data), which classifies coordinates into groups
identified by an attribute.
As an example, consider the following table of data points containing two features:

KNN Algorithm working visualization

Now, given another set of data points (also called testing data), allocate these points to a
group by analyzing the training set. Note that the unclassified points are marked as ‘White’.
Intuition Behind KNN Algorithm
If we plot these points on a graph, we may be able to locate some clusters or groups. Now,
given an unclassified point, we can assign it to a group by observing what group its nearest
neighbors belong to. This means a point close to a cluster of points classified as ‘Red’ has a
higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’, and the
second point (5.5, 4.5) should be classified as ‘Red’.
Why do we need a KNN algorithm?
(K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily
used for its simplicity and ease of implementation. It does not require any assumptions
about the underlying data distribution. It can also handle both numerical and categorical
data, making it a flexible choice for various types of datasets in classification and regression
tasks. It is a non-parametric method that makes predictions based on the similarity of data
points in a given dataset. K-NN is less sensitive to outliers compared to other algorithms.
The K-NN algorithm works by finding the K nearest neighbors to a given data point based on
a distance metric, such as Euclidean distance. The class or value of the data point is then
determined by the majority vote or average of the K neighbors. This approach allows the
algorithm to adapt to different patterns and make predictions based on the local structure
of the data.
Distance Metrics Used in KNN Algorithm
As we know that the KNN algorithm helps us identify the nearest points or the groups for a
query point. But to determine the closest groups or the nearest points for a query point we
need some metric. For this purpose, we use below distance metrics:
Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight line
that joins the two points which are into consideration. This metric helps us calculate the net
displacement done between the two states of an object.
distance(𝑥,𝑋𝑖)=∑𝑗=1𝑑(𝑥𝑗–𝑋𝑖𝑗)2]distance(x,Xi)=∑j=1d(xj–Xij)2]
Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total distance
traveled by the object instead of the displacement. This metric is calculated by summing the
absolute difference between the coordinates of the points in n-dimensions.
𝑑(𝑥,𝑦)=∑𝑖=1𝑛∣𝑥𝑖−𝑦𝑖∣d(x,y)=∑i=1n∣xi−yi∣
Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special cases of
the Minkowski distance.
𝑑(𝑥,𝑦)=(∑𝑖=1𝑛(𝑥𝑖−𝑦𝑖)𝑝)1𝑝d(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above we can say that when p = 2 then it is the same as the formula for
the Euclidean distance and when p = 1 then we obtain the formula for the Manhattan
distance.
The above-discussed metrics are most common while dealing with a Machine
Learning problem but there are other distance metrics as well like Hamming Distance which
come in handy while dealing with problems that require overlapping comparisons between
two vectors whose contents can be Boolean as well as string values.
How to choose the value of k for KNN Algorithm?
The value of k is very crucial in the KNN algorithm to define the number of neighbors in the
algorithm. The value of k in the k-nearest neighbors (k-NN) algorithm should be chosen
based on the input data. If the input data has more outliers or noise, a higher value of k
would be better. It is recommended to choose an odd value for k to avoid ties in
classification. Cross-validation methods can help in selecting the best k value for the given
dataset.
Algorithm for K-NN
DistanceToNN=sort(distance from 1st example, distance from kth example)
value i=1 to number of training records:
Dist=distance(test example, ith example)
if(Dist<any example in DistanceToNN):
Remove the example from DistanceToNN and value.
Put new example in DistanceToNN and value in sorted order.
Return average of value
Fit using K-NN is more reasonable than 1-NN, K-NN affects very less from noise if dataset is
large.
In K-NN algorithm, We can see jump in prediction values due to unit change in input. The
reason for this due to change in neighbors. To handles this situation, We can use weighting
of neighbors in algorithm. If the distance from neighbor is high, we want less effect from
that neighbor. If distance is low, that neighbor should be more effective than others.

Workings of KNN algorithm


Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it
predicts the label or value of a new data point by considering the labels or values of its K
nearest neighbors in the training dataset.

Step-by-Step explanation of how KNN works is discussed below:


Step 1: Selecting the optimal value of K
 K represents the number of nearest neighbors that needs to be considered while
making prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points, Euclidean
distance is used. Distance is calculated between each of the data points in the
dataset and target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the target point are the nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
 In the classification problem, the class labels of are determined by performing
majority voting. The class with the most occurrences among the neighbors
becomes the predicted class for the target data point.
 In the regression problem, the class label is calculated by taking average of the
target values of K nearest neighbors. The calculated average value becomes the
predicted output for the target data point.
Let X be the training dataset with n data points, where each data point is represented by a
d-dimensional feature vector 𝑋𝑖Xi and Y be the corresponding labels or values for each data
point in X. Given a new data point x, the algorithm calculates the distance between x and
each data point 𝑋𝑖Xi in X using a distance metric, such as Euclidean
distance:distance(𝑥,𝑋𝑖)=∑𝑗=1𝑑(𝑥𝑗–𝑋𝑖𝑗)2]distance(x,Xi)=∑j=1d(xj–Xij)2]
The algorithm selects the K data points from X that have the shortest distances to x. For
classification tasks, the algorithm assigns the label y that is most frequent among the K
nearest neighbors to x. For regression tasks, the algorithm calculates the average or
weighted average of the values y of the K nearest neighbors and assigns it as the predicted
value for x.
Advantages of the KNN Algorithm
 Easy to implement as the complexity of the algorithm is not that high.
 Adapts Easily – As per the working of the KNN algorithm it stores all the data in
memory storage and hence whenever a new example or data point is added then
the algorithm adjusts itself as per that new example and has its contribution to
the future predictions as well.
 Few Hyperparameters – The only parameters which are required in the training
of a KNN algorithm are the value of k and the choice of the distance metric which
we would like to choose from our evaluation metric.
Disadvantages of the KNN Algorithm
 Does not scale – As we have heard about this that the KNN algorithm is also
considered a Lazy Algorithm. The main significance of this term is that this takes
lots of computing power as well as data storage. This makes this algorithm both
time-consuming and resource exhausting.
 Curse of Dimensionality – There is a term known as the peaking phenomenon
according to this the KNN algorithm is affected by the curse of
dimensionality which implies the algorithm faces a hard time classifying the data
points properly when the dimensionality is too high.
 Prone to Overfitting – As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well. Hence
generally feature selection as well as dimensionality reduction techniques are
applied to deal with this problem.
Discriminant functions and regression functions

classification and regression problems were defined as prediction of categorical (class, labels)
variables, and prediction of numeric (continuous) variables, respectively. These two kinds of
problems are of utmost importance. For example, in the case of speech or character
recognition systems, fault detection systems, readers of magnetic-strip codes or credit cards,
various alarm systems, and so on, we predict class or category. In control applications, signal
processing, and financial markets, we predict (numeric values) various signals and stock
prices based on past performance

Classification and Discriminant Functions:

**Classification**

In machine learning, classification is a type of supervised learning problem where the goal
is to predict the class or category that an instance belongs to. In other words, it is a
technique used to identify which group or class an instance belongs to, based on a set of
input features.

**Types of Classification**

There are two main types of classification:


1. **Binary Classification**: This type of classification involves predicting one of two
classes or categories. For example, spam vs. non-spam emails.

2. **Multi-Class Classification**: This type of classification involves predicting one of more


than two classes or categories. For example, predicting the type of fruit (apple, banana,
orange, etc.).

**Discriminant Functions**

A discriminant function is a mathematical function that takes a set of input features and
outputs a class label or probability distribution over the classes. The goal of a discriminant
function is to separate the classes as much as possible.

**Linear Discriminant Function**

A linear discriminant function is a linear combination of the input features that maximizes
the distance between classes while minimizing the distance within classes. The general
form of a linear discriminant function is:

f(x) = w^T*x + b

where w is a weight vector, x is the input feature vector, and b is the bias term.

**Non-Linear Discriminant Function**

A non-linear discriminant function is a non-linear combination of the input features that


maximizes the distance between classes while minimizing the distance within classes. The
most common type of non-linear discriminant function is the Radial Basis Function (RBF)
network.

**Example: Binary Classification with Linear Discriminant Function**

Suppose we want to classify emails as either spam or non-spam based on three features:
number of words in the subject line, number of words in the body, and whether the email
contains an attachment.

We can use a linear discriminant function to classify emails based on these features. Let's
say we have two weight vectors:

w1 = [1, 2, 3]
w2 = [-1, -2, -3]

The linear discriminant function would be:

f(x) = w1^T*x + b1

f(x) = w2^T*x + b2

where x is the input feature vector and b1 and b2 are bias terms.

For example, if we have an email with 15 words in the subject line, 75 words in the body,
and no attachment:

x = [15, 75, 0]

The output of the linear discriminant function would be:

f(x) = w1^T*x + b1 = (1*15 + 2*75 + 3*0) + b1 = 150 + b1

f(x) = w2^T*x + b2 = (-1*15 - 2*75 - 3*0) + b2 = -150 + b2

If f(x) > 0, we classify the email as spam; otherwise, we classify it as non-spam.

This is just a simple example to illustrate how discriminant functions work in classification
problems. In practice, you would need to train a model using a large dataset and tune
hyperparameters to achieve good performance.

Numeric Prediction and Regression Functions

**Numeric Prediction**

In machine learning, numeric prediction refers to the process of predicting a numerical

value or a range of values based on a set of input features. This type of prediction is

often used in regression problems, where the goal is to predict a continuous output

variable.

**Types of Numeric Prediction**

There are several types of numeric prediction:

1. **Continuous Regression**: Predicting a continuous value or range of values.


2. **Discrete Regression**: Predicting a discrete value from a finite set of

options.

**Regression Functions**

A regression function is a mathematical function that takes a set of input

features and outputs a predicted value or a range of values. The goal of a

regression function is to minimize the difference between the predicted values

and the actual values.

**Linear Regression Function**

A linear regression function is a linear combination of the input features that

minimizes the mean squared error (MSE) between the predicted values and the

actual values. The general form of a linear regression function is:

y = β0 + β1*x1 + β2*x2 + ... + βn*xn

where y is the target variable, x1, x2, ..., xn are the input features, and β0, β1,

β2, ..., βn are the coefficients.

**Non-Linear Regression Function**

A non-linear regression function is a non-linear combination of the input

features that minimizes the mean squared error (MSE) between the predicted

values and the actual values. The most common type of non-linear regression

function is the polynomial regression function, which is defined as:


y = β0 + β1*x + β2*x^2 + ... + βn*x^n

where y is the target variable, x is the input feature, and β0, β1, β2, ..., βn are

the coefficients.

**Example: Predicting Stock Prices**

Suppose we want to predict the stock price of XYZ Inc. based on three features: daily

trading volume, previous day's closing price, and industry sector.

We can use a linear regression model to predict stock prices based on these features.

The linear regression function would be:

Price = β0 + β1*Trading Volume + β2*Previous Day's Close + β3*Sector

After training the model, we get:

β0 = 40

β1 = 0.01

β2 = 0.05

β3 = 5
Now we can use this model to predict the stock price for tomorrow with trading

volume 130000, previous day's close 58, and sector Technology:

Price = 40 + (0.01*130000) + (0.05*58) + (5*Technology) = 62.3

This predicted price is close to the actual price of $62.

This is just a simple example to illustrate how numeric prediction and regression

functions work in machine learning. In practice, you would need to handle

outliers, missing values, and non-linear relationships between variables to

achieve good performance.

Practical Hypothesis Functions:

The classical statistical techniques are based on the fundamental assumption that in

most of the real-life problems, the stochastic component of data follows normal

distribution. Linear discriminant functions and linear regresser functions have a

variety of pleasant analytical properties. However, the assumptions on which the

classical statistical paradigm relied, turned out to be inappropriate for many

contemporary real-life problems.

Heuristic search is organized as per the following two-step procedure. (i) The search

is first focused on a hypothesis class chosen for the learning task in hand. The

different hypotheses classes are appropriate for learning different kinds of functions.

The main hypotheses classes are: 1. Linear Models 2. Logistic Models 3. Support

Vector Machines 4. Neural Networks 5. Fuzzy Logic Models 6. Decisions Trees 7. k-

Nearest Neighbors (k-NN) 8. Naive Bayes


Linear Regression with Least Square Error Criterion:

Least Square Regression is a statistical method commonly used in machine learning for
analyzing and modelling data. It involves finding the line of best fit that minimizes the sum
of the squared residuals (the difference between the actual values and the predicted values)
between the independent variable(s) and the dependent variable.
We can use Least Square Regression for both simple linear regression, where there is only
one independent variable. Also, for multiple linear regression, where there are several
independent variables. We widely use this method in a variety of fields, such as economics,
engineering, and finance, to model and predict relationships between variables. Before
learning least square regression, let’s understand linear regression.
Linear Regression
Linear regression is one of the basic statistical techniques in regression analysis. People use
it for investigating and modelling the relationship between variables (i.e. dependent variable
and one or more independent variables).

Before being promptly adopted into machine learning and data science, linear models were
used as basic statistical tools to assist prediction analysis and data mining. If the model
involves only one regressor variable (independent variable), it is called simple linear
regression, and if the model has more than one regressor variable, the process is
called multiple linear regression.
Equation of Straight Line
Let’s consider a simple example of an engineer wanting to analyze vending
machines' product delivery and service operations. He/she wants to determine the
relationship between the time required by a deliveryman to load a machine and the
volume of the products delivered. The engineer collected the delivery time (in
minutes) and the volume of the products (in a number of cases) of 25 randomly
selected retail outlets with vending machines. The scatter diagram is the
observations plotted on a graph.

Now, if I consider Y as delivery time (dependent variable), and X as product volume


delivered (independent variable). Then we can represent the linear relationship
between these two variables as

Okay! Now that looks familiar. Its equation is for a straight line, where m is the slope
and c is the y-intercept. Our objective is to estimate these unknown parameters in
the regression model such that they give minimal error for the given
dataset. Commonly referred to as parameter estimation or model fitting. In
machine learning, the most common method of estimation is the Least
Squares method.
What is the Least Square Regression Method?
Least squares is a commonly used method in regression analysis for estimating the
unknown parameters by creating a model which will minimize the sum of squared
errors between the observed data and the predicted data.
Basically, it is one of the widely used methods of fitting curves that works by
minimizing the sum of squared errors as small as possible. It helps you draw a line of
best fit depending on your data points.
Finding the Line of Best Fit Using Least Square Regression
Given any collection of a pair of numbers and the corresponding scatter graph, the
line of best fit is the straight line that you can draw through the scatter points to
represent the relationship between them best. So, back to our equation of the
straight line, we have:

Where,
Y: Dependent Variable
m: Slope
X: Independent Variable
c: y-intercept
Our aim here is to calculate the values of slope y-intercept and substitute them in the
equation along with the values of independent variable X to determine the values of
dependent variable Y. Let’s assume that we have ‘n’ data points, then we can
calculate slope using the scary looking formula below:

Then, the y-intercept is calculated using the formula:

Lastly, we substitute these values in the final equation Y = mX + c. Simple enough,


right? Now let’s take a real-life example and implement these formulas to find the
line of best fit.
Least Squares Regression Example
Let us take a simple dataset to demonstrate the least squares regression method.
Step 1: The first step is to calculate the slope ‘m’ using the formula

After substituting the respective values in the formula, m = 4.70 approximately.


Step 2: Next, calculate the y-intercept ‘c’ using the formula (ymean — m * xmean). By
doing that, the value of c approximately is c = 6.67.

Step 3: Now we have all the information needed for the equation, and by substituting
the respective values in Y = mX + c, we get the following table. Using this
information, you can now plot the graph.

This way, the least squares regression method provides the closest relationship
between the dependent and independent variables by minimizing the distance
between the residuals (or error) and the trend line (or line of best fit). Therefore, the
sum of squares of residuals (or error) is minimal under this approach.

Logistic Regression
Logistic regression is used for binary classification where we use sigmoid function, that
takes input as independent variables and produces a probability value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function
for an input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it
belongs to Class 0. It’s referred to as regression because it is the extension of linear
regression but is mainly used for classification problems.
Key Points:
 Logistic regression predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value.
 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie between 0 and
1.
 In Logistic regression, instead of fitting a regression line, we fit an “S” shaped
logistic function, which predicts two maximum values (0 or 1).
Logistic Function – Sigmoid Function
 The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
 It maps any real value into another value within a range of 0 and 1. The value of
the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the “S” form.
 The S-form curve is called the Sigmoid function or the logistic function.
 In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as “low”, “Medium”, or “High”.
Assumptions of Logistic Regression
We will explore the assumptions of logistic regression as understanding these assumptions
is important to ensure that we are using appropriate application of the model. The
assumption include:
1. Independent observations: Each observation is independent of the other.
meaning there is no correlation between any input variables.
2. Binary dependent variables: It takes the assumption that the dependent
variable must be binary or dichotomous, meaning it can take only two values.
For more than two categories SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The
relationship between the independent variables and the log odds of the
dependent variable should be linear.
4. No outliers: There should be no outliers in the dataset.
5. Large sample size: The sample size is sufficiently large
Terminologies involved in Logistic Regression
Here are some common terms involved in logistic regression:
 Independent variables: The input characteristics or predictor factors applied to
the dependent variable’s predictions.
 Dependent variable: The target variable in a logistic regression model, which
we are trying to predict.
 Logistic function: The formula used to represent how the independent and
dependent variables relate to one another. The logistic function transforms the
input variables into a probability value between 0 and 1, which represents the
likelihood of the dependent variable being 1 or 0.
 Odds: It is the ratio of something occurring to something not occurring. it is
different from probability as the probability is the ratio of something occurring
to everything that could possibly occur.
 Log-odds: The log-odds, also known as the logit function, is the natural
logarithm of the odds. In logistic regression, the log odds of the dependent
variable are modeled as a linear combination of the independent variables and
the intercept.
 Coefficient: The logistic regression model’s estimated parameters, show how
the independent and dependent variables relate to one another.
 Intercept: A constant term in the logistic regression model, which represents
the log odds when all independent variables are equal to zero.
 Maximum likelihood estimation: The method used to estimate the coefficients
of the logistic regression model, which maximizes the likelihood of observing
the data given the model.
How does Logistic Regression work?
The logistic regression model transforms the linear regression function continuous value
output into categorical value output using a sigmoid function, which maps any real-valued
set of independent variables input into a value between 0 and 1. This function is known as
the logistic function.
Let the independent input features be:
Fisher's Linear Discriminant and Thresholding for Classification:

Fisher’s Linear Discriminant

Linear Discriminant Analysis in Machine Learning is a generalized form of Fisher’s Linear

Discriminant (FLD). Initially, Fisher, in his paper, used a discriminant function to classify

between two plant species, namely Iris Setosa and Iris Versicolor.
The basic idea of FLD is to project data points onto a line in order to maximize the

between-class scatter and minimize the within-class scatter. Consequently, this

approach aims to enhance the separation between different classes by optimizing the

distribution of data points along a linear dimension.

This might sound a bit cryptic but it is quite straightforward. So, before we delve deep

into the derivation part, we need to familiarize ourselves with certain terms and

expressions.

 Let’s suppose we have d-dimensional data points x1….xn with 2

classes Ci=1,2 each having N1 & N 2 samples.

 Consider W as a unit vector onto which we will project the data points. Since we

are only concerned with the direction, we choose a unit vector for this purpose.

 Number of samples : N = N1 + N2

 If x(n) are the samples on the feature space then WTx(n) denotes the data points

after projection.

 Means of classes before projection: mi

 Means of classes after projection: Mi = W Tmi

Datapoint X before and


after projection
Scatter matrix: Used to make estimates of the covariance matrix. IT is a m X m positive

semi-definite matrix.

Given by: sample variance * no. of samples.

Note: Scatter and variance measure the same thing but on different scales. So, we

might use both words interchangeably. So, do not get confused.

Two Types of Scatter Matrices

Here we will be dealing with two types of scatter matrices

 Between class scatter = Sb = measures the distance between class means

 Within class scatter = Sw = measures the spread around means of each class

Now, assuming we are clear with the basics let’s move on to the derivation part.

As per Fisher’s LDA :

arg max J(W) = (M1 - M2)2 / S12 + S22 ........... (1)


The numerator here is between class scatter while the denominator is within-class

scatter. So to maximize the function we need to maximize the numerator and minimize

the denominator, simple math. To maximize the above function we need to first

express the above equation in terms of W.

Numerator

Denominator: For denominator we have S12 + S22 .


Now, we have both the numerator and denominator expressed in terms of W

J(W) = WTSbW / WTSwW

Upon differentiating the above function w.r.t W and equating with 0, we get a

generalized eigenvalue-eigenvector problem

SbW = vSwW

Sw being a full-rank matrix , inverse is feasible

=> Sw-1SbW = vW

Where v = eigen value

W = eigen vector
Linear Discriminant Analysis (LDA) for multiple classes:

Linear Discriminant Analysis (LDA) can be generalized for multiple classes. Here are the

generalized forms of between-class and within-class matrices.

Note: Sb is the sum of C different rank 1 matrices. So, the rank of Sb <=C-1. That

means we can only have C-1 eigenvectors. Thus, we can project data points to a

subspace of dimensions at most C-1.

Above equation (4) gives us scatter for each of our classes and equation (5)

adds all of them to give within-class scatter. Similarly, equation (6) gives us

between-class scatter. Finally, eigen decomposition of Sw-1 Sb gives us the desired

eigenvectors from the corresponding eigenvalues. Total eigenvalues can be at

most C-1.
Minimum Description Length Principle.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy