Unit1 - Read-Only

Unit 1
Math, Probability, and

Statistical Modelling
Exploring Probability and Inferential Statistics
• Making predictions and searching for different structures in data is the most
important part of data science.
• They are important because they have the ability to handle different analytical
tasks.
• Probability and Statistics are involved in different predictive algorithms that are
there in Machine Learning. They help in deciding how much data is reliable
• Probability is one of the most fundamental concepts in statistics.
• A statistic is a result that’s derived from performing a mathematical operation on
numerical data.
• Probability is all about chance. Whereas statistics is more about how we handle
various data using different techniques.
Probability basics:-
• Probability denotes the possibility of the outcome of any random event.
• The meaning of this term is to check the extent to which any event is likely to
happen.
• For example, when we flip a coin in the air, what is the possibility of getting a
head? The answer to this question is based on the number of possible outcomes.
Here the possibility is either head or tail will be the outcome. So, the probability
of a head to come as a result is 1/2.
• The probability is the measure of the likelihood of an event to happen. It
measures the certainty of the event. The formula for probability is given by;
• P(E) = Number of Favourable Outcomes/Number of total outcomes
• P(E) = n(E)/n(S)
• Probability denotes the possibility of something happening.
• It is a mathematical concept that predicts how likely events are to
occur.
• The probability values are expressed between 0 and 1.
• The definition of probability is the degree to which something is likely
to occur.
• This fundamental theory of probability is also applied to probability
distributions.
Axioms of probability
• Axioms mean a rule a principle that most people believe to be true. It is the
premise on the basis of which we do further reasoning
• There are three axioms of probability that make the foundation of probability
theory-
• Axiom 1: Probability of Event
• The first one is that the probability of an event is always between 0 and 1. 1
indicates definite action of any of the outcome of an event and 0 indicates no
outcome of the event is possible.
• Axiom 2: Probability of Sample Space

• For sample space, the probability of the entire sample space is 1.
• Axiom 3: Mutually Exclusive Events
• And the third one is- the probability of the event containing any
possible outcome of two mutually disjoint is the summation of their
individual probability.
• one can derive several mathematical relationships on probability of events using set theory logic
• The following elementary rules of probability are directly deduced from the original three axioms
of probability, using the set theory relationships:
1. For any event A, the probability of the complementary event, written A’, is given by
P(A’) = 1 – P(A)
• If P(A) is a probability of observing a fraudulent transaction at an e-commerce portal, then P(A’) is
the probability of observing a genuine transaction.
2. The probability of an empty or impossible event, f, is zero:
P(f) = 0
3. If occurrence of an event A implies that an event B also occurs, so that the event class A is a subset
of event class B, then the probability of A is less than or equal to the probability of B:
P(A) ≤ P(B)
Random variable:-
• A random variable is a variable whose value is unknown or a function
that assigns values to each of an experiment's outcomes.
• A random variable is a numerical description of the outcome of a
statistical experiment.
• In probability and statistics, random variables are used to quantify
outcomes of a random occurrence, and therefore, can take on many
values.
• Random variables are required to be measurable and are typically real
numbers.
Discrete Random Variables
• If the random variable X can assume only a finite or countably infinite set of values, then it is
called a discrete random variable.
• There are very many situations where the random variable X can assume only finite or countably
infinite set of values. Examples of discrete random variables are:
• 1. Credit rating (usually classified into different categories such as low, medium and high or using
labels such as AAA, AA, A, BBB, etc.).
• 2. Number of orders received at an e-commerce retailer which can be countably infinite.
• 3. Customer churn [the random variables take binary values: (a) Churn and (b) Do not churn].
• 4. Fraud [the random variables take binary values: (a) Fraudulent transaction and (b) Genuine
transaction].
• 5. Any experiment that involves counting (for example, number of returns in a day from customers
of e-commerce portals such as Amazon, Flipkart; number of customers not accepting job offers
from an organization).
Continuous Random Variables
• A random variable X which can take a value from an infinite set of values is called
a continuous random variable. Examples of continuous random variables are listed
below:
• 1. Market share of a company (which take any value from an infinite set of values
between 0 and
• 100%).
• 2. Percentage of attrition among employees of an organization.
• 3. Time to failure of engineering systems.
• 4. Time taken to complete an order placed at an e-commerce portal.
• 5. Time taken to resolve a customer complaint at call and service centers.
• For example, the letter X may be designated to represent the sum of
the resulting numbers after three dice are rolled.
• In this case, X could be 3 (1 + 1+ 1), 18 (6 + 6 + 6), or somewhere
between 3 and 18, since the highest number of a die is 6 and the
lowest number is 1.
• Random variables are often designated by letters and can be classified
as discrete, which are variables that have specific values, or
continuous, which are variables that can have any values within a
continuous range.
• A random variable has a probability distribution that represents the
likelihood that any of the possible values would occur.
• Let’s say that the random variable, Z, is the number on the top face of a
die when it is rolled once.
• The possible values for Z will thus be 1, 2, 3, 4, 5, and 6. The probability
of each of these values is 1/6 as they are all equally likely to be the value
of Z.
• For instance, the probability of getting a 3, or P (Z=3), when a die is
thrown is 1/6, and so is the probability of having a 4 or a 2 or any other
number on all six faces of a die. Note that the sum of all probabilities is 1.
• Discrete Random Variables
• Discrete random variables take on a countable number of distinct values. Consider an
experiment where a coin is tossed three times.
• If X represents the number of times that the coin comes up heads, then X is a discrete
random variable that can only have the values 0, 1, 2, 3 (from no heads in three successive
coin tosses to all heads). No other value is possible for X.
• Continuous Random Variables

• Continuous random variables can represent any value within a specified range or interval
and can take on an infinite number of possible values.
• An example of a continuous random variable would be an experiment that involves
measuring the amount of rainfall in a city over a year or the average height of a random
group of 25 people.
• To understand discrete and continuous distribution, think of two
variables from a dataset describing cars.
• A “color” variable would have a discrete distribution because cars
have only a limited range of colours (black, red, or blue, for example).
The observations would be countable per the color grouping.
• A variable describing cars’ miles per gallon, or “mpg,” would have a
continuous distribution because each car could have its own separate
value for “mpg.”
Probability distributions
• In Statistics, the probability distribution gives the possibility of each
outcome of a random experiment or event. It provides the probabilities
of different possible occurrences.
• A probability distribution is a mathematical function that describes
the probability of different possible values of a variable. Probability
distributions are often depicted using graphs or probability tables.
• For example, the following probability distribution table tells us the
probability that a certain soccer team scores a certain number of goals
in a given game:
Probability distributions
• When the roulette wheel spins off, you intuitively understand that
there is an equal chance that the ball will fall into any of the slots of
the cylinder on the wheel.
• The slot where the ball will land is totally random, and the probability,
or likelihood, of the ball landing in any one slot over another is the
same.
• Because the ball can land in any slot, with equal probability, there is
an equal probability distribution, or a uniform probability distribution
— the ball has an equal probability of landing in any of the slots in the
cylinder.
• But the slots of the roulette wheel are not all the same — the wheel has 18 black
slots and 20 slots that are either red or green. Because of this arrangement, there is
18/38 probability that your ball will land on a black slot.
• You plan to make successive bets that the ball will land on a black slot.
Types of Probability Distributions
• Two major kind of distributions based on the type of likely values for
the variables are,
1.Discrete Distributions
2.Continuous Distributions
Discrete Distribution Vs Continuous Distribution
Discrete Distributions Continuous Distribution

Discrete distributions have finite number of Continuous distributions have infinite many
different possible outcomes consecutive possible values
We cannot add up individual values to find out the
We can add up individual values to find out the
probability of an interval because there are many
probability of an interval
of them
Discrete distributions can be expressed with a Continuous distributions can be expressed with a
graph, piece-wise function or table continuous function or graph
In discrete distributions, graph consists of bars In continuous distributions, graph consists of a

lined up one after the other smooth curve
DISCRETE DISTRIBUTIONS:
• Discrete distributions have finite number of different possible outcomes.

• Characteristics of Discrete Distribution
• We can add up individual values to find out the probability of an interval
• Discrete distributions can be expressed with a graph, piece-wise
function or table
• In discrete distributions, graph consists of bars lined up one after the
other
• Expected values might not be achievable
In graph, the discrete distributions looks like as,
Examples of Discrete Distributions:

1.Bernoulli Distribution
2.Binomial Distribution
3.Uniform Distribution
4.Poisson Distribution
• Bernoulli Distribution
• In Bernoulli distribution there is only one trial and only two possible outcomes i.e.
success or failure. It is denoted by y ~Bern(p).
• Characteristics of Bernoulli distributions
• It consists of a single trial
• Two possible outcomes
• E(Y) = p
• Examples and Uses:
• Guessing a single True/False question.
• It is mostly used when trying to find out what we expect to obtain a single trial of an
experiment.
Binomial Distribution
• A sequence of identical Bernoulli events is called Binomial and follows a

Binomial distribution. It is denoted by Y ~B(n, p).
• Characteristics of Binomial distribution
• Over the n trials, it measures the frequency of occurrence of one of the possible
result.
• E(Y) = n × p
• P(Y = y) = C(y, n) × py× (1 – p)n-y
• Examples and Uses:
• Simply determine, how many times we obtain a head if we flip a coin 10 times.
• It is mostly used when we try to predict how likelihood an event occurs over a
series of trials.
CONTINUOUS DISTRIBUTIONS:
• Continuous distributions have infinite many consecutive possible

values.
• Characteristics of Continuous Distributions
• We cannot add up individual values to find out the probability of an
interval because there are many of them
• Continuous distributions can be expressed with a continuous function
or graph
• In continuous distributions, graph consists of a smooth curve
• To calculate the chance of an interval, we required integrals
Examples of Continuous Distributions
1.Normal Distribution
2.Chi-Squared Distribution
3.Exponential Distribution
4.Logistic Distribution
5.Students’ T Distribution
Normal Distribution
• It shows a distribution that most natural events follow. It is denoted by Y ~

(µ, σ2). The main characteristics of normal distribution are:
• Characteristics of normal distribution
• Graph obtained from normal distribution is bell-shaped curve, symmetric and
has shrill tails.
• 68% of all its all values should fall in the interval, i.e. (µ – σ , µ+ σ )
• E(Y) = µ
• Var(Y) = σ2
• Examples and Uses
• Normal distributions are mostly observed in the size of animals in the desert.
Categorial (non-numeric) distribution
Represents either non-numeric categorical variables or ordinal variables (a special case of numeric variable that
can be grouped and ranked like a categorical variable).
Conditional probability
• Conditional probability is the probability of an event occurring given that
another event has already occurred. In other words, it's the probability of
one event happening, considering that we know some information about
the occurrence of another event.
• Events:
• Event A (even number): Rolling an even number (2, 4, or 6).
• Event B (greater than 2): Rolling a number greater than 2 (3, 4, 5, or 6).
• Now, let's find the conditional probability P(A∣B), which is the probability
of rolling an even number given that the roll is greater than 2.
• Calculations:
1.Find the probability of Event A (rolling an even number):
P(A)=Number of even numbers/Total number of outcomes=3/6=1/2
2.Find the probability of Event B (rolling a number greater than 2):
P(B)=Number of numbers greater than 2/Total number of outcomes=4/6=2/3
3.Find the joint probability of both A and B (rolling an even number
greater than 2):
P(A∩B)=Number of even numbers greater than 2/Total number of outcomes
=2/6=1/3
• Use the formula for conditional probability:
P(A∣B)=P(A∩B)/P(B)=1/3/2/3=1/2
Bayes’ Theorem
• Bayes' Theorem is a mathematical formula that helps us update the
probability of an event based on new evidence or information. In
simple terms, it allows us to revise our beliefs about the likelihood of
an event happening given some observed data.
• Bayes' Theorem states that the conditional probability of an event,
based on the occurrence of another event, is equal to the likelihood of
the second event given the first event multiplied by the probability of
the first event.
• Bayes' Theorem allows you to update the predicted probabilities of an
event by incorporating new information.
• Prior probability, in Bayesian statistics, is the probability of an event
before new data is collected. This is the best rational assessment of the
probability of an outcome based on the current knowledge before an
experiment is performed.
• A posterior probability, in Bayesian statistics, is the revised or
updated probability of an event occurring after taking into
consideration new information. The posterior probability is calculated
by updating the prior probability using Bayes' theorem. In statistical
terms, the posterior probability is the probability of event A occurring
given that event B has occurred.
Bayes’ Theorem(Example)
• Mathematically Bayes’ theorem can be stated as:
Basically, we are trying to find the probability of event A, given event B is true.
Here P(B) is called prior probability which means it is the probability of an
event before the evidence
P(B|A) is called the posterior probability
• Probability of an event after the evidence is seen. With regards to our
dataset, this formula can be re-written as:
• Y: class of the variable
• X: dependent feature vector (of size n)
Example:-
• Let's say you visit a doctor because you have a cough. The doctor knows that there
are two possible causes for the cough: a common cold (Event A) or a more serious
lung disease (Event B).
1.Prior Beliefs:
1. Before any tests, the doctor estimates that 90% of patients with a cough have a common cold
(P(A)=0.9), and 10% have the lung disease (P(B)=0.1).
2.Test Information:
1. The doctor performs a lung scan, and the test is 95% accurate in detecting the lung disease
(P(B∣A)=0.95), but there is a 10% chance of a false positive (P(B∣A′)=0.1, where P(A′) is not
having the lung disease).
3.Updating Beliefs:
1. You get the test result, and it indicates that you have the lung disease. Now, you want to
know the probability that you actually have the disease (P(A∣B)).
• Applying Bayes' Theorem:
• Plug the values into the formula:
• P(A∣B)=P(B∣A)×P(A)/ P(B)
• P(A∣B)=0.95×0.1/ P(B)
Calculating P(B):
• Calculate the denominator using the law of total probability:
P(B)=P(B∣A)×P(A)+P(B∣A′)×P(A′)
P(B)=0.95×0.9+0.1×0.1
• Final Calculation:
• Substitute the values back into the formula:
P(A∣B)=0.95×0.9+0.1×0.10.95×0.1
Conditional probability with Naïve Bayes
• You can use the Naïve Bayes machine learning method, which was borrowed straight from the
statistics field, to predict the likelihood that an event will occur, given evidence defined in your
data features — something called conditional probability.
• Naïve Bayes, which is based on classification and regression, is especially useful if you need to
classify text data.
• This model is easy to build and is mostly used for large datasets. It is a probabilistic machine
learning model that is used for classification problems.
• The core of the classifier depends on the Bayes theorem with an assumption of independence
among predictors. That means changing the value of a feature doesn’t change the value of another
feature.
• Why is it called Naive?
• It is called Naive because of the assumption that 2 variables are independent when they may not
be. In a real-world scenario, there is hardly any situation where the features are independent.
• Conditional probability is defined as the likelihood of an event or outcome occurring, based on the
occurrence of a previous event or outcome. Conditional probability is calculated by multiplying the
probability of the preceding event by the updated probability of the succeeding, or conditional, event.
• A conditional probability would look at such events in relationship with one another.
• Conditional probability is thus the likelihood of an event or outcome occurring based on the occurrence of
some other event or prior outcome.
• Two events are said to be independent if one event occurring does not affect the probability that the other
event will occur.
• However, if one event occurring or not does, in fact, affect the probability that the other event will occur,
the two events are said to be dependent. If events are independent, then the probability of some event B is
not contingent on what happens with event A.
• A conditional probability, therefore, relates to those events that are dependent on one another.
• Conditional probability is often portrayed as the "probability of A given B," notated as P(A|B).
• Conditional probability is calculated by multiplying the probability of the preceding event by the
probability of the succeeding or conditional event.
• Four candidates A, B, C, and D are running for a political office. Each
has an equal chance of winning: 25%. However, if candidate A drops
out of the race due to ill health, the probability will change: P(Win |
One candidate drops out) = 33.33%.
The formula for conditional probability is:
P(B|A) = P(A and B) / P(A)
which you can also rewrite as:
P(B|A) = P(A∩B) / P(A)
Example:-
• In a group of 100 sports car buyers, 40 bought alarm systems, 30
purchased bucket seats, and 20 purchased an alarm system and bucket
seats. If a car buyer chosen at random bought an alarm system, what is
the probability they also bought bucket seats?
• Step 1: Figure out P(A). It’s given in the question as 40%, or 0.4.
• Step 2: Figure out P(A∩B). This is the intersection of A and B: both
happening together. It’s given in the question 20 out of 100 buyers, or
0.2.
• Step 3: Insert your answers into the formula:
P(B|A) = P(A∩B) / P(A) = 0.2 / 0.4 = 0.5
What is Naive Bayes?
• Bayes’ rule provides us with the formula for the probability of Y given some
feature X. In real-world problems, we hardly find any case where there is only
one feature.
• When the features are independent, we can extend Bayes’ rule to what is called
Naive Bayes which assumes that the features are independent that means
changing the value of one feature doesn’t influence the values of other variables
and this is why we call this algorithm “NAIVE”
• Naive Bayes can be used for various things like face recognition, weather
prediction, Medical Diagnosis, News classification, Sentiment Analysis, and a lot
more.
• When there are multiple X variables, we simplify it by assuming that X’s are
independent, so
1.Convert the given dataset into frequency tables.
2.Generate Likelihood table by finding the probabilities of given
features.
3.Now, use Bayes theorem to calculate the posterior probability.
For n number of X, the formula becomes Naive Bayes:
Naive Bayes Example
• Let’s take a dataset to predict whether we can pet an animal or not.
Assumptions of Naive Bayes
• All the variables are independent. That is if the animal is Dog that
doesn’t mean that Size will be Medium
• All the predictors have an equal effect on the outcome. That is, the
animal being dog does not have more importance in deciding If we
can pet him or not. All the features have equal importance.
• We should try to apply the Naive Bayes formula on the above dataset
however before that, we need to do some precomputations on our
dataset.
• We also need the probabilities (P(y)), which are calculated in the table
below. For example, P(Pet Animal = NO) = 6/14.
• Now if we send our test data, suppose test = (Cow, Medium, Black)
Probability of petting an animal :
And the probability of not petting an animal:

• We know P(Yes|Test)+P(No|test) = 1 So, we will normalize the result:
We see here that P(Yes|Test) > P(No|Test), so the prediction that we can pet this animal
is “Yes”.
Types of Naïve Bayes
• Naïve Bayes comes in these three popular flavors:
• »»MultinomialNB: Use this version if your variables (categorical or continuous) describe discrete
frequency counts, like word counts.
• This version of Naïve Bayes assumes a multinomial distribution, as is often the case with text data.
• It does not except negative values.
• »»BernoulliNB: If your features are binary, you use multinomial Bernoulli Naïve Bayes to make
predictions.
• This version works for classifying text data, but isn’t generally known to perform as well as
MultinomialNB.
• If you want to use BernoulliNB to make predictions from continuous variables, that will work, but
you first need to sub-divide them into discrete interval groupings (also known as binning).
• »»GaussianNB: Use this version if all predictive features are normally distributed. It’s not a good
option for classifying text data, but it can be a good choice if your data contains both positive and
negative values (and if your features have a normal distribution, of course).
Statistics Basics:-
• Statistics is the study of the collection, analysis, interpretation, presentation, and

organization of data.
• It is a method of collecting and summarising the data. This has many applications
from a small scale to large scale.
• Whether it is the study of the population of the country or its economy, stats are
used for all such data analysis.
• Statistics has a huge scope in many fields such as sociology, psychology, geology,
weather forecasting, etc.
• The data collected here for analysis could be quantitative or qualitative.
Quantitative data are also of two types such as: discrete and continuous. Discrete
data has a fixed value whereas continuous data is not a fixed data but has a range.
Exploring Descriptive and Inferential Statistics
• In general, you use statistics in decision making. Statistics come in two flavours:
• Descriptive: Descriptive statistics provide a description that illuminates some
characteristic of a numerical dataset, including dataset distribution, central
tendency (such as mean, min, or max), and dispersion (as in standard deviation
and variance).
• Inferential: Rather than focus on pertinent descriptions of a dataset, inferential
statistics carve out a smaller section of the dataset and attempt to deduce
significant information about the larger dataset.
• Use this type of statistics to get information about a real-world measure in which
you’re interested.
Descriptive Statics
• descriptive statistics describe the characteristics of a numerical dataset, but that
doesn’t tell you why you should care.
• most data scientists are interested in descriptive statistics only because of what
they reveal about the real-world measures they describe.
• For example, a descriptive statistic is often associated with a degree of accuracy,
indicating the statistic’s value as an estimate of the real-world measure.
• You can use descriptive statistics in many ways — to detect outliers, for example,
or to plan for feature pre-processing requirements or to quickly identify what
features you may want, or not want, to use in an analysis.
statistic Class value
Mean 79.18
Range 66.21 – 96.53
Proportion >= 70 86.7%
Inferential Statics
• inferential statistics are used to reveal something about a real-world measure.
• Inferential statistics do this by providing information about a small data selection,
so you can use this information to infer something about the larger dataset from
which it was taken.
• In statistics, this smaller data selection is known as a sample, and the larger,
complete dataset from which the sample is taken is called the population.
• If your dataset is too big to analyse in its entirety, pull a smaller sample of this
dataset, analyse it, and then make inferences about the entire dataset based on
what you learn from analysing the sample.
• You can also use inferential statistics in situations where you simply can’t afford
to collect data for the entire population.
• In this case, you’d use the data you do have to make inferences about the
population at large.
• At other times, you may find yourself in situations where complete information
for the population is not available. In these cases, you can use inferential statistics
to estimate values for the missing data based on what you learn from analysing the
data that is available
• For an inference to be valid, you must select your sample carefully so that you get
a true representation of the population.
• Even if your sample is representative, the numbers in the sample dataset will
always exhibit some noise — random variation, in other words — that guarantees
the sample statistic is not exactly identical to its corresponding population
statistic.
Quantifying Correlation
• Many statistical and machine learning methods assume that your features are independent.
• To test whether they’re independent, though, you need to evaluate their correlation — the extent
to which variables demonstrate interdependency.
• We will have brief introduction to Pearson correlation and Spearman’s rank correlation.
• Correlation is used to test relationships between quantitative variables or categorical variables. In
other words, it's a measure of how things are related. The study of how variables are correlated
is called correlation analysis.
• Some examples of data that have a high correlation: Your caloric intake and your weight.
• Correlation means to find out the association between the two variables and Correlation
coefficients are used to find out how strong the is relationship between the two variables. The most
popular correlation coefficient is Pearson’s Correlation Coefficient. It is very commonly used in
linear regression.
• Correlation is quantified per the value of a variable called r, which
ranges between –1 and 1.
• The closer the r-value is to 1 or –1, the more correlation there is
between two variables.
• If two variables have an r-value that’s close to 0, it could indicate that
they’re independent variables.
Calculating correlation with Pearson’s r
• If you want to uncover dependent relationships between continuous variables in
a dataset, you’d use statistics to estimate their correlation.
• The simplest form of correlation analysis is the Pearson correlation, which
assumes that
• Your data is normally distributed.
• You have continuous, numeric variables.
• Your variables are linearly related.
• Because the Pearson correlation has so many conditions, only use it to determine
whether a relationship between two variables exists, but not to rule out possible
relationships.
• If you were to get an r-value that is close to 0, it indicates that there is no linear
relationship between the variables, but that a nonlinear relationship between them
still could exist.
• Consider the example of car price detection where we have to detect
the price considering all the variables that affect the price of the car
such as carlength, curbweight, carheight, carwidth, fueltype, carbody,
horsepower, etc.
• We can see in the scatterplot, as the carlength, curbweight, carwidth
increases price of the car also increases.
• So, we can say that there is a positive correlation between the above
three variables with car price.
• Here, we also see that there is no correlation between the carheight
and car price.
• To find the Pearson coefficient, also referred to as the Pearson correlation
coefficient or the Pearson product-moment correlation coefficient, the two
variables are placed on a scatter plot. The variables are denoted as X and Y.
• There must be some linearity for the coefficient to be calculated; a scatter
plot not depicting any resemblance to a linear relationship will be useless.
• The closer the resemblance to a straight line of the scatter plot, the higher
the strength of association.
• Numerically, the Pearson coefficient is represented the same way as a
correlation coefficient that is used in linear regression, ranging from -1 to
+1.
Formula:-
Find the value of the correlation
coefficient from the following table:
Subject Age x Glucose Level y

1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Glucose
Subject Age x xy x2 y2
Level y 2868 / 5413.27 = 0.529809
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
Ranking variable-pairs using Spearman’s rank
correlation
• The Spearman’s rank correlation is a popular test for determining correlation
between ordinal variables.
• Spearman’s rank correlation measures the strength and direction of association
between two ranked variables. It basically gives the measure of monotonicity of
the relation between two variables i.e. how well the relationship between two
variables could be represented using a monotonic function.
• Unlike Pearson correlation, Spearman's correlation does not assume a linear
relationship between the variables but rather examines the degree to which the
variables tend to increase or decrease together.
• By applying Spearman’s rank correlation, you’re converting numeric variable-
pairs into ranks by calculating the strength of the relationship between variables
and then ranking them per their correlation.
• The Spearman’s rank correlation assumes that
• Your variables are ordinal.
• Your variables are related non-linearly.
• Your data is non-normally distributed.
Formula:-
• The scores for nine students in physics and math are as follows:
• Physics: 35, 23, 47, 17, 10, 43, 9, 6, 28
• Mathematics: 30, 33, 45, 23, 8, 49, 12, 4, 31
• Compute the student’s ranks in the two subjects and compute the
Spearman rank correlation.
Add a third column, d, to your data. The d is the difference
between ranks.
• Sum (add up) all of your d-squared values.
4 + 4 + 1 + 0 + 1 + 1 + 1 + 0 + 0 = 12. You’ll need this for the formula
• = 1 – (6*12)/(9(81-1))
= 1 – 72/720
= 1-0.1
= 0.9
The Spearman Rank Correlation for this set of data is 0.9.
Reducing Data Dimensionality with Linear Algebra
• Any intermediate-level data scientist should have a pretty good

understanding of linear algebra and how to do math using matrices.
• Array and matrix objects are the primary data structure in analytical
computing.
• You need them in order to perform mathematical and statistical
operations on large and multidimensional datasets — datasets with
many different features to be tracked simultaneously.
• When you have dataset with large number of features its very challenging task
to work with such dataset.
• having a high number of variables is both a boon and a curse.
• It’s great that we have loads of data for analysis, but it is challenging due to
size.
• It’s not feasible to analyse each and every variable at a microscopic level.
• It might take us days or months to perform any meaningful analysis and we’ll
lose a ton of time and money for our business
• Also, the amount of computational power this will take is higher.
• We need a better way to deal with high dimensional data so that we can quickly
extract patterns and insights from it.
• So, Using dimensionality reduction technique we can reduce the number of features in
dataset without having to lose much information and keep (or improve) the model’s
performance.
• It’s a really powerful way to deal with huge datasets
• Data dimensionality reduction is a technique used to simplify and streamline large datasets
by reducing the number of features or variables while retaining important information.
• In simpler terms, it's like trying to capture the essence of the data with fewer characteristics.
• Imagine you have a dataset with many columns or features, each representing a different
aspect or measurement. However, some of these features might be redundant or provide
similar information. Dimensionality reduction helps in identifying and keeping only the most
relevant and essential features, making the data more manageable and often improving the
efficiency of analysis or machine learning algorithms.
What is Dimension?
• let’s first define what a dimension is. Given a matrix A, the dimension
of the matrix is the number of rows by the number of columns. If A
has 3 rows and 5 columns, A would be a 3x5 matrix.
• Now in the most simplest of terms, dimensionality reduction is exactly
what it sounds like, you’re reducing the dimension of a matrix to
something smaller than it currently is.
• Given a square (n by n) matrix A, the goal would be to reduce the
dimension of this matrix to be smaller than n x n.
• Current Dimension of A : n
Reduced Dimension of A : n - x, where x is some positive integer
• the most common application would be for data visualization
purposes. It’s quite difficult to visualize something graphically which
is in a dimension space greater than 3.
• Through dimensionality reduction, you’ll be able to transform your
dataset of 1000s of rows and columns into one small enough to
visualize in 3 / 2 / 1 dimensions.
What is dimensionality Reduction?
• As data generation and collection keeps increasing, visualizing it and
drawing inferences becomes more and more challenging.
• One of the most common ways of doing visualization is through
charts.
• Suppose we have 2 variables, Age and Height. We can use a scatter or
line plot between Age and Height and visualize their relationship
easily:
• Now consider a case in which we have, say 100 variables (p=100).
• It does not make much sense to visualize each of them separately.
• In such cases where we have a large number of variables, it is better to
select a subset of these variables (p<<100) which captures as much
information as the original set of variables.
• we can reduce p dimensions of the data into a subset of k dimensions
(k<<n). This is called dimensionality reduction.
Benefits of Dimensionality Reduction
• It helps to remove redundancy in the features and noise error factors

ultimately enhanced visualization of the given data set.
• Excellent memory management activity has been exhibited due to
dimensionality reduction.
• Improving the performance of the model by choosing the right features by
removing the unnecessary lists of features from the dataset.
• Certainly, less number of dimensions (mandatory list of dimensions) required
less computing efficiency and train the model faster with improved model
accuracy.
• Considerably reducing the Complexity and Overfitting of the overall model
and its performance.
• Dimensionality reduction can be achieved by simply dropping columns.
• for example, those that may show up as collinear with others or identified as not
being particularly predictive of the target as determined by an attribute importance
ranking technique.
• But it can also be achieved by deriving new columns based on linear combinations
of the original columns.
• In both cases, the resulting transformed data set can be provided to machine
learning algorithms to yield faster model build times, faster scoring times, and
more accurate models.
• While SVD can be used for dimensionality reduction, it is often used in digital
signal processing for noise reduction, image compression, and other areas.
Eigenvector and Eigenvalue:-
• In simple terms, an eigenvalue is a special number associated with a
particular vector that represents how that vector stretches or shrinks
when a linear transformation is applied to it.
• Imagine you have a square, and you want to transform it by stretching
or shrinking it in different directions. A linear transformation can be
represented by a matrix.
• Now, the eigenvalue (often denoted by λ) is a number that tells you
how much the vector v gets scaled during this transformation.
Eigenvector and Eigenvalue:-
• Eigenvectors and eigenvalues have many important applications in computer
vision and machine learning in general.
• Well known examples are PCA (Principal Component Analysis) for
dimensionality reduction or EigenFaces for face recognition.
• An eigenvector is a vector whose direction remains unchanged when a linear
transformation is applied to it.
• To conceptualize an eigenvector, think of a matrix called A. Now consider a
nonzero vector called x and that Ax = λx for a scalar λ.
• In this scenario, scalar λ is what’s called an eigenvalue of matrix A.
• It’s permitted to take on a value of 0.
• Furthermore, x is the eigenvector that corresponds to λ, and again, it’s not
permitted to be a zero value.
SVD(Singular Value Decomposition)
• The SVD linear algebra method decomposes the data matrix into the three
resultant matrices shown in Figure .
• The product of these matrices, when multiplied together, gives you back your
original matrix.
• SVD is handy when you want to remove redundant information by compressing
your dataset.
.
The SVD of m x n matrix A is given by the formula :

where:
•A=u*v*S
• »»A: This is the matrix that holds all your original data.
• »»u: This is a left-singular vector (an eigenvector) of A, and it holds all the
important, non-redundant information about your data’s observations.
• »»v: This is a right-singular eigenvector of A. It holds all the important,
nonredundant information about columns in your dataset’s features.
• »»S: This is the square root of the eigenvalue of A. It contains all the information
about the procedures performed during the compression.
• Singular Value Decomposition (SVD) is useful in data dimensionality
reduction because it helps find the most important features in a dataset,
allowing us to represent the data with fewer, but still meaningful,
characteristics.
• Consider a dataset with information about houses, including features like
square footage, number of bedrooms, and distance to the city center. Using
SVD, you can find the most critical features that determine the overall
characteristics of a house. Instead of dealing with all the features, you focus
on the most important ones, reducing the dimensionality of the data while still
capturing the main factors that influence housing characteristics.
•A: Input data matrix — m x n matrix (eg. m documents, n terms)
•U: Left Singular Vectors — m x r matrix (m documents, r concepts)
•Σ: Singular Values — r x r diagonal matrix (strength of each ‘concept’)
where r is rank of matrix A
•V: Right Singular Vectors — n x r matrix (n terms, r concepts)
The theorem states the following:
1.U, Σ,V: unique
2.U, V: Both the matrices are orthonormal in
nature. Orthonormal matrices are those whose
column’s Euclidean length is 1 or in other terms
sum of squared values in each column of U and
V matrices is equals to 1. Also, the columns are
orthogonal. In simple terms dot product of two
columns of U and two columns of V leads to
zero
3.Σ: Diagonal — All the entries (singular
values) are positive and are sorted in decreasing
order (σ1≥σ2≥….≥0)
Example:-
• Left Singular Matrix (U): Columns of matrix U can be thought of as concepts, the first
column of U corresponds to the SciFi concept and the second column of U corresponds to
the Romance concept. What it basically shows is first 4 users correspond to scifi concept and
the last 3 users correspond to romance concept. Matrix
• U would be “Users-to-Concept” similarity matrix. Each value in matrix U determines how
much a given user corresponds to a given concept (in our case there are two concepts, SciFi
and romance concept). In the given matrix for eg, the first user corresponds to SciFi-concept
whereas the fifth user corresponds to romance-concept.
• Singular Values (Σ): In this diagonal matrix, each diagonal values are non zero positive
value. Each value depicts the strength of every concept. For instance, it can be seen
“strength” of SciFi concept is higher than that of romance concept.
• Right Singular Matrix (V): V is a “movie-to-concept” matrix. For instance, it shows that
first three movies heavily belongs to the first concept i.e. SciFi concept while last two
belongs to second concept which is romance concept.
• Although it might sound complicated, it’s pretty simple. Imagine that you’ve compressed your
dataset and it has resulted in a matrix S that sums to 100.
• If the first value in S is 97 and the second value is 94, this means that the first two columns contain
94 percent of the dataset’s information.
• In other words, the first two columns of the u matrix and the first two rows of the v matrix contain
94 percent of the important information held in your original dataset, A.
• To isolate only the important, non-redundant information, you’d keep only those two columns and
discard the rest.
• When you go to reconstruct your matrix by taking the dot product of S, u, and v, you’ll probably
notice that the resulting matrix is not an exact match to your original dataset. That’s the data that
remains after much of the information redundancy and noise has been filtered out by SVD.
• When deciding the number of rows and columns to keep, it’s okay to get rid of rows and columns, as long as
you make sure that you retain at least 70 percent of the dataset’s original information.
Reducing dimensionality with factor analysis
• Factor analysis is along the same lines as SVD in that it’s a method you can use for filtering out
redundant information and noise from your data.
• An offspring of the psychometrics field, this method was developed to help you derive a root
cause, in cases where a shared root cause results in shared variance — when a variable’s variance
correlates with the variance of other variables in the dataset.
• A variables variability measures how much variance it has around its mean.
• The greater a variable’s variance, the more information that variable contains
• When you find shared variance in your dataset, that means information redundancy is at
play.
• You can use factor analysis or principal component analysis to clear your data of this information
redundancy.
• In order to apply Factor Analysis, we must make sure the data we have
is suitable for it.
• The simplest approach would be to look at the correlation matrix of
the features and identify groups of intercorrelated variables.
• If there are some correlated features with a correlation degree of more
than 0.3, perhaps it would be interesting to use Factor Analysis.
Groups of features highly intercorrelated will be merged into one
variable latent, called factor.
• Factor analysis makes the following assumptions:
• Your features are metric — numeric variables on which meaningful calculations
can be made.
• Your features should be continuous or ordinal.
• You have more than 100 observations in your dataset and at least 5 observations
per feature.
• Your sample is homogenous.
• There is r > 0.3 correlation between the features in your dataset.
• In factor analysis, you do a regression on features to uncover underlying latent
variables, or factors.
• You can then use those factors as variables in future analyses, to represent the
original dataset from which they’re derived.
• At its core, factor analysis is the process of fitting a model to prepare a dataset for
analysis by reducing its dimensionality and information redundancy.
Decreasing dimensionality and removing outliers with PCA
• Principal component analysis (PCA) is another dimensionality reduction technique that’s

closely related to SVD
• This unsupervised statistical method finds relationships between features in your dataset
and then transforms and reduces them to a set of non-information-redundant principle
components — uncorrelated features that embody and explain the information that’s
contained within the dataset (that is, its variance).
• These components act as a synthetic, refined representation of the dataset, with the
information redundancy, noise, and outliers stripped out.
• You can then take those reduced components and use them as input for your machine
learning algorithms, to make predictions based on a compressed representation of your
data.
• Principal Component Analysis is an unsupervised dimension reduction technique that
focuses on capturing maximum variation of the data.
• In this technique, variables are transformed into a new set of variables,
which are linear combination of original variables.
• These new set of variables are known as principle components.
• They are obtained in such a way that first principle component
accounts for most of the possible variation of original data after
which each succeeding component has the highest possible variance.
• The PCA algorithm is based on some mathematical concepts such as:
• Variance and Covariance
• Eigenvalues and Eigen factors
• The PCA model makes these two assumptions:
• Multivariate normality is desirable, but not required. (normal distribution)
• Variables in the dataset should be continuous.
• Although PCA is like factor analysis, there are two major differences
• One difference is that PCA does not regress to find some underlying cause of
shared variance, but instead decomposes a dataset to succinctly represent its most
important information in a reduced number of features.
• The other key difference is that, with PCA, the first time you run the model, you
don’t specify the number of components to be discovered in the dataset. You let
the initial model results tell you how many components to keep, and then you
rerun the analysis to extract those features.
• A small amount of information from your original dataset will not be captured by
the principal components.
• Just keep the components that capture at least 95 percent of the dataset’s total
variance. The remaining components won’t be that useful, so you can get rid of
them.
• When using PCA for outlier detection, simply plot the principal components on an
x-y scatter plot and visually inspect for areas that might have outliers.
• Those data points correspond to potential outliers that are worth investigating.
Modelling Decisions with Multi-Criteria Decision
Making
• You can use MCDM methods in anything from stock portfolio management to
fashion-trend evaluation, from disease outbreak control to land development
decision making.
• Anywhere you have two or more criteria on which you need to base your
decision, you can use MCDM methods to help you evaluate alternatives.
• To use multi-criteria decision making, the following two assumptions must be
satisfied:
• Multi-criteria evaluation: You must have more than one criterion to optimize.
• Zero-sum system: Optimizing with respect to one criterion must come at the
sacrifice of at least one other criterion.
• This means that there must be trade-offs between criteria — to gain with respect
to one means losing with respect to at least one other.
• When taking a decision, there might not always be a finite number of
choices or there might be many alternatives to the original decision.
• is also some possibility for not having a suitable choice for the
criterion. Multiple Criteria Decision Making (MCDM) is an approach
designed for the evaluation of problems with a finite or an infinite
number of choices.
Example 1:-
• The best way to get a solid grasp on MCDM is to see how it’s used to solve a real- world problem.
• MCDM is commonly used in investment portfolio theory.
• Pricing of individual financial instruments typically reflects the level of risk you incur, but an entire portfolio
can be a mixture of virtually riskless investments (U.S. government bonds, for example) and minimum-,
moderate-, and high-risk investments.
• Your level of risk aversion dictates the general character of your investment portfolio.
• Highly risk-averse investors seek safer and less lucrative investments, and less risk-averse investors choose
riskier investments.
• In the process of evaluating the risk of a potential investment, you’d likely consider the following criteria:
• Earnings growth potential: Here, an investment that falls under an earnings growth potential threshold gets
scored as a 0; anything above that threshold gets a 1.
• Earnings quality rating: If an investment falls within a ratings class for earnings quality, it gets scored as a
0; otherwise, it gets scored as a 1.
• earnings quality refers to various measures used to determine how suitable a
company’s reported earnings are
• Dividend performance: When an investment doesn’t reach a set dividend
performance threshold, it gets a 0; if it reaches or surpasses that threshold, it gets a
1.
• Imagine that you’re evaluating 20 different potential investments.
• In this evaluation, you’d score each criterion for each of the investments.
• To eliminate poor investment choices, simply sum the criteria scores for each of the
alternatives and then dismiss any investments that do not get a total score of 3 —
leaving you with the investments that fall within a certain threshold of earning
growth potential, that have good earnings quality, and whose dividends perform at a
level that’s acceptable to you.
Example 2:-
• A shopper is in an electronics shop. Their objective is to purchase a new
mobile phone.
• Their criteria are price, screen size, storage space and appearance, with
price, storage space and screen size as the non-beneficial values and
appearance being a beneficial value that the shopper evaluates using a
five-point scale. To this shopper, all the criteria are equal in value, so each
has a weight of 25%. They've reduced their choices to three phones with the
following ratings:
• Phone A: 16,000, 6.2 inches, 32 GB, average looks (3 out of 5)
• Phone B: 19,000, 5.8 inches, 64 GB, excellent looks (5 out of 5)
• Phone C: 17,500, 6.0 inches, 64 GB, above-average looks (4 out of 5)
• In mathematics, a set is a group of numbers that shares some similar
characteristic.
• In traditional set theory, membership is binary — in other words, an
individual is either a member of a set or it’s not.
• If the individual is a member, it is represented with the number 1. If it
is not a member, it is represented by the number 0.
• Traditional MCDM is characterized by binary membership.
Focusing on fuzzy MCDM
• If you prefer to evaluate suitability within a range, instead of using binary membership terms of 0
or 1, you can use fuzzy multi-criteria decision making (FMCDM) to do that.
• With FMCDM you can evaluate all the same types of problems as you would with MCDM.
• The term fuzzy refers to the fact that the criteria being used to evaluate alternatives offer a range of
acceptability — instead of the binary, crisp set criteria associated with traditional MCDM.
• Evaluations based on fuzzy criteria lead to a range of potential outcomes, each with its own level
of suitability as a solution.
• One important feature of FMCDM: You’re likely to have a list of several fuzzy criteria, but
these criteria might not all hold the same importance in your evaluation.
• To correct for this, simply assign weights to criteria to quantify their relative importance.
Introducing Regression Methods
• Regression in statistics is a method used to understand and quantify the relationship between two
or more variables. It helps us predict or explain the value of one variable based on the values of
others.
• Machine learning algorithms of the regression variety were adopted from the statistics field, to
provide data scientists with a set of methods for describing and quantifying the relationships
between variables in a dataset.
• Use regression techniques if you want to determine the strength of correlation between
variables in your data.
• Regression analysis is a group of statistical methods that estimate the relationship between
a dependent variable (otherwise known as the outcome variables) and one or more independent
variables (often called predictor variables).
• Unlike many other models in Machine Learning, regression analyses
can be used for two separate purposes.
• First, in the social sciences, it is common to use regression analyses to
infer a causal relationship between a set of variables
• second, in data science, regression models are frequently used to
predict and forecast new values
• A regression model provides a function that describes the relationship
between one or more independent variables and a response, dependent,
or target variable.
• You can use regression to predict future values from historical values,
but be careful
• Regression methods assume a cause-and-effect relationship between
variables, but present circumstances are always subject to flux.
• Predicting future values from historical ones will generate incorrect
results when present circumstances change.
Linear regression
• Linear regression is a machine learning method you can use to describe and
quantify the relationship between your target variable, y — the predictant, in
statistics lingo — and the dataset features you’ve chosen to use as predictor
variables (commonly designated as dataset X in machine learning).
• Linear regression shows the linear relationship between the
independent(predictor) variable i.e. X-axis and the dependent(output) variable i.e.
Y-axis.
• If there is a single input variable X(dependent variable), such linear regression is
called simple linear regression.
• you can also use linear regression to quantify correlations between several
variables in a dataset — called multiple linear regression.
• Equation of Simple Linear Regression, where b is the intercept, b is
o 1
coefficient or slope, x is the independent variable and y is the

dependent variable.
y= a0+a1x+ ε
Y= Dependent Variable (Target Variable)

X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input
value).
ε = random error
• Equation of Multiple Linear Regression, where bo is the intercept,
b ,b ,b ,b …,b are coefficients or slopes of the independent variables
1 2 3 4 n
x ,x ,x ,x …,x and y is the dependent variable.

1 2 3 4 n
• A Linear Regression model’s main aim is to find the best fit linear
line and the optimal values of intercept and coefficients such that
the error is minimized.
The above graph presents the linear relationship between the output(y) variable and predictor(X)
variables. The blue line is referred to as the best fit straight line. Based on the given data points, we
attempt to plot a line that fits the points the best.
• Before using linear regression, though, make sure you’ve considered
its limitations:
• Linear regression only works with numerical variables, not categorical ones.
• If your dataset has missing values, it will cause problems. Be sure to address
your missing values before attempting to build a linear regression model.
• If your data has outliers present, your model will produce inaccurate results.
• Check for outliers before proceeding.
• The linear regression assumes that there is a linear relationship
between dataset features and the target variable. Test to make sure this
is the case, and if it’s not, try using a log transformation to
compensate.
• The linear regression model assumes that all features are independent
of each other.
• Prediction errors, or residuals, should be normally distributed.
• you should have at least 20 observations per predictive feature if you
expect to generate reliable results using linear regression.
Logistic regression
• Logistic Regression is a “Supervised machine learning” algorithm that can be used to model the
probability of a certain class or event. It is used when the data is linearly separable and the
outcome is binary in nature.
• Logistic regression is a machine learning method you can use to estimate values for a categorical
target variable based on your selected features.
• Your target variable should be numeric, and contain values that describe the target’s class — or
category.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not
based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it
has the ability to provide probabilities and classify new data using
continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using
different types of data and can easily determine the most effective
variables used for the classification.
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the

predicted values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the "S" form. The
S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the
threshold value tends to 1, and a value below the threshold values tends
to 0.
The Logistic regression equation can be obtained from the Linear
Regression equation. The mathematical steps to get Logistic Regression
equations are given below:
•We know the equation of the straight line can be written as:
•In Logistic Regression y can be between 0 and 1 only, so for this let's
divide the above equation by (1-y):
• But we need range between -[infinity] to +[infinity], then take
logarithm of the equation it will become:
• One cool thing about logistic regression is that, in addition to predicting the class of observations
in your target variable, it indicates the probability for each of its estimates. Though logistic
regression is like linear regression, it’s requirements are simpler, in that:
• There does not need to be a linear relationship between the features and target variable.
• Residuals don’t have to be normally distributed.
• Predictive features are not required to have a normal distribution.
• When deciding whether logistic regression is a good choice for you, make sure to consider the
following limitations:
• Missing values should be treated or removed.
• our target variable must be binary or ordinal.
• Predictive features should be independent of each other.
• Logistic regression requires a greater number of observations (than linear regression) to produce a
reliable result.
• The rule of thumb is that you should have at least 50 observations per predictive feature if you
expect to generate reliable results.
Example:-
• Let us consider a problem where we are given a dataset containing
Height and Weight for a group of people.
• Our task is to predict the Weight for new entries in the Height column.
• So we can figure out that this is a regression problem where we will
build a Linear Regression model.
• We will train the model with provided Height and Weight values.
• Once the model is trained we can predict Weight for a given unknown
Height value.
• Now suppose we have an additional field Obesity and we have to classify whether a person is
obese or not depending on their provided height and weight.
• This is clearly a classification problem where we have to segregate the dataset into two classes
(Obese and Not-Obese).
• So, for the new problem, we can again follow the Linear Regression steps and build a regression
line.
• This time, the line will be based on two parameters Height and Weight and the regression line will
fit between two discreet sets of values.
• As this regression line is highly susceptible to outliers, it will not do a good job in classifying two
classes.
• To get a better classification, we will find probability for each output value from the regression
line.
• Now based on a predefined threshold value, we can easily classify the output into two classes
Obese or Not-Obese.
Ordinary least squares (OLS) regression
methods
• Ordinary Least Squares regression (OLS) is a common technique for estimating coefficients of
linear regression equations which describe the relationship between one or more independent
quantitative variables and a dependent variable (simple or multiple linear regression).
• Least squares stand for the minimum squares error (SSE). Maximum likelihood and Generalized
method of moments estimator are alternative approaches to OLS.
• Example: We want to predict the height of plants depending on the number of days they have spent
in the sun. Before getting exposure, they are 30 cm. A plant grows 1 mm (0.1 cm) after being
exposed to the sun for a day.
• Y is the height of the plants
• X is the number of days spent in the sun
• β0 is 30 because it is the value of Y when X is 0.
• β1 is 0.1 because it is the coefficient multiplied by the number of days.
• A plant being exposed 5 days to the sun has therefore an estimated height of Y = 30 + 0.1*5 = 30.5
cm.
How do ordinary least squares (OLS) work?
• The OLS method aims to minimize the sum of square differences

between the observed and predicted values.
• For example, if your real values are 2, 3, 5, 2, and 4 and your predicted
values are 3, 2, 5, 1, 5, then the total error would be (3-2)+(2-3)+(5-
5)+(1-2)+(5-4)=1-1+0-1+1=0 and the average error would be 0/5=0,
which could lead to false conclusions.
• However, if you compute the mean squared error, then you get (3-
2)^2+(2-3)^2+(5-5)^2+(1-2)^2+(5-4)^2=4 and 4/5=0.8. By scaling the
error back to the data and taking the square root of it, we get
sqrt(0.8)=0.89, so on average, the predictions differ by 0.89 from the
real value.
• Now, the idea of Simple Linear Regression is finding those parameters
α and β for which the error term is minimized.
• To be more precise, the model will minimize the squared errors:
indeed, we do not want our positive errors to be compensated by the
negative ones, since they are equally penalizing for our model.
• This procedure is called Ordinary Least Squared error — OLS.
• With OLS, you do this by squaring the vertical distance values that describe the
distances between the data points and the best-fit line, adding up those squared
distances, and then adjusting the placement of the best-fit line so that the summed
squared distance value is minimized.
• Use OLS if you want to construct a function that’s a close approximation to your
data.
• As always, don’t expect the actual value to be identical to the value predicted by
the regression.
• Values predicted by the regression are simply estimates that are most similar to
the actual values in the model.
• OLS is particularly useful for fitting a regression line to models containing more than one
independent variable.
• In this way, you can use OLS to estimate the target from dataset features.
• When using OLS regression methods to fit a regression line that has more than one independent
variable, two or more of the IVs may be interrelated.
• When two or more IVs are strongly correlated with each other, this is called multicollinearity.
• Multicollinearity tends to adversely affect the reliability of the IVs as predictors when they’re
examined apart from one another.
• Luckily, however, multicollinearity doesn’t decrease the overall predictive reliability of the model
when it’s considered collectively.
Detecting Outliers
• Many statistical and machine learning approaches assume that there are no
outliers in your data.
• Outlier removal is an important part of preparing your data for analysis.
• Analysing extreme values
• Outliers are data points with values that are significantly different than the
majority of data points comprising a variable.
• It is important to find and remove outliers, because, left untreated, they skew
variable distribution, make variance appear falsely high, and cause a
misrepresentation of intervariable correlations.
Effect of outlier on dataset
• Most machine learning and statistical models assume that your data is free of outliers, so spotting
and removing them is a critical part of preparing your data for analysis.
• Not only that, you can use outlier detection to spot anomalies that represent fraud, equipment
failure, or cybersecurity attacks.
• In other words, outlier detection is a data preparation method and an analytical method in its own
right.
• Outliers fall into the following three categories:
• Point: Point outliers are data points with anomalous values compared to the normal range of
values in a feature.
• Contextual: Contextual outliers are data points that are anomalous only within a specific context.
• To illustrate, if you are inspecting weather station data from January in Orlando, Florida, and you
see a temperature reading of 23 degrees F, this would be quite anomalous because the average
temperature there is 70 degrees F in January.
• But consider if you were looking at data from January at a weather
station in Anchorage, Alaska — a temperature reading of 23 degrees F
in this context is not anomalous at all.
• Collective: These outliers appear nearby to one another, all having
similar values that are anomalous to the majority of values in the
feature.
• You can detect outliers using either a univariate or multivariate
approach.
• This outlier could be the result of many different issues:
• Human error
• Instrument error
• Experimental error
• Intentional creation
• Data processing error
• Sampling error
• Natural outlier
• The purpose for being able to identify this outlier of course can also be
different.
• This could be because an outlier would indicate something has
changed in the action that produces the data which is useful in the case
of:
• Fraud detection
• Intrusion detection
• Fault diagnostics
• Time series monitoring
• Health monitoring
Detecting outliers with univariate analysis
• Univariate outlier detection is where you look at features in your dataset, and inspect them
individually for anomalous values.
• There are two simple methods for doing this:
• Tukey outlier labelling
• Tukey boxplot
• It is cumbersome to detect outliers using Tukey outlier labelling, but if you want to do it, the trick
here is to see how far the minimum and maximum values are from the 25 and 75 percentiles.
• The distance between the 1st quartile Q1 (at 25 percent) and the 3rd quartile (at 75 percent) Q3
is called the inter-quartile range (IQR), and it describes the data’s spread.
• When you look at a variable, consider its spread, its Q1 / Q3 values, and its minimum and
maximum values to decide whether the variable is suspect for outliers.
• Any data point that falls outside of either 1.5 times the IQR below the first
quartile or 1.5 times the IQR above the third quartile is considered an outlier.
• Here’s a good rule of thumb: a = Q1 - 1.5*IQR and b = Q3 + 1.5*IQR.
• If your minimum value is less than a, or your maximum value is greater than b,
the variable probably has outliers.
Method:-
1.Sort your data from low to high
2.Identify the first quartile (Q1), the median, and the third quartile (Q3).
3.Calculate your IQR = Q3 – Q1
4.Calculate your upper fence = Q3 + (1.5 * IQR)
5.Calculate your lower fence = Q1 – (1.5 * IQR)
6.Use your fences to highlight any outliers, all values that fall outside
your fences.
Your dataset has 11 values. You have a couple of extreme values in your dataset, so you’ll use the
IQR method to check whether they are outliers.
26 37 24 28 35 22 31 53 41 64 29
Step 1: Sort your data from low to high
First, you’ll simply sort your data in ascending order.
22 24 26 28 29 31 35 37 41 53 64
Step 2: Identify the median, the first quartile (Q1), and the third quartile (Q3)
The median is the value exactly in the middle of your dataset when all values are ordered from low to
high.
Since you have 11 values, the median is the 6th value. The median value is 31.
22 24 26 28 29 31 35 37 41 53 64
• Next, we’ll use the exclusive method for identifying Q1 and Q3. This
means we remove the median from our calculations.
• The Q1 is the value in the middle of the first half of your dataset,
excluding the median. The first quartile value is 26.
22 24 26 28 29
Your Q3 value is in the middle of the second half of your dataset, excluding the median. The third
quartile value is 41.
35 37 41 53 64
Calculate your IQR
The IQR is the range of the middle half of your dataset. Subtract Q1 from Q3 to calculate the IQR.
Formula Calculation
IQR = Q3 – Q1 Q1 = 26
Q3 = 41
IQR = 41 – 26
= 15
Calculate your upper fence
The upper fence is the boundary around the third quartile. It tells you that any values
exceeding the upper fence are outliers.
Formula Calculation
Upper fence = Q3 + (1.5 * IQR) Upper fence = 41 + (1.5 * 15)
= 41 + 22.5
= 63.5
Calculate your lower fence
The lower fence is the boundary around the first quartile. Any values less than the lower fence
are outliers.
Formula Calculation
Lower fence = Q1 – (1.5 * IQR) Lower fence = 26 – (1.5 * IQR)
= 26 – 22.5
= 3.5
Use your fences to highlight any outliers
Go back to your sorted dataset from Step 1 and highlight any values that are greater than the upper
fence or less than your lower fence.
These are your outliers.
•Upper fence = 63.5
•Lower fence = 3.5
22 24 25 28 29 31 35 37 41 53 64
Tukey Boxplot
• In comparison, a Tukey boxplot is a pretty easy way to spot outliers.
• A Tukey boxplot, also known as a box-and-whisker plot, is a graphical
representation of the distribution of a dataset.
• It provides a visual summary of key statistics, including the median, quartiles,
and potential outliers.
• The plot consists of a rectangular "box" that represents the interquartile range
(IQR) and "whiskers" that extend from the box to indicate the range of the
data. Outliers can also be displayed as individual points.
• Each boxplot has whiskers that are set at 1.5*IQR. Any values that lie beyond
these whiskers are outliers.
• Figure shows outliers as they appear within a Tukey boxplot.
Detecting outliers with multivariate analysis
• Sometimes outliers only show up within combinations of data points from disparate variables.
• These outliers really wreak havoc on machine learning algorithms, so it’s important to detect and
remove them.
• You can use multivariate analysis of outliers to do this.
• A multivariate approach to outlier detection involves considering two or more variables at a time
and inspecting them together for outliers.
• There are several methods you can use, including
• Scatter-plot matrix
• Boxplot
• Density-based spatial clustering of applications with noise (DBScan)
• Principal component analysis (PCA)
Introducing Time Series Analysis
• A time series is just a collection of data on attribute values over time.
• Time series analysis is performed to predict future instances of the measure based
on the past observational data.
• To forecast or predict future values from data in your dataset, use time series
techniques
• In time series the order of observations provides a source of additional
information that should be analysed and used in the prediction process
• Time series are typically assumed to be generated at regularly spaced interval of
time (e.g. daily temperature), and so are called regular time series.
• A Time-Series represents a series of time-based orders. It would be
Years, Months, Weeks, Days, Horus, Minutes, and Seconds
• A time series is an observation from the sequence of discrete-time of
successive intervals.
• A time series is a running chart.
• Time Series Analysis (TSA) is used in different fields for time-based
predictions – like Weather Forecasting, Financial, Signal processing,
Engineering domain – Control Systems, Communications Systems.
• Since TSA involves producing the set of information in a particular
sequence, it makes a distinct from spatial and other analyses.
• Time series can have one or more variables that change over time.
• If there is only one variable varying over time, we call it Univariate
time series.
• If there is more than one variable it is called Multivariate time series.
Time Series Analysis
• Time series analysis is a method of examining and interpreting data points collected over
time to identify patterns, trends, and make predictions. In simpler terms, it's about
understanding how a particular quantity changes over time.
• Here are the basic components of time series analysis explained in simple terms:
1.Time Series Data:
1. Time series data consists of observations or measurements taken at different points in time. For
example, daily stock prices, monthly temperature readings, or yearly sales figures.
2.Components of Time Series:
1. Trend: The long-term movement or pattern in the data. It shows whether the values are generally
increasing, decreasing, or staying constant over time.
2. Seasonality: Repeating patterns or cycles that occur at regular intervals. For instance, retail sales
might have a seasonal pattern, increasing during holiday seasons.
3. Random Fluctuations: Unpredictable and irregular variations that do not follow a specific pattern.
• Analysis Techniques:
• Descriptive Statistics: Simple statistics like mean, median, and
standard deviation to understand the central tendency and variability
of the data.
• Data Visualization: Plots and charts (e.g., line charts) to visually
inspect trends and patterns.
• Moving Averages: A technique to smooth out fluctuations and
highlight trends over time.
• Forecasting: Predicting future values based on historical data.
How to analyse Time Series?
• Collecting the data and cleaning it

• Preparing Visualization with respect to time vs key feature
• Observing the stationarity of the series
• Developing charts to understand its nature.
• Model building – AR, MA, ARMA and ARIMA
• Extracting insights from prediction
Applications:
1. Economics and Finance: Analyzing stock prices, GDP trends, and interest
rates.
2. Meteorology: Studying temperature, rainfall, and other weather-related data.
3. Business: Forecasting sales, demand, and inventory levels.
4. Healthcare: Tracking patient data over time for disease patterns.
Importance:
5. Time series analysis helps in making informed decisions by understanding
historical patterns and predicting future trends.
6. It is used in various fields to identify anomalies, plan for future scenarios, and
optimize processes.
Identifying patterns in time series
• Time series exhibit specific patterns.
• Take a look at Figure to get a better understanding of what these patterns are all
about.
• Constant time series remain at roughly the same level over time, but are subject
to some random error.
• In contrast, trended series show a stable linear movement up or down.
• Whether constant or trended, time series may also sometimes exhibit seasonality
— predictable, cyclical fluctuations that reoccur seasonally throughout a year.
• As an example of seasonal time series, consider how many businesses show
increased sales during the holiday season.
• Let’s discuss the time series’ data types and their influence. While discussing TS
data-types, there are two major types.
• Stationary
• Non- Stationary
• Stationary: A dataset should follow the below thumb rules, without having Trend,
Seasonality, Cyclical, and Irregularity component of time series
• The MEAN value of them should be completely constant in the data during the
analysis
• The VARIANCE should be constant with respect to the time-frame
• The COVARIANCE measures the relationship between two variables.
• Non- Stationary: This is just the opposite of Stationary.
• If you’re including seasonality in your model, incorporate it in the quarter, month, or even 6-
month period — wherever it’s appropriate.
• Time series may show nonstationary processes — or, unpredictable cyclical behaviour that is
not related to seasonality and that results from economic or industry-wide conditions instead.
• Because they’re not predictable, nonstationary processes can’t be forecasted.
• You must transform nonstationary data to stationary data before moving forward with an
evaluation.
• There are 2 methods used for time series forecasting.
• Univariate Time-series Forecasting: only two variables in which one is time and the other is
the field to forecast.
• Multivariate Time-series Forecasting: contain multiple variables keeping one variable as
time and others will be multiple in parameters.
Modelling univariate time series data
• A univariate time series is a sequence of measurements of the same variable

collected over time. Most often, the measurements are made at regular time
intervals.
• Univariate time series analysis focuses on the analysis of a single variable over
time. In simpler terms, it involves examining and understanding the patterns,
trends, and characteristics of a time-ordered sequence of data points for one
specific variable.
• This type of analysis is particularly useful when you are interested in the behavior
of a single quantity over a period of time.
• Univariate time series analysis deals with data that consists of observations or
measurements taken at different points in time for one variable. For example,
daily stock prices, monthly temperature readings, or hourly sales figures for a
specific product.
• Similar to how multivariate analysis is the analysis of relationships between multiple
variables, univariate analysis is the quantitative analysis of only one variable at a time.
• When you model univariate time series, you are modelling time series changes that
represent changes in a single variable over time.
• Autoregressive moving average (ARMA) is a class of forecasting methods that you
can use to predict future values from current and historical data.
• As its name implies, the family of ARMA models combines autoregression techniques
(analyses that assume that previous observations are good predictors of future values
and perform an autoregression analysis to forecast for those future values)
• moving average techniques — models that measure the level of the constant time
series and then update the forecast model if any changes are detected.
Time Series modelling Steps
• There are three steps that you have to check before you can advance to
time series modeling
• Step 1: Is your data stationary?
• The first step that you have to do is to check whether your data is
stationary or not.
• Stationary here means that the time series data properties such as mean,
variance, autocorrelation, etc. are all constant over time.
• This is important because most of the forecasting methods in time series
analysis are based on the assumption that the time series can be rendered
approximately stationary through the use of mathematical transformations.
• Step 2: Is there any exogenous variable available?
• independent variables or predictors, are external factors that are
considered in a model but are not influenced by other variables within
the system being analyzed.
• If your time series data contains the exogenous variables, include
them in your model, to give you a piece of better information gained
from your model.
• Step 3: How much is your data?
• If your data are pretty small (say, hundreds of observation periods in
your data), then traditional statistics models— like ARIMA, SARIMA,
ARCH-GARCH, etc. and their modifications — will be enough for
good model performance, since the model equation that will be
generated from that data (and the data that are used to train your
model) won’t be too complex.
• On the contrary, if your data are very big, then use data-driven models
like machine learning models, neural network models, etc. to help you
reach better model performance
Univariate Time-series Models:
• Univariate time series analysis involves various models to analyze and
predict the behavior of a single variable over time. Here is a list of
common univariate time series analysis models:
• Autoregressive Integrated Moving Average (ARIMA)
• Seasonal ARIMA (SARIMA)
• Exponential Smoothing State Space Models (ETS)
• Autoregressive Integrated Moving Average with Exogenous Inputs
(ARIMAX)
• Long Short-Term Memory (LSTM) Networks
• Moving Averages (MA)
• Vector AutoRegressive (VAR) Model
• Vector Moving Average (VMA) Model

Unit1 - Read-Only

Uploaded by

Copyright:

Available Formats

Unit1 - Read-Only

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit1 - Read-Only

Uploaded by

Copyright:

Available Formats

Unit 1

Math, Probability, and

• Axiom 2: Probability of Sample Space

• Continuous Random Variables

Discrete Distributions Continuous Distribution

In discrete distributions, graph consists of bars In continuous distributions, graph consists of a

• Discrete distributions have finite number of different possible outcomes.

Examples of Discrete Distributions:

• A sequence of identical Bernoulli events is called Binomial and follows a

• Continuous distributions have infinite many consecutive possible

• It shows a distribution that most natural events follow. It is denoted by Y ~

And the probability of not petting an animal:

• Statistics is the study of the collection, analysis, interpretation, presentation, and

Subject Age x Glucose Level y

• Any intermediate-level data scientist should have a pretty good

• It helps to remove redundancy in the features and noise error factors

The SVD of m x n matrix A is given by the formula :

• Principal component analysis (PCA) is another dimensionality reduction technique that’s

coefficient or slope, x is the independent variable and y is the

Y= Dependent Variable (Target Variable)

x ,x ,x ,x …,x and y is the dependent variable.

• The sigmoid function is a mathematical function used to map the

• The OLS method aims to minimize the sum of square differences

• Collecting the data and cleaning it

• A univariate time series is a sequence of measurements of the same variable

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.