Unit1 - Read-Only
Unit1 - Read-Only
Unit1 - Read-Only
• Making predictions and searching for different structures in data is the most
important part of data science.
• They are important because they have the ability to handle different analytical
tasks.
• Probability and Statistics are involved in different predictive algorithms that are
there in Machine Learning. They help in deciding how much data is reliable
• Probability is one of the most fundamental concepts in statistics.
• A statistic is a result that’s derived from performing a mathematical operation on
numerical data.
• Probability is all about chance. Whereas statistics is more about how we handle
various data using different techniques.
Probability basics:-
• Probability denotes the possibility of the outcome of any random event.
• The meaning of this term is to check the extent to which any event is likely to
happen.
• For example, when we flip a coin in the air, what is the possibility of getting a
head? The answer to this question is based on the number of possible outcomes.
Here the possibility is either head or tail will be the outcome. So, the probability
of a head to come as a result is 1/2.
• The probability is the measure of the likelihood of an event to happen. It
measures the certainty of the event. The formula for probability is given by;
• P(E) = Number of Favourable Outcomes/Number of total outcomes
• P(E) = n(E)/n(S)
• Probability denotes the possibility of something happening.
• It is a mathematical concept that predicts how likely events are to
occur.
• The probability values are expressed between 0 and 1.
• The definition of probability is the degree to which something is likely
to occur.
• This fundamental theory of probability is also applied to probability
distributions.
Axioms of probability
• Axioms mean a rule a principle that most people believe to be true. It is the
premise on the basis of which we do further reasoning
• There are three axioms of probability that make the foundation of probability
theory-
• Axiom 1: Probability of Event
• The first one is that the probability of an event is always between 0 and 1. 1
indicates definite action of any of the outcome of an event and 0 indicates no
outcome of the event is possible.
• If the random variable X can assume only a finite or countably infinite set of values, then it is
called a discrete random variable.
• There are very many situations where the random variable X can assume only finite or countably
infinite set of values. Examples of discrete random variables are:
• 1. Credit rating (usually classified into different categories such as low, medium and high or using
labels such as AAA, AA, A, BBB, etc.).
• 2. Number of orders received at an e-commerce retailer which can be countably infinite.
• 3. Customer churn [the random variables take binary values: (a) Churn and (b) Do not churn].
• 4. Fraud [the random variables take binary values: (a) Fraudulent transaction and (b) Genuine
transaction].
• 5. Any experiment that involves counting (for example, number of returns in a day from customers
of e-commerce portals such as Amazon, Flipkart; number of customers not accepting job offers
from an organization).
Continuous Random Variables
• A random variable X which can take a value from an infinite set of values is called
a continuous random variable. Examples of continuous random variables are listed
below:
• 1. Market share of a company (which take any value from an infinite set of values
between 0 and
• 100%).
• 2. Percentage of attrition among employees of an organization.
• 3. Time to failure of engineering systems.
• 4. Time taken to complete an order placed at an e-commerce portal.
• 5. Time taken to resolve a customer complaint at call and service centers.
• For example, the letter X may be designated to represent the sum of
the resulting numbers after three dice are rolled.
• In this case, X could be 3 (1 + 1+ 1), 18 (6 + 6 + 6), or somewhere
between 3 and 18, since the highest number of a die is 6 and the
lowest number is 1.
• Random variables are often designated by letters and can be classified
as discrete, which are variables that have specific values, or
continuous, which are variables that can have any values within a
continuous range.
• A random variable has a probability distribution that represents the
likelihood that any of the possible values would occur.
• Let’s say that the random variable, Z, is the number on the top face of a
die when it is rolled once.
• The possible values for Z will thus be 1, 2, 3, 4, 5, and 6. The probability
of each of these values is 1/6 as they are all equally likely to be the value
of Z.
• For instance, the probability of getting a 3, or P (Z=3), when a die is
thrown is 1/6, and so is the probability of having a 4 or a 2 or any other
number on all six faces of a die. Note that the sum of all probabilities is 1.
• Discrete Random Variables
• Discrete random variables take on a countable number of distinct values. Consider an
experiment where a coin is tossed three times.
• If X represents the number of times that the coin comes up heads, then X is a discrete
random variable that can only have the values 0, 1, 2, 3 (from no heads in three successive
coin tosses to all heads). No other value is possible for X.
• Two major kind of distributions based on the type of likely values for
the variables are,
1.Discrete Distributions
2.Continuous Distributions
Discrete Distribution Vs Continuous Distribution
Basically, we are trying to find the probability of event A, given event B is true.
Here P(B) is called prior probability which means it is the probability of an
event before the evidence
P(B|A) is called the posterior probability
• Probability of an event after the evidence is seen. With regards to our
dataset, this formula can be re-written as:
• Y: class of the variable
• X: dependent feature vector (of size n)
Example:-
• Let's say you visit a doctor because you have a cough. The doctor knows that there
are two possible causes for the cough: a common cold (Event A) or a more serious
lung disease (Event B).
1.Prior Beliefs:
1. Before any tests, the doctor estimates that 90% of patients with a cough have a common cold
(P(A)=0.9), and 10% have the lung disease (P(B)=0.1).
2.Test Information:
1. The doctor performs a lung scan, and the test is 95% accurate in detecting the lung disease
(P(B∣A)=0.95), but there is a 10% chance of a false positive (P(B∣A′)=0.1, where P(A′) is not
having the lung disease).
3.Updating Beliefs:
1. You get the test result, and it indicates that you have the lung disease. Now, you want to
know the probability that you actually have the disease (P(A∣B)).
• Applying Bayes' Theorem:
• Plug the values into the formula:
• P(A∣B)=P(B∣A)×P(A)/ P(B)
• P(A∣B)=0.95×0.1/ P(B)
Calculating P(B):
• Calculate the denominator using the law of total probability:
P(B)=P(B∣A)×P(A)+P(B∣A′)×P(A′)
P(B)=0.95×0.9+0.1×0.1
• Final Calculation:
• Substitute the values back into the formula:
P(A∣B)=0.95×0.9+0.1×0.10.95×0.1
Conditional probability with Naïve Bayes
• You can use the Naïve Bayes machine learning method, which was borrowed straight from the
statistics field, to predict the likelihood that an event will occur, given evidence defined in your
data features — something called conditional probability.
• Naïve Bayes, which is based on classification and regression, is especially useful if you need to
classify text data.
• This model is easy to build and is mostly used for large datasets. It is a probabilistic machine
learning model that is used for classification problems.
• The core of the classifier depends on the Bayes theorem with an assumption of independence
among predictors. That means changing the value of a feature doesn’t change the value of another
feature.
• Why is it called Naive?
• It is called Naive because of the assumption that 2 variables are independent when they may not
be. In a real-world scenario, there is hardly any situation where the features are independent.
• Conditional probability is defined as the likelihood of an event or outcome occurring, based on the
occurrence of a previous event or outcome. Conditional probability is calculated by multiplying the
probability of the preceding event by the updated probability of the succeeding, or conditional, event.
• A conditional probability would look at such events in relationship with one another.
• Conditional probability is thus the likelihood of an event or outcome occurring based on the occurrence of
some other event or prior outcome.
• Two events are said to be independent if one event occurring does not affect the probability that the other
event will occur.
• However, if one event occurring or not does, in fact, affect the probability that the other event will occur,
the two events are said to be dependent. If events are independent, then the probability of some event B is
not contingent on what happens with event A.
• A conditional probability, therefore, relates to those events that are dependent on one another.
• Conditional probability is often portrayed as the "probability of A given B," notated as P(A|B).
• Conditional probability is calculated by multiplying the probability of the preceding event by the
probability of the succeeding or conditional event.
• Four candidates A, B, C, and D are running for a political office. Each
has an equal chance of winning: 25%. However, if candidate A drops
out of the race due to ill health, the probability will change: P(Win |
One candidate drops out) = 33.33%.
The formula for conditional probability is:
P(B|A) = P(A and B) / P(A)
which you can also rewrite as:
P(B|A) = P(A∩B) / P(A)
Example:-
• In a group of 100 sports car buyers, 40 bought alarm systems, 30
purchased bucket seats, and 20 purchased an alarm system and bucket
seats. If a car buyer chosen at random bought an alarm system, what is
the probability they also bought bucket seats?
• Step 1: Figure out P(A). It’s given in the question as 40%, or 0.4.
• Step 2: Figure out P(A∩B). This is the intersection of A and B: both
happening together. It’s given in the question 20 out of 100 buyers, or
0.2.
• Step 3: Insert your answers into the formula:
P(B|A) = P(A∩B) / P(A) = 0.2 / 0.4 = 0.5
What is Naive Bayes?
• Bayes’ rule provides us with the formula for the probability of Y given some
feature X. In real-world problems, we hardly find any case where there is only
one feature.
• When the features are independent, we can extend Bayes’ rule to what is called
Naive Bayes which assumes that the features are independent that means
changing the value of one feature doesn’t influence the values of other variables
and this is why we call this algorithm “NAIVE”
• Naive Bayes can be used for various things like face recognition, weather
prediction, Medical Diagnosis, News classification, Sentiment Analysis, and a lot
more.
• When there are multiple X variables, we simplify it by assuming that X’s are
independent, so
1.Convert the given dataset into frequency tables.
2.Generate Likelihood table by finding the probabilities of given
features.
3.Now, use Bayes theorem to calculate the posterior probability.
For n number of X, the formula becomes Naive Bayes:
Naive Bayes Example
• Let’s take a dataset to predict whether we can pet an animal or not.
Assumptions of Naive Bayes
• All the variables are independent. That is if the animal is Dog that
doesn’t mean that Size will be Medium
• All the predictors have an equal effect on the outcome. That is, the
animal being dog does not have more importance in deciding If we
can pet him or not. All the features have equal importance.
• We should try to apply the Naive Bayes formula on the above dataset
however before that, we need to do some precomputations on our
dataset.
• We also need the probabilities (P(y)), which are calculated in the table
below. For example, P(Pet Animal = NO) = 6/14.
• Now if we send our test data, suppose test = (Cow, Medium, Black)
Probability of petting an animal :
We see here that P(Yes|Test) > P(No|Test), so the prediction that we can pet this animal
is “Yes”.
Types of Naïve Bayes
• Naïve Bayes comes in these three popular flavors:
• »»MultinomialNB: Use this version if your variables (categorical or continuous) describe discrete
frequency counts, like word counts.
• This version of Naïve Bayes assumes a multinomial distribution, as is often the case with text data.
• It does not except negative values.
• »»BernoulliNB: If your features are binary, you use multinomial Bernoulli Naïve Bayes to make
predictions.
• This version works for classifying text data, but isn’t generally known to perform as well as
MultinomialNB.
• If you want to use BernoulliNB to make predictions from continuous variables, that will work, but
you first need to sub-divide them into discrete interval groupings (also known as binning).
• »»GaussianNB: Use this version if all predictive features are normally distributed. It’s not a good
option for classifying text data, but it can be a good choice if your data contains both positive and
negative values (and if your features have a normal distribution, of course).
Statistics Basics:-
• In general, you use statistics in decision making. Statistics come in two flavours:
• Descriptive: Descriptive statistics provide a description that illuminates some
characteristic of a numerical dataset, including dataset distribution, central
tendency (such as mean, min, or max), and dispersion (as in standard deviation
and variance).
• Inferential: Rather than focus on pertinent descriptions of a dataset, inferential
statistics carve out a smaller section of the dataset and attempt to deduce
significant information about the larger dataset.
• Use this type of statistics to get information about a real-world measure in which
you’re interested.
Descriptive Statics
• descriptive statistics describe the characteristics of a numerical dataset, but that
doesn’t tell you why you should care.
• most data scientists are interested in descriptive statistics only because of what
they reveal about the real-world measures they describe.
• For example, a descriptive statistic is often associated with a degree of accuracy,
indicating the statistic’s value as an estimate of the real-world measure.
• You can use descriptive statistics in many ways — to detect outliers, for example,
or to plan for feature pre-processing requirements or to quickly identify what
features you may want, or not want, to use in an analysis.
statistic Class value
Mean 79.18
Range 66.21 – 96.53
Proportion >= 70 86.7%
Inferential Statics
• inferential statistics are used to reveal something about a real-world measure.
• Inferential statistics do this by providing information about a small data selection,
so you can use this information to infer something about the larger dataset from
which it was taken.
• In statistics, this smaller data selection is known as a sample, and the larger,
complete dataset from which the sample is taken is called the population.
• If your dataset is too big to analyse in its entirety, pull a smaller sample of this
dataset, analyse it, and then make inferences about the entire dataset based on
what you learn from analysing the sample.
• You can also use inferential statistics in situations where you simply can’t afford
to collect data for the entire population.
• In this case, you’d use the data you do have to make inferences about the
population at large.
• At other times, you may find yourself in situations where complete information
for the population is not available. In these cases, you can use inferential statistics
to estimate values for the missing data based on what you learn from analysing the
data that is available
• For an inference to be valid, you must select your sample carefully so that you get
a true representation of the population.
• Even if your sample is representative, the numbers in the sample dataset will
always exhibit some noise — random variation, in other words — that guarantees
the sample statistic is not exactly identical to its corresponding population
statistic.
Quantifying Correlation
• Many statistical and machine learning methods assume that your features are independent.
• To test whether they’re independent, though, you need to evaluate their correlation — the extent
to which variables demonstrate interdependency.
• We will have brief introduction to Pearson correlation and Spearman’s rank correlation.
• Correlation is used to test relationships between quantitative variables or categorical variables. In
other words, it's a measure of how things are related. The study of how variables are correlated
is called correlation analysis.
• Some examples of data that have a high correlation: Your caloric intake and your weight.
• Correlation means to find out the association between the two variables and Correlation
coefficients are used to find out how strong the is relationship between the two variables. The most
popular correlation coefficient is Pearson’s Correlation Coefficient. It is very commonly used in
linear regression.
• Correlation is quantified per the value of a variable called r, which
ranges between –1 and 1.
• The closer the r-value is to 1 or –1, the more correlation there is
between two variables.
• If two variables have an r-value that’s close to 0, it could indicate that
they’re independent variables.
Calculating correlation with Pearson’s r
• If you want to uncover dependent relationships between continuous variables in
a dataset, you’d use statistics to estimate their correlation.
• The simplest form of correlation analysis is the Pearson correlation, which
assumes that
• Your data is normally distributed.
• You have continuous, numeric variables.
• Your variables are linearly related.
• Because the Pearson correlation has so many conditions, only use it to determine
whether a relationship between two variables exists, but not to rule out possible
relationships.
• If you were to get an r-value that is close to 0, it indicates that there is no linear
relationship between the variables, but that a nonlinear relationship between them
still could exist.
• Consider the example of car price detection where we have to detect
the price considering all the variables that affect the price of the car
such as carlength, curbweight, carheight, carwidth, fueltype, carbody,
horsepower, etc.
• We can see in the scatterplot, as the carlength, curbweight, carwidth
increases price of the car also increases.
• So, we can say that there is a positive correlation between the above
three variables with car price.
• Here, we also see that there is no correlation between the carheight
and car price.
• To find the Pearson coefficient, also referred to as the Pearson correlation
coefficient or the Pearson product-moment correlation coefficient, the two
variables are placed on a scatter plot. The variables are denoted as X and Y.
• There must be some linearity for the coefficient to be calculated; a scatter
plot not depicting any resemblance to a linear relationship will be useless.
• The closer the resemblance to a straight line of the scatter plot, the higher
the strength of association.
• Numerically, the Pearson coefficient is represented the same way as a
correlation coefficient that is used in linear regression, ranging from -1 to
+1.
Formula:-
Find the value of the correlation
coefficient from the following table:
• let’s first define what a dimension is. Given a matrix A, the dimension
of the matrix is the number of rows by the number of columns. If A
has 3 rows and 5 columns, A would be a 3x5 matrix.
• Now in the most simplest of terms, dimensionality reduction is exactly
what it sounds like, you’re reducing the dimension of a matrix to
something smaller than it currently is.
• Given a square (n by n) matrix A, the goal would be to reduce the
dimension of this matrix to be smaller than n x n.
• Current Dimension of A : n
Reduced Dimension of A : n - x, where x is some positive integer
• the most common application would be for data visualization
purposes. It’s quite difficult to visualize something graphically which
is in a dimension space greater than 3.
• Through dimensionality reduction, you’ll be able to transform your
dataset of 1000s of rows and columns into one small enough to
visualize in 3 / 2 / 1 dimensions.
What is dimensionality Reduction?
• As data generation and collection keeps increasing, visualizing it and
drawing inferences becomes more and more challenging.
• One of the most common ways of doing visualization is through
charts.
• Suppose we have 2 variables, Age and Height. We can use a scatter or
line plot between Age and Height and visualize their relationship
easily:
• Now consider a case in which we have, say 100 variables (p=100).
• It does not make much sense to visualize each of them separately.
• In such cases where we have a large number of variables, it is better to
select a subset of these variables (p<<100) which captures as much
information as the original set of variables.
• we can reduce p dimensions of the data into a subset of k dimensions
(k<<n). This is called dimensionality reduction.
Benefits of Dimensionality Reduction
• While SVD can be used for dimensionality reduction, it is often used in digital
signal processing for noise reduction, image compression, and other areas.
Eigenvector and Eigenvalue:-
• In simple terms, an eigenvalue is a special number associated with a
particular vector that represents how that vector stretches or shrinks
when a linear transformation is applied to it.
• Imagine you have a square, and you want to transform it by stretching
or shrinking it in different directions. A linear transformation can be
represented by a matrix.
• Now, the eigenvalue (often denoted by λ) is a number that tells you
how much the vector v gets scaled during this transformation.
Eigenvector and Eigenvalue:-
• Eigenvectors and eigenvalues have many important applications in computer
vision and machine learning in general.
• Well known examples are PCA (Principal Component Analysis) for
dimensionality reduction or EigenFaces for face recognition.
• An eigenvector is a vector whose direction remains unchanged when a linear
transformation is applied to it.
• To conceptualize an eigenvector, think of a matrix called A. Now consider a
nonzero vector called x and that Ax = λx for a scalar λ.
• In this scenario, scalar λ is what’s called an eigenvalue of matrix A.
• It’s permitted to take on a value of 0.
• Furthermore, x is the eigenvector that corresponds to λ, and again, it’s not
permitted to be a zero value.
SVD(Singular Value Decomposition)
• The SVD linear algebra method decomposes the data matrix into the three
resultant matrices shown in Figure .
• The product of these matrices, when multiplied together, gives you back your
original matrix.
• SVD is handy when you want to remove redundant information by compressing
your dataset.
.
• A Linear Regression model’s main aim is to find the best fit linear
line and the optimal values of intercept and coefficients such that
the error is minimized.
The above graph presents the linear relationship between the output(y) variable and predictor(X)
variables. The blue line is referred to as the best fit straight line. Based on the given data points, we
attempt to plot a line that fits the points the best.
• Before using linear regression, though, make sure you’ve considered
its limitations:
• Linear regression only works with numerical variables, not categorical ones.
• If your dataset has missing values, it will cause problems. Be sure to address
your missing values before attempting to build a linear regression model.
• If your data has outliers present, your model will produce inaccurate results.
• Check for outliers before proceeding.
• The linear regression assumes that there is a linear relationship
between dataset features and the target variable. Test to make sure this
is the case, and if it’s not, try using a log transformation to
compensate.
• The linear regression model assumes that all features are independent
of each other.
• Prediction errors, or residuals, should be normally distributed.
• you should have at least 20 observations per predictive feature if you
expect to generate reliable results using linear regression.
Logistic regression
• Logistic Regression is a “Supervised machine learning” algorithm that can be used to model the
probability of a certain class or event. It is used when the data is linearly separable and the
outcome is binary in nature.
• Logistic regression is a machine learning method you can use to estimate values for a categorical
target variable based on your selected features.
• Your target variable should be numeric, and contain values that describe the target’s class — or
category.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not
based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it
has the ability to provide probabilities and classify new data using
continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using
different types of data and can easily determine the most effective
variables used for the classification.
Logistic Function (Sigmoid Function):
•In Logistic Regression y can be between 0 and 1 only, so for this let's
divide the above equation by (1-y):
• But we need range between -[infinity] to +[infinity], then take
logarithm of the equation it will become:
• One cool thing about logistic regression is that, in addition to predicting the class of observations
in your target variable, it indicates the probability for each of its estimates. Though logistic
regression is like linear regression, it’s requirements are simpler, in that:
• There does not need to be a linear relationship between the features and target variable.
• Residuals don’t have to be normally distributed.
• Predictive features are not required to have a normal distribution.
• When deciding whether logistic regression is a good choice for you, make sure to consider the
following limitations:
• Missing values should be treated or removed.
• our target variable must be binary or ordinal.
• Predictive features should be independent of each other.
• Logistic regression requires a greater number of observations (than linear regression) to produce a
reliable result.
• The rule of thumb is that you should have at least 50 observations per predictive feature if you
expect to generate reliable results.
Example:-
• Let us consider a problem where we are given a dataset containing
Height and Weight for a group of people.
• Our task is to predict the Weight for new entries in the Height column.
• So we can figure out that this is a regression problem where we will
build a Linear Regression model.
• We will train the model with provided Height and Weight values.
• Once the model is trained we can predict Weight for a given unknown
Height value.
• Now suppose we have an additional field Obesity and we have to classify whether a person is
obese or not depending on their provided height and weight.
• This is clearly a classification problem where we have to segregate the dataset into two classes
(Obese and Not-Obese).
• So, for the new problem, we can again follow the Linear Regression steps and build a regression
line.
• This time, the line will be based on two parameters Height and Weight and the regression line will
fit between two discreet sets of values.
• As this regression line is highly susceptible to outliers, it will not do a good job in classifying two
classes.
• To get a better classification, we will find probability for each output value from the regression
line.
• Now based on a predefined threshold value, we can easily classify the output into two classes
Obese or Not-Obese.
Ordinary least squares (OLS) regression
methods
• Ordinary Least Squares regression (OLS) is a common technique for estimating coefficients of
linear regression equations which describe the relationship between one or more independent
quantitative variables and a dependent variable (simple or multiple linear regression).
• Least squares stand for the minimum squares error (SSE). Maximum likelihood and Generalized
method of moments estimator are alternative approaches to OLS.
• Example: We want to predict the height of plants depending on the number of days they have spent
in the sun. Before getting exposure, they are 30 cm. A plant grows 1 mm (0.1 cm) after being
exposed to the sun for a day.
• Y is the height of the plants
• X is the number of days spent in the sun
• β0 is 30 because it is the value of Y when X is 0.
• β1 is 0.1 because it is the coefficient multiplied by the number of days.
• A plant being exposed 5 days to the sun has therefore an estimated height of Y = 30 + 0.1*5 = 30.5
cm.
How do ordinary least squares (OLS) work?
26 37 24 28 35 22 31 53 41 64 29
Step 1: Sort your data from low to high
First, you’ll simply sort your data in ascending order.
22 24 26 28 29 31 35 37 41 53 64
Step 2: Identify the median, the first quartile (Q1), and the third quartile (Q3)
The median is the value exactly in the middle of your dataset when all values are ordered from low to
high.
Since you have 11 values, the median is the 6th value. The median value is 31.
22 24 26 28 29 31 35 37 41 53 64
• Next, we’ll use the exclusive method for identifying Q1 and Q3. This
means we remove the median from our calculations.
• The Q1 is the value in the middle of the first half of your dataset,
excluding the median. The first quartile value is 26.
22 24 26 28 29
Your Q3 value is in the middle of the second half of your dataset, excluding the median. The third
quartile value is 41.
35 37 41 53 64
Calculate your IQR
The IQR is the range of the middle half of your dataset. Subtract Q1 from Q3 to calculate the IQR.
Formula Calculation
IQR = Q3 – Q1 Q1 = 26
Q3 = 41
IQR = 41 – 26
= 15
Calculate your upper fence
The upper fence is the boundary around the third quartile. It tells you that any values
exceeding the upper fence are outliers.
Formula Calculation
Upper fence = Q3 + (1.5 * IQR) Upper fence = 41 + (1.5 * 15)
= 41 + 22.5
= 63.5
Calculate your lower fence
The lower fence is the boundary around the first quartile. Any values less than the lower fence
are outliers.
Formula Calculation
Lower fence = Q1 – (1.5 * IQR) Lower fence = 26 – (1.5 * IQR)
= 26 – 22.5
= 3.5
Use your fences to highlight any outliers
Go back to your sorted dataset from Step 1 and highlight any values that are greater than the upper
fence or less than your lower fence.
These are your outliers.
•Upper fence = 63.5
•Lower fence = 3.5
22 24 25 28 29 31 35 37 41 53 64
Tukey Boxplot
• In comparison, a Tukey boxplot is a pretty easy way to spot outliers.
• A Tukey boxplot, also known as a box-and-whisker plot, is a graphical
representation of the distribution of a dataset.
• It provides a visual summary of key statistics, including the median, quartiles,
and potential outliers.
• The plot consists of a rectangular "box" that represents the interquartile range
(IQR) and "whiskers" that extend from the box to indicate the range of the
data. Outliers can also be displayed as individual points.
• Each boxplot has whiskers that are set at 1.5*IQR. Any values that lie beyond
these whiskers are outliers.
• Figure shows outliers as they appear within a Tukey boxplot.
Detecting outliers with multivariate analysis
• Sometimes outliers only show up within combinations of data points from disparate variables.
• These outliers really wreak havoc on machine learning algorithms, so it’s important to detect and
remove them.
• You can use multivariate analysis of outliers to do this.
• A multivariate approach to outlier detection involves considering two or more variables at a time
and inspecting them together for outliers.
• There are several methods you can use, including
• Scatter-plot matrix
• Boxplot
• Density-based spatial clustering of applications with noise (DBScan)
• Principal component analysis (PCA)
Introducing Time Series Analysis
• A time series is just a collection of data on attribute values over time.
• Time series analysis is performed to predict future instances of the measure based
on the past observational data.
• To forecast or predict future values from data in your dataset, use time series
techniques
• In time series the order of observations provides a source of additional
information that should be analysed and used in the prediction process
• Time series are typically assumed to be generated at regularly spaced interval of
time (e.g. daily temperature), and so are called regular time series.
• A Time-Series represents a series of time-based orders. It would be
Years, Months, Weeks, Days, Horus, Minutes, and Seconds
• A time series is an observation from the sequence of discrete-time of
successive intervals.
• A time series is a running chart.
• Time Series Analysis (TSA) is used in different fields for time-based
predictions – like Weather Forecasting, Financial, Signal processing,
Engineering domain – Control Systems, Communications Systems.
• Since TSA involves producing the set of information in a particular
sequence, it makes a distinct from spatial and other analyses.
• Time series can have one or more variables that change over time.
• If there is only one variable varying over time, we call it Univariate
time series.
• If there is more than one variable it is called Multivariate time series.
Time Series Analysis
• Time series analysis is a method of examining and interpreting data points collected over
time to identify patterns, trends, and make predictions. In simpler terms, it's about
understanding how a particular quantity changes over time.
• Here are the basic components of time series analysis explained in simple terms:
1.Time Series Data:
1. Time series data consists of observations or measurements taken at different points in time. For
example, daily stock prices, monthly temperature readings, or yearly sales figures.
2.Components of Time Series:
1. Trend: The long-term movement or pattern in the data. It shows whether the values are generally
increasing, decreasing, or staying constant over time.
2. Seasonality: Repeating patterns or cycles that occur at regular intervals. For instance, retail sales
might have a seasonal pattern, increasing during holiday seasons.
3. Random Fluctuations: Unpredictable and irregular variations that do not follow a specific pattern.
• Analysis Techniques:
• Descriptive Statistics: Simple statistics like mean, median, and
standard deviation to understand the central tendency and variability
of the data.
• Data Visualization: Plots and charts (e.g., line charts) to visually
inspect trends and patterns.
• Moving Averages: A technique to smooth out fluctuations and
highlight trends over time.
• Forecasting: Predicting future values based on historical data.
How to analyse Time Series?