Data Analytics Unit-2 PPT Notes
Data Analytics Unit-2 PPT Notes
Data Analytics Unit-2 PPT Notes
A B C D
Bayesian Network can be used for building models from data and experts
opinions, and it consists of two parts:
Directed Acyclic Graph
Table of conditional probabilities.
Explanation Bayesian Belief Network
Directed Acyclic Graph: is the pictorial representation of the events that occurred.
Node: Hypothesis
Edge: conditional probability
Note: # node has at least two probabilities.
# Probabilities are calculated on the Rain
Basis of parent.
Conditional Probability Table Dog_Bark|Rain
R ~R
Rain
B 9/48 18/48
~B 3/48 18/48 Cat|Hide|Dog_Bark
Solved Example: Calculate the probability that alarm has sounded, but
there is neither a burglary, nor an earthquake occurred, and David and
Sophia both called the Harry.
Bayesian Belief Network
Conditional
Bayesian
Statistics
Probability
Bayes Theorem
Conditional Probability
Conditional probability is known as the possibility of an event or
outcome happening, based on the existence of a previous event or
outcome. It is calculated by multiplying the probability of the preceding
event by the renewed probability of the succeeding, or conditional, event.
The probability of occurrence of any event A when another event B in
relation to A has already occurred is known as conditional probability. It is
depicted by P(A|B).
Conditional Probability
A A∩B B
40 students 30 students
like Apple 20 students like orange
like both
Conditional Probability
30
P( B) 0.3
100
20
P( A B) 0.2
100
0.2
P( A | B) 0.67
0.3
Bayes Theorem
Bayes’’theorem is a mathematical formula used to determine the
conditional probability of the events.
Bayes theorem describes the probability of an event based on prior
knowledge of the conditions that might be relevant to the event.
Invented – Thomas Bayes
Year- 1763
Posterior P(A|B)=> Probability of event A being True, given event B has already
accrued.
Likelihood P(B|A)=> Probability of the evidence given that the hypothesis is
True.
Prior P(A)=> Probability of hypothesis before considering the evidence.
Marginal P(B)=> Probability of evidence/Data.
Bayes Theorem
P(A|B)- Probability of hypothesis A, given that evidence or data B.
P(B|A)- Probability of data/evidence, given that hypothesis is true.
P(A)- Probability of A PB
P( A | B)
P(B)- Probability of B P( B)
P( B A)
P( B | A)
P( A)
LHS RHS
P( A | B).P( B) P( A B)
P( B | A).P( A) P( B A)
P( A B) P( A | B).P( B) P( B | A).P( A)
P( B | A).P( A)
P( A | B)
P( B)
Bayes Theorem
For Example:
Calculate P(King|Face)------Posterior Probability
Step 3: Use Naive Bayes equation to calculate the posterior probability for
each class. The class with the highest posterior probability is the outcome of
prediction.
Problem: Players will play if the weather is Rainy. Is this statement correct?
You can solve it using the above discussed method of posterior probability.
P(Yes | Rainy) = P( Rainy | Yes) * P(Yes) / P (Rainy)
Here, you have P (Rainy |Yes) = 2/9 = 0.22, P(Rainy) = 5/14 = 0.36, P(Yes)=
9/14 = 0.64
Now, P (Yes | Rainy) = 0.22 * 0.64 / 0.36 = 0.39, which has a higher probability.
Naive Bayes uses a similar method to predict the probability of different classes
based on various attributes. This algorithm is mostly used in NLP problems like
sentiment analysis, text classification, etc.
Fuzzy
The word fuzzy refers to things which are not clear or are vague.
Any event, process, or function that is changing continuously
cannot always be defined as either true or false, which means that
we need to define such activities in a Fuzzy manner.
Fuzzy-Logic
Fuzzy Logic resembles the human decision-making methodology. It deals with
vague and imprecise information. This is gross oversimplification of the real-
world problems and based on degrees of truth rather than usual true/false or
1/0 like Boolean logic.
Take a look at the following diagram. It shows that in fuzzy systems, the
values are indicated by a number in the range from 0 to 1. Here 1.0
represents absolute truth and 0.0 represents absolute falseness. The number
which indicates the value in fuzzy systems is called the truth value.
Fuzzy Architecture
Fuzzy-Architecture
RULE BASE: It contains the set of rules and the IF-THEN conditions
provided by the experts to govern the decision-making system, on the basis
of linguistic information. Recent developments in fuzzy theory offer
several effective methods for the design and tuning of fuzzy controllers.
Most of these developments reduce the number of fuzzy rules.
FUZZIFICATION: It is used to convert inputs i.e. crisp numbers into
fuzzy sets. Crisp inputs are basically the exact inputs measured by sensors
and passed into the control system for processing, such as temperature,
pressure, rpm’s, etc.
INFERENCE ENGINE: It determines the matching degree of the current
fuzzy input with respect to each rule and decides which rules are to be
fired according to the input field. Next, the fired rules are combined to
form the control actions.
DEFUZZIFICATION: It is used to convert the fuzzy sets obtained by the
inference engine into a crisp value. There are several defuzzification
methods available and the best-suited one is used with a specific expert
system to reduce the error.
U : All Students
G: Good Students
S: Bad Students
G= {G, (G )} ()- Degree of Goodness
G={(A, 0.9), (B, 0.7), (C, 0.1), (D, 0.3)}
S ={(A, 0.1), (B,0.3), (C, 0.9) (D, 0.7)}
Membership Function
A ( x) 1
Support :
a. The support of a membership function for some fuzzy set A is
defined as that region of the universe that is characterized by
nonzero membership in the set A.
b. The support comprises those elements x of the universe such
that
A ( x) 1
Feature of The Membership Function
3. Boundaries :
The boundaries of a membership function for some fuzzy set are defined
as that region of the universe containing elements that have a non-zero
membership but not complete membership.
The boundaries comprise those elements x of the universe such that
0 A ( x) 1
Benefits of Fuzzy Logic in Real Life
Tabu Search
Hill Climbing Algorithm
Simple hill climbing is the simplest way to implement a hill climbing
algorithm. It only evaluates the neighbor node state at a time and
selects the first one which optimizes current cost and set it as a current
state. It only checks it's one successor state, and if it finds better than the
current state, then move else be in the same state. This algorithm has the
following features:
Less time consuming
Less optimal solution and the solution is not guaranteed
Summation
Axon
Step1: y=w1.x1+w2.x2+w3.x3…wn.xn+bias n
wi.xi
i 1
Step 2: z= ( y )
Activation function: if value is less than the threshold value than the neuron
will not activated. If greater than the neuron will activated.
Multiple Layer Neural Network
ANN comprises of multiple hidden layer to train on deeper
level.
Multiple Layer Neural Network
A neural network consists of three layers.
Input Layer: The first layer is the input layer. It contains the input
neurons that send information to the hidden layer.
Hidden Layer: The hidden layer performs the computations on input data
and transfers the output to the output layer. It includes weight, activation
function, cost function.
Output Layer: Final processing output.
The connection between neurons is known as weight, which is the
numerical values. The weight between neurons determines the learning
ability of the neural network. During the learning of artificial neural
networks, weight between the neuron changes.
Training Algorithm of ANN
Gradient Decent Algorithm: Simplest training algorithm used in
case of supervised training model. In case the actual output is
different from the target output , the difference or error is find out .
The gradient decent changes the error in such a manner to minimize
this mistake.
Back Propagation: It is the extension of the gradient based
learning. Here, after finding the error, the error will back-propagate
backward to the input layer via hidden layers. It is used in the case
of multilayer NN.
Neural Network Architecture Type
1. Single Layer Perceptron Model
2. Radial Basis Function Neural Network
3. Multi-Layer Perceptron Neural Network
4. Recurrent Neural Network
5. Hopfield Neural Network
6. Boltzman Machine Neural Network
Single Layer Perceptron
Single Layer Perceptron
Perceptron model, proposed by Minsky-Papert is one of the simplest and oldest
models of Neuron. It is the smallest unit of neural network that does certain
computations to detect features or business intelligence in the input data. It
accepts weighted inputs, and apply the activation function to obtain the output
as the final result. Perceptron is also known as TLU(threshold logic unit)
Perceptron is a supervised learning algorithm that classifies the data into two
categories, thus it is a binary classifier. A perceptron separates the input space
into two categories by a hyperplane represented by the following equation:
Advantages of Perceptron:
Perceptrons can implement Logic Gates like AND, OR, or NAND.
Disadvantages of Perceptron:
Perceptrons can only learn linearly separable problems such as boolean AND
problem. For non-linear problems such as the boolean XOR problem, it does not
work.
Radial Basis Function Network
A Radial Basis Function Network, or RBFN for short, is a form of neural
network that relies on the integration of the Radial Basis Function and is
specialized for tasks involving non-linear classification.
time t
The Green Box represents a Neural Network. The arrows indicate memory or
simply feedback to the next input.
The first figure shows the RNN. The Second figure shows the same RNN unrolled
in time. Applications:
Generating Text
Machine Translation
Speech Recognition
Generating image description
Process of RNN
The tree set of parameter ( U, V, and W) are used to apply linear transformation over their
respective inputs
Parameter U transformation the input xt to the state st
Parameter W transforms the previous state st-1 to the current state st
And, parameter V maps the computed internal state st to the output Ot
Formula to calculate current state:
ht = f(ht-1,xt)
Here, ht is the current state, ht-1 is the previous state and xt is the current input
The equation applying after activation function (tanh) is:
ht=tanh(whhht-1 + wxhxt)
Here, whh : weight at recurrent neuron, Wxh : weight at input neuron
After calculating the final state, we can then produce the output
The output state can be calculated as:
Ot = Why ht
Here, Ot is the output state, why: weight at output layer, ht: current state
Hopfield Neural Network
It is a form of Recurrent Artificial Neural Network, invented by John Hopfield. It
is a type of ANN.
Serve as content-addressable memory system with binary threshold units.
Binary neurons with discrete time, updated one at a time
Vj(t+1)={1,0, if ΣkTjkVk(t)+Ij>0 otherwise
Graded neurons with continuous time
dxj/dt=−xj/τ+ΣkTjkg(xk)+Ij .
Here,Vj denotes activity of the j-th neuron.
xj is the mean internal potential of the neuron.
Ij is direct input (e.g., sensory input or bias current) to the neuron.
Tjk is the strength of synaptic input from neuron k to neuron j .
g is a monotone function that converts internal potential into firing rate output of the
neuron, i.e., Vj=g(xj) .
Boltzmann Network
Boltzmann Machines is an unsupervised DL model in
which every node is connected to every other node. That
is, unlike the ANNs, CNNs, RNNs and SOMs, the
Boltzmann Machines are undirected (or the
connections are bidirectional). Boltzmann Machine is
not a deterministic DL model but
a stochastic or generative DL model. It is rather a
representation of a certain system. There are two types
of nodes in the Boltzmann Machine — Visible nodes —
those nodes which we can and do measure, and
the Hidden nodes – those nodes which we cannot or do
not measure. Although the node types are different, the
Boltzmann machine considers them as the same and
everything works as one single system. The training
data is fed into the Boltzmann Machine and the weights
of the system are adjusted accordingly. Boltzmann
machines help us understand abnormalities by learning
about the working of the system in normal conditions.
Activation Function
Activation function decides, whether a neuron should be activated or not by calculating
weighted sum and further adding bias with it. The purpose of the activation function is
to introduce non-linearity into the output of a neuron.
Explanation :- We know, neural network has neurons that work in correspondence
of weight, bias and their respective activation function. In a neural network, we would
update the weights and biases of the neurons on the basis of the error at the output. This
process is known as back-propagation. Activation functions make the back-propagation
possible since the gradients are supplied along with the error to update the weights and
biases.
Why do we need Non-linear activation functions :- A neural network without an
activation function is essentially just a linear regression model. The activation function
does the non-linear transformation to the input making it capable to learn and perform
more complex tasks.
There are various types of activation functions.
Step Function
Signum Function
Linear Function
ReLU Function
Leaky ReLU Function
Hyperbolic Tangent Function
Sigmoid Function
Softmax Function
Backpropagation
Backpropagation is the essence of neural network training. It is the method
of fine-tuning the weights of a neural network based on the error rate
obtained in the previous epoch (i.e., iteration). Proper tuning of the weights
allows you to reduce error rates and make the model reliable by increasing
its generalization.
Backpropagation in neural network is a short form for “backward
propagation of errors.” It is a standard method of training artificial neural
networks. This method helps calculate the gradient of a loss function with
respect to all the weights in the network.
Most prominent advantages of Backpropagation are:
Backpropagation is fast, simple and easy to program
It has no parameters to tune apart from the numbers of input
It is a flexible method as it does not require prior knowledge about the network
It is a standard method that generally works well
It does not need any special mention of the features of the function to be learned.
Backpropagation Algorithm
Input:
D, a dataset consisting of the training tuples and their associated target
values.
L, the learning rate.
Network, a multi-layer feed-forward network.
Output:
A trained neural network.
Method:
1. Initialize all the weights and biases in the network.
2. While (training condition is not satisfied)
3. {
4. For each training tuple X in D
5. { // propogate the input forward:
For each input layer unit j:
{
Oj=Ij; # output of input unit is its actual value
}
For each hidden or output layer unit j:
{
I j wij .oi j
Where,
wij is the weight from unit i in the previous layer to unit j.
oi output of unit i from previous unit.
j is the bias of the current unit.
Oj =
1 1 e Ij
}
# Backpropagate the errors
For each unit j in the output layer:
{
Errorj o j (1 o j )(T j o j )
}
For each unit j in the hidden layer
{
Errorj O j (1 o j ) Errork w jk
Where,
w jk is the weight of the connection from unit j to unit k.
Errork is the error of unit k.
}
For each weight wij in the network:
{
wij (l ) Errorj Oi #weight increment
w ij wij wij # weight updation
The above graph presents the linear relationship between the output(y) variable
and predictor(X) variables. The blue line is referred to as the best fit straight
line. Based on the given data points, we attempt to plot a line that fits the points
the best.
Linear Regression
Calculate best-fit line linear regression uses a traditional slope-intercept form which is
given below,
Yi = β0 + β1Xi
Where Yi = Dependent variable to the given value of Independent variable.
β0 = constant/Intercept the predicted value of y when x is 0.
β1 = Slope or regression coefficient (how much we except y to change as x increase).
Xi = Independent variable (The variable we expect influencing the dependent variable
y).
This algorithm explains the linear relationship between the dependent(output) variable
y and the independent(predictor) variable X using a straight line.
But how the linear regression finds out which is the best fit line?
The goal of the linear regression algorithm is to get the best values for B0 and B1 to find
the best fit line. The best fit line is a line that has the least error which means the error
between predicted values and actual values should be minimum.
You can use the simple linear regression when you want to know:
1. How strong the relationship between two variables.
2. The value of dependent variable at a certain value of independent variable.
Assumptions Linear Regression
1. Homogeneity of Variance: The size of the error in our prediction doesn’t change
significantly across the value of independent variable.
2. Independence of observations: the observations in the dataset were collected using
statistically valid sampling methods, and there is no hidden relationships among
variables.
3. Normality: The data follows a normal distribution.
( y y)
2
p
This is R2=
( y y)
2
Linear Regression Solved Numerical
Linear Regression Solved Numerical
Linear Regression Solved Numerical
Linear Regression Solved Numerical
Multiple Linear Regression
Multiple linear regression (MLR), also known simply as multiple
regression, is a statistical technique that uses several explanatory variables
to predict the outcome of a response variable. The goal of multiple linear
regression is to model the linear relationship between the explanatory
(independent) variables and response (dependent) variables.
yi=β0+β1xi1+β2xi2+...+βpxip+ϵ
where, for i=n observations:
yi=dependent variable
xi=explanatory variables Independent Variable
β0=y intercept (constant term)
βp=slope coefficients for each explanatory variable
ϵ=the model’s error term (also known as the residuals)
As the number of independent variable increases to 2 our graph become
3D. The added 3rd dimension represents other independent variable.
Multiple Linear Regression Numerical
Multiple Linear Regression Numerical
Multiple Linear Regression Numerical
Multiple Linear Regression Numerical
Multiple Linear Regression Numerical
Multiple Linear Regression
Numerical
Multiple Linear Regression Numerical
Logistic Regression
Logistic Regression is a “Supervised machine learning” algorithm that can be
used to model the probability of a certain class or event. It is used when the data is
linearly separable and the outcome is binary in nature.
That means Logistic regression is usually used for Binary classification problems.
Logistic Regression =
Linear SVM
Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
Non-linear SVM
Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM
classifier.
Hyper-plane
There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the
hyper-plane of SVM.
The dimensions of the hyper-plane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyper-plane will be a straight line. And if there are 3 features, then hyper-
plane will be a 2-dimension plane. We always create a hyper-plane that has
a maximum margin, which means the maximum distance between the data
points.
Support Vectors
The data points or vectors that are the closest to the hyper-
plane and which affect the position of the hyper-plane are
termed as Support Vector.
By adding the third dimension, the sample space will become as below image:
Concept of SVM
So now, SVM will divide the datasets into classes in the following way. Consider
the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis.
If we convert it in 2d space with z=1, then it will become as:
Concept of SVM
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis.
If we convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
SVM Pros/Cons
Pros of SVM
SVM classifiers offers great accuracy and work well with high
dimensional space. SVM classifiers basically use a subset of
training points hence in result uses very less memory.
A
B
Scenario-2 Here we have 3 hyperplanes (A, B, and C)and all are segregating the
classes well. Now how can we identify the right Hyperplane.
A
B
C
Here, maximizing the distance between nearest data points and Hyperplane will help us to decide
the right Hyperplane.
In above picture margins of Hyperplane C is more than the others. So C has selected as the best
Hyperplane.
How Does Identify Right Hyperplane
Scenario-3 Here we have 2 hyperplanes (A, and B). Use the rules as discussed in
previous section to identify the right Hyperplane
SVM selects the Hyperplane which classifies the classes accurately prior to maximizing the
margins.
Hyperplane B has a classification error and A has classified all accurately. Therefore right
Hyperplane is A.
How Does Identify Right Hyperplane
Advantages of SVM
Disadvantages of SVM
How Does Identify Right Hyperplane
Scenario-4 Below we are not able to segregate the two classes using straight line, as
one of the star lies on the territory of circle.
One star at other end is like an outlier for star class. The svm algorithm has the feature of ignoring
the outlier and find the Hyperplane with maximum margins. Hence we can say that the svm
classification is robust to outliers.
Non-Linear SVM
It is a form of data structure where the data elements don't stay arranged
linearly or sequentially. Since the data structure is non-linear, it does not involve
a single level.
Kernel Function
In machine learning, a kernel refers to a method that allows us to
apply linear classifiers to non-linear problems by mapping non-linear
data into a higher-dimensional space without the need to visit or
understand that higher-dimensional space.
Kernel Trick
Types of Kernel Function
Polynomial Kernel
RBF Kernel
Sigmoid Kernel
Polynomial Kernel
The polynomial kernel is a kernel function commonly used with support
vector machines and other kernelized models, that represents the similarity
of vectors in a feature space over polynomials of the original variables
It is popular in image processing.
K ( xi , x j ) ( xi .x j 1)
T d
| x y |2
K ( x, y ) exp
2
2
Hyperbolic Tangent Kernel
Mainly used in neural networks.
K ( xi , x j ) tanh kxi .x j c
Weather data
Rainfall measurements
Temperature readings
Heart rate monitoring (EKG)
Brain monitoring (EEG)
Quarterly sales
Stock prices
Automated stock trading
Industry forecasts
Interest rates
Components for Time Series Analysis
Trend
1. Graphical Method
2004
2007
1999
2000
2002
2003
2005
2006
2008
2008 19.3
Moving Average Method
+ - + -- --
+ + + -- -
- - - -- - - + +
- ------ -- +++
Procedure of Extracting Rules
Step 2: Rule development :
a. The objective in this step is to cover all “+” data points using classification
rules with none or as few “–” as possible.
b. For example, in Fig. 2.10.2 , rule r1 identifies the area of four “+” in the
top left corner.
C. Since this rule is based on simple logic operators in conjuncts, the
boundary is rectilinear.
d. Once rule r1 is formed, the entire data points covered by r1 are eliminated
and the next best rule is found from data sets.
+ Rule1(r) --
+ ++ + -- -
++
- - - -- - - + +
- ------ -- +++
Procedure of Extracting Rules
Step 3 : Learn-One-Rule :
a. Each rule r1 is grown by the learn-one-rule approach.
b. Each rule starts with an empty rule set and conjuncts are added one by one
to increase the rule accuracy.
c. Rule accuracy is the ratio of amount of “+” covered by the rule to all
records covered by the rule :
Correct records by rule
Rule Accuracy A(ri )=
All records covered by the rule
d. Learn-one-rule starts with an empty rule set: if {} then class = “+”.
e. The accuracy of this rule is the same as the proportion of + data points in
the data set. Then the algorithm greedily adds conjuncts until the accuracy
reaches 100 %.
f. If the addition of a conjunct decreases the accuracy, then the algorithm
looks for other conjuncts or stops and starts the iteration of the next rule.
Procedure of Extracting Rules
Step 4 : Next rule :
a. After a rule is developed, then all the data points covered by the
rule are eliminated from the data set.
b. The above steps are repeated for the next rule to cover the rest of the “+”
data points.
c. In Fig. 2.10.3, rule r2 is developed after the data points covered by r1
are eliminated.
+ --
Rule1(r) Rule2(r)
+++
- - - -- - - + +
++
- ------ -- +++
Procedure of Extracting Rules
Step 5 : Development of rule set :
a. After the rule set is developed to identify all “+” data points, the rule
model is evaluated with a data set used for pruning to reduce generalization
errors.
b. The metric used to evaluate the need for pruning is (p – n)/(p + n), where p
is the number of positive records covered by the rule and n is the number of
negative records covered by the rule.
c. All rules to identify “+” data points are aggregated to form a rule
group.