Unit 4
Unit 4
Unit 4
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can
be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in the
graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by
the nodes of the network graph.
o If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
Bayesian Networks are applied in many fields. For example, disease diagnosis, optimized
web search, spams filtering, gene regulatory networks, etc. And this list can be extended.
The main objective of these networks is trying to understand the structure of causality
relations. To clarify this, let’s consider a disease diagnosis problem. With given symptoms
and their resulting disease, we construct our Belief Network and when a new patient comes,
we can infer which disease or diseases may have the new patient by providing probabilities
for each disease. Similarly, these causality relations can be constructed for other problems
and inference techniques can be applied to interesting results.
Mathematical Definition of Belief Networks
The probabilities are calculated in the belief networks by the following formula
As you would understand from the formula, to be able to calculate the joint distribution we
need to have conditional probabilities indicated by the network. But further that if we have
the joint distribution, then we can start to ask interesting questions. For example, in the first
example, we ask for the probability of “RAIN” if “SEASON” is “WINTER” and “DOG
BARK” is “TRUE”.
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry has
two neighbors David and Sophia, who have taken a responsibility to inform Harry at work
when they hear the alarm. David always calls Harry when he hears the alarm, but sometimes
he got confused with the phone ringing and calls at that time too. On the other hand, Sophia
likes to listen to high music, so sometimes she misses to hear the alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both called the
Harry.
Solution:
o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the alarm
and directly affecting the probability of alarm's going off, but David and Sophia's
calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
o The conditional distributions for each node are given as conditional probabilities
table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities.
Hence, if there are two parents, then CPT will contain 4 probability values
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B,
E], can rewrite the above probability statement using joint probability distribution:
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
The Conditional probability of David that he will call depends on the probability of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
= 0.00068045.
Introduction To Machine Learning
Machine learning (ML) is the study of computer algorithms that improve automatically
through experience. It is seen as a subset of artificial intelligence. Machine learning
algorithms build a mathematical model based on sample data, known as "training data", in
order to make predictions or decisions without being explicitly programmed to do
so. Machine learning algorithms are used in a wide variety of applications, such as email
filtering and computer vision, where it is difficult or infeasible to develop conventional
algorithms to perform the needed tasks.
How ML works?
Gathering past data in any form suitable for processing.The better the quality of data,
the more suitable it will be for modeling
Data Processing – Sometimes, the data collected is in the raw form and it needs to be
pre-processed.
Example: Some tuples may have missing values for certain attributes, an, in this case,
it has to be filled with suitable values in order to perform machine learning or any
form of data mining.
Missing values for numerical attributes such as the price of the house may be replaced
with the mean value of the attribute whereas missing values for categorical attributes
may be replaced with the attribute with the highest mode. This invariably depends on
the types of filters we use. If data is in the form of text or images then converting it to
numerical form will be required, be it a list or array or matrix. Simply, Data is to be
made relevant and consistent. It is to be converted into a format understandable by the
machine
Divide the input data into training,cross-validation and test sets. The ratio between the
respective sets must be 6:2:2
Building models with suitable algorithms and techniques on the training set.
Testing our conceptualized model with data which was not fed to the model at the
time of training and evaluating its performance using metrics such as F1 score,
precision and recall.
Supervised Learning
Supervised learning is the most popular paradigm for machine learning. It is the easiest to
understand and the simplest to implement. It is very similar to teaching a child with the use of
flash cards.
Given data in the form of examples with labels, we can feed a learning algorithm these
example-label pairs one by one, allowing the algorithm to predict the label for each example,
and giving it feedback as to whether it predicted the right answer or not. Over time, the
algorithm will learn to approximate the exact nature of the relationship between examples and
their labels. When fully-trained, the supervised learning algorithm will be able to observe a
new, never-before-seen example and predict a good label for it.
a singular task, feeding more and more examples to the algorithm until it can accurately
perform on that task. This is the learning type that you will most likely encounter, as it is
supervised learning task. Many of the ads you see as you browse the internet are placed
there because a learning algorithm said that they were of reasonable popularity (and click
ability). Furthermore, its placement associated on a certain site or with a certain query (if
you find yourself using a search engine) is largely due to a learned algorithm saying that
Spam Classification: If you use a modern email system, chances are you’ve encountered
a spam filter. That spam filter is a supervised learning system. Fed email examples and
labels (spam/not spam), these systems learn how to preemptively filter out malicious
emails so that their user is not harassed by them. Many of these also behave in such a way
that a user can provide new labels to the system and it can learn user preference.
Face Recognition: Do you use Facebook? Most likely your face has been used in a
supervised learning algorithm that is trained to recognize your face. Having a system that
takes a photo, finds faces, and guesses who that is in the photo (suggesting a tag) is a
supervised process. It has multiple layers to it, finding faces and then identifying them,
Unsupervised learning is very much the opposite of supervised learning. It features no labels.
Instead, our algorithm would be fed a lot of data and given the tools to understand the
properties of the data. From there, it can learn to group, cluster, and/or organize the data in a
way such that a human (or other intelligent algorithm) can come in and make sense of the
What makes unsupervised learning such an interesting area is that an overwhelming majority
of data in this world is unlabelled. Having intelligent algorithms that can take our terabytes
and terabytes of unlabelled data and make sense of it is a huge source of potential profit for
many industries. That alone could help boost productivity in a number of fields.
For example, what if we had a large database of every research paper ever published and we
had unsupervised learning algorithms that knew how to group these in such a way so that you
were always aware of the current progression within a particular domain of research. Now,
you begin to start a research project yourself, hooking your work into this network that the
algorithm can see. As you write your work up and take notes, the algorithm makes suggestions
to you about related works, works you may wish to cite, and works that may even help you
push that domain of research forward. With such a tool, your productivity can be extremely
boosted.
Because unsupervised learning is based upon the data and its properties, we can say that
unsupervised learning is data-driven. The outcomes from an unsupervised learning task are
controlled by the data and the way its formatted. Some areas you might see unsupervised
learning crop up are:
Recommender Systems: If you’ve ever used YouTube or Netflix, you’ve most likely
encountered a video recommendation system. These systems are often times placed in the
unsupervised domain. We know things about videos, maybe their length, their genre, etc.
We also know the watch history of many users. Taking into account users that have
watched similar videos as you and then enjoyed other videos that you have yet to see, a
recommender system can see this relationship in the data and prompt you with such a
suggestion.
Buying Habits: It is likely that your buying habits are contained in a database somewhere
and that data is being bought and sold actively at this time. These buying habits can be
used in unsupervised learning algorithms to group customers into similar purchasing
segments. This helps companies market to these grouped segments and can even resemble
recommender systems.
Grouping User Logs: Less user facing, but still very relevant, we can use unsupervised
learning to group user logs and issues. This can help companies identify central themes to
issues their customers face and rectify these issues, through improving a product or
designing an FAQ to handle common issues. Either way, it is something that is actively
done and if you’ve ever submitted an issue with a product or submitted a bug report, it is
likely that it was fed to an unsupervised learning algorithm to cluster it with other similar
issues.
Reinforcement Learning
1. Classification : When inputs are divided into two or more classes, and the learner must
produce a model that assigns unseen inputs to one or more (multi-label classification) of
these classes. This is typically tackled in a supervised way. Spam filtering is an example
of classification, where the inputs are email (or other) messages and the classes are
“spam” and “not spam”.
2. Regression : Which is also a supervised problem, A case when the outputs are
continuous rather than discrete.
3. Clustering : When a set of inputs is to be divided into groups. Unlike in classification,
the groups are not known beforehand, making this typically an unsupervised task.
Supervised Algorithm
1. Linear Regression (used for Regression)
2. Logistic Regression (used for classification)
3. Support Vector Machine
4. Decision Tree
5. K NN(K Nearest Neighbour)
6. Naïve Bayes
7. Convolutional Neural Network(CNN)
*********************************************
********************************************************
********************************************************
Statistical Learning Method
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and
presentation of data. Statistics is a collection of tools that you can use to get answers to
important questions about data.
You can use descriptive statistical methods to transform raw observations into
information that you can understand and share.
Statistical Learning is a set of tools for understanding data. These tools broadly
come under two classes: supervised learning & unsupervised learning. In this
technique main idea is data and hypothesis. Here data is evidence i.e. instantiations of some
or all random variables describing the domain. Bayesian learning calculates probabilities of
each hypothesis given the data and makes prediction.
Raw observations alone are data, but they are not information or knowledge. Data raises
questions, such as:
It would be fair to say that statistical methods are required to effectively work through a
machine learning predictive modelling project.
Below are 10 examples of where statistical methods are used in an applied machine
learning project.
Problem Framing: Requires the use of exploratory data analysis and data mining.
Data Understanding: Requires the use of summary statistics and data visualization.
Data Cleaning. Requires the use of outlier detection, imputation and more.
Data Selection. Requires the use of data sampling and feature selection methods.
Data Preparation. Requires the use of data transforms, scaling, encoding and much more.
***************************************************************************
***********************************************************************
Example:
Naive Bayes Classifier
1. Naive Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
2. It is mainly used in text classification that includes a high-dimensional training
dataset.
3. Naïve Bayes Classifier is one of the simple and most effective classification
algorithms which help in building the fast machine learning models that can
make quick predictions.
4. It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
5. Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Example: (Numerical)
Problem-
Novel instance,
It was explained, proposed and given its name in a paper published in 1977 by Arthur
Dempster, Nan Laird, and Donald Rubin. It is used to find the local maximum likelihood
parameters of a statistical model in the cases where latent variables are involved and the data
is missing or incomplete .
Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the dataset,
estimate (guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the expectation
(E) step is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.
Usage of EM algorithm –
It can be used to fill the missing data in a sample.
It can be used as the basis of unsupervised learning of clusters.
It can be used for the purpose of estimating the parameters of Hidden Markov
Model (HMM).
It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
It is always guaranteed that likelihood will increase with each iteration.
The E-step and M-step are often pretty easy for many problems in terms of
implementation.
Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
It has slow convergence.
It makes convergence to the local optima only.
It requires both the probabilities, forward and backward (numerical optimization
requires only forward probability).
*********************************************
1. Categorical Variable Decision Tree: Decision Tree which has a categorical target
variable then it called a Categorical variable decision tree.
2. Continuous Variable Decision Tree: Decision Tree has a continuous target variable
then it is called Continuous Variable Decision Tree.
Example:- Let’s say we have a problem to predict whether a customer will pay his renewal
premium with an insurance company (yes/ no). Here we know that the income of customers
is a significant variable but the insurance company does not have income details for all
customers. Now, as we know this is an important variable, then we can build a decision tree
to predict customer income based on occupation, product, and various other variables. In this
case, we are predicting values for the continuous variables.
Important Terminology related to Decision Trees
1. Root Node: It represents the entire population or sample and this further gets divided
into two or more homogeneous sets.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the
decision node.
4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
7. Parent and Child Node: A node, which is divided into sub-nodes, is called a parent
node of sub-nodes whereas sub-nodes are the child of a parent node.
Decision trees classify instances by sorting them down the tree from the root to some leaf
node, which provides the classification of the instance. Each node in the tree specifies a test
of some attribute of the instance, and each branch descending from that node corresponds to
one of the possible values for this attribute. An instance is classified by starting at the root
node of the tree, testing the attribute specified by this node, then moving down the tree
branch corresponding to the value of the attribute in the given example. This process is then
repeated for the subtree rooted at the new node.
Would be sorted down the leftmost branch of this decision tree and would therefore be
classified as a negative instance (i.e., the tree predicts that PlayTennis = no).
Steps in ID3(Iterative Dichotomiser 3 ) algorithm:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of
a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
Information Gain= Entropy(S) - [(Weighted Avg) *Entropy(each feature)
Or
2. Entropy:
Or
Where,
3. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o It is calculated by subtracting the sum of squared probabilities of each class
from one.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used: