0% found this document useful (0 votes)
122 views

ML Notes

- Machine learning allows computers to learn from data without being explicitly programmed. It focuses on developing algorithms that can change when exposed to new data. - Machine learning is used to build predictive models from large datasets. It is similar to data mining but uses patterns in data to adjust program actions rather than just extracting data for human comprehension. - There are two main types of machine learning algorithms - supervised and unsupervised. Supervised algorithms can apply what they've learned to new data while unsupervised algorithms can draw inferences from datasets. - The document provides an example of using machine learning for bird classification. It explains key terminology like features, instances, training sets, and target variables. The goal is to train a machine learning algorithm

Uploaded by

shilpa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views

ML Notes

- Machine learning allows computers to learn from data without being explicitly programmed. It focuses on developing algorithms that can change when exposed to new data. - Machine learning is used to build predictive models from large datasets. It is similar to data mining but uses patterns in data to adjust program actions rather than just extracting data for human comprehension. - There are two main types of machine learning algorithms - supervised and unsupervised. Supervised algorithms can apply what they've learned to new data while unsupervised algorithms can draw inferences from datasets. - The document provides an example of using machine learning for bird classification. It explains key terminology like features, instances, training sets, and target variables. The goal is to train a machine learning algorithm

Uploaded by

shilpa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Ch.

1 Introduction to Machine Learning

What is Machine Learning? (Just for understanding)


To solve a problem on a computer, we need an algorithm. An algorithm is a sequence of instructions
that should be carried out to transform the input to output. For example, one can devise an algorithm
for sorting. The input is a set of numbers and the output is their ordered list. For the same task, there
may be various algorithms and we may be interested in finding the most efficient one, requiring the
least number of instructions or memory or both.
For some tasks, however, we do not have an algorithm—for example, to tell spam emails from
legitimate emails. We know what the input is: an email document that in the simplest case is a file of
characters. We know what the output should be: a yes/no output indicating whether the message is
spam or not. We do not know how to transform the input to the output. What can be considered spam
changes in time and from individual to individual.
What we lack in knowledge, we make up for in data. We can easily compile thousands of example
messages some of which we know to be spam and what we want is to “learn” what constitutes spam
from them. In other words, we would like the computer (machine) to extract automatically the
algorithm for this task. There is no need to learn to sort numbers, we already have algorithms for that;
but there are many applications for which we do not have an algorithm but do have example data.
With advances in computer technology, we currently have the ability to store and process large
amounts of data, as well as to access it from physically distant locations over a computer network.
Most data acquisition devices are digital now and record reliable data.
Think, for example, of a supermarket chain that has hundreds of stores all over a country selling
thousands of goods to millions of customers. The points of sale terminals record the details of each
transaction: date, customer identification code, goods bought and their amount, total money spent,
and so forth. This typically amounts to gigabytes of data every day. What the supermarket chain wants
is to be able to predict who the likely customers for a product are. Again, the algorithm for this is not
evident; it changes in time and by geographic location. The stored data becomes useful only when it
is analyzed and turned into information that we can make use of, for example, to make predictions.
We do not know exactly which people are likely to buy this ice cream flavor, or the next book of this
author, or see this new movie, or visit this city, or click this link. If we knew, we would not need any
analysis of the data; we would just go ahead and write down the code. But because we do not, we can
only collect data and hope to extract the answers to these and similar questions from data.
We do believe that there is a process that explains the data we observe. Though we do not know the
details of the process underlying the generation of data—for example, consumer behavior—we know
that it is not completely random. People do not go to supermarkets and buy things at random. When
they buy beer, they buy chips; they buy ice cream in summer and spices in winter. There are certain
patterns in the data.
We may not be able to identify the process completely, but we believe we can construct a good and
useful approximation. That approximation may not explain everything, but may still be able to account
for some part of the data. We believe that though identifying the complete process may not be
possible, we can still detect certain patterns or regularities. This is the niche of machine learning. Such
patterns may help us understand the process, or we can use those patterns to make predictions:
Assuming that the future, at least the near future, will not be much different from the past when the
sample data was collected, the future predictions can also be expected to be right.
Application of machine learning methods to large databases is called data mining. The analogy is that
a large volume of earth and raw material is extracted from a mine, which when processed leads
to a small amount of very precious material; similarly, in data mining, a large volume of data is
processed to construct a simple model with valuable use, for example, having high predictive
accuracy. Its application areas are abundant: In addition to retail, in finance banks analyze their past
data to build models to use in credit applications, fraud detection, and the stock market. In
manufacturing, learning models are used for optimization, control, and troubleshooting. In medicine,
learning programs are used for medical diagnosis. In telecommunications, call patterns are analyzed
for network optimization and maximizing the quality of service.
In science, large amounts of data in physics, astronomy, and biology can only be analyzed fast enough
by computers. The World Wide Web is huge; it is constantly growing, and searching for relevant
information cannot be done manually.

 Machine learning, more specifically the field of predictive modeling is primarily concerned
with minimizing the error of a model or making the most accurate predictions possible, at the
expense of explainability.
 Machine learning is a type of artificial intelligence (AI) that provides computers with the ability
to learn without being explicitly programmed. Machine learning focuses on the development
of computer programs that can change when exposed to new data.
 The process of machine learning is similar to that of data mining. Both systems search through
data to look for patterns. However, instead of extracting data for human comprehension -- as
is the case in data mining applications -- machine learning uses that data to detect patterns in
data and adjust program actions accordingly. Machine learning algorithms are often
categorized as being supervised or unsupervised. Supervised algorithms can apply what has
been learned in the past to new data. Unsupervised algorithms can draw inferences from
datasets.

Key terminology
Before we jump into the machine learning algorithms, it would be best to explain some terminology.
The best way to do so is through an example of a system someone may want to make. We’ll go through
an example of building a bird classification system. This sort of system is an interesting topic often
associated with machine learning called expert systems. By creating a computer program to recognize
birds, we’ve replaced an ornithologist with a computer. The ornithologist is a bird expert, so we’ve
created an expert system.
In table below are some values for four parts of various birds that we decided to measure. We chose
to measure weight, wingspan, whether it has webbed feet, and the color of its back. In reality, you’d
want to measure more than this. It’s common practice to measure just about anything you can
measure and sort out the important parts later. The four things we’ve measured are called features;
these are also called attributes, but we’ll stick with the term features in this book. Each of the rows in
table is an instance made up of features.
The first two features in above table are numeric and can take on decimal values. The third feature
(webbed feet) is binary: it can only be 1 or 0. The fourth feature (back color) is an enumeration over
the color palette we’re using, and I just chose some very common colors. Say we ask the people doing
the measurements to choose one of seven colors; then back color would be just an integer. (I know
choosing one color for the back of a bird is a gross oversimplification; please excuse this for the
purpose of illustration).
If you happen to see a Campephilus principalis (Ivory-billed Woodpecker), give me a call ASAP!
Don’t tell anyone else you saw it; just call me and keep an eye on the bird until I get there. (There’s a
$50,000 reward for anyone who can lead a biologist to a living Ivory-billed Woodpecker.) One task in
machine learning is classification; I’ll illustrate this using table 1.1 and
the fact that information about an Ivory-billed Woodpecker could get us $50,000. We want to identify
this bird out of a bunch of other birds, and we want to profit from this. We could set up a bird feeder
and then hire an ornithologist (bird expert) to
watch it and when they see an Ivory-billed Woodpecker give us a call. This would be expensive, and
the person could only be in one place at a time. We could also automate this process: set up many bird
feeders with cameras and computers attached to
them to identify the birds that come in. We could put a scale on the bird feeder to get the bird’s weight
and write some computer vision code to extract the bird’s wingspan, feet type, and back color. For the
moment, assume we have all that information. How do we then decide if a bird at our feeder is an
Ivory-billed Woodpecker or something else? This task is called classification, and there are many
machine learning algorithms that are good at classification. The class in this example is the bird
species; more specifically, we can reduce our classes to Ivory-billed Woodpecker or everything else.
Say we’ve decided on a machine learning algorithm to use for classification. What we need to do next
is train the algorithm, or allow it to learn. To train the algorithm we feed it quality data known as a
training set. A training set is the set of training examples we’ll use to train our machine learning
algorithms. In above table our training set has six training examples. Each training example has four
features and one target variable; this is depicted in figure below. The target variable is what we’ll be
trying to predict with our machine learning algorithms. In classification the target variable takes on a
nominal value, and in the task of regression its value could be continuous. In a training set the target
variable is known. The machine learns by finding some relationship between the features and the
target variable. The target variable is the species, and as I mentioned earlier, we can reduce this to
take nominal values. In the classification problem the target variables are called classes, and there is
assumed to be a finite number of classes.
(NOTE Features or attributes are the individual measurements that, when combined with other features,
make up a training example. This is usually columns in a training or test set.)

To test machine learning algorithms what’s usually done is to have a training set of data and a separate
dataset, called a test set. Initially the program is fed the training examples; this is when the machine
learning takes place. Next, the test set is fed to the program. The target variable for each example from
the test set isn’t given to the program, and the program decides which class each example should
belong to. The target variable or class that the training example belongs to is then compared to the
predicted value, and we can get a sense for how accurate the algorithm is. There are better ways to
use all the information in the test set and training set. We’ll discuss them later.

In our bird classification example, assume we’ve tested the program and it meets our desired level of
accuracy. Can we see what the machine has learned? This is called knowledge representation. The
answer is it depends. Some algorithms have knowledge representation that’s more readable by
humans than others. The knowledge representation may be in the form of a set of rules; it may be a
probability distribution or an example from the training set. In some cases we may not be interested
in building an expert system but interested only in the knowledge representation that’s acquired from
training a machine learning algorithm.

Features and target variable identified.

Types of Machine Learning

Supervised Learning: A training set of examples with the correct responses (targets) are provided
and, based on this training set, the algorithm generalizes to respond correctly to all possible inputs.
This is called learning from examples.

Unsupervised Learning: Correct responses are not provided; instead the algorithm tries to identify
similarities between the inputs so that inputs that have something in common are categorized
together. Then statistical approach to unsupervised learning is known as density estimation.

Reinforcement Learning: This is somewhere between supervised and unsupervised learning.


The algorithm gets told when the answer is wrong, but does not get told how to correct it. It has to
explore and try out different possibilities until it works out how to get the answer right.
Reinforcement learning is sometime called learning with a critic because of this monitor that scores
the answer, but does not suggest improvements.

Evolutionary learning: Biological evolution can be seen as a learning process: biological organisms
adapt to improve their survival rates and chance of having offspring in their environment. We’ll look
at how we can model this in a computer, using an idea of fitness, which corresponds to a score for
how good the current solution is.

The most common type of learning is supervised learning, and it is going to be the focus of the next
few chapters. So, before we get started, we’ll have a look at what it is, and the kinds of problems that
can be solved using it.

Applications of machine learning:

 Adaptive websites
 Affective computing
 Bioinformatics
 Brain-machine interfaces
 Cheminformatics
 Classifying DNA sequences
 Computational anatomy
 Computer vision, including object recognition
 Detecting credit card fraud
 Game playing
 Information retrieval
 Internet fraud detection
 Marketing
 Machine perception
 Medical diagnosis
 Economics
 Natural language processing
 Natural language understanding
 Optimization and metaheuristic
 Online advertising
 Recommender systems
 Robot locomotion
 Search engines
 Sentiment analysis (or opinion mining)
 Sequence mining
 Software engineering
 Speech and handwriting recognition
 Stock market analysis
 Structural health monitoring
 Syntactic pattern recognition
 User behavior analytics
How to choose the right algorithm

 With all the different algorithms in above table, how can you choose which one to use?
 First, you need to consider your goal. What are you trying to get out of this? (Do you want a
probability that it might rain tomorrow, or do you want to find groups of voters with similar
interests?) What data do you have or can you collect? Those are the big questions.
 Let’s talk about your goal. If you’re trying to predict or forecast a target value, then you need
to look into supervised learning. If not, then unsupervised learning is the place you want to be.
 If you’ve chosen supervised learning, what’s your target value? Is it a discrete value like
Yes/No, 1/2/3, A/B/C, or Red/Yellow/Black? If so, then you want to look into classification. If
the target value can take on a number of values, say any value from 0.00 to 100.00, or -999 to
999, or +_ to -_, then you need to look into regression.
 If you’re not trying to predict a target value, then you need to look into unsupervised learning.
Are you trying to fit your data into some discrete groups? If so and that’s all you need, you
should look into clustering. Do you need to have some numerical estimate of how strong the
fit is into each group? If you answer yes, then you probably should look into a density
estimation algorithm. The rules given here should point you in the right direction but are not
unbreakable laws. The second thing you need to consider is your data. You should spend some
time getting to know your data, and the more you know about it, the better you’ll be able to
build a successful application.
 Things to know about your data are these: Are the features nominal or continuous? Are there
missing values in the features? If there are missing values, why are there missing values? Are
there outliers in the data? Are you looking for a needle in a haystack, something that happens
very infrequently? All of these features about your data can help you narrow the algorithm
selection process.
 With the algorithm narrowed, there’s no single answer to what the best algorithm is or what
will give you the best results. You’re going to have to try different algorithms and see how they
perform.
 There are other machine learning techniques that you can use to improve the performance of
a machine learning algorithm. The relative performance of two algorithms may change after
you process the input data.

Steps in developing a machine learning application


Our approach to understanding and developing an application using machine learning in this book
will follow a procedure similar to this:
1. Collect data. You could collect the samples by scraping a website and extracting data, or you
could get information from an RSS feed or an API. You could have a device collect wind speed
measurements and send them to you, or blood glucose levels, or anything you can measure.
The number of options is endless. To save some time and effort, you could use publicly
available data.
2. Prepare the input data. Once you have this data, you need to make sure it’s in a useable
format. The format we’ll be using in this book is the Python list. We’ll talk about Python more
in a little bit, and lists are reviewed in appendix A. The benefit of having this standard format
is that you can mix and match algorithms and data sources. You may need to do some
algorithm-specific formatting here. Some algorithms need features in a special format, some
algorithms can deal with target variables and features as strings, and some need them to be
integers. We’ll get to this later, but the algorithm-specific formatting is usually trivial
compared to collecting data.
3. Analyze the input data. This is looking at the data from the previous task. This could be as
simple as looking at the data you’ve parsed in a text editor to make sure steps 1 and 2 are
actually working and you don’t have a bunch of empty values. You can also look at the data to
see if you can recognize any patterns or if there’s anything obvious, such as a few data points
that are vastly different from the rest of the set. Plotting data in one, two, or three dimensions
can also help.
But most of the time you’ll have more than three features, and you can’t easily plot the data across all
features at one time. You could, however, use some advanced methods we’ll talk about later to distill
multiple dimensions down to two or three so you can visualize the data.
4. If you’re working with a production system and you know what the data should look like, or
you trust its source, you can skip this step. This step takes human involvement, and for an
automated system you don’t want human involvement.
The value of this step is that it makes you understand you don’t have garbage coming in.
5. Train the algorithm. This is where the machine learning takes place. This step and the next
step are where the “core” algorithms lie, depending on the algorithm. You feed the algorithm
good clean data from the first two steps and extract knowledge or information. This knowledge
you often store in a format that’s readily useable by a machine for the next two steps. In the
case of unsupervised learning, there’s no training step because you don’t have a target value.
Everything is used in the next step.
6. Test the algorithm. This is where the information learned in the previous step is put to use.
When you’re evaluating an algorithm, you’ll test it to see how well it does. In the case of
supervised learning, you have some known values you can use to evaluate the algorithm. In
unsupervised learning, you may have to use some other metrics to evaluate the success. In
either case, if you’re not satisfied, you can go back to step 4, change some things, and try testing
again. Often the collection or preparation of the data may have been the problem, and you’ll
have to go back to step 1.
7. Use it. Here you make a real program to do some task, and once again you see if all the previous
steps worked as you expected. You might encounter some new data and have to revisit steps
1–5.
Ch2. Learning with Regression
Supervised Learning

 Learning a mapping from a set of inputs to a target variable


 Classification: Target variable is discrete (spam mail)
 Regression: Target variable is real-valued (stock market)

 Regression is used to predict continuous values.

 Classification is used to predict which class a data point is part of (discrete value).

Example: I have a house with W rooms, X bathrooms, Y square-footage and Z lot-size. Based
on other houses in the area that have recently sold, how much (dollar amount) can I sell my
house for? I would use regression for this kind of problem.

Example: I have an unknown fruit that is yellow in color, 5.5 inches long, diameter of an inch,
and density of X. What fruit is this? I would use classification for this kind of problem to
classify it as a banana (as opposed to an apple or orange).

Here is a good infographic to help reason through the methods you might use for your
problem:
2.1 Linear Regression

 Machine learning, more specifically the field of predictive modeling is primarily concerned
with minimizing the error of a model or making the most accurate predictions possible, at the
expense of explainability. In applied machine learning we will borrow, reuse and steal
algorithms from many different fields, including statistics and use them towards these ends.

 As such, linear regression was developed in the field of statistics and is studied as a model for
understanding the relationship between input and output numerical variables, but has been
borrowed by machine learning. It is both a statistical algorithm and a machine learning
algorithm.

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input
variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear
combination of the input variables (x).
 When there is a single input variable (x), the method is referred to as simple linear
regression.
 When there are multiple input variables, literature from statistics often refers to the method
as multiple linear regression.
Different techniques can be used to prepare or train the linear regression equation from data, the
most common of which is called Ordinary Least Squares.

 The representation is a linear equation that combines a specific set of input values (x) the
solution to which is the predicted output for that set of input values (y). As such, both the input
values (x) and the output value are numeric.
A simple linear regression model is based on a single independent variable and its general
form is: Slope/ Regression Coefficients

Yt     X t   t
Intercepts
Where,
Yt = dependent variable or response variable
Xt = independent variable or predictor
 t = random error or disturbance term
Question is, can we use a known value of X to help predict the Y? And the answer is, we draw
straight line through points and we will use that line for prediction.

Consider the example,


Correct(x) Attitude(y)
17 94
13 73
12 59
15 80
16 93
14 85
16 66
16 79
18 77
19 91
To predict value of “attitude” we need regression formulae.
Linear regression fn; y=a+bx
Slope(b)= r (sy/sx) where r is pearson correlation coefficient
sx,sy are standard deviations
y-intercept(a)= y̅ -bx̅ where y̅ is mean of y samples and x̅ is mean of x sample

(x- x̅) (y- y̅) (x- x̅)(y- (x- x̅)2 (y- y̅)2
Correct(x) Attitude(y)
y̅)
17 94
13 73
12 59
15 80
16 93
14 85
16 66
16 79
18 77
19 91
x̅=15.6 y̅=79.7 ∑=134.8 ∑=42.4 ∑=1206.1

Pearson correlation coefficient(r)


r= ∑((x- x̅)(y- y̅)) / √( ∑(x- x̅)2 ∑(y- y̅)2)
= 134.8/√42.4*1206.1

=0.596 (this means that prediction about “y” is 59.6% correct.)

sy= √( ∑(y- y̅)2 /x-1)=11.576


sx= √( ∑(x- x̅)2 /x-1)= 2.171
b = r*(sy/sx)
= 0.596*(11.576/2.171)
= 3.178

a= y̅ -bx̅
= 79.7 - 3.178 * 15.6
= 30.123

y = a + bx
= 30.123 + 3.178x
If x=15 then we can predict y= 77.803

2.2 Logistic Regression


Some data points and then someone fit a line called the best-fit line to these points; that’s regression.
What happens in logistic regression is we have a bunch of data, and with the data we try to build an
equation to do classification for us. The exact math behind this you’ll see in the next part of the book,
but the regression aspects means that we try to find a best-fit set of parameters. Finding the best fit
is similar to regression, and in this method it’s how we train our classifier. We’ll use optimization
algorithms to find these best-fit parameters. This best-fit stuff is where the name regression comes
from. We’ll talk about the math behind making this a classifier that puts out one of two values.

Classification with logistic regression and the sigmoid


function: a tractable step function
We’d like to have an equation we can give all of our features and it will predict the class. In the two-
class case, the function will spit out a 0 or a 1. Perhaps you’ve seen this before; it’s called the Heaviside
step function, or sometimes just the step function. The problem with the Heaviside step function is
that at the point where it steps from 0 to 1, it does so instantly. This instantaneous step is sometimes
difficult to deal with. There’s another function that behaves in a similar fashion, but it’s much easier
to deal with mathematically. This function is called the sigmoid. The sigmoid is given by the following
equation:

Two plots of the sigmoid are given in figure 5.1. At 0 the value of the sigmoid is 0.5. For increasing
values of x, the sigmoid will approach 1, and for decreasing values of x, the sigmoid will approach 0.
On a large enough scale (the bottom frame of figure 5.1), the sigmoid looks like a step function. For
the logistic regression classifier we’ll take our features and multiply each one by a weight and then
add them up. This result will be put into the sigmoid, and we’ll get a number between 0 and 1.
Anything above 0.5 we’ll classify as a 1, and anything below 0.5 we’ll classify as a 0. You can also think
of logistic regression as a probability estimate.

The question now becomes, what are the best weights, or regression coefficients to use, and how do
we find them? The next section will address this question.
Using optimization to find the best regression coefficients
The input to the sigmoid function described will be z, where z is given by the following:
z = w0x0 + w1x1 + w2x2 + …+ wnxn

In vector notation we can write this as z=wTx. All that means is that we have two vectors of numbers
and we’ll multiply each element and add them up to get one number. The vector x is our input data,
and we want to find the best coefficients w, so that this classifier will be as successful as possible. In
order to do that, we need to consider some ideas from optimization theory.
We’ll first look at optimization with gradient ascent. We’ll then see how we can use this method of
optimization to find the best parameters to model our dataset. Next, we’ll show how to plot the
decision boundary generated with gradient ascent. This will help you visualize the successfulness of
gradient ascent. Next, you’ll learn about stochastic gradient ascent and how to make modifications to
yield better results.

Gradient ascent
The first optimization algorithm we’re going to look at is called gradient ascent. Gradient ascent is
based on the idea that if we want to find the maximum point on a function, then the best way to move
is in the direction of the gradient. We write the gradient with the symbol  and the gradient of a
function f(x,y) is given by the equation

This is one of the aspects of machine learning that can be confusing. The math isn’t difficult. You just
need to keep track of what symbols mean. So this gradient means that we’ll move in the x direction
by amount and in the y direction by amount The function f(x,y) needs to be defined and
differentiable around the points where it’s being evaluated. An example of this is shown in figure 5.2.
The gradient ascent algorithm shown in figure 5.2 takes a step in the direction given by the gradient.
The gradient operator will always point in the direction of the greatest increase. We’ve talked about
direction, but I didn’t mention anything to do with magnitude of movement. The magnitude, or step
size, we’ll take is given by the parameter  . In vector notation we can write the gradient ascent
algorithm as

w :=w + wf(w)

This step is repeated until we reach a stopping condition: either a specified number of steps or the
algorithm is within a certain tolerance margin.
Ch3. Learning with Tress
 We are now going to consider a rather different approach to machine learning, starting with
one of the most common and powerful data structures in the whole of computer science: the
binary tree.
 The computational cost of making the tree is fairly low, but the cost of using it is even lower:
O(logN), where N is the number of datapoints.
 This is important for machine learning, since querying the trained algorithm should be as fast
as possible since it happens more often, and the result is often wanted immediately.
 This is sufficient to make trees seem attractive for machine learning.

The idea of a decision tree is that we break classification down into a set of choices about each feature
in turn, starting at the root (base) of the tree and progressing down to the leaves, where we receive
the classification decision. The trees are very easy to understand, and can even be turned into a set of
if-then rules, suitable for use in a rule induction system.
In terms of optimization and search, decision trees use a greedy heuristic to perform search,
evaluating the possible options at the current stage of learning and making the one that seems optimal
at that point. This works well a surprisingly large amount of the time.

USING DECISION TREES

As a student it can be difficult to decide what to do in the evening.


 There are four things that you actually quite enjoy doing, or have to do: going to the pub,
watching TV, going to a party, or even (gasp) studying.
 The choice is sometimes made for you—if you have an assignment due the next day, then you
need to study, if you are feeling lazy then the pub isn’t for you, and if there isn’t a party then
you can’t go to it.
 You are looking for a nice algorithm that will let you decide what to do each evening without
having to think about it every night.

 Figure 12.1 provides just such an algorithm. Each evening you start at the top (root) of the tree
and check whether any of your friends know about a party that night. If there is one, then you
need to go, regardless. Only if there is not a party do you worry about whether or not you have
an assignment deadline coming up. If there is a crucial deadline, then you have to study, but if
there is nothing that is urgent for the next few days, you think about how you feel. A sudden
burst of energy might make you study, but otherwise you’ll be slumped in front of the TV
indulging your secret love of Shortland Street (or other soap opera of your choice) rather than
studying.
Of course, near the start of the semester when there are no assignments to do, and you are feeling
rich, you’ll be in the pub.
One of the reasons that decision trees are popular is that we can turn them into a set of logical
disjunctions (if ... then rules) that then go into program code very simply—the first part of the tree
above can be turned into:
 if there is a party then go to it
 if there is not a party and you have an urgent deadline then study
 etc.

CONSTRUCTING DECISION TREES

In the example above, the three features that we need for the algorithm are the state of
your energy level, the date of your nearest deadline, and whether or not there is a party
tonight. The question we need to ask is how, based on those features, we can construct the
tree. There are a few different decision tree algorithms, but they are almost all variants of
the same principle: the algorithms build the tree in a greedy manner starting at the root,
choosing the most informative feature at each step. We are going to start by focusing on
the most common: Quinlan’s ID3, although we’ll also mention its extension, known as C4.5,
and another known as CART.

Entropy in information theory


Information theory was ‘born’ in 1948 when Claude Shannon published a paper called “A
Mathematical Theory of Communication.” In that paper, he proposed the measure of information
entropy, which describes the amount of impurity in a set of features. The entropy H of a set of
probabilities pi is (for those who know some physics, the relation to physical entropy should be
clear):

where the logarithm is base 2 because we are imagining that we encode everything using binary digits
(bits), For our decision tree, the best feature to pick as the one to classify on now is the one that gives
you the most information, i.e., the one with the highest entropy. After using that feature, we re-
evaluate the entropy of each feature and again pick the one with the highest entropy.
(For more information on entropy and examples
http://www.csun.edu/~twang/595DM/Slides/Information%20&%20Entropy.pdf)

ID3
Now that we have a suitable measure for choosing which feature to choose next, entropy, we just have
to work out how to apply it. The important idea is to work out how much the entropy of the whole
training set would decrease if we choose each particular feature for the next classification step.
This is known as the information gain, and it is defined as the entropy of the whole set minus the
entropy when a particular feature is chosen. This is defined by (where S is the set of examples, F is a
possible feature out of the set of all possible ones, and |Sf | is a count of the number of members of S
that have value f for feature F):
As an example, suppose that we have data (with outcomes) S = {s1 = true, s2 =false, s3 = false, s4 =
false} and one feature F that can have values {f1, f2, f3}. In the example, the feature value for s1 could
be f2, for s2 it could be f2, for s3, f3 and for s4, f1 then we can calculate the entropy of S as (where
means true, of which we have one example, and Θ means false, of which we have three examples):

The function Entropy(Sf ) is similar, but only computed with the subset of data where feature F has
values f.
We now want to compute the information gain of F, so we now need to compute each of the values

inside the summation , Entropy(S) (in our example, the features are ‘Deadline’, ‘Party’, and
‘Lazy’):

The information gain from adding this feature is the entropy of S minus the sum of the
three values above:

Gain(S, F) = 0.811 − (0 + 0.5 + 0) = 0.311

The ID3 algorithm computes this information gain for each feature and chooses the one that produces
the highest value. In essence, that is all there is to the algorithm. It searches the space of possible trees
in a greedy way by choosing the feature with the highest information gain at each stage. The output
of the algorithm is the tree, i.e., a list of nodes, edges, and leaves. As with any tree in computer science,
it can be constructed recursively.
At each stage the best feature is selected and then removed from the dataset, and the algorithm is
recursively called on the rest. The recursion stops when either there is only one class remaining in
the data (in which case a leaf is added with that class as its label), or there are no features left, when
the most common label in the remaining data is used.
The ID3 Algorithm
• If all examples have the same label:
– return a leaf with that label
• Else if there are no features left to test:
– return a leaf with the most common label
• Else:
– choose the feature f that maximises the information gain of S to be the next node.
– add a branch from the node for each possible value f in ˆ F
– for each branch:
* calculate Sf by removing ˆ F from the set of features
* recursively call the algorithm with Sf , to compute the gain relative to the
current set of examples
Ch.5 Learning with Classification

Classification is a form of data analysis that extracts models describing important data classes. Such
models, called classifiers, predict categorical (discrete, unordered) class labels. For example, we can
build a classification model to categorize bank loan applications as either safe or risky. Such analysis
can help provide us with a better understanding of the data at large. Many classification methods have
been proposed by researchers in machine learning, pattern recognition, and statistics. Most
algorithms are memory resident, typically assuming a small data size. Recent data mining research
has built on such work, developing scalable classification and prediction techniques capable of
handling large amounts of disk-resident data. Classification has numerous applications, including
fraud detection, target marketing, performance prediction, manufacturing, and medical diagnosis.

What Is Classification?
A bank loans officer needs analysis of her data to learn which loan applicants are “safe” and which are
“risky” for the bank. A marketing manager at AllElectronics needs data analysis to help guess whether
a customer with a given profile will buy a new computer. A medical researcher wants to analyze
breast cancer data to predict which one of three specific treatments a patient should receive. In each
of these examples, the data analysis task is classification, where a model or classifier is constructed
to predict class (categorical) labels, such as “safe” or “risky” for the loan application data; “yes” or “no”
for the marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data. These
categories can be represented by discrete values, where the ordering among values has no meaning.
For example, the values 1, 2, and 3 may be used to represent treatments A, B, and C, where there is no
ordering implied among this group of treatment regimes. Suppose that the marketing manager wants
to predict how much a given customer will spend during a sale at AllElectronics. This data analysis
task is an example of numeric prediction, where the model constructed predicts a continuous-valued
function, or
ordered value, as opposed to a class label. This model is a predictor. Regression analysis is a
statistical methodology that is most often used for numeric prediction; hence the two terms tend to
be used synonymously, although other methods for numeric prediction exist. Classification and
numeric prediction are the two major types of prediction problems. This chapter focuses on
classification.

General Approach to Classification


“How does classification work?” Data classification is a two-step process, consisting of a learning step
(where a classification model is constructed) and a classification step (where the model is used to
predict class labels for given data). The process is shown for the loan application data of Figure
5.1 (The data are simplified for illustrative purposes. In reality, we may expect many more attributes
to be considered.
In the first step, a classifier is built describing a predetermined set of data classes or concepts. This is
the learning step (or training phase), where a classification algorithm builds the classifier by
analyzing or “learning from” a training set made up of database tuples and their associated class
labels. A tuple, X, is represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting
n measurements made on the tuple from n database attributes, respectively, A1, A2, : : : , An. Each
tuple, X, is assumed to belong to a predefined class as determined by another database attribute called
the class label attribute. The class label attribute is discrete-valued and unordered. It is categorical
(or nominal) in that each value serves as a category or class. The individual tuples
making up the training set are referred to as training tuples and are randomly sampled from the
database under analysis. In the context of classification, data tuples can be referred to as samples,
examples, instances, data points, or objects.

Figure: 5.1 The data classification process: (a) Learning: Training data are analyzed by a classification
algorithm. Here, the class label attribute is loan decision, and the learned model or classifier is
represented in the form of classification rules. (b) Classification: Test data are used to estimate the
accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied
to the classification of new data tuples.

This first step of the classification process can also be viewed as the learning of a mapping
or function, y = f (X), that can predict the associated class label y of a given tuple X.

In this view, we wish to learn a mapping or function that separates the data classes. Typically, this
mapping is represented in the form of classification rules, decision trees, or mathematical formulae.
In our example, the mapping is represented as classification rules that identify loan applications as
being either safe or risky (Figure 5.1a). The rules can be used to categorize future data tuples, as well
as provide deeper insight into the data contents. They also provide a compressed data representation.

“What about classification accuracy?” In the second step (Figure 5.1b), the model is used for
classification. First, the predictive accuracy of the classifier is estimated. If we were to use the training
set to measure the classifier’s accuracy, this estimate would likely be optimistic, because the classifier
tends to overfit the data (i.e., during learning it may incorporate some particular anomalies of the
training data that are not present in the general data set overall). Therefore, a test set is used, made
up of test tuples and their associated class labels. They are independent of the training tuples,
meaning that they were not used to construct the classifier.

The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier.

The associated class label of each test tuple is compared with the learned classifier’s class prediction
for that tuple.

Rule-Based Classification
In this section, we look at rule-based classifiers, where the learned model is represented as a set of
IF-THEN rules.
We first examine how such rules are used for classification. We then study ways, in which they can be
generated, either from a decision tree or directly from the training data using a sequential covering
algorithm.

Using IF-THEN Rules for Classification


Rules are a good way of representing information or bits of knowledge. A rule-based classifier
uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form

IF condition THEN conclusion.

An example is rule R1,

R1: IF age = youth AND student = yes THEN buys computer = yes.

Points to remember −

 The IF part of the rule is called rule antecedent or precondition.

 The THEN part of the rule is called rule consequent.

 The antecedent part the condition consists of one or more attribute tests and these tests are
logically ANDed.

 The consequent part consists of class prediction.


In the rule antecedent, the condition consists of one or more attribute tests
(e.g., age = youth and student = yes) that are logically ANDed.

The rule’s consequent contains a class prediction (in this case, we are predicting whether a
customer will buy a computer).

R1 can also be written as

R1: (age = youth) ^ (student = yes))(buys computer = yes).

If the condition (i.e., all the attribute tests) in a rule antecedent holds true for a given tuple, we say
that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the
tuple.
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

 A rule r ‘covers’ an instance x if the attributes of the instance satisfy the condition of the rule

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal
A rule R can be assessed by its coverage and accuracy.
Given a tuple, X, from a class labeled data set, D, let ncovers be the number of tuples covered by R;
ncorrect be the number of tuples correctly classified by R; and |D| be the number of tuples in D.

We can define
the coverage and accuracy of R as

That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., their attribute
values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the tuples that it covers
and see what percentage of them the rule can correctly classify.

Example;
 Coverage of a rule: Fraction of records that satisfy the antecedent of a rule.
 Accuracy of a rule: Fraction of records that satisfy both the antecedent and
consequent of a rule.

(Status=Single)  No

Coverage = 40%, Accuracy = 50%

Coverage= 4/10=0.4 (40%)

Accuracy= 2/4 =0.5 (50%)

Fig. 5.2 Training Data to calculate coverage and accuracy

How does Rule-based Classifier Work?


R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

Characteristics of Rule-Based Classifier

 Mutually exclusive rules


 Classifier contains mutually exclusive rules if the rules are independent of each other
 Every record is covered by at most one rule
 Exhaustive rules
 Classifier has exhaustive coverage if it accounts for every possible combination of
attribute values
 Each record is covered by at least one rule

Rule Extraction from a Decision Tree

Decision tree classifiers are a popular method of classification—it is easy to understand how decision
trees work and they are known for their accuracy. Decision trees can become large and difficult to
interpret. In this subsection, we look at how to build a rule based classifier by extracting IF-THEN
rules from a decision tree. In comparison with a decision tree, the IF-THEN rules may be easier for
humans to understand, particularly if the decision tree is very large.

 To extract rules from a decision tree, one rule is created for each path from the root to a leaf
node.
 Each splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF”
part). The leaf node holds the class prediction, forming the rule consequent (“THEN” part).

Fig.5.3 A decision tree for the concept buys computer, indicating whether an AllElectronics customer
is likely to purchase a computer. Each internal (nonleaf) node represents a test on an attribute. Each
leaf node represents a class (either buys-computer= yes or buys-computer= no).

Example: Extracting classification rules from a decision tree. The decision tree of Figure 5.3 can
be converted to classification IF-THEN rules by tracing the path from the root node to each leaf node
in the tree. The rules extracted from Figure 5.3 are as follows:

R1: IF age = youth AND student = no THEN buys-computer = no


R2: IF age = youth AND student = yes THEN buys-computer = yes
R3: IF age = middle aged THEN buys computer=yes
R4: IF age = senior AND credit rating = excellent THEN buys computer = yes
R5: IF age = senior AND credit rating = fair THEN buys computer = no

Since we end up with one rule per leaf, the set of extracted rules is not much simpler than the
corresponding decision tree! The extracted rules may be even more difficult to interpret than the
original trees in some cases. As an example, Figure 5.4 showed decision trees that suffer from subtree
repetition and replication. The resulting set of rules extracted can be large and difficult to follow,
because some of the attribute tests may be irrelevant or redundant. So, the plot thickens. Although it
is easy to extract rules from a decision tree, we may need to do some more work by pruning the
resulting rule set.
Fig.5.4
“How can we prune the rule set?” For a given rule antecedent, any condition that does not improve the
estimated accuracy of the rule can be pruned (i.e., removed), thereby generalizing the rule. C4.5
extracts rules from an unpruned tree, and then prunes the rules using a pessimistic approach similar
to its tree pruning method. The training tuples and their associated class labels are used to estimate
rule accuracy. However, because this would result in an optimistic estimate, alternatively, the
estimate is adjusted to compensate for the bias, resulting in a pessimistic estimate. In addition, any
rule that does not contribute to the overall accuracy of the entire rule set can also be pruned.

Rule Induction Using a Sequential Covering Algorithm


IF-THEN rules can be extracted directly from the training data (i.e., without having to generate a
decision tree first) using a sequential covering algorithm. The name comes from the notion that the
rules are learned sequentially (one at a time), where each rule for a given class will ideally cover many
of the class’s tuples (and hopefully none of the tuples of other classes). Sequential covering algorithms
are the most widely used approach to mining disjunctive sets of classification rules, and form the topic
of this subsection.
There are many sequential covering algorithms. Popular variations include AQ, CN2, and the more
recent RIPPER. The general strategy is as follows. Rules are learned one at a time. Each time a rule is
learned, the tuples covered by the rule are removed, and the process repeats on the remaining tuples.
This sequential learning of rules is in contrast to decision tree induction. Because the path to each leaf
in a decision tree corresponds to a rule, we can consider decision tree induction as learning a set of
rules simultaneously.

A basic sequential covering algorithm is shown below. Here, rules are learned for one class at a time.
Ideally, when learning a rule for a class, C, we would like the rule to cover all (or many) of the
training tuples of class C and none (or few) of the tuples y) of the training tuples of class C and none
(or few) of the tuples from other classes. In this way, the rules learned should be of high accuracy.
The rules need not necessarily be of high coverage. This is because we can have more than one rule
for a class, so that different rules may cover different tuples within the same class. The process
continues until the terminating condition is met, such as when there are no more training tuples or
the quality of a rule returned is below a user-specified threshold. The Learn_One_Rule procedure
finds the “best” rule for the current class, given the current set of training tuples.

Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.


Input:
 D, a data set of class-labeled tuples;
 Att_vals, the set of all attributes and their possible values.
Output: A set of IF-THEN rules.
Method:
(1) Rule_set = {}; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn_One_Rule(D, Att vals, c);
(5) remove tuples covered by Rule from D;
(6) Rule_set = Rule_set + Rule; // add new rule to rule set
(7) until terminating condition;
(8) endfor
(9) return Rule_Set ;

Classification by Backpropoagation
 Backpropagation: A neural network learning algorithm

 During the learning phase, the network learns by adjusting the weights so as to be able to
predict the correct class label of the input tuples

 Also referred to as connectionist learning due to the connections between units

Neural Network as a Classifier

 Weakness

 Long training time


 Require a number of parameters typically best determined empirically, e.g., the
network topology or “structure."
 Poor interpretability: Difficult to interpret the symbolic meaning behind the learned
weights and of “hidden units" in the network
 Strength
 High tolerance to noisy data
 Ability to classify untrained patterns
 Well-suited for continuous-valued inputs and outputs
 Successful on a wide array of real-world data
 Algorithms are inherently parallel
 Techniques have recently been developed for the extraction of rules from trained
neural networks

A Multi-Layer Feed-Forward Neural Network

 The inputs to the network correspond to the attributes measured for each training tuple
 Inputs are fed simultaneously into the units making up the input layer
 They are then weighted and fed simultaneously to a hidden layer
 The number of hidden layers is arbitrary, although usually only one
 The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network's prediction
 The network is feed-forward in that none of the weights cycles back to an input unit or to an
output unit of a previous layer

Backpropagation
 Iteratively process a set of training tuples & compare the network's prediction with the
actual known target value
 For each training tuple, the weights are modified to minimize the mean squared error
between the network's prediction and the actual target value
 Modifications are made in the “backwards” direction: from the output layer, through each
hidden layer down to the first hidden layer, hence “backpropagation”
 Steps
o Initialize weights (to small random #s) and biases in the network
o Propagate the inputs forward (by applying activation function)
o Backpropagate the error (by updating weights and biases)
o Terminating condition (when error is very small, etc.)

Backpropagation Algorithm
Input: Data set D, learning rate l, network Output: Trained Neural Network

Backpropagation Example:
Learning Rate= 0.9

X= (1, 0, 1) , with class label of 1 (Tj=1)

Step 1: Net input and output calculation

Formula required calculating net input: I j   wij Oi   j


i

1
O j 

Formula required calculating output: I j

1 e

Unit j Net input ij Output Oj

4 i4=(w14*x1+w24*x2+w34*x3)+(Ө4) Ox4= 1/(1+e-i4)


=(0.2*1 + 0.4*0 + (-0.5)*1) + (-0.4) = 1/(1+ e0.7)
=-0.7 = 0.332
5 i5=(w15*x1+w25*x2+w35*x3)+(Ө5) Ox5= 1/(1+e-i5)
=(-0.3*1 + 0.1*0 + 0.2*1) + 0.2 = 1/(1+ e-0.1)
=0.1 = 0.525
6 i6=(w46*Ox4+w56* Ox5)+(Ө6) Ox6= 1/(1+e-i6)
=(-0.3*0.332 + (-0.2)*0.525) + 0.1 = 1/(1+ e0.105)
=-0.105 = 0.474

Step 2: Calculation of the error at each node


Formula Required:
For Output node/layer=> Errj  Oj (1 Oj )(Tj  Oj )
For middle/hidden layer=> Errj  O j (1 O j )  Err w
k
k jk

Unit j Err j

6 Errj  O j (1 O j )(Tj  O j ) (j=6)


=0.474(1-0.474)(1-0.474)
= 0.1311
5 Errj  O j (1 O j ) Errk wjk (j=5, k=6)
k
= 0.525(1- 0.525)*((0.1311)(-0.2))
= -0.0065
4 Errj  O j (1 O j ) Errk wjk (j=4, k=6)
k
= 0.332(1- 0.332)*((0.1311)(-0.3))
= -0.0087

Step 3

Just for understanding


Bayesian Belief Network
Bayesian belief networks—probabilistic graphical models, which unlike naïve Bayesian classifiers
allow the representation of dependencies among subsets of attributes. Bayesian belief networks can
be used for classification.

Note: Before reading this topic just go through naïve bayes classifier (Bayesian network.ppt)

Concepts and Mechanisms


 The naive Bayesian classifier makes the assumption of class conditional independence, that is,
given the class label of a tuple, the values of the attributes are assumed to be conditionally
independent of one another.
 This simplifies computation. When the assumption holds true, then the naïve Bayesian
classifier is the most accurate in comparison with all other classifiers.
 In practice, however, dependencies can exist between variables. Bayesian belief networks
specify joint conditional probability distributions. They allow class conditional
independencies to be defined between subsets of variables.
 They provide a graphical model of causal relationships, on which learning can be performed.
 Trained Bayesian belief networks can be used for classification. Bayesian belief networks are
also known as belief networks, Bayesian networks, and probabilistic networks. For
brevity, we will refer to them as belief networks.
 A belief network is defined by two components—a directed acyclic graph and a set of
conditional probability tables (Figure 1).
 Each node in the directed acyclic graph represents a random variable. The variables may be
discrete- or continuous-valued. They may correspond to actual attributes given in the data or
to “hidden variables” believed to form a relationship (e.g., in the case of medical data, a hidden
variable may indicate a syndrome, representing a number of symptoms that, together,
characterize a specific disease).
 Each arc represents a probabilistic dependence. If an arc is drawn from a node Y to a node Z,
then Y is a parent or immediate predecessor of Z, and Z is a descendant of Y. Each variable
is conditionally independent of its nondescendants in the graph, given its parents.

 Figure 5.5 is a simple belief network, adapted from Russell, Binder, Koller, and Kanazawa
[RBKK95] for six Boolean variables.
 The arcs in Figure 5.5(a) allow a representation of causal knowledge. For example, having lung
cancer is influenced by a person’s family history of lung cancer, as well as whether or not the
person is a smoker.
 Note that the variable PositiveXRay is independent of whether the patient has a family history
of lung cancer or is a smoker, given that we know the patient has lung cancer. In other words,
once we know the outcome of the variable LungCancer, then the variables FamilyHistory and
Smoker do not provide any additional information regarding PositiveXRay.
 The arcs also show that the variable LungCancer is conditionally independent of Emphysema,
given its parents, FamilyHistory and Smoker.
 A belief network has one conditional probability table (CPT) for each variable.
 The CPT for a variable Y specifies the conditional distribution P(Y | Parents(Y)), where
Parents(Y) are the parents of Y.
 Figure 5.5(b) shows a CPT for the variable LungCancer. The conditional probability for each
known value of LungCancer is given for each possible combination of the values of its parents.
For instance, from the upper leftmost and bottom rightmost entries, respectively, we see that

Let X = (x1, . . . , xn) be a data tuple described by the variables or attributes Y1, . . . , Yn, respectively.
Recall that each variable is conditionally independent of its nondescendants in the network graph,
given its parents. This allows the network to provide a complete representation of the existing joint
probability distribution with the following equation:

where P(x1, . . . , xn) is the probability of a particular combination of values of X, and the values for
P(xi |Parents(Yi)) correspond to the entries in the CPT for Yi . A node within the network can be
selected as an “output” node, representing a class label attribute. There may be more than one output
node. Various algorithms for inference and learning can be applied to the network. Rather than
returning a single class label, the classification process can return a probability distribution that gives
the probability of each class. Belief networks can be used to answer probability of evidence queries
(e.g., what is the probability that an individual will have LungCancer, given that they have both
PositiveXRay and Dyspnea) and most probable explanation queries (e.g., which group of the
population is most likely to have both PositiveXRay and Dyspnea).
Hidden Markov Model
 The Hidden Markov Model is one of the most popular graphical models.
 It is used in speech processing and in a lot of statistical work.
 The HMM generally works on a set of temporal data. At each clock tick the system moves into
a new state, which can be the same as the previous one.
 Its power comes from the fact that it deals with situations where you have a Markov model,
but you do not know exactly which state of the Markov model you are in—instead, you see
observations that do not uniquely identify the state. This is where the hidden in the title comes
from.
 Performing inference on the HMM is not that computationally expensive, which is a big
improvement over the more general Bayesian network.
 The applications that it is most commonly applied to are temporal: a set of measurements
made at regular time intervals, which comprise the observations of the state. In fact, the HMM
is the simplest dynamic Bayesian network, a Bayesian network that deals with sequential
(often time-series) data. Figure 5.6 shows the HMM as a graphical model.

FIGURE 5.6 The Hidden Markov Model is an example of a dynamic Bayesian network. The figure
shows the first three states and the related observations unrolled as time progresses.(On is
observation and Wn is state)

 The example that we will use is this: As a caring teacher I want to know whether or not you
are actually working towards the exam.
 I know that there are four things that you do in the evenings (go to the pub, watch TV, go to a
party, study) and I want to work out whether or not you are studying.
 However, I can’t just ask you, because you would probably lie to me. So all I can do is try to
make observations about your behaviour and appearance. Specifically, I can probably work
out if you look tired, hungover, scared, or fine.
 I want to use these observations to try to work out what you did last night.
The problem is that I don’t know why you look the way you do, but I can guess by assigning
probabilities to those things. So if you look hungover, then I might give probability 0.5 to the guess
that you went to the pub last night, 0.25 to the guess that you went to a party, 0.2 to watching TV, and
0.05 to studying. In fact, we will use these the other way round, using the probability that you look
hungover given what you did last night. These are known as observation or emission probabilities.

 Each day that I see you in lectures I make an observation of your appearance, o(t), and I want
to use that observation to guess the state w(t).
 This requires me to build up some kind of probabilities P(Ok(t)|wj(t)), which is the probability
that I see observation Ok (e.g., you are tired) given that you were in state wj (e.g., you went to a
party) last night. These are usually labelled as bj(ok).
 The other information that I have, or think I have, is the transition probability, which tells
me how likely you are to be in state wj tonight given that you were in state wj last night.
 So if I think you were at the pub last night I will probably guess that the probability of you
being there again tonight is small because your student loan won’t be able to handle it. This is
written as P(wj(t+1)|wi(t)) and is usually labeled as ai,j .
 I can add one more constraint to each of the probability distributions ai,j and bi.
 I know that you did something last night, so ∑j ai,j = 1 and
 I know that I will make some P observation (since if you aren’t in the lecture I’ll assume you
were too tired), so ∑k bj(ok) = 1.
 There is one other thing that is generally assumed, which is that the Markov chain is ergodic,
it means that there is a non-zero probability of reaching every state eventually, no matter what
the starting state.

The HMM itself is made up of the transition probabilities ai,j and the observation probabilities bj(ok),
and the probability of starting in each of the states, π. So these are the things that I need to specify for
myself, starting with the transition probabilities (which are also shown in Figure 5.7):

FIGURE 5.7 The example HMM with transition and observation probabilities shown.
The Forward Algorithm
Ch.6 Dimensionality Reduction
Dimension Reduction refers to the process of converting a set of data having vast dimensions into
data with lesser dimensions ensuring that it conveys similar information concisely. These techniques
are typically used while solving machine learning problems to obtain better features for a
classification or regression task.

Further, the dimensionality is an explicit factor for the computational cost of many algorithms. These
are some of the reasons why dimensionality reduction is useful. However, it can also remove noise,
significantly improve the results of the learning algorithm, make the dataset easier to work with, and
make the results easier to understand.

• From a theoretical point of view, increasing the number of features should lead to better
performance.

• In practice, the inclusion of more features leads to worse performance (i.e., curse of
dimensionality).

• The number of training examples required increases exponentially with dimensionality.

• Significant improvements can be achieved by first mapping the data into a lower-dimensional
space.

• Dimensionality can be reduced by:

− Combining features using a linear or non-linear transformation.

− Selecting a subset of features (i.e., feature selection).

There are three different ways to do dimensionality reduction.

 The first is feature selection, which typically means looking through the features that are
available and seeing whether or not they are actually useful, i.e., correlated to the output
variables.
 The second method is feature derivation, which means deriving new features from the old
ones, generally by applying transforms to the dataset that simply change the axes
(coordinate system) of the graph by moving and rotating them, which can be written simply
as a matrix that we apply to the data. The reason this performs dimensionality reduction is
that it enables us to combine features, and to identify which are useful and which are not.
 The third method is simply to use clustering in order to group together similar datapoints,
and to see whether this allows fewer features to be used.
Dimensionality reduction techniques

There are dimensionality reduction techniques that work on labeled (supervised) and unlabeled
(unsupervised) data. Here we’ll focus on unlabeled data because it’s applicable to both types.

 The first method for dimensionality reduction is called principal component analysis (PCA).
 In PCA, the dataset is transformed from its original coordinate system to a new coordinate
system.
 The new coordinate system is chosen by the data itself. The first new axis is chosen in the
direction of the most variance in the data. The second axis is orthogonal to the first axis
and in the direction of an orthogonal axis with the largest variance.
 This procedure is repeated for as many features as we had in the original data. We’ll find
that the majority of the variance is contained in the first few axes.
 Therefore, we can ignore the rest of the axes, and we reduce the dimensionality of our
data.

 Factor analysis is another method for dimensionality reduction.


 In factor analysis, we assume that some unobservable latent variables are generating the
data we observe.
 The data we observe is assumed to be a linear combination of the latent variables and
some noise.
 The number of latent variables is possibly lower than the amount of observed data,
which gives us the dimensionality reduction.
 Factor analysis is used in social sciences, finance, and other areas.

 Another common method for dimensionality reduction is independent component analysis


(ICA).
 ICA assumes that the data is generated by N sources, which is similar to factor analysis.
 The data is assumed to be a mixture of observations of the sources.
 The sources are assumed to be statically independent, unlike PCA, which assumes the
data is uncorrelated.
 As with factor analysis, if there are fewer sources than the amount of our observed data,
we’ll get a dimensionality reduction.

Of the three methods of dimensionality reduction, PCA is by far the most commonly used.

Principal Component Analysis (PCA)

• Dimensionality reduction implies information loss; PCA preserves as much information as


possible by minimizing the reconstruction error:
• How should we determine the “best” lower dimensional space?
The “best” low-dimensional space can be determined by the “best” eigenvectors of the
covariance matrix of the data (i.e., the eigenvectors corresponding to the “largest” eigen values
– also called “principal components”).

Dimensionality problem
Suppose an object can be represented by extracting some features f1,f2,f3,…,fn. Then
F=( f1,f2,f3,…,fn) is called as feature vector.

Question is how many features have to use? And how many are important? If we use all or
many features then Training data size will also increase and this can degrade the performance
of classifiers.

The solution to this is to reduce number of features without losing any useful information, so
dimensionality reduction comes into picture. See figure below, we trying to reduce 2-
dimentional data into 1-dimention.

Suppose we have data to classify boys and girls based on their height (h) and weight (w). Now
to reduce dimension, either map all data points on h- axis or on w-axis. But one- dimension (h
or w) will not be enough to classify boys and girls. So we need some other solution, which will
be Principal Component Analysis (PCA).
From above figure, we can map all data points on line in such a way that no information will
be loose.

Where Z1 is called as first principal component and Z2 is called 2nd principal component.
Note that Z1 and Z2 are correlated and orthogonal principal components.

Steps:-

 Suppose x1, x2, ... xM are N x 1 vectors


 Calculate mean of all vectors,find μ(mean)

μ=

 Now calculate (x- μ) and (x- μ)T


 Calculate covariance matrix to find eigen values. And eigen vectors

Note than highest eigen value will be consider for further calculation.
Independent Components Analysis (ICA)
 There is a related approach to factor analysis that is known as Independent Components
Analysis.
 When we looked at PCA above, the components were chosen so that they were orthogonal
and uncorrelated (so that the covariance matrix was diagonal, i.e., so cov(bi, bj) = 0 if i ≠ j).
 If, instead, we require that the components are statistically independent (so that for E[bi, bj ]
= E[bi]E[bj ] as well as the bi being uncorrelated), then we get ICA.
 The common motivation for ICA is the problem of blind source separation. As with factor
analysis, the assumption is that the data we see are actually created by a set of underlying
physical processes that are independent.
 The reason why the data we see are correlated is because of the way the outputs from different
processes have been mixed together. So given some data, we want to find a transformation
that turns it into a mixture of independent sources or components.

The most popular way to describe blind source separation is known as the cocktail party problem.
If you are at a party, then your ears hear lots of different sounds coming from lots of different locations
(different people talking, the clink of glasses, background music, etc.) but you are somehow able to
focus on the voice of the people you are talking to, and can in fact separate out the sounds from all of
the different sources even though they are mixed together. The cocktail party problem is the challenge
of separating out these sources, although there is one wrinkle: for the algorithm to work, you need as
many ears as there are sources.
This is because the algorithm does not have the information we have about what things sound like.

 Suppose that we have two sources making noise (st1, st2) where the top index covers the fact
that there are lots of datapoints appearing over time, and two microphones that hear things,
giving inputs (xt1, xt2). The sounds that are heard come from the sources as:

x1 = as1 + bs2,

x2 = cs1 + ds2,

which can be written in matrix form as:

x = As,

 where A is known as the mixing matrix.


 Reconstructing s looks easy now: we just compute s = A−1x. Except that, unfortunately, we
don’t know A. The approximation to A−1 that we work out is generally labelled as W, and it is
a square matrix since we have the same number of microphones as we do sources.

Difference between PCA and ICA


Both PCA and ICA try to find a set of vectors, a basis, for the data. So you can write any point (vector)
in your data as a linear combination of the basis.

In PCA the basis you want to find is the one that best explains the variability of your data. The first
vector of the PCA basis is the one that best explains the variability of your data (the principal
direction) the second vector is the 2nd best explanation and must be orthogonal to the first one, etc.
In ICA the basis you want to find is the one in which each vector is an independent component of your
data, you can think of your data as a mix of signals and then the ICA basis will have a vector for each
independent signal.

As an example of ICA consider these two images:

While not 100% perfect it is an excellent separation of the two mixed images.

In a more practical way we can say that PCA helps when you want to find a reduced-rank
representation of your data and ICA helps when you want to find a representation of your data as
independent sub-elements. In layman terms PCA helps to compress data and ICA helps to separate
data.
Ch.7 Learning with Clustering
Points:

K-means clustering, Hierarchical clustering, Expectation Maximization Algorithm, Supervised


learning after clustering, Radial Basis functions

Clustering is the classification of objects into different groups, or more precisely, the
partitioning of a data set into subsets (clusters), so that the data in each subset (ideally)
share some common trait - often according to some defined distance measure.

Types of clustering:

1. Hierarchical algorithms: these find successive clusters using previously established


clusters.

o Agglomerative ("bottom-up"): Agglomerative algorithms begin with each


element as a separate cluster and merge them into successively larger clusters.

o Divisive ("top-down"): Divisive algorithms begin with the whole set and
proceed to divide it into successively smaller clusters.

2. Partitional clustering: Partitional algorithms determine all clusters at once.

They include:

o K-means and derivatives

o Fuzzy c-means clustering

o QT clustering algorithm

K-Means Clustering

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in
the data, with the number of groups represented by the variable K. The algorithm works iteratively
to assign each data point to one of K groups based on the features that are provided. Data points
are clustered based on feature similarity. The results of the K-means clustering algorithm are:

1. The centroids of the K clusters, which can be used to label new data

2. Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze
the groups that have formed organically. The "Choosing K" section below describes how the
number of groups can be determined.

Each centroid of a cluster is a collection of feature values which define the resulting groups.
Examining the centroid feature weights can be used to qualitatively interpret what kind of group
each cluster represents.

Algorithm
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The
algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of
features for each data point. The algorithms starts with initial estimates for the Κ centroids, which
can either be randomly generated or randomly selected from the data set. The algorithm then
iterates between two steps:

1. Data assigment step:


Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest
centroid, based on the squared Euclidean distance. More formally, if ci is the collection of centroids in
set C, then each data point x is assigned to a cluster based on

where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for
each ith cluster centroid be Si.

2. Centroid update step:

In this step, the centroids are recomputed. This is done by taking the mean of all data points
assigned to that centroid's cluster.

The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points
change clusters, the sum of the distances is minimized, or some maximum number of iterations is
reached).

This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not
necessarily the best possible outcome), meaning that assessing more than one run of the algorithm
with randomized starting centroids may give a better outcome.
Choosing K
The algorithm described above finds the clusters and data set labels for a particular pre- chosen
K. To find the number of clusters in the data, the user needs to run the K-means clustering
algorithm for a range of K values and compare the results. In general, there is no method for
determining exact value of K, but an accurate estimate can be obtained using the following
techniques.

One of the metrics that is commonly used to compare results across different values of K is the
mean distance between data points and their cluster centroid. Since increasing the number of
clusters will always reduce the distance to data points, increasing K will always decrease this
metric, to the extreme of reaching zero when K is the same as the number of data points. Thus, this
metric cannot be used as the sole target. Instead, mean distance to the centroid as a function of K
is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to roughly
determine K.

Example: Apply K-means algorithm on given data for k=3. Use C1(2), C2(16), and C3(38) as

initial cluster centers.

Data: 2, 4, 6, 3, 31, 12, 15, 16, 38, 35, 14, 21, 23, 25, 30

Hierarchical Clustering

• Hierarchical clustering involves creating clusters that have a predetermined ordering from top
to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.
There are two types of hierarchical clustering, Divisive and Agglomerative.

• Use distance matrix as clustering criteria. This method does not require the number of clusters
k as an input, but needs a termination condition

• Clusters are created in levels actually creating sets of clusters at each level.
• Agglomerative method: In this method we assign each observation to its own cluster. Then,
compute the similarity (e.g., distance) between each of the clusters and join the two most
similar clusters. Finally, repeat steps until there is only a single cluster left.

– Initially each item in its own cluster

– Iteratively clusters are merged together

– Bottom Up process

• Divisive method: In this method we assign all of the observations to a single cluster and then
partition the cluster to two least similar clusters. Finally, we proceed recursively on each
cluster until there is one cluster for each observation.

– Initially all items in one cluster

– Large clusters are successively divided

– Top Down process

Hierarchical Methods

• Single Link

• MST Single Link

• Complete Link

• Average Link

Dendrogram

• Dendrogram: a tree data structure which illustrates hierarchical clustering techniques.


• Each level shows clusters for that level.

– Leaf – individual clusters

– Root – one cluster

• A cluster at level i is the union of its children clusters at level i+1.


1. Single Linkage: In single linkage, we define the distance between two clusters to be the
minimum distance between any single data point in the first cluster and any single data point in
the second cluster. On the basis of this definition of distance between clusters, at each stage of
the process we combine the two clusters that have the smallest single linkage distance.

2. Complete Linkage: In complete linkage, we define the distance between two clusters to be the
maximum distance between any single data point in the first cluster and any single data point in
the second cluster. On the basis of this definition of distance between clusters, at each stage of
the process we combine the two clusters that have the smallest complete linkage distance.

3. Average Linkage: In average linkage, we define the distance between two clusters to be the
average distance between data points in the first cluster and data points in the second cluster. On
the basis of this definition of distance between clusters, at each stage of the process we combine
the two clusters that have the smallest average linkage distance.

4. Centroid Method: In centroid method, the distance between two clusters is the distance between
the two mean vectors of the clusters. At each stage of the process we combine the two clusters
that have the smallest centroid distance.

5. Ward’s Method: This method does not directly define a measure of distance between two points
or clusters. It is an ANOVA based approach. At each stage, those two clusters marge, which
provides the smallest increase in the combined error sum of squares from one-way univariate
ANOVAs that can be done for each variable with groups defined by the clusters at that stage of
the process.

In the following table the mathematical form of the distances are provided. The graph gives
geometric interpretation.

Notationally, define

 X1, X2, ... , Xk = Observations from cluster 1


 Y1, Y2, ... , Yl = Observations from cluster 2
 d ( x,y ) = Distance between a subject with observation vector x and a subject with
observation vector y

Linkage Methods or Measuring Association d12 Between Clusters 1 and 2


Single This is the distance between the closest
Linkage members of the two clusters.

Complete This is the distance between the members that


Linkage are farthest apart (most dissimilar)

Average This method involves looking at the distances


Linkage between all pairs and averages all of these
distances. This is also called UPGMA -
Unweighted Pair Group Mean Averaging.

This involves finding the mean vector location


Centroid
for each of the clusters and taking the distance
Method
between these two centroids.

Expectation Maximization Algorithm

 The Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin 1977; Redner and
Walker 1984) is used in maximum likelihood estimation where the problem involves two sets
of random variables of which one, X, is observable and the other, Z, is hidden.
 The goal of the algorithm is to find the parameter vector Φ that maximizes the likelihood of
the observed values of X, L(Φ|X).
 But in cases where this is not feasible, we associate the extra hidden variables Z and express
the underlying model using both, to maximize the likelihood of the joint distribution of X and
Z, the complete likelihood Lc(Φ|X,Z).
 Since the Z values are not observed, we cannot work directly with the complete data likelihood
Lc ; instead, we work with its expectation, Q, given X and the current parameter values Φl,
where l indexes iteration.
 This is the expectation (E) step of the algorithm. Then in the maximization (M) step, we look
for the new parameter values, Φl+1, that maximize this.

Thus

Dempster, Laird, and Rubin (1977) proved that an increase in Q implies an increase in the
incomplete likelihood

 In the case of mixtures, the hidden variables are the sources of observations, namely, which
observation belongs to which component.
 If these were given, for example, as class labels in a supervised setting, we would know which
parameters to adjust to fit that data point.
 The EM algorithm works as follows: in the E-step we estimate these labels given our current
knowledge of components, and in the M-step we update our component knowledge given the
labels estimated in the E-step.
 These two steps are the same as the two steps of k-means; calculation of bti (E-step) and
reestimation of mi (M-step).
Supervised Learning after Clustering

 Clustering, like the dimensionality reduction methods can be used for two purposes: it can be
used for data exploration, to understand the structure of data.
 Dimensionality reduction methods are used to find correlations between variables and thus
group variables; clustering methods, on the other hand, are used to find similarities between
instances and thus group instances.
 If such groups are found, these may be named (by application experts) and their attributes be
defined.
 One can choose the group mean as the representative prototype of instances in the group, or
the possible range of attributes can be written.
 This allows a simpler description of the data.
 For example, if the customers of a company seem to fall in one of k groups, called segments,
customers being defined in terms of their demographic attributes and transactions with the
company, then a better understanding of the customer base will be provided that will allow
the company to provide different strategies for different types of customers; this is part of
customer relationship management (CRM).
 Likewise, the company will also be able to develop strategies for those customers who do not
fall in any large group, and who may require attention, for example, churning customers.
 Frequently, clustering is also used as a preprocessing stage.
 Just like the dimensionality reduction methods which allowed us to make a mapping to a new
space, after clustering, we also map to a new k-dimensional space where the dimensions are
hi (or bi at the risk of loss of information).
 In a supervised setting, we can then learn the discriminant or regression function in this new
space.
 The difference from dimensionality reduction methods like PCA however is that k, the
dimensionality of the new space, can be larger than d, the original dimensionality.
 When we use a method like PCA, where the new dimensions are combinations of the original
dimensions, to represent any instance in the new space, all dimensions contribute; that is, all
zj are nonzero.
 In the case of a method like clustering where the new dimensions are defined locally, there are
many more new dimensions, bj , but only one (or if we use hj , few) of them have a nonzero
value.
 In the former case, where there are few dimensions but all contribute, we have a distributed
representation; in the latter case, where there are many dimensions but few contribute, we
have a local representation.

Radial Basis functions


• This is becoming an increasingly popular neural network with diverse applications and is
probably the main rival to the multi-layered perceptron

• Much of the inspiration for RBF networks has come from traditional statistical pattern
classification techniques

• The basic architecture for a RBF is a 3-layer network, as shown in Fig.

• The input layer is simply a fan-out layer and does no processing.


• The second or hidden layer performs a non-linear mapping from the input space into a
(usually) higher dimensional space in which the patterns become linearly separable.

x1

y1
x2

y2
x3
output layer
input layer
(linear weighted sum)
(fan-out)

hidden layer
(weights correspond to cluster centre,
output function usually Gaussian)

Output layer

• The final layer performs a simple weighted sum with a linear output.

• If the RBF network is used for function approximation (matching a real number) then this
output is fine.
• However, if pattern classification is required, then a hard-limiter or sigmoid function could
be placed on the output neurons to give 0/1 output values.
Clustering

• The unique feature of the RBF network is the process performed in the hidden layer.

• The idea is that the patterns in the input space form clusters.

• If the centres of these clusters are known, then the distance from the cluster centre can be
measured.

• Furthermore, this distance measure is made non-linear, so that if a pattern is in an area that
is close to a cluster centre it gives a value close to 1.

• Beyond this area, the value drops dramatically.


• The notion is that this area is radially symmetrical around the cluster centre, so that the non-
linear function becomes known as the radial-basis function.

Gaussian function

• The most commonly used radial-basis function is a Gaussian function


• In a RBF network, r is the distance from the cluster centre.

The equation represents a Gaussian bell-shaped curve, as shown in Fig.

1
0.9
0.8
0.7
 0.6
0.5
0.4
0.3
0.2
0.1
0

1
0.2

0.4

0.6

0.8
-1

-0.8

-0.2
-0.6

-0.4
x

Distance measure

• The distance measured from the cluster centre is usually the Euclidean distance.

• For each neuron in the hidden layer, the weights represent the co-ordinates of the centre of
the cluster.
• Therefore, when that neuron receives an input pattern, X, the distance is found using the
following equation:

rj  (x  w )
i ij
 i1


Width of hidden unit basis function
 (x  w )
n
i 2
ij
(hidden _ unit )j  exp( i1
)
2 2
The variable sigma, , defines the width or radius of the bell-shape and is something that has to be
determined empirically. When the distance from the centre of the Gaussian reaches , the output
drops from 1 to 0.6
Ch.8 Reinforcement Learning
Points: Introduction, Elements of Reinforcement Learning, Model based learning, Temporal
Difference Learning, Generalization, Partially Observable States.

In reinforcement learning, the learner is a decision-making agent that takes actions in an environment
and receives reward (or penalty) for its actions in trying to solve a problem. After a set of trial-and error
runs, it should learn the best policy, which is the sequence of actions that maximize the total reward.

Introduction
Let us say we want to build a machine that learns to play chess. In this case we cannot use a supervised
learner for two reasons. First, it is very costly to have a teacher that will take us through many games
and indicate us the best move for each position. Second, in many cases, there is no such thing as the
best move; the goodness of a move depends on the moves that follow. A single move does not count;
a sequence of moves is good if after playing them we win the game. The only feedback is at the end of
the game when we win or lose the game. Another example is a robot that is placed in a maze. The
robot can move in one of the four compass directions and should make a sequence of movements to
reach the exit. As long as the robot is in the maze, there is no feedback and the robot tries many moves
until it reaches the exit and only then does it get a reward. In this case there is no opponent, but we
can have a preference for shorter trajectories, implying that in this case we play against time.
These two applications have a number of points in common: there is a decision maker, called the
agent, that is placed in an environment (see figure 8.1). In chess, the game-player is the decision maker
and the environment is the board; in the second case, the maze is the environment of the robot. At
any time, the environment is in a certain state that is one of a set of possible states—for example, the
state of the board, the position of the robot in the maze. The decision maker has a set of actions
possible: legal movement of pieces on the chess board, movement of the robot in possible directions
without hitting the walls, and so forth. Once an action is chosen and taken, the state changes. The
solution to the task requires a sequence of actions, and we get feedback, in the form of a reward rarely,
generally only when the complete sequence is carried out. The reward defines the problem and is
necessary if we want a learning agent. The learning agent learns the best sequence of actions to solve
a problem where “best” is quantified as the sequence of actions that has the maximum cumulative
reward. Such is the setting of reinforcement learning.
Elements of Reinforcement Learning
Beyond the agent and the environment, one can identify four main sub elements of a reinforcement
learning system:
• a policy,
• a reward function,
• a value function, and,
• optionally, a model of the environment.

 A policy π, defines the learning agent's way of behaving at a given time. Roughly speaking, a policy
is a mapping from perceived states of the environment to actions to be taken when in those states.
π : S → A (S-state , A-Action)
• It corresponds to what in psychology would be called a set of stimulus-response rules or
associations.
• In some cases the policy may be a simple function or lookup table, whereas in others it
may involve extensive computation such as a search process.
• The policy is the core of a reinforcement learning agent in the sense that it alone is
sufficient to determine behavior. In general, policies may be stochastic.

 A reward function defines the goal in a reinforcement learning problem. Roughly speaking, it
maps each perceived state (or state-action pair) of the environment to a single number, a reward,
indicating the intrinsic desirability of that state.
• A reinforcement learning agent's sole objective is to maximize the total reward it receives
in the long run.
• The reward function defines what are the good and bad events for the agent. In a biological
system, it would not be inappropriate to identify rewards with pleasure and pain.
• They are the immediate and defining features of the problem faced by the agent.
• As such, the reward function must necessarily be unalterable by the agent.
• It may, however, serve as a basis for altering the policy.
• For example, if an action selected by the policy is followed by low reward, then the policy
may be changed to select some other action in that situation in the future. In general,
reward functions may be stochastic.
 Whereas a reward function indicates what is good in an immediate sense, a value function
specifies what is good in the long run.
• Roughly speaking, the value of a state is the total amount of reward an agent can expect to
accumulate over the future, starting from that state.
• Whereas rewards determine the immediate, intrinsic desirability of environmental states,
values indicate the long-term desirability of states after taking into account the states that
are likely to follow, and the rewards available in those states.
• For example, a state might always yield a low immediate reward but still have a high value
because it is regularly followed by other states that yield high rewards.
• Or the reverse could be true.
• To make a human analogy, rewards are like pleasure (if high) and pain (if low), whereas
values correspond to a more refined and farsighted judgment of how pleased or
displeased we are that our environment is in a particular state.
 The fourth and final element of some reinforcement learning systems is a model of the
environment.
• This is something that mimics the behavior of the environment.
• For example, given a state and action, the model might predict the resultant next state
and next reward.
• Models are used for planning, by which we mean any way of deciding on a course of
action by considering possible future situations before they are actually experienced.
• The incorporation of models and planning into reinforcement learning systems is a
relatively new development.
• Early reinforcement learning systems were explicitly trial-and error learners; what they
did was viewed as almost the opposite of planning.

Model based learning


 We start with model-based learning where we completely know the environment model
parameters, p(rt+1|st, at ) and P(st+1|st, at ).
 In such a case, we do not need any exploration and can directly solve for the optimal value
function and policy using dynamic programming.
 The optimal value function is unique and is the solution to the simultaneous equations.
 Once we have the optimal value function, the optimal policy is to choose the action that
maximizes the value in the next state:

Where,
π – Policy (π*- Optimal policy), S- State, r- Reward, V*- Expected Cumulative Reward
The problem is modeled using a Markov process decision process (MDP). The reward and next state
are sampled from their respective probability distributions, p(rt+1|st, at ) and P(st+1|st, at ). And
In the finite-horizon or episodic model, the agent tries to maximize the expected reward for the next
T steps:

Value Iteration
 To find the optimal policy, we can use the optimal value function, and value iteration there is
an iterative algorithm called value iteration that has been shown to converge to the correct V∗
values. Its pseudo code is given in figure 8.2.
 We say that the values converged if the maximum value difference between two iterations is
less than a certain threshold δ:

 where l is the iteration counter. Because we care only about the actions with the maximum
value, it is possible that the policy converges to the optimal one even before the values
converge to their optimal values.
 Each iteration is O(|S|2|A|), but frequently there is only a small number k <|S| of next possible
states, so complexity decreases to O(k |S| |A|).

Policy Iteration
 In policy iteration, we store and update the policy rather than doing this indirectly over the
values.
 The pseudo code is given in figure 8.3.
 The idea is to start with a policy and improve it repeatedly until there is no change.
 The value function can be calculated by solving for the linear equations.
 We then check whether we can improve the policy by taking these into account. This step is
guaranteed to improve the policy, and when no improvement is possible, the policy is
guaranteed to be optimal.
 Each iteration of this algorithm takes O(|A||S|2 +|S|3) time that is more than that of value
iteration, but policy iteration needs fewer iterations than value iteration.

Temporal Difference Learning


 An obvious approach to learning the value function is to update the estimate of the value
function when the actual return is known.
 This method is called the constant-α Monty Carlo method, where α
is a learning parameter between 0 and 1.
 Since the actual return is the sum of all future rewards, this algorithm must wait until the end
of the episode when the expected return is known before the value function is updated.
 Richard Sutton proposed to instead estimate the expected return by the next reward plus the
value of the next state.
 The update to the value function takes the difference of successive estimates of the value
function, thus the name temporal difference. The simplest method, known as TD(0), is given by
More generally, the expected return can be estimated by the next n rewards,

Richard Sutton proposed the TD(λ) algorithm, which mixes TD(0) and Monty Carlo methods and
uses a weighted average of n-step returns. The resulting estimation of the expected return, called
the λ-return is defined as

If λ=0, this reduces to Rt(1) which is the TD(0) algorithm. If λ=1, this reduces to Rt (∞) or just Rt
which is the constant-α Monty Carlo method.
 The parameter λ can vary between 0 and 1, and provides a trade off between updating based
on the final result and updating based only on the next estimate.
 The definition given above is called the forward view of TD(λ), since the weight update
requires knowledge about the future.
 Richard Sutton also proved an equivalent view of TD(λ), called the backward view. In the
backward view of TD(λ), every state has an eligibility trace at time t, denoted et(s).
 On each step, the eligibility traces for all states are decayed by γλ and the current state gets
incremented by one. The TD(λ) algorithm becomes

 In the backward view, the value function of all recently visited states is updated by δt, where
“recently visited” is defined in terms of the states eligibility trace.
 Temporal difference learning can be easily extended to the control problem, that is, learning
the optimal policy π*.
 One policy learning method is to use an on-line ε-greedy policy while training: instead of
learning V(s) learn Q(s,a), then every time-step select the action with the largest value of Q(s,a)
1-ε percent of the time and select a random action ε percent of the time.
 The random actions cause the agent to explore the state space.
 The TD(λ) is extended to the Sarsa(λ) algorithm for on-line control by replacing V(s) with
Q(s,a) and et(s) with et(s,a).

Generalization
 Until now, we assumed that the Q(s, a) values (or V(s), if we are estimating values of states) are
stored in a lookup table, and the algorithms we considered earlier are called tabular
algorithms.
 There are a number of problems with this approach: (1) when the number of states and the
number of actions is large, the size of the table may become quite large; (2) states and actions
may be continuous, for example, turning the steering wheel by a certain angle, and to use a
table, they should be discretized which may cause error; and (3) when the search space is
large, too many episodes may be needed to fill in all the entries of the table with acceptable
accuracy.
 Instead of storing the Q values as they are, we can consider this a regression problem. This is
a supervised learning problem where we define a regressor Q(s, a|θ), taking s and a as inputs
and parameterized by a vector of parameters, θ, to learn Q values.
 For example, this can be an artificial neural network with s and a as its inputs, one output, and
θ its connection weights.
 A good function approximator has the usual advantages and solves the problems discussed
previously. A good approximation may be achieved with a simple model without explicitly
storing the training instances; it can use continuous inputs; and it allows generalization. If we
know that similar (s, a) pairs have similar Q values, we can generalize from past cases and
come up with good Q(s, a) values even if that state-action pair has never been encountered
before.
 To be able to train the regressor, we need a training set. In the case of Sarsa(0), we saw before
that we would like Q(st, at ) to get close to rt+1 + γQ(st+1, at+1).
 So, we can form a set of training samples where the input is the state-action pair (st, at ) and
the required output is rt+1 +γQ(st+1, at+1).
 We can write the squared error as

Et(θ) = [rt+1 + γQ(st+1, at+1) − Q(st , at )]2

 Training sets can similarly be defined for Q(0) and TD(0), where in the latter case we learn
V(s), and the required output is rt+1 − γV(st+1).
 Once such a set is ready, we can use any supervised learning algorithm for learning the
training set.
 If we are using a gradient-descent method, as in training neural networks, the parameter
vector is updated as

Δθ = η[rt+1 + γQ(st+1, at+1) − Q(st, at )]∇θt Q(st , at )

This is a one-step update. In the case of Sarsa(λ), the eligibility trace is also taken into account:

Δθ = ηδtet

where the temporal difference error is

δt = rt+1 + γQ(st+1, at+1) − Q(st, at )


and the vector of eligibilities of parameters are updated as
et = γλet−1 +∇θt Q(st , at )
with e0 all zeros.
 In the case of a tabular algorithm, the eligibilities are stored for the state-action pairs because
they are the parameters (stored as a table).
 In the case of an estimator, eligibility is associated with the parameters of the estimator. We
also note that this is very similar to the momentum method for stabilizing backpropagation.
 The difference is that in the case of momentum previous weight changes are remembered,
whereas here previous gradient vectors are remembered.

Partially Observable States


 In certain applications, the agent does not know the state exactly.
 It is equipped with sensors that return an observation, which the agent then uses to estimate
the state.
 Let us say we have a robot that navigates in a room. The robot may not know its exact location
in the room, or what else is there in the room. The robot may have a camera with which sensory
observations are recorded. This does not tell the robot its state exactly but gives some
indication as to its likely state.
 For example, the robot may only know that there is an obstacle to its right.
 The setting is like a Markov decision process, except that after taking an action at , the new
state st+1 is not known, but we have an observation ot+1 that is a stochastic function of st and
at : p(ot+1|st, at ). This is called a partially observable MDP (POMDP).
 If ot+1 = st+1, then POMDP reduces to the MDP.
 This is just like the distinction between observable and hidden Markov models and the
solution is similar; that is, from the observation, we need to infer the state (or rather a
probability distribution for the states) and then act based on this. If the agent believes that it
is in state s1 with probability 0.4 and in state s2 with probability 0.6, then the value of any
action is 0.4 times the value of the action in s1 plus 0.6 times the value of the action in s2.
 The Markov property does not hold for observations. The next state observation does not only
depend on the current action and observation.
 When there is limited observation, two states may appear the same but are different and if
these two states require different actions, this can lead to a loss of performance, as measured
by the cumulative reward.
 The agent should somehow compress the past trajectory into a current unique state estimate.
These past observations can also be taken into account by taking a past window of
observations as input to the policy, or one can use a recurrent neural network to maintain the
state without forgetting past observations.
 At any time, the agent may calculate the most likely state and take an action accordingly. Or it
may take an action to gather information and reduce uncertainty, for example, search for a
landmark, or stop to ask value of for direction.
 This implies the importance of the value of information, and indeed POMDPs can be modeled
as dynamic influence diagrams. The agent chooses between actions based on the amount of
information they provide, the amount of reward they produce, and how they change the state
of the environment.
 To keep the process Markov, the agent keeps an internal belief state bt that summarizes its
experience (see figure above).
 The agent has a state estimator that updates the belief state bt+1 based on the last action at ,
current observation ot+1, and its previous belief state bt. There is a policy π that generates the
next action at+1 based on this belief state, as opposed to the actual state that we had in a
completely observable environment.
 The belief state is a probability distribution over states of the environment given the initial
belief state (before we did any actions) and the past observation-action history of the agent
(without leaving out any information that could improve agent’s performance). Q learning in
such a case involves the belief state-action pair values, instead of the actual state-action pairs:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy