0% found this document useful (0 votes)
4 views1,053 pages

106105216

The document outlines a comprehensive index of topics related to image processing, covering various aspects such as fundamentals, geometry, feature detection, color processing, and neural architectures over a span of twelve weeks. It includes detailed lectures on how images are represented, the principles of image formation, and the statistical analysis of pixel values. The content is structured to provide foundational knowledge essential for further studies in computer vision and image processing.

Uploaded by

Shubham Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views1,053 pages

106105216

The document outlines a comprehensive index of topics related to image processing, covering various aspects such as fundamentals, geometry, feature detection, color processing, and neural architectures over a span of twelve weeks. It includes detailed lectures on how images are represented, the principles of image formation, and the statistical analysis of pixel values. The content is structured to provide foundational knowledge essential for further studies in computer vision and image processing.

Uploaded by

Shubham Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1053

INDEX

S. No Topic Page No.


Week 1
1 Fundamentals of Image Processing Part I 1
2 Fundamentals of Imagr Processing Part II 21
3 Image Transform Part I 39
4 Image Transform Part II 58
Week 2
5 Projective Geometry – Part I 79
6 Projective Geometry – Part Ii 94
7 Projective Transformation 114
8 Homography:Properties – Part I 140
9 Homography:Properties – Part Ii 155
10 Homography:Properties – Part Iii 172
Week 3
11 Camera Geometry – Part I 189
12 Camera Geometry – Part Ii 208
13 Camera Geometry – Part Iii 223
14 Camera Geometry – Part Iv 240
15 Camera Geometry – Part V 261
Week 4
16 Stereo Geometry – Part I 279
17 Stereo Geometry – Part Ii 295
18 Stereo Geometry – Part Iii 310
19 Stereo Geometry – Part Iv 322
Week 5
20 Stereo Geometry – Part Vi 334
21 Stereo Geometry – Part V 346
22 Stereo Geometry – Part Vii 363
23 Stereo Geometry – Part Viii 380
Week 6
24 Feature Detection And Description – Part I 394
25 Feature Detection And Description – Part II 413
26 Feature Detection And Description – Part III 432
27 Feature Detection And Description – Part IV 449
28 Feature Detection And Description – Part V 474
Week 7
29 Feature Matching And Model Fitting- Part I 485
30 Feature Matching And Model Fitting- Part Ii 502
31 Feature Matching And Model Fitting- Part Iii 514
32 Feature Matching And Model Fitting- Part Iv 527
33 Feature Matching And Model Fitting- Part V 540
Week 8
34 Color Fundamentals And Processing-Part I 558
35 Color Fundamentals And Processing-Part Ii 574
36 Color Fundamentals And Processing-Part Iii 591
37 Color Fundamentals And Processing-Part Iv 613
38 Color Fundamentals And Processing-Part V 633
39 Color Fundamentals And Processing-Part Vi 651
40 Color Fundamentals And Processing-Part Vii 665
Week 9
41 Range Image Processing – Part I 679
42 Range Image Processing – Part Ii 693
43 Range Image Processing – Part Iii 707
44 Range Image Processing – Part IV 720
45 Range Image Processing – Part V 735
Week 10
46 Clustering And Classification €“ Part I 757
47 Clustering And Classification €“ Part Ii 780
48 Clustering And Classification €“ Part Iii 799
49 Clustering And Classification €“ Part Iv 816
50 Clustering And Classification €“ Part V 832
Week 11
51 Dimensional Reduction And Sparse Representation – Part 855
52 Dimensional Reduction And Sparse Representation – Part 871
53 Dimensional Reduction And Sparse Representation – Part 891
54 Dimensional Reduction And Sparse Representation – Part 909
Week 12
55 Deep Neural Architecture And Applications – Part I 927
56 Deep Neural Architecture And Applications – Part II 950
57 Deep Neural Architecture And Applications – Part III 970
58 Deep Neural Architecture And Applications – Part IV 988
59 Deep Neural Architecture And Applications – Part V 1007
60 Deep Neural Architecture And Applications – Part VI 1028
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 01
Fundamentals of Image Processing – Part I

In this lecture we will discuss about, different Fundamental operations of Image


Processing and an overview of image processing.

(Refer Slide Time: 00:26)

So, let us understand first how images are represented in computer. Let us consider an
image which is displayed on your screen; and consider a small portion of the image as it
is shown by this rectangle, if I zoom this portion then you will find the enlarging portions
of the image. And you can see that some of the details are more visible here and, but there
are certain kind of rectangular regions of the pixels regions which are also visible.

And, if I further zoom this area, then you will observe that further zooming those areas
there are small squares of uniform illumination, those values are shown here by the
numbers in those squared. So, finally, as you can see that, in the image at the very bottom
level of representation, every point has a number and that number is representing the
brightness value at that particular point. However, while displaying it on a screen it is a
small area which is represented; which is representing that point and this brightness value
is also shown at that point.

1
So, the image is represented as a 2D array of integers. So, these numbers are all those
integers; and in a 2 dimensional array as you know you require to also mention the array
size. So, those array size will be the width and height of the image for example, in this
case for this image the width is 256, which means there are 256 points along its width and
height is 384, there are 384 points along its height.

(Refer Slide Time: 02:26)

When we consider representing a colour image, there would be three such 2 dimensional
arrays each for representing one of the primary colours like red, green and blue. If you
consider any particular point of these array, all the respective locations all the respective
array elements corresponding array elements with the same array indices, they will
represent a combination of colour by this three primary colours. So, you will have a red
component of the colour, if I can display only with the red colour in the screen.

And a green component of the image, which is displayed here by the green colour; and the
blue component of the image which is displayed by blue colour. Now, once we super
impose all these colours on a screen, then you can get the colour representation of the
image itself. So, an image in this case is represented by three, 2 dimensional array.

2
(Refer Slide Time: 03:30)

And when this information is stored in a computer hard disk, in the secondary storage you
know that any information in the secondary storage is stored as a file. Similarly, image has
to be stored as a file and a file it consists of a stream of bytes. So, in this case every byte
or a collection of byte will represent a pixels and we can consider that is as a stream of
pixels.

However, to represent the image in the in your program in your computation, you require
other associated information’s of the images of that image and which has to be stored also
with this stream. Usually, they are stored a head of the stream in a very predefined format
and which is called header of the image file. For example, a header can consist of this kind
of information’s like, it could it should contain the width information, height then number
of components which means in this case for a colour channel it should be three for a colour
image.

Number of bytes per pixel as I mentioned that, a pixel could be represented by number of
bytes; in a very elementary representations usually it is 1 byte per pixel. And where the
values it you know unsigned integer values vary from, 0 to 255 per pixel. of course, there
should be an end of file representation for a file.

And there are different standard file formats which are available for an image
representation, they are very much standardized and documents are available. So, when
you get an image file in these formats, you should know the corresponding format and you

3
should parts the header and get the images. So, these are some example formats like TIFF,
BMP, GIF etc.

(Refer Slide Time: 05:31)

So, let us consider, how an image is formed in an optical camera? Let us consider in a
camera which is represented here, a lens of the camera which plays a very critical role.
And, there is a plane where the image points are projected where the points of the 3
dimensional scene, they are projected on a 2 dimensional plane and that is actually the
focal plane usually of these particular lens. And where the images are found and those
points are finally, sensed by corresponding sensors and they are digitised and you get a
digital representation.

So, if I consider that, how a point in a 3 dimensional space is mapped to an image point;
so, let us see how it can take place, if you consider a point P where a light is reflected and
that light is passing through the centre of the lens and intersecting at the plane of
projections or the image plane. So, this point which is represented here by small p is a
image of the scene point which is represented here by this capital P. So, image has been in
this case image formation, has taken place due to the phenomena of reflection as you can
see.

So, there is another information, which is also encoded at this image point that is a amount
of reflection that amount of energy that is received at this point which is reflected from the
point P. So, that is the action of lens, which takes care of this particular but it tries to get

4
as much as energy reflected energy tries to put them together on this plane p, collects it
and focused it on this point p and that is what is called also focusing. And that is why you
get a very sharp picture if it is properly focused, a sharp point representation of the same
point.

So, this is another encoding. So, the amount of energy which is reflected from this point
that should be received here and that is sensed. So, the interpretation of this image is that
it keeps a brightness distribution in this 2 dimensional plane; where at each point you get
the corresponding it is proportional to the amount of energy reflected from its
corresponding same point.

Let us look at minutely that once again; what is a rule of projection that I mentioned here
and that provides you a very simple mathematical tool to compute given a point P in the 3
dimensional world, what should be its corresponding image point in a 2 dimensional plane.
So, as we can see in this case that, this can be formed if I draw a line from this point
through a particular fixed point O which is here the centre of the lens and extend that ray
which hits the corresponding image plane and that intersection point defines the image
point of the 3 dimensional scene point.

(Refer Slide Time: 08:53)

So, this is a centre of projection as I mentioned and this is the image plane. And so, we
can summarize this rule as image points formed by intersection of the ray from a point P

5
and passing through the centre of projection O with the image plane. This kind of
projection is known as perspective projection.

(Refer Slide Time: 09:14)

But, this is not the only way images are formed there could be other kinds of imaging
principles other rules of projections can take place. For example, in this case what I am
going to show here, there are image formed image of this cube where you know one of the
planes of the cubes have been projected here. And, as you can see all these points are
parallely projected to this plane, there is a particular direction which has to be considered
in this case for this projection it could be normal to the plane, it could be any other
directions in a 3 dimensional plane.

So, this rule we can summarize in this way that image points formed by intersection of
parallel rays with the image plane. And, this kind of one example of this kind of imaging
is X ray imaging where X ray beams parallel X ray beams, it passes through our know
body through bones through tissues. And then it intersects the corresponding X ray plate,
which acts likes a image plane in this case and forms the image and this projection is
known as parallel projections.

6
(Refer Slide Time: 10:20)

Let us take another imaging principle. Let us consider you have a surface of an object and
your imaging sensor there is a transmitter, which transmits the transmits a electromagnetic
wave or some acoustic wave and then, the reflected wave is received by receiver. So, the
duration the time interval between transmission and reception that can be measured and if
you know the velocity of the wave you can compute the corresponding distance from this
point.

And consider it scan radially, you are taking this at every regular interval and you are
scanning it radially over the surface points. And you can get or you can consider also, you
can translate this transmitter receiver transmitter receiver along certain directions and
performs its action repeatedly. So, for every surface point in that path you will get a
distance. So, you can measure the distance not only that the amount of reflection, what
you get from the surface that would also determine the orientation and surface property of
this particular material.

So, one example is this echocardiograms, where acoustic waves are used and ultra sound
waves are used and so, this is one kind of example.

7
(Refer Slide Time: 11:43)

So, if I summarize that how what is an image, how do I define an image? It is in a very
short sentence we can say, it is an impression of the physical world. And just to make it a
little more just to elaborate it we can say that, it is a special distribution of a measurable
quantity encoding the geometry and material properties of objects.

(Refer Slide Time: 12:14)

So, now I will be discussing a few concepts and operations, which are which are there in
the image processing. And in this course we would require some of these concepts as I
mentioned earlier that know you do not required to go through the first level image

8
processing course to attend this particular computer mission course, these are the primers
that I will be discussing here. However, it would be better if you follow some image
processing textbooks and know further know more details about this concepts.

(Refer Slide Time: 12:50)

So, let us consider the first a very simple concept of images a very first level statistics of
the distribution of this pixel values and, which can be captured in the form of a frequency
distribution of the brightness values in this particular image. Here I have shown an image
of a scanned page we can say this classes of images are document images. Once again in
this image also you have those brightness values at every pixel.

And as you can see there are two types of pixels are there mostly, one is one is a text it
depicts a text of the document the other one belongs to background. Usually in the
histogram you should have found the this you should have obtained, a bimodal kind of
characteristics. But in this case since you get so many white pixels, so it is more skewed
and particularly the distribution in the text zone it looks little flat.

So, we will come to this point know how this could be processed further to make it more
bimodal, but presently let us consider this is the let us concentrate on this fact that a. An
image histogram is nothing, but the frequency distribution of this brightness value. And
from this frequency distribution you can get the probability distribution you can convert it
into a probability distribution of the brightness value.

9
If I normalize the histogram; that means, all this frequencies should be divided by the total
number of pixels of the image, then you get the probability distribution of the brightness
values.

(Refer Slide Time: 14:29)

So, one of the problem of document image analysis is to separate the foreground from
background and this process is called binarization process. So, and one of the simple
technique of binarization is using a threshold value to declare that, whether a pixel is
foreground or whether it is background ah.

So, in our present context the example what I have given here foreground is the dark pixels
and background is the bright pixels; so which are white region of the document. And pixels
after binarization, it could be set to one of the two values for example, we can consider
255 represents the white region and 0 represents the text portions or dark pixels say this is
what we can consider.

10
(Refer Slide Time: 15:30)

One of the simple algorithms at of this you know binarization could be as follows. Say,
you can choose a threshold value T that is one of the brightness value in that intervals some
value in the brightness interval. And a pixel greater than T is set to 255 otherwise it is set
to 0.

(Refer Slide Time: 15:55)

So, this is a very simple algorithm and let us see, what is the affect of this algorithm. Say,
you consider that know this document and this is represented this is displaying the
particular a histogram. Let us consider a particular value say 156 where you perform this

11
thresholding and then you get an image like this. So, you see that there are only two types
of pixels, pixels with the value 0 and pixels with value 255 in this case when your threshold
value is 156.

If I choose another value say 192 you get also another kind of binarized image; and you
can see the difference between these two images. So, if the threshold value is higher than
you get more foreground pixels your text becomes sharper here, but then there are more
spurious noise in your in your document also which is not desirable. So, what is the
optimum threshold value? What is a desirable threshold value which will make my text
sharper which will look the foreground also sharper or they take the proper program pixels
and also it should not contain the it should remove the noisy part of the pixels also.

So, this kind of manual choice of thresholding may not help when you are trying to process
various documents. And one of the objective would be that to automate this operation of
this thresholding ah.

(Refer Slide Time: 17:26)

So, one of the techniques, that I would be discussing here, is a Bayesian classification of
foreground and background pixels of a particular image. In this case we consider that our
histogram of the image is a bimodal histogram, a schematic diagram is shown here it is.
Say, there are two modes, there are two peaks in this histogram and we our assumption is
that most of the pixels which are around this mode around its particular peak, those pixels
are they are coming from the foreground pixels.

12
And the and the pixels, which are coming from the background they are centered around
this particular mode. So, we consider there are two classes of pixels and these are the know
symbols of this class these are the notations of this class in this case say it has considered
w 1 and w 2 just for the abstract representation of this problem.

So, what we need to do in this case? We need to compute the probability of a class w 1
given x,

𝑝(𝑤1|𝑥)

and probability of a class w 2 given x,

𝑝(𝑤2|𝑥)

So, this is because you know in Bayesian classification rule, we can assign the pixel x to
class w1, if

𝑝(𝑤1|𝑥) > 𝑝(𝑤2|𝑥)

otherwise we assign it to w2 that is a base Bayes classification rule. So, how we can
compute this know this particular probability, which is called incidentally posterior
probability. In this case, so we can apply Bayes theorem there and in the Bayes theorem
you can see that in this case I have described the theorem.

So, consider this pixel x, so probability of a class given that pixel x, can be computed from
this three quantities.

𝑝(𝑤)𝑝(𝑥|𝑤)
𝑝(𝑤|𝑥) =
𝑝(𝑥)

And it is simpler to compute 𝑝(𝑥|𝑤) than 𝑝(𝑤|𝑥) directly because 𝑝(𝑥|𝑤) is called
likelihood. And we can assume that the pixels which are around, this part they form a
distributions which are coming a class distribution which are coming from the foreground
pixels and can assume they are Gaussian distribution.

Similarly, the pixels, so this is this is the probability distribution 𝑝(𝑥|𝑤1) And similarly
for the background class also we can consider another probability distributions, for the
background class and that would be the probability distribution 𝑝(𝑥|𝑤2). So, these

13
probabilities are called likelihood that could be easily that could be computed easily rather
than computing this directly.

And also you can compute class probabilities 𝑝(𝑤), if I assume that know some threshold
value is chosen then these proportional areas can give me those two probabilities. I will
describe it in the subsequent slide, but what is interesting to note that actually you may not
use 𝑝(𝑥) at all in this computations. Because after computing 𝑝(𝑤1|𝑥) and 𝑝(𝑤2|𝑥) you
need to compare only

𝑝(𝑤1|𝑥) > 𝑝(𝑤2|𝑥)

that would, because this is proportional to given a particular x, 𝑝(𝑥) is you know already
given by the data itself.

(Refer Slide Time: 21:11)

So, let us see how these computations can be carried out? And there is a algorithm by
which we determine this thresholds, we call this algorithms as expectation maximization
algorithm. So, let me explain this algorithm here. So, let us consider that histogram once
again histogram of the image or probability distribution of the pixels. And let us assume
and a threshold value initially say at this point. So, this value divides the intervals this
brightness interval into two halves.

So, one we can consider this half belongs to say foreground region and this half belongs
to a background region. And so, this is a representation for the foreground part and say

14
this is a representation of background part. So, what we can do? Given this threshold this
threshold we can compute 𝑝(𝑤1) and 𝑝(𝑤2) why computing the areas area of this part
and also areas of this part and take the proportional know areas of which region to compute
that values.

So, this is how the probability of class probabilities could be computed, ones given a
threshold value. And then, after that we can consider only concentrate only this values
these classes only and from there we can compute the parameters of say 𝑝(𝑥|𝑤1) by
assuming it Gaussian. And similarly we can also compute parameters of 𝑝(𝑥|𝑤2) by
assuming it Gaussian.

So, if I relook at the Gaussian distributions function, it is a Gaussian distribution function


as you can see there are two parameters; one is µ which is a mean of the distribution
another one is σ which is the standard deviation of this distribution. So, what you need to
do in this case to compute the probability this likelihood probability 𝑝(𝑥|𝑤1), you just
simply you need to compute this parameters then you can compute the probability of any
value x given those parameters. So, let us assume that we call the parameters
corresponding parameters for the class w 1 as µ1 and σ1 and corresponding parameters of
w 2 as µ2 and σ2.

(Refer Slide Time: 23:34)

So, next what we will do that we will be considering the and this is how the corresponding
parameters are computed in this case.

15
𝑇ℎ

𝑝(𝑤1) = ∑ 𝑝(𝑥)
𝑥=0

𝑝(𝑤2) = 1 − 𝑝(𝑤1)

𝑇ℎ

µ1 = ∑ 𝑥. 𝑝(𝑥)
𝑥=0

𝑇ℎ

𝜎12 = ∑ 𝑥 2 . 𝑝(𝑥) − µ12


𝑥=0

255

µ2 = ∑ 𝑥. 𝑝(𝑥)
𝑥=𝑇ℎ+1

255

𝜎22 = ∑ 𝑥 2 . 𝑝(𝑥) − µ22


𝑥=𝑇ℎ+1

So, we are computing the variances with this of this values and main of these values. And
these are all simple mathematical arithmetic expressions of weighted means and weighted
variances. So, if you look at the statistics, book of statistics it will be very clear how these
values are computed by this expressions are there. So, once we get these values then we
are determining a new threshold value such that, the

𝑝(𝑤1|𝑥) > 𝑝(𝑤2|𝑥)

So, as soon as it becomes less, then we choose that threshold values, still choose that
threshold value and that value would be a new value. So, we expected this should be a
threshold value, but after computing this parameter after maximizing the probabilities of
occurrences of these pixels then we found there is a better threshold value which will be
giving in a bettered better probabilistic occurrences of this observations. So, we iterate this
process, so that would be my new threshold value and we will be iterating this process till
the process is converged. So, this is what is your Bayesian classification based binarization
method.

16
(Refer Slide Time: 25:44)

And there is another method also we can consider here, which is almost similar and which
also defines an optimization function by which you can get the threshold value. So, this
optimization function is the between class variance of the particular two classes. So,
between class variances are also as you can see it is defined by those by those parameters
which I have discussed earlier.

𝜎𝐵2 = 𝑝(𝑤1)𝑝(𝑤2)(µ2 − µ1 )2

So, given a threshold value I can compute this 𝜎𝐵2 , like p(w1) from this part p(w2) from
this part then µ1 can be computed from here and µ2 can be computed from here. So, you
consider that you are computing thee value at every pixel value from in your intervals say
0 to 255 for example, and you are you consider the that pixel value where the between
class variance is maximum and that has been that you consider as your threshold value.
So, this thresholding principle this thresholding technique is proposed by Otsu and it is
known as Otsu thresholding technique.

17
(Refer Slide Time: 27:03)

So, the example of these particular processing, you can see that with that particular
document image. We have computed this Otsu thresholds and which is 157 in this case
and it gives know this kind of image. And, if I consider the Bayesian thresholds, then we
find an another image incidentally though the threshold values are same you can see that
there is little bit of difference between these two images though quality of this two images
are almost similar; this difference happens because we would like to make this histogram
bimodal.

So, before applying Bayesian classification, we process the image so that, the histogram
has a sharper modes also in the foreground zone. So, how we are processing it I will discuss
in the you know next part.

18
(Refer Slide Time: 27:58)

So, in this case, so this is what is the method which I was referring at it is a contrast
enhancement method and here the concept of pixel mapping is used. The pixel mapping
concept is that you have an input pixel and which will be mapped to an output pixel in
such a way that dynamic range of the input would be expanded. That means, suppose in
this case dynamic range is from 0 to this value which could be say half of the interval
approximately.

But in the output we are converting that dynamic range from 0 to 255 that makes the pixels
sharper I mean contrast sharper. And one of the property of course, that you need to
mention that you need to you know preserve that this function has to be monotonically
increasing because, if you have two pixels x 2 and x 1

If 𝑥1 > 𝑥2 , then 𝑦1 > 𝑦2

So, that keeps a consistency of displaying you know brighter pixel brighter and darker
pixel darker, so that is why you require this property.

One of the popular function which is used in particular in this case, is this function from
the probability distribution of the you know pixel itself. So, this is a cumulative distribution
and

19
𝑥

𝑦 = 255 ∑ 𝑝(𝑡)
𝑡=0

(Refer Slide Time: 29:28)

So, if I do this operation we can see that, we get a contrast rate image where the features
are more visible in this are more prominent here. You can also look at this histogram this
histogram has similar shape, but the dynamic range has been expanded modes are more
clearly visible. In fact, this is a technique what I was referring at this is a technique we
applied in the document which has been processed for binarization. So, with this let me
stop here, this is a first part of this particular talk, we will start further we will go for the
next part in the next lecture.

Thank you very much for your listening.

Keywords: Images, projection, histogram, thresholding, expectation maximization,


Bayesian, equalization

20
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 02
Fundamentals of Image Processing – Part II

We will move to the next part of this particular topic. We are still over viewing some of
the Fundamentals of Image Processing.

(Refer Slide Time: 00:25)

And we will start with the concept which is called Segmentation. Now, a segmentation is
partitioning of image pixels into a meaningful non-overlapping sets. The binarization
process what we discussed earlier, it is a special case of this segmentation. In that case, we
have only two such sets that we discussed; one is of the background, another one is
foreground. But it could be extended to groups of pixels, which are more than two levels.

21
(Refer Slide Time: 01:01)

For example we can consider a multilevel thresholding scheme. In the previous case, we
have just only single thresholds. In a multilevel thresholds, we can have more than one
threshold. Say consider this particular image and its histogram and let us consider we have
selected these two thresholds. We can use a similar computation what we discussed earlier,
particularly the Bayesian classification methods that can be extended further and we can
determine this thresholds automatically also or manually by visualizing the modes, you
can also select this thresholds.

In this particular case, let us consider there are three such intervals and for each interval,
we assign a particular colour to those pixels. There are 3 colours you can look at this
particular image and they are blue and green and yellow and each colour represents a
segment of the image. As we can see from the intervals say there are 0 to 60 and 61 to 119
and 120 to 255.

𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠: [0,60], [61,119], [120,255]

𝑓𝑜𝑟 𝐵𝑙𝑢𝑒, 𝐺𝑟𝑒𝑒𝑛, 𝑌𝑒𝑙𝑙𝑜𝑤

These are the definitions of those non overlapping intervals and this is how the pixels are
grouped into this particular image into 3 different segments.

You can note that 0 to 60 corresponds to the darker values which is the water channel here.
61 to 119 is a middle values. Usually, these are the points which corresponds to this

22
interval and 120 to 255, they are the brighter values and they corresponds to the yellow
portions of the image. So, this is the blue zone. This is the green zone and this is the yellow
zone.

(Refer Slide Time: 02:57)

So, one of the task of segmentation even after grouping the pixels into different intervals
that is still remaining and that is called component labelling. So, in this case, we need to
also partition the image pixels in the plane itself. So, it is a connected image pixels into
meaningful non overlapping sets that is what is our objective. So, there we can use various
neighbourhood definition to define this connectedness.

For example, we can consider the 4-neighbourhood definition or 8-neighbourhood


definition in this respect. Say 4-neighbours are defined in this fashion. Say given a pixel,
these are the positions of the neighbouring pixels and out of them, these are the pixels
which are called 4-neighbours. Suppose, this is a location x and y. This is a coordinate
location. So, this could be x, this should be x plus 1 y. So, this is x right neighbour.
Similarly, x minus 1 y, this is its left neighbour. x y plus 1, its top neighbour. x y minus 1,
its 4-neighbour, its bottom neighbour.

So, these are the 4-neighbours of this pixel and this neighbourhood is called 4-
neighbourhood whereas, now if I consider also the diagonal pixels, diagonal positions are
also their neighbours. So, then this definition is an 8-neighbourhood definition. So, you

23
can have different such neighbourhood definitions and accordingly, this connected,
connectivity will also be defined between 2 pixels.

(Refer Slide Time: 04:35)

So, let us discuss a particular case when you want to get the components preserving the 8
connectivity of pixels. So, let us take this particular example. So, this is a small image and
each pixel value is represented by these numbers. So, this is a 4 x 4 image for example.
So, what we do, we can form a graph with edges between neighbouring pixels having same
labels which say you consider every pixel is a node of this graph and you form the edge
when their neighbour, the neighbours have the same pixel values.

So, for example, this is 20, 20, 20, 20. So, you can see that all these pixels, they form edges
among between themselves and similarly, we can form edge between 50, 50 and this pixels
in this form and also say for 100, 100 this is a edge. For another case this 20, 20 form an
edge and 100, 100 another edge. So, it gives graphical representation of this particular
image following the neighbourhood definitions.

So, now your task would be to find out the components of these graphs, you can use any
graph traversal algorithms like breadth first search or depth first search and by which you
can find out each component and declare them as a segment. So, I have displayed here by
different colours, those segments. They are showing different components in this case.
One of the interesting question, what could be asked here that whether you required
explicit graph representation in this case.

24
Because we have already the image array and the neighbourhood definitions are merely
précised, where they have a well organised structure. So, actually do not require any graph
representation. You can compute everything on this array itself to get these components.
So, you can compute using only one image array itself.

(Refer Slide Time: 06:59)

So, some of the examples of this processing let me show you. Say these are the two images
which we earlier displayed. So, this is the original image and this is the segmented image
into 3 segments, 3 classes of intervals. Now, we are forming the connected components
from each segments. So, I will show you some of the prominent components in order of
their number of pixels in those components. So, this is a largest component and you can
see that in this component, we could recover we could get the river channel into one
component itself.

Similarly, you can have various other components. So, this should be corresponding to this
part of the zone. This is a green segment. So, connected green segments are shown here.
This is mostly showing you connected regions in this part, yellow segment part. So, you
get brighter pixels, yellow segment parts from here and in this way there are various other
components could be also retrieved. I have shown only 6, but there are many other
components which are possible in this image.

25
(Refer Slide Time: 08:15)

One of the interesting question that I can ask you from this results itself; you need to be
carefully observing particular, you should observe this results. See you can find out that
though in the original image, there is a river channel, this is a river channel and in the
largest segment also you can get the corresponding pixels to river channel, but part of the
river channel is missing in this segment.

So, can you tell why this part is missing? So, this is a question you need to find out by
yourselves and you check why this information is missing in this part.

(Refer Slide Time: 09:05)

26
So, let us now discuss another operations for this with the images and we called this
operations are computation of gradient values in the images. So, a gradient operation could
be is defined in this way. So, image is a 2-dimensional function. So, in a 2-dimensional
functions a gradient vector is defined by the corresponding partial derivatives along two
principle directions. So, this is a partial derivative with x

𝜕𝑓(𝑥, 𝑦)
= 𝑓(𝑥 + 1, 𝑦) − 𝑓(𝑥, 𝑦)
𝜕𝑥

and partial derivative of function with y

𝜕𝑓(𝑥, 𝑦)
= 𝑓(𝑥, 𝑦 + 1) − 𝑓(𝑥, 𝑦)
𝜕𝑦

and they form a vector

Δ𝑓(𝑥, 𝑦) 𝜕𝑓(𝑥, 𝑦) 𝜕𝑓(𝑥, 𝑦)


= 𝑖̂ + 𝑗̂
𝜕𝑥 𝜕𝑥

and that is a gradient vector.

So, computation of this gradient is quite simple and direct. We can use finite difference
method by which you can compute gradient along the x direction which is simply the
difference between the right neighbour and the pixel itself and similarly, we can compute
the gradient along y direction. It is a difference between the top neighbour and the pixel
itself. So, functional values at those positions is a difference. So, with these simple
computations, you can compute these gradients and you can get the gradient vector.

27
(Refer Slide Time: 10:21)

These computations, so let me discuss in a more general framework. There is a motivation


behind this that would be understood later on. Let us consider that these difference
operations could be performed by a computation with mask. So, let us define a mask. In
this case its a mask 1 x 2 elements. Mask is a kind of array. In this case, it happens to be
1-dimensional array. So, the dimension is that there are 2 rows and number of column is 1
in this case.

And these values in the mask, they represent the weights. So, what you need to do for
computing the gradient all on the vertical direction, you place this mask at every pixel
positions. So, this particular one of the points in the mask is corresponds to the central
pixel positions and that is shown here by this particular coordinate (x, y). So, you are
placing this mask in the (x, y) positions and then, what you are doing you are performing
the computation. So, you are performing the weighted sum of the pixel values of the
corresponding with its neighbours, where this mask is also present.

So, once you place a mask here (x, y). So, this should be my multiplied by - 1. Then, the
functional value at (x, y +1) should be multiplied by 1. So, effectively if you take the sum
that would be the difference between (x, y+1) and (x,y). So, the algorithm for these mask
would be that you scan the image top to bottom and left to right, then at every point (x, y)
place the mask and compute this weighted sum which means that as I mentioned that you
multiply the weight with a corresponding functional values at those locations and then,

28
you take the addition. Take the sum of them and that would give you the weighted sum
and then, you replace the central value by this weighted sum.

𝑔(𝑥, 𝑦) = 1. 𝑓(𝑥, 𝑦 + 1) + (−1)𝑓(𝑥, 𝑦)

So, in this way you can compute the finite differences along the vertical directions. Same
computations can be carried out with this mask also which computes that difference along
the horizontal directions.

(Refer Slide Time: 12:37)

So, to generalize this computation for making a robust gradient computation, what we can
do instead of computing the difference only with the right neighbour or top neighbour, we
can consider a neighbourhood region and find out this statistics of this finite differences in
that region and take the mean of those know mean of that distribution.

So, in this case particularly we are representing the mask value. So, we are taking a 3 x 3
mask and this is a central positions, these (x, y) positions. So, with respect to these position,
so we can compute finite differences along using these two pixels also. So, , we are
computing these finite differences 6 times around its neighbourhood and each one would
not be the same; they are different values.

But it is expected that if there is a gradient directions statistically I mean they if you take
the average, then the noise or the error would be reduced, if you take the average of all
these values. So, which is equivalent this computations since these are all linear

29
combinations, this computations can be carried out by single mask computation itself,
when the mask is designed in this fashion.

So, we can note that what we are doing at every cell of the mask, we are simply adding the
weights. So, 1 - 1 becomes 0 here. This one is 1. So, in this way this is a distribution of
weights in the mask and we can carry out the same computations with this mask and
compute the corresponding. So, horizontal gradients in this particular case, in the gradient
in the horizontal direction.

(Refer Slide Time: 14:37)

Similarly, gradient in the vertical direction also could be computed by this mask.
Incidentally this masks two masks are called Prewitt operator in image processing and as
you understand that 6 times you have computed this gradients in a particular direction.

30
(Refer Slide Time: 14:59)

There is another example of robust gradient computation. In this case, what we are doing
we are giving more weights to the central gradients than the off centric gradients. So, for
example, in the horizontal directions when we are computing this gradient, we are giving
weights of 2 here.

So, weights are doubled compared to the off centric rows and once again as I mentioned
that every value of the cell can be computed by the sum of these corresponding values and
if I multiply them with the weights. So, the distribution of the weights in the mask will
look like this. This is for the computation on the horizontal directions and similarly, for
the vertical directions, the distribution will be here in this case and so, this is a Sobel
operator and this gradient has been computed 8 times in this case.

31
(Refer Slide Time: 15:57)

So, the results of these gradient operations, let me show you here. Consider the image on
right top is your original image and if I compute the vertical gradients, I am showing those
pixels by vertical gradients are very prominent. You can see those vertical bars of the
fences; they become prominent in this particular image. For horizontal gradient again the
horizontal directional fences which are parallel to the ground in those directions, those
fences are becoming more prominent.

But if I consider the resultant of this two vectors by taking the magnitude of the vector
itself, we will get all the edge pixels in every direction and the corresponding edge points
or boundary points of different object services, they become prominent in this particular
image. This is how you compute that gradients.

32
(Refer Slide Time: 16:53)

Let us generalize these computations with any arbitrary mask; that means, we consider any
arbitrary weight distribution and this is an example of a 3 x 3 mask, but it could be of any
dimension itself. So, if I consider weighted sum of the pixel values around the
neighbourhood of the central pixel and we can represent this computations in the form

𝑔(𝑥, 𝑦) = 𝑤1 𝑓(𝑥 − 1, 𝑦 + 1) + 𝑤2 𝑓(𝑥, 𝑦 + 1) + 𝑤3 𝑓(𝑥 + 1, 𝑦 + 1) + 𝑤4 𝑓(𝑥 − 1, 𝑦)


+ 𝑤𝑐 𝑓(𝑥, 𝑦) + 𝑤5 𝑓(𝑥 + 1, 𝑦) + 𝑤6 𝑓(𝑥 − 1, 𝑦 − 1) + 𝑤7 𝑓(𝑥, 𝑦 + 1)
+ 𝑤8 𝑓(𝑥 + 1, 𝑦 + 1) +

So, we are assuming that we are considering the functional values at the corresponding
locations and then, multiplying its weights and we are taking the sum of all these values
and that gives you the corresponding weighted sum and which would replace the
functional values in the processed image. So, this computation is nothing but a convolution
operation, where at every pixel you are placing this mask and you are computing this
weights and then, replacing that functional value with this processed value.

So, it is the convolution operation. It is an operation meant for a computing outputs given
an input of a linear shift invariant system in this case and this is an impulse response. So,
these weights they define the discrete impulse response in this case, which is also called
filtering because there is a frequency domain, transform domain interpretations which i
will discuss in the next lecture. So, this mask is sometimes called also filters. So, this is
the filter response h(x, y).

33
(Refer Slide Time: 18:39)

One example of this kind of filtering is noise filtering, where the mask values are shown
here. You can see that every value is positive and its weighted combination of the
neighbouring values. Finally, sum of all these weights is equal to 1. So, this gives you a
kind of low pass filtering action and if I apply this operation, these convolution operations
on this image with the same mask computations what we discussed that is weighted sum
of the weighted sum of pixel values around its neighbourhoods.

And from there, replace the original pixel value by the processed pixel value, we will get
an image something like this, where you can see that this noisy structures look little
reduced in this case..

34
(Refer Slide Time: 19:33)

A special case of this kind of filtering operation often used in the image processing tasks
is called Gaussian smoothing.

1 ((𝑥−𝑥𝑐 )2 +(𝑦−𝑦𝑐 )2 )

𝐺(𝑥, 𝑦) = 𝑒 2𝜎2
2𝜋𝜎 2

When the filter or the mask is computed, the weights of the mask is computed following a
Gaussian distribution. So, you consider the origin of this distribution at the central pixels
and there is a width of the Gaussian distribution or standard deviation of the Gaussian
distribution. Accordingly, the mask size also has to be determined to capture the Gaussian
most of the functional values, significant functional values around its mask.

So, * is a symbol which is usually used for convolution operation. We denote this
computation of with the mask this is what is a convolution operation.

𝑔(𝑥, 𝑦) = 𝑓(𝑥, 𝑦) ∗ 𝐺(𝑥, 𝑦) 𝑤𝑖𝑡ℎ 𝜎 = 2, 𝑚𝑎𝑠𝑘 𝑠𝑖𝑧𝑒: 9𝑋9

So, in this particular case, if the mask size is 9 x 9 and your σ is 2; then, using this know
Gaussian mask by using this convolution operation, you can get a result like this.

35
(Refer Slide Time: 20:43)

Median filtering is another kind of computation which is not which does not use
convolution rather it is a very simple to describe. What it does, it computes the median
value among the neighbours among the neighbouring pixels and replaces original value by
this median value.

It is a kind of non-linear filtering operations, but this is a very effective method to reduce
a noise like spot noise kind or salt and paper noise kind of noise which are randomly the
random dots which can be scattered over the images and those could be minimized, those
could be reduced using mitigated using this kind of filtering.

36
(Refer Slide Time: 21:35)

So, here we are at the end of our lecture on this fundamentals on image processing and so,
as a summary what we discussed in this particular lecture that we can summarize it. That
images they are formed by optical camera and we see that the mapping of 2-dimensional,
3-dimensional points, 2-dimensional points that takes place through a kind of projection
which is known as perspective projection.

So, we discussed rule of projection. That in this case, a point from a scene point a 3-
dimensional point, you have to draw a straight line which passes through a particular point
which is known as centre of projections and which hits the image plane that is point of
intersection defines the corresponding image point that is what is perspective projection.
Then, we have discussed different operations in image processing. Some of them are say
binarization and in binarization particularly, we discussed the thresholding operation.
Then, we discussed also contrast enhancement, we discussed segmentation and special
task of segmentation say component labelling.

We also discussed gradient computation where we have discussed how edges could be
computed and convolution operations. In the convolution operation, we also introduced
the concept of filtering which is equivalently called filtering and it is a computation with
a mask; a mask consist of weights in its cells and that weighted sum gives you
corresponding distribution and then, we also discussed median filtering at the end. So, here
let me stop.

37
Thank you very much for your listening.

Keywords: Binarization, segmentation, gradients, convolution, filtering.

38
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology Kharagpur

Lecture - 03
Image Transform Part - I

In this lecture I will introduce Image Transforms.

(Refer Slide Time: 00:21)

Let us consider image as a continuous function; it is a two-dimensional function and a point is in


the two-dimensional real space and let us consider set of basis functions which are also a
two-dimensional functions, we can represent it as a set where each function is given by
say

b​i (x,y)
​ . you should note that this could be the functional value, could be either in real or in the
complex domain. We can represent, we can expand the two dimensional function f(x,y)
using B as a linear combination of this basis functions as it is given here in this form that
λ​i​ × b​i​ (x, y) and where i is the indexes of the basis functions as given in the set.

39
So, the transform of f with respect to B is given by the set of set of this coefficients which are λ​i
these are called coefficients of transform and we can represent the function as the linear
combination of this basis functions where the coefficients of the linear combinations are
listed here. So, you can see that this is an alternative description of the image instead of
representing the image by the functional form of f(x, y); I can simply represent it as a list
of coefficients or even these coefficients can be a function of indexes.

So, indexing maybe multidimensional, for example for a two-dimensional function indexing
we can use two such indexes to denote a coefficient and in that case we can say that this
function can be expanded in the form of a linear combination of two dimensional
functions and we can have this double summations in this representation. So, one of the
advantage of image transform is that this properties of basis functions this can be
extended in the analysis.

(Refer Slide Time: 02:37)

Let us consider a particular type of property which is very useful when this basis functions they
have this property. This property is called orthogonality property and if I expand the
function in terms of this orthogonal basis functions then we call that expansion as
orthogonal expansion. We will consider our discussion following up the discussion on

40
this image transform, we will restrict our discussion to one-dimension first and later on
we will see that now these properties can be easily extended to two dimension. So, we
will understand the one-dimensional transform initially.

So, one of the thing that we would like to define here this operation inner product. So, inner
product is a binary operation where two operands are two functions you can see that this
function f(x) and the another function g(x). So, it is the product of this two functions and
integral of this product values at every point in the space x. Now of occurs in this product
there is a there is there is something we should note that it is not a simple product, it is a
product of function f(x) with the complex conjugate of g(x) and that is we are considering
here. If both f(x) and g(x) functional values are in the real domain then complex
conjugate itself will be the same functional value, so then we can write it as f(x) g(x) dx.

So, orthogonal expansion it is possible when the basis functions this satisfies certain property,
that means the set of basis functions it satisfies this particular property of a orthogonality.
What we can see that inner product of any two different basis functions that should be
equal to 0, whereas inner product of the same basis function will have a non zero value
and which is a positive value. If this is true for any pair of basis functions in the set B,
then we say that this basis functions they are all orthogonal in that set.

So, transform coefficients in orthogonal expansion that could be easily computed by exploiting
this property, that is one of the usefulness of this particular property and you can say that
simply if I take the inner product of the function with a basis function, then we can and
also divide it by c​i then we can get the corresponding λ​i​’s. And if c​i is equal to 1 then it
becomes orthonormal expansion then we can simply write λ equal to inner product of
function and b​i​. So, this is what of since we are expressing the functions in terms of only
transform coefficients, so we call this operation is a Forward transform operation.

So now, function instead of being represented by f(x), now the functions function is represented
by these λ’s and the reverse transform or inverse transform would be to compute the
function from this coefficients back and because of the orthogonal property and also

41
orthonormal property, we can simply write it in this form. That means, simply it is a;
it is nothing but the linear combination of those basis functions and since it is a
continuous domain. So, we are taking the integrations over all the index values otherwise
in a discrete domain we can write it in this form.

(Refer Slide Time: 06:29)

So, one of the special case of this orthogonal expansion is Fourier transform and in this case you
can see that set of basis functions is given by in the following form:

B = ejωx | − ∞ < ω < ∞ (eq. 1)

it is a complex sinusoid which is the member of the set, it is given in the above form and the
completeness (Refer Time: 06:53) of this basis function is : any orthogonal set which is
a subset of any orthogonal basis set will also remain orthogonal, but using the linear
combination of that subset, you will not get the complete reconstruction of the function.

The basis set which keeps the complete reconstruction that is called the complete base. So, in the
Fourier transform, in fact, the set what is defined here it is a complete base because, it can
give me back this function as a linear combination of this function. You can see that

42
actually this (eq. 1) is a infinite set, though individually every sinusoid can be
distinguished here.

So, the orthogonality property is reflected by this particular relationships where you can see the
that (x) is the unit impulse function whose area is equal to 1 centering at ⍵ equal to 0
and otherwise (x) value would be 0 in everywhere. So, it is an unit impulse function
and this particular property gives you the orthogonal property of the Fourier transform
this base.

So, Fourier transform can be defined in the following form which is the forward transform .


F (f (x)) = f̂ (jω) = ∫ f (x) e−jωx
−∞

as you see in the above that it is an inner product of f(x) and also the complete base. So, the base
is ejωx ,. So, you take the complex conjugate of the base which is e−jωx . And if I take the
inverse transform then once again this is considered as the inverse transform;


f (x) = 1
2π ∫ f̂ (jω)ejωx dx
−∞

So, f̂ (jω) is the corresponding coefficients and the linear combination of this basis function will
give you the corresponding inverse transform.

So, it gives you the full reconstruction because it is a complete base as I mentioned. One of the
interesting fact that can be noticed in the following complex sinusoid

e−jωx = cos(ωx) − j sin(ωx)

it can be recomposed into two real and imaginary parts, real part consists of cos(ωx) and
imaginary part consists of say − sin(ωx) in this case. So, this forward transform can be
expressed using the following expression itself.

43

f (x) = 1
2π ∫ f̂ (jω)(cos(ωx) − j sin(ωx))dx
−∞

So, it has one transform component which consists of real part, another transform component
which consists of imaginary part and in the real part we are using the basis functions as
cos(ωx) where as in the imaginary part we will be using the basis functions at sin(ωx) or
− sin(ωx) whatever be your interpretation.

So, we can consider cos(ωx) as the set of basis functions and this is also orthogonal,

C = cos(ωx)| − ∞ < ω < ∞

S = sin(ωx)| − ∞ < ω < ∞

The above trigonometric functions are also orthogonal we know and sin(ωx) is also
orthogonal. But the thing is that as I mentioned that they will not form the complete base.
So, we can use only cos(ωx) . If I use f (jω) cos(ωx) the real part of cos(ωx) then that will
not give you back the full function. Similarly if I use the corresponding imaginary part of
the transform and use the sin(ωx) that will not also give me back the full function.

44
(Refer Slide Time: 10:55)

So, it is not a complete base but there are certain functions where actually if you use only cosine
functions or sine functions, you can reconstruct it fully, so these functions are called even
and odd functions.

There is a property like for an even functions, it should satisfy the following property:

f (− x) = f (x) for all x

i.e, it should be symmetric around the origin or around at x = 0 at both ends say f (− x) should
be equal to f (x) for all x

f (− x) =− f (x) for all x

Whereas for odd it should be antisymmetric. i.e, f (− x) equal to − f (x) for all x and

f (0) = 0

naturally ​at the x = 0 , the value has to be equal to 0 for this definition.

45
So, a function could be even, it could be odd or it could be neither, when they are belong when
they have this property, then you can expand them using only cosine or only sine
functions. So, let us see so even f (x) we can have this is a property which is satisfied
because, in that case if I take the integrations while taking the product with sin(ωx) that
would be 0. I.e,


∫ f̂ (jω)(sin(ωx))dx = 0
−∞

So, all sinusoidal terms would be 0; so that is why only cosine terms will remain and your
transform coefficients can be sufficient to prescrib by only cosine transformations.


∫ f̂ (jω)(cos(ωx))dx = 0
−∞

Similarly for odd f (x) the above property is true and that is why using only sine sinusoidal basis
you can reconstruct it.

Now this could be easily derived if I consider this relationships of cos(θ) and sin(θ) in terms of
the complex exponential quantities as below.

ejθ +e−jθ
cos(θ) = 2

ejθ −e−jθ
sin(θ) = 2j

So, full reconstruction is possible with cosines when the function is even and with sines when
the function is odd.

46
(Refer Slide Time: 13:03)

Now, let us consider the discrete representation. So, a discrete representation of a function can be
made in the following form :

f (n) = {f (nX 0 )|n ∈ Z }

that the function needs to be sampled at periodic interval and it will provide a sequence of
functional values where each sequence position is an integer set and sampling interval
which is associated with this particular definition.

So, it can be also considered as a vector in an infinite dimensional vector space, but in our
consideration since we will be always using images of finite dimension or the signals of
finite dimension, you are representing them in the computation in the discrete domain.

So, there we will be having only a finite dimensional vector.

f (n), n = 0, 1, ..., N − 1

For example we can represent a function from within certain interval from say 0 to
N − 1 as shown above.

47
You should note here the sampling interval is implicitly represented in this form. So, even
without sampling interval we have a representation of the function and when you are
trying to interpret the function with the physical terms, physical quantities in the
functional space then only the sampling interval has to be used.

So, let us consider that we are representing a function in this case with a finite dimensional
vector and say it is an n dimensional vector in this case; it is a column vector
representation. So, that is why the transpose operation has been used as shown below.

f = [f (0)f (1)...f (N − 1)]T

(Refer Slide Time: 15:05)

So, then how do you define a discrete linear transform? It is very simple because you know that
whenever we perform any matrix multiplication in the following form:

Y m×1 = B m×n X n×1

48
, say you have a column vector of n dimensional column vector and let there may be a matrix of
m × n dimension and if I multiply them then you will get another vector of m × 1
dimension, that means m dimensional vectors.

So, this is a transformation of this column vector into another column vector of a different
dimension. we call it a linear transform or as it is discrete since we were using the
discrete presentation, let us call it as discrete linear transform. And this transform has
inverse when this matrix which is called say transformation matrix B is a square matrix
and also invertible.

(Refer Slide Time: 16:11)

So, one of the interesting facts about this transformation matrix that we can note that we can
consider rows of these transformation matrix B, they form the basis vectors, this is the
analogy with respect to the basis functions. Because we will see instead of inner product
between two functions we are having here, inner product of two vectors which is
equivalent to the dot product of two vectors.

So, let us consider this say this is the representation of the transformation matrix

49
and these are the row vectors which is indicated by the transpose operations and you can say that
this is a row vector you are considering, these are the complex conjugate operations by
keeping it consistent with representation what we have for the inner product.

So, when we perform this dot product or inner product between two vectors, then you get the
corresponding element. So, i​th basis vector provides you the i​th element of the transformed
vector which is Y here.

< bi *T .bj >= 0 if i≠ j

= ci, otherwise

So, the orthogonality condition in this case is reflected in the above form such that if you take
any pair of two basis vector, then their inner product should be equal to 0 if they are
different otherwise they should have a non zero value ci .

(Refer Slide Time: 17:43)

50
So, with this form we can consider that it will have the similar representations. You can consider
that a function as a linear combination of all those basis vectors. So, if I look at the
discrete Fourier transform expressions, the basis vectors for the discrete Fourier
1 k
transforms are represented in this form. So, it is √N
ej2π N n you should note here this is
defined for n functional points.

1 j2π Nk n
bk (n) = √N
e for 0 ≤ n ≤ N − 1 , and 0 ≤ k ≤ N − 1

So, bk (n) is a small nth element which means you can form a basis vector from by computing it
n = 0 to n = N − 1 at it at each integer value of n within these interval. So, that gives me
a vector that is a k​th vector and there are n such vectors where the k indexes k is indexes
vary from 0 to N − 1 .

N −1
k
f̂ (k) = ∑ f (n)ej2π N n for 0 ≤ k ≤ N − 1
n=0ˆ

So, forward transform or discrete Fourier transform of a discrete sequence f (n) which is a finite
sequence of length n can be expressed in the above form.

N −1
1 k
f (n) = N
∑ f̂ (k)ej2π N n for 0 ≤ n ≤ N − 1
n=0ˆ

k
You can find out that this is nothing but the inner product of f (n) and ej2π N n .only thing is that
1
instead of putting √N
in this expression ,we have taken care of the square root operation
1
during the inverse transform by multiplying N
as shown above .This is a simple
operation .

51
So, we kept this division normalization operation. We removed that operation in forward
transform and included it in during that reconstruction. So, actually the value what we
will get that would be proportional and it does not matter at this stage. But when you
reconstruct, it will again recover the same value because, you are taken care of that
particular normalization during reconstruction operation.

So, you can see that it is a linear combination of the corresponding of basis vector in the
reconstruction also and the coefficients which is given by f̂ (k) are the coefficients from
discrete Fourier transform. We can also observe from the above expression that discrete
Fourier transform is nothing but Fourier series of a periodic function. So, let us consider a
finite sequence. So, in this case for simplicity let me take only four functional values and
so, which means my value of N is 4 here and this is a functions for which I will be doing
discrete Fourier transform.

So, what I will consider because there is no definitions outside this interval, I can use any
definition as per my convenience any other functional values and perform a
transformation and after inverse transformation once again I will keep my observation
window within the interval from 0 to 3 in this case. So, a periodic extension of this signal
could be in this form that means, it is repeated so that you know in a periodic function
this property needs to be a satisfied a periodic function with a period capital N should be
f (n + N ) = f (n) . So, that is satisfied as you can see here, here the value of N is equal to 4.

So, you will see after every fourth sample again it is repeating the same value. So, it becomes a
periodic signal and as you know any periodic signal can be expressed as a linear
combination of sinusoid functions and you can perform Fourier series ,that is ,what you
are doing here in this case. After that, while in performing inverse Fourier transform ,
you are only performing for these four sample points. So, and just to note that how it is
related with the actual physical sampling interval which is say X 0 here.

52
So, the fundamental frequency would be determined by the length of the period of this signal
1
which is N X 0 So, fundamental frequency is N X0
.

k k
And you can see that harmonics is represented by harmonic N
actual implicitly there is N
X0
that is the physical frequency. So, we call k ✕N as the normalized frequency in this
representation.

(Refer Slide Time: 22:47)

So, discrete Fourier transform that can be also expressed in terms of a linear transforms which
means we can express a discrete Fourier transform as a matrix multiplication of a column
vector where the column vector is given by the functional values a finite dimensional
vector as we have considered earlier and if I multiply these matrix then we will get the
transform matrix and you can see that these elements they are obtained from this
corresponding basis vectors definition.

So, we can represent this matrix in a shorter form where each k and N​th element is represented by
the corresponding expression as shown below

k
ℱN = [e−j2π N n ]0≤(k,n)≤N −1

53
and if the value of k and n they range from 0 to N − 1 which will give me an m✕n matrix. So,
in a forward transform what we are doing; We are simply multiplying this transformation
matrix with a column vector f which is representing the particular column vector
[f (0) f (1) ...f (N )]T .

Then, we get the output transform matrix. So, the ( N × N ) transform matrix transform column
vector which is representing the column vector [F (0) F (1)....F (N − 1)]T which is
actually the coefficients of discrete Fourier transform. F = ℱN f .

and inverse transform will be naturally if I take the inverse of F (N ) and multiply with F then
we will get back once again column vector. f = (ℱN )−1 F

Incidentally because of the orthogonality property and also orthonormal property of these
functions. So, you can show that the inverse of the discrete Fourier transform matrix is
nothing but its Hermitian transpose of the corresponding matrix. (ℱN )−1 = (ℱN )H

Hermitian transpose is the transpose of this matrix and also you should have to perform complex
conjugate operation that would give you the Hermitian matrix.

54
(Refer Slide Time: 24:53)

Now, discrete there could be other kinds of know say it off orthogonal basis vectors and some of
them could be derived or extended from discrete Fourier transform representation. So, we
call it generalized discrete Fourier transform.

If you can observe that in the discrete Fourier transform ,basis vectors are generated by sampling
the complex sinusoid within an interval between 0 to N − 1 and then we have sampled at
regular interval .Now if I make a phase shift there in that interval. So, there itself we can
have a variation instead of we can give a phase shift of 𝜷 and also while defining basis
vector we have considered harmonics and we have generated harmonics at regular
interval.

1 j2π k+α
bk α,β (n) = √N
e N (n+β) for 0 ≤ n ≤ N − 1 , and 0 ≤ k ≤ N − 1

So, if I give also a shift in the frequency space ,then also you can have a different basis vector.
Now the above representation will generate also n basis vectors of N dimensions and that
would be once again an orthogonal basis vector which is a square matrix which could be
invertible.

55
So, it could be easily used for once again for making a transform. So, this is the generalized
discrete Fourier transform. We can use this basis vectors and we can get this expressions
for discrete Fourier transforms. Similarly we can get back the function by applying the
inverse Fourier transform it is the same similar form what we did for the case of discrete
Fourier transform and the corresponding transformation matrix can be expressed in this
form, here the elements as you can see it retains the same similar expressions only there
are parameters α and β which is giving you a different set of transformation matrix.

There are some popular transformation matrix as you can see a special value of α at zero and β
that itself will give you the discrete Fourier transform what we have discussed earlier. If I
1
took α =0 and set β = 2
, we call that transform as Odd Time Discrete Fourier Transform
1
or OTDFT and you can represent the transform in this form; similarly α = 2
and β = 0
would be Odd Frequency Discrete Fourier Transform and if both are half; Odd Frequency
Odd Time Discrete Fourier Transform.

There are different properties which I am not discussing here; just for your example we have
given this particular thing and they have their inverse transform in this case also. Those
are related and I have shown you in this particular grade.

So, there are different relationships of the inverse transform. I think let us stop here at this point
and we can start from this point in the next lecture, where will see that though it is not
possible in the continuous domain to have cosine transform and sine transform for every
kind of function but in discrete domain for any finite dimensional sequence you can
define cosine transforms and sine transforms.

So, for that we will be using this generalized discrete Fourier transform. So, thank you for
listening this talk and we will move over to the next lecture.

Thank you.

56
57
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 04
Image Transform - Part – II

So, let us discuss about the second part of the Image Transform properties and in the previous
lecture I discussed how discrete Fourier transform can be generalized. And then we will see
using those generalized discrete Fourier transform, we can derive also discrete cosine and
discrete sine transforms for discrete sequences and they form a complete base; so you can get a
completely construction of those sequences.

(Refer Slide Time: 00:54)

So, to understand that fact let us consider a concept of symmetric or anti-symmetric extension of
a finite sequence. As I mentioned for the case of discrete Fourier transform also that given a
finite sequence; you can define this sequence in the zone where in the zone of undefined region
as per your convenience, make that sequence having certain properties which will be useful and

58
which will be convenient for performing transformation on the sequences; on those extend
sequences.

So, one such extension type of extension that is used in that is used for this purpose is symmetric
and antisymmetric extension. So, let us understand what this is. you know symmetric extension
and what is antisymmetric extension. Now you can see this is an example of a symmetric
extension, as this was my original sequence and here the center of symmetry lies at the end
sample of the original sequence itself.

And we call this kind of symmetric extension is whole symmetric extension or whole symmetry;
a given a sample 4 actually can see that we require another additional 3 samples and by
maintaining the symmetric symmetry of the sample values; we can create the whole symmetry.
In the other kind of; no symmetry we can create that is called half symmetry, in this case central
lies separated with an interval of half of the sampling period or sampling interval from the end
sample; this is the point there is center of symmetry.

So, at this towards its left you have 4 samples towards its right also we have 4 samples; so this is
called half symmetry. And for the antisymmetry we also similarly defined whole antisymmetry;
in whole antisymmetry the antisymmetry, the center of antisymmetry lies at the at a sample value
which is introduced; so this value should be 0.

And then next of the values should be you know determined by the corresponding sequence
original sequence; they should be negative of the corresponding sample value and this is a whole
antisymmetry. So, you can see the total length of in this case in this definition; it is 4 plus this is
a value introduced 5 and then this is 4; so this is 9; this is a whole antisymmetry an example of
whole antisymmetry.

Similarly, we can have half antisymmetry; so instead of introducing a 0 explicitly here; we can
assume that center of antisymmetry lies in between half of the interval between these two
samples. So, you do not require any additional introduction of 0; only the samples are inverted
here and this is called half antisymmetry extension.

59
Now, the significance of the this symmetric extension is that, if I make the symmetric extension,
then you can observe of this function becomes even function if we consider this is the sample
value at the 0th index or x equal to 0 and it becomes also even function if your origin of x lies
here. And similarly here it would be odd function once again if your origin lies here and
otherwise in this case origin. So, now, as I was mentioning that given an original sequence
actually, you have converted this sequence as even function and further if you consider periodic
extension of this function; then you can apply discrete Fourier transform on this; some discrete
Fourier transform.

And then what happens that since it is an even function that transformation will require only
cosine parts; so, you can have discrete cosine transform using this kind of functions here. And
again you can reconstruct back using that discrete cosine transform and keep your observation
window only on this interval. So, you get the original sequence back.

So, in this way you get all the original sequence by just by using only cosine transforms.
Similarly, for odd functions it is sufficient to use only sine functions because for odd functions
we have seen that Fourier transform is equivalent to application of only sine functions from the
basis function set, here also only the sine functions can be used.

So, that is what using this symmetric antisymmetric extension; we can have DCTs and DSTs or
Discrete Cosine Transforms and Discrete Sine Transforms for any finite sequence. And that is
the reason why do they exist and you find them in the text books and in many applications. So,
for even function it is DCT and for odd function it is DST.

60
(Refer Slide Time: 06:26)

So, this is what I mean there could be various kinds of discrete cosine transforms because you
have so many different varieties. you can have different types of symmetric or antisymmetric
extensions at its two ends. And observe whether the sequence become even function or odd
function and also you can use different kinds of discrete Fourier transforms from the definition
of generalized discrete Fourier transforms.

So, you can have different DCTs and different DSTs; so take this example suppose we have a
symmetric extension; whole symmetric extension both at the both ends, in that case we could
observe that the functional values would be you know symmetric around this particular value.
This is the end and this is the period that would be defined by this symmetric extension; if I
extend it periodically with this one, we will get a periodic value.

So, this is the original function and using the whole symmetric extension at this end and also at
this end; you can identify this is the period, i.e, the minimum period that is formed by this
particular extension. And if the value was 4; number of sample was 4 you can see that length of
the period becomes 6. So, we will observe that how it affects the corresponding cosine
transforms in this case.

61
It is possible to have a sine transforms and if we apply simply discrete Fourier transform on this
extension will get a type I even DCT; whose expression is given in this form. You can say that
only the cosine basis function is used; cosine functions is used from the based on the basis
vectors. And you can see the period; period is 2N, but you observe the definition of N in this
case the value ranges from 0 to capital N which means there are ​N+1 number of samples. So, if
the value of ​N+1 is equal to 4. So, actual value of N is equal to 3 in this case. So, that is why the
period is 6 right;

so this is how the type I even DCT is defined. There is a definition of α(p) also. This is for the
normalization operations for making it orthogonal and orthonormal when you are performing the
reconstruction; they should satisfy this property, so you will have this particular definition.

62
(Refer Slide Time: 09:15)

So, similarly we can have other kinds of symmetric extensions HSHS. So, at both ends we have
half symmetric extension and if this is your original samples that is of length 4; then half
symmetric extension that both end will provide you a periodic signal of even function of length
8. And if I apply the alpha that general discrete Fourier transform, when α=0 and β = 21 that is
odd time discrete Fourier transform, then you will find it is a type II even DCT.

And the expression can be given in this form and here you can observe that given N samples you
are generating a period of twice N. So, you have the following expression :

N −1
2πk(n+ 12 )
C 2e (x(n)) = X IIe (k) =
√ 2
N
α(k) ∑ x(n) cos(
n=0
2N
), 0≤ k ≤ N − 1

and this is a familiar DCT expression what you see in the text book and this is mostly used in
image compressions and video compressions. And if I say this is type II even DCT is by default
that would be considered as that discrete cosine transforms.

And for discrete sine transform similar we can have antisymmetric extensions; like whole
antisymmetry extension and given a 4 samples; you generate a period of 9, it should be (2N +
1). So, we will define the definition of N instead of starting from 1; we should consider 0 we

63
should ​say N=1, N-1​. So, actually your defining it with respect to ​N-1 samples and that would
give you the twice N period and this is what your type I even DST.

(Refer Slide Time: 11:14)

And for Type II even DST similarly, we have half antisymmetric extension and there also we are
applying this corresponding discrete Fourier transform and this is a particular transformation
matrix that you will be applying there. So, you will get a type II even DST and Following is
expression for type II even DST .

N −1
2πk(n+ 12 )
S 2e (x(n)) = X sIIe (k) =
√ 2
N
α(k) ∑ x(n) sin(
n=0
2N
), 1≤ k ≤ N − 1

There would be many other DCTs and DSTs; in fact, since there are two ends and we can have
two varieties of symmetry and two varieties of antisymmetry. So, in total there could be 16
different types of DCTs and DSTs and type II even DCTs is mostly used in signal image and
video compression.

64
(Refer Slide Time: 12:10)

So, in a matrix form we can represent particularly type 2 DCT for an example I am showing all
this transformation can be also expressed in the discrete linear transform what we discussed
earlier. So, an element of the matrix that is (k, n)th element can be denoted in this form and this
matrix is also referred to as N-point DCT matrix which is a type II DCT in this case.

And in this case, there are certain properties which are interesting and which are exploited in
various developing different algorithms using DCT coefficients. One of this property is that :
each row of this transformation matrix is either symmetric, we call the even row or
antisymmetric. And following is a particular equations by which we are expressing this property

C N (k, N − 1 − n) = C (k, n) f or k even

=− C (k, n) f or k odd

and the transformation can be expressed in terms of multiplication with the column vector
X = CN x

65
and you get the corresponding transformed DCT column vector and its inverse is also given in
the following form

(C N )−1 = (C N )T

in this case it is orthonormal expansion; so you can just have the transpose operation.

(Refer Slide Time: 13:24)

So, one of the application is that; it can simplify the convolution operations; so what you perform
in the functional domain. Because, you can see that convolution operation the functional domain
that becomes equivalent to multiplication operation in the transpose domain; when you
considered the Fourier transform.

So, this is the particular expressions as these are stage by which we can understand; that the
relation is that if I consider a function and if the impulse response of the system is h(t); then
convolution of impulse response with that function is equivalent to the product of Fourier
transform of this function and Fourier transform of this impulse function. This property is called
convolution multiplication property of Fourier transforms.

66
(Refer Slide Time: 14:19)

Let us consider how this property is reflected using discrete Fourier transform. In the discrete
domain the convolution operation; linear convolution operation can be represented in this form
instead of integrations now we have summations and these are shifted impulse responses; that is
unit impulse responses along the corresponding functional domain at integral points.

And the linear combination of all those shifted values will give you the convolutions as you see
that coefficients whose linear combination comes from the functional value itself. So, when you
perform the linear convolution; the thing is that, we assume both of the sequence and also the
impulse response they are of infinite length. But it happens so, we are handling with finite
sequence and discrete Fourier transform is applied for a finite sequence.

So, how do you consider this definitions, how do you we can modify this definition because it is
not necessarily the corresponding functional domain where the function is not defined in a finite
sequence; it is not necessarily they have to be set to 0. If you set them to 0 it becomes equivalent
to linear convolution, but you can consider some periodic extension as we did for discrete
Fourier transform definition itself. So, periodic convolution of two finite sequences is defined as
convolution between two finite sequences with periodic extension.

67
Same linear convolution and it can be observed that if they have same period that the periodic
sequence also will be of same period. So, with this definition; with this property we can define a
circular convolution. So, it is sufficient if we compute the periodic convolutions for a single
period only, it need not compute in the whole functional domain itself.

N −1
f ⍟h = ∑ f (m)h(n − m)
m=0

n N −1
= ∑ f (m)h(n − m) + ∑ f (m)h(n − m + N )
m=0 m=n+1

So, in a circular convolution which is as I mentioned, it is a periodic extension and we can


compute only with the interval of from 0 to N-1. And if I apply this periodic periodicity
definition; then this can be broken into the two parts as shown above. This is the definition of a
circular convolution. The interesting part is that convolution multiplication property for DFT
holds for the circular convolution.

So if I consider the impulse response of a system and functional finite sequence both should be
of same length; take the discrete Fourier transform and take the product point wise at the
corresponding coefficient wise take the product. Then, you will get also the transform
coefficients of the sequence what would have been the output of the system if they are all
periodic extensions.

68
(Refer Slide Time: 17:36)

So, this is what is your is the property in the discrete Fourier transform domain. So, what about;
there is another kind of convolution with antiperiodic extension. Like periodic extension, we can
have antiperiodic extension; in the antiperiodic extension first we perform the this antiperiodic
extension over the functional domain that is with an antiperiod N.

f (x + N ) =− f (x)

This is what it is defined antiperiodic function and if I do this antiperiodic extension; it is it is


also a periodic function of period N that is interesting.I I antiperiodic function does not mean it
is; it is not periodic. It is actually periodic, but the periodicity is now doubled you can check this
things; you can make an antiperiodic extension and we will see there is a periodicity of twice the
number of samples of the original sequence.

So, skew circular convolution is defined with respect to this periodic values or we can say this
antiperiodic extensions. Because we will be again observing the convolved output only in the
observation window of the original sequence. So, a skew circular convolution is defined in this
form; you take the convolutions again from the linear convolution definition itself, but you apply
the properties of antiperiodic extension; then you will get this expression.

69
(Refer Slide Time: 19:18)

So, using the circular convolution and skew circular convolution; we can find that there are
different convolution multiplication properties those hold also in DCTs and DSTs. I will show
you some of them for DCTs. For example, you take this case that you have two functions; one of
them you can consider as the impulse response say y and both of them of same period; of same
length. So, if you take the type I DCT of this one and type I DCT of this one; then if I multiply
and no that is that should be multiplied by this factor because of the definition of DCTs what we
have in this lecture following that this multiplication factor would there.

And then you get your result in the transform domain itself; that is the type I DCT of the
circularly convolved output of these two sequences. Similarly, a type II DCT of this one and type
I DCT of impulse response will give you a type II DCT of the corresponding convolved result.
You should note that number of samples depend upon the corresponding type of DCT what you
are applying. Because finally, there should be of same periodicity; so in this case we will have
N+1 samples and there are N samples because N sample define a DCT of 2N period and ​N+1
samples define it type I DCT of 2N period.

70
So, that you need to be careful while applying this particular properties and type III DCTs have
these interesting property; where actually you can find the output is this, this properties applied
for skew circular convolution.

(Refer Slide Time: 21:20)

So, whatever we have discussed in one dimensional transform; this can be easily extended for
two dimensional transforms this discussion. If I consider our basis function in two dimension as
a certain particular property; which is separability property; so they are separable if I can I write
this basis function into this two form; B = {bij (x, y ) = g i (x).g j (y)}

that means, you can write it as a product of 2 basis functions as shown above; they are all
independent of independently they can be computed using a particular variable. So, this is an 1 D
basis function say product of two one dimensional basis function.

If both of them are orthogonal; then you can see that this set of basis functions will be also
orthogonal and then we can reuse this one dimensional transform computation and we can
express them in the following equation

71
λij = ∑ g j * (y)(∑ f (x, y )g i * (x) )
j i

So, first we are computing the transforms with respect to the x; by changing the values or
sequence with respect to variation over x.

And then we are considering the transform of say with respect to the value of y, we will see that
how this computation is reflected in terms of matrix operation. So, you should note here though
in this slide; we have used the same notations for this two functions; they could be separate, only
thing is that both of them needs to be orthogonal to keep the orthogonality property of this basis
functions.

(Refer Slide Time: 23:11)

So, 2D discrete transform can be easily computed by using this separability property. So, let us
consider the computation being this way say you can transform columns and then you can
transform rows. Suppose, you have a 1 dimensional transform matrix B and there is an input.
your input is a corresponding input which is given in terms of m × n matrix ,that is your data.

So,first I can transform columns of this input data; that is the image in this case m × n, some
image block. So, I will be transforming them columns; so, we know that dimension of each

72
column is m So, that is why we are use using the corresponding transformation matrix which
deals with m dimensional vectors; which is represented in this form and then each column is
now is transformed into another m columns. So, you are doing it for n such columns. So, you
will find n dimensional vectors transform columns and there are n columns; so this is a matrix
you will get.

After that what you can do ,you can now transform the rows of this matrix which means now you
have to take the transpose and then perform the n × n . Since all rows are n dimensional, so you
have to use the n -point transformation matrix here and then you can get the final no transforms
of the two dimensional image of m × n ​size.

So, if I expand Y 1 I can write the whole operations in the following composite form.

Y m×n = [B n×n (Y 1 )T ]T = Y 1 (B n×n )T = B m×m X m×n (B n×n )T

So, the whole operations can be described in this particular form and that gives me the 2
dimensional discrete form. So, this is how a 2 dimensional discrete form can be represented
given the 2 dimensional input image we can use the corresponding transformation matrix
towards its right and towards its left in this way and we can get the corresponding transformed
image in the transformed domain itself.

73
(Refer Slide Time: 25:39)

So, typical examples could be say discrete Fourier transform; we can also express this transforms
using the summation operations because they are the basis functions here or the basis vectors
here; they are separable, they are expressed in this form. And you can see that it is a simple
extension of the discrete Fourier transform of what you had in the in your in the 1 dimensional
case. And using property of separability, once again we can simplify this computation and when
we express in terms of matrix multiplication; we have already defined the matrix representation
of the transformation matrix of DFT.

And we consider this is a m × m ransformation matrix and this is a n × n matrix; we are just
denoting it by simple ​M showing it as ​m point discrete Fourier transformation matrix. So, this
will give you the corresponding transformation of the image f.

74
(Refer Slide Time: 26:47)

So, one typical example it is shown here that now given this image; we can perform Fourier
transform and it will give me since it is in the complex quantity every transform coefficient in
the discrete Fourier transform is a complex element. So, it has its magnitude and phase at every
point; so you get two components of this transformation. And in this particular image you should
note that if I apply Fourier transform, coefficient values would be very large and it is very
difficult to make them display.

So, we have made them display well and also we have shifted the origin of the transformation
space.

75
(Refer Slide Time: 27:33)

So similar way like discrete Fourier transform, we can define also discrete two dimensional
discrete cosine transform. These are same simple extensions what we had in 1 dimension and in
the matrix representation; we have the similar representation what we discussed for when we are
extending a 1 dimensional transform to a 2 dimensional transform.

(Refer Slide Time: 28:00)

76
This is an example of a DCT; so if given this input image, you have this discrete cosine
transform. In this case also you have scaling of the coefficient values and there are as I have
mentioned there could be 16 different types of DCT and DSTs this is just one typical example
which is type II even DCT that is shown here.

(Refer Slide Time: 28:27)

So, I will conclude this lecture by mentioning that why do we require image transforms? As you
can see image transform, give an alternative representation of image instead of representing the
image in the functional domain itself, we are representing it in a different domain; it gives a
different insight of structure of images.

And for example, if I consider the frequency domain representations using Fourier transforms,
we get low frequency and high frequency components. So, it may become useful for providing
more compact representation; you can use only a few transform coefficients to get a very close
approximation of the functional representation or you can perform selective quantization of
components, considering their effect on our perception.

And those are used in the algorithms for image compressions, even video compressions. And
sometimes many processing becomes convenient when we use this transform coefficients; like

77
we have already discussed about filtering, there are operations like enhancement, restoration etc
many other operations; where this transforms are useful. So, with this I end my lecture on image
transforms.

Thank you very much for your patience and listening to my lecture.

78
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Module – 05
Projective Geometry

In this lecture we will discuss about Projective Geometry and its application for the analysis
of images in Computer Vision.

(Refer Slide Time: 00:27)

As earlier during the discussion of fundamentals of image processing, we have seen how
images are formed in a camera. So, here you can see the rules of projection is that given a
same point, then from that point you can draw a ray which passes through the center of the
lens and then that ray when it intersects image plane, then that particular point is the image
point of the three dimensional scene.

And this perspective projection rule is applied for any scene point for getting the image point
of any scene point you take another scene point here and once again let me draw a line
passing through this center and when it is intersecting the plane of projection, that makes the
image point of the other scene point. So, in this way the projections are formed. So, we can
see that this particular geometry can be well understood or well studied in the case of using

79
projective geometry. So, let us understand what is a projective geometry. So, these are the
things for same point.

(Refer Slide Time: 02:00)

So, there are usually in the mathematics, we study our geometry in the real space you know
school book geometry or later on in the analytical geometry. Consider it two dimensional
space where any point in the two dimensional space is denoted by a particular pair of
coordinates, x and y pair of coordinates and since this is the Cartesian product of real axis, we
denote these two dimensional space also as R2 and this point belongs to the two dimensional
coordinate space. And following these coordinate convention, there is a origin and there are
x axis and y axis based on which these coordinates are correspondingly defined.

Now, when we are considering a projective space once again a projective space in two
dimension, but implicitly there is a three dimensional space behind that definition. it is like an
image. When you look at an image; all the points in an image, they are in a two dimensional
plane, but finally those points are related to three dimensional points in the scene.

So, if we apply the rule of perspective projections, we can consider that in this three
dimensional space, there is a point which is the origin and if there is a ray passing through
this particular point p, then that ray itself is representative as if this point is representing this

80
ray itself. For any point on this ray is equivalently represented by particularly this
representative point of two dimensional point.

So, this is what we say a point in a projective space and we call this space as a two
dimensional projective space. So, you should note that if I take any other point in this space;
then once again if I draw a line passing through the origin and extending this ray. So, the
whole line itself is represented by the point in this plane. So, this is what is the two
dimensional projected point and every point in this particular plane is represented in this
form. So, every point in this plane is actually representing a ray that is the basic
representation of a projective space.

So, in this case a point (x, y, 1), it is a canonical representation of the projective space where
you have a three dimensional coordinate system. As you can see implicitly this three
dimensional coordinate system as if this plane is at a distance of one along the z direction and
all the coordinate axis in this plane is also parallel to the corresponding coordinates of the
implicit coordinate system of the three dimensional plane. If I consider any point in this plane
in this straight ray, so that is also represented by the proportional factor of k. Any point can
be represented by (kx,ky,k) where k is a parameter.

So, finally, a point (x, y) in a two dimensional real space is in a equivalently represented in
the two dimensional projective space as a set of points (kx, ky, k) where k varies which is
actually representing this ray. So, this is the relationship when we consider an image as a two
dimensional real space . On the other hand when the image is considered as a two
dimensional projective space under the imaging perspective of projection.

So, this coordinate system is called homogeneous coordinate system and you have an
additional dimension which is denoting the scale of this coordinate system. So, if all other
coordinates are divided by this scale, you will get the non-homogeneous representation which
has direct relationship with the two dimensional.

81
(Refer Slide Time: 06:46)

So, let us understand bit more about this homogeneous representation. So, let us consider a
point in R2 which is a real space, the ordinary space which we are used to understand in our
geometry. So, let us consider a point x. In this case I am using this notation vector to denote
that it is a tuple in this case, it is a two tuple coordinate systems. So, it is a vector. you can
also represent this as a column vector of coordinates and this is a representation I will follow
throughout in this lecture.

let this point with this. If the corresponding point in the projective space is denoted in this
form

and here also you can see that it is a three vectorial representation where you have an
additional dimension which is denoting the scale and which is multiplied with all other
coordinate system . So, these two are equivalent representations in our perspective. Either I
can denote a point as a point in a two dimensional real space or a point in a two dimensional
projective space with three coordinate dimensions.

82
One of the interesting facts in this projective space is that the origin of that space is not
included.

So, it is the singular point you cannot form a projection ray between two at that point itself.
So, it is a singular point in this space which is not included as a set of points in the projective
space.

So, I can consider that any point in their space of three dimensional real space excluding the
origin, can be represented as a point in the projective space, however you should understand
the elements in the projective space are not points. They are all rays passing through the
origin and each ray can be mapped to a point in a particular plane in that space.

(Refer Slide Time: 09:21)

83
So, this is another example what we would like to understand once again. consider a point in

the projective space given in the above form say

So, what should be the corresponding point in the real space? So, in this case how do you
convert it? As I mentioned that there is a additional dimension which represents the scale.

So, to convert into the real space what I need to do? I need to simply divide this coordinates
by this scale and then we can get the corresponding you know real space point. So, this is
how these points will look that know you can divide 25, by 5. So, it should be 5 and 30 divide

by 5 which is 6. So, this point is equivalently represented by coordinates in the real


space. Let us also consider another tricky situation. Suppose I am considering a point

in the projective space, what should be the real point in the real space? But as we
have discussed that this is not included, so this question is not valid because this is not the
point in a projective space. So, I cannot convert these points as you can see that 0 by 0 is also
undefined mathematically. So, it is consistent with mathematical notations and operations.
So, this is not a point in the projective space. So, I cannot consider any equivalence point of
this in the real space. So, it does not belong to the projective space.

84
(Refer Slide Time: 11:05)

Let us understand another fact that is homogeneous representation of a line in a plane and let
us see how the lines are represented here. Say once again you consider a two dimensional real
space and p is a point and consider a line between these point which is represented by our
familiar equation of straight line equation ax + by + c = 0 , where a, b, c are the parameters
of the straight lines.

Now, this representation itself what I can consider that a, b, c are the parameters that itself
is representing the straight line and we know two parameters are sufficient instead of three
parameters,because this is the proportional ratios of a, b, c that is uniquely identifying a
particular straight line. If there is a projective space, which is representing only lines in this
case. So, a line in this two dimensional space is represented by a point in this projective space
and follow the same representation.

So, you have a three dimensional implicit, three dimensional coordinate system. It is a two
dimensional projective space once again and if I consider any element of this space that
should be a ray passing through the origin and incidentally if it passes through this particular
point (a/c, b/c, 1) that is a representation of this straight line. So, a straight line is
represented like a point in this space. So, it is an element as I mentioned and any point,

85
whether it is (a, b, c) or (ka, kb, kc) , all are straight lines. So, as you can see the third
dimension is again representing a scale for representing this element straight line. .

In fact we can denote this relationship of straight line into this form which is a dot product of
the two vectors or we can say it is a matrix multiplication, where this is represented as the
transpose of the column vectors of the projective point and this is a straight line vector. In a
simple form I can represent it represent in this way that this is a point containment
relationship and this representation in a simple form we can represented in this way, it is a
transpose of the column vector of the point x.

Also, this line multiplied with the column vector of point, represent this line and that is equal
to 0 which is essentially representing the straight line equation. I can write this equation
because this mathematical operations can be also equivalently represented as :

So, in this way you can see that whether it is point containment or that equation can be
represented in this way.

86
(Refer Slide Time: 14:38)

So, points and lines in the projective space have nice complimentary relationships, we call as
a kind of duality relationships. So, what we are looking at in this particular case, we will
explain the dual relationships. So, any point in the real space or any line in the real space can
be defined by two points. we know that two points define uniquely a line. This line can be
computed as a cross product of these two-3 vectors. So, x1 (x! , y 1 ) is representing the
corresponding projective homogeneous representation of point X 1 and x2 (x2 , y 2 ) is the
homogeneous representation of point X 2 . So, there are three vectors, so if I take the cross
product, you will get this straight line.

In the real space itself any point is also defined as intersection of two straight lines which has
a relationship in the projective space is in the following form P = l1 l2

here it is a cross product of these two lines which are represented in the projective form. And
then, we are getting the corresponding point of the corresponding intersection point that is
also represented in the homogeneous coordinate system or as an element of the projective
space. Here, you should note that how the cross product of two vectors are defined this is a
usual mathematics that we have learnt in our school book.

87
So, you should see that X 1 corresponds to its coordinates [x1 , y 1 , 1] and [x2 , y 2 , 1] are the
coordinates corresponding to point X 2 and once we define the cross product of 3-vectors, in
the following form

if I get the determinant, then I can always compute to this point if we expand this particular
determinant.

So, this is the summary exactly one line that passes through two points and exactly one point
at the intersection of two lines and their relationships in projective space, they are quite
similar. A straight line can be expressed as cross product of these two points in the projective
space and a point can be expressed as cross product of these two straight lines in again all
representations are in the projective space.

(Refer Slide Time: 17:32)

So, one of the interesting applications would be, we would be to find out to the straight line
passing through two points by applying this method. We have learnt these method from

88
simple coordinate geometry, but we can see that these computations can be very briefly
summarized using this operation of cross product in using the concept of projective geometry.

So, let us consider this particular problem that we need to find out a line passing through the
points (3, 5) and (5, 0) in a plane. So, naturally it is understood that (3, 5) and (5, 0 ) are the
points in a two dimensional array space. We will be taking the cross product of the
corresponding projective points in the projective space of (3, 5) in the homogeneous
representation and the vector which is represented from (5, 0) in the projective space. Which
means the vectors say (3, 5, 1) and cross product of (5, 0, 1), ok. So, that is what is the
straight line representation.

So, if I expand this ,let me do this sum. So, as I have written earlier, so we have this
particular components and you are representing . So, these are the unit vector directions and
represent with the corresponding vector components. So, if I expand this determinant what I
need to do? I need to consider this sub determinant in to e which means say (5 − 0) i plus,
then you suppress this and then you take the determinant and you have to take the negative of
that. So, (3-5), so I should write here this is should be minus, then it is (3 − 5)j and then you
suppress you consider this part of the sub determinant for the kth component and you take the
determinant which is (0 − 25)k .

So, this should be then 5i + 2j − 25k . I hope my computation is correct, let me see with the
result and you should also verify it. So, finally this line is represented by a vector as you can
see I will take from 5 5 from 2 and this is − 25 . So, if I write the equation of the straight
line, what should be the equation? 5 is that a , 2 is that b and − 25 is that c. So, the equation
of the straight line can be represented as 5x + 2y − 25 = 0 So, that is what is your equation,
ok.

89
(Refer Slide Time: 21:35)

So, let us see that how this result is . So, this is a summary of the result, so as you can see we
will get these equation 5x + 2y − 25 = 0 , that is the equation of the straight line in our
understanding of coordinate geometry. Let us do another example consider this. So, we have
computed the line from two points. Now, let us consider there are two lines and you would
like to compute the point of intersection of these two lines and these two lines are given by
these two equations.

So, once again let us work out. So, from this line, so now what is the rule or what kind of
operations we have to do? If I consider the intersection point is say p. So, that would be a
cross product of these two vectors l1 and l2 .

Let us see how we will do it and the above line will represent l1 and l2 . So, we know that l1
can be represented by 5x − 2y + 4 and l2 is represented by 6x − 7y − 3 So, we need to take
cross product of these two operations. Once again I will consider the same determinant

90
then if I expand it, what would be the expansion? So, let me rub some areas, ok. So, if I
expand it let us consider it I will be expanding in this way say this coordinate this sub
determinant. That means, they send this it is plus 6 and then plus 28, ok.

So, I am doing it mentally (6 + 28)i , then + or I should put it as − Then this middle
determinant suppressing the middle column it is − 15 − 24 and then we consider the third you
know component which is (− 35 − (− 12)) . So, it should be (− 35 + 12)k So, the full you
know.

Now, this coordinate can be represented as this expansion can be 34i − 39j − 23k . So, what
is p? P is represented as this vector which is

I hope my calculation is correct and let us see. So, as you can see this is the homogeneous
representation of a point, actual point. What you need to do? You need to divide this scale
value. You need to divide all the coordinates values with this scale value (-23).

91
(Refer Slide Time: 26:13)

So, let us find out that you know what is the result of this operation. So, you can see that is
what we are getting 34, 35 minus 23. I am not sure whether you know I just rubbed it, but as I
mentioned this scale value has to be correspondingly divided by -23. So, this is the coordinate
that we will get. You have to check my computations once again, there could be some error
there.

(Refer Slide Time: 26:36)

92
Let us complete this lecture by explaining the duality principle as we can see the point and
line, they are represented in the projective space and their relationships. They can be nicely
complemented by each other. For example, if I get the containment relationship xT l = 0 and
then lT x = 0 . so if I interchange l and x, still this relationship holds, that is the duality
principle.

Not only this point containment if I consider the cross product if I point in line intersection
and point intersections which are expressed by the cross product operations of three vectors.
So, in this case the cross product of two lines l l′ gives you the corresponding point and
cross product of two points x and x′ gives me the corresponding line.

So, once again the places of lines and points if you interchange, still you get the same
principle. So, that is what, is the duality principle. So, to any theorem of two dimensional
projective geometry, they corresponds a dual theorem. So, this is any theorem and then there
is a dual theorem which may be derived by interchanging the role of points and lines in an
original theorem. So, with this understanding of duality principle let me stop here at this
point. We will continue this lecture in the next video lecture session.

Thank you very much for listening to my lecture.

93
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 06
Projective Geometry Part - II

(Refer Slide Time: 00:20)

We will continue our discussion on Projective Geometry. We have seen how points and lines
are represented in a two-dimensional projective space. We have seen that a point in the
projective space can be represented by an additional dimension, because there is a implicit
three-dimensional representation involving the two-dimensional projective space.

So, in this case, any point that represents a particular element, i.e, a ray passing through the
origin connecting to that point, and every point in this space is being represented by this
element. Similarly, a line in the plane of projection which is given by the corresponding
equation, say, ax +by +c=0 which is also represented as an element of a projective space,
which is a different projective space representing a line. There also a point in that projective
space is representing a line and which is also representing an element that is a ray passing
through the origin connecting to these point and extending it to towards infinity.

And as you can see that the parameter of this equations are now used to represent this line in
the two-dimensional projective space. Also we have learned the relationships between points

94
and lines, there is a duality in expressing these relationships. For example, if you have point
contentment relationship, then this can be expressed in this form,

this is a point is represented that say transposition of the column vector of point
representation in a canonical form, and this is the line representation of the line what we have
shown here. So, if I take this matrix product that should be equal to 0, as if this is a dot
product of these two vectors.

So, this could be expressed as in this form that I can write it as x transpose l equals 0. So, this
transportation is represented here and symmetrics multiplication. And we can see in this
relationship that if we interchange the position of point and line same relationship holds, so
that is the dual principal. Similarly, there is another example of this kind of dual
representation that is if you would like to compute a line, given two points in this space,
suppose you have another point in this line and you would like to represent it.

And then ah how do you get that relationship? That is a point is a intersection of two line. So,
we consider there is another line which is l prime. And this representation this l cross l prime
will be the operation that would give you the corresponding intersection intersecting point.
Similarly, if we consider a line is defined by two points say this is x and this is x prime. So,
you get x cross x prime as line l. So, this is the duality what I was talking about you
interchange the point and line into this relationship, and still that relationship holds. So, now,
we will continue this discussion and we will further see what are the properties are there in
the projective geometry.

95
(Refer Slide Time: 04:55)

So, one of the interesting property in this space that how do you express the intersection of
parallel lines? We know in normal two-dimensional real space which we studied in our
school geometry, two parallel lines they intersect at infinity, but there we could not qualify
the nature of infinite point, nature of point of intersection at infinity. We will see in the two-
dimensional projective space this could be qualified. Let us see how, let us compute this
intersection.

So, here we are going to compute, here you can see that there is an there are examples of two
parallel lines this is given by this equation. Suppose, this is this parallel line and take another
parallel another line which is parallel to it by this equation. You can observe that the
coefficients a and b they remain same; they remain the same, so that is why this parallelism is
established.

So, to compute the intersection of these two parallel lines, we can apply the cross product
operations of three vectorial representation of these lines. So, we will perform that
competition say line l 1 represented by this three vectorial form, and l 2 is also represented by
another three vectorial form. And we would like to take the cross product of these two to
compute the point of intersection. So, as we did this exercise in the previous lecture, we will
carry out the same computations in the similar fashion we will be computing it.

So, we are computing the cross product. So, let me consider the components of these vectors
arrange them in rows, and then expand the determinant. So, let me do it as you understand

96
that this is the sub determinant which unit to compute as a component of i. So, this would be
b c 2 minus b c 1 i. The middle part, so we will write it as minus of these two which is a c 2
minus a c 1 j. And finally, the third component by suppressing the third column, it would be
ab minus ab k.

So, if I write it in the vectorial form, I can write this the resultant vector as b into c 2 minus c
1, then a into I can write it as c 1 minus c 2. You note the change of sign because of this
negative sign here, and then 0 that is the third component. So, this is the intersection point. In
fact, this is equivalently I can write these vector as b minus a 0 by taking the scale factor c 2
minus c 1, no outside.

So, this equivalent representation itself it is sufficient to say that this is the point of
intersection of these two parallel lines. So, now, you can see that this point if you notice that
the scale value is 0. So, if I divide the scale value, divide the other coordinates by the scale
value, those coordinates will become infinite. But the nature of infinity is captured here,
because it is qualified by these two values b and minus a. Let us try to understand the
significance of this representation.

(Refer Slide Time: 10:18)

So, if I rub this you know computations, so you can see that we have the point of
intersections in this form. And how this line is represented here, this line this particular point
of intersection, how it is represented? So, we will say that in the two-dimensional projective
space this point b minus a 0 it is represented as a point in a plane which is parallel to the

97
projection plane. And this plane is called principal plane, because it contains the access x and
y.

And not only this point b minus a b minus a 0, but also the ray passing through this point
connecting to the center O, the whole ray itself is representing this point because that is how
the elements in the projective space is represent and this is the point of intersection in this
representation. This point is called ideal point that is a technical term will be using it more
often. And the plane where all these points are line for all of them the third coordinate is 0
that plane is also called ideal plane which is incidentally is the principal plane of this
representation. This form of representation is called canonical form of representation.

(Refer Slide Time: 12:12)

So, let us understand the meaning of an ideal point. So, we consider a two-dimensional plane,
where you have these two parallel lines, and these are the x axis, and this is the y axis. So,
this is x axis; this is y axis; say this is the origin of this representation. So, a straight line
given this equation ax plus by plus c equals 0, one of the straight lines in this representation,
and you know the other straight line which is parallel to it should can be represented as ax
plus by plus some value c 1 which is not equal to c in this case that should be equal to 0.

So, this straight line particularly if we notice that this can be represented also in another form
very well known analytical geometric form. I can represent as y equals minus a by b x plus c
by b, where you see that this is a slope of this representation, and the relationship between the

98
slope and the angle of this line which it makes with x axis that is also known to us. So tan of
this angle that would give you the slope.

So, we can see how a and b they are related with this representation. So, intersection point is
given by this b, minus a, 0 that you have computed and this point is related with this slope.
So, what is a point, ideal point? In that case it is simply representing a direction, a direction in
this two-dimensional plane. So, a point ordinary point in the two-dimensional perspective
projection space or two-dimensional projection space is representing a ordinary point, there is
an one to one correspondence with the ordinary point of a two-dimensional real space also.
Whereas for the ideal point, it corresponds to a direction in that plane, and that direction is
given by this angle theta which makes an angle with respect to x axis that is the implication
of an ideal point.

(Refer Slide Time: 15:09)

So, just to summarize this fact, ideal points are points on the x y plane or principal plane
parallel to projection plane. And for canonical coordinate system, they are of the form x y 0.
So, the third dimension which represent the scale that would be 0. An ideal point denotes a
direction toward infinity that is the implication of an ideal point.

99
(Refer Slide Time: 15:52)

There is another interesting concept in this projective space in this representation, this
concept is called line at infinity. So, let us try to understand what is a line at infinity. You
notice this particular axis which is extended towards the direction of parameter c that is
represented by this particular representation 0 0 1, this is the column vector. This is also
representing an element in the two-dimensional projective space which are which is
representing all the lines. So, this is a representation of a spatial line.

Let us see what is the property of this spatial line. Let us consider this particular operation. It
says multiplication of the transpose of a point incidentally which is an ideal point. So, this is
an ideal point and this is the line, what I was referring at. If I perform this multiplication, you
can say that this is giving you 0, it is very simple to check this computation.

So, what does it signify? You choose any ideal point and you perform this operation you will
get 0. This is the relationship between a point and a line that is a point contentment
relationship, which means all the ideal points they lie on this particular line and this line is
called line at infinity. So, to summarize their definition of line at infinity, it is a line
containing every ideal point and in canonical system it is given by 0 0 1.

100
(Refer Slide Time: 17:58)

So, what should be a model for the projective plane? In this case, we can represent all the
points in the projective plane using this geometric concepts, this is a geometric model. So,
you can observe that this is a plane of projection, this is a plane of projection which is
represented by the symbol pi. So, all the points which are in the real space and which
corresponds to a point in the projective space directly, they lie on this particular plane of
projection. And every point corresponds to a ray passing through this point connecting the
origin. So, any point is related with a ray connecting to origin passing through that point.

Similarly, if I have considered a point in the principal plane or ideal plane that is also an
element of the projective space. So, all this point which are lying in this plane there was a
part of the projective space. And they are representing all ideal points and as I mentioned
they are representing a direction with respect to this plan of observation. And any straight line
on this plane you can see it is geometric interpretation is that it is a intersection of a plane
containing the origin and intersection with the plane of projection.

So, this is what is your a geometric model by which we can understand the two-dimensional
projective space, so that is what a straight line passing through the origin, that is how a point
is represented in a projective space. And a plane passing through the origin intersection of
that plane with respect to the projection plane that intersection represents a line in that on that
plane, or any line is actually representing a plane passing through the origin.

101
So, mathematically we can say the set of all points in a projective space is also related or they
are equivalent to set of all points in the three-dimensional real space excluding the origin, as I
mentioned earlier origin is a singular point of the projective space. Similarly, I can consider
also a real space two-dimensional, real space every point in that real space representing some
point in the projective space.

In addition to that, there is another plane parallel to the real space that is the canonical in the
canonical representation or ideal plane, all points in that ideal plane is also represented. So,
instead of writing it as a plane containing all points, simply I can write all those points, they
lie on a particular line which is called line at infinity. And this line at infinity is given by this
particular you know structure. So, these itself represents all the points in the ideal plane. So,
this is a summary of this representation.

(Refer Slide Time: 22:02)

Let us try to understand another particular feature in the two-dimensional projective space
that is called projection of parallel lines that feature we would like to check from any
arbitrary plane how this projection appears in the projective space. Let us consider a
projective space given by this representation, that means, there is an implicit three-
dimensional representation. You have this ideal plane, you have those access a plane of
projection, and any point in this projection plane is represented through this plane of
projection.

102
And let us consider a plane in a arbitrary plane, and a parallel line two parallel lines in that
plane. So, a plane is denoted here by the symbol pi. And if you would like to project this
parallel line on the canonical plane, let me draw these two rays passing through any points
lying on this plane. So, these rays they intersect the plane of projection, and the intersection
would be given by a straight line lying on that plane of projection.

Similarly, consider the other line which is parallel to the parallel to this line. And if I consider
the other line and perform the same representation, same projection, and projection of that
line on the canonical plane which means I have to get the intersection of rays connecting two
points lying on that straight line, and those intersecting points they will form a line. What do
you observe that though the lines are parallel in plane pi, but in the canonical projection plane
these lines they are meeting to a particular point. And this line this point is called vanishing
point. So, vanishing point is a point of intersection of parallel lines which are projected on the
canonical plane.

(Refer Slide Time: 24:33)

We try to understand a bit more about this vanishing points; the their implications would be
more clear here. You consider parallel lines on plane pi in various directions. Suppose, you
take two directions and there are two representative lines which parallel lines which are
denoting those directions. And if I take the projections of those lines in the canonical
projection plane or in the plane of projection, as we can see that these two lines they would
appear like meeting at some point which is a vanishing point.

103
Similarly, say other two lines, it would appear also meeting at some point which is a another
different vanishing point. So, all parallel lines in that direction, so if I take another parallel
line in this direction, that would also meet at the same vanishing point. If I take another
parallel line see in this direction, that would also met in the same vanishing point here for this
group of lines. Interestingly if I connect these two vanishing points, then we get a line, and
this line is called vanishing line, because any parallel lines set of parallel lines in any
directions they are vanishing points in the plane of projection will lie on this particular line
which is called vanishing line.

For example, if I consider a parallel lines see in this say if I consider this is another set of
parallel lines, and if I take the plane of their projection on the plane of projection. So, what I
will get? I will also observe that we will observe that those two lines they are meeting at a
point which would be the vanishing point. And that point will also lie on the line on the same
straight line connecting to the vanishing points earlier we have seen, that means, that point is
lying on the vanishing line. So, this is a summary that vanishing points corresponding to
parallel lines of a plane lie on a line and that is called vanishing line.

(Refer Slide Time: 27:21)

Let me draw a real life example to show that how vanishing points do exist. You take this
particular image and you can see that the edges in the horizontal direction as we understand
from that notice board and edges in the vertical directions, they are meeting at some point.

104
For example, in the horizontal direction, if these two edges this particular two edges they are
meeting here and in the vertical direction.

So, here what is shown here that even you take another parallel line, another line parallel to
same direction like this text; text are also in the horizontal direction. So, this line also will be
meeting at the same vanishing point, because as I mentioned all lines parallel to a given
direction will meet on a single point that is the vanishing point. And similarly the vertical
edges also they will also meet some vanishing point, and connecting these two vanishing
point we will give you a vanishing line.

(Refer Slide Time: 28:49)

This is a visual demonstration of vanishing point that we can see that this is an image of a
road, and which is captured from the front of a car. And you can see that how the edges of
that road is meeting at a point at infinity, but this point we can sense, but it remained ever
(Refer Time: 29:18) let say. So, our journey is to our infinity, we can say it from our
perspective projections point of view, but really cannot touch it that is how a vanishing point
could be also interpreted.

105
(Refer Slide Time: 29:35)

There are there is another element in the two-dimensional projective space which is called
conics, which are conics. And we will be considering their representation also in a projective
space. So, how conics are represented? They are curves described by secondary equation.
And this is the form of the equation which has been shown here.

So, and if we consider this representation translate this representation in the homogeneous
coordinate, each point instead of represented by 2D real coordinate of x and y, so in a 2 d real
coordinate a point is represented by x and y. So, there in the homogeneous coordinate we
know how this coordinates are represented by using this scale factor. So, x is equated with x
1 by x 2; and y is equated with x 2 by x 3. And if I replace this in this equation, then we will
get a representation of conics in the homogeneous coordinate representation, and this is how
this representation looks like.

106
(Refer Slide Time: 30:58)

So, to make this representation brief once again we will be using the matrix form of
representations using vectors for representing a point. And we can see a two-dimensional
matrix represents a conics, this is how it is represented. So, this is the general form of
representation of a conics. And we can see that this coefficients a, b, c, d, e, f, they are
representing a conics. And these equation can be simply represented by this particular form.
So, if you are wondering how I could get it, I can consider the homogeneous coordinate
representation say let me write it as x 1, x 2, x 3, that is what x transpose, then C is given by
this matrix. And then we have the column vector representation x 1, x 2, and x 3.

So, if I perform this matrix multiplication, you can check you will simply get this expression.
So, finally, a conics is represented by this , you can observe that this is a symmetric matrix,
its dimension is 3 cross 3. And you have how many parameters are there? there are six
parameters a, b, c, d, e, f. But as you know in this equation if I multiply this C with k, still it
remains the same conics. So, it is an element of projective space. So, one of them can be
treated as a scale factor. So, ultimately the degree of freedom in this representation is 5
though there are 6 parameters. So, I can represent a conic by this 6 elements, but one of them
is a scale. So, degree of freedom is 5.

107
(Refer Slide Time: 33:09)

So, naturally to define a conic uniquely I need at least five points in the two-dimensional
projective space and I can write those equations using that five points. So, this is the
equation, they should satisfy this equation, and this is a representation of a conic also in a
vectorial form.

(Refer Slide Time: 33:42)

And if I get the five points, I can write five equations and then I can solve this equations
because there is a 5 degree of freedom by taking one of the parameter by fixing it at some
value, I can solve it. There could be rank deficiency in this representation rank deficiency in

108
C. And in that case here is degree of freedom is less than 5, and the there are special cases
those are called degenerated conic, like there could be two lines of rank 2. And a repeated
line of rank 1 those are the rank deficient representation of C. We will also check how this
representations could be done, could be expressed analytically.

(Refer Slide Time: 34:34)

For a conic, its tangent lines are well related very in a convenient from it is related with the
point which is lying on that conic, and this is a relationship that is given by simple linear
relationships. If I multiply the point with the matrix C, then we will get the corresponding
tangent line l.

109
(Refer Slide Time: 35:07)

Now, this gives an interesting relationship of a conic representation, we have a dual


representation of a conic. In this case a conic can be represented by all tangential lines which
is enveloping which forms an envelope of the conic. And you can see that the expression is
also in the similar form in the previous case it was x transpose C l, now it is l transpose
another representation of conic C star it is a different matrix 3 cross 3 matrix, and l transpose
C star l equals 0.

So, the relationship between the original conic representation with the dual conic
representations can be found from this particular case. See if I have l equals C x, then I can
get x equals C inverse l, then x transpose C x equals 0 that is the original representation.
From there I can derive the representations involving only line l, and this is how the algebraic
manipulations, every x is represent is replaced by C inverse l. And if I take the transpose
operations of those matrices that property, finally we see that we get l transpose, then all the
composite matrices involving C involving constants and another l that is equal to 0.

Now, the this whole thing can be considered as a representation of another conical form,
another form of conics which is the dual representation. We can simplify these expression
further by using matrix algebra, and we can represented as l transpose C minus, that means,
transpose of C inverse because you know that that C into C inverse this is this is equal to the
identity matrix ok.

110
So, this is the identity matrix and so we can just simply you know ignore it from this term,
and then we gets you know this is what is C star incidentally, since C is a symmetric matrix.
So, transpose of its inverse, this transpose is same as the original matrix, that means, this is
equal to C inverse itself.

So, finally, the dual conic representation of C is nothing but its inverse there is a interesting
and very beautiful relationship involving the conics. So, a nice you know picture from the
book from Hartley and Zisserman. You can see this is the original representation of conics in
the point space, and these are representation of conics with the lines these are the two dual
representation.

(Refer Slide Time: 38:11)

And degenerate conics we mentioned earlier if the rank of matrix C itself is less than 3, then
we have some degenerate conditions of representation. Like with rank 2, a conic is defined by
only two lines or two points, which are contained in a conic and they are defined only that
lines and points. For example, a rank 1 it is the repeated lines and points.

For example, in a degenerate point conic, we have to we have to specify it using two lines,
say a line given by parameter l or vector l and m. So, l dot m transpose plus m dot l transpose
that itself we will give you the conic representation. You note that the vectorial form of l, it is
a 3 vector. So, if I perform this computation, it is 3 cross 1, and this is 1 cross 3. So, the
dimension could be 3 cross 3, and which is a conic representations, but its rank is 2. There are
only two you know directions, it will involves only two parameters.

111
So, if I take any two line l and another line m this itself these operation the pair of these lines
is representing a conic, because there are there is a point of intersection of this line which lies
on both this point, and that is what is representing a very generate degenerate conditions.
Similarly, degenerate dual conic is represented by two lines, two points x y transpose plus y x
transpose. So, this is what is the degenerate representation of conics.

(Refer Slide Time: 40:09)

So, we can summarize our discussion on this projective geometry is two-dimensional


projective space that a point in a 2-D projective space it is represented by a ray passing
through origin of an implicit 3D space. It requires an additional dimension for representation.
And we call that representation as the homogenous coordinate representation. Then a straight
lines in 2D real space those are also can be represented as an elements of a 2D projective
space that is the space representing for lines of 2D real space.

So, points and lines they hold duality theorem. So, these are the duality theorem which we
have learned that is x transpose l equals 0 that is a point contentment relationship which can
be expressed in the dual form also l transpose x equals 0, x equals l cross l prime that is the
intersection of two lines keeps a point. A dual form is intersection of two points gives a line
which is also in the same kind of operations. Then there are conics in 2D projective space
which are represented by a 3 cross 3 symmetric matrix. And every conic has a dual conic or
line conic as an envelope of its tangents. So, here we come to the end of this particular
lecture.

112
Thank you very much for listening.

113
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 07
Projective Transformation

In this lecture, we will discuss about Transformation in Projective Space.

(Refer Slide Time: 00:23)

So, let us define what is meant by a Projective Transformation. Firstly, this property says
that a projective transformation is the transformation of a point in a projective space to a
point in another projective space. This transformation should be invertible and it should
preserve the collinearity of every 3 point. For example, you have three points say
x1 , x2 , x3 which lie on the same line, then if you transform those points they should also

lie on the same line; on a line. And that property has to be satisfied for any pair of three
points, and then any triple of three points, and then this transformation is called
projective transformation.

There is a figurative explanation that we can provide here. You consider this particular
configuration. Here we have shown there are three points, they are lying on a particular
line and they are shown by different colors. You consider a transformation, again their
color shows the corresponding transformation in another space and they are mapping,

114
they are shown by these arrows, they are one to one mapping, they are invertible. And
also since those lines were collinear they should be also collinear in the transform space.
When these properties are satisfied for every configuration of points then this is a
projective transformation.

(Refer Slide Time: 02:24)

There are various examples of this transformation. For example, change of coordinate
convention. You consider this particular case where a point explained in a plane which is
shown here as  ' plane and that is mapped with a point x in another plane. So, these two
points, every point, for every point in this plane there is a corresponding point in this  '
plane and they are related by this geometric rule. If I draw a ray connecting that point to
the center of the coordinate which is a center of projections in this case then you will find,
I mean you will get the corresponding transformed point.

So, you can see that with this particular configuration every point in this space is mapped
to another point in this space which is also a two dimensional projective space. There
could be coordinate convention. So, coordinate axis could be in different orientations.
They need not be parallel to the same coordinate axis of the implicit three-dimensional
space, representing this particular phenomena of projection. They could have different
coordinate definitions in their particular plane. And so, if I use a representation of this

115
 x'  x
   
point as say  y '  again a canonical representation and these representation as  y  , so
1  1 
   
we can see that they are related we can observe that they are related by a transformation
this is called a projective transformation. We will see that what is the form of this
transformation subsequently.

So, this is one example because if I consider any line, so this line will be also projected
as a line which means all the points lying on that line after transformation they are also
lying on a straight line. So, they satisfy all the three conditions of the projective
transformation that is why it is a case of projective transformation.

(Refer Slide Time: 05:08)

So, there could be several other examples of this projective transformation like rotation
of axes, change of scale, translation of origin in planar coordinate system. So, every
operation will give you a different set of coordinates for the transform points, but
geometrically you can see that they preserve the properties of collinearity, the properties
of invertibility of the transformation one to one mapping of the transformation and
finally, those are points in 2D projection projective space.

116
(Refer Slide Time: 05:41)

There could be more examples. Like this is another example, where you are rotating your
plane of projections about an axis and you consider any two plane of two planes which
are related by this rotation and consider a ray passing through the center of, a center
origin of this particular plane and the intersection points in this plane they are giving you
the corresponding transfer point.

Similarly, some other examples are also given here. In this case, this point x and this
point x' they are related by this rule that they are formed, they have two different center
of projections, but they are formed by the rays connecting the same point on a particular
planar surface. So, here also any straight line, any line in this planar surface they will
projected as lying here, which means all collinear points will be also mapped to another
set of collinear points in the transform space satisfying the conditions of projective
transformation. In this case, also you can see the shadow formation that is also a case of
projective transformation because the same set of properties those are true here.

117
(Refer Slide Time: 07:18)

So, let us discuss what would be the form of this transformation and interestingly only
one form is possible and that to a very simple linear from, that is what we will be finding
out here. And it is as I mentioned, it is linear, it would be invertible that is from the
property of the projective transformation and the form is given here, in the linear form as
shown below

 x1'   h11 h12 h13  x1 


   
 x2'   h21 h22 h23  x2 
 ' 
 x3   h31 h32 h33  x3 
 

. You can see it is a 3 3 matrix which should be an invertible matrix and it maps
 x1 
 
uniquely a point  x2  to another set of another point which is represented by a 3 vector
x 
 3

 x1' 
 
 x2'  .
 '
 x3 
 

In short, we represent this transformation matrix by the symbol H mostly in our lecture.
We will see that this representation will be there which is representing a 3 3 matrix.
These are the column vectors by bold letter that we are representing here so, the

118
transformation matrix H that is equivalent to kH which is a scalar value. The same
relationship can be denoted by this multiplication of a scalar value because a point X '
and kX ' it is the same point in the projective space. So, I can also multiply by any scalar
constant, the transformation matrix I can multiply it with any scalar constant that would
give me also the same transformation matrix, which means the transformation matrix
itself is an element of the projective space.

So, this particular matrix has 8 degree of freedom due to this fact, because as I
mentioned one of them can be treated as a scale and you can express all other elements in
proportion to this scale. So, there are 9 elements effectively in the 3 3 matrix, but out of
which one of them will denote scale factor, so there are 8 independent parameters or the
degree of freedom of this matrix is 8. This matrix is also called homography and this
transformation is also called homography, and this matrix is called homography matrix.

(Refer Slide Time: 10:04)

So, Hx preserves collinearity, that property we can very easily verify. Let us consider a
line in this space two dimensional projective space. A point x on this line it satisfies this
point contentment relationship which is given by this l T x  0 . And these itself I can
write it as l T H 1 Hx  0 . You can see that, that is a trick we are using here. We can
consider this( H 1 H ) is as an identity matrix.

119
So, this identity matrix can be replaced by this H 1 H , and then what we can do that by
using the matrix transformational rule l T H 1 can be written as ( H T l )T . So, this

quantity itself can be considered as a new line. It is ( H T l )T Hx  0 . So, that is what I


was referring at. These itself represents as if a line; on this line this transformed points
are lying. So, this line in the transformed space where the all the points of on line l they
are lying now.

(Refer Slide Time: 12:24)

So, this is a summary. So, ( H T l ) is a transformed line of l . And so, that shows how
Hx preserves collinearity, but it is difficult to show that H is the only form of
homography and this is not in the scope of this particular course.

So, we will accept this fact that H is the only form of homography and this has certain
advantages in computing H if we accept these particular fact. This is also true, as I
mentioned that you can prove it, but that proof is not discussed here because it requires a
complex arguments.

120
(Refer Slide Time: 13:15)

So, the implications of this fact is that, if there is a homography, there exists a unique H
and it is a 3 3 invertible matrix. Since, its functional form is known, because we know
that is a matrix those are the elements and we can expand the relationships in an
algebraic form, so it is easier to estimate. As I mentioned also that H and any scalar
multiplication of H which is say kH, they are equivalent. Number of unknowns in H is 8.

(Refer Slide Time: 14:04)

So, now we will be considering how this homography matrix could be computed.
Consider, we have a set of point correspondences, and a typical point correspondence

121
can be specified in this form that you have a point xi in the original space and its

corresponding transformed point xi prime in the transform space, and their relationship

is given by this fact xi'  Hxi or xi'  kHxi you should note that in the projective

transformation relationship.

Now, we need to estimate H that is the computational problem. As I mentioned there are
8 unknowns, since there are 8 unknowns and given a particular point correspondence you
can get only two independent equations. Let us see how we can get these two
independent equations. Let us consider this particular fact. You can see that in the two-
x1'
dimensional real coordinate space can be expressed as a ratio of ' . This is a division by
x3
the scale factor. x'

 x1' 
 
So, x' is a representation in the homogeneous coordinate of x' is given here as  x2'  and
 '
 x3 
 
this from this representation a x' is given by this form. And since you have the
corresponding matrix multiplication. So, if I write the matrix multiplication relationship
 x1' 
 
that means,  x2'  this is the transformed coordinate space which is multiplied by the
 '
 x3 
 

 h11 h12 h13 


homography matrix whose elements are given by h21 h22 h23  and then you have the
 h31 h32 h33 

original space points are represented in the original space in this problem. So, you can
see that x1'  h11 x  h12 y  h13 and which is given here in this numerator.

 x' 
 
So, in this particular representation it is considered as  y '  . This representation we are
 
1 
considering this x' , y ' and this is considered as say x' , y ' let me take is at x3 or and

this is considered as x, y, and say this is say x3 and this is say x3' , ok. So, say this is the

122
representation we are using here. So, it is h11 x  h12 y  h13 . So, we are taking x3 as 1 and

this is also we are considering this is, this could be x3' , but this is we are taking as 1

because this is a scale factor that would be representing here.

(Refer Slide Time: 18:18)

So, let me rub this to make it more concrete. let me write the equation once again. So,
 h11 h12 h13 
this is x' , this is y ' , and this is say x , and this is the matrix h21
'
3 h22 h23  and this is
 h31 h32 h33 

x
 
 y  ok. So, from there we can get this equation you can see that numerator is coming
1 
 
from here and the denominator; that means, x3' is coming from here h31 x  h32 y  h33 and

even if this coordinate x' or x1' is divided by x3' then you get actually the coordinate in

the real space which is observed here.

123
(Refer Slide Time: 19:28)

Similarly, we can get also for the y coordinate; that means, y ' also can be expressed in
this form. So, we get these two equations from this particular relationships. So, there are
8 unknowns. So, you require 4 points, at least 4 points to get 8 equations, but one of the
parameters we can set it as equal to some known value and then you can solve this
problem. So, minimum 4 point correspondences are needed to solve this problem.

(Refer Slide Time: 19:54)

Let us take this particular example consider this image and as you this is the same image
what we displayed in a previous lecture and you can see that the horizontal lines they are

124
not remaining horizontal in the projective space because of the projection they look like
oblique and meeting at a finite point, that is the vanishing point. If you would like to
remove the projective distortion which means let the horizontal line looks like horizontal
and vertical lines looks like vertical then we can see how we can apply the concept of
homography here.

So, we select 4 points in a plane with known coordinates. So, consider this selection.
Here I have shown by drawing a particular contour, but we can select any 4 points, the
endpoints of this edges say we have selected this point, this point here, this one and this
one. So, these are the points which are selected. And then in the transform space we
would like to straighten this lines, so that they look like parallel. So, in the transform
space we want this rectangle, this quadrilateral should look like a rectangle; that means,
we will be mapping this points to a corner point of this rectangle, the respective corner
points of these rectangles. So, this is a kind of mapping you would like to have. And we
know the coordinate point. So, we know these coordinates and we can also define the
coordinates of this space and we can define the coordinates of these corner points.

So, we know the corresponding pair of corresponding points. So, from there you can
form equations because if the coordinates each pair of correspondents will give you two
equations, so we will get 8 equations. Like this, this is one example of a pair of equations,
similarly we can form another 4 pair of equations and you will get 8 equations and since
they are all linear in this with respect to the parameters of transformation matrix; that
means, the elements of H matrix. And as I mentioned you can set one of them to a value
1 and in proportion to that value it express others. Because there are 9 elements, but one
of them is a scale element, so let us consider say h33 is the scale element, we can set this

h33 at 1 and then you can solve them.

If you do that then you can say that this particular image in the transform space will look
like this where the straight lines they still remain parallel. But there is a caution when
you apply this method that h33 if it is a non-zero value then this method will work, but

suppose h33 itself is a 0 so then it cannot act like act as a scale factor. And for that you
need to choose another element, and we will see actually there is a method which can
remove this particular constraint and in generally can be applicable for any kind of
homography matrix. And another interesting part here you can see that you do not

125
require any calibration of the cameras here, it is your coordinate definition that is doing
the trick to remove this projective distortion.

(Refer Slide Time: 24:04)

So, this computations let me elaborate a little further. Say take this typical case these are
the point correspondences, I have defined these coordinate and these are the points in the
in the original image, this is the points in the original image and this is a point in the
transforms which is a point of the rectangle. And using this set of point correspondences
I can form these 8 equations and then these equation can be conveniently represented in
the matrix form in this particular form.

126
(Refer Slide Time: 24:42)

As I set h33  1 , I can rearrange the elements in such a way that all the unknown

parameters remain in the left hand side and the constant terms they form the column
vector in the right hand side. So, the inversion of this matrix and multiplying that
inverted matrix with this column vector will give you the solution. And you will get a
homography matrix like this.

(Refer Slide Time: 25:21)

127
And if I apply the homography then you can see this is the image which I am showing
here in an enlarged form. And you can see the utility of this technique, the text is now
conveniently being read by applying this homography.

(Refer Slide Time: 25:38)

So, the other method by which this homography could be computed is called direct linear
transformation method. It is a general method, because yours working with only 4 point
correspondences, but you may get more number of observations and you may that would
make your method more robust because there could be noisy observations, and if you
just use only 4 points to compute homography that noise will heavily affect the quality of
your result.

So, let us consider a method where you have more number of observations, more than 4
points. So, in general I would like to represent my transformation coordinate space by
this particular representation and consider your homography matrix is represented in this
form, a vectorial form where you can you are representing by this particular symbol h1 it
represents a row of H, h 2 represents a row of second row h 3 and transposition of this
that represents third row. See this is a matrix representation, all these are row vectors,
these are the row vectors.

So, if I multiply this matrix H with the vector xi , I can also represent in this particular

sub matrix form with my multiplication. So, you can see that all row vectors are

128
multiplied by column vectors and each one will give you this particular form. So, you
should note that each element is a scalar element. So, this is a 3 vectorial, Hxi is a 3
vectorial representation that is a transformed point.

 yi' h 3T xi  wi' h 2T xi 
 
'

xi  Hx  wi h xi  xi h xi   0
' 1T ' 3T

 ' 2T 
 xi h xi  yi' h1T xi 
 

So, since the scale is involved it is difficult to work with this equality sign. You consider
xi' is a vector and Hxi , is a 3 vectors and they are related with this scale factor. So, they

should have the same direction. So, instead of enforcing equality what we can do? We
can take the cross product of these two vectors because they have the same directions, so
they are parallel vectors. So, their cross product should give you 0 vector. So, this is a 0
vector you should note that this is a 3 vectorial representation this is a 0 vector.

So, if I perform the cross product, the computations can be expanded in this form. So,
you can see that these are the 3 components of this cross product. And this should be
equal to 0, and this 0 is not a simple 0, its a 0 column vector of dimension 3 1 which
means this element should be equated with 0, this second row this is equated with second
0, third row is equated with this 0.

And this can be shown that there is a redundancy in this representation because if I
multiply say xi' prime with this equation, and if I multiply yi' with this equation. And if

I take the addition of this two, so let us see say the first equation if I multiply it would be
xi' yi' h 3T xi  xi' wi' h 2T xi  0 Multiply these second equation with yi' . So, it should be

yi' wi' h 3T xi  0 . If I add them, these two quantity will be cancelled and you can take wi' as

a common and we will find this will give you this particular equation xi' h 2T xi  yi' h1T xi .

So, that is equal to 0.

129
(Refer Slide Time: 30:54)

So, this equation is redundant and you get only this two equation. So, that is the
summary of this particular exercise. So, this is a redundant equation.

(Refer Slide Time: 31:00)

So, in the direct linear transformation we can consider only these two, first two rows of
this matrix that is giving you the two equations. So, I can represent this particularform
 0T  wi' xiT yi' xiT 
as this equation Ai h  0 , where is given by this form Ai   ' T  . So,
 wi xi 0T  xi' xiT 

you can note that the dimension of A Ai is 2 9 , because you know there are two rows

130
and each one is a transpose of 3 vectors. So, 0T it is a 3 vectors, this is 0 0 0 minus w i
prime x i transpose, x i transpose is a point x i, so which is if I write say ( xi1 , xi2 ,1) into

wi' This is my original space points. Similarly, yi' xiT is also another point. So, you see

that this is also 3 vector. So, if I arrange them there are 9 such columns, there would be 9
such columns. So, it should be 2 9 , this matrix is 2 9 . So, a single point
correspondence will give you; two equations Ai h  0

(Refer Slide Time: 32:26)

So, this set of equations can be solved by solving a set of non-homogeneous equations
and you can see that this is the expanded form what I described earlier. And we can
consider that there are more number of points and stack of these two rows for each point
correspondence will give you a composite matrix A which is of higher dimension. So, if
there are n point correspondences, so dimension of A would be 2n  8 because if we set
~
h 
the parameter h33 equals 1, so the representation of h would be   . So, you understand
1 
~
in this representation. So, h consist of the column vector representation of each rows of
matrix h. Only the last element that is h33 that is set 1 and then rearranging this equation

you will get this equation.

~
So, finally, dimension of Ai is 2 8 and if I get the stack of all endpoints it will be

2n  8 . So, since there are more number of equations and only there are 8 unknowns. So,

131
we can use the least square error estimate method. So, all objective is to minimize
~
Ah  b , where dimension of h is 8 1 and dimension of b is 2n  1 .

So, solution of this equation can be obtained in this form. You can see that this is by
solving least square, we can multiply the b with this this is called pseudo inverse of the
these operation. The whole thing is called pseudo inverse operation. If I multiply with
the pseudo inverse matrix then you will get this vector. I can give show you one simple
way to obtain this pseudo inverse by using the matrix operations itself. Say,
~
approximately we want Ah should be equal to b. So, what we can do. Let us post
multiply each side by AT , ok.

So, now you take the inverse of A transpose and multiply with the resulting operation
that would give you the corresponding equations that is what exactly you are getting here.
So, if you use these particular solutions, you will get the elements of matrix H in this
form and append it with the element h33 as 1 then you will get a transformation matrix.

But here also the problem is that, that your assumption of h33 equals 1 may not hold

because if h33 is 0, then you simply cannot put it as 1 and you cannot apply this method

in that case.

(Refer Slide Time: 36:03)

So, there is a method which is more generally applicable and that is called you know
solving solution of homogeneous equations. In this case we are not making any

132
assumptions, we are not setting any value to a particular constant to a particular value,
any parameter. We consider to solve the whole problem as equals 0. So, this side is the
corresponding this specification is 0. So, you would like to get that values of Ah which
will minimize this particular equations.

So, the error of error from is your objective functions would be a bit different in this case.
You have dimension of A is 2n  9 now, because you have not shifted one of the
parameters at the right hand side and setting it value to 1, we considered the whole vector
as a 9 dimensional vector, forming the elements of transformation matrix and each point
will give you a pair of equations A 1 A 2 A 3, those are the rows corresponding to
equations.

So, dimension of A would be 2n  9 , dimension of h is 9 1 , so dimension of Ah is 2n  1 ,


the whole thing is 2n  1 . So, the dimension of 0 is 2n  1 . So, our objective is that we
would like to minimize the norm of Ah, such that the norm of h equals 1 because there is
a scale factor involved, so we have to put a constraint on h to minimize this value, and
this constraint; it is a constant optimization problem where we have to keep the norm of
h equals 1. So, if you were wondering what is meant by norm of a vector it is simply the
magnitude of that vector. So, consider a 3 vectorial form representation say you have
(h1 , h2 , h3 )T . Say, this is a three dimensional representation I am using the row vector

representation. So, norm would be simply the sum of h squares and you take the square
root of this; that means, the magnitude of that vector. So, we are minimizing, also will
give you a vector of dimension 2n  1 . So, we are minimizing the norm of this subject to
this.

133
(Refer Slide Time: 38:50)

So, there is a solution for this equation, it is a well-known solution. This is the unit Eigen
vector of the smallest Eigen value of AT A . So, using this result, so I am not going to
discuss this particular theory by which you get this solution. We will be using this
particular fact that if I can compute the Eigen vector correspond to smallest Eigen value
that Eigen vector is a solution of this equation.

(Refer Slide Time: 39:20)

So, this is one way of solving this problem that you can solve using least square error
estimate for homogeneous, set of homogeneous equations or set of non homogeneous

134
equations as we have discussed earlier. But there could be different other error criteria.
Like the error term what we used in the previous methods which is known as direct
linear transform methods, those errors are known as algebraic errors, those are the least
square errors that is how we have defined, that sum of square errors, its square of
deviations. There could be geometric error, like you can find out the distance between
the transformed point, true observed point in the transformation space and the estimated
transformation; transform point and square of all these distances can give you an error.

So, the idea is something like this. Say you have the original space x and you are
transforming it to the transformation space x ' . So, when you perform H operation. Now,
this may be your true observation points, true and after estimation your point may be
estimated at this point. So, the distance between these two will give you a component of
error. And you are going to minimize some of all these distances. So, this is one form of
geometric error.

And this is the Euclidean distance we use. Geometric error with re-projection where we
will be using also the same estimation; that means, here what we will be considering, we
will be also applying H 1 on the transformed points. We will be also applying H 1 on
the observed transform points and get the estimated point in the original space say x ' .
Say, this is the point and say this is your observed point and get the distance between
them. So, that would give you the component from this inverse mapping. So, this is a
geometric error with re-projection.

135
(Refer Slide Time: 42:01)

And there are different methods like non-linear iterative optimization techniques, such as
Newton iteration, Levenberg-Marquardt method, etc those methods could be used for
solving this problem.

(Refer Slide Time: 42:16)

There is another interesting fact, that should be noted. we can apply even
transformation on these points, on the point set of points which are observed in the two
spaces and even we can estimate the homography even after transformation. So, let us
understand this particular fact.

136
Say, consider yi  Txi and yi'  Ti ' xi' . So, these are the transformations T and T ' which
are applied in two spaces, both in the original and the transformed spaces. So, now you
compute the estimate the transformation matrix between these transformed, set of
transform points. So, we can see that even from there you can estimate H. So, this is a
relationship that is related with these point x by Hx. So, you apply the transformation
relationships with x ' T and T ' with a corresponding set of points. So, you can see that, so
x ' is T '1 y ' . x is T 1 y . So, y ' is T ' HT 1 y . So, this thing can be considered as the
transformation matrix. This is what is the transformation G.

So, if I use and yi can estimate G and from there I can estimate H, but one thing is that

if you are using least square estimate then of course, you know you may not get a; the
you can get a close estimate of H, but it is not equivalent. So, if I estimate H without
transformation and if I estimate G with transformation and convert get back the H from
there the results will not be equivalent because of this constant optimization issues.

(Refer Slide Time: 44:21)

So, one of the example of this transformation which is very often used is that we can
transform the point set, so that its center becomes origin in the plane and average
distance from it is 2 . So, we can transform the coordinates, we can scale down or we
can scale up, so that finally, the all the spread of the points say if we have the original

137
point like this after transformation, T all these points should be within 2 distance of
the center of the points.

(Refer Slide Time: 45:09)

So, this is the idea and there is a particular type of transformation which can perform this
job easily. So, take this, that I subtract each point from its center and then divide it by its
standard deviation  x . With this transformation you can preserve this particular fact, and

now you apply the direct linear transformation on this transform point. So, it can recover
the homography and this computation is robust because it can take care of the big values
that we will be getting at different you know coefficients.

138
(Refer Slide Time: 45:39)

So, just to summarize this particular lecture on computation of homography; what we


have seen that a projective transformation, has these properties that it is invertible, it
preserves collinearity, and it is always in a linear form. And that information gives you a
convenient techniques for estimating projective transformation because you can apply
the linear model there.

So, these are the fact with the projective transformation that it is defined as a
transformation of a point to another space, another projective space, as another point in
the projective space. So the line in a particular original space that is also transformed,
and with the transformation H it is related with this particular property, H T . If I
multiply with the original line, line in the original space you will get the corresponding
transform space.

And there is a computational problem for estimation of homography in this particular, in


this respect which we discussed that is if you have a set of point correspondences then
you can compute the homography. And there we require minimum 4 point
correspondences to solve it, but if you have more number of observations then you can
make it robust and you can apply least square error techniques, like direct linear
transformation technique for solving them. So, it makes your computation robust.

Thank you for listening.

139
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 08
Homography: Properties Part – I

In this lecture, we will talk about different Properties of Homography.

(Refer Slide Time: 00:21)

So, let us summarize what are the features of a projective transformation. We know that
a transformation is projective first it has to be a mapping from a two-dimensional
projective space to another two-dimensional projective space. Then, it should be
invertible and collinearity of every three points to be preserved that is three points for
example, which lie on a straight line, then after transformation transform points also
should lie on a straight line. And also we have discussed that there is only one form of
this kind of transformation and that is in the form of a non-singular 3 cross 3 matrix.

140
(Refer Slide Time: 01:19)

So, let us consider a typical case of this projective transformation a kind of a schematic
diagram by which we can explain these properties. So, you consider a projective space
two-dimensional projective space and you can see that O is the center of projection and a
projection plane which is in the canonical form is placed at a distance 1 along the z axis
of this particular projective space. And let us consider another projective space in the
same representation, similar representation, but it has a different coordinates of the
center of projection, it has different orientations of the coordinate axis and also
accordingly with respect to that its canonical projection plane is also defined.

So, a point in this case O ' is the center of projection of the other projective space. So, a
point in the projective space which has been shown earlier that is represented in this form.
We know that any point in the canonical space is represented by a ray connecting to the
center of projection and whereas a point in the transform space that is also represented in
this form and here we are showing that there is a mapping between the point p and p ' .
So, there is a mapping, and that mapping is defined by a 3 3 transformation matrix as
we have discussed earlier. And this is how we denote a projective transformation. We
can visualize the projective transformation, for every point in the space there will be
another unique point, every point in the space there will be another point in the
transformed space, where it will map to that particular point uniquely.

141
So, to summarize this property also we can say that , if there is a straight line and if the
point lies on that straight line, , so if we map all the points in that straight line , they will
also form another straight line and in that straight line all the mapped points will lie. So,
we can say there is a mapping from straight line l to l ' under this transformation.

And we have also discussed how this transformation is related, how we can derive the
transformed points or transformed straight lines from the points or straight lines of the
original projective space. So, the relationship is that for straight line you have to apply
the transformation matrix which is transpose of the inverse of the transformation matrix
for point transformation.

So, the relationship can be summarized in this form that for a point if I multiply the point
x in the projective space representation which is an homogenous coordinate
representation, if I multiply with a 3 3 transformation matrix H, then we will get the
corresponding transformed point also in the homogenous coordinate representation.
Similarly, for a line if we multiply it with the H T matrix which is transpose of inverse
of H, then also we will get the corresponding transformed line.

(Refer Slide Time: 05:47)

So, let us discuss again that how vanishing points they are related in the transformation,
in the transformed space. So, in this case we have drawn a pair of parallel lines and one
of them is denoted by l . Consider l is transformed in the transformed space by a line l '
and we know how l and l ' is related. So, you multiply l , represent it in a two-

142
dimensional projective space by transpose of inverse of this matrix H and then you can
get the corresponding transformation of l ' . That means, you have to multiple l with H T .
Similarly, if I transform the other point of that parallel line we will get another straight
line and this straight line incidentally we observe though they are parallel in the original
projective space, but in the transformed space they will be appearing as a converging
straight lines or actually they will intersect at certain points. Now, this point is called the
vanishing point of this straight lines.

So, the interpretation of this vanishing point in the transformed space can be very easily
given, if we understand that what is a intersection between this parallel lines in the
original space. We know that we have already discussed that when two lines are parallel
in a projective space they intersect at a point which is at infinity, but there is a finite
representation of that infinite point.

And in this case for example, if we represent the straight line by this equation
ax  by  c  0 and then the all lines parallel to this straight line will intersect at a point
in the projective space which can be represented by the coordinate (b, a,0) it is a
homogenous coordinate representation. So, the ray connecting this point and passing
through the origin that same ray is representing this element. And we know that this
point is also called ideal point.

Now, the vanishing point of these parallel lines in the transformed space is nothing, but
the transformation of this ideal point into this space, which means if I apply
transformation matrix, if I multiply transformation matrix with this point then I will get
the vanishing point. So, what we can get is that, we have to represent the intersection
point of the parallel lines in the original space which incidentally is (b, a,0) this column
vector represent represents the intersection point for a line for all parallel lines which are
parallel to the line ax  by  c  0 . So, if I multiply it with H then I will get the vanishing
point, corresponding vanishing point, now that means, the corresponding coordinates of
this point in the projective space with this computation. So, this is a simple
interpretations of a vanishing point and that is what we have represented here.

Now, another interesting part is that, we know that if I take any other parallel line in the
original space any of the set of parallel line in the original space, suppose you take
another set of parallel lines in the original space and then if I transform these lines in the

143
projective space they will also meet at some vanishing point and all these vanishing
points they lie on a particular line which is called vanishing line. So, how do you get this
vanishing line also in the transformed space? What is the representation?

So, there we know that all the vanishing points in the original space they lie on a
particular line which is called line at infinity and that is given by the; that is given by
0 
this representation 0 in the let us say that is an element in the projective space
1 

representing the all the straight lines. So, these vanishing line is nothing, but the
transformation of this line at infinity into this transformation space. So, which means that
if I perform transformation of line at infinity by following the same rule of line
transformation then I will get the corresponding vanishing line.

So, let me find out this particular equation to which is hidden here. So, let me write it
here. So, what we can do we have to multiply the line at infinity represented by this from
0 
0 , then we will get the corresponding vanishing line which is this line here. So, this is
 
1 

a interpretation of a vanishing point and vanishing line in the transformation space.

(Refer Slide Time: 11:59)

144
So, just to summarize all these properties in the transformation space or all the properties
of this projective transformation homography. So, we have these are the transformations,
those we have studied that is point transformation which means that we have to multiply
a point in the projective space by the corresponding transformation matrix, then you get
the corresponding transformation point in the transformed space. You should remember
that all the representations are in the homogeneous coordinate form. Then line
transformation, where a line has to be multiplied by the matrix H T which is a transpose
of inverse of transformation matrix, then you will get the corresponding transformed line
in the transform space.

Similarly, for vanishing point, for lines parallel to say represented by this form
(a, b, c)T that is a form of a line then the vanishing points can be derived by multiplying
the intersection point of those parallel lines in the original space with the transformation
matrix which would (b, a,0) That is a typical representation of the intersection point of
lines which are parallel to the line given by parameters a, b, c. And vanishing line would
be the transformation of line at infinity of the original projective space which is given by
(0,0,1)T , and that has to be multiplied once again with H T which is transpose of inverse
of transformation matrix and that would give us the vanishing line.

(Refer Slide Time: 14:01)

So, let us consider an example by which we can show all this computations. You
consider a homography matrix which means this is a transformation matrix which is

145
given in this form you can say this is a 3 3 matrix and which should be incidentally
non-singular you can verify that. So, the computational problem is that you have to
compute the transformation of the line formed by two points given by (2, 4, 2) and

(6, 9, 3) both the points in the two-dimensional projective space. So, let us see how we
can compute this.

(Refer Slide Time: 14:43)

So, there are few methods by which you can do it, I will discuss two of them. So, the first
method, in this method what we can do that first you compute the transformed points of
(2, 4, 2) and (6, 9, 3). So, you get the corresponding transformation point in the
projective space. So, it is figuratively I can show you. So, you have a projective space
where you have this points say p and q, and p is given by the point (2, 4, 2) and q is says
(6, 9, 3). So, you find out the corresponding transformation in the transformed space.

Suppose, this point is p ' and this point is q ' . So, what you need to do? You have to
perform the multiplication of transformation matrix H with p and multiplication of
transformation matrix H with q, then you get this point p ' and q ' those are the
transformation point. Now, you compute the straight line in the transformed space, which
means if I perform p 'q ' I will get the corresponding transformation matrix. So, let me
see this computations once again.

146
(Refer Slide Time: 16:17)

So, first let me reduce the representation of this points. Since, there is a scalar quantity 2,
I can divide all the coordinates to make it more simple representations of (1, 2, 1) and
similarly (6, 9, 3) I can simpler representation of (2, 3, 1), you could have performed
using (2, 4, 2) and (6, 9, 3) also. So, it is just for the convenience of the computations I
have taken these two. Now, you transform (1, 2, 1) and (2, 3, 1). So, if I transform

(1, 2, 1) by multiplying this matrix then you will get this point (5, -1, 11). You need to
check when you multiply it. And if I multiply (2, 3, 1), then you will get the coordinate
(12, 3, 16).

147
(Refer Slide Time: 17:17)

Now, what you should do that you have to take the their cross product to compute the
transformed line. So, you perform the cross product of this two and you can find out that
 49
the result would be 52  . So, that is the representation of the straight line which means
 
27 

the straight line in our conventional coordinate form should be represented as


 49 x  52 y  27  0 . So, this one of the method by which we have carried out this
computations.

(Refer Slide Time: 18:07)

148
Let me discuss the other method. So, in this case what we will do. Rather, first we will
perform the computation of line in the transformed space. So, let us consider your
original projective space and once again two points, p and q. So, you have to compute
this transformation or you have to compute line between p and q by performing the cross
product of this two points in the homogeneous coordinate representation.

Now, you perform the transformation of this line. And in this case what you should do?
You should multiply with transpose of inverse of this transformation matrix H which is
represented as H T . So, if I perform this operation, then I will get the corresponding
transformed line in the transformed space. So, l ' should be equal to H T l . So, this is the
computation we will be carrying out. So, let me detailed out this computation.

(Refer Slide Time: 19:27)

So, first we will be computing the cross product of this two points and the line would be
(-1,1,-1), you can verify this computations by computing cross product of this two lines.
And then, we will be transforming this line by multiplying this line with transpose of H
inverse ( H T ) and you get the corresponding transformed line l ' . And H 1 in this case
is given by this particular matrix.

So, you know how to compute the inverse of a homography matrix or inverse of a
3 3 matrix and there you need to compute the determinant and you need to compute the
co-factors, take the transpose of the co-factors and then you have to divide the
corresponding matrix by the determinant you will get the corresponding inverse matrix.

149
So, please go through a standard text book of matrix operations, you will find out the
steps of doing inverse and you should be familiarized to this computations. Particularly,
since it is a 3 3 matrix, it should not take that much of time to get an inversion.

So, after performing the inversion of a matrix, transformation matrix, then you have to
take the transpose of this matrix and perform the computations. So, if you do this
computations your line will come in this form. So, you can ignore the scale factor here.
So, this scale factor you can ignore you know that. This, it is same as representation of
(-49, 52, 27) column matrix itself, and it will give you the similar equation of straight
line what you have derived earlier also by the method one which we discussed in the
previous slides. So, this is the equation of the straight line that you will get. We will
continue this example by computing the vanishing line of the in the transformed space.

(Refer Slide Time: 21:37)

So, here as we discussed that vanishing line could be computed by transforming the line
0 
at infinity of the original projective space which is incidentally 0 . So, we will be
1 

computing, we will be performing this computation and if I multiply the corresponding


0 
transpose of this inverse with 0 , then you will get the corresponding vanishing line as
 
1 

150
5 
 15 . And in the conventional coordinate representation this could be equivalent to the
 
5 

straight line 5 x  15 y  5  0

(Refer Slide Time: 22:39)

So, now, we will study another interesting property of projective space. And we will see
that projective transformations they form a group, we call it a linear transformation and
this is a linear transformation and we call that group as projective linear group.

So, the essence of this property is that if I consider a transformation of a spaceto the
points in the another space by the transformation H1 , subsequently another
transformation from this space to the third projective space by H 2 Now, this is
equivalent to transformation of the first to the third space directly by a matrix H which
can be derived by composing this two matrices H1 and H 2 , simplify multiplying these
two matrices. This property holds because of this; this property of projective linear group,
so which means that a cascade of transformation can be replaced by a single
transformation that is one of the implications of this property. H  H 1 H 2

151
(Refer Slide Time: 24:03)

The other implication is that, if I consider a series of transformation, suppose we are


transforming first those points by H1 , next the transformed points to another
transformation space by H 2 , and from the third transformed space to a fourth
transformed space by H 3 , then by applying cascade we can simply multiply all these

matrices H1 , H 2 , H 3 and we can get the single transformation we can equivalently

represent it by a single transformation from the first to fourth space.

But the interesting part is that as you know the matrix multiplications they are all
associative. So, it does not matter in which order you compute this multiplications; that
means, we can compute this multi this operations by first taking multiplications of H 2
and H 3 and then multiplying with H1 or first multiplying with H1 and H 2 and then

multiplying this matrix before H 3 . So, this will give me the composite matrix H. So, this

is another property we can, it can you can perform this computations with different
compositions and this is possible because of that group property.

152
(Refer Slide Time: 25:27)

So, there are some interesting structure in this particular groups and we will see that
there are subgroups and there is a hierarchy among the subgroups. So, our parent group
is a projective linear group. So, all projective transformation they fall under this category
which is in general which is a projective linear group. We have discussed all those
properties. Those properties are hold for this kind of transformations.

Now, there is a special case of projective transformation, and they are called affine
projective transformations, and they form affine group. One of the key property of this
particular transformation is that their last row should be (0, 0, 1), you can multiply with
any scale factor of (0, 0, 1), also. So, it is very easily distinguishable. We can easily
determine that whether the transformation is an affine transformation or not.

Then a special class of affine group is called Euclidean group when the upper left 2 2 .
So, this is the upper left sub matrix this should be orthogonal. As you, as I have told you
the bottom most row or the last row it should be (0, 0, 1), it they should have the value of
(0, 0, 1), or a scaled multiplication of this vector (0, 0, 1). So, that forms the affine group
and a special class of a affine group becomes Euclidean when this sub matrix is an
orthogonal sub matrix. And finally, another sub group which is also a special class of
Euclidean group which is called oriented Euclidean group, if the determinant of this sub
matrix is equal to 1; that means, upper left 2 2 sub matrix is equal to 1. So, we call that
group as oriented Euclidean group.

153
So, as I mentioned as that oriented Euclidean group is a sub class of Euclidean group.
Euclidean group is a sub group of affine group, and affine group is a sub group of
projective linear group. So, we have this hierarchy in the transformation space. With this
let me stop for this lecture, and we will continue this topic in next lecture.

Thank you for listening.

154
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 09
Homography: Properties Part - II

So, we will continue our discussion on the properties of projective transformation and we
are discussing about the existence of different subgroups in the projective
transformations group of projective transformations.

(Refer Slide Time: 00:27)

And, we have seen that in general projective transformations they belong to a group call
projective linear group and a subgroup of projective linear group is affine group. The
characterization is that the last row of the transformation matrix should be (0, 0, 1) or a
scalar multiplication of (0, 0, 1) row vector. Then a subgroup of a affine group is
Euclidean group, where the upper left transformation sub matrix 2 2 transformation
sub matrix it should be orthogonal; which means if you take dot products of rows
between 2 different rows it should be 0 and the self dot products should be non-zero.

And then the bottom most subgroup; that means, which is also a special category of
Euclidean group is called oriented Euclidean group, when the determinant of these
orthogonal sub matrix should be equal to 1. So, these are the different kinds of groups or
subgroups those are present in the projective transformations, in the group of projective

155
transformations. Let us discuss that what different kinds of properties are there for each
group.

(Refer Slide Time: 02:09)

So, the parent group which is a projective group that is represented in this form. You can
see that it is a 3 3 matrix, but we are representing it by their sub matrices where A is a
sub matrix of 2 2 sub matrix. So, this is a 2 2 sub matrix. And t is 2 1 column
vector, V transpose we have already shown how V should be represented its a column
vector of (v1 , v2 )T

So, V T itself it would be 1 2 and this v which is just a scalar quantity. So, this is the
representation of any general projective transformation matrix that we have already
discussed. Representation we have seen that some typical examples of this
transformation of a rectangle or a parallelogram has been shown here, some of the shapes
you can see that these shapes they do not preserve the parallelism in this particular
transformation in general not necessarily they preserves parallelism.

So, the degree of freedom; that means, a number of independent parameters in the
transformation matrix is 8, there are 2 scales, 2 rotations, 2 translations, and 2 parameters
are meant for line at infinity. So, when we decompose this matrix we will find out that
we can decompose it into this parameters, it will be clear later on when I will be
discussing the matrix decomposition of this projective transformation. So, let us take it

156
for the time being that its degree of freedom is 8, 8 which means there are 8 independent
parameters by which you can describe this transformation matrix.

 x1    x1  
A t    A  
v T  x    x
 2 
v   
2
 
 0   v1 x1  v2 x2 

And this is another interesting property which has been shown here; that means, if you
transform any ideal point you can note that the third coordinate of the ideal point is 0
here. So, the transformation of ideal point is not going to give you always an ideal point
because the third coordinate is v1 x1  v2 x2 which may be non-zero which means this
point would be a finite point in the transformed projective space.

And that is why line at infinity will be finite in the transformation space, it will be a
vanishing line in the transformation space and it allows you to observe vanishing points
or horizon after the transformation. So, what are the properties which are conserved even
after transformation? Their concurrency; that means, if you find 3 straight lines which
are concurrent; after transformation also you will find transformation of all those straight
lines will be also concurrent.

It preserves the co-linearity that is the best property of any projective transformation
from definition itself co-linearity has to be preserved; it preserves order of contacts it
preserves cross ratio which is ratio of ratio. Let me explain this cross ratio further. So
how do you define a cross ratio-?

157
(Refer Slide Time: 06:07)

Let us considered four points lying on a straight line; so, we can consider the ratio the
cross ratio is defined in this way that is the ratio of ratios. So, you can see that ratio of
length between | X 1 X 2 | and | X 2 X 4 | . So, this is the ratio first ratio this numerator of this
ratio of ratio, the numerator ratio is defined in this way ratio of length between | X 1 X 2 |
and | X 2 X 4 | and the other ratio is | X 1 X 3 | and | X 3 X 4 | . So, the ratio of these two ratios

defines a cross ratio. And, if you even if you transform this points after applying
projective transformation, first thing they will all be collinear as this is what is the
property of a projective transformation and they all maintain the same order those are
also preserved.

And then this cross ratio is computedc there in the same fashion, you will find that cross
ratio and this cross ratio they are the same. So, this is one of the invariants of this
particular transformation.

158
(Refer Slide Time: 07:29)

So, let us consider the properties of affine group. As we have already seen or we have
already discussed that the third row of the affine transformation should be in the form of
a vector (0 0 1) or you can multiply it using any scalar constant. So, let us for the sake of
convenience we consider the representation of (0 0 1), one kind of representation. And
here I have shown figuratively also how parallelograms when it is transformed in the
affine applying affine transformation, still they remain parallelograms which means one
of the invariants of this transformation should be the parallel lines.

Parallel lines still remain parallel after affine transformation. And how many degree of
freedom we have here? You have only you can count the number of elements there itself
there is 6 and in fact, there are six independent parameters and further this matrix A can
be decomposed into this operations. So, as if you are applying distortion in the
deformation in the corresponding perpendicular to axis directions.

So, you are applying deformation means, you have to apply a rotation first determine
align it with respect to the axis where the deformation as to be applied and then you
apply the deformation in two perpendicular directions that is given by this diagonal
 0 
matrix D   1  . So, you are applying you are introducing deformation by using
 0 2 
two different scales and your coordinate measurements at two oblique axis. So, as if your

159
coordinate convention instead of considering rectilinear, you know axis representation,
we have this oblique axis representation let me draw it once again.

So, you had a rectilinear coordinate representation, but in the transformed space you are
following a non-rectilinear representation, but measurements should be all by the parallel
lines of those two baselines. Now so, your unit scales they are scaled by different factors
there and that is how the co-ordinate transformation is taking place while applying the
affine transformation along with this you can rotate the co-ordinate axis that is another
parameter 

A  R ( ) R ( ) DR ( )

So, you can see here that there are four independent parameters in A one is this angle 
then this two scale factors and then again you are rotating back and then after that you
are applying rotation of the whole transformation points angel  . So, this is in this way
you can account for the coordinate transformations through A and the other thing what
you are doing? You are translating the even you apply the translation of the origin so,
another two independent parameters. So, there are 2 translation parameters 2 rotation
parameters and 2 scale parameters. So, those are the two things those are the things what
we discussed in the previous case also those are the parameters in this particular affine
transformation.

(Refer Slide Time: 11:21)

160
So, just to summarize you have 6 independent parameters and once again if I apply
affine transformation on ideal points, you can see that it still remains ideal because your
scale factor still remains 0 which means that, your parallel lines they will be still remain
parallel because after transformation the intersection points also are ideal points. So, line
at infinity stays at infinity. So, line at infinity is the one of the invariants of this particular
transformation. So, what are the invariants of affine group? Parallelism, then ratio of
areas, ratio of lengths on parallel lines for example, midpoints, then linear combinations
of vectors for example, centroids.

So, they all remain invariant and the line at infinity that is one of the major property that
line at infinity still remains the same. So, if you transform a line at infinity represented
by the vector (0 0 1), then you will still get the corresponding transformation as (0 0 1).
So, which means , if I perform this operation say this is the line at infinity and then
multiplied by the matrix H AT which is transpose of inverse of this particular
transformation matrix, then you will get the corresponding representation of line at
infinity, it may be multiplied by a scale factor, but still represents the line at infinity as

(0 0 1).

So, the implication is that it means that all vanishing points after transformation they still
lie on (0 0 1) which means they are all ideal points and which means after transformation
all parallel lines still remain parallel that is one of the very interesting properties in affine
group.

161
(Refer Slide Time: 13:41)

The Euclidean group what we discussed can be also rename that can be called also as a
similarity group, because it maintains a similarity property of triangles say when in
triangles when all the edges are parallel, then their ratios are preserved ratio of edges
they are preserved. So, in this case ratios of distances are preserved. So, that is why it is
called similarity group and its particular structure is given in this form you can see that
the corresponding 2 2 sub matrix can be simply represented by this decomposition that
is means it is a scalar multiplication of a orthonormal matrix R where R T R equal to
identity matrix.

1 0 
This is identity matrix which means it should be   which means that R is an
 0 1 
orthonormal matrix and T is once again it is a 2 1 column vector and so, this
transformation matrix is a special case of affine transformation matrix, on top of that it
has this constraints that it has to satisfy this particular property which makes the 2 2
sub matrix as orthogonal. So, in this case you have four independent parameters one is
for scale. So, this is what is the scale, the other one is rotation, you can keep the axis
rectilinear, but you can rotate the axis orientations of this axis and 2 parameters are for
translation you can translate the origin.

162
(Refer Slide Time: 15:47)

So, there is another interesting property. there are 2 special points which are given in
1 
 
this form that  i  So, you consider instead of a real space you consider a space it is a
0
 

1  1 
   
projective space with complex coordinate systems. So, we have  i  and   i  that is the
0 0 
   
representation. So, one of the axis is considered as the imaginary axis and that is the
representation that is how a two dimensional projective space can be also represented by
simple complex numbers.

And then this points are interesting because if you apply transformation on this points
any similarity transformation, you will find that it still remains the same point; that
means, if I apply transformation similarity transformation of this particular point
1  1 
   
 i  Then you will find the transformation point is still a scaled factor of  i  which
0 0
   
means in the projective space in consideration, it is the same point, it is just multiplied
by scale factor this is true for the other point J also. So, these are the invariants of this
transformation we will see later that it has useful properties this is very useful for certain
computations.

163
So, these are the things which are invariant under similarity transformation, they are
ratios of lengths angles and the circular points I, J. Of course, you have to consider all
other invariants which are there earlier for affine group and also projective
transformation groups. Since it is a hierarchical relationships all those properties they are
also true for a similarity transformations. So, which means the parallel lines they remain
parallel even after this transformation, line at infinity after applying this transformation
still remains at line at infinity at (0 0 1). So, those are invariants even after this
transformation and finally, the oriented Euclidean group which is also called as isometry.

(Refer Slide Time: 18:07)

Because it preserves even the distances even the lengths between 2 points and you can
see its particular structure, here we can see that the representation could be
 cos   sin  tx 
  sin  cos  t y  , where  is  1 which means that if I take the determinant of this

 0 0 1 

sub matrix then that should be equal to 1 that is one interesting property. So, the number
of parameters as you can see it is  and t x and t y . So, number of parameters in this case

is only 3 and it is orientation preserving when  value is 1. otherwise it is a orientation


reversing directions like reflection.

And so, there are 3 independent parameters- 1 rotation and 2 translation as I mentioned;
and in this case you have special invariants they preserve the lengths. So, it is not just

164
ratios of lengths, ratios of lengths are preserved in the case of similarity transformation,
it is also a lengths those are also preserved angles, preserved areas so, that is why this
transformation is called isometry that is another type of transformation. And this
transformation is a subgroup of similarity transformation and all the properties of
projective transformation, affine transformation, similarity transformation all those
properties are also true for a isometry transformation in addition to them you have this
three other properties invariants length angle and area.

(Refer Slide Time: 20:11)

So, let us take an example that of decomposition of a projective transformations. So, any
projective transformation can be decomposed as a cascade of all this special
transformations like it is a cascade of similarity transformation affine transformation and
general projective transformation. And, its forms are also given here you can find out
that in this case, similarity transformation has this particular structure and then affine
transformation this is affine transformation, the K is an upper triangular matrix and there
is a particular form the det K  1 . You should note that K does not form orthogonal
matrix, but its determinant is one. So, that is why its an affine transformation and
because you are its last row is once again (0 0 1).

 sR t   K 0  I 0  A t
H  H S H AH P   T  T 
 0 1 0 1 v T v  v T v 

165
And this is a projective transformation, I is a identity matrix it is a 2 2 identity matrix
and its other parameters are vT which is a 1 2 row vector and this is a scalar amount.
So, the relationship between the original transformation matrix which has been
decomposed; you can see this third row it defines the component of projective
transformation where as the translation matrix t it is used in the similarity transformation.
And, A has to be decomposed into this form A  sRK  tv T and which is expressed as in
this particular matrix multiplication form. So, you know t , you know vT

So, you have to perform decomposition of this matrix after subtracting tv T from A and
then perform the matrix decomposition. one of them as you can see, R is orthonormal
and K is upper triangular matrix. So, this matrix decomposition can be performed to any
QR decomposition method where Q is orthogonal matrix R is a upper triangular matrix
and that is a very standard method by in linear algebra that you can do and this
decomposition is unique if s is positive, if s is kept positive as you know that projective
transformation it is a scale factor which matters.

So, we have put the restriction that s has to be positive. So, as I mentioned that through
QR decomposition you can derive the corresponding components of R and K. And, this
is how you can get all the transformation matrix we can decompose any projective
transformation matrix as a cascade of similarity affine and a general projective
transformation matrix in a special form.

(Refer Slide Time: 23:33)

166
So, let me give you an example consider a transformation matrix H this is given in this
form and if I perform this decomposition it would look like this. So, here you can see
that this is the projective transformation part, where this is the corresponding identity
matrix that is 2 2 identity matrix, this [ 1 2 1 ] this is totally defined by the last row of
this transformation matrix. Then the corresponding translation parameters they are used
directly in the similarity transformation matrix.

And using this after that you can perform the matrix decomposition of A and you can get
the corresponding similarity matrix and affine matrix. So, this is one good example of
how matrix decomposition can be carried out to make the transformation to show the
transformation as a series of different transformations of certain structure.

(Refer Slide Time: 24:55)

So, this particular this properties could be used for rectifying the images, it is called
rectification process. let me explain what is meant by rectification process you consider
here you have the image which is observed in this form which has been given in the
plane  2 . So, this is what you have observed you can see the edges . they do not appear
as parallel because of transformation; that means, suppose actually in the original space
you have a rectangle, you have a square in this particular example.

And, there after applying the two dimensional projective transformation, they take the
shape of quadrilateral where the parallel edges in the original space. appear like meeting
at a vanishing point and you can get a vanishing line after this transformation. So, one of

167
the task could be that how to make those seemingly parallel lines from our experience we
can identify which are parallels and how to make them parallel again, by applying
another transformation say H p '

Which means after this transformation I should at least get this 2 lines parallel it is not
really giving me the original square shape where the rectangular property of the straight
lines of the corners they are not preserved, but still at least we can get the parallelisms of
those edges in this particular transformation. So, this process is called rectification
process. So, we are trying to compute a transformation from here to here so, that it gives
corresponding edges.

Now you should note that the transformation from the original space  1 to  3 that is an

affine transformation. However, this transformation is a general projective


transformation H p . Again H p ' is also a general projective transformation. So, the

cascade of these 2 transformations can make it an affine transformation. So, you would
like to see that how we can compute this H p , H p ' . suppose you get the vanishing line

of this particular vanishing line in the transformed space and which is represented in this
form by vector l1 l2 l3  and then you can define a transformation in this form say
T

1 0 0
0 1 0  So, you can see that this is a general projective transformation because last

l1 l2 l3 

row is not [0 0 1], but H A is a affine transformation. So, its any affine transformation.
So, if I multiply this matrix with any affine transformation, still we will get a general
projective transformation. The property of this transformation is that if I transform this
vanishing line then I will get line at infinity. that is very interesting.

That means, whatever vanishing points you had here, now, they will all become ideal
point after this transformation H p ' which means this lines now in the transformed space

they will appear like a parallel line. So, this is one good method by which at least you
can make the edges parallel and this is how we can carry out rectification. So, let me take
some example of this process.

168
(Refer Slide Time: 28:51)

So, first let me give you a figurative examples that how this computation can be
proceeded.. See you consider a an input image where you can see here that edges they do
not look like parallel in fact, seemingly parallel edges they are meeting in this particular
image at some finite point. So, you choose two such parallel lines direct two such set of
parallel lines, each one will be giving you two vanishing points. So, with these two
vanishing points you can compute the vanishing line.

So, you get the corresponding representation of vanishing line which is the
transformation of the line at infinity and from there you can compute the corresponding
projective transformation. So, the computation goes like this. So, you take the cross
product of this lines l3 and l4 . So, that would give you

v2 similarly cross product of line l1 and l2 that would give you v1 , and then v1  v2 will
give you the corresponding line at infinity which is a finite line which is the vanishing
line.

And, from there if I apply the transformation then this is one example that you can find
out edges have become parallel in the corresponding transformation. They have not
become square or rectangle, but at least they have become parallelograms. So, we will
continue with another example.

169
(Refer Slide Time: 30:39)

So, take this particular example, you have an image here and you can see that the edges
of the image, they are not really parallel though you know that they are not only parallel
they form a rectangle. Now, you consider two parallel lines and then you compute their
corresponding vanishing point or you compute this lines. So, you take in this case
suppose you have taken this parallel line say this line you have considered this two
points which defines this lines. So, l1 is defined by this line, similarly you are computing
l2 which means you are taking this two points.

So, these are the two points and from there you are getting l2 and you continuing this
process you are taking another two lines by defining this two end points. So, in this case
lines are define by two end points and from there you get l3 and l4 and then as we have

computed we take the cross product of l1 and l2 which give you the vanishing line
vanishing point of l1  l2 . That means, you get the vanishing point
of l1  l2  [-10556416,-72128,64] big numbers are coming which means you see that if I
consider multiplication with respect to with respect to 64 these are big numbers.

So, we will see later on. So, this is the vanishing point and vanishing point of l3 and l4

is this number your vanishing line is given by this. So, there are big numbers do not
bother about that, they are coming due to the computations, but finally, as there are scale
factors associated you can see the vanishing line is given by this.

170
(Refer Slide Time: 32:49)

So, the vanishing line as I was mentioning the essential component if I take out the scales
can be considered only this vector and then I can represent equivalently by this form
also. So, this is equivalent representation by making the scale of this third dimension as
equal to 1. And so, actually it will be a vanishing line at a very large distance it is almost
10 4
horizontal as you can see that this component is 0. So, equation becomes Y  .
6

So, that is why these horizontal edges they look almost like horizontal because they are
meeting at a point there are almost like parallel to the vanishing line also. And, the
1 0 0

transformation matrix will be given 0 1 0 . So, you can identify that [l1 , l2 , l3 ]
0  .0006 1
those are the 2-3 components they look like this and you can see the affect. If I apply this
transformation matrix on these points then you will get this images and where you can
find out the edges they look parallel. So, with this let me stop here.

Thank you very much for your listening.

171
ax 2  bxy  cy 2  dx  ey  f  0 Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 10
Homography: Properties – Part III

We will continue our discussion on the Properties of Homography or projective


transformation.

(Refer Slide Time: 00:19)

Now we will look at the properties of conics, how they appear in the transform space.
Earlier we discussed about the general representation of a conics. In a non homogeneous
coordinate representation the conics are represented in the following form

ax 2  bxy  cy 2  dx  ey  f  0

whereas, in a homogeneous coordinate representation we can represent it in a simple


quadratic structure X T CX  0

Where, C is given by the following parameters

172
 b d
a 2 2
b e
C=  c 
2 2
d e
f
 2 2 

It is 3 3 symmetric matrix and as you can see that there are 5 independent parameters
because C is also an element of the projective space. So, if I multiply C with a scale k
still it will remain the same conics and another interesting property of this conics is that,
there is a dual representation of the same conics. it is a representation by their line
tangents. So, take any line tangent given by l that should satisfy also the following
property l T C *l  0

where C * is a dual conic representation and that is given as C 1 . We have discussed all
this properties in our previous lectures.

(Refer Slide Time: 02:11)

So, let us see how a conic goes under transformation how a conics appear. So, let us
consider a transformation H where a point X is transformed to X ' . So, all the points
which are lying on a conics and which are given by the equation X T CX  0 .

let us see what property they follow following the conics rule in the transform space . So,
we can write this equation X T CX  0 in the following form ( H 1 X ' )T C ( H 1 X ' )  0

173
because X is replaced by H 1 X ' this is coming from :

X '  HX
 X  H 1 X '

. So, X equals H 1 X ' .So, if I replace it then I get this equivalent form
( H 1 X ' )T C ( H 1 X ' )  0 we expand this equation once again; that means, this transpose

operations can be expanded on the matrix multiplications which is X 'T H T CH 1 X '  0 .


So, you can see here there is a quantity H T CH 1 . It’s a matrix multiplication and its a
composite. It will provide you another matrix and say let me write this matrix as C ' now
this is also a 3 3 matrix and you can find out that you have a 3 3 matrix even in the
transform space. and which satisfies the equation: X 'T CX '  0 . This is nothing, but again
another conics representation.

(Refer Slide Time: 04:25)

So, we get this particular. So, that transform conics is given by this H T CH 1 . So, a
conic remains a conic under homography that is the essence of this particular
mathematical exercise. So, conics is an invariant after transformation. As the dual conics
is nothing, but the inverse of the conics in its point representation, So, if I perform the
inverse operation. I will get its corresponding dual conics and that is consistent (Refer
Time: 05:08) with respect to this transformations.

174
(Refer Slide Time: 05:11)

There is another interesting property of this transformation particularly in the conics.


There is an interesting property of conics. So, there are two points which are known as
circular points, We have already mentioned while discussing about the invariances of
similarity transformation. You have seen that these points are invariant under similarity
transformations. So, the definition of the circular points are given as

“The circular points I, J are fixed points under the projective transformation H iff H is a
similarioty. They are also on I ”

I mentioned that you need to consider a two dimensional projective space with the
complex representation in this representation, where one of the axis is imaginary axis
other axis is real axis and there is a scale axis.

So, these are the two points

1  1 
   
I  i  J  i
0 0 
   

175
and if I apply a transformation once again you get a representation in the complex space
itself. We have already discussed that under similarity transformation these points still
remain the same under conics. So, they are the fixed points of similarity transformation.
And this is the summary of this particular property that the circular points are fixed
points under the projective transformation H, if and only if H is a similarity.

They are also on line at infinity because you note that their scale i,e, this third dimension
is scale value which is 0. If I apply the point containment relationship with the line at
infinity, it will satisfy that relationship which means the dot product of the column vector
represented by circular points and line at infinities they should equals to 0. Just to
elaborate this part. So, if I perform the following operation

0 
1 i 00  0
1

This is the point containment relationship.

(Refer Slide Time: 07:37)

So, every circle intersects line at infinity at I and J which is another property of every
circle. So, let us elaborate this part, say you considered a representation of a circle in the
homogeneous coordinate system.

176
x12  x22  dx1 x3  ex2 x3  fx32  0

So, the above equation is the representation of a circle. you can see that x12 , x22 where

the coefficients of x12 x22 remain, the same. In this case, we have normalized it with
respect to those coefficients.

We require only three parameters d , e, f and that is the representation of the circle. And
here if I set x3  0 which means that you are computing the line at intersection of line

at infinity I because any point whose third coordinate is 0 that is lying on the line at

infinity. So, if I set x3  0 . the equation would be x12  x22  0

And you can note that these two points I and J they satisfies this equation because if I
take for I; that means, 12  i 2  1  1  0 . i.e, the circular point I is lying on the circle
and also it is lying on the line at infinity I . So, that is why the every circle intersects

line at infinity I at I and J . So, this is also another interesting property of circular
points.

(Refer Slide Time: 09:51)

So, since we have this two circular points, we can define also conic duals . Previously,
we have discussed about this dual representation of conics by using two circular points.

177
Now, we can define a degenerate conic. So, we can define a degenerate conic in the
following form,

C*  I .J T  J .I T

you can see that this is 3 3 form and which will be eventually in a simplified
representation which will look like this.

So, you have

1 0 0 
C  0 1 0
*

0 0 0

This is a typical dual conic formed by the two circular points and this dual conic has a
very interesting property since the circular points are fixed under similarity, this dual
conics also remains fixed under similarity. So, this is another interesting feature. Earlier
we mentioned that circular points are fixed under similarity, so, this dual conic
representations using circular points is also fixed under similarity.

And degree of freedom of this dual conic at infinity is 4 and its determinant is equal to
0 . This is another property of this dual conics and other thing is that the line at infinity is
null vector of this dual conic at infinity. Because if I multiply the dual conic at infinity
that is the way we are you can call it if I multiply with the line at line at infinity you will
0  0 
get a 0 here; that means, C* 0  0
1 0

let us discuss how an angle could be measured under homography.

178
(Refer Slide Time: 12:09)

Suppose we have two straight lines l and m as shown in the figure which makes an
angle  and their directions of the straight line are given by l1 ,l2 and m1 , m2 in the
projective space we have this 3 vector representation, but we know that first two
components of that 3 vector ,,they provide you the corresponding direction ratios. So,
they could be used for measuring the angle between these two straight lines as shown in
this expression that if you take the dot product of this vectors l1 ,l2 and m1 , m2 and then if
you normalize them basically dot product of unit vectors representing those two
directions then you will get cosine of the angle that is cos( ) .

l1m1  l2 m2
cos( ) 
(l12  l22 )(m12  m22 )

Interestingly, we can see that, this measurement can be also done even under
homography. There is an in-variance which gives you this flexibility. So, let me discuss
this property here you can see that even under homography. cos( ) can be expressed

l T C* m
using the dual conic at infinity in the following form; cos( ) 
(l T C* l )(mT C* m)

179
that means, if I take l T C* m in the numerator, that expresses l1m1  l2 m2 . Note here that

1 0 0 
C is that dual conic at infinity and it is given by 0 1 0
*

0 0 0

1 0 0  m1 
l C m  l1 l2
T *
l3 0 1 0 m2   l1m1  l2 m2 =. This computes the neumerator.
0 0 0  m3 

Similarly, (l12  l22 ) and (m12  m22 ) can be computed as (l T C* l ) and

l T C* m
(mT C* m) respectively. Hence , cos( )  it is computing the same
(l T C* l )(mT C* m)

quantity . The interesting part is that, there is an invariance associated with this measure;
particularly the product (l T C* m) . If I perform homography corresponding lines and

corresponding dual conic transform dual conic at infinity under homography. They also
preserve the same quantity. So, let me explain that.

(Refer Slide Time: 15:47)

So, this is invariant under homography and we use this property where the relationship
between dual conic at infinity under transformation of homography H and with the dual
conic at infinity in the original space is given by C*'  HC* H T . And as you know that

180
the line after transformation also they are related transformed line and original line in the
original space. They are related by l '  H T l .

So, if I replace this quantity. So, let me show you the derivation. So, in this derivation we
can see that we are. So, this remains the invariant. So, l ' which is the transformed line
under homography, it becomes now l T H 1 and then C*' or dual conic under

transformation becomes HC* H T and then this is the transformed line, this part is

transformed line and as you know that H 1 H would be identity matrix.

l 'T C*'m '


 l T H 1 HC* H T H T m
 l T C* m

Similarly, H T H T also form identity matrix. So, what you essentially gets is equal to
l T C* m . So, this is how it becomes an invariant. So, the measurement of cos( ) is the

same expressions even if I use this in a transformed space; that means, if I use all the
corresponding transformed lines transformation of dual conics transformation of
corresponding lines, then also we will get the same measure cos( ) that is how it
becomes invariant.

So, once this is obtained. So, what you required to do that given a image under
homography, you would like to compute the transformed dual conic at infinity dual line
conic at infinity that C* . And if you can obtain that then you can use this relationship to

obtain the original angle formed by those two lines in the original space.

Which means suppose there are two perpendicular lines in the original space and even
after transformation under homography, they will not remain perpendicular, but this
relationship will still remain and we as you know the cosine of the angle between this
perpendicular lines that is equal to 0. So, we can always say this orthogonality property
of two lines, they are preserved using this dual line conic at infinity in the transformed
space.

So, let us considered the following:

If l and m are orthogonal then

181
l 'C*'m '  0

Note that l ' , m ' and C*' , all the quantities are in the transformed space. Then actually

l 'C* m ' should be equal to 0. So, this could be used to for measuring angle and in fact, for

recovering matrix properties as we can explain in the next slide.

(Refer Slide Time: 19:41)

So, first we need to estimate C*' ; that means, the dual line conic under transformation at

infinity dual line conic at infinity under transformation and we will be using this property
of orthogonal lines which means I would consider that those lines are transformed lines.

So, l 'C*'m '  0 . In this way that would give me an equation. Since there are there are

five independent parameters in the conics. So, minimum 5 such orthogonal pairs are
needed thatwould give five different equations to solve it. So, a typical equation will
look like this. So, you consider a particular line say l and m we can consider l and m
itself in that space. So, there is no problem in notation its a notational description only.

So, you will be using this equation. So, l is represented as l1 l2 l3  , m is represented

as m1 m2 m3  and you can observe that an equation lC* m  0 where C is a general

conic. I will be estimating this and I am replacing by them corresponding conic variable

182
C and which is represented by the parameters a b c d f  which means I can
T
e

a 
b 
 
c 
write C as a column vector   ; you know these are the parameters and C can be
d 
e 
 
 f 
represented as a symmetric matrix using those parameters we have discussed earlier how
C can be represented using this parameters.

So, you have to obtain this parameters using this equation. As you can see that there are
six parameters and there are five equations and in fact, its a set of homogeneous equation.

(Refer Slide Time: 22:25)

So, you need to use the method what we have used earlier for solving set of
homogeneous equations, we can apply the direct linear transform method to solve a set
of homogeneous equations to get C and which means that if I express it in terms of
matrix say, set of five equations. So, in the following way,

 1 1 1 
l1m1 (l1m2  l2 m1 ) l2 m2 (l1m3  l3 m1 ) (l2 m3  l3 m2 ) l3 m3 C  0
 2 2 2 

183
I will I can represent the set of equations as a matrix multiplication where each row of A
is formed by this particular row for every pair of orthogonal lines and if there are five
there will be five such rows if there are more there would be more rows and you have to
use an least square method.

2
So, you have to minimize this norms AC 2 subject to C  1 So, that is a problem

statement in this context. We know that solution of this particular set of equations would
be that you have to take the corresponding eigenvector of AT A . So, the eigenvector of
AT A which has minimum eigenvalue. So, that would be a solution of this matrix C .

Otherwise you can also convert it into a non set of non homogeneous equations when
you are working with exactly five equations. So, set f  1 once again the still that
limitation is there suppose f  0 in the representation of conic then this will not work,
but let us assume that we can work with some non zero f in your particular given
problem. Then you can use non homogeneous equations and you can use inversion by
putting the values of f at the right hand side of the equations, we have discussed all
these methods. So, with using this kind of technique you can estimate the conics and also
estimates the dual conic at infinity dual line conic at infinity.

(Refer Slide Time: 24:57)

And then we need to do this operation because as we know that dual line conic at infinity
is a rank deficient matrix. So, it should be rank 2 matrix. So, what we do the trick what

184
we can you know perform here that we can do the singular value decomposition on C
and set the minimum singular value as 0. So, let me explain this also suppose you get a
C and you perform the singular value decomposition we know singular value
decomposition sub C can be represented in this way.

C  UDV T

1 0 0
Where D is a diagonal matrix and its singular values are given as  0 2 0 
 0 0 3 

and U is a 3 3 matrix and also V T is also 3 3 matrix. So, what we can do? We can
set 3  0

And suppose 1 has been rearranged in a way such that 1 is the maximum then 2
~
and then 3 . So, set 3  0 . then use that modified one as your estimate. So, use UDV
~
where D is a diagonal matrix were 3 set to 0. So, that is, how we can make it a rank 2

matrix. So, in this way you can estimate the dual conic at infinity or dual line conic at
infinity in the transformed space. Now you can perform a metric rectification using this.
that I was mentioning. So, let us discuss that part also. So, how we can recover the metric
properties?

(Refer Slide Time: 26:53)

185
So, what we will equate to do now we need to estimate the homography from C* and we
have seen that under the similarity transform this dual line conic remains preserved. So,
what were we can do? Homography can be multiplied by any similarity homography, it
will still remain the same and the properties will also remain the same. So, let us
consider up to similarity will be computing this matrix, which means the ratios of the
distances between egdes those will be preserve not the exact distance that you can
estimate in this way.

So, we can use matrix decomposition method to compute the homography, which means
I need the homography matrix in such a way that I should get the transformed dual line
conic what we have estimated dual line conic at infinity and note here this C* is the dual

1 0 0 
line conic at infinity and which is given by this particular matrix. 0 1 0
0 0 0

So, if I can decompose this matrix in such a way that we have this in this form and at the
middle of this 3matrices. you have a matrix like this, then this could be an homography.

1 0 0 
So, one such example one such method could be like this; C*'  U 0 1 0U T that
0 0 0

means, since this is a symmetric matrix because conic is a symmetric matrix we can
decompose in this way, we can adjust the singular values in such a way that two
quantities would be 1 1 and accordingly the columns of U and U T would be adjusted.

So, now you see that if you use this kind of decomposition its a singular value
decomposition exactly what we did but the diagonal matrix as to be adjusted. In such a
way that both the singular values becomes one accordingly columns of us should be
adjusted. So, as you see the middle 1 satisfies the structure of C* or dual line conic at

infinity and then we can take U as the homography and we can apply the inverse of this

186
homography to the image to get the rectified image or to recover the metric properties as
I mentioned you can get the ratios of the distances.

(Refer Slide Time: 29:39)

So, with this let me conclude this lecture and also this topic particularly with this
summary that, we have studied various transformation under homography like we
studied how a point could be transformed into another point with the transformation
space in the transformed space how they related.

They are related by a nonsingular 3 3 matrix and which is called the projective
transformation it preserves the co linearity of the points in the transformed space, it is
invertible and correspondingly a line is also transformed into another line in the
transformed space and the relationship between the transformed line and the original line
is given by this l '  H T l which is transpose of inverse of the transformation matrix.

Then, the conics they remain conic after transformation and their relationship have been
summarized here. Similarly the dual conics representation also you know is valid in the
transformation space and projective transformation they form various groups they form a
group so, that is a projective linear group and there are different sub groups and their
hierarchy that we have discussed like we have discussed about projective linear group
having 8 degrees of freedom, then affine group having 6 degrees of freedom and in
Euclidean group having 4 degrees of freedom and oriented Euclidean group having 3
degrees of freedom.

187
Incidentally Euclidean group is a similarity transformation what we discussed and
oriented Euclidean group one of them is a isometric transformation as we discussed

(Refer Slide Time: 31:27)

Then we discussed about conic dual to the circular points and you have seen its various
interesting properties like first thing it is invariant under similarity transformation, then
line at infinity is a 0 vector of this particular conic dual, it preserves cosine of angle of
two lines under transformation that is interesting. So, I should say it using the
transformed dual conic, you can compute the cosine of this angles it implies that.

So, this is the particular property how cosine of angle could be computed using the
transformed dual conic and homography is used in various tasks some examples are
given here like you can perform you can use homography for rectifying images there are
two types of rectifications we discussed, one is affine rectification the other one is
stratification which is which helps you recovering the metric properties particularly
proportionate distances that you can compute that you can recover underperforming that
corrections. So, with this let me stop my lecture here and thank you for listening to my
lecture we will continue our discussions on other topics from next lecture.

Thank you very much.

188
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 11
Camera Geometry Part - I

(Refer Slide Time: 00:20)

In this lecture we will discuss about Camera Geometry. We have already discussed how
a 3 dimensional same point is mapped to a 2 dimensional image point. Here as we know
that if I consider a 3 dimensional scene point P, then from there if I draw a straight line
which passes through the centre of lens and which intersects image plane at another point
some small p that forms the corresponding image point.

Now, logically we can consider also in the same way of image has been formed, in the
front of the camera where the corresponding image plane is placed at the same distance
where the usually if the sensors were placed behind the lens, the distance between that
plane of sensors that distance if I keep it same then you will get the images of the same
size and there is a logical transformation of coordinates from this point to this point that
we can consider.

So, this is a convenient way of representing the coordinate system and we will use this
principle while explaining the mapping of 3 dimensional scene point to 2 dimensional
image point using this kind of projective geometry.

189
(Refer Slide Time: 01:40)

So, let us considered the principle here, how these are related. So, this is the
configuration, what is usually shown for pinhole camera mapping and here in the
pinhole camera once again center of projection is taken as the point from which all the
rays are emanating and which is projected on the image plane, so the point of
intersection that is the image of the corresponding 3 dimensional points.

So, only thing here that you should notice here, we are interested to know also the
mapping of physical 3 dimensional coordinates in the scene to the corresponding 2
dimensional image point in the image plane. Though the complete ray, it is represented
by this point in the sense of projective geometry, but physically images are formed by
the phenomena of reflection from where energy of the reflected point those are focused
at this point.

So, we are interested to build up that relationship and mathematically if I know this point,
then it is very easy to establish the relationship between the coordinates of the image
point and also the coordinates of the 3 dimensional point where now, we have to
considered some convenient representation of this coordinate system. So, let us
considered the center of projection or the camera centered at the origin of this coordinate
system and also the axes of the 3 dimensional coordinate system those axes, the X axis
and Y axis of that 3 dimensional coordinate system that is parallel to the coordinate
convention what is followed in the image plane.

190
The origin of the image planes coordinate that is formed by the intersection of the Z axis
with that image plane and this is considered as the origin of the image plane and as I
mentioned this X and Y axis they are axis parallel with respect to the 3 dimensional
coordinate system. So, with this convention then it is very easy to compute the
coordinate x and y from the 3 dimensional coordinate which is represented here by upper
cases of X, Y, Z or capital X, Y, Z. We can apply the law of similar triangle, we assume
this plane is at a distance f because f is the focal length of a camera.

So, in that case by applying this law of similar triangle, we can find out that so, these are
the points and this is the image coordinate and as I mentioned this is the focal length and
then these are the expressions that we get the coordinate as

fX fY
x y
Z Z

So, you are using this coordinates of the scene point to get this competition, you can
apply the law of similar triangles if you consider the triangle for finding out x coordinate.

So, this is the X from the Y Z plane and so the corresponding x value at this point and
this is a similar triangle so, that will be. So, this length is f and this length here in the
fX fY
coordinate system is called Z, so it is . Similarly, y is also considered as
Z Z

So, this is how we map a 3 dimensional coordinate points to its corresponding 2


dimensional image point, you see that we are using actual scene coordinates for which
this image has been formed. So, if I give you X, Y, Z, it is very easy to get this point x
and y, but inverse is not true. If I give you the image point, naturally it is represented by
the whole ray and it is unless you specify some other constraint, you cannot say that it is
image of which point. So, let us see how you are we can resolve all this things.

191
(Refer Slide Time: 06:31)

And another thing I should mention another nomenclature will be as follows, we call the
XY plane of the original 3 dimensional coordinate system as the principal plane and the
Z axis of that coordinate system is called principal axis.

(Refer Slide Time: 06:51)

So, mathematically we have seen already, we have written in the form of an equation
fX fY
which is given by this x  and y  and so, this equation can be represented by a
Z Z
matrix form if I use a homogeneous coordinate system for representing the points, how

192
you can do it. let us see. So, this is the representation as you can see that the coordinate
point x y z, it becomes in the homogenous coordinate system you have an additional
dimension.

So, it is a 3 dimensional homogenous coordinate system and or you can say also this is a
3 dimensional projective space that is why it is a point in the projective space P 3 or P q
that is how we can you know call it and this point is mapped a point in the 2
x 
dimensional projective space which is given by  y  , 1 is a scale factor. You can see that
 
1 

the trick here in this representation, since this fX and fY if I divided by Z then we get the
actual image coordinates.

X 
 x   fx   f 0 0 0  
 y    fy    0 Y
     f 0 0  
Z 
 z   z   0 0 1 0  
1 

So, the above matrix multiplication will provide you this representation of fX, fY, Z in
the projective space in a 2 dimensional projective space which is equivalent to the image
point. That is why, this mapping can be expressed as a linear mapping in the projective
space and it is a mapping between a 3 dimensional projective space to a 2 dimensional
projective space.

So, that is the difference between homography, what we discussed in earlier lectures,
where a homography was defined in our context as a mapping from a 2 dimensional
projective space to another 2 dimensional projective space that is one difference, here it
is from 3 dimensional projective space to a 2 dimensional projective space.

The other difference was that, a homography was invertible. So, mapping was unique 1
to 1, but in this case as you know that a 3 dimensional point now, it is mapped to a
unique image point, but it is not true for the image point when it is again projected back
to the 3 dimensional space, then it becomes a projected ray and the any point in that ray
is represent represented by this image point. That is why, this mapping is not invertible
and you can see also from the dimension of this matrix, it is not square matrix that is one

193
of the conditions where you should have an invertible transformation for a linear
transformation of course, it should be non-singular also.

But in any case, since it is 3 4 matrix; so we can say that this mapping is not an
invertible mapping. So, this matrix is called projection matrix and this matrix can be also
represented in this form, you can see that it is formed by composition of two matrices,

f 0 0  1 0 0 0 
P   0 f 0 0 1 0 0  diag ( f , f ,1)[ I | 0]
 0 0 1 0 0 1 0

Where one is a 3 3 matrix. The other one is 3 4 matrix and the corresponding the
structure of the matrix are very simple and you note the notation what you are using for
representing this matrix, this particular matrix which contains information of the focal
length that is represented in a short form, we will be using this notation because it is a
diagonal matrix we will be simply listing its corresponding diagonal elements.

Whereas, for representing the other part of this you know the right hand side matrix in
this component. So, there are two sub matrixes, one sub matrix is an identity matrix. So,
this is a 3 3 identity matrix which is shown here in this form and the other sub matrix is
a column vector which is a 0 column vector which is also shown here in this
representation.

So, diag ( f , f ,1)[ I | 0] is a short form of representation of a projective projection matrix,


there are motivations for representing in this form that would be clear when we perform
the corresponding different mathematical operations on this matrices. So, let us proceed
and we will understand why we are taking this kind of representations.

194
(Refer Slide Time: 11:50)

So, we will try to generalize this coordinate mapping considering different other
scenarios. So, in the canonical form as we discussed earlier, we have considered the
projection plane or the image plane the axes in the image plane they should be parallel to
the axes what are used for the 3 dimensional coordinate systems. So, there are two
coordinate systems in this configuration, one coordinate system is the 3 dimensional
coordinate system which is also called world coordinate system. The other coordinate
system which is called image coordinate system. So, this image coordinate system also
has its own coordinate convention and that is the representation of the corresponding 2
dimensional image plane and you can also considered that is represented by a 2
dimensional projective space.

So, it has its origin here and in this convention in the 2 dimensional real 2 dimensional
space. So, it has its origin and this origin is formed as I mentioned as an intersection of
the optical axis or principal axis which is a Z axis with the image plane. So, that is how
this origin is formed and then axis are parallel I told you.

So, this is the canonical configuration and in this case we are assuming of course, image
plane is at a distance of focal length f in the further canonical form, even we can
normalize it at a distance 1, we will discuss in the previous representation itself. We have
seen that you have taken out the focal length outside of the projection matrix and the

195
other right hand side component which contains only identity matrix and the 0 column
vector.

So, if I use only that projection matrix representation, then that is the minimal
representation of this minimal configuration of this imaging where projection plane is
also taken as a unit distance from the center of projection. let us consider, we can have a
different coordinate system in the image plane, we have origin; we can have origin at
different positions say. So, in the canonical form what I discussed here when the image
plane is at a distance of f from the principle plane, then we consider that x and y
mapping of a 3 dimensional point X, Y, Z to a image points x, y are shown by this
equation. So, that we have discussed already these are the two equations.

Now, so, when you consider a coordinate axis when it is a different coordinate system;
that means, origin maybe shifted at some other point, but we may consider our principal,
our coordinate axis, they still remain parallel to X axis and Y axis. So, when the origin
has been shifted to a different point so, the principal point will have a coordinate which
is other than 0, of the image plane. Suppose, this is p x and p y so, any point in a image

plane say; any point in image plane, it will be now represented as ( p x  x ) and

( p y  y ).So, in this case, if I consider this as the new coordinate system say this is

x ' and y ' under this transformation of the origin or translation of the origin, then

fX fY
x '  px  y'  py 
Z Z

So, this is how this coordinates should be interpreted and in the matrix form if I transfer
this relation, then we can write this relation simply in this form; that means, the
projection matrix as you can see the third column of the projection matrix will have will
be change from 0, 0, 1 to p x, p y. 1, you can you can verify the corresponding
expressions we will see that the x prime is equal to fX.

So, this these operation will give you fX plus p x so, if I simplify this matrix
multiplication, we will find out these operation will give me

196
 f x  px Z 
 
 f y  py Z 
Z 
 

. So, this is the 2 dimensional point of the 2 dimensional projective space and Z is the
fX
scale factor. So, finally, when I divide this by Z, it would be p x  and similarly for
Z
fY
y coordinate, p y  . So, this is equivalent to this coordinate system. So, this is how
Z
the shift of the origin in the image plane affects the projection matrix.

(Refer Slide Time: 18:04)

So, this is the form of the projection matrix when there is an offset in the origin of the
image plane and this offset is given by p x , p y and which forms the third column of this

projection matrix so, this particular information is contained in the third column of the
projection matrix. So, once again, you can decompose this matrix into the form

f 0 p x  1 0 0 0 
P   0 f p y  0 1 0 0  K [ I | 0]
 0 0 1  0 0 1 0

197
1 0 0 0 
what I discussed earlier, you can see that 0 1 0 0 remains the same that is
0 0 1 0

canonical form of projection matrix when your image plane is at a distance of unity in
front of the camera center.

So, this mapping is very simple, it is an identity matrix and you know this column is this
0 vector column and once again this is represented by this particular matrix and K is
f 0 px 
representing  0 f p y  . So, this 3 3 matrix is represented by the symbol K here
 0 0 1 

which is called incidentally calibration matrix, we can see that this is camera calibration
matrix and you can find out in this matrix, we have some of the parameters which are
defining the mapping of the image coordinates, canonical form of image coordinate to
its corresponding coordinate what has been observed in what is given to us.

So, what are the elements there? You can see that once again the diagonal elements
particularly, it is found by the corresponding focal length it is diagonal [f f 1] that is
these are the offset principal. So, finally, the relationship between the image point and
the 3 dimensional scene point that can be summarized in the following form:
x  K [ I | 0] X as this is the image point, small this represented by this convention I will
be using the small letter to denote any image point and sometimes I will be using the
bold font just to denote that it is a column vector. Similarly, the upper cases, they are the
world coordinate system and they are all represented in homogenous coordinate system.

198
(Refer Slide Time: 20:48)

Now, let us consider that your world coordinate system also is different from the camera
centric coordinate system; that means, your origin of the world coordinate system is
different and also the axis of the world coordinate system there at different orientations
which are not parallel to the axes of image plane. So, this is the canonical configuration
what we discussed earlier so, this is the canonical form we discussed. So, this is a
canonical form, but this is the actual coordinate system, world coordinate system what
we will be considering.

So, that first we need to consider the transformation between this coordinate system and
that and the corresponding canonical coordinate system. So, there are operations like
rotations of axes and translation of origin those are involved in transforming any
coordinate to the corresponding canonical world canonical camera centric coordinate
~ ~ ~
system. So, this could be related by X c  R ( X  C ) where R is given by a rotation matrix

which is a 3 3 rotation matrix.

So, first we have to shift the coordinates of the 3 dimensional point. Incidentally, you
note that there are two type of representation of coordinate representations in our
discussion; one is that representation in the non-homogenous coordinate space or in
homogenous coordinate space. So, that is the normal 3 dimensional coordinates what we
use in or analytical geometrics and the other representation of the same coordinate point
in a 3 dimensional projective space or we call it homogenous coordinate system.

199
So, in this case this relations, it involves only the coordinates in the usual convention of
non-homogenous coordinate system, even the centre is also represented. when I will be
using symbol tilde over the point symbol of that point, then I will consider the
coordinate representation is a non-homogeneous coordinate representation. Otherwise, if
this tilde is not there at the top of that symbol then usually unless I mention that is
representating homogenous coordinate system.

So, you can see that the first you have to translate the point from camera center. So,
translate the origin by this operation. So, this is the corresponding location of the center.
So, first you translate the point to this center and then you apply the corresponding
rotation at this point and then you will get the corresponding transformed camera
coordinate system as in the camera coordinate system. So, this is the transformation
which is well explained by any analytical geometry book .you can t refer to it.

So, R is in the structure of it is a 3 3 matrix which is which is called rotation matrix


and . So, any coordinate X , X ' , it can be expressed in the coordinate system camera
centric coordinate system by this particular operation. So, now, what we need to do? We
need to apply this transformation of X to this canonical coordinates camera centric
coordinates system and then we can apply the same projection matrix.

(Refer Slide Time: 24:50)

So, which can be expressed in this form so, you can see that in this expression, I can
express this transformation of a coordinate system in the world coordinate the camera

200
centric coordinate system, all are expressed in the form of a 3 dimensional or
homogenous coordinate part.

X 
~  
 R  RC  Y
Xc    
 0 1  Z 
 
1 

~
You notice that this is the X part that is the non-homogenous coordinate representation
if I add another additional dimension, then it becomes homogenous representation and
~ ~
this operation so, this R ( X  C ) which is explained by this sub matrix multiplication.

~ ~
 R  RC   X 
  
0 1  1 

X 
~  
Where X  Y 
 Z 

(Refer Slide Time: 25:59)

We can apply the relationship with the projection matrix, with the camera centric system.
So, you know already that when your coordinate convention is the camera centric

201
coordinate convention that is a canonical form once again we discussed while deriving
projection matrix.

The relationship between the image point and the corresponding coordinate in the
camera centric coordinate system is given by x  K [ I | 0] X c . So, once again this K is the

camera calibration matrix which is a 3 3 matrix and I is an identity matrix and 0 is a


corresponding column vector and this can be further expanded. So, this X c again can be

related with the 3 dimensional coordinate system which is observed 3 dimensional


coordinate system in this form.

X 
~  
 R  RC  Y
x  K [ I | 0]  
 0 1  Z 
 
1 

So, finally, this is a mapping of point ,So, finally, you see that the whole
~
 R  RC 
expression K [ I | 0]  , it provides you a corresponding cascade of matrices and
0 1 

if I multiply all these matrices, this is a matrix which will be of dimension 3 4 because
K is of dimension 3 3 , this is of dimension 3 4 and this is of dimension 4 4 . So, if I
perform matrix multiplication and this matrix, we represent by this a symbol P which is
called projection matrix.

So, this projection matrix of 3 4 so, if I multiply once again this projection matrix with
the 3 dimensional coordinates in the projective space of a same point, then I will get its
corresponding image point once again in a 2 dimensional projective space. So, that is
how, we derive the relationship in a more general case.

202
(Refer Slide Time: 28:02)

So, to summarize this representation once again in a brief form, we are showing here. So,
as I mentioned so, this matrix is consider, this is as their projection matrix which is
shown here in the symbol P and it and the whole you know composition can be shown in
different way.

Suppose, I am taking out the rotation matrix out of this operation so, I can also
~
equivalently represent this matrix as P  KR[ I | C ] . So, in this case, you will see K R,
this is a 3 3 matrix because K is a 3 3 matrix and R is also 3 3 matrix. So, it is a
3 3 matrix and once again this is the identity matrix and here instead of a 0 column
vector of this form, what we get here? We get negative of the coordinates of the camera
center that is very interesting.

So, we can from the projection matrix elements itself we can determine the position of
the camera center in the world coordinate system there l we will come to that part in my
next slides that computations will be elaborated. The other form of representations once
again you see that if I only take K out of this box and put R inside the this part
3 4 matrix, then it has two components one is the rotation matrix and the other one is a
translation.

So, this gives you the corresponding coordinate transformation parameters of the world
coordinate to the camera centric coordinate system; that means, you can get the rotation
matrix, you can get the translation of the coordinate system origin translation of the

203
origins. If I can find out what is the matrix K so, from there we can get all this
components. So, this forms are interesting because this can give you different
information from the projection matrix itself.

(Refer Slide Time: 30:21)

So, when we observe a digital image, actually we observe the pixel coordinates and how
a pixel is defined? A pixel is defined by an element by a sensor and which is usually a
CCD sensor for a digital camera, it could be also CMOS sensor there are different
sensors, but each pixel is information of each pixel is captured by a sensor. So, we
usually called this arrangement of pixels as a CCD camera arrangement.

So, in the CCD camera arrangement if I consider that configuration, then we need to also
consider the parameters involved in digitization of the points in that corresponding you
know sensor plane. So, there is a resolution involved there so, let us see that we have
already shown that how a projection matrix is component of projection matrix we have
shown without considering the parameters of a CCD camera, but let us consider that per
unit length there are mx number of sensors.

So, mx is the sensors number of sensors in the horizontal directions and m y is the

number of sensors in the vertical directions. So, this is how a unit length has been
covered by that many number of sensors. So, if I consider a length f in the calibration
matrix so, because this is f is a multiplication factor of that length. So, in our case, we

204
will consider the f can be replaced by this transformation as  x  fmx and  y  fm y

in their respective places.

So, which means that our camera calibration matrix when we are considering this
number of pixels per unit length along X and Y axis then it will look like

 x 0 px 
K   0  y p y 
 0 0 1 

So, these are the diagonal elements instead of f, we have now  x ,  y which takes care of

the resolutions of the pixel resolutions of the camera. And then there is also another issue
that the alignent of the sensors in the horizontal direction and vertical directions may not
be always rectilinear.

There would be small deviations because of the manufacturing, very small though still
that should be accounted for. So, let us considered that instead of the angle it forms 90
degree, there is a small deviation. So, it forms an angle θ from this perpendicular
direction.

So, this is providing a kind of a skewed coordinate system. So, we need to take care of
that, we need to correct that skew and this is taken care of by another introducing another
parameter into the calibration matrix which is shown here,

 x s px 
K   0  y p y 
 0 0 1 

which is given by s which can be derived from the corresponding coordinate


transformation, you can also verify that. s would be tan  or even you can say sin  ; 
is very small. So, you can write it as s   also in the radian unit.

205
(Refer Slide Time: 34:01)

So, this is the final form of the calibration matrix and then a general projective camera
matrix can be can be described when I consider all sorts of parameters, all sorts of
transformation in this form that it consists of the elements these are the corresponding
elements like K has 5 elements like  x ,  y , s , p x , p y and also the other component s

which contains the other parameters rotation and translation.

So, these parameters are grouped into different forms. So, there are 11 degrees of
freedom in this particular projection matrix because it is associated with a scale factor
and so, there are 12 elements. So, considering the scale factor so, there are 11 parameters
and there are 11 independent parameters.

So, 3 of them will denote a rotation matrix which is a rotations along 3 axis, then 3 will
denote the translations of a 3 dimensional coordinate of the origin to the camera center
and 5 parameters are related to the calibration matrix which are given by this elements
 x ,  y s , p x , p y . So, in total there are 11 independent parameters and out of them, the

extrinsic parameters which are related to the transformation of world coordinate to the
camera centric coordinate system, those are given by this matrix that is the right side
matrix of 3 4 matrix and the intrinsic parameters are inside the camera calibration
matrix. So, these 5 parameters are called intrinsic parameters. So, with this let me stop
here and we will continue our discussion in the next lecture.

206
Thank you for your listening.

207
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 12
Camera Geometry Part - II

(Refer Slide Time: 00:21)

We will continue our discussion on Camera Geometry. In the previous lecture, we


discussed the form of a general projective camera. So, there is a projection matrix which
maps a 3 dimensional coordinate point to a 2 dimensional image point and in the
projective space, we can represent them as in this form.

~
P  [ M | p4 ]  M [ I | M 1 p4 ]  KR[ I | C ] are the different kinds of representation of the
projection matrix.

So, we can say that there is a 3 dimensional point X and if I multiply with the projection
matrix then we will get a 2 dimensional image point x  PX and in the homogenous
coordinate system x is a 3 vector and X is a 4 vector and we can see that a projection
matrix has different components here it has the calibration matrix and then there are
different forms of representation of this projection matrix.

~
P  KR[ I | C ]  K [ R | t ]

208
It can be shown as a combination of different types of matrices like in the above
representation. it is K into R whis is the rotation matrix R which is involved coordinate
transformation from all coordinate to the camera centric coordinate by rotating the
coordinate access, it gets aligned with the camera centric coordinate system. And then
the translation of the origin that is also represented by this parameter and in fact, this is
the centre of the camera centre this is also shown here. So, these are the different kinds
of information which is embedded in this particular matrix representation.

You should note the feature of a camera calibration matrix, it is an upper triangular
matrix and if I take the determinant of this camera calibration matrix this should be the
product of those two resolution factors  x ,  y they are expressed in terms of number of

pixels for representing the focal lengths. they are involved with the resolutions along x
directions and y directions.

So,  x ,  y are involved  x  fmx where f is a focal length and mx is the number of

pixels along horizontal direction and  y  fm y where, my is a number of pixels along

vertical directions and f is focal length. So, this is what we discussed in the previous
lecture also you should note there are different ways of notations by which we will refer
this projection matrix say a projection matrix is a 3 4 matrix.

So, M out of them, the first 3 3 sub matrix that left side of sub matrix with 3 3 sub
matrix is denoted by a symbol say M and the fourth column vector; that means, this is a
3 1 column vector that is also denoted by p4 .

~
P  [ M | p4 ]  M [ I | M 1 p4 ]  KR[ I | C ]

So, this representation is M and this separation just show that this is a sub matrix and this
is p4 . Another representation if I take M outside then I can consider this is an identity
matrix which is a kind of canonical form of representation of projection matrix if you
remember that is the identity part of the sub matrix and this is nothing, but the negation
of the centre; that means, the this M 1 p4 gives you the camera centre negation of the
camera centre.

209
~
So, we can also represent this element as  C and M itself can be decomposed into 2
3 3 matrix, it has a component calibration matrix, the other component is a rotation
matrix. So, these are different ways this projection matrices can be represented and it
could be understood all in those forms.

(Refer Slide Time: 04:49)

So, this is what I mentioned M is the product of K and R and p4 is the last column of the
projection matrix P.

(Refer Slide Time: 04:57)

210
Another thing we should note about the inverse of this calibration matrix, this is also
interesting to note that this calibration matrix is also an upper triangular matrix and since
the relationship between the image coordinate point and the corresponding camera
centric coordinate system is in this form. So, if I apply the K 1 x ; that means, if I
multiply the image coordinate with this inverse of calibration matrix, it will provide you
the corresponding coordinate in the canonical form which means the image coordinates
in the canonical form where the focal plane is at the unit distance and its axis are parallel
to the axis of the camera centric coordinate system. So, this is our interpretation.

(Refer Slide Time: 05:51)

So, we will summarize here. we will look at some of the properties of the projective
camera matrix. So, first thing as I mentioned the interpretation of the projection matrix is
that if I multiply it with the 3 dimensional coordinate point, you will get the 2
dimensional image point and the rank of this projection matrix is 3, it is a 3 4 matrix, its
size is 3 4 and if the number of independent parameters in the projection matrix is 11.
So, degree of freedom is 11.

So, out of which number of extrinsic parameters that is 6 that we have discussed, 3
rotation parameters those are involved in forming rotation matrix; that means, 3
rotations of axis and 3 parameters for translation of origin and number of intrinsic
parameters that is 5 as we have already seen the in the calibration matrix it is an upper
triangular matrix and there are 5 independent parameters. And when we express this

211
relationship when we expand this relationships in terms of coordinates individual
coordinates, we will get actually two independent equations.

So, we will see how those equations can be written in the next slides. So, since there are
two independent equations; that means, if I give you certain point correspondences
suppose, the problem here is that I give you the scene points say X 1 and also its
corresponding you know image point small x1 . So, if it is given to you then I can apply
this equations, I will get two equations, but how many unknowns are there as I
mentioned in the projection matrix there are 11 unknowns. So, I should get at least 11
equations to solve for all the unknowns, but since 1 point correspondence gives me 2
equations.

So, I require at least 6 such point correspondences, I require at least 6 such point
correspondences to form you know equations and from there I can estimate the
projection matrix. So, this is one technique by which we can find out the projection
matrix because we can always through experimentation, we can level some of the
coordinate points in the world coordinate system and we can identify their image points
in your in the images and that establishes those points correspondences then by knowing
their coordinates, we can form this equations and then we can derive this elements of this
projection matrix.

(Refer Slide Time: 08:58)

212
So, this is how these equations are written as I mentioned in the previous slide. So, you
note here that a point in the image coordinate is given in a very general form as
xi yi wi  because it is denoting a column vector. You note here that we have used
T

the scale factor wi here. So, which means that this x, y is not exactly the observed image

coordinate because there the scale factor has to be 1, but theoretically I can express an
image coordinate in the using this scale factor in the projective space.

So, if I multiply an image coordinate point with this projection matrix, I will get a point
in the 2 dimensional projective space and that is how in this form. So, the problem here
is that if I apply simple equation then because of the scale factor that equation will be
difficult to write t. So, we cannot simply equate the scales of PX i and xi . So, those are

equated in terms of the after the scale adjustment those coordinates are their equivalence
is established. Hence, PX i  xi

So, instead of that we can consider them as vectors and these vectors so, their directions
are to be same because the proportionate scaling does not change the direction of a
vector. So, if I consider since they are equivalent in that case the interpretation is that
those vectors they are parallel vectors they should have same directions. So, if I take the
cross product of these two vectors and then form the equation because we know that the
cross product of 2 parallel vector is a 0 vector.

So, we will perform this cross product and if we perform this cross product then we can
derive these equations. So, let me find out that how this representation is possible. So,
let us consider a representation of the projection matrix in this form which means I will
be considering, now I will be considering the row vectors instead of column vector
representations and because there are 3 rows.

r1T 
 
P  r2T 
r T 
3 

So, we have seen first row is represented by r1 , second row is represented by r2 and third
row is represented by r3 , I have used the transpose operations just to denote that in my

213
actual presentations these are all column vectors, but since they are rows so, I have to
apply transpose to those column vectors.

r1T X i 
 
PX i  r2T X i 
r T X 
 3 i

So, now, if I perform PX i so, the operation PX i will give me the corresponding

operation will give r1T X i then r2T X i and then r3T X i . So, this is the column vector that

we will get from this operation. So, you note that each one r1T X i what is the dimension

of r1 , our dimension of r1 is 4 1 because it is a row vector. So, each row is a 4


dimensional vector and what is the dimension of X i , dimension of X i is also 4 1 .

So, r1T X i will give you a scalar value, r2T X i will give you a scalar value and r3T X i give

you a scalar value.

So, finally, you will get a in this form you will get a 3 1 column vector right. And X i is

represented in this way  xi wi  and if I take the cross product of this two as we
T
yi

did earlier in our first few lectures to get the cross product, we can write it as r1T X i , this

is r2T X i and this is r3T X i and then x i this is small x i y, i and w i. So, if I expand in this

form. So, you will get the following vector


i j k
r1T X i r2T X i r3T X i
xi yi wi
 ( wi r2T X i  yi r3T X i )i  ( xi r3T X i  wi r1T X i ) j  ( yi r1T X i  xi r2T X i )k

r 2 transpose X I, w I into r 2 transpose X i minus y I into r 3 transpose X i this is i plus


let me write from this form.

214
So, X i into r 3 transpose X i minus w i into r 1 transpose X i that is j plus so, from
surpassing this part. So, will be getting y i r 1 transpose X i minus x i r 2 transpose
capital X i k. So, you will get this vector if I write this vector, we will get these vectors
as w 1 r 2 transpose X i minus y i r 3 transpose. So, I have to use the other notation, this
is not giving you the space, you can start again. So, w 1 r 2 transpose X i and then y i r 3
transpose X i so, this is the first component and other components and then that is equal
to 0 0 0. So,

 wi r2T X i  yi r3T X i  0


 T T
  
 xi r3 X i  wi r1 X i   0
 y r T X  x r T X  0 
 i1 i i 2 i   

you will see that this is equated with this equation and here unknown is r2 and r3 .

So, these are the unknown quantity because those are going to be estimated. Now, these
equation is written in this form if you note the first row, you see it involves r 2 and r 3.
So, it involves r 2 and r 3 since it does not involve r 1. So, this is multiplied by a 0 vector.
So, you will see that 0 transpose here it is a 1 cross 3 0 vector. So, this is multiplied with
r 1 which is 0 minus w i r 2 into X i transpose. So, now, since in my representation, I am
representing the rows as a column vector, I will express all of them as a transpose
operations.

 0T  wi X iT yi X iT  r1 
 
T
 wi X i 0T  xi X iT  r2   0
 yi X iT xi X iT 0T  r3 

So, it is w 1 X i transpose. So, I will take it as a transpose operations so, it will be w 1 X


i transpose r 2 minus y i X i transpose r 3 so, those equation. So, in this way the first
equation is formed. Similarly, the second equation and third equation can be formed I
have just shown only 1 equation just for the sake of your understanding. So, you can do
it on your own and you can check how is equations are formed. So, now, you will see
that there are actually three equations what you get, but out of them, there are two
equations which are independent one is redundant, how it is so. So, let me show you.

215
(Refer Slide Time: 17:38)

So, if I multiply the first equation with x i and then add it with and also multiply the
second equation with y i. So, you multiply the first equation this equation with x i that is
what is told here and these equation with y i and then if you add them you will get the
third equation again which is multiplied by w i, you can check that thing. So, which
means these equation can be derived from these two equations. So, which means that
there are only two independent equations other equation can be derived. So, that is why
in the previous case I mentioned that there are two independent equations that be that
would be formed by using a one single point correspondences specification.

So, in this way you require six such point correspondences and you can form twelve
equations independent equations and use it to derive the estimate this parameters of the
projection matrix. So, let me continue this you know operation further.

216
(Refer Slide Time: 18:47)

So, this is the summary finally, as you can see I have eliminated the third row just to
show that there are only two equations which are formed by using a single point
correspondence. So, if I have n such points, n point correspondences each one will give
me two equations. So, I will get 2n such rows for n point correspondences and what is
the dimension, if I note the dimension see each one is it is a 1 3

So, each one is a 1 3 sub matrix, each one is 1 4 sub matrix because number of
elements here since it is a projection matrix, number of elements is 12  1 ok. So, you
will get in this case, the total dimension here number of column is equal to 12 and this
was 2 rows, for n point correspondences. So, you expect it should be 2n  12 that should
be the dimension of the matrix. So, you will get the equations in this form.

So, we will represent the corresponding matrix as A this matrix if I stack all this rows
then we call this matrix as A which is of dimension 2n  12 and each one each row is
multiplied by this you know this elements of the projection matrix which are represented
by their row vectors r 1, r 2, r 3 which is also of dimension 2n  1 and each one each
point correspondence will give you two equations.

So, finally, there would be this should be 2n  1 . So, this is a 0 column vector. So, this is
a interpretation of this matrix representation of the linear equations in the matrix form.
So, there are actually 2n number of linear equations when you get n point
correspondences.

217
(Refer Slide Time: 20:59)

So, the as you understand that you have more number of equations than the number of
unknowns so, you have to apply a least square error technique what we did earlier for
estimation of homography also and in this case, it is a set of homogeneous equations
these linear equations they form a set of homogenous linear equations.

So, the standard technique is that we need to minimize this objective function given that
it is a norm of Ap, A is this corresponding matrix, A is the same matrix which is derived
from the point correspondences and p is your solution; that means, the elements of the
projection matrix which is represented in the form of this; that means, all row vectors are
concatenated one by one from the starting from its first row to third row and that gives a
12 dimensional vector. So, we have to minimize this norm and since you know this
solution is in the projective space and if I take a scaled vector also it is also the solution.

So, we would put a constraint on this minimization that norm of this vector should be
equal to 1. So, this is once again the corresponding problem formulation for solving this
particular problem for estimation of projective matrix and we have seen the same
formulation for the estimation of homography matrix.

The same techniques we can use that is a direct linear transformation techniques by using
either least square error estimation by considering one of the scale one of the; one of the
element of the projection matrix of the vector is by setting it to some value and then we
can solve it in the form of a non-homogenous equations by using least error technique or

218
you can consider by finding out the eigenvectors of a transpose A and taking the vector
corresponding to the minimum eigen value. So, we have already discussed in the
previous class.

(Refer Slide Time: 23:14)

So, let me now consider that, once you have this projection projective camera matrix
then what are the; what are the properties and what are the information that you can get
by exploiting this properties so, we will be discussing those things. So, as we have
discussed that projection matrix can be represented in different ways.

So, one of the representation is that it could be two sub matrices; one is of 3 3 M, the
other one is a column vector 3 cross 1 and the notation we use for 3 3 sub matrix is M,
the other one is p4 or we can represent it as a set of four column vectors; that means,
first column vectors, second, third, fourth that is the each one is a sub matrix of 3 1 or
we can considered as a stack of rows where each row is of 4 1 sub matrix.

So, this is how we can represent a projection matrix with this notations and using this
notations, we can relate these parameters to different types of imaging points so,
different types of geometric points. So, first thing is the camera center, we have already
established this relation . we know from the projection geometry that camera centre is a
singular point; that means, if I would like to take the image of the camera center you will
not be able to form a image, you cannot form a ray connecting to the same point itself.

219
So, which means that if I multiply the camera center with the projection matrix, I will get
a singularity which is a 0 vector so that is the interpretation that PC should be equal to 0
and for finite camera, M is non-singular and for camera at infinity, you will find M is
singular; that means, when camera is at infinity will again understand this interpretations.

So, where you can see that camera center is a point at infinity means its scale factor
should be equal to 0. Now, if I consider this representation PC=0, I can derive the camera
center very easily. So, using the sub matrix manipulation say I will represent the
~
C  ~
projection matrix in this form [ M | p4 ] and the center as   where C is the camera
1 
center in the world coordinate system and as I mentioned that should be equal to 0.
~
C 
[ M | p4 ]   0
1 

Now, this if I perform the sub matrix you know operation multiplication operations, this
~ ~
is equal to MC  p4 that is equals 0 and which means that C   M 1 p4 So, if I get the
projection matrix elements, I can easily estimate the camera center by exploiting this
relation where as for computing the center when M is singular then we have to use only
M to find the 0 of that 0 vector of that M and to in the singular form and you can
d 
compute its center C    , we will again discusses this elaborate this computation later.
0 

There are also interesting relationships with its column vectors so, you have the column
vector say p1 , p2 , p3 and they are vanishing points of X, Y and Z axes whereas, p4 is the
image of the coordinate origin. So, what is the vanishing point, suppose I consider
particular direction or a particular ray.

So, the point which is at infinity that would be projected that would be also projected in
the image plane. Now, it can be shown that for all the points lying in this straight line,
they will be converging to that point and that is called vanishing point for that direction.

So, what you need to do, simply this representation this point representation would be
the direction is given by the vector d.

220
d 
So, we can represent this point as   , you note that d is a vector, d is a vector it is a
0 
3 1 element d is a vector. So, d is a 3 1 element and this 0 ok. So, if I multiply with
respect to this projection if I multiply this vector with projection matrix, we will get the
corresponding vanishing point of that direction. So, if I have the X-axis the direction of
1 
X-axis is given as 0 and if I multiply p with this vector, what you will get, you will get
0

only p1 .

1 
P 0  p1
0

So, p1 is the vanishing point of the X -axis similarly p2 is the vanishing point of Y-axis
and p3 is the vanishing point of Z-axis. So, this can be summarized very easily.

(Refer Slide Time: 28:56)

221
So, this is what I just described. So, p1 if I multiply this with this, you will get p1 and

0 
0 
similarly p2 p3 and the other thing is that you note if I consider this point   and if I
0 
 
1 
multiplied with the projection matrix, I will get p4 which is the column vector.

1 
0 
p1   p1 p2 p3 p4  
0 
 
0 

So, what is the interpretation here, what is this point? Note here this point this part is the
origin of the world coordinate system because this is a 0 0 0coordinate system and this is
the scale factor standard representation of the of any point in the homogenous coordinate
0 
0 
space. So,   this point is nothing, but the origin of the world coordinate system. So, if I
0 
 
1 
take the image of the origin of the word coordinate system, I will get the column vector
p4 which means p4 is the image of that coordinate origin. So, with this let me stop here
for this lecture and then I will continue once again this topic in the next lecture.

Thank you very much.

222
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture- 13
Camera Geometry Part- III

(Refer Slide Time: 00:20)

We will continue our discussion on single view Camera Geometry. And there in the last
lecture we started discussing on retrieving various information from the projective
camera matrix by exploiting its properties. So, we have already considered that how a
projective matrix would be represented. So, there are various forms by which we
represent a projective matrix. For the convenience of our discussion all these relations
can be expressed in one of these forms.

So, the first from P  [ M | p4 ] which you should note that a projective matrix which is a
3 4 matrix, it is represented by its sub matrix components. So, one of the
representations in this form is that it has a 3 3 sub matrix, which is denoted here by the
symbol capital M and there is a column vector which is the fourth column vector. So, we
use this notation p4 , just to show that it is a column vector at the fourth position of the
column vector, and it is a 3 1 , its size is 3 1 .

223
r1T 
 
P   p1 p2 p3 p4   r2T 
r3T 
 

We can represent also projection matrices by their column vectors only. So, since there
are 4 column vectors so, these representations like first, second, third, fourth column
vectors; they are denoted by these symbols. And otherwise also we can represent them as
a stack of rows, or where you have the first row and since our in your convention, we are
representing a vector in the column form so we use transpose operation to denote that it
is a row and so, r1 , r2 an r3 d these are the three rows in this representation.

So, let us see how with this representations we can express various relations, with
different parameters in the camera geometry or different information related to camera
geometry; we will discuss those issues. the camera center that can be obtained as we also
discussed these things in the previous lecture. The camera center has a property of
becoming a right 0 vector of the projection matrix, because there is a singularity at
camera center in the projection.

All the image points they are formed as an intersection of a ray which is connected to the
center of camera and this intersection is with the image plane. But if the point itself is a
camera center so, you cannot form a ray. So, that is the problem and that is why there is
no such definition of image of a camera center and this expression mathematically can be
translated in the following form. PC  0

So, if I multiply the camera center coordinate in the homogeneous coordinate system,
with the projection matrix P in this form then I should get a 0 vector.

So, you understand this 0 is not a single or scalar 0 value. It is a vector, becauseit is
0  0 
dimension should be 0 so, its(P) 3 4 and (C) 4 1 so, it is dimension should be 0 
 
0 0

ok. So, that is a 0 vector. So, if I solve this problem how do you how can we express this
camera center? I will consider this from suppose, the originals the center in the non-
~
homogeneous coordinate system is denoted as C , as we use this convention in our
~
earlier lectures also. So, C and if I use a scaling coordinate as 1 in the additional

224
~
C 
dimension of homogeneous coordinate system so, this (   ) is representing the camera
1
center.

~
C 
So, I can write this relation as [ M | p4 ]   0 . So, which will be expressed if I perform
1
the sub matrix multiplication with respect to this and you should note that when you are
doing sub matrix multiplication, the first check should be dimensionality matching
between the corresponding matrix multiplication. For example, M is a sub matrix of
~
dimension 3 3 and C is a column vector of dimension 3 1 .

~
So, I can multiply M C which will give me a column vector of 3 1 . And then I will
multiply p4 and 1 so, it is the same matrix multiplication rules I am applying. So, I will
~
be multiplying M with C plus multiplying with p4 with 1. So, p4 as you understand p4
is a its dimension is 3 1 and 1 is a scalar value its dimension is 1 1 . So, I can simply
write p4 , because see p4 so it would be simply 3 1 matrix. So, it is a matrix
multiplication is valid.

So, this should be equal to 0 which is also a 3 1 column vector. So, again I can reduce
~ ~
this equation. So, MC =  p4 and then C =  M 1 p4 . So, you can see that you can get
the camera center very easily by using this expression from the projection camera matrix.
But note that M should be invertible then only you can get this expression. So, there are
camera matrices where M is a singular matrix, which is not invertible. So, in that case
camera center is not lying at any no physical; you cannot capture it in physical
dimension, but you can represent it mathematically which is lying at a plane at infinity.

That is a concept we will discuss once again, but its representation any point in the plane
d 
at infinity is represented in this ( C    ) form. You see that it is a point once again in
0
the homogeneous coordinate representation, this value of the scale dimension should be
0. So, that it is a point in the infinity. So, if I divide by 0 everything becomes infinity, but
as I discussed earlier also the interpretation of any point t in this form which is at infinity

225
is that it is a direction. So, it is just showing a direction. So, it is the point lying in that
direction at infinity.

So, this is the form of the camera center in that case and this point can be computed once
again by computing the right 0 of M, because M is singular there is a again a 0 vector. So,
~
we can compute that value so, in that case we will have MC should be equal to d. So,
d 
we can find out from the expression itself. So, when your center is in the form of   .
0
~
So, this p4 becomes 0. So, we will find MC  0 and e by finding out the 0 vector we
can obtain this 0 point. So, let me rub this value because this is wrong. So, it should be
~
MC  0 that we have to solve.

So, these are the things related to the camera center. Similarly, from the column vectors
we can get some interesting informations. So, one of the information that we can get is
the vanishing points of certain directions. As we discussed earlier also say, when we
consider a point at infinity we should or you can choose you can choose any direction
and a point at infinity can be represented in that directions by simply using the
corresponding direction cosines as a vector and putting an additional informational
dimension that homogeneous scaling dimension as 0.

(Refer Slide Time: 10:09)

226
1 
So, if I consider for example, X axis so, X axis direction is given by say 0 and a point
0

which is at infinity along that direction should be expressed by adding another 0,


1 
0 
concatenating another 0 in the homogenous coordinate dimension (   ). So, if I would
0 
 
0 
like to get the image of these particular point so, I should multiply with the projection
matrix, then whatever value I would get in the homogenous coordinate system that is the
image point to the corresponding point at infinity.

So, we call this kind of points as vanishing point. So, if you move towards that direction,
all the images of points all lying on this particular (Refer Time: 11:02) line or will be
converging to that vanishing point. So, in this case since P can be represented in the form
of column vector so, n you can check that this operation will simply give me p1 . So, that
is why p1 is the vanishing point of X axis. Similarly, p2 is also vanishing point of Y axis
and p3 is vanishing point of Z axes. So, this is one relation I just have shown here.

The other interesting fact here as you can see that what about p4 . So, p4 you can get

0 
0 
when you consider a point say   , if you multiply with the projection matrix P then you
0 
 
1 
will get p4 . So, what is this point? This is not a point at infinity, because its scale value
is 1 and this is a physical point in the space itself and what is this point in the three-
dimensional space it is simply the origin. So, which means p4 is a image coordinate of
the origin. So, this is the other property so, this is about the p4 .

227
(Refer Slide Time: 12:16)

(Refer Slide Time: 12:20)

So, next we will be considering some of the; some of the geometric concepts related to
this projection geometry, related to this imaging. We have defined all these you know
concepts earlier also. They are principal plane, principal axis and principal point. So, in
our notation as you can see that this is a very simple representation of the projection in a
camera centric coordinate space where; p is a image of say a point in the scene
represented by P.

228
But no, externally you have another coordinate system, where we are observing the
positions of these points and we call it world coordinate system. And the relationship
between world coordinate and camera centric coordinate system can be established by
applying simple rigid body transformation; where R denotes the rotation of axis which
can be represented by a 3 3 rotational matrix and t denotes the translation of origin to
the camera center, which is represented by a column vector or a just a 3 dimensional
vector. So, this representation we have already discussed while deriving the projection
matrix.

Now, we will be considering how to drive the relationships or information related to


principal planes, principal axis and principal points. So, let me redefine once again. So, a
principal point as you can see so, a principal points principal axis and principal plane; so,
here principal plane is the plane where first let me define principal axis. So, principal
axis is the Z axis in the camera centric coordinate system, which is incidentally the
optical axis of the lens in the optical camera. So, we have considered that is a Z axis in
our coordinate convention that is a canonical form.

So, in that way the Z axis is denoting the principal axis. So, intersection and principal
plane is the XY plane of the camera centric coordinate system. So, XY plane is a plane
parallel to image plane and that plane contains the center of the camera. So,
geometrically that is the definition independent of any coordinate convention. So, it is
the plane which is parallel to image plane and containing the center of projection. That is
what is the principal plane. In this case incidentally in our canonical notations of
coordinate system of camera centric coordinate system, principal plane is the XY plane.
And principal point is the point of intersection of principal axis with the image plane.

So, these are the few definitions which we discussed earlier also and that I am providing
here and now let us consider what are the informations that we can get. So, related to
this principal plane others, once again we will be considering the representation of the
projection matrix in this form and here I have shown those relationships.

229
(Refer Slide Time: 15:36)

So, one of the fact what is shown here that first how we can compute the principal plane.
So, in this case you can see that since the principal plane is parallel to image plane. So,
any point in the principal plane for any point you do not have any physical image point
in that plane, in the image plane, because the ray forming from that point to the camera
center does not intersect image plane. It intersects at infinity in that sense.

So, which means that the scaling dimension in the homogenous coordinate representation
for the image point should be always equal to 0. So, if I consider any point say q which
lies on a principal plane, if I multiply that point with these projection matrix, you will get
an image point you do not know what these two things are there, but surely in the third
place in the scaling dimension you will get it 0.

So, which means I can simply write in this case I have represented the point by X. So, let
me use this symbol instead of q, let me use the point as X representation. So, if I consider
multiplication with respect to the row vectors so, we will get r3T X that should be equal to

0. Now you can see that this is nothing, but the equation of a plane and that is the
principal plane.

You should note that this is what how do you should get a principal plane, but it is in the
wall coordinate system that you should know. Similarly, if you consider this q there
would be also image point where say for example; the first coordinate position would be

230
0 and rest other you do not care, which means if I multiply r1T X that should be equal to
0 and that is a point that is a plane where all the points which are lying on that plane,
they are projected in this form their image is in this form. So, what is this plane that
would be interesting to understand, because in this case since your this coordinate which
is X coordinate that is always equal to 0, which means you are considering only Y axis
in the image plane.

So, if I consider your image coordinates convention and in image coordinate convention,
you have X axis and you have Y axis whereas, in your camera centric coordinate system
or if I say camera centric axis representation, say this is a center; so, if I consider this
three-dimensional configuration and draw planes like this; let me rub it once again just to
make it clear. Say you consider the image plane and this is Y axis and this is X axis and
then you have a camera centric coordinate representation where C is represented. So, it is
the plane formed by this axis Y axis and the camera center, in that plane whatever points
are lying so they will be projected on Y axis. So, this is the where the X coordinate will
be equal to 0.

(Refer Slide Time: 19:56)

So, this is this plane is given by r1T X  0 . Similarly, if I can consider the X axis, this

plane should be given by r2T X  0 . So, these planes are called axis plane as it is noted

here, r1T X  0 which is imaged at Y axis of the image coordinate and plane containing

231
camera centered and Y axis similarly r2T X  0 is plane defined by camera centered and
X axis of the image.

What about the principal point? Now, the third row itself is showing you the principal
plane. So, as we have mentioned this is a plane so, in the principal plane, so this is a
principal plane; so this direction now r3 , if I consider the r3 representation in this form it

is a column vector r3T . So, let us consider that there is the it is represented as a 3 1

column vector and some scale value.

So, this mr3 is a 3 1 column vector. So, as you understand from the equation of plane.

So, if I write the equation of plane, in the form of say ax  by  cz  d  0 . So, the normal
of the direction of the normal of that plane is given by a : b : c by this vector. So, mr3 is

actually given you giving you this direction.

So, what is the principal point? Principal point is the image of that point, which is lying
in that direction and as I mentioned any point lying in at infinity in that direction should
be represented in this form. So, in this case mr3 is that direction and the point lying at

infinity in that direction is 0 and then the image of that point is a principal point. It is
nothing, but the intersection of the normal of the principal plane with the image plane. So,
if I take the projection of this, if I take the image point so, this is represented as M .mr3

That is what is shown here, then you will get the corresponding principal point. So, this
is how this three important information you can get from the camera matrix.

232
(Refer Slide Time: 22:58)

We have discussed this principle point and here it is just explained once again, how this
computations can be no it is elaborated with respect to this projection matrix. The mr3 is
further elaborated by the elements of the projection matrix in this form.

 p31 
p 
x0  P    M .mr3
32

 p33 
 
 0 

You note that p31 , p32 , p33 this these are the elements of the third row and that you need

to know multiply with M, then you will get the corresponding principle point.

Similarly, the principle ray it is nothing, but the direction cosine given by mr3 that is the

direction vector, so, it should be proportional to p31 , p32 , p33 ; I can consider it as

p31 : p32 : p33 . So, this vector, this direction is providing you the corresponding direction

of the principal ray and now in the projective geometry in the projective sense, it could
be in the along the camera centric coordinates Z axis or it could be in the opposite
direction also.

So, from there we exploit a property of the transformation matrix. So, if you take the
determinant of the transformation matrix and consider its sign. So, that sign will show

233
the actual direction of this you know this ray. So, , this vector should be represented by
the sign of the determinant and then the directions are given by mr3 . Instead of sign we
can since it could be scaled, I can simply multiply the vector with determinant of the
matrix M that is also possible.

(Refer Slide Time: 24:56)

So, let me then discuss another interesting fact that how the points which are there in the
image plane, how they are back projected to form a ray. First let us understand the
mapping of vanishing points on the plane at infinity. This we have discussed earlier also
its respect to with respect to the vanishing points of principal axis like X axis Y axis and
Z axis.

So, here if I consider any particular directions d, similarly we can multiply simply
d 
projection matrix with this vector   and that would give me;
0

d 
x  [ M | p4 ]   Md
0

that would give simply multiplication of the matrix M with d. So, that is the vanishing
point of the directions are the points at infinity lying in that direction d. So, you should
note that vanishing point is only affected by M.

234
So, the back projection can be defined as the ray formed by the camera center with the
image plane and you can express that ray in the world coordinate system. That is how the
 M 1 x 
back projection is expressed. So, if I consider a point say   , which means it is a
 0 
point which is lying in the direction of M 1 x .

So, this is the directional component in that direction and it is a point which is lying at
infinity. This 0 is showing then the image of that point is x which means this is actually
the ray, projection ray, the direction of the projection ray, it is expressed in that particular
~
fact. So, finally, we can form a ray which passes through a center, say C in the
homogeneous coordinate system is to represented in this form and at infinity the point is
 M 1 x  1
expressed as   , because the direction is M x . So, this can be easily expressed by
 0 
a parametric form of equation of straight line in a three-dimensional space where, M 1 x
~
is the direction and a point on that line is defined by C .

So, this is how the parametric form is expressed as you can see, that it is a linear
combination of the point in the homogeneous coordinate system. The point at infinity in
~
that direction D in that direction of M 1 x and C and following is the expanded from.

X (  )  D  C
~
 M 1 x  C 
   
 0  1
 M 1 x   M 1 p4 
   
 0   1 

 M 1 ( x  p4 )
In fact, this (   ) is the form which is more convenient to express this
 1 
 M 1 ( x  p4 )
  . So, these are the points which are lying on the projection ray
 1 
connecting to the cameras center.

235
(Refer Slide Time: 28:00)

How do you compute the depth of points? So, once again depth of point is nothing, but
the projection of a projection ray connecting to the scene point and on the principal axis.
~
So, you consider any particular scene point say X, X here in the three-dimensional
space. So, its projection so, this vector its projection on the principal axis which is Z axis
of the canonical form a rod in the camera centric system. So, this distance is the Z disk.
This distance is the depth information. So, this is a depth so, this could be easily
expressed by this form.

~ ~
Depth  mr3T .( X  C )

So, mr3 is the corresponding direction. So, tilde on top of mr3 denotes that it is a unit

vector along that direction. So, if you take the projection means you have to take the dot
~ ~
product along the unit vector of that direction. So, this vector is defined by X  C . then
you get the corresponding depth information. So, this is what its unit vector.

236
(Refer Slide Time: 29:13)

There is another easier representation of camera center by solving the set of equations
expressed in this form. As we discussed earlier also, you can see that in fact, there are
three equations, because camera center is represented by three coordinates. So, you need
to find out three coordinates. So, this could be easily expressed by using crammers’ rule
and the representation of M in terms of column vectors.

So, let me show you the solution. So, X c that is a X coordinate of the camera center is

 p4 p2 p3
given in this( )form by using the Cramer’s rule. So, this is a determinant
p1 p2 p3

formed by the matrices of the column vectors, determinant of the column vectors
p1  p4 p3 p1 p2  p4
similarly Yc is given by this ( ) and Z c is given by this ( ).
p1 p2 p3 p1 p2 p3

So, you apply simply Cramer's rule and then you can get this expression.

237
(Refer Slide Time: 30:15)

So, next we will be discussing how you can get the camera parameters from the
projection matrix P. So, here in this case we are again representing projection matrix in
 ~
 ~
different form, like you have a P  [ M | p4 ] or M |  MC , which(  M 1C ) is also
representing p4 or you consider a decomposed form of M, where you have the camera
~
calibration matrix and the rotation matrix that is KR  M then, this part is  RC . All
these representations have been discussed in previous lectures.

So, one of the easiest way to get the camera parameters is by using matrix
decomposition, because you know the M in general this representation is formed by the
multiplication of these two matrixes K into R. So, K is the camera calibration matrix and
R is the rotation matrix of the transformation from world coordinate system to camera
coordinate system and the property of K it is an upper triangular matrix and R is an
orthogonal matrix.

So, if I apply RQ decomposition where R is a upper triangular matrix and Q is an


orthogonal matrix, if I perform this decomposition, then I will get two matrices which
satisfies this property. Now this solution may not be unique in that sense, but still we
considered that this is one possible solutions through matrix decomposition.

So, we consider after decomposition, this R is equated with K and it is an upper


triangular matrix and this Q is equated with R; note the you know notational little

238
ambiguity here, but in this case this is a standard term of decomposition. So, this is the
form we are using. So, from there you can get the parameters of calibration matrix, from
the upper triangular matrices. Similarly, from the rotation matrices you can get the
parameter of rotations and by operating camera center as we discussed earlier, you can
get the corresponding translation parameters also.

So, this is how you get all the parameters of the camera matrix. So, here we will stop our
lecture at this point and we will continue this discussion in the next lecture.

Thank you very much for your listening.

239
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 14
Camera Geometry Part - IV

We are discussing on single view Camera Geometry and we have discussed how
different information can be obtained from a projection matrix. Now, let us discuss
some of the exercises for your practice, solve some of the problems related to that it will
give you a better insight.

(Refer Slide Time: 00:44)

So, consider this problem that you know you have a projection matrix which is given in
this form, and you need to compute the following, like you need to compute its camera
center, vanishing point of X-axis, image point of origin, vanishing point of the line with
the direction cosines say 2 : 3 : 4

Now, here I would suggest that you may give a pause to my video lecture and solve this
problem and then again resume it. So, I will discuss this particular solution in the next
slide.

240
(Refer Slide Time: 01:18)

So, as I have shown that this is a projection matrix what you have considered. A 3 3
sub matrix M is defined in this form and then the other column vector that is a p4 column
vector. So, we can compute the camera center from this expression that we have to invert
the matrix M and then compute  M 1 p4 to get the camera center directly in the world
coordinate system.

So, in this case if you carry on computations of inversion, the steps that you need to
compute the cofactor of the matrix, you need to compute the determinant of the matrix
and then the inverse can be computed in this form; that means, you have to take the
transpose of the co-factor and then divide it by determinant. And then finally, if you
~
perform these computations of  M 1 p4 you will get C in this form.

So, you should note that this is just in the world coordinate system itself you need not
convert, this you should not convert it into a non-homogeneous coordinate system then it
will be a wrong. So, it is directly it is giving you the result in a three-dimensional
coordinate system as we understand.

241
(Refer Slide Time: 02:37)

Let us considered how do you compute the vanishing point of X-axis. As I discussed
earlier that for vanishing point of X-axis we have to multiply the projection matrix with
the directions along X-axis and the additional scale value, scale dimensional is 0 because
that represents a point at infinity along X-axis. So, you have to multiply projection
matrix with this vector. I am showing for the gravity in the row form. It is a column
vector 1 0 0 0
T
and then you will get p1 and which means a you will get

particularly you will get  9 3 2 .


T

So, this is related with  9 3 2 that is a column vector. And you should note that
T

this is actually in the homogeneous coordinate system in the image coordinate system.
So, if I would like to get them in our two-dimensional coordinate system, non-
9
homogeneous coordinate system I have to express the x coordinate as and y
2
3
coordinate as . So, in a image coordinates in our understanding of normal two-
2
dimensional coordinate system is this point is at  9 / 2, 3 / 2  .

1
Similarly, for image point of origin that is the 4th column vector, so you will get 1 .
1

So, if I adjust the scale it is the coordinate 1 and 1 1. And then the vanishing point of the

242
line with the direction cosines 2 : 3 : 4 . Once again this shows that, what is the point at
infinity along the directions which is expressed by the vector . 2 3 4 0 and if I
T

multiplied with projection matrix P. So, you will get a vector 0 3  18 . Once again
this is in the homogeneous coordinate system which means this is equivalent to a
coordinate 0,  3 / 18 in our two-dimensional real space.

(Refer Slide Time: 04:50)

So, let us considered another example; here also you considered a projection matrix
given in this form and then you are asked to do this computation that if you have an
image point (2, 7) in R2 or in R square that is normal to dimensional coordinate
convention you have to compute its corresponding scene point if it is known that the
point is at a distance of 40 units from the center of camera. So, once again I would
request you to pause the video at this point, solve this problem and then again resume it.
I am going to discuss this solution in the next slide.

243
(Refer Slide Time: 05:33)

So, here I have shown you some of the structures of this solution. As you can see that
projection matrix is represented by those sub matrixes M and p4 , M is a 3 3 sub matrix
p4 is that column vector. So, camera center first you need to compute because once you
have to compute the corresponding projection ray. What we required? We required the
~ ~
camera center C . So, require the camera center C and we required the direction cosine
~ ~
(l , m, n) . So, C is given by this ( C   M 1 p4 ) relation. So, you have to compute M 1
~
and then  M 1 p4 will give you corresponding C . It is again in the three-dimensional

 2
coordinate space whereas, the direction cosine should be given by M 7  .
1

1 

So, (2, 7) was the image point and along that, directions in the homogenous system
coordinate system it is (2, 7, 1). So, if I perform these computations, I will be getting this
particular direction (l , m, n) . And then any point in this ray which is lying at a distance

l 
~ ~  m  . So, you see
 can be expressed in this form which is X (  )  C 
2 2 2  
l m n  
n
that this is nothing, but the unit vector along the direction (l,m,n). So, we can always
express in a parametric form equation of this straight line in this form. So, that is how

244
you can get the corresponding projection ray. You can put   40 and that would give
you the corresponding ray. So, this is the solution of this point.

(Refer Slide Time: 07:23)

We continue the same problem. Here you are asked to compute the principal plane of the
imaging system with the projection matrix. So, if I start doing this operation how do you
get the principal plane. It is as we know this is the principal plane is given by this third
row. So, which is given by r3T X  0 .

So, you will find that this solution is basically it is an explanation one again. Image point
of a point in a principal plane would be in this form, so it should be r3T X  0 . So, it is a

last row of P and it is given by (1,-5,8,1). If I write in our normal convention of planner
equation it is x  5 y  8 z  1  0 . So, this is how you get the principal point.

245
(Refer Slide Time: 08:42)

So, now we will continue our discussions in this topic. So, let us consider another
scenario where camera center is at infinity. Now, in this case as we have discussed
earlier also that M would be singular that is the first thing, because in that case the 0 of
M will give you the center and the third, fourth dimension of that center that is a scale
factor would be 0. Now, there are two situations in this case it could be an affine camera
or it could be non-affine camera. So, we are not interested on non-affine cameras for this
kind of situation we will be consider in only affine camera.

So, one of the simple characteristics of affine camera is that its last row is in the form of
[0 0 0 1], you can use any other scale factor instead of one non-zero value, but
canonically you can put it as [0 0 0 1]. Then every other element gets fixed with respect
to that scale 1. So, the property of affine camera is that first its principal plane is a plane
at infinity that is a principal plane; that means, all the points which are lying at infinity in
the along many directions. They are lying at in the principal plane.

And it camera center also lies on the principal plane naturally. So, point set infinity are
map to point set infinity. So, if you consider any points which are at infinity, its
vanishing points, so called vanishing point; that means, if you take the image of that
point that would be also at infinity; that means, it is scale value would be also 0 which
means that as you understand that intersection of parallel lines of a non homogeneous
coordinate at parallel lines in our Euclidean geometric sense can be explained as

246
intersection at a point at infinity which is represented in this homogeneous coordinate
system where scale value is 0.

So, still the lines after transformation, which means after getting imaged still they
remain parallel. because images of those lines are also their intersection is also at
infinity. So, that is one of you know major property of this particular things and it
explains why a points at infinity, still map to a point at infinity.

(Refer Slide Time: 11:28)

So, the affine projection one of the simplification of the camera relationship can be done
in this way as you can see that, in this form in affine camera this is the world coordinate
system in the non-homogeneous coordinate system which means it is simply a three-
dimensional point. You are multiplying this three-dimensional point by a matrix in the
form of 2 3 and then you will get another vector three-dimensional vector and this is a
parameter which we called as a translation parameter for this affine transformation and
~
then you get the corresponding image point. So, this is a image point of X .

So, that is also in the non homogeneous coordinate system because this will give you
2 1 . So, this is actual image coordinate what will get you do not have to do any scale
adjustment. So, this is a very simple relationships in affine geometry where you can
express the relationships all in the sense of Euclidean space, Euclidean space this
relationships in the sense of three-dimensional Euclidean space to a two-dimensional
Euclidean space of the images.

247
Now, this can be explained how do you get this kind of structure, if I considered the
canonical representation of the affine projection matrix where you can see that the last
row of this projection matrix is given by [0 0 0 1] vector. So, if I do the matrix
multiplication and consider the corresponding sub matrix multiplication then it is
equivalently coming in to this point. So, this simplifies the relations between the
projection, the image point with the world coordinate point and it simplifies also to the
computation of finding out this affine matrix.

So, let us see how you can do it. So, you can express this equation in the following form

~ ~
x  M 23 X  t

~
M 23 is a sub matrix that is a matrix represented above and X represents the same point

plus t. So, affine projection, so how many independent parameters are there? As we can
see that this t is 2 1 vector and this(M) is 2 3 so there are 8 parameters. So, there are
8 independent parameters or 8 degrees of freedom.

So, this is the final conclusion about or this is a advantage what you have, you required
less number of points to estimate this particular projection matrix because you have only
8 independent parameters it requires 4 point correspondences each point correspondence
is giving you two equations, one for x coordinate another for y coordinate. So, if I have 4
point correspondences I will get 8 equations and I can solve this problem. So, that is the
minimum requirement. If you have more number of point correspondences then you can
perform least square estimate, where which we will discuss in the next slide.

248
(Refer Slide Time: 14:58)

So, this is how will be discussing. So, first thing is that in the affine camera center it lies
at infinity and it is a direction of parallel rays. So, geometric interpretation is that: in an
affine camera the imaging takes place using parallel rays, instead of considering a
particular center where all the rays are connecting to that center that is a perspective
projection geometry. In affine geometry you considered any direction any parallel to any
direction and any particular direction, let me explain.

Suppose this is your image plane and suppose this is your scene point X and say the rule
is that this is a direction of the vector d; so, all lines parallel to the. So, X, image of X
would be that you draw a parallel line along this direction and where it intersects that
becomes a image of this point. So, this is a interpretation of imaging for affine camera
which is the parallel projection in the case of imaging geometry.

249
(Refer Slide Time: 16:10)

So, this is the relationship with M and d, you can get d in this by exploiting this equation
M 23 d  0 . So, the interpretation of t that it is the image of the world origin and its

principal plane or affine projection matrix is a plane at infinity that you can see that [0 0
0 1]. So, that is the form.

So, in a in PA the last row is [0 0 0 1], in the last row of the affine matrix. So, this
denotes the principal plane and that is the plane at infinity that is a interpretation in the
projective space and M 23 should be of rank 2, to ensure that PA to be of rank 3. So,

these are certain interpretations.

250
(Refer Slide Time: 17:11)

So, let us considered estimation of an affine camera when you have more number of
points; you require minimum 4 points, but if you have more number of point
correspondences. So, you can minimum 4 point correspondences and if you have more
number of point correspondences then your estimation would be robust. So, this equation
so, this is one example has been shown specifications that capital X i and xi they are the

point correspondences X i is the scene point and corresponding image point is xi shown
in bold font. So, in the non-bold font its coordinates are expressed; so, these equations if
you form this equation.

So, once again I will be considering the concatenation of row vectors as the parameters
of my projection matrix. So, this could be represented by these two equations. So, you
can see that in this case X i r1 is giving you xi , xi coordinate and X i r2 is giving in the yi

coordinate. So, xi should be represented in the form of a transpose here which is not

shown here, mathematically it should be X iT transpose and there should be also X iT

transpose because you know they are row vector and then you can form this equations.

 X iT 0T 
So, for n points, this matrix I can consider this matrix (  T  ) as matrix A and this
0 X iT 

could be considered a matrix say A. And if there are n points so, each one will give me
two equations. So, there will be 2n equations. And the dimensions as you can see this is

251
this is 1 4 and this is 1 4 , so it will make it 8. So, it is 2n  8 and this dimension is also
this is 8 1 . So, you get 2n  1 . And this matrix this column vector is found by the
coordinates respective coordinates. So, this is the set of equations which you need to
solve and this is a non-homogenous set of equations because you do not expect that
every coordinate would be 0. So, this vector is not going to be a 0 in your
experimentations or observations.

 r1  T 1 T
r   [ A A] A b
 2

So, you can solve it by using standard least square error method for non-homogeneous
set of equation. And the solution is given in this form I discussed the nature of solution,
so I can use a [ AT A]1 AT b that would give me in the solution.

(Refer Slide Time: 20:04)

So, we will discuss this particular thing that how the points which are lying on a plane
form images using projective camera. Without loss of generality let us consider that
plane is the say XY plane where Z is equal to 0. You can always make coordinate
transformation to make any plane as XY plane and apply this principal that is why I said
it is without loss of generality.

And say a point is given as q that is a scene point Q and this is a camera configuration,
where C is the center of projection and there is a image plane. So, a ray formed between

252
C and Q and the intersection of that ray with the image plane which is shown by the
point l q that is image point. So, this is how the imaging takes place.

So, now, I can express the coordinate q in this way.

X 
Y 
q  PQ  [ p1 p2 p3 p4 ] 
0
 
1

You can see here that any point in that XY plane can be written as it is coordinate as
X 
Y 
  . So, finally, this relationship in this projection geometry projection matrix and
0
 
1
multiplication projection matrix with the three-dimensional point in the projection in the
homogenous coordinate system can be reduced to a form where you require only two
coordinates X Y which is a point in that plane. So, point in the coordinate convention of
the plane and multiplied by 3 3 matrix.

We know this form that this is nothing, but a 3 3 matrix transformation of a


homogenous point in a two-dimension, projectives two-dimensional projective space to
another two-dimensional projective space which is like in homographic. So, if you take
the imaging of a plane it establishes homography between the image points and the scene
points, that is the crux of this discussion. So, that is what? Perspective projection of
same plane is a projective transformation.

253
(Refer Slide Time: 22:27)

We will discuss on imaging of a three-dimensional straight line and try to see how the
corresponding projected line on the image plane is related to the three-dimensional
configuration. So, consider a line L that is a three-dimensional line as shown in the
diagram and you have an image plane and its camera center at C. So, if you would like to
project this line, I can consider any two end point and of this line any two points in on
the line and find out its image points and connect those image points.

You will get another line that is a line of that image that would be a straight line also
because this property of projective transformation that is applied in this case. Now, let
this line is denoted by l in my figure and as you understand this line is represented in a
two-dimensional projective space. So, what kind of three-dimensional information that
we can recover if I know the projection matrix, that I would like to discuss here.

So, considered this that there is a point on a three-dimensional straight line and which is
given in this form that this is X this straight line and it is corresponding image point is
shown here by drawing the projected ray that is x . So, the relationship between x and l ,
that can be found from the point containment relation. And you can also see that it forms
a plane, a three-dimensional plane which connects the camera center and also three-
dimensional straight line.

Note that the image of that straight line that also lies on that plane. So, now, as I was
mentioning that a point containment relation of the image point x can be expressed here

254
that is x T l  0 . And if I reduce x if I express x in terms of imaging of a three-
dimensional points, so that is you have to multiply the three-dimensional point in its
homogeneous representation with a camera matrix P. So, you get PX equals that image
point x.

So, ( PX )T l  0 and using the matrix property I can convert this expression as

X T P T l  0 . So, note that this relationship is again a relationship of point containment in


a plane where the plane is given by this P T l So, we can get the expression of the plane
by this computation give in the projection matrix we can and also given a straight line we
can find out the plane on which the straight line camera center and their image line.

(Refer Slide Time: 26:03)

The third exercise here you have this projection matrix and considered 4 image points
x1 , x2 , x3 and x4 given in this form and you denote the camera center as the origin as a

point O, it is not origin it is a point O and you have to compute the dihedral angle
between planes of Ox1 x2 and Ox3 x4 . And how a dihedral angle is defined? It is a angle

of two planes, it is a angle between their normal. So, angle between the normals of those
two planes.

So, this problem you need to solve. Just to show you diagrammatically what I have asked
you that consider an image plane and you have points say x1 , x2 and say x3 , x4 . So, and

say this is the camera center; this is a camera center O. So, you can form a plane. So, 3

255
points can define a plane. Similarly, you can form another plane, now what is the angle
between these two plane; that is the problem what you need to solve.

(Refer Slide Time: 27:37)

So, this is how the solution we will get that once again this is summary that those are the
points shown and also the projection matrix shown here. So, first you have to form the
first dihedral plane. So, you can see that by performing the cross product of x1 and x2
you are getting the line found by x1 and x2 , and we know that given a line in the image
plane how the three-dimensional plane can be obtained. It is P transpose in to the line in
its corresponding image coordinate system itself, in the image representation itself.
( P T ( x1  x2 ) )

 35 
 86 
So, if I perform this operation then I would get the first plane as given here   . So,
142
 
 0 
this is a first plane. Similarly, for the second plane which is found by the images image
point x3 , x4 and the center of camera O that is also given similarly in this form. So, now,

these are the two planes. If you take its corresponding normal vectors unit normal vectors
you can compute in this form for the first plane, for the second plane. Now, to compute
the angle between this two you can take the dot product of this unit vector and applied

256
that cosine law. So, cos inverse of this will give you that angle; so, this is the answer. It
is coming around 53.752 degree in this case.

(Refer Slide Time: 29:09)

Next will discuss about the relationship between the image points when your camera
center is fixed and its image plane is changing. So, you consider this two scenarios that
there are two image shorts, one of them is taken in this configuration when this is the
image plane and this x1 is the image of the scene point x. As you can see camera center
is still remaining at this position.

So, the camera projection matrix is given by P1 in this case and which is represented in
~
the general form in this form( P1  K1 R1[ I | C ] ) using this camera center, where K1 is a
calibration matrix and R1 is a rotation matrix and the other for the other imaging system.
So, this is the image plane and its corresponding projection matrix is P2 which is given
~
in this form( P2  K 2 R2 [ I | C ] ). We can relate P2 and P1 in this way, we can see that this
~
P1 if I apply ( K1 R1 ) 1 P1 , this is giving you this [ I | C ] . So, this is giving you this
particular expression. So, finally, if I multiplied with K 2 R2 then it becomes P2 that is
how this relationship is established.

So, if I apply this relationship then we can find out that this two image point they form a
homography that we discussed already in our two-dimensional, discussion on two-

257
dimensional projective geometry and projective transformation in particular. So, I can
express this homography in this form. I will follow this. So, I will start from x2 which is
given by the image of X in the camera P2 . So, P2 X .

x2  P2 X  K 2 R2 ( K1 R1 ) 1 P1 X  K 2 R2 ( K1 R1 ) 1 x1  Hx1

Now, P2 can be related with P1 in this form and then as you can see this P1 X can be
reduced to the scene point of the other camera x1 for the same corresponding scene point.
And then the relationship between x2 and x1 can be given as Hx1 , where H is equal to

this ( K 2 R2 ( K1 R1 ) 1 ) matrix and it should be invertible. So, this is also expressing the
projective transformation between the corresponding image points. So, from the camera
geometry itself we can explain this relation.

(Refer Slide Time: 32:05)

And when we considered the image planes their parallel then the phenomena can be
expressed like a zooming of images. So, image planes they are parallel, as if you are
changing the focal length; that means, the direct that is a distance from the camera center
to the image plane. So, you are changing the focal length. And let us consider the ratios
of this focal length is k, and we know that under this situation the homography is
established because there is no rotation in between these two cameras.

258
So, if the rotation is identity matrix then in a simpler form this relationship can be
expressed as K 2 K11 x . K 2 K11 will give you the corresponding homography matrix. So,
this is the f 2 that is the distance from center and this is f1 and K is a ratio.

Now, this can be explained in a further. we can elaborate this relationship like if I
considered the deviation from any point from a particular, no. So, x naught is a principal
point. So, if I consider deviation this vector from the principle point now this deviations
or this distances or this vectors it should be also scaled in the same amount. So, it is a
vector direction remains same, but it is scaled by this factor k.

~
x0'  ~
x0  k ( ~
x ~
x0 )  (1  k ) ~
x0  k~
x

So, any point can be expressed in this form given in this equation. So, ~
x0 and it is called

in its normal two-dimensional coordinate system, k in to this vector that would give you
corresponding from here, from this position the from the geometry itself you can use this
relation and this is a expression. Now, this expression can show the structure of this
homography matrix. How it is so?

(Refer Slide Time: 34:23)

So, let me show you in the next slide. So, this is the structure as we have mentioned that
if you note that your homography structure is K 2 K11 and finally, using that relationship

259
kI (1  k ) ~
x0 
using that deviations I can relate H with this elements  . So, this is equal
0 1 

to K 2 K11 . And from using this relationship I can relate the calibration matrices between
the two configurations K 2 and K1 . So, what we will observed that actually the
calibration matrix is nothing, but K 2 calibration matrix is nothing but related with K1 by
simply multiplying by a diagonal matrix k k 1 (diag(k,k,1)).

I already explain how this diagonal notation should be interpreted. You should have a
 k 0 0
diagonal elements as k k 1, these are the 0s (  0 k 0 ). So, if I multiply K1 with this
 0 0 1

matrix then I will get K 2 . So, this is the relationships when you have zooming.

(Refer Slide Time: 35:35)

So, the effect of zooming is that you had simply multiplying the calibration matrix on the
right by diagonal. With this let me stop here for this lecture. We will continue in the next
part. We will still continue this topic where in our next lecture.

Thank you very much for listening.

260
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 15
Camera Geometry Part – V

So, we will continue our lecture on single view camera geometry. We are discussing
about different kinds of homography that exists in a single view camera, when you have
multiple view images of the same scene by single camera.

(Refer Slide Time: 00:39)

So, right now, what we are going to discuss suppose, the camera center is fixed, but you
have rotated the image plane about its vertical axis. So, you consider this scenario that
you have an image plane and you consider a center C and there is an image formed in
this plane which is given by x.

Now, if I rotate this image plane say by an angle  about an axis. So, in that case you
can consider another image plane due to this rotation and you will get an intersection
with that image plane with a point x' in that image plane of that rotated camera. So, now,
we will see that there exist a homography between this two points. Already, we
discussed this particular feature, when we discussed about the projective transformation.
In terms of camera matrix parameters or camera matrices, we can relate this homography.

261
So, consider in this case, the first position as a reference position of the camera and its
projection matrix is given in this ( K [I | 0] ) form. So, we can note that this is almost a
camera centric coordinate system. This is a camera centric coordinate system, but you
have a calibration matrix. so, you have a calibration matrix’s involved in this case. So,
in that way I mean you can consider that ; it has those parameters of calibration matrix
involved in this case.

x '  K [ R | 0] X

So, this is the projection matrix that we get and when you rotate this camera everything
remains the same, so there is translation parameter; translation of origin, it is 0, because
there is no translation of origin in that case and that is reflected by this, values 0 column
vector, but it has the rotation matrix. This R is a matrix, l which denotes a transformation
due to the rotation about this vertical axis or  .

So, I can write this projection matrix of the second position as K [R | 0] . So, this is the
projection matrix. So, if I consider that I can simplify this by considering this so, you
know that here I can equivalently right it as KRK 1 K , because K 1 K is nothing, but it
will give you the identity matrix. So, I can write this also and as we know that K [I | 0] ,
this is the projection matrix of the first camera.

So, this is what is giving you the image point of the first camera. So, finally, x' is related
with x with this kind of relation. So, it is a matrix multiplication KRK 1 that you need
to multiply with x and that matrix itself is the homography. So, this is how the
homography is established between these two views and with their corresponding image
points of the same scene point. And there are certain interesting properties of this
homography matrix, because this rotation matrix has a interesting property like first thing
is that it has a same eigenvalue of this rotation matrix and rotation matrix eigenvalues are
well defined. Because of this property of particular structure of the rotation matrix
which rotates at an angle  about an axis.

So, it should be  , e i and e  i . So, it is a complex quantity as we can see and  is

a scale factor. So, it is in eigenvalue or even I can write it as 1e i and e  i . So, it is a


scale value.

262
Now, H is also known as conjugate rotation homography and can be used to measure the
angle of rotation of two views. And the eigenvector corresponding to the real eigenvalue,
which is  it is shown here as a  , which is a scalar, which is also indicating scale value
on this set of eigenvalues. So, that eigenvector is the vanishing point of the rotation axis.
So, that is another interesting information. So, we will let us see that how this
homography could be used in relating different imaging.

(Refer Slide Time: 06:01)

Some of the applications that we would like to discuss of projective transformation


associated with this single view camera geometry, what we are discussing earlier. So,
already we have considered the corresponding, affine rectification or stratification tasks
that is involved using the homography with the images. Now, we can point out those
applications once again here.

So, we can generate synthetic views given a view. So, as we did in the previous example
of my lectures. So, you have an oblique plane where with respect to your reference view,
if there is any oblique plane; that means, it is you consider our natural tendency is to
look at fronto-parallel planes. And any oblique sensation of that kind of sensations will
be coming with respect to that oblique fronto-parallel plane.

So, if you have a plane which makes an angle with respect to fronto-parallel plane and
there is some object planner object or image on that plane. So, we can apply this
homography to straighten out it, to make it on the fronto-parallel plane. So, you consider

263
this particular example so in a fronto-parallel view we by keeping the same aspect ratio,
we can redefine the image points and then we can compute the homograpghy. This
example we have already discussed. So, we can compute and we can wrap the source
image with H.

(Refer Slide Time: 07:51)

Another kind of application could be panoramic mosaicking of images. So, consider you
have a wide view, but at a particular time you have only a limited view of taking
images. So, you have a wide panoramic scene and with the single camera, you want to
capture the whole scene, but your camera is restricted by only a small viewing angle.

So, you have an image plane, where only those points, which are intersecting with
respect to the image plane, which is been sensed by the sensors of your camera those are
only captured. So, you may in that case capture a series of images by rotating the camera.
So, getting view, getting images from different views and then again perform the
homography transformation with respect to a reference plane and put them under the
same reference plane.

So, all those images; so they would look like as if images on the simple planar plane. So,
that would that is the task of mosaicking. So, we discussed in my previous slide that how
rotation of an axis lead to the homography. So, you consider here that this is your
reference plane and these are the other views from where , you are looking at from your
camera.

264
So, you apply homography from points on this view to this view say suppose, this
homography is H1 and again apply homography from point from this view to this view
and then all the points are registered on the same coordinate system and you can get the
larger image. So, that is how you can get a panoramic mosaicking, using this kind of
computation.

(Refer Slide Time: 09:47)

So, let us discuss the concepts of vanishing points in imaging also, I have referred to
vanishing points in my lecture previously also. Vanishing points with respect to
homography, vanishing points; that means, with respect to projective transformation,
even for vanishing point with respect to the imaging from the camera transformation that
also have the similar kind of features. So, vanishing points are nothing, but images of
points which are at infinity.

You consider so, let me give you an analogy with respect to an one dimensional scene.
So, an one dimensional scene can be thought about as an infinite line, any points on an
infinite line. So, you are considering this as an infinite line and points are lying on it and
we are projecting any point with the projection rule is that there is a center and take any
point in this line and then you draw a ray from that line to that center and that would
intersect on your image plane. So, in this case it is also image line. So, the point of
intersection of this ray is giving you the image of this particular scene point. So, scene
point as I mentioned is one dimensional space.

265
So, if I go on drawing this kind of imaging for all the points in this line. So, we will see
its effect. Suppose, you take another point, once again this is a new image point then
another point. So, these are images corresponding images that you are observing and if
you go on doing these things. Finally, as you understand there is a limiting point beyond,
which you will not get any intersection point, because once your projection ray becomes
almost parallel then two parallel lines, they only intersect at point at the infinity.

So, what we can observe here that if you go on doing this finally, there exists a limiting
point V which is defined geometrically in this way. Consider a line which is parallel to L
and passing through that center C and then that line when it intersects image line at point
V that is the vanishing point, because any other point you choose from this line, it will
still intersect at a point which is not going beyond V in upwards in this case.

So, this is the interpretation of a vanishing point, when we restrict our scene as a one
dimensional straight line and also our imaging plane imaging structure is also a line. So,
we extend this idea for imaging of a three dimensional scene on a two dimensional plane
and we will see that what kind of conditions we get as I mentioned this is the vanishing
point and this is the summary that vanishing point of a line L is the intersecting point in
the image plane parallel to L and passing through the camera center C.

(Refer Slide Time: 13:21)

(Refer Slide Time: 13:39)

266
So, if I extend this concept as I was mentioning consider a three dimensional line in a
three dimensional space this line is defined. Once again, your imaging geometry is
defined by a center of camera or center of projection C and there is an imaging plane.
And if I apply the similar projection constructs, so you will find all the projected point on
the image plane, they also lie on the line. And finally, when the ray projected ray
connecting through the camera center C, it becomes again parallel to the direction of line
L then I mean there is that their intersection point would be a point at infinity, which
means it that is the point at infinity in that direction L. And that is the, and their point of
intersection and the point of intersection of that ray with the image plane acts like a
vanishing point.

So, let me see the construct here in the similar fashion will go on constructing and finally,
as I mentioned that the line parallel to L which is intersecting at the point V that becomes
the vanishing point. You note v, r, q they all lie on a straight line, because it is an image
of a straight line and that would be also a straight line.

d 
v  PX   K [ I | 0]   Kd
0

And so, even you can move as you go further and further still you will never cross v in
that directions, in this particular line L in that direction. So, that is the vanishing point of
the straight line L; so, this line is parallel to L. And as we already discussed that how a

267
d 
point at infinity in a particular direction d is denoted, it is   . So, d is the direction of
0
line L and if I apply the projection of that point, if I apply that mathematical model,
where I am assuming that the camera model as a simple canonical form like K [I | 0]

d 
then into   that is a column vector, we will get Kd , that is the vanishing point.
0

So, in this case you know you note that the vanishing point is independent of the,
translation of this particular or independent of the center or the position of the camera
center. So, that is the camera position. So, if it is not rotated as you can see from the
transformation matrix that here, we have taken only identity matrix, which means there is
no rotation. It is a orientations of the axis remains the same, only you may translate the
camera, but still if I apply that also.

d 
K [ I | t ] 
0

So, if I considered in that transformation, the new camera matrix after translation would
be something like this K I say translation. So, we may have a vector t and then, if I
d 
consider the projection of vanishing point   , so you understand d is a column vector
0
and then what happens. So, t is also a column vector in my representation. So, this is also
K into d. So, this how a vanishing point you know is computed and you could see that is
the same point.

So, this is an important mathematical explanation that why we see the points at distance
and there is almost they are always at the same position, because if you also move in
with respect to our frame of reference they do not move, they also move along us. Like if
you consider when you are moving and watching the moon at a distance, you will find in
your frame of reference, the moon always remains at the same point, because you know
if your motion does not involve any rotation, it is a simple translatory motion.

So, it will look, it will be dimmed like it is not. It is also, as if it is also moving with you,
which means it remains at the same location with respect to you. So, this is one you
know nice explanation of those phenomenas.

268
(Refer Slide Time: 18:07)

So, this is the follow up of that discussion that it is independent of camera position
vanishing points if they are not. if the camera is not rotated. But you know when you
have a rotated camera and then also you can get this expression, very simple expression
by applying the same mathematical logic that if I rotate the camera center so your
~
projection matrix can be written as say KR and then so, I think this was our  RC that is
~
the affect of moving the camera center at position C .

So, suppose this is the corresponding projection matrix after the rotation R and moving
~
it at a camera center C then if I apply the projection of vanishing points then I can
simply write it as KRd. So, that is what is shown here, you can see this is the expression
here.

So, the implication is that if I know the vanishing points, there are ways by which you
can compute the vanishing points you may take straight lines, in parallel straight lines
and take that images and find out their point of intersection, that would give you the
vanishing points. And then if you know those vanishing points and also if you know the
camera parameters like calibration matrix K then we can compute rotation. So, this is
one of the interesting implication of this result.

K 1v
dˆ  1
K v

269
So, this is how even the direction of the corresponding line also can be computed from
this relationship that is you can apply K inverse v normal to this. So, you can apply this
particular relationship you can translate them in this form. You can also get the direction
of the straight line in a three dimension, if you can get it is vanishing point knowing it is
corresponding calibration parameter.

And also similarly, you can get the in the reference frame you can get another direction
of the straight line with respect to the reference frame. So, from there you can get the
rotation matrix. So, if the relationship between these two directions again, if they are
related with the same transformation R and there are two independent constraints on R
and it can be computed using this relationship.

So, this constraint let me explain that it is an orthonormal matrix that is the first
constraint and there is a angle of rotation which is involved. So, you need to determine
that. So, with this you can find out.

(Refer Slide Time: 21:23)

So, now let us consider another kind of geometric information or interpretations with
respect to imaging. You considered a plane  and which is specified by its unit normal
there is a direction that normal is known to us and of course, to specify a particular plane
you need to know its point. So, in this case let us consider, we know a set of straight
lines in this plane  , those are also specified. So, if I take the images of these straight
lines, what will get.

270
So, if you consider a plane which is parallel to this plane; that means, whose normal of
the plane is also the same in n̂ as it is shown and so, that is the construct. it is a parallel
plane through camera center and we can denote this plane as you know  ll as it shown

here. So, what you are getting here, you are getting the vanishing points of the directions
of M; that means, all parallel lines or parallel to that direction it has a vanishing point say
at VM and similarly VL is the vanishing point related to the direction specified by L
which is lying on this plane  . So, any line parallel to that it has its direction VL .

So, if I connect these two points you will get a line on the image plane and infact that is
the vanishing line the interpretation is that all the vanishing points of the lines lying on
that plane  that would lie on that line. So, even any line or any parallel for any plane
parallel to it, all those lines in those directions they have the same vanishing line. So,
that is the interpretation. So, we can define then the vanishing line L in this way that it is
a intersection of these two planes; that means, a plane  ll , which is parallel plane

through camera center, parallel to the plane which we are mentioning here.

So, it is a plane which is parallel to this and which is passing through camera center and
when it intersects with the image plane, then the intersecting line two planes they
intersect at a line. So that intersecting line is a vanishing line that is the geometric
interpretation of vanishing line and it is related with the plane. So, it is the vanishing line
with respect to this plane. So, this is the interpretation.

So, from this we can compute also the normal of the plane because if you know the
vanishing line, then I can compute the corresponding plane itself by using that same
transformation P T l and would give me the normal to the direction of the plane for which
this line is the vanishing line. And if it is a reference camera; that means, if the camera is
represented by this projection matrix K [I | 0] , that is the projection matrix of this camera.

So, if I multiply with l , it would be simply K T l . So, for unidirectional you need to do
the normalization of that vector.

271
(Refer Slide Time: 25:09)

So, let us again workout an exercise for involving this particular computation. Suppose,
you have a camera which has this following projection matrix that is P which is given
here as you can see, and suppose you have a line in the image coordinates space by the
equation 3 x  4 y  5 . So, we have to compute the normal of the plane for which the line
appears as a horizon. So, you understand your horizon should be the vanishing line of
that plane which is the parallel to our particular view.

I mean where this w which is having this particular vanishing line. So, you would like to
get that plane and that is intersecting with the image plane with respect to this line. So,
that is what we would like to find out.

272
(Refer Slide Time: 26:19)

So, let us see the corresponding solution you have this projection matrix. So, the line is
3
denoted by this column vector  4  and plane formed by the camera center and this
 
 5

line as you know it is P T l . So, if I perform this computation you have to transpose the
camera matrix and you know you should multiply with this l then you will get this kind
of this is the equation. This is equation of the plane and if I would like to get the normal
of this plane. So, I should restrict myself to the first three elements of that column vector,
as you know that this equation of the plane is represented as 47 x  72 y  8 z  5  0 . So,
this particular vector will provide you the normal. So, you have to normalize it and then
you can get that unit vector. So, that is what will be doing.

273
(Refer Slide Time: 27:23)

So, all planes parallel to this plane have the vanishing line l . So, that is the interpretation
and the normal can be computed in this fashion. So, we can finally, we can compute the
normal as we can see into this computation.

(Refer Slide Time: 27:37)

So, how do you compute a vanishing line in that case? You have to identify groups of
sets of parallel lines in a plane at different directions. We can obtain their vanishing
points and get the line among them. So, in this way you can get the vanishing line.

274
(Refer Slide Time: 27:49)

So, now I will be summarizing the content, summarizing those particular highlights
which we discussed in this topic of single view camera geometry. First thing as you can
see that pinhole camera model, it provides a projection matrix which maps a 3
dimensional point to an image point and projection matrix has some interest. I mean
some known structure that is it is 3 4 matrix. It is a mapping from 3 dimensional world
to a 2 dimensional plane and it has a degree of freedom 11; that means, there are 11
independent parameters.

So, as you can understand 3 4 means there at 12 elements in that matrix. So, since you
can scale those elements still you get the same mapping. So, one of the element is a scale
factor. So, rest of elements is the independent of that scale factor. Then out of this 11,
there are 5 intrinsic parameters and 6 extrinsic parameters and you require minimum 6
point correspondences to estimate this projection matrix. There is another kind of
projection matrix which we call affine projection matrix.

So, in this case instead of instead of a converging projection rays on a center of


projection or camera center, we consider the images are formed by parallel rays and your
center of camera lies at an infinity.

So, the structure of the projection matrix for this kind of affine projection has a unique
distinction. It should have a colour it should have a row with [0 0 0 1] or any scale value
in the place of 1 and then its degree of freedom is, because if I picks a scale value at 1,

275
rest others are the independent parameters. So, if I take out these four elements from 12
it remains 8. So, it is degree of freedom is 8 or independent parameter is 8 and each point
correspondence provides me 2 equations. So, , we require minimum 4 point
correspondences to estimate this affine projection matrix.

(Refer Slide Time: 30:13)

Then we discussed about the geometry which is encoded in a projection matrix like
projection matrices can be represented in different forms to express these relations. As
we can see that we can have a form of [ M | p4 ] these are the notations we have used M
is a 3 3 sub matrix p4 is a column vector. It is a 4th column vector or projection matrix
can be considered as a set of 4 column vectors p first second third fourth, they are
denoted by p1 , p2 , p3 , p4 or it can be considered as a stack of rows. Those are row

vectors.

So, in this way projection matrix could be represented and some of the interesting
information about the geometry, which are encoded, which can be retrieved from this
values of projection matrix itself. Like camera center is given by  M 1 p4 in its world
coordinate itself and for affine projection matrix you have to take the right 0 vector of M
and which is you know interpreted as a direction.

So, it would give you the direction of that parallel rays which is forming the images.
Then we discussed about vanishing points in this case in the imaging. For example, X

276
axis vanishing point of X axis is given by the first column vector p1 , then Y axis by p2
and Z for Z axis it is vanishing point image of that vanishing point is at p3 and image of

world origin is at p4 . And there are so special planes, which we can recover from the
projection matrices elements particularly from its rows.

So, some of the special planes like principal plane is given by this relation r3T X  0 , then

from there since, it is a principal plane its normal will give you the principal axis. So, the
first three elements of the vector r3 , it would give you the directions of the normal and

then principal point is the intersection of the principal axis with image plane which is
expressed by this relationship. M is a corresponding you know sub matrix or projection
matrix 3 3 sub matrix. We need to multiply the directions of the optical axis with M
and you will get the principal point. The plane formed with the x axis of image
coordinate system and also the center of projection.

So, with the centre of projection and x axis this will make a plane 3 dimensional, in the 3
dimensional space and given by this relation r1T X  0 . And similarly, a plane formed by

y axis of image coordinate system with again with the center of camera it is r2T X  0 .

(Refer Slide Time: 33:21)

So, there are different other geometric derivatives from projection matrix. First thing is
that you can form a projection ray at an image point, you can from its three dimensional

277
equations of that particular line, it is direction ratio is given by this expression M 1 x .
You note that x is a image which is expressed in the homogenous coordinate system of a
2 dimensional projective space.

So, this would give you a direction ratio in a 3 dimensional structure, 3 dimensional
space and any point on the ray will define a line incidentally that one has to pass to the
camera center, and you know that the camera center can be computed in this form you
have  M 1 p4 , that would give you the camera center and this is how the projection ray
is formed.

Similarly, you can also compute the plane formed with a line in the image plane with the
camera center, it is given by P T l and vanishing point of a line with direction d is given
by Md . So, with this you know let me conclude this particular topic here. I hope you
have learnt some features of single view geometry; next, will be discussing about stereo
geometry or two views camera geometry.

Thank you very much for your listening.

278
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 16
Stereo Geometry Part – 1

In this lecture, we will start a new topic that is on Stereo Geometry. So, let us consider
what is meant by stereo setup. In the stereo setup, we have two cameras and in this two
cameras, they are specified by their camera centers and their image plane.

(Refer Slide Time: 00:37)

And scene point X for which you will be taking the images in both this cameras. So, if I
apply the rule of projection here. So, we get the image of this scene point X in the first
camera, whose camera center is given as C as a say x.

Similarly the second camera given by the camera center C ' forms an image of that same
scene point at in its own image plane. So, you have two images of the same scene point
and this is what y x' ou will get in a stereo setup. For every scene point, there are two
image points, now if they are viewable from the two cameras itself.

Now, this particular construct has certain properties geometric properties. What we can
see particularly that the point C , X , C ' and the if I consider a plane found by C C '
and the scene point X . So, the image points x and x' , they also lie on the same plane.

279
There is a from first restriction, also if I draw a line between C and C ' which is called
for a stereo setup as the base line. So, that also lies on the same plane; that line intersects
the image planes..

Again, there are some image points in the image plane and in fact, as we can see that
one of these intersection point is here which can be interpreted as the image of camera
center of the other camera. And the other point is shown here point of intersection of the
base line with the other image plane, which we can interpret also as image of camera
center of the first camera.

So, these two points, these are called as epipoles and they are also lying on the same
points. So, they are denoted here by e and e' and the line joining e and x which is
nothing but intersection of the plane containing all these points X , C , C ' and their
image points. So, that intersection with the first image plane, it will be giving you
another line and similarly get two such lines in two image planes because the
intersections with respect to the three-dimensional plane formed by same point and
camera centers.

Now, those two point; those two lines they are called epipolar lines. So, we have this
particular configuration and these terms are used we will be explaining these terms more
and more in subsequent lectures. But you should note this interesting configuration that
all of this points and lines, they lie on the same plane and that plane is called epipolar
plane with respect to this same point. So, you have this epipolar lines and this plane is
called epipolar plane, which is formed by the camera centers both the camera centers and
a same point.

280
(Refer Slide Time: 04:09)

So, this geometry in general is referred as epipolar geometry. So, that summary of this
again; once again different concepts of the epipolar geometry is shown here. As you can
see that you have the corresponding base line here and the intersection of base line is
giving in the epipoles and you note particularly this epipolar line in the second camera.
So, if I consider the projection ray formed by the first camera and the image point, then
any point lying on that ray is a possible candidate of a same point from which whose
image is this x and if I take the image of the scene point by the second camera.

So, this image would lie on the epipolar line. So, that is a geometric interpretation. who
all the possible candidates of same point with respect of this image point, they lie on an
epipolar line. So, this is a constraint that is imposed by this particular configuration. So,
corresponding point of X in the right image lies on l prime.

281
(Refer Slide Time: 05:29)

And you can also on interpret this configuration in this fashion also for any epipolar
plane  , its point of intersections are those two; its line of intersections with respect to
image plane. Those are the corresponding epipolar line. So, all points on  the project
on those lines only. So, any point on plane  will be projected for image for the first
camera will be projected on l ; epipolar line l and for the second camera, it will be
projected on epipolar line l ' .

(Refer Slide Time: 06:09)

282
So, in an epipolar geometry we have this concepts once again we will be now defining
them in particular using the geometry configurations only. So, epipoles as we see e and
e' in the figure here. So, epipoles are defined as the intersection of baseline with image
plane that is one kind of definition. So, as you can see that this is a baseline, this is a
baseline and intersection with image planes respectively, they provide you those points
of they are the epipoles. That is one kind of interpretation or you can also interpret as I
mentioned earlier, that projection of projection center in other image because you can
consider for example, this is the camera center which is a projection center of the first
camera.

So, these epipole is the projection of this camera on this image plane because you draw
the ray from this point to this point which is again the baseline and its point of
intersection of the image plane will give you these particular epipole. We call this
epipole as a right epipole. So, in our convention we will have this particular
directionality, we will be using this directional notations. So, first camera we will call
left camera from the diagram’s representation . So, you will also sometimes refer it as
left camera and the second camera, we will refer it as right camera.

Similarly, the epipoles out of this first epipole e is considered the left epipole, that is just
the epipole on the reference image plane or first image plan. So, first camera is
considered as a reference camera of the stereo setup and second epipole, it sometimes it
is also referred as right epipole that is on the corresponding epipole on the second camera.
So, these are certain notations also or terminologies will be used in our discussion.

You can also interpret epipoles as a vanishing point of camera motion direction. So, what
is a camera motion direction? It is a direction of translation of the camera center. So,
which is again given by the baseline. So, in that direction as you can see that it is the
intersection of that ray with respecting image plane will give you that intercepting point
will be always the vanishing point along that direction. So, epipoles are also vanishing
point of camera motion direction. So, in various ways you can interpret this epipoles.

On the other hand, you define an epipolar plane it is the plane containing base line. So,
you can see that baseline. So, it is an one dimensional family in the sense that it is a
pencil its baseline is acting like a pencil and you know all the planes over centering

283
around that particular line is defining an epipole; epipolar plane. And epipolar line is a
intersection of epipolar plane with image and it always come in corresponding pairs.

(Refer Slide Time: 09:25)

So, this is a explanation that was mentioning that you get a family of epipolar planes,
intersection of each any pair of the plane is a baseline. And this epipolar planes that
intersects with the image plane to provide you epipolar lines and all those epipolar lines,
they are also meeting at epipoles in respective epipoles that we can see in this particular
diagram.

So, you can consider. So, this is a plane and this is a another plane. And for each plane,
now these are the corresponding epipolar lines and they are intersecting at e in this case
were intersecting at e' .

284
(Refer Slide Time: 10:12)

So, typical examples of converging camera images. So, were it is shown that their
epipolar lines are converging. So, this is one example, I have taken from the book of
Hartley and Zisserman which is also shown here multiple view geometry in computer
vision. You can find, you can see that how epipolar lines they can be mapped on those
images and you can find out those all those epipolar lines, they are converging in this
case.

(Refer Slide Time: 10:46)

285
Also you take a situation, when you have simply translation of camera center. There is
no rotation of axis. So, which means your epipolar lines they become parallel so, all
epipolar lines they become parallel; parallel to the direction of motion of the image
planes or the camera. So, since epipolar parallel lines, they meet at infinity. So, this
epipoles also in the image plane, they are the points which are lying at line at infinity. So,
these are the interpretations of epipoles when you have simply a translation of camera
centers.

(Refer Slide Time: 11:30)

So, now let me provide you the mathematical formulation of different concepts involving
these different entities which we have defined like, first thing as we can see from this
geometry. Once again, that you have two cameras given by the centers C and C ' and
also the projection matrices of the each camera, those are also specified here. So, one of
the projection matrix that is given as K [I | 0] ; so that means, it is a reference camera and
it is a camera centric coordinate system that we considered here. And on the other hand,
the other camera that is the second camera is given by this ( K '[ R | t ] ) K ' and then, the
rotation and translation form as R and t .

So, the in a inwards studio set up, the reference camera is denoted in this diagram in this
form. So, one of them is a reference camera and the other one is a second camera. This is
a notation or left camera usually we put the reference camera call it as a left camera and
the right camera as the other camera with respect to the reference camera. Its coordinates

286
are expressed and reference camera is usually they have this camera centric coordinate
system, but it could be also generalized.

Now, you can see that in this structure, we have two image points corresponding to the
same points that is x and x' . So, we can actually show that given all the points in this
plane, it induces a homography. We will discuss it later on, but right now we will
consider a particular type of homography, we will consider the mathematical
relationships and show that how that homography exists between corresponding points.

So, you know this particular relations up camera geometry that x' can be expressed as
P' X in the homogeneous coordinate system in the projective geometry. And then, we
can consider the ray found by this particular image point with respect to camera center
and consider a point which is lying at infinity along that ray. So, that is expressed as
pseudo inverse P plus x ( P  x ).

 K 1 
So, I should not call exactly that is pseudo inverse, but it is given in this structure  T  .
0 
You can note that the direction cosine, direction of the particular line, this can be given
as K 1 x and the any point which is lying at infinity for this line that can be expressed as
 K 1 x 
  ok. So, finally, this is a same point which we were considering which is at a
 0 
infinity and we are considering an image of that point in this particular configuration.

So, if I consider for every image point for any other image point say y, we have also
there is a corresponding image point y ' which is again formed by the scene point which
is lying at that infinity and that plane is called plane at infinity. So, there is a particular
notation for that and all these points. So, this plane let us assume in this case that we will
only considering the formation of corresponding points with respect to plane at infinity,
then we can get an interesting homography between those corresponding points.

That means, , the point in the first camera first image plane and the corresponding points
which are found from plane at infinity from the same points of the plane at infinity the
second camera and this particular homography, we also sometimes referred as
homography at infinity. So, this discussion also we will be doing later on, but let this
structure of at infinity simple because as you can see from this relation x' and this is

287
nothing but it is giving you the corresponding H  and which is easily computed in this

form ( P' P  ).

So, this is what scene points lying at plane at infinity. So, I can simplify this matrix by its
corresponding elements. So, this P ' is given by K ' R and it should be K ' t . So, if I
multiply, then you get the homography matrix as K ' RK 1 . So, this is your point x' , now
the these line is reconstructed as I know in the projective space, it should be cross
product of e and x' that would give you this line l '  e' x' .

Now this operation these are all three vectors and we know how to compute this cross
product of this operation also can be expressed in terms of matrix multiplication. I form
a three dimensional matrix with the elements of e' and I can express this particular cross
product as I mentioned that with respect to multiplication of that matrix. So, let me
provide you a simple notation.

So, just to complete this discussion, we will come back to these particular constraints.
Suppose there is a matrix let us say 3 3 matrix and it is multiplied with x prime that is
giving you l prime which is performing equivalent computation of this cross product of 2
points. So, I can then expand this operation. So, this x' can be expressed as also the
using the homography relation in this form; H  x and since this is a 3 3 matrix, this a

3 3 matrix, again I can multiply this two matrices in composite form and let me denote
it by F ( e' H  ).

So, this relation l ' this epipolar line in the second camera can be obtain from the image
point in the first camera. It is a very important relationship in this stereo setup; is a very
fundamental relationship in a stereo setup and that is they are related by this particular

Expression ( l '  Fx ). So, there exist a matrix which can be obtained from this you know
from this parameters, from this values and if I multiply the image point in the first
camera in its homogenous coordinate system, with this matrix F which is of dimension
3 3 , then I will get this epipolar line l ' .

So, this matrix is called fundamental matrix and that is a very characterizing property or
characterizing matrix of this stereo setup and its task is that it converts an image point to
a line which is an epipolar line and sometimes they are called also, this relationship is

288
also called co-relation. This transformation is called co-relation. That is a kind of
misnomer, but you may find in the text book this term is also there.

(Refer Slide Time: 19:58)

So, let me also explain how this particular conversion means happening here, that a
cross product can be expressed as the form of a three-dimensional matrix multiplication.
So, if I consider e' . So, let us consider e' is represented by the column vector say
 ex 
e  say let us consider these are the column vectors. And a point x' is: as you know that
 y
 ez 

 x
this is a basically a column vector (  y  ) and is its coordinate, let me represent it as say x
 
 k 

y and some scale value k that is the homogenous coordinate system. So, let me represent
in this form.

289
(Refer Slide Time: 21:05)

So, when we have a matrix multiplication; so, let us consider this scenario. So, when you
have this , not the cross product of e' x' . So what we will do?

i j k
e' e' e' 
 x y z

 x y k 
( ke' y  ye' z )i  ( xe' z  ke' x ) j  ( ye' x  xe' y )k

We will consider this i j k and this is e x prime e y prime e z prime and this is say x y k.
So, I need to compute this expression. So, let me expand this expression as we did earlier
also. So, we can write it as say k e y prime minus y e z prime, this is i plus then x e z
prime minus k e x prime, this is j plus y e x prime minus x e y prime, this is k. So, you
get this particular expression. So, you would like to have a vector which give you all
these things; that means, it should give you this is

 ke' y  ye' z 
 
 xe' z  ke' x 
 ye' x  xe' y 
 

k e y prime minus y e z prime, then x e z prime minus k e x prime and y e x prime minus
x e y prime.

290
So, I should have a matrix with respect to in a matrix multiplication, as I mentioned that
we are considering replacing the as a matrix multiplications. So, it should multiply with
 x
 y  which should give me this particular expression. So, I have to find out this value.
 
 k 

Now, you can see that to get this what I have to do? So, there is no x component in this
row. So, this should be 0, then for y it is  e' z ; for k it is e' y . Consider the second row so,

the x contains e' z , then this is 0 and then with k, it is  e' x and consider the third row,

 0  e' z e' y   x 
 
where x contain  e' y , then y e' x and this is 0.  e' z 0  e' x   y 
  e' y e' x 0   k 

So, this is what is a you know conversion of e' x' , I can as well replace it by the three-
dimensional multiplication. You note particularly these matrix is a skew symmetric
matrix. Its diagonal all the diagonal elements are 0 and you can find out the respective
transpositions will give you a negation of the particular matrix elements. So, this is how
this is explained in this case.

(Refer Slide Time: 25:05)

So, we will proceed with this particular explanation that how it could be converted and
as I mentioned earlier that how this relationships, matrix relationships can be expressed
into form of a skew symmetric matrix. Now you can use prime(‘) in my notation, I have

291
used prime(‘) every case because they are all related to e' . However, you know it is just
a matter of notation that you need to be used to.

So, fundamental matrix as you can see the expanded form, when I am considering only
the elements of projection matrices and the epipoles that can be written as e' K ' RK 1 .
So, I can express this fundamental matrix by the camera matrices and epipolar
configurations, from the epipoles itself.

(Refer Slide Time: 25:56)

So, we will continue this discussion on epipolar geometry, once again that fundamental
matrix is the matrix, if I multiply any image point in the projective space in the first
camera, if I multiply that with this matrix F. Then, I will get the epipolar line in the
second camera. So, the interpretation is that the image point so, all my corresponding
image point any or rather corresponding image point will lie on this epipolar line. , which
means that I can apply the point contentment relationship with line, you note that this
center of the camera is given by this, center of the first camera is the given by this one
and center of the second camera is also given by this particular expression. So, this is
 R T t 
  so, given by this expression.
 1 

And as we have discussed that epipole can we considered as image of the camera centers.
So, for the left epipole e, it can be expressed as if I multiply the projection matrix with

292
the C ' , I will get the image coordinate of e and which is given by  KR T t . So, which is
KR T t or this will equivalent because you know directionality does not matter in this case;
minus sign just denotes the directionality. Similarly, e' can be considered as the image
of the image of the first camera center e' which is K ' t . So, e is image of C ' , e is image
of C ' and e' is image of C and this is how the relationship is there and the point
contentment relationship with the epipolar line that is expressed here.

So, I can get the epipolar line l ' that can be expressed as Fx and if I apply the point
contentment relationship x'T l '  0 and expand l ' in the form of equal to Fx . So, we get
a relationship x'T Fx  0 . So, given two corresponding points which are observable or
measureable or computable with two images; that means, they are the points of the same
point. Then, their existed matrix 3 3 matrix which satisfies this relation.

So, this relationship is also a very fundamental to epipolar geometry and infact. Now if I
apply the matrix transposition rule, I can also express this is as x' F T x T  0 that means,
if I consider this is a reference camera and this is a second camera. So, with respect to
that point correspondence also we have a matrix which satisfy that relationship and
those matrices are called fundamental matrix and they are related by this relation.

(Refer Slide Time: 29:38)

So, if this is a reference camera with respect to this camera, if the fundamental matrix is
F, if I consider the other camera is reference camera, then the fundamental matrix will be

293
F T . That is what we get in this relationship. There is another particular term which is
used, terminology which is used in epipolar geometry when the camera matrix when the
camera is a calibrated camera.

So, we say a camera is calibrated, when the calibration matrix is known to us and then,
we see that how this relationship can be simplified. So, when you have a calibration;
when you know the calibration matrix, I can express any image coordinate into its
normalized canonical coordinate system.

So, you can see that the relationship between the fundamental matrix and the camera
parameters is given in this form ( F  [e' ] P ' P   [ K ' t ] K ' RK 1 ). Given the two cameras
as P and P ' as shown here and suppose P is given as simply in a canonical form ([I|0]),
normalized canonical form the calibration matrix itself ; that means, we have a apply all
necessary transformation to bring it into the standard pinhole camera geometry; the base
pinhole camera geometry configurations with the base coordinate convention of cameras
centric coordinate system.

So, if I apply once again the representation of P ' in this form, then this relation can be
expressed as so, you can see that K 'T can be also considered as this is a representation.

P '  K ' R | t   K ' R | K ' t   M | m

This is an another representation of camera system and if I put the calibration matrices, I
since if I know the calibration matrices I can always covert them by doing the necessary
transformation.

Then, this particular structure will relate to t. There is a translation parameters t  R . So,
this is how the fundamental matrices expressed in that form and then, this fundamental
matrix is called essential matrix. In this particular form, this fundamental matrix is called
is essential matrix.

So, with this, I will stop my lecture, this lecture here and we will continue this discussion.

So, thank you for listening to this lecture.

294
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 17
Stereo Geometry Part – II

We will continue our discussion on epipolar at geometry and we are discussing the
definition we discussed about fundamental matrix and how this matrix is related with the
parameters of projection matrices. First thing is that a fundamental matrix is a matrix if we
multiply any image point it will give you the corresponding epipolar line in the second
image plane.

(Refer Slide Time: 00:50)

And this can be; this can be computed from the camera parameters itself and those relations
are shown here which we discussed earlier also, it could be in various forms. Here you can
consider that you have the reference camera is in the form of say standard canonical
representation say [I|0]. And the other camera this is given in this representation, with
respect to this we can compute the fundamental matrix from this configuration you can
compute it by this expression.

So, this is one kind of relationship that we discussed and we also observed that a
fundamental matrix can be simplified into a form of an essential matrix if I use the
calibrated camera which means I know the calibration matrices. And then I can make

295
necessary transformations on the image coordinates so, that effectively calibration matrix
becomes an identity matrix. So, in that form its fundamental matrix its reduces to this
particular structure when you are defining the camera mattresses in this form. So, P is
given as [I|0] and P’ is given as rotation matrix and the corresponding, translation matrix
translation of the origin.

So, they are given in this form. So, then the fundamental matrix can be obtained by
performing this operation. So, it is a cross product of translation vector with respect to the
rotation matrix. So, this is a relationship and we have also defined how this notation, how
you can convert a cross product operations into this matrix multiplication from.

𝐹 = [𝑡]𝑋 𝑅

And then this fundamental matrix is called essential matrix and we denote in this particular
structure by another notation E in our discussion. So, this is what we discussed in the last
lecture, now I will consider solving a problem just to illustrate how these concepts could
be used to retrieve various information.

(Refer Slide Time: 03:50)

So, take this problem, here in this case it is a problem of simply computing the fundamental
matrix. So, you look at the problem once again that reference matrix P is given as [I|0]
which is in the canonical form itself whereas, P’ it is in the non-canonical form
representation; that means, it is not calibrated. Its calibration parameters are not known to

296
us rather I know all the elements of the projection matrix that is what is P’. So, with this
configuration we would like to compute the fundamental matrix of the system. So, let us
see how this configuration we can do.

(Refer Slide Time: 04:34)

So, this is given that P’ as I mentioned P is also given here. So, here we will take this 3 X
3 sub matrix which is denoted here M and then we will also consider the column vector
which is denoted the m. So, these are the notations we have used earlier. So, we represent
this P‘ into this form into this notation with the sub matrices. Then we will apply the
corresponding know relations between M and m which will are configuration involving M
and m which will give us fundamental matrix.

So, we first will represent to the scheme symmetric matrix we will get the scheme
symmetric matrix. So, which is performing the equivalent cross product operation with the
vectors. So, from [4 8 1] we can get this scheme symmetric matrix and then fundamental
matrix is given by these computation. So, if I multiply this matrix with M so, this m; m
cross let me call this matrix as simply [m]X matrix. So, this [m]X matrix if I multiply with
M, then we will get a 3 X 3 matrix.

So, this is what the fundamental matrix is. So, as you understand if the camera matrices in
these particular forms then it is very easy to compute the fundamental matrix. We will also
discuss that how a general forms of camera matrices can be also used for deriving this
fundamental matrix. One thing you should note one of the property of the fundamental

297
matrix that is there which I need to discuss here, that in a fundamental matrix if I perform
as I mentioned if I convert consider any image point x. And if I multiplying with F then
you get the corresponding epipolar line in the right plane, in the right image plane.

And what is an epipolar line? Epipolar line is a line found by the epipole in this
configuration. So, you have this configuration. So, this is center C, this is center C’ and
this is the baseline. So, these are the epipoles, this considered these are the intersecting
points. So, this is e’ and a point x and its corresponding image point say this is x this is x
‘. So, epipolar line is formed by these two points; one is the epipoles in the image plane
right image plane with or we call it right epipole or e’ and the image point. Now, you just
consider this image point incidentally which is the left epipolar e.

Now, its corresponding image point itself is e’, then what would be epipolar line, how do
you form this epipolar line? So, as we have seen earlier also that actually this is the case
of singularity. You cannot form a line by just by connecting the same at the same point
itself, you cannot define it on a line just using a single point. So, that is why this
mathematically this is geometric constraint geometric constraint that we understand. But,
mathematically this constraint is expressed in this form if I multiply this fundamental
matrix with this epipole in the image point what I will get?

I will get a 0; that means, a 0 column vector. So, this is how this constraint is expressed
and which means that this fundamental matrix has a 0 vector and from the linear algebra
it says that; that means, this fundamental matrix is a rank deficient fundamental matrix.
So, if I take the determinant of this matrix you should get 0 so, that you can also check
with this result.

298
(Refer Slide Time: 09:19)

Now, we will proceed with the next example and in this case we will consider a more
general scenario of projection matrices where, we would like to compute fundamental
matrix epipolar lines. And also I think you can compute also the for epipolar line you need
to compute a epipoles definitely and there are interesting concepts of end image points of
P prime. So, let us consider let us discuss this solution.

(Refer Slide Time: 09:57)

So, if you would like to compute the fundamental matrix that is between P and P’. So, P
once again is no expressed as in this form [M p4], similarly this matrix is M’. So, first we

299
need to compute the homography at infinity what we discussed earlier; that means, we
need to find out the homography between the corresponding points, between a point and
the corresponding point of the same point which is lying at plane at infinity.

So, let us discuss these issue. So, we have a same point suppose this is plane at infinity.
Now, for this convenience I am just showing it in the diagram as you understand plane at
infinity is not realizable physically, but mathematically there exist a plane. And how a
point in a plane at infinity will be represented? You consider any particular direction and
then you use the scale factor 0. So, this is how the point obtain at infinities you know
represented. So, this is my same point say X and in a stereo setup I would be considering
the corresponding images of this point.

So, this is an image point x and this is so, this is my camera center. So, this image is x’
and we are considering homography between these two and which we will be expressing
as H infinity. So, how they are related with this let us see. So,

𝑑 𝑑
𝑥 = 𝑃 [ ] , 𝑥 ′ = 𝑃′ [ ]
0 0

So, 𝑥 ′ = (𝑀′𝑀−1 )𝑀𝑑

You note that 𝑀−1 𝑀 makes an identity matrix so, this will give you this. So, this can be
considered as a 3 X 3 matrix H infinity and Md is x. So, you find there is a homography
between these corresponding points. So, this is what we will be using here in this case
because we know that fundamental matrix can be considered if I know the right epipole
here.

See if the right epipole is e’ then fundamental matrix is given by e’XH infinity that is what
we discussed previously also. And then we have used the expansion of H infinity to get all
those forms, but we will be using simply we can compute H infinity. So, in this problem
what we will need to do, first we will be computing e’. So, e’ is nothing, but image of the
camera center of the first camera. So, we have to compute the first camera and then take
the image which will give me e’. And, then we need to compute H infinity from M and M’
by performing this operation which is 𝑀′𝑀−1 and then by performing this we can compute
the fundamental matrix. So, let us carry on this computation as we discussed.

300
(Refer Slide Time: 14:49)

So, we have computed the camera center in this form. So, we need to perform the M-1
operation and then now you can compute the camera center 𝐶̃ = −𝑀−1 𝑝4 which is given
in this form in the projective space. And then compute the right epipole as P’C which is
given in this form. This is the image of the camera center and you which is giving you the
right epipole and this is right epipole. So now, you have to perform the [e’]X you have and
you should compute the homography at infinity as we discussed which is 𝑀′𝑀−1 .

And if I perform those operations we will get this homography matrix. So, fundamental
matrix is [𝑒′]𝑋𝐻∞ and which will give you this particular form. So, [e’]X is this one so, if
you find out this is e’ I can get the [e’]X I can represent. So, I am ignoring this scale factor
here, scale value is not important here and then you get the answer as in this form. So, you
note that fundamental matrix is also an element of a projective space. So, if I scale these
values it will also denote the same fundamental matrix.

301
(Refer Slide Time: 16:10)

So, the next problem is that if I give you an image point which is given as say [15 20] in
the reference camera. So, in this case suppose this point is denoted by the coordinates of
15 20 then I need to compute the epipolar line. So, which means if I join this, so, I need to
compute the epipolar line so, this line.

So, what I need to compute? I have to compute the corresponding, so this is a thing what
I need to do. So, what I will do? I will multiply this point with F then I will get this epipolar
line that is the relation that is how fundamental matrix is related with a image point to its
epipolar line.

302
(Refer Slide Time: 17:13)

So, this is what we need to so, it is just showing that now all the points along these rays
which is lying there. And also as we have discussed that there would be homography
between these points, but finally, the point which is lying at infinity that is the limiting
point. So, you see that is that concept of vanishing point is also here in this kind of epipolar
line constraint. No other image point will exist, it will not there will not be any intersection
beyond this point. So, you have an epipolar line, you have a very finite representation of
epipolar line.

At one end there would be right epipole and the other end there will be the image point
corresponding to the homography of plane at infinity or corresponding point of the same
point which is lying at the plane at infinity. So, this is what we sometimes represent in this
notation also and this is this diagram is explaining this part. So, this is how the epipolar
line is computed, I multiplied the point [15 20]. The representation is [15 20 1] in the
homogeneous coordinate space and if I multiply it with F. So, I will get the epipolar line
in this form which I can again reduce in into this form by taking the third dimension scale
equal to 1.

So, which means the equation of this line will be - 2.53x+2.42y+1=0 and to get the x
infinity because, in this case problem also you wanted to know the end points of the
epipolar line. We can compute or we have already computed homography at infinity
simply you can multiply with this point and applying this we can get this is a point. So, we

303
know epipolar epipoles of in the right image plane and also the corresponding end point
of that epipolar line in this way.

(Refer Slide Time: 19:24)

So, now let me summarize the properties of fundamental matrix. So, we have discussed
and we have also shown how you can compute fundamental matrix given the projection
matrices. How you can compute the epipoles, again given the projection matrices and also
we have discussed how the homographies are particularly homography at infinity is
inducing induced in this case. They are they can be they are related how the how it is
related also with the elements of projection matrices.

So, now let us consider the properties of fundamental matrix, we will be summarizing
some of them we have already highlighted. So, the very basic property that if you get two
corresponding image points of the same scene point then they are related by this particular
relationship x’TFx=0. When we considered the reference image point is x and the second
image point is x’.

If I consider the other way then again you can use we can also convert this relation as
xTFx’ which means x’ is the reference image point. So, you note that the matrices F and F
transpose they are related in this fashion. So, F is the fundamental matrix of this
configuration and F transpose also is a fundamental matrix of the stereo configuration,
when we change the plane of reference or reference camera, image camera.

304
And this is true for any pair of corresponding image points that is the property first
property. So, next is this transposition what you have observed that if F is the fundamental
matrix of camera setup P, P’. So, we will be denoting a stereo setup by a pair of camera
matrices. The first one in that couplet the first one we will denote, the camera matrix of
the reference camera.

So, P, P’ if F is the fundamental matrix for that setup and for P’ P then FT will be its
fundamental matrix. Then the properties with epipolar lines, that if we have an image point
x then epipolar line in the second camera is expressed as l’=F x. Similarly, if you have an
image point in the other plane x’ its epipolar line in the reference plane would be l=FTx’.

And this is what I mentioned earlier that, if this relation is called correlation because it is
converting a point into a line, its transforming this kind of transformation is known as
correlation. F is also rank deficient we discussed this while discussing about the solution
of a problem. And so, which means that its inverse does not exist and how then this
epipoles are related with x that we discussed. In fact, epipole also they are lying on the
epipolar lines always they will lie on an epipolar line. And they are the intersections of all
the epipolar lines which is given by this e prime transpose F x that should be equal to 0 or
e transpose so, these are the conditions.

So, this should be equal to 0 or eTFTx’=0, FeTx’=0. So, it is just denoting the point
contentment relationship of epipoles on their respective epipolar lines. The other thing
what we I have discussed just a we made a very brief mention about that, that F e equals 0
which means e is the right null vector of F. Similarly, e prime transpose F is also equal to
0 and e’ is the left null vector of F in the same way. So, these are some important properties
of fundamental matrix and this is another property that determinant of F is 0.

Because, it is rank deficient we have mentioned we have discussed that and particularly it
is interesting to note that how many independent parameters are there. So, one thing I
already mentioned that there is a scale factor associated with F which means if I scale the
elements still it will give me the same, it will express all those relationships; you can note
from the relationship itself. So, for example, you take the first fundamental relationship
between the corresponding points say x’TF x = 0, if I multiply F with an scalar value k still
this relationship holds.

305
So, you can check with any other relation that would be the case and then this so, this is
one particular constraint that the scale is there is a scale factor. So, which means F has a 9
elements and it reduces one particular parameter by that constraint. The other thing is that
its a rank deficient, its determinant is 0 that is the second constraint. So, there will be 7
independent parameters out of those 9 elements so, its degree of freedom is 7.

(Refer Slide Time: 25:15)

So, we will be using this property to discuss about know to you know solve a problem
here. Suppose, you are given only fundamental matrix, in our previous problems we have
given you the projection matrices and from there you could very easily compute the
epipoles. Because, by applying that property that epipoles are images of camera centers
and given projection matrices you know how to compute camera centers. But, now if I just
give you simply fundamental matrix; can you compute its epipoles?

So, this is what we will we would like to say, first thing is that in this we are also given an
additional problem that is to compute the epipolar line. So, the first problem is about
computation of epipolar lines. Suppose you have two image points 5 8 and 7 - 5 in the left
image and you have to compute the corresponding epipolar lines in the right image. And
then you compute the right epipole and compute the left epipoles. So, as I suggested
previously also you should stop, you should give a pause at that in my video and then again
you should work out on this problem. So, let me discuss the solution of this problem.

306
(Refer Slide Time: 26:28)

So, consider this is the fundamental matrix and for the image point [5 6] its epipolar line
is given in this form. If I multiply with F, I will get its epipolar line l1 which is given by
this vector, similarly I can get the epipolar line for the other image points [7 - 5 1]. So, you
have two epipolar line so, that is giving me the answer for the first part. But, one interesting
thing I should note here you should note here that suppose this is your image plane where
epipolar lines are computed. That means, is second cameras, image plane of a second
camera; say this is your l1 and say this is your l2.

So, you know that all epipolar lines they intersect at epipoles. So, simply by finding out
the intersection of l 1 and l 2 I can get the right epipole. So, what I need to do? I need to
perform only l 1 X l 2. So, that is what we can do here. So, if I perform that l 1 X l 2 I will
get a epipolar I will get the epipole. I mean it is a big numbers, but you can always not
choose the scale to make it a smaller value, make all those values small; I mean which is
shown here as [- 1 3 1] that is the value.

307
(Refer Slide Time: 28:04)

So, the coordinate of epipole in the right plane is [-1 3] in the conventional or two-
dimensional real geometry. To compute the left epipole I could have taken note to such
any two arbitrary image points in the right image plane and find out they are epipolar lines
and apply the same technique. But, instead I will be discussing about right zero by finding
right zero of F because that would also give me the left epipole which means I have to
compute F.

(Refer Slide Time: 28:44)

308
The right zero means that if I multiply the left epipole with F, it should give me a 0 vector
which is a know 3 vector form. So, this is how this competition is shown here. Since, it is
a ranked efficient matrix one of the row I can ignore, there are only two independent rows.
And then I will convert this equation, I will assume that the coordinate is given of the right
zero is given in this form [eL1 eL2 1]. So, I have to take the inverse of this part minus of
this I can reduce this part, use the sub matrix operations and you can show that this is
nothing, but computation of this particular configuration. So, you will get the right zero of
F as [-3 5]; I mean it is giving you the coordinates already in the image plane, image
coordinate plane, non-homogeneous coordinate system because scale factor you have
already assumed as one.

(Refer Slide Time: 30:05)

I will show you also that actually the right epipole what you have computed as an
intersection of two epipolar lines we can compute it also a right zero of FT. So, you take
the FT so, just transpose the matrix F and once again apply the similar technique; that
means, ignore the third row and convert the equations, you form that these two equations
or set of equations in this form. And then you solve for the corresponding right epipoles
and you will get again you see that you are getting the same answer [-1 3]. So, with this
let me stop this lecture at this point. We will continue this discussion in my next lecture.

Thank you for listening this lecture.

Keywords: Fundamental matrix, essential matrix, right epipole, left epipole

309
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture -18
Stereo Geometry Part- III

We will continue the discussion on Stereo Geometry and we discussed the concepts related
in this geometry like epipoles, epipolar lines, epipolar planes, then the fundamental matrix
and how fundamental matrix converts an image point to an epipolar line in the other plane
where the corresponding image point lies. We also discussed about the fundamental matrix
of calibrated camera setup when this matrix is called essential matrix. Now in this lecture,
we will further elaborate the parameters or elaborate the properties of this essential matrix.

(Refer Slide Time: 01:08).

So, in brief, it can be called as stereo geometry for calibrated cameras. So, in this case, we
represent say the projection matrix P=K[I|0]. So, since the calibration matrix I can convert
it into a canonical form of [I|0] and also the other matrix I can also convert it in its canonical
form, because just by applying rotation and translation parameter itself all the image
coordinates are represented in normalized image coordinate system.

So, I can also convert it as [R|t]. So, this is how in the canonical form the coordinates can
be converted just by multiplying with the inverse of the calibration matrix. So, you see that
in the first camera x is converted to xc which is the coordinate in the canonical setup

310
optimal camera. Similarly x’ is converted to xc’ by applying there inverses of the
respective calibration matrix. So, this is what is the expression.

So, now, if I apply the relationships of fundamental matrix in these setup, then that
fundamental matrix is called essential matrix or E, but the relationship remains the same.

𝑥𝑐′𝑇 𝐸𝑥𝑐 = 0

You note that in general it is 𝑥 ′𝑇 𝐸𝑥 = 0 when we do not have any idea of K and K’ from
the projection matrix. So, how an uncalibrated fundamental matrix; that means,
fundamental matrix for uncalibrated stereo setup is related to a calibrated stereo setup? So,
how they are related? So, their relationships can be expressed using those calibration
matrices itself.

As you can see with this simple derivation that

𝑥 ′𝑇 𝐹𝑥 = 0,

then convert 𝑥 ′𝑇 into its canonical form as 𝐾 ′ 𝑥𝑐𝑇 . Similarly Kxc and then you consider the
middle part of this composite matrix. So, that is the 3 X 3 matrix. So, this matrix is nothing,
but this is the essential matrix E. So, this is what is E. So, the relationship between
fundamental matrix and essential matrix can be expressed with using these calibration
matrices as

𝐸 = 𝐾 ′𝑇 𝐹𝐾

K’ is the calibration matrix of the second camera and K is the calibration matrix of the first
camera.

Similarly you can also get fundamental matrix from essential matrix by using those
calibration matrix. So, the calibration matrix, it reduces into this from because if I consider
once again you note that relationship; if you remember it. Suppose I have [I|0] and then
[R|t] see these are the 2 camera matrices, this is P and this is P’. So, if you remember that
this structure was [M | m] in general and the fundamental matrix was given as this [𝑚]𝑋
matrix, which is a skew symmetric matrix into M. So, that is how the fundamental matrix
can be derived.

311
So, in this particular setup using the same relationship I can write

𝐸 = [𝑡]𝑋 𝑅

So, this is how the essential matrix can be represented or can we derive. So, I will replace
that value. So, here you note that there are only 6 parameters involved in this particular
operations like 3 parameters related to the translation of camera center. There is a
translation parameter and rotation of the orientations rotation of the axis, so that is also 3.

So, there are 6 parameters which are involved in constructing the essential matrix and since
scaled is involved; that means, if I multiply essential matrix with a scale value K still all
this relation holds. So, out of them one again 5 are independent and also the epipoles in its
canonical representation, so it also has those relationships holds; that means,

𝐸𝑒𝑐 = 𝑒𝑐′𝑇 𝐸 = 0

There should be a transpose here, because it is a right zero, but we have to take it a
transpose operation and its rank is also 2 like F. So, det(E)=0 and as I mention d.o.f of
E=5, because a scale factor is associated with E and it has 6 parameter. So, it should be 5.

(Refer Slide Time: 06:44).

Now, consider a situation of special kind of stereo set up where you have pure translation
of camera centers, pure in the sense there is no rotation of axis. So, in that case we can
very easily express the projection matrices having a simple structure like if you have pure

312
translation, then projection matrices can be expressed as say K[I|0]. I am representing
again in the uncalibrated form then P’= K’[I|t]. So, simple translation; so, K’t is that
translation of the camera.

So, your fundamental matrix is expressed in this particular setup is in this form

𝐹 = [𝑒′]𝑋 𝐾′𝐼𝐾 −1

You remembered that if it is in general case it should be rotation matrix R. So, in this case
rotation matrix itself is an identity matrix because there is no rotation and then this can be
reduced as

𝐹 = [𝑒′]𝑋 𝐾′𝐾 −1

That is a structure of fundamental matrix and if I consider both the calibration matrix is a
same which means they are the same camera. They are using the same camera for taking
the images or it has been fabricated in such way that all other intrinsic parameters they
remain the same. So, then fundamental matrix takes a very simple form of [𝑒′]𝑋

So, just from the epipole you can compute the fundamental matrix itself. Whereas, [𝑒′]𝑋
is given in this form;

0 −𝑒𝑧 𝑒𝑦
[𝑒′]𝑋 = 𝑒𝑧 0 −𝑒𝑥
−𝑒𝑦 𝑒𝑥 0

This is the definition of [𝑒′]𝑋 and a since you have if you consider a special kind of
translation that it is a parallel to x axis then, e’ is the vanishing point of x axis and which
is given by [1 0 0]. So, only to nonzero elements and the structure is very simple. So, if I
apply this relationship of corresponding points, then we can show that the y coordinates
actually they remain the same.

So, in this representation there we need to highlight few things that x’ here is a vector. So,
x’ is a vector and in my representation we consider for example, this points are say

𝑥′ 𝑥

𝑥 = 𝑦′ 𝑎𝑛𝑑 𝑥 = 𝑦
1 1

313
0 0 0
𝐹=0 0 −1
0 1 0

𝑥 ′𝑇 𝐹𝑥 = 0

0
𝑦 1] [−𝑦 ′ ] = 0 → −𝑦 ′ + 𝑦 = 0
[𝑥 ′ ′
𝑦

(Refer Slide Time: 11:20).

Let us discuss how we can compute depth under pure translation. Consider these two
cameras in the stereo setup as we have already mentioned that we have the corresponding
camera matrices P and P’ and where P is given by the K[I|0] which means the center of
the first camera would be at in the origin and also this translation between these two
𝑡𝑥
cameras, it is given by 𝑡 = 0
0

So, you are considering that there is also no rotation. So, it means the horizontal direction
it is a pure translation and there is no rotation matrix R. As we can see in the case of second
camera matrix here, we have same identity sub matrix here in this part instead of a rotation
matrix of or we can say rotation equal to the identity matrix which means no rotation and
this is a translation part and this is a second camera whose calibration matrix is also
different which is K’ and our objective is that to compute the depth here.

314
So, in this case since it is a canonical form so, your principal plane is also the xy plane of
the coordinate system in the first cameras camera centric coordinate system. So, this is
your canonical form. So, this is the z directions if I consider. So, your depth will be the z
coordinate in this case. So, let us see how we can compute this z considering the
corresponding points x and x’ in the stereo view. Now the center camera center of the
second camera, it should be (−𝑡𝑥 , 0,0). As you can see that translation t in this form we
will give this particular form because as we know that the camera center c’. I am using
tilde to denote that it should be expressed in word coordinate.

𝑡 = −𝑅𝐶̃

R = I. So,𝐶̃ = −𝑡. So, which will give you (-tx, 0, 0) that is the camera center in this form
in this kind of configuration and then we can compute the corresponding world coordinate
of the same point. So, here same point is X and in our convention when I am using the
world coordinate in nonhomogeneous coordinate convention I will be using tilde on top of
it.

So, as you can see that what we have done here.

𝑋̃ = 𝑍𝐾 −1 𝑥 → 𝑋̃ + 𝑡 = 𝑍𝐾 ′−1 𝑥′

Similarly if I consider the other camera, there also we can use the same convention, but
since there is a translation involved here.

It is because here also there is no rotation. So, your depth your principal plane is also the
same xy plane and you can use Z coordinate as the corresponding depth. So, now, if I
consider K=K’; that means, two camera have the same calibration matrix then this
relationship can be further simplified. It takes a very simple form and as we can see from
here itself if we subtract this equation, if you subtract see this equation from this one then
you will get this relationship; that means, you get

𝐾𝑡
𝑍𝐾 −1 (𝑥 ′ − 𝑥) = (𝑋̃ + 𝑡) − 𝑋̃ = 𝑡 → 𝑥 ′ = 𝑥 +
𝑍

That is how we will get and from here we can get x’; that means, the second image
coordinate of the since sin point in the second camera. So, this is a relationship between x

315
and x’. So, there is a shift in the horizontal direction and the amount of shift is determined
given by this Kt/Z and from this shift you can determine the depth

𝐾𝑡
𝑍=
𝑥′ − 𝑥

So, this is how we can compute depth and this relationship can be further simplified once
we take the take a simple form of K.

(Refer Slide Time: 17:55).

So, in the calibration matrix we consider only focal length is a parameter and all other
parameters they are initialized to 0, which means principal point is also the center of the
image coordinates there is no skew and under this situation you have a very simple
calibration matrix. So, this relationships; that means,

𝐾𝑡 𝑡𝑥
𝑍= ′ 𝑎𝑛𝑑 𝑡 = 0
𝑥 −𝑥
0

𝑓𝑡𝑥
𝑍=
𝑥′ −𝑥

So, this is one of the familiar equations of computing depth in a stereo setup where optical
axis is parallel between two cameras and also both the camera have the same focal length.
They have same identical calibration matrix and the shift between these 2 camera is tx then
we compute depth incidentally this value x’-x which is a shift of the corresponding point

316
along horizontal direction or this is the x coordinates. These are all scalared x coordinates.
You should note here once again that here this x and this x’ it denotes the vector whereas,
finally, in this computation when you have considered x’-x this is the scalar x coordinate
of those two points.

So, this is also called disparity. So, in a stereo setup we can compute the shift or disparity
and f and tx could be the parameter of the stereo imaging system. So, only using image,
computations over those stereo images you need to compute disparity at every point and
that is how you can obtain the depth.

(Refer Slide Time: 20:09).

This could be further generalized when we have a general configuration or arbitrary


configuration of two cameras instead of having a parallel stereo configuration. Let us
consider that you have a general projection matrix for the second camera, whereas the first
camera is a reference camera and there we follow the same canonical structure with the
calibration matrix K and in the second camera we have these general structure where you
have the calibration matrix K prime which is a different camera. It could have different
focal length etcetera and it has rotation and translation parameters as well, which means
you know they are not separated by simple horizontal motions C’ or the center of the
camera is rotated and translated with respect to C. It has a general configurations.

So, under this situation, how we can compute this depth that we would like to discuss.
Now there is a simple way by which you can map this problem to a parallel stereo problem

317
and use the previous relationships what we derived here itself. So, what we can do? We
can consider that let there be another imaginary camera which has the same center C, but
it is related by only the rotation with the same rotation matrix R and then we can compute
the homography between these two image planes. We know that it establishes a
homography we have already discussed in two dimensional persecutory projection during
projective transformation also and we will also see how this homography also could be,
could be octant we can also find out here itself.

So, once we have these things for example, he you consider originally you have two point
say x this is the reference camera. So, we will be considering this is x and another point
say x’. So, these are the say 2 corresponding point of the same. Now due to rotational
homography we bring x to the new imaginary camera setup where will be we will be
considering this point is a x 2. Now these 2 since we have exactly applied the same
rotational matrix R,

Now this image plane and this image plane they have they are they have that parallel
stereo configuration. So, now, you can apply the parallel stereo results under the same;
that means, what we need to do, we need to consider now the corresponding points under
this 2 camera setups and x’ and use them use this corresponding pair of corresponding
points to compute the depth. So, let us elaborate this computations further. So, let us first
compute the rotational homography in this case.

So assume that you have the second camera let me rub this thing just to make it more
clear. So, consider that the second camera matrix as P 2 that is a imaginary camera as I
mentioned which we considered whose image plane should be parallel to the image plane
P prime after performing this rotational transformations on the image planes. So, P 2 since
there is it is related with the original reference camera by the rotational matrix R and also.
I would like we would like to have the same calibration matrix of the second camera K
prime.

So, the projection matrix of P 2 is given in this way that it that it has a calibration matrix
K’ and in parameter of rotational matrix R and the center is still at origin 0 and that is the
same as the reference camera. So, under this configuration what should be the homography
between this 2 sins. So, here we can see that x2 can be written as in this case it is the image
of the same scene point x. So, this could be written as K’. So, this is K’ which is not very

318
invisible from this part. So, K’[R|0]. So, this projection matrix P2x that is what is x 2
further. Further simplifying it so, this is x 2 and once again we can simplify it in this form
which means I can take R outside.

(Refer Slide Time: 25:40).

So, it becomes K’R[I|0]X. Now this part can be expanded into this way. So, our objective
is that out of this will be taking this part here and then K[I|0]X can be considered as x that
is the original scene point. So, once again this is x2 which is mapped by homography H
and then x’. So, originally x and x’ is the corresponding points and now it has become x2
and x’ is the corresponding point. So, now, the relationship between x and x2 and x is
given in this form.

𝑥 (2) = 𝐾′𝑅𝐾 −1 𝑥

So, the rotational homography under this situation it will become 𝐾′𝑅𝐾 −1 . So, this is a
rotational homography. So, these are rotational homography. So, once you have obtained
this now you can apply the stereo results between x2 and x’, which means that I can write

𝐾′𝑡 𝐾′𝑡
𝑥 ′ = 𝑥 (2) + = 𝐾 ′ 𝑅𝐾 −1 𝑥 +
𝑍 𝑍

So, that is how x’ and x is related. Now it is a simple mathematical manipulations algebraic
manipulations.

319
𝐾′𝑡
𝑍=
𝑥 ′ − 𝐾′𝑅𝐾 −1 𝑥

So, from here we can get this relationship. So, in this way we can obtain the depth or Z
coordinate under this situation of the same of this kind of arbitrary general motion of
camera.

(Refer Slide Time: 28:06).

So, let us work out a problem to understand this relationships that how the depth could be
computed or Z coordinate would be computed. You consider a stereo setup with these
projection matrices P and P’ which are given here as it is shown and now what you need
to find out that you have been given 2 image coordinates 3-D points are there (6, 8) and
(9.33, 8) in left and right cameras.

So, you should note that we have already satisfied that constraint of keeping the y
coordinate same because that is what we discussed as a property of this particular scenario
you can find out that the component of sub matrix m, they remain the same and only these
column vector they are different. So, which means it is a case of camera cameras which
are related with only translations of origins because you know that this is given by minus
m inverse this column vector this is given by m inverse this column vector and that is a
only thing because others are either rotations or rotations except there is no rotation relative
rotation between this two system.

320
So, this is a case of your translation. So, in the under this case you we already seen that
your y coordinates of the corresponding point they should remain the same because the
there is a special case of its a special case of you know stereo and the fundamental matrix
is given by simply by E prime cross as we have seen earlier also.

So now let us compute the depth of this particular you know in the particular what should
be the depth of sin point?

(Refer Slide Time: 30:14).

So, this is the scenario your P = K[I|0] and P’ = K [I|t] where

6 0 0 10/6
𝐾=0 6 0 𝑎𝑛𝑑 𝑡 = 0
0 0 1 0

So, this is what is K and this is this is what is t. So, you note that t needs to be scaled by
the value 6. Because in the previous case it was Kt. So, you will get if I multiply this one
then only you will get the previous matrix. So,

6 ∗ 10
𝐾𝑡 10
𝑥′ = 𝑥 + → 𝑍 = ′6 = =3
𝑍 𝑥 − 𝑥 9.33 − 6

So, with this let me stop my lecture here and we will continue this discussion in my next
lecture. Keywords: Essential matrix, Fundamental matrix, Projection matrix, computation
of depth

321
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 19
Stereo Geometry Part – IV

We are discussing about Stereo Geometry and we discussed how a fundamental matrix of
this geometry characterizes the system. The in this lecture we will consider how to estimate
a fundamental matrix given a pair of images of a stereo imaging system. So, you consider
the relationships with the corresponding pairs of points for a fundamental matrix.

(Refer Slide Time: 00:53)

And as we have already discussed that you have two pair of points that is x’; that is in the
right stereo image and x in the left which is a reference stereo image. Then the relationships
with the fundamental matrix between and also this pair of points can be described in this
form; that is 𝑥 ′𝑇 𝐹𝑥 = 0; if I consider two images of a stereo system as I mentioned. And
consider a point this is x; this is the reference camera system or reference image plane of
the stereo system and its corresponding point x’.

Now, the task of a fundamental matrix or the property of a fundamental matrix is that; it
transforms a point into the epipolar line. So, it transforms a point; if I apply Fx, it will
transform the line where x’ will occur. So, this line is given by this equation Fx and the

322
applying the point content print relationship of projective space; since x’ lies on these lines;
so 𝑥 ′𝑇 𝐹𝑥 = 0; so that is how we have derived this relationship.

So now we will see that using this property how we can estimate a fundamental matrix
given a pair of image points; so for that we need to know say a few corresponding pairs of
points, so this is a problem now.

(Refer Slide Time: 03:16)

So, this is the expansion of this particular equation. If I consider the element of the
fundamental matrix is given in this form; it is a matrix element and using the notation of
index indexing notations of an element of a matrix that element of an ith row and jth
column is given by fij.

𝑓11 𝑓12 𝑓13


𝐹 = 𝑓21 𝑓23 𝑓23
𝑓31 𝑓32 𝑓33

And also you consider the column vector notation of a point for example, x’ = [x’ y’ 1]. In
our convention we consider a point x’, but its x coordinate is x also denoted by this symbol
x’; only thing is that it is scalar quantity of an x coordinate, then y coordinate and 1. So,
this is the presentation of the point x’ there. So, if I we can write it in this form; [x’ y’ 1],
this is what is x’ transpose. And then multiply with fundamental matrix F and also the
point x in the reference image corresponding point which is also given by this column
vector [x y 1].

323
So, now you consider multiplications of these matrices; if you expand, you will get this
equation; And again it would be convenient for us to represent in the form of a matrix
because in that case a set of linear equations can be represented by the matrix operations.
So, a single equation is given in this way and in the matrix notation, here our objective is
to compute the fundament elements of fundamental matrix; so, these are unknown
variables or unknown parameters in our case.

And given the observation these values we already because that is what is given to you; I
mean we will discuss later on how we can also compute these corresponding pairs in my
next topic. But for the time being; let us assume that these points are given to you, maybe
even through manual observation in a crude way you can get a few such matching points.
So, these values are given to you; that means, these coordinates are given to you. So, now
this equation in the matrix form is written here.

(Refer Slide Time: 06:42)

Let me write it in a proper notation which is more convenient to understand. So, we will
see that these are the coefficients of those equations.

𝑥 ′ 𝑥𝑓11 + 𝑥 ′ 𝑦𝑓12 + 𝑥 ′ 𝑓13 + 𝑦 ′ 𝑥𝑓21 + 𝑦 ′ 𝑦𝑓22 + 𝑦 ′ 𝑓23 + 𝑥𝑓31 + 𝑦𝑓32 + 𝑓33 = 0

So, I am converting this equation in the matrix form which is written here also. So, because
this is for the sake of convenience a column vector in this slide is shown by a row vector
by using transpose operation; so, you are computing all these elements of fundamental

324
matrix. So, you can see that all the rows are again represented as a column vector and they
are; they are concatenated in this form.

[𝐴]𝑛𝑋9 [𝑓⃗] = [0]𝑛𝑋1


9𝑋1

So, this equation is formed with only one curve pair of corresponding points; as I
explained.

Suppose you know a few more points so; that means, I can have; I can represent all these
equations in a general matrix form. So, all the coefficients of those equations will come in
this matrix. So, this is for the first pair of points; for the next pair of points once again I
will put these elements observations and I can write this equation. So, in this way if I know
n pair of points; n pairs of points each one will give me one equation. So, as you can see
what is the number of columns in this case? You have here 9 unknowns because
fundamental matrix the dimension is 3 X 3; so, there are 9 unknowns. So, in this form we
have I mean there are 9 columns and number of rows would be n and there are 9 columns.

So, this vector can be shown as a column vector f and this matrix let me write it as an A.
So, [𝐴]𝑛𝑋9 [𝑓⃗]9𝑋1 = [0]𝑛𝑋1 ; so 0 is also a column vector because each equation is giving
me 0. So, if there are n equations; so 0 will be a column vector of dimension n X 1 and as
we know this should be 9 X 1 and this dimension should be n X 9. So, given n pairs of
points corresponding points; you can form these set of equations.

So, now your task would be to solve this equation to get f and you can see this is a
homogeneous equation; a set of homogeneous equations, again you can apply the least
square error estimate to solve this; however, you can make this equation as a non
homogeneous set. Suppose I consider the; I said this f33=1; like we did earlier also. So,
one of the value let us consider say to certain particular value because I; I mentioned that
though there are 9 elements; number of independent parameters in a fundamental matrix
is 7.

First thing that it is an element of a projective space. So, if I multiply all the elements by
is constant k still this relationships it holds; that means, that is also a solution of this system.
So, fundamental matrix multiplied by any scalar constant they are all equivalent that we
discussed earlier. So, one of the element could be considered as a scale factor say if I take

325
this equal to 1; this would be taken care of. So, in proportion to that value all other values
are expressed that is one particular feature property; so, that it is at least one reduction of
the number of independent parameters.

The other constraint in a fundamental matrix that we also discussed that it is a singular
matrix. So, if I take the determinant of this matrix; it will be 0, so you will get another
equation. So, if I expand determinants you will get equations in terms of all these elements
and that should be equal to 0; so that is another constraint. So, that is why there are number
of independent parameter is 7.

Now, for the time being let us consider that now we will be solving for 8 parameters. We
will later on we will see that how that properties of singularity could be enforced; for the
time being let us let us approximately estimate fundamental matrix where the singularity
constraint is waived. So, in that case as I mentioned that one of the parameter you can set
to some known value; some given value. So, let us check to this value 1 and then you can
convert this equation in this form. So, let me rub this part.

(Refer Slide Time: 13:24)

So, once again; I can write this equation as

[𝑥 ′ 𝑥, 𝑥 ′ 𝑦, 𝑥 ′ , 𝑦 ′ 𝑥, 𝑦 ′ 𝑦, 𝑦 ′ , 𝑥, 𝑦, 1][𝑓11 , 𝑓12 , 𝑓31 , 𝑓13 , 𝑓22 , 𝑓23 , 𝑓31 , 𝑓32 , 𝑓33 ]𝑇 = 0

326
so this is a transformational. So, now with this equation becomes non homogeneous
equation because this part is not 0. And in this way same for all the n corresponding pairs;
you can get n such equations.

So, let me consider that matrix say A’ matrix. So, A’ into let me use also f prime; these
are the dimensionally reduced matrices because of this conversion that should be equal to
minus f 33. For example, if I set minus if set f 33 equal to 1; so this is the equation we will
get ah; actually you will get a column vector of minus 1 because every question will have
minus 1.

So, the dimension of this should be n cross 8; for dimension of this vector would be 8 cross
1 because there are 8 unknowns and this is n cross 1. So, applying the solution for system
of non homogeneous equations; you can perform the least square estimate and you know
that. But if I consider this column vector as say b; so I would be solving

𝐴̃𝑓̃ = 𝑏̃

And since there are more number of equations and usually; you should have more number
of equations at least minimally you require 8 equations to solve it then exactly you get a
solution. Because this matrix would be 8 X 8 and if it is not a singular matrix you perform
𝑓̃ = 𝐴̃−1 𝑏̃ that would give you; so, when just you have 8 corresponding pairs of points I
can simply write 𝑓̃ = 𝐴̃−1 𝑏̃

So, that would be the solution in that case and that is a requirement you should have 8
numbers of corresponding points that is the minimum requirement. And when it is n greater
than 8; then what you need to do? You need to perform a least square estimate; that means,
you will have to minimize the error.

So, error is defined in this way

2
||𝐴𝑓 − 𝑏|| = 0

so the deviation of the vector b from vector Af. These dimension of Af would be also n X
1 and b would be n X 1. So, if I take the norm of this vector which is nothing, but the sum
of the square of its elements and you can take the square root also or sum of the square of

327
its elements. And that should be I mean you can use also square just to show that square
of the norm; that should be equal to that you have to minimize this.

So, you have to find out that f which will minimize this norm, can be solved by least square
error method. There is a nice derivation we can perform by applying almost like matrix
operations. But we should use the theory of minimization in this case by taking the
derivatives of these expression and equating with 0; we have to solve these equations then
you will get these solutions.

Just for the sake of your easy this aspect

̃𝑇 𝐴̃]−1 𝐴
𝑓̃ = [𝐴 ̃𝑇 𝑏

So, now you see that this becomes a square matrix. So, this is this square matrix is in the
form of 8 X 8; A is n X 8; so a transpose is 8 X n into n X 8. And of course, the whole
thing this would be 8 X 1.

So, this is your least square error solution that is for the set of non homogeneous equations.
So, let me summarize these expressions because I have also the slide set for this
representation.

(Refer Slide Time: 20:26)

So, as I mentioned this is how we represent the system of equations given n pairs of
corresponding points. And in our discussion we also mentioned that the elements of f; they

328
can be retrieved up to scale; that means, whatever solution you get if you multiply by any
scalar constant; that would be also a solution of this equations. And also we discussed how
minimum 8 point correspondences are required. There is a technique which we will discuss
later on that even you can do its 7 point correspondences; just to mention that because as
I mentioned there are 7 independent parameters. So, we will discuss this technique in my
next slides.

(Refer Slide Time: 21:27)

So, this is a representation of the set of equations with n pairs of corresponding points and
you can see this is that matrix A; I was referring at. So, this is matrix A and this is the f
that is a column vector of consisting of elements of fundamental matrix; that is the
unknown and this is a set of equations.

(Refer Slide Time: 22:06)

329
One of the problem with these presentation is that the coordinates of these points; they
have a very large range. And as you are taking products in some cases; you understand
product of two large numbers will be also very large; so, the dynamic range of these
coefficients of this equation of the unknown parameters; that widely vary.

You can see that for the products; the ranges could be in the tune of minus in sorry in the
tune of 10000; this could be magnitudes that is the variation ranges. It could be negative
positive whatever because the coordinates could be usually in the images all the
coordinates are positive; so, let us assume all values are positive. So, so this is the range
10000; 10000 and whereas, the coefficients with single variables range would be smaller
100; assuming the images dimensions are in that 100’s only. So, 100 X 100 image, but
then you get all the; you know coordinates within 100.

So, that is why since the orders of magnitude to difference it is; there is a lot of difference
in that orders of magnitude. So, numerical stability is less in these computations and it
yields poor results. So, we will see how to handle that part let us proceed.

(Refer Slide Time: 23:47)

330
So, one of the technique to take care of this particular fact that why not you transform the
coordinates within a smaller; within this unit range of [-1, 1], then you can see even the
products of coordinates also they have equivalent equal ranges of the range of the single
coordinates; all the coefficients they will have this.

So, one of the technique is that you can transform these coordinates between [-1, 1] X [-1,
1]. That means, suppose you have your image coordinates it varies; if I assume say this is
0, 0 and say this is the width and height of the image say (xmax, ymax). So, these are the
coordinates within this ranges all the coordinates are represented. So, now we I would like
to transform this coordinate.

So, the variations are there; that means, exactly in the middle you have the centre point is
represented by origin (0, 0); so this is how coordinate is transformed. And you can see that
now if with transformations you get transformations in both the images independently, still
you can recover the fundamental matrix that also let us understand. Say you have an image
as I mentioned say any point x; now it is transformed to a point, this transformation is
given by this in the say left image; that means, a reference image.

For the right image; let us consider that another transformation we are performing; so we
are expressing in this. So, these are linear transformation;

(𝑇 ′ 𝑥 ′ )𝑇 𝐹(𝑇𝑥) = 0

331
So, if I consider the equation of the fundamental matrix; involving fundamental matrix;
so, this is the equation that we are getting; so now you consider apply this transformation;
so this is T’.

So, what I will do? I will rather write see transformed point; if the transform point is y. So,
I will have y =T x’ and y’=T’x’ so 𝑥 = 𝑇 −1 𝑦. So, this is a requirement the transformation
should be invertible and x prime is T prime inverse y prime; so now this should be written.

(Refer Slide Time: 28:13)

So, now x prime is written as T prime inverse y prime, transpose F and x is T inverse y
that is equals 0. And performing this matrix transfer transposition operation; so y prime
transpose T prime minus 1 transpose; then F that is T T; T prime inverse transpose, we can
write it also T prime minus T in our previous notation; T inverse y equals 0.

Now, you can see this is also a 3 cross 3 matrix and this is the transformed matrix after
transformation you get this matrix; you get this one. So, this is the fundamental matrix
associated with y transformed points. So, what you can do? You can, you can compute the
fundamental matrix between these by using this transformed points; corresponding
transformed points where all these coordinates in this ranges will be used. And then from
F; F tilde from this whatever you get; then again transform it back to get the fundamental
matrix to get the original fundamental matrix; that relationship also we can use. Say this
is T prime transpose then F T inverse equals F tilde.

332
So, now we will multiply at the both end; so both end we will multiply with T. So, it
become identity and in this case it would be T prime transpose. So, that would be equal to
F. So, this is how you can get; get back the original fundamental matrix. So, with this let
me stop here; we will continue this discussion of estimation of fundamental matrix in my
next lecture.

Thank you very much.

Keywords: Fundamental matrix, least square solution, normalized algorithm

333
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 20
Stereo Geometry Part – V

We are continuing our discussion on estimation of fundamental matrix and in the last
lecture we discussed how the equations can be formed using the pairs of corresponding
points in relation to the elements of fundamental matrix. And, by solving those equations
you can get those elements of fundamental matrix. One of the things we are discussing
that we require to perform coordinate transformation as the dynamic ranges of coefficients
of those equations they widely vary because of the nature of the equations.

(Refer Slide Time: 01:00)

So, we will continue that discussion and as we discussed that we will map the coordinate
points to these ranges and bring them into these ranges. An example is shown here. Usually
that is a very common case that no image coordinates vary between say 0 to 700 in this
example and 0 to 500 there are other values. So, it is a typical range and there you can see
product of those coordinates will have a very large dynamical range compared to single
variables of coordinate points, single points. So, the example of a transformation will
convert these coordinates into these ranges as given here.

334
𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚 𝑖𝑚𝑎𝑔𝑒 𝑡𝑜 ~ [−1,1]𝑋[−1,1]

You can see that this transformation is given in this form, that if I multiply this matrix with
x’, I will get the transformed point.

So, let me denote the transform point as say xt coordinate; that means,

2
𝑥𝑡 = 𝑥−1
700

2
𝑦𝑡 = 𝑦−1
500

So, given (x, y) will be converted into this value. So, that would translate the origin to the
(0, 0) in the middle of this (-1, -1) has disc affect. Anyway as discussed that you have to
apply the corresponding computations to apply that transformation back into what you get
as a fundamental matrix out of this computation. Then the inverse transformations are to
be applied to get their original fundamental matrix in the respective coordinates of the
image planes.

(Refer Slide Time: 03:28)

We will now continue, we will see what other aspects are there. So, as we mentioned that
this is a paper where actually this particular technique is discussed and it has been shown
that least square error method yields good results when we perform this transformation.

335
(Refer Slide Time: 03:45)

So, now we will consider the other aspect of this estimation, I mentioned that in the
previous case that we have not applied the constraint of singularity on the fundamental
matrix. So, the solution what you get that is not truly giving you a fundamental matrix
unless it satisfies that singularity constraints. So, how to enforce that singularity
constraint? So, that is what we will be discussing here. So, here you see that this constraint
is expressed in the form of that the determinant of the matrix F should be equal to 0. And,
also we know since its rank deficient, it should be less than 3 and in this case it would be
2.

det 𝐹 = 0 𝑎𝑛𝑑 𝑟𝑎𝑛𝑘(𝐹) = 2

So, there is 1 zero particularly of this fundamental matrix we discussed also, that there is
a epipole that is left a epipole. Epipole in the reference image reference plane image plane
that is a 0 of this fundamental matrix that is a right 0. Similarly, there is a left 0 which is a
right epipole; Now if I perform singular value decomposition of a fundamental matrix. So,
we present singular value decomposition in this form. So, you see that U is a matrix which
is orthogonal or we can make it also orthonormal which means if I take the column vectors.

𝑈 = [𝑢1 𝑢2 𝑢3 ]

336
So, let me represent U as a set of column vectors in this form, I will use this notation. So,
each vector is a column vector which is of 3 cross 1 dimension; that means, U is a 3x3
matrix.

And, in singular value decomposition all these column vectors would be orthogonal or you
can make it orthonormal also as I mentioned. Similarly, V is also another matrix whose
column vectors are also orthogonal. So, you have V1, V2, V3 so, this is also orthogonal.
So, any square matrix or even any matrix not only square can be decomposed into this
form, into this particular structure in using singular value decomposition. So, in this case
particularly I will discuss with respect to square matrix only for in this context so, it can
be decomposed into this form.

𝐹 = 𝑢1 𝜎1 𝑉1 𝑇 + 𝑢2 𝜎2 𝑉2 𝑇 + 𝑢3 𝜎3 𝑉3 𝑇

So, U as I mentioned 3x3 then there is a diagonal matrix, and then V which is also a matrix,
but we have to take transpose of this V. So, I can write also like [u1 u2 u3] and this diagonal
matrix there in this case it is also a 3x3 matrix. You can see that if I multiply I will get a 3
x3 matrix. So, diagonal matrix as we know that all off-diagonal elements will be 0 except
the diagonal elements which should be non-zero not necessarily, that it should be non-
zero. It could be 0 also and we can apply no sign laws it is, it we can adjust the signs in
both the sides so, that no one can make it all positive.

So, if I shuffle these columns similarly shuffle the rows of [v1 v2 v3] in the same order
still we will get the same matrix F. So, I will shuffle in such a way that

𝜎1 > 𝜎2 > 𝜎3

which means I will take those column vector those pairing U1 and V1 whose sigma is
those singular value. So, 𝜎1 this is called as singular values whose singular value is the
maximum here, then the next maximum then the minimum. So, this is how the singular
value decomposition works and this is how you can see the matrix can be considered as a
super positions of 3 rank 1 matrices.

So, you are just summing up 3 rank 1 matrices so, each one is a 3x3 matrix. So now, what
you can do that you have to estimate a singular 3x3 matrix which should be very close to
your estimated F matrix, now that is a computational problem. So, F you have estimated

337
from the data, but that is not singular. So, your objective is to get that solution F’ which
should be very close to your estimated F which means you have to minimise the Frobenius
norm.

That means, element wise if you take the square of their differences and add them and take
the square root of the sum that is a Frobenius norm. It is as good as saying that this is L2
norm when you are representing a matrix in the form of a vector by concatenating all its
rows or its columns in whatever way. So, the solution lies in the fact that this should be
minimum that theory says that when you set the minimum singular value to 0; that means,
you are setting to 0 so, you are ignoring this part. So, once you set it 0, now then this F
prime is a singular matrix because that is one of the conditions one of the signature of a
singular matrix.

That if you perform singular value decomposition, then at least one of the singular value
should be 0 and if I reorder them all those singular values in order of their increasing
values. And, as I mentioned that we enforce that singular values should be positive we can
enforce positive point 0 so, the minimum values should be 0 in that set. So, in this case
number of 0s or number of non-zero singular value will also provide us the rank of that
matrix. So, since the rank of this fundamental matrix is 2 so, we are keeping two non-zero
singular values and only setting the minimum singular value to 0.

So, this becomes your solutions now and you have applied the singular constraint in this
fashion. So, this is that we are computing singular value decomposition.

(Refer Slide Time: 13:11)

𝐶𝑙𝑜𝑠𝑒𝑠𝑡 𝑟𝑎𝑛𝑘 2 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛, 𝐹′ = 𝑢1 𝜎1 𝑉1 𝑇 + 𝑢2 𝜎2 𝑉2 𝑇 + 𝑢3 𝜎3 𝑉3 𝑇

338
And this is the closest rank 2 approximation from that estimated F matrix.

(Refer Slide Time: 13:17)

So, one of the nice example that I could get from the book of you know Hartley and
Zisserman that is “Multiple view geometry in computer vision”. You can see there are the
effect of singular F and non-singular F by showing how epipolar lines behave in a
particular image plane. So, this is a right image plane with respect to a left image plane
which is a reference plane and we are drawing the epipolar lines of all the points
corresponding points of the left image planes in the right image plane.

339
It is expected that all epipolar lines should be converging to a single point and that is the
right epipole of that stereo system. So, you can see that if it is non-singular then they are
not converging. Only their convergence is ensured if it is a singular. So, that is why
geometrically also this property needs to be satisfied and that is why this constraint is so
important in this respect.

(Refer Slide Time: 14:34)

Now, if I consider the estimation of essential matrix there is another interesting feature, I
told earlier also the number of independent parameters in essential matrix is 5. So, which
means there are few more constraints here. So, let us find out what are those constraints.

det(𝐸) = 0

So, just remember that essential matrix is the same fundamental matrix when your camera
is calibrated which means the calibration matrix of the camera. And, you can convert all
the image coordinates into the canonical form and express the fundamental matrix just by
using the rotation and translation of the word coordinate to camera coordinate system those
parameters. So, extrinsic parameters are the only constraints and since there are 6
parameters and also the scale is another constraint so, there will be 5 independent
parameters in this case. So, suppose you have a calibrated camera; that means, you know
the calibration matrix. So, you can take and make that conversions and then you apply the
same estimation technique. And, whatever fundamental matrix you have estimated that is
essentially an essential matrix because, you get this canonical coordinates and you get in

340
that form of the fundamental matrix. So, there also we need to perform singular value
decomposition to enforce the constraint of singularity.

But, there is one interesting property of essential matrix that the two singular values that
you get here should be also equal.

𝐸̂ = 𝑈𝐷 ̂ = (𝑎+𝑏 , 𝑎+𝑏 , 0)
̂ 𝑉𝑇 , 𝐷
2 2

So, the way you can enforce is that you consider the average of those two values (a, and
b) and set them to the 𝜎1 and 𝜎2 in my previous technique whatever we discussed. You
just make these modifications when you are estimating an essential matrix.

(Refer Slide Time: 16:56)

Let us discuss now how to estimate fundamental matrix using 7 point correspondences
which is the minimum number that is required as we mentioned earlier. So, if you have 7
point correspondences there would be 7 such equations involving the elements of the
fundamental matrix as we also discussed in our previous slide. So, we can see that each
row is also formed in the same way what we considered earlier. And, there are 7 such row
corresponding to each of the 7 pairs of corresponding points. So now, if you would like to
solve this as you can see this is a set of homogenous equations and since the number of
elements in F is 9 and you have 7 equations.

341
So, the rank of this matrix is 7 and here we can use the singular value decomposition of
this matrix A. So, we consider this matrix is A and if I perform a singular value
decomposition then we will be naturally expecting there will be two 0s in this case. And,
the U and V you can consider know their dimensions in this form and V there will be
corresponding 0 vectors of this particular matrix A. So, those 0 vectors are the last two
columns of V and they can give you the solutions of the equations. Because, 0s are the
solutions of A are they are the solutions of these equations and they are the corresponding
fundamental matrix.

But since there are two such zero vectors so, actually the solution is a linear combination
of these two zero vectors. So, we can express the solution as a linear combination of two
zero vectors F 1 and F 2 as it is shown here and this linear combination is given in this
form. So, where lambda is any scalar constant, but you have to determine exactly which
linear combination which lambda value.

|𝐹1 + 𝜆𝐹2 | = 0

Because, the fact is that the determinant of this should be equal to 0, it should be rank
deficient matrix which means if I take the determinant of this matrix, if I take the
determinant that should be equal to 0. And, since the determinant it is not automatically
rank 2 we have to solve for lambda which will give you the 0. And, it will be a cubic
polynomial because it is a 3x3 matrix and in cubic polynomial we would like to get a real
valued solutions of lambda. And, there are two situations either there will be one real
valued solutions or there will be three such solutions. So, a maximum we have three such
possibilities of fundamental matrices in this case that is a technique we discussed.

(Refer Slide Time: 20:13)

342
So, the essence is that you have to perform singular value decompositions or rather eigen
decomposition of the AT matrix here. And, and solve for the lambda which is the
coefficient of linear combination of the eigen vectors by enforcing the determinant of the
combination linear combination should be equal to 0. So, let us consider now the other
aspect of the fundamental matrix. So, how fundamental matrix can be represented in
different parametric form. So, we have already seen how many number of independent
parameters should be there, but not necessarily your representation will have that intrinsic
characteristics.

So, it you can represent with more number of parameters, but then you have to enforce the
constraints; that means, the independencies among those parameters needed to be enforced
there. So, in the estimation process and this is not always ensured and a post estimation
those operations are taken care of as we discussed that how we can make the determinant
0 for F. So, over parameterization representation could be in this form; that means,

𝐹 = [𝑡]𝑋 𝑀 → {𝑡, 𝑀} → 12 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠

𝑒 = [𝛼 𝛽 − 1]𝑇

where t is a epipole. And, in this form you see that there are more number of parameters
like you have epipoles t and M.

𝑎 𝑏 𝛼𝑎 + 𝛽𝑏
𝐹=𝑐 𝑑 𝛼𝑐 + 𝛽𝑑
𝑒 𝑓 𝛼𝑒 + 𝛽𝑓

343
So, M is 9 and t is a 3x1 so, we can write 12 parameters. Now, we know essentially there
are seven independent parameters. One of the interesting representations which follows
this parameterization is this one, you can see that how many independent parameters are
there a c e b d f 6 then alpha and beta. So, there are 8 independent parameters. The
interesting part is that the singularity constraint on f; that means, the determinant that is
equal to 0 here because if I take the linear combination of these two columns; the third
column is a linear combination of these two columns.

𝑎 𝑏 𝛼𝑎 + 𝛽𝑏
𝐹= 𝑐 𝑑 𝛼𝑐 + 𝛽𝑑
𝛼′𝑎 + 𝛽′𝑐 𝛼′𝑏 + 𝛽′𝑑 𝛼𝛼 𝑎 + 𝛼𝛽 ′ 𝑐 + 𝛽𝛼′𝑏 + 𝛽𝛽′𝑑

So, we know that in this case this is a rank deficient matrix and one of the column is
dependent of others. So, also its determinant should be equal to 0 that is the property of
determinant. And, if I multiply this F say with[−𝛼 − 𝛽 1], you will find that you are
getting 0. So, you perform minus alpha a minus beta b plus alpha a beta b that is 0 and in
this way. So, what does it mean? It means that the left epipole; that means, 0 of F is this 1
so, which means it is an epipole. So, alpha and beta has an interpretation in this form or
you can see alpha beta minus 1 so, it is because you are multiplying F e should be equal to
0.

So, I think this is fine. So, it should not be right epipole e, it should be the left epipole e it
should be this one ok. So, let us consider the next one that is both epipoles as parameters.
So, you have this number of parameters a b c d e f alpha beta. Now, this is another
representation you can see that how many number of parameters you are having here. You
have a b c d then alpha prime beta prime and alpha and beta; that means, here also you
have 8 parameters a b c d alpha prime beta prime and alpha beta. So, a b c d alpha prime
beta prime alpha beta.

Now, in this case you can find out that if I multiply F with once again with alpha beta
minus 1. So, here also you can see that the third column; that means, this column this is a
linear combination of first two columns which is alpha into this and beta into this will give
me the third column. And, alpha in yeah, and also the third row is a linear combination of
the first two row which is alpha prime into first row and beta prime into second row will
me the third row. So, if so alpha beta minus 1 that should be equal to 0 like the previous
one. And, then if I perform on the other way if I perform the multiplication alpha prime

344
beta prime minus 1 with F that should be also equal to 0; that means, this becomes your
right epipole and this becomes your left epipole so, in this set up.

(Refer Slide Time: 26:12)

So, once again now we have the same problem of representation. So, e should be alpha
beta minus 1 and e prime should be so, here also you have a problem. So, please note this
corrections e should be alpha beta minus 1 and e prime should be alpha prime beta prime
minus 1 as per my diagram.

So, let me stop here and we will continue our discussion in the next lecture.

Thank you for your listening.

Keywords: fundamental matrix, singularity constraint, essential matrix, parametric


representation.

345
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 21
Stereo Geometry Part – VI

We are discussing about fundamental matrix of a Stereo Geometry and we have seen how
fundamental matrix plays a very distinctive role in characterizing its geometry. And in the
last lecture, we discussed how fundamental matrices could be estimated given the
observations of corresponding pairs of points and also some parametric forms of
representation of fundamental matrix which also sometimes helps us in formulating the
estimation problem.

(Refer Slide Time: 00:55)

Now, in this lecture, we will be discussing that how we can retrieve the camera matrices
from fundamental matrix F. So, note that in our previous problems, we have not considered
estimation of camera matrices. If we know the camera matrices, then computation of
fundamental matrix is very direct; it’s very simple. We will see here the reverse is not the
same.

So, let us consider these aspects that if that is a fundamental matrix, it only depends on the
projective properties of P and P’. So, given a pair of P and P’, you get an unique F and the
interesting part is that it is independent of choice of world frame means world coordinate

346
system. That means, if I have a stereo setup and I can move this set up by keeping its
relative orientation the same; relative orientation of the image access the same.

I can move this stereo setup in at any point of the world frame whatever may be my world
coordinate convention here. Still I should have fixed world coordinate system and then, I
can move the stereo system at any point, still we will get the same fundamental matrix
which is not true for projection matrix. If my world coordinates also move and based on
that the corresponding points of image coordinate and world coordinate points will change.
Because world coordinate have been transformed then your projection matrix will also get
transformed, but this is not true for fundamental matrix. It is solely dependent on the image
coordinates.

So, if you have same pairs of image coordinates for the corresponding pairs of points, we
will get the same fundamental matrix. As you know that even if you move the cameras at
any point in the imaging system, a coordinates will coordinates will vary, but still
fundamental matrix will remain the same. So, this is one of the interesting property. So, P
P prime gives a fundamental matrix F it is unique.

So, the question is that whether we can get the projection matrices P and P’ from a
fundamental matrix F. Now, this is a property which we can easily verify and by which
we can see that we do not have a unique solution in this case. That means, though
projection matrix is a pair of projection matrices P and P‘ give an unique F, but there could
be other pairs of projection matrices which will can give me the same fundamental matrix
F.

So, you consider the situation that there is a homography matrix in the form of 4x4
homography which means is a linear transformation. So, if I multiply P with a 4x4
nonsingular matrix which is invertible matrix. So, from PH, I can again get P.

𝐼𝑓 (𝑃, 𝑃′ ) → 𝐹, 𝑡ℎ𝑒𝑛, (𝑃𝐻, 𝑃′ 𝐻) → 𝐹

There is no problem, I can simply multiply with its inverse, but the thing is that if I multiply
PH; H with this pair and P’ with H, then I will get the same F. That is a theory.

So, the proof is that. So, if I consider a particular scene point X and consider this stereo
system, then the corresponding pair of points will be PX and P‘X. Now, consider a

347
transformation of scene point by this homography 4x4 matrix; that means, by a 4x4
nonsingular matrix a linear transformation. So, you know that X is a 4x1 column vector.
So, if I multiply with 4x4, it still remains a 4x1 column vector. So, now, you consider
rather I will make H-1 instead of H, let me do it H-1. H-1 will be also a 4x4 matrix because
I mentioned H is an invertible matrix.

So, if I consider another projection matrix PH and multiply with H-1X, we will get PX the
same points. Similarly, P’H multiply with H-1X, we will get P’X which are the same
corresponding pairs of point. So, same corresponding pairs of points have these solutions.
It could give me a fundamental matrix satisfying no a set of projection matrices in the form
of PH P’H and corresponding scene points as H-1X. So, this is what this proof is and that
is I can summarize once again just to make it a clean display.

(Refer Slide Time: 07:13)

So, this is just we derived; this is the scenario. So, the summary is that F does not uniquely
map to P’.

348
(Refer Slide Time: 07:25)

So, once again we continue this discussion. So, let us consider this particular scenario, we
discussed earlier that projection matrix is in the form of a canonical from that is for the
reference camera which is given by this

𝑃 = [𝐼|0], 𝑃′ = [𝑀|𝑚] → 𝐹 = [𝑚]𝑋 𝑀

Then we know that given this pairs of projection matrices, its fundamental matrix of the
stereo system can be computed using this expression.

Note that this m is a right epipole; that means, this is column vector is the right epipole of
this. So, this M provides you also the homography at infinity. So, we note e’ cross
homography at infinity that gives me a fundamental matrix that is a relationship we are
discussed here.

So, we discuss that if we can derive F; if you get F from two camera systems the same F,
then there should be a relationships between the projection matrices of these two camera
systems. It collaborates with the theory, what we discussed in the previous slide. And we
can explain that because of this constraint, we can explain how degree of freedom of F is
related to the degree of freedom of projection matrices.

Say we know that number of independent parameters of a projection matrix is 11. Just the
scale that it should be scale equivalent that is all. So, there are 12 elements and so, number
of unknowns are 11 or number of independent parameter is 11. So, since there are in a

349
stereo system, it involves two projection matrices. So, maximum degree of freedom could
be 22, but since they are related by this homography 4x4 homography; once again
homograph 4x4 homography matrix is also a projective element. That means, there is it is
described upto scale.

So, one of the element could be a scale element. So, there are 15 degree of freedom of H.
So, that is the constraint imposed on a fundamental matrix that given that fundamental
matrix, projection matrices should be related given any reference projection matrix. So,
22-15 will give you the degree of freedom of F. So, this justify how that degree of freedom
of a fundamental matrix is 7.

(Refer Slide Time: 10:46)

There is another interesting fact that we like to discuss that the fundamental matrix
corresponds to a projection matrix P and P’ if and only if P’TFP is a skew symmetric
matrix. That means, it check the compatibility. So, if I give you a fundamental matrix and
if I give you P and P’, immediately I can check with this relationships whether they are
compatible or not.

I need not compute F from P and P’ to check it. I have to check the scale equivalence there
of course, but in this case simply I have to check whether P’TFP is skew symmetric.

350
So, what is skew symmetric? Just to remind you, a skew symmetric is a matrix where the
transpose of this matrix should be the negative of, this one and the diagonal element would
be 0. So, this is one example.

𝐴𝑇 = −𝐴

(Refer Slide Time: 12:07)

So, just to prove this fact one of the interesting property of skew symmetric matrix that is
exploit that if you consider any skew symmetric matrix, then this fact is true; that means

𝑋 𝑇 𝑆𝑋 = 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑋

You have to choose a proper matrix dimension. So, a skew symmetric matrix has to be a
square matrix and when it is true for all X. So, if it is a skew symmetric matrix, it has to
be true for all X because of that negation properties that the transpose of this is a negation.

So, now, we will be showing that if I perform this operation,

𝑋 𝑇 𝑃′𝑇 𝐹𝑃𝑋 = (𝑃′ 𝑋)𝑇 𝐹(𝑃𝑋) = 𝑋 ′𝑇 𝐹𝑋 = 0

You see that this is true for only the corresponding pair of points, but for any scene point
in the three-dimensional phase or in the homogeneous coordinate four-dimensional
homogeneous coordinate space, this is true and that is how if this quantity 𝑃′𝑇 𝐹𝑃; this is
a skew symmetric matrix.

351
(Refer Slide Time: 13:52)

Then another relationship we will be knowing here that F corresponds to P and P’, where
P’ is given in this form. That means, one of them is a skew symmetric matrix; S is a skew
symmetric matrix and F is a fundamental matrix.

So, this relation is giving me a particular pair of no projection matrices. It is helping us to


form at least a pair of matrices, where which will give me the corresponding F. So, I will
not give the details of the proofs, but these results are interesting. So, that we will be
writing here. So, one of the good choice of a skew symmetric matrix, we have already
discussed that is e prime cross.

We know [𝑒 ′ ]𝑋 is a skew symmetric matrix and if you remember that if I consider e’, if I
represent e’ as say

𝑒𝑥′
𝑒 ′ = 𝑒𝑦′
𝑒𝑧′

then [𝑒 ′ ]𝑋 is given by a skew symmetric matrix in this way.

0 −𝑒𝑧′ 𝑒𝑦′
[𝑒′]𝑥 = 𝑒𝑧′ 0 −𝑒𝑥′
−𝑒𝑦′ 𝑒𝑥′ 0

352
So, you see that this is a skew symmetric matrix. So, the solution from F at least we can
get one set of one pairs of projection matrices; one of them is in the canonical from [I|0]
and the other one would be[[𝑒 ′ ]𝑋 𝐹|𝑒 ′ ]. So, this is at least a solution you can get from here.

(Refer Slide Time: 16:18)

So, this is what I just mentioned that given a fundamental matrix at least you can get a pair
of projection matrices in this form. But as we mentioned there should be a family of
projection matrices pairs of projection matrices which will give you the same fundamental
matrix and this family is shown here also. Once again, the results we will discuss the
results, but not the derivation.

So, here we can see that these are the equivalent pair’s projection matrices which will give
you the same fundamental matrices or this is a family of fundamental matrix. So, what is
the definition here. See v is any 3 vector and k is a scalar constant. So, the ke’ is the same
epipoles that you expect there and e‘vT it is a 3x3 matrix once again you can multiply with.

353
(Refer Slide Time: 17:21)

For essential matrix, we can also derive the camera matrices as you can see that essential
matrix is given in this form. So, once again this is a [𝑡 ]𝑋 is a skew symmetric matrix and
R is a rotation matrix which is an orthonormal matrix. So, this is a kind of decomposition.
So, given an essential matrix, if I get a decomposition of skew symmetric matrix and
rotation matrix which is an orthogonal matrix; then I should be able to get at least a pairs
of projection matrices given P = [I | 0] and P’ = [R | t].

And the other fact we have already mentioned that two of its similar values are equal and
the third one is 0. So, the matrix decomposition of E should give me the corresponding
pairs of matrices.

354
(Refer Slide Time: 18:24)

Now, there is a technique which ensures this decomposition. So, we have to perform
singular value decomposition of essential matrix in this form, what is given here as you
can see and there are two possible decomposition of essential matrix in the form of a skew
symmetric matrix and rotational matrix; where you can see that it is a this E = SR and S is
given in this from UZUT;

So, U comes from the definition of the singular value decompositions and R is UWVT or
UWTVT. And you can find out the form of z and W.

0 1 0
𝑧 = −1 0 0
0 0 0

0 −1 0
𝑊=1 0 0
0 0 0

So, this is one particular solution. You need to perform the single value decomposition and
then, z and W, they are well defined and you can get the corresponding matrices.

So, some more elaborations at any skew symmetric matrix S can be decomposed into this
form

𝑆 = 𝑘𝑈𝑍𝑈 𝑇

355
you can show and even if we can multiply with a scale k as I mentioned that always for
any fundamental matrix or essential matrix, you can multiply with a scale and you can
show that W is an orthogonal, in the orthogonal form and Z=diag(1, 1, 0)W. So, there is a
relationship between Z and W, we will see the motivation of why we are explaining this
properties.

(Refer Slide Time: 20:12)

So, by exploiting those properties, these are the possible configurations of the projection
matrix of the second camera. So, first cameras projection matrix is already taken as [I | 0]
in the canonical form. So, these are the possible configurations as you can see there are
four such options are there

𝑃′ = [𝑈𝑊𝑉 𝑇 |𝑢3 ] 𝑜𝑟 [𝑈𝑊𝑉 𝑇 |−𝑢3 ] 𝑜𝑟 [𝑈𝑊 𝑇 𝑉 𝑇 |𝑢3 ] 𝑜𝑟 [𝑈𝑊 𝑇 𝑉 𝑇 |−𝑢3 ]

You note the definition of 𝑢3 here which is the last column of the matrix U which is given
by this single value decomposition which means 𝑢3 corresponds to the singular value 0 of
U.

And but out of this four, only one is valid for viewing a point from both the cameras. We
have already discussed that given a camera matrix how you can decide which point is in
front of the camera. So, one of this projection matrices will have the same frontal

356
directions, same directions with [I | 0] and you should consider that matrix which satisfies.
So, this is one form of solution of projection matrices from essential matrix.

(Refer Slide Time: 21:34)

So, let us consider an example here using those properties. Now this problem says that you
have been given a fundamental matrix and also the projection matrices P and P’ of this
stereo vision system. Now you have to check whether they are compatible or not. We
discuss this particular property we have to exploit the property that PT or P’TFP, it should
be skew symmetric.

(Refer Slide Time: 22:18)

357
So, we need to compute this particular element, particular entity and if I perform matrix
multiplications, I will get this matrix. And you can observe that this is not a skew
symmetric matrix. So, which can readily tells us this fundamental matrix is not compatible
with this cameras which is not compatible.

(Refer Slide Time: 22:47)

We will now consider another computational problem that given a pairs of corresponding
points, a set of pairs of corresponding points; how can you compute the scene points. We
sometimes say for the whole and symbol as the structure of an object because if you know
the three dimensional coordinates of the object points, we consider we have recovered the
structure. So, essentially, we would like to compute the scene point ith scene point I can
say also xi scene point whose corresponding points are given by xi’

So, one of the strategy could be that I can easily compute fundamental matrix F. Then, I
need to find out P and P’. So, which may not be unique, but at least it gives the possible
structure and then, I may later on do post processing to get proper combination of P and
P’ to get a structure. But right now, let us consider only that whatever P and P’ we get, we
need to compute the corresponding scene points.

Then we can apply the triangulation. So, I will explain what this method is by which you
can get the scene points. So, triangulation method, it says that if I have the scene point xi.
Say you have an image plane corresponding to this camera, whose centre is here C and say
this is your image point xi. So, I can always form given its projection matrix P, I can form

358
the back projection rate. So, I can form the equation of the projection ray, we have
discussed earlier.

Similarly, for the other camera I know its camera centre; centre of the camera and its scene
point I can form the other projection ray which means I would get the equations of straight
lines you need three-dimensional space. So, now, the ideally, they should have intersected
to my scene point; but there would be errors due to observation, errors due to computation.
So, they may not exactly intersect. So, what I should consider, I should consider the closest
point which should have been an intersecting point.

So, consider a perpendicular projection from this line to the other line. So, that is a linear
segment and take the middle of that line segment and we will consider that is my scene
point that is a solution. So, this is what we will compute intersection of Cxi and C’xi front
and for that we need to compute the segment perpendicular to both and get the midpoint.

So, we should note that this computation is not projective invariant because as we
mentioned that camera matrices could have been (PH, P’H). Of course, if it is (PH, P’H),
you would have got H-1x. But the thing is that the way you are doing you would not get H-
x if you take also (PH, P’H). So, that compatibility is not there. It would be close to that,
1

but it is not theoretically it will not ensure that you will get H-1x.

(Refer Slide Time: 27:09)

359
Another method could be that we can estimate the scene point by minimizing certain
objective function by considering the projection of those scene points applying the
projection matrices. So, in this case you can see that what you can do is start with an initial
estimate and then, observe what is your image point. Now, those image points should be
close to your observed image point.

Then, this is the image point you obtain from computation and these are the observed one.
So, if your estimate is good enough, this error should be very small and they should also
satisfy this constraint. So, once again this is a constraint based optimization problem it is
a non-linear optimization. There are various non-linear optimization techniques by which
you can solve and you can define your estimates and this method is particularly projective
invariant. That means if you take (PH, P’H), using this method the solutions will also
satisfy H-1x and H-1x’.

(Refer Slide Time: 28:27)

And the linear triangulation methods we have discussed I think this is [FL]. So, in this
method we will be using the algebraic manipulations. In the previous case of triangulation,
we considered only geometric method of computing the intersections of two lines. But
here we will formulate this problem as a solution of a set of linear equations. So, we will
form a set of linear equations, as we did in previous cases also you will find out similarity;
we will find the similarities of this formulations.

360
Since we have two pair of points xi and xi’, we can consider that it forms an equation say
Since PX is a same PX gives me the same point, but once again since it is a projective
element its it could be scaled version of X, but the direction will remain same as a vector.
So, if I take the cross product with X that should be equal to 0.

So, this will give me a set of linear equations to for which satisfies this constraint.
Similarly, x’xP’X =0 that would give me another set up equations. So, if I expand this
particular form, I can write it in this way say

𝑟1𝑇 𝑟1𝑇 𝑋
𝑃 = 𝑟2𝑇 𝑎𝑛𝑑 𝑃𝑋 = 𝑟2𝑇 𝑋
𝑟3𝑇 𝑟3𝑇 𝑋

See you see that this is a 3 vector and similarly, x also a point where we can represent in
this form so

𝑥
𝑥⃗ = 𝑦
1

So, I have to take the cross product of these vectors. So, I will form this equation (x, y, 1),
then r1Tx, r2Tx and r3Tx. This is a cross product and there as you can if I expand these cross
product once again. So, you will find that

𝑦𝑟3𝑇 𝑋 − 𝑟2𝑇 𝑋 = 0, 𝑟1𝑇 𝑋 − 𝑋𝑟3𝑇 𝑋 = 0 𝑎𝑛𝑑 𝑋𝑟2𝑇 𝑋 − 𝑦𝑟1𝑇 𝑋 = 0

So, we will find actually the third equations can be derived as a linear combination of these
two.

You multiply with x and multiply with y and then, if you add them you will get the third
equation. So, which means there are only two independent equation. So, it will give you
only 2 equations. Similarly, this constraint will give you another two equations. So, given
a pair of corresponding point you have 4 equations and how many unknowns you have for
x? You have only 3 unknowns because you know these are the world coordinate points x,
y, z and there is the scale dimension. So, there are 4 equations and 3 unknowns. So, once
again you can apply the least square error estimate technique that we did.

361
(Refer Slide Time: 34:00)

So, just to summarize no this particular computation, we can consider that there would be
4 equations, 3 unknowns as we mentioned and this can be expressed in this form set of
linear equations. So, once again the same it is a set of homogeneous equations. So, we
have to minimize the norm AX subject to norm X to be equal to 1 that is the constraint we
will impose.

And you can apply this direct linear transform that we discussed earlier also for estimation
of projective transformation matrix or homography matrix and this is not a projective
invariant. We can generalize this particular construct that is the power of this technique.
Because, if you have 3-point correspondence; that means, you have a three-camera system
say P1 P2 P3. So, each one will give me the similar form of equations. So, each one means
each one will give me 2 equations. So, there are 6 unknowns; sorry 6 equations, but there
are again 3 unknowns. So, you can solve using the same technique.

So, I think with this let me stop here we will continue this discussion of structure recovery
in our next lecture.

Thank you very much.

Keywords: Camera matrix, Projection matrix, fundamental matrix, triangulation.

362
Computer Vision
Prof. Jayantha Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 22
Stereo Geometry Part – VII

We are discussing about recovering structure from the set of corresponding pairs of points
of a stereo system. And we have discussed how there are different approaches for
computing that; computing a scene point given a pair of corresponding point. So, just to
illustrate this process we will be discussing a problem for computing the scene point.

(Refer Slide Time: 00:47)

So, consider this particular example that you have two camera matrices P and P’ for left
and right cameras. And its images of a scene point are formed at say (0, 3.5) and (2/3, -
1/3). So, we have to compute the 3D coordinates of the scene point. You can see the
projection matrices of the two cameras P and P’ they are also given here.

363
(Refer Slide Time: 01:30)

So, we will discuss the solution. So, let us consider how we are going to compute this
particularly to get the scene point. So you have projection matrices P and P’;

𝑃 = [𝑀|𝑚] 𝑃′ = [𝑀′ |𝑚′]

And we have shown that two pairs of corresponding points as x and x’. So, we will be
applying the geometry approach of triangulation; that means, we will form the rays back
projection rays for these two cameras and then try to compute its intersection. And due to
the noisy observation, you will not exactly get that intersection of this it is expected they
should intersect, but you will not get exact intersection. So, what you need to find out that
find out a very close estimate that intersecting point. So, you should consider a
perpendicular segment for both the lines and then get the midpoint that was the solution.

So, let us see how this computation proceeds. So, C is the centre of the first camera, C’ is
the center of the second camera and we can compute the direction cosine of the projection
ray; given the image point x. Similarly, direction cosine of the projection ray of the second
camera; given the image point x’; so, this is what. So, -M-1m that would be the center C,
then –M’-1m’; that would be the center of the second camera, M-1x is a direction cosine of
the projection ray of the first camera and M’-1x’ is a direction cosine of the second camera.

So with this computations you get to know all the three dimensional constructs of the
straight lines because the points passing through the straight line and its direction cosines.

364
Now, you apply the three dimensional coordinate geometry to solve the perpendicular
segment and its meet point; so there lies the solution. So, one of the property of this
perpendicular segment that its direction cosine should be obtained by the cross product of
dx and dx’ because it is perpendicular to that.

And if I assume; a point lying on the projection ray passing through the image point x and
C; as C plus s d x; so s is some it is a parametric representation of that straight line; any
point can be; if a s varies; s is a any positive value. Similarly, C prime plus t d x prime is
the another point. So, what we need to find out exactly that which s and t will satisfy this
constraint that the line joining them; line joining those two points given by this values of
s and t will give me a perpendicular directions of both the rays; so that we translate. So, it
is perpendicular to this and then you know we need to solve for s and t applying this
constraint.

𝐶 = −𝑀−1 𝑚

𝐶 ′ = −𝑀′−1 𝑚′

𝑑𝑥 = 𝑀 −1 𝑥

𝑑𝑥′ = 𝑀′−1 𝑥′

𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑠𝑖𝑛𝑒 = 𝑑𝑥 𝑋𝑑𝑥′

(Refer Slide Time: 05:27)

365
So, these are computational steps what we discussed and here is the summary of all those
results. As you can see that we have computed the first centers, this is given by applying
this relations. Then you have completed the direction cosines by applying this particular
facts. Then this is the direction of the perpendicular segment which is perpendicular to
both of these rays.

(Refer Slide Time: 06:02)

And then we are trying to solve for s and t by applying the constraint that the line forms
by this; so the line forms by them this segment it is perpendicular to both the directions of
dx and dy’.

So, you get actually two equations; so if you note the direction cosine of dx=[0.32, 0.47,
0.69]. Then in this slide you can see that you have taken the dot product of that direction
with this particular expression that should be equal to 0. Similarly from the direction cosine
of dx’; it should be orthogonal to this direction. So, the dot product of between these two
vectors should be equal to 0. So, this gives me two equations is that a linear equations in
terms of s and t; so if you solve and you can get s and t and the point.

366
(Refer Slide Time: 07:16)

So, this is how we can solve this problem and we can get the structure. Let us consider the
other problem of line reconstruction. So, you consider here a particular situation, where
you have been given a straight line L and you can get its two images which will be again
straight lines in the first image plane and in the second image plane.

Incidentally, as you can see that there will be also an epipolar lines those corresponding
epipolar lines. So, this is an image of 3 dimensional line in the first image plane; similarly
the image of the same line in the second image plane; both are given by L and L’. So, what
we can do here that given this line and given this projection matrix P; we know that
equation of this plane is PTl; so this is the equation of this plane.

𝜋 = 𝑃𝑇 𝑙

𝜋 ′ = 𝑃′𝑇 𝑙 ′

𝜋
𝐿=[ ]
𝜋′

Similarly, equation of this plane is 𝑃′𝑇 𝑙 ′ . So, now you have these two plane one is say 𝜋
another is say 𝜋 ′ . So, when two planes intersect; they forms a line and that is the solution
there is a three dimensional line and that is exactly you will get if you compute this
intersection; so this is what we I just mentioned. So, finally, your solution is in this form;
so this is one convenient way of representing a three dimensional line.

367
(Refer Slide Time: 09:17)

Now, we will be discussing a very important property of a stereo system; an important


feature of a stereo system. And you can show that how the projective transformations
among the image points they exist in a stereo geometry. So, the homography or the
projective transformation are established among the pair of corresponding points for which
the scene points lie on a particular plane. So, we call this kind of homography a plane
induced homography.

So, take this particular scenario which is been shown in this diagram that you have once
again two cameras of projection matrices P and P’. So, P is given by [I|0] that is the
projection matrix for the first camera and P’ given in this representation [A|a];

And we have a plane which is given by this algebraic representation as you can see this is
an equation of a plane. So, representation of a plane in a homogeneous coordinate system
of a three dimensional projective space; so this is the 3 vector which is giving you the
directions cosines of the plane and your scale value is kept as 1; which means the equation
of the plane if v is given as a column vector

𝑣1
𝑣
𝑣 = [ 2]
𝑣3

𝑣1 𝑥 + 𝑣2 𝑦 + 𝑣3 𝑧 + 1 = 0

368
Consider a point on that plane which is given here x pi and its corresponding image points
x and x prime. So, we can show that there exists a homography between these two points.
So, you can take another point for example, and also you can form its say images say this
is y and say this is y prime; so this is the corresponding point. So, there exist homography
H such that Hx will give me x prime and Hy will give me y prime and this homography is
called plane induced homography; we can show this things. So, let us check the
derivations, clean this particular diagram.

(Refer Slide Time: 12:44)

So, we can see that; we have already discussed earlier also that points in a plane and its
corresponding image points in a single vis[ion]- camera also they are related by the
homography. Because points in a plane can be represented by a three vector by two
independent parameters and they are related by a homography; consider that homography
is H 1 pi. Similarly, you can have an homography H 2 pi; so from there you can get a
homography which is given by this H 2 pi; H 1 pi inverse.

This plane induced homography is can be related with the parameters of the plane which
is inducing this homography and also the parameters of the projection matrices. So, here
we have shown; we have shown a proof that how it is related the; the basic theorem is that
this homography is given by this. That means, plane induced can be expressed in this form
where A is the 3 cross 3 sub matrix of the projection matrix of the second camera or P
prime. And this vector small a is the fourth column vector; fourth column vector of this

369
projection matrix and v as you know this is the direction ratios of the normal of the plane
pi.

So, this has been shown here how it be related like this. So, let me just provide you the
outline of this proof as it is given here. Now, you can see that x prime which is the image
of the scene point x is given by this P prime X; which is can be expressed in this form
whereas, x is just PX; which can be expressed in this form. So, now, any point in this ray
can be considered as x rho; as you know that the in the homogenius coordinate system, the
scale factor provides you the corresponding you know points and the interpretations with
respect to three dimensional space real space is that; it should could be any point in that
projected ray connecting the image point x.

You should note that at rho equal to 1; you have the image point. So, when rho is 1; this is
represented in this form and when at infinity, this would be represented in this form x 0.
So, exactly at what value of rho you are getting this value that can be determined by
considering the point containment relationships on the plane. So, if I take pi transpose x
rho; this vector that should be equal to 0 and value of rho is given by minus v transpose x
and that now we will be using in the expression of x prime.

So, x is related with this capital X that is the scene point is related with the image point x
and also the corresponding scale factor which has been computed using this particular fact
of point containment in plane. So, x and scale factor is minus v transpose x; so if I do the
corresponding submatrix multiplication, I will get it as Ax minus a v transpose x and no
ah; so this can be written in this form. So, as you see x prime is related with x; by this
linear transformation and which is nothing but we call it a projective transformation and
which is what is homography H. So, you will get in this way this is the direct relationships.

370
(Refer Slide Time: 17:22)

So, using this particular fact we can say that a transformation H between two stereo images
is plane induced homography; if you can say in the other way that if fundamental matrix
can be decomposed into this form; when e prime e prime cross H; e prime is a epipolar
plane epipole epipoles right epipoles. And as you know that the cross product of right
epipole with the matrix that is an operation matrix operation basically this e prime cross is
a matrix equivalent matrix which is representing that operation; that itself will be we can
you will give you F.

We have; earlier we have discussed that actually this homo[graphy]- it could be any
homography and that is what it is represented here; previously we have taken only the
homography at plane at infinity. So, even if I decompose F in this form of e prime cross
H; if we decompose F in this form of e prime cross H, then P and P prime can be ah
obtained; can be can be expressed in this form, it is a in the family of ah pairs of projection
matrices this could be also one such option.

So, we can also say that if you have two projection matrices P which is in the canonical
form of I 0 and P prime which is given in this form as we can see that a plane induced
homography H. And then the plane can be recovered by solving this equations because we
know that homography would be ah; we have seen that homography is given by A minus
a v transpose in the previous case. But naturally this equality we cannot establish unless

371
we know what is the; exact equality can be determined can; can be put when we know
exactly what at what scale this equality holds.

So, that is why we have to we write this equation in this form. So, what are the unknowns
here? There you have the unknown k and unknown v; so there are you get a linear
equations there will be no; no there in the H, there are 8 elements; 8 independent elements
independent parameters. So, you can get linear equations for unknowns k and v and that
you can solve using this homography. So, this is how you can get the equation of the plane.

(Refer Slide Time: 20:15)

Another feature of the stereo geometry that this plane induced homography or you can find
out whether it is compatible to a fundamental matrix or not. Like you have observed; we
have discussed how fundamental compatibility of fundamental matrix will pairs a
projectional matrices could be tested by observing, whether it is that there is a skew
symmetric matrix property; P prime transpose F P should be skew symmetric. Similarly
here also you can see that H transpose F should be skew symmetric. So, that is you know
there is one of the interesting that is one of the feature of the skew symmetric matrix as the
transpose of that matrix is the negation of the matrix. So, if we add the transpose you
should get 0.

So, this can be shown in this form that is the proof that since x prime transpose F x equal
0 and x prime is H x. So, H x transpose F x equals 0 and this is true; once again this is true

372
for every x because see these are all same image points; so this is true for every x. So, H
transpose F is a skew symmetric matrix from the property of a skew symmetric matrix.

(Refer Slide Time: 21:44)

So, these are certain facts of the stereo geometry that once you have a plane induced
homography; that means, in this case the corresponding pairs of points they are all formed
from the scene points of a plane, so that is explained by that form. See you consider, a
plane any plane you consider and in that plane if it is not parallel to the base line; there is
an intersection with respect to that plane. So, once again if for that plane if you know there
is a homography H; so e prime should be equal to He for that plane induced homography.

Similarly, we have mentioned that epipolar lines; corresponding epipolar lines which are
lying on the same epipolar plane. And any particular line lying on the same epipolar plane
they also form the plane induced homography. Because once again the plane which is
inducing and points are lying in that plane and any intersection of that plane with respect
to epipolar plane will satisfy this constraint. So, this is also satisfied; for any pair of
epipolar lines this is satisfied.

And H x lies on epipolar line of L e prime, there is another constraint it has been shown
that here also you can see that if you have; you have a scene point if you have some scene
point here this is the epipolar line. So, epipolar line of a corresponding point can be
obtained by performing the multiplication with respect to x F x.

373
(Refer Slide Time: 24:01)

And then; that means, I can obtain the epipolar line by multiplying x and if I perform the
homography transformation on x; so this point will also lie here; that means, I can write H
x transpose; F x that should be also equal to 0. So, that is the essence of this particular
computation.

(Refer Slide Time: 24:28)

So, we will be considering; we will be exploiting this facts for estimating the fundamental
matrix under certain constraint scenarios because that would simplify the computations.
Suppose, you would like to compute fundamental matrix from 6 points; out of which 4 are

374
coplanar. So, I can using that 4 coplanar points; we know that they establish a homography.
So, their corresponding points; there is a relationship of homography among themselves.

So, we can compute the homography matrix H by using those 4 points and then you can
find out the epipoles by computing the epipolar lines. So; that means, you consider the
actual corresponding points which is been observed for x 5. So, you have x 5 corresponding
to x 5 prime, then x 6 corresponding to x 6 prime; also you had x 1 corresponds to x 1
prime to x 4 corresponds to x 4 prime. So, first you compute the prime induced
homography among themselves, you require minimum 4 point correspondences.

So, now if I transform x 5 using this homography; the same point will lie in the epipolar
line f x 5 or I should say that; the epipolar line will be formed because F is not known. So,
epipolar line will be formed by H x 5 cross x 5 prime. So, that is what is shown x 5 prime
and this is the plane induced homography point; so they will form an epipolar line.

In the same way, using the other point you form the second epipolar line. So, you will get
another points see you get x 6 prime and it is suppose this is x 6, this is x 6 the; it may be
somewhere else, it is not here. So, this point may be here and say this is x 6 prime and after
homography transformation; this is H x 6. So, this will also form another epipolar line; so
this two epipolar line l 1 and l 2; their intersection will be e prime; that is expressed in this
from.

So, if I take the cross product of l 1 and l 2; I will get e prime. So, now fundamental matrix
can be defined as e prime cross H pi that is fundamental matrix; so you can compute it.
So, using the 6 point no fundamental matrix from 6 points; you can compute using the 6
points, you can compute fundamental matrix.

375
(Refer Slide Time: 27:38)

Now, here the problem is the; in the reverse. You have been given a fundamental matrix
and 3 point correspondences and then you need to compute H. So, the first method says
you can fundamental matrix you can always; no estimate the camera matrices P and P
prime using the method what we discussed. And then you can obtain the plane because
you have 3 scene points and you solve for the 3 scene points; construct 3 scene points and
obtain the plane and then from there you can compute the plane induced homography that
is H equals A minus a v transpose given the parameter in this form.

The second method is that; that for that 3 point correspondences, you require another point
correspondences from which you can get this homography because minimum 4 point
correspondences are required. So, what are those, what is the pair of point correspondences
that is the epipoles. So, from F; you can compute this epipoles; that means, you get its left
0 and right 0; those are the epipoles and they are also corresponding points. So, now you
have 4 corresponding points from where you can compute homography. So, this is also
another interesting feature that any 3 points can bipartition the image space with respect
to the plane formed by them.

376
(Refer Slide Time: 29:10)

Infinite homography, we discussed earlier also and let us discuss it in the context of this
plane induced homography. So, you consider any general plane induced homography
which is the results are shown here at the top; here you can see that this is the general result
of homography given your projection matrices in this form and also the plane three
dimensional plane; the representation in this form.

So, in this case of course, this is given in the form of n d; so that we can relate it. So, you
just if you just scale this normal vector by d that is equal to b; so let us see how this. So,
now your plane in this tomography following this particular constrict could be written in
this form. Now, here your projection matrices are given in this; so it is related with A; A
is K prime R; so this is A. And your e prime is given by this know e prime is the image of
these coordinate which is 0; centre is at since it is 0 center is at 0 0 at the origin. So, if I
take the image here and we will get the corresponding; so it is t.

So, the epipole is image is at t right. So, this e prime we will find out e prime will be t and
let us see what is K inverse? K inverse because this is I 0; so K inverse will come here. So,
we can make it in a canonical form; if I transform the image coordinates by k inverse.

377
(Refer Slide Time: 31:18)

So, these are the things; so v should be equal to n d and e prime is K prime t; e prime is K
prime t from here; so that is what is e prime. And then H pi following this you know
formulation can be written in this form. So, you note that v is given this n d; so it is t n
transpose d. So, finally, it is K prime R K inverse minus this. So, now as d tends to infinity
H infinity is K prime R K inverse.

This is a formu[la]- which we know discussed earlier also that is the homography at infinity
can be expressed in terms of these camera parameters. And we discussed also earlier how
a in general configuration how the corresponding pairs of points are related with this
particular parameter; so this is what we get using infinite one. So, as Z tends to infinity; x
prime is image of point on pi infinity. So, you this is; this shows you that that is a vanishing
point over theepipolar line; as you are you are projecting the points at infinities.

378
(Refer Slide Time: 32:42)

So, this is a summary between the line the homography of you know; homography at
infinity at vanishing points H infinity. It maps vanishing points between two images and
it can be computed by identifying three noncollinear vanishing points given F or from 4
vanishing points. So, we discussed earlier also that given three points and fundamental
map fix; you can compute the homography.

So, if you have vanishing points ah and corresponding vanishing points; you can compute
H infinity in that case or you can get 4 vanishing pair of vanishing points; you get again
compute H infinity. And if your camera mattresses are given in this form P and P prime
and any point at infinity is given in this form; we discussed earlier also this shows at H
infinity is M prime M inverse.

This relationship we discussed while solving a problem itself that given a general camera
representation, how homography at infinity or plane in this tomography when the plane at
infinity is inducing plane is expressed in this from or H infinity is M prime M inverse. So,
with this let me stop this lecture at this point we will be continue our discussion on stereo
geometry in also subsequent lectures.

Thank you very much.

Keywords: Plane induced homography, line reconstruction, epipolar constraints, infinite


homography

379
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 23
Stereo Geometry Part - VIII

We will discuss special case of epipolar geometry, under the same topic of Stereo
Geometry and that is affine epipolar geometry. So, we discussed earlier also how an affine
camera differs from a general projective camera. In an affine camera your centre of
projections or camera centre that lies at infinity which means that the projection rays, they
are all parallel to sudden directions and the basic projections basic imaging mechanism it
takes place through that parallel projection.

(Refer Slide Time: 00:58)

So, let us consider a stereo system which is formed by two affine cameras. You can see
here in this diagram that there are two such cameras and there are direction, so it is a
parallel projection ray which is forming images in the second camera. Similarly, in this
direction ray for the first camera. And the camera centers this is lying at the infinity, this
parallel projection, so the intersection of this projection rays will be at infinity which will
define the camera centre, effectively that is the direction of these rays itself.

So, how the epipolar lines are found? Epipolar lines are all parallel lines. Here we can see
that for the same projection ray, if I take the projection of the same points lying in this

380
projection ray then all those image points will form an epipolar line. And since all the rays
are parallel here all the parallel rays in the first camera, epipolar lines also should be
parallel.

And the form of fundamental matrix particularly when you have this kind of geometry it
get simplified you can see the structure here, you have only 5 non-zero elements.

0 0 𝑒′2
[𝑒′]𝑥 = [ 0 0 −𝑒′1 ]
−𝑒′2 𝑒′1 0

0 𝑏
=[ ]
−𝑏 𝑇 0

See all the 4 elements in this part they are 0, and the other elements are non-zero. And so
there are only know 5 parameters, but since once again it is their scale is involved in this
representation, so there are 4 independent parameters. So, here projection rays are parallel
and epipolar lines and planes are also parallel. That is the characteristics of an affine
epipolar geometry.

And the other thing is that as epipolar lines are parallel, epipoles are in the form of e1, e2.
So, we know that because the intersection of epipolar lines that would be also at the point
at infinity which means the value at the third dimension representing scale should be 0.

(Refer Slide Time: 03:34)

381
So, let us again elaborate this particular property. Consider this affine stereo x is
corresponding to x‘, y is corresponding to y’ and they are existing homography between
them when all these points are lying on a particular plane pi. So, that is the plane induced
homography.

𝐴 𝑡
𝐻𝐴 = [ 𝑇 ]
0 1

And so epipole a point of intersection is represented as e1’, e2’0 we have seen that this is
the structure. And, following the same construct the epipolar line is given by e‘XHAx
because this is the homographic transformation of y to y’ which will lie also on the epipolar
line and epipole also will lie on the epipolar line. So, e’XHx equal to 0, so FA should be
e’XHx. And since epipoles have these construct, so all those 4X4 sub matrix of the affine
fundamental matrix in the upper side that becomes 0, because of these structure.

𝑙 ′ = 𝑒′𝑋𝐻𝐴 𝑥

= [𝑒′]𝑥 𝐻𝐴 𝑥

𝐹𝐴 = [𝑒′]𝑥 𝐻𝐴

So, particularly you notice that the right epipoles they these are given by this particular
column. So, the values of the right epipoles it has occurred in this fashion. So, e2’ comes
here and e1’ comes here so. Similarly, we will see also the values of the left epipoles will
be at the row [-d, e, 0]. Right epipole is given by [-b, a, 0]. So, from the structure of affine
fundamental matrix epipoles are very easily determined that we can see.

382
(Refer Slide Time: 06:55)

0 0 𝑎
𝐹𝐴 = 0 0 𝑏
𝑒 𝑑 𝑐

𝑙 ′ = 𝐹𝐴 𝑥 = [𝑎 𝑏 𝑒𝑥 + 𝑑𝑦 + 𝑐]𝑇

𝑙 = 𝐹𝐴𝑇 𝑥 ′ = [𝑒 𝑑 𝑎𝑥 ′ + 𝑏𝑦 ′ + 𝑐]𝑇

For estimating this affine fundamental matrix we have to use the corresponding epipolar
lines. So, this equation becomes a in a in a much more simpler form and you can represent
the set of linear equations again in the matrix form in this way and you can solve using
direct linear transform.

Since, there are 4 independent parameters you require only 4 point correspondences to get
FA and from the structure itself singularity is satisfied because of the structure that all the
upper sub matrix 2 x 2 sub are all of them are 0, so if you take the determinant you can
check that the determinant value itself is 0 in this channel structure. So, you do not have
to perform any special operations to input singularity.

383
(Refer Slide Time: 08:24)

So, the approach was using the set of correspondence points where you can find a linear
equations exploiting this constraints of epipolar geometry, and you have seen how it can
solve those problems and solve those equations which becomes a simpler for the case of
affine fundamental matrix estimation.

Now, we can use another approach for estimating the fundamental matrix. Here we can
compute the first we can compute the homography induced by a plane by using 3 point
correspondences. So, minimally we require 4 point correspondences. So, we have seen
already this homography would be affine homography and for that 3 point correspondence
is sufficient to compute it. So, we can do that and then we need to compute the epipolar
line.

Now, in this case it is sufficient to compute the epipolar line l’ because from there itself
we can get the epipoles. The directions of epipolar line itself will give you that epipoles
because that is what its direction cosine is given in this from [l2 -l1 0]. So, actually,
epipoles is the point at infinity along that directions and this is direction of the line. So, in
this way epipoles can be obtained.

𝑙 ′ = 𝐻𝐴 𝑥4′ 𝑋𝑥4′

𝐹𝐴 = [𝑒′]𝑥 𝐻𝐴

384
(Refer Slide Time: 10:12)

So, we have come to the end of this particular topic of stereo geometry. So, let me
summarize it is different features that we discussed. First we discussed epipolar geometry
in a stereo imaging system and there we have seen that epipoles, same point corresponding
image points and camera centers, all of them lie on the same plane. Then fundamental
matrix characterize its stereo system and it is unique to the stereo setup it has various
properties like transformation invariance is one of the properties. That means,
transformation of image points will give me back the fundamental matrix. You can use the
transform points to estimate it.

Then it is a 3X3 singular matrix with degree of freedom 7, that is the number of
independent parameter is 7 and since it is singular its determinant has to be 0, rank has to
less than 3 and in this case for a fundamental matrix its rank will be 2. So, it is function
that it transforms an image point into its epipolar line in the second camera plane, second
image plane. So, and this is the relationship that you have to multiply fundamental matrix
with image point in the homogeneous coordinate representation, then you will get a line
in the homogeneous coordinate representation in the image plane.

And for any pair of corresponding points you get this relationships

𝑥′𝑇 𝐹𝑥 = 0

If we get epipoles e’ then this is the relation

385
𝐹𝑒 = 𝑒′𝑇 𝐹 = 0

So, they are the zeros of fundamental matrix. So, epipoles are the 0s of the fundamental
matrix. e is a epipole in the reference camera plane, so it is right zero of F and e’ is the
epipole in the second camera plane and it is a left zero of the fundamental matrix. On the
other hand, if I change the convention of reference camera that show up them then F
transpose will be the fundamental matrix and all these results are consistent with that
representation.

So, given camera matrices P and P’ and F is unique that we discussed. And particularly
these computations we should note that if we have the projection matrices in this from,
that projection matrix in a canonical form of [I|0] and P’ as [M|m], this is the
representation. Then fundamental matrix is given as mXM. So, we discuss the definitions
of these notations in our lecture.

Or, you consider a very generic representation I have given you the expression, I did not
really expand or discuss in my lectures, but please go through it both note this expression.
So, what we will find that from P and P’ we have computed the epipole and which is given
by this

𝐹 = [𝑚′ − 𝑀′𝑀 −1 𝑚]𝑥 𝑀′𝑀−1

So, this is now the epipole is given. That means, you have to take the image of the centre
of the first camera. So, image of the centre of the first camera is given in this form. And
then the himography at infinity is given M’M-1. So, that is how we get this fundamental
matrix.

386
(Refer Slide Time: 14:31)

Then we discussed that how projection matrices could be related with fundamental matrix
though a pair of projection matrix give uniquely one fundamental matrix, but a
fundamental matrix can lead to a family of projection, pairs of projection matrices because
now if you have a homography in the three-dimensional space like which means it is a 4
X 4 non-singular matrix homography of a 3D projective space.

(𝑃, 𝑃′ ) → 𝐹, (𝑃𝐻, 𝑃′ 𝐻) → 𝐹

Then given a fundamental matrix F that exist a family of stereo setup or pairs of camera
matrices, that we need to note. And this is one example of a family of matrices considering
the reference camera is given always in the canonical form [I|0], then you can express this
family given that fundamental matrix by its epipole and any arbitrary 3 vector and a
arbitrary scalar constant, as you can see in that expression which has been shown here in
this case. Then, given a camera matrices (P, P’) and a corresponding pair of image points
(x, x’) it is possible to reconstruct the respective 3D scene point X.

387
(Refer Slide Time: 15:58)

We have discuss 3 approaches for this and one of them was using the triangulation
geometry, the second one was using geometry preposition error is in non-linear
optimization technique, and third one is algebraic form of representation of triangular
triangulation approach. And that is how we can derive the structure.

Then, let us summarize also the facts regarding essential matrix which is a fundamental
matrix of calibrated camera, and this is given in this structure. That is, if I considered a
stereo system where projection matrices are given in this form that reference camera is
canonical camera canonical matrix I 0 and the second camera is related by the rotation and
translation of rotational axis and translation of R h. So, in that case essential matrix is given
by those parameters only. And one of the properties of essential matrix is that two of its
singular values are equal and the third one is 0.

388
(Refer Slide Time: 17:32)

Then you can compute that it components t prime cross and R by matrix decomposition of
E such that know E =SR, where S a skew symmetric matrix and R is orthogonal. And if
you have this decomposition, then you can compute also its projection matrices which are
shown here, that apply singular value decomposition of essential matrix. And, there could
be two possible decomposition of this essential matrix which is shown here in this form
that use the singular value decomposition matrix U and V and there are two special matrix
Z and W which is shown here in this form.

Then the expression of S and R is given as you can see.

𝑆 = 𝑈𝑍𝑈 𝑇 , 𝑅 = 𝑈𝑊𝑉 𝑇 𝑜𝑟 𝑈𝑊 𝑇 𝑉 𝑇

And from there we can compute P’. So, note here this u3 is the third column of the matrix
U. They are the singular value decomposition, which means this is a column vector
corresponding to singular value 0. And one of the above is valid in viewing a point from
both the cameras.

389
(Refer Slide Time: 18:32)

And then we discussed also estimation of fundamental matrix. So, given a set of pairs of
corresponding points it is possible to estimate fundamental matrix. Minimum 7 periods of
points required, we have discussed one particular method where we have seen that how
the eigenvectors corresponding to 0 eigenvalues they are used in this case, 0 eigenvalues,
so eigenvectors of a system.

Then parametric representation of a fundamental matrix is given in this form. This is the
particularly this from also contains the epipoles, you should note that know there is a little
bit of ambiguity in representing e prime and e, e’ should be e here and e should be e’ here.
This should be e’.

390
(Refer Slide Time: 19:43)

And given a set of pairs of corresponding points it is possible to estimate camera matrices
and scene point up to projective ambiguity which is 4 X 4 homography matrix.

(𝑃𝐻)(𝐻 −1 𝑋) → (𝑃′𝐻)(𝐻 −1 𝑋)

Then given a pair of corresponding lines l and l’ and camera matrices P and P’ it is possible
to reconstruct respective 3D line L. We discuss that also. So, in this case you can compute
the plain formed by the lines l and also camera centre we know that that is 𝑃𝑇 𝑙, and also
the plane found by l’ and camera centre of P’ which is 𝑃′𝑇 𝑙 ′ and then two planes they
intersect at a line which is that corresponding three-dimensional line, so that also you can
reconstruct. So, this is what intersection of planes 𝑃𝑇 𝑙 and 𝑃′𝑇 𝑙 ′ .

391
(Refer Slide Time: 21:18)

Then, we discussed about a plane induced homography. So, a plane induces homography
between corresponding image points in a stereo set up. So, this was the expression for the
homography for a particular description of the camera matrices and the plane. So, plane
described by say (𝑣 𝑇 , 𝑙) transfers gives the directions of normal of the plane and camera
matrices are given by [I|0] that is the canonical camera matrix of the reference camera and
[A|a] that is a general camera matrix. Then, homography induced by that plane is given in
into this form.

We also discussed about affine epipolar geometry which simplifies the structure of
fundamental matrix.

0 0 𝑎
𝐹𝐴 = 0 0 𝑏
𝑒 𝑑 𝑐

So, the epipoles they are given by is the right epipole which means the left epipole is at [-
b a 0] and the right epipole is [-d e 0]. And there we are having the same confusion. Left
epipole is the FA [-b a 0] that would be 0 which is the left epipole. So, it should be right
epipole. And similarly [-d e 0]FA that should be equal to 0 which is actually may added.
So, please note this correction once again here also.

392
Left epipole is the epipole at the left image plane and right epipole is the epipole at the
right image plane. Thing with this we have come to the end of this particular topic and we
will be discussing our next topic on feature detection in the next lecture.

Thank you very much.

Keywords: Affine epipolar geometry, fundamental matrix, left epipole, right epipole

393
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 24
Feature Detection and Description Part - 1

In this lecture we will talk about Feature Detection and Description.

(Refer Slide Time: 00:23)

So far we discussed about camera geometry, stereo geometry where we considered that
how to obtain the projection matrix of a single view camera or homography between two
scenes or fundamental matrix between two images of the same scene in a stereo camera
or in a stereo imaging setup.

Now, there we have considered that the corresponding points of the images are given to
us and using those corresponding points we have obtained those quantities. But in this
lecture what we would like to explore is how do you automate that process of detection
and getting the set of corresponding pairs of scene points. That would be primarily the
issue what we will be considering in this particular topic.

So, here for an example, you can see that there are two images of the same scene. There
is a structure of an ancient temple and in two different views we have taken this image.
So, the question is that how do you match the scene points? For example, we know from

394
this image that this particular part of this image and for the other image they correspond
to each other. But how do you precisely define the points of correspondences even in
those regions.

So, detecting the regions where they match approximately or crudely, but then again the
preciseness of locating the same points of the scene point that is also a requirement and
there are several complexities of this problem as we can see that the images they call
they are transferred in this case you know you get a view from a different view from a 3
dimensional perspective, but even for a 2 dimension image also there could be various
kinds of transformation.

For example, the same image could be translated, it could be rotated say for example,
this is the one kind of rotation and even the scale can vary, which means that you can get
a shrinked version of this image or you can get the image from a distant viewing point,
where objects they look small, but still in that case also you need to identify the
corresponding points of the structures.

So, these are the challenges in detecting these points and there are several issues
regarding these computations. So, some of these issues are highlighted here like
detection. I told how to detect these points even though you detect the structural points
first you have to consider that what are the land mark points which are easily detectable
even after transformation. So, there is a problem of detection then you need to uniquely
characterize those landmarks points.

So, you need to describe the point by its neighboring statistics by looking at the texture
around it in the image and finally, you need to match them. So, there are several
candidates of such land mark point. So, out of them which pairs they correspond to the
same point in the scene. So, in this lecture we will be considering several such issues,
particularly the detection and description that is the primary theme of this particular topic
here in this lecture. Later on we will also consider the computation of matching of pair of
points.

395
(Refer Slide Time: 04:46)

So, while we are considering detecting a feature the idea is that you need to characterize
a feature point and which should have some uniqueness with respect to others.

That uniqueness should be preserved even after different kinds of transformation as we


mentioned earlier like translation, rotation, scaling etc. So, in this case particularly there
are various mathematical techniques by which you we would like to define this
uniqueness. So, we would like to consider the local statistics around a point and we call
it as a local measure and we desired that this property should be invariant with respect to
transformation. So, in this diagram we are trying to show that with the green square
block those are the regions of interest say central point of that green squares.

We considered that this is a point of interest and we are trying to find out the statistics
around these points and suppose we move this particular window and this is a same point
around different image points. So, what kind of structural property that would show us
some variations in the measures if you move it? So, you consider this particular aspect
that if I consider this particular region which is a flat region.

So, even if we move the windows in different directions still the local statistics they
would look almost similar which is a kind of uniform distribution of intensities in
particular this example. But when you consider the other image where we are moving in
the edges, then if I move along the direction of edge you will not get any change it would
look almost similar kind of distributions specially, but if I move along the perpendicular

396
direction of this edge we can find out there is change as we move in the perpendicular
directions. But the significant change that you will get even if you move in any direction
that we will get in this kind of structure when two edges are meeting here.

In fact, this kind of structure is a corner structure and even a slight movement of your
window will disturb the local distribution and that can be reflected by some measure
some local statistics we will see later on how can we define this statistics.

So, the summary of this discussion or the highlight of this particular example is that
some structures (Refer Time: 08:16) have certain uniqueness in describing their
neighborhoods or they could be conveniently characterized by some local statistics. So,
these structures are mostly the corners that we can see in this kind of 2 dimensional
images. So, let us proceed just to understand what kind of statistics we can define.

(Refer Slide Time: 08:47)

So, consider a window in the same example what you have discussed in the previous
slide and the kind of measure which will be changing when we shift the window in
different directions. So, we will be considering the intensity distribution around a point
and intensity distribution within the window and you would like to see how intensity
values are changing because of the changing of these windows. So, how this intensity
distributions are changing in this window and there is a particular measure what we can
consider during the shifting we can find out the difference between the intensity values
with respect to the corresponding point.

397
So, every pixel when there is a just (Refer Time: 09:53) translator motion. So, every
pixel is shifted by a constant vector in a particular direction say (u, v) what has been
shown here and if I take the difference between the intensity values between those two
pixels in the shifted window and in the original window and if I take the sum of square.
So, what is the expected that if it is a uniform region or flat region this sum of square of
differences would be very small. They will be almost 0 when it is an edge there would be
some difference at those edge points, but when it is a corner this difference would be
very prominent.

𝐸(𝑢, 𝑣) = ∑ [𝐼(𝑥 + 𝑢, 𝑦 + 𝑣) − 𝐼(𝑥, 𝑦)]2


(𝑥,𝑦)є𝑊

So, you take the difference between these two values and take the square and consider all
the pixels in that window for this major you accumulate these differences square of
differences for all the pixels in that number that is what is the measured what is
described here.

(Refer Slide Time: 12:12)

So, we will be expressing this particular measure with respect to the differential
geometric corporations and how you can compute this one in a very general situation.

𝜕𝐼 𝜕𝐼 𝑢
𝐼(𝑥 + 𝑢, 𝑦 + 𝑣) = 𝐼(𝑥, 𝑦) + 𝑢+ 𝑣 = 𝐼(𝑥, 𝑦) + [𝐼𝑥 𝐼𝑦 ][ ]
𝜕𝑥 𝜕𝑦 𝑣

398
So, for small u and v this higher order term; that means, those are the terms which are
considering higher differential quantities. So, those can be ignored. So, I can simply
write in this from, we will use only the first order changes.

And then this can be conveniently represented by this expression. As you can see this is
just a representation of the same equation here.

𝜕𝐼
𝐼𝑥 =
𝜕𝑥

(Refer Slide Time: 13:57)

And expanding it further what we can see that how we can write this sum of square of
differences. We can write it conveniently in this particular notation simply using only the
first order changes along certain directions. So, you can see that only the differential
changes are needed for computing this you do not require the absolute pixel values in
those windows we can eliminate them by these simple expressions.

In fact, this quantity as you can see that this quantity can be expressed in terms of
quadratic expressions of with using the matrix representations. So, and there is a typical
representation and what is shown in here that you can see that this could be written as

𝑢
𝐸(𝑢, 𝑣) = ∑ [[𝐼𝑥 𝐼𝑦 ][ ]]2 = ∑ 𝑋 𝑇 𝑋
𝑣
(𝑥,𝑦)є𝑊 (𝑥,𝑦)є𝑊

399
And you can see this is that quadratic expression what I was talking about.

So, this quantity particularly shown here is actually reflecting the local statistics. So,
there are three particular measures those are characterizing locally, one is square of the
differential in the direction of x then square of the differential in the direction of y those
are denoted by u and v directions also and also the differential along x and differential
along y. What you should note here that this major is an aggregation over the local
statistics. So, it is not a major at that particular point. It is an aggregation. So, we should
consider it as a distribution and for any typical distribution is the expectation of those
values that would be considered or averages of those values that would be considered to
form this matrix.

Now, this matrix characterizes the local statistics of the point and we will see how this
matrix will help us in characterizing a feature point.

(Refer Slide Time: 17:02)

So, this is the summary of this particular expression. Once again it is written in a more
prominent form here as we can see that this is the matrix I was talking about.

𝐼𝑥 2 𝐼𝑥 𝐼𝑦
𝐻=[ ]
𝐼𝑦 𝐼𝑥 𝐼𝑦 2

We will denote this matrix as H.

400
(Refer Slide Time: 17:22)

See we will continue with this representation. As the sum of square of differences
represented by the function E there can be represented in this form where locally around
a pixel you need to measure the differential changes along x direction and y directions
take the averages of them.

And you need to take the averages over the square of those changes and individual
averages of along x direction and y direction. So, big question once again that if the
centre of the green window move to anywhere on the blue unit circle then how this
particular quantity changes? So, that is the question we need to ask and we need to find
out. So, this is what we are considering that we need there are observing the changes and
which are the directions for which these changes would be the largest and also the
smallest.

Now mathematically this can be found out. If I perform the Eigen analysis of that
particular matrix what I have referred here referred and we denote this matrix by H.

So, we will make an Eigen analysis for this one and the eigenvectors they would give us
those directions one of them corresponds to the higher eigenvalue or larger eigenvalue
that would give us the changes along the largest changes along the direction of largest
change and the smaller eigenvalue you will give us the direction along the smallest
change and from the properties of linear algebra that eigenvectors they are orthogonal.
So, there are two orthogonal directions that you would get.

401
(Refer Slide Time: 19:25)

So, just a quick overview of computation of eigenvalue and eigenvector in this particular
case we have a very simple situation because our matrix is just 2x2 matrix and as you
know the definition of eigenvector of a matrix A is that you know if I perform this
multiplication if I transform a vector x in the same dimensional space with this
transformation A, I should get the same vector in the same direction with a the change of
the magnitude will be there.

So, it’s a scaled vector that you would get and that scale value is the eigenvalue and the
vector which is not changing its direction that vector is called eigenvectors. So, one of
the ways that you can find out these eigenvalues is that as you can

𝐴𝑥 = 𝜆𝑥

ℎ11 − 𝜆 ℎ12
det(𝐴 − 𝜆𝐼) = 0, det ( )=0
ℎ21 ℎ22 − 𝜆

So, there will be two values of lambda and particularly if the matrix is symmetric then
you would get the real values there. So, we would see these summarizations of this
competition. So, this is what the competitions of determinant in these cases are shown
here.

Now, there will be coefficients of 𝜆2 and also the constant term there and it is in the form
of that equations like 𝐴𝑥 2 + 𝐵𝑥 + 𝐶 = 0 and then you know what the solutions are.

402
(Refer Slide Time: 22:28)

In fact, I will just show you that solution here of this in terms of this elements. So, this is
the solution of these equations and you can see that in this case you are denoting this two
values one is lambda plus which is the larger value. So,

1
𝜆± = [(ℎ11 + ℎ22 ) ± √4ℎ12 ℎ21 + (ℎ11 − ℎ22 )2 ]
2

We need to solve this particular equation and one of them would be redundant or you can
rather use the y equal to sum constant value assuming that is not 0 and then you can find
out the corresponding eigenvector.

403
(Refer Slide Time: 23:37)

So, once you know the lambda then you find eigenvector by solving the other. So, this is
the review of the competitions of eigenvalue and eigenvector given this kind of matrices.
Now let us see that how these quantities help us in characterizing the point. So, once
again we have shown here the matrix H and we can compute the larger eigenvalues 𝜆+
and the smaller eigenvalue 𝜆− corresponding eigenvectors are denoted here x+ and x-.

So, if you want to define the shift with the smallest and the largest change of E value. So,
x+ is the direction of largest increasing in E, 𝜆+ is amount of increase in direction x and
x- is the direction of smallest increase in E and 𝜆− is the amount of increase in direction
x-.

404
(Refer Slide Time: 24:29)

So, our objective is that we should define a feature point where these values of E(u, v)
should be large for small shifts in all directions, because that is how we are trying to
characterize a point. We told that local measures will be disturbed even if small shift
which means is this value should be large in that the difference should be very large. So,
u, v should be large, so which means that if I consider the direction of the largest change
and smallest change.

So, just to ensure that it should be large the smallest change should be also very high. So,
that is how the smaller eigenvalue is more important in this case to identify or to
characterize a feature point here. So, this is what the minimum of u, v should be large
over all unit vectors and this minimum is given by the smaller eigenvalue lambda minus
of H that we discussed.

405
(Refer Slide Time: 25:37)

So, some examples here are taken from the slides which has been shown here that is they
are adapted from the Weizmann institute slides by Darya Frolova and Denis Simakov. In
fact, all the preceding slides those are adapted from those lecture slides, just to
acknowledge those sides.

Now as we can see in this very nice example that there is a chess board pattern and if I
perform the Eigen analysis of that sum of square of differences at every point then we
can get a distribution of larger eigenvalue which is shown as 𝜆+ over this space, so which
is given by this.

So, you can find out that we can see that actually all the uniform regions this larger
eigenvalue is almost 0. So, that is why it is black whereas, in edges there quite
prominent. In fact, they should be more prominent in the intersection points which are
the corners just because of optical illusion. Now they look little darker, but it would be
clear if I consider the smaller eigenvalue if I plot the smaller eigenvalue then we will see
actually those intersection points those are highlighted. So, even for edges smaller
eignevalues are also almost 0.

So, only those intersection point of edges which we can called as a corner points of those
black and white squares you can find out that those corner points are prominently
highlighted and this is how we can detect the detect those corner points in this particular
example.

406
(Refer Slide Time: 27:24)

So, we can now design and algorithm based on this discussion for detecting features
which are actually corner points in an image. So, what we can do that we can compute
the gradient point at each point in the image and then we can obtain the matrix H from
the entries in the gradient as we discussed. So, in fact, this is how the elements of H is
defined. As I mentioned that you need to compute the gradient along x direction and y
direction and also compute the square of the gradients because (Refer Time: 28:00) when
you take the averages you need to take the averages over the square of the gradients not
just you now you just take the average over only on gradient on x and then you make a
square.

𝐼 2𝑥 𝐼𝑥 𝐼𝑦
𝐻=
𝐼𝑦 𝐼𝑥 𝐼𝑦2

So, you have to note this particular point because when you implement this algorithm
you have to be careful, otherwise you can see that this determinant of this H would be 0
always. So, that would be a problem for characterizing it.

407
(Refer Slide Time: 28:45)

And then we should compute the eigenvalues of H as we discussed. Locate the points
with large response of minimum eigenvalue and we can define a large response by
considering some empirically chosen threshold that if it is greater than such in certain
threshold we can considered that is as a point, but that is not a very precise
characterizations there could be even the neighboring points of a particular corner point
we can have this large response.

So, just to precisely locate the corner point we need to consider the local maxima of
those responses. So, that would give us the feature points.

408
(Refer Slide Time: 29:30)

So, this is a once again a very nice example from the slides, it’s like as you can see in
this computations what would look appeared is a very tiny dot in that resolution of
images. If I zoom this image and look at the distributions of the pixels in a smaller
neighborhood in a larger resolution over the display then effectively I can see a
distribution of intensity values that you can see here a kind of a star pattern as what you
can see in the central part. In fact, they are lies a local maxima.

And that is the corner points and that is how precisely it is defining a particular feature
point and we will be considering that local maxima. This is importance of computation
of local maxima. So, points with 𝜆− as local maxima that should be considered for the
feature point.

409
(Refer Slide Time: 30:35)

So, this gives us this computation gives us the famous Harris operator and in this case we
are considering the 𝜆− as the major, but there is a variant in computing this major instead
of computing directly 𝜆− we can compute similar measures which will give you the
proportional quantity. So, one of this measure which he has been shown here by the
symbol f is you are considering the product of eigenvalues and which is normalized by
the sum of the eigenvalues.

𝜆1 𝜆2 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑛𝑡(𝐻)
𝑓= =
𝜆1 +𝜆2 𝑡𝑟𝑎𝑐𝑒(𝐻)

You can see that since both the eigenvalues should be large this product should be very
large and then it is normalized with respect to the overall eigenvalues and the
determinant. So, these computations can be conveniently carried out without doing any
square root operations or anything. This is the same as computation of the finding
computing the determinant of the matrix H, which will give us the product of
eigenvalues and the sum of the eigenvalues, would be defined by the trace of the matrix
H which we have defined. So, using this measure we can look at those points. So, this is
called Harris operator, because by applying this operation, by computing these measures
and by computing the local maxima of this measure we can detect those features.

So, this is the reason why Harris corner detector is efficient than computing directly the
eigenvalues. So, there could be so many other detectors, but this is one of the most
popular detectors in the literature.

410
(Refer Slide Time: 32:32)

So, it just shows that how these two operations are almost equivalent you can see at the
top the pair of images they are showing the distribution of those functional values using
Harris operator where it looks little flattered though, but still the local maxima is retained
and the sharper one is of course, the eigenvalue smaller eigenvalues, but we have also
discussed the advantages of Harris operator, because here you do not require any square
root operation simply you can compute detect determinant and trace of matrix and get
the ratios and that would give you this operation.

(Refer Slide Time: 33:09)

411
Some examples of this detector, it has been shown here say input image has been shown
and then the distribution f value has been shown and if I get the local maxima we can get
some precise locations of those points which are like corners in the intensity distribution
there are sharp changes in those particular points.

(Refer Slide Time: 33:32)

So, let us stop here in this particular lecture, we will continue this discussion of you
know feature detection and description and the motivations for obtaining the descriptor
that is feature matching that we will be discussing in the next lectures.

Thank you very much.

Keywords: Harris corner detector, local maxima, match scene points, feature detection.

412
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 25
Feature Detection and Description Part - II

We are discussing the topic of Feature Detection and Description and in the last lecture we
discussed how the interesting points or key points which will remain mostly invariant of
different transformation those can be detected those are the corner points. So, the question
is that now how do you characterize those corner points?

(Refer Slide Time: 00:39)

For example take these two images of the same view which I have shown in this particular
lecture in the beginning of this lecture. So, here as you can see that there are many various
interesting points, if I perform the same computations of Harris operator, then we can
obtain the corresponding feature values and there you can see that these are the corner
points which are looking little faded in this display.

But these are the corner points which are here and there are various corners. So, out of
them which pairs of corner points or which pairs of key points their corresponding to the
same point. So, this is the question. So, how do you get those pairs of points and how do
you establish those correspondences? This is the problem of feature matching that we
would be considering. So, let us consider that what are the stages are there.

413
(Refer Slide Time: 01:47)

So, these are the three stages of computation of matching that we need to detect the feature
points in both images. For example, we have applied the Harris operator and we considered
those key points which have been extracted those are the feature points.

But we do not know that which point corresponds to which one in the other image. So,
then just to uniquely describe every point we can describe them by some local statistics.
So, we will discuss how the statistics could be described that is what we have to build up
a descriptor for each feature point we call it feature descriptor. And then of course, using
those description we have to find the corresponding pairs of points this task is known as
matching.

414
(Refer Slide Time: 02:37)

So, again the main issue is that can you select the same features under various
transformations; as we did also for consider this particular issue while detecting features
this is a major concern and. Even after detecting the key points can you detect the same
pair of true pairs of key points on various transformation?. So that means, your description
should be also invariant to those transformation. For example, it could be rotation, change
of illuminations, and variation of scale like many other such transformation.

(Refer Slide Time: 03:23)

415
So, the key idea for detecting the corners is that we need to find out a scale. So, one of the
major concern in this detection is scale that we will find out. So, because the scale how to
make the description scale independent there lies lot of tricks. Usually you can make
descriptions rotation independent or translation independent that is easier to do but scale
is much more trickier.

So, to get a little more scale independent what we can do or to get that scale independence
features we can consider the image representation in multiple scales which means multiple
resolutions. And observe that what kind of local measures of that ‘f’ you are considering
how it varies with scales. In a proper scale it is expected that if this value which should be
very high compared to the other scales which are not very appropriate for reflecting that
measure. So, we will continue and we will understand this process.

So, it is a local maxima in both position and scale that is what we will be considering and
there are various kinds of such measurements like Laplacian measurements which is a
second derivative operations over the images. We will define this measure mathematically
soon or we can consider differences between two Gaussian filtered images with different
scales at different standard deviations. Standard deviations of Gaussian mask there also
called scale in the image processing jargon. So, you can hear the similar terms.

(Refer Slide Time: 05:15)

So, we discussed this particular thing. Let us elaborate this particular fact; say you have
two images I1 and I2 and say I2 is a transformed version of I1 and as I mentioned different

416
kinds of transformation not only translation rotation scale it could be non-uniform scaling,
it that could be illumination changes, that could be view changes, it could be reflected and
so you need to get transformation invariance.

Transformational invariance of this measure which means we have to detect the same
features regardless of the transformation and detection in the sense now in the description
also should be unique. So, that your matching should be successful even after
transformation.

(Refer Slide Time: 06:11)

So, both the detection and description should be invariant and both should be ensured and
as we have discussed that Harris measure is invariant to translation and rotation but we did
not consider the variation over scale. So, we will see that how this could be ensured as the
in the previous slide we discussed that we can use multi-resolution representation of
images.

Just to explain a little bit about this resolution representation, what we can consider?
Suppose you have an image and with an object this and it is given in its say original
resolutions; that means, the camera resolutions the number of pixels whatever you have
got in the sensors and the spatial resolution (Refer Time: 07:06) that particular number of
pixels that gives you the; to gives you the highest resolution given that imaging.

417
But then what you can do, you can sub-sample you can down sample this pixels and get
the smaller size of images. So, and in some cases even this particular it can could be so
small that that this may appear like a dot. So, you can see that depending upon different
resolution the structured structural information will vary. So, it may it may not retain the
same structural information and sometimes the larger objects with when it has a very high
resolution but once we increase the high resolution say you have a very very tiny objects.
Now, in this resolution it could be detected whereas in this resolution this would be lost.

But you have a very large object in an image now with using local measures to get the
overall ideas of the shape would be difficult to comprehend difficult to analyze. But if I
get a smaller resolution of this image, then even a smaller window will be able to capture
this particular feature then the local measure corresponding to that particular
representation should give a higher value. So, that is the idea of having multi-resolution
representation. So, you vary the resolution but highest resolution that is already
constrained by the imaging system.

But only thing you what you can do you can get lower resolution versions and try to see
that you know some of the structures in those lower resolutions are becoming more
prominent and easily detective. So, this idea has been used to in particular this feature
detection.

(Refer Slide Time: 09:27)

418
So, this is an approach, so you can compute features at multiple scales, you can use a
Gaussian pyramid which means; you smoothen the image using some Gaussian mask
varying standard deviation. So, scale would be higher iteratively; iteratively scale will be
increasing and also you can down sample the image and you can get a pyramidal structure
of that representation you have an image. So, if I show it in this form say let me show the
image it is a bottom resolution higher resolution.

In the next version after smoothing this image using a Gaussian mask and such sampling
you get this resolution you use further smoothing; that means, effectively you are using a
very larger mask over this image. So, you’re smoothing this image and you are getting
coarser distribution and sub-sampling it. So, you get a next level of resolution.

So, this kind of representation is useful. Since the shapes looks like a pyramid if I place
them if I stack those images in this particle vertical order it’s just a visualization. So, that
is a very popular term which is used for this multi resolution representation we call it a
representation following a Gaussian pyramid representation. Which means every image is
convolved by a Gaussian mask and then subsampled and you get multiple representation.

So, for single image you get ‘n’ number of images of varying sizes by this process. So,
this is what Gaussian pyramid representation is. And in fact, there is a very efficient and
effective you know method by which you can compute the best scale for feature detection
and that method I will be discussing in a method called sift method which is scale invariant
feature transformation. So, we will be discussing then and. So, the basic idea is a feature
descriptor should be transformation invariant.

419
(Refer Slide Time: 11:53)

And it captures the information in a region around the detected feature point that should
be the property. For example, we can consider histogram of gradient directions in a square
window centering a feature point. So, these are the two steps one is the detection which
should be invariant to transformation including scale, the other one is a description. So,
you should consider now description at that scale.

So, whatever local statistics you collect you should collect the image which has been
transformed through that multi-resolution processing; that means, which has been
convolved using a Gaussian mask of that scale and then consider the point which is been
detected at that scale and consider the neighboring statistics at that scale. So, these are the
two policies which are used in particular to get it transformation invariant scale invariant
description.

420
(Refer Slide Time: 12:51)

So, it just explains what I wanted to show through the diagram you can see at different
resolution the local descriptions they two vary at an appropriate resolution. The alphabet
‘a’ is visible, but if you look at very closely then ‘a’ gets missing. So, if there is any
measure of this interestingness of a particular resolution, then there is an appropriate
resolution in the middle where you get more interestingness and that is what is shown here
hypothetically, it has been shown by this particular curve.

(Refer Slide Time: 13:31)

421
So, for scale invariant detection one of the major task is that to determine the appropriate
scale and for that we need to convolve with image. And since now we have a 3 dimensional
space because not only the 2 dimensional special locations of say x and y direction, you
have a direction along scale.

So, you observe the measures in both position and scale that would give the three
dimensional space. And there are different kernels which we are going to define here like
Laplacian kernels and difference of Gaussians which means kernels are here this is the
masks as we defined earlier and which needs to be convolved with the image to give a
measure and you would like to find out a local maximum of those measure.

(Refer Slide Time: 14:33)

So, just to define it, first let us consider the definition of a Gaussian function. As you can
see, this is a 2 dimensional Gaussian function and uniformly scaled along directions which
means standard deviations is uniform in all directions if particular in two principal
directions x and y directions and this is the you know Gaussian function and this is a
continuous function.

1 𝑥 2 +𝑦2

𝐺(𝑥, 𝑦, 𝜎) = 𝑒 2𝜎2
2𝜋𝜎 2

422
So, for a discrete processing you need to get a discrete representation of this function over
a mask or over a window. Say if I choose a mask size of say 10 x 10, then that fixes you
have to find out considering the center of the mask as the origin you get the functional
values in other locations.

And the mask size depends upon the values of sigma. So, the one of the criteria could be
that it should be say 2√2𝜎 sigma which is a very large mask size. So, this is a definition
of the Laplacian mask and we have you have defined using the Gaussian function.

𝐿 = 𝜎 2 (𝐺𝑥𝑥 (𝑥, 𝑦, 𝜎) + 𝐺𝑦𝑦 (𝑥, 𝑦, 𝜎))

As you can see that these are the second derivatives of those Gaussian functions it’s some
of the second derivatives along x direction and y direction and this is normalized by
multiplying with sigma squared to make it scale invariant description. And difference of
Gaussian this descriptor is defined in this fashion

𝐷𝑜𝐺 = (𝐺 (𝑥, 𝑦, 𝑘𝜎) − 𝐺 (𝑥, 𝑦, 𝜎))

It is just from the nomenclature itself. It is understood that it’s mask which is defined from
the difference of two Gaussian functions of two different scales; one scale is 𝑘𝜎 the other
scale is 𝜎.

(Refer Slide Time: 16:25)

423
These are the shape of the kernel; that means, shape of those functions in 1-D you have
shown and no any 2-D you just rotate it along the axis of symmetry in the center and; that
means, about y axis if you rotate, then you will get the 3 dimensional mask. That means,
it’s a function of 2 dimensional space, but we will get the values in a represents as a 3
dimensional representation on that particular function. So, you can see from this particular
plot that both difference of Gaussian and Laplacian they are quite similar.

(Refer Slide Time: 17:13)

They are quite similar in fact, mathematically also one can show you can perform these
operations; that means, take the Gaussian functions take the partial derivative along sigma,
then you can show that that is equal to sigma into this Laplacian of G, Laplacian of that
Gaussian mask.

𝜕𝐺 𝜕 2𝐺 𝜕 2𝐺
= 𝜎Δ2 𝐺 = +
𝜕𝑥 𝜕𝑥 2 𝜕𝑦 2

So, this can be shown in this problem and you can see that the Gaussian difference of
Gaussian mask is proportional to the corresponding Laplacian mask and this factor is (k-
1) is kept constant across scales.

𝐺(𝑥, 𝑦, 𝑘𝜎) − 𝐺(𝑥, 𝑦, 𝜎) = (𝑘 − 1)𝜎 2 Δ2 𝐺

424
So, it does not influence extreme locations.

(Refer Slide Time: 18:17)

So, figuratively it is showing how this computation proceed. So, you have to convolve
with a series of Gaussian masks and then at different layers sometimes you have to down
sample it and produce that 3 dimensional functional distributions in a three dimensional
space and then you have to get the local extreme.

So, that is what you will be considering. So, in this particular particular figure it has been
shown the order of computation. So, this is the original image, you perform the Gaussian
mask you perform the Gaussian operation. So, this is a Gaussian and then you get the
subtraction that is a difference of Gaussian. So, this is the first representation of difference
of Gaussian. Again you smooth using the Gaussian image Gaussian convolution subtract
this one from this one. So, you get the second layer of difference of Gaussian images.

So, in this way you are producing the representation in a 3 dimensional space of scale and
positions and then you have to find out the extrema in 3 dimensional DoG space or
Difference of Gaussian space.

425
(Refer Slide Time: 19:37)

So, just to summarize that we have discussed about two different kinds of detectors the
difference of Gaussian detector is used in this sift descriptor which I will be discussing
next and which has been proposed by David Lowe which has been found to be very
popular. And also Harris Laplacian; that means, Harris operator with applied over the
Laplacian pyramid representations of the particular with varying scale and then you get
local maxima varies corner response in that space and scale that would also give you the
transformation invariant. So, these are the two major you know detectors which are
popular in particular in the literature.

(Refer Slide Time: 20:27)

426
So, just to summarize the scale invariant detection that given two images of the same scene
with a large scale difference between them, we have to find the same interest points
independently in each image and so the solution is that you have to search for maxima of
suitable functions in scale and space over the image. And these are the two methods which
I mentioned one is Harris Laplacian and this it maximizes the Laplacians over scale and
Harris is measure of corner responses that has to be used in those Laplacian over scale
representation and then the sift is maximize the difference of Gaussians over scale and
space.

(Refer Slide Time: 21:17)

Now, the thing is that what you get out of this process is you get lot of key points. Now,
some of this key points are not so important and some of them may not be structurally very
robust because a small disturbance can disturb them.

So, we would like to get only those key for a points which are more robust to the
transformation and which has a very precise location locations are to be very precisely
defined there. And for example, a many key points will lie on edges and as we have
discussed that, corners are more robust than edges. So, even some edge points can give a
very high response and can be a local maxima, but we need to eliminate those edge points.

427
(Refer Slide Time: 22:11)

So, there are certain operations like you can perform over the key points you can apply
this particular operations. For example, if you apply this particular Laplacian operator that
is edge operator what you can see, if these are the double derivatives over the difference
of Gaussian functional values and it is expected that this would give you the curvature
value.

So, the Eigen values of this particular matrix.

𝐷𝑥𝑥 𝐷𝑥𝑦
𝐻=[ ]
𝐷𝑥𝑦 𝐷𝑦𝑦

will be giving the principal curvatures and it is expected that for edge points it would be
large across the edge, but a small one in the perpendicular directions. So, both the
curvatures should be also very large. So, this is one characterization of edge point by which
you can eliminate those edge points.

𝑇𝑟(𝐻)2 (𝑟 + 1)2
< 𝑓𝑜𝑟 ℎ𝑖𝑔ℎ 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑟 (𝑠𝑎𝑦 10)
𝐷𝑒𝑡(𝐻) 2

S so you compute the eigenvalues and eigenvalue should be large they should not differ
much too much.

And this is how this computation is carried out and this is equivalent once again to you
know equivalently computed by computing traces of this matrix H which has been defined

428
here, this H is different than what we discussed for Harris operator because these are all
double derivatives of the difference of Gaussian functions as you can see from the
definition. So, the ratios of square of trace and determinant it should be very high then
only for high value of ‘r’. So, it should be less than this and then you should we should
take this value. So, eliminate key point if the ratio greater than this threshold it occurs if
the ratio is greater than this then those key points are eliminated.

(Refer Slide Time: 24:19)

So, now the question is that how do you characterize a key point? So, there are some local
statistics that we need to consider, one of the attribute that we would be considering that
what the major orientation around that neighborhood is. So, we can assign that orientation
because if you get the major orientation, then the local statistics could be made orientation
independent you can perform a transformation. So, that your reference axis is aligned to
that major orientation or major oriented directions and then you can aggregate those local
statistics by performing those transformation that is how you can make it rotational
invariant.

So, computation of orientation is important. So, what we can do that, in that case is locally
you can get gradient directions around its neighborhood and then you can compute a
histograms and bin them. You can see that in this particular example this binning of the
directions are shown by this angles because this directions can be considered with respect
to some x axis with respect to the reference x axis the angle what is formed by this direction

429
that is what is of our interest. So, you can discretize this range of this angles varying from
0 to 2π into some intervals and then put those directions into one of those bin.

So, that is what is known as binning of these directions in a histogram and then find out
know which is the prominent one out of this discrete options and we can assign that
directions to that particular feature vector to that key point actually. So, assign canonical
orientation at peak of smoothed histogram. So, even you can perform smoothing of this
histogram and then you compute the peak of that particular functional value and that peak
will give you the orientation.

So, in this way a key point is described by its position scale and orientation because as I
mentioned that scale is determined where you get also the local maxima in position and
scale that is of the scale and positions have obtained and then orientation is computed in
this in this fashion. There could be some situations where you can have two major
orientation. So, you may have to use both; that means, multiple descriptor description of
the same key point.

(Refer Slide Time: 27:05)

So, these are some examples which are again taken from this slides. In fact, it is also in the
book by Zisserman. So, you can see that I mean Zisserman and Hartley there are multiple
authors. So, there are there are some examples that you have key points after gradient
threshold and key points after ratio threshold this example shows that how key points are
you know reduced using those different kinds of processing.

430
So, initially you have this many number of key points on the same image 832 and it there
are different orientations it shows a orientations and also you know the scale is associated
with a position is associated with position has been shown scale is difficult to show in this
particular diagram.

And then after performing a gradient threshold you can reduced to a number 729 and then
after performing that ratio threshold which means using this curvature analysis of the
Hessian matrix that is called Hessian matrix of you know difference of Gaussian function
and from there you can perform this ratio threshold, you can get reduce more number of
key points.

(Refer Slide Time: 28:33)

So, this is a summary of a key point characterizations that it has a location, it has a scale it
has an orientation. So, next we need to discuss that how to compute a descriptor for the
local image region about each key point and that should be very highly distinctive and also
it should be invariant as possible as for the variations such as changes in view point and
elimination. So, we will continue this discussion in the next lecture.

Thank you very much for your listening.

Keywords: Feature matching, detectors, descriptors, keypoints

431
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 26
Feature Detection and Description Part – III

We are discussing about the detection of features, feature points and also how do you
describe those feature points uniquely, so that both Detection and Description becomes
transformation invariant.

(Refer Slide Time: 00:33)

In the last lecture, we discussed how a feature point can be very distinctive robust to the
transformation of scale and rotation, translation, and they are characterized by their
position scale and also orientation.

Now, we will be considering a descriptor, which is a very popular descriptor used widely
in the image processing techniques and this descriptor is called scale invariant feature
transform. It was proposed by David Lowe in 1999. In fact, these slides are adapted from
his presentation and we will be considering how these descriptors are not, right.

So, we can consider once we detected a key point. So, around that key point we can take
a 16x16 square window. You should note that it is orientation corrected; that means, once
you have the orientation dominant orientations around a key point then you perform the

432
rotation to align it with your reference axis, for example, x-axis. And rather the x axis to
rotate it along this direction. And then in that rotated image now you take a 16x16 square
window. And let us see how these descriptors or local statistics is accumulated. In fact,
there are some diagrams I will be explaining them.

So, what we need to do? We need to compute edge orientation; that means, angle the
gradient minus 90 degree. So, that is how the edge orientation is defined. So, the gradient
operations in the very first lecture itself I discussed that, you have to compute the
derivatives along x direction and derivatives along y direction. So, these two vectors would
give a direction and that is angle of the gradient. And then if I subtract 90 degree from it
you will get the edge orientation for each pixel.

And then you can throw out weak edges; that means, the magnitude give a threshold over
the magnitude. And then create histogram of surviving edge orientations which are shown
here figuratively in those diagrams. You can see here that typically in this picture it is
showing that in that 16x16 window at every pixels how it is shown of course, in a smaller
window.

How the directions are shown at every pixels; that means, at every pixel you computed
gradients and these are the directions. Some of them are thrown out from the consideration
of the magnitudes if they are weak. And then you bring those directions into a histogram
which I discussed earlier that because directions can be considered a range it ranges from
0 to 2π radian angle, and then you discretize those intervals. Each interval is called a bin.
So, in this case as you can see from the diagram itself, there are 8 directions.

So, there would be 8 such intervals each by 45 degree (π/4) and within that you are binning
those directions and then now you can get the histograms. So, the length of the vector in
this particular diagram is showing the magnitude, that is showing the number of pixels or
even you can add those magnitudes of a gradient to get the magnitude of this histogram to
get that angle histogram.

433
(Refer Slide Time: 04:34)

So, next what we do that now you can divide this 16x16 window into a 4x 4 grid of cells.
So, in this case it has been shown 2x2 in this particular diagram for typical examples and
for each grid of cells you get the histogram. So, compute an orientation histogram. So,
what I discussed is the histogram of composition of the orientation histogram that was not
over 16x16 cell. It was each of 4x4 grid of cells and each one will have 8 bins of histogram.
So, each one will give you 8 dimensional vectors corresponding to each histogram. So,
since there are 16 cells, so finally, if you concatenate all those vectors into some order that
would give you 128 dimensional descriptor. That is what is your sift descriptor.

(Refer Slide Time: 05:26)

434
So, the properties of sift descriptor is that it is capable of handling changes in viewpoint
and there is a significant changes in illumination, just to qualify that it even it is observed
that it can handle up to about 60 degree out of plane rotation and it is found to be invariant
even the illumination variations due to day and night.

(Refer Slide Time: 05:55)

So, this is what about this sift descriptor. But, now there are various other descriptors. I
will briefly summarize them in this particular lecture. One of them followed by sift
descriptor is a speeded up robust features or SURF descriptor which has been proposed in
2006. And this descriptor also is very popular because there are various modifications
which have been introduced in computing this descriptor, so that it could be computed
efficiently. That means, your compressional speed increases significantly using this
descriptor.

Mostly, in the sift descriptor the major computation that we need to carry out by using the
Gaussian convolution. So, in this case in surf descriptor, instead of Gaussian convolutions
what we can do? We can perform a box type convolutions; that means, we can compute
the same gradients or and also same double derivative. So, we will show how. What are
the detectors that have been used in this particular case and they convert using a box type
filters. I will explain what is meant by this box type of filters.

And in fact, which means in this case the convolutional masks they have only weights of
say 1 and minus 1, and there are patches where all are 1 and all are minus 1. Then those

435
patches are square patch and that is how this kind of the shape of this filter is called box
type of filters. And these computations can be carried out using an integral image which
again I will explain in this particular discussion.

So, the key point detection using this descriptor is performed using Hessian operator.
Instead of the difference of Gaussian operator we consider the Hessian operator of the
image itself. So, Hessian operators are those no double derivative operations in the
direction. So, in this case for an intensity images, you perform double derivatives along x
direction then along x and y, y and x, and y directions. So, this is what is Hessian operator.
And you perform the local maxima of the determinant of this Hessian.

So, now you can perform these operations over multiple resolution representation, where
instead of convolving using the Gaussian mask this Hessian operator itself; that means,
this computation of this double derivative they are computed with a different sizes of
masks. So, with increasing masks site you are obtaining the larger scales, you are
equivalently performing largest smoothening of the more smoothening of the images
which means of lower resolution representation of the images, and from there you try to
capture the local maxima in that representation in that kind of representation of images
compute the local maxima.

(Refer Slide Time: 09:43)

So, we have to accumulate the orientation corrected Haar Wavelet responses in this case
also. So, in the sift we have accumulated orientation corrected gradient directions and in

436
this case the gradient computation also is carried out using Haar Wavelets which can be
easily implemented by box filters. So, I will elaborate in subsequent slides.

(Refer Slide Time: 10:09)

So, this is what is the Hessian operato. So, the convolution with Gaussian second order
derivative with images, but this is actually replaced by a box filter in the sharp cases. So,
key point is maximum determinant of this over space and scale.

(Refer Slide Time: 10:35)

So, this is approximation. You can see in this picture how the box filters they look like.
So, here you can find out that this representation. So, the brighter values are shown by the

437
values of 1 and the darker values are shown by values are -1. So, it is not only 1, there
could be other integral values also.

det(𝐻𝑎𝑝𝑝𝑟𝑜𝑥 ) = 𝐷𝑥𝑥 𝐷𝑦𝑦 − (𝑤𝐷𝑥𝑦 )2

And here the value of ‘w’ is taken as 0.9. And particularly 9x9 box filters are an
approximation of Gaussian weight 1, 2. So, I should say approximating convolutions using
Gaussian masks or double derivatives of Gaussian masks.

(Refer Slide Time: 11:59)

About the use of integral images that is also explained nicely here in this slide. Here you
can what is an integral image. You can see that integral image is the cumulative sum of
pixel values over a over a rectangular region.

𝑖≤𝑥 𝑗≤𝑦

𝐼∑ (𝑥) = ∑ ∑ 𝐼(𝑖, 𝑗)
𝑖=0 𝑗=0

∑ =𝐴 − 𝐵 − 𝐶 + 𝐷

So, over this image at every point suppose I would like to compute the integral value at
this point. What I have to do? I have to compute the sum of pixel values in this particular
region and that value will be replaced here.

438
Now, these computations can be efficiently performed in one scan, because this could be
accumulated. This sum could be accumulated over computations of other integer values
which has been already computed. So, in one scan we can compute an integral image.
Now, given that integral image I can compute sum over any rectangular zone by using
these operations; that means, this is the integral image value at these points.

I need to take the sum over all this region and then subtract these values over this particular
particular term, and then revalue this one, because this is twice subtracted. So, if I do that,
then actually I will get the corresponding sum over this particular value.

So, when you are applying box filter over a space. So, what you are doing? You are simply
performing either addition accumulating those value addition of the pixel values along that
region and then in subsequently you are using this value for subtraction weighted, you can
multiply it by that weight then you are using this value for subtraction or addition. You
can see the box filtered implementations earlier for computing the determinant of the
Hessian matrix what we discussed. And so every kind of computations of the sum of these
pixel values it requires only 3 additions and 4 memory access, and that is how efficiently
or it increases the speed of computations.

(Refer Slide Time: 14:57)

And next what I said that is a computation of Hessian and finding out the corresponding
key points by computing the local maxima of Hessians implement using box filters.

439
Now, for describing the key point, once again like sift, here also around its neighborhood
you are accumulating the statistics of the orientations. But in this case the orientations are
once again computed using Haar Wavelets. Once again this is nothing but in this case as
it is computation of gradients along different scales. So, these are the two particular filters
and which has been shown here and they can be implemented by box filters.

And to get the orientation, what you can do? You can rotate the image and then apply these
filters and find out where you get the maximal responses. So, the longest period that would
provide the dominant directions. And this could be implemented by box filters as we have
discussed earlier also that white portions would be say for example, we are summing them
all those pixels in the white regions and after summing all the dark pixels you are
performing subtraction, which means in these computations.

For example, if I am applying this particular convolutions operations, so for summing up


the pixels around this we will take 3 operations for integral image. Similarly, summing up
will take 3 operations. So, total 6 operations and then you require also a subtraction, so
another addition operations. But with 6 operations you can compute two particularly these
two values. So, 6 operation units for computing each filter response using integral image,
actually you require one more operation because after that you need to perform the
subtraction.

So, you sum this value you take the summations of this value suppose this is over region
A, so and then this is the region B and then you have to perform a subtraction. So, this is
how you can compute these particular operations. So, instead of 6, 6 is for this 3 plus 3,
but this will require another additional operation.

440
(Refer Slide Time: 17:43)

And to get the final descriptor, what we do? We do not know collect like the meaning of
histogram, so I did in this case of sift. So, instead of that in this case after performing these
orientation corrections we partition these square sub regions into 4x4, partition sub regions
into 4x4 quiet sub regions and in each region we are computing this particular quantities.
That means, after performing the Haar Wavelets, so we are taking the summation of the
convert responses and also the summation of the magnitudes of the converted responses
of Haar Wavelets along the x direction. Similarly, those two along y directions.

So, as you can see that each region each cell of 4x4 cell will give me a 4x4 dimensional
vector, and there would be and size of the window is twenty cross scale that is the size of
the window and since there are 4 cross 4 square regions. So, each region will give me a 4
dimensional vector.

So, you will get, so the it is regularly spaced 5x5 sample patches in each sub region and
then each sub region has 4 dimensional vector, and if you concatenate them then it would
be a 64 dimensional vector and that is how the surf response is given. So, you can see the
reference of this paper is given here for the details you should look at this paper which you
can know read in details and you can get all this description.

441
(Refer Slide Time: 19:44)

So, we discussed about two such descriptors like sift and surf, but still they are
computationally quite intensive. So, there are different other descriptors which are
proposed later and they also quiet becoming popular in various applications. Particularly,
you require first computations in real time applications. So, one of such detectors is called
FAST detector. By the name FAST it is implying that it is very it computes first, the
detection of feature point. So, but the acronym is features from accelerated segment test. I
will have a BRIEF description of that particular method.

And also there are different descriptors like BRIEF or binary robust independent
elementary features or ORB which makes the description rotational invariant by
considering oriented FAST detection. So, FAST is not a rotation invariant detection. But
in ORB we know that limitation has been overcome by performing some operations which
I will be also discussing. And also the rotated BRIEF, so BRIEF is also not orientation
independent descriptor. So, in ORB those are taken care of.

442
(Refer Slide Time: 21:10)

So, this is the principle of detection of a key point or feature point in FAST. and there is a
test which is called 12 point test and where it considers much defined locations around a
pixel. And those are shown in the diagram which are the locations shown by the square
thick squared edges. So, you can say that the size of the mask in this case is about is 7x7
mask and there are location centering this particular pixel. So, this is the central pixel and
this location, so this is 1 2 3.

So, if I consider these are the locations say 1 2 3 like this, so there are in fact 16 such, no,
locations which are shown here. But out of them if they are exist 12 consecutive points
which are brighter than the central pixel then we consider that point is an interesting point
which is a key point, that is the principle. So, it has been shown that this works well for
detecting various key points.

So, there are also modification of these strategies which improves this performance. Now,
we can train decision tree on the Boolean conditions given level data. So, the details you
can find those details in this particular paper which has been shown here. It is a paper
published in transaction, transactions of pattern analysis and machine intelligence in 2010
and is written by Rosten et al. So, you can look at this paper and get the details.

443
(Refer Slide Time: 23:20)

So, just a little more elaborations about this particular point. So, there are even for the
decision trees that partitioning of points on this circle and you can train them as a decision
tree. So, you can characterize them as a point which is in the category of darker and a point
which are similar with respect to center of pixel, either it is darker than the central pixel or
almost similar or it is brighter. So, then it is a trinary logic.

So that means, there are 3 steps for every location, every point which are shown out of
those 16 points. And then based on them you can train a decision tree based on those
values. And given training samples of interesting key points and their neighborhoods, use
that decision tree later on to identify key points. So, the problem is the classification of
points in 3 classes and create 3 partitions for a point and then you can train a decision tree.

𝑑, 𝐼𝑝→𝑥 ≤ 𝐼𝑝 − 𝑡 (𝑑𝑎𝑟𝑘𝑒𝑟)
𝑆𝑝→𝑥 = {𝑠, 𝐼𝑝 − 𝑡 < 𝐼𝑝→𝑥 < 𝐼𝑝 + 𝑡 (𝑠𝑖𝑚𝑖𝑙𝑎𝑟)}
𝑏, 𝐼𝑝 + 𝑡 < 𝐼𝑝→𝑥 (𝑏𝑟𝑖𝑔ℎ𝑡𝑒𝑟)

444
(Refer Slide Time: 24:27)

So, now let me discuss about a descriptor around a key point. So, the previous one is a
replacement of the detector operator like Harris operator or difference of Gaussian
operations, max local maximum of difference of Gaussian, local maxima of Laplacian
measure. So, first replaces them and as you can see the computation is very simple based
on those heuristics and those principles.

Now, let us see that how a descriptor also could be computed without too much complex
computation. So, in this case this is a principle that we considered randomly a set of 𝑛𝑑
pairs of locations which is denoted here a set(𝑥𝑖 , 𝑦𝑖 ) that is 𝑖 𝑡ℎ point in that set in a patch
centered on a point. Actually, 𝑥𝑖 is not x coordinate, 𝑥𝑖 is the point, 𝑦𝑖 is the point and they
are coupled together.

1, 𝑖𝑓 𝑝(𝑥) < 𝑝(𝑦)


𝜏(𝑝; 𝑥, 𝑦) = { }
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

So, suppose you have a pixel here, some central pixel ‘c’ and around its neighborhood you
have you have randomly defined certain pairing a point. So, these are the positions and
then you are getting a binary value 1, right. So, you get a binary string. So, there are 𝑛𝑑
such pairs, so you get a binary string of length𝑛𝑑 . In fact, that is what is supposed to your
future descriptor.

445
(Refer Slide Time: 26:32)

𝑓𝑛𝑑 (𝑝) = ∑ 2𝑖−1 𝜏(𝑝; 𝑥𝑖 , 𝑦𝑖 )


1≤𝑖≤𝑛𝑑

So, it is n dimensional binary string and as you can see this computation is very fast and it
has been found that this is also giving a feature descriptor around key point which is quite
unique. And you can consider a value of this point also, you can you can know it since it
is a binary representation, so you can also convert this into a decimal value. So, the typical
dimensions which are used to in experiments like 128, 256, 512, you can see that a very
large dimension.

446
(Refer Slide Time: 27:10)

But the problem is that as we can see that now there is no concept of making it rotational
invariant, there is no concept of making it scale invariant. Even the FAST also there is no
concept of rotational invariant or scaling variant. So, in ORB that is another descriptor this
these are taken care of.

So, again the multi resolution images are representation is used and orientation
compensation has been done in a very simple and tricky way that now, you consider an
intensity centroid. So, that is the characteristic. So, what is an intensity centroid? Say, you
have a key point which is given by c, now this is a geometric center.

Now, if I consider the centroid of the intensity values; that means, a pixel which contains
the average of the intensity values which is close to the average of the intensity values then
that is intensity weighted center of patch it is not average. So, we are considering the
location where intensity is acting like a weight and taking the intensity weighted central
location. So, that is the different point. And this gives me an orientation. It is something
like it will try to simulate the gradient direction. So, this is the orientation.

So, now, you describe all those, you perform all those operations by performing rotations
by aligning your reference access to this orientation and you can make it this rotational
invariant in this way. So, rotate the patch by the angle and compute degree, and this is
called actually steered BRIEF.

447
(Refer Slide Time: 28:56)

So, let me stop here at this point, where we have discussed know various descriptors, I
mean other than surf and sift which are quite computationally intrinsic, but in there are
also additional descriptors like BRIEF, ORB and also a FAST detector which acronym is
also FAST and they are computationally efficient.

So, next we will we will continue this discussion once again just to understand that how
these descriptors could be used in matching a pairs of key points, and also there are other
kinds of descriptors which are also useful in the in various other bit processing of images
on analysis of images that we will also discuss. So, let us stop at this point for this particular
lecture.

Thank you very much.

Keywords: SIFT, SURF, FAST, BRIEF

448
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 27
Feature Detection and Description Part – IV

We are discussing about techniques for Detecting Features and also Describing those
feature points, so that we can use them later for matching with respect to various tasks.

(Refer Slide Time: 00:34)

So, let us consider briefly that what is meant by these operations matching and how you
can compute it. In our next topic we will have much more elaboration on these particular
computations, but to understand the motivation behind feature detection and descriptions
let us also go through briefly about this competition.

So, one of the thing that we discuss that we can represent a key point by a feature vector.
And suppose we have two images and they are related. They are of the same sign and we
would like to find out the corresponding pair of points which are of the same scene point.

So, the key points which are of the same scene points needed to be identified. And they
are the descriptors like feature vectors to be used, they are relevant in that case. And then
you can use different distance functions or similarity measures to find out how close these
points are. And accordingly we can take a decision.

449
So, for example a key point could be represented by feature vector. In this case we can see
that the fields of the vectors are shown. And there are distance function for example, if I
consider two feature vectors in the same way, so there are different norms which are
defined mathematical in this way. So, this is a 𝐿1 norm where you can see it is a sum of
absolute differences between the fields of two feature vectors or components of two feature
vectors.

𝐿1 (𝑓⃗, 𝑔⃗) = ∑ |𝑓𝑖 − 𝑔𝑖 |


𝑖=0

𝐿2 Norm which is Euclidian distance functions between two points or between two vectors.

𝐿2 (𝑓⃗, 𝑔⃗) = (∑ |𝑓𝑖 − 𝑔𝑖 |2 )1/2


𝑖=0

𝐿𝑝 (𝑓⃗, 𝑔⃗) = (∑ |𝑓𝑖 − 𝑔𝑖 |𝑝 )1/𝑝


𝑖=0

So, you can see that the special case of 𝐿𝑝 norm is𝐿1 norm when p=1, and 𝐿2 norm when
p=2. So, using this distance function, you can find out the proximities between two feature
vectors into particular images. And which are more proximal we can assign those feature
vector as the best. The closest one can be assign to a particular feature vector.

(Refer Slide Time: 03:37)

450
So, this is the kind of overview of how a matching could be performed using key point
descriptors and then compute the corresponding points. So, let us consider different kinds
of descriptors. It is not only the point descriptors sometimes to identify or to detect an
object we require to describe their region instead of a point.

And we call those descriptors like patch descriptors or it could be a texture of a region. So
it could be texture descriptors or some parts of image, or sub image. And they represent
the whole content of that. So those are like global descriptors with respect to those regions.
So, we will be discussing some of these techniques which we can derive this descriptors.

(Refer Slide Time: 04:30)

So, one of the very popular technique for patch descriptors is histogram of gradients
representation. So, this particular method is proposed in 2005 as it is shown in the reference
paper here and we can go through that paper to get the details. So, I will give you a brief
overview of this particular technique. So, you are computing the horizontal and vertical
gradients without any smoothing here.

So, you are collecting the gradient statistics like what we did in the serve descriptor
etcetera, but here what we are doing? You are doing over a patch. So, we will be
partitioning that patch and in each patch we will be computing this horizontal and vertical
gradients. So that means, you have to compute gradient orientation and magnitudes and if
you consider a colored image then you can pick that color channel where it is giving the
highest gradient and use that directions in that case. So, typically say if you have a 64x128

451
image so, divide the image into 16x16 blocks of 50 percent overlap and then you have say
105 blocks in total.

(Refer Slide Time: 05:59)

Then each block, partition into 2x2 cells with size 8x8 this is typically done in this papers.
So, these statistics are taken from this paper. It has been found that this works well for
certain object recognition and it has been found also in other techniques. Then you can
quantize the gradient orientation into 9 bins and you can vote is the gradient magnitude.
That means, you have each orientation as qualified or attributed by some of their gradient
magnitudes along those directions.

So, it is like forming a histogram without counting each vector as a unit, we consider their
magnitude as the weight of that particular directions. And accumulation of that magnitude
gives the corresponding histograms weights of the histogram. So, interpolate votes
between neighboring bin center, so you have to smooth the corresponding histograms and
it could be weighted with Gaussian to down weight the pixels near the edges of the block.
Then you concatenate at the histograms. So, it would give you a feature dimension as you
have seen earlier there were 105 blocks and each block has 4 cells and each histogram has
9 bins. So, finally, you get a feature dimension of 3780 for a block of 64x128 patch.

452
(Refer Slide Time: 07:26)

So, how do you use this particular descriptor? So, this is a description and there are various
examples of the application of these kind of descriptors. Like in that paper they have
applied for pedestrian detection who are walking on a road.

So, their detection then character recognition in the text document. It has been found in
various other applications where this hog descriptors are found to be useful with certain
modifications. Usually these are descriptors and as you can see these problems are kind of
a classification problem.

So, instead of matching using a distance function we can consider training a classifier. So,
you can get a labeled sample feature descriptors and you can train a classifier, we can use
distance function also considering that as representative samples of classes.

And then find out how close your unknown feature vector to those representative samples.
But there are various techniques of machine learning which could be used for you know
training this classifiers like support vector machine, decision tree, random forest and use
them for that. So, by using those classifier later on you can once it is strained then you can
label an unknown patch using this descriptor.

453
(Refer Slide Time: 09:00)

One of the operations that is often needed in this case is called non maximal suppression
of this detection; which means that, sometimes when you detect a particular patch with
certain class, even the neighboring patch also should give a high score. But as you
understand that there exist only singularly one object should cover that particular radii and
those matches are duplicate matches. And many of them will give you unnecessarily some
false positive or redundant matching. The wise thing to do that is we ignore those matches.

So, that is the suppression; that means, as we did for key point detection. So, we consider
the maximal response around a neighborhood here also we can consider maximal response
around neighborhood patches and select that patch with locally maximal score. So, a
greedy approach what has been discussed here is you have to select the best scoring
window and then you suppress the windows that are too close to that selected window.
And then again you carry out search on the remaining windows which are outside that
region.

454
(Refer Slide Time: 10:24)

So, that is what a typical example of a patch descriptor and we will consider other kind of
image descriptions particularly texture descript. And texture also describe a region and in
this particular image you can see there are sunflower beds which has given a typical texture
not only that the texture of sky, texture of road which are quite distinct.

And those patches or those regions can be leveled with some of those classes. So, in brief,
texture is a special arrangement of the colors or intensities in an image and we can define
a quantitative measure to such arrangements and that measure itself distinctly identify a
particular texture.

455
(Refer Slide Time: 11:19)

So, we will see how the texture descriptors are designed. So, there are various techniques
for describing texture regions like these are certain techniques like edge density and
direction, there are techniques using local binary pattern I will elaborate all these
techniques. Then co-occurrence matrix and also ‘Laws’ texture energy features. So,
‘Laws’ is a researcher who has proposed long back this energy features and which have
been found very effective in identifying textures.

(Refer Slide Time: 11:55)

456
So, first the edge density and direction; In this method what we are doing is we are
computing the gradient at each pixel and then we can get the histogram of that gradient
and normalized histograms of magnitudes and directions of gradients over a region. So,
you can get two histograms one for the magnitudes, one for the directions and you have to
normalize it. So, in this case normalization means that you are normalizing with respect to
the area of histogram so it is like giving a probability density function.

So, while making the area as one, so this is the representation that there are two histograms;
one for magnitude, one for directions and they are normalized histogram of magnitudes,
normalized histogram of directions as I mentioned. Typically number of bins in histogram
they are kept small for example, 10 and then you can use some distance function like L1
norm between feature vectors to find out the levels of a texture. So, if you have a library
of textures you can use this distance function, we will elaborate again this kind of matching
letter.

(Refer Slide Time: 13:21)

This is another method by which textures are represented and this is a very popular
technique called local binary pattern. So, let me define this particular feature.

457
(Refer Slide Time: 13:50)

As it has been shown that in this image you consider a central pixel and these are the
positions which are numbered here. So, these are the neighboring pixels of the pixel ‘c’,
but again they have given some positional number ordinal numbers.

3 2 1
4 𝑐 0
5 6 7

So, what we are trying to find out that we are comparing the pixel values between these
two locations between the central pixel and the pixel at the locations.

𝐿𝐵𝑃(𝑐) = ∑ 𝑏(𝑖)2𝑖
𝑖=0

If the pixel value is greater than ‘c’, we will have a value 1 otherwise if it is smaller we
will have a value 0, which means all these values will be a binary value. And that is a local
binary pattern and which can be represented in an aggregated form by simply the value of
that binary string. So, which is represented here by the above equation. It is simply the
decimal value of that binary string.

So, that would describe this feature itself that binary value itself will describe the features
and as you understand this will range from 0 to 255 in this particular case. So, this is the

458
local binary pattern for an image and it can be shown it is invariant elimination and contrast
this particular description local descriptions.

However, for a texture region what we need to do that this is the definition of b(i) as I
mentioned that it could be either 1 or 0 depending upon whether the pixel at their location
‘c’ is greater than the central pixel or not. This is the typical example you may have
different ordering of neighbors in some other work and accordingly the value would also
changed. But the values would range from 0 to 255 and then what you do, you obtain a
normalized histogram over a region.

So, you compute these value at each pixels then collect the statistics over that region in the
form of a histogram and normalize that histogram. You should note that this is not
rotational invariant. This description is not rotational invariant.

(Refer Slide Time: 16:04)

There are different variations of local binary pattern, so that you can make it rotational
invariant. And in fact one of the reference paper has been cited here that is multi resolutions
gray scale and rotational invariant texture classification with local binary pattern, it was
published in a transactions of pattern analysis and machine intelligence in 2002.

So, I will be briefly describing this particular method as you can go through this paper and
get the details. So, how do you make it rotational invariant ?. You consider a circular

459
neighborhood of radius R with P pixels at equal intervals of angles. So, instead of having
a fix 3x3 neighborhood let us extend this definition.

(Refer Slide Time: 17:09)

You are selecting P pixels in that neighborhood which means suppose I have a central
pixel ‘c’. And then any neighborhood any radius if I draw a circle. And then at equal
interval I will be selecting P pixels out of this, so the total number of pixels will be P. And
then you consider the local binary patterns at those positions.

So, suppose since it is a discrete grid, if the pixel does not coincide with that particular
location because of discrete grid. So, you can interpolate in a grid it is represented by 2
pixel locations in this continuous form, so, you have to interpolate them. The thing is that
after those values you can get a local binary pattern value, and those are characterized by
these parameters: number of pixels ‘P’ that has been taken over that circular neighbor and
also the radius.

So, typically the ordinary local binary pattern what we define in the previous case that
would be the same when the radius is kept as 1 and where the number of pixel is kept as
8. Now, the rotational invariant LBP is defined that you consider the rotation of this binary
string on those locations P that you perform rotational right rotation operations for
example. And in every rotation you compute whatever the value decimal value you get
and the minimum one you will be choosing.

460
So, finally, as you can see when you if rotate the image, the same value will be there
because you are always taking the minimum out of all possible rotations. So, that is how
you make it rotational invariant, so if you are performing a circular bitwise right shift
operations for this rotation. And it has been shown that there are 36 distinct values if I
consider a rotation invariant representation when radius is 1 and P value is 8.

(Refer Slide Time: 19:23)

There is another variation instead of choosing all sorts of local binary pattern, in this case
we will be considering only those pattern which we will considered uniform. And rest of
them would be put in the same class or in a separate class

𝑃−1

𝑟𝑖𝑣2 (𝑐) ∑ 𝑏(𝑖)2𝑖 , 𝑖𝑓 𝑈 (𝐿𝐵𝑃𝑃,𝑅 (𝑐)) ≤ 2


𝐿𝐵𝑃𝑃,𝑅 ={ }
𝑖=0
𝑃 + 1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

So, which means there would be exactly you know P+1 class is when we define the
uniform pattern in this way that there, it is the pattern where there is not more than two
special transitions in the bit sequence. And so, it is you know the number of transition is
0, so it is an uniform pattern.

𝑈(11111111) = 0

Consider, in this case there is a special transitions say 1 to 0 and 0 to 1 only that is number
is 2, so this is also an uniform pattern.

461
𝑈(11101111) = 2

Whereas, you consider the third example and you can find the transitions in the beginning
there is 1 to 0 then in the middle there is 0 to 1 and 1 to 0 and at the end there is 0 to 1. So,
there are four transitions, so this is not a uniform pattern.

𝑈(10001001) = 4

So, now, how do you define the LBP in this case rotational invariant that you are
considering only those pattern which is uniform you are taking their rotation invariant
value? So, which means you perform right shift operations take the minimum value; right
shift rotations to the minimum value otherwise you put them in a separate class in a
separate value as P+1. So, in this way there could be at most P+2 distinct values. So, you
have to use rotation invariant range value by computing minimum applying ROR operator.

(Refer Slide Time: 21:13)

So, it has been found there are 9 uniform patterns in the number inside them that
correspond to their unique codes, so these are certain examples which have been shown in
that paper itself. So, these are the uniform patterns, rest of them will be non uniform and
you get 9+1, so in total there would be 10 distinct classes. So, if I take the histogram of
this local binary patterns there would be only 10 bins in this case.

462
(Refer Slide Time: 21:47)

1 1
𝑣𝑎𝑟𝑃,𝑅 (𝑐) = 𝑃 ∑𝑃−1 2 𝑃−1
𝑝=0(𝑔𝑝 − µ) µ = 𝑃 ∑𝑝=0 𝑔𝑝

And also this could be further augmented by other rotational invariant measures or other
measures which would make it intensity invariant. So, local variance of intensities can be
considered for uniform pattern, so this is the definition of variance over those values.

So, it is only those pixel intensities for uniform patterns those are the positions and you
are taking the variance. So, only for uniform pattern you are considering the local variance
and this is the main thing. And then you got the normalized histogram of local variances
also, so this is another feature.

So, you can get normalized histogram of LBP or you can get normalized histogram of local
variances and in fact, you can get another robust representations when you take this ratios.
So, these are different representations by which a region can be described either you can
describe it as an histogram of rotational invariant local binary patterns as we defined or
histogram of local variances as it is defined here or even histogram of you know ratios of
this two quantities.

463
(Refer Slide Time: 23:15)

So, let us understand what is meant by a co occurrence matrix as it has been already shown
in the slide. That it is a matrix where every element of that matrix defines or contains that
how many times the indexes of those elements. That means, x and y those are the pixel
values how many times they occurred together with respect to certain spatial relationships.

So, in this notation you note here that we have use to a subscript ‘r’, here ‘r’ is denoting
that pixel location. So, for an example we consider there are two pixel ‘p’ and ‘q’ which
is shifted by a translation vector a b, then the pixel values; that means, if their intensity
values. Then ‘x’ is intensity values at location ‘p’ and ‘y’ is intensity value at location ‘q’,
and an occurrence of (x, y) that would be considered that would be counted in this case.

So, now throughout the image you find out any pixel location, so let me explain once again
suppose we have an image and suppose this is ‘p’ and this is ‘q’ and there is a horizontal
and vertical shift a and b there is a translation term, so this is how the p and q is located.
So, if I take another pixel say [laughter], then once again its corresponding pixel L prime
would be here which is also shifted by the same amount. So, in that case we will consider
what is the intensity values at l and l prime and this pairing will be considered that would
provide you the corresponding you know statistics of how many times this pairs have
occurred. So it is a frequency distribution of all such pairs.

So, suppose in an image there are 256 levels, so you will get 256 square such pairs and for
each combination you have to find out how many times that combination occurs

464
throughout the image. So, when you have a matrix in this form; that means, (i, j) th element
of that matrix denotes the pixel value i and pixel value j in corresponding locations with
satisfying this special relationships between those locations throughout the images how
many times that has occurred. So, this will give you the frequency distribution that is how
the co occurrence matrix is defined.

(Refer Slide Time: 26:48)

So, this is also explained here, so if I consider a relationship ‘r’ where there is a translation
of a and b as we denoted. So, number of cases in an image where I p equal to x and I p
plus t which is a translation vector a b that is equals y.

465
(Refer Slide Time: 27:20)

So, let us workout a simple case from where we will be able to understand this particular
relationship in a better way. Considered a small image a 4x4 image and the image has only
two levels which is 0 and 1.

So, if I consider the special relationships of 0 and 1 then if I define a particular location
for example, if this is my location p then its corresponding q location as per the relations
would be this one so we will be considering the pairing value 1 0. So, this denotes the
value of the location ‘p’ or the originating locations or the reference location if I say and
this is the other second locations in that couplet, so this denotes those values of second
location.

So, if I say it is 1 0, so 1 and 0, so there will be a count for this one. So, in this way let us
see what are the locations where 1 0 has occurred. So, this is one, this is another instance
where which satisfies this relationship, so we can take this then I think that is all. Let us
consider the other cases, say this is 0 0, so we will take this, this is also 0 0 we will take
this we will go on doing this see this is 0 1.

So, 0 1 this one this is 1 1, so 1 1 this one we have already considered this location this is
0 1. So, 0 and 1, this one this is 1 1 this one similarly this value is 1 1. Now this has been
already considered see this is 0 0. So, 0 0 will be this one this is 1 1, 1 1 this one oh not
sorry this 1 is 1 1, so here is 1 1 earlier then this is 0 0.

466
So, now, you count how many times it has occurred, this has occurred 4 times; these has
occurred 2 times, 2 times, 4 times. So C (0, 1) this co occurrence matrix will look like this.

4 2
[ ]
2 4

So, similarly you can compute to other frequency distributions and which will give you
these are the values that you can find out that.

(Refer Slide Time: 30:52)

So, every different positional relation we will have a different co-occurrence matrix. So,
depending upon the definition of the coupled locations of the pixel you get a different co
occurrence matrix. And this set of co occurrence matrices they represent a texture feature.

467
(Refer Slide Time: 31:22)

You can normalize this co occurrence matrix by dividing them by their sum of frequencies
in a matrix.

(Refer Slide Time: 31:30)

𝑆𝑟 (𝑥, 𝑦) = 𝐶𝑟 (𝑥, 𝑦) + 𝐶−𝑟 (𝑥, 𝑦)

Or you can also modify the definition by using symmetric co-occurrence matrix; that
means, you can consider the directions in the opposite directions also say shift (a, b) again
minus a-b. Like 0 1 0 minus 1 and find out the counts at for each case and then aggregation
of those counts will give you a co-occurrence matrix for each element of that matrix.

468
(Refer Slide Time: 32:00)

∑𝑥 ∑𝑦(𝑥 − µ𝑥 )(𝑦 − µ𝑦 )𝑁𝑟 (𝑥, 𝑦)


𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 =
𝜎𝑥 𝜎𝑦

So, from this co-occurrence matrix you can define different features and those features can
describe a region. So, co-occurrence matrix is not just describing region; from the co
occurrence matrix different measures can be defined like correlation measure. So, we can
see that this is a normalized correlation measure which means it is weighted by the
corresponding you know value of the co-occurrence matrix.

(Refer Slide Time: 33:22)

469
So, in this way you can define many other features and you can represent them by a feature
vector. The last technique which I will be discussing for texture representation is Laws’
texture energy features. I have already cited the paper which you can go through. It is a
very old paper in 1980, but these features are found to be very effective.

So, how this features are computed there are set of 9, 5x5 masks which is to compute this
texture image, so let us define how this 9 masks derive. In the base level you can see there
are 4 one dimensional filters. And their element is there are 5 such elements of the filters
that is why this abbreviations L5, E5, S5, R5 those are denoted here and each one has its
purpose of competitions.

So, if you have a one dimensional filtering you can see that L5 it is just performing an
weighted small thing. Where is E5 it is performing a gradient computations along the x
direction [-1 -2 0 2 1] in one dimension it is computing the gradient directions. Say spot
S5 is computing spot which means you know the central pixel is the higher one and then
you are taking the differences it is a center surround model. So, you are masking the
surroundings surrounding excitations or surrounding responses are subtracted from the
central response that the, and R 5 is a ripple it is also given by this particular mask.

Now, this is for one dimensional computation’s, but when you do it for two dimension
then you perform outer product of any pair, so outer product that it is a product of this
matrices. So, if I represent them in the form of a column vector say all these vectors you
can represent as a column vector, then outer product is the column vector into a row vector.

Suppose the outer product of E5, L5 we will give you E5L5 transpose. Say this is one
example of outer product of E5 and L5 transpose and this will give you a mask. And then
you perform the convolution with the corresponding texture. So, this is typically one
example of a 5x5 mass which computes a texture response.

470
(Refer Slide Time: 36:12)

And from there we compute the energy, which means that has to be sum of square of all
these responses. We will come to that details. So, this is a set of 9 5x5 masks so, 16 such
masks are possible because, there are four. There are four pairing possible but, we combine
if you pair to make them 9 masks. So, the list is given here we can see that some of them
are trying to provide you the symmetric measures like L 5, E 5, E 5, L 5 they are combined.

So, in this way we can find that there are 6 into 2; 12 masks they are combined in each
pair and they make it 6 and then there are masks S5, S5, R5, R5, E5, E5 they produce 3
masks. So, they take average of responses of 2 masks that is how you combine them.

471
(Refer Slide Time: 37:13)

So, finally, as I representation these are the steps that you have an input image, then you
can subtract the local average at each pixel convolve with 16 masks computer energy map
for each channel then combine a few symmetry pairs to 9 channels.

So, you have a nine dimensional feature space; that means, every pixel in the textured
region has a 9 dimensional vector representation. So, this is a 15x15 window and this is a
sum of absolute values in a 15x15 window, and you know these are the masks which are
combined.

(Refer Slide Time: 37:49)

472
Some of the use of texture descriptors is like detection of object patches. It is represented
by textured patterns, then segmentation of images and classification or matching. Usually,
the problem is considered as a classification problem as the object the detections with using
the region descriptors. Here also we can generate a library of labelled feature descriptors
and then you can detection of classes using class labels. Here also you can use different
classifiers, you can use matching to the nearest texture descriptors using a. So, matching
to the nearest texture descriptor using some distance function.

(Refer Slide Time: 38:33)

So, with this let me stop here where, we have discussed different kinds of region
descriptors. And we will continue this discussion still we will be discussing how globally
also an image could be represented by a feature descriptor. So, let us stop at this point.

Thank you very much.

Keywords: Descriptors, matching, detection, feature

473
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 28
Feature Detection and Description Part – V

We are discussing about descriptors, different kinds of featured descriptors in an image.


In our previous lectures we discussed how key points could be described, then we also
considered description of patches or regions. And after that we have also discussed about
similar kind of descriptors of regions, where textures are described. And, today we will
discuss another kinds of description of images. When the whole image is required to be
described, the whole visual content in an image needs to be represented. And, in that
scenario let us see what kind of techniques are there for image description. Typically, we
will discuss two such popular techniques. One of them is known as bag of visual words
and then we will consider vector of locally aggregated descriptions.

(Refer Slide Time: 01:21)

So, we will see that kind of descriptors. So, bag of visual words in this representation what
we do. First we compute the key point based feature descriptors over a library of images.
So, in this case we need to construct visual words from a set of images and use those visual
words. So, it is a kind of dictionary and dictionary in terms of visual words and then

474
describe any image using those visual words whether they are present in the image or not
or in what number they are present in that image.

So, the analogy of this kind of approach came from natural language processing or
document retrieval if I say particularly. So, in the textual documents, there are words which
are present in the documents and some of those words which play their distinctive roles on
the nature of the documents. So, those are the words which are textual words and a
document can be represented by some feature vector where the presence of those words
are counted for.

Now, the similar kind of concept has been extended when we are trying to describe the
visual content of an image, but the problem here is that, we do not have any precisely
defined set of words, as we can get it from the documents. Because, dictionaries are
independently created by linguists, by any language there are dictionary of words. So, we
can use those dictionaries to represent any word and those words could be used for
representing the document.

But, in the images, the first task could be to create a dictionary of words and which should
be dependent on the visual content or local visual features of an image. So, this locality is
first addressed by considering the key points of an image because, key points as we
understand they play a very important role in defining transformation invariant features in
an image. So, they are the locations, they are kind of landmark locations which are easily
detectable even after the image is transformed. So, key points are those positions in an
image where you are considering those landmarks. then the description of those key points
are possible candidate of visual words.

So, that is the philosophy when we are considering these step. You consider a set of images
for each image you compute the key points by using different kinds of detectors what we
discussed earlier. It could be a SIFT detector, it could be a detector based on SURF, it
could be other detectors like FAST and those will be giving you those locations. And, also
you understand that even the scaling variance is also considered for detection of key points
in some of these techniques. And, then around these locations, you are deriving a descriptor
which also should have the property of transformational invariance.

But the problem here is that you get so many varieties of descriptors that there would be a
large number of descriptors in that case and it is very difficult to represent an image with

475
so much of variations. So, what we need to do next is that all these key points we need to
put them under some quantization schemes. That means, we would like to get some
representatives of these visual words. So, it is not key point, it is a descriptor which we are
considering here. A descriptor corresponds to a visual word in our candidate visual word
dictionary, but as there could be lot of variations.

So, we would like to put those visual words into buckets which are very similar and choose
one representative out of them. So, this task in the vector space also it is called
quantization, it is also called clustering when you are bucketizing them or when you are
grouping them into a similar setup in similar visual words or similar feature vectors and
then choose a representative out of it. So, there are different algorithms for clustering and
there is a technique called K means clustering technique by which you can easily perform
this kind of grouping and we can choose the representatives from that group.

So, if I consider a set of feature vectors from a group, then the mean of that feature vector
is used as a representative for that set. So, the parameter K, it denotes that how many such
representatives or how many such groups you are going to create. And in fact, that would
give you the number of visual words in your dictionary and using that fixed number of
visual words we are going to describe an image. So, it has certain advantages that
dimension of your feature description is determined by this number K and it remains the
same for every kind of image under your consideration.

So, once again just to summarize these steps: you take a set of images which you can call
a library of images from where you are trying to form a dictionary of visual words. For
that what you are doing? For each image you are computing the key points, at each key
point you are computing its feature descriptors. Then you get a collection of feature
descriptors from all such images and use that collection in clustering algorithm, apply a
clustering algorithm to that collection of feature descriptors. So, group them into a finite
number of clusters like using K means clustering algorithm which I will discuss in later on
in some other topic of clustering and classifications in this particular course.

Right now, let us understand what it does. It simply partitions the descriptors into K groups
or K partitions. So, from each partition you take the mean as a representative of that cluster
and they form visual words. So, you have K visual words out of this process. Next, what
you are doing? That, now your dictionary is prepared, now you consider any image. And,

476
for any image you again compute its key words, key points and consider a descriptor of
that key point and then you find out that which is the nearest visual word corresponding to
that feature descriptor and then that description is associated with that word.

So, here you are counting how many such representative visual words, how many times
they occur in an image. It is the similar strategy where you count how many key words
occur in a document from a dictionary, similarly how many key visual words they occur
in your dictionary in a particular image that we are doing. And, that can be represented by
a histogram because histogram gives you frequency distribution of these visual words of
your dictionary and the number of bins in this histogram is fixed by the number K. So, you
have a K dimensional feature representation for an image and this is what the
representation by bag of visual words is.

(Refer Slide Time: 10:35)

Next we will consider another kind of descriptor which is an extension of bag of visual
words description we will see, the concepts of visual words are used here also. But, the
summarization of the visual content or the nature of description is bit different and that we
would like to see what kind of variation is there in this description. So, this technique is
known as Vector of Locally Aggregated Descriptors or in an acronym we call it VLAD.
So, the first step is similar, you have to form the codebook of visual words as you did in
the bag of visual word representation.

477
So, which means that, once again you should take a set of images, you should extract the
key points for each image and at each key point again you form the feature descriptors;
you will get a set of feature descriptors. And, then perform K means clustering on those
feature descriptors and get K visual words or that is what your bag of K visual words is,
that you derive in the first step form the dictionary. Suppose, these representative visual
words are denoted by the symbols like you have K such visual words so, each code we can
call it also the codebook. So, each is a feature vector and they are denoted by say C1 C2
to Ck.

So, they are just the cluster centers of dimension D in this case. So, this D dimension is
determined by the dimension of the feature descriptor. For example, we know that for SIFT
feature vector the dimension is 128, for SURF feature vector the dimension is 64; likewise
there are different other feature descriptors from where these dimensions are defined.
These are all conventions as you understand, you can have your own feature descriptor
also and accordingly you can determine this dimension. Then after that we consider that
aggregation operations with respect to these centers. Let us understand this mathematical
operation. So, consider a local descriptor ‘x’ in an image and that is associated to one of
these visual words.

So, then we are accumulating the differences with respect to the corresponding cluster
center. So, this operation can be described by this particular summation operation.

𝑣𝑖 = ∑ (𝑥 − 𝐶𝑖 )
𝑥 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝐶𝑖

We can see in this representation x is the unknown local descriptor and Ci is a cluster
center. And, such that the distance from x to Ci is the minimum among all cluster centers
that is how the x is assigned to Ci. So, if I compute the distance, suppose you use some
distance function let me represent the distance function as say‘d’. And, distance between
feature vector x and any cluster center Cj that is denoted in this case like this.

𝑑(𝑥, 𝐶𝑖 )

𝑑(𝑥, 𝐶𝑖 ) > 𝑑(𝑥, 𝐶𝑗 )

478
In this case, you can see equality also in a degenerate case. So, at least all the distances
should be either greater or equal to this distance then you can assign x to any such cluster
center; if there are multiple centers which are of equal distances and which are also
minimum you have to choose one of them. So, then what you are doing? You are simply
aggregating the differences of that cluster center and with respect to a cluster center Ci.
So, if I geometrically look at it say Ci and then suppose there is another cluster center Cj
and there are feature vectors of this document say this is (x1, x2, x3) and say (y1, y2, y3).

So, these are all feature vectors of the same document, I could have written x 4 x 4; also
let us follow that and what we are doing? You are considering say the Euclidean distances.
So, you know that if I draw the perpendicular distance connecting these two feature vectors
a perpendicular then this hyper plane will separate into two. So, all these feature vectors
that would be close to Ci and all these feature vectors would be close to say Cj and then
what you are doing? See this is the vector, this is vector, and this is a vector. So, you are
just adding this vector; so, it is a resultant of all these vectors with respect to Ci. Similarly,
so if I take the resultant, it would be some resultant direction say this is some resultant
direction, this is the resultant direction.

Similarly, for Cj also you consider the resultant directions, some resultant directions would
be this. So, so that is what is your feature descriptors vi, accumulate the differences with
respect to the corresponding cluster center; you are basically concatenating all these
descriptors. So, if they are so the dimension is quite large as you can see, that if you have
Ci as a dimension say D; so, dimension of vi would be also D. So, what we are doing here
so, that if I consider the dimension of Ci is D dimension, the dimension of vi would be
also D and there are concatenation of all D dimensional feature vectors. So finally, you
will have a feature dimension of K into D. So, you note that your dimension of feature
representation, it increases in VLAD considering the number of bag of visual words. But,
in reality what happens that in this representation your number of K is kept small and that
is how it matches with the bag of visual word; it can compete with the bag of visual words
representation also in terms of the length of a feature vector. It has been found this is also
efficiently discriminating an image.

479
So, VLAD descriptor is the normalized vector of this representation of aggregated
differences or concatenated aggregated differences. So, that is how the VLAD descriptor
is defined. You see that this vector is divided by its magnitude. So, this particular operation
is the magnitude of the vector which is L2 norm of that vector and if you divide it, then
you will get the normalized representation. So, this is how another kind of representation
is obtained. Let us try to understand that what could be the motivation of this kind of
representation. So, what are the applications?

(Refer Slide Time: 18:57)

(Refer Slide Time: 19:01)

480
So, one of the application of this kind of global image descriptor is content based image
retrieval. So, let us try to understand what is meant by this content based image retrieval.
It is the image search that you can perform using any search engine where your query
would be an image and it would search similar images in your library of images, from your
database or from your library from different repositories scattered over the web.

So, content based image retrieval is meant for that so, for that what you need to do? You
need to represent all the candidate images or images of your library using some of these
descriptors and given any query image you also convert it to its descriptor. And, then try
to match between the query descriptor and the descriptors of your library images and which
one is closest or rather a set of images which are closer than the others; so, you can even
rank them; that means, stop match, next stop matching etcetera. So, in this way you can
rank those images say top 50 images, top 5 images you can find out that list and you can
report them, which is the way content based image retrieval works.

So, it is an image search based on visual content, there is a thing and one such example of
these operations can be shown here. Suppose, this is a query image and you have a library
of images and as I mentioned all those images in that library they are stored to they are
descriptors of stored in your database. And, then your objective would be to match the
descriptor of query image to those descriptors of images in your database. So, if I perform
this kind of operations and if I report few top ranking matches we can get some results like
this. So, this is the operations I am showing from one of the processings, one kind of one
of the systems of content based image retrieval and, you can see the image of a chariot
with wheel, it is a very famous image from on the heritage site Hampi and there is a temple
called Vithala temple where this image has been captured. And, in your library also it
contains several such instances taken from different views and some of those instances are
also captured by the search and those are shown here.

And, the idea is that when you create an image database with the image, images can also
keep various other metadata information, its descriptions. Suppose you have captured
newly this image while traveling and you would like to know what this particular object is
called. So, you give a search and associated images should be shown including their
description.

481
So, this is one kind of very useful applications which are presently available in different
systems. I can give you another example. So, these are the retrieved images from a
database. So, you see that one of them is not really a match, it happens because these are
all represented in a feature space and there would be ambiguities in representation. So,
many of the images which are different, but they can have similar feature vectors. This is
another example, this is from a place called Bishnupur in West Bengal and there are
terracotta temples which are very famous. This is image of one such temple and using our
image search technique, we could get this kind of representations.

(Refer Slide Time: 23:19)

So, in this way we have seen that there are different kinds of descriptions of an image
including its descriptors for the key points or for the points or descriptors for regions even
descriptors for the image globally. So, in this topic we discussed all such techniques of
descriptions and also detections of the point key points of the images particularly. So, we
will summarize this all this topics, all these different issues what we discussed, different
techniques what we discussed in this in this lectures on this topic.

And so, we have discussed about scale and transformation invariant feature detection. And,
some of them are like Harris corner or Laplacian maximum; those are the methods of
feature detections. You can use also difference of Gaussian maximum or extremum, it
could be minimum also and then we have used intensity weighted first which is also gives
a detection in terms of rotational invariant detections in that way.

482
And, we have also discussed different kinds of feature descriptors like they are
transformation invariant descriptions; typically the descriptors like you know scale
invariant feature transform, SIFT or SURF or ORB. So, these are different kinds of
descriptors that we discussed.

(Refer Slide Time: 25:13)

And, we discussed also the region and texture descriptors that is another kind of description
related to images. Here you consider a patch or a region instead of a point and usually
those patches or regions are there of rectangular shape and in that there are different kinds
of descriptors that we discuss like histogram of gradient. And, there are different texture
descriptors, edge density based descriptors, local binary patterns and its variations where
you can make it rotational invariant. You can also handle noise in this and make it more
robust in that way. That is not only rotational invariant there is a concept called uniform
pattern that has been used to make it more robust, but for handling noises. And, then there
are co-occurrence matrix based description. So, local binary pattern descriptors, you have
to create histograms in that particular region of these patterns that gives you a descriptor.

In co-occurrence matrix based descriptors you have to form co-occurrence matrix


depending upon the coupling of pixel values; that means, pairs of pixel values with respect
to predefined relationships of spatial relationships among the paired pixels. So, in this way
you can define a frequency distribution of different pairs of values and that would give
you the co-occurrence matrix. And, from co-occurrence matrix again different kinds of

483
features, different kinds of measurements and features you can define and which can be
used for the description of a region with certain texture.

Laws’ texture energy is also another technique which we discussed and finally, we
discussed about the global descriptor of an image. And, we have seen its application or
content based image retrieval or image based searches and two techniques are particularly
discussed. One is bag of visual word representation, another one is vector of locally
aggregated differences VLAD technique. So, these are the two techniques we discussed.
So, with this of course, you know there are few applications that also we considered, let
us look at them also.

Like for key point descriptors we know that they could be used for matching corresponding
to get corresponding points of a scene in a multi view imaging system, particularly stereo
imaging systems. And, they could be used for obtaining different kinds of characteristics
matrices of those imaging systems like fundamental matrix of its history imaging system.
Then for region descriptors object detection could be one applications, there are using
HoG, there are applications of pedestrian detection, there are applications of character
recognition etcetera.

And, for global descriptors we have discussed about a technique an application of image
retrieval. So, with this let me stop this particular lecture and also we conclude this topic
with this summarizations. We will move over to our next topic of matching and we will
be discussing also matching and model fitting in our next topic. Some parts of matching
of feature descriptors we have discussed in this topic also, but we will have more
elaborations in the next topic.

Thank you very much, for your listening.

Keywords: VLAD, visual words, image descriptor, co-occurrence matrix, local binary
pattern.

484
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 29
Feature Matching and Model Fitting – Part I

In this lecture we will start on a new topic on Feature Matching and Model Fitting.

(Refer Slide Time: 00:24)

Let us consider a typical problem where you require this kind of matching of features for
example, you would like to compute the 3-D structure of a scene. So, we have already
gone through the computational aspects of this particular problem and we can design an
algorithm with some computational steps to solve it. So, what we can do? We can get a set
of pairs of corresponding points, that is the first thing we should get and in fact, this is a
step, where we will see that matching is very much required.

But just to give a holistic view of the solution of this problem. So, you need a set of pairs
of corresponding points, then you should compute the fundamental matrix, and then we
should derive the camera matrices and then we can solve for 3-D coordinates of scene
points for each pair of corresponding points. So, you see that your initial assumption. So,
having a set of pairs of corresponding points that itself requires some introspection. So,
for in our computations we have assumed they are available either by visually finding out
which points corresponds to what. So, let us consider these particular structures.

485
(Refer Slide Time: 01:51)

So, what we can do at least if you would like to automate this process, we would like to
find out first thing interesting points in those images as the feature detection we have
discussed. So, we can apply those transformational invariant feature detectors like SIFT
detector, Harris corner detectors and also apply the detections in a scale invariant and
transformation invariant manner.

So, consider this particular operations where you define a measure like Harris corner
measure and then from there you get the feature locations by finding out the local maxima
out of this set of points and those points define your feature points. Similarly for the other
view also you perform the similar operations.

So, now since you expect that these points which you have detected their transformation
invariant, which means even after the change of view in the imaging still those points will
be retained in your detect after detections. So, we could try to find out that what are the
correspondences or which point corresponds to which point of the other image, which
means that we need to find out this kind of relationships where a point in the left image
corresponds to the image another point in the right image and their relation is that they are
images of the same point. So, you can check with this particular thing.

So, we are trying to find out for example, by precisely locating the feature points we are
trying to get a correspondence and in this way you would like to get the set of
corresponding points. So, computation which involves in pairing these points or getting

486
the corresponding pairs of corresponding points that is what is the what is known as feature
matching. So, we will be discussing some of these techniques about this feature match.

(Refer Slide Time: 04:27)

So, this is the summary of that computations that we discussed that first you have to detect
feature points in both images and then you should describe them by local statistics. So,
this is the way how these points are to be you know matched that just their locations will
not give any particular information about the nature of the point.

So, you need to at least look at the neighboring statistics neighboring distribution of
intensity values or some functional you know distributions in it is particular neighborhood
and which you are expecting that would also remain the similar in the other image of the
similar of the same landmark point around that same key point and then exploiting that
similarity we are trying to match them.

So, you have to describe them by local statistics and we have discussed about different
feature description techniques like shift descriptors or what we different kinds of
descriptors and which have the properties of these transformation ingredients and then you
have to find corresponding pairs. So, we will be considering that what competition
involves suppose we are given those descriptions, then how do you say that this pairs this
pair of points are the corresponding points?

487
(Refer Slide Time: 05:53)

So, we have already discussed in the previous topic also because while explaining the
motivations of describing a key point that one of the computations that we consider that
matching is in this representation. So, same description or same discussions will be
repeating here that, you can represent a key point by a feature vector that you have
considered. That suppose we have a feature representation of a vector here it is an n plus
1 dimensional vector [𝑓0 , 𝑓1 … 𝑓𝑛 ]𝑇 . And then you can use some distance function to define
the proximity or similarity between two vectors, you can other than distance function also
directly you can use any similarity measures also.

So, some of the examples of this distance function are given here like you can use L1
norm,

𝐿1 (𝑓⃗, 𝑔⃗) = ∑ |𝑓𝑖 − 𝑔𝑖 |


𝑖=0

𝐿2 (𝑓⃗, 𝑔⃗) = (∑ |𝑓𝑖 − 𝑔𝑖 |2 )1/2


𝑖=0

𝐿𝑝 (𝑓⃗, 𝑔⃗) = (∑ |𝑓𝑖 − 𝑔𝑖 |𝑝 )1/𝑝


𝑖=0

488
So, you can use these distance functions to define the proximities between key points and
you can apply certain strategies we will discuss that later on that how to declare that two
key points, they correspond to the same scene point for any way for two images.

(Refer Slide Time: 08:21)

Sometimes no distance functions you can have weighted distance functions which means
in the in our previous examples of distance functions, we considered every component has
uniform weight we did not distinguish that which component is more reliable and which
component is less reliable. So, if it is more reliable we give more weight to that differences
if it is less contributing to the discrimination of the images reliable means this it is
contribution to the discrimination between two images.

𝑑𝑤 (𝑓, 𝑔) = √(𝑓 − 𝑔)𝑇 𝐴(𝑓 − 𝑔)

So, if it is less then we give placement. So, this can be represented in this particular
mathematical form as you can see that, it is a it is used the column vector representation
of a feature vector and using matrix operations, you can compute this distance. So, in this
expression particularly note that A is a positive semidefinite matrix and which means that
it is a symmetric matrix and if I perform this operation

𝑣 𝑇 𝐴𝑣 ≥ 0, 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑣

489
So, its value should be positive with respect to the same vector and typical example of A
is given here. So, that this is a diagonal matrix which we can represent in this form that
you can write

𝐷𝑖𝑎𝑔(𝑤0 , 𝑤2 , … 𝑤𝑛−1 ), 𝑤𝑖 ≥ 0

So, w 0 today in the previous example we have the index from 0 to n which was in n+1
dimension space. So, this is the diagonal matrix and all others are 0. So, this part and this
part they are all 0. So, if I multiply if I use this diagonal matrix here in this expression then
it is simply this expression will boil down to in this form

𝑛−1 2

𝑑𝑤 (𝑓⃗, 𝑔⃗) = √ ∑ 𝑤𝑖 (𝑓𝑖 − 𝑔𝑖 )


𝑖=0

So, one typical example could be that they could be weighted by inverse of the standard
deviations of the variations in that ith component of the feature vectors say we can consider
or variants rather.

So, if I consider

1
𝑤𝑖 =
𝜎𝑖2

Where 𝜎𝑖 is the standard deviation of the ith component of this feature vectors you consider
all possible feature vectors. So, what is the variability on the ith component and then if it
if it has more variability we give less weight if it has less variability we give more weight
to define this distance. So, this could be one such policy while defining this distances one
1
typical example of weighted distance function when we can use 𝑤𝑖 = 𝜎2 and 𝜎𝑖 is the
𝑖

standard deviation of the ith component if I consider the statistical distribution of the ith
component of these vectors.

490
(Refer Slide Time: 12:19)

There could be other kinds of similarity measures. So, instead of distances by distance
compute the distance between two feature vectors means smaller the distance bit greater
is a chance of having them or declaring them similar or the smallest distance between a
pairs of feature vector may indicate that they are the matching pairs whereas, for similarity
it is just the inverse relationship.

Where in a similarity measure the higher the similarity value higher is the chance that they
are matching candidates. So, two such similarity measures which are used here one is
called normalized cross correlation measure and the other one is cosine similarity. So, let
us consider their mathematical forms.

So, this is the normalized cross correlation measure which you can see that it is defined as
the correlation between no two distributions where here distributions are considered in
terms of their components. So,

𝑐𝑜𝑣(𝑓, 𝑔)
𝜌(𝑓⃗, 𝑔⃗) =
𝑠. 𝑑. (𝑓)𝑋𝑠. 𝑑. (𝑔)

1 𝑛−1
∑𝑖=0 (𝑓𝑖 − 𝑓)̅ (𝑔𝑖 − 𝑔̅ )
= 𝑛
√1 ∑𝑛−1 (𝑓 − 𝑓 ̅ 2 √1 ∑𝑛−1
) (𝑔 )2
𝑛 𝑖=0 𝑖 𝑛 𝑖=0 𝑖 − 𝑔̅

491
They have been expanded also in my definitions and as I mentioned that, here we are
considering the values of components across the vectors and we are comparing those
variabilities with respect to two different vectors. And if they are very similar this group
cross correlation value should be you know high actually it ranges from [-1, 1].

So, if they are very highly similar then it should be close to one whereas, the other function
which is called cosine similarity that is defined in this way.

𝑓⃗. 𝑔⃗
𝐶𝑜𝑠𝑖𝑛𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
||𝑓⃗|| ||𝑔⃗||

So, as you understand here also, if the vectors are very similar these angles should be near
to 0. So, which means the cosine of that angle 0 should be equal to 1. So, here also the
values range from [-1, 1] and the values which are nearby and which are very similar there
the value should be high, it should be close to 1.

(Refer Slide Time: 15:09)

So, let us now consider what kind of different matching criteria could be there. So, I will
consider distance based matching criteria mostly here and you can extend this discussion
or extend these ideas for using similarity based matching. So, in the distance based
matching the minimum is the distance or smaller is the distance better is the you know
matching between two vectors, that is a way that this policy is considered. So, one of the

492
policy of distance based matching is that, you can use a fixed threshold which means you
can report all matches within that threshold value.

So, it is not declaring just a pair of feature vectors, but given a feature vector it is providing
you a candidate feature vectors and then you may have to do some post processing to select
which one is the fittest or which one the actual feature vector is corresponding to the query
feature. Similarly, but if you would like to precisely define the corresponding feature
vector, the nearest neighbor definition is much more useful because here it is not a fixed
threshold.

So, it considers which feature vector is nearest to it. The problem is that you need to also
consider; that means, it does not consider that whether the the distance between them is
greater than certain threshold value or within certain you can combine them of course, but
by definition nearest neighbor is the feature vector corresponding to that.

So, if I say that there are feature vectors say [x1, x2, …, xn] those are the candidate feature
vectors and you have a feature vector y in another image and this is the feature vectors in
the other image. So, what you are doing? You are computing distance between this each
feature vectors and which one is the smallest out of them. So, you take the minimum of
them that operation called arg min and that would give you the corresponding feature
vector; that means, xi* that is a nearest neighbor of y.

But it may happen that the distance is greater than the fixed threshold what you have
considered. So, you can apply that consideration also while selecting this nearest neighbor
point. So, this is the strategy when you are considering nearest neighbor principle of
detecting or matching a pair of feature vectors.

493
(Refer Slide Time: 18:22)

The other strategy which is found to be more robust is called nearest neighbor distance
ratio. So, in this case it considers the ratio of distances between the nearest neighbor and
the second nearest neighbor. So, that means, if it is distinctly nearest, the next neighbor is
quite far away then we should accept that as a reliable matching between these two feature
vectors. So, which means this ratio should be very small it is a distance between the nearest
neighbor and the second nearest neighbor.

So, ratio should be very small and if this ratio is very small then you can consider that this
is a good matching point. Here also we are not giving an absolute criteria it is a relative it
is relative distances what we are comparing. So, relative comparison between two
distances one with the one distance is with the nearest neighbor and the other one is with
the second nearest neighbor.

494
(Refer Slide Time: 19:26)

So, there is a figure by which you can understand this particular process. So, in this
example you can see that the feature vectors which are actually closer in the space they are
colored by the same color. So, for example, if I use a fixed threshold of th, then all the
feature vectors within these circular region around it. So, here we are showing a two
dimensional space.

So, it could be sphere, if it is a three dimensional feature time feature vector or for an n
dimensional feature vector it would be a hyper sphere. So, anyway. So, within this region
if there is any feature vector that would have been reported, but what happened the nearest
neighbor of this one is just outside of the circle. So, if you use fixed threshold you will be
missing this.

If we use nearest neighbor principle then this will be also selected which means you get a
true match according to this particular configuration and also it matches with the ground
truth as you have considered that they are the two matching points closest neighbor. But if
I consider nearest neighbor principle then what happens you see that for this point this is
the nearest neighbor.

But it is not the corresponding matching point it is showing with the different color in this
particular case. So, though it is nearest neighbor, but still it should not be considered. So,
that they are the nearest neighbor distance rule comes if I consider the nearest neighbor
distance rule then with respect to this particular point, this is the first nearest neighbor

495
which means that this is the nearest neighbor and this is the second nearest neighbor with
respect to this point.

So, if I take the ratio of between these two and this ratio is expected to be small according
to this diagram and then also we accept it. Which means that even the principle of nearest
neighbor distance ratio which is shown here in acronym NNDR, even if you apply them
then also it is accepted whereas, if I consider the other situation you can see that here this
the with respect to this image this point this one is the first nearest neighbor with respect
to this one and this is a second nearest neighbor.

And the second nearest neighbor is also quite close as you can see, because no there is an
ambiguity of descriptors in this case. And if I take the ratio d3/d4 then this ratio is expected
to be higher and then with the appropriate thresholding you can consider and you can reject
them.

So, this policy in NNDR policy it is also rejecting in this case and which is desirable,
because the using nearest neighbor principle this will be assigned this list will be matched
using nearest neighbor principle, but which is not desirable according to the given data.
So, that is why policy of nearest neighbor using distance ratio is found to be more robust
than the other policies.

(Refer Slide Time: 23:12)

496
So, we discussed about matching of key points here using their feature descriptors, what
about regions or even the images where you represent a feature description a feature where
you have a feature descriptor in the form of histograms of certain measurable quantity so,
there you require matching of histograms.

Now histograms also can be considered as a feature vector each bin represents as a
component of a feature vector. So, there you can use usual distance functions what we
discussed in the previous case also like Lp norms could be used and with respect to the
corresponding bin, and usually L1 norm is used to mostly for representing this histograms
that has been found, but you can use any other norm. But there are others special measures
other special distance functions or measures by which know you can describe the
differences between two histograms or similarities between two histograms.

So, one of this measure is called Kullback Leiber divergence measure and it actually it
tells you if I give you two probability density functions probability distributions, two
different probability distribution how close or how different they are. So, the measure is
defined in this way;

𝑃(𝑥)
𝐷𝐾𝐿 (𝑃||𝑄) = − ∑ 𝑃(𝑥) ln( )
𝑄(𝑥)
𝑥

There is also another distance function which is called earth movers distance function and
which I will elaborate a bit more.

497
(Refer Slide Time: 25:16)

So, in this case the idea is that you have two histograms P and Q and you consider that
transforming one histogram P to Q by transferring masses from a bin to any other bin of
Q. So, you are assuming that every bin represents every bin has certain amount that is a
definition of histogram it is a frequency distribution. So, frequency is considered frequency
of that particular quantity that itself each unit is called as a mass.

And if you are going to transfer a portion of that to another bin and of a Q and so, that the
histogram P gets transformed into Q. So, this transfer has a cost and this cost could be
defined as the product of transferred mass and distance between bin which mathematically
I can express in this form that m and the distance between two bins.

So, each bin say ith bin and jth bin their difference of the bin locations itself could be
considered as a distance that absolute differences between bin locations. So, you take the
product of these two that is what is the mass you are transferring from from ith bin to jth
bin. So, for example,

𝑃[𝑖] = 𝑃[𝑖] − 𝑚 𝑎𝑛𝑑 𝑄[𝑗] = 𝑄[𝑗] + 𝑚

So, this is the transfer operation that would do and by doing this thing what you are trying
to do trying to achieve is that transform histogram P into Q. So, we consider in this context
the total mass of P and Q should be same.

498
So, after doing all these transfers you are expecting that no you can convert the distribution
of masses in P in the form of P should we now in the form of Q. So, what is the minimum
cost operation accumulated cost operation? So, particularly this accumulated cost again it
could be it should be normalized with respect to the total transfer of mass that is a measure.
So, which what is that minimum no cost and that will give you the distance and that is
what is called earth movers distance, because no it relates like transferring mass is
something like digging the earth from one part from one place and placing it to the that
amount to the another bin.

(Refer Slide Time: 28:26)

So, something from that analogy the name has come and providing you a bit more
mathematical formulation of this computational problem, that now you consider a two
normalized histograms because no as I assume that their mass should be same. So, the
safest way to start with is that normalize both the histograms. So, their some of areas
should be equal to 1 that is the total mass they have and represent say ith bin of histogram
P’s small pi and ith bin of histogram Q’s small q and now you consider the ranges then
there are n number of bins and the mij denotes the mass transform from ith bin of p to jth
bin of q and dij distance is expresses a distance between ith and jth bin there is a distance
between locations.

So, the earth movers distance is the minimum normalized work required to transforming
required for transforming p into q. So, which means you are trying to compute this one.

499
So, this is an optimization problem that you have to minimize this particular quantity as
you can see this is the cost of transfer from ith bin to jth bin of a mass mij and this is a
total mass that has been transformed that is the normalization part of the work.

∑𝑖,𝑗 𝑚𝑖𝑗 𝑑𝑖𝑗


𝐸𝑀𝐷(𝑃, 𝑄) = 𝑚𝑖𝑛𝑀={𝑚𝑖𝑗 } ( ) , 𝑚𝑖𝑗 ≥ 0
∑𝑖,𝑗 𝑚𝑖𝑗

So, what is the set of transfers that would gives you the distribution Q from P and what is
out of those possible sets which one will give you the minimum cost.

So, this is what is your computational problem and since now you are dealing with mass
there are certain constraints you need to consider one other constraint is that, every mass
should be positive or 0. So, if there is a transfer it has to be a positive transfer then the
other one are your transfer of mass from the ith bin of p should should not exceed the
content what it has in that bin; that means, that is a capacity that bin has to transfer masses
from that to any jth bin of Q. So, so that is expressed mathematically in this form if I
accumulate all masses from ith bin, it should not exceed the original mass of that bin in
the histogram p.

Similarly then what you are transferring to the jth bin is that should not again exceed what
is the content in that jth bin because that is what would like to transform pth P histogram
to Q histogram. So, you have to ensure that that that it should go always less than equals
it should not exceed that. And in overall the transform or of all these masses should be less
than the total mass of this histograms one of this histograms. So, minimum of these two
histograms. So, it is actually considering this framework does not really require that two
histograms should have same mass, but in our context we considered that they should have
same mass.

So, in this case it should be always less than equal to 1 because that is what we have
assumed that total mass is 1 in the normalized histogram. So, this is a computational
problem and I am not going to discuss this computational solution of this problem which
is not within the scope of this particular course, but if you are interested, you can go
through this particular paper where a typical solution has been presented and efficient
solution this paper has been published in IEEE transactions on pattern analysis and
machine intelligence in 2007 and you can find out an algorithm for that. So, with this let

500
me stop this lecture at this point and we will continue this discussion of matching of feature
descriptors as well as the other topic of model fitting in this series of lectures.

Thank you very much.

Keywords: Feature matching, nearest neighbor, earth mover’s distance, matching criteria

501
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 30
Feature Matching and Model Fitting Part - II

We are discussing about Feature Matching and now we will be considering that how this
computation could be made more efficient.

(Refer Slide Time: 00:25)

So, we have seen in the case of feature matching, we required to find out similarities
between two feature vectors or you need to compute the distances between a pair of feature
vectors. And the nature of this particular computation is that if I give you a feature vector
for which you have to find out its corresponding feature vector in a candidate set. It is the
kind of range query which means instead of a single feature vector I may get a possible
candidate’s in its proximity that is the kind of answer I may like to get.

And later on I would like to find out which one of them is more closer or satisfies some
other criteria by which I can declare them as the close match. Now so, this is a kind of
computational problem. So, it is the nature of you know query. It is not typically I would
be interested on a particular feature. I may be interested on a set of features or on a range
of or an interval on which this feature vectors should like. A special case is of course know
you would be only considering all the nearest neighbor of the query and otherwise you

502
may as I mentioned for a range query, you can report all the features within a distance
from the query as we have seen also that is one of the strategy where you can use a fixed
distance threshold. And report all features which are within that region as centering around
that query feature vector.

But the question is that now when you are performing this computation, you have to
compute the distances with all the feature vectors in that in your candidate set that, so if
there are n such feature vectors you have to compute n times those distances. So, that
makes your operation as a, you know linear time complexity of order. Ordinarily if I apply
this brute force method of computing distances to each of them, then it would take off
order in time complexity.

So, our objective is that how we can make this you know computations more efficient. As
we have already; you might have known that typically this kind of search problem. If you
have a set of you know candidate vectors or candidate elements, then you can keep them
in some organized manner, so that your search could be made more efficient and you can
perform in sub-linear time complexities in many cases. So, we call actually these
operations of organizing the candidate set in a particular form that is that operation is also
called as indexing with respect to databases particularly. You might have heard this term.

So, in this case n is a number of features in the target image or database, and that is why
to make it efficient as I mentioned that you can use some techniques like similar to
indexing. Even you can use techniques like hashing where you can store your candidate
set by generating some hash values and based on that hash values, the sets having the same
hash values would be kept together.

So, your idea is that for a query image it is expected that it should have a similar hash
value and it should go to that particular group where your candidate set is expected to be
there, and then you can compute the distances of similarities between them. In this
particular context there are two techniques which are very popular for this kind of feature
matching.

Because you can see that this feature vectors they are multi-dimensional representation
and you can use a concept called K-Dimensional tree, K-D tree where K is the known
dimension that is representing of the future vector. So if the feature vector is of two
dimension we call it 2D tree, if it is three dimensional it is 3D tree, if it is one dimension

503
it is 1D tree which is which happens to be binary search tree. So, K-D Tree is a is the
concept which is extended from binary search tree and that concept is used for searching
over a multi-dimensional feature representation. And we will have some discussion, the
principal of representation using K-D tree and also the search that we perform over K-D
tree. And for hashing there is a technique called Locality Sensitive Hashing, as you can
see that in this case the Geometry is also important.

Because know you want to get feature vectors in certain locality of your query feature
vector. So, there the geometry concept of geometry space those are to be considered. In a
general hashing, it is simply it is a matching of a random point my random element. So,
you are randomly you are grouping different elements by the same hash value, but here
we need to also consider that the in the group the corresponding candidate sets should have
similar, no they should be closely spaced, they should have neighboring properties. So, to
ensure that property there is a kind of; there is a technique which is very popular and this
technique is locality sensitive hashing and we will also discuss about this technique.

(Refer Slide Time: 06:45)

So, let us consider first the principles behind the K-D tree, as I mentioned K-D Tree is a
kind of extension of concept of binary search tree for multi-dimensional feature search.
So, it is effectively it is a binary tree and every node it contains a key feature vector like
what we have what we know about binary search tree. So, every node contains a key
feature vector instead of a single key value for a one-dimensional tree, one dimensional no

504
feature representations using binary search tree. And then as I mentioned each node
behaves like a node of binary search tree for values of a particular dimension.

So, that is predefined for that position of the node say the, you are starting from the root
say root, in the root we will be using only one dimension. Suppose you have a two
dimensional feature representation. So, if I consider the dimension 1 is x and dimension 2
is y.

If I represent those dimensions, then say root will be we will consider the you know the
rule that would be used for partitioning the values or partitioning the space based on only
one dimension same of dimension x. I will explain it in the subsequently. So, we call this
dimension as cut dimension. Just referred to this that dimension based on which this
partitioning takes place and it alternate periodically on nodes along any path. So, that is
how a key value is placed.

So, like binary search tree is the order of values by which a tree structure is determined,
here also as and when the feature vectors are coming you are keeping them in to in this
structure and when you are expanding a note, when you are placing it you need to
understand, find out that you know which dimension to be used for placing it into a
particular sub-tree or in a particular node. So, that is what places a key feature vector
following the rules of binary search tree comparing on the value of the cut dimension. Let
me explain the formation of a K-D tree.

(Refer Slide Time: 09:29)

505
Let us consider a set of values say we assume that we have say values say 25, 75; we
assume your feature dimension is a two-dimensional feature vector. Some values let us
consider say minus 30, 10 minus 60, minus 50 40, 90 take it as 15, 30. So, I am assuming
that I have five feature vectors forming a candidate set. So, I have to organize this feature
vectors into a K-D tree one by one I have to insert an element. So, I will start from the root
and in the root I will be considering the dimension x so, this element would be inserted in
the root and the dimension. So based on these value only my next so, here the x-dimension
should be compared when we consider insertion of the, so insertion of the next element.

So, the next value is minus 30, 10 so, as minus 30 is less than, so this is the dimension we
will be considering here. So, minus 30 is less than 25 according to the binary search tree
rule so, I have to place it here. And next is minus 60 minus 50 so, if I go here, so first I
have to compare minus 60 and since this is this dimension x, so minus 60 it should go
there.

Now, minus 50 would be compared because now the y-dimension in this node, the y
dimension has to be compared. So, minus 50 is less than 70 so, it should go in this node.
So, this is minus 60 minus 50, here y-dimension is compared; consider so, in this case we
need to compare the x-dimension, right. So, consider the next element so 40.

So, first we compared from the root 40; so, 40 means it should come to this place so, this
is vacant. So, it will be placed here itself 40, 90 and this node the cut dimension y to
compare, next is 15, 30 so, if I come here, so 15 it will be compared with 15 so, it will be
coming to this node. Now it will be compared with 30, it will be coming to this node
because 30 is less than 70. So, now again the x will be compared, it is 15; 15 is more than
minus 60. So, this value is 15, 30 and here the cut dimension is y. So, in this way as you
can see a tree is built. So, as you go on including more number of elements your tree will
be increasing its levels and also it will be expanding its nodes.

And in this particular so, any query image suppose you have you have a query say 10
minus 3 and we would like to get the nearest neighbor from this query. So, what we can
do that it is not simply like you can apply the binary search tree rule because even if you
move to certain sub-tree, the other sub-tree also may contain the nearest neighbor.

Because if I note how the partitions of this space is made in this fashion, say let me consider
a space where you have if the values are ranging from minus 100, minus 100 to say 100,

506
100. And in the first case as you can see that when we are in the root, it is 25 sets. So, you
are using 25 so, it is minus 100 is this so, 25 maybe this location for the x-dimension so
this is x, this is y.

So, all the values which are more than 25, so this is a subject of the space at the root
location; from the root location. So, all the values; all the points two dimensional points
which are which will be in the sub-tree should lie in this particular subspace, a box and all
other values, all these sub-trees that should lie in this space. So, that is idea and then you
consider minus 30, 70 and that is in this case you have to use the y coordinate. So, 70
would be used for; so, you are now in this book sub-tree left sub-tree so, 70 could be
somewhere here. So, this is a region which will be representing this particular sub-tree
minus 30, 70; so, this is 70 so, this is 25 x direction and say this is 70, if I consider 40, 90
so, that should be also in the y region rather 40, 90 is a point. So, these are the two regions.

So, this should be the left sub-tree of this one and this should be the you know right sub-
tree. Similarly this partition, this is the left partition of this node and this is the right
partitions. So, when you have minus 60 minus 50 at as it is in the left notation, so you
should consider here and it is the dimension x. So, it will be something like this. So, this
is minus 60; one second this is a left and this is right and when it is 15, 30 that is from the
y, so it should be something like here. So, you can see that for this node, your this node is
the right sub-tree and this node is the left sub-tree.

So, in this way you are subdividing a region into different cells and your query point should
be expected your if you would like to place a point, it should be placed on in that region
and it is expected that your search space, your neighboring point should be in that region.
But the problem is that some of the points even in this bordering region they could be also
neighbors. So it is not sufficient to search just this cell, you have to search also the
neighboring searches. So, you have to follow certain search strategy which is not exactly
the same as binary search strategy to find the nearest neighbor. So, let us discuss that also
here.

507
(Refer Slide Time: 18:25)

So, we will be considering a nearest search using K-D tree as we have seen that how K-D
tree partitions the space and trying to locate a query into a particular cell, though its
neighboring cells are also to be considered for finding the neighboring regions. So, the
strategy is that as you walk through the tree nodes, you should always know the current
smallest distance and the respective key feature.

You should not ignore any sub-tree when while traversing this node for searching a feature,
key feature, tou should not ignore it unless you make sure that all the values, all the feature
vectors or all the key vectors within that sub-tree they are at a distance which is greater
than the minimum distance what has been found so far among the features space. So, that
is why you have to store the current smallest distance and the respective key feature, then
you can prune sub-trees by comparing the minimum distance with the corner nodes of the
hyper volume, the here the bounding boxes what you have shown.

So, you can also note down their corner points and you can compare them these distances.
If you find that one of these corners, see all the distances of this corner points they are
greater than the minimum distance, then you need not go through. You should not go
through that searching of any key values in that region. So, in this way you can prove.

So, while taking a decision in the sub-tree where you should move next, you should search
the sub-tree to maximize the chance of pruning which means go to the sub-tree which is
closer to the query. So, you are not no, you are not you are not immediately taking any

508
decision of removing any sub-tree from your search space as we do in the binary search
tree., at every note we take a decision and we go either left or right, but in a K-D tree we
cannot do that.

We have may traverse along both the sub-trees unless you make sure that the sub-tree, the
cell represented by the sub-tree, all the corner points they are at the distance greater than
the smallest distance what you have in your current execution and then you can prune that.
So, in this way you go and finally, whatever sub-trees you are having at the end you have
to compare all the feature vectors in those sub-trees to find out the neighboring vectors.

(Refer Slide Time: 21:11)

So, the time complexity if we consider like even in binary search tree also worst case space
time complexity, still linear in K-D Tree also it would be linear, but in practice you may
get

𝑂(log(𝑁) + 2𝑑 ), 𝑤ℎ𝑒𝑟𝑒 𝑑 𝑖𝑠 𝑡ℎ𝑒 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑠𝑝𝑎𝑐𝑒

So, log(N) to find the cell near the query point and 2𝑑 for searching around the cells in the
neighborhood. As I mentioned that it is just not that cell, you have to consider also
neighboring cells.

509
(Refer Slide Time: 21:45)

The other technique for performing this efficient search is locality sensitive hashing
technique and I would also elaborate a bit more about this technique. I will provide you
the principles without going into much details. So, the property here is that you have to
use an hash function which should preserve the locality of feature vectors with respect to
distance function which means that if I consider a hash function say h, and if I consider
two feature vectors x and y which are being shown here by bold fonts.

So, the both the hash functions if they are same, so their probability that both of them
would be same would be very high. If the distance is between them is also small, that
means they are very closely spaced. So, if I can ensure this property of a hash function,
then those hash functions will be useful in placing similar features into the same group,
same bucket.

So, that is a principle we will be using. So, let us find; let us consider an example of a hash
function as it is true that know this should be the probability should be high when their
distances are small. Similarly probability of hash function should be different when the
distances are large. So, it should satisfy that property also.

510
(Refer Slide Time: 23:19)

So, a typical example as we can consider that, consider a random unit vector r of dimension
small n. And for example, you can form this vector from using a normal distribution
independently generating the values following this distribution in each dimension. That
would give you n times if you do, then that would give you a feature vector of random
feature vector of dimension n. So for any vector x of dimension n, you can define a hash
function.

1 𝑖𝑓 𝑥. 𝑟 > 0
ℎ𝑟 (𝑥) = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

So, it is a two valued hash function. So, there are only two vectors either 1 or 0, but the
interesting fact is that this would it can be shown that for any two vectors x and y, then
probability that two hash values would be equal is given by this fact.

𝜃(𝑥, 𝑦)
Pr[ℎ𝑟 (𝑥) = ℎ𝑟 (𝑦)] = 1 −
𝜋

So, you can express this angle also in terms of radian, you know that this angle is expressed
in terms of radian. In fact, this function is closely similar to cos cosine functions, so which
means the angle is very small then those feature vectors are quite nearer and then there is
hash values probability of having the same hash value which is equal to 1 in this case. So,
any hash value as either both of them are 0 or both of them are 1. So, that would be quite
high. So, that is you know particular feature particular property of this or significance of
this property.

511
(Refer Slide Time: 25:29)

So, using this property you can design a scheme. So, you can design a scheme of multi-
dimensional bucketing. So, you consider that instead of having a single vector you have a
k random vectors and so, independently you can generate k hash values. So, each will give
K-dimensional binary representation and that itself can be an index of a bucket, multi
dimensional bucket and you can place your input vector in one of these buckets. So, if you
have k hash values as if it is a binary representation, there could be 2𝐾 buckets and in one
of these buckets.

So, whenever an input vector has the same bucket number, you will be placing them in
that bucket. So, in this way you can perform in a hashing. So, this is how a particular
bucket is characterized. Now you can further know extend that instead of having a single
no single multi-dimensional bucketing scheme, you can repeat this process. You can have
L multiple buckets. So, instead of keeping a feature vector only in one bucket, you consider
there are L such family of hash functions and independently you are keeping L multiple
instances by doing you are making the chance of getting neighboring feature vector, the
probability of neighboring feature vector would be high that is a chance by doing it
multiple times.

So, you are repeating this process L times and for example, you can generate such you
know multi-dimensional vectors hx like L times. So, you have L such buckets and that is
the scheme and then when you are performing a nearest neighbor search what you can do

512
that given a query, you have to compute L buckets and search for nearest neighbor in all
of them. So, that makes your locality sensitive hashing more efficient and chance of
missing a nearest neighbor will be less.

So, with this let me stop this lecture at this point and we will continue this discussion. In
fact, in this topic we have these two parts; one is matching the other part is model fitting
which is related as I mentioned because after matching the points, then know using those
points you are trying to derive a model which explains the data. So, we will be discussing
some of these techniques of model fitting in the continuing lecture.

Thank you very much.

Keywords: K-D tree, nearest search, locality sensitive hashing, bucketing

513
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 31
Feature Matching and Model Fitting Part - III

We are discussing about Feature Matching and Model Fitting. In the last two lecture
lectures, we discussed about the techniques for matching feature vectors and also efficient
computation of that problem. In this lecture, we will be considering the issue on model
fitting.

(Refer Slide Time: 00:35)

So, the computational problem for this fitting model is that given a set of data points we
need to fit a model to establish relationship among the data points. For proposing the
compression problem first we need to have a set of data points. So, we should obtain data
points and then we need to fit an appropriate model to explain those data points. As we
mentioned to establish relationship among the data points we need to fit an appropriate
model.

So, in the last few lectures we discussed how we can obtain this data points by detecting
feature points and then matching feature points in multiple views in different images, you
can establish correspondences and there are various problems on which this
correspondences are to be required for fitting a model. Some of the examples could be,

514
these are also we have discussed in our previous topics. For example, if that between
corresponding pixels in two images there exist this homography, then how to compute
homography matrix. That itself is a model fitting problem. We have already discussed it
solution. So, in this lecture we will consider a bit more general issues involved in this kind
of computational problem.

Another example could be again computation of fundamental matrix between


corresponding points in two stereo images. Even if I give you a set of 2D points how can
you fit a straight line or parabolic curve or circle or a high degree polynomial curve passing
through them. There has to be some knowledge about this kind of train or relationships
about the model and using that knowledge only you try to get the precise model visual
with this points.

For obtaining data points we need to apply various image processing techniques with
respect to data points from generated from an image from images, like there are various
pre-processing techniques and feature detection description matching that we have already
discussed in previous lectures.

(Refer Slide Time: 03:29)

So, as I was referring that knowledge of models is crucial for defining a computational
problem. What should be the mathematical form of the class or family of models? For
example, we know the mathematical form of a homography relationships among the image
points between two sins, between two scenes or images of the same three-dimensional

515
scene, we know that for a two-dimensional projective transformation the only form of
transformation matrix should be in the form of a 3 X 3 non-singular matrix. So, using that
knowledge then we have solve that problem that we have already discussed in a previous
topic.

Similarly, the form of fundamental matrix in a stereo geometry and its role that has that
we know and applying those relationships, we can derive a fundamental matrix from the
set of corresponding points. We should have the knowledge of the structure of that
fundamental matrix, its properties that it is a singular matrix and also how this matrix
relates the corresponding points that request to be used while fitting a model.

Projection matrix of a camera, its form also we have seen how this from could be derived,
in the form of a 3 X 4 matrix, where it maps a three dimensional sin point to an image
point in the projective space or in the homogeneous coordinate systems. So, finding out or
providing a set of corresponding points between three-dimensional sin points and there
corresponding image points, we can derive a projection matrix by using this model fitting
techniques. This also be discussed in a previous topic.

(Refer Slide Time: 05:53)

So, choice of a model is very important. It comes from the knowledge or analysis of a
particular system or analysis how the data are generated, how the data has been obtained,
from there you can get that information and you can choose an appropriate model. But
there are situations where you may not be able to precisely define a structure of model.

516
So, you may have to guess even there are also some model some intuitive grace some
intelligent grace we can make, but still precisely you do not know how many number of
independent parameters that would take, that would play into that model. So, how to
ascertain that when you choose a model using some kind of intelligent and intuitive grace
work. How that model is appropriate in your data fitting?

So, there are certain checks and bounce for that. Particularly, you need to consider the
error of fitting in that case. There are various kinds of errors which were used. We have
also used and mean square error between the predicted value and the observed value in our
previous model fitting techniques of fitting homography matrices or fundamental matrices
or projection matrices. So, similar mean square error could be, similarly we can define
mean square error in various other contexts.

Just to explain that what is meant by mean square error though it may be very clear to you
through our different exercises, just to elaborate that fact. Suppose, you have some data,
let me consider your measurement is a scalar quantity which is y and which depends upon
a feature vector. So, corresponding to a feature vector you have a measurement y and you
postulated that there exist a functional relationship between the feature vectors and also
your measurement that is the model you would consider. You, suppose the form of these
functions and apply different techniques to derive this function.

So, now there are when you actually apply a model, then the value what you get that is a
model predicted value and this is the value which is the observed value, this is an observed
data. So, the error between observation and prediction could be define as square error; the
difference if I since I have assumed in this particular case this is a scalar value, so I can
simply take the difference. You can extend this concept; when your observation is also a
vector and you can consider norm of the differences in that case.

However, so mean square error is that you have so many observation. Suppose, you have
n observations and I represent mathematically this pair in this form that means, there are
n observations. So, in that case mean square error would be for each observation I will find
out the corresponding predicted values which is shown in this form. So, I will sum all these
square deviation, square of the errors and take the mean that is what the mean square error
is.

517
𝑁
1 2
𝑀𝑆𝐸 = ∑(𝑦𝑖 − 𝑦̅)
𝑖
𝑁
𝑖=1

In fact, you have used this error in obtaining homography matrix, fundamental matrix or
projection matrix in previous computational problems, there mean square error is
expressed in terms of norm of a difference vector, mean of the norm of difference vectors.
Let us now consider other aspects. This is one particular example of mean square error.

(Refer Slide Time: 11:21)

We can also evaluate the strength of a model by computing likelihood of data given a
model, which is defined in this form. What is the probability of occurrence of data given
a model? So, this measure should be high, as you know it is a probabilistic measure, so the
probability value should lie between 0 to 1 and if it is a very high value your model fitting
is good that is one kind of evolution of models. And that determines how good is a model
in or when you have a competitive models if we can compute their likelihoods their relative
ratios can provide you which model should be consider. There are theories for that, I am
just providing in the intuitive reasoning in this case.

Size of a model is important. So, in brief we can say a size of a model is determined by
the number of independent parameters those are involved in a model. That is just a very
short way of defining a size. If you have more number of independent parameters your
module is more complex, that is the rough idea. So, you have to choose a particular size.

518
Suppose, you have simple linear relationships between the observed between your
measurements and also the feature vectors, observed feature vectors and in a feature vector
suppose there are n components. So, in that case in a linear model we can write the form
of

𝑦 = 𝑎1 𝑥1 + 𝑎1 𝑥1 + ⋯ + 𝑎𝑛 𝑥𝑛

Suppose, your feature vector is n dimensional element, it is represented in an n dimensional


space in this fashion and you can also use a constant here, you can use that also in your
model. So, we can see in this linear model there are n + 1 parameters. They are the
independent parameters.

In this case, you have to establish that whether they are independent or not that depends
upon the problem, but let us assume they all independent. So, this is one kind of size of a
model. If you consider only a subset of feature vectors, only certain dimensions are related
with these observations than your model size also gets reduced. Or, if you want to use the
non-linear forms then also there will be coefficients regarding the non-linear terms and
your model size will be increasing that as there will be more number of parameters. So,
this is some example that depending upon your model description this number of
independent parameters they vary and they determine in one way the complexity of a
model.

So, the question is what should be the form of a model when as I mentioned that a scenario
when you are not sure about the precise structure of the model, as we did in the previous
cases of determining of homography matrix, fundamental matrix or projective matrix, the
structures were very precisely defined, their properties are also well known or known to
us and applying them we have derive them. But in some situations you may not get that
then you have to apply intuitive or intelligent gases about this kind of structure, and then
observe the error of fitting or likelihood of data based on that model and decide whether
you should accept that module or not.

519
(Refer Slide Time: 16:09)

So, regarding this when you are evaluating the performance of a model there are two
particular type of errors, those you should note. One is that training error, another is test
error. So, training error these term came from the machine learning is perspective. So,
when we are fitting a model using a set of data that is the kind of training operations and
when you are testing that model with another set of data which is has been kept outside of
this training set, but in the test set we know the actual values that means, ground truths
values and the we compare those values with the predicted values, then we get also and
error while those comparisons that we called test error.

Now, by observing the amount of training error and test error, we can also qualify your
model fitting. For example, if your training error is very large that itself reflects that your
model is wake it does not explain the data properly, it is an under fitting case which means
your number of parameters in the model may have to increased. If it is a low training error
that means, in the training you get reasonable you know error term which is very small,
but while testing you find actually you are getting large tested which means your model is
very data specific and we call that over fitting. It tries to minimize the error just considering
that data.

Moment you want to generalize this model over other data sets which are not used in fitting
them then actually your model is not performing well. So, this problem is an over fitting
case. So, we have to consider that what kind of structure of model you should take. Most

520
likely you have taken too many parameters, you have to work with less number of
parameters in that case. Ideal situation is that you get a low training error and also a test
error. Then, that gives you the confidence of having a good fit.

(Refer Slide Time: 18:39)

Some examples of model fitting which are observed in image imaging. We have discussed
some examples of model fitting regarding homography computations, fundamental matrix
computation or projection matrix or camera matrix computation, but from the geometric
relationships in a two-dimensional image there are some simple models. As we; I have
shown that given the set of points you may have to find out the lines which are which could
be formed by this points.

So, the model is that those set of points they lie on a particular line and you are describing
that line by their parametric form. It could be a circle or it could be any arbitrary shape
consider this scenario and a boundary of an object has been shown in that shape and in
another image we would like to see whether that object is present or not. So, using that
polygonal models or the boundary described in that kind of shape we are trying to find out
whether it exist in other cases or not. But this model is a bit complicated model for any
arbitrary shape.

So, the question is that how to decide about appropriate parametric models or appropriate
form of a models to relate the data, data points or to establishes relationships among this
data points.

521
(Refer Slide Time: 20:25)

So, there are various issues which are involved in model fitting. Like, your observation
could be noisy there could be error. For example, even when you are obtaining data you
are applying different computational techniques, different kinds of estimation. So, this
estimation will have some error. So, feature locations may not be precisely found, there
could be some error in deciding about those locations. There could be some data which is
not generated by the process which will fit the model. So, they are called clutter or outliers.

Say, you are assuming the corresponding points between two images there exists an
homography and you know that a plane planner, a planner sin induces a homography. So,
so assumption is that those points they belong to with a sin points which are lying on the
same plane. But it may happen in your observation, in your measurements or in your
experiments when you establishes correspondences, when you get those data point some
of them may be out of the plane, images of some sin points which are not lying on the
same plane. So, if you would like to establish the homography matrix that would provide
an error, that would give an erroneous measurements for those models and since you are
trying to optimise the overall error that would make a problem.

Then there could be multiple lines. So, your model is a for single line, but actually there
are multiple lines and then again the single line model, model for single lines will not be
applicable there. Some data could be missing also. For example, some occlusion, so you

522
may not get the it is a kind of partial information that you observe in your data points, it
owned give you the full picture of the model in that case.

(Refer Slide Time: 22:41)

Now, we will be considering a particular type of problem of model fitting, for this lecture
to give you some ideas of you know these issues what we discussed and some approaches
to handle them, those could be extended in fitting more complex models also. So, would
be considering only a very simple model of fitting a straight line over two-dimensional
points and we will discuss about some of this techniques as they are shown here techniques
of least squares, total least squares, random sample consensus, then Hough transformed
techniques, mentioned here as Hough voting technique.

523
(Refer Slide Time: 23:29)

So, actually these techniques they are presented in varying contexts. Suppose, you have a
points belonging to a line and you have to find the optimal line parameters then least
squares technique they are applicable, they are very useful, they are very appropriate, but
suppose they have outliers in that case we should think about some technique like random
sample consensus technique or in short we call it RANSAC. We will see it is a very generic
approach of fitting different kinds of models, where we expect there would be outliers in
the data or clutter in the data as we mentioned earlier.

If we have too many lines and you would like to fit a model then voting methods are
appropriate like Hough transform is applied there. So, in this lecture we will this will be
discussing on this different kinds of techniques.

524
(Refer Slide Time: 24:35)

So, let us first considered the techniques of least squares for line fitting. So, the problem
here I have shown that you have a data which is given in the form of a set of two-
dimensional points. There are n points and you can see that there coordinates are denoted
in this from that means, if you have i-th point its coordinate is denoted in my representation
as x i and y i. And the relationship between this coordinates, in the model it is a straight
line equation relationship. It is not strictly linear relationship, as you know the straight line
is not will not give you a linear relationships it is called a find relationships. Colloquially,
even that then we call it a linear fit, but actually it is an a find relationships.

Anyhow, so we know how a line is represented in a two-dimensional coordinate system


and that is the familiar representation of y=mx+c that is the model. For a particular instance
i-th instance we write it as yi=mxi+c. Then, the error of fit can be expressed in this form,

𝐸 = ∑(𝑦𝑖 − 𝑚𝑥𝑖 − 𝑐)2


𝑖=1

So, you can see in this particular diagram, see, if this is the equation or the straight line
which has been given by this equation y=mx+c, then given xi say this is your value of y
and this is your observed data yi. So, what, the deviation is given y, this vertical shift
vertical difference. So, we call it vertical error and square of these deviations will give you
mean square error. So, that is how the error term is defined. So, we call it vertical least
squares because no we are trying to minimize this error. So, we have to find out that

525
straight line that m and c which will minimize the sum of square of these errors, which
means the error can be expressed in this form and then we need to solve it for this problem.
So, find m and c to minimize it.

So, let me stop here and for this lecture. We have understood what problem we need to
solve. We will continue this discussion in the next lecture.

Thank you very much for your attention.

Keywords: Model knowledge, mean square error, fitting curves, occlusions, RANSAC

526
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 32
Feature Matching and Model Fitting Part – IV

(Refer Slide Time: 00:24)

We are discussing about least squares line fitting and as we have considered that we have
been given a set of data points and the form of the model in this case is a line, straight line
given in the form of equation y=mx+c and the error term is defined as the sum of square
of vertical divisions of the observed value from the predicted value, which is been given
by the equation

𝐸 = ∑(𝑦𝑖 − 𝑚𝑥𝑖 − 𝑐)2


𝑖=1

So, we have to minimize this error with appropriate values of m and c. So, that is a problem,
find m and c to minimize this error. So, we can write the expression of error in this form,
as you can see here, that we have converted this expression in the short form of matrix
notation of your data, just to elaborate that how this representation

𝑛 2
𝑚 𝑦1 𝑥 1 𝑚
𝐸 = ∑(𝑦𝑖 − [𝑥𝑖 1][ ])2 = ||[𝑦 ] − [ 1 ] [ ]|| = ||𝑌 − 𝑋𝐶||2
𝑐 𝑛 𝑥𝑛 1 𝑐
𝑖=1

527
Similarly, in the second row we will get y 2 minus mx 2, minus c and in this way, you will
get y n minus mx n, minus c. So, this is the column vector and if you would like to take
the norm of this vector and take the square of the norm, that exactly we will give you this
equation or that is what we will give you also this form, you can verify that. So, this is
nothing but, represented in this short form. So, this error E is represented in a shorter matrix
notation as capital Y minus XC where, Y is the this matrix, X is this matrix and C is this
matrix. So, this is how this notation has been derived, let me wipe out this writings and
proceed further.

(Refer Slide Time: 03:05)

So, where E can be error can be expressed now, we have a matrix notation and as the norm
of a matrix can be written in this form

𝐸 = (𝑌 − 𝑋𝐶)𝑇 (𝑌 − 𝑋𝐶) = 𝑌 𝑇 𝑌 − 2(𝑋𝐶)𝑇 𝑌 + (𝑋𝐶)𝑇 (𝑋𝐶)

(Y-XC) just to explain once again suppose, you have a matrix X or a vector, which is
represent in the form of a column vector, square of the magnitude, that is nothing but, it is
a you understand it is a scalar amount.

And now you expand each one using matrix algebra and then you can perform the
multiplications also. Since, matrix is a linear operation. So, you can perform derivatives
everything in the matrix form itself and you can extinct the analysis, what we know for
ordinary single dimensional variable, single dimensional function cases.

528
(Refer Slide Time: 04:45)

So, in this case, if I perform the derivative with respect to C then, we will get these
equations, which means at independently you are taking derivative of E with respect to
each component of C,

𝑑𝐸
= 2𝑋 𝑇 𝑋𝐶 − 2𝑋 𝑇 𝑌 = 0
𝑑𝐶

So, this relationships from here, which is established or which can be derived using matrix
algebra and their derivative that in short, we have written in this form, as you can see here
this is not related to see, this is a kind of constant term, these are the term which are related
to C.

You can intuitively extent your knowledge of differential calculus, applying single
dimensional, you know variable or applying those or in ordinary single dimensional
functional space, you can just extinct them using the matrix notation, you need to know
practice on that part, when you are deriving this in a very short and faster way. Otherwise
you can do it component wise, we will find they have a one to one relationships with this
kind of expression. So, once you obtain this then you perform once again, the algebraic
manipulations with this relation.

529
(Refer Slide Time: 06:40)

𝐶 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑌

This is the famous pseudo inverse relationships what we previously also discussed, just to
give you the same picture, that I am trying to fit a model like Y=XC. So, I am trying to get
C, given Y and X. So, these are the given data.

That is why what exactly we have derived here, it is the result of minimization of the error.

(Refer Slide Time: 07:41)

530
So, just to expand this relationships in a more granular level of data elements; that means,
to get a solution of m and c, we can get this form, m can be expressed as ratio

1
𝑐𝑜𝑣(𝑋, 𝑌) 𝑛 ∑𝑖 𝑥𝑖 𝑦𝑖 − 𝑥̂𝑦̂
𝑚= = 1
𝑣𝑎𝑟(𝑋) ∑𝑖 𝑥𝑖 2 − 𝑥̂ 2
𝑛

Anyway, there are simpler way of deriving this relationships by performing partial
derivative of E with respect to m and with respect to c and following 2 equations and solve
for m and c.

So, these are the equations we will get here, y bar and x bar is the mean of x is and mean
of. So, y bar is mean of y is and x bar is mean of x is. So, error can be estimated by
replacing the value of m and c in that expression, eventually you can find that this would
come like this

𝐸 = 𝑛(𝑣𝑎𝑟(𝑦) − 𝑚2 𝑣𝑎𝑟(𝑥))

So, goodness of fit as I mentioned, that any module fitting you should observe this error,
it is related to this error and you can express as a quantity called

∑𝑛𝑖=1 𝑦𝑖 2 − 𝑦̂ 2 𝐸
𝑅2 = 1 − 𝑛 2 2
→ 𝑅2 = 1 −
∑𝑖=1 𝑦𝑖 − 𝑦̅ 𝑛. 𝑣𝑎𝑟(𝑦)

So, it is trying to explain that what part of this variance is explained by this data fitting.
So, this is what it compares the variability in the measurements, not explained by the model
to the total variability in the measurements.

So, if it is R square value if it is very high, then your model fit is good. So, this value
should lie between 0 to 1 and you can also express in this way, this R square is related to
E, sometimes it is called coefficients of liberation. So, once you fit a model you should
find out also R square and the value should be quite high, say near about 0.8 or more than
0.8 that would be a good fit, we consider that as a good fit in that case.

You should also note that this technique is not rotational invariant and it fails completely
for vertical lines because, if it is really a vertical line then, as you can understand your m
is going to be almost it is m is. So, if it is vertical line, it is x = C. So, it should be almost

531
like infinity y is coming now. So, it is very difficult to get term. So, variance of x is training
to very very small and m is going to be infinity so, it will fail in that case.

(Refer Slide Time: 11:17)

So, there are techniques which takes care of the situations and also here the objective
criteria for fitting model exposed in a different way and this technique is called total least
squares. So, it is suitable for any such situation whether, it is a vertical line or not. Now,
in this particular diagram it has been shown that, what kind of error measurement we are
considering here instead of vertical deviation, rather we are considering the deviation of
the point from that line itself. So, deviation means the perpendicular distance between the
point and line.

So, those are the deviations and sum of square of those deviations that defined the error,
we assume here the line is given in this form. So, px + qy = d. So, that the normal direction
of this line is given p q that is the property as you can see. So, the distance between a point
and the line as I mentioned, you have to consider the perpendicular distance which is
algebraically given in this form,

|𝑝𝑥𝑖 + 𝑞𝑦𝑖 − 𝑑|𝑔𝑖𝑣𝑒𝑛 𝑝2 + 𝑞 2 = 1

So, you have to consider that representation. That you can make always in your straight
line equation by choosing appropriate d or scaling them.

532
So, the error is given in this form, 𝑝𝑥𝑖 + 𝑞𝑦𝑖 − 𝑑 is whole square and if I perform the
derivative with respect to d then,

𝑛
𝑑𝐸
= ∑ −2(𝑝𝑥𝑖 + 𝑞𝑦𝑖 − 𝑑) = 0 → 𝑑 = 𝑝𝑥̅ + 𝑞𝑦̅
𝑑𝑑
𝑖=1

𝐸 = ∑(𝑝(𝑥𝑖 − 𝑥̅ ) + 𝑞(𝑦𝑖 − 𝑦̅))


𝑖=1

Now, it is basically a function of p and q by replacing d there and the constant is that p
square plus q square should be equal to 1. So, you have to choose that p and q, which will
minimize this error E, with the constraint that 𝑝2 + 𝑞 2 = 1, that is the problem that you
need to solve.

(Refer Slide Time: 13:34)

So, once again we can use the matrix notation to represent the square of error, sum of
squares in this short form; that means, it is norm of this particular vector. So, just to
elaborate once again just that you can see that, if you considered the first row, it is giving
you, p into x 1 minus x bar plus, q into y 1 minus y bar and in this way, you consider each
row is formed. So, the last row is p x n minus x bar plus q, y n minus y bar. So, this column
vector and then if you take it is norm of square. So, you get this expression.

533
So, this is how it is representing in this form and you would like to make E = 0, you would
like to get a solution but, this can be expressed also this is this form is

𝐸 = (𝑈𝑁)𝑇 (𝑈𝑁)

So, this is how this short form is explained and let us continue with this representation.

(Refer Slide Time: 15:16)

So, as I mentioned this is U this is N. So, we need to solve this problem, to minimize this
error by choosing appropriate N. So, if I derive E with respect to N, we get this
relationships

𝑑𝐸
= 2(𝑈 𝑇 𝑈)𝑁 = 0 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 ||𝑁|| = 1
𝑑𝑁

which is p and q should be equal to 1 because, that is the constraint we have put, as you
can see now, that actually this is giving a set of homogeneous equation.

So, you have to find out the 0 vector of 𝑈 𝑇 𝑈, you can do it in various ways. So, one of the
way could be that if you perform the, if you compute the eigenvectors eigenvalues, there
will be one value which should give you the 0 but, ideally it should give 0 but, because of
the measurements because of noise etcetera, you should consider that eigenvector which
corresponds to the smallest eigenvalue of the matrix 𝑈 𝑇 𝑈. So, that should be your solution

534
in this case. So, this is what eigenvector of 𝑈 𝑇 𝑈 corresponding to the smallest eigenvalue,
that you should consider.

(Refer Slide Time: 16:53)

We can understand this result in a more by applying on notion of geometry, as I have


summarised the analysis and you can see that the direction of normal is given that N is
actually they in a perpendicular direction of the line, what you have fitted here and in the
form of px + qy = d, d is an interpretation of parameter d as it is a perpendicular distance
from origin to that line.

The structure of 𝑈 𝑇 𝑈 is also interesting to note, you can see that it is a 2 X 2 matrix and
whose eigenvalues there will be 2 eigenvalues and you have to choose the minimum
eigenvalue and the vectors would be also of dimension 2 X 1 and the diagonal elements
and the variances of x coordinates and y coordinates whereas, of diagonal elements and
covariance of x and y and this is a symmetric matrix. So, this is what once again
eigenvector of 𝑈 𝑇 𝑈 corresponding to the smallest eigenvalue, that is the solution of this
problem.

𝑛 𝑛
2
∑(𝑥𝑖 − 𝑥̅ ) ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑖=1 𝑖=1
𝑈𝑇 𝑈 = 𝑛 𝑛

∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) ∑(𝑦𝑖 − 𝑦̅)2


[ 𝑖=1 𝑖=1 ]

535
(Refer Slide Time: 18:18)

Let us now consider the problem of line fitting, when you have a situation where there are
clotted observations or when you there are some outliers in your observations. In fact, this
technique though we will be discussing with respect to line fitting, it provides you a general
framework for model fitting in the presence outliers. You can apply this technique in our
previous cases of finding homograph matrix or fundamental matrix or projection matrix,
those we discussed in our previous topics.

So, outline of this technique is that, you should choose a small subset of points uniformly
at random and then should fit a model to that subset then, find all remaining points that are
close to the model and reject the rest as outliers and do this many times and choose the
best model. So, this is what is this technique will be considering and this is what is used
or general line of approach will be discussing with respect to a straight line for solving
this. So, let me give you a general outline first here let me explain it.

536
(Refer Slide Time: 19:26)

(Refer Slide Time: 19:33)

So, the approach what I mentioned earlier that considered, you have a set of points, which
you required to fit in a straight line. There would be some points which are also deviated
and which we considered their outline. So, you may consider say this could be a good fit
of a straight line among this points but, if I apply the least square error method, the outlier
will cause providing you a straight line which is not really fit in this one, it may come
something like that, which will minimize the least square error, which is not desirable. So,
your objective is to first find out a set of reliable points which we called inlier points and
then apply model fitting on those points only.

537
(Refer Slide Time: 20:35)

So, in that case what we are doing once again, use that examples of some other points,
which forms a good inlier points for fitting a straight line and see there are 2 outlier points
and as I mentioned. So, what you consider here that you may choose, it may some take
some initial setup points, arbitrarily choose any set up points, say initially you have chosen
this point and see this point minimally and then you can fit a straight line and then you
find out how many points which are lying in vicinity to the straight lines by looking at the
perpendicular distance and give a threshold to declare that, they are lying in the outlier,
there lying as an inlier point to that model.

Now, if this number is quite high, then you can say that you know this is a good model
and then again use all those additional inlier points to refine the fitting of your model. But,
as you can see in this scenario, there are only say 2 points, additionally 2 points only 4
points and I mean you can set your parameters in such a way which can decide that this is
not a very good fit. So, we will try another setup points.

So, let us consider now that you have chosen, say this point and this point and this is your
randomly again you are choosing and this is giving initial setup inlier points and then again
you perform this test; that means, you are finding out which pointer line closer to this line
and by configuring the distances and you consider all the points, which are outside this
lines, that is outside this threshold there outliers.

538
So, now you can see at least, you can get a good mini points within this straight line and
if you are with this number is high, when you may choose this setup points as inlier points
and refine your model.

So, unless you get a good setup inlier points, you can go on doing this trial, go on doing
this operation N number of times; that means, more number of times and maybe at after
certain, if you do not get in any trial there is a good setup inlier points you may drop the
idea of fitting this model because, it may happen the data is not good enough for fitting
the model otherwise, if you get a good set of inlier points, then you can simply use them
and refine your model. So, we will elaborate this process in the next lecture, for the time
being let us stop here and.

Thank you for your attention.

Keywords: Least squares, line fitting, RANSAC.

539
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 33
Feature Matching and Model Fitting Part – V

(Refer Slide Time: 00:19)

We are discussing about the technique of random sample consensus. And in the previous
lecture, I have given a general idea how this technique works while fitting a straight line.
Now, we will elaborate it is algorithm with respect to line fitting.

540
(Refer Slide Time: 00:37)

So, the steps are that as I mentioned that we have to perform N number of times the
operations for choosing a good set of inlier points for fitting and initial model. And then
based on an initial model and then perform refining over that set of inlier points. So, this
is the step that you should select s points uniformly at random.

In the previous example, minimally I have chosen two points for fitting a straight line, but
it could be more number of points. And then apply a fitting technique, for example, you
can use any least square estimate method which we discussed earlier and then you find
inliers to this line, the method is that the points whose distance from the line is less than a
threshold value t, then that point is declared is inlier point. And if you have a sufficient
number of inlier points, for example, we can have another parameter d, then you should
accept the line and then again refine your model by refitting those, refitting the model with
those lines with those points only.

541
(Refer Slide Time: 01:51)

So, there are various parameters those are involved in this process and that will affect your
estimation of straight line parameters or fitting the models. So, initial number of points,
for example, and as I mentioned you can choose a small number of points, randomly you
have to sample those points from your data. Typically, minimum number of points that is
require needed to fit the model that at least you have to choose.

So, sometimes you may work on those that minimum number of model sorry, minimum
number of sample points and apply the model fit precise model fitting over those sample
point. And ones you get a model then you have to compute the distance from that model
distance of other points from your straight line, so distance threshold t in this case you
should choose this threshold in such a way that probability of that point should be in inlier
if it is less than that threshold value that probability should be high.

For example, say 0.95 could be your target probability. In that case, if I assume that its say
your error in the data is a 0 mean Gaussian noise then you can ensured if its standard
deviation is σ then this threshold should be 1.96σ. Then you can ensure that this probability
of no getting an inlier with this policy if this 0.95.

Similarly, number of trials that how many times you should repeat these operations for
getting a model and then if you do not get any good set of inlier points then you can you
then you can terminate the process. So, this number of trials you have to choose in such a
way that probability of at least in one trial you should get all the points as inlier that

542
probability should be very high. Given some data condition, suppose you know that in
your data certain fraction of data is there forming outliers and you can have an estimate of
that fraction. Say outlier ratio which is given in a symbol e here it could be say 0.1, 0.2
something like that.

And then the consensus set size d that is also important that how many points declared as
inliers are sufficient for deciding about a model. So, that also is decided by this outlier
ratio. It should match the expected inlier ratio. That means, if there are e fraction of outlier
ratios outlier points then there is (1-e)*N number of inlier point is there. So, your d should
be close enough to that value, d should not be small, d could be very near to that value. It
should not be no significantly less than this expected number. So, these are certain thumb
rules by which this parameters are selected as it is shown the N is number of data point.

(Refer Slide Time: 05:12)

So, one particular issue I will be discussing in details here, that how to estimate the number
of trials, how it is related with this outlier ratio e and that indicates what the probability
that a sample is outlier is. So, I can make probabilistic analysis in this way.

Suppose, we have s samples all of them are inliers. So, in that case probability that all s
samples are inliers would be

(1 − (1 − 𝑒)𝑠 )𝑁 = 1 − 𝑝

543
say individual sample is inlier probabilities (1-e) and all s samples together they would be
all are inliers would be product of (1 − 𝑒)𝑠 . So, that is what you get here. So, which means
that there exist at least one outlier in your set s. So, in a trial that would be (1 − (1 − 𝑒)𝑠 ).
So, this is what. So, which means a trial, we consider a trial is unsuccessful if there exist
at least one outlier that means, I need all the points of sample should be inliers that
minimally you should satisfy. So, the probability of in one trial that it is unsuccessful that
means, there is an outlier in the set of s points is s shown here.

Then for N trails, that none of them contains any successful. It is a product of this N times.
So, it is raise this, raise to the power N should give you that probability and 1 minus of
that would give you the probability that there is at least one trial where you get all the
environments. So, that is what that probability that all N trails have an outlier as I
mentioned that value should be raised. So, at least one random trial, one trial is free from
any outlier, if I consider that probability p that is the desired feature of your algorithm say
p which could be very high value say 0.99, that should be, so your N is related to that value
in this fashion.

So, the probability of failing in all N trails is (1 − (1 − 𝑒)𝑠 ) = 1 − 𝑝. So, this is how the
theoretical estimate of N could be found

log⁡(1 − 𝑝)
𝑁=
log⁡(1 − (1 − 𝑒)𝑠 )

Given the data conditions where outlier issue is given and all other parameters are you
know algorithms are also specified.

544
(Refer Slide Time: 07:59)

If I typically look at those values of N for a very high probability 0.99, we can see that as
your samples s varies, the number of trial also varies and if your outlier ratio is very small
then you get less number of trials. But if it is large then you get more number of trials, you
need to have more number of trials. In fact, it exponentially rises with the rising fraction
of outlier layers.

(Refer Slide Time: 08:33)

So, there are certain advantages and disadvantages of this RANSAC algorithm. Like, it is
very simple and it is general. As you can see that this technique can be extended to other

545
model estimations. For example, homography matrix or fundamental matrix, there also
you when you are selecting a set of data points for fitting a model you should apply this
strategy you have choosing a set of inlier data points by rejecting those data points which
are not well fitted by the estimated model, initially estimated model.

And then look at the fraction of data points which are validated by model, and then again
refitting it with the expanding number of inlier points. So, that is a general approach and
that you can apply for any other kind of model fitting which considers now fitting data
points. And applicable to many different problems often works well in practice, but there
are some of the disadvantages also.

First thing as we have seen there are many parameters to tune. I have already given 4
parameters. So, tuning this parameters as you increase the number of parameters tuning
becomes it tricky and non-trivial job, and you may not get a good initialisation of the model
based on the minimum number of samples. So, to choose how many samples that s, value
of s itself that itself is critical and that influences the performance of the model,
performance of this algorithm. Sometimes too many iterations are required that means,
you have lot of outliers and your N could be a large number and that is why it is not
appropriate for low inlier ratio.

(Refer Slide Time: 10:21)

So, let us discuss the other technique flying fitting, where you have multiple lines. And
then in this approach will be considering voting schemes for getting a line out of the data

546
point and in fact, multiple instances of straight multiple instances of similar models out of
a data point. In this case, these are all straight line.

So, the idea the general approach of this voting scheme is that let each feature vote for all
the models that are compatible with it. So, if the future is noisy, then the consistence
consistency of voting owned be there. So, only voting from the less noisy features they
would be quite consistent and overall affect would be that the voted, the mostly voted no
models they will consistently describe those good set of points. And also in this case,
missing data, if it is missing data is also there that can be also handle because no though
from the you do not get any vote for the missing data, since you are getting vote from other
data and by accumulating this votes you may get a good model.

(Refer Slide Time: 11:40)

So, let me describe this particular technique with respect to line fitting using Hough
transform which follows this approach. You consider a set of points which has been shown
here in the image space by circular dots and as we can understand that in fact, there is a
straight line passing through these dots.

So, what we are trying to do here that for every point we would like to see that what are
the possible straight lines that could pass through that point, which means now if I go to
the space of straight lines representations that parametric space of m and b then which
pairs of m and b will be describing those straight lines which passes through any point.
Just to elaborate that concept consider a particular point say this point.

547
Now, with respect to this point there are straight lines passing through them say this is
described by in the parametric space by (m1, b1), (m2, b2) like this. So, any one parametric
space suppose this cell is m1 and say this is b1, and then this cell has a vote for this point.
Similarly, say this cell is m2 and say this is b2, so this also has a vote. In this way, in the
parametric space you find out all possible combinations of m and b which can provide you
those straight lines.

So, we will see mathematically how this could be computed. But this is idea. So, you
accumulate votes for all instances of this point sets. You accumulate and you get a
distribution of votes in the parametric space. So, the parameter values where these votes
are quite high, those are the possible descriptions of a model through this data point. So,
this is the idea and that is what we will be considering.

(Refer Slide Time: 13:55)

So, these are the operations which are needed when you perform this particular
computations. First thing, we have to discretize the parameter space into bins, because in
they continuously are from the continuous domain, but in your computations you have to
discretize them, so that you can accumulate votes for each discretized cells.

Then for each feature point in the image put a vote in every bin in the parameter space that
could have generated this point. So, that is what I explained here. And we have to find the
bins that have the most votes.

548
(Refer Slide Time: 14:33)

So, this will be explaining further. So, as I was telling that how straight line in the image
corresponds to a point in Hough space. So, in the image if you have a straight line as it is
shown here y = m1x + b1, you know that exactly it can be represented by a point in the
Hough space by (m1, b1). We assume that it is it is corresponding to some discretize bin
of m n.

Similarly, if I consider a point in the image, then in the Hough space it will be represented
by a line because that is the relationship, say if I consider a line in the Hough space with
the parameters b and m expressed in this way b = - x1m + y1, we can see that combination
of b and m will give you all lines passing through (x1, y1).

With respect to our previous discussion it means that if I consider any point in this straight
line it has a corresponding representation of a straight line in the image space these values,
say these values some value corresponds to say m* and b*. And there exist a straight line
equation y = m *x + b * which passes through this point (x 1, y1). So, that is an implication
of this particular analysis.

549
(Refer Slide Time: 16:03)

Consider another point (x2, y2) and that would also give you another straight line in the
Hough parameter space which has been shown here for the green points I have shown the
straight line with green colour which is given by b = - x2m + y2.

So, now you understand the importance of voting’s. You can observe that the intersection
point of this two straight lines in the Hough parameter space that would be the common
parameter of a straight line passing through both the points and which could be a single
straight line only geometrically.

Which means, if I use this the intersection of these two straight lines as a parameters of a
straight line then that straight line would be joining these two points and that is how the
voting comes, voting becomes importance. So, if I collect the votes you will find this
intersection point will have more number of votes and since it has more number of votes
you may choose that point as a solution. So, this is the general idea.

550
(Refer Slide Time: 17:05)

So, there are some problems with this parametric space representation. So, in we are
discretizing m and b, because the values of ranges of m and b, they are bit unbounded, so
particularly for m. And vertical lines require infinite m there is a problem. So, there are
other representations of straight line which is more convenient like polar representation
which is given in this form.

So, it is the perpendicular distance from the origin is given by ρ of the straight line and
you can write it as

𝑥𝑐𝑜𝑠θ + 𝑦𝑠𝑖𝑛θ = ρ

So, you see here you have parameters θand ρ instead of m and b, and the ranges of θand
ρ will be in a finite range. So, like θvaries from 0 to 180 degree for an image and ρ varies
from 0 to the length of the diagonal of the image grid. So, we have a very finite range.
Discretization becomes easier within this finite range and you can design an algorithm.

551
(Refer Slide Time: 18:10)

So, the algorithm would be like this that now you consider the discretized space of
parameters ρ and θ which is represented by an array here all the cells of an array and let
us considered name that array as an accumulated array.

So, you initialize the accumulator a to all 0s, then for each edge point (x, y) in the image
you increase the counts of those accumulator cells of those arrays which denote a straight
line passing through x y, which means that would be also a straight line which can be given
in this form. It is not straight line in rho theta it is given in this form. So, range is theta
equal to 0 to 180 degree, then ρ = xcosθ+ysinθ. And in that way the ρ can be found for
every θ, then A(θ,ρ) should be the cell that should be increased value of that it should be
accumulated.

And finally, you need to find the values of θ and ρ which are local maxima because that
gives you that shows that there are more votes, there are more votes in those cells
considering their surroundings and that is one possible description of a straight line one
possible straight line passing through some set of points in your image. And you can, in
this way we can get a multiple number of lines, you can detect multiple number of lines
those are passing through this set of data point. So, detected line in the image is given by
this equation.

552
(Refer Slide Time: 19:50)

Some illustration that how this votes they look. If you have a very precise straight line
passing through all the set of points only single straight line, then you can see there is only
one point where you get a very high illuminations. Say, here the brightness value shows
the amount of voting and since it is a single line and a very short peak you are expecting
in the space of parametric space of Hough transform. So, you will get a very short point
there.

(Refer Slide Time: 20:22)

553
But if you have a more complicated image that means, there are too many number of lines,
so each will give you some local maxima our all those voting would be, many voting’s
would be there from the points lying on their straight line. For example, the line showing
on by the colour green would be one of this points in this case.

(Refer Slide Time: 20:45)

If you have noise then the shortness of that local peak would be lost, would be blurred.
See, we can get some near about local maxima. So, there will be cluster of local maxima
in a small space, that would be kind of nature of this.

(Refer Slide Time: 21:03)

554
So, these are some situations and one of the major problem or major issue in applying of
transform that is it how to deal with noise. In that case, you should require an appropriate
resolution of discretization of the grid because if you have two coarse grid in the parametric
space then large votes in a cell for accumulating too many different lines correspond to a
single bucket. If it is too fine, then you a miss lines as some points may not be exactly
collinear and voting different buckets.

(Refer Slide Time: 21:38)

So, you should have an appropriate resolution, and other than this you should smooth the
accumulator array to reduce the effect of ripples, effect of rippling local maximas and
sometimes you try to get rid of irrelevant features that means, the outliers which may cause
problem. So, you fit and you take the edge point with significant. So, in this case, you
consider those points which have a strong gradient magnitude and use them for your fitting.

555
(Refer Slide Time: 22:13)

So, here we come to the end of this particular topic and let me summarise what we
discussed in this topic of feature matching and model fitting. So, in feature matching we
have discussed about different distance functions and similarity measure, and then we also
considered different policies of matching like you can use a fix threshold value to report
the matching points and within that neighbourhood of a point. Then, you can also choose
in precise nearest neighbour that is as a matching point matching feature or you can use
nearest neighbour with distance ratio that policy used from the query point you can find
out, you can use this measure to compute a matching feature point.

The indexing and hashing to make your efficient your computation efficient and in
particular with respect to these feature vectors you can exploit the geometry and you can
use the indexing schemes like K-D tree or for hashing locality sensitive hashing schemes
that we discussed. For comparing histograms we discussed about different some of the
special distance functions other than using the norms, Lp norms with the histogram,
correspondingly histogram bins values. We can use the special measures like Kullback
Leiber divergence assuming histogram is a probability density function. And then, this
measure is used to compare between compare two distributions and if there are close this
value should be less.

And earth mover’s distance also we discussed, that is a special distance function and its
computation is also quite involved computations.

556
(Refer Slide Time: 24:14)

In the model fitting approaches, we considered only of course, here the issues of fitting
straight lines given two-dimensional points, but we discussed some of the general issues a
model fitting like prior knowledge of the model is useful, then we should consider the
goodness of fit when you fit a model. And we discussed about different techniques for line
fitting, like techniques of least squares, then total least squares, and the random sampling
consensus technique or RANSAC, and also Hough transform technique. So, with this, let
me conclude this lecture and this topic. In our next topic would be on colour processing.

Thank you very much for your attention.

Keywords: RANSAC, outliers, hough transform, parametric space.

557
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 34
Color Fundamentals and Processing Part – I

In this lecture, we will start a new topic on Color Fundamentals and Processing.

(Refer Slide Time: 00:21)

So, let us try to understand what is meant by color and how do we perceive color. Color is
a psychological property of our visual experiences when we look at objects and lights. It
is not a physical property of those objects or lights. It is the result of an interaction between
physical light in the environment and our visual system.

558
(Refer Slide Time: 00:47)

Consider the electromagnetic spectrum, the broad range of electromagnetic spectrum


starting from a very long wavelength of radio waves, long radio waves to say gamma
waves and you can find there is a very narrow interval in this particular spectrum at the
higher end of spectrum frequency and that is that consists of the visible spectrum. Those
are the frequencies of lights or colors what we perceive.

And, it has been shown here you know pictorially that how these wavelengths correspond
to different colors. You can find that in our school textbook when we learnt about these
phenomena of color in different optical bands. So, there are seven particular colors of
rainbow that was distinguished and you can find out that from violet to red that famous
acronym of VIBGYOR you can identify those spots in this particular picture.

So, interestingly our sensitivity to this part of spectrum is very high and if we see the curve
of this luminance sensitivity function which is a function of the wavelength and we can
find that it is maximally sensitive when in the green zone that is what it is shown here and
later as we move towards the either towards higher frequency range of violet or violet zone
and lower frequency range of infrared zone there this luminance sensitive function slowly
gets decayed.

559
(Refer Slide Time: 02:42)

The thing is that the light which is causing the sensation that may not be also having
uniform energy distributions over all the wavelengths. So, there is a relative spectral power
we call it relative spectral power, it is energy per unit time that is what we need to you
know consider here and for example, this could be a signature of a light source so, where
you have a relative spectral power distributions over these wavelengths. So, this is how a
source of light is described. It is the amount of energy emitted per time unit I mean that is
the power at each wavelength from 400 to 700 nanometre and that is a visible band of
these electromagnetic wavelengths.

(Refer Slide Time: 03:36)

560
So, some examples of these light sources if you consider a ruby laser, it is a very pure
electromagnetic wave radiator at the near about 700 nanometre which is the red in the red
region, the color is red. Whereas these are other examples gallium phosphide crystal or if
we consider a normal day light more or less this spectrum is uniform that you can see here.

(Refer Slide Time: 04:10)

So, one of these particular characterisations of this luminance source is a blackbody


radiators which is an ideal energy emitter in that sense and which has only observes from
the environment and have its own energy spectrum.

So, one of the example of constructing a blackbody could be let us say first this is a hot
body with near-zero albedo or that is why it is blackbody if it does not reflect the light
from the environment then we cannot see it. So, that is the concept of blackness in this
case. And the easiest way to do this is to build a hollow metal object with a tiny hole in it
and we can look at this hole. So, the inside it becomes an ideal blackbody radiator.

And the spectral power distribution of a blackbody radiator it is a simple function of


temperature. So, which is radiating and this is relationships of energy imitated in power
wavelength energy power wavelength that you can see here. And, this is proportional to
this factor where you know that this h is that you know Planck’s constant and you can find
and k is the Boltzmann constant; those are not defined here. So, it just shows that the kind
of functional relationships between the temperature and the wavelength and the
distribution of energy with respect to wavelengths given a temperature.

561
1 1
𝐸(𝜆) = ( )
𝜆 exp ( ℎ𝑒 ) − 1
2
𝑘𝜆𝑇

So, this leads to the notion of a color temperature because if a light source has own spectral
power distributions or if I say energy emission distributions over the wavelengths and then
which one is closely resembling to this blackbody radiations of a particular temperature,
we consider that source has equivalent to having a blackbody source of that color
temperature. So, we can tag any luminant source by any color temperature, from there
itself we can get its spectral power distribution.

So, this is the temperature of a black body that would look this same when you are
considering the notion of color temperature.

(Refer Slide Time: 06:56)

Let us now consider that phenomena of reflection in the environment because this
reflection is the major phenomena by which you know we sense the environment we see
the objects that we have discussed from the very beginning itself in our very first slide we
considered reflected lights projection of 3-dimensional scene points to a 2-dimensional
plane.

So, in our visual system also we are getting reflections of from a 3-dimensional scene and
it is basically falling into our retina of eyes that I will be coming you will be discussing in
my next slides, but the nature of that reflected energy that that also depends upon the kind

562
of surfaces you have. So, every surface absorbs a part of its incident energy and also
reflects and in different wavelengths it performs this job in different differently. So, if I
consider the percentage of light reflected across wavelengths that also shows the material
property of that surface with respect to this particular task.

For example, you can see the picture of tomato here which is of red surface so, it is
expected the energy what is reflected from the surface points they will have high content
of rate wavelengths. So, higher wavelengths in the optical range those are the part of that
energy of the reflected wave. For banana which is yellow it the spectrum looks like this.
This is a percentage of light reflected in different wavelengths and it has high reflectance
value of within this zone and particularly the yellow zone is very prominent including the
red zone also as you can see here.

And, this are the blueberries and naturally this is expected that the reflection in the blue
range of the visible spectrum, those will be high and this is another object which is a purple
color. And, there we can see that the reflectance in this interval is very slow whereas, in
this particular zone in the blue zone and also in the red zone is quite high, that is interesting
that you have the purple reflections, but still you have sufficient component of red zones
and blue zones. There are explanations of these behaviours of our visual response.

(Refer Slide Time: 09:33)

So, finally, as we can see that it is the interactions of light and also the material property
of the surface, light source and the material property of the surface that results into our

563
visual sensations. So, that is what we observe as a colour. So, if I consider a light source,
the reflected energy distributions across wavelengths in a light source and also they
consider the reflectance values or percentage of reflectance reflected energy from surface
points and at the surfaces at different wavelengths.

So, for every wavelength the energy what we receive in our visual system that is simply
product of these two factors that is the power of illuminations or reflect the relative energy
of this particular wavelengths energy of this wave for any wavelength and also considered
the corresponding reflectance value of that wavelength you multiply, and then you get the
relative energy of that particular wavelength that figure you will get in the received signal.
That is how the received signal or received sensation is received sensory signal if I say
exciting signal that is how it is characterized in this way.

(Refer Slide Time: 11:13)

So, now let us consider the function of our eye in our visual system, now it is acting like a
camera as you can see in its contraction. Let us say there is a lens and that project see 3-
dimensional points in a environment to a 2-dimensional plane, it is not a plane I should
say it is like a spherical screen why are these points are projected semi-spherical screen
and that is called retina where these points are projected and sensation is carried out from
that point through our nervous system to our brain to understand to which gives all the
perception of this color and also we can understand the you have seen what is there in front
of us.

564
So, the components in this part which has very direct analogy with the function of different
components of a camera you can see lenses this part. Similarly, the retina which were there
is a screen kind of thing or which acts like an image plane and there are photoreceptor
cells. So, it carries the sensations which transduces this energy into the electrical signal
form. So, the retina consists of this photoreceptor cells and from there the sensation is
carried through our nervous system to brain.

And, there are different kinds of auto receptor cells in our retina they are called rods and
cones. So, I will give more details in next slides. Here there is also an iris which is in
colored analyst with radial muscles and there is this pupil the hole or aperture and apart
like an aperture of a camera through which the optical lights the energy is transmitted
through the lens or lights are coming to the system entering into the system and this is like
an aperture of a lens and this its size could be controlled by the iris.

(Refer Slide Time: 13:34)

So, about the rods and cones there are two types of receptors in the retina that I mentioned
the shape of cone and rod we can see here, their cells basically their cells and they are
attached with the corresponding optical nerves which goes to our brain through optical
through the nervous system there is a pathway through which the sensation is carried out
it is carried towards them. And, this is attached to these particular retinal surface. This is
an attachment surface what is there and the shape of this attachment as relate to the

565
nomenclature of these kind of cells this is since this is a conical shape we call it cone and
this is a cylindrical shape we call it rod.

Now, the distribution of these rods and cones is not uniform over this retina. So, you can
see that the cones are highly concentrated over this. So, this is the distribution of cone and
cones are highly concentrated over the retinal part. So, let me just point out. So, this is
what you see is a distribution of cone roughly, this is the distribution of cone that is a curve
what is shown here and the outside the curves they are the distribution of rods and it is
very much concentrated on a particular region and which is a very small high visual acuity
region and which is called fovea.

So, this region of the retina is called fovea and here the degree that is formed in this region
is a very narrow degree. It forms a very narrow degree about 1 to 2 degree top range. So,
the function of rod circuit is responsible for intensity and cones it is responsible for color
and the fovea as I mentioned it is a small region that is 1 or 2 degree at the center of visual
field containing the highest density of cones and rod.

So, this is how the structure is there and in the periphery you have less visual acuity. There
are many rods where to the same neuron. So, that is the kind of structure.

(Refer Slide Time: 16:15)

So, let us try to understand how rod and cone they are responsible they function they are
responsible for sensing the light and their sensitivity we can see it varies. Rod vision is

566
sensitive to low illuminated environment whereas, cone vision is more sensitive in a very
in a bright or highly illuminated environment.

So, you can see from this particular variations it is shown here that this is their zone which
is for the rod vision zone and this is a zone which is for cone vision zone. And, this is the
scale which is showing the intensity of lights reflected from objects and follows the
lambert, you know the unit is a lambert unit. So, if I say what is a lambert, 1 lambert it is
basically the definition is in this way. It is 1/π candela per square centimetre; candela is an
again unit of energy radiation.

So, it is the luminous intensity in a given direction of a source that emits monochromatic
radiation of a frequency of this 540 terahertz frequency and that has a radiant intensity in
the direction of a particular power per steradian that is being emitted that is 1/683 watt per
steradian. So, if this is a kind of a intensity, this is a kind of intensity or energy emitted
form this is the amount of energy emitted per steradian at that frequency and from a source
then that unity is you know one candela in that given directions that is and if it is 1 by
candela per square centimetre, then the intensity of that you know surface at that 0.1
lambert. So, that is the definition.

And, for very high intensity object points we will be able to perceive through cone
whereas, low intensity visions or through rod. So, visual acuity is more active in the cone
sensitive zone and that is why we cannot read we are unable to read in low illumination
because rod vision is not visual activity of rod vision is less.

567
(Refer Slide Time: 18:50)

So, now let us understand that with this background the physiology of color vision. So, as
we have seen there are three kinds of cones so, in our visual system and you can see that
there are three responses like this is a one kind of wavelength this is one kind of response
which is shown here as the S and or short wavelength response and that is the response in
the you know blue zone. There are medium wavelength response which has been shown
here as M and there are long wavelength response. So, there are three types of cones. So,
it is cones are also there are three categories and they have different types of wavelength
you know responses.

And, so, the ratio of this long wavelength, medium wavelength and short wavelength cones
they also widely vary. As you can see that more number of you know long wavelength
cones are in the red zone that is the ratio is 10, whereas in the medium wavelength it is 5
and short wavelength it is 1 and almost no S cones in the center of the fovea. So, in fact,
there is a nice picture which shows the distribution of different types of cones by showing
the colored dots in this particular two-dimensional point.

So, if we flatten the retinal surface near the fovea particularly and we can see in the centrals
part hardly there are blue cones or a shorter wavelength cones; mostly it is red and the
green cones or medium wavelength cones that we can see in this picture.

568
(Refer Slide Time: 20:35)

So, just to model this color perception we can consider this particular you know simple
mathematical you know explanation or mathematical model, that let us consider the
incident energy in our retina a having is relative you know spectral power distributions as
shown here across the wavelengths. And, then you have the same you know sensation is
received by three different categories of cones. So, we will be explaining the color
perceptions only using the cone vision. So, we are not considering rod vision in perceiving
colors.

So, the color perception is a phenomena due to our sensations through cone cells and there
the same sensation is you know same energy is received by different kinds of cones, but
they have different you know different effects or responses because they are wavelength
response they are wavelength or there they are response of to their wavelengths are
difference. For example, this is the long wavelength cells, this is the filter response for
long wavelength cells and this is the filter response for medium wavelength cells and this
is a filter response for short wavelength sense.

So, as we have considered the simple model that what we can do if I consider this received
energy is can be shown as see E(λ) then in our long wavelength rest filters in the long
wavelength cones the received energy can be modelled as E(λ) L(λ). So, what is being
finally, received the output of this output from the long wavelength cone is E(λ) L(λ). So,
it is an integration over λ and that should be the response from long wavelength cones. So,

569
𝐶𝐿 = ∫ 𝐸(𝜆) 𝐿(𝜆)𝑑𝜆
𝜆

𝐶𝑀 = ∫ 𝐸(𝜆) 𝑀(𝜆)𝑑𝜆
𝜆

𝐶𝑆 = ∫ 𝐸(𝜆) 𝑆(𝜆)𝑑𝜆
𝜆

So, this is what we will be obtaining from a long wavelength cone; mathematically this is
the operation that has been carried out there.

Similarly, for the medium wavelength cone we can write it as E(λ). So, this is a response
this is the output of the of the filter with spectrum frequency spectrum and say M(λ)or
wavelength spectrum as E(λ) that is a filter response and then this is output we get from
the medium wavelength cone and for short wavelength cone in the same way we can write
it as E(λ) S(λ).

So, now you can see that given a particular energy particular or particular energy from the
environment which is received by all these cones we get we can finally, this can be coded
into three factors one from the long wavelength cones another from the medium
wavelength cones, another from the short wavelength cones. And, those sensations are
carried to our in our brain through the nervous systems and those are perceived as a color.

So, finally, color perceptions can be you know considered into this from that it is the effect
of these sensations which is giving the ultimate you know our which is resulting into the
perception of the color of that particular light and this actually leads to the trichromatic
theory of color which I will be describing next.

570
(Refer Slide Time: 25:08)

So, just to summarize once again, in this color perception we have entire spectrum and
entire spectrum of reflected energy from an object or energy of an illuminate that could be
represented by three numbers, as I explained how these numbers could be mathematically
computed given all these facts. And, then the it also know tells us that even if we have
different spectra it can have the same representation because finally, as you can see there
are it could lead to the same set of triplets by doing that product and sign integrations over
accumulations of this product of the corresponding response and also the power over the
wavelengths that would give you the same vectors.

So, it becomes in the indistinguishable and such spectra they are called metamers.

571
(Refer Slide Time: 26:07)

So, there are some examples of this metamers we can see here. Consider this spectra or the
relative power spectra for different kinds of reflected waves. See this is a white petal
spectra which has been shown here by this particular color and another in this pink curve
this is a this is a white flower and we our sensation of white petal and white flower is
similar it is a white sensation. And, but as you can see there is it there is a lot of variations
over the distribution of the relative power over the wavelengths in these two different
sources, but still we perceive them as white and this is an example of metamer.

(Refer Slide Time: 27:01)

572
There could be another example that you know you can see that this is a widely varying
relative spectral, relative power over the wavelengths whereas, there is a smooth
variations. But, final effect of integrations of the product of our cone response and with
these you know energy distributions, with these power distributions of a particular at a
particular wavelength final effect would be would may be same and that is how that it
could be also another pairs of metamers.

So, with this let me stop at this point for this lecture we will continue this discussion of
color perception. Thank you very much for your attention.

Keywords: Electromagnetic spectrum, black-body radiators, color perception, rods and


cones.

573
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 35
Color Fundamentals and Processing Part – II

We are discussing about fundamentals of color perception and in the previous lecture we
discussed how this perception is a result of interaction of the energy which is being
received in our retina, and also the response of the corresponding retinal cells. And this
could be factored into three vectors or into three factors, which can identify a particular
characteristics of an incident energy or characteristics of the spectral distribution of the
incident light in our visual system.

(Refer Slide Time: 00:59)

So, that is how we experience color. And to standardize this experience; that means, to
calibrate with respect to sources and then understand color perception under any kind of
color source or any feature or response from any arbitrary color sources that would be
our objective next. So, for that , we need to do color matching experiments.

So, let me discuss what is meant by color matching and how these experiments are
carried out. So, here is an example of a setup for performing color matching what you
can see that, there are three primary sources of lights and of three different wavelengths
and there are three pure wavelength sources of lights and we call them primary as we

574
have seen that a color could be considered as a finally. The color is represented by a
combination of three different color components into red, green and blue zone of the
spectrum. So, we have chosen such primary color sources in that way and consider a test
light which needs to be calibrated with as a factors of these three primary sources, that
test light is T.

And if I display this test light, this is once again a say Pure monochromatic color source
of a particular wavelength. And if I display it on a screen and the screen is designed in
this way there is a circular zone and here in that screen and there is of course, a
surrounding field to provide you the contrast and the test light is projected here and this
part is the combination of this red, green, blue, that is projected here.

So, you can vary the relative you know strength of this energy sources and you can note
down that what in what proportion they are mixed and they are superimposed here. So,
that proportion itself could represent the same test light as a factor of these primary
colors. So, this is known as color matching. So, you have to vary the proportion in such a
way that these two colors there should not be any distinction between them.

(Refer Slide Time: 03:38).

So, we can describe an hypothetical color matching experiment diagrammatically once


again it is showing the same setup what I described earlier. So, this is that test light and if
I project it for example, it produces a color like this. consider there are three primary
sources of red, green, blue and a certain combination produces this. Say this is the

575
combination of these primary sources, it shows the relative proportion of the amount of
energy emitted by these primary sources and they are superimposed here and that is the
color.

Now, if I vary one of them for example, p2 I could see that this color is going near to
this throughout visual sensation toward visual observation and further increase may
cause exact matching of this two. So, then we say the, relative distribution of these three
primary colors, is representing this particular color. That is how we have standardized
the representation of this color with respect to these primaries through this experiment.
And so, this is the primary color amounts needed for a match. So, consider repeating this
experiment for every kind of light sources of different wavelengths.

(Refer Slide Time: 05:00)

There is another scenario that also I should explain here and this is we are considering a
type two scenario of this experiment. Here you can see that this is a color of the test
light of a particular wavelength and again this is the color produced by the superposition
of different primary colors, in some certain proportions.

576
(Refer Slide Time: 05:22)

(Refer Slide Time: 05:30)

Consider this is a proportion which is producing, as you can see that the primary p2 is a
very small amount and in fact, if we further reduce it we will find that this color
becoming nearer to this one, but still it is not the same now in fact, what we can do there.
Consider you add instead of projecting primary from here now, it is add primary from
this part.

577
(Refer Slide Time: 05:52)

Which means, say if I add a primary here then we can see that these two colors they
become same, which means that as if you are subtracting the p2 from these
combinations p1 and p3 these proportions. And then you are subtracting p2 with this

proportion from this combination, then it is producing the same sensation.

So, we say negative amount of p2 as needed to make this match and this is the another
kind of experiment where you see that it is not only additive superposition, but also even
the negative matching, negative additions from the other side. theoretically you can get
the equivalent sensation in this way and you can obtain a figure obtaining proportions
like this. So, you can represent in this in this fashion that this p2 in it needed to be
subtracted from p1 and p3 . So, in the color matching experiments you note down all
such proportions including their addition or subtraction for every wavelengths you can
get a chart art for different colors for, different wavelengths and that would give you the
color check.

578
(Refer Slide Time: 07:03)

So, by standardizing this chart, we can have a trichromatic representation of any color
which means using three primary colors and this feature of color representation is known
as trichromacy.

So, it is the fact that most people can match any given light with three primaries and
primaries must be independent which means that, none of the ; two other color with
addition or subtraction whatever may be that with any proportion that should not be able
to produce the particular color. So, that is how the independent, it is this definition is the
analogy is with that linear independence relationships among the vectors. And for the
same light and same primaries most people select the same weights that is also observed
experimentally.

So, that is how the uniqueness of color representation across our perceptual systems of
the varying human visual perceptual system so, that unique representation is established
through this kind of empirical studies. Of course, there would be exceptions when there
are color blindness. So, there would be different kinds of perceptions for that.

So, we are talking about normal vision and there it is an uniquely represented given
primary with that same proportion to represent a color. And this gives you the
trichromatic color theory, which means that three numbers seem to be sufficient for
encoding color, it is a very old theory it dates back to 18th century it is proposed by
Thomas Young.

579
(Refer Slide Time: 08:39)

And as we can see that there are a lot of similarities with respect to the linear space and
also the color space. So, Grassman proposed this laws of color matching using this
property of linearity. So, it is considered that it is a linear property and which means that
if two test lights can be matched with the same set of weights, then they match each
other. Say you consider there are three primaries p1 , p2 , p3 and this is the amount to

which is needed to produce A color a say u1 , u2 , u3 there is a

representation. A  u1 P1  u 2 P2  u3 P3 .

And if there is another color which is also represented by the same values same
proportions and same values actually u1 , u 2 , u3 then these two colors must be same.

And if we mix two test lights and mixing the matches will match the result which means
once again you get two colors A and B and having their different proportions of
primaries and if I superimpose, the color what you get that would be represented by this
factor. So, their property of linearity is also that property holds during the super
positions of two colors also.

this property also tells you that if we scale the color scale the test light, then also the
matches get scaled by the same amount.we see the similar kind of color, but intensity
varies with that color. So, these are the three features of Grassman and three laws of
Grassman which which tells that know a color could be represented in a linear space.

580
(Refer Slide Time: 10:33)

So, that is how the linear concept of linear color spaces they have been considered and it
is defined by a choice of three primaries. So, there could be different color spaces given
different sets of three primaries, because we will get different vectorial representations
So, RGB is one common choice, but there could be different other primary choices.

So, the coordinates of colors are given by the weights of the primaries used to match it
that is the representation and we need a matching functions; that means, we need these
know standardization. So, a monochromatic light source with respect to this coordinate
representation- three vector representation of a color.

And then we can apply this linear superpositions principles suppose you would like to
have a mixing of two colors. So, if we represent them by that three vectors and any color
if you consider a linear combination of these two, they will be represented by a color
which could be also represented as a point in this particular straight line.

And if I consider three primaries and any color any combination of this color can be
represented by a point in this particular triangle. So, these are all positive weights or the
some of the proportion sum of the weight should be equal to 1. it is a convex
combination we are considering and then the point should lie in this case within this
triangle.

581
(Refer Slide Time: 12:16)

So, we have described this experimental techniques that how to compute the color match
for any color signal for any set of primary colors. So, now we will be considering the
representation of a color signal because as you see that so, far we have represented light
of a monochromatic wavelength which means a pure color and then you have
represented by the three vectors.

But again the light could be considered this color or the color signal could be considered
as linear combinations of different wavelengths at different proportions. So, for each
wavelength we can have its representation from the color chart; that means, this is the
three vectorial representations say for a  , this is the amount for primary p1 , amount for
primary p2 and amount for primary p3 . And so, we can measure the amount of each

primary needed to match a monochromatic light.

So, in this way we can consider that whatever relative amount of energy is there in the
spectrum of a light source, in the same proportions all these three vectors needed to be
mixed and that would represent this color signal. So, color signal would be also finally,
represented by a three vector form.

582
(Refer Slide Time: 13:44)

And using this linear superposition principle we can compute them, if we know the
corresponding color responses are three vectorial standard representations of these
wavelengths so, that is what we will be describing here. So, a monochromatic light of 
a wavelength, it could be matched by the amount say c1 , c2 and c3 for each primary and

consider any spectral signal, it is thought of as a linear combination of so, many


monochromatic lights and this linear combination is can be represented in terms of a
vector.

So, this is the amount of the energy which is there in the wavelength 1 in that light
source and consider there are wavelengths from 1 to n . we can discretize the range of

the wavelengths and also use the amount of energy for each wavelength which is
present in that color signal. So, this is a representation of the color signal in terms of
spectral power distribution. So, this is what is spectral power distribution and that is
discretized in this form.

So, now, for each one, once again 1 is represented by the amount of primary color. So,
it is normalized representation. So, if I multiply. So, each one should be multiplied by
this c11 .

583
(Refer Slide Time: 15:20)

So, we can consider this, say they store the color matching functions and then this is the
representation of the color signal and now as I was mentioning you multiply this matrix

with this particular combination. So, this matrix(C) is a 3  N as you can see and this ( t )
is N  1 .

So, finally, what you get? You get 3 1 vector for each energy, each component you
will get for example, t (1 ) will give you c1 (1 )  t (1 ) for the c1 component part of it
and then c2 (1 )  t (1 ) for this c2 component and c3 (1 )  t (1 ) for the c3 component.

And for every wavelength you are finding out these components and then again add them
component twice that is what it is simple matrix multiplication that is what it is
explaining. So, in this way you can apply Grassman’s law to get the representation of a
color signal by three vectorial form or three components.

584
(Refer Slide Time: 16:34)

So, we will be considering now production of or renditions of colors in the technology


and you can see that now there are two different types of technologies available. As we
have discussed that a color could be represented as a superposition of primary colors
now this superposition does not mean always an additive mixing, there could be also
subtractive component.

But the problem is that even in today’s technology we cannot you know have both the
properties either the principle should be on additive mixing, principle of the technology
or it could be only on subtractive principles. So, there are two types of systems for color
renditions either it is an additive system or it is a subtractive system.

So, this slide particularly it is showing that this is a additive system, where the red,
green, blue color, they the primary colors and a different proportions these colors when
they are added all of them are positively added there is no subtraction, then you produce
different kinds of colors like you can have yellow here, you can have cyan, you can have
the magenta color here.

But on the other hand in the subtractive system primary colors could be magenta yellow
and cyan, and which means that in the magenta the red color gets absorbed and then you
have the sensation. So, it is absorption of red color, but which will produce a sensation of
yellow, absorption on green which will produce the sensation of; of magenta and this is

585
a sensation of yellow. And the absorption of blue which will produce you the sensation
of sand and using them you can produce different kinds of colors.

So, this principle is not you know subtractive principle by which these colors are
produced. So, example of this kind of system is the printing system. So, when you print
colors on white pages, we use a subtractive principle whereas, when we you display
colors in a computer monitor or any display screen or any cathode ray tube using just
normal televisions. So, there we work with the principles of additive superposition.

(Refer Slide Time: 19:10)

And we can talk about know different colored spaces linear color spaces, the common
color space is an RGB color space because when we captured the color information in
cameras, this is the space in which we capture the color information of a point. Because
we use red filter, green filter and blue filter in the camera when the optical energy is
incident on a lens before that we apply those filters. So, you get only the energy in the
red zone or green zone or blue zone of the spectrum accordingly, and then these three
components are independently captured for the same light source for the same incident
energy and that is how a color is represented.

And that space is a natural red, green, blue space or RGB space and in the RGB
matching function if I show it, over as a function of wavelength which has been
displayed here, you can see also the corresponding wavelength which is representing red,
green and blue which is shown here.

586
So, one of the thing you should observe here that it is not always additive mixing for
every wavelength. For some wavelength for example, you consider this red matching
response for some wavelengths it is subtracting. So, that is why in the RGB color space
every color cannot be produced by only additive color.

So, that is one kind of property and that is the reason why every display additive display
cannot produce all sorts of colors, there are set of sets of colors which cannot be rendered,
because they require subtraction and which is not possible by the technology what is
giving that perception or producing that color for our perception. So, this is the thing we
should note that for some wavelengths you record subtractive matching.

(Refer Slide Time: 21:20)

So, just to summarize an RGB model could be represented in this normalized form where
you can consider 1 is the highest intensity values and 0 is the lowest intensity value and
then you can represent every vector within that normalization. And it is an additive
model an image consists of three bands one for each primary color and it is appropriate
for image displays.

587
(Refer Slide Time: 21:49)

Whereas the other subtractive model the primary colors are cyan, magenta and yellow
and can see that they are actually complementary colors of red in this way the
complementation is explained here.

C  R
 M   1  G 
   
 Y   B 

So, you can see also. If I represent the red, green, blue primary colors those components,
if I normalize them within the range 0 and 1 corresponding cyan, magenta and yellow
also can be represented from there for that color in the subtractive model. Because in the
subtractive model what you are doing. If I produce this amount of cyan means this
amount of red will be absorbed in this particular treatment.

588
(Refer Slide Time: 22:33)

And this is appropriate for paper print. There is a another standard color space and which
has been provided by the chromaticity model and which has been provided by a standard
body of international body of color which is known as here that commission
internationally de I’Eclairage, I mean this is in the French pronunciation my
pronunciation is not correct.

So, we call it in short CIE, I will refer it to that. So, international body of color and this is
established in 1931 and it defines three standard primary colors the objective is that in
this space you do not require any subtraction, you can produce all color using addition,
but the thing is that it is a hypothetical space. So, you cannot have light sources which
will produce this kind of primary colors. But, mathematically we considered that these
are the three primaries, three hypothetical primary colors. And in this case Y component
was chosen such that the color matching function matches the some of the three human
code responses.

 X  0.6067 0.1736 0.2001  R 


 Y   0.2988 0.5868 0.1143 G 
    
 Z  0.0000 0.0661 1.1149   B 

So, the example is that, it is a, linear transformation of any vector in RGB space that
gives you X Y Z now this is a primary color. So, X is this combination of RGB will
produce this X and this combination of RGB will produce Y and this combination of

589
RGB will produce Z as you can see that this factor is more than 1. So, it is not possible
to produce because we have normalized all RGB component to 0 to 1.

So, this primary color is not possible to produce all these primary colors not possible to
produce physically, but mathematically this transformation is defined. And now the
advantage of this transformation is that : use the same matching functions with the
matching experiments whatever you have with RGB, if I transform all those matching
functions using this transformation you will find all of them becomes positive.

All the RGB components, even some of them is negative in the matching function due to
our experiments, but after transformation you will find all of them becomes positive in
this case. You should note that this transformation is an invertible transformation and as
you can transform an RGB vector to XYZ space. So, we call this space as CIE XYZ
space or simply XYZ color space, you can again transform it back to RGB space.

 R   1.9107  0.5326  0.2883  X 


G    0.9843 1.9984  0.0283  Y 
    
 B   0.0583  0.1185 0.8986   Z 

So, inverse transformation is shown in the above fashion, and you can see that this is the
matrix and your corresponding XYZ component is shown here and if I multiply this
with RGB, if I multiply this with XYZ, then you can get the corresponding RGB
component. So, let me give a break here for this in this lecture. So, we will continue this
topic of discussing about color fundamentals and later on processing with color images
in our subsequent lectures.

Thank you very much for your attention.

590
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 36
Color Fundamentals and Processing (Part III)

We are discussing about trichromatic representation of a color according to our


perception or three-vectorial form of a color. In the previous lecture, we discussed how
color could be represented in its natural space of RGB, natural in the sense that we
captured the information of color into these three components, red, green, blue but the
problem is that not all possible colors could be produced by using these three
components in an additive mixture of those components. So, for a convenient
representation of color, where some of these vectors even all these vectors could be
represented by positive numbers in their components.

There is an hypothetical color space which has been designed by an international body of
color; and in acronym it is called CIE, and that space is called XYZ space and there is a
transformation which has been proposed by that body which has been suggested by that
body or recommended by that body. Using that transformation, you can convert any
color represented in RGB space to that XYZ space and since it is a linear transformation,
so this is also a linear color space.

(Refer Slide Time: 01:41)

591
So, we call this space as CIE XYZ color space and here as I mentioned primaries are
imaginary, but matching functions are everywhere positive that is the advantage we have.
So, if I again plot those matching functions in terms of color representation of color in
across the wavelengths, then we can see actually all these values are positive values here
and so this is the for, so this is representing X ( ) ), this is Y ( ) , and this is Z ( ) .

Now, one of the representation of color, you can consider also a two-dimensional
representation because the part of the color representation as we can see using
Grassman’s law that if I scale the vector still you will get the same sensation of the
color, only its intensity changes. So, if I can separate out this intensity from the other
quantities, then also we can represent a color.

So, with respect to intensity or normalize the color representation with respect to
intensity, then this normalized representation once again can be represented in a two-
dimensional space, so that is a motivations of representing color in a two-dimensional
space and as you can see that this normalization of XYZ component happens in this way,
so this is normalized x and this is normalized y.

So, here I am showing you a normalized representation of all colors which has been
converted to their respective x, y, z coordinate given these wavelengths, and they can be
represented in a two-dimensional space. We call this space as normalized x, y space and
a point is representing color here pictorially, even the color dots are representing the
corresponding coordinates.

So, this phase is called CIE XY chromatic space, because here the intensity information
is missing here, only we are interested on the relative representation of two components
which is sufficient to identify a color uniquely in this particular space. So, this is what is
a normalized representation.

592
(Refer Slide Time: 04:13)

So, to understand this particular representation, we follow a model. we call it CIE


chromaticity model. As I mentioned that it is a normalized x, y, z and since the sum of
them is equal to 1, so it is sufficient to represent any point using x and y and in this
model, you can see the pure color is in the boundaries. So, this boundary represents pure
colors. In fact, this is the red zone of having 700 nanometer and as you proceed through
this boundary wavelength is decreasing which means frequency of the energy is
increasing or electromagnetic wave is increasing and red, green and this is a blue zone
and wavelength is near about 360 nanometer. So, these are the pure representation of
color.

In between it has other components of this color, I mean we will be again defining them.
Actually the white point means you have all the components with equal magnitude and
1 1 1
this point can be represented by  , ,  in the three dimensional form but in a two-
3 3 3
1 1
dimensional form it is simply  ,  .
3 3

In this diagram, we are also showing a curve which is interesting it is showing the
particular color in our visual perceptions there, as if their color remains almost same,
only the purity of the color varies and this is the curve, this is a locus of same chromatic
sensation in the sensor, no, I should say the pure color what is being shown here in the

593
periphery of this curve. That sensation still carries over this curve. even if the point
moves that sensation still carries. its whiteness of the color differs.

(Refer Slide Time: 06:13)

So, this is called the spectral locus of monochromatic lights. So, it is showing here that
how different light source of different wavelengths that can be represented here also and
you can see that. if I consider the representation of sunlight even the sun which is at noon
which is equivalent to a blackbody radiator of 4870 Kelvin and which is almost a white
light source but when it goes in sunrise or even sunset, it is becoming more reddish and
you can see the (Refer Time: 06:55) trend captured in this particular curve.

594
(Refer Slide Time: 06:57)

So, CIE chromaticity chart can provide you a simple representation of color using two
particular components which is called saturation and hue; I will be defining them. So,
any color which is achromatic means which does not have in the sense of our white
sensation that is what is achromatic color, and that is represented by as I mentioned it
1 1
should be  ,  and this is a simple representation whereas, the saturated colors means
3 3
pure colors they are at the boundary and there is a spectral colors.

(Refer Slide Time: 07:37)

595
And so if I consider a radial point, a straight line from this white point connecting to any
point in the periphery, now this is a simple model where it is representing the same
color represented by this wavelength whereas, the whiteness of the color varies as you
move towards periphery the colored is becoming more pure which means whiteness is
lessened as it as it moves inward it whiteness is going to increase. So, the this is called
saturation.

So, less white means less whiteness in the color means it is more saturated. So, the
presentation of hue in this form it is giving the directions, it is remain, it is represented
by the I should say a direction from the white point which is directing to a particular
point in the periphery. So, we can represent it by an angle also with respect to a reference
direction.

(Refer Slide Time: 08:35)

Whereas for hue it is the relative proportion of these two lengths how close it is to white
point, and what is the total length of this particular radial segment which is touching to
the periphery of the chromaticity chart. So, how far it is shifted towards the spectral color?
a
So, this ratio of it is giving the saturation as you see that it is as it is moving more
b
outward this is going to be increased.

596
So, it in this scale particularly saturate varies from 0 to 1 in this definition. So, this is it
this definition may vary also this is one sort of simple representation. So, periodic well to
1 implies spectral color with maximum saturation.

(Refer Slide Time: 09:21)

So, as I was mentioning that color reproducibility depends upon the choice of primaries
because once you choose a set of primaries then in the XY chromaticity chart itself, we
can represent them as three points. Now, these forms a triangle, so any linear
combination of these primaries that would be any linear combination means additive
combination, that should be represented by a point within the interior of this triangle,
any point should lie within the interior of the triangle or in the edges of this triangle. So,
this is the limit of representation of color by these set of primaries, so that is what this
particular fact is described here.

So, if I project the 3D color gamut, then you will get this triangle and we call this also a
2D color gamut. So, color gamut is a set up reproducible colors given a set of primaries.
So, it could vary, this set could be large if I consider the primaries are more saturated,
because if it is more saturated, it will be moving towards the periphery and you can see
the area of this triangle will be increasing in that case.

597
(Refer Slide Time: 10:45)

So, there are various primaries are used and which increases the scope of color gamut for
displaying colors, and in fact, there could be a some kinds of display or where you can
have more number of primaries instead of rate, that means, you do not have an
independent set of primaries, you can have some dependent set of colors also whose
combinations are also considered and in that way you can increase the range of color
reproducibility.

One such example has been shown here. So, this is a multiple laser source DLP
projections where you can have multiple primary. So, it is given by the we should not
call them primary color in that sense, we should call those color sources whose linear
combination is producing the other colors and, so these are showing you these vertices
are showing you this number of colors.

598
(Refer Slide Time: 11:49)

So, one of the interesting operations from the processings point of view is called
saturation and desaturation operation. So, let us understand this particular operation. As
we can see that given a color image, we can represent the color of any pixel we can get
them into this color gamut. We can represent by any point which will lie within this color
gamut and when we consider the primary of the corresponding color system is given by
this points here, which is shown as RGB here.

So, if I consider say p is a true color, which is the color of the source image. And then
using our model, if I draw a vector or draw a line connecting the white point which is at
1 1
the point  ,  of this color space that is a coordinate or you can see here (0.33, 0.33)
3 3
is a coordinate point. So, if I know connect these two points W, p and extend it, when it
intersects the edge of the color gamut, so that is the maximum possible reproduction of a
color of this same hue; and in the same direction, this is a maximally saturated color in
this representation.

So, we can compute this point and then again you can convert it back to RGB, and you
get the corresponding RGB component. In that way, in this way, you can get the
maximum saturations. However, if I again any point in this straight line itself can
represent a color having the same hue that means the color corresponding to the
wavelength which is represented by this vector.

599
So, there is another operation called de-saturation that means saturation may not give
you always pleasing sensation. So, it is not true that more saturated colors and more
pleasing there in our perception. So, we feel comfortable with the right balance of colors
their saturations as we are accustomed by perceiving different phenomena of nature. So,
sometimes it is required even to reduce the saturation and move inwards towards the
white (Refer Time: 14:21) point following the same vectorial direction or along the same
line, so that is called de-saturation operation.

So, in this diagram, what it is shown that say p is the point which is a color of the pixel
and if I connect W, p and extend it, and when it intersects with this edge B, G, then it is a
maximal saturation point; and if I move inward also it is the de-saturation point d and
this circle is showing here. I explain the algorithms of saturation de saturation, usually
the color within this circle most of them are like whitish. So, it is very difficult, it is not
visually pleasing if we disturb these points.

So, sometimes for the processing of saturation and de-saturation, we leave out these
points which are very close to white. So, some empirically chosen threshold we can
model it as a circle around that I mean it is a threshold radius around that point W and we
can leave out from this processing.

(Refer Slide Time: 15:33)

So, this operation of saturation desaturation that is used for enhancing color images. So,
we will describe this algorithm before that. let me explain the desaturation a process,

600
there is a work which proposed a particular intuitive way of obtaining the weighted
combination. So, giving these two points W and S color point in the x y chromaticity
space, you can get a desaturation point between them because as you see that this is
desaturated with respect to this S by following a property called center of gravity law,
which means that it is a weighted combination and weights are chosen in such a way. So,
Yw
these are the weights, so . So, I will explain this notations or symbols.
yw

So, this is the weight. If multiply with the corresponding thing you know values and
take the weighted average, you get xd , similarly you get yd . You note that , this value is

nothing but multiplication of this weight into y w and this value is multiplication of this

weight, weight y s .

So, you get this weighted combination of xd , yd whereas Yd is the intensity part is

defined as the sum of these two values ( Yd  Yw  YS ) and the intensity of whiteness can

be also estimated as the average of Y of the image a fraction of it k. So, empirically


which you can choose, for example, you can choose k equal to 1 also.

(Refer Slide Time: 17:21)

So, one example of saturation process has been shown here. There is an image where we
named this image Alps because this is the image of a particular mountain and speak of
Alps and this is the original image. And in the original image, if I plot the colors of the

601
pixels in the x y chromaticity chart you can see this is a distribution and you can see
from this distribution itself, the triangular gamut is perceptible because all these points
they lie within the triangles of these primaries which have been assumed to some values
considering the properties of the display.

(Refer Slide Time: 18:09)

So, if I saturate and if I leave out those points as I mentioned without processing and
saturate other points, we can find that they are lying at the edges, and then this is a
saturated images. You can see some of the reddish points are if I look at the previous one.
So, this point is becoming more reddish in the saturated images.

602
(Refer Slide Time: 18:39)

And following desaturation operations using that central center of gravity law, you get a
desaturated image like this and where the chromaticity chart, the distribution of color is
showing in this form.

(Refer Slide Time: 18:55)

So, that is just desaturated, and saturated, desaturated, after saturation then you
performing desaturation, then it looks like this.

603
(Refer Slide Time: 19:05)

Yw  kYavg

There are some variations you can make. So, if I use the negative value of k as the you
know intensity at the white point that you remember kYavg . let me show you the value of

the effect what is defined by k. So, this is what is k. So, it is a multiplication factor with
the Yavg value and that is giving Yw and this is how the intensity at this point is

determined.

So, this component of course, this component is negative means you are not taking the
modulus in that sense. So, you are subtracting it in fact. So, if I use that negative k, so
you have to modify that definition, it should not take modulus and then what we get, we
get these kind of images. This is the effect.

604
(Refer Slide Time: 20:13)

1 1
If I shift the white instead of  ,  as the white point, the same process we carry out
3 3
same computation you carry out by considering your white point is (0.5, 0.2), which is
towards say reddish zone or and then now you can see that the effect, effect of reddish.

(Refer Slide Time: 20:37)

Similarly, if I shift white towards greenish zone, then this is the effect or towards bluish
zone this is the effect.

605
(Refer Slide Time: 20:45)

So, there are different effects those are possible by varying this kind of points.

(Refer Slide Time: 20:57)

So, this is one example of processing with the color images by using the normalized CIE
chromaticity chart and let me discuss one exercise how do we compute this chromaticity
point, it will show you those computational steps, so if you can solve this particular
problem. So, in this case the problem says statement says in this way that considered the
following transformation matrix.

606
So, this is the RGB to XYZ transformation matrix, given any RGB color you can convert
it to XYZ. So, first given a color value in RGB space this is the three vector
representation (100, 80, 200) compute its corresponding point in the normalized x-y
chromaticity space. So, this is what we would like to solve.

(Refer Slide Time: 21:59)

And this is very straightforward what we have to do, you have to simply multiply the
RGB vector with the transformation matrix, and then you will get corresponding XYZ
vectors and then you have to normalize with the definition. So, we know that the
normalized, so these are the XYZ components and this is a normalized X which has to be;
which has to be divided by the sum of these components. So, you get a value say 0.2864
and y you get a value similarly 0.2134. So, this is a part of the assignment and of that
problem.

607
(Refer Slide Time: 22:39)

The next what you need to find out that we need to find out a maximal saturated point of
that particular color. So, where you get the points of the no vertices of that color that
triangular gamut in this way. So, let me read out once again given the coordinates in the
 2 1
normalized x-y chromaticity space of three primary colors as these  ,  , so note it is
 3 3
1 3 1 1 
it should be for R;  ,  it should be for green; and  ,  it should be for blue.
5 4  6 10 

So, in our representation, we will be using in our discussion using those three notations.
So, what we need to compute the corresponding maximally saturated color in the RGB
space preserving the same hue and intensity for the above point.

608
(Refer Slide Time: 23:35)

So, if I perform this operation, so once again this figure is explaining what should be
done but consider that you have a point, so, this is the red point, this is a green and this is
a blue in the chromaticity chart. So, this is how the triangular gamut is specified, and
1 1
consider this as a white point  ,  and this is the point you have computed already of
3 3
that color vector in the x-y chromaticity chart and the thing is that you need to find out
the maximally saturated point of connecting and say w, q, and which is connecting the
triangular edge.

Now, it may happen that it is difficult to say the exact position of this point, it may
happen this point is here, and it is intersecting this one and actually the intersection of
this edge is, outside the gamma triangle. So, you need to consider all these options or
even it can intersect in this direction this edge and now you have to find out that in which
direction this point is there, that means, you have to consider the corresponding angle
and then only choose that direction for the maximally saturated one.

So, there are several options. Now, in our case we will be considering only intersection
with these two because by looking at the coordinate positions we can identify that these
are the two possible candidates of edges, and we will be considering and the solving this
in that fashion. So, we have to find out the intersection of point between the line formed
by an edge of the triangle and wq. So, let us proceed with this computation.

609
You should note this maximally saturated as I mentioned that point or whatever the
intersection is there, then you have to decide before declaring it as maximally saturated.

(Refer Slide Time: 25:49)

So, for computing this intersection, we will be using projective space concepts. We have
already studied that how you can compute the intersection of lines in a two-dimensional
coordinate space, if I extend it into the projective space representation. So, first check for
intersection of BG and wq. So, note here once again let me draw. we have a CIE
chromaticity chart, and then there is a gamut triangle and say this is R, this is G and this
is B, and this is w and say this is your q.

So, we are trying to find out intersection of this edge with respect to the corresponding
triangular edges. So, the BG these computations could be computed as with respect to
BG the intersection could be computed as the cross product of these two vectors. As you
1 1 
can see that this point is represented as this point is represented as  ,  . So, in the
 6 10 
1 1  1 3
homogeneous coordinate system, this is  ,  , and then G is  ,  . So, this is the
 6 10  5 4
representation. So, if I take cross product, this is giving you the equation of this line BG.

And similarly I will get the equation of wq. So, this is the equation of w q. So, this is w
1 1
which is  ,  , so this is the homogeneous representation and q is (0.2864, 0.2134) as
3 3

610
we have computed earlier. So, this is a homogeneous representation. So, if I take the
cross product of these two, then we will get this particular representation and value
which is representing the straight line wq.

(Refer Slide Time: 28:55)

So, now intersection of BG and wq will give you the point, this is the point, again this is
homogeneous coordinate representation. So, in the non homogeneous form, this is the
point. Now, you observe that there is a negative and as you know that all the points are
to be positive and there is a range between 0 to 1, it should lie within that range. So, this
point is not within the x-y chromaticity space.

611
(Refer Slide Time: 29:29)

So, we should not consider these points. So, now, we should consider the other edge BR,
and find out the intersection with wq. So, wq we have already computed which has been
shown here and now BR is given by this again in the same way we are computing the
cross product of these two points in the homogeneous representation; one is representing
B; the other one is representing R and then you get this is a straight line in the projective
space.

So, now, the cross product of these two lines BR and wq that would give you the
intersection point and this is the intersection point you will get. In the homogeneous
coordinate representation in the non homogeneous coordinate representation it is
(0.2592 , 0.1429). So, finally, this should be your answer, so it is a maximally saturated
point in x y. So, this is an example by which you can see that how this point maximally
saturated point could be computed using x-y chromaticity chart given any point in the
RGB color space.

So, with this, let me stop here. We will continue this discussion of you know Color
Fundamentals and Processing in our subsequent lectures.

Thank you very much for your attention.

612
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 37
Color Fundamentals and Processing (Part IV)

(Refer Slide Time: 00:30)

We continue our discussion on Color Fundamentals and Processing of color images. In


the last few lectures I have discussed how colors could be represented as a trichromatic
vector. And, how the perceptual factors of color in terms of chromatic experience in
terms of hue and saturation could be obtained from these calibrated chart as proposed by
an international body of colors or in acronym CIE. There is CIE chromaticity chart from
where you can represent a color in a x y normalized x y chromatics space.

And as you can see here also in this particular figure that the colours are shown in a semi
elliptic region and where the peripheral regions peripheral curve is showing different
wavelengths. For example, the wavelengths starting from, see this point is 700
nanometre and this point is 520 nanometre and this point is 380 nanometre. And, across
this periphery as you traverse from this point and come to this point, wavelengths are
decreasing in this order. And different wavelengths, they correspond to a particular color,
a pure color that is sensed by us when that electromagnetic wave of that wavelength is
received the energy from those wavelengths.

613
But, in the chromaticity space any color when they are represented as a superposition of
three primary colors particularly when we capture in terms of red green and blue, then
we can convert those color points into a two dimensional chromaticity space that we
have discussed that is x y chromaticity space. So, for example, this is the x axis of that
space, this is x axis of that space and this is y axis and any particular color at this point it
has been shown also.

what is the color of that pixel that is how this you know visualization has been created
this graphics have been created, where it is also showing the colour. So, you can see
there if there are large regions of having almost similar color sensation and the minimum
the region where color are almost indistinguishable they can be approximated by ellipses.

As you can see one example of this ellipse is shown here a larger ellipse, here in this
region you can see this ellipse I am drawing it just for your convince. So, centring this
point all the color covered by this point all the chromatic point they will generate the
same sensation. Now, these ellipses are known as McAdam ellipses and these ellipses are
also called just noticeable differences in color ellipses or JND ellipses this is a very
common acronym what is used.

So, if your points so within these JND ellipses so all the perception of color would be
almost similar in our human visual system. Now the problem here in thisx y space is that
the size of this ellipse is non-uniform, you can see the size of this ellipse at this location
is quite large. Whereas, near to the red region, this size is quite small very small ellipses
are shown here, so that is a problem with presentation of color in an x y chromaticity
space. So, there is a proposition another proposition for another color space, where you
can try, you can make this ellipse.

Size of this ellipse is more or less uniform. it will not be exactly uniform in every place.
But we try to minimize the variation of this sizes and this color space is called uniform
color space, the uniformity the uniform that adjective comes from the point of view that
we are trying to make those sizes uniform by making some transformation on those (x, y)
point itself. So, so this is one example of this transformation. you can see that just to
make it clear let me rub these points.

So, you can see that the transformation I have displayed here. so this space is called
u ' v ' or UV in short that is a uniform color space. Again it is a chromaticity space, space

614
of representing only the chromatic components and the kind of transformation from the x
y z coordinates of the color representation of the color has been shown here.

(Refer Slide Time: 06:13)

So, the point to note here that it is a projective transformation of (x, y) to make the
ellipses more uniform, say you consider the this diagram. Here you can see that the
differences of sizes of ellipses in this zone which corresponds to the green zone in this
space and in the red zones or in the blue zones, they are reduced grammatically. So, this
space is no more suitable when we are trying to compare colors in terms of defining
certain distances.

Though it is ellipse it is not circle, circle so Euclidean distance is not very appropriate in
that sense in this space. But still you can consider the neighbourhoods are sizes are
almost at least sizes are similar. So, they would be more appropriate or they would be
more effective for distinguishing colours when they are represented in this space.

615
(Refer Slide Time: 07:24)

So, let us continue this discussion on color representation. let us see what are the spaces
are there. So, one of them with respect to the discrimination of colours there is another
very effective space that is proposed also by the international color body CIE and this
space is called Lab space or we define this space also L*a *b* space. Here the distances
would be more effective Euclidean distance would be more, as the distinguishable the
ellipses. If I convert in the “ab” space or that is “ab” chromatic space, so those will be
more circular in that shape.

So, in this space we have one luminance channel and the two color channels a and b, I
am referring them a and b those sometimes here shown as a *b* . As a transformation is
also shown here you can see the transformation in this part. So, we have L*a *b* and you
can see that is a non-linear transformation which is made using once again they are made
using x y z representation of the colour. And note that all this x y z values they are
normalized by the respective values of reference white.

So for example, reference white is represented as X n , Yn , Z n , usually the white is that


white color and we know that corresponds in the chromaticity chart of x y chromaticity
1 1
chart the point is nearly  ,  in the x y coordinate. Almost all the components of x y z
3 3
would be equal, but they may not be exactly equal. so this is a kind of calibration you
need to do with respect your system.

616
So, find out what is the definition of white in your system and it is a kind of white
balancing operations in the color representation. So, once you get this values and then
X Y Z
you can normalize x y z coordinates, as , , and then again use this function to
X n Yn Z n

derive L*a *b* . And so in this model the color differences which we perceive it correspond
to Euclidean distances in CIE lab. So, it is very effective and the problem of
distinguishing even in the uniform color space that becomes much reduced. I mean we
can we can mitigate that problem to a great extent if we use L*a *b* for this purpose.

Just note that the a axis in the case it extends from it is shown in the figure also. So, this
is the a axis this is the negative part, this is the negative direction and this is the positive
direction. So,  a is green so this is a green portion these are red, similarly for b axis this
is a  b and this is actually luminous. So,  b is showing here and this is  b , so this is
yellow this part this is yellow and this one is blue.

So, blue to yellow and green to red that is a kind of variation of hue in this space along
this axis principal axis and along the you know L axis which is shown here as a vertical
line it is a intensity what is being varying alone that axis.

(Refer Slide Time: 11:01)

There is another color space which is called YIQ color space and it is respective
transformation has been shown here it is a linear transformation, it has the property of
better compressibility of information. So, in usually the chromatic component of I and Q

617
they require less bandwidth. So, to represent the color variations. So, if you consider any
radio signal color radio signal and if the color is represented into the three forms other
than convert them from RGB to this form, then the component of I and Q in that signal
will require less bandwidth.

And that is why this particular conversion is used in television signalling color
television signalling where this conversion has been made and we can also used in the
digital television signals. So, so it is a luminance y is encoded using more bits because
it requires more bandwidth. Where as a chrominance values I and Q they require less
width. So, the idea is that humans are more sensitive to the intensity y. And in fact if you
have black and white television signal, then you know you can just simply use the y and
there is a signalling encoding.

So, that even the encoding of I and Q will not affect the receiver of black and white TV.
But that is the different point which we are not going to elaborate here. So, the fact is
that luminance is used by this black and white TVs and all three values are used by
colored TV. So, this YIQ is also a very good model for sending information by
factorizing information into these three channels.

(Refer Slide Time: 12:52)

The space YCbCr space we might have already mentioned this space when we discussed
about image transformation. Actually this is a space this is the color space used for
image compression in JPEG standard and in many other standards subsequent standards

618
also Vario standards. So, you can see the transformation here. there is a linear part of it
and then there is an translatory part of it which is which makes this transformation as an
affine transformation.

And just to note that it is also having a better compression properties, because of that it
is used in different compression schemes and here Y represents the luminance and Cb
and Cr they represents chrominance parts. So, Cb is called complementary blue and Cr is
called complementary red that is a full term. But it is not a linear transform it is an affine
transform because, there is a additive component in the transformation, translatory
component or additive component at the end like this particular component which makes
this transformation affine.

Otherwise you know if I leave aside this part and consider modified Cb as Cb  128 and
Cr as Cr  128 then you have a linear transformation from RGB to that. So, that is also
something used in color representation. So, this translation is done with this motivation
that these values without this translation it could it is ranging in a negative zone.

So, for this inconvenience of representation in digital storage and I am making all those
values positive. So, that you can represent then you know in unsigned white color
representation or pixel representation. Assuming if red green blue the values they varied
from 0 to 255 you can see the values of Cv and Cr of the transformation will vary from 0
to 240. So, that is one of the motivations why we make this kind of affine transformation.

(Refer Slide Time: 15:03)

619
There are other non-linear color spaces where again this hue and saturations they could
be factored out from the achromatic information’s like intensity. So, this is showing one
such example, where you can see it is a conical shaped representation of a space and at
the the intensity is varying as we move towards this and as it as move towards the
vertical directions.

Whereas spatially in the horizontal directions you have the direction of saturation and the
hue is represented by the angle in this. So, such a kind of polar representations what has
been shown here. And, depending upon angle you can see .so your reference direction is
red and as you increase your angle it is if there is a transition from y green cyan blue
magenta to red as you can find out in this circle representation or in this hexagonal
representation.

(Refer Slide Time: 16:16)

So, we can elaborate this transformation further the mathematical form that we will see.
So, the interpretation here is that HSV stands for hue saturation value, value stands for
intensity . So, Hue relations are naturally expressed in a circle as I shown as I have
shown here and these are the expressions. So, how the intensity which is a value
represented here by I which is just a average of red green blue values and then saturation
expression is also very simple it is one minus of fraction of intensity and the minimum
value.

620
( R  G  B) min( R, G , B )
I S  1
3 I

So, fraction defined as a as a ratio of minimum of RGB to the intensity value and it is a
one minus of that value saturation and where as hue is also expresses angle as has been
interpreted in that space.

 1 / 2[( R  G )  ( R  B )] 
H  cos 1  ifB  G
 [( R  G ) 2  ( R  B )(G  B )] 

 1 / 2[( R  G )  ( R  B )] 
H  360  cos 1  ifB  G
 [( R  G ) 2  ( R  B )(G  B )] 

(Refer Slide Time: 17:06)

So, the implication is that HSV model it is uniform that is it is equal states it gives a
same perceived color changes. So, it has that uniform representations of non
distinguishable colours that kind of neighbouring color, non distinguishable color that
shape also here it is a quite uniform it is a three dimensional representation. So, it is if I
say that is a kind of circular ball around a point, so it has that effect.

621
But though you should note the the conical regions it is a double conical region what is
being represented as it is shown in this part of this diagram and around this point you
take small spheres spherical neighbourhood and that would be uniform here. And
saturation is the distance to the vertical axis that is 0 to 1 and intensity is a light along the
vertex. So, these are all normalized representation. So, ranges of saturation intensity and
Hue is an angle, so it is it ranges from 0 to 2 .

(Refer Slide Time: 18:26)

The other interesting color space it has again a theory of opponent color processes
opponent processing of colours that is going on in our visual system. it is supported by
that theory. this space has been proposed. what it does that you know the theory says
that by processing signals from cones and rods in an antagonistic manners.

So, they act in an antagonistic manner. So, rods are kind of inhibiting when cones are
very active or rods are active then cones are inhibiting. So, that is an antagonistic way
they act. However, so let us consider the representation and in the overlapping spectral
zone of three types of cones that we has been considered. We know that three types of
cones we discussed that long wavelength zone cones that is if that is a spectral response.
If they response move to the longer wavelength that is called L medium wavelengths
called M and short wavelengths call S.

So, L M and S types of cones are there in our visual system and it consider the
differences between the responses of cones, rather than each type of cones individual

622
responses. So, when we sense a color signal when the received energy in terms of three
stimulations of red green blue in three primary forms. It is not red green blue directly
being processed for the interpretation of the color, rather the differences of red and green
and say blue and also red green part we will see later on what is the model of this
difference in the next slides.

So, because a when the fact it is supported by this kind of psycho visual phenomena, that
it is hardly there is any perception of reddish green we cannot talk about reddish green or
bluish yellows in our color perception. So, from there this opponent color processing
theories are developed.

(Refer Slide Time: 20:29)

So, you can see that what kind of representation is there. So, this is a L cone and M cone
S cone and you see that L and M they are participating in an antagonistic manner with
providing you C1 which is R  G that is a component. So, green is opposing the red you
know excitations and S cone it is Y  B where y is basically represented by R  G .

So, that is what is opposing so and all the three cones and the rod the participate in
producing the brightness value. So, you have now this three factors of brightness this is
the a chromatic part and R  G and Y  B as the chromatic part. So, this is how this
color has been represented and mathematically we can model in this fashion.

623
(Refer Slide Time: 21:31)

So, we have three opponent channels like red versus green is represented as a G  R and
blue versus yellow is represented as B  Y , Y is representing yellow or B  Y is
represented as a additive composition of red and green or R  G and where is luminance
RG B
is black versus white there is a representation. So, this is if you note the that
3
Y Cb Cr conversion where you consider on the linear part, you will find that this is
somewhat following this principle because you can see this is red green blue.

So, Y part is representing particularly this luminance which is a weighted average of red
green blue. Whereas, the Cb complementary blue part you can see it is representing this
component it is blue from the blue you are subtracting addition of red and green. So, that
is a complementary blue representation complement representation and Cr this part it is
the red. So, red is , you are subtracting major component is green and also some amount
of blue is also subtracting in this model.

So, we can see that Y CbCr model in fact that proposition also tends their motivated by
this representation of opponent color processing and that is used heavily in various
applications. So, this is what is our Cb' prime Cr ' is presented to show Cb  128
and Cr  128 in this slide.

624
(Refer Slide Time: 23:19)

So, now we will be discussing about another phenomena which is observed when we
perceive color. It is the variation of intensities, our environmental illumination and how
do you perceive colors. You can see when we take photographs or when we take images
in varying illuminations the component red green blue they widely vary.

As you can see the super positions of those primary colors they reproduce you know
different kinds of image colors in the images, though they are of the same scene. For
example, in this case this image is taken by a flash light and this image image is taken by
a tungsten lamp and you can see the differences. The interesting you know property is
that as a human being we can adopt this illumination variation.

Even we can understand true red with largely varying illuminations, that means that
does not have much effect on our color perception. But when you take the photographs
there is a large variation of this red green blue component. So, we will see that how we
can add these phenomena while processing and how actually it affects our understanding
of color when we do the processing. So, this is the some of that lighting conditions of the
scene have a large effect on the colors recorded.

625
(Refer Slide Time: 24:49)

This is another example that same scene has been captured in different illuminations
different lighting conditions. But, when you record this intensity values or record the red,
green, blue components; there is white variation. But as a human being we do not see
much difference we can adopt to these variations.

(Refer Slide Time: 25:09)

So, so that is what knowing just the RGB value is not enough to know everything about
the image. That is a challenge when we interpret a color pixel. And the RGB primary is
used by different devices are usually different that is one consideration. So, for scientific

626
work we need to calibrate the camera and light source and for multimedia application
this is more difficult to organize this factor has to be addressed there. So, there exist
algorithms for estimating the illumination of colour.

(Refer Slide Time: 25:49)

So, this phenomenon is of adaptation of human to perception of color same color or true
color under varying illumination is called color constancy. So, when we tried to address
this computation, we call that is a computation of color constancy. When we try to
achieve this objective. now to understand the once again that you know how the colours
are perceived or colours are produced. What are the components are there when we
record the corresponding color signal now we have to consider two types of reflection
that is occurring on the object surfaces.

So, there are two components of reflected light, one component is a diffuse component
and which is the product of spectral power density and reflectance curve as we have
discussed and the other component is a specular component and this specular component
is just like a reflection by a shining surface. So, you get almost a same color of the
illuminant, it is acting those surface points are acting like a mirror point. So, it is actually
providing you the color of the illumination or color of the light sources itself.

So, when the same light falls on a surface of diffused surface you get the color of the
surface, when the same light falls on a specular surface or get the specular reflection you
get the color of illumination, color of the light source. So, there are the two different

627
kinds of color we perceive from the same scene. So, in a diffuse reflection as you can see
it has been shown that after reflection when the incident energy it is diffused, means it is
reflected in all possible directions and this energy with equal energy in an ideal diffuse
surface. And that amount of that energy reflected energy depends upon the angle of
incidence in this case cosine of the angle of incidence it means proportional to that.

Whereas specular reflection its like a mirror light reflection. So, the reflected energy
goes to a particular direction and if your senses are in those directions it will get those
you know reflected energy. It is very narrow directions through with this specular
reflection takes place almost following like a Newton law of reflection like a mirror
reflections. So, when we perceive when we record color, record the intensity values or
received energy at a point a weighted some of this two actually were recording.

(Refer Slide Time: 28:28)

So, let us assume this module diffuse + specular model. So, as I mentioned it is a kind of
summarization of the things what I described in the previous slides. So, far it is case
specular reflection you perceive color of the light you record color the light on dielectric
objects on metals you can get this kind of reflection. Whereas per diffuse reflection it
depends on both illuminant and surface. it is the product of the intensity is received from
the illuminant and also the reflectance property of the surface, reflection coefficient of
the surface.

628
So, the fact what I stated also that human vision is capable of adjusting the perception of
color in varying illumination and this particular phenomena is known as color constancy.

(Refer Slide Time: 29:27)

So, we would like to compute or we would like to represent the colors in in terms of
illumination in variance representation of color. So, we will have to have this that is what
is the color computation of color constancy. Now, we can observe if we plot the points
this color points in a RGB space, we can observed a kind of structure a dog legged it is
structure as a in the histogram of receptor responses.

So, in RGB space or if I just plot these points in the RGB space we will find out the
histogram of receptor or dog legged structure in that plot itself. So, you can see here in
this picture we are describing, the presentation of this colours. considered a diffuse
surface and so all colours produced by this. diffuse surface. So, this is a diffuse surface
all colours they would be more or less of having the similar hue and saturations. So,
their color vector would be similar, but only they could be scaled by intensities amount.

So, they will have a in the RGB space, they will have representation of a particular
directions, they lie on a linear segment again connecting the origin. But on top of that if
there is a shiny surface that is a specular reflection, then this colours will be related to the
color of the illuminate. it will give directions along this particular surface and since it is
a super position. So, you will get shifted.

629
So, you will find another directions for some values we will find another points. In the
space on top of that and this structure is a kind of dog legged structure. So, we observe
this particular structure when you plot this color in a RGB space. So, this is what in a
patch of diffused surface, a color multiplied by different scaling constants that we get
and the specular patch, a new color and as a result you get this dog legged in this
diagram.

(Refer Slide Time: 31:42)

So, now we discuss about computation of color constancy. So, we see that there are three
factors of image formations or for producing the sensory information from the surface
from the energies of the environment which is reflected from a surface point. So, these
factors are: what are the objects present in the scene, then the spectral energy of light
source, that is a spectral power distribution and then spectral sensitivity sensitivity of
sensors.

So, we can represent this particular thing as you can see that E ( ) is representing the

distribution spectral power distribution where as R X ( ) is showing you the surface point
object point x and that is a surface reflectance spectrum and S ( ) is the spectral
response of a sensor. So, it is the accumulated response over all the wavelengths over the
spectrum that that would be the output from this particular sensor and that is what is

630
expressed as I (x) So, if there are three different sensors are having three different types

of spectral responses you will have three such components. I ( x)   E ( )R X ( ) s ( )d


(Refer Slide Time: 33:05)

So, that is of the color is represented, it is a same scene captured under different
illumination and the problem of color constancy is that : can we transfer colours from
one illumination to another.?

(Refer Slide Time: 33:19)

631
So, this is the problem statement that we need to derive an illumination independent
representation of color. And for that what is required? we required to estimate the
spectral power density of light source that is the first thing and Then normalize the colors
what you get in your as a response of sensors. you should normalize this colours with
respect to the spectral power density (SPD) of the light sources.

So, the factor which will be doing that normalization that task is known as color
correction and for estimation of spectral power density you need to compute the three
vectorial representation of the color of the illuminant. So, there are one method is known
as diagonal correction and so this diagonal correction is given in this form. If suppose
you have a target color in the color invariant representation, you would like to transfer
all the colors as if they are illuminated by this illuminator which has the color equivalent
of representation of Rd Gd Bd

And, the actual color of the illuminator in the scene or color of the light is given by
Rs Gs Bs . So, any color what you get say RGB any pixel which is obtained due to

illumination how you can transfer you can modify it. So, that same color would be
obtained from the illumination of this color vector. So, that is the correction you are
going to do.

RG B R Gd Bd
f  kr  d kg  kb 
k r R  k g G  kb B Rs Gs Bs

So, first you find out the proportional factor of these spectral components of or primary
components of this color vectors of target illuminator and source illuminator also you
compute this factor f it is trying to normalize the intensity values. So, that you can have
the same intensity value even after modifying and then your updated color value should
be this. So, I should write here updated value.

R '  fk r R G '  fk g G B '  fkb B

Let me put it R ' that is the updated one which is fk r R then G '  fk g G and B '  fkb B .

So, this is how you can perform this kind of color correction. So, let me take a break here
and we will continue this discussion in the next lecture.

Thank you very much for your listening.

632
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 38
Color Fundamentals and Processing (Part V)

We are discussing about computation of color constancy and the objective is to get color
representation which would be invariant to the color of the illuminant.

(Refer Slide Time: 00:29)

And for that there are two tasks that we have been identified we have identified; one task
is to estimate the color of the illuminant. the next task is to transform the colors from the
source color image to a target color image where there is a target illumination is desired
I mean it should be color illuminated by a target illuminator. So, now, we will be
discussing different approaches of estimating color of the illuminant.

So, you consider here that there is one technique which assumes that the world is gray.
Which means that in the red green blue all these components if I average all the color
then actually that color of the illuminant would be gray. So, this particular part we can
describe in terms of red, green, blue.
If I take the average of the red green blue components this one is giving you the color of
the illuminant. R, G , B  Ravg , Gavg , Bavg

633
So, this is that types technique is a very simple technique you have to perform simply
averaging of the color vectors and you get the estimation of the color of the illuminant.
The other assumption is called white world assumption, so in this case the assumption is
that the color of the illuminant will be given by the maximum of these components of red,
green, blue. R, G , B  Rmax , Gmax , Bmax

The color is almost like white and you just consider the maximum of their red green blue
component. And here you need to just compute the maximum of red components of all
the pixels, maximum of green components of all the pixels, and maximum of blue
components of all the pixels.

(Refer Slide Time: 02:19)

So, these two are very simple techniques, there are other techniques which use a more
involved competitions and assumptions are bit different here. Here it is considered that
the pixels which are lying on the edges, color of the illuminant they are more information
of the color of the illuminant.

So, the idea is that when colors are reflected by the boundaries of objects, you get kind of
a specularity and as we discussed that specular reflection contains the color of the
illuminant. So, from there itself we are trying to estimate the color of the illuminant. So,
one of these method that is shown here, it extends pixel based methods to incorporate
derivative information.

634
That means, again you are trying to accumulate all such pixels intensity values or their
derivatives of those intensity values. And then this aggregated form itself will give you
the proportional color representations in the three vectorial (Refer Time: 03:25). So, you
just consider these particular computations.

And here you can see that we are considering n-th order derivative of a intensity value
which is represented by the function f (x) and c is the channel I think I have. So, n is

the order of the derivative, then p is a type of norm as you can see this is like an Lp
norm. And this is also called Minkowski’s norm in the continuous space. And  is the
scale; that means, it is smoothen by a Gaussian mask and then you are performing this
derivative. And then you compute this particular competitions of Lp norm or
1
  f ( x)
n p
 p

Minkowski’s norm. e n , p ,    c ,
dx   ke
 x n  c
 

And for that particular channel see, so if you are doing it for red component for example,
all the pixels then you will get a value which is proportional representation of the
illuminant for that particular channel. And you carry out this competitions for every
channel like green and blue you will get three such values and their proportional values
will represent the color of the illuminant. And you can adjust your particular intensity of
that illuminant and transfer the color using the color correction as I discussed in the in
previous slides in the previous lecture.

635
(Refer Slide Time: 04:49)

The other kind of approach could be that instead of computing the colors or
accumulating the responses of the perceived specularities in pixel reflections. What we
can do? We can consider something some kind of data driven approach; that means, we
have some models of different canonical illuminants in terms of their RGB distribution.
Distribution in a color space, it could be RGB space, it could be xy space.

And then given an image we try to find out what kind of distribution is there in that
space. So, the similarity between these two distributions can indicate that whether that
corresponding scene has been illuminated by that given illuminant whose distribution is
known. So, if I have a set of such illuminants we would try to consider the illumination
with the nearest of them.

The closest approximation of the distribution what we get we try to attach that
illumination for that scene in that way we can select is from a set of canonical illuminant.
So, it is observe distribution of points in 2D chromatic space even you can do it in 3 D
chromatic space also and; that means, including intensities; then you can assign the
spectral power density of the nearest illuminant to them.

So, there are approaches like gamut mapping approaches, so as we have discussed that
how xy chromaticity chart gives you a gamut. That means, the distribution of points in
this particular space which should be lying in a triangle and from there you can compare
the that distribution with the distribution of the canonical illuminants.

636
So, existence of chromatic points is important, color by correlation is another approach.
So, it depends upon you know what kind of similarity, how do you compute this
similarity, so it is relative strength over the distribution that is considered. You can use
also nearest neighbour approaches like, you can use mean and covariance matrix of those
distributions and some distance functions like mahalanobis distance function you can use.
So, in this way you can select one of this canonical illuminants.

(Refer Slide Time: 07:23)

Some examples of these techniques, consider this is a target image and which means this
image is illuminated as you can see there is a reddish hue in the illuminant. And that is
why their color is more tending towards red. But, for if I perform color corrections and
first you estimate the color of the illuminant using grey world assumption.

That means you consider averaging of the red component, green component and blue
component use it as the color of this image. And if I use the target illuminant as simple
white light like with the intensity value say (255, 255, 255). And then perform color
correction following diagonal correction rule then you will get this image.

So, this is for grey world method, for max world method you get this kind of image and
gamut mapping will give you this kind of image. I have not described in details gamut
mapping you can go through some of the papers. And in our course or in this particular
respect we will be considering simple approaches of estimating colors using say grey
world assumption, max world assumption etc.

637
(Refer Slide Time: 08:50)

So, let me discuss a problem here to let me solve a problem just to show you that how
color corrections could be performed using source illumination and target illumination.
So, let us consider this exercise and then we can understand the computational steps
better. So, consider that color of resource illuminant is represented by an RGB vector
(200, 240, 180), which means 200 is red component, 240 is a green component and 180
is the blue component.

Whereas the target color is given by (240, 240, 240) in my case as you can see in this
example it is trying to get a white illuminant by with this combination that is the target.
And given a color victor (100, 150, 200) in the source image compute the color corrected
vector using diagonal correction rule, so this is a problem statement.

638
(Refer Slide Time: 09:55)

So, let me see how we can perform this competition it is quite simple and quite direct if
you know this expressions as I mentioned. We need to first compute the proportional
factors of target and source colors of the illuminants for each primary components
like k r , k g , kb you need to compute. And also you need to compute the corresponding

factor for normalizing the colors. So, their intensity value remains still the same.

So, and then the expressions, so a modified colors are given in this form since they are
modified I am using once again this notations. And if I perform this computations what
is given you can see ( Rd , Gd , Bd ) is given and the component of source is also given. So,

if you component this proportional factors by using this information those values are
shown here. Then this color has to be converted into a form as if it is illuminated by the
source given by ( Rd , Gd , Bd ) that means, (240, 240, 240).

So, the completion of f is this is the factor which is computed here and so color corrected
vector as per these expressions R. You can see R becomes 101, it was earlier 100, now
after modification, it is 101, green becomes 126 and blue becomes 223, so these are the
modified color corrected values. So, you should note that the intensity in this particular
computation if I add ( R  G  B ) if I add these values and that is still coming the same
which is actually with this value is 450.

639
So, if I add 101, 126, 223, you will still get 450, so in this correction you are not
changing the intensity values of the modified pixels. but what you are changing you are
changing the relative proportion of three primary components because it is corrected for
the target illuminant, so that is how we perform the color correction.

(Refer Slide Time: 12:27)

So, this is the final answer in this problem and as I mentioned some of the original vector
and corrected vector remains the same.

(Refer Slide Time: 12:33)

640
We will discuss about a more general form of color correction and that operation is
called color transfer. In this operation what it does that there is an source image which
has its own color distribution and there is another image which defines a target color
distribution. So, the source image would like to borrow their target distribution in its own
rendition.

So, an example can be cited here to explain this particular operation, consider this is an
image which is that source image. And you can see that source image has an illumination
which is a bit reddish illumination and all the objects inside it they are affected by that
particular illumination.

Consider another image which is a completely different image having different types of
objects, different scenario, different context, but it defines a an illumination or defines a
target illumination, or target color distribution which is depicted by this target image. So,
it is an image of alps, a peak of alps and, so this distribution should be transferred to the
content to the objects of source image to the display of source image, so that it looks like
this.

So, your image under this transferred illumination model or if I said with the transferred
color characteristics that would look like this. So, here you can see the colors are quite
different and distribution is quite akin to the target illumination of the as if they are
illuminated by the illumination of target image.

(Refer Slide Time: 15:06)

641
So, let us discuss about the processing of this particular operation. So, what we do here
we process all this pixels in a different color space where we can separate out the
chromatic components from luminance component. So, in this particular technique this is
the way this processing has been done. first, RGB is the color space where the color
values of the pixels are captured.

So, there convert it into another space which is called LMS cone space, so it is trying to
model our retinal sensitivity of different cones. So, this model trying to provide as if the
sensations which will be generated in our cones. And that is what it is trying to capture
with this mathematical model, you can see it is a linear transformation.

And just to remind you that L stands for long wavelength, M stands for medium
wavelength and S stands for short wavelength which means they are corresponding to
red, green and blue zones of the spectral domain. Now, once you get this LMS value you
further process it you take it to their log domain. Because, in some of the perceptual
models it is considered that our perception is more it follows the logarithmic, it is
proportional to the logarithmic value of the receipt sensory unit sensory single or receipt
energy.

So, that kind of heuristics have been used here. we use the logarithmic values and use
this logarithmic values to separate out the chroma components for the luminance
components. So, you can see once again another transformation which is also a linear
transformations in the logarithmic with the logarithmic values of the cone responses that
we have considered in all model.

We call it LMS once again, but their font has been shown in a bold font and you can see
l 
the color space which has been given by   . These are the three components which is
 
  

similar to an opponent color space model you can see particularly in this row. So, this is
corresponding to the intensity component which is the addition of L, M and S and this
one it is ( L  M  2 S ) , so L corresponds to red, M corresponds to green.

So, it is like yellow minus blue that kind of model opponent color model and this is
L  M which is red minus green opponent color model which we have already discussed

642
earlier. So, it is a variant of this particular operation processing this particular opponent
color space, so we perform this kind of separation. Now, in the next ; we will be doing
will be performing certain processing over the chromatic components. So, that you can
transfer the distribution of colors in the target image to the source image, let us see how
we can do it.

(Refer Slide Time: 19:19)

So, for modifying the chromatic component and also the luminance component, first
what we can do that, we subtract the mean of all those components from the source
image. So, you can see here this symbols ls ,  s ,  s they denote the
corresponding l ,  ,  components of source image. And then you are taking their mean
which is denoted by this angle bracket operations I mean this is the symbol just.

So, your subtracting those means and then up the subtracted values they are scaled up, so
that the standard deviation of the target distribution can be transferred. I mean your
source distribution also should follow the standard should be similar to this variances or
standard divisions of the target distribution. So, you perform a proportional scaling- a
scaling proportion in proportion to their ratios of standard deviations that could make
 tl *  t *  t *
these operations. l '  l ,  '   ,  '  
 sl  s  s

643
So, you can see here, this is the standard deviation of the luminance component of the
target image. And this is a standard deviation of the source component of the luminance
component of the source image. So, if I multiply the modified luminance value with this
scale with this ratio or with this proportional factors. Then actually your making the
standard deviation of the modified distribution is the same as the standard deviation of
the target distribution.

So, that is what your performing by doing this particular operations and you are doing it
for every component in the same way. So, the corresponding ratios of the standard
deviations in those components in both target and source those are multiplied with the
modified values. So, this can transferred the standard deviation operation also this is
what it is just defines the corresponding symbols as I mention that they are the standard
deviation of source and target distributions and defining those ratios.

So, once you have performed this then you have to also add the mean of the target
distributions components which means for the luminance components,  component,
and  component. So, in this way in the transferred domain, transferred color space or
transferred colored values of source image in that distribution you have transferred the
mean.

So, that distribution we have the mean which is equal to the mean of the target
illumination, target distribution and also it will have the standard deviation of the target
distributions. And that is in the l     space. And now the rest of the job should be
that you perform the inverse transformation to take all these components back to the
RGB domain. So, for that what you should do; you should follow the inverse operations.

644
(Refer Slide Time: 23:33)

That means you can see that after modifying the l ,  ,  components again you are
perform you are taking those values into the this place L M S. But this is a logarithmic
space of L M S cone color space, so this is a logarithmic of L M S cone color values. So,
far from there again to you have to transfer to the original L M S cones color space.

So, you perform the exponentiation operations over those logarithmic values and
perform once again the inverse operations to transfer from L M S to the R G B value. So,
in this way you get the color transferred image from the source image using the
characteristics or using the distribution of the target image. So, let me give you the other
example of this operation.

645
(Refer Slide Time: 24:51)

In this case we will be considering we will show off the source and target images and we
will see its effect which means your target illumination will be defined by the image
which has that reddish illumination in the environment. And your source image will be
the image which has been captured in broad day light in a every sunny day. And it is a
once again its content it has been shown its a mountains peak which is very prominent
and there is a valley, green valley which has been also displayed. So, under the color
transfer operation let us see how it looks.

So, we can see actually the scene is perceived that it has been captured in some you
know reddish under some reddish illumination which for natural scenery may considered
that it is bit unnatural in that sense it should not be, so much of reddish. But that is what
the color transfer operation did, but it would be near you can say sunset kind of
operations. So, this is just the imaginations through which you can perceive this kind of
image.

646
(Refer Slide Time: 26:22)

So, next type of processing that we are going to discuss that is the color demosaicing
operation, so in this problem we use a color filter array. So, first let me explain the
principle of the single chip camera where actually we know that for a camera we
required three types of filters. For a normal camera when we have full color reception
which means you have a color signal, suppose there is a color signal and then you have
to receive the signal in the form of a sensor.

(Refer Slide Time: 27:00)

647
So, same signal should be received by three different sensors, say there are red, green,
and blue and there are different techniques by which the same optical signal is guided to
this or they are divided into three different parts. And then they are sensed by three
different sensors independently. But it has to be the same coherent source and that
coherency has to be maintained to represent the color of a particular object which should
be received by the same signal.

Now, this kind of technology is quite complex and there are different ways people do it
where they have that optical divider and like using prisms etc.. And the manufacturing of
this kind of camera is requires high technological innovations and it is quite costly. And
since you are using for each pixel three different sensors, we call it three chip colored
camera or three chip sensors.

But there is a cost effective technology for sensing color images where instead of three
sensors, I will be using only one of them. That means, I can use either red, I will be only
sensing red or I will be sensing sometimes green or sometimes blue. But whenever I am
sensing the signal whenever the camera is sensing this any signal of a coming from a
particular point, it only captures only one component, but what it does, it does in an
interleaved fashion. So, which means that in an image suppose from the image
perspective if I record the corresponding received energies from different color channels.

At this point I will have only red whereas, next point I will have say green and say this
point is blue. So, you can see that I will be losing my spatial resolutions, but if I assume
my spectral resolution is much course; that means, I can associate this blue also to this
pixel and this green to this pixel. So, I can consider at all this pixels they have the same
R G B values, so this is a very crude way of estimating the colors.

So; that means, assigning colors of each pixels all the full color information estimating
full color information’s along this pixels. As you can see it is a very crude, because I am
simply transferring the values of missing components to the neighboring pixels by
considering the spatial correlation of the spectral channels.

Now, this problem itself I have just given you one particular solution, but there could be
many solution it is an interpolation problem which means I have the missing components
and missing components at this point. So, for this pixels say for example, for this pixels

648
if it is I am recording only red, so my green and blue they are missing. but in the
neighborhood I have some pixels where those values are there some are green, some are
blue say some are green, some are blue.

So, even I have also other red also, so considering this neighborhood pattern or this
neighborhood information how best I could estimate this missing components that is the
problem of color interpolation. And when we sense the image in this interleaved fashion
of one of the spectral components we call that kind of image as a color filter array. So,
let me go back to my slides, so that to summarize this particular computation problem
and this kind of computation is known as color demosaicing or color interpolation.

The term demosaicing came as if the pattern interleaved pattern is called as a mosaic
pattern with different color components. So, here we can see that in this problem you are
using color filter array of a single chip CCD camera, from there you are generating a
dense pixel maps of all the color components. So, they have sparse data as if as you have
only one component at every point, but you would like to get all the color components
for every point.

And as the motivation as I mentioned the technologically it is simpler to design and


manufacture this kind of cameras. Because you know the spectral coherency of the
reception to ensuring that coherency is not required here. So, we have you can use
simply the filters in front of the images in front of your lens color filters which are in
patterns. And for each pixel you will be receiving only energy in that spectral band.

So, there are different patterns are proposed by which you can design this filters that is
why the array also is called color filter array. Because those are sensed by incident light
through this filters. So, this example of this patterns are shown here, you can see in this
pattern which is called Bayer’s pattern and which is a very popular pattern in most of the
literatures you will find people have worked on this pattern. And you can see that in this
pattern what happens that green, red green red, so they are interleaved in one row.

In another row blue, green, blue, green they are interleaved, so the fact is that green has
been sampled in the color filter array twice the red than red and blue. So, this is one
consideration is that green anywhere more sensitive to the green channel. So, lots of

649
information of green channel will affect more than the other two components that is why
we have this kind of sampling pattern.

But this is not the only pattern we can have other kind of patterns for example, there is a
pattern adopted by Kodak , famous Kodak cameras. So, there you have this patterned
like in the same row you have green, blue and reds on the same row this patterns are
interleaved. Whereas, Bayer’s pattern now we can also identify as if this is the red row
because green is sampled in every row, but in one row when red pixels are sampled then
the next row the blues pixels are sample.

So, rows can be you know denoted by for the purpose of describing this pattern or
computations we can save a pixel in the red row or blue row and whereas, columns could
be green and non green, so either it is red or blue. So, in this way you can identify a
location in the Bayer’s pattern; you should note also the starting pixel locations and the
starting row, starting column and starting rows state. If I can provide you that and given
this particular rule then I can describe the all the patterns.

So, in the color demosaicing problem, it is the pattern which is which remains a fixed, so
you know already that given a color filter array for each pixel which is the spectral
component of that particular image. So, for any every pixel I know that which spectral
component it is, whether it is green, red and blue. And by following this pattern if I can
tell you one particular pixel the rest of the pixel’s spectral components can be
determined or described.

So, this is the advantage when you use this kind of regular pattern when you describe a
color filter array. So, our objective here is to interpolate the pixel values in such a way
that you get all the full color components; that means, all three components at every
pixel and that is what is a computation of color demosaicing. So, let me stop here and we
will continue this discussion in the next lecture we will discuss about different
algorithms for performing this color demosaicing. So, let us give a break at this point.

Thank you very much for your attention.

650
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 39
Color Fundamentals and Processing (Part VI)

We continue our discussion on color demosaicing and in the last lecture I have defined
what is the problem of color interpolation.

(Refer Slide Time: 00:28)

So, in this problem you are given a color filter array where there is a pattern by which
you are sensing the colored image. So, in this pattern at every pixel only one spectral
component is captured. So, the information of only one spectral component or spectral
channel that is available. For example, say at this pixel only green value is available in
its neighboring pixels red and blue are available.

So, the problem of interpolation is to compute the missing components at every pixel by
taking care of its neighboring distributions. So, we will be discussing different
methodologies for that and in particular here in the given problem what is already
provided that the interleaved pattern of channels that is standardized with respect to the
pixel coordinates. So, at every pixel of CFA I know that, at that coordinate locations or at
that array location which spectral component is available.

651
So, you should note that this could be this positions could be encoded by two different
two indexing schemes; I mean, two index by this kind of indexing schemes you can
observe that in this interleaved pattern one row has the alternations of green and red
pixels, green and red values and in the next row alternately blue and greens are available.
So, we can consider the row where red samples are available along with green that is red
row and the row where blue samples are available along with green that is blue row
where as the column can be denoted as either green or non green. then the column
positions are either green or non green.

So, in this way in an alternate in a periodic fashion we can determine for any pixel
locations in the CFA in which type of row it is lying and which type of column it is lying
and from that information itself we can find out that which spectral component is
available there.

So, for example, if the position is a red row and green column, then it is the green pixel
location. So, any column which is green in a position, there will be a green pixel
locations will be available where as if it is blue row and non green column then say this
is one example then the blue component is available. So, with this kind of background
now let me discuss different methodologies of interpolation.

(Refer Slide Time: 03:49)

So, there are two key observations that we should make here; one thing is that there is a
high correlation between the red green and blue channels. And which means that if you

652
can if you have the informations of any of these channels, the other channel information
also you can exploit this correlation to derive that information. They should very likely
to have the same texture and edge locations particularly.

So, particularly the gradient estimations, cross channel gradient estimation is useful in
this respect we will elaborate when we discuss different methodologies. The other
observations from the CFA arrangement I mentioned these things in the previous lecture
also that the green channels they are sampled at a higher rate and particularly in Bayer’s
pattern their sampled at twice the rate of red and blue channels.

So, usually green channel represents the intensity values they are more akin to
representing intensity of values and the chromatic information is more embedded on red
and blue occurs green also has its part there. So, that is why you know sometimes in the
in the jargon of color interpolation green is considered synonymous to luminance
channels. So, if you want to estimate luminance first we interpolate green and use that
value kind of a luminance value in subsequent processing.

So, these are some of the advantages since you are having higher sampling rate in green
channel because we are very sensitive to the green component of the color our vision
system is very sensitive to that component. So, you would like to have higher quality of
reconstruction of green channel and that is why there is it sampling rate is high. Andthat
is why it should be less likely to be aliased and its details are preserved better in this
channel than the other two chroma channels.

653
(Refer Slide Time: 06:08)

So, first we will be discussing about very simple technique of bilinear interpolation since
this is a two dimensional distribution. So, we will be considering interpolations in both
the directions. So, that is why linear interpolation in both the directions and this is quite
simple and direct as we can see here.

For example if you want to interpolate the green pixels let us observe that what kind of
arrangement we have. So, when you are going to interpolate green pixels means, these
information we need to find out in a location where green samples are not available. For
example, a location like this which is shown here as say B8 or a location say like this

which is shown here as R14 . So, these are the locations where you need to compute the
missing green component.

Now, you can observe that the neighboring information of green samples are available in
the form of four neighbors. So, in the discrete rate we call this neighborhood definition; 4
neighborhood definition is quite straight forward. Suppose you have a pixel location and
in a discrete rate say this is a pixel location and if I denote this location as say (x, y) then
this neighboring pixels, these are the 4 neighbors of this grid or this pixel.

So, which means this co ordinate is say array indexing should be ( x  1, y ) this should
( x, y  1) this is ( x  1, y ) and this one is ( x, y  1) . So, if I considered for example,
computation of green at this location I can see these are the 4 neighbors.

654
Say this one, this one, this pixel and this pixel is 4 neighbors. So, what I can do I can
take simple average of this pixels. So, that is how we interpolate green because
everywhere wherever green sample is missing you will always find that in the 4
neighbors of that sample there are two green values are available. So, you simply take
the average.

So, you can see this is the example of the average of 4 neighbors in the pixel location B8
where you have the green sample missing. So, this is the location and you are
considering this then G7 , then G9 and G13 and you are dividing it by 4, the sum of them
you divide by 4, then you get the average.

Similarly, when you are going to interpolate the red and blue pixels there also you should
consider only the neighboring samples which contain true pixel values of the
corresponding channel which means that suppose I have a pixel where say red
component is missing. So, there are two types of pixels the red component is missing in a
pixel while green sample is available and red component is missing in a pixel where blue
sample is available. So, these are the two scenarios say this is also another case.

So, in this green sample particularly note this location. So, here you observe that these
are the red samples, two red samples are available. there is no other red samples in its
neighborhood. So, you simply take the average of this two. So, what you need to do?
You need to identify the type of pixel in a CFA that be which type of row and which type
of column it is.

For example, this is a blue row and green column that is how you designate this pixel
positions and for that case you have to consider the top neighbor and bottom neighbor
and then you take the average. Where as if it is in red row and green column then you
have to consider this one and this one; that means the left neighbor and the right
neighbor and take the average.

If it is a blue column and blue row; that means, this is a scenario then you should
consider in addition all this diagonal neighbors of this pixel and you take the average. So,
these are the three cases which are available for interpolating red similarly you carry out
the same operations for blue. So, what we get let us see these expressions.

655
R2  R12 R2  R4  R12  R14
R7  R8 
2 4

So, as I mentioned that if I considered the location G7 say this is the location G7 . So,

this is a blue row and green column location. So, you get the top neighbor and bottom
neighbor R2 and R12 and you divide it by 2 that is averaging of those two values.

Similarly, you consider this case. So, here you have R8 it is just the location beside the

previous one this is the location and now the red samples are available in its diagonal
neighbors. So, you can see that R2 , R4 , R12 and R14 and then you take the average. So,
this is a second condition for interpolating red second case for interpolating red.

B6  B8 B6  B8  B16  B18
B7  B12 
2 4

The third case as I mentioned, it could be blue row and blue column oh sorry this is in
the same way you are doing B7 and B12 , but no the other case could be as I was

mentioning .these are the only two cases of red. No other case is missing here. So, which
is not shown let me tell you. So, it is basically red row and green column. So, you
consider this particular location. So, I should write missing component red and that
location indexing the convention whatever we are following we call it R13 then I should

R12  R14
take the average of left neighbor and right neighbor. R13 
2

And for the blue cases also, it should be the green column and blue row. So, which is this
one, one example is this and I can write as say B19 that is equal to this is just for typical

example we are use those using the indexes of those pixels. Anywhere where you
satisfies this property of characterization of locations of the pattern what you mentioned
that is the type of row and type of column you should apply the same technique.

So, in this case you have again left neighbor left blue neighbor and right blue neighbor.
B18  B20
So, that should be . So, this is how bilinear interpolation is carried out.
2

656
(Refer Slide Time: 14:53)

So, next we will be discussing about a technique where we would like to exploit the
correlation between green and blue and green and red samples. So, this technique what it
is considering that at every point you can estimate hue and you are averaging over the
corresponding hues.

first hue interpolate green pixels using the simple bilinear interpolation. So, in that case
at every pixel you have all the green values. So, all the pixels wherever green samples
were missing through bilinear interpolation you have computed it. So, once you have
done it then actually you are able to compute this blue hue and red hue correspondingly
in respective positions wherever blue sample is available, then you can compute blue hue
wherever red sample is available you can compute red hue.

So, while performing interpolating blue sample at a pixel where blue sample is missing
you consider its neighboring blue hues, then average it and then you multiply with the
respective green value at that pixel that would give you the corresponding blue sample.
So, this is a modifications it requires, similarly you do it for red hue. So, it is nothing, but
bilinear interpolations of hues once you have computed green and then you convert it to
the pixel value by multiplying the respective hue with the green channel, green
component.

So, let me describe it further. So, as you can see that interpolation of green sample is the
same as it is done in bilinear interpolation technique. You are simply averaging out the

657
the 4 neighbors of every non green pixel locations where 4 neighbors are always green
neighbors that we have we can observe from this Bayer’s pattern.

So, now we will be discussing about the interpolation of red and blue pixels. So, there
you can find out this situation’s; so, will be considering interpolation of blue pixels in
this case. So, as you can see here that the location where you are interpolating that is
this( G7 ) location. Now, in this location you have the green sample that is available. So,
its neighbors and it is a blue row; so it is left neighbor and right neighbors the two blue
samples are available. Since you have already interpolated green, so we have the
estimation of G6 and estimation of G8 these are available.

G7  B6 B8  G13  B8 B18 
B7     B13    
2 G
 6 G8  2  G8 G18 

So, I can compute the blue hue at in those neighboring locations and you take the
average of them. So, you divide it by two and then multiply with G7 that is how you will

get the blue value at that location. So, let us find out what you are doing in the other
locations. So, in the location B13 as you can see here. So, this is the location where blue

is missing and we need to compute the blue component, once again instead of left and
right neighbor we have to consider top and bottom neighbors because this is in the red
row and green column.

So, in its top and bottom neighbor two blue samples are available. So, you can estimate
the blue hues take the average and multiply with the green value what is available at that
location. So, this is the second case. And the third case is shown here where you have the
blue values all in the diagonal locations. So, you can consider this particular guess say let
me use a different colour just to say this is a location where blue sample is missing and
you are going to estimate. So, once again the diagonal neighbors contains two blue pixels
blue values and their greens are already estimated.

G12  B6 B8 B16 B18 


B12      
4  G6 G8 G16 G18 

So, you take the average of hues of those locations; that means, sum those hues and then
divide it by 4 because there are 4 instances of estimation of blue hues and then you

658
multiply with the green sample. So, here green is also estimated because it was a red. So,
the green value what is estimated by the green interpolation stage you can use that value
to get the missing blue sample. So, this is how you compute the blue components
wherever it is not there in the color filter array. In the same way you can do it for the red
components. So, this is what you can perform these operations for red pixels.

(Refer Slide Time: 20:37)

So, now I will be discussing another very effective technique where it is exploiting the
information of gradient. So, let me discuss this particular property of interpolation this is
very commonly used that when you interpolate the pixels you should interpolate along
the directions of the least gradient. Because you can see particularly when you have
edges let me give you an example through the sketches. Suppose you have a region
where this side you have a brightness value say I 0 and this side you have a brightness

value say I1 .

So, suppose this pixel value is not available and this pixel values are available. Now, if I
am going to interpolate along these directions there is a direction of maximum gradient
change. So, as you can see that these interpolated value will be in between I 0 and I1 and

there will be blurriness in the corresponding edge values edges. Where as if I interpolate
along this direction we will be preserving this edges because those values they are
similar and they are the least gradient changes there is the direction of least gradient
change in this case.

659
So, we can consider that value should give you better interpolation. So, there is a
principle by which you can perform this interpolation along the directions of low
gradients. Since we have two directions natural directions in this pattern and by the
definition of neighborhood there are horizontal and vertical gradient directions of course,
you have also diagonal directions. So, other directions are also available, but depending
upon the availability of your samples either they are horizontal neighbors or vertical
neighbors or diagonal neighbors you can compute the corresponding gradients along
those directions.

So, whenever there is a possibility of choosing samples among those directions. For
example, in this particular case see we would like to compute green. Now, you can take
the average of these two values that is in the horizontal directions or you can take the
average of these two values that is in the vertical directions. So, which one you should
choose? Now, if you find that vertical gradient is significantly smaller than the
horizontal gradient which means there exists an edge then you should prefer the vertical
direction.

So, that this is what and if you find then it is not much of difference then you can use all
the neighbors. And if it is a diagonal directions where we coming to those situation there
we have to consider again two directions diagonal directions which are also
perpendicular and follow the same principle. So, estimation of gradient is important here,
the other case what this algorithm does that estimation is further defined by considering
the second order derivative of the distribution. So, I will be now discussing this
algorithm in more details.

So, as I mentioned that first you have to compute say green interpolations. So, we are
considering for example, this pixel location. So, the gradient value in the horizontal
direction, this is an horizontal gradient direction is the difference between these two and
you take the absolute amount because that is of your interest.

H | G4  G6 |  | B5  B3  B5  B7 |

And then you can find that actually this is also added with it because what we are doing
we are computing the difference of the difference which means the second order
derivative; that means, if I take the difference of these two. Say B7  B5 and then again

660
take the difference of these two and you subtract it say B7  B5 and then again take it

and subtract it B5  B3 , then you will get the, if you take the magnitude you will get the
same value here. So, this is how you are estimating, but you are estimating the second
order derivative from the blue channels as we mentioned in the very beginning of this
today’s lecture that there is a high correlation in the higher order derivatives across the
channels.

So, we are estimating that same derivative should be also observed in the green channel
when you can estimate it form the blue channel. So, this is how the gradient it estimate
along the horizontal direction, similarly you can perform the estimation. So, this is just
explaining that second order derivative of a function what I just have mentioned this is
how you can compute.

V | G2  G8 |  | B5  B1  B5  B9 |

And the vertical gradient can be estimated in this form you can see that the top neighbors
of B5 are considered for estimating the vertical gradient first order derivative and the
second order derivatives are also computed from the vertical columns of blue
components.

So, now the algorithm is that you have to find out in which direction the gradient is
minimum. So, you have to compute G5 like if H is less than V then if this is a
scenario then you should consider the interpolation using this pixels. So, but what it does
G4  G6
I could have taken that is know 1 value, but it is further refined by a correction
2
by a cross channel estimation of Laplacian derivative.

ifH  V
G4  G6 B5  B3  B5  B7
G5  
2 4

So, I will show you the final expression. So, this is the expression you can see that this
estimate is refined by this second order derivative or you call it Laplacian value which is
estimated from the blue channel because from the other channel which is available and it
G4  G6
is subtracted from the value of .
2

661
So, this is one thing interesting because if you use the Taylor series expansion, then you
will see that actually this value should be added during interpolation, but that performs
poorly than this particular correction. And this is a technique which has been you
proposed in this particular work and it is a very effective technique that what we have
found out.

So, this is the work what is referred here and it has been observed that it gives very good
interpolation quality of interpolation is very high good. So, this is further elaborated I
mean used to repeated in other locations also same thing like if now the horizontal
gradient is you know greater than vertical gradient, then you should use a vertical
directions for interpolating green.

And otherwise if they are equal use all possible green values; that means, all the
diagonal neighbors sorry not diagonal in this case they are all four neighbors in the non
green locations. And then also use the corresponding estimates of derivatives using the
both the directions vertical and horizontal directions and refine it.

(Refer Slide Time: 29:18)

So, this is how the green pixel is interpolated and the blue and red pixel is also
interpolated it in the same fashion on already you have green pixel. So, now, what you
can do you since you have green values at every locations your cross channel estimation
of Laplacian should be through green only. So, now, you have once again the condition
of interpolation of red pixel. So, if it is at the location here you can see this is the location

662
where the missing red is computed and since there is there are no vertical this only the
vertical neighbors are available for red true sample.

So, there is no question of comparing horizontal and vertical gradient in this case
because only in one directions information is available and that is more reliable than the
other values. So, we will be using them while doing interpolations and also we will be
refining this estimate using their cross channel Laplacian estimate as we mentioned. And,
but in other case also when it is the horizontal neighbors are available in the missing red
location.

So, this is this particular location you see the only horizontal neighbors are available for
red vertical there are blues. So, again you do not do any comparisons of horizontal and
vertical gradient use similar kind of estimation process.

And the third case would be where actually the red neighbors, the neighboring red
samples are in the diagonal directions. For example, this is the case where you have
neighboring red samples in the CFA. So, now here you would like to apply the same
technique of finding out the doing interpolations in the least gradient directions.

So, only thing is that your estimation of gradient should be along this direction, along
these two perpendicular directions and that estimates once again can be refined using the
Laplacian correction. So, what we are defining here as you can see that two directions;
one is called N . So, this direction N you can see it is R1  R9 and then you are also

considering the Laplacian estimate along this. So, you are using G5 , so this is G5 , this is

G1 and this is G9 .

N | R1  R9 |  | G5  G1  G5  G9 | P | R3  R7 |  | G5  G3  G5  G7 |

So, using this value itself you can perform this Laplacian estimate here and in the other
direction also you are doing this and then which one is smaller you are choosing that
value. So, if N is less than P , then you are performing your interpolations using R1 ,
R9 and those Laplacian corrections. And if N is greater than P , then you are doing

663
this and if they are equal, then in the same way you are using all the diagonal neighbors
take the average and also perform the Laplacian corrections.

If N < P

R1  R9 G5  G1  G5  G9
R5  
2 4

Else if N > P

R3  R7 G5  G3  G5  G7
R5  
2 4

Else

R1  R9  R3  R7 G5  G1  G5  G9  G5  G3  G5  G7
R5  
2 8

So, this is how this particular interpolation technique is carried out. So, let me give a
break at this point, we will continue this discussion in the next lecture.

Thank you very much for your attention.

664
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 40
Color Fundamentals and Processing (Part Vll)

(Refer Slide Time: 00:25)

So, we continue our discussion on color interpolation. In the last lecture, I have discussed
three particular techniques of interpolation; one is bilinear technique, then averaging of
red and blue hue and then Laplacian correction of edge co related technique, which in
acronyms are shown here ARBH, LCEC, for the this two and the bilinear is BI.

Now, here I have shown a few examples of demosaic patterns, just to explain how this
algorithm has been applied. What we did from the original color image, we have
extracted the CFA image using the Bayer’s pattern as I have described. So, it is a
simulated here color filter array from the original image. So, you we can compare and
then you carried out interpolation say you apply bilinear interpolations, then you observe
this particular result and this is average of red, blue and hue and this is Laplacian
corrected edge correlated techniques LCEC.

So, you can see that visually, they look almost similar though, in the close examination
you may find there are some color artifacts and blurriness in some cases and there is a
measure by which this quality of this techniques could be evaluated. Suppose, you

665
consider this image that is the reconstructed image and this is your original image. So,
you have a pixel value, consider at any location, the correspondingly same location of
the original. Say, this is a location say let me write this location as xy and same location
xy, this image let me consider this image has three component. So, this is reconstructed
image so, we have IˆR , IˆG , IˆB

So, these are the three color components and this is the original color image of this
component. So, what we can consider; we find out the corresponding error between this
color channel. So, for the red we will be doing say ( I R ( x, y )  IˆR ( x, y )) . So, if I perform
the square of that error that would give me the red components error and overall red
components error would be, if I sum over all the x and y values, that is for red.

That is a error of the red and you compute this error for all the channels for even blue
and for green and the all this channel, if I take the average which means that if there are
N samples I should compute it, so, if there are three channels. So, I will be writing this
expression mathematically in this way that there are three channels. So, let me put those
channel C and C could be either red or green or blue and if there are N pixels.

So, what I will do I will take the average of this error; that means, every sample, every
instance is considered for squared deviations of the true values with the predicted value
and count that many number of instances which is 3N in this case. N is the size of the
image for every channel.

So, this is the error and when we define a peak signal to noise ratio, what we do? We
(255) 2
perform the signal to noise ratio as . So, this is this peak signal to noise ratio, in
E
terms of ratio now, this value that you can see that signal strength has been taken as the
constant that is the very convenient way of expressing.

It gives little, it quite inflated measure but it is very popular in image processing and then
 255 2 
you express it in dB which is 10 log  . When you express in this form, then in it
 E 
becomes so you can see that, which means that I have to take root mean square
255
sometimes it is written as 20 log dB that is ok.
E

666
So, this is the PSNR value and you can compute this PSNR value it is not shown here in
my slides and we can find that actually, LCEC will give you the highest PSNR and
which shows that this LCEC technique is better than the other techniques in terms of
PSNR .

(Refer Slide Time: 06:08)

This is another example of reconstruction. Once again, the same experiments have been
carried out only, we have shown another example.

(Refer Slide Time: 06:19)

667
So, let me discuss one particular exercise to once again to make your understanding
better for estimation of the missing spectral samples. So, let us do this exercise. So, it
says that you consider the following color filter array, whose first row corresponds to the
red row and first column of that row corresponds to a green pixel of the Bayer pattern,
which means the leftmost and top most position of the array location, in this particular
display that element corresponds to red row and green column in our convention of
denoting any location.

And then what you are asked to answer the following that what are the missing
components of the central pixel, which has been shown by the bold font and whose value
is 45 here and then you should compute those missing components using bilinear
interpolation and average of red and blue hue.

So, these are the two parts of this exercise. So, this is a central pixel as I mentioned.

(Refer Slide Time: 07:33)

So, first let us consider what are the missing components of the central pixel . Now, in
this particular diagram, I have just displayed the corresponding pixel types or which
spectral samples are available in the locations corresponding locations to the pixel. As
we have seen the first row we have the red row and green column.

So, which means the pixel value available at that location is green and then alternatively,
or periodically, in that row the in an interleaved fashion green red green red in that way

668
the pixel samples are available. So, if I follow those Bayer pattern then the central pixel,
we can say which we have highlighted that is also a green column pixel and it is also in
the red row.

So, anyway since if the column if the type of column there it is green. So, we consider
that is a only the green value is available. So, that missing components in those cases
would be red and blue. So, this is a central pixel and missing components are red and
blue.

(Refer Slide Time: 08:50)

Now, we have to compute the missing components red and blue. So, you can see; we
have to apply the bilinear interpolation technique. So, we will be using bilinear
interpolation. So, here I am showing part of these array the pixels samples which are
used for our computation and by the color, I am showing those samples also the red is
shown in red color and blue is shown in blue color.

So, for missing samples of red you simply you have to take average of 45 and 48 and for
blue, you have to take average of 25 and 27. So, that is what you need to do to get red
and blue for the bilinear interpolation technique.

669
(Refer Slide Time: 09:33)

Now, let us consider the other technique that is averaging of red and blue hue. Here, once
again I am showing those patterns, showing the highlighting those pixel values which are
required for computing the corresponding know red and blue interpolations, but we need
to compute the hue in those locations also; that means, in this location you need to
compute the red hue, which means I should know I should also try to know what is the
value of estimated green here.

Similarly, what is the value of estimated green here. So, I should know green value
before computing hues that is what we require and for estimated green once again, you
apply the same bilinear interpolation technique. So, the step would be first I will estimate
this green values and then I will be applying the averaging of red and blue hues steps. So,
computing missing G values; in respective B and R pixels.

So, for B pixels these values are computed and for bottom neighbor also these values are,
I mean B pixels this is computed, the top neighbor it is 44.25 and bottom is 46.25 and red
also is computed using left neighbor and right neighbor out here, it is corresponding left
and right neighbors.

670
(Refer Slide Time: 11:01)

So, (Refer Time: 11:03) just to show you what are those values. So, you can find out
here, these values are shown here. So, this is the estimated green in those locations,
following those rules. So, now, you can compute the corresponding hues. So, you can
compute average of hues and multiply with green value. So, this is the missing blue, you
can see that these are the hues we are computing. So, this is 25 by 44.25. So,
corresponding to this location and 27 by 46.25 and then you have taken the average by
dividing it by 2 and then you are multiplying the corresponding green pixel value and
finally, you obtain the missing blue component at this position. So, this value is 25.847.
In the same way you perform red by considering these two neighbors that value is
coming as 47.

So, in this way you can get this result.

671
(Refer Slide Time: 12:16)

So, there are two major problems of the reconstruction. I talked about blurriness of edges
even though, we can use the less gradiant directions still, you can have the edges blurred.
So, I have shown by know zooming the interpolated image what I have displayed earlier,
the dome of the Taj Mahal in that picture and you can see that there the edges are a quite
blurred in the zoomed portion, you can see and also there are appearance of false colors.

(Refer Slide Time: 12:48)

So, particularly it is severe, when you have a very high transitions of the paints here,
very high frequency transitions and the color is white.

672
Since, the color is white, no all red and red, green, blue, they are to be estimated very
accurately representing white. Even when a slight change in those estimations, slight
error will cause appearance of know different kinds of colors. So, this is called the false
color artifacts and those artifacts are quite visible, when you have this kind of images
and this particular image is known for showing these artifacts and this image is also
heavily used for testing algorithms for evaluating the quality of interpolation, quality of
results.

So, these are the two particular concerns, but the thing is that for removing blurred image
there a different filtering techniques and there are different other post processing
techniques, which I am not going to discuss which is quite, which requires a bit more
complex. You know concepts to be discussed, but whereas, for removing false color, I
can describe a very simple technique and effective technique that is by using the median
filtering.

(Refer Slide Time: 14:08)

But what you are doing you know, in this case? you are performing median filtering in a
different color space. You are separating out, the luminance and chroma components.
You could have done in other color spaces also, where you can, you could have
separated out, but I have used the U and V it is U and V is nothing but modified
complementary blue and complementary of red of YCbCr transformation. So, which is
linear transformation and there you perform median filter in each component and after

673
that you again, you project back to then you again transform it back to RGB and
interesting is that while since, you are doing filtering some of the true pixel values are
also getting modified.

So, you should project those true samples in this locations. that is another thing, you can
you should do and that that is how you get the output of the image.

(Refer Slide Time: 15:10)

So, if you perform these operation, I can show you the effect here. So, you can see that
this was the original reconstruction image, reconstructed image using this technique
LCEC which we have discussed. If I apply a 3 3 three cross three mask of median then
you can find out this artifacts are getting reduced and 5 5 is reduced but the there is you
know some disadvantage of having larger mask, it will blur the edge also. So, there is a
trade-off between these.

So, this is one example. In fact, we can show also the, how the noise gets reduced in the
corresponding space of U and V.

674
(Refer Slide Time: 15:53)

So, this is typical plot, this plot corresponds to the reconstruction of the lighthouse image,
what I have shown you before and in the error of reconstruction of U and V samples
using the original interpolation technique LCEC, those errors are plotted against the
locations, against the pixel locations. So, I have given you the surface plot. The top, it is
showing the surface plot of error for the U component and the bottom, it is showing the
surface plot of error of the V component. So, when we apply the median filter, then we
can see the magnitude of this error is getting reduced.

So, this is the plot of U. So, this corresponds to U component and this is for V, you can
see that it is significantly reduced. This part, if you are observant this part it is
significantly reduced even this part it is significantly reduced, for U component it is not
so prominent still it is reduced and there is a remarkable improvement of PSNR value
which I am not quoting here. So, around 3 to 4 UV improvement those are reported.

675
(Refer Slide Time: 17:38)

So, with this I will be concluding our topics in color fundamentals and color processing.
So, just to summarize that what are the issues we have covered in this topic and what are
the what is the take home, information regarding this particular topic that color is an
important information for interpreting images and videos. you know that and it is
understanding of it is representation is very important.

And we have seen how color could be represented and so, it is captured in the RGB color
space, but it is not suitable for direct interpretation of color components such as hue and
saturation and that is why there are spaces where you can separate them out. In
particularly CIE, which is the international body of colors and they have recommended a
chart which is called chromaticity chart and it represents colors in a 2-D space according
to tri stimulus model of color representation. So, they have standardized the color
representation and it is capable of providing the gamut triangle for reproducing colors.

There are various other colors spaces that we discussed that is used for processing in the
just in the previous example; I have shown you processing in YUV color space, but there
are others spaces also; we have mentioned using chromaticity chart itself, we have given
some examples of processing; that means, you can convert from RGB to xyz, from xyz
to normalized xy chromaticity space and there you can process in the hues and saturation
to get different kind of information.

676
(Refer Slide Time: 19:32)

So, the other topic, the last topic, what we discussed that is on color demosaicing. So,
this is required when color images are captured using a color filtered array and mostly in
present day widely used all the digital cameras, which are not expensive to that extent
there it follows this principle of imaging and there is an interpolation process in built
with that system itself.

So, it is needed to be interpolated to provide full color information, if you take get the
image in the form of color filter array and we have discussed also a few problems on
color processing. So, those are the some of this problem that is we discussed in color
enhancement through saturation desaturation operation and we have seen how it could be
done with the help of CIE chromaticity chart, that is one operation. Then we discussed
also about computing color constancy there are two particular steps estimation of color
of the illuminants and also correcting colors using the estimated color.

Then color transfer is more general competitions of transferring colors, which is the
what the illumination is seemingly coming from a different target image from a target
image or from a source image and that is the target illumination, which is described by
target imageand the color of the source image to be transferred as if it has been
illuminated by the illuminated of the target image.

So, there is a problem on color transfer then in the last examples of processing, we
discussed about color interpolation. So, these are the four typical color processing of

677
color images that we discussed in this topic. With this let me conclude my lecture on
this color processing and color fundamentals; we will go to the next topic of range image
analysis, in my next lecture.

Thank you very much for your attention.

678
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 41
Range Image Processing – Part I

In this lecture we will start a new topic to discuss; this is on Range Image Processing.

(Refer Slide Time: 00:23)

So, let us first understand what is meant a range image. A rang data is a 2  1 two and
D
half D or three-dimensional representation of the scene. It is called two and half D
because you have only the surface information in the form of a data it is discretized
pixels discretized points on the surface that is represented as an image. So, the
representation is something like this that say consider this function say d( i, j) and it
records the distance of the corresponding scene point. So, in an image when we consider
the array representation of an image at pixel the value what we get in the range image is
not the intensity, but the distance of the surface point.

So, this value d that is recorded at this point and this distribution; this distribution of this
surface points over this discretized space that is a functional distribution that give me the
range data, or we sometimes we also consider this data as depth data that is also another

679
nomenclature we used for range data. Also it could be provided as a set of 3-D scene
points or point cloud.

So, just to explain once again that suppose this position is array index position is given
(i,j) and the corresponding functional value here it is f( i, j), then the x, y, z coordinate of
the pixel value is given as(i,j), f( i, j) in a discrete 3 dimensional space so this is how this
corresponding location is given. So, if I consider a collection of all these points in this
representation in this form then we get a set of 3-D scene points.

(Refer Slide Time: 02:18)

We will have range data is shown here in the left side corresponding intensity image is
shown and the right side its data range data captured using the Microsoft Kinect sensor
that has been shown here.

680
(Refer Slide Time: 02:35)

There are various principles on which range imaging takes place we have already
discussed about stereo imaging system and we discussed how 3 dimensional scene
information could be computed using a stereo imaging system. So, this kind of system is
passive imaging system and which would provide you also the range image, but there are
other kinds of systems where you get high resolution range image and those are active
range sensors or active range sensing mechanisms that is followed.

Active in the sense that here you need to use an energy source apart from the scene for
detecting the for measuring the depth you have to project a ray from a particular source
of energy and by illumination you have to consider the computation of the depth. So,
there are three different types of active sensing mechanisms are there; one is time of
flight sensors the other one is triangulation based sensors and structured light based
sensors. So, let me explain those principles one by one.

681
(Refer Slide Time: 03:50)

So, in a time of flight range sensors there is a light source and it say a laser light source
usually and you project a light or you transmit a pulsed laser from that point and again he
reflection gives you back the corresponding laser signal. So, by detecting that pulses you
can measure the time of flight or the duration of the transmission and reflection of the
signal. So, if you know the velocity of light , all of we know the velocity of light, then
you simply you multiply with that time that would give you the twice of the distances.

So, that is a principle; you have both source and detector co-located because the point
from where you are transmitting you are also detecting at the same location. So, directly
you are getting the shortest path from that location since light travels through the shortest
path in that direction itself, you can get this depth. So, in this relationship is very simple
if you can measure this time, then you multiply with speed of light and then you get the
twice of the distance we have to divide by two to get the actual distance.

So, this light source as I, as it is shown here pulse laser. So, it is a laser based time of
flight light it is a range sensors there are different kinds of lasers are there like light
detection and ranging red lasers are there sensors are or laser ladar sensors are there and
these are quite popular lidar or ladar based sensors. And since you are just measuring
only one surface point and the depth of a one surface point at a given direction.

So, you have to scan the you know whole surface by maneuvering about the direction of
the laser beam and there are moving mirror, you can use moving mirror to scan the

682
beams. So, there are mechanisms by which the mirror angles could be varying and
accordingly the direction of laser beam could be varied and in a given predetermined
path you can find out that what is the depth from where the reflection came and that is
how you can get the surface points.

So, there is a limitation of this kind of sensing you are limited by the minimum
observation time because the mechanism by which you know you can detect a pulse laser,
it has its width laser pulse duration, duration of the pulse that is one thing and another
thing is the sampling intervals of emitting pulse laser.

So, these are you know limiting; these are limiting your observation time and does know
your minimum distance what is observable that is also limited by that minimum
observation time.

(Refer Slide Time: 07:08)

The other kind of sensor; other kind of principle is triangulation based sensors where you
can see in this particular diagram that there is a source for laser beam, it could be any
light source, but usually lasers are used for in this case also. So, you project that light on
a particular surface point and then the reflection of that light will be captured by a
camera.

So, in the camera from the images you can get the point of the reflected path, if both
camera and light source they are calibrated then you can get the equation of this two

683
lines as we have discussed in the previous lectures of stereo geometry. That if I can get
the directions of this 2 rays in 3 dimension then the intersection of these two lines will
give you the corresponding three dimensional point.

So, this imaging system it calibrates everything. So, from the camera calibration we can
find out that in which pixel of the camera in image plane which is illuminated from
where the light reflected light is coming to the camera. So, you can get from the camera
parameters the equation of this particular you know direction of ray in the equation of
this ray in the three-dimensional coordinate system.

Similarly by knowing the coordinate location of the laser beam and also by knowing also
the in the calibrated face by knowing also the value of i, j with the when you are you
know when you are sending this transmitting this laser or transmitting this light that
direction itself is also encoded. Or it is also known in the in the system then you can get
the equation of these particular direction particular ray and by solving them you can get
the three dimensional scene information.

So, there are various ways by which you can determine what is the directions by which
this particular along which this ray has been transmitted. So, those variations are there.
So, this is what known for predetermined scanning path of the beam and then solving the
equations and this is observed in camera and you apply triangulation to get that 3-D point.

(Refer Slide Time: 09:55)

684
Now, this same principle is used because as I mentioned only for one directions at a
particular time you can get informations only for one surface point given a direction.
Now, you have to vary these directions you have to scan the corresponding you know
projected ray of the, from the light source projected from the illuminated in a
predetermined path. Or in a calibrated path over a at every instance you should know the
coordinate of i, j in image plane which will give you the corresponding equation of the
straight line equation of this ray. And also you are observing the image points in the
camera because of this illumination and for that illuminating point we will discriminate
we will help you easily identify this particular point and you can apply this triangulation
law.

(Refer Slide Time: 10:50)

Now, the technique of you know triangulation using the principals what I discussed it is
quite slow because you know you have to consider shooting ray for every point in its
corresponding you know plane of in the respective image plane if I say. So, this is a if I
say this is the plane where you are recording the you are identifying a particular
directions like it acts like an image plane of a camera in the case of structured light or in
the case of light source. So, instead of projecting single ray, what you can do? You can
project a vertical you know light itself vertical strip on the surface.

So, which means that all your all the directions along this vertical line they are encoded
into the strip and when you are sensing this one it could be sensed in a distorted fashion

685
also depending upon the surface it may not be very straight. So, from this point; this
point is calibrated with respect to that this corresponds to this one which has been
predetermined which has been known to us from the system imaging system itself and
this point corresponds to this one. And if you can interpolate this path then you can get
for every point they corresponds to particular point in that light strip.

So, you need not project multiple times only from single projections you can get
information of all the surface which is surface points. So, which is lying on this light
stripe. So, this is called structured light range sensing system because the light itself has
been structured in different patterns this is a very familiar pattern vertical stripes are very
familiar pattern. And in this pattern the 3-D position of projected ray is already encoded
and those encodings are used to determine the surface points. So, this is a summary, so
you should get 3-D positions of all the scene points lying on the projected strip by this
mechanism.

(Refer Slide Time: 13:11)

So, another way by which the structured lighting could be better utilized and where
instead of projecting, but only single vertical stripes once again by scanning over the
surfaces and you require more number of projections in that case. But you can encode a
vertical strip position of a vertical strip in terms of m projected patterns.

686
So, each pattern will try to identify a zone of the surface and finally, the by observing
whether a pixel is illuminated or not illuminated by the corresponding you know
projection rays in m patterns by observing those strips, then you will get 2 m strips. And
in this way you can get 2 m stripes and using just m patterns you can have 2 m I did not
distinguish 2 m stripes on the projected surfaces. And for each one you can again apply
the law of triangulations as we discussed for the structured lighting.

So, just an example. So, you have number of stripes in this pattern usually it gets doubled
at every new pattern this is a particular mechanism particular type of structured light
patterns and each light point is associated with an m-bit binary code. And its image is
observed in your in the scene in the camera and from there you can solve the
triangulation losses.

(Refer Slide Time: 14:55)

So, this is this diagram is showing how actually this panel this any particular stripe is
encoded using m pattern as you can see here that : consider this ray and consider a
vertical striped here. So, this strip is encoded so, in this location pattern say it is
illuminated is 1 whereas, in this pattern this particular strip is 0 once again this is 1.

So, if you are using m pattern you will have m such stripes and that would give you a
binary string of length m and so the you will get 2 m such discrimating patterns. So, code
word could be 101 and it goes depending upon the kind of pattern you are placing there

687
and that identifies this corresponding stripe. And from the stripe once again using the
triangulation rule you can get all the depth informations for all the scene points which are
lying on that particular stripes on the surfaces.

(Refer Slide Time: 16:10)

Now, there are other kind of codifications of you know spatial or the direction of the
projected ray where you can consider that your and in this method you did not project it
you know multiple times. So, what you can do? You can just simply illuminate only once,
but there is a variation of spatial pattern on the greed of projected rays. So, every
variation every and it is uniquely appearing at a particular location. So, which means if I
consider an image suppose you have a projected ray and as I was mentioning that you are
trying to identify the direction of the ray by a calibrated light source.

So, this location is modulated by different shapes of patterns through which this lights
are projected. So, those will be visible on the surface of the visible means those can be
detected it may not be visible in naked eye, but those could be detected by the camera
sensors this patterns could be detected which varies in shape and also in colour and also
in size. And this is uniquely identifying this particular pattern for example, it uniquely
identifies the central point. So, in other locations you will have other kind of variation.
So, each pattern in a location will be uniquely identified this location.

688
So, there is a resolution spatial resolutions by which you can encode this directions that
is the limitation of this technique. But; however, the advantage is that only single
projection can give you all the encoded ray which is illuminating the object. So, in the
object you are having all the encoded ray simultaneously and those patterns are projected
and from observing in your image if I observe those patterns wherever it is. So, since the
pattern. So, you just find out the matching of this pattern with respect to your expected
pattern and the this pattern library then you know that this is a image of this location. So,
and then you can apply the triangulation.

(Refer Slide Time: 18:53)

So, this technique is very effective it makes the imaging very fast and also your imaging
system also becomes cheaper following this technology the resolution of the images will
be less as you can understand. So, as I mentioned, as a discuss this it could be spatially
arranged dots with varying in size and colors.

So, this is one example of the spatial patterns which uniquely characterizing the
neighborhood. For example, if I consider this location these neighbouring locations
maybe unitary you would not get any other locations which you have similar
arrangement of colors or colored dots.

So, there are tricky ways by which you can design this patterns and when you detect the
corresponding patterns in the camera you are also detecting that particular pattern and

689
that it is a image of this location. So, this location is already encoded there is a particular
location or direction of the ray which is given by once again two indices of their plane
of projection if I say where the rays are discretized and using then you can apply; you
can apply triangulation to solve it.

So, it provides a code word is associated with the calibrated light point and it does not
require multiple projection over the object that is the advantage. And this code word is
obtained from a neighborhood of the point around it that I explained also while
discussing its principle.

(Refer Slide Time: 20:37)

So, one example once again the same example of range image I am showing here
because this particular sensor which is Microsoft Kinect sensor, it uses infrared laser
light with speckle pattern to get the same surface points and to encode the corresponding
light points as I mentioned using spatial patterns. You can see that both the images are
captured by the same camera, it has the corresponding optical image. So, Kinect also has
an optical camera and along with there is a range camera range depth sensors following
this which is captured following this principle. So, that is why the images what you are
captured by Kinect is known as RGBD images, here RGB stands for red green blue
components of the optical image and D stands for depth for the range image.

690
(Refer Slide Time: 21:39)

So, when we discussed about this techniques, we mentioned that mostly the light source
what is used for this kind of projection is lasers. Now, you know the full form of LASER
that is Light Amplification by Stimulated Emission of Radiation. Now, why lasers are
used? You could have used any ordinary light source also for the structured light and for
the triangulation schemes particularly of course, for time of light you require a coherent
source where lasers have to be used.

But in other cases also lasers are mostly used compared to with respect to the ordinary
lights. The reasons are that first that it easily generate bright beams with lightweight
sources. So, that is one advantage of lasers. Then it can generate also infrared beams and
infrared lasers are used for generating those beams and which could be used
unobtrusively which means that a viewer would not be disturbed by those speckle
patterns. Viewer will not get distracted or normal activities will not be hampered or will
not provide the disturbing informations or abstract the viewing of the objects are seen.

So, that is why this the infrared beams are mostly used for generating this patterns
nowadays particularly in Microsoft Kinect it is used and in actions like gaming etc
Kinect camera could be used very easily without disturbing the participant or viewer.

For lasers also this is another good feature that it focus well to give narrow beams. And
also the frequency of the laser that is also single frequency and it is easier to detect that is

691
one advantage. It does not disperse from refraction as much as full spectrum sources; so
this is another advantage. And there are semiconductor devices which easily generate the
short pulses.

So, this is a reason why lasers are used instead of using any other ordinary light source in
the rain sensors particularly the sensors where there are options that could have used
other light sources still lasers are used in structured light based sensors that we discussed.

So, let me give a break here and we have so far we have discussed about different
sensing mechanism, and from the next lecture we will be discussing about different
theory and fundamentals for processing range data. Particularly will have a brief
discussions on different concepts of differential geometry which are useful for
characterizing the surface points of the range images. With this let me stop here.

Thank you very much for your attention.

692
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 42
Range Image Processing – Part II

(Refer Slide Time: 00:23)

We are discussing about Range Image Analysis, in the last lecture we discussed different
mechanisms for sensing range images. In this lecture we will be discussing different
concepts of differential geometry which will be useful in processing range data. So, let
me first explain how parametric curves in 2D they are represented mathematically. Here
you consider a particular parametric parameter which is which is a subset of for the real
space, the whose values know lie in an interval of real space. And you are mapping these
parameter value to a coordinate of a of a two dimensional space.

So, that is why it is a mapping R  R 2 and in parametric form we can represent this
parameter say t which is a variable, which takes values from an interval of real space.
For example, from interval between 0 to 1 the parameter values may vary and for each
parameter value you get a co ordinate within that interval. You get a two dimensional co
ordinate which is denoted here as X(t) as a point and its x co-ordinate as u(t). It is a
function of that parameter and v(t) that is a y co-ordinate which is also a function of this
parameter.

693
So, this is a parametric description a very simple way to describe a curve as a set of
points and it is a continuous curve with the parameter varies continuously and this
functions are also continuous over t. So, this curve also a continuous curve can be
represented in this fashion. Now when we compute a tangent because of the continuity
you come compute the derivatives at those points, if it is continuous of R at least of order
one it could be continuous of you in order two also or higher orders. So, at least when it
is it is continuous curve of order one then you can compute the tangent and which means
mathematically you can very easily get this tangent information by the tangents at those
locations also denoted by the corresponding derivatives of those parametric curves,
which is giving you the tangent information. Which is giving the vector actually it is
giving the directions of the vector with respect to that point and you can also compute
the curvature at that point by performing this computations. What you need to do you
need to compute the first derivative and second derivative of the parametric curve and
then perform this operations.

You can see that numerator is nothing but the determinant of this quantity derived from
first derivative and second derivative of the parametric curves , which is shown here as a
u v
determinant value . That is in this case you can see that this 2t t that is that is
u  2v
t 2 t 2
giving you the corresponding value. So, given a particular value of t you can get a scalar
value here, all the scalar values are obtained and then you perform the determinant of
this matrix. And this is the magnitude of the tangent vector.

So, if your tangent vector is (u ' (t ), v' (t )) that is shown here. So, this magnitude should

be (u ' (t ) 2  v' (t ) 2 ) .It is the magnitude of X ' and when you are performing when you

are raising it to the power three cube of this amount ( ( (u ' (t ) 2  v' (t ) 2 ) ) 3 ). So, this is the

quantity you are measuring. So, this is how the curvature could be computed at any
parameter value

(| X ' |) 3  ( (u ' (t ) 2  v' (t ) 2 ) ) 3

694
(Refer Slide Time: 05:06)

(Refer Slide Time: 05:07)

So, that is just simply explaining that part. Now, let me consider the parametric
representation curve presentation is in three dimension. So, like two dimension you can
see all this notations are simply extended in this particular context, where the parametric
curve representation is a mapping from a one dimensional real space to a three
dimensional real space R  R 3 .

So, again the parameter values varies within an interval of real space I and you have an
additional co-ordinate that in the z which has been shown as a function of w(t) here, in

695
addition to u and v functions for corresponding x and y coordinates. Similarly, the
tangent which is computed in 2D by taking the derivatives of respective functions that
also you do it here. And for curvatures also you see that the expression remains the same
only it is interpretation is bit different.. You have the cross product of the vectors X ' and
X '' .

X ' X ' '


k (t )  3
X'

So, this is cross product of three vectors you take the double derivatives which means in
 u ' (t ) 
this case you are taking  v' (t )  . So, that is the vector notation, so I am using column
 w' (t )

vector notations in this lecture slide everything is shown as row vector, but throughout I
have used column vector notation (u ' (t ), v' (t ).w' (t ))T . So, you should considered you
know in some cases I will be using column vector notations for doing the exercises. So,
let us considered this is this vector this is what is X ' (t ) and X ' ' (t ) which is given as the
double derivative with of this function.

 u ' (t )   u ' ' (t ) 


X ' (t )   v' (t )  X ' ' (t )   v' ' (t ) 
 w' (t )  w' ' (t )

So, given a value of t you can get this three vectors and then you perform the cross
3
product that is the X ' interpretations. And similarly X prime cube magnitude we have

shown in the two dimensional case you have to extend that concept of these vector. So,
take the magnitude of these vector and raise it to the power 3 that would give you the
denominator that is how this curvature is completed.

696
(Refer Slide Time: 07:48)

So, now we will be you discussing about the parametric curve which is lying on a surface.
So, in the previous case we have considered any parametric curve ah, but with respect to
surface let us discussed as such if I look it in the curve in an independent fashion our
previous analysis or previous representations all are valid. We represent it as a
parametric curve representation of three functions of parameters for x y z coordinate
given as u (t ), v(t ), w(t ) .

Similarly it is tangents are given by their first derivatives and curvatures are computed at
every point by computing the cross product of these two term X ' , X ' ' taking the cross
product and the magnitude of the cross product. So, that I missed in the previous you
know slide.

So, actually it is after completing cross product you have to take the magnitude on the
cross product and then in the denominator you compute the tangent by taking the first
derivative and take its magnitude and raise it to power three. So, if I divide this two
terms i will get the curvature, so that is what we discussed.

So now, let us see with respect to surface what other kinds of concepts are there.
Consider a point at a particular point of a curve and consider the tangent directions. So,
this is t tangent directions at that point say which is given by T (t ) in our notation and
consider very near about points.

697
So, this is a point which is very close infinite infinitesimally closed point on the curve.
So, this tangent and this curve this point will form a plane and this is this plane ok. So, it
is a kind of plane where the tangent is lying also the curve is lying. So, it is a tangential
plane kind of thing and which is called actually osculating plane. We will see the
concepts names nomenclatures later on first let me define this concepts.

First so in that plane where the tangent is lying and also curve is laying at that particular
point you can define the normal which is lying on that plane. So, normal which is
perpendicular to this tangent and which is lying on that plane.

(Refer Slide Time: 10:25)

So, this is the normal of the curve, so this point is p and this is the tangent what I
mentioned and this is the normal. You can also consider another plane which is
perpendicular to the tangent plane, actually name of the tangent plane is osculating plane
as I mentioned and this is normal as I have shown here. And this you know this red
colored plane is considered as a perpendicular plane of the osculating plane and this
plane is called normal plane we will come to that. And so the curvature at that point is
shown here by this quantity that it is basically radius of circle is giving you inverse of the
curvature.

So, centre of that circle which forms at that point due to curvature has been shown and it
should lie on the normal. And this plane is a normal plane as I was mentioning that
normal plane is the perpendicular plane to the osculating plane. So, you have these two

698
planes now you can have another plane also which is perpendicular to both of these
planes and that is called Rectifying plane. So, these are the three planes which are
defined with respect to a point P by considering it is direction of tangent and also the
osculating plane and so normal of the curve at that point.

So, the bi normal is a direction which is perpendicular to normal and tangent of the curve,
so that is binormal. So, binormal lies in the rectifying plane and also in the normal plane,
so it a intersection of normal and rectifying plane. So, you can see as if at every point
you have defined a co ordinate system, that means you have three axis which are
perpendicular and it is a it can define a local co-ordinate system.

So, this is called the moving trihedron or frenet frame. So, this particular configuration
and the relationships between the tangent and normal’s in the differential geometric
operations those can be expressed. Here all are unit vectors and if you take the
derivative of tangent you will get the curvature, you will get the normal vector, unit
normal vector which is magnitude which was magnitude not unit normal vector. It is a
vector along the normal of the curve with the magnitude’s curvature. Similarly if I take
the derivative of the bi normal’s.

And then the vector what you get once again it is as along the normal of the curve , but it
is magnitude will be modified by a quantity called   which is called torsion of this
curve. And the derivative of normal is related with the directions of tangent and binormal
in a linear from you can see that this is lying in the plane of rectifying plane particularly
this direction.

So, it is some directions in the rectifying plane in a so that is expressed. So, these are the
three I should say very fundamental relationships between the change of the directions
of tangent binormal and normal with respect to its curvature and torsions of the curve.

699
(Refer Slide Time: 14:20)

So, we discussed about the representation of parametric curve and also we have
discussed some concepts several concepts on normal osculating plane, normal plane,
rectifying plane, normal, binormal curvatures and torsion of the curve. So, let us now
consider a representation of surfaces, because range data as you understand mostly it is a
surface data and of course a curve would lie on a surface and also you can try to treat the
surface in totality around your point. So, for parametric surface representation once again
here your required two parameters.

So, it is surface is a two dimensional entity. So, it is a mapping from a two dimensional
real space to a three dimensional real space ( R 2  R 3 ). Once again the parameters may
vary with in certain finite ranges. So, you can consider mapping from a subset of real
space to the three dimensional real space. surface co ordinate is represented by the x y z
coordinate and each one is a function of these two parameters u and v.

So, let us consider a curve which is lying on the surface and a at a particular point I can
get a directions we have shown here. Say suppose this is a point p and if I considered a
curve which is varying over u only, whereas v is kept constant. This is how a curve
could be defined when your one parameter is varying, other is constant, then it gets a
single parametric description of points and then it becomes a curve on the surface. And
you consider another curve on the same point passing through that same point, where it
varies over v and u is constant. So, for that you get another say tangent directions.

700
So now, the normal at the surface point, so actually X u and X v these are all tangents
and there is a plane touching that surface at that point, in that plane all this tangents are
lying. So, that is called tangent plane and surface and the normal to that tangent plane is
the normal at that point which is called Surface Normal.

(Refer Slide Time: 17:08)

So, if I denote it by the N in that case and you can see that how surface normal could be
computed using this two tangents, because it is perpendicular to these vectors X u and

X v . So, you can take simply cross product of them then you can get the directions of the

normal and as a vector if you normalize it you get the unit vector along this normal. So,
that is how normal at this point computed and that would be giving you the unit normal
Xu  Xv
vector. N 
Xu  Xv

701
(Refer Slide Time: 17:37)

So, these are the things we discussed and then how we can represent a curve on the
surface in the parameterized forms, I have mentioned that if you make one of the
parameter constant and the other parameter varying you get this curve. But in general
what we can consider even the parameter u and v is also described as a function of a
single parameter.

So, let us consider a consider another variable t, which is again mapping the values from
an interval of real space to the two dimensional real space which were actually co
ordinate which are the values given by the combinations of u and v. So, this curve
 (t ) equals (u (t ), v(t )) that itself will give you a parametric.

Say curve on the surface because using these values of u and v for a particular t you can
get a curve on the surface. That is how you can represent any general curve on the
surface you need not keep one of the parameter u as constant, the other varying, you can
vary both the parameters. But there is a relationship in them where there is a functional
relationship in that variation and that functional relationship is denoted by this variable
function  (t ) and which is denoting that parametric curve.

So, if I expand them we can see that corresponding to  (t ) we get a curve on a surface
whose x y z coordinates are given in this fashion, so this is how it is going. So, tangent
vector to this curve  (t ) it is given in this form when will this representation, it is a

702
familiar differential operations you first know derivative get derivative of the function X
with respect to u partial derivative and then you get the derivative of u with respect t.
Similarly you get derivative of X with respect to v and then again you know you get the
derivative with respect to v(t ) .

So, each one will give you a vector, so you can see X u is also vector X v is also vector

where as u ' (t ) and v' (t ) . You will get a single scalar value because u (t ) is a function,
so you will get a function. So, it is a weighted combination of these two vectors that
would give you the tangent. So, what I meant here that if I perform the differentiation of
this function and take some value t0 t that is a scalar quantity.

So, but X u is a vector and X v is a vector when you are performing partial derivative you

are getting ( xu (u , v), yu (u , v), zu (u , v)) for X u , so in this way you are getting vectors. So,

you are getting the resultant of these two tangent, two vector X u and X v that is giving
up the tangent over that parametric curve.

(Refer Slide Time: 21:03)

So, these are some of the key concepts of differential geometry characterizing the
surfaces related to the gradient and curvatures. So, the first fundamental form it relates
the change of magnitudes or it relates to the gradient directions at the surface point.
And magnitude of that tangent or changes in that surface point in the gradients.

703
So, this is called this is a bilinear form that associates two vectors in the tangent plane in
the form of dot product. So, the definition of the first fundamental form is that in the
tangent plane, if I consider two vectors u and v and if you take the dot product that would
give you the corresponding First Fundamental Form. So, with respect to a curve
parametric curve  (t ) , then this could be expressed as the dot product of t.t that is that
is know tangential directions or this is the tangent actually this t is a parameter, but this is
tangent.

So, you take the dot product of these two tangents as we have discussed how tangents
could be computed, so we are expanding these two tangents. So, you get this particular
quantity, so this quantity is related to the first fundamental form. So, where you note that
this parameters of E, F and G which characterizes the first fundamental from, they are
expressed as dot products of the corresponding partial derivatives of the functions in this
way. I mean partial derivatives of the surface point with respect to u.

 E  xu  xu 
 
 F  xu  xv 
G  x  x 
 v v

So, this is just to expand. you have say X (u , v) and this you are representing as say
X (u , v)  ( x(u , v), y (u , v), z (u , v)) . So, if I take the partial derivative of all these functions

and at a given u and v you get a vector. So, that is how X u is a vector so this is a vector

this is also. So, you take the magnitude of that vector that is E similarly you compute X v .

So, you are computing X u as a vector and similarly you compute X v as a vector from

here and you know you take the dot products and you get this solutions. So, the
interpretation of first fundamental form that it is the magnitude of the tangent vector.

704
(Refer Slide Time: 24:11)

The other concept which is used in characterizing surface points and the local surface
geometry that is second fundamental form. And in the second fundamental form it is
related to the curvatures let us see the definition.

So, as we have already discussed we can see that we have defined the normal as a
normal to the tangential plane and which could be derived by taking the tangents at a
points along two curves which would be X u and X v . That means, where varying only u
and taking v constant and varying only v taking u constant and you can compute normal.
So, in this particular diagram it is shown that if I have a curve where it is varying along v
and if I consider with the variations how the normal is changing it is directions.

So, that itself that change of normal that is expressed by a vector dN (v) that that is
shown here and let us continue. So, the definition of second fundamental form is that it is
a dot product of two vectors u and v not dot product. So, it is a function of two
tangential directions u and v at this point, which is given in this term u vector dN (v)
vector. So, as you vary v so this term is dN (v) , so it is a tangent it is it is basically dot
product of tangent with respect to d and b.

So, tangent is another curve u this is not the same direction. So, we already know
t.N  0 . So, if I take u and v both as t then actually we get this relationships. So, since
tangent is perpendicular to normal. So, t.N  0 and then we take the derivative with

705
dt
respect to v. So, you get N  t.dN (v)  0 so this is the derivative. So, just from there
dv
we are deriving with respect to v.

And so we have already seen that if I take the derivative of tangent, you will get the
actually the normal of the curve at that point a vector along the normal of the curve
whose magnitude would be the curvature at that point. So, you get this quantity here, so
this is coming from this and this is the actually second fundamental form. So, from there
I can derive that second fundamental form could be expressed as k it is a curvature and
this is the angle. So, this is cosine of that plane because this is dot product of two vectors
normal and these are all unit vectors. So, this is the angle between these two that would
give you the corresponding cos( )

dt
N  t.dN (v)  0
dv
knˆ.N  t.dN (v)  0
II (t.t )   k cos( )

So, this is how the second fundamental form is interpreted and you should note that their
unit vector, so angle between curve normal and surface normal that is the interpretation.
And if you have a normal section, that means when your curve is lying on the normal
plane itself then this angle becomes 0 . So, it is something like this that you have a
normal section. So, your curve is lying like this in a it is intersection here normal plane.

So, your normal section so this is a tangential plane. So, the direction of normal and
surface normal both are either 00 or 180 0 , whatever depending upon topology of that
point. But the dot product would be 0 and sorry the dot product would give you either 1
or -1. So, in this way you can get the corresponding know surface curve and this is called
normal curvature. So, with this let me stop here for this lecture, we will continue our
discussion on differential geometry as and it is in the context of analyzing range image
and it is applications for characterizing surfaces in our subsequent lectures.

Thank you very much for your attention.

706
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 43
Range Image Processing – Part III

We continue our discussion on differential geometry based analysis of local surface


geometry.

(Refer Slide Time: 00:29)

And in the last lectures we discussed that how normal at a surface point could be
computed and we have defined two particular entities describing the local surface
characteristics. One is first fundamental form, the other one is second fundamental form.
In the first fundamental form essentially we are computing the magnitude of the change
of the gradient at that point a magnitude along the tangent vector. And, in the second
fundamental form it is related with the computation of curvature and we will continue
our discussion on this particular aspect.

707
(Refer Slide Time: 01:11)

So, we are using the parametric description of a surface point say u and v are the
parameters and normal at tangent at a surface point can be computed in this form, where
u and v both again can be expressed in terms of a parametric curve in parametric curve
descriptions and then if I induce a curve on the surface we have used this parametric
representations. So, if  (t ) which is once again defining the surface point (u (t ), v(t )) for t
within certain range, it could be 0 to 1, it could be any other values of the real interval.

So, with this description we at this particular value of t we can compute the tangent of
that curve by taking the derivative epsilon t. So, this is how the derivatives could be
computed for the functions as we have seen that the X can be derived with respect to u
and then u with respect to t such chain rule is applied here. So, this is how the tangent
vector is computed at the point we have already discussed this also earlier.

And then the second fundamental form is defined in this fashion which is the t.d N (t ) ;
that means, as we moved across the along the tangent how the normal is changing its
direction. So, that is the vector that changing vector that vector representing that change
of normal. So, you are taking the dot product of these two and then you will be you will
be getting the second fundamental form which actually measures the local curvatures
which represent the which is a measure of local curvatures and which is defined from the
functional point of view in this fashion. So, with the same descriptions here you can see
e, f, g they are again 3 entities 3 parameters which are related to this functions and this is

708
a definition of e, f, g which are all related to the second fundamental form those
parameters.

So, they are the dot product of the normal at that surface point and also the second
derivative vector of the variations along constant along constant v curve, this is second
derivative vector variations along const both u and v directions and this is first derivative
with respect to u, then second derivative with respect to v. So, we understand this kind
of interpretations of differential calculus or differential that is the definition. And so,
normal curvature is actually has to be normalized with respect to the magnitude of the
tangent. So, it is the second fundamental form that should be what normalized.

(Refer Slide Time: 04:45)

And this is the, these are the expressions. So, this is what is normal curvature which is
second form fundamental form to be normalized and as you can see this is the expression
for normal curvature at surface point for the curve described for any particular curve
with described by the c parameter t and also the corresponding in the tangent directions
along that curve. So, for any tangent direction this is the definition and these are the
quantities which are defined earlier that e, f and g these are already defined here and E,
F, G they are related to the first fundamental form.

And what if we can remember that E should be X u . X u then F is X u . X v . So, these are all

X that is a surface point and is G is X v . X v . So, they are the magnitude of the tangents

709
along the constant u, constant v and the other is basically the dot product of 2 tangents
along constant u and constant v curves in this description. So, these are the interpretation
from the functional point of view. So, let us continue that how this could be further used
for computing curvatures locally.

(Refer Slide Time: 06:51)

So, there is a concept called linear map using this parameters and this linear map it
contains all the necessary informations for computation of curvature. So, this linear map
is defined in this way you can note that the inverse of the parameters of first fundamental
E F
form is shown by this matrix  these are the elements those inverses I show.
F G 

1
e f E F
Linear map : 
f g   F G 

So, this is related to first fundamental form parameters and these are related to second
fundamental form and those definitions can be seen here how we have defined earlier
also in this case. So, how linear map is characterizing let us see. So, in the linear map the
eigenvalues and eigenvectors they provide the principle curvatures and also the principle
directions corresponding to those principle curvatures.

So, these are interesting and we can say that it is basically one of them would be
dominant curvature whose value is greater, but note that there is sign involved. So, in

710
the case of computation we call them two principal curvatures, but since their sign
involved actually when we are considering with respect to their relative strength then the
magnitudes should be considered.

Or earlier we have seen this particular feature when you have detected the (Refer Time:
08:39) and you have considered the magnitudes of curvatures here. For a 2 dimensional
function of in that case which is equivalent to a function in the surface also is a 2
dimensional function. So, these analysis are can be quite extended in that under this in
that scenario is also except that is not a surface that is a brightness distribution or any
other functional distribution.

So, any way coming back to this particular topic of surface geometry; so, curvature is as
a very critical information to understand the local surface topology or geometry. And so,
there are two entities which are very intrinsic, one entity particularly Gaussian curvature
which is very intrinsic property of the surface geometry and that could be computed as a
determinant of linear map which is actually product of this two principle curvatures. And,
mean curvature is basically half of the trace of linear map or which is the mean of the
two principle curvatures.

So, these two also can be an alternative descriptions of local curvatures other than the
principal curvatures. So, the expression for mean curvature can be obtained using the
parameters of first fundamental form and second fundamental form that is given here and
also the expression for Gaussian curvature is also given here.

Eg  Ge  2 Ff k1  k 2 eg  f 2
H  K  k1k 2
2( EG  F 2 ) 2 EG  F 2

You note that you know since this is a very familiar expression which can be related to
the theory of quadratic equation. So, we can define a quadratic equation this is equation
and it is solution is also shown here. So, you can also get principal curvatures given
mean curvature and Gaussian curvature.

k 2  2 Hk  K  0 k1, 2  H  H 2  K

711
(Refer Slide Time: 10:39)

So, the usefulness of this analysis can be understood by this particular technique which
is which has found the topology of the local surface curvature. You can see various kinds
of local shape that could be attributed by the signs of in this case this is shown with
respect to Gaussian curvature and mean curvature.

So, r if the mean curvature is negative so, if it is negative and if the Gaussian curvature is
positive then it is like a surface like a peak surface just to draw your attention to the fact
in our school geometry also we know that when you have a function as a variable of x
f(x) and if we get a peak here. So, at this point the peak maximum could be characterized
when we find it is double derivatives; that means, if I take the f ' ' ( x)  0

Where as if it is a minimum then f ' ' ( x)  0 we have already know this particular feature
and it can be explained from differential calculus from there we can explain why the
signs should be here negative and positive respectively, and we also know that this mean
this maximum and minimum they are characterized by the gradient value that should be
equal to 0. However, the fact is that this curvature which is related which is proportional
almost to this differential double derivative of the functional value.

So, in a surface in a surface again we can consider any particular curve and which has
this kind of shape. So, we expect the curvatures values so, principle curvatures should be
negative here and all the principle curvature. So, there are two directions. So, if I
consider these are the two principle directions for principal curvatures. So, both should

712
be negative then you can get this kind of peak shape that is what is expected here and
you know that the mean curvature it is defined as the mean of this two value.

k1  k 2
H
2

So, since they are negative so, mean also has to be negative and the Gaussian curvature
which is given by K it is defined as product of these two values K  k1.k 2 and since w
once again they are negative the product has to be positive. So, that is how this
characterization of if the mean curvature is less than 0 and or Gaussian curvature is
greater than 0 then you can characterize it as a peak surface.

Similarly for the pit surface which is like they like the minimum of a 1 dimensional
function. So, this is a scenario this is a pit surface here we expect both of them should be
positive both mean curve both curvature should be positive. So, the mean curvature of
should be positive as well as a Gaussian curvature should be positive that is what is
shown here. So, you can extend this analysis for a flat surface this curvatures are going
to be 0 . we know that that radius of curvature is infinite for a flat surface. So, that value
of curvature is 0. So, since for a flat surface value of curvature is 0 at any directions so,
you get both of them 0 Gaussian and mean should be 0.

So, we extend this kind of observation to various other shapes in a 3 dimensional surface
and you can explain the kind of a features or observations those are provided in this
particular slide. And in fact, this is a paper which discusses this particular properties this
is a paper by Paul J Besl and Ramesh Chandra Jain in 1986 it was published and it is
very pioneer in paper in characterizing the or in finding out the local topology of a range
images from the range image from the range data.

So, this shows how we can compute this particular characteristics local properties and
this feature again you can use for your purpose is later on for example, segment in
surfaces etcetera.

713
(Refer Slide Time: 15:57)

So, just to summarize we can characterize the this local topology by the signs of
curvature, it could be signs of principal curvatures or it could be signs Gaussian and
mean curvatures. So, you can see that for the principal curvatures as I mentioned when
both are negative we have peak when both are positive we have pit and then there are
combinations of negative positive, positive negative. So, they have almost like a
symmetric relationships you can you can find out in this case.

You have say ridge, ridge, saddle, saddle because of these see it does not matter in
which direction it is negative or positive, but the local surfaces they will have symmetric
this property can be described in the form of a symmetric matrix form you can say, it
has certain symmetry with respect to this. Whereas for Gaussian and mean curvatures the
characterization is a bit more elaborate that you can see here and that is the reason why
in the previous work of Besl and Jain they have used Gaussian and mean curvatures, it is
interesting they have and we identify these characteristics.

So, there are regions like saddle ridge which was earlier could not be characterized a
saddle valley minimal surface. So, these are few other kind of characterizations other
kind of topologies also they have considered in their analysis.

714
(Refer Slide Time: 17:47)

So, we will be considering a special kind of a surface the description and which is akin to
our range data and in fact, there is a term of this kind of surfaces surface patches when
you consider the parameters u and v, they itself describes the x coordinate and y
coordinate and the z coordinate is a function of those two parameters like h(u, v) has
been shown. So, this is a familiar range data representations what we have discussed in
previous in our previous lectures and one example is shown here in the as a figuratively.

So, this is a surface patch and this is u and v like it is equivalent to x and y axis and you
get the corresponding height of the surface point from that plane which is a z directions.
So, this is how this is described and already we know in general in for any description of
parametric surface we can compute the surface normal and then all the elements of first
fundamental form and second fundamental form matrices in this form. So, this is note
that this is second fundamental form matrices. So, these are all double derivatives they
are related to curvatures.

e   N  xuu
f   N  xuv
g   N  xvv

So, this is the second fundamental form matrices and

715
E  xu  xu
F  xu  xv
G  xv  xv

this is the first fundamental form matrices, these are all derivatives and they are related
to magnitude of the tangents or actually this is a dot product of tangents along two
directions two different directions for F otherwise they are also of the same directions.
So, this quantities can be computed now easily I mean they are they have a special form
in this particular aspect, because if I compute X u then if I take that gradient and let me

1
represent them in the form of a column vector so, you get  0 
 
hu 

That means it is a partial derivative of function h(u , v) with respect to u, hu  h(u , v)

0
similarly we can compute X v as  1  . So, from this two vector we can compute E. So,
hv 

what should be E? E  X u  X u which will give you 1  hu2 , then F  hu  hv and, G is

X v  X v that magnitude itself it is 1  hv2 . So, this is about the elements of the first

fundamental form.

What about the elements of second fundamental form? Again from this description so,
let me first compute the normal vector because it requires to computation of normal and
for that I need to take the first cross product of X u and X v . So, I will use that familiar

cross product to know computation. So, this is I will expand this determinant and this
could be written as. So, this is minus h u i plus, this is minus this should be v, this is v
minus h v you know this is minus h v j and k is 1.

i j k  hu 
N  1 0 hu   hu i  ( hv j )  1.k    hv 
0 1 hv  1 

716
So, this vector N can be considered as this vector is minus h u minus h v 1 and if I have
to normalized it. So, N is given as minus h u minus h v 1 divided by root over 1 plus h u
( hu  hv 1)
square plus h v square . N 
1  hu2  hv2

and from there you can compute now the other thing. So, if I consider X uu which means

0
I have to once again take the derivative with respect to this and we will get  0  and
huu 

0 0
X uv   0  and X vv   0  .
 
huv  hvv 

So, your value of e should be dot product of N and xuu and only the you know z

 huu  huv
component. So, you have e and the similarly I mean for F it is
2 2
1 h  h
u v 1  hu2  hv2

 hvv
and for G it is . So, this should be minus. So, so this is how these elements
1  hu2  hv2

are computed just I have described and let us see the result how we can get it.

(Refer Slide Time: 24:59)

717
So, you see that what I have described those are the thing mentioned here and once you
have computed this then you can compute the Gaussian curvature and mean curvature
1
e f E F
using this element. So, linear map is given by  So, you can compute
f g   F G 

those quantities and then take the trace of the matrix half of the trace and determinant of
the matrix then you will can get the Gaussian curvature and mean curvature and from
there also you can then take also principal curvature. So, I will show you the expressions
of Gaussian curvature and mean curvature as a result of this operation.

(Refer Slide Time: 26:01)

So, this is how we will get you will get mean curvature this is the expression

 huu (1  hu2 )  hvv (1  hv2 )  2huv hu hv


H 3
2 2 2
2(1  h  h )
u v

and also Gaussian curvature this is expression.

huu hvv  huv2


K
(1  hu2  hv2 ) 2

So, when you have a range data then you can compute these curvatures by using this
particular these expressions because range data as I told that is a simplified form of

718
surface description where the parameters are described itself by the x and y coordinates
in those directions.

And the depth value is a function of x and y coordinate which is z coordinate and by
computing them you can compute the derivatives using the masks as we have discussed
earlier in our lectures. In the very first lecture we discussed how derivatives could be
computed even the double derivatives other gradients could be computed of a function of
a 2 dimensional function that we have discussed.

So, use those masks to compute this derivatives partial derivatives and also double
derivatives and then from there, you can compute the mean curvature and Gaussian
curvature then look at the signs of those curvatures and that would give you the topology
of the local surface. So, with this, let me stop here we will continue this discussion of
surface geometry local surface geometry in the next lecture.

Thank you very much for your attention.

719
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 44
Range Image Processing – Part IV

(Refer Slide Time: 00:26)

We continue our discussion on computing local features of range data or surface local
features of local geometry or range data. For understanding this process let us discuss a
particular problem in computation of this features and let us solve this problem and I
hope that would give you a better understanding of the computation.

So, you consider this exercise that a parametric surface which has been described in this
form. You note that in the problem definition statement, I have used the brief notations
by using matrix algebra. So, all the partial functions of u and v they are independently
defined and the function of x y parametric functions of the corresponding coordinates,
they are separable in terms of the function of u and v.

So, that is how it is given. And what we need to do we need to compute the surface
normal Gaussian and mean curvatures at (0.5 , 0.5) of the parameter values. So, this is
our this is the point. So, this surface point we need to compute and these values. So, you
should note that with this parameters value what is a co-ordinate that can be compute it
by using this functions.

720
So, you replace u by 0.5 and v by 0.5, then you can get the coordinate of x by this
function. So, you can use this values and you can get these coordinates. So, let us
proceed that how this computation. So, we will be exploiting this matrix notation to
perform the derivative,s first derivative some second derivatives because matrix is a
linear operation. So, we can see that those delivery operations also become simple in the
structure.

(Refer Slide Time: 02:35)

So, let us considered the representation of the functions of u and v in the form of a
column vectors and which is given F(u). So, this is a column vector which is equivalent
 f1 (u ) 
to like F(u) is expressing  f 2 (u ) . So, this is what F(u) similarly G(v). The advantage
 
 f 3 (u ) 

here with this notation is that, I can easily compute first derivative of all these functions;
I can express in terms of matrix itself.

So, simply I have to take derivatives of corresponding elements of matrix which is a


function of u in this case. So, as you can see, this( F ' (u ) ) is a first derivative of the
function F( u). Similarly, you compute second derivative of function G( v). So, you put
the value of 0.5 you can get the first derivative values of the functional values of F(u)
and second derivatives for v equals to 0.5 also you can get these values.

721
So, at u equal to v equal to 0.5 the, the functional values are given by this. So, it is
computing all the things together like f1 (0.5) , f 2 (0.5) and f 3 (0.5) ; these are the

corresponding values. So, this(5.75) value is actually f1 (0.5) and so on. So, this is a
interpretations of this computation and similarly we can get G(0.5), we can also get the
derivative with respect to u F ' (u ) and get the value at 0.5.

This is a value get the value with respect to derivative of G at both are 0.5. So, now, this
is the vector which is giving you the X u with respect to u the surface point. So, this is
giving you in depend this function is in independent of v. So, what you can do? You are
 f1' (u ).g1' (v) 
 
multiplying it means your X u is considered as  f 2' (u ).g 2' (v) , . So, this is the vector and
 f 3' (u ).g 3' (v) 
 

 g1 (v) 
this point wise multiplication with respect to G(v) which is given by  g 2 (v) column
 g 3 (v) 

vector that is expressed by this notation.

(Refer Slide Time: 05:34)

Similarly, you can compute the X v in this passion.

722
(Refer Slide Time: 05:39)

So, from these two, we can get the normal because you are getting two tangents along u
curve and v curve. So, you take the cross product and normalize it then you get the
surface normal.

Xu  Xv
nˆ 
Xu  Xv

So, surface normal at this point is given by this value. So, this is a unit vector unit normal
vector at that point. Now to compute the mean curvature and Gaussian curvature as I
have already discussed that you have to compute the elements of linear map and then
also the linear map itself the matrix linear map and the trace of that matrix half of the
trace will give you mean curvature and determinant of that matrix will give you the
Gaussian curvature.

So, we are computing this particular entities here and we have already seen how the first
derivatives with respect to u and v for surface points could be completed. Similarly we
will be computing the second derivative. So, this is a linear map and you have to come to
those elements and from linear map, we can get principal curvatures eigen values of
linear map or as I mentioned the Gaussian curvature as this and also mean curvature as
this.

723
e f
So, let us compute  that is related to second fundamental form and here also we
f g 

are using that convenient form of derivatives with respect to on the only on the column
vectors of u functions of column vectors of u. So, F ' ' (u ) or second derivative of F(u) that
would give you the X uu and element wise multiplication with G(v) that would give you

the X uu . Similarly, so this computations first you compute F ' ' (u ) , second derivative in

G ' ' (v) and these are the element wise multiplications of respective elements.

(Refer Slide Time: 07:42)

And then the values what you will get you use those values then you will get the
elements of second fundamental form in this way. Similarly, we already obtain the, we
can get also first fundamental from these are the values and by using them, you can
construct the linear map matrix. This is a linear map matrix. So, if you take the trace and
take the half of it which means you add them and you would take the average.

So, the value would be so, Gaussian curvature is determinant which means y you take the
determinant of this matrix it is coming like this and trace is coming like it 0.086 that is
what is mean curvature, I think this is what was a problem statement and this is how we
could solve them.

724
(Refer Slide Time: 08:38)

So, next we will be discussing another kind of processing with the range images and in
this case, we would like to compute the edges in the range image. Now the difference
between the intensity image and range image is that in the intensity image, a primarily
the edges are all like a very discontinue there is a short discontinuity short changes of
brightness value at the edges.

So, I should say it is a almost like zeroth order discontinuity that is happening in the
edges and we call those edges as step edges very short discontinuity. Whereas, for the
case of range image, there is a continuous change of the functional value and those
edges you have a differential change, the discontinuity is not so, sharp and it is like an
kind of an plane planar. It is a intersection of two planes in the functional map and that is
producing image and that is how the depth also is varying.

Now those kind of edges are called roof edges. So, if I draw them for the simplicity see
this is a short change along when we vary this edges that is what is step edge whereas if I
consider a change like this is a value these are the two planes and you are observing the
depth observing the height, then you see that this change is continuous and this structure
this kind of edges called roof edge.

So, the in this diagram also it has been shown in this bottle, this is an example of step
edge. This transition is an example of step edge whereas, this transition is an example of
the here to here this transition is example of roof edge. So, in the range image you have

725
both types of edges and that is why the characterization of these edges are important
because they are bit different and their processing should be different.

(Refer Slide Time: 11:05)

So, we will be considering to analyze these two cases. Now I will not go into details of
this analysis what I will provide you the result of analysis which will provide the
summary of data analysis which helps us in developing an algorithm for extracting edges
so, the characterizations the properties that I will be describing. So, this is an example
how this edges are model you can see the step edge. There is a sharp jump of the
functional values a jump of edge that is you can see here the first function which has
been shown here as a step edge and this edge is showing this jump.

So, this is the jump which has been shown here. So, it is a certain jump where as for the
roof edge you have no jump. So, this is an edge here and there is a continuous gradation
continuous decrease of the value in a linear form in a professional form as you move
along x. So, with this kind of functional definitions of edges so, this analysis is done with
respect one dimensional function.

726
(Refer Slide Time: 12:31)

So, you can characterize these two situations in this way. Suppose you smooth this
particular functions using a Gaussian mask and then you take the double derivative, you
can take higher order derivatives including first derivatives second derivatives of this
smoothed signal.

So, you are it is possible to compute them because this is one of the techniques or trick
so, that is used for computing gradients instead of directly computing gradient. So, we
compute gradients in a smoothen signal to handle noise to make it first you smooth the
signal . noise should be reduced in that case and then you perform this kind of operations
or derivatives. Otherwise furious changes will cause high you know errors in
computation of gradients.

Now, there are some advantages of having Gaussian mask because either you can
smooth the Gaussian smooth it using Gaussian mask and then take the derivatives or
instead you take the derivatives of the Gaussian mask and then apply convolutions to get
the result in operations. So, that is how you get the second derivative operations.

So, what is very characteristics in this analysis? They have observe that ratios of second
and first derivatives of curvatures they behave differently across scales; that means, as
you vary the smoothing factor sigma the scale of Gaussian mask, they have certain
invariance properties or they have certain interesting properties that analysis as in shown.
So, this analysis is given by I think this is a paper given in this paper.

727
So, if you would like to go through the details you should read this paper, I will be just
describing this particular this results I will be describing here. So, just let me elaborate
once again. So, curvature at Gaussians smoothened function is given in this form. it is
the usual definition of curvature of a function , one dimensional functional function of a
single variable.

''  2G ( x;  ) z''
z ( x)   z ( x) Curvature at Gaussian Smoothened z(x): k ( x)  3
 2
(1  z' ) 2

As you can see it is a ratio of second derivative and or other in the numerator there is
second derivative and in the denominator we have the 1 plus of first derivative raise to
the power of 3 by 2. This is a very standard expression of computing curvature for a
function of one single variable.

Now, if I would like to take ratios of second and first derivatives of curvature, then this
ratio can be computed also from the functional from the function z it itself as a ratio of
fourth derivative and third derivative of curvature. So, this is how a curvature is related
with this original function; curvature is proportional to the second derivative. So, second
derivative of curvature is proportional to the fourth derivative of function and first
derivative of curvature is proportional to third derivative of function.

So, this is how the ratios could be compute it directly from the functions by taking the
ratio of fourth derivative and third derivative of the function and there all smoothed
across by this scale sigma. So, the characterization of step edge is that this ratio remains
roughly same across scales whereas, for roof edges this is inversely proportional to scale.
So, when this is there is a real edge point if we observe this properties we can say that
edge point is genuine and we can delete spurious edge point which does not you know
hold this property.

So, t this is one of the very key findings or key observations from the researchers and
they have used it, but; however, to detect the edge points we have to use the curvatures.

728
(Refer Slide Time: 17:02)

For the step edges it is a zero crossings of one of the principal curvatures that would give
you give which will give you a candidate step edge point. And then you observed across
scales whether that point preserves the property of invariance of ratios of ratios of those
two quantities, what I mentioned that ratios of second derivative of curvature and first
derivative of curvature. So, they should be constant. So, that is a thing.

So, this should be roughly remains constant across scale whereas, for roof edge it is
characterized by local maximum of curvature and that to be also sought in the direction
of dominant principal curvature. So, you should note here that dominant in the sense of
magnitude that you have to consider which is the dominant and this roughly remains
constant across scales. So, this one; that means, earlier it was inversely proportional to
scale.

So, if I multiplied with scale so, this should we constant. So, these four know
observations can be combined together and can be used in developing an algorithm of
edge detections.

729
(Refer Slide Time: 18:23)

So, the algorithm says that we can compute a set of Gaussian smoothed images at
multiple scales, then compute the principal directions and curvatures at each point of the
smoothed images compute zero crossings of the Gaussian curvature and the extrema of
dominant principal curvature in the corresponding principal direction. So, zero crossings,
they would give you the candidate step edge points and extrema would give you the
candidate roof edge points. And then use the analytical models across scales to select
candidate points for respective edges.

(Refer Slide Time: 19:05)

730
So, this analysis gives you a mechanism of multi-scale edge tracking. So, what it says
that you need to track those edge pixels at different scales of images which means
images which are smooth smoothed by different scales of Gaussian functions.

So, if we can observe that still it is retaining at higher scale also this property is
retainwhich means the ratios of the curvatures second derivative on first derivative of
curvature, they remain constant across scales for step edge point or the scale multiplied
by these ratio remains roughly constant across scales output across scales for roof edge
point, then we select those points and we retain those points. But those points they
locations, they retain which are given at the finest scale because shift has happened due
to the smoothening operations.

(Refer Slide Time: 20:08)

So, this is one example that is once again the results reported in that vapor. So, it is
shown here. You can see that edge points of differently smoothed images are shown here
with scale 20, scale 40, scale 60 and scale 80 and as you move across hire scale, you will
find that edge points are getting fitted out because those are the points which are still
maintaining those invariance properties as we noted.

And finally, when you consider the result at the locations of the at the point then you are
simply getting the these are the edge points, but you are getting from this operations.

731
(Refer Slide Time: 20:57)

So, that was about the detection of edges in range images and now I will consider
another operations; another processing with range images which is segmentation of range
images. Mostly the segmentation of range images there are different algorithms, but I
will be restricting ourselves to consider only segmentation of planar patches. If I
consider small planar patches and then integrating those planar patches, we can get large
surface from them that is one kind of approach and one of this approach is given here of
course, it is considering on a planar facets.

So, what it does that first you will locally fit very small patch with a equations of plane.
And then for each patch it forms a node of a graph between all the neighboring patches,
you can let me explain it say you have fitted this patches and these are the neighboring
patches. So, you consider also the cost of fitting of the total patch. So, this cost we will
give you a average distance to the plane best fitting them.

So, this cost of fitting. So, if I merge this, then you have a cost of fitting and that would
give you the average fitting. So, let me explain that how a plane could be fitted. See we
have already discussed in model fitting. I will simply bring that particular technique here.
So, you know that equation of plane can be suppose I have given you what is given as
problem here.

You have given here a set of points if I consider them as just a set of points. say set of N
points and this is a set ( S  ( xi , yi , zi )i 1, 2,... N )and your objective is to find a plane, fit

732
them with a plane which means a plane can be expressed in the form of equations say
zi  a1 x  a2 y  a3 t that is how a plane could be written. You know that familiar

equation of ax  by  cz  d  0 . So, I will just expressed that equation in this form


where it is a function it is z as a function of x and y which is a usual form of range data.

 a1 
So, this I can write it in the matrix form as x y 1a2  and that is equal to z. So, if I
 a3 

have given so, many points I can write a set of linear equations say x 1 y 1, one that
should be equal to a 1 x 1 a 2 plus a 2 x 1 plus a 3 should be equal to z 1 x 2 y 2 1 z 2
like this and this is a x n y n one say this is a z n.

 x1 y1 1  z1 
x y 1  a1   z 2 

 2 2

    a2     
    
    a3    
 xn yn 1  z n 
XA  Z

So, what is your problem your objective is to compute a1 , a2 , a3 to give a plane equation
and what you need to minimize. So, let us consider this matrix is Z this matrix is X and
this is A. So, you have to minimize this norm Z  AX , you have to compute find out

that A which will minimize this particular norm. This is that least square error estimate
and we know that we can solve it using non homogeneous method of solving least square
error estimate. And many a time I have used this particular analysis say I will write it as
AX = Z and I would like to compute sorry it is not AX, it is XA. I.e., E  Z  XA

So, I will write it as XA equals Z and then A should be equal to X transpose X inverse X
transpose Z. A  ( X T X ) 1 X T Z So, this is a equation this is a pseudo inverse and the
error would be you replace A by this form and you will get the corresponding error. So,
error can be computed as so, Z minus X then X transpose X inverse X transpose Z and
you can simplify these expressions. So, you will get error. E || Z  X ( X T X 1 ) X T Z ||

So, this error is the arc cost; this error or planar fit is a arc cost between two nodes. So,
for every for this graph, you have nodes you have say four nodes and they have

733
corresponding edges all neighboring edges means your fitting them with planes and the
arc cost is this one. So, in this way you form a graph and as you can see, it is we will
describe it later on that in this method the edge which is having minimum cost that is
considered and those nodes are fitted and now you transform this graph into this form.
And you continue go on doing this work.

(Refer Slide Time: 28:03)

So, just to summarize this particular operations. So, what we are doing what is the idea is
that you iterately the merge this spare of planar regions to minimize the average distance
to the plane best fitting them. And we apply a greedy approach, we select the best
minimum cost arc and merge the nodes and we do it iteratively. Go and doing it till we
find there is no more patches to be mask because there is a threshold of error. We cannot
should not merge with the error is too large.

So, this is an example of an image and whose range data is available and if I perform
these operations, you will get this kind of planar fits. This is taken from that paper itself
which is a representation recognition and locating 3D objects by Faugeras and Helbert
and it was also published in 1986. So, this is one technique of segmenting range images.
We will continue this discussion I will discuss another technique which is quite fast and
effective in the next lecture. So, let me stop here.

Thank you very much for your listening.

734
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 45
Range Image Processing – Part

We continue our discussion on segmentation of Range Images. In the previous lecture,


we have seen how a greedy approach could perform the segmentation of planar faces in
range images.

(Refer Slide Time: 00:20)

Today we will discuss another approach, this is a morphological analysis based approach
where it uses information of local orientation at a point and we will find that this
approach could be effectively used also to extract the planar segments in range images.

So, in this approach a concept of a digital neighborhood plane and neighborhood plane
set those concepts are introduced. I will explain these concepts in this lecture. And the
advantage of working with this approach is that, it is very fast and very easy computation
on performing checks of local neighborhood arrangements and then taking a decision on
the local orientation. And after that you can aggregate pixels of similar orientations and
form the segments.

735
So, essentially what it does that it computes a set of neighborhood planes. which again I
will be discussing and then this itself acts like a feature, this set itself acts like a feature
and it induces an unique partition under the equivalence relation of equality of NPS and
you can get the segment. And this planar segments are of course, approximate and they
can be formed by vision growing, as I mentioned by considering the equality of the local
features defined by neighborhood plane set or NPS.

(Refer Slide Time: 02:17)

So, first let me explain how 3 dimensional neighborhood of a point is described. As you
know that in a discrete space a 3 dimensional neighborhood can be described by
extending the notion of 2 dimension itself. In discrete with phase space, so we will have
a 3  3  3 that is a, rectangular tessellations of 3 dimensional space and those points are
shown here. So, in this case this is the point which is the central point and whose
neighborhood are described. So, this is a point P and you can see that there are 3 cross
sections if I move along the directions of x, y and z directions in our conventional
notations.

So, we can consider say this is y, this is x, and this is z, this is a direction and with
respect to this point this is a front plane, and this is the middle section and this is a back
plane. And the variables are also identified by considering the positions of neighboring
pixels. So, if I assume this is a origin, then with respect to that; for example, this voxel,
so in 3 dimension we call the elements in the discrete grid as voxel and say this variable

736
has been named as n0 and the corresponding positions we can see this is a right neighbor
of the point p in the middle section.

And similarly we have, so if I consider only a 2 dimensional cross section then this is
defining a 8 neighbor configurations; but when it is a 3 dimension it is a 26 neighboring
configurations and the nomenclature of the variables are also shown here. So, middle
plane we have seen all the variables are named using the notations n and their subscript
from 0 to 7. Similarly in the back plane, the variables are denoted as v and with subscript
0 to 7. Their variable means, if there is a point then the value of this variable would be 1
or 2 and if it is empty then the value is 0.

So, this is a representation of a point set in a discretized space you can say. And as I
mentioned range image is nothing but a set of points in 3 dimension, they should lie on
surface points; but in a 3 dimension you can describe them also as a 0.6. So, with respect
to any point in the range image, you can have a 3 dimensional neighborhood in the
corresponding space.

(Refer Slide Time: 05:30)

Now, let me define what is meant by digital neighborhood planes. Now, we assume that
the point lies on a surface, then there could be neighboring points also should be lying on
some of the neighboring planes. They are defined in these directions, in these
configurations; mostly if it is a planar phase in the discrete orientations you are expecting
those points should lie in one of those planes.

737
It is once again it is an extension of the corresponding directions, discretized directions
of 2 dimensional space; where you have a 0 to 7 or 1 to 8 directions and used in chain
codes in 2 dimensional images when we describe the sequence of points in a contour.
But in 3 dimension you have surface points here, and there we have to look at the
corresponding 3 dimensional configuration of neighboring points.

So, in this description, we have shown that what kind of configurations neighboring
planes can have. So, if I have all the points in the middle plane, then this is one kind of
configurations. There are certain indexes by which these planes are referred to here, for
example, this is the index here is 3. So, there are nine such configurations, and all of
them actually formed by as you can see that if I consider the this cubic face; then they
are formed either by the principal planes in the parallel to the faces of the cubes or their
diagonal planes which is connecting the corresponding diagonal edges and showing this
planes.

So, the numbers here, you can see starting from 1 to 9. So, there are 9 such neighborhood
plane which has been defined in this configuration. And in each neighborhood plane
there would be once again 9 points including the point p which is a central point. Let me
show you those points.

(Refer Slide Time: 07:50)

So, these are the set of points which have been know shown here, and here the
corresponding planes, their indexes are shown by their indexes or identities are shown by

738
the corresponding number 1, 2, 3 these are the principal planes. And if you check with
the variables that we defined earlier for the digital map, for the neighborhood, 3  3  3
neighborhood those variables are listed in these planes, those are corresponding variables.

So, this was a middle plane, so the variable name and you can find out all the variables in
this plane, they correspond to the corresponding variables of this third plane. And
similarly if I consider all the columns of these cross sections along these directions that
will give me the, it is I hope it is second plane this is v 2 v 8 v 6 and now this is the first
plane. So, this is the first plane directions and if I use this, this is the second plane
directions. So, in this way you can always form this point set by observing the
neighborhood variables.

(Refer Slide Time: 09:22)

So, in the same way we can define all other diagonal planes also, I am not pointing out
the corresponding configurations you can do it yourselves by observing the name of the
variables and corresponding configurations of those diagonal planes.

739
(Refer Slide Time: 09:40)

And this is the plane of 7th, 8th, 9th those planes are also described here.

(Refer Slide Time: 09:46)

So, just to have an idea how those planes looks in a discretized grid with voxels when
those all the points are there in the plane. So, you can see the corresponding shapes of
those planes in the digital grid, those are shown here with their numbers.

740
(Refer Slide Time: 10:05)

So, now let me define the neighborhood plane set or in acronym NPS. So, it is
considered Pi , is a ith DNP; that means, the set of points assigned to the ith DNP. Now,

we consider the Pi could be an element of neighborhood plane set of a 3 dimensional

point P; if it is neighbors contains sufficient number of points in the or lying on that


Pi itself.

So, that that is the inclusion of Pi . So, in this way you can check for other planes also and
set of all such planes which satisfy this property that the neighboring points of that point
P, they lie sufficiently on those planes. So, mathematically we can define neighborhood
planes in this way, it is the set of plane i.

So, i denotes the i-th data neighborhood plane, such that the neighborhood points of P
which is denoted as a set N ( p ) and the corresponding set of points in planes as defined
that intersection, and A is just object point. So, only the points which are one and that is
defining the object; when it is just making it more precise that we are considering
neighboring points where there is a volume, I mean which are not empty which has only
object points.

[ p ]k  {i :| N ( p )  A  Pi | k}, k  3

741
So, those points which are lying on a particular plane with a sufficient number that
number is k. And usually we take the threshold k is greater than equals 3 two; that means,
which means at least four points should lie including the point P. And at least yeah, at
least three points should lie including the point P; but we can you can use any other
threshold maybe 4 or 5 is sufficient in mostly used in our cases. So, k is a parameter as
you can see and it is a threshold number of points required for accepting a digital
neighborhood plane in the neighborhood.

(Refer Slide Time: 12:16)

Now, the question is that, this definition what I know given, what I discussed this
definition is meant for 3 dimensional grid and where you assume that is an idealized grid,
there is no noise and then perfectly you can associate neighborhood planes with the
point P. But the problem is that, now if we have noise, so even some of the points which
are not lying exactly in your in the 3  3  3 neighborhood or 26 neighborhood have a
point P; still those points should have been should be considered because those are
deviated due to noise, and they could be a possible candidate of forming DMP.

So, that is the case when we are trying to handle range image, because range image is an
image which has taken from the real life scenario and there is a expected that there
would be noise in those imaging systems. So, in a range image D(x,y) we can define a 3
dimensional point as (x, y, D(x, y)). And around its neighborhood, then to handle this
kind of tolerance of deviations of points which may lie on a neighborhood plane. What

742
we considered; we consider an extended neighborhood around p, and this extended
neighborhood for example, a size a  b  c . So, minimally there should be 3. So, this all
of them should be greater than 3. So, that is how the extended neighborhood, and for
maintaining symmetry usually they are all odd numbers.

So, what we can do here, that is to this is a trick what has been used in this technique;
that this tolerance has been accounted for by the fact that we map a set of points in this
extended neighborhood to a point of the neighborhood 3  3  3 neighborhood to those
variables. So, a particular variable would be true in 3  3  3 neighborhood, if one of the
points in that set which has been mapped to it, is true.

So, that is why, that is how we can make a simple adaptation of this concept of DNP in
this case. And then all the definitions of digital neighborhood plane, and neighborhood
plane set they remain the same; only this mapping will take care of those tolerances,
giving tolerances to those deviated points or to those noisy points.

(Refer Slide Time: 15:04)

So, before giving an example let me discuss that what are the properties should be
maintained when we define such an mapping functions. There could be various
possibilities for mapping a set of points to a neighborhood point of N(p); but we should
consider those mappings which maintain certain consistency, and which should be
helpful in solving our problem.

743
So, we are considering a mapping functions in this case we are naming it as a
neighborhood mapping functions, that is a mapping from extended neighborhood of
a  b  c to a neighborhood of 3  3  3 . So, first thing that the function should be total
and onto; which means that, for every point in the extended neighborhood there exists a
unique point in N(p).

So, for every point you have to have a definition, you have to have a mapping that is the
total property; and for every point the in the 3  3  3 neighborhood there exist at least
one point in the extended neighborhood which has been mapped to it. So, that is a
property of onto function.

And then this function should be should induce connected partitions in N abc ( p ) ; that

means, when we are making these mappings, so the points of the extended neighborhood
which are mapped to the same point in the 3  3  3 neighborhood they should be
connected that is the connected partition in that case.

And again as a result the digital neighborhood planes what it will look in the extended
neighborhood of the point p, they should be also connected. So, the induced digital
neighborhood planes also should be connected. And moreover for the good quality of
segmentations it should have strong structural similarity, with the respective DNPs
defined in N(p).

(Refer Slide Time: 17:19)

744
So, one such possible neighborhood mapping function is described here. So, in this case
you can see that we are extending the neighborhood by 3  3  5 ; which means, there is
an extension along the z directions in both the front and backward know directions and
there is an additional 3 3 plane in front of or in the in behind the point p.

So, you can see here, see this is the additional plane which has been additional 3 3
cross sections which has been added to the neighborhood definition of p. So, now if this
is the extended neighborhood, then let us see how the mapping is carried out to maintain
the properties which I mentioned.

So, first thing as you can see that we have mapped all these points to the central cross
section of the 3  3  3 neighborhood; which means, if I consider say these points, all
these points together they are mapped to n3 , together means no they are all of them are

mapped to n3 . So, if any one of them is true, then the variable n3 becomes true. So, it is

all logic by which this variable is related to with those points.

So, in this way the no extension has been done, and in the same way the other know very
other neighboring variables of points they are also mapped. So, the cross section which is
behind by two unit, it is mapped to the actually all the variables of backplane of the
original neighborhood definitions of 3  3  3 . And cross sections which is in front of two
units, again it is mapped to the all the variables, corresponding variables are mapped to
the variables in the front cross section of 3  3  3 unit.

So, in this way as you can see we can extend the neighborhood definitions we can map
them to the variables. And finally, we are working with only say 26 variables with p and
with these kind of definitions. So, then the rest of the definitions of digital neighborhood
plane, and neighborhood plane set it remains same.

745
(Refer Slide Time: 19:54)

So, just to look at the results of the digital neighborhood planes due to these operations
or the configurations what you can have. So, these are the digital neighborhood planes
which have been induced by this function F. You may note that the central plane or the
plane 3 which I have shown earlier, actually it is accepting any points in this volume.

So, all these points in that volume it corresponds to plane 3 because of that tolerate,
because we have given this tolerance. And in some of these planes also this more number
of points are there, and there is a particular rule by which of course, you have to check
the planarity of this test of these set of points.

So, first thing induced DNPs are connected. So, you have seen that all the digital
neighborhood planes here they are connected, and they are structurally similar. Let me
show you the once again the configurations of the neighborhood planes, and you can see
that the corresponding shapes are similar; say this is plane 2 and this is plane 2 which has
been defined, and in this way we have to understand we have we can find out that
similarity of shapes.

746
(Refer Slide Time: 21:24)

So, it satisfies the properties what we wanted to have in this neighborhood mapping
function. And then by applying the definitions of neighborhood plane set we can from
the segment. So, the algorithm goes like this, that you can compute the neighborhood
plane set at each pixel, then you compute connected components having the same
neighborhood plane set; and remove small connected components from them, then
smooth a region by assigning its level to spurious unlabeled pixels within it.

One example of range image segmentation has been shown here; the upper one is a
display of a range image, So, in this case the higher the brightness value nearer the
pixels and the darker points are far behind. So, that is how this display has been made
and the corresponding, the segmentation results are also shown here.

747
(Refer Slide Time: 22:30)

So, now let me discuss another kind of processing with range data and this is registration
of range data. So, we know the registration problem that, if I give you two images and if
they are related by certain transformation. So, we need to compute the transformation, so
that you can obtain the other image by applying transformations, obtain an image from
the other image by applying this transformation.

So, in the range data also we may consider that this data has been captured from different
views. So, they are related by the corresponding coordinate transformations between
these two views and that is a kind of rigid body transformation. So, the assumption said
that, first thing that surface belongs to the same object; that means, you are viewing the
same points.

And then captured from different viewing direction, coordinates of corresponding points
are related by rigid body transformation and we assume in the same scales for the
coordinate axis. So, all other things are similar, only thing is that in the coordinate
transformation we have translation and rotation of the corresponding coordinate
transformation. So, the computational problem is that estimation of those rotation and
translation parameters.

748
(Refer Slide Time: 24:00)

So, we are considering this problem of computing parameters of rigid body


transformation. So, let us consider there are two corresponding point sets {mi } and {d i } .

So, {mi } corresponds to point sets of one view, and {d i } corresponds to point set of

other view; and we would like to get a transformation from {mi } to {d i } . And as I
mentioned that, it is a rigid body transformation and related with rotation and translations,
so we can express them in these relationships. So, all our 3 dimensional points, 3
dimensional vectors and they are expressed in non-homogeneous coordinate system. So,
it is written as d i  Rmi  T  vi , so just to explain.

So, if vi is the noise here, and R is a 3 3 rotation matrix and T is a 3 1 translation

matrix; and we know that points are described by 3 1 vectors d i and mi . So, this
relationships are established. Now, our objective is to compute these parameters are R
and T in presence of the noise. So, we would like to minimize the error of the model fit,
model fitting and this is how the error of model fitting has been expressed here and this
is the same sum of square error what we discussed earlier also as you can see. And the
estimated rotation matrix and translation matrix they are all denoted here by theover it
2
tilde in my slides. E  i 1 d i  Rˆ mi  Tˆ
N

749
(Refer Slide Time: 26:06)

And one thing which it is, you should know that it is a constant optimization in the sense
there is a property of rotation matrix and that property is that it is an ortho-normal matrix;
which means that, this matrix should satisfy this condition, this ( Rˆ T Rˆ ) should be equal to

I or R T should be equal to R 1 e. So, subject to that we have to solve this minimization.


So, it is not a simple least squared error estimation method or simple minimization
optimization method, it is a constraint optimization method that we need to consider here.

So, we apply in this strategy of a first removing the translation part. So, what we do? We
have taken partial derivatives with respect to T and then we get these equations, if I take
the derivatives we can find that translation does not have only minus 1 as the coefficients,
and you can apply once again matrix algebra to reduce the derivatives into this form. So,
from there we can get the estimate of translation in terms of the average of d i is and

average of mi is given R.

So, still we cannot compute T, but we know the relation; if we can get the estimator of
rotation matrix R then we can get translation matrix translation 3 1 translation
parameters, because we can always compute average of d and average m. So, that is what
it is means of d i and mi ’s.

750
(Refer Slide Time: 27:34)

Let us proceed with that. So, let us make a coordinate transformation, so that we can, we
try to remove the translation part from this expression. So, we perform this coordinate
transformation and we can see that, with this coordinate transformation, the translation
part could be removed because only rotation, with the relationships with rotation and the
translation has been established in terms of beam in the estimation. So, the translation
can be that parameters could be removed.

So, now the minimization problem becomes that, you have to compute R with that
constraint of R T R should be equal to identity matrix, so that now this function gets
minimized. Now, there is a particular type of solutions, so if I expand it is in this form,
and you have to minimize this particular know part; this is once again it is the matrix
algebra just to show you that this expression can be written as (d ci  Rˆ mci )T (d ci  Rˆ mci ) .

So, if I perform the corresponding multiplications like simple and linear algebra like
simple algebra, but it is all applicable with matrix algebra also because you can check
with a dimensional matching and because of linear operations of matrix multiplications
and additions. So, you will find that finally, it reduces to the expression what has been
shown here.

751
(Refer Slide Time: 29:24)

So, you can see here actually if you are going to minimize this thing which means, I need
to maximize this particular part. So, what we are doing, we need to maximize this
part( 2d cTi Rˆ mci ); and then the there is a solutions for that we have to maximize trace of

R̂H I am not going to know discuss the how we have derived particular these
relationships; but this is the solution, so when with the constraint optimization what I
have mentioned here. So, the definition of H is that, H is given in this form

H  i 1 mci d cTi . So, it is basically covariance between the corresponding coordinates


N

translated by their means.

So, we call it also a correlation matrix. And one of the solutions for maximizing this is,
this is a solution that H can be decomposed using singular value decomposition
H  UDV T and then R transpose R sorry Rˆ  VU T , you can check this is an orthonormal
matrix.

So, this is a solutions and then T is obtained as this. But the fact is that, this will give
you a particular solutions for a same point, but there would be error fit as we have seen,
but we can perform iterative fitting of these points by performing we can refine these
solutions, because there could be outliers in the corresponding points and we can remove
those outliersby an iterative process.

752
(Refer Slide Time: 31:03)

So, there is a technique which is called iterative closest point registration algorithm, in
this technique it has considered to removal of those outliers in this way. So, what it does
first? It computes initial registration parameters, like denoted as R0 and T0 ; and then we

perform the steps iteratively in this way that, apply transformation to the sourcing or
point clouds, and compute the closest pairs between source and target.

After applying transformations we can find out the closest pairs; that means, for every
point in the source point what is it is nearest neighbor and every point in the trans-parity
what is the nearest neighbor. If they correspond to each other then we take those points,
and in that way we can get a new data set and use those data sets. So, we can again re-
compute registration parameters by applying the same technique what I discussed; and
go on doing these things till we get a good registration, till it converges.

753
(Refer Slide Time: 32:02)

So, this is one example of a registration result and this has been taken from a paper by
Besl and McKay which has been published in 1992. So, this result I mention. So far
these are the few techniques of processing of range images we have discussed.

(Refer Slide Time: 32:27)

So, let me conclude this topic with the summary of different things what we discussed
under this topic of range image analysis; one is that we have discussed about different
types of range sensors. So, we considered stereo imaging.

754
Then time of flight based sensors, triangulation through scanning beams, then structured
light. Then, we have considered use of differential geometry in extracting local features
of a pixel or a point in a range image. These are the features which we could extract like,
surface normal, principal curvature, Gaussian curvature, mean curvature; signs of
curvature characterize the local topology of the surface.

We discussed also about characterization of step and roof edges. And step edge is could
be detected by detecting zero crossings of Gaussian curvatures and roof edge could be
detected by considering the Extrema of dominant curvature along its direction. And
further if the multi scale tracking of edge points can refine the results.

(Refer Slide Time: 33:42)

We have also discussed about segmentation of range images into planar patches. We
discussed a greedy algorithm by fitting local surface patches and merging them; and then
a morphological processing based approach by computing neighborhood planes and local
orientation.

And the last topic of this particular, last topic of this range image analysis it is
registration of range images. So, we discussed how rigid body registration parameters
could be computed by a technique, there you can use least square error estimation for
rotation and translation transformation matrices, it is a constrained optimization problem.
And then we can perform iterative refinement from initial estimates by computing

755
nearest neighboring pairs in two images. So, these are the things we have covered in this
particular topic.

Thank you very much for your listening.

756
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 46
Clustering and Classification Part – I

We start a new topic in this course and that is on Clustering and Classification.

(Refer Slide Time: 00:22)

So, the classification is a task of assigning a known category or class to an object. For
example, you take this image and you are asked to classify this image that which are the
regions which contains human and or which contains which and whether it is human or
non human. So, this is the classification task. So, suppose I have given this particular
region of the image and I asked you to classify whether it contains human or not.

And in this way I can in this system I can give various patches from this images from this
image and we can ask this ask to solve this problem. There could be other kinds of
problems also like detection of pedestrian in an image patch, recognition of an alphabet
given a 2-D pattern, assigning a pixel of an image to its foreground or background. So,
these are different other classification problems and in this way you can define infinitely
many types of problems. So, this is a nature of a classification problem.

757
(Refer Slide Time: 01:42)

Whereas a clustering it is a task of organizing objects into groups whose members are
similar in some way. So, the cluster is a collection of objects which are similar to each
other, but dissimilar to the objects belonging to other clusters. You can consider examples
of some of the clustering as finding out regions of homogeneity in an image and which
means you are deriving segments this is also a problem similar to clustering.

Or grouping of similar components and example we can see here suppose I have given an
image of mushroom and in the background there is a humous substance and we would like
to cluster the similar pixels or similar regions in this case. So, one result could be in this
form here you can see that there are two primarily two regions are two types of clusters
are there shown by the green contours and red contours. Again there are different regions
of green contours and red contours which can be treated also as segments.

758
(Refer Slide Time: 03:01)

So, what is the difference between class and cluster? A class is a well studied group of
objects identified by their common properties or characteristics where as a cluster it is a
group with loosely defined similarity among the objects it is potential to form a class.

(Refer Slide Time: 03:20)

So, what are the motivations of clustering? I have given one example of its application like
segmenting images will see different other motivations. Like you can find representatives
for homogeneous groups and this would reduce your total data representations you can
represent data by smaller set of representative samples we are in those characteristics of

759
data. Then discovering natural groups or categories that is also another motivation. So,
that we can describe them by their unknown properties and finding relevant groups.

So, segmentation is one such example where we would try to draw our attention to relevant
groups in the distribution, so in the images. So, it is a measure groups in the given context
that we would like to identify like segments of an image. Then detecting unusual data
objects, so usual like outliers in data. So, those are also can be detected using this kind of
clustering techniques.

(Refer Slide Time: 04:25)

So, in the context of our image and video image processing computer visions also video
processing. So, how clustering and classifications in that are placed in this particular
processing of information. So, as you can see that it is a very high level process of
information. So, at the lower level from the images we would like to derive certain
representation of data in terms of feature extraction.

So, it should pass through the process of pre processing feature extraction and then you
can get the represent the objects or images image patches in by feature descriptors or
feature vectors that you have discussed on already in previous lectures. And we know
different techniques of driving deriving the points of importance relevance in an image
and then describing their neighbourhoods or also different patches or images you can
describe by their corresponding feature vectors.

760
These feature vectors they are the representative of those corresponding class or groups or
objects they are used for the purpose of classification and clustering. So, and then you have
to though the task is that no you have to assign them certain known or unknown groups
whatever that means they should be similar. For clustering groups are not well defined as
we mentioned. So, in the example with different colors we are trying to represent those
different kinds of classes or groups.

(Refer Slide Time: 06:02)

So, we can see the approaches of you know clustering classification which are essentially
learning problems. Now, there are two approaches supervised and unsupervised learning
when you consider its a learning problem. So, here we are learned we are learning about
groups or categories in data. And when we say it is an unsupervised learning then it learns
in the absence of almost any prior knowledge of groups almost I told because sometimes
know the number of groups that information is provided.

Now, this problem is like clustering. So, in clustering we used to unsupervised learning
and in a supervised learning it exploits knowledge about the classification problem such
as example instances of classes. Say in this case training samples with class levels are
provided for solving this problem and it finds features suitable for predicting classes. So,
this problem supervised learning is also used in the use solving classification problem.

761
(Refer Slide Time: 07:19)

Now, there are other variations in this learning framework you have semi supervised
learning and reinforcement learning like in semi supervised learning it learns by making
use of unlabeled data for training in conjunction with a small amount of labeled data.
Usually the sizes of unlabeled data it is they are large and labeled data they are kept small.

And in this case this framework this learning mechanism it falls between unsupervised and
supervised learning. In the reinforcement learning it is learning by feedback from a teacher
or a critique in the form of reward or punishment yes or no true or false etcetera.

(Refer Slide Time: 08:03)

762
So, we will be discussing now the methods of clustering techniques, we will first discuss
methods of clustering techniques and then we will discuss about the classification
techniques. So, in clustering there are three measure components; the first it should have
a distance measure or which measure similarity between two data or samples. And then
there should be some criteria function to evaluate clusters and then of course, the
methodology the algorithm by which you should compute you should solve the problem.

So, for defining similarity different distances could be used and some examples of
distances we have already seen in various other applications in this particular course also
in on various topics. Suppose the L 1 norm, L 2 norm and generalized L p norm could be
used and for criteria function or to evaluate clusters there are two particular properties
which are looked at it which are looked for in having a good clustering solution.

One thing is that there should be a good intra cluster cohesion which means the members
of the cluster they should have good homogeneity property. And one of the measures in
this case could be sum of squares of error of deviations from that property or there should
be inter cluster separation; that means, groups are also well separated well discriminated.

About the clustering algorithms in this particular lecture or in this course will be learning
three different clustering algorithms; one is K means and then K medoids and mixture of
Gaussians Gaussian technique. But there could be various other approaches like know
there could be hierarchical clustering techniques graph based approaches etcetera will be
considering only this three in this course.

763
(Refer Slide Time: 10:02)

Homogeneity and separation principles that we discussed when we are considering


evaluation of cluster clustering evaluation of clusters those are derived by technique. So,
homogeneity as I mentioned it is elements within a cluster which should be close to each
other. And for example, you can compute the average distance of these elements from the
cluster center. So, if this distance is small then the cluster is good it preserves the
homogeneity it has good homogeneity property.

Whereas this separation property in clusters are the elements in different clusters they
should be further apart from each other. So, in this case may compute average distance of
pairs of cluster centers and those centers should be no should be placed apart they should.
So, this distance should be large for having a good separation property.

764
(Refer Slide Time: 10:58)

There could be a several choices of clustering or partitioning, so its a non trivial


partitioning problem in that sense. Suppose I give you this data distribution its a 2
dimensional data points which are shown by this circles and locations of the data are shown
in has a 2 dimensional coordinate space. So, these are the elements.

So, one could one kind of partition could be like this and the other kind of partitioning
could be like this. Now, by considering the two property homogeneity and separation; the
first one is a bad clustering example where this is the second one is the good clustering
example.

765
(Refer Slide Time: 11:40)

So, the first technique that I will be discussing here it is called K means clustering
technique and in this case the problem is that if I give you n data points then you need to
compute k partitions though partitions are actually clusters. So, that it they it minimizes
the sum of square of distances between a data point and the center of its respective partition
or cluster.

So, this is essentially an optimization problem where you can consider this mathematical
formulation of the optimization function. So, you see that optimization function which you
require to minimize this is actually sum of square of deviations or distances from the mean
of a partition which is denoted here by ck and to the points which are included in that
partition. And if since there are k partitions you perform this job k times I mean you have
summing all those components of k partitions.

2
𝐸 = ∑ ∑ ||𝑥 − 𝑐𝑘 ||
𝑘 𝑥𝜀𝐶𝑘

Now, this problem is an n NP complete problem provided K is greater than 1 because if


there is if there is K it is just a centre of the cluster compute the centre of the data that itself
will give you the solution.

766
(Refer Slide Time: 12:56)

So, the algorithm for K means clustering the its a very famous algorithm and which is
known as a Lloyd algorithm by the name of the inventor of this algorithm. So, it is given
as k initial centers assigns given k first you consider randomly you can choose k initial
centers and then you can based on that you can partition them by assigning the nearest
center to a particular points.

So, the cluster of the by assigning that that cluster whose center is nearest to a point. So, it
is a closest among them and then we can update the centers once again the partitions so,
centers gets updated because once you get the partition and you can compute that center
that would be definitely different from what you had earlier. And you can iterate these two
steps till the centers do not change their position a very simpler approach, but very
effective.

So, what it is trying to do it is trying to minimize the energy function defined by the sum
of divergence says each cluster from its center it is the error function what I discussed in
the previous slide. So, this is this function what we are trying to minimize. But this method
the convergence not guaranteed, but works well in practice and it may get stuck at local
minima that is another problem.

767
(Refer Slide Time: 14:25)

So, let me show you figuratively how this approach works the essence of this computation.
So, we consider once again the distribution of the points in a 2 dimensional space and
points are shown by circles their locations are shown because putting center of the circles.
So, what we can do that we can choose initial cluster centers.

For example, say these are the two initial hypothesize has two cluster centers. Then what
you should do? You have to now compute the partitions which means you have to consider
the points which are closer to a particular cluster center. So, the corresponding cluster level
should be given to those points. So, here the levels are shown by colors for the purpose of
visualization.

768
(Refer Slide Time: 15:20)

So, I will show the leveling by color. So, you can see these are the set of points which are
found to be close to the cluster shown by the color pink and the other one by the color
green. So, now we have a two partition once again best on those centers and now you have
to you should again update this center. So, you can update this centers which means now
these centers they should move inward because of the configuration here. And you get this
is the updated positions again you perform the partitioning with these two new positions.

(Refer Slide Time: 15:49)

769
So, compute new partitions with updated centers. So, now, you will find that no these are
the set of points which are being clustered which are closer to the corresponding centers
and they are shown by their colors. So, you would repeat this operations you update centers
and move the cluster centers no update no cluster centers once again you perform the
partitions and you will find that you get a new partition; new partitions here and you can
you should update the centers and it will be moved further.

(Refer Slide Time: 16:39)

And in this way you will find that after few iterations there is no change or a very little
change in cluster centers and then you should stop.

770
(Refer Slide Time: 16:52)

So, this is how the K means algorithm works. There is about conservative approach of this
particular algorithm because Lloyd algorithm is a first is first algorithm, but its not
necessarily causing better convergence. So, a more conservative approach is to move one
data point at a time provided overall cost gets reduced. So, its a greedy approach and what
it does in the principle of this approach is that it chooses the transfer of a data point from
a class i to another class j which causes the best cost reduction that gives the maximal cost
reduction at that step.

(Refer Slide Time: 17:27)

771
So, we can describe this computations in an algorithmic steps say select an arbitrary
partition P into k clusters the like the K means clustering algorithm and you should repeat
the steps till convergence of this cost. So, let us consider we keep track that what is the
reduction at this step if there is no reduction called some iterations then we should not after
some iterations then know we can stop.

So, initially it is initialized with 0 and then for every cluster C and for every element which
is not in C you perform these operations. We perform the reduction that is a difference
between cost of partitioning of P which was there earlier before the transfer and what is
the cost after the transfer; that means, you have recomputed once again the centers and
partitions.

Now, this can be done efficiently by simply considering the clusters which are affected by
this transfer by recomputing their means. Rest other partitions levels of the rest of the
assignment remains same and then you can quickly compute this cost. So, if this is if this
there is a reduction and if this reduction is greater than the maximum reduction, then you
I mean first thing if there is a reduction then you should move update C and the cluster
containing.

So, which means in at every iteration we are considering for all elements what is the
reduction of cost and choose that transfer which is a maximum and at that iteration only
that transfer is used and go on doing this. So, its a very slow process its not so fast as
Lloyds algorithm, but its convergence is better which means as I mentioned that K means
algorithm may get stuck at local minima similarly this algorithm also it is not guaranteed
that will get a global minima, but it should get a better local minima by this process.

772
(Refer Slide Time: 19:44)

So, that was about K means clustering; a variation of K means clustering is called k
medoids clustering. So, let us understand what is defined by a medoid.

1
𝑚𝑒𝑑𝑜𝑖𝑑 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ ||𝑥𝑗 − 𝑥𝑘 ||
𝑛−1
𝑖,𝑗𝜀𝐶𝑘

A medoid is the representative element of a set of data point with minimal average
dissimilarity with the other data points in the set you con you consider the data points are
there all vectors here. So, it is nothing, but the median vector what is defined
conventionally in single processing community.

So, this defination mathematically can be represented in this way that given a set X, its
medoid is given by this is what you compute the average of the distances from that point
to other points and consider that element in that set which has the minimum such distance.
So, that is how the medoid is defined.

The advantage here is that it is like median of a set what we get. So, in the median of the
set it is one of the element of the set similarly medoid is also one of the member of this
data set. So, the cost to minimize in k medoid clustering it is almost structure is same
instead of the cluster means now we have replaced means by the cluster medoids in the
expression. So, you can see this is almost similar expressions where these are the medoids
or clusters k.

773
(Refer Slide Time: 21:12)

So, k medoids clustering could be almost similar to k means clustering as you can see that
this is the same algorithm except the fact that instead of updating cluster centers or means
we are updating the medoids. So, iterate above two steps till the medoids do not change
their position. So, the corresponding changes in this algorithm they are highlighted by this
red color.

But the problem is that this updating of medoid is very computationally expensive it is not
so simple like updating centers as we have seen for every cluster we have to compute the
corresponding median vector by solving that problem. So, we need to no which we can
have some other variations to make this computation faster; one of the approach is that we
can consider computing medoid of only two clusters instead of all.

So, randomly choose an element of a different cluster and swap with the medoid element
of one of the clusters and update the medoids if the cost decreases, then only you accept
the swap and then you can you continue this operations till it converges. So, this algorithm
is also known as partitioning around medoids or PAM.

774
(Refer Slide Time: 22:35)

The last technique of the clustering methods which will be discussing in this particular no
subject that is a use of mixture of Gaussians and by analyzing the probability density
functions of the data in terms of representing it as a mixture of Gaussians. At the outset
we can consider some similarities with the K means clustering technique. Say you consider
a cluster center is augmented by co variance matrix, it is not a center is the mean of the
data, but let us consider also your using its co variance matrix and know for every at every
iterations you are updating corresponding means and co variance matrices.

−1
𝑑(𝑥, µ𝑘 ; ⅀𝑘 ) = (𝑥 − µ𝑘 )𝑇 ⅀ 𝑘 (𝑥 − µ𝑘 )

And the distance function could be used like a Mahalanobis distance function. So, this is
a example this is expression for Mahalanobis distance function as we can see it takes the
takes into consideration of the co variance matrix also not only the mean and so this is the
cluster center and this is the co variance matrix. So, you can apply like K means algorithm,
but of course, this is not the technique of mixture of Gaussians, but these has some
similarity. What we can do instead? We can refine it by computing probabilities of
belonging into a cluster.

1 1
𝑁(𝑥|µ𝑘 , ⅀ 𝑘) = 𝑒 −2𝑑(𝑥,µ𝑘;⅀𝑘)
√(2𝜋)𝑛 |⅀|

775
So, instead of crisply defining the membership of the of a cluster for every element at
every iteration we maintain a probability of belongingness and continue doing this things
till no we get a good probability density function which defines the you know which satisfy
which describes the corresponding distribution of data consistently describes it.

So, what is this parametric probability density function is described by mixture of Gaussian
distribution. So, this is a description here you can see that probability density function of
a point data point x is given as a mixture of know normal distributions; so weighted sum
of normal distribution.

While weights are given by the parameters pi k and each normal distribution has its own
center and co variance matrix; that means, they are denoted by µ𝑘 and ⅀𝑘 . Just for the
completeness I am also providing you the expression called normal distribution in multi
dimensional space. And the mixing co efficient are here it they are called this pi k.

(Refer Slide Time: 25:30)

So, there is an algorithm which is known as expectation maximization algorithm and by


which you can estimate this probability density functions and get those mixtures get those
Gaussian distributions. And each Gaussian distribution component is representing like a
cluster here and so elements which have higher probability due to that Gaussian mixture
distribution those elements are assigned to those clusters. So, that is a process.

776
So, what we can do, we can start with an initial set of this parameters any arbitrary
parameter you can choose and then the expectation stage what we do we compute the
probabilities likelihood of x of the data to a particular know kth Gaussian cluster. So, for
each cluster center we compute its likelihood which means know I have to compute simply
the probability given that distribution and which has to be multiplied by.

So, this is the likelihood which we need compute and you can see that we are computing
the probability of corresponding probability of xi belonging to this distribution this is given
by the probability distribution function, it needs to be multiplied by the mixture coefficient
pi k and zi is the normalizing coefficients because no you are considering k classes.

1
𝑧𝑖𝑘 = 𝜋 𝑁(𝑥𝑖 |µ𝑘 , ⅀𝑘 )
𝑍𝑖 𝑘

So, we are considering the ith pixels and this is how we compute the likelihood. And then
we assign the this pixel ith pixel or xi we can assign it to the mth cluster whose likelihood
is maximum.

However, now this assignment is an optional step it is it should not be taken at this at every
iteration only at the final stage when you computed all the probabilities of likelihoods of
every know points then only you assign the cluster which as the maximum likelihood and
the maximization. So, this is the expectation stage.

So, you are computing the likelihoods and you from there know you get a redistribution
of as redistribution of data. In the maximization stage once again with this likelihood of
data you re estimate the parameters of the Gaussian distribution. So, you continue this
process. So, this re estimation process let me explain more in an elaborated way.

777
(Refer Slide Time: 28:24)

So, as we have seen that we can compute the likelihood of the data this is the expression
by computing the probability using the probability density function called kth Gaussian
distribution in the kth Gaussian component multiplying with the corresponding mixture
coefficients and then normalizing with respect to all the classes or components clusters.

1
µ𝑘 = ∑ 𝑍𝑖𝑘 𝑥𝑖
𝑁𝑘
𝑖

1
⅀𝑘 = ∑ 𝑍𝑖𝑘 (𝑥𝑖 − µ𝑘 )(𝑥𝑖 − µ𝑘 )𝑇
𝑁𝑘
𝑖

𝑁𝑘
𝜋𝑘 =
𝑁

So, if I have these likelihood we can consider the strength it is as if it is representing the
number of points in a cluster. So, it is not a discrete number as you can see it is sum of
probabilities of all pixels whose sum of likelihood of all pixels which may which is not a
not an integer, but it is an estimate of the number you mean probability estimate of the
number.

So, if the likelihood is 1 and 0, then you get integers integer numbers, but in this case we
will get an estimate of that number and this is how it is defined it is. So, it is expected

778
number of pixels in class k which is which has a fractional component also here in this
framework. So, now you compute the mean of the distributions you compute the mean of
the cluster k by considering the corresponding say likelihood as the as weights for a
particular data points.

So, it is weighted know mean of the data points belonging to that cluster. So, the
belongingness is defined by likelihood that is how you get the mean of the cluster in the
same way you get the co variance matrix of the cluster. So, this is how the co variance
matrix is also computed. So, now, you see that from here you get atleast two sets of
parameters of mean and co variances and finally, the mixture coefficients that pi k those
are also found as a fraction of total no N here N is the total number of pixels in the image.

So, N k by N now total number of elements in the data points its not only image its a data
points. So, N k by N will give you the corresponding know probability of a or prior
probability of a class k. So, it is acting like a mixture coefficients. So, you iterate this
process and through iteration you know you can converge you can after convergence as I
mentioned then you assign the highest assign the corresponding cluster which is
represented by a corresponding Gaussian component to a pixel which as a highest
likelihood to that component.

So, this is how the Gaussian mixture of Gaussian method could be used for clustering. So,
let me stop here and will continue our discussion on classification in the next lectures.

Thank you very much for your attention.

Keywords: Clustering, K-means clustering, K-medoids, expectation maximization.

779
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 47
Clustering and Classification Part – II

We continue our discussion on Clustering and Classification of data. And, in the last
lecture we discussed various algorithms for clustering of data points and in this lecture I
will start discussing on different classification approaches.

(Refer Slide Time: 00:37)

Now, classification is a task of assigning a known category or class to an object. There are
various approaches out of them few are shown here. It could be probabilistic approach,
distance based, discriminant analysis best approach, you can use also artificial neutral
network for classifying data points.

In this lecture or in this particular topic under this course I will be considering a few
examples of each of this approaches. They are for the probabilistic approach will be
considering Bayesian classification techniques, particularly will discuss about Naive based
Bayesian classifier. For distance based approach will consider K nearest neighbour
classifier and discriminant analysis we will discuss about linear discriminant analysis. And,
then in artificial neutral network it is a feed forward neutral network that we will be
discussing.

780
(Refer Slide Time: 01:42)

So, let us start with the problem definition of a classification task. Here you have a labeled
data sets given in the form of (yi, xi). Note here xi is the data point, this is a n dimensional
vectors as for every object in an abstract way they represented as a feature vector in a n
dimensional real space. So, xi is such a vector or such a point in a real dimensional in a n
dimensional real space. And, yi is its class, it could be any categorical data. It could be
some names of class or there could be numerical representation of those classes.

So, the numeric values distinctly identify these classes; however so, we consider of course,
a finite set of classes. So, it is an element of a finite data set classes and we can design a
classifier C; I mean that is the problem statement that you need to design a classifier C
which assigns class yi to xi. So, which should be supported by which should supports its
data set ah. So, as I mentioned that for a two class problem say yi could be denoted as + 1
or - 1.

781
(Refer Slide Time: 03:10)

There is a risk associated with classification, particularly misclassification as you can see
that risk is defined as the number of misclassified data. And, it is a fraction of number of
misclassified data that is how the risk is defined and incidentally Bayesian classification
which we are going to discuss next it minimizes this risk.

(Refer Slide Time: 03:36)

So, let us consider this particular, thing Bayesian classification. So, it is the name comes
from a theoretical framework which has been proposed by Bayes and there is a Bayes
theorem. We will discuss about this theorem from which this classification rules are set

782
and classifier set designed. It is a probabilistic prediction of belongingness to a class and
a simple example Bayesian classifier that is an example of a simple Bayesian classifier is
a Naive Bayesian classifier which will be considering in particular in this lecture.

Each training example in this case it can increase or decrease the probability of the
predicted class. So, incrementally you can learn the classifier that is the advantage of this
classification technique and prior knowledge can be combined with the observed data.

(Refer Slide Time: 04:37)

So, let us consider the Bayes theorem and the inferacing that could be carried out using by
applying this theorem. So, first we know the definition of a conditional probability it is
from the joint probability of A and B, we can say that either no B occurs given A or A
occurs given B. So, probability of A and B could be expressed as probability of a
multiplied by probability of B given A or probability of B into probability of A given B.
So, they are all equivalent, they should give you the same probability value.

Now, Bayes theorem is derived from this observation that you can see that it has been
shown that probability of a hypothesis H given a data X can be expressed as the ratio of
this two quantity.

𝑃(𝑋)𝑃(𝑋|𝐻)
𝑃(𝐻|𝑋) =
𝑃(𝑋)

783
So, probability of H is the prior probability of the hypothesis based on our previous
knowledge whereas, probability of X given H we call it as likelihood function. So, in the
particular hypothesis if it is true what is the likelihood that X will occur under this
hypothesis and probability of X is the unconditional probability of data.

So, it is a normalizing constant in this particular case which is ensuring posterior


probability sum should be 1; that means, if X occurs there could be all different hypothesis
which may generate this X. So, which are those hypothesis from which to which this
occurring of this X belongs to are conditional. And, now if I sum all those possibilities that
that should be equal to probability of X. So, posterior probability, probability of the
hypothesis assigning the given a data.

(Refer Slide Time: 06:43)

So, given training data X posterior probability of hypothesis H, it is it can be computed by


following this Bayes theorem. We will see that no it is it has significance, it is relevance
in terms of designing a classifier; because here ah our objective is to maximize the
probability of a class given the data and class is representing an hypothesis.

So, you need to compute this posterior probability here. So, that is what is the Bayesian
classification rule or Bayes classification rule. It says that assigns Ci to X if and only if
𝑃(𝐶𝑖 |𝑋)is the highest among all the other classes all, all other probabilities called all other
classes. So, there are challenges of course, in this computation, it requires prior knowledge
of probabilities of classes and their distributions in multidimensional feature space.

784
(Refer Slide Time: 07:39)

So, just to once again summarize the problem statement in a more concrete form, we have
an input to the classifier. It is a training set of tuples and their associated class levels and
each tuple is represented by an n dimensional attribute vector X. We have shown name of
the attributes or the representation has been shown here. And, they are the data points what
I am referring in an n dimensional space and let us consider there are m classes. So, as we
discussed in the previous slide itself our objective is to derive the maximum posterior
probabilities.

So, for a class the class which keeps the maximum posterior probability given the data so,
that class should be no assigned to data. So, we can compute this posterior probability
using Bayes theorem of conditional is in the conditional probability rule that you can
compute the prior of class Ci. So, this is the prior probability. Now, this computation can
be you can get it from the data point itself; some applied information can be obtained from
other knowledge from a different source. But, given the data distributions among different
classes we can compute that how many times for a data a data occurred or in a class, how
many time class instances are occurring in the classes.

So, that fraction can estimate this prior and by observing the data distribution within the
class itself, we can compute this probability density function or probability of X given Ci
also. So, this called likelihood. So, this term is likelihood and you can see this numerator
probability of X actually need not compute. Because, while comparing the values of

785
probability of C i given X or posterior probabilities; this numerator is constant given the
data X this numerator is constant for every classes. So, it is sufficient sorry this
denominator is constant for every classes the probability of X.

So, it is sufficient if I compute just the numerator for it or computer these two quantities
that would give me the probability no that would give me the basis of comparison. I am
not in absolute since computing posterior probability, but in relatively I can compare them
by computing those factors. So, this is what that is the summary only product of probability
of C i and probability X given C i needs to be maximized that is the problem statement.

(Refer Slide Time: 10:33)

So, now will be discussing about Naive Bayes classifier. It works on a simplified
assumption, where attributes are conditionally independent which means that there is no
dependence relationship between attributes. And, in that if it is true then you can write the
likelihood of X given a class C i as the product of all likelihood probabilities of individual
attributes because, attributes are independent. So, that is the advantage in simply you can
compute probability. So, all likelihood functions can be defined in one dimension instead
of multi dimensional representation of probability density function that is the simplicity of
Naive Bayes classifier.

Your handling of data becomes simpler or handling of multi dimensional data becomes
simpler for modeling the probability density function. So, there is a significant reduction
of the computational cost because, of this and it requires only class distribution that too in

786
single dimensional feature space. So, it is convenient to estimate every attribute probability
of every attribute given a particular class. And for example, for a categorical or discrete
variables you can count how many times that values of attributes have occurred given all
the occurrences of all other values of those attributes.

And, then you can find out the fraction of its occurrences over all occurrences and that can
even estimate of this probability or if it is a continuous variable, then you can model it by
any probability density function. Any kind of parametric modelling of density function
you can do for example, Gaussian distribution you can and from there by estimating the
parameters of Gaussian distribution you can compute this probability. So, you know that
in Gaussian distribution there are only two parameters for a single dimensional, one
dimensional feature space you have only its mean and standard deviation there.

(Refer Slide Time: 12:46)

So, the likelihood is expressed in this form

𝑃(𝑋|𝐶𝑖 ) = ∏ 𝑃(𝑥𝑖 |𝐶𝑖 ) = 𝑃(𝑥1 |𝐶1 ) ∗ 𝑃(𝑥2 |𝐶2 ) ∗. .∗ 𝑃(𝑃(𝑥𝑛 |𝐶𝑛 )


𝑘=1

So, there is the class condition probability of in attribute xi and as just I mentioned that no
these are the methods by which you can do it. So, here this is just elaborating that how
parametric modelling of Gaussian distribution could be done in the one dimensional
feature space.

787
So, this is the distribution this is the probability density function of a variable X one
dimensional random variable X which follows a Gaussian distribution and you can see that
there are two parameters here. One parameter is mu which is called expectation of this
distribution or you can in a simple layman’s term we can call it mean and this is sigma
which is actually the standard deviation of this distribution. So, so given the data you can
always estimate these values from mean and standard deviations, these parameters could
be estimated.

(Refer Slide Time: 14:07)

And so, this is what you are considering for it is the product of all this things which has
been in this case actually I should write it in this way this is

𝑃(𝑋|𝐶𝑖 ) = ∏ 𝑔(𝑥𝑖 , µ; 𝜎𝑖 )
𝑖=1

So, this is what is the probability of this variable n dimensional feature given the class Ci
this is this is how it should be written. So, there is some problem with this particular know
expression.

788
(Refer Slide Time: 14:50)

So, let me consider an example to explain to elaborate this computation, you can see this
example here and here there is a data set which has been shown in this particular example.
In this data set you have the statistics of the computer buyers I mean whether a person
buys a computer or not, you have that kind of statistics. The person’s age has been shown
here, then the income and whether the person is a student and what is the credit rating and
with that background whether that person particularly has bought computer or not. So,
there are few distinct values for each variable.

So, these are the attributes age, income, student, credit rating and computer buyers. So,
these are the attribute and computer buying is the class, you can see that there are two
classes. One buys computer where the value would be yes and other class buys computer
the value is and here the problem is that we need a data and a person has this kind of
attributes or data associated with the person. What is the probability that the person buys
a computer or whether not only probability actually in which class category, we can put
this person that it is whether this person should be in the class of buys computer or not. So,
this is what is this particular classification problem.

789
(Refer Slide Time: 16:43)

So, first we need to compute the class prior in this case and you can see that here I have
shown with color that how many instances of computer a person buying a computer is
there and how many instances a person not buying a computer is there. So, these are the
instances which in which cases a person did not buy a computer. So, there are 5 to 4, there
are 5 such instances and if I count this number of record these are not actually 14 which
means 9 instances there for yes.

So, that is why the probability of prior probability of class of buys computer equally yes
would be 9 by 14, that is the fraction that is how this probability is considered and similarly
the probability for the other class would be 5 by 14 ok. So, these are the 2 prior probabilities
that we can we have estimated from this data, then will compute the likelihood of each
attribute value for each class.

790
(Refer Slide Time: 18:11)

So, now will consider the likelihood of age given the class of buying computer and also
not buying computer. So, there what we need to consider here that for each class we have
to see that distribution of attribute values; that means, how many times what is the fraction
of times age less equals 30 occurs out of all possible values of attribute of age.

So, if I consider say no will see that this is one occurrences of this fraction, this is also
another occurrences of this fraction, this is not for no this is for yes. So, there are 4 times
this age less than less equals 30 occurs and actual occurrences of no is 5. So, the class
likelihood probability of age equals 30 is quite high here, it should be 4 by 5 whereas, the
other cases when the class of years. There we can see that there are so, many instances 1
2 3; that means, out of 9 cases only 1 instance of buying a so, this is called yes is there out
of 9. So, which means it is likelihood probability is 1 by 9.

So, that is how we can compute these probabilities so, this is yes and this is yes. So, that I
4 is actually 3, it is not 1 is reduced. So, yes has 2 so, 2 by 9 instead of 1 by 9 because 1 9
is to I consider that was no in the class one no. So, it is 2 by 9 that correction should be
there for this data 2 by 9 for yes and 3 by 5 for no. So, the probability is 0.222 likelihood
probability given the class yes and 0.6 given the class no. So, let us proceed again for the
other cases.

791
(Refer Slide Time: 20:30)

So, now we are considering the income equals medium, once again will see how many
times medium has occurred. So, 1 2 3 here actually I have given colors. So, from the color
we can find out for no medium has occurred twice for no and we know that for the class
no; that means, the persons those] they are not buying a computer there are 5 instances.

So, the likelihood probability should be 2 by 5 that is the probability of this attribute given
the class no, in short if I write and for yes again it has occurred so, this is yes; so, 2 3 4
times. So, out of 9 no instances of yes 4 instances of were of income medium so, it should
be 4 by 9. So, that is no value, let us check once again whether I made mistake in the case
also let us see.

792
(Refer Slide Time: 21:48)

Yes 4 by 9 and 2 by 5; so, these are the values, these are the likelihood values.

(Refer Slide Time: 21:58)

Will proceed further; so, this time it is student, student equals yes. So, which means the
person is a student or whether the person is not a student here. So, here we observe that
for one instances in the class no the it is the student and which means that likelihood
probability given the class no means not buying computer for a person is student is 1 by 5.
So, this is probability yes let me write it in short, this yes means this is for student and this
is for class.

793
So, this is 1 by 5 and the other cases we have shown that how many times student has
occurred; that means, persons who have bought computer how many of them are student
or there are 9 person who have brought computers in this data, out of them this is one, this
is another, this is another, this is another quite a few. So, this is 6 6 by 9. So, probability a
student yes given the class yes so, 6 by 9. So, it is let us see once again yeah. So, this is
how the likelihood of student equals yes that attribute value given the classes could be
computed.

(Refer Slide Time: 23:42)

So, the next attribute will be considering credit rating where the given the data we have
credit rating equals fair. So, in this case we have these are the instances when persons are
not buying computer, they have these attribute value of fair credit rating which means that
you will have 2 by 5 when it is fair given the class no and for yes you have 3 4 5 6 so 6.
So, that is 6 by 9 that is a credit rating here given person’s buys a computer. So, these are
the values that we can compute estimate likelihood estimates right. So, you have 6 by 9 by
2 by 5.

794
(Refer Slide Time: 24:47)

So, we have computed no all the likelihood values of age less equals 30, income equals
medium, student equals yes and credit rating equals fair. So, the likelihood of X given any
class any particular class C i for example, the class could be person buying computer
means it is the product of likelihood of all those given classes. So, we can compute in this
way finally, likelihood of this data given this two classes can be computed.

We can find out that the it the value is 0.044, when the class is yes that is buying a computer
and value is 0.019 when class is you know buys the computer is no. And then finally, the
posterior has to be computed which means each likelihood has to be computed no has to
be multiplied with their prior. And then you get the posterior I mean proportional value of
course, we are not dividing with probability of X; it is not the absolute posterior value, but
you can get the proportional posterior value.

795
(Refer Slide Time: 25:51)

So, this is how we are getting. So, in this case it is not exactly this, it is proportional in
particular this constant when X is given it is a proportional value. So, the posterior of this
I should say once again this is yeah, this is the product of this two that is 0.028. So, this
value is 0.028 and the other value is 0.007. So, naturally 0.028 is the know greater value
and this is out of these two classes this is maximum. So, which means that I can assign this
attribute this variable to a class of computer buying class. So, that is the inferencing so,
therefore, X belongs to class of buys computer equals yes.

(Refer Slide Time: 26:52)

796
So, this is the example of learning. So, one of the problem of the computations of this
likelihood is that if there is any zero probability; suppose there is no instance at all. It may
happen in your data point for a particular class then the likelihood becomes zero. So, or
very small of like that is one of the problem. So, to avoid this we need to do certain because,
you know it is a observation and observation are noisy. So, we do not except that it would
be absolutely zero for every class the likelihood.

So, but it is low, it indicate it should be low ah. So, in that case so, this is one example that
now data set with 1000 tuple, income equal to low 0, income equal to 990 and income
equal to high 10. So, we use a correction which is called Laplacian correction or Laplacian
estimator by adding one to each case. So, instead of 0 we consider the value is 1. So, you
will get a very low probability and we avoid this problem of zero probability in computing
likelihood.

(Refer Slide Time: 28:00)

So, what are the pros and cons of Naive Bayes classifier? It is the advantage is that it is
easy to implement, it good results obtained in most of the cases. There are disadvantages
like first it is the assumption is it that class conditional independence which may not be
true for a realistic data and if it is not true, then there will lose also accuracy. So, that is
what in real life dependencies exist among those attributes or variables.

For examples: hospitals, patients, profile, age, family, history etcetera; symptoms: fever,
cough etcetera, disease: lung cancer, diabetes. So, many of the information are related and

797
this cannot be modeled by Naive Bayes classifier. So, let me take a break here, we will
continue this discussion on different approaches of classifications in in the lectures so.

Thank you very much for your listening.

Keywords: Classification, Bayesian inference, likelihood estimation, clustering

798
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 48
Clustering and Classification Part - III

We continue our discussion on Classification of data points. And in the last lecture we
discussed a classifier called naive bases Bayesian classifier.

(Refer Slide Time: 00:25)

In this lecture will be discussing about another approach of nearest neighbor classification
scheme. And, here the classification approaches quite simple as we can see here. In
particular learning algorithm is very simple what you need to do here. You need to simplest
towards the training data. In compare to the Bayesian classification you have seen that you
need to process the data and perform parametric modeling to get the probability
distributions.

So, but in nearest neighbor classification simply, you store training examples. And then
prediction algorithm with goes like this, if you want to classify example by finding the
training example that is nearest to x. So, which is one is the nearest, you assign the class
to that sample.

799
(Refer Slide Time: 01:20)

So, one particular method, in nearest neighbor classification is known as K nearest


neighbor method. So, instead of observing a single nearest neighbor we observe a set of
close neighbors. And, the K most nearest neighbors are called K nearest neighbors. So, in
this case to classify a new input vector x; we examine the K closest training data points to
x. And then assign the object to the most frequently occurring to us.

(Refer Slide Time: 01:54)

So, here the approach is very simple. Let me illustrate with respect to using this diagram.
You can see there are two classes and there are class levels shown by these two colors pink

800
and green. And suppose you have a query data point it is a 2 dimensional no feature space
enquired quarried data point has be shown here.

And, then used for computing the nearest neighbors you have to compute distances with
all the training samples nearest K close neighbor K most close neighbor. So, you compute
distances with all the training sample. And then you can you should solve them and find
out three most closest nearest neighbor.

(Refer Slide Time: 02:45)

So, you will find that these are the 3 nearest neighbor. So, that is the technical term that is
used there.

801
(Refer Slide Time: 02:56)

And so, and according to the rule what I mention. Assign the class which has maximum
number of nearest neighbors of NNs; and we can see that know there are two members
from the pink class and one member from the green class. So, we should assign this pink
class belong denoting the pink color to that particular query point. So, this is the approach
and which is very simple to implement and to understand.

(Refer Slide Time: 03:19)

But there are interesting know observation. So, it has a sound theoretical framework
though approaches very simple. We can show that actually it is a good relation with the

802
Bayesian classification rules. So, it is a non-parametric estimation of probability density
at x given K neighbors, if I do if I perform these estimations will be doing that. So, suppose
there are a number of training samples N. And we assume the volume and in a N
dimensional space, we call it hyper value containing K neighbors is V. And let us
considered the probability of data point in then in the volume be P. So, following binomial
distribution; that means, know how many data could occur.

If it is different kind of its could be modeled, in this fashion that we are sampling the point
and with probability P it can occur in the volume V and, if you doing N times this sampling
it follows binomial distribution binomial distribution. And, the number of data points in
the volume expected number could be N into P and that should be equal to K. So, the
estimation of P, is very simple you just get the ratio of K by N or it is the fraction of times
it has occurred out of N trails.

(Refer Slide Time: 04:51)

So, this is the is probability that you can estimate here. And then the probability density at
x; if I considerates continuous know is space. So, this probability is divided by the volume
P that is the know number of data points possible data points here. So, consider a class w
1, which contains n 1 number of neighbors out of K neighbors.

And then the joint probability of x and w 1 will be given by that; that means, first joint
probability of interest n 1 has n 1 times that you know it has occurred within that volume.
So, it is n 1 by N. And then divided by; that is a density function. So, probability of x given

803
w 1, which is the posterior probability and which is defined from the Bayesian rule that
joint probability divided by probability of x it is n 1 by K.

Now, this is the interpretation if you have n nearest neighbors n w 1 out of K neighbors.
Then the n 1 by K give the posterior probability of that class given the data point. So, if
you assign the class which your for which maximum number of data points of occurred, it
is actually in the same as maximizing this posterior probability. So, it is following the
Bayesian classification rule as you can see.

(Refer Slide Time: 06:22)

So, this is what and it follows the of Bayesian inferencing rule.

804
(Refer Slide Time: 06:27)

So, when should you consider for nearest neighbor classification? It when the instance
maps to points in a in the dimensional space. When you have less than 20 attributes for
instance and there are lots of training data. Then those are situations when nearest neighbor
classification should be more appropriate. There are advantages as you see that was
training what mean it is, it do it means that you have to store the data of course, we will
call not storage for that.

So, learn complex target functions the method is very simple, but finally, the function
simplicity it is implemented though this method that is not simple. So, if you if you do
more detailed analysis, you will be able to understand what are the decision boundaries K
nearest neighbors. And those boundaries a quiet complex they are shapes quite complex
there not simple.

And it you know it do not lose any information. Disadvantages it is slow at query time it
is it is very much competition intensive. You have to compute know all the distances with
all the data points. So, which means your query would be very slow. It is linearly
proportional with the number of data points.

And if your data point is very large then it is prohibitively very much difficult and easily
fooled by irrelevant attributes. So, if there are outliers or rather irrelevant which is not
related, but still you are establishing the relationships by nearing relationships.

805
(Refer Slide Time: 08:17)

So, these are the issues, like distance measure you have to choose the most common is an
Euclidean distance. And increasing K; reduces variance and increases bias which means
not class prior dominates in that case. And for high dimensional space the nearest neighbor
may not be closed at all. So, that is another know difficulty using it high dimensional space.

That is why not more than 20 attributes on 20 dimension 20 it is advisable to use this
technique. This is just empirical observation memory based techniques. So, it means must
make a pass through data points for each classification. So, these are several issues of
nearest neighbors. And as I mentioned it is prohibitive for a large dataset.

806
(Refer Slide Time: 09:10)

So, this is the approach of nearest neighbor approach. The other approach which I will be
discussing now, that is called a classification using discriminant functions. Now, in this
case we should consider a two class problems to illustrate. There are extensions to multi
class problems, but mostly in most cases discriminant functions are very easy to use for
two class problems.

And we can extend that discussions for multi class problems. But in this particular course,
we will be only considering two class problems. So, the problem statement once again let
me revisit with respect to these two class problems. So, we have labeled data sets like there
are n such a training samples each one as a level yi each xi.

Once again its say n dimensional feature vector; and the as I mentioned xi is a n
dimensional feature vector and yi there are two classes. So, we will be using class
representation by this value of yi is + 1 or - 1. Then the problem of designing classifier
using a discriminant function is that this function needs to be designed.

Such that, if I evaluate the function at x I then the sign up that value, which is either plus
1 or minus 1 which means positive or negative should be the class identity that is a
definition of this problem. So, should note that the decision boundary is given by g(x)=0.

So, there is a boundary and which geometrically which partitions the space into two
regions on one part all the values. If they put the discriminant function there it should be

807
+ 1 they other part would be - 1. Usually, that is the scenario of discriminant functions.
And on the boundaries only and these values are 0.

When we call this analysis as linear discriminant analysis; when this function g is in linear
form and it is a coordinated discriminant analysis when g being quadratic. So, the meaning
of g is in linear form is that if this is the polynomial of x of degree 1. So, x is a
multidimensional vector. So, for each attribute, which is taking part in the expression of
functions; the polynomial degree of any attribute should not exceed 1.

(Refer Slide Time: 12:13)

So, it is a polynomial degree 1; and polynomial of x of degree two for quadratic; so, each
attribute 1 second degree should not exceed 2. So, let us discuss a particular tie a linear
discriminant you know function are deriving of linear discriminant functions from
Bayesian classification approaches.

So, we call it as linear discriminant analysis from base classifier. So, the base classification
steps as we have seen given an input x; we need to estimate the posterior probability
P(yi|x). So, that yi in this case it could be either + 1 or - 1 there are two classes. And then
assign that a value y k; that means, k th value of the k th class in to class problem it is just
there are two values either y 1, y 2 two x if it is maximum.

Now as we have already discussed that it is difficult to estimate this posterior directly what
we should do? We should compute the class likelihood and class prior from the data which

808
has been shown here. That probably of x given y i that is a class likelihood of given the i
th class y i. And the class prior is probably of that i th class itself.

So, this is how posterior is related with them. And once again just to stress that we are not
required to compute quality of x. And that is y only computation of those two factors are
is sufficient to perform this classification.

(Refer Slide Time: 13:53)

So, now let us assume this likelihood distributions are normal. We have discussed for naive
base classifier, where an unlikelihood distance distributions. We have discussed about the
discrete and categorical classes. But we categorical attribute to domain actually attribute
sets.

1 (𝑥−𝑚𝑘 )𝑇 𝑆 −1 (𝑥−𝑚𝑘 )
𝑁(𝑥|𝑚𝑘 , 𝑆) = 𝑒− 2
√(2𝜋)𝑛 |𝑆|

But if I consider the continuous attribute value in a continuous domain space the attribute
value occurs which we assume. We can use the parametric modeling. And let us assume
that for every n dimensional endpoints diamond every n variables are in attributes. I mean
it is in our combinations is that n dimensional feature vector itself follow a normal
distribution.

So, here the independence of attributes search are not considered. So, this is what a normal
distribution in a multidimensional space is represented to note here. There are two

809
parameters though the number of elements in those parameters they are not single as it was
for one dimension. One is m k, which is a k th which is a 1 second and dimensional vector,
which is the mean of the class it is the vector. And Sk is a covariance metrics of once again
N crossing metrics but which is symmetric.

This is how the probability density functions in the multidimensional space looks when it
is a Gaussian distribution. Just to simplify these expressions, we considered the
denominator part as a normalizing part of this probability density function. And in this
case this is a constant. We are assuming that every class has covariance metrics that is an
assumption here.

So, assume covariance matrices at the same then only we can do this job. So, this is what
Sk equal to S for all k. So, now, we will be expressing this particular probability by using
logarithm operations. Because the advantage of logarithm is that the comparisons of values
in original domain can be carried out in the log domain. Because, those values are
proportional to the log values itself or log values are proportional to the origin and value
itself.

And as you can see the normal distributions, one factor is an exponential exponentiation
operations. If you take the logarithm operation though so, the power of that explanation
operation will be basically computed. And it could make we could make it linear form or
it could make it a polynomial form rather it is not always linear.

So, if I perform logarithm so, that is advantage why you should know. Apply logarithm
operations over the probability values and compute the log properties or log likelihood
values, whatever in there are different ways these measures are named. So, if I perform the
logarithm of this expression of the posterior.

So, once you take the logarithm so, it is a locked p k and this constant. So, the posterior is
proportional to p(k) that is the prior probability into. So, posterior p(k|x) is proportional to
this value. Now, once I take log of this operation then log (p (k)) comes here. And from
here we can see that this is a C; constraint C.

1
log(𝑝𝑘 ) − log(𝐶) − (𝑥 − 𝑚𝑘 )𝑇 𝑆 −1 (𝑥 − 𝑚𝑘 )
2

810
So, I should write here - log C; this is in the denominator, but it does not matter in our
expression because we can ignore it at it is not dependent on class. It for the comparison
so, because we have to find out the maximum of all log of these probabilities and then
assigned that class which has the maximum value.

So, in the in comparing those values these constant terms will not matter. And this term is
shown here that is a logarithm of exponentiation operations that would be coming there
so, this is the plus.

(Refer Slide Time: 19:13)

So, as I mentioned we ignore this while comparing you should note the correction here
this plus should be minus according to this representation. So, finally, it is sufficient to
compare these values for finding out the class giving the maximum posterior probabilities.
And we should assign that class which having the maximum value of this.

811
(Refer Slide Time: 19:42)

So, we will see that how actually this will give you a linear form of function. We need to
maximize this value in between there some steps I will show them.

1
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 log(𝑝𝑘 ) − (𝑥 − 𝑚𝑘 )𝑇 𝑆 −1 (𝑥 − 𝑚𝑘 )
2

And we can see that from that operations I have shown that is that there are two
components in that operations which you need to maximize. One is the prior component
that log(p (k)); other one comes from the likelihood part ignoring the constant terms.

And, if I expand them just you can perform usual algebraic operations of know
multiplications using those matrices itself. So, these are the expansions of this term. So, if
you expand it will be coming in this way. You can see that in my multiplications I have
considered a very standard algebraic operations.

And, this part is independent of class because the covariance matrix is same for all of the
class and given the data x. So, this is independent drop class once again we can ignore this
for compression. And on the other hand so, that is why they in this step we are not having
this particular know or term; we have only three factors. But since As S is symmetric since
As S symmetric, then actually this value and these value they are same. So, simply we can
add them.

812
So, it should be minus to slip let us suppose m k transpose is inverse x that is a value; and
then if I multiply with half it would be simply this value. So, that is show the algebraic
operation. So, you can see that these two terms we can combine them into a single term
because they are the same you simply you can add them. And finally, we get this
expression.

So, in this expression as a it is possible to see that there are class dependent factors these
are all class dependent factors. But particularly this for discriminant functions point of
view this is a variable x in this discriminant function; and which is a linear form of a
discriminant function. So, this is how we can derive a linear discriminant function using a
bayesian classification you know approach.

So, this is a discriminant function for these class x if I class k. And linear discriminant
function can we consider as difference of these two. Because know the sign of this you
have to consider a sign of g x.

𝐿𝑖𝑛𝑒𝑎𝑟 𝑑𝑖𝑠𝑐𝑟𝑖𝑚𝑖𝑛𝑎𝑛𝑡 𝑎𝑛𝑎𝑙𝑦𝑠𝑖𝑠 𝑔(𝑥) = 𝑔1 (𝑥) − 𝑔2 (𝑥)

So, when g1(x) is maximum or greater than g2(x) the sign would be one positive. And we
assign it to class 1 and when g 2 is greater the sign would be negative then assign it to class
2. So, your linear discriminant function is g x and two which is as we can see the form is
a linear form.

(Refer Slide Time: 23:12)

813
So, just to summarize this approach, that how linear discriminant function could be derived
following this bayesian classification what we need to do? You need to estimate different-
different class priors that is p(k). So, for estimating class priors you get the ratio of these
two quantities N k is considered as the number of instances in class k and N is the total
number of instances so, that is how the class prior.

And then given training data estimate the means of classes and the covariance matrix of
the data in this way. That this is m k which is mean of know samples in the in the class k
in this expression. And covariance can be computed from the whole data. Now, this is one
know possible approach for estimating covariance there are different other refinements are
there also. So, I am just presenting it for the sake of simplicity of computations.

So, you note m here is defined as a mean of the data and say it is over all mean of the data
it is not the class means. So, you obtain in the discriminant function g x as we discussed
and use it for classification. So, this is how you can you know perform linear discriminant
analysis given the data points up to class problems.

(Refer Slide Time: 24:46)

In the next type of classification, which is also a kind of linear classifier is special kind of
know classifier. Or we can see that there are relationships also and but it you can have a
different perspective from this. So, let us try to understand this particular classifier which
is called here perceptron, you can see there is a node and there is a computational block
shown in terms of an ellipse.

814
So, in this computational block it has its input, where you are getting the getting a feature
vector as an input described n dimensional feature vector. x 1 to x n those are the
corresponding attributes in that feature vectors. And each attribute to has an weight w; I
and weighted combination of this attribute values. Plus there is a constant term or which
is can be considered as bias w naught. So, this is a functional form.

So, as says the function is also a linear functional form as you can see. And the class what
we would you like to do the classification you have to once again its the same the sign of
that. So, it is acting like a discriminant function. So, this z is acting like a discriminant
function gx; and then sign op that will give you y.

So, it is nothing but the same problem, what I discussed in the in the previous example of
driving example of linear discriminant analysis. So, this is also a and linear classifier. But
it has been shown with this perspective it would be clear when I elaborate this aspect know
later on. Let me take a break at this point and we will continue this discussion in that next
lecture.

Thank you very much for your attention.

Keywords: Nearest neighbor, linear discriminant analysis, perceptron.

815
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 49
Clustering and Classification Part - IV

(Refer Slide Time: 00:25)

We are discussing about different classification schemes. And today I will discuss about a
classifier which is a linear classifier of a different perspective and that is perceptron
classifier. So, in a perceptron classifier you can see here that there is a concept of a node
and edge representation. So, there is a particular node which represents a node of a
classification classifier here. And it has an input, in this case you can see that there are n-
dimensional inputs and then in fact the input is also augmented by another dimension one.

So, this is your input and these are the weights to the particular node. And what we are
doing, we are doing weighted sum of this input, and then plus there is a bias. Now, you
can also consider as if the input is augmented by another dimension whose value is 1, and
then the whole thing can be considered also as a linear classifier.

𝑧 = ∑ 𝑤𝑖 𝑥𝑖 + 𝑤0
𝑖

816
We will discuss these things also in subsequent slides that the fact is that now this weighted
sum of this weighted input and also add and also it is added with this bias so this value
provides also these value is used in a function, and the function value finally, provides you
the output of a classifier.

So, depending upon the functional value, the classification output will vary. And in a
normal linear classification scheme what we discussed in previous lectures, this function
was chosen as a signal functions or sin functions which means if the z is positive, you
consider the class level as 1; if z is negative, you can consider class level as - 1, and 0 is
little ambiguous. So, for the completeness of the function you can say z is positive for 0 as
class level as 1. So, we will see that with this perspective how a perceptron classifier can
be understood.

(Refer Slide Time: 02:45)

So, as I was mentioning that considering the augmented input, this could be in a linear
form. And then we can consider the input as a vector of n + 1 dimensional vector, where
the last dimension n + 1th dimension will be kept as always 1. And then and also the weight
is also an n + 1 dimensional weights where the bias is included as the n + 1th part as w0.
Here as you can see, and then it becomes simply a linear operation.

So, it is it can be expressed as WTX, which is something like also you can consider from
the vector operations as the dot product of the weight and the augmented input X. And this

817
has been used in a function and finally, output is a functional value of this particular net
input which is z. And we are using in this case for example, a sign(z).

So, the problem of classification scheme is that given (yi, Xi), where yi is the level of the
input and Xi is the corresponding input we have to compute an optimum W which
minimizes the classification error. And we have already discussed about this particular
type of problem in the previous lecture. So, when you discussed a linear discriminant
analysis now with respect to this perceptron classification scheme, let us also revisit that
problem.

(Refer Slide Time: 04:36)

So, in this case, you consider that now this is hyper plane and this WTX=0 is an hyper
plane, and it separates two samples X and X’ of classes 1 and 2 as you can see. So, in my
diagram this is the hyper plane in that is representing as a hyper plane. And you consider
the normal to this hyper plane is W. And you note that in this case because of this form
this hyper plane has to pass through the origin of the corresponding space, in this case for
the simplicity I am showing it in two dimension, but you consider it is a n-dimensional,
any arbitrary dimension we have to consider in this case.

Now, consider there is a sample X which lies in one side of the hyper plane, and another
sample X’ which lies in the other side of the hyper plane. So, hyper plane divides this
space into two partitions. So, in this side and if I define W this is a direction of W that
vector, so if I take the dot product of these vectors, so this is X.

818
And so if I take the dot product finally, it is the projection of this vector on this direction
that would give you and what it is showing it is basically if you consider that is a unit
vector along W, and this is what is X.W, and as you can understand that this should be
positive it will form a positive. If it is in the same partition where the vector normal and
also the x there lying, then this dot product will be greater than 0. And the interpretation
would be that you are measuring the distance of X from this particular hyper plane.

So, the distance from our analytical coordinate geometry, distance function is given in this
form, you can see that this is a very standard form how the distance could be completed.
Now, if I consider the other sample X’ which lies in the other side of the hyper plane, and
it is X, it is in our case as I mentioned that X’ is of class 2, that means, which is also in a
different class then this dot product will be negative because now you are taking dot
product with W in that direction, so WTX<0.

(Refer Slide Time: 07:25)

And so this is an ideal case. So, this hyper plane can discriminate these two samples of
two different classes. So, the condition should be therefore a classifier that the samples of
class 1 should lie in one side of the hyper plane, whereas the samples of class 2 should lie
in the other side of the hyper plane. So, if this condition is satisfied, if your input is given
in such a way and if there exists such a hyper plane which can divide the input which can
partitions the input in this form, then we say these classes are linearly separable. So, and
you can get a solution of hyper plane. So, in that case, our objective would be to find a

819
hyper plane separating data points of two classes. So, this is what if a solution exists then
the classes are called linearly separable.

(Refer Slide Time: 08:24)

So, we are trying to design an error function. And in this case, you consider the data
normalization we will be doing some interesting manipulation of data, so that your
problem gets much more simplified. Let us consider the samples or the input samples
which belongs to class 1 is X, whereas if it is in class 2 will be inverting this sample will
be negating this sample -X. So, in that way intentionally what we are doing even the
samples of class 2 also we are bringing in the same hyper plane in the same partition of
the hyper plane.

And then with this kind of normalization, so all are input samples which should satisfy the
property that all of them should lie in one part of the hyper plane, and we have brought it
towards the positive part of the hyper plane, that means, the partition where also the normal
also the normal vector also lies in that partition. So, with this kind of operations, they we
have satisfied this conditions that all my input sample should be lying in the positive form.

So, with that data normalization we can define an error function where we can see that this
is a form of this error function. And if only we consider if Y is misclassified then only we
compute this value which means this value has to be always positive because if Y is
misclassified with according to your input condition, then for a misclassified sample it
would be always negative for a misclassified sample. So, if I take the negation of that, it

820
would be positive. So, this error will be always positive. And using this error function, we
will be will be designed we will be designing a classifier.

(Refer Slide Time: 10:41)

Let us see how we can do it. So, this is what for correct classification it has to be always
positive as we mentioned. And this is an error function which you have defined, and this
should be always positive as I mentioned. So, our problem is that to obtain the weight
vector which minimizes this particular error function.

(Refer Slide Time: 11:05)

821
So, we will be using here a gradient descent method for iterative optimization, and the
steps should be like this way, we can start with an initial vector which has been shown
here, I say W0 that is the initial form. And then we compute the gradient vector. Suppose,
this gradient vector is denoted with this operation we will see how we can compute this
gradient vector. So, we are performing gradient over this function. And at the initial, so
you note that the functional point is W0 here. Now, you should move close to the minimum
by updating W.

So, you did is following the rules of gradient descent method, you have to move along this
steepest gradient directions in the opposite to the steepest gradient directions that means,
you should move along −∇𝐽(𝑊 (0) ) what has been shown as a symbol say this is a symbol.
So, you should move along that −∇𝐽(𝑊), which has to be computed at W0 that is the
meaning of this particular symbol.

So, you should move along that, but instead of if what we can do while moving these things
will be simply considering there is certain step size by which will be moving it is a
proportional factor to that particular value that will be moving. And, this is particular this
is called as the positive scale factor or learning it in our weight updation.

(Refer Slide Time: 12:45)

So, we let me further elaborate this particular process of iterative optimization using
gradient descent we start with W(0) then update W and W should be updated in this form.

822
𝑊 (𝑖) = 𝑊 (𝑖−1) − η(𝑖)∇𝐽𝑝 (𝑊)

As you can see it is a previous result of the previous iteration of W, and then it has to be
moved along the steepest gradient directions multiplied by this scale factor. This is the
gradient of this particular function. And we will see how to compute this gradient
analytically we can very easily derive these computations, because if I take WT from here
if I take the gradient it would be only sum of Y or sum of - Y.

So, this expression can be modified in this form as you can see that 𝑊 (𝑖−1) −
η(𝑖) ∑𝑦 𝑚𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑌, Y is once again misclassified sample. So, computation is very
simple in this regard. What you should do that you start with any W, then you use the
classifier to see which samples are getting misclassified, then some of those samples will
give you the corresponding update of the weight you should move the weight by adding
the sum of those samples with a proportional value.

Now, this value may vary over the steps or you can also use some heuristics to keep it as
constant or small. So, you should continue this operation still it converges, so that is what
it could be maybe it could be taken as a constant this particular learning rate can be taken
as a.

(Refer Slide Time: 14:30)

There could be other forms of criteria function. For example, this is another function where
instead of providing it as -WTY, we can consider (WTY)2. So, the advantage is that the

823
previous error function what I mentioned that means, it is sum of minus WTY. Note that
all Ys are misclassified only you are taking your defining functions over those values only.

So, if Y is not misclassified that should not come into this functional computations. But
this function particularly it is continuous though it has some issues like it is very smooth
in the boundary which means there now if you apply gradient descent, you can get arbitrary
result, you can get stuck at some location, and also this value is dominated by long Ys. So,
some kind of normalization with respect to Y could have been there. So, we can choose
another kind of error functions which certain relaxation criteria where this normalization
has been considered, and you can see that this is a function where you are using
(𝑊 𝑇 𝑌 − 𝑏)2

So, in this case the hyper plane is further moved towards positive side by a distance b. So,
it is more stringent criteria for satisfying the linear separability of the classes. And this half
is used just to simplify the computations because due to the square term, but the derivative
these two will come here. So, 2x(1/2) will get canceled as 1. So, we will also see its
analytical form. So, it has a stronger linear separability as I mentioned.

And gradient will be computed in this way. As you can see two half and this squared square
term, it gets canceled, and then you can apply simple differential geometries concept of
differential geometry here also what you did for the one-dimensional cases one-
dimensional functions this can be extended to in many cases particularly when you are
when you are using the matrix operations in the linear form, then you will see almost
similar rules that applied here. So, this is a function of W as you can see.

𝑌(𝑊 𝑇 𝑌 − 𝑏)
∇𝐽𝑟 (𝑊) = ∑
||𝑌||2

So, if you are deriving over this particular value, then the term Y will come here. So, if I
derive this particular value will we can see that like what we did in the in any differential
geometry suppose it is a function it is a function of say W.

So, if I perform derivative with respect to W, this value should have been two (WY-b)Y.
When you translate it into the matrix form, then you get these operations, then you get this
value as you can see why (WTY-b) that is 2 (YWTY-b). Of course this is a constant, so this

824
constant term will come here, so that is what you are finding here, so that you can see here
that the gradient would be computed in this form.

(Refer Slide Time: 18:52)

So, this particular diagram shows that because of the use of the offset b, you are moving
further this separating hyper plane further towards positive partition by distance b and that
puts a stringent criteria of a of this linear separability. Just to note here that even support
vector machines linear support vector machines that is also that is also a kind of linear
classifier and which maximizes this b between two linearly separable data points of
classes, so that is the property of a support vector machine.

825
(Refer Slide Time: 19:38)

So, let me summarize this algorithm, and which we denote as batch relaxation with margin.
So, in this case we initialize W to W0, then iterate till convergence. And for that we have
to compute a set of misclassified samples where the conditions of classification you know
gets violated because of the data normalization all WTY operation should be positive or
rather in this case for stringent classifier, they should be greater than equals b or greater
than b.

So, the misclassification is that if it is <=, we consider it as a misclassification. And we


are considering those sample Y, and then use those sample you compute the gradient as
shown in this particular computational expression, and then you can update W by
following this particular rule. So, this is what you should go on doing till you get
convergence of your weights.

826
(Refer Slide Time: 20:48)

Now, batch relaxation means now you are performing the classification of every samples
in your batch, and then you are computing the gradient function. Now, there is simple
simplification of these particular computations, and it has been found you can this is all
this also gives the same performance, but it is much simpler and faster instead of waiting
instead of considering the whole batch at a time, we can consider immediate updating of
W by considering a single sample whenever it is misclassified. So, whenever a sample is
misclassified immediately you update the weight W by following the same rule, same
classification same gradient computation.

So, this is what you should perform the update of W by considering samples one by one
in every iteration, which means if it is misclassified, then you update W. And you go one
by one you take all the samples and observe when W is converged. So, this is the
expression here now for a single sample computations of updation of W. And you should
stop when very little convergence very little change in updates at the end of an iteration.

827
(Refer Slide Time: 22:12)

So, actually the perceptron classifier, it models also a neuron what is you know in a in its
biological interpretations when you as a biological functions. As you can see in this
particular classifier the inputs are given as with the classified the inputs are weighted inputs
are taken, and then the output is dependent on as a function of that weighted input. If I
consider a biological neuron, you will see that a neuron it consists of several synapses. So,
in its input synapses in its input and this could be considered as those exciting synapses,
synapses which is coming from which is taken from this input.

And then it processes it this weighted input is processed and then the output is propagated
through the corresponding you know path of our nervous system of the neuron. And it
goes to the other neurons or the excitation is propagated through this path, and other
neurons are connected at the same. So, you can consider the perceptron node is like a single
neuron, and it is collecting excitations it getting excitations from different synapses in the
input and propagating and generating and output response which propagates through this
path and which can be connected further.

So, this particular model is a very strong model. You can consider it as a network of these
neurons or perceptrons and you can consider a system that your input and output conditions
are or your output is generated through this excitation of these networks there is an input
and through this network of neurons finally, the output response is generated and that is
this model is known as neural network or artificial neural network model. And various

828
kinds of input-output relationships can be modeled using this generalized model, it is a
very powerful model in that respect.

(Refer Slide Time: 24:38)

So, this is what we see here in an artificial neural network, it is a network of perceptrons
where input is a vector we considered input is a vector and output is also a vector it could
be also scalar, that means, you can have several output neurons which is providing you the
output or you can consider only single output neuron, then the output becomes a scalar
output.

(Refer Slide Time: 25:09)

829
One of the special case of this kind of network is that they are mean there is no feedback
or loop in this network. So, then we call this network is feed-forward network. So, a free
loop means from an output you get a feedback to you know one of the one of its input layer
sorry input neuron. So, this is kind of loop is not here in this kind of when it is not there in
the network then we call this network is feed for feed-forward neural network.

(Refer Slide Time: 25:42)

And a very general form of feed-forward neural network is a multilayered feed-forward


neural network as has been shown here that there is an input layer which accepts input. So,
an input layer which accepts input as we have seen here then there are several intermediate
layers those layers are called hidden layer. In this case, we have shown only one particular
hidden layer and then there is an output layer where you are collecting the outputs or
whether outputs are generated which generates the output.

Now, you can see that in this particular form every layer has certain number of neurons
which may vary. And but from a particular layer say from the one input layer to another
to the next higher layer now for every neuron the connectivity to all the neurons of the
next higher layer is ensured so in this particular model. So, there is a connectivity which
means the weighted the there are weights from the responses of a particular neuron, so this
weighted response is the input of the input of the neuron of the next layer. So, or every
neuron in the next layer gets weighted sum of the outputs of the previous layer and that
provides a net input to this layer.

830
And finally, again as you understand there is also a non-linear function which processes
that net input as a weighted sum of the input to that particular neuron and then produces
its corresponding output or response which again propagates or which excites the neurons
of the other layers. So, this is the kind of model and since all the neurons are fully
connected by its previous neurons of its previous layers; so, we call this kind of network
also as a fully connected network.

So, as you can see that it is a layer wise processing, and the ith layer takes input from i -1
ith layer and forwards its output to the input of next layer, and this network is called fully
connected feed-forward network.

(Refer Slide Time: 28:18)

So, let me take a break here. We will be discussing particularly the mathematical
description of this particular model and its analysis so that we can model the we can solve
the problems of designing classifiers using neural network or problems of classification is
a neural network that we will do in the next lecture.

Thank you very much for listening to this particular lesson.

Keywords: Perceptron classifier, linearly separable, gradient descent, batch relaxation,


feed-forward

831
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 50
Clustering and Classification Part – V

(Refer Slide Time: 00:22)

We are discussing about artificial neural network and particularly it is a multilayer feed
forward neural network, that we will be considering and this particular network will be
used to model a classifier. So, let us see how we can model a multilayer feed forward
neural network. We will be concentrating on a latest model particular neuron of a layer or
a particular layer which means let us consider a jth neuron of ith layer.

832
(Refer Slide Time: 00:47)

And so, to model this neuron it has its input weights. So, you consider that a weight is a
vector and as it is connected to all the outputs of the all the neurons of previous layers.
And, if I consider that number of output is given by ni-1. So, where ni-1 is a number of
neurons at (i-1)th layer so and the bias is considered it as wj0(i). So, this is the dimension of
input to this neuron which means also the number of neurons in the output layer for a fully
connected network.

And, as I mentioned dimension about output at the ith layer which means that is a number
of neurons at the ith layer. So, the output of the neuron can be described in the
mathematical form in this way, that it is considered first to take a net input from its input.
That means, X(i-1) is actually the vector generated by (i-1)th layer of dimension ni-1. Now,
you have the product you have the weighted combination of this input.

And, then it is added with a biasing term which has been considered here and then that
becomes a net input to the function which will be a non-linear function which will be
generating the output response.

833
(Refer Slide Time: 02:38)

So, if I consider the input output relation in the ith layer then that can be described in this
form, you have in the ith layer you have once again ni number of output neuron. So, for
each one there is a weight vector which is connecting all the neurons of the previous layer.
So, this is how this relationship can be expressed as you can see that each input with it
weight vectors, it is multiplied is it is making a weighted sum with respect to the
components of the previous layer and then it is generating the output of the ith layer.

So, note that the dimension here is of ni this output is n i, this is a dimension and these are
the biases. So, we can express in brief we can note denoted that this is a finally, this is a
weight matrix of at ith layer and this is corresponding the bias vector at the ith layer. So,
these are the two parameters which we are related to the ith layer.

834
(Refer Slide Time: 04:00)

So, the input output relation can be described to in this form as we can see that in a
particular block diagram we have explained here. So, given an input then from this input
layer use this parameters W1b1 as described, you generate the output for the next layer.
And, it in this way you propagates the output to the end of this output layer to the input of
the output layer and finally, you get this output. And, at each layer the description of the
parameters is in this form, there is a mistake here. It should be ni as per make our
description should be align, similarly it should be.

𝑌 (𝑖−1) = 𝑓(𝑊 (𝑖) 𝑋 (𝑖−1) + 𝑏 (𝑖) )

835
(Refer Slide Time: 04:47)

So, in short the model can be considered the whole set of parameters, see all the set of
parameters they define, the collection there are the symbols parameters as it is represent
the whole collection of parameters. So, in a model you have the parameters are defined in
this model and it has its corresponding input output relation following the functions of the
neurons as we discussed before. And, at each functional block this is how it is processing
the input.

(Refer Slide Time: 05:31)

836
So, when you considered an optimization problem, we are trying to model a neural network
in such a way that it satisfies the input output specifications which means that if I give you
an input output specifications in that forms of Xi, Oi and these are the samples of n samples.
So, you have to find out W such that it produces Oi given input Xi for all i. So, this is the
optimization problem and we can use the same gradient descent approve a method to solve
this. So, we define an error function which is a defined in this form here.

You can see that it is considered the sum of mean square error, it is basically mean square
error given the output and given the predicted output of from the model which is expressed
as F (X;W). So, this is a predicted output and this is the actual output. So, the square of
this deviation is a single observation and then average over all these observations will give
the mean square error. So, this error has to be minimized. So, you have to find out that W
which minimizes this error. So, once again we can apply the same gradient descent
technique.

𝑁
1
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒: 𝐽𝑛 (𝑊) = ∑ ||𝑂𝑖 − 𝐹(𝑋𝑖 ; 𝑊)||2
𝑁
𝑖=1

(Refer Slide Time: 07:04)

So, that is what we can do here. So, the procedure is similar, we can start with an initial
weight W(0), then we update W iteratively. So, it is weight means collection of weights as
we have described the collection parameters, it considers weights of weights and biases of

837
every layer of the artificial neural network. So, the corresponding update scheme can be
shown in this form.

𝑊𝑖 = 𝑊𝑖−1 + η(𝑖)(𝑂𝑖 − 𝐹(𝑋𝑖 ; 𝑊𝑖−1 ))∇𝐹(𝑋𝐾 ; 𝑊𝑖−1 )

So, η(i) takes care of all the scaling factors here and as we can see that when you are doing
derivation of this particular function with respect to W. And then of course, you have to
compute the derivation of F gradient of F with respect to W and you can apply once again
stochastic gradient descent instead of instead of considering the O all the samples at it
together.

That means, instead of considering the sum you can simply use a single sample like the
previously what we have done and we can immediately update this weights and continue
doing for every sample. And finally, when weights are getting you know converged; that
means, there are a little changes in the values of weights after updation then we can stop.

(Refer Slide Time: 08:30)

So, computation of gradient itself is a task that we need to do it here and here let me first
discuss with respect to a single neuron or single perceptron. So, we can apply the chain
rule very effectively. So, we know what is a chain rule, let me explain that say; this is the
corresponding functional description that. So, this is an input here and you have
corresponding weights.

838
So, this is the net input finally, after considering weighted combination then adding a bias
term and then next process next part is that we have to this becomes an input to a non-
linear function f. And, then it produces output O and your error function, if your target
response is t then error function in this case we considered as (t-0)2. So, this is a square
error that is that we are considering.

So now, if you would like to perform you are interested here, you need to compute the
derivation derivative of E with respect to individual weights because, we are trying to
update the weights. So, how this particular response varies that we need to find out, that
gradient and you should move towards or how it is affecting this error.

So, you should move towards that direction which will be minimizing the error; so, let us
take us gradient directions. So, how do you compute it? Now, by applying chain rule what
we can do at this part we can compute

𝜕𝐸 𝜕𝐸 𝜕𝐸
∇(𝑊) = ( , ,…, )
𝜕𝑤0 𝜕𝑤1 𝜕𝑤𝑛

𝜕𝐸
= −2(𝑡 − 𝑜)𝑓′(𝑧)𝑧
𝜕𝑤𝑖

So, that is how the chain rule is considered, let me give you the corresponding you know
summary here also. So, we need to compute as we mentioned, we need to compute all this
gradients.

839
(Refer Slide Time: 11:44)

And as I was telling you that all these components we have to compute. So, you just
multiply all this three things so, you get the corresponding

𝜕𝐸 𝜕𝐸 𝜕𝑜 𝜕𝑧
=
𝜕𝑤𝑖 𝜕𝑜 𝜕𝑧 𝜕𝑤𝑖

Now, let us consider a particular form of f(z), this is a sigmoid function. So, if I take the
derivative with respect to z, it will look like this.

1 1
𝑓 ′ (𝑧) = −𝑧
(1 − )
1+𝑒 1 + 𝑒 −𝑧

Now, incidentally this can be expressed in this form and which is nothing, but f(z)(1-f(z)).

So, this is one interesting simplification, if you consider only sigmoid function. So,
sigmoid function has this property and in many cases in neural networks sigmoid function
is used and this particular property is effectively used in computation. Yes, if I want to
𝜕𝐸
compute 𝜕𝑥 because, we will see later on also that in some cases we need to compute that
𝑖

and then also we can apply chain rule.

So, you will find that the expressions would be considered here. The interesting part here
is at the whole computation can be done using analytical methods. You do not have to do
any numerical since you do not have to use the numerical simulation to compute. So,

840
sometimes for a complicated function what we observe that we give a change to the input
and observe the change in the output and then use the ratio of those changes as a derivative.

But, in this case we can directly compute by giving the functional by giving the values at
that point. So, only values of x w or weight centre weights and inputs they will be only
required to compute this particular function. So, once you compute the functional value,
then you can compute also the derivative using those values only. So, you can compute at
a point that is what is advantage of this particular method.

(Refer Slide Time: 14:22)

So, when you would like to compute the gradient of a feed forward network, multilayered
feed forward network; we apply the same chain rule. But, in this case since there is a
layered computation now, the chain rule has to be also applied following that layered
architecture which means that we should compute from output towards the input. So, you
should compute all the derivatives from the output end and then you should proceed
towards input and compute to successive derivatives.

So, which means from output layer to input player and we can compute the partial
derivatives of weights at (i-1)th layer from the corresponding derivatives of the ith layer.
We will see how we can do it.

841
(Refer Slide Time: 15:18)

So, let us consider this particular concept here, we would like to compute the derivative
with respect to this weight; you note that here in my notation 2 denotes the layer. So, this
is a second layer and 3 denotes the jth neuron of the previous layer. So, this is a third
neuron of the previous layer and here are the weight 3 1, it is a connectivity of the third
neuron of the previous layer to the first neuron of the second layer. So, previous layer is a
here so, that is a notation we are using here.

Similarly, these are y’s are shown as outputs here. So, actually I have shown all these
outputs which are affected by the change of this weights. So, this response is affected by
the change of these weights. So, this is generating, this y41 becomes an input to this
particular neurons. So, it is acting like an input, but any change of this weight will affect
the changes here. So, which means I need to compute here for example, deltaEdelta o then
any change here will affect outputs. So, I need to compute delta o delta y 2 3; that means,
with respect to this.

Similarly, I need to compute the derivative with respect to this, then I need to compute this
derivative of these with respect to this. So, in between there is a functional block here so,
we will have to consider that. So, in this way we have to compute. So, only these
derivatives are to be computed and then you can find out you know this one.

842
(Refer Slide Time: 17:24)

So, this is a relationship with respect to this and you note that the computation of

(2)
𝜕𝑦1 (2) (1)
(2)
= 𝑓′(𝑧1 )𝑦4
𝜕𝑤31

So, this is single neuron, this is the output.

(Refer Slide Time: 18:46)

843
So, we will continue this computation once again. So, it shows at the top these are the
layers 1 2 3 4 5, these are the layers and then it is this computation is carried out in this
(2)
form. So, ∆𝑤31

(Refer Slide Time: 19:49)

We will be discussing about a particular simple rule by which the gradients can be
propagated from output towards the input direction and that rule is called delta rule. So,
let me explain this particular rule by which can organize the computations very nicely. So,
we measure the error as we discussed that as a square of deviation from the target t to the
at response given by the network for a particular input and particular values of weights
parameters of that network which is o.

So (t-o)2 is a error and we would like to find out compute the gradient to of this error with
respect to a derivative of error E with respect to some parameters. For example, in this
diagram we are considering the parameter w31. So, you note here in my notation this is a
weight and you can see this 3 is a third neuron in the previous layer and 1 is the first neuron
of the current layers. So, the previous layer of neuron this is the first layer and this is a
second layer.

So, our convention is to denote the weight connectivity from the third neuron, output of
the third neuron of the previous layer to the current layer is in this form. Similarly, the
output of the third neuron of the previous layer which is the layer 1 is denoted here y 3 1
and also you can see output of the first neuron of layer 2 is denoted in this form. Similarly,

844
output of layer three first neuron of layer 3 is denoted here of layer 3, it is also denoted
that is the second neuron in this form.

And finally, this is of course, the fourth layer and only one neuron is there and fifth layer
is just output response which is redundant and we are not denoting with any other symbol
other than o. So, this is a convention that we are following and we are showing all the
necessary responses which are affected by the change of weights w31. So, when we
compute the gradient with gradient of error E with respect to w31 this weight then these
are the variables which will play a role in determining this gradient.

So, let us find out how this gradient value is dependent. So, you can see here we would be
like to measure this value and which will give me the corresponding updates of the weight
w 3 1 parameter. And so, you applying the chain rule. So, first we compute the gradient of
E with respect to output o, then gradient of output o with respect to this gradient y1 and
then subsequently gradient y 1 2.

So, this is that value, similarly these two added here. So, because of the linear operations
this is other value and finally, when you are computing the gradient with respect to w 3 1,
it is t y 1 2 with respect to w 3 1 2. So, these are the components which we need to compute
or which these expressions which we need to find out given this particular responses at
this instance. So, in this case we have already discussed how to compute the gradient of y
1 3 with respect to y 1 2.

Suppose we know the gradient of output o with respect to y 1 3 and that is we are calling
that is a value delta 1 3. That means, this is the accumulated gradient till this point, till the
output of the first neuron of third layer we denote in this fashion. Similarly, the gradient
at this point will be denoted as delta 1 delta 2. So, it is delta 2 3 so, that is a gradient at this
point, that is my notation and that is the definition of delta.

So, delta is the accumulated gradient till a output till the output of certain layer and the
corresponding convention of writing delta in this fashion. Similarly, you can write the
gradient changes along a edges also particular a edges, we will see how we can do it. So,
let us expand this particular quantity; that means, gradient of y 1 3 with respect to y 1 2.
So, let us explain this particular quantity which is the gradient of y 1 3 with respect to y 1
2.

845
(Refer Slide Time: 26:02)

So, these are how they are related, since in between you have the corresponding non-
linearity which is f z and that would play into role. So, we can write it as the we can write
this fact as say f dash z and then we know that there is a weight which is also connected
here. So, it is since it is a linear combination of this. So, if this weight as per our convention
it is w this is a first neuron. So, this is 1 and this is also 1 because this is the first neuron
of third layer and this is 3.

So, according to our convention this is weight so, this is linearly related. So, this is a scale
y 1 2 is scaled by w12 and then it is contributing to the net input which is a again
determining the y 1 3. So, we can write delta y 1 3 delta y 1 2 as f dash z that is also at that
point. So, we will see whether we have taken any conventions. So, we can write it also see
f dash z 1 and it is layered 3 f dash z 1 3 into w 1 1 then 3. So, this is the expansion,
similarly you can do this expansion and later on we will see how the chain rule is defined
with respect to this.

846
(Refer Slide Time: 27:59)

So, let me proceed. So, this is what as I mentioned we will compute this part and similarly
we will compute the other one and this will expand. So, as I mentioned this is expanded
into this form and this is expanded into this form, that is coming from here. This one is
coming from here and this one is coming from here. Now, we define the delta 1 1 as I was
mentioning delta 1 1, like I have defined earlier the delta this is delta 1 3 and at this edge
this is delta 1 1 3 and that gets multiplied with w 1 1 3.

Similarly, at this stage this is delta 1 2 3 and that gets multiplied with w 1 2 3. So, we will
find out how this is happening. So, this is what; this is what is delta 1 1 3 as I have
mentioned, this is getting multiplied what with w 1 1 3 and this is w delta 1 2 3 and this is
getting multiplied with w 1 2 3. And, that would give me the corresponding here that would
give me delta 1 2 so, this is the delta rule. So, if I write it so, we can define the delta rule
in this fashion as I was mentioning that delta 1 2.

So, this is what this is delta 1 2 is equal to then. So, this is added this is an this is equal to
the weighted addition of delta 1 1 3 and delta 1 2 3 where the weights are the weights of
the corresponding edges. So, that will be more clear when I show it here; so, this is a delta
rule. So, once you get delta 1 2, similarly in the same way you can also get the delta value
here. So, it means it has to be multiplied with f dash. So, from here it should be multiplied
with f dash it should be z 1 2 and then you are computing actually the response of this
point. So, it is delta y 1 2 delta 3 1 2.

847
So, it is this multiplied by f dash z 1 2 into the y 3 1 because, when you are computing the
gradient at this point in the multiplication factor with respect to this parameter will be this
response. So, that is what we will be writing here. So, you can see that it is delta 1 2
multiplied with f dash z 1 2. So, this is delta 1 2 this is f dash z 1 2 in that path and then
you are multiplying it y 3 1, that would give you this quantity and which is used for
updating this weight.

(Refer Slide Time: 31:27)

I think this is explaining this diagram explaining in a bit in a clearer way. So, what it is
doing let me see, let me explain; see you compute as I mentioned delta 1 i at this point. So,
which is the gradient accumulated gradient value from output to this point, see this is delta
j i. So, in that you are computing all these accumulated gradient value and then it is first it
is propagated. So, delta k 1 i is multiplied with respect to f dash z so, this is delta k 1 i
similarly delta k j i this.

So, you take the weighted sum of all this. So, you take the weighted sum that gives you
there the corresponding accumulated gradient value from the output to this point; so, at
this point. So, and next you will propagate in this fashion; so, it will again propagate in
this fashion. And so, the grad great way the update of weight should be this particular
know the delta E delta o into this because it is from the weight output response. So, this is
our update, in this way weights are you know updates are computed at every branch. So,
this is what is at its semi computing in this gradient.

848
(Refer Slide Time: 32:47)

So, the in this way you are computing the gradient of with respect to the corresponding
weight, with respect from the gradient of the error and that is a how we get the gradient
vector. And, then the algorithm follows in the same fashion that for each training sample
you compute functional values of each neuron in the forward pass. And, then update
weights of each link starting from the output layer using back propagation and then it
should continue till it converges.

(Refer Slide Time: 33:21)

849
One of the thing we would like to you know mentioned here that when you are doing
artificial using this model whether it is a classification or regression model. Now, you can
see primarily it is a regressor, it builds a model to predict no functional value F x given
input x. But, you can convert this model, you can use this model also as a classifier by
appropriate encoding of classes; that means, your output vector would be those encoding
of classes.

For example, if you have a two class problem, you can consider a binary encoding. So,
you can consider either 0 or 1 or you can also consider one hot encoding; that means, there
are two neurons output neurons one of them will be 1, other will be 0. For the other class
other will be 0 another is 1. In one hot encoding it is idea is that if there say n classes that
could be represented by m such you know binary variables n bits and only one of them
would be 1 for a particular class, rest will be 0.

(Refer Slide Time: 34:36)

It is necessary that when you design a classifier you should evaluate a classifier,
particularly for a supervision supervised classification problem. So, we will discuss about
you know some methods of evaluation, some measures of evaluation. Consider you have
two class problems and there are positive and negative classes. Now, there are several
possibilities after classification like here in this diagram in this table I am showing the
classification what has been predicted by the model.

850
So, you are seeing where it is these are the predictive predicted positive and predicted
negative. So, it can predict other positive or negative. So, it could happen that the sample
is actually positive and it is also predicting positive that would give you two positive or it
is negative, but it is predicting positive. So, it is the class of false positive. When it is
positive actually positive, but it is predicting negative then it is false negative and in the
other way when it is negative actually, but it is predicting also negative.

So, this is an desirable situation, this is true negative. So, this true positive and true
negatives are desirable outcomes those numbers should be high whereas, these two
numbers should be less. So, in a measure we use this numbers there are different measures,
like if I consider accuracy of a classification we consider what is the total number of
predictions which are true either positive or negative. So, it is true positive plus true
negative by total number of you know samples which have been tested, there is there are
other measures like precision and recall.

So, in precision you see that what is a fraction of predicted positives are true and recall is
what is a fraction of actual positives are true. There are other measures like sensitivity
which is the same as recall, these are used particularly for the medical world. So, or we
can considered it also true positive rate. So, it is true positive by actual positive and
specificity is true negative rate which is true negative by actual negative.

And, you can combine these scores two scores into one F score which is called harmonic
mean of precision and recall so, which is has been defined here. So, higher the F score
better is the classification, when you consider it in combination.

851
(Refer Slide Time: 37:14)

For a multi class problem we can use a confusion matrix where, once again you have you
can I have shown here that they these are the true classes and this is the predicted classes.
So, in the diagonal term if this numbers are high, those are the desirable outcomes. All
other terms there are some kind of errors are there because, it belongs to actual class omega
1, but your prediction is omega 2 or omega 3. So, the accuracy measure which means
which will be sum of diagonal by total.

(Refer Slide Time: 37:43)

852
There are methods of you know testing the; testing the performance of a classifier. One
method is called cross validation and once again it is applicable for supervised
classification. What we do that we separate training and test data. Then we train network
using training data and then evaluate using test data.

(Refer Slide Time: 38:08)

So, for k fold cross validation, we divide the data in k sets of equal size and we train using
k minus 1 sets and test it with the remaining. And we do it for every set as a test set and
take the average. So, report the average performance.

(Refer Slide Time: 38:27)

853
So, here we comes to the you know come to the end of my you know talk for particular
topic. So, just let me summarize whatever you have discussed under the topic of
classification and clustering. So, as we as you know that classification is the task of
assigning a known category or class to an object whereas, clustering is a task of organizing
objects into groups whose members are similar in some way. So, clustering techniques
there are several clustering techniques we discussed like K means, K medoids or Gaussian
mixture model.

(Refer Slide Time: 39:04)

And, for classification techniques we considered Naive Bayesian classification scheme,


then K nearest neighbor classification scheme, linear discriminant analysis and finally,
artificial neural network models. With this let me stop here and we will continue our
discussion for our lectures of the next topic.

Thank you very much for listening to this talk.

Keywords: feed forward network, chain rule, back propagation, cross validation.

854
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 51
Dimension Reduction and Sparse Representation Part - 1

(Refer Slide Time: 00:24)

In this lecture, we will discuss about Dimension Reduction and Sparse Representation. So,
let us first understand that, what is meant by dimension of a data? Consider a set of data
points as shown in the slide that it is a set, where xi is a data point in the space of N
dimensional real space. We can consider it is a vector in that real space so, the dimension
of that space is naturally n.

Now does it mean the dimension of the set S is also n? So, let us take this example. So, it
is a 3 dimensional space and there are 4 data points. Now, they may be arranged in such a
way that they lie on a plane. So, when they lie on a plane we can always define a coordinate
system within that plane and use that coordinate convention to represent every point.

So, in that case all the points could be represented as a set of points on it 2D space or which
is a 2D real space. So, it is not necessary that dimension of data would be the same as a
dimension of the space, it could be lower than that number, what the example has been
shown here. So, the principal component analysis is a method by which it finds the
minimum dimensional subspace for representing data. So, we will learn in this lecture how

855
this analysis could be done and the basic idea here is that, it computes a new set of
orthogonal axis. And that and using that a new set of orthogonal axis you define new
coordinates with respect to the representation.

(Refer Slide Time: 02:34)

So, it is a coordinate transformation in one sense. What principal component analysis


does? It maximizes variance of a component let me explain that: you consider a feature
vector representation of that data point X and since it is represented in n dimensional space.
So, we have n components or n fields of the vector, which is shown here as {x1, x2,..,xn};
this is a convention we are using for representing a data.

Now variance of a particular component say ith component xi that is defined as using this
mathematical expression, this should be clear that if there are N such data points; that
means, N data points. So, for any xi’s vector the corresponding jth data point we
considered that is a value for the jth component of xi’s vector.

And you consider the mean of that component. So, this is how the variance is defined, the
standard definition of variance. Now we say a component is dominant out of all these n
components whose variance is the maximum. So, that is then we say that component is a
dominant component. Now, PCA it maximizes the variance of the dominant component.
Let us understand what it means, consider a unit vector W and since as I mentioned that
there is a coordinate transformation involved in PCA what we can do, that we can perform

856
certain type of coordinate combination, conventions. So, center of the coordinate our
origin of the coordinate can be considered as a mean of these feature vectors.

So, let us represent that mean by this 𝑆̅, which means you have N feature vector saved this
is {X1, X2,..,XN} N feature vectors and 𝑆̅ should be

𝑁
1
𝑆̅ = ∑ 𝑋𝑖
𝑁
𝑖=1

So, this is how the mean could be computed. So, for every Xj translate it to the mean vector
compute the component along W. Suppose you can consider these are the data points and
this is 𝑆̅. So, we are translating and this is the original coordinate of the sparse. So, first
we transfer the center to 𝑆̅ take any particular direction, say this is the direction W and
take the component means the dot product of this W.

This is a unit vector, this should be a unit vector and so, the dot product of this, which
would be the component that is how you will be computing it. So, this is what is yj.

𝑦𝑗 = (𝑋𝑗 − 𝑆̅). 𝑊

So, you take which is a unit vector in this case, we consider this is a unit vector and then
you take the dot product of this then you get this component as yj.

Now you consider the variance of this component. So, the problem of PCA is at least you
have to find out one such direction, where it maximizes the variances of these projections
of different vectors all the data a point centering at its mean.

857
(Refer Slide Time: 07:05)

So, let me continue that with that representation, we have a set of data points and now we
are representing the data points in this way that, if I have Xj as a vector then, for the jth
vector there are small n number of components. And, each one component is a variable
which is indexed as the ith component should be Xij that is are how we are representing
it. So, we consider ith component as Xij here.

So, the mean vector as we discussed is 𝑆̅which means this is a mean of each component
we have defined it earlier also. And, then we perform this transformation we are taking the
component along a vector W which is a unit vector and you can see that every vector is
translated towards mean. And, for all N vectors there are N numbers. So, for every vector
translated it and we have the corresponding component along the unit vector.

So, finally, you get N observations or projections of all these data points along W centering
at the mean of the vectors. So, now you consider the variances of these y’s which can be
represented as. So, we can write compute W which maximizes

1 𝑇
𝑌 𝑌
𝑁

Actually you can say that since we have translate towards mean. So, the mean of the y’s
would be 0.

So, it is sufficient if I simply maximize the square of the magnitudes of that vector.

858
||𝑊 𝑇 𝑊|| = 1

(Refer Slide Time: 09:47)

So, these are magnitude of this vector that would be

𝑁
1
∑ 𝑦𝑖2
𝑁
𝑖=1

And we are taking the mean of it, which is the variance of this value. And, if I consider
the mean of these mean of the y 1’s at all we can write

𝑁
1
𝑦̅ = ∑ 𝑦𝑖 = 0
𝑁
𝑖=1

And effectively we are finding out the variance of the components and we would like to
get W which maximizes this particular factor. There is a constraint on W you have already
mentioned that it has to a unit vector so we should have the norm of the W that should be
equal to 1.

859
(Refer Slide Time: 10:50)

And, effectively if I consider expand YT or Y, y is it has been shown it has been XTW. So,
this can be written in this form and which is giving the expression as

1 ̃𝑇 𝑇 ̃𝑇
(𝑋 𝑊) 𝑋 𝑊
𝑁

This quantity is interesting because, what it is measuring? It is measuring the covariance


of X of course, you have to consider the averaging of that those values. So, this is the
averaging there is a term which is divided by N which will average it out.

(Refer Slide Time: 11:36)

860
So, we have to compute W which maximizes this particular factor and as I mentioned that
this is nothing, but the covariance matrix C, where the element of say kth lth element is
covariance between the kth component and lth component. Covariance between kth
component and lth component of the vectors, of the data points and that is how the
covariance matrix is defined.

So, the objective function you can consider for maximizing the variance is that to
maximize a function, which is the function of weight vector W or the unit directions W.
And, which will maximizing this quantity; this is the quantity which is same as this one
which has to be maximized, but there is a constraint so, this constraint in the objective
function can be incorporated using Lagrange multiplier. And this is that particular term.
So, this lambda is the Lagrange multiplier.

So, if I take the derivative with respect to lambda it will be WTW=1 which is actually
enforcing the constraint what we want to for while getting a solution. And, then if I take
the partial derivative with respect to the weight direction unit vector W, then we get this
system of equations as that I earlier also mentioned that the similar properties or similar
rules of differential geometry for one dimensional variable can be extended for multi
dimensional space also when you are using matrix operations, linear operations.

So, you can consider WTCW it is a quadratic product.

𝐿(𝑊) = 𝑊 𝑇 𝐶𝑊 − λ(𝑊 𝑇 𝑊 − 1)

𝜕𝐿
= 0 → 2𝐶𝑊 − 2λ𝑊 = 0 → 𝐶𝑊 = λ𝑊
𝜕𝑊

So, λ is an eigen value and since we are considering W has to be an unit vector. So, you
will consider the unit eigenvector in this case. And, since we would like to maximize the
variances and we will be considering the maximum eigen value for the vector
corresponding to the maximum eigen value.

861
(Refer Slide Time: 14:47)

So, what you get actually dominant principal component that is eigenvector corresponding
to maximum eigen value of C. Now, what about other eigenvectors because as covariance
matrix it is a symmetric matrix so, you will get N eigenvectors if they because a dimension
is small N here in this case and n X n matrix.

And so, all the eigenvectors in fact, it can be shown that they are providing the maximum
variances alone the residuals one after another. And, so the solution of this particular
analysis or what you can say that set of eigenvectors corresponding to decreasing eigen
values they provide the principal components.

Suppose we represent it set in this form that there are small in number of eigenvectors as
I say {e1, e2,..,en} such that there are corresponding eigen values are also in the decreasing
order. So, e1 corresponds to the maximum eigen value, e2 corresponds to the second
maximum and the minimum eigen value λn that corresponds to the eigenvector en. You
should note all vectors are normalized here, we are only considering unit vector eigen
vector.

862
(Refer Slide Time: 16:09)

So, the ith principal component is defined as the projection along the eigen vector of the
centering at 𝑆̅ centering at the mean of the data points. So, we can mathematically write it
as

𝑦𝑗 = (𝑋 − 𝑆̅). 𝑒𝑖

The dimension reduction could be in this way, that we can ignore eigenvectors of small
eigen values. That means, for a data point we can reduce the dimension, we can ignore
those components whose eigen values are very less.

So, there is an interpretation of eigen values, they are representing the variances of the
residuals for at that point. So, suppose all the eigen vector still kth eigen value retained for
representing data, then the data can be approximately represented by k dimensional, it kind
of a k dimensional representation of data.

863
(Refer Slide Time: 17:09)

So, we had an N dimensional data, but as I mentioned the dimension of data is not
necessarily the dimension of the space. So, they can lie on a subspace which is a k
dimensional subspace in this case. And using the principal component analysis we have
performed that coordinate transformations. So now, your coordinate axis are given by
these eigen vectors and you are considering and your center of your coordinates becomes
the center of the data point. And, you consider the projections of data points along each
eigen vectors that would give you the components.

So, first k components in decreasing order which are sufficient which may be sufficient to
represent the data, which may be sufficient to capture the variances of the data. So, this is
a thing that we have k dimensional vector and as you understand k has to be less than n or
it could be equal n also. So, total variance of data that can be you know, that can be
represented as variances of each component sum of variances of each component.

There are n components so, you consider variance of each component and that would give
you the total variance of data. So, variance is the sum of eigen values. So, this can be
shown, this can be proved also, that variance is nothing, but sum of eigen values. So, that
is why the eigen value which are very small which is negligible it is not contributing to the
data variance part and we can ignore those components. So, ratio of sum of k eigen values
to total sum that is a variance of the data that is the fraction of variance accounted for. So,

864
in dimension reduction this is what we would be considering that as high as this fraction
is better is the information content of the data retained.

So, this fraction is a fraction as you understand it varies from 0 to 1. So, we will be
considering a very high value of this fraction nearly 1 for representing data. So,
mathematically we represent this statistics as

∑𝑛𝑗=𝑘+1 𝜆𝑗
𝑅2 =
𝑉

As you can see that this is not fraction of (Refer Time: 19:36), this is a fraction of variances
which are rejected.

(Refer Slide Time: 19:50)

So, it is a sum of those components from variances of those components which are not
accounted in your representation. So, this 𝑅 2 should be as small as possible. So, to
summarize the PCA algorithm is that you have an input, it has a set of data points as we
mentioned that Xj and each data point is a point in the n dimensional space. And, then the
output should be a set of k eigen vectors, now k will be determined by the threshold
fraction threshold that fraction of variances that we are accounting for with that particular
factor.

So, the algorithm goes like this you have to compute the mean of data points, and then
translate all data points to their mean. Compute covariance matrix of the set, then compute

865
eigenvectors and eigenvalues that is an increasing order then choose k such that the
fraction of variance accounted for is more than a threshold so, that threshold is also a
parameter to this algorithm. Usually the typical value could be say 0.95; that means, 95
percent of the variance you are taking care of by this representation, and use those k
components for representing any data point.

(Refer Slide Time: 21:05)

So, let me explain elaborate these computations using an example, consider this is a data
point and we want to do PCA on this data point for reducing the dimension of the data.

(Refer Slide Time: 21:23)

866
So, all these data points now they are represented as a column vector of a matrix X, there
are 5 data points as you see they are all data points are in three dimensional spaces. So,
each column vector is a data point so, these are data points in that three dimensional space
those are the coordinates.

So, if I take the mean of those data points it is at all these three dimensional points so; that
means, mean of those columns column vectors your mean is computed in this form. Now,
you compute the vectors translated towards the mean. So, you simply subtract mean from
the from each column vector, you will get this particular matrix X tilde, which is
representing the translated vectors to around in the mean. And, then you compute this
1
particular covariance matrix; which is 5 𝑋̃𝑋̃ 𝑇 .

So, if I perform these matrix computations this is a covariance matrix you get. So, for
principal component analysis what you need to do? You need to find out the eigenvectors
and eigen values of this covariance matrices. So, since this is a 3 X 3 covariance matrix
and it is a symmetric covariance matrix; you have 3 eigen values. It could be distinct, it
could be non-distinct, but there were 3 eigen values and correspondingly there will be 3
eigen vectors.

(Refer Slide Time: 22:57)

In this particular example we also note the diagonal elements of this covariance matrix,
actually diagonals they represent variances of components. Say first component variance

867
is 1.04 that is a translated, but around the mean so, variance of first company is 1.04,
second component is 41.6 and third component is 42.24.

If you note that the maximum component actually maximum variance is actually in the
third component in this particular data set which has been given in its original form, and
also there are high correlations between the factors. Because, if you note the corresponding
off diagonal terms, you will find that these terms are quite significant. So, the total variance
could be, total variance is the sum of all these diagonal terms.

So, if I take the diagonal terms you can see the total variance is 84.88. And eigen values
of the covariance matrix they can be computed and they are computed as; you can see that
one of the eigen value becomes 0 that is the minimum, but the maximum is 83.3238 in this
case which is quite high and which is capturing almost all the variance of the data, but the
second eigen value is also 1.5562 which is also; it is much less than the first component,
but still it has some significant component.

So, what we can do? We can represent this data because the third is 0. So, we can represent
this data only using these two components itself and let us see what are the eigen. And,
you should note that some of these eigen values they are the same as the total variance of
the data. So, the respective eigenvectors with respect to the eigen values; that means, e1;
e1 corresponds to the eigen vector 83.3238; e2 and e3 similarly they correspond to the
eigenvectors 1.5562 and the other one is 0. So, we can perform the dimension reduction
by considering projections alone e1 and e2 only.

868
(Refer Slide Time: 25:12)

So, what we will do? We will consider as if they are the basis vectors now, and each basis
vectors as the columns e 1 e 2 e 3 is a similarity there is an, it is almost it is the same as
what we learned for image transform. So, you have a new basis vectors and the original
data point can be you know their components with respect to this new basis vectors can be
computed of course, you have to translate towards mean for principal component analysis.

So, if I consider the components of translated data points at mean of those original data
points then with these computations we are computing each one of them. So, it is taking
care of for each data point, it is computing the translation, it is computing the
corresponding product, corresponding dot product along e 1 e 2 and e 3. So, now this is
your new data point, translator and you can see that one of the component is 0 which
means: now I can represent it in a two dimensional space.

So, I can represent it this is a data point, this is another data point, this is another data point
like this ignoring the third dimensions. So, this is a redundant dimension. In fact, if you
note the point set what I have given they are lying in the plane X+Y+Z=10. And, that is
why since they are lying in a two dimensional plane; through principle component analysis
what you could find out the plane of why they are lying. In fact, the third eigen value that
would give you the normal to that plane.

869
(Refer Slide Time: 27:11)

And, that is why any projection along that direction it is 0 and e1 and e2 they are giving
two directions which those are the vectors which are lying on that plane, and this is a new
axis and in that plane you can once again express the coordinates of each data points. So,
this is how using principal component analysis you can find out the lower dimensions on
which data points lie.

So, let me stop here and we will continue this topic in the next lectures.

Thank you very much for listening to my talk.

Keywords: Principal component analysis, dimensionality reduction, maximizing variance.

870
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 52
Dimension Reduction and Sparse Representation Part - II

(Refer Slide Time: 00:22)

In the previous lecture, we discussed about the computation of principle component


analysis and how we can find out the lower dimensional representations of data points,
which are represented in a particular space. Now, let us consider what are different kinds
of applications those are possible using PCA. Now, I will consider three major applications
here.

So, one of them is a data compression. As you have already seen directly that, actually you
do not require that many dimensions to represent data, that itself gives a gives an efficiency
of data representation, it requires less storage for representing data. So, it provides an
optimum set of orthonormal basis vectors for a set of data points. Because, we have seen
in the case of image transforms, that ortho normal basis vectors they are very convenient
to transform any data point into another space, and if your basis vectors are properly
chosen, then you can reduce the redundancy of that representations.

So in PCA you have that advantage actually it is optimum set of orthonormal basis vectors,
which will give you that kind of transformation, but it is data dependent that is one issue

871
here. So, for every set of data points you need to perform it is analysis, perform this
analysis and then compute a new set of orthonormal basis vectors, which is not very
convenient from the point of view of data compression. Because with the compressed data,
then you have to also convey this information of orthonormal basis vectors which is an
overview for this.

However, and these basis victors they are called ‘Karhunen-Loeve’ basis or the transform
is called Karhunen-Loeve transforms which is an optimal representation of data as I
mentioned. In fact, many standard basis vectors, they can be shown as eigenvectors of
certain statistical representation of signals or images.

For example, type-2 DCT basis vectors are shown as it is approximately the eigenvictors
of 2-D matrix with j, kth entries as r|j-k| as it is shown here. So, these are it represents the
correlation between adjacent samples say j and k. So, if j and k it is deviates too much
from that location it, then it is not it adjutancy is less and then the correlation should be
less.

So, because r is a value which is less than 1, so as the j - k this magnitude increases it will
be reducing this value will be decreasing. So, you have less correlations, but for high
correlations it, so for adjacent samples it is expected they should be very highly correlated.
Now, with this kind of data, it has been shown that the basis vectors are; that means,
eigenvectors are similar to the almost similar to the type-2 DCT basis vectors. And, that is
why type-2 o DCT is so, efficient for representing a large class of signals and images.

So, as I mentioned covariance matrix for a very useful class of signals, where r is major or
of correlation between adjacent samples and it is the value with me at to 1. So, it is trying
to represent the natural images and natural signals with these kinds of statistics it is trying
to model that part.

872
(Refer Slide Time: 04:02)

The other advantage of PCA or other applicants which is used also different applications,
that it de-correlates the components. We have already seen in the example that covariance
matrix they show that there is high correlations between different components for the
original data. But, after performing PCA actually those correlations they are those
correlations are largely reduced. So, it will be almost 0 in that case ideally it would be 0.

So, that is how it de-correlates the components. So, one of the applications of this kind of
using this property is that you can find out a new color space, that when color images they
are represented in RGB color space they are highly correlated. So, there is a work I have
given this reference here, it is the work done by Ohta Kanade and Sakai. So, what they
did, that they took a large number of color images and, they performed PCA using the
different blocks of the color images. And, then found out eigenvalues and eigenvectors
and those eigenvectors that is the data that gives you the transformation new
transformation space.

So, in fact, they found that if I transform the color component using those eigenvectors in
this way, that (R+G+B)/3, R - B is a second component and (2G-R-B)/2, there is a third
component. Now, these are the principle components from their principal component
analysis.

And, as you can see this is nothing, but you are performing a color transformations where
the fast one is indicating the intensity value and the other two they are the chromatic

873
components. So, this can be, this has been obtained through this PCA itself. The major
applications of PCA it is there when you have more number of components in images.

So, color images they have only three components, but if I consider remote sensing images,
these components could be very large there are different kinds of remote sensing images
depending upon the bands they are using, depending upon the electromagnetic wave length
bands they are using. So, multispectral, hyperspectral, ultra spectral, remote sensing
images. And, there could be many bands, like multi-spectral it could be 10’s of bands,
hyperspectral it could be 100’s of bands, ultra-spectral it could be 1000’s of bands.

We have so many different bands so many numbers of bands and how to get an efficient
data representation using those bands. So, what we can do? So, which means that at every
pixel, it has dimension; data dimension is say if it is band number is N, data I mentioned
is N.

So there are, these kinds of dimensional reduction becomes very useful we can find out
only those components after deduction. We can reduce those components which are de-
correlated and use those components to make the information analysis to analyze
processing permission or to correlate with different ground truth or different ground level
information.

So, PCA is required to highlight the correlated information.

(Refer Slide Time: 07:41)

874
This is one example where I am showing that PCA component of a hyperspectral image
and you can see that it starts from this, this is the PCA, which will corresponds to the
maximum variance, maximum eigenvalue, you have lots of information there, a lot of
details there and this is a minimum eigenvalue in this case it is actually 12th eigenvalue.

Now, if the number of band is in this case it is not specified, I suppose the number of bind
is 20 in this case. So, this is a minimum one. In fact, there is no such information also
almost no details are visible here. So, that is the advantage of PCA, you can prioritize or
you can give references to those bands which are having more informations.

And, say this is a PCA 1, this is band PCA 2 band, second band, third band as you
progressively go over the PCA components from say top to bottom and left to right by
making these kind of scan. You will find finally, the details are slowly dying and it gets
almost like a smooth region. And, so, you can use this components to find the, to analyze
the information where if it is representing those information which right here. So this is
what after component when (Refer Time: 09:22) much details are available and it removes
that data redundancy in this representation.

(Refer Slide Time: 09:29)

The third application of PCA could be factor analysis and it highlights this decorrelated
factors. So, you can find out the factors related to that even the color component what I
have shown, it is trying to find out those factors, intensity factors, coma factors and this

875
factor analysis this is very much useful for classification. For example, you consider
eigenvalues for representing human, human faces.

So, what we are doing here, we consider a large set of images of human faces crop to the
same size and follow certain rules like it is not only just simple cropping, you are trying to
maintain the different parts of the human face from the similar distances from the top. And,
then you perform PCA and any arbitrary face you can express as a linear combination of
those eigenfaces; that means, if you perform PCA you get eigenvectors those are called
eigenfaces here. So, coefficients of linear combination it represent an arbitrary face.

(Refer Slide Time: 10:49)

So, this is one example that it is showing four eigenfaces. So, if I consider any arbitrary
face at least I can represent by four-dimensional vector using this PCA, using these factors.
And, then subsequently we can use it for classification.

876
(Refer Slide Time: 11:08)

So, this is the another applications, that we use those factor analysis, use those
representation or any classification or high level processing like, you can this is a pipeline.
So, you can perform factor analysis and then those representation can we use for
classification.

(Refer Slide Time: 11:31)

So, I will be now discussing another type of dimension reduction. In fact, the objective
here is more for the purpose of classification. In the previous slide itself I have shown how
PCAs could be used for factor analysis and those factors could be used for classification.

877
But, it is not necessarily that those representation of those factors are efficient for
classifications particularly if you are considering linear discriminate functions.

So, we will be considering those kinds of dimensional reductions, where this linear
discrimination becomes simple for in fact, this is a simple discrimination it is you are
reducing it to just one dimensional component. So, the objective of this linear
discriminates and which is known as special linear discriminate by the inventor Ronald
Fisher is a very famous statistician and it captures the direction of maximum variance of a
dataset.

So, for label dataset it does not capture the direction of maximum separation between the
groups of data points of different levels. So, here the point is that PCA, it captures the
direction of maximum variance for the data set not the vicious linear discriminate. And,
but if I consider the dataset is labeled then it may not capture the maximum separation. So,
I will be showing you with I will be explaining it with respect to this diagram consider
there are two groups of data points, these are the labeled data points; one is shown by the
triangles, another other groups are shown by ellipses by different colors.

Now, the direction of maximum variance is a dotted line, that could be the direction, along
that directions all the projections of these data points, they will have maximum variance.
But, if I take the projections as you can see here that within these intervals the projected
data points there they are intermingled. So, they are not well separated. So, they cannot be
segregated using simple interval rules, they cannot be discriminated by that rules.

But whereas, if I consider another direction say this is a new direction another directions
and if I take the projections. Now, you can see that all the projected points for this group
they are lying in this within this interval and all the projected points from for these group
they are lying within this interval along this direction of projection along that particular
direction. So, they are well separated.

And, as it; so, this shows direction of the principal component is not really providing you
the good separations between data points. So, as we mentioned it is well separated, but not
along the direction of principal component.

878
(Refer Slide Time: 14:51)

So, let me explain that what is the competition problem that involves in this particular
analysis; Fishers linear discriminant analysis.

So, consider a set of data points as we have considered in the previous cases also. And,
then out of those there are N1 data points which are in class w1. And, there are N2 data
points which are in class w2. So, naturally we consider that N1+N2=N which is a total
number of data points. And, now you consider a line with direction u, because our
objective is to get a direction where this separation could be maximum. So, we should
consider projection of data xi on u.

So, the projection can be operation can be expressed as a dot product of two vector xi and
u or in the matrix representation, it is xiT.u matrix multiplication of xiT and u. And that
would give you the projection of data point yi, it is nothing, but a 1-dimensional sub space,
which represents this data so, all the points around that lying on that line.

879
(Refer Slide Time: 16:18)

Now, we are trying to measure, what is how the projected data is separated? Because with
that major then we can formulate a problem for maximizing that measure to get it well
separated data. So, in this case let us consider mean of data points in class w1 is m1, and
mean of data points in class w2 is m2. And, projection of this means along this direction
u can be computed as projection of mean vector along u that m1Tu and m2Tu

So, figuratively we can show here that my1 and my2 and one of measure of separation could
be the separation of these two means; that means, how far there, we want this measure
should be no large, this value should be large per well separated two groups.

So, if you consider the difference absolute difference between these two values, that would
give you a separation measure, but the problem here is that it is not capturing the variance
of data. So, some data could be very well spreaded, very largely spread it, in an area and
some could be very closely groups, closely spaced and after projection. So, ideally what
we would like to have, ideally you should have the data points they should be closely
spaced around mean. And, they should be largely separated. Then only those kinds of
measures I mean those are the desirable situations desirable cases for this projection, but
that is not captured by simply by the value D.

880
(Refer Slide Time: 18:13)

So, what we do in that case we normalized this D by a factor proportional to class variance.
So, in fact, we call it this factor scattered. So, scatter of data belonging to class C is defined
in this way it is on the projected space.

𝑠 2 = ∑(𝑦 − 𝑚𝑐 )2
𝑦∈𝐶

So, it is nothing, but as you can see, it is a sum of mean square of mean deviations from
the mean from the mean of the data points. And, it is a proportional factor with variance,
because it is nothing, but class variance product of class variance and number of samples.
And, as I mentioned the m c here represents of the mean of class C and small s square is a
scatter.

So, the measure of separation a good measure of separations could be that

𝐷2
𝐽(𝑢) =
(𝑆12 + 𝑆22 )

And, our objective is to maximize this value J(u) with respect to get a direction and so,
scatter of projected samples should be small.

881
(Refer Slide Time: 19:44)

Let us define also scatter matrix of the original data. So, scatter matrix of sample of class
C in the original space that is also defined in the same way. So as you can see it is the
multidimensional space. So, earlier we have only one-dimensional space sub space for
projections. So, we have used standard definitions of variances kind of definitions; that
means, square deviations, but now you have to use the corresponding outer product of this
matrix to denote a scatter matrix. So, this is represented as, as you can see

(Refer Slide Time: 20:33)

882
So, within the class scatter matrix, that is defined as scatter matrix for S1 and S2; sum of
S1 and S2 that is within the class scatter matrix. So, each one for S 1 you have to consider
once again that definition; that means, suppose for S1, we had m1 as the mean. So the
vector (𝑥 − 𝑚1 )(𝑥 − 𝑚1 )𝑇 and sum of if I consider there are ith samples. So, these ith
component and if there are N1 samples in S1.

So, we can show how it is related with this within the class scatter matrix. So, this can be
written as in this form. So, you can see that this is what is the projected sample of x along
the direction u. So, we can write uTx; similarly, this is the definition of this is a projected
mean of the samples along y along direction u. So, it is uTm1. So, this itself can be written

𝑠12 = ∑ 𝑢𝑇 (𝑥 − 𝑚1 )(𝑥 − 𝑚1 )𝑇 𝑢
𝑥∈𝑊1

So, we can write in this way you can simplify this algebraic form and you can write in this
form. You can note here actually since it is one-dimension. So, whether I put the transpose
here or here does not matter. So, we have taken the convenient form for the sake of
derivations for the sake of the final derivations. So, we can see that from here you can take
out u transpose u outside and this will give you the scatter of the original sample for class
W1.

And, this is the reason yT here instead of taking u transpose a minus this could be as
transposed and you could have taken this also. On the other hand we may and (Refer Time:
23:17) this is also correct. So but, the advantage of this notation is that, then we can bring
the scattered matrix nicely within this uT and u in this form, we can get it in this form.

And so, from here we can see that a small s12 is related with the scatter matrix of that class
by this explanation. So, s12 is nothing, but uTs1u. Similarly, s12 would be uTs2u and if I add
them finally, the both S12+S2 square can be written as uTSwu, where Sw is the within the
class scatter matrix as we have defined.

883
(Refer Slide Time: 24:08)

Between the class scatter Matrix; that is another definition. Now, this is the scatter matrix
formed by the means of those classes. So, this is how the definition is there. So, SB= (m1-
m2)(m1-m2)T, it is the difference between two mean vectors. So, difference vector of means
that is defining this scatter matrix. And, here also we can show that the separation is related
with this.

So the D2 which is nothing, but the square deviation of two projected means that could be
written in this form. Once again conveniently we have used the outer product from 2 to
outer product matrix representation for showing the square of a one- dimensional term and
this could be represented as

𝑢𝑇 𝑆𝐵 𝑢
𝐽(𝑢) =
𝑢𝑇 𝑆𝑊 𝑢

Now, this part is nothing, but this is what is SB and so, you can write it in this way and that
is related with D2. So, finally, we can rewrite this optimization function as it is to maximize
𝐷2 𝑢𝑇 𝑆 𝑢
these factor J(u) which is (𝑆2 +𝑆2 ) this could be re written as 𝑢𝑇 𝑆 𝐵 𝑢.
1 2 𝑊

884
(Refer Slide Time: 25:51)

So, this is the problem of optimization, that you have to maximize it, once again u should
be unit victor that is a constraint.

So, u should be such that it can be shown that

−1
𝑆𝑊 𝑆𝐵 𝑢 = λ𝑢, 𝑆𝑊 𝑖𝑠 𝑖𝑛𝑣𝑒𝑟𝑡𝑖𝑏𝑙𝑒

So, I am not giving you the divisions I am providing you the final solutions that u is the
−1
eigenvector of these particular matrix. It is like an eigenvector solution 𝑆𝑊 𝑆𝐵 and if we
compute the eigenvector there. And, since now you want maximization which means you
have to consider the maximum eigenvector corresponds to the maximum eigenvalue.

So, and one of the thing is that interestingly that SBu it has eigenvector along already (m1-
m2), this can be shown easily, if I consider this particular expansion of SB. So, this is
nothing, but SB. So, this part as you can see that this is the dot product of the difference
vector with respect to u and we should be some scalar value. So, this is some scalar value
k.

So, finally, this expression is nothing, but it is a vector this is some scalar term this is
nothing, but a vector m1 - m2. So, the eigenvector of SB is also m1-m2. So, in that case
−1
the solution of 𝑢 = 𝑆𝑊 (𝑚1 − 𝑚2 ), that is what is your solution and, in this way you can
get the direction where you can get the maximum separation of projected samples; this is
how the Fisher’s linear discriminant works.

885
(Refer Slide Time: 27:47)

So, let me explain it with respect to an example consider a set of data points and now we
are considering labeled set of data points. So, we have I show given you 2 such sets X1
and X2 and we perform linear. So, we have to perform linear discriminant analysis or to
get the optimum direction or FLD actually Fisher’s linear discriminant analysis to get the
optimum direction.

And, we should check also the separability in the line of projections. As an alternative also
we will perform PCA on the whole data set, because PCA does not consider class
information, it considers the whole data set and get the dominant principle direction and
check the separability of the projected points on it.

886
(Refer Slide Time: 28:38)

So, for a linear discriminant analysis or performing here Fisher linear discriminant, we
consider this representation of data into two matrixes X1, 1 group of data all column
vectors are data points X2, other column vector are here also column vector data points
per group 2.

So, mean of X1 is given this that is what is m1 and mean 2 is given this, that is what is m
2 in our notations we have discussed earlier and this is what is your scatter matrix for the
X 1. So, S1 is a scatter matrix for S1; similarly, you computes scatter matrix for X2 S2
and we within class scatter matrix can be represented. Now, this is a definition of S1, which
we already we have discussed. So, within class scatter matrix is S1+S2, which is given
this. And, now the solution you have to take the SW-1 and then multiply that with m1-m2
that would give you the direction.

So this is what is your u, SW-1(m1-m2) and it could be found that u value is given this, it
could be unit vector also it could be any vector, because it is just projection. So, magnitude
of the vector does not.

887
(Refer Slide Time: 30:08)

So, the separability if you would like to study you take Y1 = X1Tu. So, you are taking the
projections of all data points in X1 with respect to u. So, you can perform all these
operations simultaneously using these matrix multiplications. Similarly, you consider
projections of all data points of X2 with respect u.

So, you will get them as column vectors there are three data points here. So, these are the
projections for Y1, these are the projections for Y2. And, you can see that their intervals
are well separated, because the range of Y1 is 19.31 to 22.2 whereas, range of Y2 is 5.43
to 9.55 and they are well separated.

888
(Refer Slide Time: 30:55)

Whereas, if we take principle, if you perform principle component analysis on the whole
data set which is represented here by the matrix X all column vectors are data points and
you have considered union of data points of X1 and X2. So, part principle component
analysis as we did earlier, first we compute the mean vector S bar, then you compute the
covariance matrix, the variance of data.

So, sum of diagonal elements that would give you the total variance of the data. And, if I
perform the Eigen; compute the eigenvalues and eigenvectors in order of there, values I
am showing here. So, the maximum eigenvalue is 72.96 and eigenvectors corresponding
eigenvectors are this e1, e2, e3 and we are interested on finding out the directions of
maximum eigenvalue or maximum variance. So, this is your principle component direction
of the principle component is given by e1.

889
(Refer Slide Time: 31:58)

So, if I take that direction and check the separability of projected samples so, you perform
the same projections on that direction and these are the values we will get. We will see
that at least there is one sample this is the (Refer Time: 32:14) example. So, I mean it
approximately it is good well separated, but there is at least compared to the previous one,
one sample is here 13.98, which lies in the interval or very close to the interval of the data
points of Z 2.

So, the separation is not that much, it is not really (Refer Time: 32:38) overlapping in this
sense here, but it is very close to that integral. So, that is it shows that the utility of fishers
discriminate analysis. So, with that let me stop here and I will continue this topic in the
next lectures.

Thank you very much for listening to my talk.

Keywords: Fischer linear discriminant, between class variance, scatter matrix, eigen faces.

890
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 53
Dimensions Reduction and Sparse Representation Part – III

In previous lectures we discussed about dimension reduction of data points.

(Refer Slide Time: 00:23)

Now, I will be discussing another issues related to data representation which is called
sparse representation. So, let us understand the problem statement involved in this
particular issue. Consider we have a dictionary of N elementary n D vectors we call it
dictionary it is something like set of basis vectors.

But in the set of basis vectors we would like to have only those basis vectors which will
be sufficient to represent data. But here we consider a redundant set of basis vectors; that
means, we have more number of basis vectors and if you would be only required to
represent a data.

And the term what we will be using instead of set of basis vectors we will be calling this
particular collection as a dictionary. As if you have lot of redundant vectors and
conveniently you would like to use them for representing your data or signal.

891
Now, the elements of this dictionary they are called atoms that is a terminology we will be
using and you consider any arbitrary vector n dimensional vector. So, the problem here
you have to compute the best linear approximation using a subset of D as basis vectors.
As I was mentioning that D is an over complete representation of basis vectors which
means there are many redundant vectors and it is not required all the vectors should be
used for representing data.

So, the linear combination of the subset of D can be expressed mathematically in this form
as you can see. That X is your input vector it should be very close to a linear combination
which is given as a multiplication of each dictionary elements with the scalar constraint aj
and some of them. And dj is are the elements of that set which has been considered for
representing X and which has to be subset of this dictionary.

And in an n dimensional space it is sufficient if you have n independent basis vector, n


linearly independent basis vectors and then it is sufficient you can use into represent any
arbitrary vector X. So, here you have an over complete representations and you deserve
that this number should be less and that is the sparsity we are considering.

That you required a few basis vectors only from D which should be sufficient to represent
X we do not require N basis vector. So, it should be much smaller to n and that is what we
are trying to achieve using these kinds of representation, so the idea is that this number of
atom should be minimum. So, for the particular problem statement you would like to have
that S subset whose cardinality should be minimum and also reconstruction should be as
close as possible. So, it is an approximation.

892
(Refer Slide Time: 03:34)

So, to make it more precise in that case how close it is first thing it could be exactly
construction. So, you find out the minimums which will give you the minimum cardinality
of S, a minimum number of atoms which will give you the exact reconstruction. So, this
could be a problem and it is a very precise problem statement.

Or what you can do we can keep the number of atoms fixed say what is the best
representation; that means, how close that should be the error between the representation
approximation of X and X should be the minimum original value, so it should be very
close to that.

893
(Refer Slide Time: 04:11)

So, to summarize this problem it is the problem of approximating a signal with the best
linear combination of elements from a redundant dictionary. It should give some optimal
or near optimal representation, the competition should be fast. These are the desirable
properties and the dictionary also should be optimal and sometimes there are joint
optimization problem that you would like to have also a dictionary of with an optimal size,
so that gives also the optimal sparse representation of data.

(Refer Slide Time: 04:46)

894
So, to elaborate this optimization task we can consider this mathematical formulation that
we have to minimize the approximation error using L2 norm. You know L2 norm is
basically it is the Euclidean distance or in the n dimensional space and this it is we are
using this and also using m terms. So, you have a dictionary D which is a set of d i’s each
one is an atom and this considers L 2 norm. So, just to show you what means by L2 norm,
suppose you have a vector is represented as say x1, x2 to xn.

So, L2 norm is

𝐿2 = √(𝑥12 + 𝑥22 +. . +𝑥𝑛2 )

So, you consider this is an error vector

min min | |𝑋 − ∑ 𝑎𝑘 𝑑𝑖𝑘 | |2


𝑑𝑙𝑘 ∈𝑆

So, the problem here is that you should get that set of coefficients and the total number of
atom should be exactly equal to m, so that you get the minimum error minimum L2 norm
of these values.

(Refer Slide Time: 06:24)

895
So, these are these explaining the corresponding conference of this expression. This is the
data vector and this is the linear combination and it gives you the fixed number of atoms
that constraint.

(Refer Slide Time: 06:38)

Suppose I have given you a set of atoms which is a subset of the dictionary D. Now, how
can I get the best reconstruction we in terms of linear combination of these atoms for in a
arbitrary X, for in a arbitrary vector X. So, we can see that we can probably it easily the
least square error automation problem. We construct a matrix B, so once again be relates
to that setup basis vectors and each element of S is considered to be a vector this is vector.

So, this B is found in this weight, so each column vector of B is a corresponding atom of
S. So, the dimension of B should be n X m as that dimension of each vector each base is
for each element of d is n. So, each column victor is of n that is the dimension, so number
of rows would be n and since there are m elements m atoms, so we will have m columns.

So, the dimension is n X m and if you consider the linear combination of these atoms
which can be conveniently represented as in this form. So, here a1, a2 these are all scalar
quantities these are all scalar. And you are multiplying 𝑎1 𝑑𝑖1 + 𝑎2 𝑑𝑖2 +. . +𝑎𝑘 𝑑𝑖𝑘 like this
you proceed and you can represent in this form that is well you get the linear combination.

So, this linear combination can be represented and this is this set of coefficients or this
column victor let me consider. Let me denote as Y and this linear combination can be

896
represented as the multiplication of B and Y, B is the set of basis vectors as I mentioned
from the dictionary where each basis vector is the column of B. Or dictionary elements of
is the column of B and Y is the corresponding set of coefficients, early stock coefficients
with respect to each atom, so Y is the representation of X given this dictionary B.

So, Y is the transform of X and this is this should be sparse we would like to have Y as
sparse as possible, so how to get the best approximations for m elements. As you can see
this is nothing but you can convert it to a list squared error estimation because, what you
can do you can minimize here B is given X is given what you have to find out? You have
to find out Y.

So, minimize this norm X minus B Y squared and that Y will give you the optimal
representation of this particular element representation of X with respect to these
constraints. So, we know this solution a number of times you know we have we did it,
because here once again just let me put it I would like to get the close approximation in
this form. So, what we can do? We can multiply B transpose and then Y can be written as
B transpose B inverse B transpose X. So, this is the representation this is the solution of
this particular problem, so this is how you can get Y out of this process.

(Refer Slide Time: 10:56)

So, the problem here is that if I give you is then use the least square error estimation method
to get a representation Y for particular X. But which subset you should consider, which
list to m atoms you should consider from the dictionary that is what the basic problem is.

897
If I can if I know that this set of atoms will give me the minimum reconstruction, then our
sparse representation I can easily get the solution.

So, there are various approaches to select those atoms which will derive this kind of sparse
representation. Now, these approaches are called pursuit approaches, so two major
approaches; one is orthogonal matching pursuit, another one is basis pursuit.

(Refer Slide Time: 11:50)

In orthogonal matching pursuit or in short we call it OMP here the algorithm is in iterative
greedy algorithm, what it does it selects at each step the dictionary element which is based
correlated with the residual part of the input vector. We will see what it means; that means,
what it is doing it is successively refining the refining the representations it is like
successive approximations.

You take the original vector X or input vector X and find out the basis vector where the
component is maximum; that means, that vector is representing the maximum part of the
signal. So, you choose it in your dictionary element and that is your representation itself,
so that vector and the component.

There next time what can you subtract the represented part already what is being done by
that vector that is called residual. The next time you are going to carry you in the same
operations with residual, so with residual again you find out along which directions you
are getting maximum component.

898
Then you add it to your dictionary use it for least squared error estimation, use it you as I
mentioned that if I give you a set of basis vectors I can easily derive what is the best
representation best linear combinations which will represent the corresponding vector. So,
use it use that technique to get derive the best representation and continue doing it for m
atoms m number of times than or as long as you want to represent the data.

So, it produces a new approximate by projecting the residual onto the dictionary elements
that have already been selected, what I described that is mentioned. Here that every time
you are considering residual part of the input vector you project it on a particular vector a
dictionary element which has not has been used already.

And the dictionary element which is giving you the maximum value of projection that
should be included into your set S. And use set S to represent the efficient representation
it extends the trivial greedy algorithm that success for an orthonormal system that is a
characteristics.

(Refer Slide Time: 14:14)

The other approach which is called basis for suit it is a more sophisticated approach. And
what it does it replaces the sparse approximation problem as a linear programming
problem I will explain elaborate that problem statement.

899
(Refer Slide Time: 14:29)

So, before describing the orthogonal matching pursuit algorithm. Let me describe a similar
algorithm which was say it before orthogonal matching pursuit it is simple matching
pursuit algorithm. So, the idea is here that at every time you just look at only that those
you know elements of the dictionary where you get the maximum residual component; and
use that you know element itself to represent the vector.

So, it is minimize the approximation error using L 2 norm using m terms that is objective.
And you are considering that at k th iteration since it is iterative process, you have residue
representation as 𝑟𝑘 = 𝑋 − 𝑎𝑘 .

So, the initialization is that your whole signal or whole vector input vector itself as X like
a residue r naught is equal to X and so far you do not have any approximation which is 0
which you need to do. Then at every step every k th step what we are doing, as I mentioned
you are finding out the direction finding out that element of dictionary where you get the
maximum you know projected component.

So, this is the residual at this stage, so what is the projected residue along this particular
you know dictionary element and then consider only that dictionary element where this
value is maximum. So, you choose that dictionary elements say i star is that index, so now
use it for representing this signal. So, this would be your magnitude and this would be your
direction of the vector.

900
So, now you are, so the signal which is represented by a k that is the approximation
approximately representation at the kth iteration, it should be incrementally it is
representing that approximation. So, with previous approximate representation you are
adding this vector as I mentioned. So, this is the magnitude part of that vector which is the
component what is there in the residue. And this is the direction of the vector, so di* gives
you the vector itself.

So, you add it with a k - 1 that is previous approximation and then you get the
corresponding approximately representation. So, now, the residue rk, ak th rate step would
be this and this residue will be for the next iteration use this residue once again for finding
out the next of it is representation approximations. So, you privet this process till you get
a good approximation and you should note that this is equivalent to this competition.

So, I can subtract the approximations what is the approximate representative signal with
from the total signal then also I get the residue at this stage or I can incrementally build up
the residue. So, earlier I had residues still k - 1 stage and out of that already this part is
accounted. That means, the rk-1 this residue the projected component of the residue as in
di* it has been already taken care of it has been transferred to the approximation part.

So, the residual part gets reduced, so slowly residue is reducing and it is trending to 0. So,
when it becomes 0 it becomes exact representation otherwise, the whatever part is there
that is the approximation still that error is error representation will remain as a residues.
Now, problem of matching pursuit as you can see in this process it may select the same
dictionary element again and again.

So, we have not put any restriction we are concerting of the throughout the whole
dictionary element itself. So, if you are a residue it is still, so at some stage once again you
can get the same dictionary element and if I am trying to restrict per m terms. So, I have
to also keep a separate count that how many elements are presenting taken care of in my
representation of this signal.

901
(Refer Slide Time: 19:02)

In orthogonal matching pursuit algorithm we take care of that factor, so that the elements
of the dictionary they do not appear repeatedly. So, what because we are we will be
performing here least square estimation, so your residue will be always orthogonal to the
selected set of dictionary atoms. So, that is the property we can ensure by doing least
square error estimation. So, here the process goes like this first you have to initialize it, so
you have this two atoms r0 =X and a0=0.

So, this is the same initializations and at kth step and also you are considering the set which
will be including the set of dictionary elements which will be reconstructing the signal.
So, at kth step first we find out that what is the direction along which you get the maximum
component of the residue. And in the same way what we did for the matching pursuit and
once you have chosen that then you put it into the set of atoms what would we use for
representing the signal X or the input vector X.

And we would like to find out the best linear approximation there we will be using the
least square error estimate method. So, that is the difference from matching pursuit, but by
doing it what you are ensuring that all the atoms selected in the set S. That means, Sk at k
th iteration all these atoms they are orthogonal to the residue rk at that stage, k th stage
residue. So, the rescue is computed like this, so I mentioned that this minimization can be
performed with standard least square techniques.

902
So, this is a residue and this residue it we have ensured through the step by enforcing least
square error estimation that rk is orthogonal to all the elements of Sk. And that is why in
the next iteration you do not require to bother about finding out the elements.

You are only choosing those elements which are remaining elements of dictionary which
are not within the Sk. And you go on doing these things till m th terms for best m terms
approximation this is what is orthogonal matching pursuit. So, as I mentioned that OMP
selects and atom only once as a residual is always orthogonal to the selected set.

(Refer Slide Time: 22:03)

Basis pursuit problem is different form orthogonal matching pursuit problem. It minimizes
approximation error using L1 norm and that too it is L1 norm on the coefficients of
representation; that means, coefficients of the linear combination. And it gives you a
convex function and hence it can be minimized in polynomial time and you can see that
this is the problem statement as I mentioned that you are trying to minimize the sum of
coefficients represent the signal.

And by that you will be achieving as sparse representation and there exists no different
approaches. To solve this problem we are not discussing going to discuss any such solution
it involves also mathematical complex mathematics to argue on those solutions. So, in this
particular case we will be considering only orthogonal matching pursuit for representing
for getting a sparse approximate representation of a signal using a dictionary.

903
(Refer Slide Time: 23:13)

So, let me solve some exercises to elaborate this competitions, let us consider these
problem that you consider the set of basis vectors there are three set of basis vectors. And
you have to show that they form an orthonormal set of basis vectors and then if I given
you these orthonormal set of basis vectors you represent a vector 1, 2, 3 as a linear
combination of the above set.

(Refer Slide Time: 23:46)

Now, for checking orthonormal setup basis vectors what you can do, you take any pair of
vectors and if you perform dot product then it should be 0. So, any pair of vectors different

904
vectors and magnitudes of these vectors all the magnitudes of these vectors should be equal
to 1. So, and that is how we can test and you can see here also just typically let me give
you one example, suppose I have taken this vector and say this vector.

So, the dot product of this vector is 1 by root 3 into minus 1 by root 6 plus minus 1 by root
3 into 1 by root 6 plus 1 by root 3 into 2 by root 6. So, you will see that these are the two
negative terms it should be minus 2 by root 3 into root 6 plus this is 2 by root 3 into root
6, so they should they should give 0.

And in this way you can choose any pair of vectors and you can find out they are dot
product is 0 and if I take the magnitude, so take the magnitude of any of these vector say
magnitude of this. So, magnitude means the square of minus 1 by root 6 which is 1 by 6
plus square of this 1 by 6 plus square of this which is 4 by 6. And you take the root of that
and which is root of one that is it, so this is how you know you can show that this is
orthonormal this is vectors.

(Refer Slide Time: 25:52)

Considered the other example other exercise ; that means, how to represent a any arbitrary
vector a vector 1, 2, 3 as a component. So, what we are doing you are taking the component
of this vector 1, 2, 3 alone all these orthonormal basis vectors another basis vectors.

So, this is the component 2 by root 3 this is the magnitude 2 by root 3 along these vectors
and 7 by root 6 along these vectors, 3 by root 2 along these vectors. So, finally, linear

905
combination can be expressed as you can see that this provides you the coefficients of
linear combinations. All these dot products they are providing the coefficients of this linear
combinations, we have discussed this theory in my lecture in image transform, so you can
revisit that lecture once again.

So, here the point is that if I give you an orthonormal set of basis vectors, then you do not
require to do any complex operations what we discussed for matching pursuits. Simply
you take the dot products with respect to any arbitrary vector and that value itself will give
you the coefficients of the linear combination.

(Refer Slide Time: 27:11)

Let me take the other example, if I consider a dictionary in a 3 D space consisting of this
atoms, so you have seen there are four atoms there in the dictionary of 3 d space and we
can show that they are all linearly independent atoms. So, which means it is, so over
complete set of dictionary as you know in 3 D space, 3 linearly independent vectors are
sufficient to represent any arbitrary vector. So, derive the best representation of the vector
1, 2, 3 using two atoms of the above dictionary following orthogonal matching pursuit that
is what our no problem statement is.

906
(Refer Slide Time: 27:49)

So, we will apply the orthogonal matching pursuit algorithm, first selection of an atom.
So, find out the atom along which you get the maximum you know dot product value or
component value. So, you take this dot product of these 2 and you get 6 and that is the
maximum, you can check over other 3 atoms also it is a maximum. So, you select this
atom for the representation which means this is the magnitude and the direction is 1, 1, 1
there is a vector.

So, the residue would be 1, 2, 3 minus 6 1, 1, 1 transpose I am for the convenience of


representing in one slide I am using row vectors representation of this you know particular
vectors. So, your residue at this stage becomes this vector, now the second selection. So,
the second selection means now which atom from the remaining set of atoms one second
you perform the dot products or with each of them.

And you can observe that minus 1 minus 1 1 that is giving you the maximum value of
these dot products and that value is also 6 here. So, now, you have this in your dictionary
you have to atoms 1, 1, 1 and minus 1 minus 1 1. So, use this a dictionary to obtain the
best linear combination representation of the vector and you can use as you can what you
can do that you from that best set of this is vector is B as we have done.

So, this is what is B and this is what is A in our representation and this is input this is what
was X in our representation. And then oh here actually there is a some confusion of this,
we are we are considering these values A these matrix is A here instead of N.

907
(Refer Slide Time: 30:06)

So, now, you get this is a solution, so to compute that A transpose A inverse you can
compute. And you can compute the linear combination those coefficients X and Y using
that this pseudo inverse by using pseudo inverse you can compute it. So, finally, your
linear combination of the dictionary elements the best linear combination using two atoms
for 1, 2, 3 is given this, so this is your required solution these the linear combinations that
you are getting. So, with this let me stop at this point and we will continue this discussion
for sparse representation in the next lecture.

Thank you very much for listening.

Keywords: Sparse approximation, orthogonal matching pursuit, basic pursuit.

908
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 54
Dimension Reduction and Sparse Representation Part – IV

We are discussing about Sparse Representation of signal or input data vector. And in the
last lecture we have discussed given a dictionary how we can obtain sparse representation
of a signal using different pursuit algorithms particularly we have discussed about
orthogonal matching pursuit algorithms.

(Refer Slide Time: 00:31)

Today I will discuss about another problem of sparse representation, here we would like
to learn the dictionary itself. So, in the previous case we have considered the dictionary is
given and given the dictionary of say K atoms we wanted to get a sparse representation
given any input signal of dimension n.

Now, in this case the problem is that we would like to learn a dictionary specific to a set
of data. So, given a set of data points here we have considered the symbol X as a set and
xi which is representing n dimensional data vector. Suppose you have N number of such
data vectors in that set, then what should be a dictionary D of K atoms so that it would
provide best possible sparse representation for each member of the set.

909
(Refer Slide Time: 01:41)

Motivation of learning dictionary could be several that could be several such motivations.
In particular you would like to learn dictionary adaptive to specific classes of signals or
data of interest. So, that would make a dictionary suitable for certain application and would
give you relatively better performance compared to any standard fix dictionary. And, this
dictionary learns from exemplars and it also ensures that sparse representation properties
are insured using this dictionaries.

So, sometimes it may happen that if we use a fixed dictionary given any input data
representation sparse representation may not be possible with that set of atoms in the
dictionary. But, if we can tune the dictionary for a specific classes of data set then most
likely we will be getting a better sparse representation of data. So, this is also another
motivation.

910
(Refer Slide Time: 02:58)

So, just precisely let us define the problem statement here for this dictionary learning, we
consider a data as an n dimensional vector and you consider a set X which is which consists
of N number of data points and let us represent it in a matrix form. So, we can consider a
matrix X by each column represent the data and each column is of N dimension.

And, then we are considering there are dictionaries of K atoms and again each atom is of
N dimension whereas, it has to be the same as the dimension of the data we would like to
have linear combination of this atoms to represent the data. And, that linear combination
should be sparse and the coefficient of linear combination is represented by vectors yi’s.

So, Y is also another matrix where each column of Y represents the sparse representation
of data X. So, the dimension of Y is K dimensional vector because K dimensional
representation of data point in this case and so, the column vector of the matrix Y of
dimension K. So, our problem is that we would like to obtain a sparse Y in the K
dimensional real space such that X equals DY.

So, this expresses itself the linear combination of dictionary atoms for each data point and
the coefficient of linear combinations they are coming from the columns of Y or it could
be approximate. So, let us considered the let us make a dimensional check of this particular
fact; that means, we are trying to factorize the input matrix data matrix X into two factors.
One part should give me a dictionary of K atoms, the other part should give me a dictionary
of N sparse representation for N data points; n here is N.

911
So, we consider the dimension of X this is say n X N, then dimension of D this is n X K
as we can see that n is the size of the column vectors. So, that is why that many rows
should be there that is n and K is the number of atoms in dictionary D. So, that is why its
number of columns should be K columns. Similarly for Xn should be the size of the row
because n represents the dimension of the data point and N is the number of data points in
X.

So, number of columns in X is N. Similarly for Y it should be K X N. So, K once again is


the dimension of sparse representation of data; that means, a few of those elements in that
column vector Y should be nonzero rest of them should be 0 that is sparsity that is a
interpretation of sparsity very few of them should be nonzero.

But the dimension could be very large, it could be quite large from n and that is a dimension
of data original dimension of data and then N is the number of sparse represented data
vectors. So, you can see the multiplication DY means n X K matrix multiplied by K X N
matrix you get the dimensional matching here because you get n X N data matrix.

And let us also understand that how this representation is to be interpreted. So, we are
considering that the data vectors [x1, x2,..,xN] and say the dictionaries are [d1, d2, to dK]
and then each y1 let me write y1 as see a11, a12 to a1K and like this. So, if I consider only
the first row, if I consider if I consider this dictionary and these are the vectors these are
the atoms and you know these are the this is the coefficients.

So, x 1 vector and this is y 1 vector, but which is represented by this K dimensional vector.
So,

𝑥⃗1 = ∑ 𝑎1𝑖 𝑑⃗1 = 1


𝑖=1

So, in this way for all n vectors you have one column at y; one column of Y. And, this is
how these vectors are represented and this is how these linear combinations of each column
vector of x is represented by these corresponding matrix multiplication of D and Y.

912
(Refer Slide Time: 10:10)

So, you can apply various sparsity constraints as I mentioned that the K dimensional
representation of input vector which are given in the matrix Y now these would be sparse
which means there should be only a few nonzero elements. So, when I have written a
representation say x 1 this vector when I am representing y say some of 𝑎1𝑖 𝑑⃗1 = 1 to k
only few of this coefficient should be nonzero and rest of them should be 0.

So, you can apply various sparsity constraints there. So, one of the some of these examples
of these constraints can be shown here say the 0th norm of y; 0th norm means number of
nonzero element number of nonzero element of y and that should be minimum. You should
have that representation that your representation of dictionary atoms and also these sparse
representation would be such that the number of zero nonzero elements of the each sparse
represented vector that number should be kept as minimum.

So, if I consider the whole set y al the set y you can count how many number of nonzero
elements are there in that matrix that can be also consider as we measured and we would
like to minimize it. Or we can consider that the approximation; that means, it is for exact
reconstruction X = DY we can consider this or it may be approximate reconstruction that
it could be you may not get exact reconstruction of all the input vectors that could be a
tolerance of Epsilon in the reconstruction in the sense of L2 now and there also it would
like to minimize the 0th now of the Y vectors.

913
(Refer Slide Time: 12:30)

We will discuss if particular type of dictionary forming algorithm which is called K- SVD
or K singular value decomposition and in this case it is a same problem statement as you
can see it is the given a set of training signals to obtain the dictionary of K elements that
leads to the best possible representation for each member in this set with strict sparsity
constraints.

Now, the sparsity constraints here we have mention with this I would like to add in this
case instead of considering always minimization we can make a bound that the 0th norm
it should be less than or equals sum constant T0 which is a small number. And you would
like to minimize the L 2 norm in this case that we will took by keeping this constraints we
would like to minimize the L 2 norm of the reconstruction.

914
(Refer Slide Time: 13:44)

So, the principal of this algorithm K similar value decomposition or K SVD that is how
this algorithm is popular to popularly known, it generalizes the K means clustering
algorithm or K means clustering problem rather. And, we know that in K means clustering
what we get, we get representatives of K groups of data vectors and each representative
acts like an atom in this context. And then you get an extreme sparse representation if I
consider the any member of that group is represented by that atom only.

So, we will see that in that in terms of Y matrix how this vector looks when we consider
its an extreme sparse representation. So, in the case of and in K SVD instead of only a
single atom representation of K means clustering we consider a sparse linear combination
of K atoms. So, what it does it chooses a dictionary of K atoms and then it obtains a sparse
representation and then it updates dictionary atoms to get a better representation out of this
and repeat steps two and three till convergence.

So, once it obtains a sparse representation using this atoms then it checks whether any
sparse representation is possible by updating some of destruction atoms and also with
better approximation of the input signals. And, if it is possible then it updates and it repeats
its processes till there is no such improvement is possible.

915
(Refer Slide Time: 15:41)

So, let us revisit this K means clustering algorithm and width in the context of this
particular discussion. Say you have a set of atoms which from a dictionary you remember
that we have to initialize the centers of clusters K centers of clusters now those initial
centers are considered here as atoms of dictionary.

And then what we do that we assign the training examples to their nearest neighbor in the
dictionary D. So, that is what we did in K means and then given an assignment then we
update D to better fit the examples. So, far K means clustering we have used L 2 norm to
check to compute the distance between any sample vector to the dictionary to the atoms or
to the centre of clusters.

And then whichever is nearest we assign that vector input vector to that cluster and once
the assignment is over then you have once again K partitions, then again re compute the
centers. So, that is what the update is updation process see update mean of each partition
of assignment. We started with any initial set of distinct atoms and then after convergence
you consider those are the learn dictionary atoms.

916
(Refer Slide Time: 17:13)

So, your codebook here is the K cluster centers those are the dictionary atoms that is your
code book and you have this training examples and this is what is the representation
exchange sparse represent sparse vector you can say j ith vector representation would be
in this form all the elements would be 0 except 1. So, the 1 denotes that atom which is
assigned to that input vector or this input vector belongs to that group where this atom is
the centre of that cluster.

So, that is why it is called extreme sparse because it has only one nonzero element and rest
of them at 0. So, in your sparse representation you have only this type of vectors; that
means, extreme sparse vectors it should be one of them and your optimization problem as
we have considered for K means. It is once again as you can see your minimizing the L 2
norm square of the L2 and your update policies that if ith input vector is closest to the
dictionary atom r th dictionary atom then assign the r th sparse r th you know, then the
presentation should be here which is exchange sparse vector which means an nonzero
elements 1 would be only at the r th location.

So, once you have computed that location then for all the vectors once the assignment is
over, then you update each partition which means that you are considering mean of those
input vectors which have the same assignment. For example, with the same assignment of
e j that is the sparse vector representation for the j th group and you compute its mean that
would be your new dictionary updated dictionary atom.

917
So, this operation you are doing iteratively and at the when it converges then we are you
stop and that is the dictionary you learned and already the sparse representations are also
derived as extreme sparse vectors for each input vector. So, you have mentioned this is
problem provenience norm that you should consider.

(Refer Slide Time: 19:46)

So, that was K means clustering and let us considered now that how this concept is
generalized for K singular value decomposition approach or K SVD algorithm. In this case
also we start with any initial codebook; we will discuss how this code book is derived here
and suppose there is an initial course code book.

And then and also you have the training examples as given in this form a N number of data
points and it can be represented as a data matrix X and you would like to get sparse
representation once again a matrix of K X N each column represents the sparse
representation of K data vector. And why I provide linear combination of maximum T
naught nonzero term that I mentioned while discussing this sparsity constraints that is
involved in KSVD algorithm.

So, the optimization problem here is that you would like to minimize the provenience norm
X - DY its a generation of L2 norm square and subject to that number of nonzero elements
for each sparse representation is less or equal to T 0.

918
(Refer Slide Time: 21:14)

So, this computation is carried out by a nice manipulation of the optimization function that
we will discuss here. If I consider my optimization function if you note here the
optimization function is given this; that means, provenience norm of X - DY and how this
can be written in a different from that is that we would like to discuss now consider the j
th row of Y. So, if you remember the representation of y each column is a sparse
representation. So, I have represented by the coefficients a 1 1, a 1 2 to see a 1 k.

So, this is the y 1 of the sparse represented vector and this is representing x 1. So, y 1 is
represented representing x 1. Similarly y 2 can be represented as linear combination say a
21 a 22, so this is my say notation a 2 k. So, this is y 2 and in this way you are representing
say any particular small k y as see a k1 a k2 to a k1 is k another is K; K is the dimension
and finally, there are n number of representations because there are n number of data
points.

Now, you consider any particular row say j th row which means this should be a 1j a 2 j a
small k j a capital N j. So, this j is represented as the is j th row vector. So, now, this
expression X - DY this can be conveniently written in this form you can see that what is
doing it is the sum of the rank 1 matrixes in the other way.

So, you have if I write it in this way that d 1 d 2 to dn and this is the y matrix. So, so we
can write it as say d 1 into this first row; that means, y 1 T in our notation. So, d 1 this into
this. So, the dimension of d 1 is n X 1 and y 1 T its dimension is 1 X N because there are

919
N elements. So, there are N elements in a rho there are N columns. So, and only 1. So, this
would give you a matrix of n cross N and x is a matrix which is of n cross N.

So, if I consider sum of all these matrixes that would actually cover the x. So, it is a same.
So, this DY; DY is the same as this 1 it is not same it is trying to represent that we are
trying to get that D and Y which will be which will be the same as a x if I multiply them,
𝑗
but 𝐷𝑌 = ∑𝐾
𝑗=1 𝑑𝑗 𝑦𝑝 .

So, this is the trick. So, you are considering every atom dj and find out what is its
contribution to the reconstruction of DY the whole multiplayer multiplied factor. So, the
advantage here is that if you do that, then what you can consider you can separate out the
contribution of a single atom say d k.

(Refer Slide Time: 26:05)

So, this is the single atom d k is the contribution of atom dk and this is a contribution of
rest of the atoms towards formation of DY and which we would like to get as close as
possible to x. So, if we have this then as I then the idea is that since we are separated out.
So, we can observe just what the effect of a single atom is.

So, if I keep all other thing constant then what is the best dk and yTk that should
approximate this particular factor let me call it as E. So, we can compute it because
everything is given in your in your hand at this stage you have the input data vectors, you

920
have the set of atoms that is the safe suppose you are doing iteratively. So, K - 1 th set of
atoms you have also the sparse representation in the previous K - 1 th sparse representation.

So, you can find out what is the value of this error of approximation using except K th
atom this is an error approximation. So, the error can be this minus this. So, if I can update
dk and y T k such that it is close to e then my error would be minimized. So, that is a idea
and that is what this algorithm does that keeping other transfixed. So, it gets E k.

So, to do that what we can do? We can perform the SVD of E k. So, this is I am mentioning
as E K and this is what is written. So, let me rub the my writing. So, if I perform SVD of
E K and then you know that with singlar value decomposition you get set of or thermal
vector column vectors of U and V and considered the columns of U the first column of u
with the maximum singular value and column of b of the same maximum singular value.

So, you take the column of U and you can normalize it that is what is your d k and the
maximum singular value into the column of V we will give you the y k T k that
representation that could be 1 possible close approximation it is not exactly equal to E K.
It is a best approximation of rank 1 approximation from the theory of approximation of
matrix using this kind of singular value decomposition.

Now, this is one best representation, problem here this that here you are not ensuring that
y T k of that coefficients would be sparse, so this is a problem that. So, as I mentioned we
can consider 1st column of U as d k and the 1st column of V multiplied with the singular
value d 1, 1 which is the maximum singlar value as y T k, but the column vector this
column vector may not be sparse. So, what we should do in that case?

921
(Refer Slide Time: 29:28)

𝑗
||(𝑋 − ∑ 𝑑𝑗 𝑦𝑇 ) − 𝑑𝑘 𝑦𝑇𝑘 ||2

So, this is a tree that we can enforce sparsity by considering only those samples of the
input which has nonzero contributions towards d k; that means, I have nonzero coefficient.
So, choose only samples from x which have a nonzero component along d k and from there
you can reduce the error matrix Ek to E kR. So, you are only considering those input
samples related to that only and from there you can rewrite the equation. And, you can get
E kR you can get the reduced representation of the row y R k and that you can use the once
again singular value decomposition of E k R and then update d k and y T k from there.

So, you can repeat for all dj S and update obtain the updated d and y in this form and you
should repeat till repeat this thing till convergence. Actually convergence like K means
clustering here also the problem here that it can get local it can start at a local optimum
point in this case. So, performing SVD since you are performing this singular value
decomposition K times for K atoms in each iteration we call this algorithm as K SVD.

922
(Refer Slide Time: 30:53)

So, finally, if we like to give the overview of this algorithm from total perspective at your
input is x data points sorry n capital N data points and represented by this set X or which
can be also represented by a data matrix X and your output is your K number of atoms of
the dictionary D and also the sparse representation of each input sample Y.

So, the first step is that you have to form an initial dictionary of K atoms and that any
method you can use K means clustering can we also be used, then obtain an initial sparse
representation y using any sparsity algorithm. So, you can use K means clustering for that
or you can use orthogonal matching pursuit for this sparse representation, then you iterate
for updating j th atom and sparse representation associated with this atom and you go on
to in this iterations till there is a conventions.

923
(Refer Slide Time: 31:59)

So, there are some applications of the K- SVD algorithms I mean it can be applied for data
compression there are various image processing operations where denoising is applied.
Actually the idea here is that since once again you have factorized the input data you only
retain those factors which are important and where the coefficients are negligible then you
can reject it. By doing this things you can compressed data, you can denoise data, you can
deblur data. The other applications that kind of computation could be you can learn
dictionaries of different levels of representation of a signal.

For example high level representation and low level representation and then if you
establish the mapping between two dictionaries that it can give you an algorithm for
increasing the resolution of signal. So, super resolution is that task which is called. So,
mapping of learn dictionaries can be used in this case. In painting also another task of
image processing where you take out some area and you feel that area by looking at the
content of the image. So, that the discontinuities of extraction of the object from the image
is not felt by everyone. So, they are also this kind of dictionary learning and mapping could
be used.

924
(Refer Slide Time: 33:23)

So, let me summarize the topics what we discussed under the title of dimension reduction
and sparse representation. So, in dimension reduction we discussed about the technique of
principal component analysis and also a technique Fisher’s linear discriminant with respect
to classification task.

So, far principal component analysis the objectives are to represent data in a minimal
subspace and it involves we have seen it involves co ordinate transformation, it chooses a
direction maximize in variance of dominance component and it decorrelates data across
different dimensions. Whereas, in Fisher’s linear discriminant the task was to data project
the data in a 1-D subspace and then use it for classification and this projection itself gives
severe linear discriminant function.

925
(Refer Slide Time: 34:19)

And, in spare representation we have discussed about different pursuit algorithms like
matching pursuit, then orthogonal matching pursuit, then basis pursuit we have given just
a problem statement, but we did not discuss the algorithmic steps. And, then we also
discussed technique for dictionary learning and using it how to derive sparse representation
in particular we discussed about this algorithm K- SVD. So, with this let me stop and this
is the end of this particular topic we will start a new topic in my next lecture.

Thank you very much.

Keywords: K-SVD, K-means algorithm, denoising, deblurring.

926
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 55
Deep Neural Architecture and Applications Part – I

We will discuss a new topic today and that is on Deep Neural Architecture and
Applications.

(Refer Slide Time: 00:21)

So, let us understand what is the difference between a classical neural architecture and a
deep neural architecture. We have already discussed about the artificial neural network
where you can see that it is a network of different neural nodes or perceptual nodes where
you have the input at one end.

And it is the feed forward network, each node takes the weighted input vector and then the
net sum of those weighted vectors are passed through a non-linear functions and the output
is the response of that non-linear functions. And network of all these nodes that gives you
the artificial neural network. And finally, there is an input and output layers of this
particular architecture.

Now, the fact is that so far for classical artificial neural networks we had only a few hidden
layers that is what we are used to do maximum three layers that we could see in many of

927
the applications, but 1 or 2 layers are more common. Whereas, for deep neural architecture
you have more number of layers, you can have more than 100 hidden layers and that is
one of the major difference.

(Refer Slide Time: 01:41)

So, there are different applications of this deep neural architecture. So, before describing
the features of this architecture let me give a brief overview what are the applications
which are reported by this kind of architecture. So, one of the applications is an image
classification and there is a data set image net large scale visual recognition challenge data
set which was there for improving the classification algorithms for testing. And for a long
time there are researchers they are working on these data sets and they are reporting their
performance on this data set.

So, in this data set you have 1000 object classes and there are about 1.4 million images the
design in the data set. And major algorithms which have been found to provide good
performances they are deep features based which means they have used deep neural
architecture while classifying this data.

So, here is a brief summary that what are the performances of different algorithms. As you
can see that you have these traditional competition algorithms, so this is a traditional
algorithm, and this is a deep learning algorithms. So, all the performance indexes of these
deep learning algorithms they are very high where the traditional algorithms and as we are
moving ahead with time. So, it is still 2050 it was shown after that also the performance

928
has been improved. So, the strain is increasing and you are getting better and better deep
architecture for giving this kind of classification performance.

(Refer Slide Time: 03:35)

Another kind of application that we can discuss that is object recognition and localization.
So, this is one example how this problem has been solved what is the kind of response.
You can see that in this scene the system tries to localize an object for example, it has been
shown car and there is a pink box rectangular box where the car has been shown so, this is
the localization. And also it recognizes that what objects it is the car with certain
confidence level or some probability value.

So, it is definitely car because it is 1.00 whereas, you can see that horse is 0.993, so there
is some confusion with horse like this. And this kind of there is also another data set where
these recognition and localization tasks are performed. And you can see that using deep
conclusion neural networks or we will discuss what is a conclusion neural network in this
under this topic, so they have provided they have given better performance in this case.

929
(Refer Slide Time: 04:51)

Object segmentation is another task in fact, the segmentation with respect to deep neural
architecture it is called semantic segmentation. Because you are not only segmenting the
objects you are also telling that what is that object, it is like localization almost similar
thing. But, now you are doing pixel level declarations that is a segment of which type of
object. So, this is one kind of application,

(Refer Slide Time: 05:19)

Pose estimation is another, you can see in this diagram that poses of different persons are
shown by a skeleton figure. So, skeletons are 3-dimensional information, 3 dimensional it

930
is a description of a human body, human limbs with the graphical descriptions with 3D
coordinate vertices. And connected by their edges and with our joints at different limbs
those are representing our human skeleton.

And depending upon the image of a different person, we try to guess at what kind of
skeleton configurations could provide that kind of stance or that kind of pose of a person.
So, this is estimation and that gives you idea that in what way the person is standing or
doing some activity.

(Refer Slide Time: 06:17)

Another interesting application is image captioning it is a cross disciplinary problem as


you can see that not only image, understanding of images and videos we are trying to
describe using natural languages. So, you can see that with respect to different scenes there
is some language, some description like you consider this particular image where a child
where a girl is jumping. So, it is written here girl in pink dress is jumping in air, so this
kind of descriptions those are generated by this kind of system.

931
(Refer Slide Time: 06:55)

Similarly, dense image captioning, so it is further complex. So, you would like to have not
only you are describing the main motto of the same. But also you are describing every part
of the image that is a elephant task of an elephant it is a football, there is a man sitting on
top of an elephant like that the man is wearing a red shirt. So, this kind of description and
this kind of captioning could be there using this kind of architecture.

(Refer Slide Time: 07:31)

Another application is super resolution, so here you can see that examples are given the
original image is shown in the right most column. So, this is the original image and this is

932
the classical algorithm by cubic interpolation algorithm which produces the super
resolutions or interpolated images. Whereas, using deep neural architecture you have this
middle two columns those are the results and you can choose you can actually see the
visual quality of these two interpolated images or super reserved images they are much
better than what we have obtained by cubic interpolation.

(Refer Slide Time: 08:15)

Image art generation and transfer this is also another very interesting applications. In fact,
using deep neural architecture you can synthesize images, you can render images and here
in particular you can see that there is a photograph. So, this is the original photograph
which is taken by a camera and you consider the style of painter say this is a Hancock
painting and I think this is starry night or that is a name.

And using Hancock painting, so this photograph is with that style itself painting style this
photograph has been generated. So, all these buildings etcetera you can observe here, but
they have very high similarity with this style with which it has been painted. And similarly
you can get the examples that same scene is rendered by different styles of painters.

933
(Refer Slide Time: 09:11)

Many other applications are there not only in computer vision deep learning deep neural
architecture is used for in other areas. So, machine translation and text synthesis, speech
recognition and synthesis, navigating autonomous vehicles, playing different kinds of
games, even playing chess alpha go these are things.

(Refer Slide Time: 09:31)

So, one of the very one of the questions will come will naturally this question will arise to
us that why this architecture why when it is giving. So, higher performance compared to
classical techniques, but why did it take so much of time that to come to the stage. Because

934
the concepts what you are using in deep neural competitions they are very old they are the
same neural network competition, they are the same neural network model. So, they are
introduced in 80’s and they are basic principles they remain the same.

So, there are two major reasons that is why it has arrived today it is an historical reason;
that means, the progress of technology and science that is what led to every kind of new
technological advancement. So, it is a new technological advancement because, we have;
now we are have the capacity of generating large scale annotated data and because of that
availability of the data.

So, you can train large networks to develop any kind of model which can work in a very
general, to solve any general and difficult problems. And then the advancement of
computing power because this training also requires high computations and it has to be
done within a feasible time. So, because, so there are say there are gpu computing which
has come with a you know which has been which has given this advancement and which
has been very much used for these kinds of competitions.

So, availability of large scale annotated data is possible because of penetration of internet
and smartphones, widespread of social networking, online shopping etcetera and
advancement of computing power as I mentioned high throughput GPU computing like
this is one such example.

(Refer Slide Time: 11:35)

935
Now, let us understand that what is the difference between classical image classification
techniques and the deep features based classification techniques or deep neural
competition based classification techniques. In classical image classification you consider
you have an image and then you would like to represent that this image by some
handcrafted features. Which means, we have studied what could be the distinct features
are characterized in this image and we developed computational algorithms to extract those
features that is what is called handcrafted features. So, it is designed by as the algorithm
is also designed accordingly and then we extract them this features.

For example edges, SIFT, SURF key points, HOG hog regional features, motion features
we have discussed some of these features in my previous lectures. And then this feature
representation is input to our classification algorithm, so use the classifier. And there are
different classifier like bayesian classifier linear discriminate analysis, support vector
machine, KNN algorithms. So, different such you know classifier could be there and then
we produce the output. Output is something like whether it is tiger, cat, lion it is a standard
classification problem, so this is the classical approach.

(Refer Slide Time: 13:07)

Whereas, so the problem here is that it is very tedious and costly to develop handcrafted
features to handle various challenges, because there is a large variance of input image of
the same object. So, these variance causes due to different kind of variations, you have
viewpoint variation deformation, occlusion, interclass variation, illumination, clutter,

936
instances, scale; some nice examples are shown here with respect to that tiger. Say, you
can see one by one that what kind of variations are there and, so that is why this is very
difficult to develop a very general algorithm, it is dependent on one application and it is
not easily transferable to other applications.

(Refer Slide Time: 13:59)

In deep feature based algorithm we learn the feature representation first using the it is a
automated learning it is unsupervised learning. So, your mechanism is such that they
extract the features depending upon the you know problems statement depending upon the
applications.

So, here usually it has been shown we will discuss that part that how you are learning even
the feature extractions that is not handcrafted, your neural network itself is learning to
represent the features. So, there are different levels of features representations when you
are doing deep feature representation.

So, youwill have low level features and as you increase the level of abstraction you can
have mid level features, you can have high level features and then you train the classifier
using this features. The whole thing you are doing in one shot; that means, it is augmented
it is coupled in the previous case classification scheme is independent of feature
representation scheme. But now feature representation and classification they go hand in
hand they go in one shot and then you get the output.

937
(Refer Slide Time: 15:23)

So, it utilizes large amount of training data to learn features and it has reached hierarchical
representations as I have mentioned that from low level, mid level to high level feature
representation. So, multiple stages of feature learning processes involved and it learns
features adoptively, so in the context of your application this features are learnt.

(Refer Slide Time: 15:47)

Mostly we will be considering supervised learning and in the deep neural architecture,
particularly classification problem as we have discussed that you have the training

938
samples, you are training the network and then you are classifying. So, this is then ideal
supervised classification setup and we call it also supervised learning.

So, the problem here is that you have x data and y label, so I am just revisiting this we
have discussed also in the context of classification previously. So, x is the data, y is the
label and our goal is to learn a function f which will map the data to the label. And different
kinds of problems like classification, and regression, object detection, semantic,
segmentation, image captioning all could be mapped to this kind of learning framework.

(Refer Slide Time: 16:39)

So, let us consider image classification and in the image classification the problem is at
what class that image belongs to and how to classify, we have already understood this
problem statement. So, in this case we would learn we would like to learn a parameter
function f which composed by weight parameters of the model to classify image x as class
label y.

939
(Refer Slide Time: 17:03)

So, it is a data driven approach to learn the model in three steps; first you have to define
the model like we have defined the prediction of class y given the input and the model w.
So, artificial neural network we discussed that is one example of model and as I mentioned
that deep neural architecture is nothing but the same artificial neural network model only
you have more number of hidden layers. So, you have input data as image pixels here and
you have the model weights and this is the model structure, and you have to this is a
predicted output or image label.

(Refer Slide Time: 17:43)

940
So, first step is defined model next, step is you have to collect data. So, this is for this
supervised learning it is important that you know you should have annotated data, xi is
instance, and yi is the label of the data. So, this is the training input and this is the true
output yi and then you learn the model.

(Refer Slide Time: 18:03)

So, here we consider that that there is some mapping function that you have learn and how
good is that mapping function and what is the gap then we try to update that gap, so in
terms of weights. So, whatever weights are there what is the performance? If the loss is
more then we need to update, so that loss should be minimized or loss should be reduced.

So, we use a loss function here to just the to assess the performance of a model, so which
is represented here by these notation. So, this is the loss function what is the predicted
output, and what is the ground truth depending upon this loss function is defined. And for
a set of training samples we accumulate this loss functions and this is the regularization
term, because your model should not be too complex we will discuss this part also.

So, let us see, so this is predicted output and this is a loss function and we would like to
minimize the average loss over training set and this is called regularizer or panel, which
penalizes the complex models and this is learned weights. So, total loss can be considered
as the data loss and the regularization loss. So, data loss is given by the functions l here
that is the data loss and regularization loss is given by R(w).

941
𝑁
1
𝑤 ∗ = arg min ∑ 𝑙(𝑓(𝑥𝑖 , 𝑤), 𝑦𝑖 ) + 𝑅(𝑤)
𝑁
𝑖=1

(Refer Slide Time: 19:39)

So, a loss function tells how good is the current classifier and data loss is the model
prediction should match training data, so what is the deviation. So, there are different kinds
of examples of data loss functions like in multiclass support vector machine we can use
hinge loss. Here, now we can see that each wrongly classified sample produces a loss. So
it is the difference of the functional values of the predicted sample predicted function
functional value predicted sample and the models sample.

So, you can see that functional form here in this case and the softmax loss or multinomial
logistic regression laws that is given in this form. So, in this function we are estimating
the probability of a level true level given the input data. So, if this probability is very high
then this loss will be less, so using this negation, negative sample we make it a positive
function.

Sometimes this probability is computed with respect to the responses of all other classes.
So, in the output layer see if you have in the output layer say N classes and the response is
provided by say syi of the ith class. Then we consider that see yi is a true class, so we
consider the syi of the true response of the true class.

942
Now, the probability of that class is given by this expression this is the softmax expression
you raise it to the exponentiation and somehow normalize it by the sum of all these
exponentiations. So, that is that would give you the probability value and it is actually
modelling this particular term itself it is giving this probability of y given data this is (Refer
Time: 21:55).

𝑒 𝑦𝑖
𝐿𝑖 = − log 𝑃(𝑌 = 𝑦𝑖 |𝑋 = 𝑥𝑖 ) → 𝐿𝑖 = −log⁡( )
∑ 𝑒 𝑦𝑖

(Refer Slide Time: 21:57)

So, cross entropy loss is also another form of softmax loss when you have 2-class entropy
then you have either say class 1 or class 2 that will object class and background class for
example.

−(𝑦𝑙𝑜𝑔(𝑝) + (1 − 𝑦) log(1 − 𝑝)); 𝑝: 𝑃𝑟𝑜𝑏. (𝑦 = 1|0)

So, when your object class y is 1, label is 1 then you are only considering if it is predicted
as you know y then you are considering the, when the object is true class of the object is;
true class of the sample is the object then you are using log(p) otherwise you are using log
(1-p).

In the multi class you are extending this particular computations with this expression
where you can see that y o, c is a binary indicator it is one if the object belongs to class c.
And we are considering only probability of that class for the kind of sample. Here o is the

943
input sample, so estimating probability of o belonging to class c. In a more general
framework you may have a distribution of probability distribution of the true class also, so
it is not binary always. So, it is true probability of o belonging to c use this, you can use it
for the cross entropy loss.

(Refer Slide Time: 23:39)

So, in the regulation laws we would like to have the model simple and there we would like
to have the W, the size of the w as small as possible. And it is not an unique with the data
loss, see if there are different kinds of regularization function which tries to put a constraint
on the values of the W.

For example, you would like to minimize the

𝐿2⁡𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛: 𝑅(𝑊) = ∑ ∑ 𝑊𝑘,𝑗 2


𝑘 𝑗

𝐿1⁡𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛: 𝑅(𝑊) = ∑ ∑|𝑊𝑘,𝑗 |


𝑘 𝑗

𝐸𝑙𝑎𝑠𝑡𝑖𝑐⁡𝑛𝑒𝑡⁡(𝐿1 + 𝐿2 ): 𝑅(𝑊) = ∑ ∑ 𝛽𝑊𝑘,𝑗 2 + |𝑊𝑘,𝑗 |


𝑘 𝑗

944
(Refer Slide Time: 24:27)

So, finally, for the training we would like to solve this optimization problem. So, given
this loss function which has the given this objective function which has data loss term and
regularization loss term and then we would like to minimize with respect to a chosen
weight. So, what is the best weight which can minimize it, so this is g(w) is your the sum
of these losses and we would like to minimize g(w).

So, what we could have done that we can use random search; that means, once again we
need to choose some initial guess of w and or pick here and there and try to see what kind
of loss you are getting and try to get the minimum out of them, but it is very inefficient.
So, standard algorithm would be gradient descent again the same back propagation
algorithm what we can use it here.

945
(Refer Slide Time: 25:29)

So, gradient descent algorithm we have discussed earlier let us revisit once again in the
context of this particular topic. So, we can initialize w randomly and then we can go ahead
computing the gradient of that w at the current point. So, which is expressed as this grad
of g(w) and then move down in a little bit because that is updation process.

So, move along in the opposite to the steepest increase of the gradient, so which means
steepest descent of the gradient value, so −𝛼∇𝑔(𝑤) is a learning factorial. So, this is the
updating the weights at each iteration and this is the learning rate alpha is the learning rate
we get step should be that needs to decided.

946
(Refer Slide Time: 26:17)

And back propagation algorithm also we discussed that it has two passes, one is forward
pass, another is backward pass. In the forward pass we run the graph forward to compute
loss and the backward pass we run the graph backward to compute gradients with respect
to loss. And we know that it is efficient to compute gradients for big complex models.

(Refer Slide Time: 26:41)

So, supervised learning could be also considered as a linear regression problem where your
input is a vector and output is also a vector. So, in a linear regression one of the linear
regression problem is that now the model is linear.

947
𝐿𝑖𝑛𝑒𝑎𝑟⁡𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛:⁡𝑓(𝑥, 𝑊) = 𝑊𝑥

𝑁
1

𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔⁡𝑝𝑟𝑜𝑏𝑙𝑒𝑚:⁡𝑊 = arg min ∑ ||𝑊𝑥𝑖 − 𝑦||22 + 𝜆||𝑊||2𝑓𝑟𝑜
2𝑁
𝑖=1

So, the if your input is x, then use the weight linear combination of input that is Wx it is
non-linear combination you multiply with any matrix w that would give you another vector
and there is a linear model here.

So, dimension of W should be DoutXDin, because your output vector is of size Dout. So, it
is just a matrix multiplication that is the model here and you can use the euclidean distance
as a loss norm of the deviations. So, you are learning problem here is a linear regression
learning problem and coupled with regularization the corresponding objective function as
been shown here. And regularization is using frobenius norm of matrix or sum of squares
of this entries.

(Refer Slide Time: 27:51)

And supervised learning you can use the neural network where you can consider it is a
sequence of linear layers. So, it is a new model once again as you can see that you are
learning problem it is almost remaining same, because what you are doing if you are using
multi layer neural network. So, first layer W1 since this is linear, next layer also you can
multiply W2, but this W=W2W1, so it remains the same learning problem. So, even if you

948
increase the layer finally, it is not increasing the capacity of problem solving it remains the
same power of your problem solving remains the same.

(Refer Slide Time: 28:37)

So, by introducing non-linearity here, you are giving another complexity to your model
and increases the power of solving this problem. So, let me take a break here and we will
continue this topic in the next lecture.

Thank you very much for listening this lecture.

Keywords: Deep architecture, loss function, regularization loss, supervised learning.

949
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 56
Deep Neural Architecture and Applications Part – II

We are discussing about Deep Neural Architecture. And in the last lecture we discussed
how the deep computation is different from classical computations. And, then we have
considered different types of supervised learning problems or different facets of
supervised learning problem in the context of deep learning architecture.

(Refer Slide Time: 00:42)

Now, in the neural network there is a nonlinearity function, that we know that is used
and those are also called activation functions. And, we have seen particularly the sigmoid
function that is used in the artificial neural network very often. But so, the property of
the sigmoid function is that it squashes numbers to range 0 to 1 And so, it can kill
gradients as you can see it saturates when the value increases or value decreases and they
are the gradient becomes almost 0. And, then it is best for learning logical functions that
is functions on binary inputs, but it is not good for image networks and it is not 0
centered also

950
(Refer Slide Time: 01:23)

So, there is another activation function which is called tan hyperbolic x. So, it is also
ranging from -1 to 1 and it is also 0 centered, it is 0 centered which is desirable. And,
still it has the problem of that it can kill gradients when it is saturated and it is not as
good for binary functions.

(Refer Slide Time: 01:49)

This particular function rectified linear unit, it is very popular in convolution neural
network in different neural architecture. So, the functional form has been shown here that
is you can see only positive half of the input it is linear and then for the negative half it is

951
the value is 0. So, that there is a non-linearity and they discontinuity at the value 0 that is
first order discontinuities there.

So, but the advantage is that it does not saturate in positive region. And, it converges
faster than sigmoid and tan hyperbolic data for example, 6 times. And it is very
computationally efficient, but it is not suitable for logical functions for representing
modelling logical functions not for control in recurrent nets not 0 centered output.

And, if it is going into the red region which is the negative part then it is not going to
activate the filters or activate the gradient. So, it stops there so, there is a dead zone in
this case.

(Refer Slide Time: 03:00)

So, to avoid that we can have leaky ReLU so, that you have very small increase in the
negative part also the small gradient. So, which it has those advantage that it does not
saturate, convergence faster and the gradients will not die.

952
(Refer Slide Time: 03:20)

Other kind of activation function like ReLU we have also exponential linear units. In this
particular graph this blue graph is the exponential linear units which corresponds to this
function. So, the benefits of all ReLU benefits are there and it is closer to 0 main outputs.

(Refer Slide Time: 03:47)

Another non-linear activation function which is called max out neuron; here it is a
maximum of two linear part, maximum of two linear combinations of inputs w1T x  b1

and w2T x  b2 . So, it does not have basic form of dot product of course, it is non-linear

953
and generalizes ReLU and leaky ReLU does not saturate does not die, problem is that it
doubles the number of parameters per neural.

(Refer Slide Time: 04:22)

And, we have seen this is the standard architecture of two one hidden layer neural
network or two layer neural network. And, another three layer neural network
architecture.

(Refer Slide Time: 04:35)

And sometimes this is called multi layer or fully connected network or perceptron and
hidden layers are learned feature representations of the input and these are deep features.

954
So, with this actually I have talked about general features of neural networks and also
deep neural architectures. But, let me discuss the specific a specific deep neural
architecture, which is very popular in various applications and which one is convolution
neural network.

(Refer Slide Time: 05:05)

In conventional neural network, the conventional layer that we need to understand that is
a hidden layer. And, in the convolution layer what we have that input is an image as a
32  32  3 image. That means, for example, it could be RGB 3 channel RGB image or it
could be any other three components.

And there is a filter so, this filter performs the convolution operations over this image.
And size of the filter sometimes it is called kernel also for example, in this size if here it
is given a typical value say 5  5  3 . So, first thing the filter extend the full depth of the
input volume. So, 3 is the depth in this example.

So, the filter has to span over the depth and then what it does it takes a dot product. So, I
will explain that so, it convolves with respect to the image which means this filter slides
over every point and computes convolution. That means, a dot products with the input
with the filter weights and then produce that output at that at the central location of that
central pixel of that mask.

955
(Refer Slide Time: 06:25)

So, this is the example that you have 5  5  3 filter w and it is placed within a pixel of
image. And then it takes a dot product and then produces a value and this value is the
output at that at that point. So, for example, in this case you have 5  5  3 chunk of the
image; that means, 75 dimensional dot product you are doing and there.

In the neuron as you and as you know that it is the dot product and in the neuron there is
a bias term. So, then you add the bias term and that could be also the output of the
convolution layer. So, one of the feature of this convolution is that it is the locality which
means that objects tend to have a local spatial support. So, this could be exploited using
convolution.

956
(Refer Slide Time: 07:22)

And the other feature is that weight sharing.

(Refer Slide Time: 07:27)

Since, it will be your convolving over all special locations so, the same weights are used.
And finally, you are generating the output. So, if we consider that boundary conditions
that only pixels, which are fully embedded the convolution mask which is which is fully
embedded within the image at those pixels sites. You are only considering output from
those pixels sites then the size of the activation map get reduced.

957
As you can see the filter is of size 5 5 ; that means, along height and along width you
have to leave out two pixels each and at each edges. Because, those are not those pixels
are not fully embedded. So, finally, size would be 28 28 and along depth wise also
other planes are not coming into picture because they are not fully embedded. Only the
central pixels convolutions for the central plane those are only considered here. Because
those for those pixels only this pool embedding is available.

Say another feature of this computation that it is translation invariance which means the
object appearance is independent of location.

(Refer Slide Time: 08:43)

Now, you consider that there are other filters. So, there is another second say green filter
of the same size, but its weights could be different. And, it also produced another we call
it channel output. So, it also produce another output.

958
(Refer Slide Time: 08:59)

And, if there are 6 such filters, you have 6 such activation maps six separate activation
maps. And you can stack this to get a new image of size 28  28  6 . So, 6 number of
convolutional filters.

(Refer Slide Time: 09:20)

So, the features of CONV layer or convolution layer we can consider that is locality,
which is the feature of having local spatial support for objects. Then translation
invariance where object appearance could be independent of location; and weight sharing
which means the units could be connected to different locations having the same weight.

959
And equivalent each unit is applied to all locations and weights of filters are invariant.
This keeps the number of parameters small or convolution neural network. Of course,
since there are many such layers finally, total number of parameters will be quite large.
Then each unit output or filter is connected to a local rectangular area in the input and
that is considered as the receptive field.

(Refer Slide Time: 10:13)

So, as there is a non-linear function in each neuron. So, after convolution layer we
consider the that the non-linearity at every point itself is considered as a layered response
from the system. Whose input is the output of the convolution layer then it goes through
the non-linear stages at every pixel. So, it is the point wise non-linearity and it increases
the non-linearity of the inter architecture without affecting the respective fields of
convolutional layer, which generalizes the model further.

960
(Refer Slide Time: 10:50)

And, ReLU is commonly used that I have mentioned. So, what is a CNN it is a sequence
of convolution layers and nonlinearity. So, you have a this is your input you have the
convolutional layer and also the ReLU this is one layer. Then another layer of
convolutional layer and ReLU second layer and it goes on. So, CNN is sequence of such
convolution layers and nonlinearities.

(Refer Slide Time: 11:23)

So, let us also see what are the parameters involved in convolutional layer. Consider
your input is of size W1  H 1  D1 . So, in my previous example we have taken 32  32  3

961
that was in size of the input. So, value of W1 was 32, H1 was 32 and D1 is 3. And
considered there are K, number of filters in the previous example this value of K was 6;
we considered 6 filters. And size of the filter is Fw  Fh  D1 .

So, in the previous example we have taken 5  5  3 filters, which means Fw was 5; Fh

was 5 and D1 is kept 3. So, you note that the depth wise the filter has to have the same
number of channels what you have in the input or same number of depth slices, but is
there in the input. Because the idea of convolution layer is that the filter should be totally
embedded within the input image.

And that to at the central slice and that is why the depth has to be the same. And they
need to provide only single channel output. And then there is another parameter called
stride. So, it is considering that the point where you are performing convolutions. So,
how the point could be separated in the grid? So, if they are adjacent then stride is 1;
that was the previous case.

Suppose you leave 1 out of them then the stride would be 2. And also input can could be
0 padded on both sides to keep the size same. So, if you would like to include also the
edge pixels or boundary pixels of the image, then pad the input to 0 and then perform
this computation. And it would produce the output volume size of W2  H 2  2 So, what
should be this output volume size? Now these values are all related with the values of
input sizes.

Like values of in W1 , H 1 , D1 ; ; Fw , Fh , D1 , D 1 is not very important here, but W1 and H1

Fw , Fh , S w , S h and Pw , Ph . For all those values they will determine what should be the

size of your output. For example consider the value of W2 and as this is the expressions.
So, it has taken care of striding it has taken care of 0 padding, it has taken care of the size
of the filter and complete embedding of the filters within the input.

So, you observe that the depth number of depth slices should be the number of filters K
here. And what could be the number of parameters? Since the parameters involves say
the weights of the filters and since there are the size of the filter is Fw  Fh  D1 . So, that

many number of weights should be there.

962
So, it is Fw  Fh  D1 . And if there are K filters you have to multiply with K. So, we are
assuming here that in the convolution layer all the filters of same size and all the filters
of same size. So, the parameters as you can see Fw * Fh * D1 * K that many weights and

each neuron could have a bias so; you can have K biases.

So, your summary of this particular operations is that d th depth slice of output is the
result of convolution of d th filter of the padded input volume with a stride. And then
offset by d th bias.

(Refer Slide Time: 15:28)

There is another kind of layer in CNN, which is called pooling layer. So, in the pooling
layer it tries to progressively reduce the spatial size of the representation that is it tasks.
So, that it can reduce the amount of parameters and also computation in the network it
controls overfitting. And pooling partitions input image into a set of non overlapping
rectangles.

So, for each sub region outputs an aggregated value of the features in that region. There
are two types of aggregation one is max pooling, which considers maximum value. The
other one is average pooling which takes average value. So, it operates over each
activation map independently.

963
(Refer Slide Time: 16:13)

So, this is an example of pooling layer, you can see that the input was 224  224  64
doing max pooling with 2 2 filters and stride 2 your living one sample in between. That
is a stride 2 and that is how the input size output size half of the input size. So, you have
224  224 output would be 100  112 , but the number of depth remains the same.

And the example here it is showing max pooling first we have partition the input into
2 2 . Rectangles into rectangular sizes blocks and from there you are choosing the
maximum value. And you are replacing you are forming your one pixel for each block.

(Refer Slide Time: 17:01)

964
So, parameters which are involved in pooling let us observe that. So, here input volume
size is once again W1  H 1  D1 . And pool size could be Fw  Fh with stride ( S w , S h ) . So,

output volume size is W2  H 2  D2 and that is related with input parameters. So, W2
W1  Fw H  Fh
would be  1 . So, that is why that is what you are doing. And H 2 is 1 1.
Sw Sh

So, that is that is the H 2 and your number of depth remains the same as the input.

What about number of parameters as you can see? There is no weight no bias nothing
you are it is without when without specifying any things you can perform this
computation. So, actually there is no parameter involved in this operation. It is just the
computations that is providing you the output of the here. It does not depend upon any
parameter. And, it is very uncommon to use zero-padding in a pooling layer.

(Refer Slide Time: 18:12)

Then we will discuss about the fully connected layer which is also coupled with the
convolutional layers or CNNs. So, it contains neurons that connect to the entire input
volume that is why it is fully connected as it is there in ordinary neural networks. And
input volume to FC layer can also be treated as deep features. So, it is either it could be
deep features or it could be the feature representation itself for a classifier. So, these are
the two options for FC layer.

965
(Refer Slide Time: 18:42)

There are various operations, which are required for efficient training and testing of
CNN or for efficient computations of CNN effective computation of CNN. One of the
operation is known as batch normalization. So, it tries to condition the input and also the
intermediate responses. So, what it does? It normalizes input activation map to a layer by
considering its distribution over a batch of training samples.

Consider for example, you can apply your Gaussian activation maps Gaussian model
there. Using batch normalization you can improve the gradient flow through the network
and it allows also higher learning rates. So, your convergence becomes faster. And it is,
it reduces the strong dependence on initialization this also acts as a form of
regularization of the network. And, batch normalization is usually inserted after fully
connected or convolution layers and before non-linearity.

966
(Refer Slide Time: 19:46)

So, let me elaborate a little bit about this computation. As I mentioned it normalizes
activation responses of a channel a previous layer. And how does this normalization
work? It subtracts mean of responses of a batch and divides it by their standard deviation.
And transform the resultant output operation by scaling and translation by parameters a
and b. In fact, this is also learned by the gradient descent algorithm.

So, during test time running averages and standard deviations of activation maps are
used with the learned parameter for each channel at a layer.

(Refer Slide Time: 20:25)

967
So, that is how the batch normalization is carried out in CNNs. There is another
operations which is also quite common and which improves the generalization of the
model and that is called dropout. What it does? It randomly dropout nodes of network at
hidden or visible layers, it could be hidden and visible layer during training.

So, dropout means it temporarily removes that node from the network along with all its
incoming and outgoing connections. And, it regulates overfitting and which is more
effective for smaller dataset it simulates learning sparse representation in hidden layers.
So, implementation of dropout to could be in this way that, you can consider that you can
retain an output of a node with a probability p. And, typically the value lies between 0.5
to 1 at hidden layers and 0.8 to 1 in the visible layers which means input or output layers.

(Refer Slide Time: 21:32)

So, weight becomes larger due to drop out that is one of the effect of dropping out that
whatever weight you compute, you learn now the values could be actual value would be
larger effect would be larger and you get large weight. So, you need to scale down that
weight at the end of the training. And, there is a simple heuristic that if you are outgoing
weights of unit is retained with probability p during training.

Then you should multiply by p at the test time or you can consider these operations
during training time itself at each weight update. And, then you do not required to do it
on during testing. So, you can use the same weight as you learn. So, with this let me take
a break. And, we will continue this discussion in the next lecture, where we will be

968
discussing different types of convolutional neural networks, different kinds of
architectures in my next lecture.

Thank you very much.

969
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 57
Deep Neural Architecture and Applications Part – III

So, we are discussing about convolutional neural network and in this lecture I will
provide some examples of CNN architectures or Convolutional Neural Networks which
are used for different particularly for image and object classification.

(Refer Slide Time: 00:32)

So, the first; the very first network which was introduced for as in the applications of
image classifications in fact, in document image processing, document image processing
or character recognition this is called LeNet. And, it is a gradient based learning applied
to document recognition as you can see and this architecture uses two convolutional
layers, there are two pooling layers, two fully connected layers.

And, it was reported in 1998, it was quite long back and it is a quite a pioneering work in
that sense. And, its number of parameters it was around 60,000 and number of floating
point operations per inference was 341 k; that means, about 341000 and it uses sigmoid
as a nonlinearity function.

970
(Refer Slide Time: 01:38)

So, this is a diagram which shows schematically how this architecture looks like. So, you
can see that there are convolutional layers and then you have the sub sampling which
means it has the pooling layers and as you discuss there are two convolutional layers,
two pooling layers and there are two FC layers and finally, you get the output, out of this
process.

(Refer Slide Time: 02:08)

Then in 2012 in fact, the another this network is used and this has provided very good
result for the image net classification and this is called AlexNet. So, in this network it

971
uses a normalization of the responses that local response normalization and its
architecture it consists of 5 convolutional layers and you can have it has also you know 3
pooling layers. And, you can see that the first pooling layers are adjacent to their
convolutional layers, but the last pooling layers it has it operates after processing through
after getting the process data through third, fourth and fifth convolutional layers; that is
what it is designed, that is how it is designed.

And, there are normalization operations after first and second convolution pooling layers
and there are three fully connected layers from which the output is and then finally, there
is an output layer. So, the number of weights in this case was 61 million and number of
floating point operations is 724 million. So, it is a very large network compared to very
primitive network of LeNet earlier what has been reported and here they have used ReLu
as the non-linearity, that is one of the major change of non-linear functions in neural
network.

(Refer Slide Time: 03:42)

So, a schematic diagram of AlexNet has been shown here, it also shows some of the
information related to the size of the input say; it processes 224  224 inputs. It takes
filter sizes of say 11 11 and then it uses max pooling of 5 5 . Finally, you can see that
the feature representation what you were waiting 4096 is the dimension of the feature
representation and that is finally, classified for the 1000 notes. So, this is how this
classifier is used.

972
(Refer Slide Time: 04:28)

So, if you would like to count the parameters of AlexNet let us consider how the
parameters could be counted. See consider the input image size of 227  227  3 images
that is the input size, it was not 224. It is 227  227  3 and the first layer convolution
layer there are 96 filters each of size 11 11 and it is applied with a stride 4. So, the gap
between two samples in the input pixels so, that stride should be 4. it should leave 3
samples and then the fourth sample it should operate with the convolution operation.

So, what should be the output volume size? So, as we know that it is related to with their
dimension of input width and height and also the filter size. So, along length and breadth
you can see that it has to be the 227  11 , 11 is the filter size and divided by 4. So, that is
how the dimension and then you add with 1; so, because of the central you can add with
1. , this is the expression we have discussed earlier also. So, the 55 would be the size of
the length and size of the breadth and your depth is the number of filters that is 96. So,
the size should be 55  55  96 so, that is that should be the output volume; it is
55  55  96 .

Similarly, what should be the number of parameters in this layer? So, the number of
parameters involves the filter weights and also biases. So, since there are 11 11 filters
that is a size 11 11 3 there is a filter size. So, number of parameters will be
11 11 3 for 1 filter and since there are 96 filters, it should be 96  11 11 3 . There is a

973
total number of parameters that is without bias and if I consider each should have a bias;
so, for each filter there would be one bias. So, you have to add another 96 biases.

So, it should be 90 this number which is around 35 K around 35 K or 35000, there is


approximate number and that should be again plus 96. Then consider a second layer that
is POOL 1, it has 3 3 filters and which is applied at stride 2. So, what should be the
output volume size in this case? So, once again 55 minus filter size 3 divided by the
55  3
stride 2 and plus 1 I.e.,  1 . So, output the length and breadth should be 27. So,
2
output volume would be 27  27  96 because the input depth of the input was 96, it
remains the same.

And, that is what is your output volume and, what is the number of parameters in this
layer? And, as we mentioned that in pooling layer there is no parameter involved, you do
not require any weight or bias. So, number of parameters would be 0. So, in this way you
can count parameters by knowing the size of the filters and you can count the you can
also determine the size of the output by knowing the size of the filter and size of the
input and number of filters that is also important.

(Refer Slide Time: 08:35)

Other network that is also quite distinct that is ZF network, it is I am basically providing
you some kind of historical evolution of CNNs. So, ZF network came from Alex
network, the changes this modifications. So, what it does? It uses filters of smaller size.

974
So, earlier it was 11 11 and stride 4, in ZF network they use 7  7 stride 2. And, but
what they do also they have used more number of filters say for example, in CONV 3, 4,
5 in AlexNet you have 384, 384 and 256 filters.

In ZF net you have 512, 1024 and 512 and it has improved the accuracy of the image
classification for the ImageNet data set. So, this is one that is smaller the filter size it is
smaller; in number filter size is smaller, but more filters in layer is used. A schematic
diagram is shown here, it shows the size of the different processed output and also the
size of filters and other things.

(Refer Slide Time: 09:57)

Next you know evolution of this CNN architecture, next stage was VGG and the here in
this architecture you have a very, you have a good number of deep layers. So, the idea is
that you should have smaller filters, smaller even then the ZF net and you should
increase the number of layers. So, in ZF net you have smaller filters more number of
filters, it is not only more number of filters also increase the more number of layers. And,
by doing that they could increase the, performance of the image classification using the
same data set whatever is referring.

So, in this case VGG net it uses different kinds of number of layers in architecture. There
are 13 layers which is called VGG 13, there are 16 layers which is 16 or 19 layers which
is called VGG 19. And, only 3 3 convolution filters are used with stride 1 pad 1 and

975
2 2 max pool with stride 2. So, the filters specification is very simple for VGG network.
So, here the observation is the deeper the layer better was the accuracy.

In this particular VGG network, suppose it is I am not sure which number it is in a


typical VGG network; it is 138 million number of weights and number of floating point
operations for each you know influencing it is about 15.2 billion floating point
operations.

(Refer Slide Time: 11:48)

So, one example of VGG network, it has been shown here you can see that typical sizes
of input which starts from 224  224  3 and then it down samples still 7  7 . And finally,
the feature representation goes towards the 1 1 4096 and then you use the soft max
probabilities, you use the output layer 1000 class classification.

976
(Refer Slide Time: 12:21)

It has been observed in or it has been found in the VGG network that stack of 3 3
convolutional layers has the same effective field as one 7  7 convolutional layer. So,
successively if you use 3 3 convolutional layer let me explain that. So, you have an
image. So, apply 3 3 convolution with this image, then you produce another image,
then you apply against 3 3 convolution. In this way successively you do it, the effect
would be that same as if you have 7  7 convolution at a single stage and so this is one
advantage.

So, that is why you have so many number of layers, but you are having the same effect
and you have more nonlinearities introduced. And, that would give you the fewer
parameters also, because you know if you are using this stack of three then you get fewer
parameters. Like for this case you have say 3  (32 C 2 ) that many number of parameters
should be there; if you assume C channels per layer C is a constant whereas,
for 7  7 filter it is (7 2 C 2 ) . So, that number is large compared to 3 32 which is 27.

977
(Refer Slide Time: 13:57)

Then we came across GoogleNet and in GoogleNet you have more number of layers 21
depth and total 57, it introduces inception modules, that is the novelty in GoogleNet
modifications. So, what it does instead of using a filter of same size in a particular layer,
it concatenates output of filters of different sizes and that increases the power of the
model. So, it has only one fully connected layer and number of weight is 7 million.

And, number of floating point operation is 1.43 billion and the architecture example
architecture there are 9 inception modules as I mentioned. So, there are convolution, 3
convolution layers and there are 2 pooling layers followed by this convolutional layers,
but after that you can see it is inception layers again there are pooling of inception layers.
So, if I explain, if I give an example; so, GoogleNet also provides a very good
performance 6.7 percent top 5 error.

978
(Refer Slide Time: 15:12)

So, this is a one short diagram for picture of GoogleNet how does it look and you see
that these layers are inception layers. So, these layers are inception layers; so, it is using
different sizes of filters and then it concatenate this filter.

(Refer Slide Time: 15:39)

There is a enlarged diagram here. So, here it is shown that from the input there are filters
of 1 1 . So, 1 1 means it is filtering along the depth only. So, your filter weights are
along depth only, then there are 3 3 convolution filters, 5 5 filters and 3 3 pooling.
And, number of filters with different sizes also they vary say 128 1 1 filters 192

979
3 3 filters 96 5 5 filters. And then of course, 3 3 pooling which does not require any
number of filters which should give you the same depth value of the input.

So, finally, you can say that if I use 128 1 1 filters and your output would be
28  28  128 that would be your output. Similarly for 3 3 convolution and here by
performing 0 padding, you are keeping the input size same output size same as the input
size; so, that is also done in the inception model. So, you have 28  28  192 output from
this model. So, this is giving 28  28  128 , because there are 128 filters this will give
28  28  192 , this will be giving 28  28  96 because there is a 0 padding.

So, you are keeping input and output same size and this will give 3 3 pool, but there is
no stride; so, this will also give 28  28  256 . So, total input size output size would be
say 28 cross 28 cross and there are so many channels. So, this is how the inception model
works, it increases the number of channels and sometimes it reduces the number of
channels like here instead of 256, the number of channel becomes 128 here.

(Refer Slide Time: 17:54)

So, that was the you know modifications done in GoogleNet and it has found to be given
good performance. Then the next kind of deeper model is the residual network or ResNet
because the problem, if we increase the more number of layers; it has been observed that
it causes overfitting and it is harder to optimize because the gradients they get managed.
So, the since the managed for the vanishing gradients you cannot run the optimization, it
drive soon as we go deeper.

980
So, this ResNet we will discuss the architecture. So, this is the problem. So, you can have
a good training error, but your tester would be quite high, if you increase the number of
iterations, if you go by more number of layers. So, this is a normal CNN, this is CNN
with less number of layers and this is the CNN with more number of layers. So, you will
have this problem that your error will be higher.

(Refer Slide Time: 18:59)

So, in the residual network or ResNet what we learned? We try to learn the residual
errors. So, it fits the residual mapping instead of directly fitting the in fitting the data
itself. So, as you can see in this unit that the learn; what you are learning; you are
learning the residual part effects. Because, there is a skip connection here, it is called
skip and it is added with this residual part and then only you are getting the output H x.
So, what you are learning H(x). F(x) is H(x)-x which is the residual part that is what you
are learning.

981
(Refer Slide Time: 19:48)

So, this is one schematic diagram where ResNet are shown, as you can see that the skip
connections are there regularly and your number of layers could be very large. It could
be more than 100 now and it has increased the performance of the classification to a
great extent.

(Refer Slide Time: 20:08)

There are other examples of networks also like some list is provided here, I am not going
to read all of them. So, you may go through these references. In particular let me point

982
out one network which is called MobileNet, it is used for low end applications and here
all the filters are of size of 1 1 , they are all separable convolutions.

(Refer Slide Time: 20:36)

So, to there are few things we would like to discuss about training of this network. First
thing is that we would like to pre-process the training data set, different kinds of pre-
processing is required. We need to normalize the data with respect to the variances of the
data that we have to normalize. Then decorrelation is also required which means
diagonalization of the covariance matrix We know the decorrelation of data is principal
component analysis, that we have learned and whitening of data is identity covariance
matrix.

So, which means that during normalizations you have to divide it by the variances, you
have to normalize the ranges and or you can subtract the mean. So, different kinds of
pre-processing operations that may involve, while using these data for the purpose of
computation.

983
(Refer Slide Time: 21:34)

Then you may require to generalize your data set because, the data set what you have
with level data set that may not capture the all variances of in the class. So, that is called
data augmentation because by doing some simulation you can also generate some more
data. So, like you can perform different kinds of geometric operations on data, images
say horizontal flips, random crops on scaled input, color jitters.

You can add noise, you can add variances, add different kinds of color variations,
distortions, then transformations. So, there are different kinds of simulations you can do
on the given input data set and generate more input data set and use it for training. Then
you need to initialize weight, there are different policies for initialization of weights. I
will discuss that and train the network by updating the weight parameters.

984
(Refer Slide Time: 22:44)

So, there are few training tips that is required that I would like to provide here. So, we
should start with small regularization and find learning rate that makes the loss go down.
So, regularization parameter; so, there is a loss function you have the regularization term
and also the loss function due to the cross entropy term for example; so, it is a linear
combination. So, start with the regularization component keeping it a small and also try
to find out learning rate that makes loss go down. So, it can over; you can overfit very
small portion of the training data and then train first few epochs with few samples to
initiate the hyper parameters.

So, first you have to introspect how it is behaving and try to select those parameters
which are giving you which is providing you the directions of convergence, good
directions of convergence. So, those parameters you need to consider and because it is
very time consuming; so, once you fix the parameters then again empirically you have to
observe how it is behaving. So, these are certain tips by using it you can reduce your
time of experimentations. So, if you find there is a big gap between training accuracy and
validation accuracy then it is a overfitting case. So, what you should do? You should
increase the regularization term in that case or if there is no gap then you may increase
the model capacity.

985
(Refer Slide Time: 24:18)

So, one of the policy for initialization of the weights is called transfer learning and in this
case even if you have a small data set, you can work with deep learning paradigm. You
do not require a lot of data set to train CNN; usually deep learning requires lot of data set.
Because, what you can do you can use a pre-trained network which has been trained
using a very large data set for some applications. But deep, but use that feature
representation in your own application. So, pre-trained models can be initialized for
CNNs at the early stage of training.

(Refer Slide Time: 24:59)

986
So, the example here I will providing say for example, you train on ImageNet data set, a
particular network. Here we have shown that there are few convolution networks and
then fully connected networks given the image and it also shows number of channels etc.
And, now in your small data set you fix this portion so, this is your feature representation.
So, the whole feature representation is fixed so, you can freeze this and you can transfer
the weight which has been learnt in this network, you can transfer it directly to this
because your architecture is kept same.

So, that is a constraint you have to keep the same architecture; so, you transfer those
weights and use these weights to learn the features. And, then only you train the output
layer; that means, the final classifier you only train that part and you can reinitialize and
train for the classification in your particular to your problems, what you are working on.
If you have a large data set bigger data set, then you can have some other you can have
other kind of policies. For example, in this case you can consider the all the fully
connected layers to be trained and by keeping your future representations only at this
point.

So, you can consider that and your lower; you have a, you can fine tune finally so, what
you can do also that you can fine tune once you get a convergence some weights then
you can again fine tune. So, you can train with your input data samples and using a lower
learning rate and which is like one-tenth of the original learning rate; I mean that is a
good starting point and then you know you can vary it.

So, these are you know different pragmatic policies and you know some of the intuitions,
heuristics those are used while training a good deep neural network which takes lot of
time and also lot of resources. And finally, a good network when it gives you a result
that requires a lot of experimentations and it involves lot of experimentations you should
understand that.

So, with this let me stop here and we will continue this discussion on deep neural
architecture in my next lectures.

Thank you very much for your listening.

987
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture - 58
Deep Neural Architecture and Applications Part – IV

We continue our discussion on Deep Neural Architecture and Applications. In the last
lecture we have discussed about convolutional neural networks and we have also
considered different architectures of convolutional neural network.

(Refer Slide Time: 00:35)

Let us consider a problem at this stage to find out the number of parameters and also to
determine the sizes of the output or input to a particular layer given a specification of
convolutional neural network. So, consider this exercise, a deep CNN architecture which
takes an image of size 240  240  3 as input.

So, in the input there are 3 channel input and for the purpose of classification into two
classes and the architecture has three convolution layers, where in each layer the filter
sizes are of 7  7 , 5 5 and 3 3 and the number of channels in this layers are 64, 32
and 16. So, these layers they are specified from input to the output and then there are
two max pool layers in between every convolutional layers.

988
So, which means two max pool layers of filter size 2 2 with stride 2 in both the
directions between the first and second and second and third convolutional layers. And
after the third convolutional layers we have two fully connected layers having the
number of neurons as 50 and 20 respectively. And the activation function of each neuron
used in this network is the ReLU or rectified linear unit function.

(Refer Slide Time: 02:09)

So, the problem here is that we have to provide a schematic diagram of the architecture
of the network, provide the sizes of outputs from each convolutional layers and compute
the number of parameters that you require to learn in each layer. So, let us see how we
can find or we can compute which have been asked here.

989
(Refer Slide Time: 02:33)

So, let us consider the architecture, a schematic diagram of the architecture itself. So, as
you can see that there is an input layer and which is going to the first convolutional layer
of filter size 7  7 . And since there are 64 channels so, it should be a filter size of
7  7  64 . Then it is followed by a max pool layer of 2 2 and stride is 2 there.

And, then again the output from the max pool layer is fed to the fed to a another
convolution layers whose filter size is 5  5  32 , because there are 32 channels output
from that convolution layer. And, then again it is followed by a max pool layer of 2 2
size, this is the third convolution layer or final convolution layer of our architecture.

So, output from this max pool layer it is fed to that convolution layer where filter size is
of 3 3 and also the there are 16 channel output. So, it is 3  3  16 that is what will be
considering, that is not exactly filter size it is 6; the last 16 is showing it is the 16 channel
outputs. Filter size will be 3 3 will see that thing. And, then this is being fed to the fully
convolutional layers where there are 50 neurons.

So, the output from the 16 channel output from the third convolution layer, they are all
flattened and they are considered as a say linear vector and or in the 1 dimensional
feature vector. And, then that is fed to the fully connected layers of 50 neurons, this is
also this output from this layer also is fed to another fully connected layer of 20 neurons
and then those are again connected to output neurons. So, this is the output layer where
there are 2 neurons and it is used for the classification because there are 2 class

990
classification problems; so, there are 2 neurons at the output. So, this is a schematic
diagram of the architecture.

Let us now determine what should be the sizes of outputs from each layer. So, the input
here as I has been specified that we have it will be handling and a size of 240  240  3
this is the input size. So, the output if I consider 7  7 filter and as you know that it is
only the output should be from those pixels of the input where this filter is fully
embedded. So, filter size is 7  7  3 ; that means, it is the number of channels of the input
that would determine the depth of the filter and 7  7 is its width and height.

So, filter size is 7 cross 7 not 64, if I have mentioned wrongly earlier 64 is the number of
channel here in this architecture output channel. So, that 7  7  3 filter is embedded and
then fully embedded on those pixels only from there the output has been provided. So,
you have to leave around those boundary pixels where, it could not be fully embedded.

So, you have to basically subtract 6 pixels from width and height and only 1 plane will
be coming out of this process. So, you have 234  234 and the since the output has 64
channels; so, there are 64 such filters. So, the output is 234  234  64 . So, this is our
output size from the first convolution layer, then this goes to the max pool layer of size
2 2 and stride 2 which means they fill the input size should be half of the output of max
pool layer should be half of the input size and number of channel will remain same
which is 117  117  64 .

Then it goes to the next convolution layer where the filter is a 5 5 and since there are
64 channel input; so, filter size would be 5  5  64 that is the depth of the filter. Once
again you get only know 1 plane output 1 channel output from each filter there are 32
such filters.

So, you have to leave aside those boundary pixels where they are not fully embedded. So,
117 you have to subtract 4 from there so, it would be 113 113 and since there are 32
channels; so, it is 113  113  32 . This will again go to the next max pool layer of 2 2
and it has to be divided by 2 and since it requires full embedding of max pool layer; so, it
113
would be only the lower ceiling or of the .
2

991
So, it is 56  56  32 channel remains same. Once again 3  3  16 filter there is a third
convolution layer, actually size is 3  3  32 and there are 16 just filters. So, you have to
subtract 2 from 56 and the size will be 54  54  16 because there are 16 channels. Now,
the all these 2 dimensional array output, it will be flatted into a 1 dimensional vector of
size 54  54  16 and that would be input to the fully connected layer of 50 neurons.

So, each neuron will have that many number of weights. So, the total number of weight
will be coming to the total number of weight whatever. So, output of 50 neurons from
the fully connected layer would be 50 only. And then output of 20 neurons from the fully
connected layer it will be 20 only and then you have 2 class outputs. So, there will be
from there will be 2 class output.

So, which is not written here, but there are even 2 outputs from this layer. So, this is the
sizes of different outputs from different layers and next we are going to count the number
of parameters. So, I have indicated by arrows by showing that we will be counting the
parameters of those layers which is associated with those layers. So, consider the first
layer so, we will be considering number of parameters. Consider the first convolution
layer as the filter size is 7  7  3 because input is 3 channel. So, there 1 filter size would
be 7  7  3 , but there are 64 such filters.

So, it would be 64  7  7  3 that many number of weights and each 64 filters will have
also bias; so, plus 64 bias. So, in the green color I am showing this count of parameters.
So, this is answer that the number of parameters in the first convolution layers will be
this.

Similarly, for the max pool layer since we know that max pool layer it does not require
any weight, it is already specified by its sizes itself. So, the number of parameters would
be 0 there. So, that is what is max pool layer, next layer 5 5 then the filter size is 64
5 5 is 64 depth is 64. And how many filters are there? There are 32. So, it should
be 5  5  64  32 , that many weights plus there are 32 filters; so, there are 32 bias.

So, this should be the number of parameters from this layer. Again there is a max pool
layer so, we will have 0 here. In the next it will be fed to the 3 3 filters and depth
would be 32 because, the input to that layer it has 32 channels. So, it is 3  3  32 and
there are 16 filters. So, 3  3  32  16 plus for each 16 filters there are 16 biases. So, we

992
will have (3  3  32  16)  16 Next it goes to the know fully connected layer and as I
mentioned that the size of the input here is 54  54  16 each input will have a weight to
the fully connected there and there will be 50 such neurons; so, there will be 50 such bias.
So, it should be (50  54  54  16)  50 so, that is what here and here the size is 50.

So, it will be 50 weights should be connected to every neuron of 20 neurons and there
are 20 biases. So, it is (50  20)  20 and again since there are know still you require
more weights because this is the output layer. So, output here its size is 20; so, each
neuron will have 20 weights and 1 bias so, it should be (20  2)  2 So, this is the total
number of parameters those are involved in this architecture, what I have specified in the
problem statement. Understand that this is how that parameters and sizes could be
determined given the specifications of a particular CNN.

(Refer Slide Time: 12:22)

So, let us now consider another kind of problem which can be solved once again using
convolutional neural network. This is this problem is called object recognition and
localization. So, in this problem the task here is that given an image not only you have to
identify the level of the object, but also you have to identify which portion of the image
where this object lies, in the form of a rectangular block.

So, any rectangular block can be defined by the two corners of the block in pixel
coordinates. So, that is how the specification of rectangular block will be there.

993
(Refer Slide Time: 13:05)

So, there are different techniques, this is a very classical problem and even before the
deep neural architecture there are methods by which people have tried to solve it. So, this
is a pre-CNN and the fact is that the object while it is which the region of the object in
the image that needs to be identified. And, we call that task is as a region proposal and
which means you would like to find out a probable region which contains an object and
then you try to find out that what is the object in that region.

So, this is a kind of approach; there are stages like first you propose the region and then
you describe that region by features. So, you have to extract features and then use those
features to classify and if you could find some objects in your target in your classes.
Then how you decide about that, otherwise you may say that region does not contain
any object. So, there are various methods which have been reported.

So, one method was very exhaustive means you are searching every rectangular type of
rectangular block which is a combinatorially a hard problem, you need to use some
heuristics there. So, that is what you should do it. And there are methods also to propose
you certain heuristics to propose those regions. And then for feature extraction you can
use some handcrafted features. Handcrafted features I mentioned what is used considered
as a handcrafted features by considering the classes of objects, you predetermine that
what kind of features will be computing.

994
For example, you can use shift operator, you can use the descriptor of shape or other
kind of key point detectors and key point descriptors those. And, then combine them
aggregate them to define a feature. And, then there are different classification algorithms
you can use like linear classification algorithms. And, for deep neural architecture based
technique there are three such variations. So, very first a technique called RCNN has
been proposed.

So, RCNN what it does? It considers the task of region proposal should come from the
conventional methods and then the feature extraction part is only should be done using
CNN. So, that you do not have to use any handcrafted design. So, as per your problem
features could be extracted learnt and extracted and then use a classifier to find out the
object.

For example, linear SVM that is a classical classifier a rather another improvement or
variations over on top of RCNN, where the both feature extraction and classification is
done by a deep neural network. Whereas, the region proposal is still carried out using the
conventional methods and the third variation that whole thing can be performed can be
done is in a deep neural architecture.

(Refer Slide Time: 16:29)

So, this diagram it is explaining this flow of this computation. So, you have an input
image and then the regions are proposed. So, these yellow boxes are proposed regions by
conventional methods, then you extract them one by one and then fed it to the CNN, as

995
CNN takes a fixed input size. So, you need to work those regions and then you know
compute the features.

So, last layer of the CNN will give you a future descriptor and which is used in a
classifier. Once again this classifier is the conventional classifier and then you decide
what kind of class it is based on those target classes of this classifier.

(Refer Slide Time: 17:13)

For this particular method as I mentioned that it uses conventional method for proposing
regions. So, there is a method based on selective search which uses hierarchical
grouping based on color texture size and it is a bottom up method. So, it performs bottom
up segmentation then it merge regions at multiple skills and then based on the regions
property it decides whether it contains any object or not. So, this figure tries to
summarize this process.

So, it considers very small segments at the bottom layer, then it merges it and the bigger
segments and it tries to identify out of them some of them are considered that they
contain the object. And then you use the rectangular box enclosing those particular
segments.

996
(Refer Slide Time: 18:12)

Their problem in this method is that the training objectives are not very clearly
mentioned here. As you can see that the CNN is a feature extractor stage, but know what
is what should be the objective function of that CNN. So, far we have discussed about
training of CNN based on some classification task. So, even the features are learned
based on the end objective of classification, but in this case since the classifier is a
conventional classifiers; so, classifier is independent, independently designed.

So, this is a problem; so, you have to consider some intermediate objectives to learn the
CNN. So, kind of adhoc is remains there. So, you can use some pre trained network, but
now again you need to do fine tuning based on your domain, based on your problem
domain. So, that is required or and also we require to once again train the SVMs, Support
Vector Machines based on those features and usually feature descriptors are quite large;
so, this will take lot of time. So, and also another training you have to do because you
have to find out the regions.

So, though the regions are proposed, but you need to do some kind of fine tuning. So,
some kind of regressions based on those proposed regions you need to do; so, that also
requires some training. So, there are so, many different kinds of trainings are involved
that is why it is very slow and also it takes a lot of disk space. So, inference is also slow,
it need to run full forward process of CNN for each region proposal.

997
So, that is what because for each region proposal you are extracting features
independently and you will need it so, many times you need to iterate this process.

(Refer Slide Time: 20:05)

So, that is why its inference time is very large, it takes quite a bit of time. So, what are
the times components here? You can see there is a time called prop time or proposal time.
So, it is time taken for generating all proposals, then number of depending upon number
of proposals that is generated at this stage. So, each proposal or each region has to be
fed to the CNN.

So, it extracts the feature. So, it is number of proposals into the time taken for
computing features using CNN and finally, each one once again has to be classified. So,
that there is a classification time and time taken to identify the object in the image. So,
this is a total inference time, if I consider for RCNN.

998
(Refer Slide Time: 21:06)

So, to an improvement over this particular approach is that you can use a first RCNN
network. And here the idea is that instead of using an SVM, we can use also the whole
thing classification and feature extraction can be carried out at the same go. That means,
for every region it will be passing through both feature extraction and classification in
this case.

So, it computes CNN feature a map of the for the whole image. So, and also it is not
required to compute the feature map of the whole image. You can just process the feature
maps in ROIs and you know here only the thing is that it pools the ROIs and then again it
passes to the classification stage. So, it is not that deep neural network a deep
architecture is not there at the classification stage it is a conventional classification.

999
(Refer Slide Time: 22:12)

So, this is what it is doing it, it only uses the same conventional neural network. So,
given the image it computes the whole feature set, then using the region proposal it pools
those features and use a classifier to compute to classify it and also the objects. So, the
time here is the proposal time it remains proposal time and then since it is only
computing 1 time all the features, it is only 1 time convolution network is used. So, it is 1
 that conv time, but every region every proposal is separately classified. So, it is
number of proposals  fc time.

(Refer Slide Time: 22:59)

1000
So, first RCNN the features are that each which are vector to sequence a fully connected
layers and two sibling output layers. So, a layer with the softmax probability estimates of
our K object classes plus a catch all background class. So, you are considering that and
another layer producing four real valued number for each of the K object classes, this is
actually regression of the regions.

(Refer Slide Time: 23:25)

So, it just shows the architectures of RCNN, you can see that every we proposed region
is passed through separately through the convolution layers. And, then those are used for
classifying and regression of the regions using regression module and also the SVMs for
classifications; every region every regions are processed in this way.

1001
(Refer Slide Time: 23:55)

Whereas, in fast RCNN you have the only one time the convolutional neural network is
used to generate all the features and also it has propose the regions. So, use those regions
in the feature map those are used for classification; so, those are fed again to the fully
connected layers and then.

So, it is a classifier which could be also neural network classifier like ANN classifier and
those are used for classifying the object and also the finding out the regions; that means,
fine tuning of estimates of the regions using a regression box.

(Refer Slide Time: 24:41)

1002
The loss what is used in the stages classification stage as I have shown where there is a
fully connected layer which is simultaneously classifying the object and also detecting
the box or a rectangular area by the object is contained. So, there are two components of
this loss; so, one component is due to classification and the other component is due to the
regression.

So, and we are only considering those samples those classes which are object classes. So,
u is the ground truth, p is the predicted class. So, if only the it is a object class then only
we are considering the loss due to the regression and this is a linear regression  is any
parameter. So, these this is the thing this is the log loss which is cross entropy loss.

This is predicted class course, p is the predicted class course and u is the true class scores.
And, this is the smooth L1 loss, L1 is the function which is used for computing the
regression the box, the error deviation of the localizations or deviation of the rectangles
from the true rectangle to the predicted rectangle. So, t u is the true box coordinates and
v is the predicted box coordinates. So, this is how the loss function is defined.

(Refer Slide Time: 26:10)

So, then improvement over even faster RCNN is faster RCNN, where the whole task has
been considered you doing it in a convolutional network; even the region proposal also is
done using a convolutional network. So, we call it as a region proposal network. So,
what we can do that using the features what is generated in by the convolutional layers,
the same features are used for generating region proposals.

1003
And, from the region proposals you pool those features which are there in that re-
proposed region and used it for classification.

(Refer Slide Time: 26:53)

So, that is why it becomes so fast and you can see that the first stage is that you generate
a feature map. So, use this feature map to propose regions and also use this feature map
to classify and also regress the rectangular boxes.

(Refer Slide Time: 27:07)

So, this is what is faster RCNN and in the region proposal network in the task is that we
can slide a small window on the feature map and then we can build a small network for

1004
classifying object or non-object. And so, here the task is that you have to classify the
object or non-object and also regress the bounding box locations and this initial local this
is the initial localization.

So, these positions are used and final localization is considered at the final stage of
classification and regression, where box regression performs final localization. So, and
the faster RCNN as you can see that it requires only there is no proposal time separately,
it is 1 time convolutional time and then number of proposals into the time required for
classification.

(Refer Slide Time: 27:57)

So, this is describing the region proposal network. So, you take a sliding window in the
feature map and it is a 256 dimensional feature descriptor. Then you are classifying as
object or non-object or regressing the box location. So, this will give you the initial box
locations of the object, again in the final in the last stage when you are classifying the
objects you can make a fine tune of that locations.

1005
(Refer Slide Time: 28:25)

So, for region proposal network loss what it considers that in this sliding window there
are different shapes of windows are also considered, which may contain a box or object.
So, there are k anchor boxes; so, each one is tested whether it contains a object and also
this box itself becomes the coordinates those are output of this region. So, it contains
there are two components: one is once again classification loss, once again it is a
regression loss. So, these two components are used in the loss function.

In the 2 class loss function standard cross entropy loss is used and because now you are
deciding whether it contains object or not. So, if your predicted object is intersecting
about 0.7 of the true object 70 percent of the true object, then we call this is 1 otherwise
it is 0. So, it is intersection of union.

So, it is the ratio of intersection of predicted object and predicted region and true region
divided by union of predicted region and true region. So, if this ratio is more than 0.7
then we call it is 1, if it is less than it is less than 0.3 it is 0 and otherwise it do not
contribute to loss. So, with this let me take a break here and we will continues this
discussion in the next lecture.

Thank you very much for listening.

1006
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 59
Deep Neural Architecture and Applications Part – V

We are discussing about different kind of operations or different type of tasks that we
can perform using deep neural architecture. In the last two lectures we discussed about
task of classification of objects and also task of classification as well as localization of
objects. In this lecture we will discuss about the segmentation of images.

(Refer Slide Time: 00:41)

So this kind of segmentation is called semantic segmentation, with respect to this


particular architecture; because the task here is that we would like to level each pixel in
the image with a category label. So, you can see it is also akin to classification, but
instead of classifying the objects or a group of pixels, we are trying to classify each pixel.
And with that also you are assuming the segmentation of images that we understand
what is segmentation, that grouping a pixels of same classes or same categories.

And it does not different differentiate instances only care about pixels; that means, if
there are say 3 human profiles and they are all connected profiles. So, it is sufficient if
you denote the segment as all 3 connected as human profiles. But there could be instance

1007
level semantic segmentation also. So, even the task could be that you have to
differentiate, even individual humans in that connected segment of human profiles.

(Refer Slide Time: 01:47)

So, in a semantic segmentation, the building blocks uses convolutional neural network.
And these are the certain things some of the computations what is being carried out
standard convolution. Then it is a task of down sampling because it tries to learn the
features with aggregations at different resolutions.

So, in down sampling, as we know that we can perform down sampling by this different
kinds of processes; like you can do pooling, max pooling, average pooling even you can
use strided convolutions, which means you leave some of the pixels and do not do
convolution there. And in the output only you consider those pixels where you have
convolved the image and that itself will down sample the image. And then there is a
stage of up sampling because finally, in the segmentation the input and output both the
sizes would be same; because every pixel of the input has to have some semantic level.
So, this learned feature are again used and it is projected in that semantic space.

So, those are leveled those are learnt this projection is also learnt and that is up sampling
task. So, there are operations like unpooling and upconvolutions I will explain those
operations we understand the maxpool or pooling operations and the down sampling
operations In this lecture we will discuss what is unpooling and up convolution.

1008
(Refer Slide Time: 03:25)

So, let me explain this part first what is first the unpooling, max unpooling operation.
Consider this particular image the example is shown here and its a 4 4 image and we
would like to perform the maxpooling by 2 2 filters or its not exactly maxpooling what
we are trying to do it is a kind of decimation. So, let me call no yes it is maxpooling and
what we do while doing maxpooling, you also generate the index of the particular
position.

See in this 2 2 block, the maximum value is here 5. So, output will have 5 from these
2 2 block and also its position in this 2 2 block has to be also stored and that
information also you should be kept so that during unspooling, we will be using that
information.

So, as you can see in these block 6 is the max in this block 7 is the max which these are
shown in the bold font and 8 is the max. So, the maxpool operation will give you 5, 6, 7,
8 and also as it is shown here. And also the position can be stored in a binary mask;
where the positions where the value from the value position from the position from here.
The value is taken those position is indicated by 1 and all other positions are 0.

So, its a kind of indication indicator functions, we call this or a as pooling indices. So,
this two things goes hand in hand. So, you are not only computing the maxpool
operation and also you are computing the pooling indices. So, this pooling indices are
used during the unpooling operations. So, how it is used? Let us see now the input to

1009
your unpooling operation is the values of maxpool operation; that means. Now this is
your input and also it would be added by pooling.

So, if I consider max unpooling, then this is your input and also this is also your input
because you will be using this indices to finally, generate the output. So, after a few
layers of network, which means in the processing pipeline the unpooling operations can
come later. So, you should tag the pooling indices to that layer. This is a output or this is
the input to your maxpool through your un-pooling; because after a few network a few
layers in network these values will change it is not the same value.

And suppose these values are shown here by these variables a, b, c, d. Then the
unpooling operations max unpool will be like this. So, you can see that using pooling
indices, we are putting the values to the corresponding locations we are positioning. So, a
is positioned here from here and this is also guiding the position of a similarly position
of b. So, this is the position of b; with respect to this block and you see that with respect
to this block b is put here. So, in this way we are doing a max unpooling. So, this is how
they are showing.

(Refer Slide Time: 07:15)

Now let us understand how we can perform up sampling and up convolution. You
consider input to a block, input to the up sampling up block as I say 2 2 input here the
colors are shown of the input. And there is a weighted filter from there. So, what we do

1010
that for each input we are multiplying this filter. So, so the idea is that output contains
copies of this filter weighted by the input. And its summing at where it overlaps in output.

1 2
So, the let me explain it suppose your input here is given the value say   and for
3 4
each input we have a 3 3 weight filter; say 3 3 weight filter we will have say
x y z
u v w So, what we will do and say v is a central place. So, far 1 will be generating

 p q r 

the output as x, y, z, u, v, w, p, q, r then you consider 2; so the place of 2 if I consider the


place of 2. So, so this would be the place of 2 and from there we will be using.

So, this should be I should say 2 x plus, 2 y plus, 2 z then v would be this is 2 u plus 2 v
plus 2 w. So, this plus is this is not plus because this is the new these is the new position
and still there was 0 earlier. And this is q plus 2 p r plus 2 q and 2 r so; this is how the 2
is used. Similarly for 3 we have to perform the same job; that means, if I consider the
position of 3.

(position 2 become the following)

x y  2x z  2y 2z
u v  2x w  2v 2 w
p q  2r r  2q 2r

Central position of 3 is here that is what you need to take a when you are considering
position of 3; because you have to place the array and place this thing as the central
positions and accordingly in each pixel locations you are doing it. Similarly when you
place the central positions 3, then you will find that u will be 3 this will be 3 x. This will
be added with 3 y this will be added with 3 z, this will be added with 3 u this will be
added with 3 v, 3 w.

x y  2x z  2y 2z
u  3x v  2x  3y w  2v  3 y 2 w
p  3u q  2r  3v r  2 q  3w 2r

And so, this is about 3 and also the lower row will come which I am not computing
showing here. Similarly for 4 it should be this location. So, this would be 4 v once again

1011
this would be 4 x plus, 4 y plus, 4 z and this value will be 2 r it is 4 w. And this value
would be here it is 4 u and also here the low bottom row I am not showing. So, in this
way what you are doing you are you are coping the filter weighted by the input.

x y  2x z  2y 2z
u  3x v  2 x  3 y  4 x w  2v  3 y  4 y 2 w  4 z
p  3u q  2r  3v  4u r  2 q  3w  4v 2r  4 w

So, you are placing all those things in those locations and in the overlapping locations
you are summing it up that is what is your output. I will do one exercise later on and you
can find out how it is carried out there if you will understand fully this part. So, this kind
of operation is also called us transpose convolution, fractionally strided convolution,
backwards strided convolution and deconvolution these are there are different other
names.

(Refer Slide Time: 11:58)

The expression what I have given to you it can be also considered in a different form and
that is very precise definitions from the mathematical form. So, if I consider 1
dimensional convolutions, then it would be more clear that what exactly you are doing
we can show that why it is called transpose convolution. So, convolution operation in 1
dimension you can express very easily using matrix multiplication. And when you are
doing 2 dimensionsthat you have to use if you have separable properties, then you can
also perform it using matrix multiplications.

1012
But let me first consider the 1 dimensional convolutions for the sake of convenience of
our discussion. Consider a filter of size 3 and it is strided 1; that means that every pixel
you are performing these computations and padding is 1. So, it is a the input is padded
with 0s only at 1 pixel gap. So, and it is not pixel it is since these are samples these are 1
dimensional let me call it as sample. And then let us consider this convolution operation.

0
a 
 
b 
So, you can see that here the input is considered as a sequence of values   So, this is
c 
d 
 
 0 
the input and you are performing a convolution whose response is given by x, y, z. So, if
I consider a input say a, b, c, d and we pad it with 0 and you know how convolution is
perform. it is a marks x, y, z which needs to be multiplied with this and then the central
value will be replaced.

So, if I consider these value it should be ay  bz . Similarly this marks will be shifted
here x, y, z and the central value will be replaced by this b will be replaced by after
convolution ax  by  cz . Exactly that is what is being shown here in this computation
so, what you can see, we can construct a matrix where each row consider shifted
response of the corresponding impulse response x, y, z it. And that matrix is multiplied
with the input and you get the output of the convolutions.

So, convolution can we consider it as matrix multiplication. And we can show also this
up convolution, which is nothing but what I explained that what you are doing. You are
basically multiplying the filters using the input in the corresponding up sampled size up
sampled array. And then add those values in the overlapping locations. So, that can be
shown equivalently. So, this is what is convolution.

So, you just transpose this matrix, convolution matrix and multiply with the input a, b, c,
d and you will get this up sampled value. Once again we can see that this pixel a should
be replaced by as we know that it should be multiplied with ax then ay this is a central
location then az. Similarly for b it would be bx, by, bz and you are adding them. For c it
will be cx, cy, cz and you are going on doing these things. So, that is how you get all

1013
these values. And in fact, that operation is nothing but your transpose of convolution
matrix and multiply that transpose of convolution matrix with the input a so, that is what
you get it here.

(Refer Slide Time: 16:19)

Even for stride two also we can compute. Here in this case if it is stride to your
computing at a then c and you can get the samples. So, you can get up sampling and
upsampling with some kind of you know these upper stride two operations.

(Refer Slide Time: 16:37)

1014
So, let me to explain these particular computations, let me solve one exercise. Suppose
you have the following pooling indices and max un pool by 2 2 of the input block X
the pooling index indices is shown here and also the input blocks is shown here. So this
is your input and what you need to do? You need to unpool this particular indices. So,
which is not written here, but you need to un-pool max you have to perform the max
unpool or unpooling of these values X.

(Refer Slide Time: 17:19)

So we have already discussed how unpooling can be done. So, you have the pooling
indices like this and you have the input X. And so, what you need to do you need to
place the value 2 here, value 6 here, value 30 here and value 45 here, rest of the values
should be 0. So, from 2 2 you are getting 4 4 image and these are the values you can
get. So, this is how you get the unpooling of the input.

1015
(Refer Slide Time: 17:57)

Let us consider another exercise that we need to up convolve the input x using a filter h.
And as you can see origin of the filter is also given here. So, filter is a 3 3 filter and
5 6
you have those input as 2 2 input values are shown for a toy example   . So, you
7 8 
need to solve this problem. As I have mentioned that each input has to be multiplied with
this filter in the up sampled block.

(Refer Slide Time: 18:34)

1016
So, we need to solve this problem. So, what I need to I should do here? As I consider that
I have shown here with colors the pixel values because they are contributions to the up
sampled values can be shown by color. So, for example, 5 we will be contributing a
3 3 block centering at that pixel with say 5 in (2 0) locations in the diagonal neighbor
and like this. So, let me show you by the add itself. So, you can see all these
contributions are shown here.

So, all the red values are contributed by a 5 all the green values green values or colors
here they are contributed by 6 similarly blues from 7 and violet from 8. And at respective
cells respective locations you are adding all of them and this is how you are doing up
sampling and if you add them, then you get this value.

(Refer Slide Time: 19:40)

So, this is a output this is a up sampling value from 5, 6, 7, 8 you get this particular
values.

1017
(Refer Slide Time: 19:53)

Performing semantic segmentation, we use a bit different form of neural architecture and
which is called fully convolutional neural network. So, in this case we do not have a
fully connected layers as we use in traditional CNNs which is used for particularly
classification of images. Here we use all the layers as if fully convolution as a
convolutional neural network layers convolutional layers.

So, so finally, we have an output is also an image of certain number of channels. And
usually the size of the output the number of pixels associated with that output is the same
as the number of pixels of the input. And every pixel at every pixel we perform a
prediction of the class to which the pixel belongs and that is why this kind of prediction
is called dense prediction. And a belongingness to the class that would determine the
semantics associated with that pixel and through which you can perform the
segmentation.

Now, let us understand how this kind of classifications or this kind of prediction is
carried out at pixel level. As we understand that in the process of computation input
image is processed at several layers of convolution. And also there are downsampling
operations either by max pooling or by strided convolutions or by pooling operations. So,
at certain stage you will get the downsampled processed results processed arrays
processed tensors. And those are to be again up sampled.

1018
So, that it could be brought to the size of the input of the pool network. So, when we
have the same size, but the number of channels at that layer at the layer just previous to
the output previous to the prediction layer that number of channels may differ and let us
consider an example. Suppose before dense prediction we have d number of channels let
us consider there are d channels. And each of size say M  N ; that means, there are there
are M rows and N columns; so, each of size M  N And here what we would like to do
we would like to predict the level of a pixel. So, we perform a convolution with a filter if
it is an 1 dimension filter of length d; effectively in the convolutional neural networks
terminology we specify this filter as 1 1 filter and its length is d. So, this is one filter
and that would give some response some value. And if I consider k of these filters for
each pixel you will get k such response. So, there will be k such response. And each
filter is associated with certain class. So, the say i th filter say associated with say i th
class.

So, by using the softmax principle we can compute the probability of that pixel.
Probability of a class that would be assigned to that pixel and it is the same principle that
we can follow here. And these are the parameters so, these many number of parameters
associated with each pixel that can be applied here I mean that needs to be learned. So,
there should be k  d parameters that should be learned at this output layer.

So, as we mentioned here that number of channels at the output layer which is say k.
That should be equal to number of classes that is a kind of restriction that we need to
follow here because this is meant for this semantic labeling of the pixels. And for
performing the optimization, here we use a loss function and since we are classifying
each pixel cross entropy loss function is a natural choice in this case.

So, this is how we can use a fully convolutional neural network and that can be used for
labeling every pixel of the at the output from this network. And we can perform semantic
segmentation using this network. So, let us now see that how this network, what are the
different kind of variations of this architecture?

1019
(Refer Slide Time: 26:08)

So, let us consider this particular work that is also performing semantic segmentation.
And as you can see, what it is doing? It is in the process of computations it is
performing the corresponding downsampling operations and also convolution operations.
And also then after that it is also going through the up sampling operations at certain
stages. And it has been brought to the same size of the input image and then it performs a
pixel wise prediction as we have considered.

While doing up sampling, you it has considered that the initialization of the up sampled
filter could be done by using billinear interpolation filters weights. That is a very
standard interpolation filter whose weights are already predetermined because of the
billinear operations there. So, after initialization you perform the optimization operations
then those weights we will get updated. So, the next we will see what kind of operations
so, this particular network is doing.

1020
(Refer Slide Time: 27:37)

So, it combines the what it does it not only takes a prediction from the last layer, it also
considers predictions of some previous layers operations. Let us consider this particular
example say while down sampling the images at different stages of conclusion layers,
then it down samples say always it down samples by a factor of 2. So, you have a down
sampled by factor of 2 here, say in this case by a factor of 4; that means, you required up
sampling by factor of 2 in this case up sampling by factor of 4 to bring it to the size of
the input. So, this is factor of 8 and this is factor of 16 and say this is factor of 32.

So when you are trying to bring to the same size of the input what we would like to do
here, we would like to perform an up sampling of by factor of 32. And we call on this up
sampled output before the classification operations that is 32 X up sampled a prediction
or FCN 32 s. So, prediction can be done from here using as I mentioned by using the k
number of 1 1 cross the number of channels.

So, number of channels is not mentioned here suppose there are d number of channels at
this layers at the final layer d number of channels. So, we have to predict using 11 d
filters and for k semantic classes you have to use k number of such filters. So, we can
perform prediction simply from this layer itself using this operations. But what it has
been found that actually this performance could be improved.

If we use also the information from the down the lower layers for example, from here if I
perform a 16  up sampling and then which means from this layer it is 2  conv7 that is

1021
what it means? But it is it means that you are doing 16  up sampling. And then you
again you also consider predicting using k number of filters at this stage. Similarly you
are performing 4  up sampling.

So, all those values all those values are what we are doing we are not predicting here. We
are simply adding them or we are adding the corresponding values and then performing
the prediction using 11 d filters k number of 11 d filters. So, if I consider this these
operations it could be done in various ways. Probably in this work what it has been done,
it has been applied independently for k times and then added them then the softmax
softmax principle has been used.

So, you are learning these weights at all these stages and finally, the lost function is
computed after fusing those values using soft max principle. And that would give
variations and it has been found that that has improved the performance of this network
for predict for performing semantic lebeling of the pixels. So, this is one kind of
architecture, one kind of a strategy that has been followed. There are other architectures
also, other variations also architectures are most similar.

(Refer Slide Time: 31:45)

This architecture is called encoder decoder network and it is called segnet. So, what it
does at the encoder level, it takes an input image and it generates a high dimensional
feature vector. And also through this process it aggregates features at multiple levels.

1022
Whereas, at the decoder it takes a high dimensional feature vector and generate a
semantic segmentation mask.

So, in this process we will see instead of a the difference is that instead of up sampling
using up sampling filters. In this network the un pooling operation has been used as we
discussed earlier how un-pooling could be done; because the idea here is that by doing
unpooling you are preserving the location. So, the high responses as we observe during
the encoding stages.

So, those are those information are passed to the decoders and that helps in improving
the performance that has been observed. So, so, the decoder it decode features a
aggregated by encoded at multiple levels that is what its a reverse operation. And as you
can see that it projects the discriminated features, which are being learned by the encoder
to get it dense classification in the pixel space or in the original image.

(Refer Slide Time: 33:23)

So you can see that what are the processing here. So, you have these are the encoder
stages and at every in encoding stage; that means, these are the convolutional layer. And
the there are actually 3 convolutional operations and then there is a pooling operation
max pooling operations. So, after every max pooling naturally there is a down sampling
operations so, after pooling operations.

1023
So, every pooling, the pooling indices are correspondingly passed to the corresponding
decoding stage at the decoder stage. And they are in this case that is used for unpooling.
So, finally, at the in the same way you get at the output at the final level which becomes
the same size of the input. And you use the same 1 1 filters of that many number that
many number of know levels that you would like to use for segmenting the image.

For example, in this case we could see there are around 1 2 3 4 5 6 say around 6 levels
what are we shown here, but you know it may happen that there are many other levels,
which are not reflected here because of the content of the image. So, we mentioned
earlier that if there are k number of labels. So, at the output you should have 11
filters of length the number of channels what you used in the output that that many
number of filters should be there.

This is how from the decoder we generate the segmentation output. However, there
could be various decoding strategies that, we would like to see. So, when we are
performing semantic segmentation what we are doing . there are two stages; first is a
learning stage or that is you are trying to represent it as a feature and learning the
features. And the next you are doing the up convolutions and to project it to the semantic
domain.

So, there is a stage as you can see the up sampling of learn low resolution semantic
feature maps and that is done using up convolutions which are. So, during up
convolutions you are using by linear interpolation filter weight initialization of weights
and then pixel wise you are predicting. So, the whole thing is being trained in this way.
So, your input output specification is there and you are training the network through
downsampling and upsampling using this kind of structure.

1024
(Refer Slide Time: 36:24)

So, in the encoder, it takes an input image and generates a high dimensional feature
vector. Encoder aggregates features at multiple levels and in the decoder it takes a high
dimensional feature vector and generates a semantics segmentation mask. And decode
features aggregated by encoder at multiple levels. Semantically it project the
discriminative features and also that is learned by the encoder what I mentioned earlier to
get a dense classification.

So, the basic thing here is that in this semantic segmentation, the encoder it takes the
input image process in input image and it generates a high dimensional feature vectors.
And it uses the features at multiple levels whereas, in decoder its input is that high
dimensional feature, then it tries to project those features into the semantic domain. So, it
tries to correlate those features with the labels and that is what it tries to get a dense
classification out of this process.

So, this diagram shows that what could be the flow of computation. So all the
convolutions layers and pooling layers are shown; so, in the left side you can see this is
the encoder part and in the right side this is a decoder part. And you can see as I
mentioned that pooling indices are to be provided for un-pooling operations in
corresponding layer.

So, unpooling at this layer should be done using pooling indices of this one. So, that is
why it is shown by these arrow. So, these are all pooling indices and this is a flow. So, it

1025
is actually you can see there is no fully connected layer in this particular architecture. So,
we call it also fully convolutional neural network architecture.

(Refer Slide Time: 38:23)

For decoding there could be various options also, like you know you can use the vgg
network which has been shown which uses a convolutional neural network. So use the
unpooling operations and then perform convolutions stages of convolutions already you
can use Google net where the inception models are used. Once again unpooling has
to be there and then as you know that Google net it concatenates the output of different
filters sizes but it keeps the output size from each filter the same by using 0 padding.

And you can use this kind of decoder. You can use also resonate a structure, where you
know you will be only learning the residual errors. So, several such options are several
such architectures are configurations are possible.

1026
(Refer Slide Time: 39:25)

There is one particular architecture, which is a very popular in this with respect to this
segmentation and that is called U-Net architecture and this is also used similar concepts.
So, let me take a break here and then we will discuss particularly this architecture. And
also we will show some of these results from this architecture in the next lecture.

Thank you very much for listening to this lecture.

1027
Computer Vision
Prof. Jayanta Mukhopadhyay
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur

Lecture – 60
Deep Neural Architecture and Applications Part – VI

We are discussing about semantic segmentation using Deep Neural Architecture and in
the last lecture I have refer to a very popular segmentation architecture which is called
U-Net architecture.

(Refer Slide Time: 00:31)

So, in this lecture I will give you a brief overview of that architecture. So, in U-Net
architecture the competition is carried out in two successive stages. Like the previous
architecture of semantic segmentation, there is a path for down sampling and up
sampling. Here also we have the path for down sampling and up sampling, we call it as a
contracting path and also an expansive path.

So, in the contractive path we have repeated application of two 3 3 convolutions, each
of them is followed by a rectified linear unit and they need 2 2 max pooling operation
with stride 2. And at each down sampling we also double the number of feature channels.
That is how, this architecture is specified you can get the details of this architecture by
the paper, what is shown here and you can get it from the internet.

1028
(Refer Slide Time: 01:22)

About the expansive path as I was mentioning that it is a repeated up sampling what is
used for the feature map, and there is a 2 2 convolution or up convolution is used. And
in the contractive path where the number of channels are doubled at every down
sampling, in up sampling, that means, in the expansive path rather the feature channels
are getting halved. And a concatenation also is there from the contractive paths output,
we will show in the diagram.

So, in the expansive path we are using the up sampled features and also the features what
is being computed in the contractive path those are concatenated. So, half of them are
coming from there and then using them you are basically passing it to the next stage. So,
it has three two 3 3 convolutions after doing up sampling and it goes through the stages.
Finally, the size of the image or output remains becomes the same as the size of the input,
and you require cropping to keep this constraint of size.

1029
(Refer Slide Time: 02:32)

And at the final layer when you have generated all these pixels at the output, then for
each pixel you classify those pixels into a number of classes. So, every pixel at the final
layer, it is represented by a feature of 64 feature; 64 dimensional feature vector. And
those 64 dimensional feature vectors becomes a input to every classifier.

So, in total there are 23 convolutional layers and in particular this work, they have it is
used applied for medical image segmentation or segmentation of microscopic images and
there are data augmentations which are used to increase the data pool, for example, they
have used simulated deformations on training images.

1030
(Refer Slide Time: 03:23)

So, this is a schematic diagram which is taken from that paper itself. It is a U-Net
architecture and you can see that how the contractive path it is down sampling images of
different sizes and the number of channels are getting doubled there. And then in the
expensive path, you have up sampling images.

And not only that as I was mentioning there is a concatenation of features. So, these
features it is concatenated at this stage. So, half of this is coming from this stage and half
of them is coming here and it become once again, this becomes a input to the next two
convolution layers. Then again it is up sampled and it is performed in the same fashion.

So, this is the operation, what is being done in the U-Net architecture and you using the
input or specification you train it, and then once the training is over then you can use it
for the purpose of semantic segmentation.

1031
(Refer Slide Time: 04:23)

So, some of the results which are there in that paper I am just producing them. So, you
can see that these are microscopic images. The top one is showing the segmentation of
objects that these are the cells and they are the yellow border is the manually segmented
that is the ground truth and the colour part is a result. So, it is very close we can see.

The bottom rows and the pairs of images are shown, original image and also segmented
image. It is quite interesting and which is shown that it is very nicely it captures the
corresponding segments and which collaborates with the ground truth also very close to
it.

1032
(Refer Slide Time: 05:11)

So, next another kind of architecture we will be discussing and we will be considering
another types of problems which could be solved using neural architecture, these
networks are called Recurrent Neural Networks.

So, let us first understand the difference between recurrent neural networks. For
example, the conventional neural network, including CNN there is no feedback, it has
been shown in the left side that NN block the green NN block, where the input is x and
output is y and there is no connections from the output to the towards the input. Whereas,
in recurrent neural networks you have a feedback, you have a connections from the
output part to the input block.

So, in the conventional neural network it is a feed forward and there is no feedback.
Whereas, in the recurrent neural networks you have feedback from output of a neuron. A
conventional neural network it handles fixed size input and output, in the recurrent
neural network input could be a variable size. It is a sequences in the input the output or
in the most general case both.

Conventional neural network it is a fixed number of layers and since, there is a feedback
actually it can simulate an infinite number of feed forward network, it can unfold into an
infinite number of layers. So, this is the characteristics of recurrent neural networks.

1033
(Refer Slide Time: 06:47)

Some of the applications of recurrent neural networks, let us quickly visit some of those
applications, like semantic labeling of a sequence. So, it can classify the input as a
sequence, then prediction in a sequence. So, using previous a video frames to inform the
understanding of the present frame one kind of example or you can consider the
language model, it tries to predict the next word, based on the previous ones. And
sequence translation and generation where input is a sequence and output is also a
sequence.

(Refer Slide Time: 07:25)

1034
So, the basic feature of the recurrent neural network is that it introduced cycles and
notion of time. So, consider this particular diagram and here the red square is
representing a recurrent neural network node. So, input at time t, xt and then there is

output yt , but the intermediate output from the network which has been also given as a

feedback. So, you can see that input is not only xt even the previous time instances

output is also input to this block ht 1 .

So, as if there is a one step delay and this is how it works. So, this is used to design two
process sequences of data and it can produce sequences of outputs. Because of this
feature, that you are introducing a notion of sequence or time and it is kind of a delay in
the processing. So, we can consider when you are giving input, input can come as a
sequence and when you are producing output, output we will also a sequence.

(Refer Slide Time: 08:47)

So, to analyze a recurrent neural network, we have to unroll or unfold the neural network.
And number of stages of unrolling it depends upon the length of the input sequence.

So, you consider this that suppose you have three; length of sequences three x0 , x1 , x2 ;

so, x0 is the past. So, the difference you note when we consider so, digital filter as if the

input is coming from the left side and it is being shifted towards right. But here, the past
is always shown at the in the figure in the top side and all the present are following

1035
successively. So, as you proceed in the time domain because it is unfolding over the
time you know instances.

So, you are always the temporarily present sample or current sample will be shown at the
bottom or at the layer, which is successive layers always the temporarily following
inputs will be coming in the sequence. So, if there are say for example, the sequence
length is 3. So, you can consider there are three such unfolding.

(Refer Slide Time: 10:12)

You can draw it also in this fashion; that means, you provide input x0 in the first stage

get output y0 . Also it generates an intermediate output h0 , which could be equal to y0 or

which could be some function of y0 . And they that becomes the input to the next RNN, a

next stage next block of neural block it is unfolding, which is the same RNN string, but
at that time instead it is operating with input x1 and also from the intermediate input

h0 generates y1 .

Next time stamp it will be input x2 , then h1 both are input, that is intermediate output of
the previous time instance is h1 and it produces output y2 . Also generates an intimated
output h2 , since there is no other input and we may not observe. So, maybe now your
sequence has been generated as y0 , y1 , y2 from x0 , x1 , x2 . That is a kind of

1036
explanation of the working of RNN. So, the learning algorithm what we use, if I unfold
this the same back propagation algorithm can be used.

(Refer Slide Time: 11:24)

So, this is trying to elaborate that particular aspect, that relationship between the
intermediate output and also the past input or present input and past intermediate output.
So, you can see that y is function of ht and ht is function of ht 1 , xt of that RNN block

where the w is the parameter.

So, ht 1 is the old state, xt is the input victor at sometimes time and W is a function with

parameter w and ht is the new state. So, the same function and the same set of

parameters are used at every time step. And yt can be considered as a function of ht it is

a linear function, there is a Why is the weights connecting the intermediate output of the

neuron to the final output.

1037
(Refer Slide Time: 12:26)

So, one of the form of ht could be that it is tan hyperbolic function aAnd it has been

shown by this weight set. So, Whh is a matrix, which is producing the vector ht 1 , which

is producing this particular form and Wxh xt is another set of inputs, and relationship

within yt and ht is it is Why ht linear combinations from ht .

(Refer Slide Time: 12:57)

So, one particular example is shown here, say in this case it is a example has been drawn
from the natural language processing or string processing. So, what we are trying to

1038
predict, we are trying to predict the next character of a word. So, there are training
sequences.

So, one may say hello is a training sequence and as you can see this RNN input. It can
unfold the RNN into these seems there are the length of the string is five and there are
prediction is four. So, we consider unfolding of RNN till fourth stages of input character
and this is how the input is proceeded with temporal timestamp and this is the structure.
And so, these are the weights at the stage and this is the weight for ht 1 , it is an one-

dimensional input it is shown here.

Now, hidden layer has a three input sorry, hidden layer has a three node three vectors. So,
accordingly you should have the weight vector Whh and also Wxh xt that should come
here and then you are using tan hyperbolic layer. So, there are three such nodes here and
from there your output which is also considered as output of vector four dimensional
vector.

Now, you are using here hot encoding; so, you can see the encoding which is used here.
So, per h we are using a four dimensional hot encoding here. So, this position one shows
h, this is e, this is l, this is l and your outputs which are being produced here yeah. The
0 
1 
specified ground truths should have been here it should have been   , but instead you
0 
 
0 
are producing this output. So, the last function will be determined in this fashion.

Similarly, this is a production of l, I mean it should have been l, but it produces this
0 
0 
output and otherwise you know the ground truth should have been   . So, in this way
1 
 
0 
the last function set to be defined.

1039
(Refer Slide Time: 15:19)

These are some examples, that what are possible configurations and how different kinds
of neural networks could we use, different architectures could be used for solving
different types of problem. Here, you can see that we have given some numbers and each
rectangle is a vector and arrows represent functions and input vectors are shown in red
and output vectors are in blue.

(Refer Slide Time: 15:47)

So, let me just explain. So, suppose these are different configurations first one is one to
one configuration which means input is a fixed vector and output is also fixed size output.

1040
So, example is an image classification and this is a conventional neural network
processing, you do not require RNN for this kind of processing. Where as a second one
which means this particular structure the input is a fixed size input or output is a
sequence.

Now, here you require a RNN and so, it is a second of output the example could be the
image captioning, which takes an image and outputs the sentence of word.

(Refer Slide Time: 16:38)

The third one is, so this is a sequence output is fixed. So, typical example could be that
say sentiment analysis, where a given sentences classified as expressing positive or
negative statement sentiment. Fourth applications, so these are fourth one where input is
a sequence also output is a sequence and this is an example could be the machine
translation. It reads a sentence in English and then outputs the sentence in French for
example, a typical example of a problem, here also RNN should be use.

So, only for the first one it is a conventional network for all these kinds of sequence
processing, we need to use RNN. Consider the fifth one you have the synchronous, no
sequence input output sequence and it could be the video classification for labelling each
frame of the video.

1041
(Refer Slide Time: 17:46)

So, these are some examples of RNN, use of RNN in different applications. Sometimes
you can convert even a pixel input as a sequence by following certain predetermined
order of scanning the input data.

So, we call it sequential processing of non-sequence data; for example, for image we can
scan from left to right and top to bottom. Now that itself will provide you some sequence
of regions blocks. Now, those could be input to RNN and we can process it for various
purposes.

(Refer Slide Time: 18:20)

1042
So, this is just this is showing a one kind of processing. So, the blocks are moving in a
sequence and we would like to classify images by taking a series of glimpses.

(Refer Slide Time: 18:33)

Another example classify images by taking series of glimpses

(Refer Slide Time: 18:38)

So, the application of this kind of one such very popular application is image captioning,
where image regions we also can scan. But in this case what it does that it represents the
image as a feature vector using a CNN and that is input to an RNN and the descriptions
one by one the descriptions of the output. Once again the number of the length of the

1043
output sequence depending we decide the length of the unfolding of RNN that has to be
there.

So, for example, in this particular figure the image could be described as the straw hat
and that is the sentence; so, it has to start and it has an end. So, with that kind of
specifications you are training the RNN in this way. So, you use the convolution neural
network to describe the feature and then, followed by recurrent neural network to extract
the caption.

(Refer Slide Time: 19:42)

So, it just shows the workflow that you use a convolution neural network, where image is
an input and it goes through several layers of convolution neural network and finally,
you represent it is a feature vector. In fact, the last layers of fully connected layers they
are mostly used for classification, you can remove that layer and you can use only the
representation of the last layer of the fully connected layer FCporchiporchi

4096; that means, the feature dimension is 4096 at the last layer in this case.

So, that could be the input to your RNN and also the input another input is a start symbol
from where you are starting the sentence reconstruction, it reconstructed the sample why
not, that itself is the input towards your intermediate sample.

1044
(Refer Slide Time: 20:39)

So, these are sample and it goes like this. So, in the same (Refer Time: 20:44) way RNN
operates and then it can generates a sentence. So, as you train on a number of images it
depending upon the image content it tries to predict those levels which are being trained
in a network.

(Refer Slide Time: 20:59)

There is a data set which is called Microsoft COCO and it has say 120 K images and for
each image there are 5 sentences and used it for training and there is an example. So, you

1045
provide this as training set to the network and then you can generate different kind of
image captions.

(Refer Slide Time: 21:18)

So, some examples are shown here consider the first row top row and the leftmost image.
Where, you have the man in black shirt is playing guitar and a say other one is shown as
construction worker in orange safety (Refer Time: 21:39) he is working on road. So,
these are a very interesting results and presently a lot of work is going on video
summarization, image captioning and many other things using this kind of network.

(Refer Slide Time: 21:55)

1046
So, let me summarize the topics and whatever we have covered under this deep neural
architecture and its applications. So, we have seen that deep architecture works in the
same principle of artificial neural network, but it has a large number of hidden layers and
it has a large number of weights.

One a very popular deep neural architecture is a convolution neural network, because it
can learn filter weights and it can share those weights and there are two types of layers
usually in this process. One is convolution layer another is this is pooling layers for the
feature description, and there are two stages which are involved feature extraction and
classification. So, in the classification stage usually we used fully connected neural
networks.

(Refer Slide Time: 22:50)

So, there are variations of convolution neural networks say RESNET it processes
residual errors then RCNN it has a region proposal network and it localizes object and
also it could be fully connected CNN where, it uses image segmentation. And then there
is recurrent neural network where it is a different kinds of neural architecture, it is quite
different from convolution neural network, because it has a feedback loop it is not a feed
forward network.

And recurrent neural networks are mainly used for processing a sequence. So, with this
let me stop here. And this is the end of this particular topic and this lecture.

1047
Thank you very much for listening to my lecture.

1048
THIS BOOK
IS NOT FOR
SALE
NOR COMMERCIAL USE

(044) 2257 5905/08


nptel.ac.in
swayam.gov.in

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy