Seminar Project - Face - Recognition
Seminar Project - Face - Recognition
Seminar Project - Face - Recognition
Programme
Seminar Report
BCA Sem VI
AY 2022-23
Project Guide by :
Prof.Nehal patel
Acknowledgement
The success and final outcome of this project required a lot of guidance and
assistance from many people and we are extremely fortunate to have got this
all along the completion of our project work. Whatever we have done is only
We would not forget to thank I/C Principal Dr. Aditi Bhatt, Head of Department
Dr. Vaibhav Desai and Project guide Prof. Nehal patel, and all other Assistant
professors of SDJ International College, who took keen interest on our project
work and guided us all along, till the completion of our project work by
We are extremely grateful to her for providing such a nice support and
guidance though she/he had busy schedule managing the college dealings.
We are thankful and fortunate enough to get support and guidance from all
Face recognition has been one of the most interesting and important research fields
in the past two decades. The reasons come from the need of automatic recognitions
and surveillance systems, the interest in human visual system on face recognition,
and the design of human-computer interface, etc. These researches involve know-
ledge and researchers from disciplines such as neuroscience, psychology, computer
vision, pattern recognition, image processing, and machine learning, etc. A bunch of
papers have been published to overcome difference factors (such as illumination, ex-
pression, scale, pose, ……) and achieve better recognition rate, while there is still no
robust technique against uncontrolled practical cases which may involve kinds of
factors simultaneously. In this report, we’ll go through general ideas and structures
of recognition, important issues and factors of human faces, critical techniques and
algorithms, and finally give a comparison and conclusion.
Table of content:
(1) Introduction to face recognition: Structure and Procedure
(2) Fundamental of face pattern recognition
(3) Issues and factors of human faces
(4) Techniques and algorithms on face detection
(5) Techniques and algorithms on face feature extraction and face recognition
(6) Comparison and Conclusion
1
Face Recognition Technology
In this report, we focus on image-based face recognition. Given a picture taken from
a digital camera, we’d like to know if there is any person inside, where his/her face
locates at, and who he/she is. Towards this goal, we generally separate the face rec-
ognition procedure into three steps: Face Detection, Feature Extraction, and Face
Recognition
Face Detection:
The main function of this step is to determine (1) whether human faces appear in
a given image, and (2) where these faces are located at. The expected outputs of this
step are patches containing each face in the input image. In order to make further
face recognition system more robust and easy to design, face alignment are per-
formed to justify the scales and orientations of these patches. Besides serving as the
pre-processing for face recognition, face detection could be used for re-
Feature Extraction:
After the face detection step, human-face patches are extracted from images. Di-
rectly using these patches for face recognition have some disadvantages, first, each
patch usually contains over 1000 pixels, which are too large to build a robust recogni-
tion system1. Second, face patches may be taken from different camera alignments,
with different face expressions, illuminations, and may suffer from occlusion and
clutter. To overcome these drawbacks, feature extractions are performed to do in-
formation packing, dimension reduction, salience extraction, and noise cleaning. Af-
ter this step, a face patch is usually transformed into a vector with fixed dimension
or a set of fiducial points and their corresponding locations. We will talk more de-
tailed about this step in Section 2. In some literatures, feature extraction is either in-
cluded in face detection or face recognition.
Face Recognition:
After formulizing the representation of each face, the last step is to recognize the
2
Face Recognition Technology
Figure 2: An example of how the three steps work on an input image. (a) The input image a
nd the re-
sult of face detection (the red rectangle) (b) The extracted face patch (c) The feature vector
after fea-
ture extraction (d) Comparing the input vector with the stored vectors in the database by cla
ssification
techniques and determine the most probable class (the red rectangle).
.
Before going into details of techniques and algorithms of face recognition, we’d like
to make a digression here to talk about pattern recognition. The discipline, pattern
recognition, includes all cases of recognition tasks such as speech recognition, object
recognition, data analysis, and face recognition, etc. In this section, we won’t discuss
those specific applications, but introduce the basic structure, general ideas and gen-
eral concepts behind them.
The general structure of pattern recognition is shown in fig.3. In order to generate
a system for recognition, we always need data sets for building categories and com-
pare similarities between the test data and each category. A test data is usually called
3
Face Recognition Technology
a “query” in image retrieval literatures, and we will use this term throughout this re-
port. From fig. 3, we can easily notice the symmet2ric structure. Starting from the data
sets side, we first perform dimension reduction on the stored raw data. The me-
thods of dimension reduction can be categorized into data-driven methods and do-
main-knowledge methods,
each raw data in the data sets is transformed into a set of features, and the classifier
is mainly trained on these feature representations. When a query comes in, we per-
form the same dimension reduction procedure on it and enter its features into the
trained classifier. The output of the classifier will be the optimal class (sometimes
with the classification accuracy) label or a rejection note (return to manual classifica-
tion).
Notation
There are several conventional notations in the literatures of pattern recognition and
machine learning. We usually denote a matrix with an upper-case character and a
vector with a lower-case one.
4
Face Recognition Technology
5
Face Recognition Technology
6
Face Recognition Technology
tion between the feature vectors and their corresponding labels, and this kind of
learning is called the supervised learning. On the other hand, if the label of each
training sample is unknown, then what we try to learn is the distribution of the
possible categories of feature vectors in the training data set, and this kind of learn-
ing is called the unsupervised learning. In fact, there is another task of learning called
the semi-supervised learning, which means only part of the training data has labels,
while this kind of learning is beyond the scope of this report.
Evaluation Methods
Besides the choices of pattern recognition methods, we also need to evaluate the
performance of the experiments. There are two main evaluation plots: the ROC (re-
ceiver operating characteristics) curve and the PR (precision and recall) curve. The
ROC curve examines the relation between the true positive rate and the false posi-
tive rate, while the PR curve extracts the relation between detection rate (recall) and
the detection precision. In the two-class recognition case (for example, face and
non-face), the true positive means the portion of face images to be detected by the
system, while the false positive means the portion of non-face images to de detected
as faces. The term true positive here has the same meaning as the detection rate and
recall and we give a detailed description in table 1 and table 2. In fig. 5, we show
examples of the PR curve. In addition to using curves for evaluation, there’re some
frequently used values for performance judgments, and we summarize them in table
3.
The threshold used to decide the positive or negative for a given case plays an
important role in pattern recognition. With low threshold, we could achieve high true
positive rate but also high false positive rate, and vice versa. To be noticed, each
point on the ROC curve or PR curve corresponds to a specific threshold used.
The terms positive and negative reveal the asymmetric condition on detection
tasks where one class is the desired pattern class and another class is the comple-
ment class. While in tasks that each class has equal importance or similar meaning
(for example, each class denotes one kind of object), the error rate is much pre-
ferred.
7
Face Recognition Technology
Fig 5: An example of the PR curve. This is the experimental result of the video fingerprinting
technique,
where five different methods are compared. The horizontal axis indicates the recall and the
vertical
Term Definition
Recall (R) # of true positive / # of all desired patterns in the validatio
n
8
Face Recognition Technology
--->Conclusion
The tasks and cases discussed in the previous sections give an overview about pat-
tern recognition. To gain more insight on the performance of pattern recognition
techniques, we need to take care about some important factors. In template match-
ing, the number of templates for each class and the adopted distance metric directly
affects the recognition result. In statistical pattern recognition, there are four impor-
tant factors: the size of the training data N, the dimensionality of each feature vec-
tor d, the number of classes C, and the complexity of the classifier h, and we sum-
marize their meanings and relations in table 4 and table 5. In syntactic approach, we
expect that the more rules are considered, the higher recognition performance we
can achieve, while the system will become more complicated. And sometimes, it ’s
hard to transfer and organize human knowledge into algorithms. Finally in neural
networks, the number of layers, the number of used perceptrons (neurons), the di-
mensionality of feature vectors, and the number of classes all have effects on the
recognition performance. More interesting, the neural networks have been discussed
and proved to have closed relationships with the statistical pattern recognition tech-
niques [5].
9
Face Recognition Technology
Holistic-based or feature-based
This is another interesting argument in psychophysics / neuroscience as well as in
algorithm design. The holistic-based viewpoint claims that human recognize faces by
the global appearances, while the feature-based viewpoint believes that important
features such as eyes, noses, and mouths play dominant roles in identifying and re-
membering a person. The design of face recognition algorithms also apply these
perspectives and will be discussed in Section 5.
Thatcher Illusion
The Thatcher illusion is an excellent example showing how the face alignment affects
human recognition of faces. In the illusion shown in the fig. 6, eyes and mouth of an
expressing face are excised and inverted, and the result looks grotesque in an
upright face. However, when shown inverted, the face looks fairly normal in ap-
pearance, and the inversion of the internal features is not readily noticed.
(a) (b)
10
Face Recognition Technology
(2) face pose, (3) face expression, (4) RST (rotation, scale, and translation) variation,
(5) clutter background, and (6) occlusion. Table 6 lists the details of each factor.
ing the image acquisition process. This variation changes the spatial
relations among facial features and causes serious distortion on the
traditional appearance-based face recognition algorithms such as
eigenfaces and fisherfaces. An example of pose variation is shown
in fig. 8.
Expression Human uses different facial expressions to express their feelings or
tempers. The expression variation results in not only the spatial re-
lation change, but also the facial-feature shape change.
RST variation The RST (rotation, scaling, and translation) variation is also caused
tion and face detection. It means that some parts of human faces
are unobserved, especially the facial features.
11
Face Recognition Technology
Figure 7: Face-
patch changes under different illumination conditions. We can easily find how strong
the illumination can affects the face appearance.
Figure 8: Face-
patch changes under different pose conditions. When the head pose changes, the spa-
tial relation (distance, angle, etc.) among fiducial points (eyes, mouth, etc.) also changes an
d results in
serious distortion on the traditional appearance representation.
Design issues
When designing a face detection and face recognition system, in addition to consi-
dering the aspects from psychophysics and neuroscience and the factors of human
appearance variations, there are still some design issues to be taken into account.
First, the execution speed of the system reveals the possibility of on-line service
and the ability to handle large amounts of data. Some previous methods could accu-
rately detect human faces and determine their identities by complicated algorithms,
which requires a few seconds to a few minutes for just an input image and can’t be
used in practical applications. For example, several types of digital cameras now have
the function to detect and focus on human faces, and this detection process usually
takes less than 0.5 second. In recent pattern recognition researches, lots of published
papers concentrate their works on how to speed-up the existing algorithms and how
to handle large amounts of data simultaneously, and new techniques also include the
12
Face Recognition Technology
execution time in the experimental results as comparison and judgment against oth-
er techniques.
Second, the training data size is another important issue in algorithm design. It is
trivial that more data are included, more information we can exploit and better per-
formance we can achieve. While in practical cases, the database size is usually limited
due to the difficulty in data acquisition and the human privacy. Under the condition
of limited data size, the designed algorithm should not only capture information from
training data but also include some prior knowledge or try to predict and interpolate
the missing and unseen data. In the comparison between the eigenface and the fi-
sherface, it has been examined that under limited data size, the eigenface has better
performance than the fisherface.
Finally, how to bring the algorithms into uncontrolled conditions is yet an unsolved
problem. In Section 3.2, we have mentioned six types of appearance-variant factors,
in our knowledge until now, there is still no technique simultaneously handling these
factors well. For future researches, besides designing new algorithms, we’ll try to
combine the existing algorithms and modify the weights and relationship among
them to see if face detection and recognition could be extended into uncontrolled
conditions.
4. Face detection
From this section on, we start to talk about technical and algorithm aspects of face
recognition. We follow the three-step procedure depicted in fig. 1 and introduce
each step in the order: Face detection is introduced in this section, and feature ex-
traction and face recognition are introduced in the next section. In the survey written
by Yang et al. [7], face detection algorithms are classified into four categories: know-
ledge-based, feature invariant, template matching, and the appearance-based me-
thod. We follow their idea and describe each category and present excellent exam-
ples in the following subsections. To be noticed, there are generally two face detec-
tion cases, one is based on gray level images, and the other one is based on colored
images.
Knowledge-based methods
These rule-based methods encode human knowledge of what constitutes a typi-
cal face. Usually, the rules capture the relationships between facial features. These
methods are designed mainly for face localization, which aims to determine the im-
age position of a single face. In this subsection, we introduce two examples based on
hierarchical knowledge-based method and vertical / horizontal projection.
13
Face Recognition Technology
This method uses the fairly simple image processing technique, the horizontal
14
Face Recognition Technology
which together constitute a face candidate. Finally, each face candidate is validated
by further detection rules such as eyebrow and nostrils. As shown in fig. 10, this me-
thod is sensitive to complicated backgrounds and can’t be used on images with mul-
tiple faces.
In this work, Hsu et al. [10] proposed to combine several features for face detec-
tion. They used color information for skin-color detection to extract candidate face
regions. In order to deal with different illumination conditions, they extracted the 5%
brightest pixels and used their mean color for lighting compensation. After skin-color
detection and skin-region segmentation, they proposed to detect invariant facial
features for region verification. Human eyes and mouths are selected as the most
significant features of faces and two detection schemes are designed based on
chrominance contrast and morphological operations, which are called “eyes map”
and “mouth map”. Finally, we form the triangle between two eyes and a mouth and
verify it based on (1) luminance variations and average gradient orientations of eye
and mouth blobs, (2) geometry and orientation of the triangle, and (3) the presence
of a face boundary around the triangle. The regions pass the verification are denoted
as faces and the Hough transform are performed to extract the best-fitting ellipse to
extract each face.
This work gives a good example of how to combine several different techniques
together in a cascade fashion. The lighting compensation process doesn’t have a sol-
id background, but it introduces the idea that despite modeling all kinds of illumina-
tion conditions based on complicated probability or classifier models, we can design
an illumination-adaptive model which modifies its detection threshold based on the
illumination and chrominance properties of the present image. The eyes map and
15
Face Recognition Technology
Figure 11: The flowchart of the face detection algorithm proposed by Hsu et al
. [10]
16
Face Recognition Technology
the mouth map shows great performance with fairly simple operations, and in our
recent work we also adopt their framework and try to design more robust maps.
17
Face Recognition Technology
Figure 14: The locations of the missing features are estimated from two feature points. T
he ellipses
show the areas which with high probability include the missing feature
s. [11]
18
Face Recognition Technology
(a) (b)
Figure 15: The example of the ASM for resistor shapes. In (a), the shape variation
summarized and several discrete points are extracted from the shape boundaries for shape
learning,
as shown in (b). From (c) to (e), the effects of changing the weight of the first three principal
are presented, and we can see the relationship between these components and the shape
[15]
The ASM model can only deal with shape variation but not texture variation. Fol-
lowing their works, there are many works trying to combine shape and texture varia-
tion together, for example, Edwards et al. proposed that first matching an ASM to
boundary features in the image, then a separate eigenface model (texture model
based on the PCA) is used to reconstruct the texture in a shape-normalized frame.
This approach is not, however, guaranteed to give an optimal fit of the appearance
(shape boundary and texture) model to the image because small errors in the match
of the shape model can result in a shape-normalized texture map that can’t be re-
constructed correctly using eigenface model. To direct match shape and texture si-
multaneously, Cootes et al. proposed the well-know active appearance model (AAM)
[19][20].
The active appearance model requires a training set of annotated images where
corresponding points have been marked on each example. In fig. 16, we show that to
build a facial model, the main features of human faces are required to be marked
manually (each face image is marked as a vector ). The ASM is then applied to align
19
Face Recognition Technology
each training face is warped so the points match those of the mean shape , obtain-
ing a shape-free patch. These shape-free patches are further represented as a set of
vectors and undergo the intensity normalization process (each vector is denoted as
). By applying the PCA to the intensity normalized data we obtain a linear model
that captures the possible texture variation. We summarize the process that has
been done until now for the AAM as follows:
, where is the orthonormal bases of the ASM and is the set of shape para-
meters for each training face. The matrix is the orthonormal bases of the texture
variation and is the set of texture parameters for each intensity normalized shape-
free patch. The details and process of the PCA is described in Section 5.
To capture the correlation between shape and texture variation, a further PCA is
applied to the data as follows. For each training example we generate the concate-
nated vector:
, where is a diagonal matrix of weights for each shape parameter, allowing for
the difference in units between the shape and texture models. The PCA is applied on
these vectors to generate a further model:
An example image can be synthesized for a given c by generating the shape-free tex-
ttuerxetupraet,ch first, and warp it to the suitable shape.
In the training phase for face detection, we learn the mean vectors of shape and
, , and Q to generate a facial AAM. And in the face detection phase,
20
Face Recognition Technology
we modify the vector c, the location and scale of the model to minimize the differ-
ence between synthesized appearance and the current location and scale in the in-
put image. After reaching a local minimum difference, we compare it with a pre-
defined threshold to determine the existence of a face. Fig. 17 illustrates the dif-
ference-minimization process. The parameter modification is rather a complicated
optimization, and in their works, they combined the genetic algorithm and a pre-
defined Parameter-refinement matrix to facilitate the convergence process. These
techniques are beyond the scope of this report, and the readers who are in-
terested in them can refer to the original papers [19].
Figure 16: A labeled training image gives a shape free patch and a set of po
ints. [19]
Figure 17: The fitting procedure of the adaptive appearance model after specific iterat
ions. [19]
21
Face Recognition Technology
Appearance-based methods
In contrast to template matching, the models (or templates) are learned from a set
of training images which should capture the representative variability of facial ap-
pearance. These learned models are then used for detection. These methods are de-
signed mainly for face detection, and two high-cited works are introduced in the fol-
lowing sections. More significant techniques are included in [7][24][25][26].
Fast face detection based on the Haar features and the Adaboost algorithm
The appearance-based method usually has better performance than the fea- ture-
invariant because it scans all the possible locations and scales in the image, but this
exhaustive searching procedure also result in considerable computation. In order to
facilitate this procedure, Viola et al. [22][23] proposed the combination of the
Haar features and the Adaboost classifier [18][28]. The Haar features are used to
capture the significant characteristics of human faces, especially the contrast fea-
tures. Fig. 19 shows the adopted four feature shapes, where each feature is labeled
by its width, length, type, and the contrast value (which is calculated as the averaged
22
Face Recognition Technology
Figure 18: The modeling procedure of the distribution of face and non-
face samples. The window size
of 19x19 is sued for representing the canonical human frontal face. In the top row, a six-
component
Gaussian mixture model is trained to capture the distribution of face samples; while in the b
ottom
row a six-component model is trained for non-
face samples. The centroids of each component are
shown in the right side of the figure. [21]
intensity in the black region minus the averaged intensity in the white region). A
19x19 window typically contains more than one thousand Haar features and results
in huge computational cost, while many of them don’t contribute to the classification
between face and non-face samples because both face and non-face samples have
these contrasts. To efficiently apply the large amount of Haar features, the Adaboost
algorithm is used to perform the feature selection procedure and only those features
with higher discriminant abilities are chosen. Fig. 19 also shows two significant Haar
features which have the highest discriminant abilities. For further speedup, the cho-
sen features are utilized in a cascade fashion, where the features with higher discri-
minant abilities are tested at the first few stages and the image windows passing
these tests are fed into the later stages for detailed tests. The cascade procedure
could quickly filter out many non-face regions by testing only a few features at each
stage and shows significant computation saving.
The key concept of using the cascade procedure is to keep sufficient high true pos-
itive rate at each stage, and this could be reached by modifying the threshold of the
23
Face Recognition Technology
rate will also increases the false positive rate, this effect could be attenuated by the
cascade procedure. For example, a classifier with 99% true positive rate and 20%
false positive rate is not sufficient for practical use, while cascading this performance
for three times could result in 95% true positive rate while 0.032% false positive rate,
which is surprisedly improved. During the training phase of the cascade procedure,
we set a lower bound of true positive rate and a higher bound of false positive rate
for each stage and the whole system. We train each stage in term to achieve the de-
sired bound, and increase a new stage if the bound of the whole system hasn’t been
reached.
In the face detection phase, several window scales and locations are chosen to ex-
tract possible face patches in the image, and we test each patch by the trained cas-
cade procedure and those which pass all the stages are labeled as faces. There are
many works later based on their framework, such as [32].
(a)
(b)
Figure 19: The Haar features and their abilities to capture the significant contrast feature of
the hu-
man face. [23]
Figure 20: The cascade procedure during the training phase. At each stage, only a portion
of patches
can be denoted as faces and pass to the following stage for further verifications. The patch
es denoted
24
Face Recognition Technology
Part-based methods
With the development of the graphical model framework [33] and the point of inter-
est detection such as the difference of Gaussian detector [34] (used in the SIFT de-
tector) and the Hessian affine detector [35], the part-based method recently attracts
more attention. We’d like to introduce two outstanding examples, one is based on
the generative model and one is based on the support vector machine (SVM) classifi-
er.
R. Fergus et al. [36] proposed to learn and recognize the object models from unla-
beled and unsegmented cluttered scenes in a scale invariant manner. Objects are
modeled as flexible constellations of parts, and only the topic of each image should
be given (for example, car, people, or motors, etc.). The object model is generated by
the probabilistic representation and each object is denoted by the parts detected by
the entropy-based feature detector. Aspects including appearances, scales, shapes,
and occlusions of each part and the object are considered and modeled by the
probabilistic representation to deal with possible object variances.
Given an image, the entropy-based feature detector is first applied to detect the
top P parts (including locations and scales) with the largest entropies, and then these
parts are fed into the probabilistic model for object recognition. The probabilistic
object model is composed of N interesting parts (N<P) and denoted as follows:
, where X denotes the part locations, S denotes the scales, and A denotes the ap-
pearances. The indexing variable h is a hypothesis to determine the attribute of each
detected part (belong to the N interesting parts of the object or not) and the possible
occlusion of each interesting part (If no detected part is assigned to an interesting
part, this interesting part is occluded in the image). Note that P regions are detected
from the image while we assume that only N points are characteristics of the object
and other parts belong to the background.
25
Face Recognition Technology
Figure 21: An example of face detection based on the generative model framework. (Up-
left) The av-
eraged location and the location variance of each interesting part of the face. (Up-
right) Sample ap-
pearances of the six interesting parts and the background part (the bottom row). (Bottom) E
xamples
of faces and the corresponding interesting parts. [
36]
26
Face Recognition Technology
27
Face Recognition Technology
racteristic object parts and their geometrical relation. Second, the patterns of some
object parts might vary less under pose changes than the pattern belonging to the
whole object. Third, a component-based approach might be more robust against
partial occlusions than a global approach. And the two main problems of a compo-
nent-based approach are how to choose the set of discriminatory object parts and
how to model their geometrical configuration.
(a)
(b)
Fig 22: In (a), the system overview of the component-
based classifier using four components is pre-
sented. On the first level, windows of the size of the components (solid line boxes) are shift
ed over the
face image and classified by the component classifiers. On the second, the maximum outpu
ts of the
component classifiers within predefined search regions (dotted lined boxes) and the positio
ns of the
components are fed into the geometrical configuration classifier. In (b), the fourteen learned
compo-
nents are denoted by the black boxes with the corresponding center marked by cro
sses. [37]
28
\
Figure 24: The results after (a) the eyes map (b) and the mouth
map.
parallel or exploit more robust features for detection. our previous work
(a) (b)
(c)
Figure 26: The facial feature pair verification process. In (a) we show an positive pair and (
b-c) are two
29
Face Recognition Technology
Assumed that the face of a person is located, segmented from the image, and aligned
into a face patch, in this section, we’ll talk about how to extract useful and compact
features from face patches. The reason to combine feature extraction and face rec-
ognition steps together is that sometimes the type of classifier is corresponded to
the specific features adopted. In this section, we separate the feature extraction
techniques into four categories: holistic-based method, feature-based method, tem-
plate-based method, and part-based method. The first three categories are fre-
quently discussed in literatures, while the forth category is a new idea used in recent
computer vision and object recognition.
Holistic-based methods
Holistic-based methods are also called appearance-based methods, which mean we
use whole information of a face patch and perform some transformation on this
patch to get a compact representation for recognition. To be more clearly distin-
guished from feature-based methods, we can say that feature-based methods di-
rectly extract information from some detected fiducial points (such as eyes, noses,
and lips, etc. These fiducial points are usually determined from domain knowledge)
and discard other information; while appearance-based methods perform transfor-
mations on the whole patch and reach the feature vectors, and these transformation
basis are usually obtained from statistics.
During the past twenty years, holistic-based methods attract the most attention
against other methods, so we will focus more on this category. In the following sub-
sections, we will talk about the famous eigenface [39] (performed by the PCA), fi-
sherface (performed by the LDA), and some other transformation basis such as the
independent component analysis (ICA), nonlinear dimension reduction technique,
and the over-complete database (based on compressive sensing). More interesting
techniques could be found in [42][43].
The idea of eigenface is rather easy. Given a face data set (say N faces), we first scale
each face patch into a constant size (for example, 100x100) and transfer each patch
into vector representation (100-by-100 matrix into 10000-by-1 vector). Based on
these N D-dimensional vectors (D=10000 in this case), we can apply the principal
component analysis (PCA) [17][18] to obtain suitable basis (each is a D-dimensional
vector) for dimension reduction. Assume we choose M projection basis ( ),
30
Face Recognition Technology
we can get a new M-dimensional vector . This process achieves our goal
of dimension reduction.
31
Face Recognition Technology
Re
co
gni
tio
n
Te
ch
nol
og
y
(a)
(b) (c)
Figure 27: (a) We generate a database with only 10 faces and each face patch is of size 10
0-by-100.
Through the computation of PCA basis, we get (b) a mean face and (c) 9 eigenface (the ord
er of ei-
genfaces from highest eigenvalues is listed from left to right, and from top to b
ottom).
The eigenfaces have the advantage of dimension reduction as well as saving the most
energy and the largest variation after projection, while they do not exploit the infor-
mation of face label included in the database. Besides, there have been several re-
searches showing that the illumination differences result in serious appearance vari-
ations, which means that first several eigenfaces may capture the variation of illumi-
nation of faces rather than the face structure variations, and some detailed struc-
tured difference may have small eigenvalues and their corresponding eigenfaces are
probably dropped when only preserving the M largest eigenvectors.
Despite calculating the projection bases from the whole training data without la-
bels (without human identities, which corresponds to unsupervised learning), Bel-
humeur et al. [40] proposed to use the linear discriminative analysis (LDA) [17] for
bases finding. The objective of applying the LDA is to look for dimension reduction
based on discrimination purpose as well as to find bases for projection that minimize
the intra-class variation but preserve the inter-class variation. They didn’t explicitly
32
Face Recognition Technology
(a)
(b)
Figure 28: The reconstruction process based on eigenface representation. (a) The original f
ace in the
database could be reconstructed by its eigenface representation and the set of projection v
ectors
(lossless if we use all PCA projection vectors, or this reconstruction will be lossy). (b) The r
econstruc-
tion process with different number of basis used: From left to right, and from top to bottom,
we in
turn add one projection vector with its corresponding projection value. The bottom right pict
ure using
9 projection vectors and the mean vector is the perfect reconstruction
result.
in a manner which discounts those regions of the face with large intra-class deviation.
Fig. 29 shows the difference of applying the PCA and LDA on the same labeled train-
ing data. The circled data point indicates the samples from class 1 and crossed from
class 2, as you can see, the PCA basis preserved the largest variation after projection,
while the projection result is not suitable for recognition. On the other hand, the LDA
exploits the best projection basis for discrimination purpose, although it doesn’t
preserve as much energy as what the PCA does, the projection result clearly sepa-
rates these two classes by just a simple thresholding. Fig. 30 also depicts the impor-
tance of choosing suitable projection bases.
In the two class problem, the LDA is also called the Fisher linear discriminant algo-
rithm. Given a training set with 2 classes where indicates the set of class , we
33
Face Recognition Technology
Figure 29: The comparison between Fisher linear discrimination and principal compone
nt analysis.
[40]
Figure 30: The figure of using different bases for projection. With suitable bases, th
e dimen-
sion-
reduced result could preserve the discriminative nature of the original data. [
17]
, where , and
34
Face Recognition Technology
, where the means the projection vector here to be calculated and is the training
set of class
. This obtained is well known in mathematical physics as the ge-
neralized Rayleigh quotient, and a vector
that maximizes must satisfy
And if
the PCA or finding the bases from the nullspace of
.
To solve the multiclass problem, given a training set with C classes, the intra-classvariation and inter-
class variation could be rewritten as:
, where m is the mean vector of all the samples in the training set. W is the projec-
tion matrix where with , and each should satisfy
In order to deal with the singularity problem of , Belhumeur et al. used the PCA
method described as below, where all the samples in the training data are first pro-
35
Face Recognition Technology
The has the maximum rank as N-C, where N is the size of the training set, so we
need to reduce the dimensionality of the samples down to N-C or less. In their ex-
periment results, the LDA bases outperform the PCA bases, especially in the illumina-
tion changing cases. The LDA could also be applied in other recognition cases. For
example, fig.31 shows the projection basis for the glasses / without glasses case, and
as you can see, this basis capture the glass shape around human eyes, rather than
the face difference of people in the training set.
(a) (b)
Figure 31: The recognition case of human faces with glasses or without glasses. (a) an exa
mple of fac-
es with glasses. (b) the projection basis reached by the LD
A. [40]
Followed the projection and bases finding ideas, the following three subsections use
different criteria to find the bases or decomposition of the training set. Because
these criteria involve many mathematical and statistical theorems and backgrounds,
here we will briefly describe the ideas behind them but no more details about the
mathematical equation and theorems.
The PCA exploits the second-order statistical property of the training set (the cova-
riance matrix) and yields projection bases that make the projected samples uncorre-
lated with each other. The second-order property only depends on the pair-wise re-
lationships between pixels, while some important information for face recognition
may be contained in the higher-order relationships among pixels. The independent
component analysis (ICA) [18][31] is a generalization of the PCA, which is sensitive to
the higher-order statistics. Fig. 32 shows the difference of the PCA bases and ICA
bases.
In the works proposed by Bartlett et al. [44], they derived the ICA bases from the
principle of optimal information transfer through sigmoidal neurons. In addition,
they proposed to architectures for dimension-reduction decomposition, one treats
the image as random variables and the pixels as outcomes, and the other one treats
the
36
Face Recognition Technology
pixels as random variables and the image as outcomes. The Architecture I depicted in
fig. 33 found n “source of pixel” images, where each has the appearance as shown in
the column U illustrated in fig. 34, and a human face could be decomposed in to a
weight vector as in fig. 35. This architecture finds a set of statistically independent
basis images and each of them captures the features in human faces such as eyes,
eyebrows, and mouths.
The Architecture II finds the basis images which have similar appearances as the
PCA does as shown in fig. 36, and has the decomposition as shown in fig. 37. This ar-
chitecture uses the ICA to find a representation in which the coefficients used to
code images are statistically independent. Generally speaking, the first architectural
finds spatially local basis images for the face, while the second architecture produces
a factorial face code. In their experimental results, both these two representation
were superior to the representation based on the PCA for recognizing faces across
37
Face Recognition Technology
Figure 33: Two architectures for performing the ICA on images. (a) The Architecture I for fin
ding statis-
tically independent basis images. Performing source separation on the face images produc
ed IC im-
ages in the rows of U. (b) The gray values at pixel location I are plotted for each face image
. ICA in the
Architecture I finds weight vectors in the directions of statistical dependencies among the pi
xel loca-
tions. (c) The Architecture II for finding a factorial code. Performing source separation on th
e pixels
produced a factorial code in the columns of the output matrix, U. (d) Each face image is plot
ted ac-
cording to the gray values taken on at each pixel location. The ICA in the Architecture II find
s weight
vectors in the directions of statistical dependencies among the face imag
es. [44]
38
Face Recognition Technology
Figure 35: The independent basis image representation consisted of the coefficients, b, fo
r the linear
combination if independent basis images, u, that comprised each face imag
e x. [44]
Figure 36: Image synthesis model for the Architecture II. Each image in the dataset was co
nsidered to
be a linear combination if underlying basis images in the matrix A. The basis images were e
ach asso- , where
ciated with a set of independent “causes,” given by a vector of coefficients S. The basis ima
ges were
estimated by is the learned ICA weight matrix. [44]
Figure 37: The factorial code representation consisted of the independent coefficients, u, fo
r the linear
combination of basis images in A that comprised each face image
x. [44]
Despite using linear projection to obtain the representation vector of each face im-
age, some researchers claim that the nonlinear projection may yield better repre-
39
Face Recognition Technology
40
Face Recognition Technology
Wright et al. [48] proposed to use the sparse signal representation for face recogni-
tion. They used the over-complete database as the projection basis, and applied the
L1-minimization algorithm to find the representation vector for a human face. They
claimed that if sparsity in the recognition problem is properly harnessed, the choice
of features is no longer critical. What is crucial, however, is that whether the number
of features is sufficiently large and whether the sparse representation is correctly
computed. This framework can handle errors due to occlusion and corruption un-
iformly by exploiting the fact that these errors are often sparse with respect to the
standard (pixel) basis. Fig. 40 shows the overview of their algorithm.
Fig. 40: The sparse representation technique represents a test image (left), which is (a) pot
entially
occluded or corrupted. Red (darker) coefficients correspond to training images of the correc
t individu-
al. [48]
Feature-based methods
We have briefly compared the differences between holistic-based methods and fea-
ture-based methods based on what the information they use from a given face patch,
and from another point of view, we can say that holistic-based methods rely more on
41
Face Recognition Technology
statistical learning and analysis, while feature-based methods exploit more ideas
from image processing, computer vision, and domain knowledge form human. In this
section, we discus two outstanding features for face recognition, the Gabor wavelet
feature and the local binary pattern.
The application of Gabor wavelet for face recognition is pioneered by Lades et al.’s
work [49]. In their work, the elastic graph matching framework is used for finding
feature points, building the face model and performing distance measurement, while
the Gabor wavelets are used to extract local features at these feature points, and a set
of complex Gabor wavelet coefficients for each point is called a jet. Graph matching
based methods normally requires two stages to build the graph for a face image I
and compute its similarity with a model graph . During the first stage, is
shifted within the input image to find the optimal global offset of while keeping
its shape rigid. Then in the second stage, each vertex in is shifted in a topological
constraint to compensate the local distortions due to rotations in depth or expression
variations. It is actually the deformation of the vertices that makes the graph matching
procedure elastic. To achieve these two stages, a cost measure function S( ) is
neccesarily to be defined and these two stages terminate when this function reaches
the minimum value.
Lades et al.’s [49] used a simple rectangular graph to model faces in the database
while each vertex is without the direct object meaning on faces. In the database
building stage, the deformation process mentioned above is not included, and the rec-
tangular graph is manually placed on each face and the features are extracted at indi-
vidual vertices. When a new face I comes in, the distance between it and all the faces
in the database are required to compute, which means if there are totally N face mod-
els in the database, we have to build N graphs for I based on each face model. This
matching process is very computationally expensive especially for large data-
base.Fig.41 shows an example of a model graph and a deformed graph based on it,
and the cost function is defined as:
where determines the relative importance of jet similarity and the topography term.
is the distance vector of the labeled edge e between two vertices, is the set of
jets at vertex n, and is the distance measure function between two jets based on
42
Face Recognition Technology
They employed object-adaptive graph to model faces in the database, which means
the vertices of a graph refer to special facial landmarks and enhance the distortion-
tolerant ability (see Fig.42). The distance measure function here not only counts on
the magnitude information, but also takes in the phase information from the feature
jets. And the most important improvement is the used of face bunch graph (FBG),
which is composed of several face models to cover a wide range of possible variations
in the appearance of faces, such as differently shaped eyes, mouths, or noses, etc. A
bunch is a set of jets taken from the same vertex (the same landmark) from different
face models and Fig.43 shows the FBG structure. The cost function is redefined as:
where B is the FBG representation, and N and E are the total amounts of vertices and
edges in the FBG. denotes the model graph of B and is the
new-defined distance measure function which takes the phase of jets into account. To
build the database, a FBG is first generated and models for individual faces are gener-
ated by the elastic graph matching procedure based on FBG. When a new face comes
in, the same elastic graph matching procedure based on FBG is executed to generate a
new face model, and this model could directly compare with the face models in the
database without re-modeling. The FBG serves as the general representation of faces
and reduce the computations for face modeling.
Besides these two symbolic examples using the elastic graph matching framework,
a number of varied versions have been proposed in literature and readers could found
a brief introduction in [51].
(a) (b)
Figure 41: The graphic models of face images. The model graph (a) is built to represent a f
ace stored in
the database, and features are directly extracted on vertices in the rectangular graph. Whe
n a new
face comes in and we want to recognize this person, a deformed graph (b) is generated ba
sed on the
two- stage process [49].
43
Face Recognition Technology
Figure 43: The face bunch graph serves as the general representation of faces. As shown i
n this figure,
there are nine vertices in the graph and each of them contains a bunch of jets to cover the
variations
in the facial appearance. The edges are represented by the averaged distance vector calcul
ated from
all face models sued to build the FBG [50]
.
Binary features
Besides applying the Gabor wavelet features with elastic graph matching based
methods, Ahonen et al. [52] proposed to extract the local binary pattern (LBP) histo-
grams with spatial information as the face feature and use a nearest neighbor classifier
based on Chi square metric as the dissimilarity measure. The idea behind using the
44
Face Recognition Technology
the operator in a (P,R) neighborhood. Superscript u2 stands for using only uniform
patterns and labeling all remaining patterns with a single label. A histogram of the
labeled image can be defined as:
, in which n is the number of different labels produced by the LBP operator and
This histogram contains information about the distribution of the local mi-
cro-patterns, such as edges, spots and flat areas, over the whole image. For efficient
face representation, one should retain also spatial information. For this purpose, the
image is divided into several regions as shown in fig. 45 and the spatially enhanced
histogram is defined as:
45
Face Recognition Technology
In the face recognition phase, the nearest neighbor classifier is adopted to com-
pare the distance between the input face and the database. Several metrics could be
applied for distance calculation, such as the histogram intersection, log-likelihood
statistic, and distance, and Chi square statistic , etc. When the image
has been divided into regions, it can be expected that some of the regions contain
more useful information than others in terms of distinguishing between people. For
example, eyes seem to be an important cue in human face recognition. To take ad-
vantage of this, a weight can be set for each region based on the importance of the
information it contains. For example, the weighted statistic becomes
, in which is the weight for region j and S and M denote two feature vectors to be
compared. Fig. 45 also shows the weights they applied in the experimental results.
Later on, several recent works used this feature for face recognition ,such as [55].
Figure 44: The circular (8,2) neighborhood. The pixel values are bilinearly interpolated whe
never the
46
Face Recognition Technology
(a) (b)
Template-based methods
The recognition system based on the two methods introduced above usually perform
feature extraction for all face images stored in the database and train classifiers or
define some metric to compute the similarity of a test face patch with each class
person class. To overcome variations of faces, these methods increase their database
to accommodate much more samples and expect the trained transformation basis or
defined distance metric could attenuate the intra-class variation while maintaining
the inter-class variation. Traditional template-matching is pretty much like using dis-
tance metric for face recognition, which means selecting a set of symbolic templates
for each class (person), the similarity measurement is computed between a test im-
age and each class, and the class with the highest similarity score is the selected as
the correct match. Recently, deformable template techniques are proposed [31]. In
contrast to implicitly modeling intra-class variations (ex. increasing database), de-
formable template methods explicitly models possible variations of human faces
from training data and are expected to deal with much severe variations. In this sec-
tion, we introduce the face recognition technique based on the ASM and the AAM
described in Section 4.3.1.
From the introduction in section 4.3.1, we know that during the face detection
process, the AAM will generates a parameter vector c which could synthesize a face
appearance that is best fitted to the face shown in the image. Then if we have a
well-chosen database which contains several significant views, pose, expressions of
each person, we can achieve a set of AAM parameter vectors to represent each iden-
tity. To compare the input face with the database, Edwards et al. [56] proposed to
use the Mahalonobis distance measure for each class and generate a
47
Face Recognition Technology
Part-based methods
Following the ideas presented in Section 4.5, there have been several researches
these years exploiting information from facial characteristic parts or parts that are
robust against pose or illumination variation for face recognition. To be distinguished
from the feature-based category, the part-based methods detect significant parts
from the face image and combine the part appearances with machine learning tools
for recognition, while the feature-based methods extract features from facial feature
points or the whole face and compare these features to achieve the recognition
purpose. In this subsection, we introduced two techniques, one is an extension sys-
tem of the method described in Section 4.5.2, and one is based on the SIFT (scale-
invariant feature transform) features extracted from the face image.
Based on the face detection algorithm described in Section 4.5.2, Heisele et al. [57]
compared the performance of the component-based face recognition against the
global approaches. In their work, they generated three different face recognition
structures based on the SVM classifier: a component-based algorithm based on the
output of the component-based face detection algorithm, a global algorithm directly
fed by the detected face appearance, and finally a global approach which takes the
view variation into account.
Given the detected face patches, the two global approaches have the only differ-
ence that whether the view variation of the detected face is considered. The algo-
rithm without this consideration directly builds a SVM classifier for a person based
on all possible views, while the one with this consideration first divides the training
images of a person into several view-specific clusters, and then trains one SVM clus-
ter for each of them. The SVM classifier is originally developed for binary classifica-
tion case, and to extend for multi-class tasks, the one-versus-all and the pair-wise
approaches are described in [58]. The view-specific clustering procedure is depicted
in fig. 46.
The component-based SVM classifier is cascaded behind the component-based
face detection algorithm. After a face is detected in the image, they choose 10 of the
14 detected parts, normalized them in size and combined their gray values into a sin-
gle feature vector. Then a one-versus-all multi-class structure with a linear SVM for
each person is trained for face recognition purpose. In fig. 47, we show the face de-
tection procedure, and in fig. 48, we present the detected face and the correspond-
ing 10 components fed into the face recognition algorithm.
In the experimental results, the component system outperforms the global sys-
tems for recognition rate larger than 60% because the information fed into the clas-
48
Face Recognition Technology
sifiers capture more specific facial features. In addition, the clustering leads to a sig-
nificant improvement of the global method. This is because clustering generates
view-specific clusters that have smaller intra-class variations than the whole set of
images of a person. Based on these results, they claimed that a combination of weak
classifiers trained on a properly chosen subsets of the data can outperform a single,
more powerful classifier trained on the whole data.
49
Face Recognition Technology
The scale-invariant feature transform (SIFT) proposed by Lowe et.al [34] has been
widely and successfully applied to object detection and recognition. In the works of
Luo et al. [59], they proposed to use the person-specific SIFT features and a simple
non-statistical matching strategy combined with local and global similarity on key-
point clusters to solve face recognition problems.
The SIFT is composed of two functions, the interest-point detector and the re-
gion-descriptor. Lowe et al. used the difference-of-Gaussian (DOG) algorithm to
detect these points in a scale-invariant fashion, and generated the descriptor based
on the orientation and gradient information calculated in a scale-specific region
around each detected point. Fig. 49 shows the SIFT features extracted on sample
faces and some corresponding matching point in two face images. In each face image
the number and the positions of the features selected by the SIFT point detector are
different, so these features are person-specific. In order to only compare the feature
pairs with similar physical meaning between faces in the database and the input face,
same number of sub-regions are constructed in each face image to compute the si-
milarity between each pair of sub-regions based on the features inside and at last get
the average similarity values. They proposed to ensemble a K-means clustering
scheme to construct the sub-regions automatically based on the locations of features
50
in training samples.
After constructing the sub-regions on face images, when testing a new image, all
the SIFT features extracted from the image are assigned into corresponding sub-
regions based on the locations. The construction of five sub-regions is illustrated in
fig. 50, and it can be seen that the centers of regions denoted by crosses just cor-
respond to two eyes, nose and two mouth corners that agree with the opinion of
face recognition experience as these areas are the most discriminative parts of face
images. Based on the constructed sub-regions, a local-and-global combined matching
strategy is used for face recognition. The details of this matching scheme are referred
in [59].
Figure 49: SIFT features on sample images and features matches in faces with expression
variation.
[59]
In this section, we’ll give summaries on face detection and face recognition tech-
niques during the past twenty year as well as popular face data set for experiments
and their characteristics
51
Face Recognition Technology
Face detection based on random labFeeleature-based Combining simple features with statistical learning
d
Component-based with SVM [37] Part-based Learning global and local SVM for detect
52
Face Recognition Technology
Table 9: The summary of popular databases used for detection and recognition tasks
The (*) points out most used databases. Image variations are indicated by (i) illumination, (p) pose, (e) expression,
indoor/outdoor conditions and (t) time delay.
Reference
[1] R. Chellappa, C. L. Wilson, and S. Sirohey, “Human and machine recognition of faces: a
survey,” Proc.
IEEE, vol. 83, no. 5, pp. 705-
740, 1995.
[2] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: a literature sur
vey,” Tech-
nical Report CAR-TR-
948, Center for Automation Research, University of Maryland (2002).
53
Face Recognition Technology
[3] A. F. Abate, M. Nappi, D. Riccio, and G. Sabatino, “2D and 3D face recognition: a survey,
” Pattern
Recognition Letter, vol. 28, no. 14, pp. 1885–
1906, 2007.
[4] M. Grgic, and K, Delac, “Face recognition homepage.” [online] Avai
lable:
http://www.face-rec.org/general-
info. [Accessed May. 27, 2010].
[5] A. K. Jain, R. P. W. Duin, and J. C. Mao, “Statistical pattern recognition: a review,” IEEE
Trans. Pattern
Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4–
37, 2000.
[6] [online] Available: http://www.michaelbach.de/ot/fcs_thompson-
thatcher/index.html. [Accessed
May. 27, 2010].
[7] M. H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting face in images:
a survey,” IEEE Tra
54