0% found this document useful (0 votes)
2 views32 pages

Final Computer Vision

The document is a final essay on computer vision authored by Trần Đình Quang Vinh and Lê Trí at Ton Duc Thang University. It covers various aspects of computer vision, including its introduction, the evolution of OpenCV, deep learning techniques, and face recognition applications. The essay also includes acknowledgments, a declaration of authorship, and a structured catalog of contents detailing theoretical bases and practical implementations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views32 pages

Final Computer Vision

The document is a final essay on computer vision authored by Trần Đình Quang Vinh and Lê Trí at Ton Duc Thang University. It covers various aspects of computer vision, including its introduction, the evolution of OpenCV, deep learning techniques, and face recognition applications. The essay also includes acknowledgments, a declaration of authorship, and a structured catalog of contents detailing theoretical bases and practical implementations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

VIETNAM GENERAL CONFEDERATION OF LABOR

TON DUC THANG UNIVERSITY


FACULTY OF INFORMATION TECHNOLOGY

TRẦN ĐÌNH QUANG VINH – 521H0333


LÊ TRÍ –

FINAL ESSAY

INTRODUCTION TO COMPUTER
VISION

HO CHI MINH CITY, 2025


VIETNAM GENERAL CONFEDERATION OF LABOR
TON DUC THANG UNIVERSITY
FACULTY OF INFORMATION TECHNOLOGY

TRẦN ĐÌNH QUANG VINH – 521H0333


LÊ TRÍ –

FINAL ESSAY

INTRODUCTION TO COMPUTER
VISION

HO CHI MINH CITY, 2025


i

ACKNOWLEDGEMENT

We would like to express our sincere gratitude to Mr. Pham Anh Huy, our
instructor and mentor, for his valuable guidance and support throughout the mid-
term report of our project on Building a book management and ordering system on
the MERN stack platform. He has been very helpful and patient in providing us with
constructive feedback and suggestions to improve our work. He has also encouraged
us to explore new technologies and techniques to enhance our system's functionality
and performance. We have learned a lot from his expertise and experience in web
development and software engineering. We are honored and privileged to have him
as our teacher and supervisor.

Ho Chi Minh city, 3rd September 2023.


Author
(Signature and full name)
Vinh
Trần Đình Quang Vinh
ii

DECLARATION OF AUTHORSHIP
We hereby declare that this is our own project and is guided by Mr. Pham
Van Huy; The content research and results contained herein are central and have not
been published in any form before. The data in the tables for analysis, comments
and evaluation are collected by the main author from different sources, which are
clearly stated in the reference section.

In addition, the project also uses some comments, assessments as well as


data of other authors, other organizations with citations and annotated sources.

If something wrong happens, We’ll take full responsibility for the


content of my project. Ton Duc Thang University is not related to the infringing
rights, the copyrights that We give during the implementation process (if any).

Ho Chi Minh city, 5 January 2025


Author
(Signature and full name)
Vinh
Trần Đình Quang Vinh
Catalog
CHAPTER 1. GIỚI THIỆU.................................................................................................................................1
1.1 Computer Vision............................................................................................................................................1
1.2 OpenCV — Evolution in Computer Vision..................................................................................................1
1.3 Deep Learning...............................................................................................................................................2
1.4 face recognition.............................................................................................................................................2
CHAPTER 2. CƠ SỞ LÝ THUYẾT...................................................................................................................3
2.1 Dataset...........................................................................................................................................................3
2.2 Test image......................................................................................................................................................3
2.3 Output............................................................................................................................................................3
2.4 Video.............................................................................................................................................................3
2.5 Build_dataset.................................................................................................................................................3
2.6 Encode_face...................................................................................................................................................3
2.7 recognize_faces_image..................................................................................................................................3
2.8 recognize_faces_video...................................................................................................................................3
CHAPTER 3. GIẢI THÍCH CODE....................................................................................................................4
3.1 Dataset...........................................................................................................................................................4
3.2 Test image......................................................................................................................................................4
3.3 Output............................................................................................................................................................4
3.4 Video.............................................................................................................................................................4
3.5 Build_dataset.................................................................................................................................................4
3.6 Encode_face...................................................................................................................................................4
3.7 recognize_faces_image..................................................................................................................................4
3.8 recognize_faces_video...................................................................................................................................4
CHAPTER 4. RESULT.......................................................................................................................................4
CHAPTER 5. CONCLUSION............................................................................................................................5
REFERENCES....................................................................................................................................................7
ML Machine Learning

HOG Histogram of Oriented Gradients

R-CNN Regions with Convolutional Neural Networks

CNNs Convolutional Neural Networks


CHAPTER 1. GIỚI THIỆU
1.1 Computer Vision

In computer science and artificial intelligence, computer vision is a crucial


topic that focuses on giving robots the ability to "see" and comprehend image or
video data similarly to humans. This technology has already produced many useful
uses in daily life, even if it has not yet attained the visual capabilities of the human
eye.
Seymour Papert and Marvin Minsky's "Summer Vision Project" launched the
field in the 1960s. When Yan LeCun introduced Convolutional Neural Networks
(CNNs) in the 1980s, it saw tremendous progress. Although CNNs transformed
image analysis, they first encountered difficulties because of their large data and
processing resource requirements.
These days, computer vision is the basis for many applications, such as
augmented reality (AR), medical picture analysis, driverless cars, and facial
recognition. Thanks to developments in big data, AI, and computing power, the
sector is still thriving.

1.2 OpenCV — Evolution in Computer Vision

OpenCV contains implementations of over 2500 algorithms! It is freely


available for commercial as well as academic use. The library has interfaces for
many languages, including Python, Java, and C++.
Here are some applications of OpenCV
- Resizing Images
- Image Rotation/Flipping
- Blending Images
- Creating a Region of Interest ROI
- Image Thresholding
- Blurring and Smoothing
+ Average Blurring
+ Gaussian Blurring
+ Median Blurring
- Edge Detection
+ Depth Discontinuities
+ Orientation Discontinuities
- Image Contours
- Face Detection
OpenCV is awesome in detecting faces by using a haar cascade based object
detection algorithm. Haar cascades are basically trained machine learning classifiers
model that calculates different features like lines, contours, edges, etc.
These Trained ML models that detect face, eyes, etc are open-sourced at
OpenCV repos on GitHub. Also, we can also train your own haar cascade for any
object.

1.3 Deep Learning

Computer vision is one of the most significant fields where deep learning is
being used to enable machines to perceive and comprehend visual stimuli. Deep
learning has opened up new possibilities in computer vision, accelerating
technological developments and changing sectors, from identifying objects in
photos to allowing safe navigation by autonomous vehicles.

1.3.1 Neural Networks

Deep learning relies heavily on neural networks, which are made to replicate how
the human brain interprets information. Layers of interconnected nodes, or
"neurons," make up a neural network. Each layer performs basic calculations on the
input data. Usually, these layers fall into one of three categories:
Input Layer: the neural network's entrance, where unprocessed data is
entered into the model.
Hidden Layers: Intermediate layers that perform complex.
Output Layer: The last layer generates network's prediction or classification.
The technique used to train neural networks is called backpropagation, and it
modifies connection weights according to the discrepancy between expected and
actual outputs. Until the model performs as anticipated, the iterative procedure is
continued.

1.3.2 Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a type of neural network that are
designed specifically for processing structured grid data, such as images. They are
highly effective in capturing spatial hierarchies and patterns in visual data. CNNs
consist of several key components:
Convolutional Layers: These layers use filters (or kernels) to detect local
patterns such as edges, textures, and forms after applying convolution operations to
the input image. Every filter creates a feature map that draws attention to particular
aspects of the picture.
Pooling Layers: By lowering the spatial dimensions of feature maps, pooling
layers preserve important data while simplifying computations. Both average and
maximum pooling are frequently employed.
Fully Connected Layers: The network usually consists of fully connected
layers that interpret the features that have been retrieved and generate final
predictions after a number of convolutional and pooling layers.
CNNs exceptional accuracy in segmentation, object detection, and picture
classification has transformed computer vision tasks. They are very effective in
visual identification because of their capacity to learn hierarchical representations.

1.3.3 Transfer Learning

Transfer learning is a method that uses previously trained networks on new, related
tasks to improve the effectiveness and performance of deep learning models.
Through transfer learning, models can make use of the knowledge gathered from
prior training rather than starting from scratch, which necessitates a significant
quantity of data and computer power.
Pre-trained Models: These models have already learned how to extract
valuable features from images after being trained on huge benchmark datasets like
ImageNet. Some well-known pre-trained models are Inception, ResNet, and VGG.
Fine-tuning: In transfer learning, the weights of the previously trained
model are changed to make it more suitable for the current task. This entails
keeping the learnt characteristics from the original dataset while training the model
on a smaller, task-specific sample.
Feature Extraction: An alternative use case for the pre-trained model is as a
fixed feature extractor. Only the fully connected layers of the pre-trained model are
retrained for the new job in this method, while the convolutional layers of the model
retrieve features from the input images.
Transfer learning significantly reduces the time and data required to achieve high
performance on new computer vision tasks. It is especially valuable in scenarios
with limited labeled data and helps in rapidly deploying models in practical
applications.

1.4 face recognition

One effective technique for identifying faces in photos or movies is the Face
Recognition library in computer vision. The principles and key elements of this
library are explained in depth below:

1.4.1 Overview of Face Recognition

Face Detection: Locate faces in images or videos.


Face Recognition: Compare and determine the identity of faces based on
existing databases.
Face Verification: Check whether two images are of the same person.
The face_recognition library (based on Python) is built on deep learning models,
using the Convolutional Neural Networks (CNNs) architecture and based on the
famous dlib library.

1.4.2 Main components of the library

a. Face Recognition
To identify faces, the library use either the CNN neural network model or the HOG
(Histogram of Oriented Gradients) model. In particular:
- HOG-based detection: Identifies corners and edges using picture characteristics.
- CNN-based detection: More computationally intensive, but more accurate.
b. Features of the Face
The library recognizes 68 facial features (landmarks), including:
features of the mouth, nose, eyes, and face. Prior to recognition, the face is aligned
using these points.
c. Encoding Faces
The technique of turning a face into a 128-dimensional digital vector (feature
vector) is known as facial encoding. Each face's distinctive characteristics are
represented by this vector.
d. Facial Recognition
The encoding vector is compared to: Euclidean distance or cosine similarity are
typically used to calculate similarity based on the distance between vectors.

1.4.3 Working process

- Data input: Image or video.


- Face detection: Determine the location of the face in the image.
- Feature tagging: Get the landmark points on the face.
- Normalization and encoding: Normalize the face and create the encoding
vector.
- Matching and recognition: Compare the encoding vector with the database.

1.4.4 Application
- Security system (identity authentication).
- Video analysis (behavior tracking).
- Virtual reality or augmented reality (AR/VR) applications.
- Personalization of services (customer recognition).

1.4.5 Advantages and limitations

Advantages Limitations

Easy to use with simple API. Performance depends on image

Highly effective in face recognition quality and lighting.

and matching. Accuracy may be affected if the

Can handle both still images and face is partially obscured or rotated
videos. too much.

CHAPTER 2. THEORETICAL BASIS


2.1 dataset
The datasets contains images of individuals, with each person’s images
organized into separate folders. To ensure proper management, especially in cases
where two or more people share the same name, it’s essential to assign a unique ID
to each person. This unique ID acts as a key for identifying individuals
unambiguously.
For optimal training and higher accuracy in data modeling, it’s recommended
to include as many images of each person as possible. These images should capture
a variety of facial angles, lighting conditions, and facial expressions. Such diversity
in the datasets improves the model’s ability to generalize and enhances its
robustness when dealing with real-world scenarios. Moreover, ensuring consistent
labeling and organizing metadata for each ID can further streamline the data
preparation process.

2.2 test image

The test_images folder contains images specifically designated for testing


purposes. These images are distinct and do not overlap with those in the training
dataset.
By ensuring no duplication between the training dataset and the test set, the
evaluation process becomes more reliable, providing an accurate measure of how
well the model generalizes to unseen data. The test images should ideally include a
wide range of variations, such as different lighting conditions, facial expressions,
and angles, to effectively assess the model's robustness and real-world performance.
Additionally, maintaining a well-organized structure for the test set and
providing metadata (such as labels, unique IDs, or conditions under which the
images were captured) can further enhance the interpretability of test results and aid
in fine-tuning the model based on specific shortcomings identified during
evaluation.

2.3 output

Save the video after it has been processed through face recognition.
This involves capturing and processing video frames to detect and recognize
faces in real-time or during post-processing. The resulting video will include the
overlays, annotations, or markers indicating the detected faces, along with
additional metadata such as timestamps, detected IDs, or confidence scores if
applicable.
Storing the output video allows for further analysis, validation, or reporting.
It is particularly useful in applications like surveillance, access control, or research
where recorded evidence of face recognition results is essential. The output format
should ideally retain high quality while incorporating all the processed information
effectively.

2.4 build_dataset

The build_dataset.py script is designed to create and structure a dataset


efficiently. This script automates the process of organizing images, making it easier
to manage and prepare data for training machine learning models. It allows for the
creation of a well-structured dataset where images are categorized and stored
systematically, often in separate folders based on predefined labels, such as
individual names or unique IDs.
Additionally, the script can be extended to include functionality for assigning
unique IDs to each category (e.g., individuals), ensuring clear identification and
avoiding conflicts in cases where names or labels might overlap. This feature is
particularly crucial when dealing with datasets involving multiple individuals who
may have similar or identical names.
To enhance the quality of the dataset, the script can also be configured to
include preprocessing steps, such as resizing images, normalizing color values, or
augmenting data by generating variations of the original images. This ensures the
dataset is diverse and ready for effective training, accounting for different angles,
lighting conditions, and facial expressions. Such a robust dataset can significantly
improve the accuracy and reliability of any machine learning model it is used to
train.
2.5 encode_face

The encode_faces.py script is designed to generate facial encodings,


specifically 128-dimensional vectors, for the faces in a dataset. These encodings
represent unique numerical representations of each face, capturing essential features
that differentiate one individual from another.
The script typically works by processing a collection of face images,
detecting the facial region, and then extracting the 128-dimensional embeddings
using a pre-trained deep learning model (such as a FaceNet or dlib-based model).
These embeddings serve as compact and efficient descriptors, enabling tasks like
face recognition, clustering, and verification.
By encoding faces into numerical vectors, the script transforms raw image
data into a format that is easier for machine learning models to process. This step is
crucial for building reliable and scalable facial recognition systems. It is also
common to save these encodings alongside metadata (e.g., name or ID) to facilitate
matching and retrieval during deployment.

2.6 recognize_faces_image

The recognizer_faces_image.py script is designed for facial recognition from


images based on the encodings stored in the dataset. A key feature of this script is
its ability to check for matches between the input face and the dataset using the
following line:
matches = face_recognition.compare_faces(data["encodings"], encoding, 0.4)
The last parameter in the compare_faces function (set to 0.4 by default) is the
tolerance level, which determines how closely the input face must match the stored
encodings to be considered a match. If the recognition results are inaccurate—such
as identifying faces not in the dataset as existing ones—you can adjust this tolerance
parameter. Lowering the value makes the recognition stricter, reducing false
positives, while increasing it makes the model more lenient, potentially capturing
more matches but risking false positives.
For best results, experiment with this parameter based on the quality of your
dataset and the specific use case. A well-prepared dataset with diverse, high-quality
images will generally lead to more accurate recognition even with moderate
tolerance settings.

2.7 recognize_faces_video

recognizer_faces_video.py is a script designed to recognize faces from a


webcam video stream. With minor modifications, this script can also be adapted to
process and recognize faces from pre-recorded video files.
By extending its functionality, you can allow the script to handle a variety of
video input sources, such as files in formats like .mp4, .avi, or others. This makes
the script more versatile, enabling it to be used not only for real-time recognition
but also for offline analysis of recorded footage. Such a feature is particularly useful
in scenarios like analyzing surveillance video, processing training datasets, or
performing retrospective facial recognition on archived media.
These enhancements can make the script a valuable tool for both real-time
and batch face recognition tasks.

2.8 encoding.pickle
The file encoding.pickle is used to store the facial encodings generated by
the encode_faces.py script. These encodings represent the unique features of each
face in a format that can be processed by machine learning models.
Once the encodings are generated, they are saved to disk in the
encoding.pickle file, ensuring they can be efficiently loaded later without the need
to regenerate them from scratch. This approach significantly reduces computational
overhead during subsequent operations, such as face recognition or verification.

The encoding.pickle file serves as a crucial component in the pipeline, acting


as a bridge between data preprocessing and the application of face recognition
models. It ensures that all encoded features are consistently formatted and easily
accessible for tasks like matching, classification, or clustering.
For optimal results, it's important to generate encodings from a diverse
dataset, ensuring the inclusion of images captured under varying conditions, such as
different lighting, angles, and facial expressions. This enhances the reliability of the
encodings stored in the file, contributing to more accurate and robust face
recognition systems.

2.9 How to do it

Step 1: Creating the Datasets

To create the datasets, we use the build_dataset.py script. In the datasets


directory, there are subdirectories for each individual, named using their name (+ ID
if necessary). Inside each subdirectory, you will store images of that person’s face.

Note: Each image should ideally contain only one face of the designated
person. If multiple faces appear in a single image, the implementation becomes
more complex as you would need to identify which face belongs to the target
person.

For this example, the datasets is created using a webcam. Position your face
at different distances from the webcam, with various angles, expressions, and
lighting conditions. Run the build_dataset.py script and press the k key to save
images for each individual. To achieve high model accuracy, ensure each person
has at least 10–20 images.

This script does not include any face detection methods, such as Haar
cascades, or bounding boxes to help the user align their face. The goal is to capture
images in diverse real-world conditions to train a more robust model. Including
more images under different scenarios will improve the system’s reliability in
practical applications.
Besides creating a dataset via webcam, you can also build it manually or use
a Search API like Bing or Google. Once the dataset is created with build_dataset.py,
you’ll run encode_faces.py to generate the embeddings.

Step 2: Encoding Faces in the Datasets

After creating the datasets, the next step is to generate encodings (or
embeddings) for the faces. The first task is to extract the Face Regions of Interest
(ROIs). Avoid using the entire image because background noise can negatively
affect model quality. To detect and extract faces, you can use methods like Haar
cascades, HOG + Linear SVM, or a deep learning-based face detector.

Once the Face ROIs are extracted, they are passed through a neural network
to obtain the encodings.

Generating Face Encodings

Instead of training an encoding model from scratch, we use a pre-trained


model (available in the dlib library, integrated with the face_recognition library for
ease of use) to create the face embeddings.

In this step, the encode_faces.py script is used to save the encodings and
names (or IDs, if necessary). You can refer to the script for detailed explanations, as
it contains clear comments for each part. The encodings and names are saved in the
encodings.pickle file for later use.

Step 3: Face Recognition in Images

Recognizing Faces in Images

With the encodings generated from the datasets (via the pre-trained model
using dlib and face_recognition), we can now perform face recognition.

Run the recognize_faces_image.py script to recognize faces in images. If you want


to recognize faces in videos, use the recognize_faces_video.py script instead.
When running face recognition on a CPU or embedded devices like the
Raspberry Pi, it’s recommended to set the detection method to hog in the
recognize_faces_image.py script. For the initial encoding step, however, you can
use the CNN method for greater accuracy (although it takes longer to process).
CHAPTER 3. EXPLAIN CODE
3.1 build_dataset

This sets up a command-line argument parser. The -o or --output argument specifies


the path where captured images will be saved. The path is passed by the user when
running the script.
args = vars(ap.parse_args()) converts the parsed arguments into a dictionary (args),
which will store the directory path for saving images under the key "output".
ap = argparse.ArgumentParser()

ap.add_argument("-o", "--output", required=True, help=r"C:\Users\DELL\OneDrive\Desktop\


ComputerVision\data_new\Face-Recognition-with-OpenCV-Python-DL-master\dataset")

args = vars(ap.parse_args())

The loop continuously captures frames from the webcam (video.read()) and stores
the image in frame.
cv2.imshow("video", frame) displays the captured frame in a window named
"video".
cv2.waitKey(1) & 0xFF waits for a key press. It returns the key pressed (converted
to an 8-bit value). If no key is pressed, it returns -1.
while True:

ret, frame = video.read()

cv2.imshow("video", frame)

key = cv2.waitKey(1) & 0xFF

If the key k is pressed (key == ord("k")), the program saves the current frame as an
image:
The file path p is constructed by joining the output directory path (args["output"])
and the filename (total), formatted to always have five digits (e.g., 00001.png,
00002.png, etc.).
cv2.imwrite(p, frame) writes the captured frame to the specified path.
The total counter is incremented by 1 to prepare for the next image.
if key == ord("k"):

p = os.path.sep.join([args["output"], "{}.png".format(str(total).zfill(5))]) #
điề
n thêm số0 bên trái cho đủ 5 kí tự

cv2.imwrite(p, frame)

total += 1

# nhấn q đểthoát

elif key == ord("q"):

break

3.2 encode_face

-i (dataset): The path to the directory containing the face images. It must be
provided by the user.
-e (encodings): The file path where the facial encodings and names will be saved (in
pickle format).
-d (detection_method): Specifies the method used for face detection. Options are
"cnn" (more accurate but slower) or "hog" (faster but less accurate).
The args variable stores the parsed command-line arguments in a dictionary.
ap =argparse.ArgumentParser()

ap.add_argument("-i", "--dataset", required=True, help=r"C:\Users\DELL\OneDrive\Desktop\


ComputerVision\data_new\Face-Recognition-with-OpenCV-Python-DL-master\dataset")

# các encodings và names được lưu vào file

ap.add_argument("-e", "--encodings", required=True, help=r"C:\Users\DELL\OneDrive\Desktop\


ComputerVision\data_new\Face-Recognition-with-OpenCV-Python-DL-master\dataset")

# trước khi encode face thì phải detect nó (đây là bước luôn phải làm trong face
recognition) - chọn method đểdetect faces

ap.add_argument("-d", "--detection_method", type=str, default="cnn", help="face detector to


use: cnn")

args = vars(ap.parse_args())

These lists will store the facial encodings and the names of the individuals
associated with those encodings.
# khởi tạo list chứa known encodings và known names (đểcác test images so sánh)

# chứa encodings và tên của các images trong dataset


knownEncodings = []

knownNames = []

The loop goes through each image path in imagePaths.


The name variable is extracted from the directory structure, assuming that the image
is stored in a subfolder named after the person’s name.
for (i, imagePath) in enumerate(imagePaths):

# lấ
y tên người từ imagepath

print("[INFO] processing image {}/{}".format(i+1, len(imagePaths)))

name = imagePath.split(os.path.sep)[-2]

The face_recognition.face_locations function detects faces in the image and returns


a list of bounding boxes (coordinates of detected faces).
The model argument specifies which face detection method to use (either "cnn" or
"hog").
face_recognition.face_encodings generates facial encodings (embeddings) for each
face detected in the image.
The boxes (bounding box coordinates) are passed to the function to specify where
the faces are located.
# Đố
i với từng image phải thực hiện detect face, trích xuấ
t face ROI và chuyển vềencoding

# trả vềarray of bboxes of faces, dùng dlib như bài face detection đó

# model="cnn" chính xác hơn nhưng chậm hơn, "hog" nhanh hơn nhưng kém chính xác hơn

boxes = face_recognition.face_locations(rgb, model=args["detection_method"])

# tính the facial embedding for the face

# sẽ tính encodings cho mỗi face phát hiện được trong ảnh (có thểcó nhiề
u faces)

# Đểlý tưởng trong ảnh nên chỉ có một mặt người của mình thôi

encodings = face_recognition.face_encodings(rgb, boxes)

The code loops over all the face encodings detected in the image (there might be
multiple faces).
The encoding and corresponding name are appended to the knownEncodings and
knownNames lists.
# duyệt qua các encodings

# Trong ảnh có thểcó nhiều faces, mà ở đây chỉ có 1 tên

# Nên chắ
c chắn trong dataset ban đầ
u ảnh chỉ có một mặt người thôi nhé

# Lý tưởng nhất mỗ
i ảnh có 1 face và có 1 encoding thôi

for encoding in encodings:

# lưu encoding và name vào lists bên trên

knownEncodings.append(encoding)

knownNames.append(name)

3.3 recognize_faces_image

-e (encodings): Path to the file containing pre-saved facial encodings (created in a


previous step).
-i (image): Path to the test image in which faces will be recognized.
-d (detection_method): Specifies the face detection model (cnn or hog). The cnn
method is more accurate but slower, while hog is faster but less accurate.
args stores the parsed arguments from the command line.
ap = argparse.ArgumentParser()

# đường dẫ
n đến file encodings đã lưu

ap.add_argument("-e", "--encodings", required=True, help=r"C:\Users\DELL\OneDrive\Desktop\


ComputerVision\data_new\Face-Recognition-with-OpenCV-Python-DL-master\encodings.pickle")

ap.add_argument("-i", "--image", required=True, help=r"C:\Users\DELL\OneDrive\Desktop\


ComputerVision\data_new\Face-Recognition-with-OpenCV-Python-DL-master\test_images\1.png")

# nếu chạy trên CPU hay embedding devices thì đểhog, còn khi tạo encoding vẫn dùng cnn cho
chính xác

ap.add_argument("-d", "--detection_method", type=str, default="cnn", help="face dettection


model to use: cnn or hog")

args = vars(ap.parse_args())

pickle.load(f) loads the facial encodings and associated names from the specified
file (encodings.pickle).
The data object contains the encodings and names in a dictionary format.
# load the known faces and encodings

print("[INFO] loading encodings...")

with open(args["encodings"], "rb") as f:

data = pickle.load(f)

face_recognition.face_locations detects face locations (bounding boxes) in the


image.
face_recognition.face_encodings generates facial embeddings (encodings) for the
detected faces.
# CŨng làm tương tự cho ảnh test, detect face, extract face ROI, chuyể
n vềencoding

# rồ
i cuố
i cùng là so sánh kNN đểrecognize

print("[INFO] recognizing faces...")

boxes = face_recognition.face_locations(rgb, model=args["detection_method"])

encodings = face_recognition.face_encodings(rgb, boxes)

face_recognition.compare_faces compares the detected face encodings with the


stored encodings (from the data["encodings"] list) and returns a list of True or False
values, indicating whether each stored encoding matches the current face.
tolerance=0.4: A lower tolerance makes the recognition stricter, while a higher
tolerance makes it more lenient.
matchedIdxs stores the indices of the matched encodings.
counts counts how many times each name is matched (in case there are multiple
matches for the same face).
name = max(counts, key=counts.get): The name with the most matches is selected.
names = []

# duyệt qua các encodings của faces phát hiện được trong ảnh

for encoding in encodings:

# khớp encoding của từng face phát hiện được với known encodings (từ datatset)

# so sánh list of known encodings và encoding cần check, sẽ trả vềlist of True/False
xem từng known encoding có khớp với encoding check không

# có bao nhiêu known encodings thì trả vềlist có bấ


y nhiêu phầ
n tử
# trong hàm compare_faces sẽ tính Euclidean distance và so sánh với tolerance=0.6 (mặc
định), nhó hơn thì khớp, ngược lại thì ko khớp (khác người)

matches = face_recognition.compare_faces(data["encodings"], encoding, tolerance=0.4)


# có thểđiề u chỉnh tham sốcuố i

name = "Unknown" # tạm thời vậy, sau này khớp thì đổi tên

# Kiểm tra xem từng encoding có khớp với known encodings nào không,

if True in matches:

# lưu các chỉ sốmà encoding khớp với known encodings (nghĩa là b == True)

matchedIdxs = [i for (i, b) in enumerate(matches) if b]

# tạo dictionary đểđế


m tổng sốlầ
n mỗi face khớp

counts = {}

# duyệt qua các chỉ sốđược khớp và đếm sốlượng

for i in matchedIdxs:

name = data["names"][i] # tên tương ứng known encoding khiowps với encoding
check

counts[name] = counts.get(name, 0) + 1 # nế
u chưa có trong dict thì + 1, có
rồi thì lấ
y sốcũ + 1

# lấ
y tên có nhiều counts nhất (tên có encoding khớp nhiề
u nhất với encoding cầ
n
check)

# có nhiều cách đểcó thểsắ p xếp list theo value ví dụ new_dic =


sorted(dic.items(), key=lambda x: x[1], reverse=True)

# nó sẽ trả vềlist of tuple, mình chỉ cần lấ


y name = new_dic[0][0]

name = max(counts, key=counts.get)

names.append(name)

The boxes and names are zipped together to draw bounding boxes and label each
detected face.
cv2.rectangle draws a rectangle around each detected face.
cv2.putText adds the recognized name next to the face.
for ((top, right, bottom, left), name) in zip(boxes, names):

cv2.rectangle(image, (left, top), (right, bottom), (0, 255, 0), 2)

y = top - 15 if top - 15 > 15 else top + 15

cv2.putText(image, name, (left, y), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 1)

3.4 recognize_faces_video

-e (encodings): The path to the file containing the known face encodings and names.
-o (output): Path to the output video file (optional).

-y (display): Whether or not to display the output frame (1 to display, 0 to hide).

-d (detection_method): Face detection method, either "cnn" or "hog". "cnn" is more


accurate but slower.
ap = argparse.ArgumentParser()

# đường dẫ
n đến file encodings đã lưu

ap.add_argument("-e", "--encodings", required=True, help=r"C:\Users\DELL\OneDrive\Desktop\


ComputerVision\data_new\Face-Recognition-with-OpenCV-Python-DL-master\encodings.pickle")

# nế
u muố
n lưu video từ webcam

ap.add_argument("-o", "--output", type=str, help="path to the output video")

ap.add_argument("-y", "--display", type=int, default=1, help="whether or not to display


output frame to screen")

# nếu chạy trên CPU hay embedding devices thì đểhog, còn khi tạo encoding vẫn dùng cnn cho
chính xác

# ko có GPU thì nên đểhog thôi nhé, cứ thửxem sao

ap.add_argument("-d", "--detection_method", type=str, default="cnn", help="face dettection


model to use: cnn")

args = vars(ap.parse_args())

cv2.VideoCapture(0) starts capturing video from the default webcam.


writer = None initializes the video writer if output video is specified.
# Khởi tạo video stream và pointer to the output video file, đểcamera warm up một chút

print("[INFO] starting video stream...")

video = cv2.VideoCapture(0) # có thểchọn cam bằng cách thay đổi src

writer = None

time.sleep(2.)

The video frame is read from the webcam.


The frame is converted from BGR to RGB and resized to speed up processing.
r is the scaling factor from the resized image back to the original frame size.
while True:

ret, frame = video.read()

if not ret:
break

# chuyể
n frame từ BGR to RGB, resize đểtăng tốc độ xửlý

rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

rgb = imutils.resize(rgb, width=500)

# hệ sốscale từ ảnh gốc (frame) vềrgb, tí phải dùng bên dưới r = frame.shape[1] /
float(rgb.shape[1])

face_recognition.face_locations detects faces in the resized image (rgb).


face_recognition.face_encodings generates facial encodings for the detected faces.
print("[INFO] recognizing faces...")

boxes = face_recognition.face_locations(rgb, model=args["detection_method"])

encodings = face_recognition.face_encodings(rgb, boxes)

# khởi tạo list chứa tên các khuôn mặt phát hiện được

# nên nhớ trong 1 ảnh có thểphát hiện được nhiề


u khuôn mặt nhé

names = []

For each detected face, compare_faces is used to compare the face encoding with
the known encodings (data["encodings"]).
The name corresponding to the most frequent match is chosen.
for encoding in encodings:

# khớp encoding của từng face phát hiện được với known encodings (từ datatset)

# so sánh list of known encodings và encoding cần check, sẽ trả vềlist of


True/False xem từng known encoding có khớp với encoding check không

# có bao nhiêu known encodings thì trả vềlist có bấ


y nhiêu phầ
n tử

# trong hàm compare_faces sẽ tính Euclidean distance và so sánh với tolerance=0.6


(mặc định), nhó hơn thì khớp, ngược lại thì ko khớp (khác người)

matches = face_recognition.compare_faces(data["encodings"], encoding)

name = "Unknown" # tạm thời vậy, sau này khớp thì đổi tên

# Kiểm tra xem từng encoding có khớp với known encodings nào không,

if True in matches:

# lưu các chỉ sốmà encoding khớp với known encodings (nghĩa là b == True)

matchedIdxs = [i for (i, b) in enumerate(matches) if b]

# tạo dictionary đểđế


m tổng sốlầ
n mỗi face khớp

counts = {}
# duyệt qua các chỉ sốđược khớp và đếm sốlượng

for i in matchedIdxs:

name = data["names"][i] # tên tương ứng known encoding khiowps với


encoding check

counts[name] = counts.get(name, 0) + 1 # nế
u chưa có trong dict thì + 1,
có rồi thì lấ
y sốcũ + 1

# lấ
y tên có nhiều counts nhất (tên có encoding khớp nhiề
u nhất với encoding
cần check)

# có nhiều cách đểcó thểsắ p xếp list theo value ví dụ new_dic =


sorted(dic.items(), key=lambda x: x[1], reverse=True)

# nó sẽ trả vềlist of tuple, mình chỉ cần lấ


y name = new_dic[0][0]

name = max(counts, key=counts.get)

names.append(name)

Bounding boxes are drawn around each detected face, and the corresponding name
is displayed on the frame.
for ((top, right, bottom, left), name) in zip(boxes, names):

""" Do đang làm việc với rgb đã resize rồi nên cầ


n rescale vềảnh gốc (frame), nhớ
chuyển vềint """

top = int(top * r)

right = int(right * r)

bottom = int(bottom * r)

left = int(left * r)

cv2.rectangle(frame, (left, top), (right, bottom), (0, 255, 0), 2)

y = top - 15 if top - 15 > 15 else top + 15

cv2.putText(frame, name, (left, y), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 1)


CHAPTER 4. RESULT
CHAPTER 5. CONCLUSION
The Python script presented in this project demonstrates a comprehensive
real-time face recognition system using a webcam and the face_recognition library,
coupled with OpenCV. It integrates several crucial components such as face
detection, encoding, and recognition to identify faces in live video feeds. By
leveraging pre-trained models like CNN (Convolutional Neural Networks) and
HOG (Histogram of Oriented Gradients), the system efficiently detects and
compares faces against a dataset of known encodings.
The script provides flexibility, offering options to either display the output
frame on the screen or save the video with recognized faces into an output file. The
inclusion of a customizable detection method allows users to fine-tune the
performance based on hardware capabilities, ensuring better results when running
the system on devices with limited resources, like embedded devices or CPUs. The
script uses a combination of Euclidean distance calculations and a k-Nearest
Neighbors (k-NN) approach to match the detected faces with known encodings,
which is ideal for a variety of applications.
This project could serve as a foundation for building more advanced facial
recognition systems, such as for security, attendance tracking, or automated
systems. The ability to scale and integrate the recognition system with a larger
database of face encodings opens up possibilities for various practical uses in
industries like retail, corporate environments, and public security. Furthermore, by
adjusting the tolerance settings, users can control the sensitivity and accuracy of the
recognition process, making it adaptable to different real-world conditions.
Additionally, the combination of OpenCV for image processing,
face_recognition for encoding and recognition, and Python’s simplicity allows
developers to expand this basic framework with additional features such as emotion
detection, age and gender classification, or even real-time alerts for unauthorized
access. This makes the system not just a facial recognition tool but a platform that
can evolve with further development.

In summary, this face recognition solution is a solid starting point for


building real-time identification systems with the flexibility to adapt to various
scenarios, providing a reliable method for face detection and recognition in diverse
applications.
REFERENCES
1.https://medium.com/analytics-vidhya/introduction-to-computer-vision-opencv-in-python-
fb722e805e8b

2. https://www.geeksforgeeks.org/deep-learning-for-computer-vision/

3. https://github.com/ageitgey/face_recognition/blob/master/face_recognition/api.py#L213

4. https://www.pyimagesearch.com/2018/06/18/face-recognition-with-opencv-python-and-
deep-learning/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy