Final Computer Vision
Final Computer Vision
FINAL ESSAY
INTRODUCTION TO COMPUTER
VISION
FINAL ESSAY
INTRODUCTION TO COMPUTER
VISION
ACKNOWLEDGEMENT
We would like to express our sincere gratitude to Mr. Pham Anh Huy, our
instructor and mentor, for his valuable guidance and support throughout the mid-
term report of our project on Building a book management and ordering system on
the MERN stack platform. He has been very helpful and patient in providing us with
constructive feedback and suggestions to improve our work. He has also encouraged
us to explore new technologies and techniques to enhance our system's functionality
and performance. We have learned a lot from his expertise and experience in web
development and software engineering. We are honored and privileged to have him
as our teacher and supervisor.
DECLARATION OF AUTHORSHIP
We hereby declare that this is our own project and is guided by Mr. Pham
Van Huy; The content research and results contained herein are central and have not
been published in any form before. The data in the tables for analysis, comments
and evaluation are collected by the main author from different sources, which are
clearly stated in the reference section.
Computer vision is one of the most significant fields where deep learning is
being used to enable machines to perceive and comprehend visual stimuli. Deep
learning has opened up new possibilities in computer vision, accelerating
technological developments and changing sectors, from identifying objects in
photos to allowing safe navigation by autonomous vehicles.
Deep learning relies heavily on neural networks, which are made to replicate how
the human brain interprets information. Layers of interconnected nodes, or
"neurons," make up a neural network. Each layer performs basic calculations on the
input data. Usually, these layers fall into one of three categories:
Input Layer: the neural network's entrance, where unprocessed data is
entered into the model.
Hidden Layers: Intermediate layers that perform complex.
Output Layer: The last layer generates network's prediction or classification.
The technique used to train neural networks is called backpropagation, and it
modifies connection weights according to the discrepancy between expected and
actual outputs. Until the model performs as anticipated, the iterative procedure is
continued.
Convolutional Neural Networks (CNNs) are a type of neural network that are
designed specifically for processing structured grid data, such as images. They are
highly effective in capturing spatial hierarchies and patterns in visual data. CNNs
consist of several key components:
Convolutional Layers: These layers use filters (or kernels) to detect local
patterns such as edges, textures, and forms after applying convolution operations to
the input image. Every filter creates a feature map that draws attention to particular
aspects of the picture.
Pooling Layers: By lowering the spatial dimensions of feature maps, pooling
layers preserve important data while simplifying computations. Both average and
maximum pooling are frequently employed.
Fully Connected Layers: The network usually consists of fully connected
layers that interpret the features that have been retrieved and generate final
predictions after a number of convolutional and pooling layers.
CNNs exceptional accuracy in segmentation, object detection, and picture
classification has transformed computer vision tasks. They are very effective in
visual identification because of their capacity to learn hierarchical representations.
Transfer learning is a method that uses previously trained networks on new, related
tasks to improve the effectiveness and performance of deep learning models.
Through transfer learning, models can make use of the knowledge gathered from
prior training rather than starting from scratch, which necessitates a significant
quantity of data and computer power.
Pre-trained Models: These models have already learned how to extract
valuable features from images after being trained on huge benchmark datasets like
ImageNet. Some well-known pre-trained models are Inception, ResNet, and VGG.
Fine-tuning: In transfer learning, the weights of the previously trained
model are changed to make it more suitable for the current task. This entails
keeping the learnt characteristics from the original dataset while training the model
on a smaller, task-specific sample.
Feature Extraction: An alternative use case for the pre-trained model is as a
fixed feature extractor. Only the fully connected layers of the pre-trained model are
retrained for the new job in this method, while the convolutional layers of the model
retrieve features from the input images.
Transfer learning significantly reduces the time and data required to achieve high
performance on new computer vision tasks. It is especially valuable in scenarios
with limited labeled data and helps in rapidly deploying models in practical
applications.
One effective technique for identifying faces in photos or movies is the Face
Recognition library in computer vision. The principles and key elements of this
library are explained in depth below:
a. Face Recognition
To identify faces, the library use either the CNN neural network model or the HOG
(Histogram of Oriented Gradients) model. In particular:
- HOG-based detection: Identifies corners and edges using picture characteristics.
- CNN-based detection: More computationally intensive, but more accurate.
b. Features of the Face
The library recognizes 68 facial features (landmarks), including:
features of the mouth, nose, eyes, and face. Prior to recognition, the face is aligned
using these points.
c. Encoding Faces
The technique of turning a face into a 128-dimensional digital vector (feature
vector) is known as facial encoding. Each face's distinctive characteristics are
represented by this vector.
d. Facial Recognition
The encoding vector is compared to: Euclidean distance or cosine similarity are
typically used to calculate similarity based on the distance between vectors.
1.4.4 Application
- Security system (identity authentication).
- Video analysis (behavior tracking).
- Virtual reality or augmented reality (AR/VR) applications.
- Personalization of services (customer recognition).
Advantages Limitations
Can handle both still images and face is partially obscured or rotated
videos. too much.
2.3 output
Save the video after it has been processed through face recognition.
This involves capturing and processing video frames to detect and recognize
faces in real-time or during post-processing. The resulting video will include the
overlays, annotations, or markers indicating the detected faces, along with
additional metadata such as timestamps, detected IDs, or confidence scores if
applicable.
Storing the output video allows for further analysis, validation, or reporting.
It is particularly useful in applications like surveillance, access control, or research
where recorded evidence of face recognition results is essential. The output format
should ideally retain high quality while incorporating all the processed information
effectively.
2.4 build_dataset
2.6 recognize_faces_image
2.7 recognize_faces_video
2.8 encoding.pickle
The file encoding.pickle is used to store the facial encodings generated by
the encode_faces.py script. These encodings represent the unique features of each
face in a format that can be processed by machine learning models.
Once the encodings are generated, they are saved to disk in the
encoding.pickle file, ensuring they can be efficiently loaded later without the need
to regenerate them from scratch. This approach significantly reduces computational
overhead during subsequent operations, such as face recognition or verification.
2.9 How to do it
Note: Each image should ideally contain only one face of the designated
person. If multiple faces appear in a single image, the implementation becomes
more complex as you would need to identify which face belongs to the target
person.
For this example, the datasets is created using a webcam. Position your face
at different distances from the webcam, with various angles, expressions, and
lighting conditions. Run the build_dataset.py script and press the k key to save
images for each individual. To achieve high model accuracy, ensure each person
has at least 10–20 images.
This script does not include any face detection methods, such as Haar
cascades, or bounding boxes to help the user align their face. The goal is to capture
images in diverse real-world conditions to train a more robust model. Including
more images under different scenarios will improve the system’s reliability in
practical applications.
Besides creating a dataset via webcam, you can also build it manually or use
a Search API like Bing or Google. Once the dataset is created with build_dataset.py,
you’ll run encode_faces.py to generate the embeddings.
After creating the datasets, the next step is to generate encodings (or
embeddings) for the faces. The first task is to extract the Face Regions of Interest
(ROIs). Avoid using the entire image because background noise can negatively
affect model quality. To detect and extract faces, you can use methods like Haar
cascades, HOG + Linear SVM, or a deep learning-based face detector.
Once the Face ROIs are extracted, they are passed through a neural network
to obtain the encodings.
In this step, the encode_faces.py script is used to save the encodings and
names (or IDs, if necessary). You can refer to the script for detailed explanations, as
it contains clear comments for each part. The encodings and names are saved in the
encodings.pickle file for later use.
With the encodings generated from the datasets (via the pre-trained model
using dlib and face_recognition), we can now perform face recognition.
args = vars(ap.parse_args())
The loop continuously captures frames from the webcam (video.read()) and stores
the image in frame.
cv2.imshow("video", frame) displays the captured frame in a window named
"video".
cv2.waitKey(1) & 0xFF waits for a key press. It returns the key pressed (converted
to an 8-bit value). If no key is pressed, it returns -1.
while True:
cv2.imshow("video", frame)
If the key k is pressed (key == ord("k")), the program saves the current frame as an
image:
The file path p is constructed by joining the output directory path (args["output"])
and the filename (total), formatted to always have five digits (e.g., 00001.png,
00002.png, etc.).
cv2.imwrite(p, frame) writes the captured frame to the specified path.
The total counter is incremented by 1 to prepare for the next image.
if key == ord("k"):
p = os.path.sep.join([args["output"], "{}.png".format(str(total).zfill(5))]) #
điề
n thêm số0 bên trái cho đủ 5 kí tự
cv2.imwrite(p, frame)
total += 1
# nhấn q đểthoát
break
3.2 encode_face
-i (dataset): The path to the directory containing the face images. It must be
provided by the user.
-e (encodings): The file path where the facial encodings and names will be saved (in
pickle format).
-d (detection_method): Specifies the method used for face detection. Options are
"cnn" (more accurate but slower) or "hog" (faster but less accurate).
The args variable stores the parsed command-line arguments in a dictionary.
ap =argparse.ArgumentParser()
# trước khi encode face thì phải detect nó (đây là bước luôn phải làm trong face
recognition) - chọn method đểdetect faces
args = vars(ap.parse_args())
These lists will store the facial encodings and the names of the individuals
associated with those encodings.
# khởi tạo list chứa known encodings và known names (đểcác test images so sánh)
knownNames = []
# lấ
y tên người từ imagepath
name = imagePath.split(os.path.sep)[-2]
# trả vềarray of bboxes of faces, dùng dlib như bài face detection đó
# model="cnn" chính xác hơn nhưng chậm hơn, "hog" nhanh hơn nhưng kém chính xác hơn
# sẽ tính encodings cho mỗi face phát hiện được trong ảnh (có thểcó nhiề
u faces)
# Đểlý tưởng trong ảnh nên chỉ có một mặt người của mình thôi
The code loops over all the face encodings detected in the image (there might be
multiple faces).
The encoding and corresponding name are appended to the knownEncodings and
knownNames lists.
# duyệt qua các encodings
# Nên chắ
c chắn trong dataset ban đầ
u ảnh chỉ có một mặt người thôi nhé
# Lý tưởng nhất mỗ
i ảnh có 1 face và có 1 encoding thôi
knownEncodings.append(encoding)
knownNames.append(name)
3.3 recognize_faces_image
# đường dẫ
n đến file encodings đã lưu
# nếu chạy trên CPU hay embedding devices thì đểhog, còn khi tạo encoding vẫn dùng cnn cho
chính xác
args = vars(ap.parse_args())
pickle.load(f) loads the facial encodings and associated names from the specified
file (encodings.pickle).
The data object contains the encodings and names in a dictionary format.
# load the known faces and encodings
data = pickle.load(f)
# rồ
i cuố
i cùng là so sánh kNN đểrecognize
# duyệt qua các encodings của faces phát hiện được trong ảnh
# khớp encoding của từng face phát hiện được với known encodings (từ datatset)
# so sánh list of known encodings và encoding cần check, sẽ trả vềlist of True/False
xem từng known encoding có khớp với encoding check không
name = "Unknown" # tạm thời vậy, sau này khớp thì đổi tên
# Kiểm tra xem từng encoding có khớp với known encodings nào không,
if True in matches:
# lưu các chỉ sốmà encoding khớp với known encodings (nghĩa là b == True)
counts = {}
for i in matchedIdxs:
name = data["names"][i] # tên tương ứng known encoding khiowps với encoding
check
counts[name] = counts.get(name, 0) + 1 # nế
u chưa có trong dict thì + 1, có
rồi thì lấ
y sốcũ + 1
# lấ
y tên có nhiều counts nhất (tên có encoding khớp nhiề
u nhất với encoding cầ
n
check)
names.append(name)
The boxes and names are zipped together to draw bounding boxes and label each
detected face.
cv2.rectangle draws a rectangle around each detected face.
cv2.putText adds the recognized name next to the face.
for ((top, right, bottom, left), name) in zip(boxes, names):
3.4 recognize_faces_video
-e (encodings): The path to the file containing the known face encodings and names.
-o (output): Path to the output video file (optional).
# đường dẫ
n đến file encodings đã lưu
# nế
u muố
n lưu video từ webcam
# nếu chạy trên CPU hay embedding devices thì đểhog, còn khi tạo encoding vẫn dùng cnn cho
chính xác
args = vars(ap.parse_args())
writer = None
time.sleep(2.)
if not ret:
break
# chuyể
n frame từ BGR to RGB, resize đểtăng tốc độ xửlý
# hệ sốscale từ ảnh gốc (frame) vềrgb, tí phải dùng bên dưới r = frame.shape[1] /
float(rgb.shape[1])
# khởi tạo list chứa tên các khuôn mặt phát hiện được
names = []
For each detected face, compare_faces is used to compare the face encoding with
the known encodings (data["encodings"]).
The name corresponding to the most frequent match is chosen.
for encoding in encodings:
# khớp encoding của từng face phát hiện được với known encodings (từ datatset)
name = "Unknown" # tạm thời vậy, sau này khớp thì đổi tên
# Kiểm tra xem từng encoding có khớp với known encodings nào không,
if True in matches:
# lưu các chỉ sốmà encoding khớp với known encodings (nghĩa là b == True)
counts = {}
# duyệt qua các chỉ sốđược khớp và đếm sốlượng
for i in matchedIdxs:
counts[name] = counts.get(name, 0) + 1 # nế
u chưa có trong dict thì + 1,
có rồi thì lấ
y sốcũ + 1
# lấ
y tên có nhiều counts nhất (tên có encoding khớp nhiề
u nhất với encoding
cần check)
names.append(name)
Bounding boxes are drawn around each detected face, and the corresponding name
is displayed on the frame.
for ((top, right, bottom, left), name) in zip(boxes, names):
top = int(top * r)
right = int(right * r)
bottom = int(bottom * r)
left = int(left * r)
2. https://www.geeksforgeeks.org/deep-learning-for-computer-vision/
3. https://github.com/ageitgey/face_recognition/blob/master/face_recognition/api.py#L213
4. https://www.pyimagesearch.com/2018/06/18/face-recognition-with-opencv-python-and-
deep-learning/