0% found this document useful (0 votes)

33 views5 pages

Face Transformer For Recognition

Uploaded by

Nilov Mitra Roy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views5 pages

Face Transformer For Recognition

Uploaded by

Nilov Mitra Roy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

1

Face Transformer for Recognition

Yaoyao Zhong, Weihong Deng

Abstract—Recently there has been a growing interest in the CNNs hegemony over the face recognition task. It is
Transformer not only in NLP but also in computer vision. known that, the efficiency bottleneck of Transformer models,
We wonder if transformer can be used in face recognition is just the key of them, i.e., the self-attention mechanism,
and whether it is better than CNNs. Therefore, we investigate
the performance of Transformer models in face recognition. which incurs a complexity of 𝑂 (𝑛2 ) with respect to sequence
Considering the original Transformer may neglect the inter- length [16]. Of course efficiency is important for face recog-
patch information, we modify the patch generation process and nition models, but in this paper, let us mainly determine the
arXiv:2103.14803v2 [cs.CV] 13 Apr 2021

make the tokens with sliding patches which overlaps with each feasibility of applying Transformer models in face recognition
others. The models are trained on CASIA-WebFace and MS- and leave out the potential efficiency problem of them.
Celeb-1M databases, and evaluated on several mainstream bench-
marks, including LFW, SLLFW, CALFW, CPLFW, TALFW, We first experiment with a standard Transformer [17] as
CFP-FP, AGEDB and IJB-C databases. We demonstrate that ViT [1] did. However, the original ViT directly flatten the
Face Transformer models trained on a large-scale database, images into patches, which may neglect inter-patch informa-
MS-Celeb-1M, achieve comparable performance as CNN with tion. Since some of important facial features will be par-
similar number of parameters and MACs. To facilitate further titioned into different tokens. To better describe the inter-
researches, Face Transformer models and codes are available at
https://github.com/zhongyy/Face-Transformer. patch information, we slightly modify the tokens generation
method of ViT, to make the image patch overlaps, which can
Index Terms—Face Recognition, Neural networks, Trans- improve the performance compared with original ViT and will
former.
not increase the computing cost. Face Transformer models
are trained on a large scale training database, MS-Celeb-
I. I NTRODUCTION 1M [7] database, supervised with CosFace [9], and evaluated
on several face recognition benchmarks including LFW [18],
R ECENTLY it seems a popular trend to apply Transformer
in different computer vision tasks including image clas-
sification [1], object detection [2], video processing [3] and
SLLFW [19], CALFW [20], CPLFW [21], TALFW [22] CFP-
FP [23], AgeDB-30 [24], and IJB-C [25] databases. Finally,
so on. Although the inner workings of Transformer is not we demonstrate that Transformer models trained on a large-
that clear, researchers come up with idea after idea to apply scale database obtain comparable performance as CNN with
Transformer in different kinds of ways [4], [5], [6] because of a similar number of parameters and MACs. In addition, it is
its strong representation ability. reasonable to find the Transformer models attend to the face
Based on large-scale training databases [7] and effec- area as we expected.
tive loss functions [8], [9], [10], convolution neural net- The contribution of our work is that we show the feasi-
works (CNNs), from VGGNet [11] to ResNet [12], have bility of Transformer models in face recognition and report
achieved great success in face recognition over the past few promising experiment results. How to further improve the
years [10]. DeepFace [13] first uses a 9-layer CNN in face performance and efficiency of Transformer models in face
recognition, and obtains a 97.35% accuracy on the LFW recognition is a promising task for future research.
database. FaceNet [14] adopts GoogleNet [15], assisted by
a private large scale dataset, achieving state-of-art perfor- II. FACE T RANSFORMER
mance (99.63% on LFW) at that time. SphereNet [8] adopts In this paper, following the open-set face recognition
a 64-layer ResNet [12] network, with a large-margin loss pipeline [8], Face Transformer is trained on face databases
function, achieving 99.42% accuracy on the LFW database. (with image 𝑿 with label 𝑦) in a supervised manner, where
ArcFace [10] develops ResNet [12] with an IR block and face images are encoded using a well-designed network, and
achieves new state-of-art performance on several benchmarks. the output face image embeddings are supervised by an
Despite the success of CNNs, we still wonder can Trans- elaborate loss function [8], [9], [10] for better discriminative
former be used in face recognition and whether it is better than ability, as shown in Figure 1.
ResNet-like CNNs. Since Transformer has shown its excellent
performance combined with large scale databases [1], while
there have been lots of large scale training database in face A. Network Architecture
recognition. It is interesting to observe the performance of Face Transformer model follows the architecture of ViT [1],
combination of Transformer and large scale face training which applies the original Transformer [17].
databases. Perhaps Transformer is just the best to challenge The only difference is that, we modify the tokens generation
method of ViT, to generate tokens with sliding patches, i.e.,
The authors are with the Pattern Recognition and Intelligent Sys- to make the image patch overlaps, for the better description of
tem Laboratory, School of Artificial Intelligence, Beijing University
of Posts and Telecommunications, Beijing 100876, China (e-mail: the inter-patch information, as shown in Figure 1. Specifically,
zhongyaoyao@bupt.edu.cn; whdeng@bupt.edu.cn). we extract sliding patches from the image 𝑿 ∈ R𝑊 ×𝑊 ×𝐶 with
2

Fig. 1. The overall of Face Transformer. The face images are split into multiple patches and input as tokens to the transformer encoder. To better describe the
inter-patch information, we modify the tokens generation method of ViT [1], to make the image patch overlaps slightly, which can improve the performance
compared with original ViT. The Transformer encoder is basically a standard Transformer model [17]. Eventually, the face image embeddings can be used
for loss functions [9], [10]. The illustration is inspired by ViT [1].

patch size 𝑃 and stride 𝑆 for them (with implicit zero on where 𝑼𝑚𝑠𝑎 ∈ R 𝑘 𝐷ℎ ×(𝐷+1) .
both sides of input), and finally obtain a sequence of flattened
2
2D patches 𝑿 𝒑 ∈ R 𝑁 ×( 𝑃 ×𝐶) . (𝑊, 𝑊) is the resolution of the B. Loss Function
original image while (𝑃, 𝑃) is the resolution of each image
The output 𝑥 of Equation 2, i.e., the final output of Trans-
patch. The effective sequence length is the number of patches
former model, is supervised by an elaborate loss function [8],
𝑁 = b 𝑊 +2× 𝑝−( 𝑃−1)
+ 1c, where 𝑝 is the amount of zero-
𝑆 [9], [10] for better discriminative ability,
paddings.
𝑒𝑾𝒚 𝒙+𝒃𝒚
𝑇
As ViT did, a trainable linear projection maps the flattened
patches 𝑿 𝒑 to the model dimension D, and outputs the patch 𝐿 = − log 𝑃 𝑦 = − log Í𝐶 . (5)
𝑾 𝒋 𝑇 𝒙+𝒃 𝒋
𝑗=1 𝑒
embeddings 𝑿 𝒑 𝑬. The class token, i.e., a learnable embedding
(𝑿𝑐𝑙𝑎𝑠𝑠 = 𝒛 00 ) is concatenated to the patch embeddings, and where 𝑦 is the label, 𝑃 𝑦 is the predicted probability of
its state at the output of the Transformer encoder (𝒛0𝐿 ) is the assigning 𝒙 to class 𝑦, 𝐶 is the number of identities, 𝑾 𝑗 is the
final face image embedding, as Equation 2. Then, position 𝑗-th column of the weight of the last fully connected layer, and
embeddings are added to the patch embeddings to retain 𝒃 𝒋 ∈ RC is the bias. Softmax based loss functions [26], [8],
positional information. The final embedding [9], [10] remove the bias term and transform 𝑾 𝒋 𝑇 𝒙 = 𝑠 cos 𝜃 𝑗 ,
and incorporate large margin in the cos 𝜃 𝑦𝑖 term [8], [9], [10].
𝒛 0 = 𝑿𝑐𝑙𝑎𝑠𝑠 ; 𝑿 1𝑝 𝑬; 𝑿 2𝑝 𝑬; . . . ; 𝑿 𝑝𝑁 𝑬 + 𝑬 𝑝𝑜𝑠 ,

(1)
Therefore, softmax based loss functions can be formulated as
serves as input to the Transformer, 𝑁
1 ∑︁
𝒛𝑙0 = 𝑀𝑆 𝐴(𝐿𝑁 (𝒛𝑙−1 )) + 𝒛𝑙−1 , 𝑙 = 1, . . . , 𝐿, 𝐿=−
𝑁 𝑖=1
log 𝑃 𝑦𝑖
𝒛𝑙 = 𝑀 𝐿𝑃(𝐿𝑁 (𝒛𝑙0)) + 𝒛𝑙0, 𝑙 = 1, . . . , 𝐿, (2) (6)
𝑒 𝑠 𝑓 ( 𝜃𝑦𝑖 )
𝑁
1 ∑︁
𝒙= 𝐿𝑁 (𝒛0𝐿 ), =− log 𝑠 𝑓 ( 𝜃 ) Í𝐶 ,
𝑁 𝑖=1 𝑒 𝑦𝑖 + 𝑠 cos 𝜃 𝑗
which consists of multiheaded self-attention (MSA) and MLP 𝑗=1, 𝑗≠𝑦𝑖 𝑒

blocks, with LayerNorm (LN) before each block and residual where 𝑓 (𝜃 𝑦𝑖 ) = cos 𝜃 𝑦𝑖 − 𝑚 in CosFace [9].
connections after each block, as shown in Figure 1. In Equa-
tion 2, the output 𝒙 is the final output of Transformer model. III. E XPERIMENT
One of the key block of Transformer, MSA, is composed A. Implementation Details
of 𝑘 parallel self-attention (SA),
We apply two training databases, CASIA-WebFace and MS-
[𝒒, 𝒌, 𝒗] = 𝒛𝑼𝒒𝒌𝒗 , Celeb-1M [7]. CASIA-WebFace is a sweet training database
√︁ (3)
𝑆 𝐴(𝒛) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(𝒒 𝒌 𝑇 / 𝐷 ℎ )𝒗, and contains 0.49M images from 10,575 celebrities, which
can be seen as a relatively small-scale database compared
where 𝒛 ∈ R ( 𝑁 +1)×𝐷 is an input sequence, 𝑼𝒒𝒌𝒗 ∈ R𝐷×3𝐷ℎ
with million-scale ones [7]. MS-Celeb-1M is a popular large
is the weight matrix
√ for linear transformation, and 𝑨 =
scale training database in face recognition and we use the
𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(𝒒 𝒌 𝑇 / 𝐷 ℎ ) is the attention map. The output of MSA
clean version refined by insightface [10], which contains 5.3M
is the concatenation of 𝑘 attention head outputs
images of 93,431 celebrities. We choose CosFace [9] (𝑠 = 64
𝑀𝑆 𝐴(𝒛) = [𝑆 𝐴1 (𝒛); 𝑆 𝐴2 (𝒛); . . . ; 𝑆 𝐴 𝑘 (𝒛)]𝑼𝑚𝑠𝑎 , (4) and 𝑚 = 0.35) as the loss function for better convergence and
3

recognition performance. The face images are aligned to 112 than ResNet-100. Actually, we find that the accuracy of Face
× 112. The Horizontally flip with a probability of 50% is used Transformer models trained on CASIA-WebFace can reach a
for training data augmentation. high level as ResNet-100, while models cannot generalize well
For comparison, the CNN architecture used in our work on test databases, which indicates that the scale of CASIA-
is a modified ResNet-100 [12] proposed in the first version WebFace may be not enough for Transformer models.
of ArcFace paper [10], which uses IR blocks (BN-Conv-BN- While things change when we use a much larger training
PReLU-Conv-BN) and applies the “BN [27]-Dropout [28]- database, MS-Celeb-1M. The performance of Face Trans-
FC-BN” structure to get the final 512-𝐷 embedding feature. former models demonstrate promising results on large-scale
We also experiment with the recent proposed T2T-ViT [5]. face training databases. The performance of Face Transformer
The number of parameters, MACs and inference speed (Tesla is competitive compared to the ResNet-100 with similar
V100, Intel Xeon E5-2698 v4) of these face recognition number of parameters and MACs. Compared with “ViT-
models are listed in Table I. Details are as follows. For ViT P8S8”, “ViT-P10S8” and “ViT-P12S8” have better perfor-
models, the number of layers is 20, the number of heads is 8, mance, which demonstrates the overlapping patches can help
hidden size is 512, MLP size is 2048. For the Token-to-Token in some degree. T2T-ViT also obtain good performance,
part of T2T-ViT model, the depth is 2, hidden dim is 64, and while limited computer source, more hyper-parameters for
MLP size is 512; while for the backbone, the number of layers T2T block remains to try. Another interest point is that,
is 24, the number of heads is 8, hidden size is 512, MLP size Transformer models obtain a little higher accuracy on TALFW
is 2048. Note that, the “ViT-P10S8” represents the ViT model database, which is a database with transferable adversarial
with 10 × 10 patch size, with stride 𝑆 = 8, and “ViT-P8S8” noise. Since TALFW database is generated using CNNs as
represents no overlapping between tokens. surrogate models, it seems that there is no significant specifi-
cality with Transformer in terms of adversarial robustness. It
TABLE I is interesting to explore the performance of combination of
N UMBER OF PARAMETERS , MAC S AND I NFERENCE S PEED OF FACE Face Transformer models and adversarial training.
R ECOGNITION M ODELS .

Models Params (M) MACs (G) Img/Sec

C. Discussion
ResNet-100 [12] 65.1 12.1 41.73
ViT-P8S8 [1] 63.2 12.4 41.72 1) Attention Area Analysis: Since the key of Transformer
T2T-ViT [5] 63.5 12.7 38.08 models is the self-attention mechanism, we analyze how the
ViT-P10S8 63.3 12.4 44.59 Transformer models concentrate on face images by analyzing
ViT-P12S8 63.3 12.4 42.45 the ViT-P12S8 model trained on MS-Celeb-1M. Specifically,
we use the Attention Rollout [30] method, which recursively
multiplies the modified attention matrices
√ 0.5𝑨 + 0.5𝑰 of all
We use AdamW [29] and cosine learning rate decay fol- layers, where 𝑨 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(𝒒 𝒌 𝑇 / 𝐷 ℎ ) is the attention map
lowing DeiT [4]. The models are trained from scratch without of Equation 3. We demonstrate that Transformer models attend
pre-training. With 1 warmup epoch, the initial learning rate is to the face area as we expected, as shown in Figure 2.
set as 3e-4, and we lower it to 1e-4 when the training accuracy
reaches a stable stage (about 20 epochs).

B. Results on Mainstream Benchmarks

We mainly report recognition performance of models
on several mainstream benchmarks including LFW [18],
SLLFW [19], CALFW [20], CPLFW [21], TALFW [22] CFP-
FP [23], AgeDB-30 [24], and IJB-C [25] databases. LFW
database contains 13,233 face images from 5,749 different
Fig. 2. With the help of Attention Rollout [30] techniques, we analyze how the
identities, which is a classic benchmark for unconstrained Transformer models (MS-Celeb-1M, ViT-P12S8) concentrate on face images,
face verification. Similar-looking LFW (SLLFW), Cross-Age and find that face Transformer models attend to the face area as we expected.
LFW (CALFW), Cross-Pose LFW (CPLFW) and Transferable
Adversarial LFW (TALFW) databases are constructed based 2) Attention Matrices Visualization: To further understand
on LFW database, to emphasize similar-looking challenges, the Transformer models (MS-Celeb-1M, ViT-P12S8), we visu-
cross-age challenge and cross-pose challenge, and adversar- alize the attention matrices of different layers, and calculate the
ial robustness of face recognition. CFP-FP database is built mean attention distance in the image space, which is seemed
for facilitating large pose variation and AgeDB-30 database as the receptive field as CNNs [1], shown in Figure 3. While
is a manually collected cross-age database. IJB-C database we find that although the deepest layers attend to long distance
contains both still images and video frames to address the relationship, it seems that the attention distance of the lowest
unconstrained face recognition. layer in Face Transformer models is relatively longer than the
The experimental results are shown in Table II and Table III. original ViT [1].
We first find that in Table II, Face Transformer models 3) Occlusion Robustness: The key of Face Transformer
trained on CASIA-WebFace database performs much worse models is the self-attention mechanism and it seems that they
4

TABLE II
P ERFORMANCE ON LFW [18], SLLFW [19], CALFW [20], CPLFW [21], TALFW [22] CFP-FP [23] AND AGE DB-30 [24] DATABASES .

Training Data Models LFW SLLFW CALFW CPLFW TALFW CFP-FP AgeDB-30
ResNet-100 [12] 99.55 98.65 94.13 90.93 53.17 96.30 95.50
CASIA-WebFace ViT-P8S8 [1] 97.32 90.78 86.78 80.78 83.05 86.60 81.48
ViT-P12S8 97.42 90.07 87.35 81.60 84.00 85.56 81.48
ResNet-100 [12] 99.82 99.67 96.27 93.43 64.88 96.93 98.27
ViT-P8S8 [1] 99.83 99.53 95.92 92.55 74.87 96.19 97.82
MS-Celeb-1M T2T-ViT [5] 99.82 99.63 95.85 93.00 71.93 96.59 98.07
ViT-P10S8 99.77 99.63 95.95 92.93 72.95 96.43 97.83
ViT-P12S8 99.80 99.55 96.18 93.08 70.13 96.77 98.05

TABLE III
C OMPARISON OF DIFFERENT MODELS TRAINED ON MS-C ELEB -1M ON
THE IJB-C DATABASE [25].

Verification 1:1 TAR@FAR

Models
1e-4 1e-3 1e-2 1e-1
ResNet-100 [12] 96.36 97.36 98.41 99.13
ViT-P8S8 [1] 95.96 97.28 98.22 98.99
T2T-ViT [5] 95.67 97.10 98.14 98.90
ViT-P10S8 96.06 97.45 98.23 98.96
ViT-P12S8 96.31 97.49 98.38 99.04

Fig. 4. The recognition performance of Face Transformer model and ResNet-

100 as the occlusion area increases.

4) Abortive attempts and Observations: In addition to the

reported models, we would like to share some of our abortive
attempts and observations. Note that, these observations may
not be rigorous enough to come to conclusions, but maybe
they are helpful for readers.
(1) We first tried SGD as previous works [9], [10] to train
Face Transformer models, while models cannot converge. So
finally we apply AdamW, which has been proved as a effective
optimizer for Transformer models.
(2) We tried removing 𝑿𝑐𝑙𝑎𝑠𝑠 (𝒛00 ), and used the mean
pooling of other tokens outputs. Compared with using 𝒛 0𝐿 as
output, the recognition performance decreases slightly, while
accuracy on TALFW database increases to more than 85%.
Fig. 3. (1) Visualization of the attention matrices of different layers. (2) Mean (3) We tried removing the MLP to improve the efficiency
attention distance of attended area by head and network depth. but find the training accuracy cannot increase to a normal
level, which indicates that the MLP block is essential for Face
Transformer models.
concentrate more on the whole face, therefore, we wonder
whether them are more robust at classifying partial occluded IV. C ONCLUSION
face images. To explore the occlusion robustness of Face In this paper, we aim to investigate the feasibility of
Transformer models, we apply random occlusion (zero value) applying Transformer models in face recognition. Finally,
on face images of several test datasets, and test the recognition we have demonstrated that Face Transformer models cannot
performance of models as the occlusion area increases. The work with a relatively small database, CASIA-WebFace, while
experimental results are in Figure 4. We find the performance they can obtain promising performance on the large-scale
of Face Transformer models decreases more compared with face training database, MS-Celeb-1M. In addition, we have
ResNet-100, which indicates Face Transformer models behave provided some analyses for better understanding the Face
no better than CNNs in occlusion robustness. Transformer models.
5

R EFERENCES [23] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and

D. W. Jacobs, “Frontal to profile face verification in the wild,” in 2016
[1] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, IEEE Winter Conference on Applications of Computer Vision (WACV).
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., IEEE, 2016, pp. 1–9.
“An image is worth 16x16 words: Transformers for image recognition [24] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and
at scale,” arXiv preprint arXiv:2010.11929, 2020. S. Zafeiriou, “Agedb: the first manually collected, in-the-wild age
[2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and database,” in Proceedings of the IEEE Conference on Computer Vision
S. Zagoruyko, “End-to-end object detection with transformers,” in and Pattern Recognition Workshops, 2017, pp. 51–59.
European Conference on Computer Vision. Springer, 2020, pp. 213– [25] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto,
229. A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney et al., “Iarpa
[3] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end janus benchmark-c: Face dataset and protocol,” in 2018 International
dense video captioning with masked transformer,” in Proceedings of the Conference on Biometrics (ICB). IEEE, 2018, pp. 158–165.
IEEE Conference on Computer Vision and Pattern Recognition, 2018, [26] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: L2
pp. 8739–8748. hypersphere embedding for face verification,” in Proceedings of the 25th
[4] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and ACM international conference on Multimedia. ACM, 2017, pp. 1041–
H. Jégou, “Training data-efficient image transformers & distillation 1049.
through attention,” arXiv preprint arXiv:2012.12877, 2020. [27] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[5] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and network training by reducing internal covariate shift,” arXiv preprint
S. Yan, “Tokens-to-token vit: Training vision transformers from scratch arXiv:1502.03167, 2015.
on imagenet,” arXiv preprint arXiv:2101.11986, 2021. [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
[6] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in dinov, “Dropout: a simple way to prevent neural networks from overfit-
transformer,” arXiv preprint arXiv:2103.00112, 2021. ting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–
[7] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset 1958, 2014.
and benchmark for large-scale face recognition,” in European conference [29] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
on computer vision. Springer, 2016, pp. 87–102. arXiv preprint arXiv:1711.05101, 2017.
[8] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep [30] S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,”
hypersphere embedding for face recognition,” in Proceedings of the arXiv preprint arXiv:2005.00928, 2020.
IEEE conference on computer vision and pattern recognition, 2017, pp.
212–220.
[9] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and
W. Liu, “Cosface: Large margin cosine loss for deep face recognition,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 5265–5274.
[10] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular
margin loss for deep face recognition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2019, pp.
4690–4699.
[11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[13] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
gap to human-level performance in face verification,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2014,
pp. 1701–1708.
[14] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed-
ding for face recognition and clustering,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2015, pp. 815–
823.
[15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1–9.
[16] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao,
C. Xu, Y. Xu et al., “A survey on visual transformer,” arXiv preprint
arXiv:2012.12556, 2020.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in Neural Information Processing Systems, 2017, pp. 5998–6008.
[18] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled
faces in the wild: A database for studying face recognition in uncon-
strained environments,” University of Massachusetts, Amherst, Tech.
Rep. 07-49, October 2007.
[19] W. Deng, J. Hu, N. Zhang, B. Chen, and J. Guo, “Fine-grained face
verification: Fglfw database, baselines, and human-dcmn partnership,”
Pattern Recognition, vol. 66, pp. 63–73, 2017.
[20] T. Zheng, W. Deng, and J. Hu, “Cross-age LFW: A database for
studying cross-age face recognition in unconstrained environments,”
arXiv:1708.08197, 2017.
[21] T. Zheng and W. Deng, “Cross-pose lfw: A database for studying cross-
pose face recognition in unconstrained environments,” Beijing University
of Posts and Telecommunications, Tech. Rep. 18-01, February 2018.
[22] Y. Zhong and W. Deng, “Towards transferable adversarial attack against
deep face recognition,” IEEE Transactions on Information Forensics and
Security, vol. 16, pp. 1452–1466, 2020.

Deep Learning in Object Detection, PDF
No ratings yet
Deep Learning in Object Detection, PDF
64 pages
LLM
No ratings yet
LLM
28 pages
SDB - Prefere 4116 (English)
No ratings yet
SDB - Prefere 4116 (English)
14 pages
Deep Learning For Face Recognition
No ratings yet
Deep Learning For Face Recognition
47 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
Deep Face Recognition Using Imperfect Facial Data Main
No ratings yet
Deep Face Recognition Using Imperfect Facial Data Main
17 pages
Stack 3 Marks
No ratings yet
Stack 3 Marks
16 pages
ATV - CVPR'23 Tutorial
No ratings yet
ATV - CVPR'23 Tutorial
152 pages
VGGFace Transfer Learning and Siamese Network For Face Recognition
No ratings yet
VGGFace Transfer Learning and Siamese Network For Face Recognition
6 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
ACC CENTCOM HH60PaveHawk AIB NarrativeReport
No ratings yet
ACC CENTCOM HH60PaveHawk AIB NarrativeReport
37 pages
CH 14-1 EM
No ratings yet
CH 14-1 EM
29 pages
Final Report Sample Format
No ratings yet
Final Report Sample Format
33 pages
Introduction To Face Processing With Computer Vision
No ratings yet
Introduction To Face Processing With Computer Vision
82 pages
Implementation of FaceNet and Support Vector Machine in A Real-Time Web-Based Timekeeping Application
No ratings yet
Implementation of FaceNet and Support Vector Machine in A Real-Time Web-Based Timekeeping Application
9 pages
VQGAN: Taming Transformer For High-Resolution Image Synthesis
No ratings yet
VQGAN: Taming Transformer For High-Resolution Image Synthesis
52 pages
Convnext V2: Co-Designing and Scaling Convnets With Masked Autoencoders
No ratings yet
Convnext V2: Co-Designing and Scaling Convnets With Masked Autoencoders
15 pages
Transformers in Computational Visual Media A Surve
No ratings yet
Transformers in Computational Visual Media A Surve
30 pages
3.3 E.M.F
No ratings yet
3.3 E.M.F
29 pages
LS MODULE 3 HANDOUT Activity 1-9, 20 PP
100% (1)
LS MODULE 3 HANDOUT Activity 1-9, 20 PP
20 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
23 pages
Transformers
No ratings yet
Transformers
30 pages
TSP CMC 50790
No ratings yet
TSP CMC 50790
24 pages
Aires Jean Detailed Lesson Plan Grade 8 Gold
No ratings yet
Aires Jean Detailed Lesson Plan Grade 8 Gold
10 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
Lightweight
No ratings yet
Lightweight
23 pages
Fast and Interpretable Face Identification For Out-Of-Distribution Data Using Vision Transformers
No ratings yet
Fast and Interpretable Face Identification For Out-Of-Distribution Data Using Vision Transformers
20 pages
Transformers For Vision
No ratings yet
Transformers For Vision
28 pages
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
No ratings yet
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
14 pages
Face Recognition Model Report
No ratings yet
Face Recognition Model Report
23 pages
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
No ratings yet
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
22 pages
Edit
No ratings yet
Edit
16 pages
Face Recognition Using CNN
No ratings yet
Face Recognition Using CNN
17 pages
Openface: A General-Purpose Face Recognition Library With Mobile Applications
No ratings yet
Openface: A General-Purpose Face Recognition Library With Mobile Applications
20 pages
Deep Convolutional Neural Network-Based Approaches
No ratings yet
Deep Convolutional Neural Network-Based Approaches
21 pages
2103 TiT
No ratings yet
2103 TiT
10 pages
Report Dbms Ahm
No ratings yet
Report Dbms Ahm
58 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
ViT Survey On Segmentation
No ratings yet
ViT Survey On Segmentation
30 pages
Class Ix Study Material
No ratings yet
Class Ix Study Material
74 pages
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
No ratings yet
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
11 pages
SFace Loss
No ratings yet
SFace Loss
12 pages
Research Notes
No ratings yet
Research Notes
9 pages
4.SeTformer Is What You Need For Vision and Language
No ratings yet
4.SeTformer Is What You Need For Vision and Language
9 pages
NeurIPS 2021 Transformer in Transformer Paper
No ratings yet
NeurIPS 2021 Transformer in Transformer Paper
12 pages
A Survey On Visual Transformer
No ratings yet
A Survey On Visual Transformer
23 pages
UMDFaces: An Annotated Face Dataset For Training Deep Networks
No ratings yet
UMDFaces: An Annotated Face Dataset For Training Deep Networks
10 pages
Seminar
No ratings yet
Seminar
7 pages
Chanda 2019
No ratings yet
Chanda 2019
7 pages
Partial FC: Training 10 Million Identities On A Single Machine
No ratings yet
Partial FC: Training 10 Million Identities On A Single Machine
8 pages
Face Recognition Based On MTCNN and FaceNet
No ratings yet
Face Recognition Based On MTCNN and FaceNet
6 pages
J. Sil 1
No ratings yet
J. Sil 1
6 pages
Comparing The Effectiveness and Performance of Image Processing Algorithms in Face Recognition
No ratings yet
Comparing The Effectiveness and Performance of Image Processing Algorithms in Face Recognition
5 pages
Facial Expression Recog Using CNN 2016
No ratings yet
Facial Expression Recog Using CNN 2016
6 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
21 pages
Taylor Swift Feat Ed Sheeran Everything Has Changed
No ratings yet
Taylor Swift Feat Ed Sheeran Everything Has Changed
35 pages
Pb4mat Umc05 P7-8
No ratings yet
Pb4mat Umc05 P7-8
38 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Essential-Neuroscience 4th
No ratings yet
Essential-Neuroscience 4th
32 pages
PIG Paper: Dress Code in Herricks High School
100% (1)
PIG Paper: Dress Code in Herricks High School
14 pages
22 Ombudsman Presentation
No ratings yet
22 Ombudsman Presentation
29 pages
Face
No ratings yet
Face
25 pages
Paper 2
No ratings yet
Paper 2
8 pages
Paper 3
No ratings yet
Paper 3
7 pages
A Comparative Study On Convolutional Neural Network Based Face Recognition
No ratings yet
A Comparative Study On Convolutional Neural Network Based Face Recognition
5 pages
Summer 2020 P1
No ratings yet
Summer 2020 P1
20 pages
SPM 4
No ratings yet
SPM 4
15 pages
Corporate Strategy and Governance PDF
No ratings yet
Corporate Strategy and Governance PDF
6 pages
Adding Fractions Different Denominators PDF
No ratings yet
Adding Fractions Different Denominators PDF
14 pages
Transfer Learning Convolutional Neural Network-AlexNet Achieving Face Recognition
No ratings yet
Transfer Learning Convolutional Neural Network-AlexNet Achieving Face Recognition
4 pages
Behavioral Counseling For STD/HIV Risk Reduction: Learning Objectives
No ratings yet
Behavioral Counseling For STD/HIV Risk Reduction: Learning Objectives
10 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
Ai 325 CA
No ratings yet
Ai 325 CA
8 pages
Lor Ead-510-Site Budget Categories Template and Reflection
100% (1)
Lor Ead-510-Site Budget Categories Template and Reflection
5 pages
Face Recognition Based On Convolutional Neural Network.: November 2017
No ratings yet
Face Recognition Based On Convolutional Neural Network.: November 2017
5 pages
Prudential Bank of San Mateo
No ratings yet
Prudential Bank of San Mateo
1 page
Prakash2019 - Face Recognition
No ratings yet
Prakash2019 - Face Recognition
4 pages
Ipv6 Cheat Sheet
No ratings yet
Ipv6 Cheat Sheet
2 pages
Water - Acidity
No ratings yet
Water - Acidity
2 pages
Direct and Indirect Objects Review
100% (1)
Direct and Indirect Objects Review
3 pages
Surgery Observation Paper
No ratings yet
Surgery Observation Paper
3 pages
Belzona 4301 - Product Details
No ratings yet
Belzona 4301 - Product Details
2 pages
La Ley Del Reconocimiento by Mike Murdock 85pdf Aw 5a395b9a1723dd97e6232835
No ratings yet
La Ley Del Reconocimiento by Mike Murdock 85pdf Aw 5a395b9a1723dd97e6232835
2 pages
El Evento Kaizen Q To Q
No ratings yet
El Evento Kaizen Q To Q
1 page
Rhyme Time Lesson Plan
No ratings yet
Rhyme Time Lesson Plan
3 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Face Transformer For Recognition

Uploaded by

Face Transformer For Recognition

Uploaded by

1

Face Transformer for Recognition

Models Params (M) MACs (G) Img/Sec

B. Results on Mainstream Benchmarks

Verification 1:1 TAR@FAR

Fig. 4. The recognition performance of Face Transformer model and ResNet-

4) Abortive attempts and Observations: In addition to the

R EFERENCES [23] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.