0% found this document useful (0 votes)
33 views5 pages

Face Transformer For Recognition

Uploaded by

Nilov Mitra Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views5 pages

Face Transformer For Recognition

Uploaded by

Nilov Mitra Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1

Face Transformer for Recognition


Yaoyao Zhong, Weihong Deng

Abstract—Recently there has been a growing interest in the CNNs hegemony over the face recognition task. It is
Transformer not only in NLP but also in computer vision. known that, the efficiency bottleneck of Transformer models,
We wonder if transformer can be used in face recognition is just the key of them, i.e., the self-attention mechanism,
and whether it is better than CNNs. Therefore, we investigate
the performance of Transformer models in face recognition. which incurs a complexity of 𝑂 (𝑛2 ) with respect to sequence
Considering the original Transformer may neglect the inter- length [16]. Of course efficiency is important for face recog-
patch information, we modify the patch generation process and nition models, but in this paper, let us mainly determine the
arXiv:2103.14803v2 [cs.CV] 13 Apr 2021

make the tokens with sliding patches which overlaps with each feasibility of applying Transformer models in face recognition
others. The models are trained on CASIA-WebFace and MS- and leave out the potential efficiency problem of them.
Celeb-1M databases, and evaluated on several mainstream bench-
marks, including LFW, SLLFW, CALFW, CPLFW, TALFW, We first experiment with a standard Transformer [17] as
CFP-FP, AGEDB and IJB-C databases. We demonstrate that ViT [1] did. However, the original ViT directly flatten the
Face Transformer models trained on a large-scale database, images into patches, which may neglect inter-patch informa-
MS-Celeb-1M, achieve comparable performance as CNN with tion. Since some of important facial features will be par-
similar number of parameters and MACs. To facilitate further titioned into different tokens. To better describe the inter-
researches, Face Transformer models and codes are available at
https://github.com/zhongyy/Face-Transformer. patch information, we slightly modify the tokens generation
method of ViT, to make the image patch overlaps, which can
Index Terms—Face Recognition, Neural networks, Trans- improve the performance compared with original ViT and will
former.
not increase the computing cost. Face Transformer models
are trained on a large scale training database, MS-Celeb-
I. I NTRODUCTION 1M [7] database, supervised with CosFace [9], and evaluated
on several face recognition benchmarks including LFW [18],
R ECENTLY it seems a popular trend to apply Transformer
in different computer vision tasks including image clas-
sification [1], object detection [2], video processing [3] and
SLLFW [19], CALFW [20], CPLFW [21], TALFW [22] CFP-
FP [23], AgeDB-30 [24], and IJB-C [25] databases. Finally,
so on. Although the inner workings of Transformer is not we demonstrate that Transformer models trained on a large-
that clear, researchers come up with idea after idea to apply scale database obtain comparable performance as CNN with
Transformer in different kinds of ways [4], [5], [6] because of a similar number of parameters and MACs. In addition, it is
its strong representation ability. reasonable to find the Transformer models attend to the face
Based on large-scale training databases [7] and effec- area as we expected.
tive loss functions [8], [9], [10], convolution neural net- The contribution of our work is that we show the feasi-
works (CNNs), from VGGNet [11] to ResNet [12], have bility of Transformer models in face recognition and report
achieved great success in face recognition over the past few promising experiment results. How to further improve the
years [10]. DeepFace [13] first uses a 9-layer CNN in face performance and efficiency of Transformer models in face
recognition, and obtains a 97.35% accuracy on the LFW recognition is a promising task for future research.
database. FaceNet [14] adopts GoogleNet [15], assisted by
a private large scale dataset, achieving state-of-art perfor- II. FACE T RANSFORMER
mance (99.63% on LFW) at that time. SphereNet [8] adopts In this paper, following the open-set face recognition
a 64-layer ResNet [12] network, with a large-margin loss pipeline [8], Face Transformer is trained on face databases
function, achieving 99.42% accuracy on the LFW database. (with image 𝑿 with label 𝑦) in a supervised manner, where
ArcFace [10] develops ResNet [12] with an IR block and face images are encoded using a well-designed network, and
achieves new state-of-art performance on several benchmarks. the output face image embeddings are supervised by an
Despite the success of CNNs, we still wonder can Trans- elaborate loss function [8], [9], [10] for better discriminative
former be used in face recognition and whether it is better than ability, as shown in Figure 1.
ResNet-like CNNs. Since Transformer has shown its excellent
performance combined with large scale databases [1], while
there have been lots of large scale training database in face A. Network Architecture
recognition. It is interesting to observe the performance of Face Transformer model follows the architecture of ViT [1],
combination of Transformer and large scale face training which applies the original Transformer [17].
databases. Perhaps Transformer is just the best to challenge The only difference is that, we modify the tokens generation
method of ViT, to generate tokens with sliding patches, i.e.,
The authors are with the Pattern Recognition and Intelligent Sys- to make the image patch overlaps, for the better description of
tem Laboratory, School of Artificial Intelligence, Beijing University
of Posts and Telecommunications, Beijing 100876, China (e-mail: the inter-patch information, as shown in Figure 1. Specifically,
zhongyaoyao@bupt.edu.cn; whdeng@bupt.edu.cn). we extract sliding patches from the image 𝑿 ∈ R𝑊 ×𝑊 ×𝐶 with
2

Fig. 1. The overall of Face Transformer. The face images are split into multiple patches and input as tokens to the transformer encoder. To better describe the
inter-patch information, we modify the tokens generation method of ViT [1], to make the image patch overlaps slightly, which can improve the performance
compared with original ViT. The Transformer encoder is basically a standard Transformer model [17]. Eventually, the face image embeddings can be used
for loss functions [9], [10]. The illustration is inspired by ViT [1].

patch size 𝑃 and stride 𝑆 for them (with implicit zero on where 𝑼𝑚𝑠𝑎 ∈ R 𝑘 𝐷ℎ ×(𝐷+1) .
both sides of input), and finally obtain a sequence of flattened
2
2D patches 𝑿 𝒑 ∈ R 𝑁 ×( 𝑃 ×𝐶) . (𝑊, 𝑊) is the resolution of the B. Loss Function
original image while (𝑃, 𝑃) is the resolution of each image
The output 𝑥 of Equation 2, i.e., the final output of Trans-
patch. The effective sequence length is the number of patches
former model, is supervised by an elaborate loss function [8],
𝑁 = b 𝑊 +2× 𝑝−( 𝑃−1)
+ 1c, where 𝑝 is the amount of zero-
𝑆 [9], [10] for better discriminative ability,
paddings.
𝑒𝑾𝒚 𝒙+𝒃𝒚
𝑇
As ViT did, a trainable linear projection maps the flattened
patches 𝑿 𝒑 to the model dimension D, and outputs the patch 𝐿 = − log 𝑃 𝑦 = − log Í𝐶 . (5)
𝑾 𝒋 𝑇 𝒙+𝒃 𝒋
𝑗=1 𝑒
embeddings 𝑿 𝒑 𝑬. The class token, i.e., a learnable embedding
(𝑿𝑐𝑙𝑎𝑠𝑠 = 𝒛 00 ) is concatenated to the patch embeddings, and where 𝑦 is the label, 𝑃 𝑦 is the predicted probability of
its state at the output of the Transformer encoder (𝒛0𝐿 ) is the assigning 𝒙 to class 𝑦, 𝐶 is the number of identities, 𝑾 𝑗 is the
final face image embedding, as Equation 2. Then, position 𝑗-th column of the weight of the last fully connected layer, and
embeddings are added to the patch embeddings to retain 𝒃 𝒋 ∈ RC is the bias. Softmax based loss functions [26], [8],
positional information. The final embedding [9], [10] remove the bias term and transform 𝑾 𝒋 𝑇 𝒙 = 𝑠 cos 𝜃 𝑗 ,
and incorporate large margin in the cos 𝜃 𝑦𝑖 term [8], [9], [10].
𝒛 0 = 𝑿𝑐𝑙𝑎𝑠𝑠 ; 𝑿 1𝑝 𝑬; 𝑿 2𝑝 𝑬; . . . ; 𝑿 𝑝𝑁 𝑬 + 𝑬 𝑝𝑜𝑠 ,
 
(1)
Therefore, softmax based loss functions can be formulated as
serves as input to the Transformer, 𝑁
1 ∑︁
𝒛𝑙0 = 𝑀𝑆 𝐴(𝐿𝑁 (𝒛𝑙−1 )) + 𝒛𝑙−1 , 𝑙 = 1, . . . , 𝐿, 𝐿=−
𝑁 𝑖=1
log 𝑃 𝑦𝑖
𝒛𝑙 = 𝑀 𝐿𝑃(𝐿𝑁 (𝒛𝑙0)) + 𝒛𝑙0, 𝑙 = 1, . . . , 𝐿, (2) (6)
𝑒 𝑠 𝑓 ( 𝜃𝑦𝑖 )
𝑁
1 ∑︁
𝒙= 𝐿𝑁 (𝒛0𝐿 ), =− log 𝑠 𝑓 ( 𝜃 ) Í𝐶 ,
𝑁 𝑖=1 𝑒 𝑦𝑖 + 𝑠 cos 𝜃 𝑗
which consists of multiheaded self-attention (MSA) and MLP 𝑗=1, 𝑗≠𝑦𝑖 𝑒

blocks, with LayerNorm (LN) before each block and residual where 𝑓 (𝜃 𝑦𝑖 ) = cos 𝜃 𝑦𝑖 − 𝑚 in CosFace [9].
connections after each block, as shown in Figure 1. In Equa-
tion 2, the output 𝒙 is the final output of Transformer model. III. E XPERIMENT
One of the key block of Transformer, MSA, is composed A. Implementation Details
of 𝑘 parallel self-attention (SA),
We apply two training databases, CASIA-WebFace and MS-
[𝒒, 𝒌, 𝒗] = 𝒛𝑼𝒒𝒌𝒗 , Celeb-1M [7]. CASIA-WebFace is a sweet training database
√︁ (3)
𝑆 𝐴(𝒛) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(𝒒 𝒌 𝑇 / 𝐷 ℎ )𝒗, and contains 0.49M images from 10,575 celebrities, which
can be seen as a relatively small-scale database compared
where 𝒛 ∈ R ( 𝑁 +1)×𝐷 is an input sequence, 𝑼𝒒𝒌𝒗 ∈ R𝐷×3𝐷ℎ
with million-scale ones [7]. MS-Celeb-1M is a popular large
is the weight matrix
√ for linear transformation, and 𝑨 =
scale training database in face recognition and we use the
𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(𝒒 𝒌 𝑇 / 𝐷 ℎ ) is the attention map. The output of MSA
clean version refined by insightface [10], which contains 5.3M
is the concatenation of 𝑘 attention head outputs
images of 93,431 celebrities. We choose CosFace [9] (𝑠 = 64
𝑀𝑆 𝐴(𝒛) = [𝑆 𝐴1 (𝒛); 𝑆 𝐴2 (𝒛); . . . ; 𝑆 𝐴 𝑘 (𝒛)]𝑼𝑚𝑠𝑎 , (4) and 𝑚 = 0.35) as the loss function for better convergence and
3

recognition performance. The face images are aligned to 112 than ResNet-100. Actually, we find that the accuracy of Face
× 112. The Horizontally flip with a probability of 50% is used Transformer models trained on CASIA-WebFace can reach a
for training data augmentation. high level as ResNet-100, while models cannot generalize well
For comparison, the CNN architecture used in our work on test databases, which indicates that the scale of CASIA-
is a modified ResNet-100 [12] proposed in the first version WebFace may be not enough for Transformer models.
of ArcFace paper [10], which uses IR blocks (BN-Conv-BN- While things change when we use a much larger training
PReLU-Conv-BN) and applies the “BN [27]-Dropout [28]- database, MS-Celeb-1M. The performance of Face Trans-
FC-BN” structure to get the final 512-𝐷 embedding feature. former models demonstrate promising results on large-scale
We also experiment with the recent proposed T2T-ViT [5]. face training databases. The performance of Face Transformer
The number of parameters, MACs and inference speed (Tesla is competitive compared to the ResNet-100 with similar
V100, Intel Xeon E5-2698 v4) of these face recognition number of parameters and MACs. Compared with “ViT-
models are listed in Table I. Details are as follows. For ViT P8S8”, “ViT-P10S8” and “ViT-P12S8” have better perfor-
models, the number of layers is 20, the number of heads is 8, mance, which demonstrates the overlapping patches can help
hidden size is 512, MLP size is 2048. For the Token-to-Token in some degree. T2T-ViT also obtain good performance,
part of T2T-ViT model, the depth is 2, hidden dim is 64, and while limited computer source, more hyper-parameters for
MLP size is 512; while for the backbone, the number of layers T2T block remains to try. Another interest point is that,
is 24, the number of heads is 8, hidden size is 512, MLP size Transformer models obtain a little higher accuracy on TALFW
is 2048. Note that, the “ViT-P10S8” represents the ViT model database, which is a database with transferable adversarial
with 10 × 10 patch size, with stride 𝑆 = 8, and “ViT-P8S8” noise. Since TALFW database is generated using CNNs as
represents no overlapping between tokens. surrogate models, it seems that there is no significant specifi-
cality with Transformer in terms of adversarial robustness. It
TABLE I is interesting to explore the performance of combination of
N UMBER OF PARAMETERS , MAC S AND I NFERENCE S PEED OF FACE Face Transformer models and adversarial training.
R ECOGNITION M ODELS .

Models Params (M) MACs (G) Img/Sec


C. Discussion
ResNet-100 [12] 65.1 12.1 41.73
ViT-P8S8 [1] 63.2 12.4 41.72 1) Attention Area Analysis: Since the key of Transformer
T2T-ViT [5] 63.5 12.7 38.08 models is the self-attention mechanism, we analyze how the
ViT-P10S8 63.3 12.4 44.59 Transformer models concentrate on face images by analyzing
ViT-P12S8 63.3 12.4 42.45 the ViT-P12S8 model trained on MS-Celeb-1M. Specifically,
we use the Attention Rollout [30] method, which recursively
multiplies the modified attention matrices
√ 0.5𝑨 + 0.5𝑰 of all
We use AdamW [29] and cosine learning rate decay fol- layers, where 𝑨 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(𝒒 𝒌 𝑇 / 𝐷 ℎ ) is the attention map
lowing DeiT [4]. The models are trained from scratch without of Equation 3. We demonstrate that Transformer models attend
pre-training. With 1 warmup epoch, the initial learning rate is to the face area as we expected, as shown in Figure 2.
set as 3e-4, and we lower it to 1e-4 when the training accuracy
reaches a stable stage (about 20 epochs).

B. Results on Mainstream Benchmarks


We mainly report recognition performance of models
on several mainstream benchmarks including LFW [18],
SLLFW [19], CALFW [20], CPLFW [21], TALFW [22] CFP-
FP [23], AgeDB-30 [24], and IJB-C [25] databases. LFW
database contains 13,233 face images from 5,749 different
Fig. 2. With the help of Attention Rollout [30] techniques, we analyze how the
identities, which is a classic benchmark for unconstrained Transformer models (MS-Celeb-1M, ViT-P12S8) concentrate on face images,
face verification. Similar-looking LFW (SLLFW), Cross-Age and find that face Transformer models attend to the face area as we expected.
LFW (CALFW), Cross-Pose LFW (CPLFW) and Transferable
Adversarial LFW (TALFW) databases are constructed based 2) Attention Matrices Visualization: To further understand
on LFW database, to emphasize similar-looking challenges, the Transformer models (MS-Celeb-1M, ViT-P12S8), we visu-
cross-age challenge and cross-pose challenge, and adversar- alize the attention matrices of different layers, and calculate the
ial robustness of face recognition. CFP-FP database is built mean attention distance in the image space, which is seemed
for facilitating large pose variation and AgeDB-30 database as the receptive field as CNNs [1], shown in Figure 3. While
is a manually collected cross-age database. IJB-C database we find that although the deepest layers attend to long distance
contains both still images and video frames to address the relationship, it seems that the attention distance of the lowest
unconstrained face recognition. layer in Face Transformer models is relatively longer than the
The experimental results are shown in Table II and Table III. original ViT [1].
We first find that in Table II, Face Transformer models 3) Occlusion Robustness: The key of Face Transformer
trained on CASIA-WebFace database performs much worse models is the self-attention mechanism and it seems that they
4

TABLE II
P ERFORMANCE ON LFW [18], SLLFW [19], CALFW [20], CPLFW [21], TALFW [22] CFP-FP [23] AND AGE DB-30 [24] DATABASES .

Training Data Models LFW SLLFW CALFW CPLFW TALFW CFP-FP AgeDB-30
ResNet-100 [12] 99.55 98.65 94.13 90.93 53.17 96.30 95.50
CASIA-WebFace ViT-P8S8 [1] 97.32 90.78 86.78 80.78 83.05 86.60 81.48
ViT-P12S8 97.42 90.07 87.35 81.60 84.00 85.56 81.48
ResNet-100 [12] 99.82 99.67 96.27 93.43 64.88 96.93 98.27
ViT-P8S8 [1] 99.83 99.53 95.92 92.55 74.87 96.19 97.82
MS-Celeb-1M T2T-ViT [5] 99.82 99.63 95.85 93.00 71.93 96.59 98.07
ViT-P10S8 99.77 99.63 95.95 92.93 72.95 96.43 97.83
ViT-P12S8 99.80 99.55 96.18 93.08 70.13 96.77 98.05

TABLE III
C OMPARISON OF DIFFERENT MODELS TRAINED ON MS-C ELEB -1M ON
THE IJB-C DATABASE [25].

Verification 1:1 TAR@FAR


Models
1e-4 1e-3 1e-2 1e-1
ResNet-100 [12] 96.36 97.36 98.41 99.13
ViT-P8S8 [1] 95.96 97.28 98.22 98.99
T2T-ViT [5] 95.67 97.10 98.14 98.90
ViT-P10S8 96.06 97.45 98.23 98.96
ViT-P12S8 96.31 97.49 98.38 99.04

Fig. 4. The recognition performance of Face Transformer model and ResNet-


100 as the occlusion area increases.

4) Abortive attempts and Observations: In addition to the


reported models, we would like to share some of our abortive
attempts and observations. Note that, these observations may
not be rigorous enough to come to conclusions, but maybe
they are helpful for readers.
(1) We first tried SGD as previous works [9], [10] to train
Face Transformer models, while models cannot converge. So
finally we apply AdamW, which has been proved as a effective
optimizer for Transformer models.
(2) We tried removing 𝑿𝑐𝑙𝑎𝑠𝑠 (𝒛00 ), and used the mean
pooling of other tokens outputs. Compared with using 𝒛 0𝐿 as
output, the recognition performance decreases slightly, while
accuracy on TALFW database increases to more than 85%.
Fig. 3. (1) Visualization of the attention matrices of different layers. (2) Mean (3) We tried removing the MLP to improve the efficiency
attention distance of attended area by head and network depth. but find the training accuracy cannot increase to a normal
level, which indicates that the MLP block is essential for Face
Transformer models.
concentrate more on the whole face, therefore, we wonder
whether them are more robust at classifying partial occluded IV. C ONCLUSION
face images. To explore the occlusion robustness of Face In this paper, we aim to investigate the feasibility of
Transformer models, we apply random occlusion (zero value) applying Transformer models in face recognition. Finally,
on face images of several test datasets, and test the recognition we have demonstrated that Face Transformer models cannot
performance of models as the occlusion area increases. The work with a relatively small database, CASIA-WebFace, while
experimental results are in Figure 4. We find the performance they can obtain promising performance on the large-scale
of Face Transformer models decreases more compared with face training database, MS-Celeb-1M. In addition, we have
ResNet-100, which indicates Face Transformer models behave provided some analyses for better understanding the Face
no better than CNNs in occlusion robustness. Transformer models.
5

R EFERENCES [23] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and


D. W. Jacobs, “Frontal to profile face verification in the wild,” in 2016
[1] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, IEEE Winter Conference on Applications of Computer Vision (WACV).
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., IEEE, 2016, pp. 1–9.
“An image is worth 16x16 words: Transformers for image recognition [24] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and
at scale,” arXiv preprint arXiv:2010.11929, 2020. S. Zafeiriou, “Agedb: the first manually collected, in-the-wild age
[2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and database,” in Proceedings of the IEEE Conference on Computer Vision
S. Zagoruyko, “End-to-end object detection with transformers,” in and Pattern Recognition Workshops, 2017, pp. 51–59.
European Conference on Computer Vision. Springer, 2020, pp. 213– [25] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto,
229. A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney et al., “Iarpa
[3] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end janus benchmark-c: Face dataset and protocol,” in 2018 International
dense video captioning with masked transformer,” in Proceedings of the Conference on Biometrics (ICB). IEEE, 2018, pp. 158–165.
IEEE Conference on Computer Vision and Pattern Recognition, 2018, [26] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: L2
pp. 8739–8748. hypersphere embedding for face verification,” in Proceedings of the 25th
[4] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and ACM international conference on Multimedia. ACM, 2017, pp. 1041–
H. Jégou, “Training data-efficient image transformers & distillation 1049.
through attention,” arXiv preprint arXiv:2012.12877, 2020. [27] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[5] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and network training by reducing internal covariate shift,” arXiv preprint
S. Yan, “Tokens-to-token vit: Training vision transformers from scratch arXiv:1502.03167, 2015.
on imagenet,” arXiv preprint arXiv:2101.11986, 2021. [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
[6] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in dinov, “Dropout: a simple way to prevent neural networks from overfit-
transformer,” arXiv preprint arXiv:2103.00112, 2021. ting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–
[7] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset 1958, 2014.
and benchmark for large-scale face recognition,” in European conference [29] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
on computer vision. Springer, 2016, pp. 87–102. arXiv preprint arXiv:1711.05101, 2017.
[8] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep [30] S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,”
hypersphere embedding for face recognition,” in Proceedings of the arXiv preprint arXiv:2005.00928, 2020.
IEEE conference on computer vision and pattern recognition, 2017, pp.
212–220.
[9] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and
W. Liu, “Cosface: Large margin cosine loss for deep face recognition,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 5265–5274.
[10] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular
margin loss for deep face recognition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2019, pp.
4690–4699.
[11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[13] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
gap to human-level performance in face verification,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2014,
pp. 1701–1708.
[14] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed-
ding for face recognition and clustering,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2015, pp. 815–
823.
[15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1–9.
[16] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao,
C. Xu, Y. Xu et al., “A survey on visual transformer,” arXiv preprint
arXiv:2012.12556, 2020.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in Neural Information Processing Systems, 2017, pp. 5998–6008.
[18] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled
faces in the wild: A database for studying face recognition in uncon-
strained environments,” University of Massachusetts, Amherst, Tech.
Rep. 07-49, October 2007.
[19] W. Deng, J. Hu, N. Zhang, B. Chen, and J. Guo, “Fine-grained face
verification: Fglfw database, baselines, and human-dcmn partnership,”
Pattern Recognition, vol. 66, pp. 63–73, 2017.
[20] T. Zheng, W. Deng, and J. Hu, “Cross-age LFW: A database for
studying cross-age face recognition in unconstrained environments,”
arXiv:1708.08197, 2017.
[21] T. Zheng and W. Deng, “Cross-pose lfw: A database for studying cross-
pose face recognition in unconstrained environments,” Beijing University
of Posts and Telecommunications, Tech. Rep. 18-01, February 2018.
[22] Y. Zhong and W. Deng, “Towards transferable adversarial attack against
deep face recognition,” IEEE Transactions on Information Forensics and
Security, vol. 16, pp. 1452–1466, 2020.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy