Face Transformer For Recognition
Face Transformer For Recognition
Abstract—Recently there has been a growing interest in the CNNs hegemony over the face recognition task. It is
Transformer not only in NLP but also in computer vision. known that, the efficiency bottleneck of Transformer models,
We wonder if transformer can be used in face recognition is just the key of them, i.e., the self-attention mechanism,
and whether it is better than CNNs. Therefore, we investigate
the performance of Transformer models in face recognition. which incurs a complexity of 𝑂 (𝑛2 ) with respect to sequence
Considering the original Transformer may neglect the inter- length [16]. Of course efficiency is important for face recog-
patch information, we modify the patch generation process and nition models, but in this paper, let us mainly determine the
arXiv:2103.14803v2 [cs.CV] 13 Apr 2021
make the tokens with sliding patches which overlaps with each feasibility of applying Transformer models in face recognition
others. The models are trained on CASIA-WebFace and MS- and leave out the potential efficiency problem of them.
Celeb-1M databases, and evaluated on several mainstream bench-
marks, including LFW, SLLFW, CALFW, CPLFW, TALFW, We first experiment with a standard Transformer [17] as
CFP-FP, AGEDB and IJB-C databases. We demonstrate that ViT [1] did. However, the original ViT directly flatten the
Face Transformer models trained on a large-scale database, images into patches, which may neglect inter-patch informa-
MS-Celeb-1M, achieve comparable performance as CNN with tion. Since some of important facial features will be par-
similar number of parameters and MACs. To facilitate further titioned into different tokens. To better describe the inter-
researches, Face Transformer models and codes are available at
https://github.com/zhongyy/Face-Transformer. patch information, we slightly modify the tokens generation
method of ViT, to make the image patch overlaps, which can
Index Terms—Face Recognition, Neural networks, Trans- improve the performance compared with original ViT and will
former.
not increase the computing cost. Face Transformer models
are trained on a large scale training database, MS-Celeb-
I. I NTRODUCTION 1M [7] database, supervised with CosFace [9], and evaluated
on several face recognition benchmarks including LFW [18],
R ECENTLY it seems a popular trend to apply Transformer
in different computer vision tasks including image clas-
sification [1], object detection [2], video processing [3] and
SLLFW [19], CALFW [20], CPLFW [21], TALFW [22] CFP-
FP [23], AgeDB-30 [24], and IJB-C [25] databases. Finally,
so on. Although the inner workings of Transformer is not we demonstrate that Transformer models trained on a large-
that clear, researchers come up with idea after idea to apply scale database obtain comparable performance as CNN with
Transformer in different kinds of ways [4], [5], [6] because of a similar number of parameters and MACs. In addition, it is
its strong representation ability. reasonable to find the Transformer models attend to the face
Based on large-scale training databases [7] and effec- area as we expected.
tive loss functions [8], [9], [10], convolution neural net- The contribution of our work is that we show the feasi-
works (CNNs), from VGGNet [11] to ResNet [12], have bility of Transformer models in face recognition and report
achieved great success in face recognition over the past few promising experiment results. How to further improve the
years [10]. DeepFace [13] first uses a 9-layer CNN in face performance and efficiency of Transformer models in face
recognition, and obtains a 97.35% accuracy on the LFW recognition is a promising task for future research.
database. FaceNet [14] adopts GoogleNet [15], assisted by
a private large scale dataset, achieving state-of-art perfor- II. FACE T RANSFORMER
mance (99.63% on LFW) at that time. SphereNet [8] adopts In this paper, following the open-set face recognition
a 64-layer ResNet [12] network, with a large-margin loss pipeline [8], Face Transformer is trained on face databases
function, achieving 99.42% accuracy on the LFW database. (with image 𝑿 with label 𝑦) in a supervised manner, where
ArcFace [10] develops ResNet [12] with an IR block and face images are encoded using a well-designed network, and
achieves new state-of-art performance on several benchmarks. the output face image embeddings are supervised by an
Despite the success of CNNs, we still wonder can Trans- elaborate loss function [8], [9], [10] for better discriminative
former be used in face recognition and whether it is better than ability, as shown in Figure 1.
ResNet-like CNNs. Since Transformer has shown its excellent
performance combined with large scale databases [1], while
there have been lots of large scale training database in face A. Network Architecture
recognition. It is interesting to observe the performance of Face Transformer model follows the architecture of ViT [1],
combination of Transformer and large scale face training which applies the original Transformer [17].
databases. Perhaps Transformer is just the best to challenge The only difference is that, we modify the tokens generation
method of ViT, to generate tokens with sliding patches, i.e.,
The authors are with the Pattern Recognition and Intelligent Sys- to make the image patch overlaps, for the better description of
tem Laboratory, School of Artificial Intelligence, Beijing University
of Posts and Telecommunications, Beijing 100876, China (e-mail: the inter-patch information, as shown in Figure 1. Specifically,
zhongyaoyao@bupt.edu.cn; whdeng@bupt.edu.cn). we extract sliding patches from the image 𝑿 ∈ R𝑊 ×𝑊 ×𝐶 with
2
Fig. 1. The overall of Face Transformer. The face images are split into multiple patches and input as tokens to the transformer encoder. To better describe the
inter-patch information, we modify the tokens generation method of ViT [1], to make the image patch overlaps slightly, which can improve the performance
compared with original ViT. The Transformer encoder is basically a standard Transformer model [17]. Eventually, the face image embeddings can be used
for loss functions [9], [10]. The illustration is inspired by ViT [1].
patch size 𝑃 and stride 𝑆 for them (with implicit zero on where 𝑼𝑚𝑠𝑎 ∈ R 𝑘 𝐷ℎ ×(𝐷+1) .
both sides of input), and finally obtain a sequence of flattened
2
2D patches 𝑿 𝒑 ∈ R 𝑁 ×( 𝑃 ×𝐶) . (𝑊, 𝑊) is the resolution of the B. Loss Function
original image while (𝑃, 𝑃) is the resolution of each image
The output 𝑥 of Equation 2, i.e., the final output of Trans-
patch. The effective sequence length is the number of patches
former model, is supervised by an elaborate loss function [8],
𝑁 = b 𝑊 +2× 𝑝−( 𝑃−1)
+ 1c, where 𝑝 is the amount of zero-
𝑆 [9], [10] for better discriminative ability,
paddings.
𝑒𝑾𝒚 𝒙+𝒃𝒚
𝑇
As ViT did, a trainable linear projection maps the flattened
patches 𝑿 𝒑 to the model dimension D, and outputs the patch 𝐿 = − log 𝑃 𝑦 = − log Í𝐶 . (5)
𝑾 𝒋 𝑇 𝒙+𝒃 𝒋
𝑗=1 𝑒
embeddings 𝑿 𝒑 𝑬. The class token, i.e., a learnable embedding
(𝑿𝑐𝑙𝑎𝑠𝑠 = 𝒛 00 ) is concatenated to the patch embeddings, and where 𝑦 is the label, 𝑃 𝑦 is the predicted probability of
its state at the output of the Transformer encoder (𝒛0𝐿 ) is the assigning 𝒙 to class 𝑦, 𝐶 is the number of identities, 𝑾 𝑗 is the
final face image embedding, as Equation 2. Then, position 𝑗-th column of the weight of the last fully connected layer, and
embeddings are added to the patch embeddings to retain 𝒃 𝒋 ∈ RC is the bias. Softmax based loss functions [26], [8],
positional information. The final embedding [9], [10] remove the bias term and transform 𝑾 𝒋 𝑇 𝒙 = 𝑠 cos 𝜃 𝑗 ,
and incorporate large margin in the cos 𝜃 𝑦𝑖 term [8], [9], [10].
𝒛 0 = 𝑿𝑐𝑙𝑎𝑠𝑠 ; 𝑿 1𝑝 𝑬; 𝑿 2𝑝 𝑬; . . . ; 𝑿 𝑝𝑁 𝑬 + 𝑬 𝑝𝑜𝑠 ,
(1)
Therefore, softmax based loss functions can be formulated as
serves as input to the Transformer, 𝑁
1 ∑︁
𝒛𝑙0 = 𝑀𝑆 𝐴(𝐿𝑁 (𝒛𝑙−1 )) + 𝒛𝑙−1 , 𝑙 = 1, . . . , 𝐿, 𝐿=−
𝑁 𝑖=1
log 𝑃 𝑦𝑖
𝒛𝑙 = 𝑀 𝐿𝑃(𝐿𝑁 (𝒛𝑙0)) + 𝒛𝑙0, 𝑙 = 1, . . . , 𝐿, (2) (6)
𝑒 𝑠 𝑓 ( 𝜃𝑦𝑖 )
𝑁
1 ∑︁
𝒙= 𝐿𝑁 (𝒛0𝐿 ), =− log 𝑠 𝑓 ( 𝜃 ) Í𝐶 ,
𝑁 𝑖=1 𝑒 𝑦𝑖 + 𝑠 cos 𝜃 𝑗
which consists of multiheaded self-attention (MSA) and MLP 𝑗=1, 𝑗≠𝑦𝑖 𝑒
blocks, with LayerNorm (LN) before each block and residual where 𝑓 (𝜃 𝑦𝑖 ) = cos 𝜃 𝑦𝑖 − 𝑚 in CosFace [9].
connections after each block, as shown in Figure 1. In Equa-
tion 2, the output 𝒙 is the final output of Transformer model. III. E XPERIMENT
One of the key block of Transformer, MSA, is composed A. Implementation Details
of 𝑘 parallel self-attention (SA),
We apply two training databases, CASIA-WebFace and MS-
[𝒒, 𝒌, 𝒗] = 𝒛𝑼𝒒𝒌𝒗 , Celeb-1M [7]. CASIA-WebFace is a sweet training database
√︁ (3)
𝑆 𝐴(𝒛) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(𝒒 𝒌 𝑇 / 𝐷 ℎ )𝒗, and contains 0.49M images from 10,575 celebrities, which
can be seen as a relatively small-scale database compared
where 𝒛 ∈ R ( 𝑁 +1)×𝐷 is an input sequence, 𝑼𝒒𝒌𝒗 ∈ R𝐷×3𝐷ℎ
with million-scale ones [7]. MS-Celeb-1M is a popular large
is the weight matrix
√ for linear transformation, and 𝑨 =
scale training database in face recognition and we use the
𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(𝒒 𝒌 𝑇 / 𝐷 ℎ ) is the attention map. The output of MSA
clean version refined by insightface [10], which contains 5.3M
is the concatenation of 𝑘 attention head outputs
images of 93,431 celebrities. We choose CosFace [9] (𝑠 = 64
𝑀𝑆 𝐴(𝒛) = [𝑆 𝐴1 (𝒛); 𝑆 𝐴2 (𝒛); . . . ; 𝑆 𝐴 𝑘 (𝒛)]𝑼𝑚𝑠𝑎 , (4) and 𝑚 = 0.35) as the loss function for better convergence and
3
recognition performance. The face images are aligned to 112 than ResNet-100. Actually, we find that the accuracy of Face
× 112. The Horizontally flip with a probability of 50% is used Transformer models trained on CASIA-WebFace can reach a
for training data augmentation. high level as ResNet-100, while models cannot generalize well
For comparison, the CNN architecture used in our work on test databases, which indicates that the scale of CASIA-
is a modified ResNet-100 [12] proposed in the first version WebFace may be not enough for Transformer models.
of ArcFace paper [10], which uses IR blocks (BN-Conv-BN- While things change when we use a much larger training
PReLU-Conv-BN) and applies the “BN [27]-Dropout [28]- database, MS-Celeb-1M. The performance of Face Trans-
FC-BN” structure to get the final 512-𝐷 embedding feature. former models demonstrate promising results on large-scale
We also experiment with the recent proposed T2T-ViT [5]. face training databases. The performance of Face Transformer
The number of parameters, MACs and inference speed (Tesla is competitive compared to the ResNet-100 with similar
V100, Intel Xeon E5-2698 v4) of these face recognition number of parameters and MACs. Compared with “ViT-
models are listed in Table I. Details are as follows. For ViT P8S8”, “ViT-P10S8” and “ViT-P12S8” have better perfor-
models, the number of layers is 20, the number of heads is 8, mance, which demonstrates the overlapping patches can help
hidden size is 512, MLP size is 2048. For the Token-to-Token in some degree. T2T-ViT also obtain good performance,
part of T2T-ViT model, the depth is 2, hidden dim is 64, and while limited computer source, more hyper-parameters for
MLP size is 512; while for the backbone, the number of layers T2T block remains to try. Another interest point is that,
is 24, the number of heads is 8, hidden size is 512, MLP size Transformer models obtain a little higher accuracy on TALFW
is 2048. Note that, the “ViT-P10S8” represents the ViT model database, which is a database with transferable adversarial
with 10 × 10 patch size, with stride 𝑆 = 8, and “ViT-P8S8” noise. Since TALFW database is generated using CNNs as
represents no overlapping between tokens. surrogate models, it seems that there is no significant specifi-
cality with Transformer in terms of adversarial robustness. It
TABLE I is interesting to explore the performance of combination of
N UMBER OF PARAMETERS , MAC S AND I NFERENCE S PEED OF FACE Face Transformer models and adversarial training.
R ECOGNITION M ODELS .
TABLE II
P ERFORMANCE ON LFW [18], SLLFW [19], CALFW [20], CPLFW [21], TALFW [22] CFP-FP [23] AND AGE DB-30 [24] DATABASES .
Training Data Models LFW SLLFW CALFW CPLFW TALFW CFP-FP AgeDB-30
ResNet-100 [12] 99.55 98.65 94.13 90.93 53.17 96.30 95.50
CASIA-WebFace ViT-P8S8 [1] 97.32 90.78 86.78 80.78 83.05 86.60 81.48
ViT-P12S8 97.42 90.07 87.35 81.60 84.00 85.56 81.48
ResNet-100 [12] 99.82 99.67 96.27 93.43 64.88 96.93 98.27
ViT-P8S8 [1] 99.83 99.53 95.92 92.55 74.87 96.19 97.82
MS-Celeb-1M T2T-ViT [5] 99.82 99.63 95.85 93.00 71.93 96.59 98.07
ViT-P10S8 99.77 99.63 95.95 92.93 72.95 96.43 97.83
ViT-P12S8 99.80 99.55 96.18 93.08 70.13 96.77 98.05
TABLE III
C OMPARISON OF DIFFERENT MODELS TRAINED ON MS-C ELEB -1M ON
THE IJB-C DATABASE [25].