0% found this document useful (0 votes)
70 views8 pages

Openface 2.0: Facial Behavior Analysis Toolkit

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 8

OpenFace 2.

0: Facial Behavior Analysis Toolkit


Tadas Baltrušaitis1,2 , Amir Zadeh2 , Yao Chong Lim2 , and Louis-Philippe Morency2
1
Microsoft, Cambridge, United Kingdom
2
Carnegie Mellon University, Pittsburgh, United States of America

Abstract— Over the past few years, there has been an


increased interest in automatic facial behavior analysis and Input
understanding. We present OpenFace 2.0 — a tool intended Image Video Webcam
for computer vision and machine learning researchers, affective
computing community and people interested in building inter-
active applications based on facial behavior analysis. OpenFace Core Algorithms
2.0 is an extension of OpenFace toolkit (created by Baltrušaitis
et al. [11]) and is capable of more accurate facial landmark
detection, head pose estimation, facial action unit recognition,
and eye-gaze estimation. The computer vision algorithms which
represent the core of OpenFace 2.0 demonstrate state-of-the-
art results in all of the above mentioned tasks. Furthermore,
our tool is capable of real-time performance and is able to run Landmarks, Gaze and
from a simple webcam without any specialist hardware. Finally, Head Orientation
unlike a lot of modern approaches or toolkits, OpenFace 2.0
source code for training models and running them is freely
Action Units
available for research purposes.

I. I NTRODUCTION
Recent years have seen an increased interest in machine Facial Appearance Non-rigid Face Parameters
analysis of faces [58], [45]. This includes understanding
and recognition of affective and cognitive mental states,
and interpretation of social signals. As the face is a very Output
important channel of nonverbal communication [23], [20], Hard Disk Application Network
facial behavior analysis has been used in different appli-
cations to facilitate human computer interaction [47], [50]. Fig. 1: OpenFace 2.0 is a framework that implements modern
More recently, there has been a number of developments facial behavior analysis algorithms including: facial land-
demonstrating the feasibility of automated facial behavior mark detection, head pose tracking, eye gaze and facial
analysis systems for better understanding of medical condi- action unit recognition.
tions such as depression [28], post traumatic stress disorders
[61], schizophrenia [67], and suicidal ideation [40]. Other
uses of automatic facial behavior analysis include automotive like attentiveness, social skills and mental health [65], as
industries [14], education [49], and entertainment [55]. well as intensity of emotions [39]. Facial expressions reveal
In our work we define facial behavior as consisting of: intent, display affection, express emotion, and help regulate
facial landmark location, head pose, eye gaze, and facial turn-taking during conversation [3], [22].
expressions. Each of these behaviors play an important Past years have seen huge progress in automatic analysis
role together and individually. Facial landmarks allow us to of above mentioned behaviors [20], [58], [45]. However,
understand facial expression motion and its dynamics, they very few tools are available to the research community that
also allow for face alignment for various tasks such as gender can recognize all of them (see Table I). There is a large
detection and age estimation. Head pose plays an important gap between state-of-the-art algorithms and freely available
role in emotion and social signal perception and expression toolkits. This is especially true when real-time performance
[63], [1]. Gaze direction is important when evaluating things is wanted — a necessity for interactive systems.
OpenFace 2.0 is an extension of the OpenFace toolkit [11].
This research is based upon work supported in part by the Yahoo! InMind
project and the Intelligence Advanced Research Projects Activity (IARPA), While OpenFace is able to perform the above mentioned
via IARPA 2014-14071600011. The views and conclusions contained herein tasks, it struggles when the faces are non-frontal or occluded
are those of the authors and should not be interpreted as necessarily and in low illumination conditions. OpenFace 2.0 is able to
representing the ofcial policies or endorsements, either expressed or implied
IARPA, or the U.S. Government. The U.S. Government is authorized to cope with such conditions through the use of a new Convo-
reproduce and distribute reprints for Governmental purpose notwithstanding lutional Neural Network based face detector and a new and
any copyright annotation thereon. optimized facial landmark detection algorithm. This leads
978-1-5386-2335-0/18/$31.00 2018
c IEEE to improved accuracy for facial landmark detection, head
Tool Approach Landmark Head pose Expression Gaze Train Test Binary Real-time Free
COFW[13] RCPR[13] X X X X X
FaceTracker CLM[57] X X X X X X
dlib [37] [35] X X X X X
Chehra [5] X X X X X
Menpo [2] AAM, CLM, SDM1 X X X 2 X
CFAN [77] [77] X X X X
[73] Reg. For [73] X X X X X X X
TCDCN CNN [81] X X X X X
WebGazer.js [54] X X X X X
EyeTab [71] X N/A X X X X
OKAO unknown X X X X X
Affdex unknown X X X X X
Tree DPM [85] [85] X X X X
OpenPose [15] Part affinity Fields [15] X X X X X X3 X
CFSS [83] CFSS [83] X X X X
iCCR [56] iCCR [56] X X X X
LEAR LEAR [46] X X X X
TAUD TAUD [33] X X X
OpenFace [8], [7] X X X X X X X X X
OpenFace 2.0 [70], [75], [78] X X X X X X X X X

TABLE I: Comparison of facial behavior analysis tools. Free indicates that the tool is freely available for research purposes,
Train the availability for model training source code, Test the availability of model fitting/testing/runtime source code, Binary
the availability of model fitting/testing/runtime executable. Note that most tools only provide binary versions (executables)
rather than the source code for model training and fitting. Notes: (1 ) The implementation differs from the originally proposed
one based on the used features, (2 ) the algorithms implemented are capable of real-time performance but the tool does not
provide it, (3 ) requires GPU support.

pose tracking, AU recognition and eye gaze estimation. Main consideration when the project is no longer actively sup-
contributions of OpenFace 2.0 are: 1) new and improved ported. Further, lack of training code makes the reproduction
facial landmark detection system; 2) distribution of ready of experiments on different datasets very difficult. Finally, a
to use trained models; 3) real-time performance, without number of tools expect face detections (in form of bounding
the need of a GPU; 4) cross-platform support (Windows, boxes) to be provided by an external tool, in contrast
OSX, Ubuntu); 5) code available in C++ (runtime), Matlab OpenFace 2.0 comes packaged with a modern face detection
(runtime and model training), and Python (model training). algorithm [78].
Our work is intended to bridge that gap between existing Head pose estimation has not received the same amount
state-of-the-art research and easy to use out-of-the-box so- of interest as facial landmark detection. An early example of
lutions for facial behavior analysis. We believe our tool will a dedicated head pose estimation toolkit is the Watson system
stimulate the community by lowering the bar of entry into [52]. There also exists a random forest based framework
the field and enabling new and interesting applications1 . that allows for head pose estimation using depth data [24].
While some facial landmark detectors include head pose
II. P REVIOUS WORK estimation capabilities [4], [5], most ignore this important
A full review of prior work in facial landmark detection, behavioral cue. A more recent toolkit for head (and the rest
head pose, eye gaze, and action unit recognition is outside the of the body) pose estimation is OpenPose [15], however, it
scope of this paper, we refer the reader to recent reviews in is computationally demanding and requires GPU acceleration
these respective fields [18], [31], [58], [17]. As our contribu- to achieve real-time performance.
tion is a toolkit, we provide an overview of available tools for Facial expression is often represented using facial ac-
accomplishing the individual facial behavior analysis tasks. tion units (AUs), which objectively describe facial muscle
For a summary of available tools see Table I. activations [21]. There are very few freely available tools
Facial landmark detection – there exists a number of for action unit recognition (see Table I). However, there are
freely available tools that perform facial landmark detection a number of commercial systems that among other func-
in images or videos, in part thanks to availability of recent tionality perform action unit recognition, such as: Affdex2 ,
good quality datasets and challenges [60], [76]. However, Noldus FaceReader 3 , and OKAO4 . Such systems face a
very few of them provide the source code and instead only number of drawbacks: sometimes prohibitive cost, unknown
provide runtime binaries, or thin wrappers around library algorithms, often unknown training data, and no public
files. Binaries only allow for certain predefined functionality benchmarks. Furthermore, some tools are inconvenient to use
(e.g. only visualizing the results), are very rarely cross- 2 http://www.affectiva.com/solutions/affdex/
platform, and do not allow for bug fixes — an important 3 http://www.noldus.com/human-behavior-research/
products/facereader
1 https://github.com/TadasBaltrusaitis/OpenFace 4 https://www.omron.com/ecb/products/mobile/
Dimensionality
Reduction

Input Face Detection 3D Facial Landmark Eye Gaze Head Pose Appearance Extraction Feature Fusion Action Unit
Detection Estimation Estimation Face Alignment Person Normalization Recognition

Fig. 2: OpenFace 2.0 facial behavior analysis pipeline, including: landmark detection, head pose and eye gaze estimation,
facial action unit recognition. The outputs from all of these systems (indicated in green) can be saved to disk or sent via
network in real-time.

by being restricted to a single machine (due to MAC address more details about the algorithm refer to Zadeh et al. [75],
locking or requiring of USB dongles). Finally, and most example landmark detections can be seen in Figure 3.
importantly, the commercial product may be discontinued 1) OpenFace 2.0 novelties: Our C++ implementation of
leading to impossible to reproduce results due to lack of CE-CLM in OpenFace 2.0 includes a number of speed opti-
product transparency (this is illustrated by discontinuation mizations that enable real-time performance. These include
of FACET, FaceShift, and IntraFace). deep model simplification, smart multiple hypotheses, and
Gaze estimation – there are a number of tools and sparse response map computation.
commercial systems for gaze estimation, however, majority Deep model simplification The original implementation
of them require specialized hardware such as infrared or of CE-CLM used deep networks for patch experts with
head mounted cameras [19], [42], [62]. There also exist a ≈ 180, 000 parameters each (for 68 landmarks at 4 scales
couple of commercial systems available for webcam based and 7 views). We retrained the patch experts for first two
gaze estimation, such as xLabs5 and EyesDecide6 , but they scales using simpler models by narrowing the deep network
suffer from previously mentioned issues that commercial to half the width, leading to ≈ 90, 000 parameters each. We
facial expression analysis systems do. There exist several chose the final model size after exploring a large range of
recent free webcam eye gaze tracking projects [27], [71], alternatives, and chose the smallest model that still retains
[54], [68], but they struggle in real-world scenarios and often competitive accuracy. This reduces the model size and im-
require cumbersome manual calibration steps. proves the speed by 1.5 times, with minimal loss in accuracy.
In contrast to other available tools (both free and commer- Furthermore, we only store half of the patch experts, by
cial) OpenFace 2.0 provides both training and testing code relying on mirrored views for response computation (e.g. we
allowing for modification, reproducibility, and transparency. store only the left eye recognizesr, instead of both eyes). This
Furthermore, our system shows competitive results on real reduces the model size by a half. Both of these improvements
world data and does not require any specialized hardware. reduced the model size from ≈ 1, 200MB to ≈ 400MB.
Finally, our system runs in real-time with all of the facial Smart multiple hypotheses In case of landmark detection
behavior analysis modules working together. in difficult in-the-wild and profile images CE-CLM uses
multiple initialization hypotheses (11 in total) at different
III. O PEN FACE 2.0 PIPELINE orientations. During fitting it selects the model with the
In this section we outline the core technologies used by best converged likelihood. However, this slows down the
OpenFace 2.0 for facial behavior analysis (see Figure 2 for approach. In order to speed this up we perform an early
a summary). First, we provide an explanation of how we hypothesis termination, based on current model likelihood.
detect and track facial landmarks, together with novel speed We start by evaluating the first scale (out of four different
enhancements that allow for real-time performance. We then scales) for each initialization hypothesis sequentially. If the
provide an outline of how these features are used for head current likelihood is above a threshold τi (good enough), we
pose estimation and eye gaze tracking. Finally, we describe do not evaluate further hypotheses. If none of the hypotheses
our facial action unit intensity and presence detection system. are above τi , we pick three hypotheses with the highest
likelihood for evaluation in further scales and pick the best
A. Facial landmark detection and tracking resulting one. We determine the τi values that lead to small
OpenFace 2.0 uses the recently proposed Convolutional fitting errors on for each view on training data. This leads
Experts Constrained Local Model (CE-CLM) [75] for facial to a 4 time performance improvement of landmark detection
landmark detection and tracking. The two main components in images and for initializing tracking in videos.
of CE-CLM are: Point Distribution Model (PDM) which Sparse response maps An important part of CE-CLM is
captures landmark shape variations and patch experts which the computation of response maps for each facial landmark.
model local appearance variations of each landmark. For Typically it is calculated in a dense grid around the current
landmark estimate (e.g. 15×15 pixel grid). However, instead
5 https://xlabsgaze.com/ of computing the response map for a dense grid we can do
6 https://www.eyesdecide.com/ it in a sparse grid by skipping every other pixel, followed by
Fig. 5: Sample eye registrations on 300-W dataset.

based on currently detected landmarks. If our CNN validation


module reports that tracking failed we reinitialize the model
using the MTCNN face detector.
To optimize matrix multiplications required for patch
expert computation and face detection we used the Open-
Fig. 3: Example landmark detection from OpenFace 2.0, note BLAS7 . It allows for specific CPU architecture optimized
the ability to deal with profile faces and occlusion. computation. This allows us to use Convolutional Neural
Network (CNN) based patch expert computation and face
detection without sacrificing real-time performance on de-
vices without dedicated GPUs. This led to a 2-5 times
(based on CPU architecture) performance improvement when
compared to OpenCV matrix multiplication.
All of the above mentioned performance improvements
and a C++ implementation, allows CE-CLM landmark de-
tection to achieve 30-40Hz frame rates on a quad core
3.5GHz Intel i7-2700K processor, and 20Hz frame rates on a
Fig. 4: Sample gaze estimations on video sequences; green
Surface Pro 3 laptop with a 1.7GHz dual core Intel core i7-
lines represent the estimated eye gaze vectors, the blue boxes
4650U processor, without any GPU support when processing
a 3D bounding box around the head.
640 × 480 px videos. This is 30 times faster than the original
Matlab implementation of CE-CLM [75].
a bilinear interpolation to map it back to a dense grid. This B. Head pose estimation
leads to 1.5 times improvement in model speed on images Our model is able to extract head pose (translation and
and videos with minimal loss of accuracy. orientation) in addition to facial landmark detection. We are
2) Implementation details: The PDM used in OpenFace able to do this, as CE-CLM internally uses a 3D representa-
2.0 was trained on two datasets — LFPW [12] and Helen tion of facial landmarks and projects them to the image using
[41] training sets. This resulted in a model with 34 non- orthographic camera projection. This allows us to accurately
rigid and 6 rigid shape parameters. For training the CE-CLM estimate the head pose once the landmarks are detected by
patch experts we used: Multi-PIE [29], LFPW [12], Helen solving the n point in perspective problem [32], see examples
[41] training set, and Menpo [76]. We trained a separate set of bounding boxes illustrating head pose in Figure 4.
of patch experts for seven views and four scales (leading
to 28 sets in total). We found optimal results are achieved C. Eye gaze estimation
when the face is at least 100 pixels ear to ear. Training on In order to estimate eye gaze, we use a Constrained Local
different views allows us to track faces with out of plane Neural Field (CLNF) landmark detector [9], [70] to detect
motion and to model self-occlusion due to head rotation. We eyelids, iris, and the pupil. For training the landmark detector
first pretrained our model on Multi-PIE, LFPW, and Helen we used the SynthesEyes training dataset [70]. Some sample
datasets and finished training on the Menpo dataset, as this registrations can be seen in Figure 5. We use the detected
leads to better results [75]. pupil and eye location to compute the eye gaze vector
To initialize our CE-CLM model we use our imple- individually for each eye. We fire a ray from the camera
mentation of the Multi-task Convolutional Neural Network origin through the center of the pupil in the image plane
(MTCNN) face detector [78]. The face detector we use was and compute it’s intersection with the eye-ball sphere. This
trained on WIDER FACE [74] and CelebA [43] datasets. gives us the pupil location in 3D camera coordinates. The
This is in contrast to OpenFace which used a dlib face vector from the 3D eyeball center to the pupil location is our
detector [37] which is not able to detect profile or highly estimated gaze vector. This is a fast and accurate method for
occluded faces. We learned a simple linear mapping from person independent eye-gaze estimation in webcam images.
the bounding box provided by the MTCNN detector to the
one surrounding the 68 facial landmarks. When tracking D. Facial expression recognition
landmarks in videos we initialize the CE-CLM model based OpenFace 2.0 recognizes facial expressions through de-
on landmark detection in previous frame. tecting facial action unit (AU) intensity and presence. We use
To prevent the tracking drift, we implement a simple four
layer CNN network that reports if the tracking has failed 7 http://www.openblas.net
AU Full name Illustration
AU1 I NNER BROW RAISER
AU2 O UTER BROW RAISER
AU4 B ROW LOWERER
AU5 U PPER LID RAISER
AU6 C HEEK RAISER
AU7 L ID TIGHTENER
AU9 N OSE WRINKLER
(a) 68 landmarks (b) 49 landmarks
AU10 U PPER LIP RAISER
AU12 L IP CORNER PULLER Fig. 7: Fitting on IJB-FL using OpenFace 2.0 and comparing
AU14 D IMPLER against recent landmark detection methods. None of the
AU15 L IP CORNER DEPRESSOR approaches were trained on IJB-FL, allowing to evaluate
AU17 C HIN RAISER ability to generalize.
AU20 L IP STRETCHED
AU23 L IP TIGHTENER
AU25 L IPS PART to lack of overlapping AU categories across datasets), we
AU26 JAW DROP perform cross-dataset experiments, allowing to better judge
AU28 L IP SUCK the generalization of our toolkit.
AU45 B LINK
A. Landmark detection
TABLE II: List of AUs in OpenFace 2.0. We predict intensity
We evaluate our OpenFace 2.0 toolkit in a facial landmark
and presence of all AUs, except for AU28, for which only
detection task and compare it to a number of recent baselines
presence predictions are made.
in a cross-dataset evaluation setup. For all of the baselines,
we used the code or executables provided by the authors.
a method based on a recent AU recognition framework by Datasets The facial landmark detection capability was
Baltrušaitis et al. [8], that uses linear kernel Support Vector evaluated on two publicly available datasets: IJB-FL [36],
Machines. OpenFace 2.0 contains a direct implementation and 300VW [60] test set. IJB-FL [36] is a landmark-
with a couple of changes that adapt it to work better on annotated subset of IJB-A [38] — a face recognition bench-
natural video sequences using person specific normalization mark. It contains labels for 180 images (128 frontal and 52
and prediction correction [8], [11]. While initially this may profile faces). This is a challenging subset containing images
appear as a simple and outdated model for AU recognition, in non-frontal pose, with heavy occlusion and poor image
our experiments demonstrate how competitive it is even when quality. 300VW [60] test set contains 64 videos labeled
compared to recent deep learning methods (see Table VI), for 68 facial landmarks for every frame. The test videos
while retaining a distinct speed advantage. are categorized into three types: 1) laboratory and natural-
As features we use the concatenation of dimensionality istic well-lit conditions; 2) unconstrained conditions such as
reduced HOGs [26] from similarity aligned 112 × 112 pixel varied illumination, dark rooms and overexposed shots; 3)
face image and facial shape features (from CE-CLM). In completely unconstrained conditions including illumination
order to account for personal differences when processing and occlusions such as occlusions by hand.
videos the median value of the features is subtracted from Baselines We compared our approach to other facial
the current frame. To correct for person specific bias in landmark detection algorithms whose implementations are
AU intensity prediction, we take the lowest nth percentile available online and which have been trained to detect the
(learned on validation data) of the predictions on a specific same facial landmarks (or their subsets). CFSS [83] —
person and subtract it from all of the predictions [11]. Coarse to Fine Shape Search is a recent cascaded regression
Our models are trained on DISFA [48], SEMAINE [51], approach. PO-CR [64] — is another recent cascaded regres-
BP4D [80], UNBC-McMaster [44], Bosphorus [59] and sion approach that updates the shape model parameters rather
FERA 2011 [66] datasets. Where the AU labels overlap than predicting landmark locations directly in a projected-
across multiple datasets we train on them jointly. This leads out space. CLNF [9] is an extension of the Constrained
to OpenFace 2.0 recognizing the AUs listed in Table II. Local Model that uses Continuous Conditional Neural Fields
as patch experts, this model is included in the OpenFace
IV. E XPERIMENTAL EVALUATION toolbox. DRMF [5] — Discriminative Response Map Fitting
In this section, we evaluate each of our OpenFace 2.0 sub- performs regression on patch expert response maps directly
systems: facial landmark detection, head pose estimation, rather than using optimization over the parameter space.
eye gaze estimation, and facial action unit recognition. For 3DDFA [84] — 3D Dense Face Alignment has shown great
each of our experiments we also include comparisons with performance on facial landmark detection in profile images.
a number of recently proposed approaches for tackling the CFAN [77] — Coarse-to-Fine Auto-encoder Network, uses
same problems (although none of them tackle all of them cascaded regression on auto-encoder visual features. SDM
at once). In all cases, except for facial action units (due [72] — Supervised Descent Method is a very popular
(a) Category 1 (b) Category 2 (c) Category 3

Fig. 6: Fitting on the 300VW dataset using OpenFace 2.0 and recently proposed landmark detection approaches. We only
report performance on 49 landmarks as that allows us to compare to more baselines. All of the methods except for iCCR
were not trained or validated on 300VW dataset.

Method Yaw Pitch Roll Mean Median


truth head pose data: BU [16] and ICT-3DHP [10].
CLM [57] 3.0 3.5 2.3 2.9 2.0
Chehra [5] 3.8 4.6 2.8 3.8 2.5 For comparison, we report the results of using Chehra
OpenFace 2.8 3.3 2.3 2.8 2.0 framework [5], CLM [57], CLM-Z [10], Regression Forests
OpenFace 2.0 2.4 3.2 2.4 2.6 1.8 [24], and OpenFace [8]. The results can be see in Table III
TABLE III: Head pose estimation results on the BU dataset. and Table IV. It can be seen that our approach demonstrates
Measured in mean absolute degree error. Note that BU state-of-the-art performance on both of the datasets.
dataset only contains RGB images so no comparison against C. Eye gaze estimation
CLM-Z and Regression forests was performed.
We evaluated the ability of OpenFace 2.0 to estimate eye
Method Yaw Pitch Roll Mean gaze vectors by evaluating it on the challenging MPIIGaze
Reg. forests [25] 7.2 9.4 7.5 8.0 dataset [79] intended to evaluate appearance based gaze
CLM-Z [10] 5.1 3.9 4.6 4.6 estimation. MPIIGaze was collected in realistic laptop use
CLM [57] 4.8 4.2 4.5 4.5 scenarios and poses a challenging and practically-relevant
Chehra [5] 13.9 14.7 10.3 13.0
task for eye gaze estimation. Sample images from the dataset
OpenFace 3.6 3.6 3.6 3.6
OpenFace 2.0 3.1 3.5 3.1 3.2 can be seen in the right column of Figure 4. We evaluated
our approach on a 750 face image subset of the dataset. We
TABLE IV: Head pose estimation results on ICT-3DHP. performed our experiments in a cross-dataset fashion and
Measured in mean absolute degree error. compared to baselines not trained on the MPIIGaze dataset.
We compared our model in a to a CNN proposed by Zhang
et al. [79], to EyeTab geometry based model [71] and a k-
cascaded regression approach. iCCR [56] — is a facial NN approach based on the UnityEyes dataset [69]. The error
landmark tracking approach for videos that adapts to the rates of our model can be seen in Table V. It can be seen that
particular person it tracks our model shows state-of-the-art performance on the task for
Results of IJB-FL experiment can be seen in Figure 7, cross-dataset eye gaze estimation.
while results on 300VW Figure 6. Note how OpenFace 2.0
outperforms all of the baselines in both of the experiments. D. Action unit recognition
B. Head pose estimation We evaluate our model for AU prediction against a set
of recent baselines, and demonstrate the benefits of such a
To measure performance on a head pose estimation task
simple approach. As there are no recent free tools we could
we used two publicly available datasets with existing ground
compare to our system (and commercial tools do not allow
for public comparisons), so we compare general methods
M ODEL G AZE ERROR used, instead of toolkits.
EyeTab [71] 47.1 Baselines Continuous Conditional Neural Fields (CCNF)
CNN on UT [79] 13.91
CNN on SynthesEyes [70] 13.55 model is a temporal approach for AU intensity estima-
CNN on SynthesEyes + UT [70] 11.12 tion [6] based on non-negative matrix factorization features
OpenFace 9.96 around facial landmark points. Iterative Regularized Kernel
UnityEyes [69] 9.95
OpenFace 2.0 9.10 Regression IRKR [53] is a recently proposed kernel learning
method for AU intensity estimation. It is an iterative nonlin-
TABLE V: Results comparing our method to previous work ear feature selection method with a Lasso-regularized version
for cross dataset gaze estimation on MPIIGaze [79], measure of Metric Regularized Kernel Regression. A generative latent
in mean absolute degree error. tree (LT) model was proposed by Kaltwang et al. [34].
TABLE VI: Comparing our model to baselines on the DISFA dataset, results reported as Pearson Correlation Coefficient.
(1)
used a different fold split. Notes: (2) used 9-fold testing. (3) used leave-one-person-out testing.
Method AU1 AU2 AU4 AU5 AU6 AU9 AU12 AU15 AU17 AU20 AU25 AU26 Mean
IRKR [53](1) 0.70 0.68 0.68 0.49 0.65 0.43 0.83 0.34 0.35 0.21 0.86 0.62 0.57
LT [34](2) 0.41 0.44 0.50 0.29 0.55 0.32 0.76 0.11 0.31 0.16 0.82 0.49 0.43
CNN [30] 0.60 0.53 0.64 0.38 0.55 0.59 0.85 0.22 0.37 0.15 0.88 0.60 0.53
D-CNN [82] 0.49 0.39 0.62 0.44 0.53 0.55 0.85 0.25 0.41 0.19 0.87 0.59 0.51
CCNF [6](3) 0.48 0.50 0.52 0.48 0.45 0.36 0.70 0.41 0.39 0.11 0.89 0.57 0.49
OpenFace 2.0 (SVR-HOG) 0.64 0.50 0.70 0.67 0.59 0.54 0.85 0.39 0.49 0.22 0.85 0.67 0.59

The model demonstrates good performance under noisy We hope that this tool will encourage other researchers in
input. Finally, we included two recent Convolutional Neural the field to share their code.
Network (CNN) baselines. The shallow four-layer model
proposed by Gudi et al. [30], and a deeper CNN model R EFERENCES
used by Zhao et al. [82] (called ConvNet in their work).
[1] A. Adams, M. Mahmoud, T. Baltrušaitis, and P. Robinson. Decoupling
The CNN model proposed by Gudi et al. [30], consists of facial expressions and head motions in complex emotions. In ACII,
three convolutional layers, the Zhao et al. D-CNN model uses 2015.
five convolutional layers followed by two fully-connected [2] J. Alabort-i medina, E. Antonakos, J. Booth, and P. Snape. Menpo : A
Comprehensive Platform for Parametric Image Alignment and Visual
layers and a final linear layer. SVR-HOG is the method Deformable Models Categories and Subject Descriptors. 2014.
used in OpenFace 2.0. For all methods we report results [3] N. Ambady and R. Rosenthal. Thin Slices of Expressive behavior as
Predictors of Interpersonal Consequences : a Meta-Analysis. Psycho-
from relevant papers, except for CNN and D-CNN models logical Bulletin, 111(2):256–274, 1992.
which we re-implemented. In case of SVR-HOG, CNN, and [4] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust discrimi-
D-CNN we used 5-fold person-independent testing. native response map fitting with constrained local models. In CVPR,
2013.
Results can be found in Table VI, it can be seen that [5] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Incremental Face
an SVR-HOG approach employed by OpenFace 2.0 out- Alignment in the Wild. In CVPR, 2014.
[6] T. Baltrušaitis, P. Robinson, and L.-P. Morency. Continuous Condi-
performs the more complex and recent approaches for AU tional Neural Fields for Structured Regression. In ECCV, 2014.
detection on this challenging dataset. [7] T. Baltrušaitis, N. Banda, and P. Robinson. Dimensional affect
We also compare OpenFace 2.0 with OpenFace for AU recognition using continuous conditional random fields. In FG, 2013.
[8] T. Baltrušaitis, M. Mahmoud, and P. Robinson. Cross-dataset learning
detection accuracy. The average concordance correlation and person-specic normalisation for automatic Action Unit detection.
coefficient (CCC) on DISFA validation set across 12 AUs In Facial Expression Recognition and Analysis Challenge, FG, 2015.
[9] T. Baltrušaitis, L.-P. Morency, and P. Robinson. Constrained local
of OpenFace is 0.70, while using OpenFace 2.0 it is 0.73. neural fields for robust facial landmark detection in the wild. In
ICCVW, 2013.
V. I NTERFACE [10] T. Baltrušaitis, P. Robinson, and L.-P. Morency. 3D Constrained Local
Model for Rigid and Non-Rigid Facial Tracking. In CVPR, 2012.
OpenFace 2.0 is an easy to use toolbox for the analysis of [11] T. Baltrušaitis, P. Robinson, and L.-P. Morency. OpenFace: an open
facial behavior. There are two main ways of using OpenFace source facial behavior analysis toolkit. In IEEE WACV, 2016.
[12] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar.
2.0: Graphical User Interface (for Windows), and command Localizing parts of faces using a consensus of exemplars. In CVPR,
line (for Windows, Ubuntu, and Mac OS X). As the source 2011.
code is available it is also possible to integrate it in any C++, [13] X. P. Burgos-Artizzu, P. Perona, and P. Dollar. Robust face landmark
estimation under occlusion. In ICCV, 2013.
C] , or Matlab based project. To make the system easier to [14] C. Busso and J. J. Jain. Advances in Multimodal Tracking of Driver
use we provide sample Matlab scripts that demonstrate how Distraction. In DSP for in-Vehicle Systems and Safety. 2012.
[15] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person
to extract, save, read and visualize each of the behaviors. 2d pose estimation using part affinity fields. In CVPR, 2017.
OpenFace 2.0 can operate on real-time data video feeds [16] M. L. Cascia, S. Sclaroff, and V. Athitsos. Fast, Reliable Head Track-
from a webcam, recorded video files, image sequences and ing under Varying Illumination : An Approach Based on Registration
of Texture-Mapped 3D Models. TPAMI, 22(4), 2000.
individual images. It is possible to save the outputs of the [17] G. G. Chrysos, E. Antonakos, P. Snape, A. Asthana, and S. Zafeiriou.
processed data as CSV files in case of facial landmarks, A Comprehensive Performance Evaluation of Deformable Face Track-
ing ”In-the-Wild”. International Journal of Computer Vision, 2016.
shape parameters, head pose, action units, and gaze vectors. [18] B. Czupryski and A. Strupczewski. High accuracy head pose tracking
survey. LNCS. 2014.
VI. C ONCLUSION [19] E. S. Dalmaijer, S. Mathôt, and S. V. D. Stigchel. PyGaze : an open-
source , cross-platform toolbox for minimal-effort programming of
In this paper we presented OpenFace 2.0 – an extension eye-tracking experiments. Behavior Research Methods, 2014.
to the OpenFace real-time facial behavior analysis system. [20] F. De la Torre and J. F. Cohn. Facial Expression Analysis. In Guide
to Visual Analysis of Humans: Looking at People. 2011.
OpenFace 2.0 is a useful tool for the computer vision, [21] P. Ekman and W. V. Friesen. Manual for the Facial Action Coding
machine learning and affective computing communities and System. Palo Alto: Consulting Psychologists Press, 1977.
will stimulate research in facial behavior analysis an under- [22] P. Ekman, W. V. Friesen, and P. Ellsworth. Emotion in the Human
Face. Cambridge University Press, second edition, 1982.
standing. Furthermore, the future development of the tool [23] P. Ekman, W. V. Friesen, M. O’Sullivan, and K. R. Scherer. Relative
will continue and it will attempt to incorporate the newest importance of face, body, and speech in judgments of personality and
affect. Journal of Personality and Social Psychology, 1980.
and most reliable approaches for the problem at hand while [24] G. Fanelli, J. Gall, and L. V. Gool. Real time head pose estimation
releasing the source code and retaining its real-time capacity. with random regression forests. In CVPR, 2011.
[25] G. Fanelli, T. Weise, J. Gall, and L. van Gool. Real time head pose Cascaded Continuous Regression for Real-time Incremental Face
estimation from consumer depth cameras. In DAGM, 2011. Tracking. In ECCV, 2016.
[26] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. [57] J. Saragih, S. Lucey, and J. Cohn. Deformable Model Fitting by
Object Detection with Discriminative Trained Part Based Models. Regularized Landmark Mean-Shift. IJCV, 2011.
IEEE TPAMI, 32, 2010. [58] E. Sariyanidi, H. Gunes, and A. Cavallaro. Automatic analysis of
[27] O. Ferhat and F. Vilariño. A Cheap Portable Eye–tracker Solution facial affect: A survey of registration, representation and recognition.
for Common Setups. 3rd International Workshop on Pervasive Eye IEEE TPAMI, 2014.
Tracking and Mobile Eye-Based Interaction, 2013. [59] A. Savran, N. Alyüz, H. Dibekliolu, O. Çeliktutan, B. Gökberk,
[28] J. M. Girard, J. F. Cohn, M. H. Mahoor, S. Mavadati, and D. P. B. Sankur, and L. Akarun. Bosphorus database for 3D face analysis.
Rosenwald. Social risk and depression: Evidence from manual and Lecture Notes in Computer Science, 5372:47–56, 2008.
automatic facial expression analysis. In FG, 2013. [60] J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tzimiropoulos, and
[29] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. M. Pantic. The First Facial Landmark Tracking in-the-Wild Challenge:
IVC, 2010. Benchmark and Results. ICCVW, 2015.
[30] A. Gudi, H. E. Tasli, T. M. D. Uyl, and A. Maroulis. Deep Learning [61] G. Stratou, S. Scherer, J. Gratch, and L.-P. Morency. Automatic non-
based FACS Action Unit Occurrence and Intensity Estimation. In verbal behavior indicators of depression and ptsd: Exploring gender
Facial Expression Recognition and Analysis Challenge, FG, 2015. differences. In ACII, 2013.
[31] D. W. Hansen and Q. Ji. In the eye of the beholder: a survey of models [62] L. Świrski, A. Bulling, and N. A. Dodgson. Robust real-time pupil
for eyes and gaze. TPAMI, 2010. tracking in highly off-axis images. In Proceedings of ETRA, 2012.
[32] J. A. Hesch and S. I. Roumeliotis. A Direct Least-Squares (DLS) [63] J. L. Tracy and D. Matsumoto. The spontaneous expression of
method for PnP. In ICCV, 2011. pride and shame: Evidence for biologically innate nonverbal displays.
[33] B. Jiang, M. F. Valstar, and M. Pantic. Action unit detection using Proceedings of the National Academy of Sciences, 2008.
sparse appearance descriptors in space-time video volumes. FG, 2011. [64] G. Tzimiropoulos. Project-Out Cascaded Regression with an applica-
[34] S. Kaltwang, S. Todorovic, and M. Pantic. Latent trees for estimating tion to Face Alignment. In CVPR, 2015.
intensity of facial action units. In CVPR, Boston, MA, USA, 2015. [65] A. Vail, T. Baltrušaitis, L. Pennant, E. Liebson, J. Baker, and L.-P.
[35] V. Kazemi and J. Sullivan. One Millisecond Face Alignment with an Morency. Visual attention in schizophrenia: Eye contact and gaze
Ensemble of Regression Trees. CVPR, 2014. aversion during clinical interactions. In ACII, 2017.
[36] K. Kim, T. Baltrušaitis, A. Zadeh, L.-P. Morency, and G. Medionni. [66] M. F. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. R. Scherer. The
Holistically Constrained Local Model: Going Beyond Frontal Poses First Facial Expression Recognition and Analysis Challenge. In IEEE
for Facial Landmark Detection. In BMVC, 2016. FG, 2011.
[37] D. E. King. Max-margin object detection. CoRR, 2015. [67] S. Vijay, T. Baltrušaitis, L. Pennant, D. Öngür, J. Baker, and L.-P.
[38] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, Morency. Computational study of psychosis symptoms and facial
P. Grother, A. Mah, M. Burge, and A. K. Jain. Pushing the fron- expressions. In Computing and Mental Health Workshop at CHI, 2016.
tiers of unconstrained face detection and recognition: IARPA Janus [68] E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson, and A. Bulling.
Benchmark A. CVPR, 2015. A 3d morphable eye region model for gaze estimation. In ECCV,
[39] C. L. Kleinke. Gaze and eye contact: a research review. Psychological 2016.
bulletin, 1986. [69] E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson, and A. Bulling.
[40] E. Laksana, T. Baltruaitis, and L.-P. Morency. Investigating facial Learning an appearance-based gaze estimator from one million syn-
behavior indicators of suicidal ideation. In FG, 2017. thesized images. In Eye-Tracking Research and Applications, 2016.
[41] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. Interactive [70] E. Wood, T. Baltrušaitis, X. Zhang, Y. Sugano, P. Robinson, and
facial feature localization. In ECCV, 2012. A. Bulling. Rendering of eyes for eye-shape registration and gaze
[42] M. Lidegaard, D. W. Hansen, and N. Krüger. Head mounted device estimation. In ICCV, 2015.
for point-of-gaze estimation in three dimensions. ETRA, 2014. [71] E. Wood and A. Bulling. Eyetab: Model-based gaze estimation on
[43] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep Learning Face Attributes unmodified tablet computers. In Proceedings of ETRA, Mar. 2014.
in the Wild. In ICCV, pages 3730–3738, 2015. [72] X. Xiong and F. De la Torre. Supervised descent method and its
[44] P. Lucey, J. F. Cohn, K. M. Prkachin, P. E. Solomon, and I. Matthews. applications to face alignment. In CVPR, 2013.
Painful data: The UNBC-McMaster shoulder pain expression archive [73] H. Yang and I. Patras. Sieving Regression Forest Votes for Facial
database. FG, 2011. Feature Detection in the Wild. In ICCV, 2013.
[45] B. Martinez and M. Valstar. Advances, challenges, and opportunities [74] S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: A face detection
in automatic facial expression recognition. 2016. benchmark. In CVPR, 2016.
[46] B. Martinez, M. F. Valstar, X. Binefa, and M. Pantic. Local evidence [75] A. Zadeh, T. Baltrušaitis, and L.-P. Morency. Convolutional experts
aggregation for regression based facial point detection. TPAMI, 2013. constrained local model for facial landmark detection. In CVPRW,
[47] Y. Matsuyama, A. Bhardwaj, R. Zhao, O. J. Romero, S. A. Akoju, 2017.
and J. Cassell. Socially-Aware Animated Intelligent Personal Assistant [76] S. Zafeiriou, G. Trigeorgis, G. Chrysos, J. Deng, and J. Shen. The
Agent. SIGDIAL Conference, 2016. Menpo Facial Landmark Localisation Challenge: A step towards the
[48] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn. solution. In CVPR workshops, 2017.
DISFA: A Spontaneous Facial Action Intensity Database. TAFFC, [77] J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-Fine Auto-encoder
2013. Networks (CFAN) for Real-time Face Alignment. In ECCV, 2014.
[49] B. McDaniel, S. D’Mello, B. King, P. Chipman, K. Tapp, and [78] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and
a. Graesser. Facial Features for Affective State Detection in Learning alignment using multi-task cascaded convolutional networks. 2016.
Environments. 29th Annual meeting of the cognitive science society, [79] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling. Appearance-based
pages 467–472, 2007. gaze estimation in the wild. June 2015.
[50] D. McDuff, R. el Kaliouby, D. Demirdjian, and R. Picard. Predicting [80] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz,
online media effectiveness based on smile responses gathered over the P. Liu, and J. M. Girard. BP4D-Spontaneous: a high-resolution
internet. In FG, 2013. spontaneous 3D dynamic facial expression database. IVC, 2014.
[51] G. McKeown, M. F. Valstar, R. Cowie, and M. Pantic. The SEMAINE [81] Z. Zhang, P. Luo, C.-C. Loy, and X. Tang. Facial Landmark Detection
corpus of emotionally coloured character interactions. In IEEE by Deep Multi-task Learning. ECCV, 2014.
International Conference on Multimedia and Expo, 2010. [82] K. Zhao, W. Chu, and H. Zhang. Deep Region and Multi-Label
[52] L.-P. Morency, J. Whitehill, and J. R. Movellan. Generalized Adaptive Learning for Facial Action Unit Detection. In CVPR, 2016.
View-based Appearance Model: Integrated Framework for Monocular [83] S. Zhu, C. Li, C. C. Loy, and X. Tang. Face Alignment by Coarse-
Head Pose Estimation. In FG, 2008. to-Fine Shape Searching. In CVPR, 2015.
[53] J. Nicolle, K. Bailly, and M. Chetouani. Real-time facial action unit [84] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment across
intensity prediction with regularized metric learning. IVC, 2016. large poses: A 3d solution. In CVPR, 2016.
[54] A. Papoutsaki, P. Sangkloy, J. Laskey, N. Daskalova, J. Huang, and [85] X. Zhu and D. Ramanan. Face detection, pose estimation, and
J. Hays. WebGazer : Scalable Webcam Eye Tracking Using User landmark localization in the wild. In CVPR, 2012.
Interactions. IJCAI, pages 3839–3845, 2016.
[55] Paris Mavromoustakos Blom, S. Bakkes, C. T. Tan, S. Whiteson,
D. Roijers, R. Valenti, and T. Gevers. Towards Personalised Gaming
via Facial Expression Recognition. AIIDE, 2014.
[56] E. Sánchez-Lozano, B. Martinez, G. Tzimiropoulos, and M. Valstar.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy