GCNet

End-to-End Learning of Geometry and Context for Deep Stereo Regression
Alex Kendall Hayk Martirosyan Saumitro Dasgupta Peter Henry

Ryan Kennedy Abraham Bachrach Adam Bry
Skydio Inc.
{alex,hayk,saumitro,peter,ryan,abe,adam}@skydio.com
arXiv:1703.04309v1 [cs.CV] 13 Mar 2017
Abstract when supervised with large training datasets. We observe

that a number of these challenging problems for stereo al-
We propose a novel deep learning architecture for re- gorithms would benefit from knowledge of global seman-
gressing disparity from a rectified pair of stereo images. tic context, rather than relying solely on local geometry.
We leverage knowledge of the problem’s geometry to form a For example, given a reflective surface of a vehicle’s wind-
cost volume using deep feature representations. We learn to shield, a stereo algorithm is likely to be erroneous if it re-
incorporate contextual information using 3-D convolutions lies solely on the local appearance of the reflective surface
over this volume. Disparity values are regressed from the to compute geometry. Rather, it would be advantageous to
cost volume using a proposed differentiable soft argmin op- understand the semantic context of this surface (that it be-
eration, which allows us to train our method end-to-end to longs to a vehicle) to infer the local geometry. In this paper
sub-pixel accuracy without any additional post-processing we show how to learn a stereo regression model which can
or regularization. We evaluate our method on the Scene be trained end-to-end, with the capacity to understand wider
Flow and KITTI datasets and on KITTI we set a new state- contextual information.
of-the-art benchmark, while being significantly faster than Stereo algorithms which leverage deep learning repre-
competing approaches. sentations have so far been largely focused on using them
to generate unary terms [48, 32]. Applying cost matching
on the deep unary representations performs poorly when es-
1. Introduction timating pixel disparities [32, 48]. Traditional regulariza-
tion and post processing steps are still used, such as semi
Accurately estimating three dimensional geometry from global block matching and left-right consistency checks
stereo imagery is a core problem for many computer vision [23]. These regularization steps are severely limited be-
applications, including autonomous vehicles and UAVs [2]. cause they are hand-engineered, shallow functions, which
In this paper we are specifically interested in computing the are still susceptible to the aforementioned problems.
disparity of each pixel between a rectified stereo pair of im- This paper asks the question, can we formulate the en-
ages. To achieve this, the core task of a stereo algorithm is tire stereo vision problem with deep learning using our un-
computing the correspondence of each pixel between two derstanding of stereo geometry? The main contribution of
images. This is very challenging to achieve robustly in real- this paper is an end-to-end deep learning method to estimate
world scenarios. Current state-of-the-art stereo algorithms per-pixel disparity from a single rectified image pair. Our
often have difficulty with textureless areas, reflective sur- architecture is illustrated in Figure 1. It explicitly reasons
faces, thin structures and repetitive patterns. Many stereo about geometry by forming a cost volume, while also rea-
algorithms aim to mitigate these failures with pooling or soning about semantics using a deep convolutional network
gradient based regularization [15, 23]. However, this often formulation. We achieve this with two key ideas:
requires a compromise between smoothing surfaces and de- • We learn to incorporate context directly from the data,
tecting detailed structures. employing 3-D convolutions to learn to regularize the
In contrast, deep learning models have been successful cost volume over height × width × disparity dimen-
in learning powerful representations directly from the raw sions,
data in object classification [28], detection [17] and seman- • We use a soft argmin function, which is fully differ-
tic segmentation [31, 3]. These examples demonstrate that entiable, and allows us to regress sub-pixel disparity
deep convolutional neural networks are very effective for values from the disparity cost volume.
understanding semantics. They excel at classification tasks Section 3 introduces this model and illustrates these
1
ity
[ ]
ar
sp
di
... * * *
...
height
width
Shared Weights Shared Weights
it y
[ ]
ar
isp
...
d
* * * ...
height
width
Input Stereo Images 2D Convolution Cost Volume Multi-Scale 3D Convolution 3D Deconvolution Soft ArgMax Disparities
Figure 1: Our end-to-end deep stereo regression architecture, GC-Net (Geometry and Context Network).
components in more detail. In Section 4 we evaluate our is the Semi-Global Matching (SGM) of Hirschmüller [24],
model on the synthetic Scene Flow dataset [36] and set where dynamic programming optimizes a pathwise form of
a new state-of-the-art benchmark on the KITTI 2012 and the energy function in many directions.
2015 datasets [14, 35]. Finally, in Section 4.3 we present In addition to providing a basis for comparing stereo al-
evidence that our model has the capacity to learn semantic gorithms, the ground truth depth data from these datasets
reasoning and contextual information. provides the opportunity to use machine learning for im-
proving stereo algorithms in a variety of ways. Zhang and
2. Related Work Seitz [52] alternately optimized disparity and Markov ran-
dom field regularization parameters. Scharstein and Pal [38]
The problem of computing depth from stereo image pairs
learn conditional random field (CRF) parameters, and Li
has been studied for quite some time [5]. A survey by
and Huttenlocher [29] train a non-parametric CRF model
Scharstein and Szeliski [39] provides a taxonomy of stereo
using the structured support vector machine. Learning can
algorithms as performing some subset of: matching cost
also be employed to estimate the confidence of a traditional
computation, cost support aggregation, disparity computa-
stereo algorithm, such as the random forest approach of
tion and optimization, or disparity refinement. This survey
Haeusler et al. [19]. Such confidence measures can improve
also described the first Middlebury dataset and associated
the result of SGM as shown by Park and Yoon [37].
evaluation metrics, using structured light to provide ground
truth. The KITTI dataset [14, 35] is a larger dataset from Deep convolutional neural networks can be trained to
data collected from a moving vehicle with ground truth sup- match image patches [46]. A deep network trained to match
plied by LIDAR. These datasets first motivated improved 9 × 9 image patches, followed by non-learned cost aggre-
hand-engineered techniques for all components of stereo, gation and regularization, was shown by Žbontar and Le-
of which we mention a few notable examples. Cun [47, 49] to produce then state-of-the-art results. Luo
The matching cost is a measure of pixel dissimilarity for et al. presented a notably faster network for computing lo-
potentially corresponding image locations [25], of which cal matching costs as a multi-label classification of dispar-
absolute differences, squared differences, and truncated dif- ities using a Siamese network [33]. A multi-scale embed-
ferences are examples. Local descriptors based on gra- ding model from Chen et al. [9] also provided good local
dients [16] or binary patterns, such as CENSUS [45] or matching scores. Also noteworthy is the DeepStereo work
BRIEF [7, 22], can be employed. Instead of aggregating of Flynn et al. [12], which learns a cost volume combined
neighboring pixels equally as patch-based matching costs with a separate conditional color model to predict novel
do, awareness of the image content can more heavily incor- viewpoints in a multi-view stereo setting.
porate neighboring pixels possessing similar appearance, Mayer et al. created a large synthetic dataset to train
under the assumption that they are more likely to come from a network for disparity estimation (as well as optical
the same surface and disparity. A survey of these techniques flow) [34], improving the state-of-the-art. As one variant
is provided by Tombari et al. [43]. Local matching costs of the network, a 1-D correlation was proposed along the
may also be optimized within a global framework, usually disparity line which is a multiplicative approximation to the
minimizing an energy function combining a local data term stereo cost volume. In addition, this volume is concatenated
and a pairwise smoothness term. Global optimization can with convolutional features from a single image and suc-
be accomplished using graph cuts [27] or belief propaga- ceeded by a series of further convolutions. In contrast, our
tion [26], which can be extended to slanted surfaces [6]. A work does not collapse the feature dimension when com-
popular and effective approximation to global optimization puting the cost volume and uses 3-D convolutions to incor-
2
Layer Description Output Tensor Dim.
porate context.
Input image H×W×C
Though the focus of this work is on binocular stereo, it Unary features (section 3.1)
is worth noting that the representational power of deep con- 1 5×5 conv, 32 features, stride 2 1⁄ H×1⁄ W×F
2 2
2 3×3 conv, 32 features 1⁄ H×1⁄ W×F

2 4
volutional networks also enables depth estimation from a 3 3×3 conv, 32 features 1⁄2H× ⁄4W×F
1
single monocular image [10]. Deep learning is combined add layer 1 and 3 features (residual connection) 1⁄2H× ⁄2W×F
1
4-17 (repeat layers 2,3 and residual connection) × 7 1⁄ H×1⁄ W×F

2 2
with a continuous CRF by Liu et al. [30]. Instead of super- 18 3×3 conv, 32 features, (no ReLu or BN) 1⁄2H×1⁄2W×F
vising training with labeled ground truth, unlabeled stereo Cost volume (section 3.2)
Cost Volume 1⁄ D×1⁄ H×1⁄ W×2F
pairs can be used to train a monocular model [13]. Learning regularization (section 3.3)
2 2 2
In our work, we apply no post-processing or regulariza- 19 3-D conv, 3×3×3, 32 features 1⁄ D×1⁄ H×1⁄ W×F
2 2 2
tion. Our network can explicitly reason about geometry by 20 3-D conv, 3×3×3, 32 features 1⁄ D×1⁄ H×1⁄ W×F
2 2 2
21 From 18: 3-D conv, 3×3×3, 64 features, stride 2 1⁄ D×1⁄ H×1⁄ W×2F
4 4 4
forming a fully differentiable cost volume. Our network 22 3-D conv, 3×3×3, 64 features 1⁄ D×1⁄ H×1⁄ W×2F
4 4 4
learns to incorporate context from the data with a 3-D con- 23 3-D conv, 3×3×3, 64 features 1⁄ D×1⁄ H×1⁄ W×2F
4 4 4
8 8 8
volutional architecture. We don’t learn a probability distri- 25 3-D conv, 3×3×3, 64 features 1⁄ D×1⁄ H×1⁄ W×2F
8 8 8
bution, cost function, or classification result. Rather, our 26 3-D conv, 3×3×3, 64 features 1⁄ D×1⁄ H×1⁄ W×2F
8 8 8
16 16 16
network is able to directly regress a sub-pixel estimate of 28 3-D conv, 3×3×3, 64 features 1⁄ D×1⁄ H×1⁄ W×2F
16 16 16
disparity from a stereo image pair. 29 3-D conv, 3×3×3, 64 features 1⁄ D×1⁄ H×1⁄ W×2F
16 16 16
32 32 32
31 3-D conv, 3×3×3, 128 features 1⁄ D×1⁄ H×1⁄ W×4F

32 32 32
3. Learning End-to-end Disparity Regression 32 3-D conv, 3×3×3, 128 features 1⁄ D×1⁄ H×1⁄ W×4F
32 32 32
33 3×3×3, 3-D transposed conv, 64 features, stride 2 1⁄ D×1⁄ H×1⁄ W×2F

16 16 16
add layer 33 and 29 features (residual connection) 1⁄ D×1⁄ H×1⁄ W×2F

Rather than design any step of the stereo algorithm by 34 3×3×3, 3-D transposed conv, 64 features, stride 2
16 16
1⁄ D×1⁄ H×1⁄ W×2F
8 8 8
16
hand, we would like to learn an end-to-end mapping from add layer 34 and 26 features (residual connection) 1⁄ D×1⁄ H×1⁄ W×2F
8 8 8
an image pair to disparity maps using deep learning. We 35 3×3×3, 3-D transposed conv, 64 features, stride 2 1⁄ D×1⁄ H×1⁄ W×2F
4 4 4
add layer 35 and 23 features (residual connection) 1⁄ D×1⁄ H×1⁄ W×2F

4 4 4
hope to learn a more optimal function directly from the 36 3×3×3, 3-D transposed conv, 32 features, stride 2 1⁄ D×1⁄ H×1⁄ W×F
2 2 2
data. Additionally, this approach promises to reduce much add layer 36 and 20 features (residual connection) 1⁄2D× ⁄2H× ⁄2W×F
1 1
37 3×3×3, 3-D trans conv, 1 feature (no ReLu or BN) D×H×W×1

of the engineering design complexity. However, our inten- Soft argmin (section 3.4)
tion is not to naively construct a machine learning architec- Soft argmin H×W
ture as a black box to model stereo. Instead, we advocate

Table 1: Summary of our end-to-end deep stereo regression
the use of the insights from many decades of multi-view ge-
architecture, GC-Net. Each 2-D or 3-D convolutional layer
ometry research [20] to guide architectural design. There-
represents a block of convolution, batch normalization and
fore, we form our model by developing differentiable lay-
ReLU non-linearity (unless otherwise specified).
ers representing each major component in traditional stereo
pipelines [39]. This allows us to learn the entire model end-
to-end while leveraging our geometric knowledge of the
we append eight residual blocks [21] which each consist of
stereo problem.
two 3×3 convolutional filters in series. Our final model ar-
Our architecture, GC-Net (Geometry and Context
chitecture is shown in Table 1. We form the unary features
Network) is illustrated in Figure 1, with a more detailed
by passing both left and right stereo images through these
layer-by-layer definition in Table 1. In the remainder of
layers. We share the parameters between the left and right
this section we discuss each component in detail. Later,
towers to more effectively learn corresponding features.
in Section 4.1, we present quantitative results justifying our
design decisions.
3.2. Cost Volume
3.1. Unary Features We use the deep unary features to compute the stereo
First we learn a deep representation to use to compute matching cost by forming a cost volume. While a naive
the stereo matching cost. Rather than compute the stereo approach might simply concatenate the left and right fea-
matching cost using raw pixel intensities, it is common to ture maps, forming a cost volume allows us to constrain the
use a feature representation. The motivation is to compare model in a way which preserves our knowledge of the ge-
a descriptor which is more robust to the ambiguities in pho- ometry of stereo vision. For each stereo image, we form
tometric appearance and can incorporate local context. a cost volume of dimensionality height×width×(max dis-
In our model we learn a deep representation through a parity + 1)×feature size. We achieve this by concatenating
number of 2-D convolutional operations. Each convolu- each unary feature with their corresponding unary from the
tional layer is followed by a batch normalization layer and opposite stereo image across each disparity level, and pack-
a rectified linear non-linearity. To reduce computational ing these into the 4D volume.
demand, we initially apply a 5×5 convolutional filter with Crucially, we retain the feature dimension through this
stride of two to subsample the input. Following this layer, operation, unlike previous work which uses a dot product
3
style operation which decimates the feature dimension [32]. tive field while reducing computation. However, it also re-
This allows us to learn to incorporate context which can op- duces spatial accuracy and fine-grained details through the
erate over feature unaries (Section 3.3). We find that form- loss of resolution. For this reason, we add each higher reso-
ing a cost volume with concatenated features improves per- lution feature map before up-sampling. These residual lay-
formance over subtracting features or using a distance met- ers have the benefit of retaining higher frequency informa-
ric. Our intuition is that by maintaining the feature unaries, tion, while the up-sampled features provide an attentive fea-
the network has the opportunity to learn an absolute rep- ture map with a larger field of view.
resentation (because it is not a distance metric) and carry Finally, we apply a single 3-D transposed convolution
this through to the cost volume. This gives the architecture (deconvolution), with stride two and a single feature out-
the capacity to learn semantics. In contrast, using a dis- put. This layer is necessary to make dense prediction in the
tance metric restricts the network to only learning relative original input dimensions because the feature unaries were
representations between features, and cannot carry absolute sub-sampled by a factor of two. This results in the final,
feature representations through to cost volume. regularized cost volume with size H×W×D.
3.3. Learning Context 3.4. Differentiable ArgMin

Given this disparity cost volume, we would now like to Typically, stereo algorithms produce a final cost volume
learn a regularization function which is able to take into ac- from the matching cost unaries. From this volume, we may
count context in this volume and refine our disparity esti- estimate disparity by performing an argmin operation over
mate. The matching costs between unaries can never be the cost volumes disparity dimension. However, this opera-
perfect, even when using a deep feature representation. For tion has two problems:
example, in regions of uniform pixel intensity (for exam-
ple, sky) the cost curve will be flat for any features based • it is discrete and is unable to produce sub-pixel dispar-
on a fixed, local context. We find that regions like this can ity estimates,
cause multi modal matching cost curves across the dispar- • it is not differentiable and therefore unable to be
ity dimension. Therefore we wish to learn to regularize and trained using back-propagation.
improve this volume. To overcome these limitations, we define a soft argmin1
We propose to use three-dimensional convolutional op- which is both fully differentiable and able to regress a
erations to filter and refine this representation. 3-D con- smooth disparity estimate. First, we convert the predicted
volutions are able to learn feature representations from the costs, cd (for each disparity, d) from the cost volume to a
height, width and disparity dimensions. Because we com- probability volume by taking the negative of each value.
pute the cost curve for each unary feature, we can learn con- We normalize the probability volume across the disparity
volutional filters from this representation. In Section 4.1 we dimension with the softmax operation, σ(·). We then take
show the importance of these 3-D filters for learning context the sum of each disparity, d, weighted by its normalized
and significantly improving stereo performance. probability. A graphical illustration is shown in Figure 2
The difficulty with 3-D convolutions is that the addi- and defined mathematically in (1):
tional dimension is a burden on the computational time for
both training and inference. Deep encoder-decoder tasks DX
max
which are designed for dense prediction tasks get around sof t argmin := d × σ(−cd ) (1)
their computational burden by encoding sub-sampled fea- d=0
ture maps, followed by up-sampling in a decoder [3]. We
extend this idea to three dimensions. By sub-sampling the This operation is fully differentiable and allows us to train
input with stride two, we also reduce the 3-D cost volume and regress disparity estimates. We note that a similar func-
size by a factor of eight. We form our 3-D regularization tion was first introduced by [4] and referred to as a soft-
network with four levels of sub-sampling. As the unaries attention mechanism. Here, we show how to apply it for the
are already sub-sampled by a factor of two, the features are stereo regression problem.
sub-sampled by a total factor of 32. This allows us to ex- However, compared to the argmin operation, its output
plicitly leverage context with a wide field of view. We apply is influenced by all values. This leaves it susceptible to
two 3×3×3 convolutions in series for each encoder level. multi-modal distributions, as the output will not take the
To make dense predictions with the original input resolu- most likely. Rather, it will estimate a weighted average
tion, we employ a 3-D transposed convolution to up-sample of all modes. To overcome this limitation, we rely on the
the volume in the decoder. The full architecture is described network’s regularization to produce a disparity probability
in Table 1. 1 Note that if we wished for our network to learn probabilities, rather
Sub-sampling is useful to increase each feature’s recep- than cost, this function could easily be adapted to a soft argmax operation.
4
6 10 10
4 8 8
2 6 6
Cost
Cost
Cost
0 4 4
2 2 2
4 0 0
6 2 2
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
6 2 2
Probability
Probability
Probability
4 0 0
2 2 2
0 4 4
2 6 6
4 8 8
6 10 10
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
0.25 0.30 0.7

0.20 0.25 0.6
Softmax
Softmax
Softmax
0.20 0.5
0.15 0.4
0.15 0.3
0.10 0.10 0.2
0.05 0.05 0.1
0.00 0.00 0.0
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
60 60 60
50 50 50
Softmax * Indices Indices

40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
4.0 6 16
3.5 14
3.0
2.5
True ArgMin 5
4 True ArgMin 12
10
True ArgMin
2.0
1.5 Soft ArgMin 3
2
Soft ArgMin 8
6 Soft ArgMin
1.0 1 4
0.5 2
0.0 0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Disparities [px] Disparities [px] Disparities [px]
(a) Soft ArgMin (b) Multi-modal distribution (c) Multi-modal distribution with prescaling
Figure 2: A graphical depiction of the soft argmin operation (Section 3.4) which we propose in this work. It is able to
take a cost curve along each disparity line and output an estimate of the argmin by summing the product of each disparity’s
softmax probability and it’s disparity index. (a) demonstrates that this very accurately captures the true argmin when the curve
is uni-modal. (b) demonstrates a failure case when the data is bi-modal with one peak and one flat region. (c) demonstrates
that this failure may be avoided if the network learns to pre-scale the cost curve, because the softmax probabilities will tend
to be more extreme, producing a uni-modal result.
distribution which is predominantly unimodal. The net- 4. Experimental Evaluation

work can also pre-scale the matching costs to control the
In this section we present qualitative and quantitative re-
peakiness (sometimes called temperature) of the normalized
sults on two datasets, Scene Flow [36] and KITTI [14, 35].
post-softmax probabilities (Figure 2). We explicitly omit
batch normalization from the final convolution layer in the Firstly, in Section 4.1 we experiment with different variants
unary tower to allow the network to learn this from the data. of our model and justify a number of our design choices us-
ing the Scene Flow dataset [36]. In Section 4.2 we present
results of our approach on the KITTI dataset and set a
3.5. Loss new state-of-the-art benchmark. Finally, we measure our
model’s capacity to learn context in Section 4.3.
We train our entire model end-to-end from a random ini- For the experiments in this section, we implement our ar-
tialization. We train our model with supervised learning chitecture using TensorFlow [1]. All models are optimized
using ground truth depth data. In the case of using LIDAR end-to-end with RMSProp [42] and a constant learning rate
to label ground truth values (e.g. KITTI dataset [14, 35]) of 1×10−3 . We train with a batch size of 1 using a 256×512
these labels may be sparse. Therefore, we average our loss randomly located crop from the input images. Before train-
over the labeled pixels, N . We train our model using the ab- ing we normalize each image such that the pixel intensi-
solute error between the ground truth disparity, dn , and the ties range from −1 to 1. We trained the network (from
model’s predicted disparity, dˆn , for pixel n. This supervised a random initialization) on Scene Flow for approximately
regression loss is defined in (2): 150k iterations which takes two days on a single NVIDIA
Titan-X GPU. For the KITTI dataset we fine-tune the mod-
1 X
N els pre-trained on Scene Flow for a further 50k iterations.
Loss = dn − dˆn (2) For our experiments on Scene Flow we use F=32, H=540,
N n=1 1
W=960, D=192 and on the KITTI dataset we use F=32,
H=388, W=1240, D=192 for feature size, image height, im-
In the following section we show that formulating our age width and maximum disparity, respectively.
model as a regression problem allows us to regress with sub-
4.1. Model Design Analysis
pixel accuracy and outperform classification approaches.
Additionally, formulating a regression model makes it pos- In Table 2 we present an ablation study to compare a
sible to leverage unsupervised learning losses based on pho- number of different model variants and justify our design
tometric reprojection error [13]. choices. We wish to evaluate the importance of the key
5
Model > 1 px > 3 px > 5 px MAE (px) RMS (px) Param. Time (ms)
1. Comparison of architectures
Unaries only (omitting all 3-D conv layers 19-36) w Regression Loss 97.9 93.7 89.4 36.6 47.6 0.16M 0.29
Unaries only (omitting all 3-D conv layers 19-36) w Classification Loss 51.9 24.3 21.7 13.1 36.0 0.16M 0.29
Single scale 3-D context (omitting 3-D conv layers 21-36) 34.6 24.2 21.2 7.27 20.4 0.24M 0.84
Hierarchical 3-D context (all 3-D conv layers) 16.9 9.34 7.22 2.51 12.4 3.5M 0.95
2. Comparison of loss functions
GC-Net + Classification loss 19.2 12.2 10.4 5.01 20.3 3.5M 0.95
GC-Net + Soft classification loss [32] 20.6 12.3 10.4 5.40 25.1 3.5M 0.95
GC-Net + Regression loss 16.9 9.34 7.22 2.51 12.4 3.5M 0.95
GC-Net (final architecture with regression loss) 16.9 9.34 7.22 2.51 12.4 3.5M 0.95
Table 2: Results on the Scene Flow dataset [36] which contains 35, 454 training and 4, 370 testing images of size 960 ×
540px from an array of synthetic scenes. We compare different architecture variants to justify our design choices. The
first experiment shows the importance of the 3-D convolutional architecture. The second experiment shows the gain in
performance we get from using a regression loss.
ideas in this paper; using a regression loss over a classifi- Soft Classification GC-Net Hard Classification
cation loss, and learning 3-D convolutional filters for cost 85.0
volume regularization. We use the synthetic Scene Flow 82.5
dataset [36] for these experiments, which contains 35, 454
80.0
training and 4, 370 testing images. We use this dataset for % < 1 px Disparity Error
two reasons. Firstly, we know perfect, dense ground truth 77.5
from the synthetic scenes which removes any discrepan- 75.0
cies due to erroneous labels. Secondly, the dataset is large
enough to train the model without over-fitting. In contrast, 72.5
the KITTI dataset only contains 200 training images, and 70.0
we observe that the model is susceptible to over-fitting to 67.5
this very small dataset. With tens of thousands of training
images we do not have to consider over-fitting in our evalu- 65.0
0 20000 40000 60000 80000 100000 120000
ation. Training Iterations
The first experiment in Table 2 shows that including the

Figure 3: Validation error (percentage of disparities with
3-D filters performs significantly better than learning unar-
error less than 1 px) during training with the Scene Flow
ies only. We compare our full model (as defined in Table 1)
dataset. Classification loss trains faster, however using a
to a model which uses only unary features (omitting all 3-
regression loss results in better performance.
D convolutional layers 19-36) and a model which omits the
hierarchical 3-D convolution (omitting layers 21-36). We
observe that the 3-D filters are able to regularize and smooth are within one pixel of the true disparity, because the re-
the output effectively, while learning to retain sharpness and gression loss allows the model to predict with sub-pixel ac-
accuracy in the output disparity map. We find that the hi- curacy.
erarchical 3-D model outperforms the vanilla 3-D convolu- Figure 3 plots validation error during training for each of
tional model by aggregating a much large context, without the networks compared in this section. We observe that the
significantly increasing computational demand. classification loss converges faster, however the regression
The second experiment in Table 2 compares our regres- loss performs best overall.
sion loss function to baselines which classify disparities us-
4.2. KITTI Benchmark
ing hard or soft classification as proposed in [32]. Hard
classification trains the network to classify disparities in the In Table 3 we evaluate the performance of our model on
cost volume as probabilities using cross entropy loss with the KITTI 2012 and 2015 stereo datasets [14, 35]. These
a ‘one hot’ encoding. Soft classification (used by [32]) consist of challenging and varied road scene imagery col-
smooths this encoding to learn a Gaussian distribution cen- lected from a test vehicle. Ground truth depth maps for
tered around the correct disparity value. In Table 2 we ob- training and evaluation are obtained from LIDAR data.
serve that our regression approach outperforms both hard KITTI is a prominent dataset for benchmarking stereo al-
and soft classification. This is especially noticeable for the gorithms. The downside is that it only contains 200 training
pixel accuracy metrics and the percentage of pixels which images, which handicaps learning algorithms. for this rea-
6
(a) KITTI 2012 test data qualitative results. From left: left stereo input image, disparity prediction, error map.
(b) KITTI 2015 test data qualitative results. From left: left stereo input image, disparity prediction, error map.
(c) Scene Flow test set qualitative results. From left: left stereo input image, disparity prediction, ground truth.
Figure 4: Qualitative results. By learning to incorporate wider context our method is often able to handle challenging
scenarios, such as reflective, thin or texture-less surfaces. By explicitly learning geometry in a cost volume, our method
produces sharp results and can also handle large occlusions.
7
>2 px >3 px >5 px Mean Error Runtime
Non-Occ All Non-Occ All Non-Occ All Non-Occ All (s)
SPS-st [44] 4.98 6.28 3.39 4.41 2.33 3.00 0.9 px 1.0 px 2
Deep Embed [8] 5.05 6.47 3.10 4.24 1.92 2.68 0.9 px 1.1 px 3
Content-CNN [32] 4.98 6.51 3.07 4.29 2.03 2.82 0.8 px 1.0 px 0.7
MC-CNN [50] 3.90 5.45 2.43 3.63 1.64 2.39 0.7 px 0.9 px 67
PBCP [40] 3.62 5.01 2.36 3.45 1.62 2.32 0.7 px 0.9 px 68
Displets v2 [18] 3.43 4.46 2.37 3.09 1.72 2.17 0.7 px 0.8 px 265
GC-Net (this work) 2.71 3.46 1.77 2.30 1.12 1.46 0.6 px 0.7 px 0.9
(a) KITTI 2012 test set results [14]. This benchmark contains 194 train and 195 test gray-scale image pairs.
All Pixels Non-Occluded Pixels Runtime

D1-bg D1-fg D1-all D1-bg D1-fg D1-all (s)
MBM [11] 4.69 13.05 6.08 4.33 12.12 5.61 0.13
ELAS [15] 7.86 19.04 9.72 6.88 17.73 8.67 0.3
Content-CNN [32] 3.73 8.58 4.54 3.32 7.44 4.00 1.0
DispNetC [34] 4.32 4.41 4.34 4.11 3.72 4.05 0.06
MC-CNN [50] 2.89 8.88 3.89 2.48 7.64 3.33 67
PBCP [40] 2.58 8.74 3.61 2.27 7.71 3.17 68
Displets v2 [18] 3.00 5.56 3.43 2.73 4.95 3.09 265
GC-Net (this work) 2.21 6.16 2.87 2.02 5.58 2.61 0.9
(b) KITTI 2015 test set results [35]. This benchmark contains 200 training and 200 test color image pairs. The qualifier ‘bg’ refers to
background pixels which contain static elements, ‘fg’ refers to dynamic object pixels, while ‘all’ is all pixels (fg+bg). The results show the
percentage of pixels which have greater than three pixels or 5% disparity error from all 200 test images.
Table 3: Comparison to other stereo methods on the test set of KITTI 2012 and 2015 benchmarks [14, 35]. Our method
sets a new state-of-the-art on these two competitive benchmarks, out performing all other approaches.
son, we pre-train our model on the large synthetic dataset, SceneFlow. However, our method outperforms this archi-
Scene Flow [36]. This helps to prevent our model from tecture by a notable margin for all test pixels. DispNetC
over-fitting the very small KITTI training dataset. We hold uses a 1-D correlation layer along the disparity line as an
out 40 image pairs as our validation set. approximation to the stereo cost volume. In contrast, our
Table 3a and 3b compare our method, GC-Net architecture more explicitly leverages geometry by formu-
(Geometry and Context Network), to other approaches on lating a full cost volume by using 3-D convolutions and a
the KITTI 2012 and 2015 datasets, respectively2 . Our soft argmin layer, resulting in an improvement in perfor-
method achieves state of the art results for both KITTI mance.
benchmarks, by a notable margin. We improve on state-
of-the-art by 9% and 22% for KITTI 2015 and 2012 re- 4.3. Model Saliency
spectively. Our method is also notably faster than most
In this section we present evidence which shows our
competing approaches which often require expensive post-
model can reason about local geometry using wider con-
processing. In Figure 4 we show qualitative results of our
textual information. In Figure 5 we show some examples
method on KITTI 2012, KITTI 2015 and Scene Flow.
of the model’s saliency with respect to a predicted pixel’s
Our approach outperforms previous deep learning patch
disparity. Saliency maps [41] shows the sensitivity of the
based methods [48, 32] which produce noisy unary poten-
output with respect to each input pixel. We use the method
tials and are unable to predict with sub-pixel accuracy. For
from [51] which plots the predicted disparity as a function
this reason, these algorithms do not use end-to-end learn-
of systematically occluding the input images. We offset the
ing and typically post-process the unary output with SGM
occlusion in each stereo image by the point’s disparity.
regularization [11] to produce the final disparity maps.
The closest method to our architecture is DispNetC [34], These results show that the disparity prediction for a
which is an end-to-end regression network pre-trained on given point is dependent on a wide contextual field of view.
For example, the disparity on the front of the car depends on
2 Full leaderboard: www.cvlibs.net/datasets/kitti/ the input pixels of the car and the road surface below. This
8
For future work we are interested in exploring a more
explicit representation of semantics to improve our disparity
estimation, and reasoning under uncertainty with Bayesian
convolutional neural networks.
(a) Left stereo input image References

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Good-
fellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War-
(b) Predicted disparity map den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow:
Large-scale machine learning on heterogeneous systems, 2015. Soft-
ware available from tensorflow.org. 5
[2] M. Achtelik, A. Bachrach, R. He, S. Prentice, and N. Roy. Stereo
vision and laser odometry for autonomous helicopters in gps-denied
indoor environments. In SPIE Defense, security, and sensing, pages
733219–733219. International Society for Optics and Photonics,
(c) Saliency map (red = stronger saliency) 2009. 1
[3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep
convolutional encoder-decoder architecture for image segmentation.
arXiv preprint arXiv:1511.00561, 2015. 1, 4
[4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by
jointly learning to align and translate. In ICLR 2015, 2014. 4
[5] S. T. Barnard and M. A. Fischler. Computational stereo. ACM Com-
puting Surveys, 14(4):553–572, 1982. 2
(d) What the network sees (input attenuated by saliency) [6] M. Bleyer, C. Rhemann, and C. Rother. PatchMatch Stereo-Stereo
Matching with Slanted Support Windows. Bmvc, i(1):14.1–14.11,
2011. 2
Figure 5: Saliency map visualization which shows the
[7] M. Calonder, V. Lepetit, and C. Strecha. BRIEF : Binary Robust
model’s effective receptive field for a selected output pixel Independent Elementary Features. In European Conference on Com-
(indicated by the white cross). This shows that our archi- puter Vision (ECCV), 2010. 2
tecture is able to learn to regress stereo disparity with a [8] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang. A deep visual
correspondence embedding model for stereo matching costs. In Pro-
large field of view and significant contextual knowledge of
ceedings of the IEEE International Conference on Computer Vision,
the scene, beyond the local geometry and appearance. For pages 972–980, 2015. 8
example, in the example on the right we observe that the [9] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang. A deep visual
model considers contextual information from the vehicle correspondence embedding model for stereo matching costs. In Pro-
ceedings of the IEEE International Conference on Computer Vision,
and surrounding road surface to estimate disparity.
pages 972–980, 2016. 2
[10] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from
a single image using a multi-scale deep network. Nips, pages 1–9,
demonstrates that our model is able to reason about wider 2014. 3
context, rather than simply 9 × 9 local patches like previous [11] N. Einecke and J. Eggert. A multi-block-matching approach for
stereo. In 2015 IEEE Intelligent Vehicles Symposium (IV), pages
deep learning patch-similarity stereo methods [50, 32]. 585–592. IEEE, 2015. 8
[12] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. DeepStereo:
5. Conclusions Learning to Predict New Views from the World’s Imagery. CVPR,
2016. 2
We propose a novel end-to-end deep learning architec- [13] R. Garg, V. Kumar BG, and I. Reid. Unsupervised CNN for Single
View Depth Estimation: Geometry to the Rescue. ECCV, pages 1–
ture for stereo vision. It is able to learn to regress dispar-
16, 2016. 3, 5
ity without any additional post-processing or regularization. [14] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous
We demonstrate the efficacy of our method on the KITTI driving? the kitti vision benchmark suite. In Conference on Com-
dataset, setting a new state-of-the-art benchmark. puter Vision and Pattern Recognition (CVPR), 2012. 2, 5, 6, 8
[15] A. Geiger, M. Roser, and R. Urtasun. Efficient large-scale stereo
We show how to efficiently learn context in the dispar- matching. In Asian conference on computer vision, pages 25–38.
ity cost volume using 3-D convolutions. We show how to Springer, 2010. 1, 8
formulate it as a regression model using a soft argmin layer. [16] A. Geiger, M. Roser, and R. Urtasun. Efficient Large-Scale Stereo
This allows us to learn disparity as a regression problem, Matching. Computer Vision ACCV 2010, (1):25–38, 2010. 2
[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier-
rather than classification, improving performance and en- archies for accurate object detection and semantic segmentation. In
abling sub-pixel accuracy. We demonstrate that our model Proceedings of the IEEE conference on computer vision and pattern
learns to incorporate wider contextual information. recognition, pages 580–587, 2014. 1
9
[18] F. Guney and A. Geiger. Displets: Resolving stereo ambiguities us- [37] M. G. Park and K. J. Yoon. Leveraging stereo matching with
ing object knowledge. In Proceedings of the IEEE Conference on learning-based confidence measures. Proceedings of the IEEE Com-
Computer Vision and Pattern Recognition, pages 4165–4175, 2015. puter Society Conference on Computer Vision and Pattern Recogni-
8 tion, 07-12-June:101–109, 2015. 2
[19] R. Haeusler, R. Nair, and D. Kondermann. Ensemble Learning for [38] D. Scharstein and C. Pal. Learning conditional random fields for
Confidence Measures in Stereo Vision. Computer Vision and Pat- stereo. Proceedings of the IEEE Computer Society Conference on
tern Recognition (CVPR), 2013 IEEE Conference on, pages 305– Computer Vision and Pattern Recognition, 2007. 2
312, 2013. 2 [39] D. Scharstein and R. Szeliski. A Taxonomy and Evaluation of Dense
[20] R. Hartley and A. Zisserman. Multiple view geometry in computer Two-Frame Stereo Correspondence Algorithms. International Jour-
vision. Cambridge university press, 2003. 3 nal of Computer Vision, 47(1):7–42, 2002. 2, 3
[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for [40] A. Seki and M. Pollefeys. Patch based confidence prediction for
image recognition. In In Proc. IEEE Conf. on Computer Vision and dense disparity map. In British Machine Vision Conference (BMVC),
Pattern Recognition, 2016. 3 2016. 8
[22] P. Heise, B. Jensen, S. Klose, and A. Knoll. Fast Dense Stereo Corre- [41] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside con-
spondences by Binary Locality Sensitive Hashing. ICRA, pages 1–6, volutional networks: Visualising image classification models and
2015. 2 saliency maps. arXiv preprint arXiv:1312.6034, 2013. 8
[42] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient
[23] H. Hirschmuller. Accurate and efficient stereo processing by semi-
by a running average of its recent magnitude. COURSERA: Neural
global matching and mutual information. In 2005 IEEE Computer
networks for machine learning, 4(2), 2012. 5
Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), volume 2, pages 807–814. IEEE, 2005. 1 [43] F. Tombari, S. Mattoccia, L. D. Stefano, and E. Addimanda. Classi-
fication and evaluation of cost aggregation methods for stereo corre-
[24] H. Hirschmüller. Stereo processing by semiglobal matching and mu-
spondence. 26th IEEE Conference on Computer Vision and Pattern
tual information. IEEE Transactions on Pattern Analysis and Ma-
Recognition, CVPR, 2008. 2
chine Intelligence, 30(2):328–341, 2008. 2
[44] K. Yamaguchi, D. McAllester, and R. Urtasun. Efficient joint seg-
[25] H. Hirschmüller and D. Scharstein. Evaluation of Cost Functions for mentation, occlusion labeling, stereo and flow estimation. In Eu-
Stereo Matching. In 2007 IEEE Conference on Computer Vision and ropean Conference on Computer Vision, pages 756–771. Springer,
Pattern Recognition, 2007. 2 2014. 8
[26] A. Klaus, M. Sormann, and K. Karner. Segment-based stereo match- [45] R. Zabih and J. Woodfill. Non-parametric Local Transforms for
ing using belief propagation and a self-adapting dissimilarity mea- Computing Visual Correspondence. In Proceedings of European
sure. Proceedings - International Conference on Pattern Recogni- Conference on Computer Vision, (May):151–158, 1994. 2
tion, 3:15–18, 2006. 2 [46] S. Zagoruyko and N. Komodakis. Learning to compare image
[27] V. Kolmogorov and R. Zabih. Computing visual correspondences patches via convolutional neural networks. Proceedings of the
with occlusions using graph cuts. In International Conference on IEEE Computer Society Conference on Computer Vision and Pattern
Computer Vision (ICCV), 2001. 2 Recognition, 07-12-June(i):4353–4361, 2015. 2
[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classifica- [47] J. Žbontar and Y. Le Cun. Computing the stereo matching cost with
tion with deep convolutional neural networks. In Advances in neural a convolutional neural network. Proceedings of the IEEE Computer
information processing systems, pages 1097–1105, 2012. 1 Society Conference on Computer Vision and Pattern Recognition, 07-
[29] Y. Li and D. P. Huttenlocher. Learning for stereo vision using the 12-June(1):1592–1599, 2015. 2
structured support vector machine. In 2008 IEEE Conference on [48] J. Zbontar and Y. LeCun. Computing the stereo matching cost with
Computer Vision and Pattern Recognition, 2008. 2 a convolutional neural network. In Proceedings of the IEEE Con-
[30] F. Liu, C. Shen, G. Lin, and I. Reid. Learning Depth from Single ference on Computer Vision and Pattern Recognition, pages 1592–
Monocular Images Using Deep Convolutional Neural Fields. Pattern 1599, 2015. 1, 8
Analysis and Machine Intelligence, page 15, 2015. 3 [49] J. Žbontar and Y. LeCun. Stereo Matching by Training a Con-
[31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional net- volutional Neural Network to Compare Image Patches. CoRR,
works for semantic segmentation. In Proceedings of the IEEE Con- abs/1510.0(2002), 2015. 2
ference on Computer Vision and Pattern Recognition, pages 3431– [50] J. Zbontar and Y. LeCun. Stereo matching by training a convolu-
3440, 2015. 1 tional neural network to compare image patches. Journal of Machine
[32] W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learning for Learning Research, 17:1–32, 2016. 8
stereo matching. In Proceedings of the IEEE Conference on Com- [51] M. D. Zeiler and R. Fergus. Visualizing and understanding convolu-
puter Vision and Pattern Recognition, pages 5695–5703, 2016. 1, 3, tional networks. In European conference on computer vision, pages
6, 8 818–833. Springer, 2014. 8
[33] W. Luo, A. G. Schwing, and R. Urtasun. Efficient Deep Learning for [52] L. Zhang and S. M. Seitz. Estimating optimal parameters for {MRF}
Stereo Matching. CVPR, 2016. 2 stereo from a single image pair. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 29(2):331–342, 2007. 2
[34] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovit-
skiy, and T. Brox. A Large Dataset to Train Convolutional Networks
for Disparity, Optical Flow, and Scene Flow Estimation. CoRR,
abs/1510.0(2002), 2015. 2, 8
[35] M. Menze and A. Geiger. Object scene flow for autonomous vehicles.
In Conference on Computer Vision and Pattern Recognition (CVPR),
2015. 2, 5, 6, 8
[36] N.Mayer, E.Ilg, P.Häusser, P.Fischer, D.Cremers, A.Dosovitskiy, and
T.Brox. A large dataset to train convolutional networks for disparity,
optical flow, and scene flow estimation. In IEEE International Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2016.
arXiv:1512.02134. 2, 5, 6
10

GCNet

Uploaded by

Copyright:

Available Formats

GCNet

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GCNet

Uploaded by

Copyright:

Available Formats

End-to-End Learning of Geometry and Context for Deep Stereo Regression

Alex Kendall Hayk Martirosyan Saumitro Dasgupta Peter Henry

Abstract when supervised with large training datasets. We observe

2 3×3 conv, 32 features 1⁄ H×1⁄ W×F

4-17 (repeat layers 2,3 and residual connection) × 7 1⁄ H×1⁄ W×F

31 3-D conv, 3×3×3, 128 features 1⁄ D×1⁄ H×1⁄ W×4F

33 3×3×3, 3-D transposed conv, 64 features, stride 2 1⁄ D×1⁄ H×1⁄ W×2F

add layer 33 and 29 features (residual connection) 1⁄ D×1⁄ H×1⁄ W×2F

add layer 35 and 23 features (residual connection) 1⁄ D×1⁄ H×1⁄ W×2F

37 3×3×3, 3-D trans conv, 1 feature (no ReLu or BN) D×H×W×1

ture as a black box to model stereo. Instead, we advocate

3.3. Learning Context 3.4. Differentiable ArgMin

0.25 0.30 0.7

Softmax * Indices Indices

Softmax * Indices Indices

distribution which is predominantly unimodal. The net- 4. Experimental Evaluation

The first experiment in Table 2 shows that including the

All Pixels Non-Occluded Pixels Runtime

(a) Left stereo input image References

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.