Object Segmentation Based On Saliency Extraction and Bounding Box
Object Segmentation Based On Saliency Extraction and Bounding Box
Object Segmentation Based On Saliency Extraction and Bounding Box
Jian Ma Bo Yin
College of Environmental Science and Engineering College of Information Science and Engineering,
Ocean University of China, Ocean University of China,
Qingdao, China, 266100 Qingdao, China, 266100
yinff3@126.com ybfirst@126.com
Abstract—Object segmentation is desirable in many practical There is plenty of previous work related to object
applications, e.g., object classification. However, due to segmentation. Generally, interactive methods[1][2][3] can
various object appearances and shapes, confusing achieve promising results, but are not suitable for practical
backgrounds, object segmentation in an effective way is still applications that require automatic operations. Therefore,
a challenging issue. In this paper, a novel algorithm of object
automatic scheme for object segmentation is proposed.
segmentation based on saliency extraction and bounding
boxes is proposed. The segmentation performance is The automatic methods use specified foreground and
significantly improved by introducing saliency extraction background regions as input. In general, the form of
into the segmentation scheme. Firstly, bounding boxes are specified foreground regions is bounding box[4][5]. The
acquired by object detection algorithms, foreground and segmentation algorithm based on energy minimization is
background model is constructed using bounding boxes. used to get the result with bounding boxes. To get more
Then, saliency extraction procedure is introduced, and accurate segmentation results with the detected bounding
adaptive weights for each pixel are computed based on the box, Lempitsky et al.[6] adopted the graph cuts algorithm
saliency extraction. Finally, undirected graph which with the constraint that the desired segmentation should
incorporates the adaptive weights for each pixel is
have parts that are sufficiently close to each of the sides of
constructed and graph cuts is implemented to obtain the
segmentation results. Comprehensive and comparative the bounding box. This is reasonable under the condition
experiments demonstrate that our proposed algorithm has that the bounding box is accurate. Yang et al.[7]
achieved promising performance over a challenging public developed the adaptive edgelet features as the unary term
available dataset. of Conditional Random Field Model to exploit the spatial
coherence of labels of neighboring pixels. These methods
Keywords - object segmentation; saliency extraction; graph employ appearance models for the foreground and
cuts; bounding box; adaptive weight background which are estimated through object detection
algorithms and achieve segmentation by solving energy
I. INTRODUCTION minimization problems.
Object segmentation in static images is an important However, All the above methods which using bounding
and challenging issue for understanding images. Many box as input, suppose that all the pixels in the bounding
approaches have been proposed to solve this problem. box can give equal contribution for the foreground model
Interactive methods focus on image segmentation with construction. This will lead false foreground prior,
prior foreground and background seeds which are often especially when the structure of the object is not compact.
labeled manually. These methods can achieve better Therefore, in this paper, we propose a novel object
segmentation results, but are not suitable for practical segmentation method based on saliency extraction and
applications. To avoid the interactive operation, bounding boxes, with the saliency of the pixels, we give
researchers proposed automatic segmentation approaches each pixel a specific weights for constructing the
which use object detection techniques to get the rough foreground and background models.
object regions instead of manual labels. However, due to Our contribution is that we propose a novel object
various object appearances and shapes, confusing segmentation method which introduce saliency extraction
background, object segmentation remains a challenging into the bounding box-based scheme. With the saliency
problem. extraction results, adaptive weights for each pixels is
A. System Overview
The framework of the proposed method is illustrated in
Figure 1. Generally, our method consists of three stages.
Graph cuts
n1 n2
B. Foreground and background model construction
where Dr (r1 , r2 ) f (c1 , i ) f (c2 , j ) D(c1,i , c2, j ) ,
In order to obtain foreground and background model, i 1 j 1
object detection method is applied. In our scheme, we use D(c1,i , c2, j ) is the color distance metric between pixels c1,i
the part based methods to get the bounding boxes[5][8].
Then, the foreground and background model is and c2, j , Dr (r1 , r2 ) is the color distance between regions
constructed by Gaussian mixture model (GMM). The r1 and r2 . Ds (rk , ri ) is the spatial distance between
probability density function for GMM of an observation regions rk and ri , s controls the strength of spatial
x can be written as
g g
1 1 weighting. Larger values of s reduce the strength of
p(x; )= i pi (x; )= i exp{ (x-μi )T i -1 (x-μi )}
i =1 i =1
(2 )d / 2 i
1
2
2 spatial weighting. f (ck , i) is the frequency of the i-th
(1) color ck ,i among all nk colors in the k-th region rk with
where pi (x; ) is the p.d . f . corresponding to the k {1, 2} . The details of the saliency extraction algorithm
ith Gaussian model Gi , x is a RGB color vector, can refer to Ref. [9]. With the saliency extraction
algorithm, the saliency map is acquired. Figure 2 give
consists of the elements of the mean vectors u i and the some examples, the color of the pixel means the saliency,
i , d is the dimension of vector x ,
covariance matrices
the whiter, the more salient. With the saliency results, we
compute the adaptive weights for each pixel as follows:
and d 3 in our algorithm, ( , ) , i is the
T T T
of the Gaussian models for construction of the GMM . (i , j ) 1 (s(i , j ) ) / (10 * ) , if s(i , j ) > (4)
C. Adaptive weights determination
In order to use the prior appearance for foreground and where (i , j ) is the adaptive weights for each pixel at
background model construction in an effective way, we location (i, j ) . s(i , j ) is the saliency of pixel (i, j ) . is
introduce the adaptive weights for each pixel. Image
the threshold that gives 95% recall rate for the training
saliency is one of the hot research issues in image
images, and is chosen empirically. For pixels in the
processing domain. Cheng et al.[9] proposed a fast
bounding boxes, we use Eq.(3), and for pixels outside the
regional contrast based saliency extraction algorithm, and
bounding boxes, we use Eq.(4).
had got promising results. In this paper, we introduce it
into our segmentation scheme. The image saliency of
565
E( , , z) U ( , , z) V ( , z) (6)
where =[1 , 2 ,...n ...N ] .
With the adaptive weights and the foreground and
background models, the undirected graph can be
constructed. Then the segmentation results is obtained
through the graph cuts method. In our scheme, we use the
optimized version in Ref.[2].
III. EXPERIMENTAL RESULTS
In this section, we conduct comprehensive evaluations
of our method. The dataset, baseline algorithms and
evaluation metrics are described first.
A. Dataset, baseline algorithms, evaluation metrics
Figure 2. Samples of saliency extraction. The first line is the
input images, and the second line is the corresponding saliency To evaluate the effectiveness of our method, we test it
extraction results. on a challenging public datasets: Parse dataset from[10].
The parse dataset contains 305 images of full body with a
wide variety of activities ranging from standing and
D. Undirected graph construction and graph cuts walking to dancing and performing exercises. The dataset
Graph cuts are famous methods which have been include a standard train/test split. And we use the 205
successfully used for seeded image segmentation. testing images to evaluate our method. We compared our
Representing the image as an array z ( z1 ,...zn ,...zN ) with
' method with the optimized version of graph cuts method
in[2](Grabcut) and saliency based segmentation
zn corresponding to the color or grey value of pixel n , method[9]. These two methods are denoted as Grabcut-
the undirected graph G (V , E ) is constructed with the Boundingbox and Grabcut-Saliency. The Grabcut-
Boundingbox method using the bounding box as input,
image pixels as the nodes( V ) and the neighborhood the same with ours. The Grabcut-Saliency method use the
relationship between pixels(e.g. 4-neighborhood) as saliency extraction results as the input.
edge( E ). There are also two specially designated terminal F -metric as follows is used to evaluate the
nodes “ F ”and“ B ”that represent “foreground” and performance of our method, which is similar with Ref.
“background” labels. Edges between pixels are called [11]:
n (rs (n) rg (n))
neighborhood links (n-links) and edges connecting pixels
and terminal nodes are called terminal links (t-links). F -metric (7)
Then, the image segmentation corresponds to a nodes n (rs (n) rg (n))
partitioning in the graph G . Defining an array of
where rs and rg denote the segmented binary body and
“opacity” values {1 ,...n ,... N } for all pixels,
ground truth respectively, n is the pixel and the operators
where n {0,1} with 0 for the background and 1 for the and perform pixel-wise AND and OR, respectively.
foreground, the image segmentation also can be expressed B. Comparison with other methods
as a solution for inferring the unknown variables from
As described in Section 2.3, the threshold is
a given image z . Finally, the global optimal solution of
is solved by minimizing a Gibbs energy function obtained from the training images. A larger means
E ( , , z) as follows: there are more pixels assigned as important ones for
foreground model construction. In our experiments,
E( , , z) U ( , , z) V ( , z) (5) 40 , which is determined by analyzing the threshold
where that gives 95% recall rate for the training images.
U ( , , z ) log( p( zn ; n )) , Figure 3 plots the comparison with baseline methods,
n where the red line represents our method, and the yellow
V ( , z )
( m, n )C
B{m, n}[ n m ] . U ( , , z) is the region and blue lines correspond to the Grabcut-Bounding box[2]
and Grabcut-saliency[9] methods respectively. From the
term, which defines the cost of t-links. V ( , z ) is the
results we can see that our method performs best among
the compared methods.
boundary term, which defines the cost of n-links. To quantitative illustrate the performance of our
The traditional graph cuts method compute the cost of method, the mean and standard deviation of F -metric
t-links with the same weight for all the pixels. In our
which have been used in the previous work [11] is
scheme, we incorporate the adaptive weights for each
employed. Table 1 provides the values of the each
pixel into the cost computation of t-links. Then,
compared methods. From the table, we can see that our
E( , , z ) can be defined as: method outperforms the other methods by more than 5%.
566
Beyond these quantitative comparisons, we highlight influence from the background regions in the bounding
the qualitative improvement in Figure 4. From Figure 4, boxes.
we can see our method can get more accurate
segmentation results than other methods. IV. CONCLUSIONS AND FUTURE WORK
In our experiments, we find that, the Grabcut- We have proposed a novel method of object
Boundingbox method gives the worst performance for that segmentation. Unlike the other methods, we give each
using rectangle region as prior foreground directly may pixel an adaptive weights by introducing the saliency
contain background which will lead to a bad segmentation. extraction algorithm. The adaptive weights is used in the
In contrast, our method is more robust and promising
since we introduced adaptive weights for each pixel for
undirected graph construction, which can alleviate the
1
Grabcut-BoundingBox Grabcut-Saliency Our Method
0.9
0.8
0.7
0.6
F-Metric
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
Testing images(1-205)
REFERENCES
[1] Y. Y. Boykov and M.-P. Jolly, “Interactive graph cuts for optimal
boundary & region segmentation of objects in ND images,” in Proc.
of IEEE International Conference on Computer Vision, 2001, vol. 1,
pp. 105–112.
[2] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive
foreground extraction using iterated graph cuts,” ACM Transactions
on Graphics (TOG), vol. 23, pp. 309–314, 2004.
(a) (b) (c) (d) (e) (f) [3] A. Denecke, H. Wersing, J. J. Steil, and E. Körner, “Online figure–
ground segmentation with adaptive metrics in generalized LVQ,”
Figure 4. Performance comparisons for different methods. (a) Neurocomputing, vol. 72, no. 7, pp. 1470–1482, 2009.
Input images, (b) Bounding boxes, (c) Saliency extraction results, [4] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
(d)-(f) are segmentation results. (d) Our method, (e) Grabcut- detection,” in Proc. of IEEE Conference on Computer Vision and
Pattern Recognition, 2005, vol. 1, pp. 886–893 vol. 1.
Boundingbox, (f) Grabcut-Saliency.
[5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
“Object detection with discriminatively trained part-based models,”
567
IEEE Transactions on Pattern Analysis and Machine Intelligence, [9] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu,
vol. 32, no. 9, pp. 1627–1645, 2010. “Global contrast based salient region detection,” in Proc. of IEEE
[6] V. Lempitsky, P. Kohli, C. Rother, and T. Sharp, “Image Conference on Computer Vision and Pattern Recognition, 2011, pp.
segmentation with a bounding box prior,” in Proc. of IEEE 409–416.
International Conference on Computer Vision, 2009, pp. 277–284. [10] D. Ramanan, “Learning to parse images of articulated bodies,”
[7] B. Yang, C. Huang, and R. Nevatia, “Segmentation of objects in a Advances in Neural Information Processing Systems, vol. 19, p.
detection window by Nonparametric Inhomogeneous CRFs,” 1129, 2007.
Computer Vision and Image Understanding, vol. 115, no. 11, pp. [11] S. Li, H. Lu, and L. Zhang, “Arbitrary body segmentation in static
1473–1482, 2011. images,” Pattern Recognition, vol. 45, no. 9, pp. 3402–3413, Sep.
[8] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible 2012.
mixtures-of-parts,” in Proc. of IEEE Conference on Computer Vision
and Pattern Recognition, 2011, pp. 1385–1392.
568