Você está na página 1de 5

978-1-4673-0024-7/10/$26.

00 2012 IEEE 1934


2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2012)
Segmentation of Depth Image using Graph cut
Jiangming Yu Jieyu Zhao
Research Institute of Computer Science and Technology, Ningbo University, Ningbo, China, 315211



Abstract a large number of tasks in computer vision involve
finding a target from a background image. It is also known as the
foreground/background discrimination problem. Various
methods have been developed to solve this problem. [1, 2, 10, 11,
12] Newly developed techniques for general purpose of object
abstraction use both color and edge information for segmentation
purpose. In this paper, we use graph cut methods on images with
depth information from A Kinect camera. Also, we apply the
approach with pyramid representation. This greatly reduces the
time used with the iterative graph cut methods. We estimate the
statistic model on the bottom of the pyramid, while use graph cut
on the top of the pyramid. This speeds up the whole segmentation
process while keeps a good segmentation quality at the same time.
When come across the situation like the objects color resembles
that of the background but with different depth, our method can
still achieve a good result.
Keywords - graph cut; depth image; pyramid representation;
gaussin mixture model;
I. INTRODUCTION
The task of interactive image segmentation is becoming
more and more popular to alleviate the problems inherent to
fully automatic segmentation which seems to never be perfect.
The ultimate goal is to extract an object from the background in
an input image with as few user interactions as possible.
Foreground/background discrimination is aimed to separate an
image to two distinctive parts. In order to achieve good quality,
the information that can be used in an image is color, texture,
depth, etc. of the pixel of the segment. It is common that some
prior on segmentations is needed for achieving a perfect
segment. The prior usually presents as appearance models [1,
2], to distinguish better the foreground from the background
segmentation. Intuitively, the foreground and the background
priors provide constrains about what the user intends to
segment. A good segmentation should be smooth in regions
while preserving sharp discontinuities that exist at object
boundaries.
A good way of using both the color (texture) information
and the contrast (edge) information is by graph cut method. [1,
2, 3]
The theory of graph cut was used in computer vision in the
paper by Greig, Porteous and Seheult [4] of Durham
University. In the Bayesian statistical context of smoothing
noisy (or corrupted) images, they showed how the maximum a
posteriori estimate of a binary image can be achieved exactly
by maximizing the flow through an associated simulated
annealing (as proposed by the Geman Brohters [5]), or iterated
conditional modes (a type of greedy algorithm as suggested by
Julian Besag [6]) were used to solve such image smoothing
problems.
To use graph cut, the base step is to build a graph. The
nodes in the graph is of two kinds. One is called neighbor node.
It is the nodes which correspond to the pixels in the image.
Another kind of nodes is called terminal node. Only two
terminal nodes exist in the graph. One is object terminal, the
other is background terminal. And the links link the nodes can
be classified into two kinds. One is neighbor link which links
the neighbor nodes. It corresponds to the prior term in the
energy function. The other is called the terminal link which
links the neighbor node and the terminal node. It corresponds
to the likelihood term in the energy function. To estimate the
neighbor link, the neighbor nodes usually see as some
probability field, usually the MRF. The neighbor link preserves
the sharp boundary. The other link which is the terminal link is
estimated as the likelihood of the priors that the user gives.
This term is under the hard constrains of the users interaction.
The two kinds of links are the penalties of segmenting some
regions as object and others to be the background. When the
graph is built, a fast implementation of segmentation can be
achieved by a new max-flow algorithm [7]. In brief, the process
is starting with some user interaction to provide hard constrains
for segmentation. Then the graph is built with constrains on
MRF. Graph cut is used to find the global optimal
segmentation of the image. After the minimum cuts are found,
the object/background regions are naturally defined by the cuts.
The obtained results give the best balance of boundary and
region properties among all segmentations satisfying constrains.
Boykov and Jolly [1] derive a general purpose interactive
segmentation technique that divides an image into two
segments. They imposes two kinds of constrains which they
called the hard constrain and the soft constrain. The hard
constrain is provides clues on what the user intends to segment.
The rest of the image is segmented automatically by computing
a global optimum among all segmentations fitting the hard
constrains. One main advantage of their method is that their
method is fit with N-dimensional segmentation and their cost
function is clearly defined. They say that many previous
techniques dont have a clear cost function at all [8]. And some
even compute only an approximate solution. On the contrary,
their imperfections of a globally optimal solution are directly
related to the definition of the cost function. Their cost function
is derived from the one in [3] in a context of MAP-MRF
estimation. Their technique is based on powerful graph cut
algorithms from combination optimization [9, 10]. They apply
their method only on gray images.
Rother, Kolmogorov, and Blake [2] derives the method of
Boykov and Jolly [1]. Instead they use GMMs to construct the
1935
statistic model in RGB color space. They follow the practice
that is already used for soft segmentation [13, 14]. They use
two GMMs, one for the background and one for the
background. They develop the iterative version of the
optimization. Rother, Kolmogorov, and Blake [2] further the
graph cut approach into three aspects. They use graph cut
method iteratively, the reason of this is that they reduce the
users interaction to drag a rectangle round the desired object.
In their opinion, they call this incomplete labeling, the reason
is that the pixels in the rectangle are not all belonging to
foreground. Their method can fit itself during the iterative
process. When the segmentation is done, they also use a
matting strategy to adjust the contour they get. The problem is
that their method is time-consuming. When meet with the
situation that object and foreground resemble in color space
their method fail.
More recently, Vicente, Kolmogorov and Rother [10] use
graph cut in MRF in high order. They imply graph cut based
image segmentation with connectivity priors. They formulate
several versions of the connectivity constraint to the two terms
energy function. However, to minimize the their energy
function is NP-hard. There are also some other method
considering high order MRF, see [11, 12]. They cant achieve
global optimal of the segmentation.
In this paper, we drive the Rother, Kolmogorov, and Blake
[2] iterative graph cut method. We need the user to drag a
rectangle to cover the object in a given image. we also use the
depth information which is obtained from A Kinect camera. In
fact, we use the depth information in both two terms in an
energy function of graph cut method: the data term and the
spatial coherency term. There are two main contributions in our
paper.
First, we revise the two terms in the energy function to fit
the additional depth information got from the A Kinect camera.
We add an additional single Gaussian model to fore ground
GMMs, a uniform distribution to the background GMMs.
Because the depth information is independent from the color
information, possibility model of the depth information can be
added to the color model by multiplying. So now the
foreground possibility model is a GMMs and a single Gaussian
model of depth channel. And the background possibility model
is a GMMs and a uniform distribution of depth channel. The
reason about we simply estimate the background depth
information just by a uniform distribution is that we think the
pixels of the background are almost very complex and this can
result the uniformity of the depth data. We further more add the
depth information to the second term of the energy function
that is the prior term. This means that the boundary of our
segmentation must preserve both the sharp color inconsistence
and the depth inconsistence. The can overcome the drawback
that the colors of the background and the object are resemble.
In our energy function, even we cant distinguish the object
from the background merely using the image color, the added
depth term also will preserve the depth inconsistent and
abstract the object. This can be a fascinating outcome.
Second, we notice that although we use a new max-flow
algorithm [7], the time spent on this process is still occupy
most of the time spend on the whole executing of the graph cut
even our image is small about 300*300 pixels. Considering that
time spent on graph cut is mostly on the max-flow algorithm.
We use a trick which can influence the result by little while
achieve good quality. We present the image in different scales.
The method we use is resemble the pyramid method but in a
simpler and lower way. We only apply two layers of the
pyramid. We estimate the possibility model in the bottom layer,
then executing the graph cut on the top layer. The boundary we
get from the top layer casts to the bottom layer and a new circle
begins. Through experiment, we use less time with the
relatively same quality.
This paper is organized as follows: Section II presents the
mathematics formulation of the probabilistic spectral matching
problem. With previous preparations we derive a new
probabilistic matching scheme in Section III. The
implementation details and experimental results are presented
in Section IV.
II. PROBLEM FORMULATION

The problem of segmenting an image can be seen as a
labeling problem. We set a label to every pixel in the image,
which is ( 1 )
i
l i T = , T presents the number of pixels in
the image, and
i
l specifies assignments to pixel i in T . Then
we can use
1
( , , , ), ( {0,1})
i T i
L l l l l = to present the
whole image labeling. Each
i
l can be either 0 or 1, which 0
defines the object and 1 defines the background. Vector
L defines the segmentation. Furthermore, we use
i
o presents
the observation in the image. The observation information in a
color image is RGB, in our problem we simultaneously get the
depth information from the A Kinect camera sensor. So the
observation data is RGBD. We use
1
( , , , , )
i T
O o o o =
presents the whole image data. Now we can solve our
segmentation problem in a probabilistic framework, that is,
arg max ( )
l
l p l o =

, we maximum the posterior ( ) p l o to


get the optimal contours. This posterior can be written as
( ) ( ) ( ) p l o p o l p l , the first term in the equation is the
observation likelihood(can be calculated from the hard
constrains from the users interaction), the second term is the
prior(can be calculated in the Markov Random Field). For the
reason that this formulation cant be computed directly, we
must rewrite it as follows:
( ) exp p l o E = (1)
where
( , ) ( ) E A l o B l = + (2)
( , ) ( )
i i i
i
A l o f o l =

(3)
1936

( , )
( ) ( , )
i i j
i j N
B l g l l

=


(4)
Now, the maximum problem can be computed as a minimum
problem of the energy E . The first term ( , ) A l o in the
equation (2) is known as the regional term, it assumes the
individual penalties for assigning pixel i as object and
background. This term is calculated by the hard constraints of
users interaction. If the RGBD data of a pixel in the image is
close to the probability model constructed from the users
interaction, then the penalties of seeing this pixel as object or
background are small, otherwise, the penalties are large. The
term ( ) B l comprises the boundary properties of segmentation
L , it is interpreted as a penalty for a discontinuity between i
and j . This penalty is large if the observation data RGBD
between in the neighbor system of MRF is vastly different
from each other, and the opposite will be small if the difference
between two neighborhood pixel data is not very obvious. The
term is a coefficient which specifies a relative importance of
the region properties term ( , ) A l o versus the boundary
properties term ( ) B l .
The likelihood term ( , ) A l o is the similarity between the
probability of the foreground and the background model and
the observation data in the image. Through the form ( )
i i i
f o l
we can see that if we give the label
i
l a value 0 or 1, its
meaning is the how much can the observation
i
o fits the
background or foreground. The main problem in this term is
how to construct the fore/background probability model. We
use the method that the user drags a rectangle covering the
object in the given image. This method reduces users
interaction drastically. It is used in [2]. But they have no depth
information added to their model. They add two GMMs to
estimate the fore/background probability model in RGB color
field. Their formulation is listed here:
1
( | ) ( ; , )
K
i f j i j j
j
f o n c
=
=


(5)

1
( | ) ( ; , )
K
i b j i j j
j
f o n c
=
=


(6)
The two terms in the equation
f
and
b
represent the
foreground model and the background model respectively.
( , , )
i i i
n c is a standard normal distribution also called a
single Gaussian model and K represents the number of the
single Gaussian model.
j
is a coefficient that represents the
proportion of the specific single Gaussian model in the
GMMs(Gaussian Mixture Model). In our paper, we add the
depth information to the equation (5) and (6). We add a single
Gaussian model to the foreground model and a uniform
distribution to the background model. It is the reason that
object is always close together in the depth channel and the
background is kind of complex to fit the uniform distribution.
The form is as follows:
1
( | ) ( ; , ) ( ; , )
K
i f j i j j i d d
j
p o n c n d
=
=


(7)
1
1
( | ) ( ; , )
K
i b j i j j
j
p o T n c

=
=


(8)
The meaning is obvious.
i
c is the color data RGB observed in
the image and
i
d is the depth information sensed from the A
Kinect camera. T is the number of the pixels in the image
which is mentioned before.
More specifically, we use GMMs in RGBD field to set the
region penalties ( , ) A l o and ( ) B l as negative log-likelihoods.
The ultimate formation is listed here:
( { 0}) ( ln | )
i i f i i i
f o l p o l = =
(9)
( { 1}) ( ln | )
i i b i i i
f o l p o l = =
(10)
As for the term ( ) B l , we use the MRF neighborhood
system. It is the four neighborhood system. That is to set the
pixels label either 0 or 1 depends only on the four neighbor
pixels near the pixel. It has value only when the two neighbor
pixels are in different label. And the penalty is calculated as
follows:
2
2
( , ) exp
2
i j
i j
o o
p o o

=
(11)
i
o and
j
o are in the MRF neighborhood system. It penalizes a
lot when the neighboring two pixels have similar intensities
when
i j
o o < . On the other hand, if pixels are very
different,
i j
o o > , then the penalty is small. Intuitively,
this function corresponds to the distribution of noise among
neighboring pixels of an image. We can further perfect the
equation into the form:
( , ) ( , )
i i j i j i j
g l l l l p o o =
(12)
This form means that only the link at contours will be
penalized. It defines the soft constrains in order to compute the
global optimum of the boundaries.
1937
III. ALGORITHM
In this section, we introduce our algorithm of two-layer
iterative graph cut in RGBD image. The core of our algorithm
is use the depth information to distinguish the situation when
the probability models of foreground and background resemble.
With the additional depth channel sensed from the A Kinect
camera we can easy abstract the object from the background.
We also find the graph cut algorithm is somewhat time-
consuming when computes the minimum energy, so we use the
image pyramid to imply the new fast max-flow algorithm at the
up layer of the image pyramid, and we do achieve a relatively
good quality with less time. We also use more layers of image
pyramid, but unfortunately the result is not as good as the two-
layer one.
To start our algorithm, we first need a user to drag a
rectangle to cover the object in the image. Then we use k-
means to estimate the GMMs of the fore/back ground model
with an additional single Gaussian model and a uniform
distribution for the depth channel. After that, we calculate the
two terms in the energy function. This process is different from
the term we give before because we use a strategy of image
pyramid. We extract one pixel in every four neighbor pixels in
the image. Then the pixels in our max-flow algorithm will
decrease three quarters. The tow energy terms are computed in
this scale image. We construct the graph for max-flow
algorithm to use on this scale image too. The optimal boundary
can get from the min-cut got from the max-flow algorithm.
This boundary projects to the bottom layer of the image
pyramid. Again a new repeat begins. Now we list all the key
step of our algorithm. A rough process of our algorithm
procedure is listed as follows:
1. Given an image observation
,
, ( {1 }, {1 }, )
i j
o i I j J I J T = , the
number of iterations IterNum and the
fore/background observation stacks and the parameter
.
2. Read the color and the depth data from the A Kinect
camera and set the 0 IterNum = and the GMMs
number 5 K = . Extract the top layer image
,
, ( [ / 2], [ / 2])
p q
o p i q j = = .
3. If 0 IterNum = , use the users rectangle to estimate
the stacks, pixels in the rectangle put into the
foreground stack, the others put into the background
stack. If 0 IterNum , pixels in bottom layer image
belong to the foreground put into the foreground stack,
the others put into the background stack.
1 IterNum IterNum = + .
4. Use k-means algorithm [15] to estimate the GMMs of
fore/background models on the bottom image layer.
Computer the value of the links with the formula:
, ,
1
( ; , ) ( , n ; ) l
K
j p q j j p q d d
j
n c n d
=

for the
link of the object terminal,
1
,
1
( ; ln , )
K
j p q j j
j
T n c

=


for the link of the
background terminal,
1 1
2
, ,
2
exp
2
p q p q
o o

for the
neighbor link and
1 1
( , ) p q is the coordinate
neighboring ( , ) p q .
5. Use the new fast max-flow algorithm [7] to find the
optimum boundary in the top image, reject it to the
bottom layer.
6. If the result satisfies the user then end, else go to step 2.
IV. EXPERIMENT RESULT

In this section, we imply our method in several images.
Also we compare our method with some other state-of-the-art
methods. We perform our algorithm in two steps. First, we use
two-layer iterative graph cut on the standard image of the
starfish.


(a)first iteration (b)second iteration (c)third iteration
Figure 1.Our two-layer iterative graph cut method on the
standard image of star fish.


(a)first iteration (b)second iteration (c)third iteration
Figure 2.Ordinary iterative graph cut method on the standard
image of star fish.

The upper part of Figure 1 is the segmentation result we get
from the top layer of the image. The blue line in the nether
part presents the contour we get at the bottom layer of the
image. Through the result, we can see that by our two-layer
iterative graph cut the results are almost same. Now we list the
time we use in each iterate in Table 1:

Table 1: Time used to get the results in Figure 1 and Figure 2
two-layer iterative graph cut ordinary iterative graph cut
iterative 1 2 3 iterative 1 2 3
time(ms) 3235 1813 1563 time(ms) 4422 2203 1859

1938
From the Table 1 we see that one part of our method can
reduce the time at the same time preserve the quality of the
segmentation.
Next, we add the depth information to the graph cut
methods, both the two energy terms must be changed to adapt
the depth channel. We compare the results that use only the
color information to our both use color and depth information
method. In particularly, our method outweighs the ordinary
iterative graph cut method when object in the image resemble
the background in RGB color spaces. Our results are listed as
follows:


(a)RGB image (b)depth image
Figure 3.The test RGB image containing a red book and a red
can of coke, and the corresponding depth image. The depth
information is presented with the blue (higher 8 bits) and the
green (lower 8 bits) colors.


(a)first iteration (b)second iteration (c)third iteration
Figure 4.The segmentation results of the ordinary iterative graph
cut method without the use of the depth information, it fails to
separate the red can from the book behind.


(a)first iteration (b)second iteration (c)third iteration
Figure 5.The segmentation results of our method with the
additional depth information sensed from the Kinect camera.

With the depth information added to the two terms in the
energy function we can easily abstract the object from the
background even with the similarity between the foreground
and the background in color space.
V. CONCLUTION
In this paper, we derive the iterative graph cut method in
two ways, we first use a strategy that construct the statistical
model in top layer and compute the min-cut in the bottom
layer. Through this method, we can greatly reduce the time
while achieving the required quality. The other way is that we
use the depth information to the graph cut method in case of
the situation that object and the background have similarity in
color distribution. The energy terms are revised to consider the
additional depth information of the image. Experimental
results show the efficiency of our method.

ACKNOWLEDGMENT
This work is supported by the Twelfth Five Years HiTech
project of the Ministry of Science and Technology, discipline
project of Ningbo University(xkl09154), the Natural Science
Foundation of Zhejiang (D1080807), and the Scientific
Research Foundation of Ningbo University ( G11JA017).

REFERENCES
[1] Y.Boykov and V.Kolmogorov. Interactive graph cut for optimal
boundary an region segmetation of objects in N-D images. In ECCV,
2004.
[2] C.Rother, V.Kolmogorov, and A.Blake. Grabcut-interactive foregournd
extraction using interated graph cut. SIGGRAPH, August 2004.
[3] D.Greig, B.Porteous, and A.Seheult. Exact maximum a posteriori
estimation for binary images. J.of the Royal Statistical Society Series B,
51(2):271-279, 1989.
[4] D.M. Greig, B.T. Porteous and A.H. Seheult. Exact maximum a
posteriori estimation for binar images, Journal of the Royal Statistical
Society Series B, 51, 271-741. 1989.
[5] D.Geman and S.Geman. Stochastio relaxation, Gibbs distribution and
the Bayesian restoration of images, IEEE Trans.Pattern Anal. Mach.
Intell., 6, 721-741. 1984.
[6] J.E. Besag, On the statistical analysis of dirty pictures (with discussion),
Journal of the Royal Statistical Society Series B, 48, 259-302. 1986.
[7] Y.Boykov and V.Kolmogorov. An experimental comparison of min-
cut/max-flow algorithms for energy minimization in vision. In 3
rd
.
Intnl.Workshop on Energy Minimization Methods in Computer Vision
and Pattern Recongnition(EMMCVPR). Springer-Verlag, September
2001, to appear.
[8] R.M.Haralick and L.G.Shapiro. Computer and Robot Vision. Addison-
Wesley Publishing Company, 1992.
[9] A.Goldberg and R.Tarjan. A new approach to the maximum flow
problem. Journal of the Association for Computer Machinery,
35(4):921-940, October 1988.
[10] S.Vicente, V.Kolmogorov and C.Rother. Graph cut based image
segmentation with conectivity priors. In CVPR, 2008.
[11] S.Vicente, V.Kolmogorov and C.Rother. Joint optimization of
segmentation and appearance models. In ICCV, 2009.
[12] O.J.Woodford, C.Rother and V.Kolmogorov. Aglobal perspective on
map inference on map inference for low-level vision. In Microsoft
Research Technical Report, 2009.
[13] M.Ruzon, and C.Tomasi. Alpha estimation in natural images. In
Proc.IEEE Conf.Comp.Vision and Pattern Recog. 2000.
[14] Y.-Y.Chuang, B.Curless, D.Salesin, and R.Szeliski. A Bayesian
approach to ditital matting. In Proc.IEEE Conf.Computer Vision and
Pattern Recon. 2001.
[15] M.Inaba, N.Katoh and H. Imai. Applications of weighted Voronio
diagrams and randomization to variance-based k-clustering.
Proceedings of 10
th
ACM Symposium on Computational Geometry. pp.
332-339.1994.

Você também pode gostar