Escolar Documentos
Profissional Documentos
Cultura Documentos
4111
supervised setting. The initial annotations generated by SPN He, and Sun 2015). To build an association between coarse
are then given to learn DecoupledNet (Hong, Noh, and Han labels and pixel-level fine annotations, they often employ
2015), which is the second network in our approach for the auxiliary objectives such as a classification loss to obtain
final semantic segmentation. By decoupling semantic seg- coarse localization of semantic categories, which are of-
mentation into classification and segmentation tasks, Decou- ten refined by Multiple Instance Learning (MIL) (Pinheiro
pledNet learns class-agnostic shape prior from the initial an- and Collobert 2015; Pathak et al. 2015) or Expectation-
notations effectively. Both SPN and DecoupledNet substan- Maximization (EM) (Papandreou et al. 2015). However,
tially improve segmentation accuracy according to our ob- these approaches tend to localize only few discriminative
servation. Our framework outperforms the existing state-of- parts of object because of the missing supervision on seg-
the-art techniques in weakly supervised semantic segmenta- mentation, and perform much worse than fully-supervised
tion with only a single round training, and its performance methods. To mitigate this issue, (Hong et al. 2016) proposed
is further improved by additional rounds, where annotations a transfer-learning approach that exploits segmentation an-
are generated by DecoupledNet of the previous iterations. notations of other object classes, and (Pinheiro and Collobert
The contribution of our approach is three-fold: 2015) investigated the use of various shape priors such as su-
• We propose a novel DNN architecture to exploit super- perpixels and object proposals. Also, (Pathak, Krähenbühl,
pixels as a pooling layout for generating segmentation and Darrell 2015) showed that even a very little supervision
annotations with image-level labels only. The proposed such as one-bit information about object size improves seg-
SPN is naturally combined with existing DecoupledNet mentation performance significantly.
for weakly supervised semantic segmentation. Our approach attempts to bridge the gap between dis-
criminative localization and shape estimation, which is at-
• To construct robust initial segmentation labels for train- tained by the use of superpixel as a unit for shape estima-
ing, we introduce techniques for segmentation label sani- tion (SPN) and integration of a shared segmentation network
tization and reliable image identification, which are useful across multiple object classes (DecoupledNet).
to improve the final segmentation performance.
• The proposed algorithm demonstrates impressive perfor- Superpixel Pooling Network
mance, and achieves the state-of-the-art accuracy com-
pared to existing approaches with significant margins. This section describes the architecture of SPN including the
details of the superpixel pooling layer, and the learning strat-
Related Work egy for SPN. It also discusses how to compute initial seg-
mentation annotations of training images using SPN.
The state-of-the-art algorithms on semantic segmentation
rely on DNN (Long, Shelhamer, and Darrell 2015; Noh,
Hong, and Han 2015; Zheng et al. 2015; Chen et al. 2015;
Architecture
Badrinarayanan, Handa, and Cipolla 2015). These algo- SPN is composed of three parts: 1) a feature encoder fenc ,
rithms are built with a classification network pre-trained on 2) an upsampling module composed of a feature upsampler
a large-scale image collection (Russakovsky et al. 2015). fups and the superpixel pooling layer, and 3) two classifica-
The standard framework of semantic segmentation is to tion modules that classify feature vectors obtained from the
train the encoder and/or decoder networks for pixel-level encoder and the upsampling module. The entire network is
classification based on ground-truth segmentation masks. learned with the two separate classification losses computed
Long et al. (Long, Shelhamer, and Darrell 2015) pro- by the last component. Overall architecture of SPN is illus-
posed an efficient end-to-end learning framework based trated in Fig. 1.
on Fully-Convolutional Network (FCN). Later approaches The encoder of SPN computes a convolutional feature
have improved the FCN-style architecture by applying fully- map z = fenc (x) of input image x. As a part of the en-
connected CRF (Chen et al. 2015) as post-processing or in- coder, we adopt the VGG16 network (Simonyan and Zis-
tegrating it as a network component (Zheng et al. 2015). serman 2015) pre-trained on ImageNet excluding its fully-
Other alternatives have built deep decoding networks based connected layers. The parameters of the VGG16 network is
on deconvolution network (Noh, Hong, and Han 2015; fixed throughout. To adapt the encoder to the target task, we
Badrinarayanan, Handa, and Cipolla 2015) to preserve ac- add an additional convolutional layer, which is learned from
curate object boundaries. Although these approaches have scratch, on the top of the convolutional layers of the VGG16
achieved substantial improvement in performance, there is a network. The encoder is thus fully convolutional and outputs
critical bottleneck in collecting a large amount of segmen- a feature map that contains a spatial information.
tation annotations to learn their DNNs, which limits their Given the convolutional feature map z, a straightforward
practicality for semantic segmentation in the wild. way to estimate object area is to classify every spatial loca-
Weakly supervised approaches (Pinheiro and Collobert tion of the feature map—akin to the sliding-window method
2015; Papandreou et al. 2015; Pathak et al. 2015; Hong, in object detection. To learn such classifier jointly with other
Noh, and Han 2015) have been proposed to resolve the data network parameters, global average pooling (Zhou et al.
deficiency issues. In this setting, models for semantic seg- 2016) or global max pooling (Oquab et al. 2015) is applied
mentation is trained with only image-level labels (Pinheiro to the feature map to convert it into a single feature vec-
and Collobert 2015; Papandreou et al. 2015; Pathak et al. tor, which is then used to learn classification layers based
2015), scribbles (Lin et al. 2016a), or bounding box (Dai, on classification loss. However, the activation map obtained
4112
ͷͳʹ
ͷͳʹ
ȋͷͳʹʹͲȌ
݂ୣ୬ୡ ݂୳୮ୱ
ܢ ࣦ
ܢො
ȋͷͳʹʹͲȌ
Figure 1: Overall architecture of SPN. SPN takes two inputs for inference: an image and its superpixel map. Given an input
image, our network extracts high-resolution feature maps using encoder fenc followed by several upsampling layers fups , and the
superpixel pooling layer aggregates features inside of each superpixel by exploiting an input superpixel map as pooling layout.
Then relevance of superpixels to semantic categories is obtained by training a model with discriminative loss in a similar
way to (Zhou et al. 2016). Note that we put additional branch of global average pooling for regularization, which prevents
undesirable training noises introduced by superpixels.
by this approach is not sufficient for semantic segmentation work from being spoiled by the SP layer and keeping dis-
since it assigns high scores to only few discriminative parts criminative parts with high activation scores.
of an object and its resolution is too low to recover object
shape accurately. Forward and Backward Propagations via SP Layer
We design the Superpixel-Pooling (SP) layer to resolve
This section derives forward and backward propagations
the above issue. Unlike traditional pooling layers, the pool-
through the SP layer. Let pi = {pki }k=1,...,Ki be the i-th
ing layout of the SP layer is not pre-defined but determined
by superpixels of the input image. Through the SP layer, we superpixel of an image, where pki indicates individual pixels
can aggregate feature vectors spatially aligned with super- belonging to pi and Ki denotes the number of pixels in pi .
pixels by average-pooling. The output of the SP layer then The results of the forward propagation through the SP
becomes an N × K matrix, where N means the number of layer are feature vectors, each of which is average-pooled
superpixels in the input image and K indicates the number from the area of the associated superpixel. Let ẑ = fups (z)
of channels in the feature map (K = 512 in the current SPN be the upsampled feature map, which is the input of the SP
architecture). Analogous to (Zhou et al. 2016), the N ×K su- layer. The pooled feature vector of the i-th superpixel is then
perpixel features are then averaged over superpixels to build given by
a single 1 × K vector, which will be classified by the follow- 1 j
ing fully-connected layer to compute and backpropagate the z̄i = I(pki ∈ rj ) ẑj = hi ẑj , (1)
classification loss. Ki j j
k
The simplest way to utilize the SP layer is to directly
connect the feature map and the SP layer, but this is not where rj and ẑj represent the receptive field and the feature
appropriate since the resolution of the feature map is too vector of the j-th location in ẑ, respectively. I(pki ∈ rj ) is an
low; a feature map location may be associated with a large indicator function that is 1 if the center of rj is closer to pki
number of superpixels, which leads to overly-smoothed fea- than those of any other receptive fields are, and 0 otherwise.
ture vectors of superpixels. We thus add a non-linear up- The average of indicator functions over the superpixel is de-
sampling module fups between the feature map and the noted by hji = k I(pki ∈ rj )/Ki , which represents how
SP layer. This module consists of two deconvolution lay- much area of the i-th superpixel is occupied by the recep-
ers (Long, Shelhamer, and Darrell 2015) and one unpooling tive field rj (i.e., the responsibility of rj for the superpixel
layer (Zeiler and Fergus 2014) followed by another two de- pi ). Through global average pooling over all superpixels, we
convolution layers. A batch normalization layer (Ioffe and obtain a single feature vector for the input image, which is
Szegedy 2015) and a rectified linear unit (Nair and Hinton given by
2010) are attached after every deconvolution layer. We em-
ploy a shared pooling switch (Noh, Hong, and Han 2015) 1 1 j j
z̃ = z̄i = h ẑ , (2)
between the last pooling layer of the encoder and the un- N i N i j i
pooling layer, which is known to be useful to reconstruct
object structure in the semantic segmentation scenario. All where N indicates the number of superpixels. The image-
the parameters of the upsampling module are trained from level feature vector z̃ is then classified by the fully connected
scratch. layer following the SP layer, and the result is fed to the loss
Finally, SPN has a branch that directly applies global av- function L.
erage pooling to the feature map z and classifies the aggre- In the backward propagation, the gradient of ẑj of the in-
gated feature vector. This branch aims at preventing the net- put feature map is derived as
4113
∂L ∂L ∂z̃ 1 ∂L ∂ i j hji ẑj 1 ∂L j
= = = h .
∂ẑj ∂z̃ ∂ẑj N ∂z̃ ∂ẑj N ∂z̃ i i
(3)
,QSXWLPDJH &$0 63&$0
Note that although individual input images in a mini-batch
may have different numbers of superpixels, the forward and
backward propagations do not depend on the number of su-
perpixels due to the average pooling over all superpixels per
image.
*URXQGWUXWK 6HJPHQWDWLRQ 6HJPHQWDWLRQ
Learning SPN E\&$0 E\63&$0
Loss function. SPN is learned with classification losses as
only image-level class labels are available in our setup. Since Figure 2: Qualitative comparison of CAM (Zhou et al. 2016)
multiple objects of different classes may appear in an image, and our SP-CAM with initial segmentation results obtained
given C object classes, our loss function is thus defined as by thresholding them. CAM is computed by classifying ev-
the sum of C binary classification losses as shown in ery location of the feature map z with the fully-connected
layer directly connected to z. CAM could not cover the en-
L(f (x), y) tire object area due to its localized activations on limited res-
C olution. Class activations in SP-CAM preserves object shape
1 efc (x) 1 more accurately by employing superpixel as a unit for shape
= yc log + (1 − y c ) log ,
C c=1 1 + efc (x) 1 + efc (x) estimation.
(4)
where fc (x) and yc ∈ {0, 1} are the network output and the to the aggregated tensor as Superpixel-Pooled Class Activa-
ground-truth label for a single class c, respectively. Note that tion Map (SP-CAM).
the SPN outputs two class score vectors at the end of the two SP-CAM is motivated by Class Activation Map (CAM)
branches (Fig. 1), and they are treated independently by the of (Zhou et al. 2016), but has a critical advantage. Unlike
identical loss function. CAM, where each location of the feature map is activated in-
dependently, SP-CAM assigns class activations to individual
Multi-scale learning. Objects are depicted with different superpixels, which allows to generate class activation scores
scales in images. To better model such scale variations of ob- in the original resolution with image structures preserved.
jects, we follow the approach of (Oquab et al. 2015) that ran- This property of SP-CAM is particularly useful for semantic
domly resizes images in the input mini-batch during train- segmentation as illustrated in Fig. 2.
ing. Specifically, we first make images square by padding
zeros to its shorter dimension, and rescale them randomly to
one of the 6 predefined sizes: 2502 , 3002 , 3502 , 4002 , 4502 , Generating initial annotation with SP-CAM. We obtain
and 5002 pixels. Note that our SPN consists only of con- initial segmentation annotations of training images through
volutional and global pooling layers except the SP layer, so their SP-CAMs. The activation map of each class in SP-
the network is naturally fit to this multi-scale approach if the CAM is thresholded by 50% of the maximum score of the
SP layer could work with images of different sizes. To this map; in other words, pixels whose activation scores are be-
end, we compute the responsibilities of the receptive field low the threshold are ignored. Furthermore, we disregard the
activation maps of the classes unmatched with the image-
(i.e., hji ’s in Eq. (1)) for all 6 possible input resolutions in
level ground-truth labels. We call this step annotation saniti-
advance per image. We observed empirically that this multi-
zation. Note that using image-level labels is not unfair since
scale approach helps to estimate object area more accurately.
our goal is not the inference of segmentation labels but the
Generating Initial Annotations with SPN construction of segmentation annotations of training images
given the labels for image-level classification. The segmen-
Superpixel-pooled class activation map. SPN assigns a tation annotation of the SP-CAM is then given by searching
feature vector to each superpixel through the SP layer as de- for the label with the maximum activation score in every
scribed in Eq. (1). During inference, the feature vector of pixel. If the activation scores of a pixel are below a pre-
each superpixel is first given to the fully-connected classifi- defined threshold in all the C channels, it is considered as
cation layer following the SP layer, and the class scores of background. An example of initial annotation is illustrated
the individual superpixels are computed. As a result, we ob- in Fig. 2.
tain a tensor in RW ×H×C , where W and H are the width
and height of the input image, respectively, and each chan-
nel corresponds to an activation map for the associated class. Iterative Learning of DecoupledNet
As in training, an input image is rescaled to the 6 predefined The output of SPN already produces decent segmentation
sizes, and 6 activation tensors are computed accordingly. Fi- results thanks to the use of superpixel as a unit for shape es-
nally, the 6 tensors are aggregated by max-pooling. We refer timation. However, SPN is still not free from the limitation
4114
of discriminative learning; it tends to exaggerate the class for weakly supervised semantic segmentation. As an eval-
scores of discriminative superpixels, thus some parts of an uation metric, we adopt segmentation accuracy defined by
object could be lost when initial annotations are generated intersection over union between ground-truth and predicted
by the procedure described above. A solution to resolve this segmentation.
issue would be to get hints from annotations of other images,
and this idea can be realized by learning class-agnostic seg- Implementation Details
mentation knowledge from a number of initial segmentation Both of SPN and DecoupledNet are trained on PASCAL
annotations. We adopt DecoupledNet (Hong, Noh, and Han VOC 2012 dataset (Everingham et al. 2010). Besides the
2015) as the network that learns such segmentation knowl- provided image sets for the semantic segmentation task, we
edge from the initial annotations and refine them iteratively. employ additional images used in (Hariharan et al. 2011) to
DecoupledNet decomposes the semantic segmentation enlarge training set. In total, 10,582 images are used to train
problem into two separate tasks of classification and seg- the networks, and the validation set of 1,449 images is kept
mentation, and is composed of two decoupled networks to for evaluating our approach. Superpixels of the images are
handle the two tasks. Following (Hong, Noh, and Han 2015), computed by (Zitnick and Dollár 2014).
the classification network of DecoupledNet is trained with SPN is implemented in Torch7 (Collobert, Kavukcuoglu,
image-level labels and serves as a feature encoder in our and Farabet 2011). The network parameters are optimized
framework. Unlike the original setting, however, its segmen- by Adam (Ba and Kingma 2015) with an initial learning rate
tation network is now trained with generated noisy annota- of 0.001. The optimization converges after approximately
tions instead of ground-truth segmentations. Since the bi- 9K iterations with mini-batches of 12 images. The training
nary segmentation network is shared across different object procedure takes about 3 hours on a single Nvidia TITAN X
classes, DecoupledNet can learn class-agnostic segmenta- GPU with 12Gb RAM in our experiment.
tion knowledge from annotations of multiple object classes, We learn DecoupledNet for two rounds. When learn-
which allows it to generate more accurate annotations for the ing DecoupledNets, we follow the original setup described
next round of training. in (Hong, Noh, and Han 2015), and the number of iterations
In each round of our algorithm, DecoupledNet is learned is set to 9K for both rounds. As a training set, top 300 most
from generated annotations, which are provided by SPN at reliable annotations are selected per class at each round.
the first round and by DecoupledNet trained in the previ-
ous iteration from the second round. We learn the segmenta- Evaluation on PASCAL VOC Benchmark
tion network of DecoupledNet from scratch at every round Analysis of generated annotation. We first analyze and
to avoid annotation biases of the previous round. Note that evaluate annotations generated by our algorithm at each
only a subset of reliable annotations are used to learn the round. Details of the results are summarized in Fig. 3. It
network to minimize undesirable effects by incomplete and turns out that both of the sanitization and reliable subset se-
noisy annotations. To this end, we define the reliability of an lection are essential to improve the quality of annotations
annotation based on the degree of scatter, which is the ra- and in turn learn a better segmentation network.
tio of the squared perimeter to the area of segmentations in We also compare SPN with CAM (Zhou et al. 2016) in
the annotation. Only a subset of least scattered annotations terms of the annotation quality. Without reliable subset se-
are then selected for training, while assuming that less scat- lection, annotations generated by SPN achieves 43.8% in av-
tered annotations are more reliable. Note that DecoupledNet erage accuracy while those by CAM shows 30.5%. This re-
is well suited to this reliable subset selection since the net- sult demonstrates the effectiveness of the superpixel pooling
work has shown superior performance even with only a lim- layer of SPN.
ited number of segmentation annotations given in training.
We observed empirically that DecoupledNet learned with a Comparison to other methods. Our algorithm is compared
subset of reliable annotations consistently outperforms its with up-to-date weakly supervised approaches. In addition
counterpart learned with all annotations. The trained Decou- to our two segmentation networks (Ours:Rnd1, Ours:Rnd2),
pledNet is in turn used to generates annotations for the next we also evaluate SPN as a semantic segmentation network
round. As in the case of SPN, the generated annotations are (Ours:SPN). SPN predicts semantic segmentation in the
sanitized by image-level labels to remove irrelevant or unre- same way as it generates initial annotations, but makes use
liable segments in the annotations. of its classification scores instead of image labels to ig-
The above procedure is repeated until the predefined num- nore irrelevant classes in its segmentation result. Specifi-
ber of iterations is reached. The DecoupledNet trained at the cally, SPN disregards object classes whose activation scores
final round is considered as our model for semantic segmen- averaged over all superpixels are below 0. We also report
tation. the accuracy of the DecoupledNet trained with all available
ground-truth annotations (Upper-bound) as the upper-bound
Experiments of our algorithm.
The segmentation results on PASCAL VOC 2012 valida-
This section describes implementation details, and demon- tion set are quantified and compared in Table 1. SPN by
strates the effectiveness of our approach in PASCAL itself outperforms all the previous approaches except MIL
VOC 2012 segmentation benchmark (Everingham et al. using segmentation proposals (MIL+Seg), which is the best
2010) with comparisons to the-state-of-the-art techniques one among them. Note that segmentation proposals used in
4115
60 60 60
Annotation accuracy (%)
Without sanitization All annotations
45 40 40
35 35
40
2000 4000 6000 8000 10000 30 30
The number of selected annotations Round1 (SPN) Round2 Round1 Round2
(a) (b) (c)
Figure 3: Analysis of our annotation generation technique on PASCAL VOC 2012 segmentation benchmark. (a) Empirical
justification for our reliability metric. We show the average segmentation accuracy of annotations selected in the decreasing
order of reliability for training images. The average accuracy consistently decreases by increasing the number of selected
annotations, which indicates that the reliability metric is well correlated with the actual quality of annotation. (b) Effects of
the annotation sanitization and reliable subset selection. In both rounds, the accuracy of annotations is enhanced by the
sanitization, and further improved by selecting 300 most reliable annotations per class in training set. (c) Effect of the reliable
subset selection. We trained two networks, one with all of the sanitized annotations and the other with 300 most reliable
annotations per class, and compare their segmentation prediction accuracy in validation set. The network learned with the
reliable subset of annotations was better than its counterpart in both rounds.
Table 1: Accuracy on PASCAL VOC 2012 validation set Table 2: Accuracy on PASCAL VOC 2012 test set
4116
,QSXWLPDJH *URXQGWUXWK ,QLW DQQRWDWLRQ 5RXQG 5RXQG neural network for semi-supervised semantic segmentation.
In NIPS.
Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accel-
erating deep network training by reducing internal covariate
shift. In ICML.
Lin, D.; Dai, J.; Jia, J.; He, K.; and Sun, J. 2016a. Scribble-
sup: Scribble-supervised convolutional networks for seman-
tic segmentation. In CVPR.
Lin, G.; Shen, C.; van dan Hengel, A.; and Reid, I. 2016b.
Efficient piecewise training of deep structured models for
semantic segmentation. In CVPR.
Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convo-
lutional networks for semantic segmentation. In CVPR.
Nair, V., and Hinton, G. E. 2010. Rectified linear units
improve restricted boltzmann machines. In ICML.
Noh, H.; Hong, S.; and Han, B. 2015. Learning deconvolu-
tion network for semantic segmentation. In ICCV.
Oquab, M.; Bottou, L.; Laptev, I.; and Sivic, J. 2015. Is ob-
ject localization for free? - weakly-supervised learning with
convolutional neural networks. In CVPR.
Papandreou, G.; Chen, L. C.; Murphy, K. P.; and Yuille,
A. L. 2015. Weakly-and semi-supervised learning of a deep
convolutional network for semantic image segmentation. In
ICCV.
Pathak, D.; Shelhamer, E.; Long, J.; and Darrell, T. 2015.
Fully convolutional multi-class multiple instance learning.
In ICLR Workshop.
Pathak, D.; Krähenbühl, P.; and Darrell, T. 2015. Con-
strained convolutional neural networks for weakly super-
vised segmentation. In ICCV.
Figure 4: Results on PASCAL VOC 2012 validation set. Pinheiro, P. O., and Collobert, R. 2015. From image-level to
pixel-level labeling with convolutional networks. In CVPR.
Qi, G.-J. 2016. Hierarchically gated deep networks for se-
for robust semantic pixel-wise labelling. arXiv preprint mantic segmentation. In CVPR.
arXiv:1505.07293. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.;
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.;
Yuille, A. L. 2015. Semantic image segmentation with deep Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale
convolutional nets and fully connected CRFs. In ICLR. Visual Recognition Challenge. IJCV 1–42.
Collobert, R.; Kavukcuoglu, K.; and Farabet, C. 2011. Simonyan, K., and Zisserman, A. 2015. Very deep convolu-
Torch7: A matlab-like environment for machine learning. In tional networks for large-scale image recognition. In ICLR.
BigLearn, NIPS Workshop. Vemulapalli, R.; Tuzel, O.; Liu, M.-Y.; and Chellapa, R.
Dai, J.; He, K.; and Sun, J. 2015. Boxsup: Exploiting bound- 2016. Gaussian conditional random field network for se-
ing boxes to supervise convolutional networks for semantic mantic segmentation. In CVPR.
segmentation. In ICCV. Zeiler, M. D., and Fergus, R. 2014. Visualizing and under-
Everingham, M.; Van Gool, L.; Williams, C. K. I.; Winn, J.; standing convolutional networks. In ECCV.
and Zisserman, A. 2010. The pascal visual object classes Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.;
(voc) challenge. IJCV 88(2):303–338. Su, Z.; Du, D.; Huang, C.; and Torr, P. 2015. Conditional
Hariharan, B.; Arbelaez, P.; Bourdev, L.; Maji, S.; and Ma- random fields as recurrent neural networks. In ICCV.
lik, J. 2011. Semantic contours from inverse detectors. In Zhou, B.; Khosla, A.; A., L.; Oliva, A.; and Torralba, A.
ICCV. 2016. Learning deep features for discriminative localization.
Hong, S.; Oh, J.; Lee, H.; and Han, B. 2016. Learning trans- In CVPR.
ferrable knowledge for semantic segmentation with deep Zitnick, C. L., and Dollár, P. 2014. Edge boxes: Locating
convolutional neural network. In CVPR. object proposals from edges. In ECCV.
Hong, S.; Noh, H.; and Han, B. 2015. Decoupled deep
4117