Você está na página 1de 10

Semi-supervised Vocabulary-informed Learning

Yanwei Fu and Leonid Sigal


Disney Research
y.fu@qmul.ac.uk, lsigal@disneyresearch.com
arXiv:1604.07093v1 [cs.CV] 24 Apr 2016

Abstract

Despite significant progress in object categorization, in


recent years, a number of important challenges remain;
mainly, ability to learn from limited labeled data and ability
to recognize object classes within large, potentially open,
set of labels. Zero-shot learning is one way of addressing
these challenges, but it has only been shown to work with
limited sized class vocabularies and typically requires sep-
aration between supervised and unsupervised classes, al-
lowing former to inform the latter but not vice versa. We
propose the notion of semi-supervised vocabulary-informed
learning to alleviate the above mentioned challenges and
address problems of supervised, zero-shot and open set
recognition using a unified framework. Specifically, we pro-
pose a maximum margin framework for semantic manifold- Figure 1. Illustration of the semantic embeddings learned (left)
based recognition that incorporates distance constraints using support vector regression (SVR) and (right) using the pro-
from (both supervised and unsupervised) vocabulary atoms, posed semi-supervised vocabulary-informed (SS-Voc) approach.
ensuring that labeled samples are projected closest to their In both cases, t-SNE visualization is used to illustrate samples
correct prototypes, in the embedding space, than to others. from 4 source/auxiliary classes (denoted by ×) and 2 target/zero-
We show that resulting model shows improvements in super- shot classed (denoted by ◦) from the ImageNet dataset. Decision
boundaries, illustrated by dashed lines, are drawn by hand for vi-
vised, zero-shot, and large open set recognition, with up to
sualization. Note, that (i) large margin constraints in SS-Voc, both
310K class vocabulary on AwA and ImageNet datasets.
among the source/target classes and the external vocabulary atoms
(denoted by arrows and words), and (ii) fine-tuning of the seman-
1. Introduction tic word space, lead to a better embedding with more compact and
separated classes (e.g., see truck and car or unicycle and tricycle).
Object recognition, and more specifically object catego-
rization, has seen unprecedented advances in recent years Zero-shot learning (ZSL) has now been widely stud-
with development of convolutional neural networks (CNNs) ied in a variety of research areas including neural decod-
[23]. However, most successful recognition models, to ing by fMRI images [31], character recognition [26], face
date, are formulated as supervised learning problems, in verification [24], object recognition [25], and video un-
many cases requiring hundreds, if not thousands, labeled in- derstanding [17, 45]. Typically, zero-shot learning ap-
stances to learn a given concept class [10]. This exuberant proaches aim to recognize instances from the unseen or
need for large labeled datasets has limited recognition mod- unknown testing target categories by transferring informa-
els to domains with 100’s to few 1000’s of classes. Humans, tion, through intermediate-level semantic representations,
on the other hand, are able to distinguish beyond 30, 000 ba- from known observed source (or auxiliary) categories for
sic level categories [5]. What is more impressive, is the fact which many labeled instances exist. In other words, super-
that humans can learn from few examples, by effectively vised classes/instances, are used as context for recognition
leveraging information from other object category classes, of classes that contain no visual instances at training time,
and even recognize objects without ever seeing them (e.g., but that can be put in some correspondence with supervised
by reading about them on the Internet). This ability has classes/instances. As such, a general experimental setting
spawned research in few-shot and zero-shot learning. of ZSL is that the classes in target and source (auxiliary)
dataset are disjoint and typically the learning is done on the is capable of utilizing vocabulary over unsupervised items,
source dataset and then information is transferred to the tar- during training, to improve recognition. A unified maxi-
get dataset, with performance measured on the latter. mum margin framework is used to encode this idea in prac-
This setting has a few important drawbacks: (1) it as- tice. Particularly, classification is done through nearest-
sumes that target classes cannot be mis-classified as source neighbor distance to class prototypes in the semantic em-
classes and vice versa; this greatly and unrealistically sim- bedding space, and we encode a set of constraints ensur-
plifies the problem; (2) the target label set is often relatively ing that labeled images project into semantic space such
small, between ten [25] and several thousand unknown la- that they end up closer to the correct class prototypes than
bels [14], compared to at least 30, 000 entry level categories to incorrect ones (whether those prototypes are part of the
that humans can distinguish; (3) large amounts of data in the source or target classes). We show that word embedding
source (auxiliary) classes are required, which is problematic (word2vec) can be used effectively to initialize the se-
as it has been shown that most object classes have only few mantic space. Experimentally, we illustrate that through
instances (long-tailed distribution of objects in the world this paradigm: we can achieve competitive supervised (on
[40]); and (4) the vast open set vocabulary from semantic source classes) and ZSL (on target classes) performance, as
knowledge, defined as part of ZSL [31], is not leveraged in well as open set image recognition performance with large
any way to inform the learning or source class recognition. number of unobserved vocabulary entities (up to 300, 000);
A few works recently looked at resolving (1) through effective learning with few samples is also illustrated.
class-incremental learning [38, 39] which is designed to dis-
tinguish between seen (source) and unseen (target) classes
at the testing time and apply an appropriate model – super- 2. Related Work
vised for the former and ZSL for the latter. However, (2)– One-shot Learning: While most of machine learning-
(4) remain largely unresolved. In particular, while (2) and based object recognition algorithms require large amount
(3) are artifacts of the ZSL setting, (4) is more fundamen- of training data, one-shot learning [12] aims to learn ob-
tal. For example, consider learning about a car by looking ject classifiers from one, or only few examples. To com-
at image instances in Fig.1. Not knowing that other mo- pensate for the lack of training instances and enable one-
tor vehicles exist in the world, one may be tempted to call shot learning, knowledge much be transferred from other
anything that has 4-wheels a car. As a result the zero-shot sources, for example, by sharing features [3], semantic at-
class truck may have large overlap with the car class (see tributes [17, 25, 34, 35], or contextual information [41].
Fig.1 [SVR]). However, imagine knowing that there also However, none of previous works had used the open set vo-
exist many other motor vehicles (trucks, mini-vans, etc). cabulary to help learn the object classifiers.
Even without having visually seen such objects, the very Zero-shot Learning: ZSL aims to recognize novel classes
basic knowledge that they exist in the world and are closely with no training instance by transferring knowledge from
related to a car should, in principal, alter the criterion for source classes. ZSL was first explored with use of attribute-
recognizing instance as a car (making the recognition cri- based semantic representations [11, 15, 17, 18, 24, 32]. This
terion stricter in this case). Encoding this in our [SS-Voc] required pre-defined attribute vector prototypes for each
model results in better separation among classes. class, which is costly for a large-scale dataset. Recently,
To tackle the limitations of ZSL and towards the goal of semantic word vectors were proposed as a way to embed
generic open set recognition, we propose the idea of semi- any class name without human annotation effort; they can
supervised vocabulary-informed learning. Specifically, as- therefore serve as an alternative semantic representation
suming we have few labeled training instances and a large [2, 14, 19, 30] for ZSL. Semantic word vectors are learned
open set vocabulary/semantic dictionary (along with textual from large-scale text corpus by language models, such as
sources from which statistical semantic relations among vo- word2vec [29], or GloVec [33]. However, most of previ-
cabulary atoms can be learned), the task of semi-supervised ous work only use word vectors as semantic representations
vocabulary-informed learning is to learn a model that uti- in ZSL setting, but have neither (1) utilized semantic word
lizes semantic dictionary to help train better classifiers for vectors explicitly for learning better classifiers; nor (2) for
observed (source) classes and unobserved (target) classes in extending ZSL setting towards open set image recognition.
supervised, zero-shot and open set image recognition set- A notable exception is [30] which aims to recognize 21K
tings. Different from standard semi-supervised learning, we zero-shot classes given a modest vocabulary of 1K source
do not assume unlabeled data is available, to help train clas- classes; we explore vocabularies that are up to an order of
sifier, and only vocabulary over the target classes is known. the magnitude larger – 310K.
Contributions: Our main contribution is to propose a novel Open-set Recognition: The term “open set recognition”
paradigm for potentially open set image recognition: semi- was initially defined in [37, 38] and formalized in [4, 36]
supervised vocabulary-informed learning (SS-Voc), which which mainly aims at identifying whether an image belongs
to a seen or unseen classes. It is also known as class- a more traditional ZSL that typically makes use of the
incremental learning. However, none of them can further vocabulary (e.g., semantic embedding) at test time, SS-Voc
identify classes for unseen instances. An exception is [30] utilizes exactly the same data during training. Notably,
which augments zero-shot (unseen) class labels with source SS-Voc requires no additional annotations or semantic
(seen) labels in some of their experimental settings. Simi- knowledge; it simply shifts the burden from testing to
larly, we define the open set image recognition as the prob- training, leveraging the vocabulary to learn a better model.
lems of recognizing the class name of an image from a po-
The vocabulary W can come from a semantic embedding
tentially very large open set vocabulary (including, but not
space learned by word2vec [29] or GloVec [33] on large-
limited to source and target labels). Note that methods like
scale corpus; each vocabulary entity w ∈ W is represented
[37, 38] are orthogonal but potentially useful here – it is
as a distributed semantic vector u ∈ Rd . Semantics of em-
still worth identifying seen or unseen instances to be rec-
bedding space help with knowledge transfer among classes,
ognized with different label sets as shown in experiments.
and allow ZSL and open set image recognition. Note that
Conceptually similar, but different in formulation and task,
such semantic embedding spaces are equivalent to the “se-
open-vocabulary object retrieval [20] focused on retrieving
mantic knowledge base” for ZSL defined in [31] and hence
objects using natural language open-vocabulary queries.
make it appropriate to use SS-Voc in ZSL setting.
Visual-semantic Embedding: Mapping between visual Assuming we can learn a mapping g : Rp → Rd , from
features and semantic entities has been explored in two image features to this semantic space, recognition can be
ways: (1) directly learning the embedding by regressing carried out using simple nearest neighbor distance, e.g.,
from visual features to the semantic space using Support f (x∗ ) = car if g(x∗ ) is closer to ucar than to any other
Vector Regressors (SVR) [11, 25] or neural network [39]; word vector; uj in this context can be interpreted as the
(2) projecting visual features and semantic entities into a prototype of the class j. Thus the core question is then how
common new space, such as SJE [2], WSABIE [44], ALE to learn the mapping g(x) and what form of inference is
[1], DeViSE [14], and CCA [16, 18]. In contrast, our model optimal in the semantic space. For learning we propose dis-
trains a better visual-semantic embedding from only few criminative maximum margin criterion that ensures that la-
training instances with the help of large amount of open set beled samples xi project closer to their corresponding class
vocabulary items (using a maximum margin strategy). Our prototypes uzi than to any other prototype ui in the open
formulation is inspired by the unified semantic embedding set vocabulary i ∈ W \ zi .
model of [21], however, unlike [21], our formulation is built
on word vector representation, contains a data term, and in- 3.1. Learning Embedding
corporates constraints to unlabeled vocabulary prototypes. Our maximum margin vocabulary-informed embedding
learns the mapping g(x) : Rp → Rd , from low-level
3. Vocabulary-informed Learning features x to the semantic word space by utilizing maxi-
Assume a labeled source dataset Ds = {xi , zi }N mum margin strategy. Specifically, consider g(x) = W T x,
i=1 of Ns
s

p
samples, where xi ∈ R is the image feature representation where1 W ⊆ Rp×d . Ideally we want to estimate W such
of image i and zi ∈ Ws is a class label taken from a set that uzi = W T xi for all labeled instances in Ds (we would
of English words or phrases W; consequently, |Ws | is the obviously want this to hold for instances belonging to unob-
number of source classes. Further, suppose another set of served classes as well, but we cannot enforce this explicitly
class labels for target classes Wt , such that Ws ∩ Wt = ∅, in the optimization as we have no labeled samples for them).
for which no labeled samples are available. We note that Data Term: The easiest way to enforce the above objective
potentially |Wt | >> |Ws |. Given a new test image feature is to minimize Euclidian distance between sample projec-
vector x∗ the goal is then to learn a function z ∗ = f (x∗ ), tions and appropriate prototypes in the embedding space2 :
using all available information, that predicts a class label
z ∗ . Note that the form of the problem changes drastically D (xi , uzi ) =k W T xi − uzi k22 . (1)
depending on which label set assumed for z ∗ : Supervised We need to minimize this term with respect to each instance
learning: z ∗ ∈ Ws ; Zero-shot learning: z ∗ ∈ Wt ; Open set (xi , uzi ), where zi is the class label of instance xi in Ds . To
recognition: z ∗ ∈ {Ws , Wt } or, more generally, z ∗ ∈ W. prevent overfitting, we further regularize the solution:
We posit that a single unified f (x∗ ) can be learned for all
three cases. We formalize the definition of semi-supervised L (xi , uzi ) = D (xi , uzi ) + λ k W k2F , (2)
vocabulary-informed learning (SS-Voc) as follows:
where k · kF indicates the Frobenius Norm. Solution to the
Definition 3.1. Semi-supervised Vocabulary-informed Eq.(2) can be obtained through ridge regression.
Learning (SS-Voc): is a learning setting that makes use 1 Generalizing to a kernel version is straightforward, see [43].
of complete vocabulary data (W) during training. Unlike 2 Eq.(1) is also called data embedding [21] / compatibility function [2].
Nevertheless, to make the embedding more comparable where b ∈ Ws is selected from source/auxiliary dataset
to support vector regression (SVR), we employ the maximal vocabulary. This term enforces that D (xi , uzi ) should be
margin strategy – −insensitive smooth SVR (−SSVR) smaller than the distance D (xi , ub ), ∀b 6= zi . To facil-
[27] to replace the least square term in Eq.(2). That is, itate the computation, we similarly use closest BS proto-
types that are closest to each prototype uzi in the source
L (xi , uzi ) = L (xi , uzi ) + λ k W k2F , (3) classes. Our complete pairwise maximum margin term is:
where L (xi , uzi ) = 1n | ξ |2 ; λ is regularization
T
o coef- M (xi , uzi ) = MV (xi , uzi ) + MS (xi , uzi ) . (6)
T
ficient. (|ξ| )j = max 0, |W?j xi − (uzi )j | −  , |ξ| ∈
We note that the form of rank hinge loss in Eq.(4) and Eq.(5)
Rd , and ()j indicates the j-th value of corresponding vector. is similar to DeViSE [14], but DeViSE only considers loss
W?j is the j-th column of W . The conventional −SVR with respect to source/auxiliary data and prototypes.
is formulated as a constrained minimization problem, i.e.,
Vocabulary-informed Embedding: The complete com-
convex quadratic programming problem, while −SSVR
bined objective can now be written as:
employs quadratic smoothing [47] to make Eq.(3) differ- nT
entiable everywhere, and thus −SSVR can be solved as an W = argmin
X
(αL (xi , uyi ) +
unconstrained minimization problem directly3 . W i=1
Pairwise Term: Data term above only ensures that labelled (1 − α)M (xi , uzi )) + λ k W k2F , (7)
samples project close to their correct prototypes. However,
since it is doing so for many samples and over a number of
classes, it is unlikely that all the data constraints can be sat- where α ∈ [0, 1] is ratio coefficient of two terms. One prac-
isfied exactly. Specifically, consider the following case, if tical advantage is that the objective function in Eq.(7) is an
uzi is in the part of the semantic space where no other enti- unconstrained minimization problem which is differentiable
ties live (i.e., distance from uzi to any other prototype in the and can be solved with L-BFGS. W is initialized with all
embedding space is large), then projecting xi further away zeros and converges in 10 − 20 iterations.
from uzi is asymptomatic, i.e., will not result in misclassifi- Fine-tuning Word Vector Space: Above formulation
cation. However, if the uzi is close to other prototypes then works well assuming semantic space is well laid out and
minor error in regression may result in misclassification. linear mapping is sufficient. However, we posit that word
To embed this intuition into our learning, we enforce vector space itself is not necessarily optimal for visual dis-
more discriminative constraints in the learned semantic em- crimination. Consider the following case: two visually sim-
bedding space. Specifically, the distance of D (xi , uzi ) ilar categories may appear far away in the semantic space.
should not only be as close as possible, but should also be In such a case, it would be difficult to learn a linear mapping
smaller than the distance D (xi , ua ), ∀a 6= zi . Formally, that matches instances with category prototypes properly.
we define the vocabulary pairwise maximal margin term 4 : Inspired by this intuition, which has also been expressed
AV  2 in natural language models [6], we propose to fine-tune the
1X 1 1 word vector representation for better visual discriminability.
MV (xi , uzi ) = C + D (xi , uzi ) − D (xi , ua )
2 a=1 2 2 + One can potentially fine-tune the representation by opti-
(4) mizing ui directly, in an alternating optimization (e.g., as in
where a ∈ Wt is selected from the open vocabulary; C is [21]). However, this is only possible for source/auxiliary
2
the margin gap constant. Here, [·]+ indicates the quadrat- class prototypes and would break regularities in the se-
ically smooth hinge loss [47] which is convex and has the mantic space, reducing ability to transfer knowledge from
gradient at every point. To speedup computation, we use source/auxilary to target classes. Alternatively, we propose
the closest AV target prototypes to each source/auxiliary optimizing a global warping, V , on the word vector space:
prototype uzi in the semantic space. We also define similar nT
X
constraints for the source prototype pairs: {W, V } = argmin (αL (xi , uyi V ) +
W,V i=1
BS  2
1X 1 1 (1 − α) M (xi , uzi V )) + λ k W k2F +µ k V k2F , (8)
MS (xi , uzi ) = C + D (xi , uzi ) − D (xi , ub )
2 2 2 +
b=1
(5) where µ is regularization coefficient. Eq.(8) can still be
solved using L-BFGS and V is initialized using an identity
3 We found Eq.(2) and Eq.(3) have similar results, on average, but for-
matrix. The algorithm first updates W and then V ; typically
mulation in Eq.(3) is more stable and has lower variance.
4 Crammer and Singer loss [42, 8] is the upper bound of Eq (4) and (5) the step of updating V can converge within 10 iterations and
which we use to tolerate variants of uzi (e.g. ’pigs’ Vs. ’pig’ in Fig. 2) the corresponding class prototypes used for final classifica-
and thus are better for our tasks. tion are updated to be uzi V .
3.2. Maximum Margin Embedding Recognition SVM: SVM classifier trained directly on the training in-
stances of source data, without the use of semantic em-
Once embedding model is learned, recognition in the se-
bedding. This is the standard (S UPERVISED) learning
mantic space can be done in a variety of ways. We explore
setting and the learned classifier can only predict the
a simple alternative to classify the testing instance x? ,
labels in testing data of source classes.
z ∗ = argmin k W T x∗ − φ (ui , V, W, x∗ ) k22 . (9) SVR-Map: SVR is used to learn W and the recognition is
i done in the resulting semantic manifold. This corre-
sponds to only using Eq.(3) to learn W .
Nearest Neighbor (NN) classifier directly measures the dis-
tance between predicted semantic vectors with the proto- DeVise, ConSE, AMP: To compare with state-of-the-art
types in semantic space, i.e., φ (ui , V, W, x∗ ) = ui V . We large-scale zero-shot learning approaches we imple-
further employ the k-nearest neighbors (KNN) of testing in- ment DeViSE [14] and ConSE [30]6 . ConSE uses
stances to average the predictions, i.e., φ (·) is averaging the a multi-class logistic regression classifier for predict-
KNN instances of predicted semantic vectors.5 ing class probabilities of source instances; and the pa-
rameter T (number of top-T nearest embeddings for a
given instance) was selected from {1, 10, 100, 1000}
4. Experiments
that gives the best results. ConSE method in super-
Datasets. We conduct our experiments on Animals with At- vised setting works the same as SVR. We use the AMP
tributes (AwA) dataset, and ImageNet 2012/2010 dataset. code provided on the author webpage [19].
AwA consists of 50 classes of animals (30, 475 images in SS-Voc: We test three different variants of our method.
total). In [25] standard split into 40 source/auxiliary classes
closed is a variant of our maximum margin leaning
(|Ws | = 40) and 10 target/test classes (|Wt | = 10) is
of W with the vocabulary-informed constraints
introduced. We follow this split for supervised and zero-
only from known classes (i.e., closed set Ws ).
shot learning. We use OverFeat features (downloaded from
[19]) on AwA to make the results more easily compara- W corresponds to our full model with maximum mar-
ble to state-of-the-art. ImageNet 2012/2010 dataset is a gin constraints coming from both Ws and Wt (or
large-scale dataset. We use 1000 (|Ws | = 1000) classes W). We compute W using Eq.(7), but without
of ILSVRC 2012 as the source/auxiliary classes and 360 optimizing V .
(|Wt | = 360) classes of ILSVRC 2010 that are not used full further fine-tunes the word vector space by also
in ILSVRC 2012 as target data. We use pre-trained VGG- optimizing V using Eq.(8).
19 model [7] to extract deep features for ImageNet. On
Open set vocabulary. We use google word2vec to learn the
both dataset, we use few instances from source dataset to
open set vocabulary set from a large text corpus of around 7
mimic human performance of learning from few examples
billion words: UMBC WebBase (3 billion words), the latest
and ability to generalize.
Wikipedia articles (3 billion words) and other web docu-
Recognition tasks. We consider three different settings in ments (1 billion words). Some rare (low frequency) words
a variety of experiments (in each experiment we carefully and high frequency stopping words were pruned in the vo-
denote which setting is used): cabulary set: we remove words with the frequency < 300
S UPERVISED recognition, where learning is on source or > 10 million times. The result is a vocabulary of around
classes and we assume test instances come from same 310K words/phrasesp with openness ≈ 1, which is defined
classes with Ws as recognition vocabulary; as openness = 1 − (2 × |Ws |) / (|W|). [38].
Z ERO - SHOT recognition, where learning is on source Computational and parameters selection and scalability.
classes and we assume test instances coming from tar- All experiments are repeated 10 times, to avoid noise due to
get dataset with Wt as recognition vocabulary; small training set size, and we report an average across all
runs. For all the experiments, the mean accuracy is reported,
O PEN - SET recognition, where we use entirely open vocab- i.e., the mean of the diagonal of the confusion matrix on the
ulary with |W| ≈ 310K and use test images from both prediction of testing data. We fix the parameters µ and λ
source and target splits. as 0.01 and α = 0.6 in our experiments when only few
training instances are available for AwA (5 instances per
Competitors. We compare the following models,
class) and ImageNet (3 instances per class). Varying values
5 This strategy is known as Rocchio algorithm in information retrieval. of λ, µ and α leads to < 1% variances on AwA and <
Rocchio algorithm is a method for relevance feedback by using more rel- 0.2% variances on ImageNet dataset; but the experimental
evant instances to update the query instances for better recall and possibly conclusions still hold. Cross-validation is conducted when
precision in vector space (Chap 14 in [28]). It was first suggested for use
on ZSL in [17]; more sophisticated algorithms [16, 34] are also possible. 6 Code for [14] and [30] is not publicly available.
Testing Classes SS-Voc
Aux Targ. Total Vocab Chance SVM SVR closed W full
S UPERVISED X 40 40 2.5 52.1 51.4/57.1 52.9/58.2 53.6/58.6 53.9/59.1
Z ERO - SHOT X 10 10 10 - 52.1/58.0 58.6/60.3 59.5/68.4 61.1/68.9
Table 1. Classification accuracy (%) on AwA dataset for S UPERVISED and Z ERO - SHOT settings for 100/1000-dim word2vec representation.

more training instances are available. AV and BS are set to Methods S. Sp Features Acc.
5 to balance computational cost and efficiency of pairwise SS-Voc:full W CN NOverFeat 78.3
constraints. 800 instances W CN NOverFeat 74.4
To solve Eq.(8) at a scale, one can use Stochastic Gra- 200 instances W CN NOverFeat 68.9
dient Descent (SGD) which makes great progress initially, Akata et al. [2] A+W CN NGoogleLeNet 73.9
but often is slow when approaches a solution. In contrast, TMV-BLP [16] A+W CN NOverFeat 69.9
the L-BFGS method mentioned above can achieve steady AMP (SR+SE) [19] A+W CN NOverFeat 66.0
convergence at the cost of computing the full objective and
DAP [25] A CN NVGG19 57.5
gradient at each iteration. L-BFGS can usually achieve bet-
PST[34] A+W CN NOverFeat 54.1
ter results than SGD with good initialization, however, is
DAP [25] A CN NOverFeat 53.2
computationally expensive. To leverage benefits of both of
DS [35] W/A CN NOverFeat 52.7
these methods, we utilize a hybrid method to solve Eq.(8)
Jayaraman et al. [22] A low-level 48.7
in large-scale datasets: the solver is initialized with few in-
stances to approximate the gradients using SGD first, then Yu et al. [46] A low-level 48.3
gradually more instances are used and switch to L-BFGS IAP [25] A CN NOverFeat 44.5
is made with iterations. This solver is motivated by Fried- HEX [9] A CN NDECAF 44.2
lander et al. [13], who theoretically analyzed and proved AHLE [1] A low-level 43.5
the convergence for the hybrid optimization methods. In Table 2. Zero-shot comparison on AwA. We compare the
state-of-the-art ZSL results using different semantic spaces (S.
practice, we use L-BFGS and the Hybrid algorithms for
Sp) including word vector (W) and attribute (A). 1000 dimension
AwA and ImageNet respectively. The hybrid algorithm can word2vec dictionary is used for SS-Voc. (Chance-level =10%).
save between 20 ∼ 50% training time as compared with Different types of CNN and hand-crafted low-level feature are
L-BFGS. used by different methods. Except SS-Voc (200/800), all instances
of source data (24295 images) are used for training. As a general
4.1. Experimental results on AwA dataset reference, the classification accuracy on ImageNet: CN NDECAF <
We report AwA experimental results in Tab. 1, which CN NOverFeat < CN NVGG19 < CN NGoogleLeNet .
uses 100/1000-dimensional word2vec representation (i.e.,
stances / class). Our model achieves 78.3% accuracy, which
d = 100/1000). We highlight the following observa-
is remarkably higher than all previous methods. This is par-
tions: (1) SS-Voc variants have better classification accu-
ticularly impressive taking into account the fact that we use
racy than SVM and SVR. This validates the effectiveness
only a semantic space and no additional attribute represen-
of our model. Particularly, the results of our SS-Voc:full
tations that many other competitor methods utilize. Fur-
are 1.8/2% and 9/10.9% higher than those of SVR/SVM
ther, our results with 800 training instances, a small frac-
on supervised and zero-shot recognition respectively. Note
tion of the 24, 295 instances used to train all other meth-
that though the results of SVM/SVR are good for supervised
ods, already outperform all other approaches. We argue that
recognition tasks (52.1 and 51.4/57.1 respectively), we can
much of our success and improvement comes from a more
further improve them, which we attribute to the more dis-
discriminative information obtained using an open set vo-
criminative classification boundary informed by the vocab-
cabulary and corresponding large margin constraints, rather
ulary. (2) SS-Voc:W significantly, by up to 8.1%, improves
than from the features, since our method improved 25.1%
zero-shot recognition results of SS-Voc:closed. This val-
as compared with DAP [25] which uses the same Over-
idates the importance of information from open vocabu-
Feat features. Note, our SS-Voc:full result is 4.4% higher
lary. (3) SS-Voc benefits more from open set vocabulary
than the closest competitor [2]; this improvement is statis-
as compared to word vector space fine-tuneing. The results
tically significant. Comparing with our work, [2] did not
of supervised and zero-shot recognition of SS-Voc:full are
only use more powerful visual features (GoogLeNet Vs.
1/0.9% and 2.5/8.6% higher than those of SS-Voc:closed.
OverFeat), but also employed more semantic embeddings
Comparing to state-of-the-art on ZSL: We compare our (attributes, GloVe7 and WordNet-derived similarity embed-
results with the state-of-the-art ZSL results on AwA dataset dings as compared to our word2vec).
in Tab. 2. We compare SS-Voc:full trained with all source
instances, 800 (20 instances / class), and 200 instances (5 in- 7 GloVe[33] can be taken as an improved version of word2vec.
Large-scale open set recognition: Here we focus on Testing Classes
O PEN - SET310K setting with the large vocabulary of approx- AwA Dataset Aux. Targ. Total Vocab
imately 310K entities; as such the chance performance of O PEN -S ET1K−N N 40 / 10 1000?
the task is much much lower. In addition, to study the ef- O PEN -S ET1K−RN D (left) (right) 40 / 10 1000†
fect of performance as a function of the open vocabulary O PEN -S ET310K 40 / 10 310K
set, we also conduct two additional experiments with dif- 100 80

95

ferent label sets: (1) O PEN - SET1K−N N : the 1000 labels 90


70

from nearest neighbor set of ground-truth class prototypes 85


60

are selected from the complete dictionary of 310K labels. 80


50

Accuracy (%)

Accuracy (%)
This corresponds to an open set fine grained recognition; (2) 75 40

O PEN - SET1K−RN D : 1000 label names randomly sampled 70


SVR−Map(OPEN SET − 1K−NN)
30

65 SVR−Map(OPEN SET − 1K−RND)

from 310K set. The results are shown in Fig. 2. Also note 60
SVR−Map (OPEN SET − 310K)
SS-Voc:W (OPEN SET − 1K−NN)
20

SS-Voc:W (OPEN SET − 1K−RND)

that we did not fine-tune the word vector space (i.e., V is an 55


SSoVoc:W (OPEN SET − 310K)
10

Identity matrix) on O PEN - SET310K setting since Eq (8) can 50


0 5 10 15 20
0
0 5 10 15 20
Hit@k Hit@k
optimize a better visual discriminability only on a relative
small subset as compared with the 310K vocabulary. While Figure 2. Open set recognition results on AwA dataset:
Openness=0.9839. Chance=3.2e − 4%. Ground truth label is ex-
our O PEN - SET variants do not assume that test data comes
tended for its variants. For example, we count a correct label if a
from either source/auxiliary domain or target domain, we ’pig’ image is labeled as ’pigs’. ?,†:different 1000 label settings.
split the two cases to mimic S UPERVISED and Z ERO - SHOT
scenarios for easier analysis.
On S UPERVISED-like setting, Fig. 2 (left), our accuracy ing the same features. This is because only few, 3 sam-
is better than that of SVR-Map on all the three different ples per class, are used to train our models to mimic human
label sets and at all hit rates. The better results are largely performance of learning from few examples and illustrate
due to the better embedding matrix W learned by enforcing ability of our model to learn with little data. However, our
maximum margins between training class name and open semi-supervised vocabulary-informed learning can improve
set vocabulary on source training data. the recognition accuracy on all settings. On open set im-
On Z ERO SHOT-like setting, our method still has a no- age recognition, the performance has dropped from 37.12%
table advantage over that of SVR-Map method on Top-k (S UPERVISED) and 8.92% (Z ERO - SHOT) to around 9% and
(k > 5) accuracy, again thanks to the better embedding 1% respectively (Fig. 4). This drop is caused by the intrinsic
W learned by Eq. (7). However, we notice that our top-1 difficulty of the open set image recognition task (≈ 300×
accuracy on Z ERO SHOT-like setting is lower than SVR- increase in vocabulary) on a large-scale dataset. However,
Map method. We find that our method tends to label some our performance is still better than the SVR-Map baseline
instances from target data with their nearest classes from which in turn significantly better than the chance-level.
within source label set. For example, “humpback whale” We also evaluated our model with larger number of train-
from testing data is more likely to be labeled as “blue ing instances (> 3 per class). We observe that for standard
whale”. However, when considering Top-k (k > 5) ac- supervised learning setting, the improvements achieved us-
curacy, our method still has advantages over baselines. ing vocabulary-informed learning tend to somewhat dimin-
ish as the number of training instances substantially grows.
4.2. Experimental results on ImageNet dataset With large number of training instances, the mapping be-
tween low-level image features and semantic words, g(x),
We further validate our findings on large-scale ImageNet becomes better behaved and effect of additional constraints,
2012/2010 dataset; 1000-dimensional word2vec represen- due to the open-vocabulary, becomes less pronounced.
tation is used here since this dataset has larger number of
classes than AwA. We highlight that our results are still bet- Comparing to state-of-the-art on ZSL. We compare
ter than those of two baselines – SVR-Map and SVM on our results to several state-of-the-art large-scale zero-shot
(S UPERVISED) and (Z ERO - SHOT) settings respectively as recognition models. Our results, SS-Voc:full, are better
shown in Tab. 3. The open set image recognition results than those of ConSE, DeViSE and AMP on both T-1 and
are shown in Fig. 4. On both S UPERVISED-like and Z ERO - T-5 metrics with a very significant margin (improvement
SHOT -like settings, clearly our framework still has advan- over best competitor, ConSE, is 3.43 percentage points or
tages over the baseline which directly matches the nearest nearly 62% with 3, 000 training samples). Poor results of
neighbors from the vocabulary by using predicted semantic DeViSE with 3, 000 training instances are largely due to the
word vectors of each testing instance. inefficient learning of visual-semantic embedding matrix.
We note that S UPERVISED SVM results (34.61%) on Im- AMP algorithm also relies on the embedding matrix from
ageNet are lower than 63.30% reported in [7], despite us- DeViSE, which explains similar poor performance of AMP
SS-Voc:full : persian_cat, siamese_cat, hamster, weasel, rabbit, monkey, zebra, owl, anthropomorphized, cat
150 SS-Voc:closed: hamster, persian_cat, siamese_cat, rabbit, monkey, weasel, squirrel, anteater, cat, stuffed_toy
persian cat
hippopotamus SVR-Map: hamster, squirrel, rabbit, raccoon, kitten, siamese_cat, stuffed_toy, persian_cat, ladybug, puppy
leopard
humpback whale
100 seal
chimpanzee
rat
giant panda −20
pig −50
50
raccoon
0

0 20
0

40

50 60
−50

80

100
−100 100
−120 −100 −80 −60 −40 −20 0 20 40 60 80 −100 −80 −60 −40 −20 0 20 40 60 80 100 −100 −80 −60 −40 −20 0 20 40 60 80 100
SS-Voc:full SS-Voc: closed SVR-Map

Figure 3. t-SNE visualization of AwA 10 testing classes. Please refer to Supplementary material for larger figure.
Testing Classes SS-Voc
Aux Targ. Total Vocab Chance SVM SVR closed W full
S UPERVISED X 1000 1000 0.1 33.8 25.6 34.2 36.3 37.1
Z ERO - SHOT X 360 360 0.278 - 4.1 8.0 8.2 8.9
Table 3. The classification accuracy (%) of ImageNet 2012/2010 dataset on SUPERVISED and ZERO-SHOT settings.

Testing Classes Methods S. Sp Feat. T-1 T-5


ImageNet Data Aux. Targ. Total Vocab SS-Voc:full W CN NOverFeat 8.9/9.5 14.9/16.8
O PEN -S ET310K (left) (right) 1000 / 360 310K ConSE [30] W CN NOverFeat 5.5/7.8 13.1/15.5
25 7
6 DeViSE [14] W CN NOverFeat 3.7/5.2 11.8/12.8
20 5
Accuracy (%)

Accuracy (%)

4
AMP [19] W CN NOverFeat 3.5/6.1 10.5/13.1
15

SVR−Map (OPEN SET − 310K)


3
Chance – – 2.78e-3 –
10 2
SS-Voc:W (OPEN SET − 310K)
1 Table 4. ImageNet comparison to state-of-the-art on ZSL: We
5
0 5 10 15 20
0
0 5 10 15 20 compare the results of using 3, 000/all training instances for all
Hit@k Hit@k
methods; T-1 (top 1) and T-5 (top 5) classification in % is reported.
Figure 4. Open set recognition results on ImageNet 2012/2010
dataset: Openness=0.9839. Chance=3.2e − 4%. We use the set recognition example image of “persian cat”, which is
synsets of each class— a set of synonymous (word or prhase) wrongly classified as a “hamster” by SS-Voc:closed.
terms as the ground truth names for each instance. Partial illustration of the embeddings learned for the Im-
ageNet2012/2010 dataset are illustrated in Figure 1, where
with 3, 000 training instances. In contrast, our SS-Voc:full
4 source/auxiliary and 2 target/zero-shot classes are shown.
can leverage discriminative information from open vocabu-
Again better separation among classes is largely attributed
lary and max-margin constraints, which helps improve per-
to open-set max-margin constraints introduced in our SS-
formance. For DeViSE with all ImageNet instances, we
Voc:full model. Additional examples of miss-classified in-
confirm the observation in [30] that results of ConSE are
stances are available in the supplemental material.
much better than those of DeViSE. Our results are a further
significant improved from ConSE. 5. Conclusion and Future Work
This paper introduces the problem of semi-supervised
4.3. Qualitative results of open set image recognition
vocabulary-informed learning, by utilizing open set seman-
t-SNE visualization of AwA 10 target testing classes is tic vocabulary to help train better classifiers for observed
shown in Fig. 3. We compare our SS-Voc:full with SS- and unobserved classes in supervised learning, ZSL and
Voc:closed and SVR. We note that (1) the distributions of open set image recognition settings. We formulate semi-
10 classes obtained using SS-Voc are more centered and supervised vocabulary-informed learning in the maximum
more separable than those of SVR (e.g., rat, persian cat margin framework. Extensive experimental results illus-
and pig), due to the data and pairwise maximum margin trate the efficacy of such learning paradigm. Strikingly, it
terms that help improve the generalization of g (x) learned; achieves competitive performance with only few training
(2) the distribution of different classes obtained using the instances and is relatively robust to large open set vocab-
full model SS-Voc:full are also more separable than those ulary of up to 310, 000 class labels.
of SS-Voc:closed, e.g., rat, persian cat and raccoon. This We rely on word2vec to transfer information between
can be attributed to the addition of the open-vocabulary- observed and unobserved classes. In future, other linguistic
informed constraints during learning of g (x), which further or visual semantic embeddings could be explored instead,
improves generalization. For example, we show an open or in combination, as part of vocabulary-informed learning.
References [22] D. Jayaraman and K. Grauman. Zero shot recognition with
unreliable attributes. In NIPS, 2014.
[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-
embedding for attribute-based classification. In CVPR, 2013. [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
[2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Eval- NIPS, 2012.
uation of output embeddings for fine-grained image classifi-
cation. In CVPR, 2015. [24] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar.
Attribute and simile classifiers for face verification. In ICCV,
[3] E. Bart and S. Ullman. Cross-generalization: learning novel 2009.
classes from a single example by feature replacement. In
CVPR, 2005. [25] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-
based classification for zero-shot visual object categoriza-
[4] A. Bendale and T. Boult. Towards open world recognition. tion. IEEE TPAMI, 2013.
In CVPR, 2015.
[26] H. Larochelle, D. Erhan, and Y. Bengio. Zero-data learning
[5] I. Biederman. Recognition by components - a theory of hu- of new tasks. In AAAI, 2008.
man image understanding. Psychological Review, 1987.
[27] Y.-J. Lee, W.-F. Hsieh, and C.-M. Huang. -SSVR: A smooth
[6] S. R. Bowman, C. Potts, and C. D. Manning. Learning support vector machine for -insensitive regression. IEEE
distributed word representations for natural logic reasoning. TKDE, 2005.
CoRR, abs/1410.4176, 2014.
[28] C. D. Manning, P. Raghavan, and H. Schutze. Introduction
[7] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. to Information Retrieval. Cambridge University Press, 2009.
Return of the devil in the details: Delving deep into convo-
lutional nets. In BMVC, 2014. [29] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean.
Distributed representations of words and phrases and their
[8] K. Crammer and Y. Singer. On the algorithmic implemen- compositionality. In Neural Information Processing Systems,
tation of multiclass kernel-based vector machines. JMLR, 2013.
2001.
[30] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens,
[9] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by
Y. Li, H. Neven, and H. Adam. Large-scale object classifica- convex combination of semantic embeddings. ICLR, 2014.
tion using label relation graphs. In ECCV, 2014.
[31] M. Palatucci, G. Hinton, D. Pomerleau, and T. M. Mitchell.
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Zero-shot learning with semantic output codes. In NIPS,
Fei. Imagenet: A large-scale hierarchical image database. In 2009.
CVPR, 2009.
[32] D. Parikh and K. Grauman. Relative attributes. In ICCV,
[11] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing 2011.
objects by their attributes. In CVPR, 2009.
[33] J. Pennington, R. Socher, and C. D. Manning. Glove: Global
[12] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of vectors for word representation. In EMNLP, 2014.
object categories. IEEE TPAMI, 2006.
[34] M. Rohrbach, S. Ebert, and B. Schiele. Transfer learning in
[13] M. P. Friedlander and M. Schmidt. Hybrid deterministic- a transductive setting. In NIPS, 2013.
stochastic methods for data fitting. SIAM J. Scientific Com-
puting, 2012. [35] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and
B. Schiele. What helps where – and why? semantic relat-
[14] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, edness for knowledge transfer. In CVPR, 2010.
M. Ranzato, and T. Mikolov. DeViSE: A deep visual-
[36] H. Sattar, S. Muller, M. Fritz, and A. Bulling. Prediction
semantic embedding model. In NIPS, 2013.
of search targets from fixations in open-world settings. In
[15] Y. Fu, T. Hospedales, T. Xiang, and S. Gong. Attribute learn- CVPR, 2015.
ing for understanding unstructured social activity. In ECCV,
[37] W. J. Scheirer, L. P. Jain, and T. E. Boult. Probability models
2012.
for open set recognition. IEEE TPAMI, 2014.
[16] Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, and S. Gong.
[38] W. J. Scheirer, A. Rocha, A. Sapkota, and T. E. Boult. To-
Transductive multi-view embedding for zero-shot recogni-
wards open set recognition. IEEE TPAMI, 2013.
tion and annotation. In ECCV, 2014.
[39] R. Socher, M. Ganjoo, H. Sridhar, O. Bastani, C. D. Man-
[17] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Learning
ning, and A. Y. Ng. Zero-shot learning through cross-modal
multi-modal latent attributes. IEEE TPAMI, 2013.
transfer. In NIPS, 2013.
[18] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transduc-
[40] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny
tive multi-view zero-shot learning. IEEE TPAMI, 2015.
images: A large data set for nonparametric object and scene
[19] Z. Fu, T. Xiang, E. Kodirov, and S. Gong. zero-shot object recognition. IEEE TPAMI, 2008.
recognition by semantic manifold distance. In CVPR, 2015.
[41] A. Torralba, K. P. Murphy, and W. T. Freeman. Using the
[20] S. Guadarrama, E. Rodner, K. Saenko, N. Zhang, R. Farrell, forest to see the trees: Exploiting context for visual object
J. Donahue, and T. Darrell. Open-vocabulary object retrieval. detection and localization. Commun. ACM, 2010.
In Robotics Science and Systems (RSS), 2014. [42] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun.
[21] S. J. Hwang and L. Sigal. A unified semantic embedding: Large margin methods for structured and interdependent out-
relating taxonomies and attributes. In NIPS, 2014. put variables. JMLR, 2005.
[43] A. Vedaldi and A. Zisserman. Efficient additive kernels via
explicit feature maps. In IEEE TPAMI, 2011.
[44] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to
large vocabulary image annotation. In IJCAI, 2011.
[45] Z. Wu, Y. Fu, Y.-G. Jiang, and L. Sigal. Harnessing object
and scene semantics for large-scale video understanding. In
CVPR, 2016.
[46] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S.-F. Chang.
Designing category-level attributes for discriminative visual
recognition. CVPR, 2013.
[47] T. Zhang. Solving large scale linear prediction problems us-
ing stochastic gradient descent algorithms. In ICML, 2004.

Você também pode gostar