Escolar Documentos
Profissional Documentos
Cultura Documentos
ColtsFoot Buttercup
these three aspects. The challenge lies in finding a good rep-
resentation for these aspects and a way of combining them
that preserves the distinctiveness of each aspect, rather than
averaging over them. However, flower species often have
multiple values for an aspect. For example, despite their
names, violets can be white as well as violet in colour, and
Daffodil
blue bells can be pink. This is quite exasperating, but in-
dicates that any class representation will need to be multi-
modal.
Daisy
1.1. Overview and performance measure
Dandelion
In the rest of this paper we develop a nearest neighbour
classifier. The classifier involves a number of stages, start-
ing with representing the three aspects as histograms of
Fritillary
occurrences of visual words (a separate vocabulary is de-
veloped for each aspect) (section 2); then combining the
histograms (section 3) into a single vocabulary. SIFT de-
scriptors [10] on a regular grid are used to describe shape,
Iris
HSV-values to describe colour, and MR-filters [16] to de-
scribe texture. Each is vector quantized to provide the visual
Pansy
words for that aspect. Each stage is separately optimized (in
the manner of [4]).
Since we are mainly interested in being able to retrieve a Windflower Sunflower
short list of correct matches we optimize a performance cost
to reflect this. Given a test image Iitest , the classifier returns
a ranked list of training images Ijtrain , j = 1, 2, ..., M with
j = 1 being the highest ranked. Suppose the highest ranked
correct classification is at j = p, then the performance score
for Iitest is
LilyValley Snowdrop
wp if p S
0 otherwise
N
1 wpi if p S
f () = (1)
N i=1 0 otherwise
Tulip
Performance 90
85
Performance
Performance
92 92 92
90 90 90
88 88 88
86 86 86
84 84 84
82 82 82
80 80 80
200 400 600 800 1000 1200 1400 1600 1800 10 20 30 40 50 60 70 10 20 30 40 50 60 70
Vocabulary size Radius Spacing
Figure 6. Performance (1) using shape vocabulary. Varying the number of clusters (Vs ), the radius (in pixels) of the SIFT support window
(R), and the spacing (in pixels) of the measurement grid (M).
100
Filter size = 3
Filter size=7
95
Filter size = 11
Filter size=15
90
Performance
85
80
75
Red cluster
70
65
100 200 300 400 500 600 700 800 900 1000
Vocabulary Size
Blue/dashed cluster Figure 9. Performance (1) for the texture features on segmented
images. Best results are obtained with filter size = 11 and 700
clusters.
Figure 7. Two images from the Daffodil category and examples of MR8 filter bank introduced by Varma and Zisserman [16].
regions described by the same words. All the circles of one colour The filter bank contains filter at multiple orientation. Rota-
correspond to the same word. The blue/dashed word represents tion invariance is obtained by choosing the maximum re-
petal intersections and the red word rounded petal ends. Note that sponse over orientations. We optimize over filters with
the words detect similar petal parts in the same image (intra-image square support regions of size s = 3 19 pixels. A vo-
grouping) and also between flowers. cabulary is created by clustering the descriptors and the fre-
quency histograms n(wt |I) are obtained. Classification is
done in the same way as with the colour features. Figure 4
8), or more subtle in the form of characteristic veins in the
shows how the performance varies with the number of clus-
petals. The subtle patterns are sometimes difficult to distin-
ters for different filter sizes. The best performance is ob-
guish due to illumination conditions a problem that also
tained using 700 clusters and a filter of size 11. The recog-
affects the appearance of more distinctive patterns.
nition rate for the first hypothesis is 56.0% and for the fifth
hypothesis it is 84.3%.
3. Combined Vocabulary
The discriminative power of each aspect varies for dif-
ferent species. Table 1 shows the confusion matrices for
Figure 8. Flowers with distinctive patterns. From left to right: the different aspects for the consistent viewpoint set. Not
Pansy with distinctive stripes, fritillary with distinctive checks and surprisingly, it shows that some flowers are clearly distin-
tiger-lily with distinctive dots. guished by shape, e.g. daisies, some by colour, e.g. fritillar-
ies and some by texture e.g. colts feet, fritillaries. It also
We describe the texture by convolving the images with shows that some aspects are too similar for certain flow-
ers, e.g. buttercups and daffodils get confused by colour, Colour
buttercup 33.33 30.00 10.00 3.33 10.00 3.33 10.00
colts feet and dandelions get confused by shape, and but-
colts foot 6.67 60.00 26.67 3.33 3.33
tercups and irises get confused by texture. By combining
daffodil 10.00 10.00 30.00 40.00 3.33 6.67
the different aspects in a flexible manner one could expect
daisy 56.67 6.67 36.67
to achieve improved performance. We combine the vocab-
ularies for each aspect into a joint flower vocabulary, to ob- dandelion 3.33 40.00 13.33 30.00 13.33
tain a joint frequency histogram n(w|I). However, we have fritillary 3.33 93.33 3.33
some freedom here because we do not need to give equal iris 6.67 6.67 13.33 13.33 36.67 10.00 13.33
weight to each aspect consider if one aspect had many pansy 3.33 13.33 3.33 23.33 46.67 10.00
more words than another, then on average the one with more sunflower 6.67 30.00 20.00 43.33
words would dominate the distance in the nearest neighbour windflower 30.00 10.00 60.00
comparisons. We introduce a weight vector , so that the Shape
buttercup 70.00 3.33 13.33 13.33
combined histogram is:
colts foot 3.33 63.33 10.00 23.33
s n(ws |I) daffodil 3.33 60.00 13.33 16.67 6.67
n(w|I) = c n(wc |I) (3) daisy 96.67 3.33
t n(wt |I) dandelion 3.33 16.67 3.33 3.33 73.33
fritillary 90.00 3.33 6.67
Since the final histogram is normalized there are only two iris 16.67 6.67 70.00 3.33 3.33
independent parameters which represent two of the ratios in pansy 16.67 6.67 16.67 53.33 6.67
s : c : t . sunflower 10.00 3.33 3.33 83.33
We learn the weights, , on the consistent viewpoint set windflower 3.33 3.33 93.33
by maximizing the performance score of (1), here f (), on Texture
the validation set. The performance is evaluated on the test buttercup 46.67 10.00 10.00 23.33 10.00
set. colts foot 86.67 13.33
We start by combining the two aspects which are most daffodil 6.67 60.00 6.67 6.67 13.33 6.67
useful for classifying the flowers, i.e. the shape and texture. daisy 6.67 10.00 50.00 20.00 3.33 10.00
Figure 10 shows f () for varying s. We keep s = 1 dandelion 33.33 3.33 46.67 3.33 6.67 3.33 3.33
fixed. Best performance is achieved with t = 0.8. This fritillary 3.33 13.33 80.00 3.33
means that the performance is best when texture has almost iris 13.33 3.33 13.33 6.67 50.00 3.33 3.33 6.67
the same influence as shape. Combining shape and colour, pansy 10.00 13.33 3.33 6.67 10.00 50.00 6.67
however, leads to a superior performance. This is because sunflower 10.00 13.33 10.00 13.33 6.67 6.67 40.00
the colour and shape complement each other better, whilst windflower 13.33 6.67 10.00 6.67 3.33 3.33 6.67 50.00
shape and texture often have the same confusions. The best Table 1. Confusion matrices for the first hypothesis of the differ-
performance for combining shape and colour is achieved ent aspects on the consistent viewpoint dataset. The recognition
when s = 1 and c = 0.4, i.e. when colour has less rate is 75.3% for shape, 56.0% for texture and 49.0% for colour,
then half the influence of shape. The best performance is compared to a chance rate of 10%.
achieved by combining all aspects with s = 1.0, c = 0.4
and t = 1.0. These results indicate that we have success-
becomes
fully combined the vocabularies the joint performance ex-
ceeds the best performance of each of the separate vocabu- c = arg minj { s d(n(ws |I test ), n(ws |Ijtrain ))
laries, i.e. we are not simply averaging over their separate
+c d(n(wc |I test ), n(wc |Ijtrain ))
classifications (which would deliver a performance some-
where between the best (shape) and worst (colour) aspect). +t d(n(wt |I test ), n(wt |Ijtrain ))}
Figure 11 shows an instance where both shape and colour
It is likely that learning weights for each class would in-
misclassify an object but their combination classifies it cor-
crease the performance of the classification system, for ex-
rectly.
ample by learning a confusion matrix over all classes for
each aspect. However, as the number of classes increases
Discussion: The problem of combining classifications this becomes computationally intensive.
based on each aspect is similar to that of combining in-
dependent classifiers [11]. The weighting gives a lin-
4. Results
ear combination of distance functions (as used in [11]). To
see this, consider the form of the nearest neighbour clas- In this section we present the results on the full 1360
sifier (2). Since in our case d(x, y) = d(x, y), (2) image dataset consisting of 80 images for each of 17 cat-
100 guishable by texture is very small, and the texture features
99 also suffer due to large scale variations. We achieve our best
98 performance by combining shape and colour with s = 1
97 and c = 1. The performance according to (1) is 81.3%,
a very respectable value for a dataset of this difficulty. Al-
Performance
96
though the texture aspect has become redundant, the final
95
classifier clearly demonstrates that a more robust system is
94
shape+colour (s=1) achieved by combining aspects. Figure 13 shows a typi-
93
shape+texture (s=1)
cal misclassification illustrating the difficulty of the task
92 and figure 14 shows a few examples of correctly classified
shape+colour+texture (s=1,c=0.4)
91 flowers.
90 We compare the classifier to a baseline algorithm using
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
RGB colour histograms computed for each 10 10 pixel
Figure 10. Performance for combining shape, colour and texture. region in an image. Classification is again by nearest neigh-
The blue/dashed curve shows performance when varying t for bours using the 2 distance measure for the histograms. The
a combination of shape and texture only, and the red/solid curve baseline performance is 55.7%, substantially below that of
shows performance when varying c for a combination of shape the combined aspect classifier.
and colour only. The best performance is obtained by combining
all features (green/dot-dashed curve) with s = 1.0, c = 0.4 80
shape (segmented)
and t = 1.0. colour (segmented)
75 texture (segmented)
Test Image 1st Hyp Colour 1st Hyp Shape 1st Hyp Combined
70
Performance
65
60
55