Escolar Documentos
Profissional Documentos
Cultura Documentos
Alexei Yavlinsky
Abstract
This work investigates the application of Support Vector Machines (SVMs) and a number of feature selection methods to Content Based Image Retrieval (CBIR) based on feature generation schema proposed by Tieu and Viola in 2000. This study largely builds on the work published by Pickering M. and Rger S., which investigates a number of approaches to CBIR. Experiments are carried out, and the results obtained are compared with the published results. Conclusions drawn from this comparison allow potential areas for future research to be identified.
1 Background
We assume that the reader is familiar with the basic concepts of machine learning, such as training and classification. For a general introduction to machine learning with emphasis on the underlying theory, Machine Learning [B2] by Tom Mitchell is recommended. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations [B3] by Witten I. and Frank E. is also recommended, but describes more of the practical aspects of machine learning such as evaluation methods and feature selection. Content Based Image Retrieval is the process in which the user retrieves images of interest from a (usually large) database collection by providing example images to the system of what he/she wants to retrieve. For more information on CBIR Content based image retrieval at the end of the early years by Smeulders A., Worring M., Santini S., Gupta A. and Jain. R. (2000) [18] is recommended for an indepth analysis of the field. We also assume that the reader has familiarised oneself with the paper on which this study is based, Evaluation of Key-frame Based Retrieval Techniques for Video by Pickering M. and Rger S. [14].
Alexei Yavlinsky
13/01/03
classifier which images the user does not want. The images from that database are then evaluated by the classifier as either belonging to the positive or the negative class of images, and the positive images are returned to the user first. The returned positive images are ranked by the degree of certainty that they are, in fact, positive. This machine-learning problem is unusual in that it has a very low ratio of positive examples to negative ones. In most cases, the user provides between one and six positive images, whilst the system uses about 100 negative images picked at random from the database. Such set-up makes it non-trivial to build classifiers that generalise well to unseen data, because there is little information on what features to look for in a positive image, and very disparate information is provided on what constitutes a negative image The size of the database is sufficiently large to minimise risk of picking a false negative example, however.
Alexei Yavlinsky
13/01/03
examples are not required in this case; images in the database are ranked using the positive examples only. The variant used in [14] uses Manhattan distance as a distance measure between two feature vectors.
In essence, for a given image this method ranks it by how much closer that image is to the positive images and how much further away it is from negative images, compared to other images being evaluated.
1.3.3 AdaBoost
AdaBoost algorithm [3] adaptation attempts to select most relevant features, by finding features that generate least error when classifying the training set. This approach is also used by [19] for ranking images. The idea is to build a strong classifier from a combination from a series of weak classifiers, each of which is determined by one iteration of the algorithm. In this case, a weak classifier is a single feature, along which positive examples are separated from negative examples. Each iteration determines a weak classifier, which generates least error on the training set, i.e., is capable of separating positive and negative examples in the training set best. Each example is assigned an initial weight, such that the sum of weights in positive and negative examples is equal, and the weights are evenly distributed within these two groups. Weights are updated at each iteration, according to whether the example was correctly classified at the previous iteration, such that when choosing the next best weak classifier, more attention is paid to images which have been incorrectly classified up to this point. When the desired number of weak classifiers is chosen, their hypotheses are combined to give a classifier with a final strong hypothesis for classification. Weak classifiers hypothesis is the two-class Gaussian model which is derived from the projection of the positive and negative training examples onto the feature which the classifier represents (Fig. 1-1). Given any image vector, a weak hypothesis will tell us the probability of it belonging to the negative or the positive class.
Alexei Yavlinsky
13/01/03
= positive example projection onto the feature component = negative example projection Positive probability Feature i Figure 1-1 A weak classifiers hypothesis Negative probability
The algorithm used in [14] and [19] is as follows: We are given example images ( x1 , y i ),..., ( x n , y n ) where x i is the ith feature vector and y i = {0,1} for negative and positive examples respectively.
For the first iteration (t =1), image weights wt ,i are initialised: 1 1 for y i = 0 ; w1,i = for y i = 1 2q 2p where q and p are the number of negative and positive examples respectively.
w1,i =
for t := 1 to T do Generate one hypothesis h j for each feature j using the probability distribution
j = Priw [h j ( xi ) y i ] .
t
least error in the training set. Assign t = k Update: wt +1,i = wt ,i t1ei where ei = {0,1} for example xi classified correctly or incorrectly respectively, and =
t 1 t
wt +1,i
w
j =1
t +1,i
It can be seen that weight of an example classified correctly is going to decrease the lower is the error of the best hypothesis in the training set, the greater is the decrease. The weight of a misclassified example remains the same. Therefore, after normalisation, examples classified wrongly will have a larger weight compared to examples classified correctly. The strong classifier is constructed with the hypothesis:
T
h( x) = t ht ( x) such that
t =1
Alexei Yavlinsky
1 T at for positive examples. 2 t =1 1 T h( x) < at for negative examples. 2 t =1 1 where a t = log . h( x )
13/01/03
1 at is known as the AdaBoost threshold, and is effectively the decision 2 t =1 boundary for the strong hypothesis. The ranking can be established as: 1 T R ( x ) = h( x) at . 2 t =1 [14] and [19] use 20 weak classifiers to derive the strong classifier.
Alexei Yavlinsky
13/01/03
2 Image Retrieval
2.1 Support Vector Classification
2.1.1 Motivation
In this problem, real vectors represent images that we try to classify as belonging or not to the same class as the positive examples provided by the user. The essence of Support Vector Machines (SVMs) is to classify vectors in real vector spaces, making it suitable for this problem. Furthermore, SVMs are capable of classifying linearly inseparable data, which could fit this problem well, since individual queries may not form linearly separable vector spaces. Furthermore, Tieu and Viola also suggest that an SVM is an approach competitive to boosting for their feature generation method in [19]. They justify boosting as being more suitable because it selects a very small subset of features (20) that generate least error when classifying the training data, and that subset is used for classification of new images, i.e. feature selection and classification is done all in one. However, it is interesting to consider SVMs as an alternative, given this hint in the literature.
2.1.2 Introduction
SVMs were developed by Vapnik in 1979, but have become popular relatively recently due to good empirical performance. They have been successfully applied in the fields of handwriting recognition and text categorisation. The main aim of SVMs is to achieve good generalisation performance rather than a good performance just on the training set supplied. SVMs achieve this by implementing the Structural Risk Minimisation (SRM) criterion, which aims to minimise generalisation errors. In contrast, many other machine learning algorithms, such as Artificial Neural Networks use Empirical Risk Minimisation (ERM), which reduces the error on the training data. The core idea of the SVM is to map the input space into a higher-dimensional space, and optimising the classification function with respect to SRM criterion. The SVM classification function is linear, but by means of non-linear mapping into higher dimensional space a linear function can be used to classify linearly non-separable data sets. The process of such mapping would normally be computationally expensive, but the linear classifier is constructed based entirely on inner products of the data, thus the transformation only needs to be done on the inner products, rather than explicitly transforming the entire input space. The other advantage of implicit mapping is that an infinite dimension range can be used, without falling into the trap of curse of dimensionality. The process of constructing the classifier can be explained informally as follows: Pair-wise similarity between data is used by an oracle to tell us if the two data points are in the same class or not. The algorithm uses the oracle to compare new points to each training point, and output the label of the most similar point.
Alexei Yavlinsky
13/01/03
The algorithm is to satisfy the SRM criterion by deriving the maximally separating hyperplane between the positive and negative examples (the oracle) in the training data (see Fig. 2-1). The similarity measure is then the inner product, which we use to compute the distance of new data points from the hyperplane, and thus deduce to which class it belongs. The sign of the distance will tell us the class of the new data point, and absolute value of the distance can be interpreted as the certainty of the data point belonging to that class.
(2.1) (2.2)
where
w = i y i xi
i
where xi is the ith training point and y i = f ( xi ) . This is the dual form of the SVM. Another property of SVMs is that one can substitute the inner products with kernels in the dual form, and thus transform the input space implicitly.
Support Vectors
Figure 2-1 Optimal Hyperplane and Maximum Margin. Distance of the nearest vector to the hyperplane, d = 1
The optimal hyperplane is a hyperplane that separates all training data correctly, and data points either side of it are maximally distant from it compared to all other
Alexei Yavlinsky
13/01/03
possible correctly separating hyperplanes, i.e. its margin is maximised (see Figure 31). Lets consider the problem of separating the set if training points belonging to two separate classes D = ( x1 , y 1 ),..., ( x l , y l ) , x R n , y {1,1} , with a canonical
hyperplane states that the norm of the weight vector should be equal to the inverse of the distance of the nearest point in the data set to the hyperplane. A separating hyperplane in a canonical form must satisfy the constraints y i w, x i + b 1 , which in words states that the hyperplane must separate the data
correctly, and the minimum distance between it and any data point should not be less than 1. This in effect defines the minimum width of the separating margin and the training data are normalised as to always satisfy this condition. The distance d ( w, b; x) of a data point x from the hyperplane (w, b) is
d ( w, b; x) = w, x i + b w (2.3)
= min i i =
w, x i + b w
x : y = 1
+ min i i
w, x i + b w
x : y =1
(2.4)
1 i i min w, x + b + min w, x + b x i : y i =1 w xi : y i = 1 2 . = w Thus to maximise the margin we must minimise the function 1 1 2 w, w = w 2 2 subject to constraint y i w, x i + b 1 .
(2.5)
(2.6) (2.7)
This optimisation is a quadratic programming problem and can be solved using Lagrangian multipliers and the dual form of SVMs. The Lagrangian functional of the equation is 1 2 l ( w, b, ) = w i ( y i w, x i + b 1) (2.8) 2 i =1 and has to be minimised with respect to w, b and maximised with respect to 0 . The optimisation problem becomes (2.9) max W ( ) = max min ( w, b, ) . w ,b The minimum of with respect w and b can be found by:
Alexei Yavlinsky
l = 0 i yi = 0 b i =1 l = 0 w = i y i xi w i =1 Hence the dual representation for this problem is l 1 l l max W ( ) = max k i j y i y j xi , x j 2 i =1 j =1 k =1 and the solution is: l 1 l l * = arg min k + i j y i y j xi , x j 2 i =1 j =1 k =1 with constraints:
13/01/03
(2.10)
(2.11)
(2.12)
0 and
i =1
yi = 0 .
(2.13)
The terms are called the Lagrange multipliers. A point xi which has a non-zero Lagrange multiplier i is a support vector of the margin. The optimal hyperplane is then given by
w* = i y i xi
i =1
1 b = ( w* , x r + x s ) 2 x r , x s { }1..i , r , s > 0, y r = 1, y s = 1 .
*
(2.14) (2.15)
As can be seen from the above equations, only support vectors are required in order to formulate the optimal hyperplane. If the data is linearly separable, all support vectors will lie on the margin. In such case the number of support vectors can be quite small and hence the hyperplane can be formulated with a small subset of the training data.
Here are the examples of some Kernels that are often used in SVMs:
Linear:
K ( x, y ) = x , y
Alexei Yavlinsky
13/01/03
The linear kernel is the inner product itself, thus there will be no implicit transformation of the training data and the algorithm will only work well with linearly separable data. || x y || 2 K ( x, y ) = exp Gaussian Radial Basis (width ): 2 2 Classical machine learning algorithms, which have utilised Radial Basis Functions, rely on methods of determining a subset of centres of data clusters. In SVMs, this selection is implicit, with each support vector generating one local Gaussian function, centred at that vector. A global basis function width is then found by deriving the maximum margin in the transformed space.
K ( x, y ) = ( x , y + b ) d
This method is popular for non-linear fitting, and different orders can be good depending on the nature of the problem. It can be shown that all the above can be expressed as ( x1 ), ( x 2 ) .
10
Alexei Yavlinsky
13/01/03
This technique was not evaluated in this study, as it required extensive modifications of existing SVM software and the time constraints did not allow for their undertaking. The authors claim that the one-class SVM with a Radial Basis Gaussian kernel performs well, however, they do not provide the details of their experimental set up or an extensive comparison to other methods. It would be interesting to validate this
11
Alexei Yavlinsky
13/01/03
approach on Tieu and Violas feature vectors and compare its performance to the standard two-class SVM.
12
Alexei Yavlinsky
13/01/03
The first requirement is obvious we want to train the classifier on well-separated data so that it is able to discriminate positive images from others with high confidence. However, it is necessary to satisfy the second requirement because we wish to provide the classifier with maximal information about the diversity of the image database, so that it is able to discriminate true positive from as many different negative images as possible. This requirement is one of the reasons that necessitate such a large number of random negative images for training in [14]. The Gram-Schmidt algorithm provides an analytical solution to these requirements for real valued vectors. Given a set of vectors, Gram-Schmidt algorithm will select a subset of vectors that are least co-linear to each other. If we assume the vectors of similar images are more co-linear than those of dissimilar ones, or in other words, the data points of similar images are clustered together, then Gram-Schmidt will select the most diverse subset of images from a given set. The adapted version of this algorithm for this problem is as follows: Let P be the number of positive images, and N be the total number of both positive and negative images. Then {vn } s.t. 0 n P are feature vectors of positive images and {vn } s.t. P < n N are the feature vectors of negative images. We start building the orthonormal basis by processing the positive examples iteratively; we do not have any choice and have to incorporate all positive examples into the basis. v e0 = 0 (2.16) v0
en +1 = v n +1 [ v n +1 , ei ei ]
n i =0 n i =0
v n +1 [ v n +1 , ei ei ]
(2.17)
for all 0 n P . At each iteration, en +1 is added to the basis. After iterating over all positive images, the basis contains projections of all positive feature vectors, all orthonormal to each other, due to (2.17). Next we start adding negative examples iteratively to the basis:
n v n +1 = arg max v j v j , ei ei j i =0
(2.18)
13
Alexei Yavlinsky
v n +1 [ v n +1 , ei ei ]
n
13/01/03
en +1 =
v n +1 [ v n +1 , ei ei ]
i =0
i =0 n
(2.19)
to maximise it in (2.18) to select the new feature vector of a negative image v n +1 that is least co-linear to the projections of chosen vectors already in the basis. We add orthonormal projections of positive examples to the basis in (2.17) to later select negative examples maximally distant from the positive examples as well. This approach has been used in [17] for selecting negative examples in text document retrieval. The problem in [17] has a similar set up to this one the training set has 4 positive examples and there is an abundance of documents from which negative examples can be chosen. Owing to the inner product form of Gram-Schmidt, it is also possible to introduce kernels and make the algorithm work in higher-dimensional spaces. This feature of Gram-Schmidt could be beneficial for combining the algorithm with SVMs, as the negative example selection is then carried out in a hyperspace in which SVM training takes place, if the same kernel is used in both. The basic version of Gram-Schmidt was implemented, but its performance was very slow for selecting a sufficiently large number of negative examples, so it was not included in the large-scale tests.
3 Feature Selection
3.1 Motivation
The selection of relevant features, and elimination of irrelevant ones, is one of the major problems in machine learning. The success of machine learning algorithms is usually dependent on the quality of data on which they operate. If the data contains redundant or irrelevant features, the learning algorithm may produce a less accurate or a less understandable result. Feature selection attempts to identify and remove as much irrelevant and redundant as possible prior to learning. Learning on a reduced number of features often benefits from an increase in classification accuracy and learning speed. The formal description of the feature selection process is as follows: A data instance is described to the system as an assignment of values f = ( f1 ,..., f n ) to a set of features F = ( F1 ,..., Fn ) . Given G as a subset of F, instead of using f we shall use f G , which is the projection of the feature vector f onto the variables in G. We wish to find such G, using which the classification performance of our machine learning algorithm improves. 14
Alexei Yavlinsky
13/01/03
For Viola feature vectors, feature selection may be of interest to us. On the one hand, the feature vectors have an extremely high number of dimensions, which can complicate the training process; on the other, the successive filter convolution nature of the feature generation process suggests that there can be many features, which are highly correlated, and therefore can be made redundant. Let us look at an extreme case from an informal perspective. Consider a convolution filter, which, for a given set of images, produces a feature map, that sums to zero on red, green and blue channels. The successive convolution of any of the 25 filters will also produce a feature map that sums to zero, and therefore the third convolution of any of the filters will also produce zero. This implies that there are 3 25 25 = 1875 features, which are correlated to the result of convolving just one filter with the image, and those features are therefore redundant. Obviously we would like to detect such a correlation, and leave only one feature that relates to the application of this particular filter for this set of images. Curse of dimensionality is another motivation for reducing the number of generated features for images. It turns out that any two randomly picked feature vectors in a high-dimensional space will tend to have a constant distance from each other, no matter what the distance measure is, provided they are independent of each other. [16] gives a justification of this phenomenon on a distribution of text documents for text retrieval purposes. This means that even if most of the features are not correlated, the task of a classifier could be complicated by the fact that distances between positive examples only are similar to distances between positive and negative examples. Again, feature selection could alleviate this problem. Occams razor dictates that we should always seek a simpler hypothesis to fit our data, and this also applies to the number of features we provide for our data instances. Lastly, there is a practical consideration running the SVM on the full feature set causes the software we use to run out of memory. For our problem, we would like to select a subset of features based on the training images, and then classify the rest of the database using only that subset. We hope that training images provide enough information to determine which features help us differentiate positive images from negative ones for this query, and that this feature selection will improve the classification power of the algorithm.
15
Alexei Yavlinsky
13/01/03
For a discrete variable Y and the corresponding class X, the entropy of the variable Y is measured as
H (Y ) = pi (Y ) log 2 ( pi (Y ))
i =1
(3.1)
where c is the number of discrete states the variable Y can have, and pi (Y ) is the proportion of Y being in state i across the dataset. The entropy of observing the variable Y after observing X is measured as
c c
H (Y | X ) = pi ( X ) pi (Y | X ) log 2 ( pi (Y | X ))
i =1 i =1
(3.2) (3.3)
Variables that obtain a higher gain score bear more information about predicting the class and are therefore considered more useful. In the case where Y is a continuous feature, it is converted into discrete by binning (the process known as discretisation). The features are ranked by the amount of information they provide about the class.
Gain Ratio is a modified version of the Information Gain measure, and tells us the amount of information gain of the variable relative to the entropy of the class: GainRatio =
InformationGain . H(X )
(3.4)
Symmetrical Uncertainty is the measure of mutual prediction ability between the variables. SymmetricalUncertainty = InformationGain H ( X ) +H (Y ) (3.5)
Both Gain Ratio and Symmetrical Uncertainty return values between 0 and 1. The value of 0 for both the Gain Ratio and the Symmetrical Uncertainty implies that there is no correlation between X and Y. The value of 1 for the Gain Ratio implies that Y completely predicts X. The value of 1 for Symmetrical Uncertainty implies that X and Y completely predict each other. Both algorithms rank features according to the values they produce higher value indicates higher rank. It is interesting to consider and compare these three variations of the Information Gain concept on the feature vectors and how they improve the learning performance of the SVM. They are evaluated in the experimental part of this study.
3.2.2 ReliefF
ReliefF [11] is a feature estimation algorithm, and is a modified version of the Relief algorithm [9]. The original Relief algorithm estimates the quality of features according to how well their values distinguish among instances that are near each other. For that purpose
16
Alexei Yavlinsky
13/01/03
Relief for a given instance searches for its two nearest neighbours: one from the same class (called nearest hit) and the other from different class (called nearest miss). The quality measure of feature F is: w(F) = P(different value of F | nearest instance from different class) - P(different value of F | nearest instance from the same class) Intuitively it is a very simple calculation, which measures the sensitivity of the class to the change of Fs value. The more sensitive it is, the more important is the feature F. The training algorithm for Relief is as follows: set all weights W(F) := 0; for i := 1 to m do begin randomly select an instance R; find nearest hit H and nearest miss M; for F := 1 to all attributes do diff ( F , R, H ) diff ( F , R, M ) + W ( F ) := W ( F ) m m end;
(3.6)
where diff(Feature, Instance1, Instance2) calculates the difference between the values of Feature for two instances. For real valued features diff returns the actual difference, normalised in the range of [0,1]. m is the number of training examples the algorithm evaluates before producing a ranking value for each attribute. The simplification that is often used is to set m to the number of training examples and run the loop over all training instances. The original Relief algorithm uses only one neighbour from each class to compute the probabilities. The first extension to Relief is the use of k nearest neighbours for both classes. This was intended to overcome the effect of noisy and erroneous data, from which the original algorithm suffered. The extended version of the algorithm, called RELIEF-A, averages the contribution of k nearest hits/misses. Further extensions to the algorithm are the ability to cope with missing attribute values and multiple classes. Neither is of interest to us in this problem. There are no missing features, since feature vectors are generated artificially, and there are at most two classes for learning. What differentiates this method from methods described in 3.3.1 is the ability to cope with features that are highly interdependent. Information Gain based algorithms are good at detecting irrelevant features, but cannot pinpoint features, which are relevant, but redundant [11]. ReliefF could be useful for finding good features for SVM learning, if the intuitions about the high degree of correlation in Viola feature vectors discussed in 3.1 do turn out to be true. It is evaluated in the experimental part of this study.
17
Alexei Yavlinsky
13/01/03
3.2.3 CFS
Correlation-Based Feature Selection (CFS) [5], by Mark Hall, is a heuristic algorithm, which evaluates the quality, or the merit, of a subset of features. The heuristic function takes into account the usefulness of individual features for prediction the class of the data item, and the level of inter-correlation between the features themselves. CFS is based on the hypothesis stating that good feature subsets contain features highly correlated with the class, yet uncorrelated with each other. The formal basis of the algorithm is:
Merit s = k rcf k + k (k 1)r ff
(3.7)
Where Merit s is the quality measure of the subset s, k is the number of features, rcf is the average feature-class correlation and r ff is the average feature-feature correlation. In (3.7), the numerator can be considered as a measure of how predictive this subset is, and the denominator is a measure of redundancy within the subset. It follows that (3.7) is a modified version of the Pearsons Correlation Coefficient, often used in statistics. This measure handles irrelevant features, as they are poor predictors of the class, and redundant features, since they are highly correlated with one another. For continuous data the measure of correlation between two features is xy (3.8) rXY = n x y where X and Y are two continuous variables expressed in terms of deviation. This is a standard linear correlation formula due to Pearson. This method is different from 3.2.1 and 3.2.2, since it evaluates entire subsets of features, rather than ranking them individually. This is more computationally expensive, since for n possible features there are 2 n possible subsets. CFS uses a heuristic search approach to search the space of subsets. It first constructs featurefeature and feature-class correlation matrices from the training data, and then searches them using a best-first search mechanism. In our case, this method is also expensive for storage purposes. The feature-feature correlation is of size n 2 . If we consider Viola feature vectors, the size required to keep the feature-feature correlation matrix is 46875 46875 4 bytes (size of a floating point number) = 8789062500 bytes 8.6 Gigabyes. Since the correlation values are symmetrical, we can keep the lower triangular half of the matrix, which reduces the required space to about 4.3 Gigabytes. However, this reduction is still not sufficient to enable the algorithm to operate from main memory on most computers; the correlation matrix must be cached onto the hard disk, with sophisticated data access functions implemented for fast access to any part of the matrix. Given the time constraints of this study, implementation of CFS was deemed impractical. However,
18
Alexei Yavlinsky
13/01/03
CFS is an interesting choice for future research, since it can give us a quantitative measure of how inter-correlated the features are in Viola feature vectors.
4 Experimental Set-up
4.1 Retrieval Methods
The experimental part of this study was set up, so as to be comparable to results published in [14]. The same images, feature vectors and categories were used to compute equivalent mean average precision values. The core learning algorithm was chosen to be the two class SVM, as described in 3.2.2. Three different kernels were applied to the SVM for evaluation: Linear, Polynomial of order 2, and Radial Basis Gaussian with = 1 (see 2.1.5). The software used for SVM classification was the Java implementation of SVMlight by Matthew Pocock. Five different feature selection methods were evaluated for preprocessing the training data before SVM learning: Information Gain, Gain Ratio, Symmetrical Uncertainty, as in 3.2.1 with the training examples discretised, ReliefF as in 3.2.2, with k = 10 neighbours; these methods were interfaced with the SVM software from the Weka 3.2.2 Java libraries. Finally, random feature selection was used to contrast with the other four feature selection methods for their efficiency. Each of the feature selection algorithms was evaluated on feature subsets of 1000 and 100 best features to keep. Thus, the following thirty combined methods were prepared for evaluation:
Kernel Linear Linear Linear Linear Linear Linear Linear Linear Linear Linear Polynomial o. 2 Polynomial o. 2 Polynomial o. 2 Polynomial o. 2 Polynomial o. 2 Polynomial o. 2 Polynomial o. 2 Polynomial o. 2 Polynomial o. 2 Polynomial o. 2 Gaussian Radial Basis = 1 Gaussian Radial Basis = 1 Gaussian Radial Basis = 1 Gaussian Radial Basis = 1 Gaussian Radial Basis = 1 Gaussian Radial Basis = 1 Gaussian Radial Basis = 1 Gaussian Radial Basis = 1 Feature Selection InfoGain InfoGain GainRatio GainRatio SymmetricalUncertainty SymmetricalUncertainty ReliefF ReliefF Random Random InfoGain InfoGain GainRatio GainRatio SymmetricalUncertainty SymmetricalUncertainty ReliefF ReliefF Random Random InfoGain InfoGain GainRatio GainRatio SymmetricalUncertainty SymmetricalUncertainty ReliefF ReliefF No. Best Features Kept 100 1000 100 1000 100 1000 100 1000 100 1000 100 1000 100 1000 100 1000 100 1000 100 1000 100 1000 100 1000 100 1000 100 1000
19
Alexei Yavlinsky
Gaussian Radial Basis = 1 Gaussian Radial Basis = 1 Random Random Figure 4-1 Retrieval methods 100 1000
13/01/03
For each category, there were 25 n-image queries, n ranging from 1 to 5, with 5 queries for each value of n. Each query had a set of 100 negative images to go with it, picked randomly, but so as to not contain images from the same category.
4.3 Tests
Methods in Figure 5-1 were run on queries in 5.2, and a MAP value was calculated for each method using the same calculation tools as in [14]. The methods were ranked according to the MAP values, and neighbouring methods in the ranking table were pair wise-compared using a paired, one-sided T-Test, with the confidence level 0.05 (where is the probability of the two samples coming from the same distribution). The two top ranking methods, which were significantly different according to the T-Tests, and had different feature selection algorithms, were then evaluated on the full database, as in [14]. The motivation behind using T-Tests is that methods with different MAP values may come from the same distribution, and thus could be not worth being treated separately.
5 Results
5.1 Subsampled MAP
5.1.1 Figures and Observations
Mean average precision figures were calculated on a reduced set of queries as specified in 5.2.
20
Alexei Yavlinsky
13/01/03
Kernel Feature Selection Gain Ratio 100 Gain Ratio 1000 Info Gain 100 Info Gain 1000 Random 100 Random 1000 ReliefF 100 ReliefF 1000 S. Unc. 100 S. Unc. 1000
Linear 0.0624 0.0857 0.0608 0.0850 0.0799 0.0963 0.0530 0.0830 0.0614 0.0847
Polynomial o. 2 0.0709 0.0928 0.0716 0.0930 0.0983 0.1130 0.0654 0.0946 0.0712 0.0920
Radial Basis =1 0.0464 0.0486 0.0470 0.0488 0.0783 0.0827 0.0536 0.0646 0.0464 0.0487
MAP
It is possible to make a number of observations from these results. 1) The Polynomial kernel has the best performance for any feature selection method. 2) Linear kernel almost always outperforms the Radial Basis Function kernel. 3) Keeping 1000 features gives significantly better performance than 100 features, except for Radial Basis Function kernels. 4) For equal kernels and number of features, all information-based feature selection methods have almost identical performance. 5) Random feature selection has the best performance! To test observations 1 and 2, T-Tests were used to compare the performance of kernels:
21
Alexei Yavlinsky
13/01/03
Method Polynomial Random 1000 vs. Linear Random 1000 Polynomial Random 100 vs. Linear Random 100 Polynomial Info Gain 1000 vs. Linear Info Gain 1000 Polynomial Info Gain 100 vs. Linear Info Gain 100 Polynomial Gain Ratio 1000 vs. Linear Gain Ratio 1000 Polynomial Gain Ratio 100 vs. Linear Gain Ratio 100 Polynomial Sym. Unc. 1000 vs. Linear Sym. Unc. 1000 Polynomial Sym. Unc. 100 vs. Linear Sym. Unc. 100 Polynomial ReliefF 1000 vs. Linear ReliefF 1000 Polynomial ReliefF 100 vs. Linear ReliefF 100
T-Test (1 Tailed, Paired) Method Linear Random 1000 vs. 7.88963E-11 Radial Random 1000 Linear Random 100 vs. 2.82209E-10 Radial Random 100 Linear Info Gain 1000 vs. 2.85897E-09 Radial Info Gain 1000 Linear Info Gain 100 vs. 1.66131E-08 Radial Info Gain 100 Linear Gain Ratio 1000 vs. 1.36616E-06 Radial Gain Ratio 1000 Linear Gain Ratio 100 vs. 1.03041E-06 Radial Gain Ratio 100 Linear Sym. Unc. 1000 vs. 3.26793E-07 Radial Sym. Unc. 1000 Linear Sym. Unc. 100 vs. 2.75001E-07 Radial Sym. Unc. 100 Linear ReliefF 1000 vs. 1.90379E-07 Radial ReliefF 1000 Linear ReliefF 100 vs. 1.56638E-08 Radial ReliefF 100
T-Test (1 Tailed, Paired) 0.000133668 0.27543914 2.98551E-23 4.53236E-08 1.23188E-22 4.47466E-10 1.59742E-22 4.94676E-09 6.03545E-09 0.396701174
From Table 6-2 we can see that in all cases, distributions produced by the polynomial kernel are significantly different to those produced by the linear kernel, and in most cases distributions produced by the linear kernel are different to those produced by the radial basis kernel. Only for Linear ReliefF 100 vs. Radial ReliefF 100 the probability is greater than 0.05. Therefore we can trust observations 1 and 2. To test observation 3, a further set of T-Tests were carried out:
22
Alexei Yavlinsky
13/01/03
Method Linear Random 1000 vs. Linear Random 100 Linear Info Gain 1000 vs. Linear Info Gain 100 Linear Gain Ratio 1000 vs. Linear Gain Ratio 100 Linear Sym. Unc. 1000 vs. Linear Sym. Unc. 100 Linear ReliefF 1000 vs. Linear ReliefF 100 Polynomial Random 1000 vs. Polynomial Random 100 Polynomial Info Gain 1000 vs. Polynomial Info Gain 100 Polynomial Gain Ratio 1000 vs. Polynomial Gain Ratio 100
T-Test (1 Tailed, Paired) probability Method Polynomial Sym. Unc. 1000 vs. 2.03322E-10 Polynomial Sym. Unc. 100 Polynomial ReliefF 1000 vs. 4.02901E-22 Polynomial ReliefF 100 Radial Random 1000 vs. 5.72288E-21 Radial Random 100 Radial Info Gain 1000 vs. 1.78708E-20 Radial Info Gain 100 Radial Gain Ratio 1000 vs. 4.53313E-23 Radial Gain Ratio 100 Radial Sym. Unc. 1000 vs. 1.73075E-07 Radial Sym. Unc. 100 Radial ReliefF 1000 vs. 1.33139E-20 Radial ReliefF 100 2.75735E-19
T-Test (1 Tailed, Paired) probability 3.28553E-18 2.63859E-25 0.071752436 0.120533657 0.063508008 0.052840906 5.5E-08
The T-Tests support observation 3. Only Radial Basis Function kernel methods have the probability greater than 0.05. Finally to confirm observation 4 and to choose methods for the full evaluation, the methods were ranked by the MAP value, and paired T-Test was carried out between neighbouring methods in the table. E.g. T-Test value on the first row is the pairwise ttest comparison between the first method and the second one, et cetera.
Kernel Type Polynomial Polynomial Linear Polynomial Polynomial Polynomial Polynomial Linear Linear Linear Linear Radial Linear Radial Polynomial Feature Selection Random 1000 Random 100 Random 1000 ReliefF 1000 Info Gain 1000 Gain Ratio 1000 S. Unc. 1000 Gain Ratio 1000 Info Gain 1000 S. Unc. 1000 ReliefF 1000 Random 1000 Random 100 Random 100 Info Gain 100 Subsampled MAP 0.1130376 0.09826416 0.09629664 0.09460992 0.0929648 0.0927824 0.09198096 0.08566576 0.08503744 0.08468032 0.08301648 0.08266976 0.07994864 0.07826496 0.07160336 T-Test (1 Tailed, Paired) 1.73075E-07 0.241013899 0.277284224 0.288952767 0.424023855 0.139036838 4.00858E-05 0.275050518 0.353589525 0.272101682 0.466369424 0.223117191 0.27543914 0.014443636 0.352347942
23
Alexei Yavlinsky
Polynomial Polynomial Polynomial Radial Linear Linear Linear Radial Linear Radial Radial Radial Radial Radial Radial S. Unc. 100 Gain Ratio 100 ReliefF 100 ReliefF 1000 Gain Ratio 100 S. Unc. 100 Info Gain 100 ReliefF 100 ReliefF 100 Info Gain 1000 S. Unc. 1000 Gain Ratio 1000 Info Gain 100 Gain Ratio 100 S. Unc. 100 0.07119216 0.0708768 0.06542416 0.06457632 0.06244496 0.06140352 0.06082768 0.05359264 0.05298544 0.04879984 0.04866288 0.04855104 0.04696896 0.04641696 0.0463928 0.339434996 0.009783152 0.364811636 0.226515422 0.108486103 0.348246139 0.000955275 0.396701174 0.064342801 0.417843752 0.32984478 0.171238465 0.313587742 0.47824034
13/01/03
From Table 6-4 we can see that the difference between Information Gain/ Gain Ratio / Symmetrical Uncertainty is insignificant, i.e., given each combination of a kernel and the number of selected features, is greater than 0.05. Hence we can trust observation 4. From this table, we also select the two top methods that are significantly different for full evaluation. SVM with the Polynomial kernel of order 2 with random selection of 1000 features came top. The next significantly different method happens to be Polynomial kernel of order 2 with 100 random features. We thought that it would not be interesting to make a full comparison of the same feature selection methods with different numbers of features, so instead decided to discard it and find the best method with a different feature selection method. Such method is Polynomial kernel of order 2 with 1000 features selected by ReliefF. The evaluation of these two methods is presented in the next section.
5.1.2 Hypotheses
We can build some hypothesis based on the above observations: Observation 1 suggests that queries of images taken from the same category require a non-linear classifier for more efficient retrieval. However, it is not clear what order of the polynomial function is most suitable, or if there are other non-linear kernels which work better, so this area must be researched further. Observation 2 tells us little other than that both Linear and Radial Basis Function kernels are less suitable. Radial Basis Function kernel has the parameter , the choice of which could affect the performance of this kernel, so different values should be investigated. Observation 3 does not tell us much with regard to how many features we should keep. In this set of tests, we compared 1000 features to 100, and selecting 1000 features gives a consistently better performance in most cases. However, it is not clear what the optimal number of features to keep is. It could be between 100 and 1000, greater than 1000 or less than 100. This area must therefore be investigated further.
24
Alexei Yavlinsky
13/01/03
Observation 4 suggests that there is no real difference between any of the informationgain based algorithms (as discussed in 3.2.1). Since the MAP values are slightly different, it can be seen that the algorithms do not rank features in exactly the same way. However, there is no real benefit in using one of these methods over other information-gain based algorithms. After observing this phenomenon, literature was again consulted to clarify the operation of these algorithms. From [6], it was discovered that the Gain Ratio and Symmetrical Uncertainty improve over the original Information Gain algorithm significantly when there is an uneven number of states across different discrete variables. It does so by overcoming the bias of the Information Gain algorithm towards variables with more states. The discretisation algorithm for continuous values in Weka is adaptive, and it is not known how exactly real values from the feature vectors were turned into discrete ones. This means that these results are by no means conclusive, and further research must be done into how exactly Weka pre-processed the data before applying information-gain based algorithms. It is interesting to note, however, that ReliefF algorithm, which works directly with continuous features, produced results similar to information-gain based algorithms. Different values of k neighbours could be researched in the future to establish the optimal performance of this algorithm. Observation 5 is the most counter-intuitive and the most interesting. Intended as a control test for the performance of other feature selection methods, random feature selection seems to be doing better than all of them! An attempt to justify this is presented in the next section.
Table 5-5 MAP for Random and ReliefF feature selection algorithms. Used with polynomial kernel and 1000 features are kept.
For a polynomial kernel SVM with 1000 features, random feature selection outperforms ReliefF, which is consistent with the subsampled MAP results. Furthermore, the MAP values for random feature selection are on par, and in some instances slightly better, than the values obtained by the AdaBoost algorithm on the same set of feature vectors in [14] (Table 6-6). The overall MAP for random feature selection slightly outperforms AdaBoost. This result is promising, it implies that we can use 1/47th of the feature vector and still get good performance, comparable to the AdaBoost method proposed in [19]. It also
25
Alexei Yavlinsky
13/01/03
puts in question the claim in [19] that Viola feature vectors are highly selective. The SVM seems to be capable to classify and rank images well given any random subprojection of the feature vectors, which supports our intuition that successive convolutions generate redundant features (see 3.1). In this light, it is interesting that ReliefF does not perform better than random feature selection on average, given the authors claim that it can cope well with redundant features [11]. Queries 1 image 2 images 3 images 4 images 5 images 6 images Overall AdaBoost 0.0568 0.0781 0.0873 0.0941 0.1023 0.1029 0.0811
The observation that random feature selection performs best suggests that it would be worth investigating applying feature selection algorithms within random feature subsets. If average precision improves, this could imply that the reason for their poor performance is the very high initial number of features. Average precision performance is not consistent across all categories, however. Tables 6-7 and 6-8 show the difference in MAP for different categories, and Figures 6-2 and 6-3 show example images from those categories. Category Museum Easter Eggs Ornamental Designs Cards Fireworks Prehistoric World International Fireworks Rainy Nights Highways and Street Signs Category Waterscapes Insects Alien Landscapes Wading Birds Landscapes II Backyard Wildlife Coastal Landscapes Sand and Wind Random 0.2525 0.4177 0.5199 0.4056 0.2851 0.3215 0.2789 0.2404 Random 0.0225 0.0309 0.0199 0.0236 0.0343 0.0356 0.0234 0.0167 ReliefF 0.2142 0.5310 0.4632 0.3923 0.2049 0.2029 0.2330 0.2353 ReliefF 0.0187 0.0284 0.0153 0.0192 0.0288 0.0300 0.0204 0.0163
26
Alexei Yavlinsky
13/01/03
Cards
Rainy Nights
International Fireworks
Ornamental Designs
Fireworks
Prehistoric World
Alien Landscapes
Landscapes II
Backyard Wildlife
Waterscapes
Wading Birds
Insects
Coastal Landscapes
27
Alexei Yavlinsky
13/01/03
It is possible to make a number of observations from the differences in MAP values for different types of images: 1) Images with a clear difference in brightness between the object and the background have a good retrieval precision rate. 2) Images containing objects with well-defined edges have a good retrieval precision rate. 3) Images consisting of large areas different colours, which have a similar intensity (such as landscapes), are not recognised well. These observations are likely to be due to the ability of Violas feature space to detect global structure of the image based on edges. Images, where the background has an ostensibly different intensity to the object being recognised, have well defined edges, and so retrieval is more precise. In the cases where edges are less conspicuous, however, retrieval is less successful, such as in the third case. The observations are similar to the ones made in [14], where the authors found that convolution vectors perform best when the structure of the image is the most important feature. This implies that the algorithms evaluated in this section do not improve the overall functionality of the Viola feature vectors.
6 Future Work
The results of this study show potential for further research. It is a promising result that it is possible to use an SVM with an arbitrary projection of the Viola feature vectors, which is just 1/47th of the original vector size, and still get retrieval performance comparable with published results in [14]. This suggests that it might be worth investigating alternative feature selection strategies to the ones presented in this work to improve the performance of Viola feature vectors further. One such approach could be using AdaBoost algorithm adaptation [19] to generate weak hypotheses, and use the underlying features for SVM classification. Since this algorithm has a good performance using just 20 features, it would be interesting to see whether using those is sufficient for the SVM to achieve the same, or better performance as with random feature selection in Table 6-5, which uses 1000 features. Additionally it is important to find out why the feature selection algorithms applied in this work failed to perform better than random feature selection. To alleviate the problem of selecting negative examples, it might be worth investigating the use of Gram-Schmidt algorithm (see 2.3) for selecting negative examples, and applying the same kernel as the SVM which is carrying out the classification, to see if this gives an improvement in the overall performance. Additionally, one-class SVM [1] could be developed to see whether it is possible to avoid the selection of negative examples altogether to improve performance. SVM with the use uneven margins [13] could also be developed to investigate its effect on image retrieval. Results in [14] show that for some categories where Viola feature vectors performed badly, various colour-based feature vectors work well. Therefore, it might be worth developing a system, which pre-processes the users query, and detects which feature sets would perform better at retrieving similar images. This might mean running the algorithm on different feature sets at the same time, and combining the rankings in 28
Alexei Yavlinsky
13/01/03
some intelligent way. Alternatively it might mean creating a dynamic feature representation, which incorporates information from different feature sets. To investigate the non-linear nature of the Viola feature space, a further range of SVM kernels can be investigated, and performance compared. To investigate the redundancy of features in this feature space, CFS subset selection [5] can be implemented, and results observed. Additionally, the optimal number of random features should be investigated for use with SVMs. All in all, this study lays the foundations for a possible longer-term project.
7 Conclusion
The use of Support Vector Machines and a number of feature selection algorithms were evaluated on the test collection created by Pickering M., and Rger S., and the results of this evaluation were compared with the published results. An unusual property of the feature space was discovered, whereby a very small, randomly selected subset of features is sufficient for the SVM to retrieve images on par with some of the existing methods. Polynomial kernel consistently outperformed other kernels, which suggests the non-linear nature of classifying images in this feature space. The results obtained and the supporting background research have allowed interesting areas for future investigation to be identified.
8 Acknowledgements
This work was supervised by Dr. Stefan Rger. Help with setting up retrieval tests was kindly provided by Marcus Pickering.
9 References
[1] Chen Y., Zhou X., Huang T., One Class SVM for Learning In Image Retrieval, IEEE International Conference on Image Processing, Thessaloniki, Greece (2001) Christianini N., Slides of the course on Advanced Topics in Machine Learning University of California Berkeley Freund Y., Schapire R., A Short Introduction to Boosting, Journal of Japanese Society for Artificial Intelligence 14(5) pp771-780 (1999) Gunn S, Support Vector Machines for Classification and Regression, Technical Report, Image Speech and Intelligent Systems Research Group, University of Southampton (1997) Hall M., Correlation Based Feature Selection for Discrete and Numeric Class Machine Learning, Proceedings of the Seventeenth International Conference on Machine Learning, Stanford University, CA (2000)
[5]
29
13/01/03
Hall M., Smith L., Practical Feature Subset Selection for Machine Learning Proceedings of the 21st Australian Computer Science Conference. Springer. pp181-191. (1998) Heesch D., Rger S., Relevance Feedback for Content-Based Image Retrieval: What Can Three Mouse Clicks Do?, ACM (2002) Hong P., Tian Q., Huang T., Incorporate Support Vector Machines To Content Based Image Retrieval With Relevant Feedback Image Processing, 2000 Proceedings. pp. 750-753 (2000) Kira K., Rendell L. The feature selection problem: traditional methods and new algorithm, Proceedings of AAAI 1992 San Jose, CA pp129-134 (1992)
[7] [8]
[9]
[10] Koller D., Sahami M., Toward Optimal Feature Selection International Conference on Machine Learning (1996) [11] Kononenko I., Estimating Attributes: Analysis and Extensions of RELIEF, European Conference on Machine Learning, pp171-182 (1994) [12] Langley P., Selection of Relevant Features in Machine Learning, Proceedings of the AAAI Fall Symposium on Relevance (1994) [13] Li Y., Zaragoza H., Herbrich R., Shawe-Taylor J., Kandola J., The Perceptron Algorithm with Uneven Margins, Proceedings of the International Conference of Machine Learning (2002) [14] Pickering M., Rger S., Evaluation of Key-frame Based Retrieval Techniques for Video, Elsevier Science (2002) [15] Platt J., Fast Training of Support Vector Machines using Sequential Minimal Optimization, In B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pp 185-208, Cambridge, MA, MIT Press. (1998) [16] Rger S., Gauch S., Feature Reduction for Document Clustering and Classification. Technical Report, 2000/8, Department of Computing, Imperial College London (2000) [17] Shawe-Taylor J., Li Y., The Results of The Trec2002 Adaptive Filtering from RHUL, Presentation at London Meeting of European KerMIT (2002) [18] Smeulders A., Worring M., Santini S., Gupta A. and Jain. R. Content based image retrieval at the end of the early years IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12) pp1349-1380 (2000) [19] Tieu K., Viola P., Boosting Image Retrieval, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2000)
30
Alexei Yavlinsky
13/01/03
10 Bibliography
[B1] Cristianini N., Shawe-Taylor, J., An Introduction To Support Vector Machines, Cambridge University Press, (2000) [B2] Mitchell T., Machine Learning, McGraw Hill (1997) [B3] Witten I., Frank E., Data Mining Practical Machine Learning Tools, and Techniques with Java Implementations, Morgan Kaufmann (2000)
31