06331573

1344
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 5, OCTOBER 2012
Evaluation of SVM, RVM and SMLR for Accurate

Image Classification With Limited Ground Data
Mahesh Pal and Giles M. Foody, Senior Member, IEEE
AbstractThe accuracy of a conventional supervised classification is in part a function of the training set used, notably impacted
by the quantity and quality of the training cases. Since it can be
costly to acquire a large number of high quality training cases, recent research has focused on methods that allow accurate classification from small training sets. Previous work has shown the potential of support vector machine (SVM) based classifiers. Here,
the potential of the relevance vector machine (RVM) and sparse
multinominal logistic regression (SMLR) approaches is evaluated
relative to SVM classification. With both airborne and spaceborne
multispectral data sets, the RVM and SMLR were able to derive
classifications of similar accuracy to the SVM but required considerably fewer training cases. For example, from a training set comprising 600 cases acquired with a conventional stratified random
sampling design from an airborne thematic mapper (ATM) data
set, the RVM produced the most accurate classification, 93.75%,
and needed only 7.33% of the available training cases. In comparison, the SVM yielded a classification that had an accuracy of
92.50% and needed 4.5 times more useful training cases. Similarly,
with a Landsat ETM+ (Littleport, Cambridgeshire, UK) data set,
the SVM required 4.0 times more useful training cases than the
RVM. For each data set, however, the classifications derived by
each classifier were of similar magnitude, differing by no more
than 1.25%. Finally, for both the ATM and ETM+ (Littleport)
data sets, the useful training cases by SVM and RVM had distinct and potentially predictable characteristics. Support vectors
were generally atypical but lay in the boundary region between
classes in feature space while the relevance vectors were atypical
but anti-boundary in nature. The SMLR also tended to mostly, but
not always, use extreme cases that lay away from class boundary.
The results, therefore, suggest a potential to design classifier-specific intelligent training data acquisition activities for accurate classification from small training sets, especially with the SVM and
RVM.
Index TermsGround truth, relevance vector machines, sparse
multinomial logistic regression, support vector machines, training
data, typicality.
I. INTRODUCTION
AND cover mapping is one of the most common applications of remote sensing. Land cover maps are produced to
meet the needs of a diverse array of users and are typically derived via some form of image classification analysis, which is,
Manuscript received September 30, 2011; revised February 12, 2012; accepted August 02, 2012. Date of publication October 16, 2012; date of current
version November 14, 2012. This work was supported in part by the Association of Commonwealth Universities (ACU), London, through a fellowship to
M. Pal.
M. Pal is with the Department of Civil Engineering, NIT Kurukshetra,
Haryana, 136119 India (e-mail: mpce_pal@yahoo.co.uk).
G. M. Foody is with the School of Geography, University of Nottingham,
Nottingham, NG7 2RD, U.K.
Digital Object Identifier 10.1109/JSTARS.2012.2215310
one of the main pattern recognition techniques applied in remote

sensing. Although remote sensing offers the potential to acquire
imagery of large areas inexpensively, there are still major costs
to be incurred in a mapping programme. One major cost to be
met in a mapping application is associated with ground reference data [1][3].
Ground data requirements may vary from study to study but
it is common to find that ground data are required to train a supervised classification analysis and to evaluate map accuracy.
The training dataset should be classifier specific. A maximum
likelihood classifier might need a large sample acquired with a
random sampling design to provide accurate information about
the mean and variance of the classes while a SVM may need
only a smaller training set of spectrally extreme cases that lie
close to the decision boundaries [1].Given that ground reference
data are expensive and difficult to acquire, many have sought
to reduce the ground reference data requirements. While it may
sometimes be possible to reduce the ground data requirements in
the testing stage for accuracy assessment [4] most attention has
focused on the training stage. For example, strategies adopted
include the use of unlabeled cases in training [2], [5][9], adoption of pre-processing methods such as feature reduction to reduce training data set requirements [10], [11], use of intelligent
training selection strategies to focus on the acquisition of informative training samples [1], [12], [13] and strategies to reduce training set size when attention is focused on a specific
class [12], [14], [15]. This article develops aspects of previous
work and focuses on the potential for accurate classification
with small training sets through the use of contemporary machine learning classifiers that may theoretically require only few
training samples.
The support vector machine (SVM) has been extensively
used as a state-of-art supervised classifier with remote sensing
data [16][21]. A key reason behind its popularity is its ability
to yield highly accurate classifications, often more accurate than
from other contemporary approaches such as neural networks
and decision trees [20], [22][24]. Moreover, of particular
concern to this article, research has shown that the SVM may
be used to produce an accurate classification from a small
number of useful training cases lying close to the decision
boundary [1] and that the financial savings to a mapping project
derived from this feature can be large. For example, [13] show
reduction in the total cost of a mapping project by
a
focussing attention on the most informative training samples
for classification with a SVM.
The SVM based approach to classification is, however, not
problem-free. Concerns include the need to define a set of
parameters [25], [26], an inability to form a full confusion
1939-1404/$31.00 2012 IEEE
PAL AND FOODY: EVALUATION OF SVM, RVM AND SMLR FOR ACCURATE IMAGE CLASSIFICATION WITH LIMITED GROUND DATA
matrix in some strategies to multi-class classification [27] and

a lack of information on per-case classification uncertainty[28].
Other classifiers may sometimes be attractive alternatives to
the SVM, especially with regard to the aforementioned concerns with SVM-based classification. Recently, for example,
[29][31] showed that a Bayesian extension of the SVM, called
the relevance vector machine (RVM; [32]), can be used as an
alternative to the SVM for image classification and has the
ability to provide per-case uncertainty data in the form of posterior probabilities of class membership. Moreover, comparative
studies suggest that a RVM may require fewer training cases
than a SVM in order to classify a data set [29]. It has been
suggested that that the useful training cases for classification
by a RVM are anti-boundary in nature while those for use in
classification by a SVM tend to lie near the boundary between
classes [32]. The potential to use a small training set and derive
per-case uncertainty information is also offered by the use the
sparse multinomial logistic regression (SMLR; [33]) for image
classification.
The aim of this study was to evaluate the potential of the
RVM and SMLR classifiers for accurate classification from
small training sets relative to the SVM, which has been evaluated previously [1]. The key focus in the evaluation was
on the accuracy with which data sets may be classified and
the number of training cases required. Many studies have
shown that only a small proportion of a training set acquired
by conventional sampling methods is actually required for
accurate classification by classifiers such as the SVM, RVM
and SMLR [1], [24], [29], [33].A key challenge is finding these
useful training cases in a way that allows accurate and efficient
classification. Sometimes researchers have acquired a large
training sample by conventional methods and then from this
identified the useful training cases [20], [29], [31], [33][35].
Such approaches can be inefficient, notably in relation to the
effort required to collect redundant training samples. A popular
alternative is to adopt approaches such as active learning in
which useful training sites are identified in an iterative analysis
of the image [8], [9]. While attractive such approaches have
limitations [36]. One key concern is that this type of method
can only be applied post-image acquisition and can be costly
and inefficient in terms of ground data acquisition as sites for
labeling are identified iteratively. The realization of the full
potential of classification methods such as the SVM, RVM
and SMLR requires an ability to identify the useful training
cases and predict their location on the ground in advance of
the classification [1], [12], [13].This would allow an intelligent training programme [1], [13] to be defined. For this, it
is necessary for the characteristics of useful training sites to
be predictable. Thus in addition to the accuracy with which
data may be classified, a key focus of this article is the nature
of the training set required for an accurate classification and
especially the characterization of the useful training cases to act
as a guide to their predictability. The remainder of this article
is structured such that the three classification algorithms used
are briefly outlined in Section II before presenting the data sets
they are applied to in Section III. The results of the analyses
are presented on Section IV and key conclusions drawn in
Section V.
1345
II. CLASSIFICATION ALGORITHMS

Three classification algorithms were used: SVM, RVM and
SMLR. All three use the training cases to define the location
of classification decision boundaries to partition the data space
such that cases of unknown class membership may be allocated
to a class. The way the training data are used and nature of
the classifiers differs, however, and so a brief summary of each
algorithm is given below. In each discussion the focus is on
the training of the classifier. Particular attention is paid to the
training cases that are used to form the decision boundaries.
A subset of the available cases for training is typically used
in classification by each of the three selected algorithms. These
useful training cases are the support vectors, relevance vectors and retained kernel basis functions in classifications by the
SVM, RVM and SMLR respectively. In the discussion below a
training set of cases, represented by
,
,
where
is input vector with input
features (wavebands) and
is the
class vector with classes, is available to the classifiers.
A. SVM
The SVM is based on statistical learning theory and has the
aim of determining the location of decision boundaries that produce the optimal separation of the classes [37]. In the case of
a two-class pattern recognition problem in which the classes
are linearly separable, the SVM selects from among the infinite
number of linear decision boundaries the one that minimises the
generalisation error. Thus, the selected decision boundary will
be one that leaves the greatest margin between the two classes,
where the margin is defined as the sum of the distances to the
hyperplane from the closest points of the two classes [37]. The
problem of maximising the margin can be solved using standard
quadratic programming optimisation techniques. The training
cases that are closest to the hyperplane are used to measure the
margin and these training cases are termed support vectors.
Only the support vectors are needed to form the classification
decision boundaries and these typically represent a very small
proportion of the total training set. If regions likely to furnish
support vectors can be predicted then only a small training set,
comprising the support vectors, may be acquired for a classification [1], [13].
For a 2-class classification problem (i.e.,
),
the training cases are linearly separable if there exists a weight
vector (determining the orientation of a discriminating plane)
and a scalar (determining the offset of the discriminating plane
from the origin) such that
and the hypothesis space can be defined by the set of functions given by
(1)
The SVM finds the separating hyperplanes for which the distance between the classes, measured along a line perpendicular to the hyperplane, is maximised. This can be achieved by
solving following constrained optimization problem
(2)
1346
If the two classes are not linearly separable, the SVM tries
to find the hyperplane that maximises the margin while, at the
same time, minimising a quantity proportional to the number
of misclassification errors. The restriction that all training cases
of a given class lie on the same side of the optimal hyperplane
can be relaxed by the introduction of a slack variable
and the trade-off between margin and misclassification error is
controlled by a positive user-defined constant such that
[38]. Thus, for non-separable data, (2) can be written as:
where is a set of adjustable weights. For multiclass classification (5) can be written as:
(3)
and an iterative method is used to obtain

, Let
denotes the maximum-a-posteriori estimate of the hyperparameter
. The maximum-a-posteriori estimate of the weights (
)
can be obtained by maximizing the following objective function:
SVM can also be extended to handle non-linear decision surfaces. [39] propose a method of projecting the input data onto
a high-dimensional feature space through some nonlinear mapping and formulating a linear classification problem in that feature space. Kernel functions are used to reduce the computational cost of dealing with high-dimensional feature space [37].
A kernel function is defined as
and
with the use of a kernel function (1) becomes:
(4)
where
is a Lagrange multiplier.
Further and more detailed discussion on SVM can be found
in [37], [40], [41].
(6)
where
is the logistic sigmoid function:

(7)
(8)
where the first summation term corresponds to the likelihood
of the class labels and the second term corresponds to the prior
on the parameters . In the resulting solution, the gradient of
with respect to is calculated and only those training cases
having non-zero coefficients , which are called relevance vectors, will contribute to the generation of a decision function. The
posterior is approximated around weights by a Gaussian approximation with
B. RVM
The RVM is a recent development in kernel based machine
learning approaches and can be used as an alternative to SVM
for image classification. The RVM is a possibilistic counterpart
to the SVM, based on a Bayesian formulation of a linear model
with an appropriate prior that results in a sparser representation
than that achieved by SVM. The RVM is based on a hierarchical
prior, where an independent Gaussian prior is defined on the
weight parameters in the first level, and an independent Gamma
hyper prior is used for the variance parameters in the second
level, which leads to model sparseness [32]. An algorithm produces sparse results when among all the coefficients defining
the model only few are non-zero. This property helps in fast
model evaluation and provides a potential for accurate classification from small training sets. Key advantages of the RVM
over the SVM include a reduced sensitivity to the hyperparameter settings, an ability to use non-Mercer kernels, the provision
of a probabilistic output, no need to define the parameter , and
often a requirement for fewer relevance vectors than support
vectors for a particular analysis [31], [32].
In a two class classification by RVM, the aim is, essentially,
to predict the posterior probability of membership for one of the
classes for a given input. A case may then be allocated to the
class with which it has the greatest likelihood of membership.
Using a Bernoulli distribution the likelihood function for the
analysis would be:
(5)
where
is the Hessian of , matrix has elements

and
is a diagonal matrix with elements defined
by
.
An iterative analysis is followed to find the set of weights
that maximizes the value of (8) in which the hyperparameters
, associated with each weight are updated. During training,
the hyperparameter for a large number of training cases will attain very large value and the associated weights will be reduced
to zero. Thus, the training process applied to a typical training
set acquired following standard methods will make most of the
training cases irrelevant and leave only the useful training
cases. As a result only a small number of training cases are required for final classification. The assignment of an individual
hyperparameter to each weight is the ultimate reason for the
sparse property of RVM. Further details on the RVM are given
by [32].
C. SMLR
The Sparse Multinomial Logistic Regression algorithm [33]
utilises a Laplacian prior on the weights of the linear combination of functions to enforce sparseness. This prior favours a
few large weights with many of the others set to exactly zero.
The SMLR algorithm learns a multi-class classifier based on the
multinomial logistic regression. This method performs simultaneously a feature selection, to identify a small subset of the most
relevant features, and the learning of the classification decision
rules.
TABLE I
THE MEAN AND STANDARD DEVIATION VALUES OF THE SYNTHETIC DATA
If
is the weight vector associated with class , then the
probability that a given training case belongs to class is given
by
(9)
Usually a maximum likelihood estimation procedure is used to

obtain the components of from the training data by maximizing the log-likelihood function [42]:
(10)
In order to achieve the sparsity, a Laplacian prior ( ) is incorporated while estimating . [32] propose to use a maximum a
posteriori (MAP) criterion for multinomial logistic regression.
The estimate of are then given by:
(11)
is Laplacian prior on , which means that
where is a user-defined parameter
and affects the level of sparsity with SMLR. Thus, similar to
the SVM and RVM, the SMLR uses a small number of training
cases, called retained kernel basis functions, in model creation.
Further details about SMLR and modified SMLR may be found
in [10], [33], [43][45].
in which
III. DATA SETS AND METHODS

The three classifiers were used to undertake a series of classifications to highlight the potential for accurate classification
from small training sets. The support vectors, relevance vectors
and retained kernel basis functions that are central to the classification by SVM, RVM and SMLR algorithms respectively will
be referred to as useful training cases in all classifications.
Four data sets were used. First, a simple simulated data set
was used to aid understanding and interpretation of the useful
training cases. This data set comprised three classes generated
randomly from Gaussian normal distributions in two wavebands
(Table I). Here, a training sample of 100 cases of each class was
randomly generated and made available to each of the classifiers
and the analyses undertaken using ten-fold cross validation.
Second, a dataset acquired in bands 1 and 5 from ETM+ of
a test site near Boston in Lincolnshire UK were used. Attention focused on three classes that were abundant at the test site:
wheat, sugar beet and oilseed rape. One hundred cases of each
class selected at random were used for training and testing all
three classifiers. These data were used only to extent the evaluation of the characterisation of the useful training cases with the
1347
simulated data to a real data set. As with the simulated data set,
ten-fold cross validation was used with ETM+ (Boston) dataset.
More extensive analyses were undertaken with the remaining
two data sets with the accuracy of the resulting classifications
evaluated against ground data.
The third data set was obtained by Daedalus 1268 airborne
thematic mapper (ATM) for an agricultural test site near
Feltwell, UK. The ATM data were acquired in 3 spectral wavebands, with a spatial resolution of 5 m [46]. The ATM data
were used to classify six different crop types: sugar beet, wheat,
barley, carrot, potato and grass. A map depicting the crop type
planted in each field produced near the time of the ATM data
acquisition was used as ground data to inform the training and
testing of the classifications. The training sets comprised of
100 randomly selected pixels of each class for the analyses of
the ATM data set. The testing set comprised 320 pixels drawn
at random from the test site.
The fourth and final data set used was acquired by the
Landsat ETM+ for an agricultural area near Littleport in
Cambridgeshire, UK. The data in the six-non-thermal spectral
wavebands with a 30 m spatial resolution were used to classify
seven agriculture land cover types: wheat, sugar beet, potato,
onion, peas, lettuce and beans [47].A map depicting the crop
type planted in each field produced near the time of the ETM+
(Littleport) data acquisitions was used as ground data. For
each class, 100 randomly selected pixels were used to train the
classifiers. The accuracy of the classifications was evaluated
using an independent testing set that comprised 1,400 randomly
selected pixels.
For each classification undertaken with the ATM and ETM+
(Littleport) data sets, accuracy was assessed with the aid of a
confusion matrix and expressed as the percentage of the testing
cases correctly allocated. As the potential for accurate classification by the SVM from small training sets has been demonstrated, a desire was to determine if the RVM and SMLR approaches were at least as accurate as the SVM classification,
which may be assessed by a test of non-inferiority. For both the
RVM and SMLR methods, this was evaluated by using the confidence interval of the difference in accuracy obtained from that
observed with the SVM in a test of non-inferiority, which focuses on the lower limit of the defined confidence interval [48],
[49]. In this evaluation it was assumed that the zone of indifference was 2.00%; this value was selected arbitrarily but ensures
that small differences in accuracy are inconsequential. For all
experiments, a personal computer with a Pentium IV processor
and 3 GB of RAM was used.
SVM were initially designed for binary classification problems. A range of methods have been suggested for multi-class
classification [20], [50], [51]. Here, the one against rest, approach with ATM dataset [17], [24] and one against one with
simulated and ETM+ datasets [51] was used. Throughout, a radial basis function kernel with kernel specific parameter ( ) was
used with SVM, RVM and SMLR algorithms. The softwares
LIBSVM and BSVM [50], [52] were used to implement the
SVM whereas software SMLR [33] was used to implement the
sparse multinomial logistic regression classifier. A multiclass
implementation of original RVM codes [32]; [53] was used to
implement RVM classifier. Similar to the parameter required
1348
TABLE II
USER DEFINED PARAMETERS WITH ALL FOUR DATASETS USED IN THIS STUDY
TABLE III
MEAN MAHALANOBIS DISTANCE MEASURES COMPUTED OVER ALL USEFUL TRAINING CASES FOR A CLASS BASED ON ANALYSES OF THE SIMULATED DATA
in the design of SVM classifier, the value of the hyperparameter

and parameter influence the accuracy of classifications by
the RVM and SMLR algorithms respectively. In order to find a
suitable value for each of the user-defined parameters with the
different classification algorithms, cross validation and trial and
error methods were used. Specifically, five-fold cross validation
was used with the SVM while trial and error procedures were
used with the RVM and SMLR to find the suitable values for the
user-defined parameters for the classifications of both the simulated and real remote sensing data sets. For classification by the
RVM, the trials involved varying the and values from 0.1 to
2.0with a step size of 0.1 and
to
with a step size of
respectively. For classification by the SMLR, the and
values were varied from 0.1 to 15.0 and 0.1 to 2.5 with a step
size of 0.1 respectively. For the analyses of all four data sets,
the optimal values of the user-defined parameters are provided
in Table II.
The position of the useful training cases in feature space was
evaluated visually and quantitatively characterised with measures based on their Mahalanobis distance to the centroid of
each class. The Mahalanobis distance between a case and a class
centroid is inversely related to the typicality of the case to the
class [54], [55]. Thus, a low distance indicates that the case lies
close to the class centroid and so is typical of the class while
a large distance indicates that the case is atypical of the class.
As well as providing a simple guide to the typicality of a case
to a class, the set of Mahalanobis distances computed over all
classes for a case may be used to provide a simple descriptor

of the location of the case relative to the class centroids and,
more critically, the decision boundaries. For example, a decision
boundary may be expected to lie between two class centroids
and so at a similar Mahalanobis distance from each centroid.
Thus, if the difference between the two smallest Mahalanobis
distances computed for a case was small this would indicate
that the case lies close to the border region between two classes
and near the location of a decision boundary [56]. Conversely, if
the difference between the two smallest Mahalanobis distances
was large the case lies away from the border region between
two classes and the decision boundary that separates them [56].
Here, the Mahalanobis distances and the difference between the
two smallest Mahalanobis distances for each case were computed to indicate the typicality of each case to a class and its
position relative to inter-class transition regions respectively.
IV. RESULTS
The classifications of the simulated data set allowed the general characteristics of the useful training cases for each classifier to be determined. Two major attributes of the useful training
cases were apparent. First, all three classifiers used only a small
proportion of the available training data set in classifying the
data (Table III). The total number of useful training cases ranged
from 6 for the RVM to 76 for the SMLR, representing 2.00%
and 25.00% of the total sample size respectively. Second, the
1349
Fig. 1. Location of the useful training cases for classifications of the simulated
data by (a) SVM, (b) RVM and (c) SMLR.
useful training cases were distributed in feature space in a relatively systematic fashion (Fig. 1). The location of the useful
training cases, however, varied between the three classifiers.
The trends were visually most apparent for class 2. For this
class, the support vectors were a set of extreme cases that lay at
the edge of the class distribution and between the distributions
of the other classes (Fig. 1(a)). As expected, the support vectors,
therefore, lay in region close to where a classification decision
Fig. 2. Location of the useful training cases for classifications of the ETM+
(Boston) data by (a) SVM, (b) RVM, (c) SMLR.
boundary would be fitted. With the RVM, the relevance vectors

were also extreme cases but located away from the boundary region (Fig. 1(b)). Note that for all three classes the support vectors have a relatively large Mahalanobis distance to the actual
1350
TABLE IV
MEAN MAHALANOBIS DISTANCE MEASURES COMPUTED OVER ALL USEFUL TRAINING CASES
FOR A CLASS BASED ON ANALYSES OF THE ETM+ (BOSTON) DATA SET
class of membership and a small difference between the two

smallest Mahalanobis distances, which indicate that they are
extreme cases located in the region of a classification decision
boundary (Table III). The atypical nature of the support vectors
is perhaps most apparent if the Mahalanobis distance to the actual class of membership is expressed as a typicality probability
[54], with the mean typicality of the support vectors for each
class being 0.05. The relevance vectors also show a relatively
large Mahalanobis distance to the actual class of membership
but a large difference between the two smallest Mahalanobis
distances, indicating that they are extreme but anti-boundary
in nature (Table III). Again, for each class, the mean Mahalanobis distance to the actual class for the selected relevance
vectors equated to typicality probabilities of 0.05 or less. With
the SMLR, the location of the retained kernel basis functions
varied from relatively typical (class 2) to atypical and near/anti
decision boundary (classes 1 and 3) (Fig. 1(c)). While the Mahalanobis distance to the actual class were generally smaller than
for the SVM and RVM, the useful training cases for classes 1
and 3 were still highly atypical, with mean typicality probabilities of
. These trends are also apparent in the Mahalanobis distance based metrics that characterise the location of
the useful training cases (Table III).
Keeping in view the uncertainty in the location of useful
training cases provided by SMLR classifier with simulated
data set, further analysis with ETM+ (Boston) data set were
undertaken. The results summarised in Table IV and Fig. 2
suggests similar trends to those observed with the simulated
dataset with regard to the location of support vectors and
relevance vectors. The total number of useful training cases
with this data set varied from 7 for the RVM to 19 for the
SVM representing about 2.00% and 6.00% of the total training
sample respectively. The results indicate that useful training
cases with SMLR are located away from the class boundary
for all three classes (Fig. 2). A comparison of Mahalanobis
distance and difference between the two smallest Mahalanobis

distance (Table IV) suggests them to be extreme cases lying
away from class boundary.
Similar trends were observed with the classifications of the
ATM data set. Of the 600 training cases available, the SVM,
RVM and SMLR used only 202, 44 and 101 respectively, representing between 7.33% and 33.66% of the total set. In the case of
the analyses of the ETM+ (Littleport) data set, the SVM, RVM
and SMLR used 314, 79 and 172 training cases representing between 11.29% and 44.90% of the total set of 700 training cases.
The difference between the two smallest Mahalanobis distances
was generally small for the support vectors but generally large
for the relevance vectors and the retained kernel basis functions
for both the ATM (Table V) and ETM+ (Table VI) data sets. The
results again suggest that useful training cases for the SVM are
atypical and lie in the border region between classes while for
the RVM and SMLR the useful training cases are atypical but
located away from the border region.
Together, the two attributes of the useful training cases, their
small number and systematic location in feature space, indicate
a potential to use small training sets for classification by each
of the three classifiers. Critically, the systematic nature of their
location in feature space suggests a potential to predict their location on the ground in advance. That is, the systematic location
of the useful training cases in feature space can be re-projected
into geographical space to allow intelligent training [1], [13].
This has been achieved with SVM, for example by deliberately
focusing training data collection activities on extreme cases that
are expected to have most spectral similarity to other classes
[1], [13]. It should also be possible, however, to design intelligent training data acquisition programmes in ways to focus
on potentially useful training cases for the RVM and SMLR.
For example, like the SVM, attention might focus on spectrally
extreme cases, but not those in the border region, when using
the RVM and SMLR classifiers. The operation of an intelli-
1351
TABLE V
MEAN MAHALANOBIS DISTANCE MEASURES COMPUTED OVER ALL USEFUL TRAINING CASES FOR A CLASS BASED ON ANALYSES OF THE ATM DATA
gent training scheme requires moving between feature and geographical space. For example, the approach used in [13] was
based on using fundamental knowledge of the variables that influence the spectral response to aid the selection of training sites
on the ground that would be expected to lie at extreme positions in feature space. For example, with a crop, extreme cases
might be expected to occur in regions of differing growth stage
and cover as well as with differing soil backgrounds. Moreover,
different extremities can be defined. For example, sites of extremely high and low plant cover would be expected to lie in different locations in feature space. Similarly, crops grown on different soil types or perhaps growing on wet and dry soils would
be expected to lie in different, potentially predictable, locations
of feature space [13], [57]. The precise approach will depend
on the specific data sets used but provided the useful training
cases have a potentially predictable nature an intelligent training
scheme should be feasible. Finally, it is apparent that the results
also highlight that training data collection programmes should
be designed in a classifier-specific manner. Note, for example
with both the ATM and ETM+ (Littleport) data sets, that few of
the training cases selected as useful by one classifier were also
selected as useful by another classifier (Table VII).
The results above indicate that all three classifiers use mostly
different training cases and so point to a desire for classifier-specific training data acquisition programmes. The importance of
this can be seen in the results of classifications of the ATM data
derived using a classifier trained upon data useful for another

classifier. For example, the useful training cases for classification by a SVM (support vectors) were used to train the RVM and
SMLR classifiers. The resulting classifications had an accuracy
of 91.00% and 85.00% for the RVM and SMLR respectively;
both less than the accuracy of 92.50% derived when the support vectors identified from the entire training set were used.
Similarly, the SMLR and SVM yielded classifications with an
accuracy of 42.50% and 70.31% when trained with the useful
training cases defined for the RVM; both substantially less than
the 93.75% obtained when relevance vectors identified from the
entire training set were used. Lastly, when trained with useful
training cases for a SMLR classification, the accuracy of the
SVM and RVM classifications were 87.18% and 78.00% respectively; again both substantially less than the 92.81% obtained when the retained kernel basis functions identified from
the entire training set were used. These results indicate a decline
in classification accuracy, by all three classification algorithms,
when trained with useful training cases defined for another classifier. Thus, a training set defined for one classifier and able to
yield an accurate classification may yield a low accuracy if used
with a different classifier. Taken together, these results highlight
the impact of the training data on the accuracy of a classification
and the desire for classifier specific training data acquisition.
The potential to characterise useful training sites and so to
design an intelligent training data collection programme offers
1352
TABLE VI
MEAN MAHALANOBIS DISTANCE MEASURES COMPUTED OVER ALL USEFUL TRAINING CASES
FOR A CLASS BASED ON ANALYSES OF THE ETM+(LITTLEPORT) DATA
TABLE VII
NUMBER OF COMMON USEFUL TRAINING CASES
attractive benefits relative to alternative approaches for efficient

training such as active learning. The ideal training set is acquired at or close to the time of image acquisition. Intelligent
training allows the location of potentially useful training sites to
be predicted in advance of an analysis, as in [13], and so allow

close temporal coincidence of ground and image data acquisitions. As noted above, methods such as active learning can only
be applied post-image acquisition. In some situations the time
gap between the acquisition of ground and image data may be a
source of error and uncertainty (e.g., the image may have been
acquired just before crop harvesting and so show the presence
of a mature crop but post-image acquisition ground data surveys
might find bare fields etc.). Additionally, as the active learning
methods highlight useful training sites iteratively their use could
require a series of ground data collection programmes. Such a
situation does not allow efficient design of ground data collection with multiple, possibly overlapping, field programmes required. The magnitude of the concerns will vary as a function of
factors such as class temporal variability and the source of the
ground data. It should also be noted that the various approaches
to efficient training may be complementary and so could perhaps be usefully combined. For example, intelligently selected
training sites could act perhaps as seeds or starting point for the
selection of other potentially useful but unlabeled pixels.
A key result is that smaller training sets than required for
the SVM may be used by the RVM and SMLR with ATM and
1353
TABLE VIII
CONFUSION MATRICES FOR THE CLASSIFICATIONS OF THE ATM DATA (A) SVM, (B) RVM AND (C) SMLR. THE OVERALL ACCURACY OF THE CLASSIFICATIONS
WAS 93.75% FOR RVM, 92.50% FOR SVM AND 92.81% FOR SMLR. PER-CLASS ACCURACY (%) SHOWN FROM USERS AND PRODUCERS PERSPECTIVES
TABLE IX
NON-INFERIORITY TEST RESULTS RELATIVE TO SVM BASED ON 95%
CONFIDENCE INTERVAL ON THE ESTIMATED DIFFERENCE IN ACCURACY. NOTE
THAT THE DIFFERENCES IN ACCURACY WERE ALL VERY SMALL (
)
AND INSIDE THE DEFINED ZONE OF INDIFFERENCE.
ETM+ (Littleport) data sets, making them attractive alternatives

to the established SVM for image classification. The value of this
attribute, however, is a function of the accuracy and computational cost of the classifications. In terms of classification
accuracy, all three classifiers produced highly accurate classifications of the ATM data set (Table VIII). Critically, the lower
limit of the derived 95% confidence interval for the difference in
accuracy from the SVM classification was above 0 for both the
RVM and SMLR classifications and the entire interval lay within
the zone of indifference, indicating that the RVM and SMLR classifications were statistically non-inferior to that from the SVM
at the 97.5% level of confidence with ATM data set. Indeed their
estimated accuracies of 93.75% and 92.81% respectively were
marginally higher than that from the SVM (92.50%; Table IX).
TABLE X
VARIATION OF CLASSIFICATION ACCURACY AND NUMBER OF RELEVANCE
VECTORS WITH USING ATM DATASET.
It is evident that the RVM produced the highest accuracy yet

required the smallest training set, although it should be noted
that the results of the trial analyses highlighted variation in the
number of relevance vectors needed and classification accuracy
with the value of the parameter; at large values of the entire
set of available training cases were required highlighting the
importance of careful parameter value selection (Table X). In
the case of analyses with the ETM+ (Littleport) data set, the
classifications were of similar accuracy with the classification
obtained by RVM (80.21%) slightly lower than that from
the SVM (81.36%)and SMLR (81.71%), though critically
the RVM and SMLR classifications were not inferior to the
1354
TABLE XI
COMPUTATIONAL COST AND THE NUMBER OF USEFUL TRAINING CASES USED
BY THE CLASSIFIERS.
SVM with the confidence intervals lying within the zone of

indifference. Additionally, the SVM required approximately
4.0 and 1.8 times the useful training cases used by RVM and
SMLR respectively. In terms of training and testing time used
by SVM, RVM and RMLR, precise value for the computational
cost cannot be exactly compared because all three algorithms
were implemented using different programming languages.
Nevertheless, a comparison of computational cost suggests that
the RVM and, to a lesser degree, the SMLR were computationally more demanding than the SVM (Table XI), which may be
a concern for analyses of large data sets.
V. CONCLUSIONS
The potential of SVM for accurate classification from small
training sets has been established in previous research. Other
classifiers such as RVM and SMLR, however, offer additional
features, such as information on per-case classification uncertainty that may sometimes be useful. Here, it has been shown
that the RVM and SMLR are able to classify data to similar
accuracies to the SVM. Moreover, both RVM and SMLR require fewer training cases than a SVM when used with remotely
sensed data. Additionally, the useful training cases for SVM and
RVM classifiers have different but well-defined characteristics
which may make them easily predictable. The training cases
for the SMLR was also mostly well characterised, being of an
extreme nature and lying away from class boundaries. Consequently, it may be possible to predict potentially useful training
sites, especially for the SVM and RVM.
ACKNOWLEDGMENT
Dr. Pal wishes to thank the Association of Commonwealth
Universities for this fellowship. The authors thank the School
of Geography, University of Nottingham, for use of computing
facilities. The ATM data were acquired as part of European
AgriSAR campaign. For SVM, LIBSVM and BSVM packages
were made available by C.-J. Lin of National Taiwan University, SMLR package was provided by A. Hartemink, Duke University and multiclass RVM was provided by Y.-f. Mao, Electronics and Information Department, SCUT, Guangzhou, China.
The authors are also grateful to the editors and the referees for
their helpful comments on the original manuscript.
REFERENCES
[1] G. M. Foody and A. Mathur, Toward intelligent training of supervised
image classifications: Directing training data acquisition for SVM classification, Remote Sens. Environ., vol. 93, no. 12, pp. 107117, Oct.
2004.
[2] M. Chi and L. Bruzzone, A semilabeled-sample-driven bagging
technique for ill-posed classification problems, IEEE Geosci. Remote
Sens. Lett., vol. 2, no. 1, pp. 6973, Jan. 2005.
[3] P. Mantero, G. Moser, and S. B. Serpico, Partially supervised classification of remote sensing images through SVM-based probability density estimation, IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp.
559570, Mar. 2005.
[4] G. M. Foody, Assessing the accuracy of land cover change with imperfect ground reference data, Remote Sens. Environ., vol. 114, no.
10, pp. 22712285, Oct. 2010.
[5] L. Bruzzone, M. Chi, and M. Marconcini, A novel transductive SVM
for the semisupervised classification of remote-sensing images, IEEE
Trans. Geosci. Remote Sens., vol. 44, no. 11, pp. 33633373, Nov.
2006.
[6] M. Marconcini, G. Camps-Valls, and L. Bruzzone, A composite
semisupervised SVM for classification of hyperspectral images,
IEEE Geosci. Remote Sens. Lett., vol. 6, no. 2, pp. 234238, Apr.
2009.
[7] L. Bruzzone and C. Persello, A novel context-sensitive semisupervised SVM classifier robust to mislabeled training samples,
IEEE Trans. Geosci. Remote Sens., vol. 47, no. 7, pp. 21422154, Jul.
2009.
[8] S. Rajan, J. Ghosh, and M. M. Crawford, An active learning approach
to hyperspectral data classification, IEEE Trans. Geosci. Remote
Sens., vol. 46, no. 4, pp. 12311242, Apr. 2008.
[9] D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, and W. J. Emery, Active
learning methods for remote sensing image classification, IEEE Trans.
Geosci. Remote Sens., vol. 47, no. 7, pp. 22182232, Jul. 2009.
[10] P. Zhong, P. Zhang, and R. Wang, Dynamic learning of SMLR for feature selection and classification of hyperspectral data, IEEE Geosci.
Remote Sens. Lett., vol. 5, no. 2, pp. 280284, Apr. 2008.
[11] M. Pal and G. M. Foody, Feature selection for classification of hyperspectral data by SVM, IEEE Trans. Geosci. Remote Sens., vol. 48, no.
5, pp. 22972307, May 2010.
[12] G. M. Foody and A. Mathur, The use of small training sets containing
mixed pixels for accurate hard image classification: Training on mixed
spectral responses for classification by a SVM, Remote Sens. Environ.,
vol. 103, no. 2, pp. 179189, Jul. 2006.
[13] A. Mathur and G. M. Foody, Crop classification by support vector
machine with intelligently selected training data for an operational application, Int. J. Remote Sens., vol. 29, no. 8, pp. 22272240, Apr.
2008.
[14] C. Sanchez-Hernandez, D. S. Boyd, and G. M. Foody, One-class classification for mapping a specific land cover class: SVDD classification of fenland, IEEE Trans. Geosci. Remote Sens., vol. 45, no. 4, pp.
10611073, Apr. 2007.
[15] W. Li, Q. Guo, and C. Elkan, A positive and unlabeled learning algorithm for one-class classification of remote-sensing data, IEEE Trans.
Geosci. Remote Sens., vol. 49, no. 2, pp. 717725, Feb. 2011.
[16] J. A. Gualtieri and R. F. Cromp, Support vector machines for hyperspectral remote sensing classification, in Proc. 27th AIPR Workshop:
Advances in Computer Assisted Recognition, Washington, DC, Oct. 27,
1998, pp. 221232.
[17] C. Huang, L. S. Davis, and J. R. G. Townshend, An assessment of
support vector machines for land cover classification, Int. J. Remote
Sens., vol. 23, no. 4, pp. 725749, Feb. 2002.
[18] G. Zhu and D. G. Blumberg, Classification using ASTER data and
SVM algorithms; The case study of Beer Sheva, Israel, Remote Sens.
Environ., vol. 80, no. 5, pp. 233240, May 2002.
[19] M. Pal and P. M. Mather, Assessment of the effectiveness of support
vector machines for hyperspectral data, Future Gen. Comput. Syst.,
vol. 20, no. 7, pp. 1215122, Oct. 2004.
[20] F. Melgani and L. Bruzzone, Classification of hyperspectral remote
sensing images with support vector machines, IEEE Trans. Geosci.
Remote Sens., vol. 42, no. 8, pp. 17781790, Aug. 2004.
[21] D. Lu and Q. Weng, A survey of image classification methods and
techniques for improving classification performance, Int. J. Remote
Sens., vol. 28, no. 5, pp. 823870, Mar. 2007.
[22] B. Waske and J. A. Benediktsson, Fusion of support vector machines

for classification of multisensor data, IEEE Trans. Geosci. Remote
Sens., vol. 45, no. 12, pp. 38583866, Dec. 2007.
[23] M. Pal and P. M. Mather, Some issue in classification of DAIS hyperspectral data, Int. J. Remote Sens., vol. 27, no. 14, pp. 28952916,
Jul. 2006.
[24] G. M. Foody and A. Mathur, A relative evaluation of multiclass image
classification by support vector machines, IEEE Trans. Geosci. Remote Sens., vol. 42, no. 6, pp. 13351343, Jun. 2004.
[25] M. Pal, Kernel methods in remote sensing: A review, ISH J. Hydraul.
Eng. (Special Issue), vol. 15, no. 1, pp. 194215, May 2009.
[26] G. Mountrakis, J. Im, and C. Ogole, Support vector machines in remote sensing: A review, ISPRS J. Photogramm. Remote Sens., vol.
66, no. 3, pp. 247259, May 2011.
[27] A. Mathur and G. M. Foody, Multiclass and binary SVM classification: Implications for training and classification Users, IEEE Geosci.
Remote Sens. Lett., vol. 5, no. 2, pp. 241245, Feb. 2008.
[28] J. Platt, Probabilistic outputs for support vector machines and comparison to regularized likelihood methods, in Advances in Large Margin
Classifiers, A. Smola, P. Bartlett, B. Schlkopf, and D. Schuurmans,
Eds. Cambridge, MA: MIT Press, 2000, pp. 6174.
[29] B. Demir and S. Ertrk, Hyperspectral image classification using relevance vector machines, IEEE Geosci. Remote Sens. Lett., vol. 4, no.
4, pp. 586590, Apr. 2007.
[30] G. M. Foody, RVM-based multi-class classification of remotely
sensed data, Int. J. Remote Sens., vol. 29, no. 6, pp. 18171823, Mar.
2008.
[31] F. A. Mianji and Y. Zhang, Robust hyperspectral classification using
relevance vector machine, IEEE Trans. Geosci. Remote Sens., vol. 49,
no. 6, pp. 21002112, Jun. 2011.
[32] M. E. Tipping, Sparse Bayesian learning and the relevance vector machine, J. Mach. Learn. Res., vol. 1, pp. 211244, Jun. 2001.
[33] B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink,
Sparse multinomial logistic regression: Fast algorithms and generalization bounds, IEEE Trans. Pattern Analysis Mach. Intell., vol. 27,
no. 6, pp. 957968, Jun. 2005.
[34] G. Camps-Valls, L. Gmez-Chova, J. Calpe-Maravilla, J. D. MartnGuerrero, E. Soria-Olivas, L. Alonso-Chord, and J. Moreno, Robust
support vector method for hyperspectral data classification and knowledge discovery, IEEE Trans. Geosci. Remote Sens., vol. 42, no. 7, pp.
15301542, Jul. 2004.
[35] G. Camps-Valls and L. Bruzzone, Kernel-based methods for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens., vol.
43, no. 6, pp. 13511362, Jun. 2005.
[36] D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Muoz-Mar, A
survey of active learning algorithms for supervised remote sensing
image classification, IEEE J. Sel. Topics Signal Process., vol. 5, no.
3, pp. 606617, Jun. 2011.
[37] V. N. Vapnik, The Nature of Statistical Learning Theory. New York:
Springer-Verlag, 1995.
[38] C. Cortes and V. Vapnik, Support-vector networks, Machine
Learning, vol. 20, no. 3, pp. 273297, Mar. 1995.
[39] B. E. Boser, I. M. Guyon, and V. Vapnik, A training algorithm for
optimum margin classifiers, in Proc. 5th Annual Workshop on Computational Learning Theory (COLT 92), New York, Jul. 2729, 1992,
pp. 144152.
[40] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector
Machines. Cambridge, UK: Cambridge University Press, 2000.
[41] G. Camps-Valls and L. Bruzzone, Kernel Methods for Remote Sensing
Data Analysis. Chichester, UK: Wiley, 2009.
[42] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. New York:
Springer-Verlag, 2001.
[43] J. Borges, J. Bioucas-Dias, and A. Maral, Fast sparse multinomial
regression applied to hyperspectral data, presented at the Int. Conf.
Image Analysis and Recognition- ICIAR 2006, Pvoa de Varzim, Portugal, 2006.
[44] J. Borges, J. Bioucas-Dias, and A. Maral, Bayesian hyperspectral
image segmentation with discriminative class learning, IEEE Trans.
Geosci. Remote Sens., vol. 49, no. 6, pp. 21512164, Jun. 2011.
[45] J. Li, J. M. Bioucas-Dias, and A. Plaza, Hyperspectral image segmentation using a new Bayesian approach with active learning, IEEE
Trans. Geosci. Remote Sens., vol. 49, no. 10, pp. 39473960, Oct. 2011.
1355
[46] G. M. Foody and M. K. Arora, An evaluation of some factors affecting

the accuracy of classification by an artificial neural network, Int. J.
Remote Sens., vol. 18, no. 4, pp. 799810, Mar. 1997.
[47] M. Pal and P. M. Mather, An assessment of the effectiveness of decision tree methods for land cover classification, Remote Sens. Environ.,
vol. 86, no. 4, pp. 554565, Oct. 2003.
[48] J. L. Fleiss, B. Levin, and M. C. Paik, Statistical Methods for Rates &
Proportions, 3rd ed. New York: Wiley-Interscience, 2003.
[49] G. M. Foody, Classification accuracy comparison: Hypothesis tests
and the use of confidence intervals in evaluations of difference, equivalence and non-inferiority, Remote Sens. Environ., vol. 113, no. 8, pp.
16581663, Aug. 2009.
[50] C.-W. Hsu and C.-J. Lin, A comparison of methods for multi-class
support vector machines, IEEE Trans. Neural Netw., vol. 13, no. 2,
pp. 415425, Feb. 2002.
[51] M. Pal, Multiclass approaches for support vector machine based land
cover classification, in 8th Annual Int. Conf., Map India 2005, 2005.
[52] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector
machines, ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 127,
Apr. 2011.
[53] X.-M. Xu, Y.-F. Mao, J.-N. Xiong, and F.-L. Zhou, Classification performance comparison between RVM and SVM, in IEEE Int. Workshop on Anti-Counterfeiting, Security, Identification, Xiamen, China,
Apr. 1618, 2007, pp. 208211.
[54] G. M. Foody, N. A. Campbell, N. M. Trodd, and T. F. Wood, Derivation and applications of probabilistic measures of class membership
from the maximum-likelihood classification, Photogramm. Eng. Remote Sens., vol. 58, no. 9, pp. 13351341, Sep. 1992.
[55] N. A. Campbell, Some aspects of allocation and discrimination, in
Multivariate Statistical Methods in Physical Anthropology, G. N. Van
Vark and W. W. Howells, Eds. Dordrecht, The Netherlands: Reidel,
1984, pp. 177192.
[56] G. M. Foody, The significance of border training patterns in classification by a feed-forward neural network using back propagation
learning, Int. J. Remote Sens., vol. 20, no. 18, pp. 35493562, Dec.
1999.
[57] G. M. Foody, On training and evaluation of SVM for remote sensing
applications, in Kernel Methods for Remote Sensing Data Analysis,
G. Camps-Valls and L. Bruzzone, Eds. Chichester, UK: Wiley, 2009,
pp. 85109.
Mahesh Pal received the Ph.D. degree from the University of Nottingham, U.K., in 2002.
He is presently an Associate Professor in the
Department of Civil Engineering, NIT Kurukshetra,
Haryana, India. His major research areas are land
cover classification, feature selection and application
of artificial intelligence techniques in various civil
engineering application.
Dr. Pal is on the editorial board of Remote Sensing
Letters. Part of the research work reported in this
paper was carried out when Dr. Pal was on a commonwealth fellowship in the University of Nottingham during the period of
October 2008March 2009.
Giles M. Foody (M01) received the B.Sc. and

Ph.D. degrees in geography from the University
of Sheffield, Sheffield, U.K., in 1983 and 1986,
respectively.
He is currently Professor of Geographical Information Science at the University of Nottingham, U.K.
His main research interests focus on the interface between remote sensing, ecology and informatics.
Dr. Foody is currently Editor-in-Chief of the
International Journal of Remote Sensing and of the
recently launched journal Remote Sensing Letters,
holds editorial roles with Landscape Ecology and Ecological Informatics, and
serves on the editorial board of several other journals. He was awarded the
Remote Sensing and Photogrammetry Societys Award, its highest award, for
services to remote sensing in 2009.

06331573

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

06331573

Cargado por

Copyright:

Formatos disponibles

1344

Evaluation of SVM, RVM and SMLR for Accurate

one of the main pattern recognition techniques applied in remote

1939-1404/$31.00 2012 IEEE

matrix in some strategies to multi-class classification [27] and

II. CLASSIFICATION ALGORITHMS

and an iterative method is used to obtain

is the logistic sigmoid function:

is the Hessian of , matrix has elements

Usually a maximum likelihood estimation procedure is used to

III. DATA SETS AND METHODS

in the design of SVM classifier, the value of the hyperparameter

classes for a case may be used to provide a simple descriptor

boundary would be fitted. With the RVM, the relevance vectors

class of membership and a small difference between the two

distance and difference between the two smallest Mahalanobis

derived using a classifier trained upon data useful for another

attractive benefits relative to alternative approaches for efficient

be predicted in advance of an analysis, as in [13], and so allow

ETM+ (Littleport) data sets, making them attractive alternatives

It is evident that the RVM produced the highest accuracy yet

SVM with the confidence intervals lying within the zone of

[22] B. Waske and J. A. Benediktsson, Fusion of support vector machines

[46] G. M. Foody and M. K. Arora, An evaluation of some factors affecting

Giles M. Foody (M01) received the B.Sc. and

También podría gustarte