Está en la página 1de 10

When Is \Nearest Neighbor" Meaningful?

Kevin Beyer Jonathan Goldstein Raghu Ramakrishnan Uri Shaft


CS Dept., University of Wisconsin-Madison

1210 W. Dayton St., Madison, WI 53706


email: beyer,jgoldst,raghu,uri@cs.wisc.edu

Abstract [15, 29, 25, 22])|with long \feature" vectors. Similarity


queries are performed by taking a given complex object,
We explore the e ect of dimensionality on the \near- approximating it with a high dimensional vector to ob-
est neighbor" problem. We show that under a broad tain the query point, and determining the data point
set of conditions (much broader than independent and closest to it in the underlying feature space.
identically distributed dimensions), as dimensionality in- This paper makes the following three contributions:
creases, the distance to the nearest data point approaches 1) We show that under certain broad conditions (in terms
the distance to the farthest data point. To provide a of data and query distributions, or workload), as dimen-
practical perspective, we present empirical results on sionality increases, the distance to the nearest neighbor
both real and synthetic data sets that demonstrate that approaches the distance to the farthest neighbor. In
this e ect can occur for as few as 10-15 dimensions. other words, the contrast in distances to di erent data
These results should not be interpreted to mean that points becomes nonexistent. The conditions we have
high-dimensional indexing is never meaningful; we illus- identi ed in which this happens are much broader than
trate this point by identifying some high-dimensional the independent and identically distributed (IID) dimen-
workloads for which this e ect does not occur. How- sions assumption that other work assumes. Our result
ever, our results do emphasize that the methodology used characterizes the problem itself, rather than speci c algo-
almost universally in the database literature to evalu- rithms that address the problem. In addition, our obser-
ate high-dimensional indexing techniques is awed, and vations apply equally to the k-nearest neighbor variant of
should be modi ed. In particular, most such techniques the problem. When one combines this result with the ob-
proposed in the literature are not evaluated versus simple servation that most applications of high dimensional NN
linear scan, and are evaluated over workloads for which are heuristics for similarity in some domain (e.g. color
nearest neighbor is not meaningful. Often, even the re- histograms for image similarity), serious questions are
ported experiments, when analyzed carefully, show that raised as to the validity of many mappings of similarity
linear scan would outperform the techniques being pro- problems to high dimensional NN problems.
posed on the workloads studied in high (10-15) dimen- 2) To provide a practical perspective, we present empiri-
sionality! cal results based on synthetic distributions showing that
the distinction between nearest and farthest neighbors
1 Introduction may blur with as few as 15 dimensions. In addition,
we performed experiments on data from a real image
database that indicate that these dimensionality e ects
In recent years, many researchers have focused on nding occur in practice (see [13]). Our observations suggest
ecient solutions to the nearest neighbor (NN) problem, that high-dimensional feature vector representations for
de ned as follows: Given a collection of data points and multimedia similarity search must be used with caution.
a query point in an m-dimensional metric space, nd In particular, one must check that the workload yields
the data point that is closest to the query point. Par- a clear separation between nearest and farthest neigh-
ticular interest has centered on solving this problem in bors for typical queries (e.g., through sampling). We
high dimensional spaces, which arise from techniques also identify special workloads for which the concept of
that approximate (e.g., see [24]) complex data|such nearest neighbor continues to be meaningful in high di-
as images (e.g. [15, 27, 28, 21, 28, 23, 25, 18, 3]), se- mensionality, to emphasize that our observations should
quences (e.g. [2, 1]), video (e.g. [15]), and shapes (e.g. not be misinterpreted as saying that NN in high dimen-
 This work was partially supported by a \David and Lucile sionality is never meaningful.
Packard Foundation Fellowship in Science and Engineering", a 3) Our results underscore the point that evaluation of
\Presidential Young Investigator" award, NASA research grant
NAGW-3921, ORD contract 144-ET33, and NSF grant 144-GN62. a technique for nearest-neighbor search should be based

1
on meaningful workloads. We observe that the database con dence!
literature on nearest neighbor processing techniques fails While the scenario depicted in Figure 3 is very con-
to compare new techniques to linear scans. Furthermore, trived for a geographical database (and for any practical
we can infer from their data that a linear scan almost two dimensional application of NN), we show that it is
always out-performs their techniques in high dimension- the norm for a broad class of data distributions in high
ality on the examined data sets. This is unsurprising as dimensionality. To establish this, we will examine the
the workloads used to evaluate these techniques are in number of points that fall into a query sphere enlarged
the class of \badly behaving" workloads identi ed by our by some factor " (see Figure 4). If few points fall into this
results; the proposed methods may well be e ective for enlarged sphere, it means that the data point nearest to
appropriately chosen workloads, but this is not examined the query point is separated from the rest of the data in
in their performance evaluation. a meaningful way. On the other hand, if many (let alone
In summary, our results suggest that more care be most!) data points fall into this enlarged sphere, di er-
taken when thinking of nearest neighbor approaches and entiating the \nearest neighbor" from these other data
high dimensional indexing algorithms; we supplement points is meaningless if " is small. We use the notion
our theoretical results with experimental data and a care- instability for describing this phenomenon.
ful discussion. De nition 1 A nearest neighbor query is unstable for
a given " if the distance from the query point to most
data points is less than (1 + ") times the distance from
2 On the Signi cance of \Nearest the query point to its nearest neighbor.
Neighbor" We show that in many situations, for any xed " > 0,
as dimensionality rises, the probability that a query is
unstable converges to 1. Note that the points that fall
in the enlarged query region are the valid answers to
Nearest Neighbor the approximate nearest neighbors problem (described
Query Point
in [6]).

3 NN in High-Dimensional Spaces
This section contains our formulation of the problem,
Figure 1: Query point and its nearest neighbor. our formal analysis of the e ect of dimensionality on the
meaning of the result, and some formal implications of
The NN problem involves determining the point in a the result that enhance understanding of our primary
data set that is nearest to a given query point (see Figure result.
1). It is frequently used in Geographical Information
Systems (GIS), where points are associated with some
geographical location (e.g., cities). A typical NN query 3.1 Notational Conventions
is: \What city is closest to my current location?"
While it is natural to ask for the nearest neighbor, We use the following notation in the rest of the paper:
there is not always a meaningful answer. For instance,
consider the scenario depicted in Figure 3. Even though A vector: ~x
there is a well-de ned nearest neighbor, the di erence Probability of an event e: P [e].
in distance between the nearest neighbor and any other
point in the data set is very small. Since the di erence Expectation of a random variable X : E [X ].
in distance is so small, the utility of the answer in solv-
ing concrete problems (e.g. minimizing travel cost) is Variance of a random variable X : var (X ).
very low. Furthermore, consider the scenario where the IID: Independent and identically distributed.
position of each point is thought to lie in some circle (This phrase is used with reference to the values
with high con dence (see Figure 2). Such a situation assigned to a collection of random variables.)
can come about either from numerical error in calcu-
lating the location, or \heuristic error", which derives X~  F : A random variable X~ that takes on values fol-
from the algorithm used to deduce the point (e.g. if a lowing the distribution F .
at rather than a spherical map were used to determine
distance). In this scenario, the determination of a near-
est neighbor is impossible with any reasonable degree of

2
Nearest Neighbor

DMIN
Query Point
Query Point DMAX Q
Center of Circle
(1+ε)DMIN

Figure 4: Illustration of query region


Figure 2: The data points are approx- and enlarged region. (DMIN is the
imations. Each circle denotes a region Figure 3: Another query point and its distance to the nearest neighbor, and
where the true data point is supposed nearest neighbor. DMAX to the farthest data point.)
to be.

3.2 Some Results from Probability The- nowhere in the following proof do we rely on that in-
ory terpretation. One can view the proof as a convergence
condition on a series of distributions (which we happen
De nition 2 A sequence of random vectors (all vectors to call distance distributions) that provides us with a
have the same arity) A~1 ; A~2 ; : : : converges in proba- tool to talk formally about the \dimensionality curse".
bility to a constant vector ~c if for all " > 0 the proba- We now introduce several terms used in stating our
bility of A~m being at most " away from ~c converges to 1 result formally.
as m ! 1. In other words:
~
h i De nition 3 :
8" > 0 , mlim
!1 P Am ~c  " = 1

m is the variable that our distance distributions may con-


We denote this property by A~m !p ~c. We also treat ran- verge under (m ranges over all positive integers).
dom variables that are not vectors as vectors with arity F data1 ; F data2 ; : : : is a sequence of data distributions.
1.
F query1 ; F query2 ; : : : is a sequence of query distributions.
Lemma 1 If B1; B2; : : : is a sequence of random vari- n is the ( xed) number of samples (data points) from
ables with nite variance and limm!1 E [Bm ] = b and each distribution.
limm!1 var (Bm ) = 0 then Bm !p b.
8m Pm;1 ; : : : ; Pm;n are n independent data points per
A version of Slutsky's theorem Let A~1 ; A~2; : : : be m such that Pm;i  F datam.
random variables (or vectors) and g be a continuous func- Qm  F query is a query point chosen independently
m
tion. If A~m !p ~c and g(~c) is nite then g(A~m ) !p g(~c). from all Pm;i .
Corollary 1 (to Slutsky's theorem) If X1; X2; : : : and 0 < p < 1 is a constant.
Y1 ; Y2 ; : : : are sequences or random variables s.t. Xm !p 8m; dm is a function that takes a data point from the
a and ym !p b 6= 0 then Xm =Ym !p a=b. domain of F datam and a query point from the domain
of F querym and returns a non-negative real number as a
3.3 Nearest Neighbor Formulation result.
DMINm = min fdm (Pm;i ; Qm) j1  i  n g.
Given a data set and a query point, we want to ana- DMAX = max fd (P ; Q ) j1  i  n g.
lyze how much the distance to the nearest neighbor dif- m m m;i m
fers from the distance to other data points. We do this
by evaluating the number of points that are no farther 3.4 Instability Result
away than a factor " larger than the distance between
the query point and the NN, as illustrated in Figure 4. Our main theoretical tool is presented below. In essence,
When examining this characteristic, we assume nothing it states that assuming the distance distribution behaves
about the structure of the distance calculation. a certain way as m increases, the di erence in distance
We will study this characteristic by examining the between the query point and all data points becomes
distribution of the distance between query points and negligible (i.e., the query becomes unstable). Future sec-
data points as some variable m changes. Note that even- tions will show that the necessary behavior described in
tually, we will interpret m as dimensionality. However, this section identi es a large class (larger than any other

3
classes we are aware of for which the distance result is Also,
either known or can be readily inferred from known re-  
sults) of workloads. More formally, we show: P [DMAXm  (1 + ")DMINm ] = P DMAX
DMIN
m 1" =
m
Theorem 1 Under the conditions in De nition 3, if  

= P DMAX m
DMINm 1  "


(d m (P m; 1 ; Qm ))
p 
lim var E [(d (P ; Q ))p ] = 0
m!1
(1)
m m;1 m
(P [DMAXm  DMINm ] = 1 so the absolute value in the
Then for every " > 0 last term has no e ect.) Thus,
lim P [DMAXm  (1 + ")DMINm ] = 1
m!1 lim P [DMAXm  (1 + ")DMINm ] =
m!1
 
DMAXm
Proof Let m = E [(dm (Pm;i ; Qm))p ]. (Note that the = mlim P 1  " = 1

value of this expectation is independent of i since all Pm;i !1 DMINm


have the same distribution.)
Let Vm = (dm (Pm;1 ; Qm))p =m .
In summary, the above theorem says that if the pre-
Part 1: We'll show that Vm !p 1. condition holds (i.e., if the distance distribution behaves
It follows that E [Vm ] = 1 (because Vm is a random vari- a certain way as m increases), all points converge to the
able divided by its expectation.) same distance from the query point. Thus, under these
Trivially, limm!1 E [Vm ] = 1. conditions, the concept of nearest neighbor is no longer
meaningful.
The condition of the theorem (Equation 1) means that We may be able to use this result by directly show-
limm!1 var (Vm ) = 0. This, combined with ing that Vm !p 1 and using part 2 of the proof. (For
limm!1 E [Vm ] = 1, enables us to use Lemma 1 to con- example, for IID distributions, Vm !p 1 follows readily
clude that Vm !p 1. from the Weak Law of Large Numbers.) Later sections
Part 2: We'll show that if Vm !p 1 then demonstrate that our result provides us with a handy
limm!1 P [DMAXm  (1 + ")DMINm ] = 1. tool for discussing scenarios resistant to analysis using
Let X~ m = (dm (Pm;1 ; Qm )=m; : : : ; dm (Pm;n ; Qm)m ) (a law of large numbers arguments. From a more practical
vector of arity n). standpoint, there are two issues that must be addressed
to determine the theorem's impact:
Since each part of the vector X~ m has the same distribu-
tion as Vm , it follows that X~ m !p (1; : : : ; 1).  How restrictive is the condition
Since min and max are continuous functions we can con- 
(dm (Pm;1 ; Qm ))p =


clude from Slutsky's theorem that lim


m!1 var E [(dm(Pm;1 ; Qm))p ]
min(X~ m ) !p min(1; : : : ; 1) = 1, and similarly,
max(X~ m ) !p 1. var ((dm (Pm;1 ; Qm))p ) = 0 (2)
= mlim
!1 (E [(dm (Pm;1 ; Qm ))p ])2
Using Corollary 1 on max(X~ m ) and min(X~ m ) we get
max(X~ m ) ! 1 = 1 which is necessary for our results to hold? In other
words , it says that as we increase m and exam-
min(X~ m ) 1
p
ine the resulting distribution of distances between
queries and data, the variance of the distance dis-
Note that DMINm = m min(X~ m ) and tribution scaled by the overall magnitude of the
DMAXm = m max(X~ m ). So, distance converges to 0. To provide a better un-
derstanding of the restrictiveness of this condition,
DMAXm = m max(X~ m ) = max(X~ m ) Sections 3.5 and 4 discuss scenarios that do and do
DMINm m min(X~ m ) min(X~ m ) not satisfy it.
Therefore  For situations in which the condition is satis ed,
DMAXm ! 1 at what rate do distances between points become
DMINm p indistinct as dimensionality increases? In other
By de nition of convergence in probability we have that words, at what dimensionality does the concept
for all " > 0, of \nearest neighbor" become meaningless? This

issue is more dicult to tackle analytically. We

DMAXm

" = 1
therefore performed a set of simulations that ex-
lim P
m!1 DMINm 1 amine the relationship between m and the ratio of

4
minimum and maximum distances with respect to discussed using the Weak Law of Large Numbers. While
the query point. The results of these simulations there are innumerable slightly stronger versions of the
are presented in Section 5 and in [13]. Weak Law of Large Numbers, Section 3.5.5 contains an
example which meets our condition, and for which the
Weak Law of Large Numbers is inapplicable.
3.5 Application of Our Theoretical Re-
sult 3.5.2 Identical Dimensions with no Independence
This section analyses the applicability of Theorem 1 in
formally de ned situations. This is done by determining, We use the same notation as in the previous example.
for each scenario, whether the condition in Equation 2 is In contrast to the previous case, consider the situation
satis ed. Due to space considerations, we do not give a where all dimensions of both the query point and the
proof whether the condition in Equation 2 is satis ed or data points follow identical distributions, but are com-
not. [13] contains a full analysis of each example. pletely dependent (i.e., value for dimension 1 = value for
dimension 2 = : : :). Conceptually, the result is a set of
All of these scenarios de ne a workload and use an data points and a query point on a diagonal line. No
Lp distance metric over multidimensional query and data matter how many dimensions are added, the underlying
points with dimensionality m. (This makes the data and query can actually be converted to a one-dimensional
query points vectors with arity m.) It is important to nearest neighbor problem.
notice that this is the rst section to assign a particular
meaning to dm (as an Lp distance metric), p (as the It is not surprising to nd that the condition of The-
parameter to Lp ), and m (as dimensionality). Theorem orem 1 is not satis ed.
1 did not make use of these particular meanings.
We explore some scenarios that satisfy Equation 2 3.5.3 Unique Dimensions with Correlation Be-
and some that do not. We start with basic IID assump- tween All Dimensions
tions and then relax these assumptions in various ways.
We start with two \sanity checks": we show that dis- In this example, we intentionally break many assump-
tances converge with IID dimensions (Section 3.5.1), and tions underlying the IID case. Not only is every dimen-
we show that Equation 2 is not satis ed when the data sion unique, but all dimensions are correlated with all
and queries fall on a line (Section 3.5.2). We then dis- other dimensions and the variance of each additional di-
cuss examples involving correlated attributes and di er- mension increases. The following is a description of the
ing variance between dimensions, to illustrate scenarios problem.
where the Weak Law of Large Numbers cannot be ap- We generate an m dimensional data point (or query
plied (Sections 3.5.3, 3.5.4, and 3.5.5). point) X~ m = (X1 ; : : : ; Xm ) as follows:

3.5.1 IID Dimensions with Query and Data In-  First we take independent random variables
p
dependence U1 ; : : : ; Um such that Ui  Uniform(0; i).
Assume the following:  We de ne X1 = U1 .
 For all 2  i  m de ne Xi = Ui + (Xi 1 =2).
 The data distribution and query distribution are
IID in all dimensions. The condition of Theorem 1 is satis ed.
 All the appropriate moments are nite (i.e., up to
the d2pe'th moment). 3.5.4 Variance Converging to 0
 The query point is chosen independently of the This example illustrates that there are workloads that
data points. meet the preconditions of Theorem 1, even though the
variance of the distance in each added dimension con-
The conditions of Theorem 1 are satis ed under these verges to 0. One would expect that only some nite
assumptions. While this result is not original, it is a number of the earlier dimensions would dominate the
nice \sanity check." (In this very special case we can distance. Again, this is not the case.
prove Part 1 of Theorem 1 by using the weak law of Suppose we choose a point X~ m = (X1 ; : : : ; Xm ) such
large numbers. However, this is not true in general.) that the Xi 's are independent and Xi  N (0; 1=i). Then
The assumptions of this example are by no means nec- the condition of Theorem 1 is satis ed.
essary for Theorem 1 to be applicable. Throughout this
section, there are examples of workloads which cannot be

5
3.5.5 Marginal Data and Query Distributions
Change with Dimensionality
In this example, the marginal distributions of data and
queries change with dimensionality. Thus, the distance
distribution as dimensionality increases cannot be de- Query Point
scribed as the distance in a lower dimensionality plus
some new component from the new dimension. As a re-
sult, the weak law of large numbers, which implicitly is Nearest Cluster
about sums of increasing size, cannot provide insight into
the behavior of this scenario. The distance distributions Figure 5: Nearest neighbor query in clustered data.
must be treated, as our technique suggests, as a series
of random variables whose variance and expectation can We can generalize the situation further as follows:
be calculated and examined in terms of dimensionality. The data consists of a set of randomly chosen points to-
Let the m dimensional data space Sm be the bound- gether with additional points distributed in clusters of
ary of an m dimensional unit hyper-cube. (i.e., Sm = some radius  around one or more of the original points,
[0; 1]m (0; 1)m ). In addition, let the distribution of and the query is required to fall within one of the data
data points be uniform over Sm . In other words, every clusters (see Figure 5). This situation is the perfectly re-
point in Sm has equal probability of being sampled as alized classi cation problem, where data naturally falls
a data point. Lastly, the distribution of query points is into discrete classes or clusters in some potentially high
identical to the distribution of data points. dimensional feature space. Figure 6 depicts a typical dis-
Note that the dimensions are not independent. Even tance distribution in such a scenario. There is a cluster
in this case, the condition of Theorem 1 is satis ed. (the one into which the query point falls) that is closer
than the others, which are all, more or less, indistinguish-
able in distance. Indeed, the proper response to such
4 Meaningful Applications of High a query is to return all points within the closest clus-
ter, not just the nearest point (which quickly becomes
Dimensional Indexing meaningless compared to other points in the cluster as
dimensionality increases).
In this section, we place Theorem 1 in perspective, and Observe however, that if we don't guarantee that the
observe that it should not be interpreted to mean that query point falls within some cluster, then the cluster
high-dimensional indexing is never meaningful. We do from which the nearest neighbor is chosen is subject
this by identifying scenarios that arise in practice and to the same meaningfulness limitations as the choice of
that are likely to have good separation between nearest nearest neighbor in the original version of the problem;
and farthest neighbors. Theorem 1 then applies to the choice of the \nearest
cluster".
4.1 Classi cation and Approximate Match-
ing
4

3
To begin with, exact match and approximate match queries
can be reasonable. For instance, if there is dependence 2
between the query point and the data points such that
there exists some data point which matches the query 1
point exactly, then DMINm = 0. Thus, assuming that
most of the data points aren't duplicates, a meaningful 0
answer can be determined. Furthermore, if the problem 0 1 2 3
statement is relaxed to require that the query point be distance between two random points
within some small distance  of a data point (instead of
being required to be identical to a data point), we can Figure 6: Probability density function of distance be-
still call the query meaningful. Note, however, that stay- tween random clustered data and query points.
ing within some  becomes more and more dicult as m
increases since we are adding terms to the sum in the
distance metric. For this version of the problem to re-
main meaningful as dimensionality increases, the query
point must be increasingly closer to some data point.

6
4.2 Implicitly Low Dimensionality  For all 1  i  m let Xi = ai U1 + bi U2 .
Another possible scenario where high dimensional near- This last workload does not satisfy Equation 2. Figure 7
est neighbor queries are meaningful occurs when the un- shows that the \two degrees of freedom" workload be-
derlying dimensionality of the data is much lower than haves similarly to the (one or) two dimensional uniform
the actual dimensionality. There has been recent work workload, regardless of the dimensionality. However, the
on identifying these situations (e.g. [17, 8, 16]) and de- recursive workload (as predicted by our theorem) was af-
termining the useful dimensions (e.g. [20], which uses fected by dimensionality. More interestingly, even with
principal component analysis to identify meaningful di- all the correlation and changing variances, the recursive
mensions). Of course, these techniques are only useful if workload behaved almost the same as the IID uniform
NN in the underlying dimensionality is meaningful. case!
This graph demonstrates that our geometric intuition
5 Experimental Studies of NN for nearest neighbor, which is based on one, two and
three dimensions, fails us at an alarming rate as dimen-
sionality increases. The distinction between nearest and
Theorem 1 only tells us what happens when we take the farthest points, even at ten dimensions, is a tiny fraction
dimensionality to in nity. In practice, at what dimen- of what it is in one, two, or three dimensions. For one di-
sionality do we anticipate nearest neighbors to become mension, DMAXm =DMINm for \uniform' is on the order
unstable? In other words, Theorem 1 describes some of 107 , providing plenty of contrast between the nearest
convergence but does not tell us the rate of convergence. object and the farthest object. At 10 dimensions, this
We addressed this issue through empirical studies. Due contrast is already reduced by 6 orders of magnitude!
to lack of space, we present only three synthetic work- By 20 dimensions, the farthest point is only 4 times the
loads and one real data set. [13] includes additional syn- distance to the closest point. These empirical results sug-
thetic workloads along with workloads over a second real gest that NN can become unstable with as few as 10-20
data set. dimensions.
1e+08 1
two degrees of freedom
1e+07 recursive
DMAX / DMIN (LOG SCALE)

cumulative percent of queries

uniform
0.8
1e+06

100000
0.6
10000

1000 0.4

100
0.2 k=1
10 k = 10
k = 100
1 k = 1000
0
0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10
dimensions (m) median distance / k distance

Figure 7: Correlated distributions, one million tuples. Figure 8: 64-D color histogram data.
We ran experiments with one IID uniform(0,1) work- Figure 8 shows results for experiments done on a real
load and two di erent correlated workloads. Figure 7 data set. The data set was a 256 dimensional color
shows the average DMAXm =DMINm as dimensionality histogram data set (one tuple per image) that was re-
increases of 1000 query points on synthetic data sets of duced to 64 dimensions by principal components analy-
one million tuples. The workload for the \recursive" line sis. There were approximately 13; 500 tuples in the data
(described in Section 3.5.3) has correlation between ev- set. We examine k-NN rather than NN because this
ery pair of dimensions and every new dimension has a is the traditional application of data sets from image
larger variance. The \two degrees of freedom" workload databases.
generates query and data points on a two dimensional
plane, and was generated as follows: To determine the quality of answers for NN queries,
we examined the percentage of queries in which at least
 Let a1 ; a2 ; : : : and b1 ; b2 ; : : : be constants in (-1,1). half the data points were within some factor of the near-
est neighbor. Examine the graph at median distance/k
 Let U1 ; U2 be independent uniform(0,1). distance = 3. The graph says that for k = 1 (normal

7
NN problem), 15% of the queries had at least half the is estimated to fetch less than 10% of the data pages.
data within a factor of 3 of the distance to the NN. For Fetching a large number of data pages through a multi-
k = 10, 50% of the queries had at least half the data dimensional index usually results in unordered retrieval.)
within a factor of 3 of the distance to the 10th nearest For instance, the performance study of the parallel
neighbor. It is easy to see that the e ect of changing k solution to the k-nearest neighbors problem presented in
on the quality of the answer is most signi cant for small [10] indicates that their solution scales more poorly than
values of k. a parallel scan of the data, and never beats a parallel
Does this data set provide meaningful answers to the scan in any of the presented data.
1-NN problem? the 10-NN problem? the 100-NN prob- [30] provides us with information on the performance
lem? Perhaps, but keep in mind that intuitively, most of both the SS tree and the R* tree in nding the 20 near-
people would expect that the median distance/k distance est neighbors. Conservatively assuming that linear scans
ratios (on the X-axis) be more in the range of 1000- cost 15% of a random examination of the data pages, lin-
10,000. ear scan outperforms both the SS tree and the R* tree
at 10 dimensions in all cases. In [19], linear scan vastly
outperforms the SR tree in all cases in this paper for the
6 Analyzing the Performance of a 16 dimensional synthetic data set. For a 16 dimensional
NN Processing Technique real data set, the SR tree performs similarly to linear
scan in a few experiments, but is usually beaten by lin-
In this section, we discuss the rami cations of our results ear scan. In [14], performance numbers are presented for
when evaluating techniques to solve the NN problem; in NN queries where bounds are imposed on the radius used
particular, many high-dimensional indexing techniques to nd the NN. While the performance in high dimen-
have been motivated by the NN problem. An important sionality looks good in some cases, in trying to duplicate
point that we make is that all future performance eval- their results we found that the radius was such that few,
uations of high dimensional NN queries must include a if any, queries returned an answer.
comparison to linear scans as a sanity check. While performance of these structures in high dimen-
First, our results indicate that while there exist situa- sionality looks very poor, it is important to keep in mind
tions in which high dimensional nearest neighbor queries that all the reported performance studies examined sit-
are meaningful, they are very speci c in nature and are uations in which the di erence in distance between the
quite di erent from the \independent dimensions" basis query point and the nearest neighbor di ered little from
that most studies in the literature (e.g., [30, 19, 14, 10, the distance to other data points. Ideally, they should be
11]) use to evaluate techniques in a controlled manner. evaluated for meaningful workloads. These workloads in-
In the future, these NN technique evaluations should fo- clude low dimensional spaces and clustered data/queries
cus on those situations in which the results are meaning- as described in Section 4. Some of the existing structures
ful. For instance, answers are meaningful when the data may, in fact, work well in appropriate situations.
consists of small, well-formed clusters, and the query is
guaranteed to land in or very near one of these clusters.
This point if further enhanced by a corollary to Theorem 7 Related Work
1 which shows that in the most common situations, the
average performance of index structures that use convex 7.1 The Curse of Dimensionality
regions to describe collections of points will deteriorate
to a scan of the entire index if the condition of Theorem The term dimensionality curse is often used as a vague
1 is satis ed. 1 indication that high dimensionality causes problems in
In terms of comparisons between NN techniques, most some situations. The term was rst used by Bellman
papers do not compare against the trivial linear scan al- in 1961 [7] for combinatorial estimation of multivariate
gorithm. Given our results, which suggest that most of functions. An example from statistics: in [26] it is used
the data must be examined as dimensionality increases, to note that multivariate density estimation is very prob-
it is not surprising to discover that at relatively few di- lematic in high dimensions.
mensions, linear scan handily beats these (complicated) In the area of the nearest neighbors problem it is used
indexing structures. (Linear scan of a set of sequentially for indicating that a query processing technique performs
arranged disk pages is much faster than unordered re- worse as the dimensionality increases.
trieval of the same pages; so much so that secondary in- In [11, 5] it was observed that in some high dimen-
dexes are ignored by query optimizers unless the query sional cases, the estimate of NN query cost (using some
1 Due to lack of space, we do not include the formulation and index structure) can be very poor if \boundary e ects"
proof of the corollary in this paper. are not taken into account. The boundary e ect is that

8
the query region (i.e., a sphere whose center is the query is an intriguing question which we intend to explore in
point) is mainly outside the hyper-cubic data space. When future work.
one does not take into account the boundary e ect, the We used the technique described in [8] on two real
query cost estimate can be much higher than the actual data sets (described in [13]). However, the fractal di-
cost. The term dimensionality curse was also used to mensionality of those data sets could not be estimated
describe this phenomenon. (when we divided the space once in each dimension, most
In this paper, we discuss the meaning of the nearest of the data points occupied di erent cells). We used the
neighbor query and not how to process such a query. same technique on an arti cial 100 dimensional data set
Therefore, the term dimensionality curse (as used by the that has known fractal dimensionality 2 and about the
NN research community) is only relevant to Section 6, same number of points as the real data sets (generated
and not to the main results in this paper. like the \two degrees of freedom" workload in Section 5,
but with less data). The estimate we got for the fractal
dimensionality is 1:6 (which is a good estimate). Our
7.2 Computational Geometry conclusion is that the real data sets we used are inher-
ently high dimensional; another possible explanation is
The nearest neighbor problem has been studied in com- that they do not exhibit fractal behavior.
putational geometry (e.g., [4, 5, 6, 9, 12]). However, the
usual approach is to take the number of dimensions as
a constant and nd algorithms that behave well when
the number of points is large enough. They observe that 8 Conclusions
the problem is hard and de ne the approximate nearest
neighbor problem as a weaker problem. In [6] there is an In this paper, we studied the e ect of dimensionality on
algorithm that retrieves an approximate nearest neigh- NN queries. In particular, we identi ed a broad class of
bor in O(log n) time for any data set. In [9] there is an workloads for which the di erence in distance between
algorithm that retrieves the true nearest neighbor in con- the nearest neighbor and other points in the data set
stant expected time under the IID dimensions assump- becomes negligible. This class of distributions includes
tion. However, the constants for those algorithms are distributions typically used to evaluate NN processing
exponential in dimensionality. In [6] they recommend techniques. Many applications use NN as a heuristic
not to use the algorithm in more than 12 dimensions. (e.g., feature vectors that describe images). In such
It is impractical to use the algorithm in [9] when the cases, query instability is an indication of a meaningless
number of points is much lower than exponential in the query.
number of dimensions. To nd the dimensionality at which NN breaks down,
we performed extensive simulations. The results indi-
cated that the distinction in distance decreases fastest in
7.3 Fractal Dimensions the rst 20 dimensions, quickly reaching a point where
In [17, 8, 16] it was suggested that real data sets usually the di erence in distance between a query point and the
have fractal properties (self-similarity, in particular) and nearest and farthest data points drops below a factor of
that fractal dimensionality is a good tool in determining four. In addition to simulated workloads, we also exam-
the performance of queries over the data set. ined two real data sets that behaved similarly (see [13]).
The following example illustrates that the fractal di- In addition to providing intuition and examples of
mensionality of the data space from which we sample distributions in that class, we also discussed situations in
the data points may not always be a good indicator for which NN queries do not break down in high dimension-
the utility of nearest neighbor queries. Suppose the data ality. In particular, the ideal data sets and workloads
points are sampled uniformly from the vertices of the for classi cation/clustering algorithms seem reasonable
unit hypercube. The data space is 2m points (in m di- in high dimensionality. However, if the scenario is devi-
mensions), so its fractal dimensionality is 0. However, ated from (for instance, if the query point does not lie in
this situation is one of the worst cases for nearest neigh- a cluster), the queries become meaningless.
bor queries. (This is actually the IID Bernoulli(1=2) The practical rami cations of this paper are for the
which is even worse than IID uniform.) When the num- following two scenarios:
ber of data points in this scenario is close to 2m, nearest
neighbor queries become stable, but this is impractical Evaluating a NN workload. Make sure that the dis-
for large m. tance distribution (between a random query point
However, are there real data sets for which the (esti- and a random data point) allows for enough con-
mated) fractal dimensionality is low, yet there is no sep- trast for your application. If the distance to the
aration between nearest and farthest neighbors? This nearest neighbor is not much di erent from the

9
average distance, the nearest neighbor may not be [14] T. Bozkaya and M. Ozsoyoglu. Distance-based in-
useful (or the most \similar"). dexing for high-dimensional metric spaces. In Proc.
16th ACM SIGACT-SIGMOD-SIGART Symposium on
Evaluating a NN processing technique. When eval- PODS, pages 357{368, 1997.
uating a NN processing technique, test it on mean- [15] C. Faloutsos et al. Ecient and e ective querying ny im-
ingful workloads. Examples for such workloads are age content. Journal of Intelligent Information Systems,
given in Section 4. Also, one should ensure that 3(3):231{262, 1994.
a new processing technique outperforms the most [16] C. Faloutsos and V. Gaede. Analysis of n-dimensional
trivial solutions (e.g., sequential scan). quadtrees using the Housdor fractal dimension. In
Proc. ACM SIGMOD Int. Conf. of the Management of
Data, 1996.
References [17] C. Faloutsos and I. Kamel. Beyond uniformity and inde-
pendence: Analysis of R-trees using the concept of frac-
[1] R. Agrawal, C. Faloutsos, and A. Swami. Ecient sim- tal dimension. In Proc. 13th ACM SIGACT-SIGMOD-
ilarity search in sequence databases. In Proc. 4th Inter. SIGART Symposium on PODS, pages 4{13, 1994.
Conf. on FODO, pages 69{84, 1993.
[18] U. M. Fayyad and P. Smyth. Automated analysis and
[2] S. F. Altschul, W. Gish, W. Miller, E. Myers, and D. J. exploration of image databases: Results, progress and
Lipman. Basic local alignment search tool. Journal of challenges. Journal of intelligent information systems,
Molecular Biology, 215:403{410, 1990. 4(1):7{25, 1995.
[3] Y. H. Ang, Zhao Li, and S. H. Ong. Image retrieval [19] N. Katayama and S. Satoh. The SR-tree: An index
based on multidimensional feature properties. In SPIE structure for high-dimensional nearest neighbor queries.
vol. 2420, pages 47{57, 1995. In Proc. 16th ACM SIGACT-SIGMOD-SIGART Sym-
[4] S. Arya. Nearest Neighbor Searching and Applications. posium on PODS, pages 369{380, 1997.
PhD thesis, Univ. of Maryland at College Park, 1995. [20] K.-I. Lin, H. V. Jagadish, and C. Faloutsos. The TV-
[5] S. Arya, D. M. Mount, and O. Narayan. Accounting Tree: An index structure for high-dimensional data.
for boundary e ects in nearest neighbors searching. In VLDB Journal, 3(4):517{542, 1994.
Proc. 11th ACM Symposium on Computational Geome- [21] B. S. Manjunath and W. Y. Ma. Texture features for
try, pages 336{344, 1995. browsing and retrieval of image data. In IEEE Trans. on
[6] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, Pattern Analysis and Machine Learning, volume 18(8),
and A. Wu. An optimal algorithm for nearest neigh- pages 837{842, 1996.
bor searching. In Proc. 5th ACM SIAM Symposium on [22] R. Mehrotra and J. E. Gary. Feature-based retrieval
Discrete Algorithms, pages 573{582, 1994. of similar shapes. In 9th Data Engineering Conference,
[7] R. E. Bellman. Adaptive Control Processes. Princeton pages 108{115, 1992.
University Press, 1961. [23] H. Murase and S. K. Nayar. Visual learning and recogni-
[8] A. Belussi and C. Faloutsos. Estimating the selectivity of tion of 3D objects from appearance. Int. J. of Computer
spatial queries using the `correlation' fractal dimension. Vision, 14(1):5{24, 1995.
In Proc. VLDB, pages 299{310, 1995. [24] S. A. Nene and S. K. Nayar. A simple algorithm for
[9] J. L. Bentley, B. W. Weide, and A. C. Yao. Opti- nearest neighbor search in high dimensions. In IEEE
mal expected-time algorithms for closest point problem. Trans. on Pattern Analysis and Machine Learning, vol-
ACM Transactions on Mathematical Software, 6(4):563{ ume 18(8), pages 989{1003, 1996.
580, 1980. [25] A. Pentland, R. W. Picard, and S. Scalro . Photo-
[10] S. Berchtold, C. Bohm, B. Braunmuller, D. A. Keim, book: Tools for content based manipulation of image
and H.-P. Kriegel. Fast parallel similarity search in mul- databases. In SPIE Volume 2185, pages 34{47, 1994.
timedia databases. In Proc. ACM SIGMOD Int. Conf. [26] D. W. Scott. Multivariate Density Estimation. Wiley
on Management of Data, pages 1{12, 1997. Interscience, 1992.
[11] S. Berchtold, C. Bohm, D. A. Keim, and H.-P. Kriegel. [27] M. J. Swain and D. H. Ballard. Color indexing. Inter.
A cost model for nearest neighbor search in high- Journal of Computer Vision, 7(1):11{32, 1991.
dimensional data space. In Proc. 16th ACM SIGACT- [28] D. L. Swets and J. Weng. Using discriminant eigenfea-
SIGMOD-SIGART Symposium on PODS, pages 78{86, tures for image retrieval. In IEEE Trans. on Pattern
1997. Analysis and Machine Learning, volume 18(8), pages
[12] M. Bern. Approximate closest point queries in high 831{836, 1996.
dimensions. Information Processing Letters, 45:95{99, [29] G. Taubin and D. B. Cooper. Recognition and position-
1993. ing of rigid objects using algebraic moment invariants.
[13] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. In SPIE Vol. 1570, pages 318{327, 1991.
When is nearest neighbors meaningful? Technical [30] D. A. White and R. Jain. Similarity indexing with the
Report TR1377, Computer Sciences Dept., Univ. of SS-Tree. In ICDE, pages 516{523, 1996.
Wisconsin-Madison, June 1998.

10

También podría gustarte