Está en la página 1de 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/24083711

Spatial Modeling of Habitat Preferences of Biological Species using Markov


Random Fields

Article  in  Journal of Applied Statistics · September 2007


DOI: 10.1080/02664760701240782 · Source: RePEc

CITATION READS
1 41

1 author:

C. Díaz-Avalos
Universidad Nacional Autónoma de México
61 PUBLICATIONS   377 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Dinámica poblacional y demografía de la merluza (Merluccius productus, Ayres 1855) del norte del golfo de California. View project

Biología de Campo: Determinación del potencial pesquero de Anchoa mitchilli Cuvier & Valenciennes, 1848 y de Achirus lineatus (Linnaeus, 1758), a partir de estudios
ictioplanctónicos, y algunos aspectos ecológicos relacionados con su distribución y abundancia, en la Laguna de Tamiahua, Veracruz, México. View project

All content following this page was uploaded by C. Díaz-Avalos on 16 May 2014.

The user has requested enhancement of the downloaded file.


This article was downloaded by:[UNAM - IIMAS]
On: 14 September 2007
Access Details: [subscription number 768410703]
Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Applied Statistics


Publication details, including instructions for authors and subscription information:
http://www.informaworld.com/smpp/title~content=t713428038
Spatial Modeling of Habitat Preferences of Biological
Species using Markov Random Fields
Carlos Díaz Avalos a
a
Departmento de Probabilidad y Estadística, Universidad Nacional Autónoma de
México, Mexico

Online Publication Date: 01 September 2007


To cite this Article: Avalos, Carlos Díaz (2007) 'Spatial Modeling of Habitat
Preferences of Biological Species using Markov Random Fields', Journal of Applied
Statistics, 34:7, 807 - 821
To link to this article: DOI: 10.1080/02664760701240782
URL: http://dx.doi.org/10.1080/02664760701240782

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf


This article maybe used for research, teaching and private study purposes. Any substantial or systematic reproduction,
re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly
forbidden.
The publisher does not give any warranty express or implied or make any representation that the contents will be
complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be
independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings,
demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or
arising out of the use of this material.
Journal of Applied Statistics
Vol. 34, No. 7, 807–821, September 2007
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

Spatial Modeling of Habitat Preferences


of Biological Species using Markov
Random Fields

CARLOS DÍAZ AVALOS


Departmento de Probabilidad y Estadística, Universidad Nacional Autónoma de México, Mexico

ABSTRACT Spatial modeling has gained interest in ecology during the past two decades, especially
in the area of biodiversity, where reliable distribution maps are required. Several methods have
been proposed to construct distribution maps, most of them acknowledging the presence of spatial
interactions. In many cases, a key problem is the lack of true absence data. We present here a model
suitable for use when true absence data are missing. The quality of the estimates obtained from the
model is evaluated using ROC curve analysis as well as a quadratic cost function, computed from
the false positive and false negative error rates. The model is also tested under random and clustered
scattering of the presence records. We also present an application of the model to the construction of
distribution maps of two endemic bird species in México.

KEY WORDS: Biodiversity maps, Markov random fields, spatial modeling, autologistic model,
species distribution

Introduction
The construction of distribution maps for biological species has gained considerable atten-
tion in the last two decades due, in part, to the recognition of the need to preserve biological
biodiversity. In Mexico, the official agency in charge of preserving the biological biodiver-
sity is the National Commission for the Knowledge and Use of Biodiversity (CONABIO).
CONABIO is also the agency responsible for maintaining and updating the national inven-
tory of biodiversity. To accomplish this task, the CONABIO produces distribution maps for
many of the species reported as present in Mexico. Such distribution maps are constructed
by fitting habitat preference models to records of geographical observations of the target
species. Habitat preference models relate the locations where the records were reported,
to physical and ecological covariates. Such models are an important tool for investigat-
ing the habitat requirements of species and for understanding the patterns of biodiversity
(Austin & Meyers, 1996; Jarvis & Robertson, 1999; Stockwell & Peterson, 2002). Further,
the geographical distribution of different taxa must be assessed if priorities for conservation
action are to be established (Peterson et al., 2002). Thus, modeling habitat preference is
essential to ensure consistency of the distribution maps, while reducing the time and costs
of large-scale studies of biodiversity.

Correspondence Address: Carlos Díaz Avalos, Departmento de Probabilidad y Estadística, Universidad Nacional
Autónoma de México, Apartado Postal 20-726, México D.F. C.P. 01000. Email: carlos@sigma.iimas.unam.mx
0266-4763 Print/1360-0532 Online/07/070807–15 © 2007 Taylor & Francis
DOI: 10.1080/02664760701240782
808 C.D. Avalos

The majority of biological inventory data are incomplete, concentrated near roads and
areas of easy access and scarce in areas too difficult to reach. As in many countries, most of
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

the survey records in Mexico report the presence of species, but rarely if ever do they include
information about surveyed localities where the target species was searched but not found,
resulting in a lack of true absence data. This non-systematic sampling makes difficult a
detailed statistical analysis of the data (Peterson et al., 1998, 2002) and precludes the use of
straightforward methods such as logistic regression (McCullagh & Nelder, 1989) or spatially
corrected versions such as those of Díaz-Avalos et al. (2001) or Gumpertz et al. (1997).
Those difficulties have led to the development of alternative methods to construct distribution
maps for biological species, which acknowledge the relationship between the presence of
the species and variables related to climate or, more properly, variables related to the habitat
needs of the species.
Owing to the binary nature of the data it is more relevant in principle to construct maps
showing the probability that the target species includes a location within its distribution
range. We will refer to these probabilities as ‘probability of presence’ and they will be
denoted by π . Estimation of π may be based on logistic regression (Osborne & Tigar, 1992;
Buckland & Elston, 1993), although other approaches use climatic envelopes (Busby, 1991),
decision trees (Stockwell et al., 1990; Moore et al., 1991), genetic algorithms (Stockwell &
Noble, 1992) and Markov random fields (Augustin et al., 1996; Hoeting et al., 2001).
These methods have been used with relative success to construct probability of presence
maps and are based on the same principle: given a region D divided into N subregions and
inventory data at a set of subregions {si , i = 1, . . . , K}, K  N , estimate the probability of
presence π using information on p covariates z1 , . . . , zp . For decision making-processes, the
probabilities are converted to presence-absence areas using a binary classification. Usually,
the user chooses a threshold value π ∗ such that areas of presence are those where π̂i > π ∗ .
The resulting maps are used by agencies such as COANBIO to decide on issues such as
land use in areas not yet opened to human activities, for example.
As with every classification process, there is always a fraction of items that are classified
erroneously. In the case of distribution maps, wrong classifications occur either because the
species is classified as missing in an area where it is present (false negative) or because an
area where the species is missing is classified as inhabited by it (false positive). The true
distribution range for many biological species is unknown, and the calibration of classi-
fication methods has been approached in two ways. One consists of splitting the data set
in two subsets, one used to fit the model and the other used to compare the model results.
This method is known as cross-validation and is widely used in the geostatistical context
(Armstrong, 1984; Chilès & Delfiner, 1999), and in the construction of distribution maps
by methods such as FLORAMAP (Busby, 1991) and GARP (Stockwell, 1993). The other
method to calibrate a classifier is by constructing an artificial distribution map, sampling
presence records from it and using those data to reconstruct the artificial distribution map.
Because the true distribution is known in this case, it is possible to measure the quality of
the map using ROC analysis such as is done in Hoeting et al. (2001).
In this paper, we present a model to construct maps of probability of presence for biolog-
ical species, based only on records of presence. The model is based on ideas proposed by
Heikkinen & Högmander (1994) and Hoeting et al. (2001). The first two authors propose
the use of auxiliary information provided by surveys on widespread species in order to
quantify the search intensity of the target species. Hoeting et al. (2001) include the use of
covariate information as well as information on whether a pixel was surveyed or not for the
target species. In both approaches there is information on true absence of the target species.
Because information in Mexico is mostly based on recordings found in museums and few
isolated surveys, we lack information about pixel surveys, so the methods of Heikkinen &
Spatial Modeling of Habitat Preferences 809

Högmander (1994) and Hoeting et al. (2001) are not directly applicable. The model pro-
posed here may be considered a hybrid of both methods, suited for use with species and
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

areas for which surveys were not made in a systematic way or where information is too
fragmented, resulting in a complete lack of information about the absence of the target
species. We present a comparison of the classification performance of our model, using
ROC curve analysis (Egan, 1975) and a cost function assuming the existence of a misclas-
sification cost. For this evaluation we use a distribution map for an hypothetical species, for
which the true geographical distribution is completely known. Unlike comparisons based
on partitioning of real data, the use of this hypothetical species allows the quantification of
the true misclassification rates and the evaluation of cost functions computed from those
true rates. We also show an application of the model to real data from two endemic bird
species in Mexico.

Theory
Consider a region D partitioned into N subregions. Those subregions may represent pixels
in a digital image or State counties, for example. Regardless of their shape or size, we
will refer to those regions hereafter as pixels. The spatial distribution of the target species
will be assumed to be a realization x of a stochastic process X defined in a lattice, where
x = (x1 , . . . , xN ) ∈ {0, 1}N denotes the true distribution for the target species. Of this
realization x, we observe y1 = 1, . . . , yK = 1, K  N , i.e. the information pertains only
to the presence of the species at a very small fraction of D. The goal is to reconstruct
x based on the data y, where yi = 1 indicates that the ith pixel was visited and that the
species has been recorded there. Note that yi = 0 does not necessarily imply xi = 0 nor
that such a pixel was visited. Our problem is to classify the pixels where yi = 0, and this
problem will be tackled in two steps. The first step is to estimate the probability of presence
{πi ; i = 1, . . . , N − K} of the target species; that is, we first obtain a map for the probability
of presence of the target species. In the second step, pixels where yi = 0 are classified as
presence (xi = 1) or absence (xi = 0) if the estimate of the probability of presence π̂i is
above a threshold value π ∗ for those pixels. In the next subsection we describe a model to
obtain the π̂ using the spatial information contained in the data as well as spatial information
from relevant covariates. The model is based on proposals from Besag (1986), Heikkinen &
Högmander (1994), and Hoeting et al. (2001).

Hierarchical Spatial Model


Let x ∈ {0, 1}N denote the unknown true presence–absence map of the target species and
let y ∈ {0, 1}K denote the observations. yi = 1 implies that the species is present in pixel i
and in this particular case we will assume that xi = 1. The value yi = 0 is attached to pixels
for which the value of xi is unknown, either because the pixel was visited and the species
was not detected (because the species was absent or because it is too shy), or because the
pixel was not visited, this is,

1 If the species has been recorded at pixel i
yi =
0 otherwise

As mentioned before, there is a chance that the species is not detected so it may happen that
yi = 0|xi = 1 and we will denote by δ = P [Yi = 0|xi = 1]. The density of the observations
810 C.D. Avalos

yi is a function of xi and δ, i.e.

f (yi |xi , δ) = δ xi (1−yi ) (1 − δ xi )yi


Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

Assuming that the observations yi are conditionally independent given x and δ, the
likelihood is
N

L(x, δ; y) = δ xi (1−yi ) (1 − δ xi )yi (1)
i=1

For the first step, our interest is the estimation of the probability of presence, using the data
y = (y1 , . . . , yK ). We will consider making inferences on x from its conditional distribution
given y using
p(x|y) ∝ p(x, y) = p(x)L(x; δ, y)
where p(x) is the prior distribution of the true but unknown presence–absence map. Note
that inferences on x could be done, in principle, using maximum likelihood by maxi-
mizing equation (1) over δ and x, but such maximization is awkward because the high
number of possible configurations for x. An alternative is to use Bayesian estimation as we
describe next.

Bayesian estimation of x
We will assume that the spatial distribution of the target species is associated with a set
of biological and p physical covariates z = (z1 , . . . , zp ). These covariates relate to p(x)
through a linear predictor α = zT β. Another part of the structure of p(x) is given by the
interaction between pixels. Based on these assumptions, we model a priori the status of xi
given {xj , j  = i} as an autologistic process (Besag, 1974)

exp{xi [zi β + γ s(xi )]}
p(xi |β, γ , x−i ) =  (2)
1 + exp{zi β + γ s(xi )}
where x−i denotes all the pixels except for the ith one, s(xi ) = #{xj : xj = 1, i ∼ j } −
#{xj : xj = 0, i ∼ j }, i ∼ j denotes that pixels i and j are neighbours, and γ is a parameter
associated with the interaction between neighboring pixels. We will assume a Np (0, λ−1 I)
prior distribution for β and (c1 , d1 ) and (c2 , d2 ) priors for γ and λ respectively. Recall
that p is the number of covariates used in the analysis. γ is assumed to be positive a priori
because the spatial distribution of many bird and mammal species tends to be aggregated
at the spatial scale we will use. For δ we assume a Beta(a, b) prior distribution. We will
further assume that λ, β, δ and γ are mutually independent. Under this formulation, the
joint distribution of all the entities involved in the model is proportional to
 
λ
p(x)λp/2 exp − β T β γ c1 exp{−d1 γ }λc2 exp{−d2 λ}δ a (1 − δ)b
2
From here, the full conditional distributions for the xi are
δ xi (1−yi ) (1 − δ xi )yi exp{xi (αi + γ s(xi ))}
p(xi |x−i , γ , g, yi ) =
1 + δ xi (1−yi ) (1 − δ xi )yi exp{αi + γ s(xi )}
and, in particular,
δ xi exp{xi (αi + γ s(xi )}
p(xi |x−i , γ , δ, yi ) = (3)
1 + δ xi exp{αi + γ s(xi )}
Spatial Modeling of Habitat Preferences 811

is the full conditional distribution for those sites where yi = 0, which are the main target
for inferences. For the rest of the parameters, the full conditionals are
 n 
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

  
exp{xi ziT β} λ T
p(β|·) ∝ exp − β β (4)
i=1
1 + exp{ziT β + γ s(xi )} 2
N  
 exp{xi γ i∼j xj }
p(γ |·) ∝ γ c1 exp{−d1 γ } (5)
i=1
(1 + exp{z T
i β + γ s(x i )})
⎧ ⎛ ⎞ ⎫ ⎛ ⎞
⎨ 1  p ⎬ p  p
p(λ|·) ∝ λp/2+c2 exp − ⎝d2 + β 2 ⎠ γ ˜ ⎝ + c2 , d2 + βj2 ⎠ (6)
⎩ 2 j =1 j ⎭ 2 j =1

N  ⎛ ⎞
  
p(g|·) ∝ δ xi (1−yi ) (1 − δ xi )yi δ a (1 − δ)b ∝ Beta ⎝a + xi , b + xi ⎠
i=1 {i:yi =0} {i:yi =1}
(7)

Parameter estimates were obtained via Gibbs sampler by sampling from the full con-
ditionals (3)–(7). Except for the conjugate distributions (6) and (7), in all the other cases
sampling from the full conditionals was done using the Metropolis–Hastings algorithm
(Gilks et al., 1996). Code in FORTRAN was written to implement the MCMC sampling
algorithm with random scanning of the parameters. Because at each iteration we obtained a
realization of (x|·), we computed the (π1 , . . . , πN ) estimates by simply averaging the num-
ber of times xi = 1 over the total number of iterations after the burn-in time. Specifically,
we ran the MCMC for 10,000 iterations, with a burn-in of 3000 iterations, so

1  {j }
10000
π̂i = P̂ [xi = 1] = x (8)
7000 j =3001 i

Although the main task of the model is to obtain estimates of the probability of presence,
it is also possible to draw inferences about parameters such as β, which are related to the
effect of physical and biological variables on the odds of presence. This is an advantage that
models such as GARP (Stockwell & Noble, 1992) do not have. As in Hoeting et al. (2001),
here we will concentrate on the quality of the classification of pixels as presence or absence
of the target species.

Simulation Study
In order to evaluate the performance of the model described in the previous subsection, we
constructed an artificial distribution map for a hypothetical species. Such a map was built
using covariate information from digital maps of potential vegetation coverage (VPOTR),
precipitation (ISOYT), isotermality (ISOTM) and elevation (ELEV).VPOTR is a categorical
variable with 10 levels. All the digital maps had a pixel size of about 4 square kilometres
and were provided by the CONABIO. The whole Mexican inland territory was covered
with 108,000 such pixels. With these covariate maps, we constructed the distribution map
for the hypothetical species using an autologistic model with the parameter settings shown
in Table 1. The resulting distribution map x, for which the xi values are completely known,
is shown in Figure 1. We will refer to this configuration of x as the ‘reference map’. From
the reference map, eight data sets were drawn by thinning a homogeneous Poisson process
812 C.D. Avalos

Table 1. Parameter values of the autologistic


model, used to obtain the geographic
distribution of an hypothetic species
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

Parameter Value

β0 0.20
β1 (ISOTM) 0.37
β2 (ISOYT) 0.24
β3 (ELEV) 0.001
β11 (VPOTR 8) 0.50
γ 0.40
δ 0.90

Figure 1. Spatial distribution of a hypothetic species. See text for details

over the areas where xi = 1. The data sets had sizes 975, 490, 240, 176, 50, 30, 10 and 5
points, which represent less than 1% of positive records. The pixels including the points
were set to the presence of the species, thus simulating artificial registers of presence of the
hypothetical species. The remaining pixels were set to zero, resulting in eight artificial data
sets y975 , y490 , y240 , y176 , y50 , y30 , y10 and y5 .
Because of the wide distribution range of the hypothetical species, the different number
of registers serves to explore the effect of survey intensity on the performance of the model.
The geographic location of the points yi = 1 for some of the different data sets is shown in
Figure 2. It may also occur that, for a given species, the number of records is high, but that
the records are clumped around some areas. To explore the effect of clumped records, we
simulated eight data sets using a Neymann process. In a first step we simulated N1 clump
centers using a homogeneous Poisson process defined over the presence areas of Figure 1,
and from each center we generated N2 points from a bivariate normal distribution centered
at each clump center, until we had 975, 490, 240, 176, 50, 30, 10 and 5 points. Some of the
resulting data sets are shown in Figure 3. These data sets were used to construct posterior
probability maps for the hypothetical species, using the posterior distribution (3).
Spatial Modeling of Habitat Preferences 813
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

Figure 2. Spatial distribution of presence records of the hipothetic species, under random scattering

Classificatory Performance
In many ecological applications the resulting probability map {π̂i : i = 1, . . . , N } is some-
times sufficient for the goals of the study. From it, ecologists and conservationists may get
insight about the habitat preferences of the target species and define potential distribution
areas. In other cases however, the map is used in a decision-making process regarding issues

Figure 3. Spatial distribution of presence records of the hipothetic species, under clustered scattering
814 C.D. Avalos

such as future land usage or the introduction of non-native species into specific areas. In
those cases, the user has to decide at which level of π̂ a pixel will be classified as a part
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

of the distribution range for the target species. As with any classifying process, there exists
the risk of taking a wrong decision, and two error types are recognized (Egan, 1975). A
false positive error (E01 ) occurs when a pixel where x = 0 is classified as presence (x̂ = 1),
whilst a false negative error (E10 ) occurs when a pixel where x = 1 is classified as absence
(x̂ = 0). A classifier may be evaluated in terms of the true positive (TPR) and false positive
rates (FPR), defined as
N N
i=1 I[pi >π] I[xi =0] I[p >π ] I[xi =1]
FPR(π ) = N TPR(π ) = i=1Ni
1=1 (1 − xi ) 1=1 xi

where the product of the indicators I[pi >π] I[xi =0] corresponds to a false positive error. Note
that TPR and FPR depend on the cut-off value π
Plotting TPR versus FPR for different values of π produces the ROC curve. For a random
classifier, the ROC curve corresponds to a straight line joining the points (0, 0) and (1, 1).
The higher the separation of a ROC curve from such a line, the better the performance of
a classifier. The ROC curve provides important information about the ability of different
classification methods to distinguish between presence and non-presence locations. The
upper part of the ROC curve provides information that allows the identification of presence
locations (high TPR), while minimizing the proportion of false alarms (low FPR). However,
a shortcoming of ROC analysis when comparing different classifiers is that one cannot tell
if a classifier is the best unless its ROC curve dominates the others over the entire FPR
range.
In the Bayesian estimation context we are using, cost functions provide a way to make the
best decision in the sense of minimizing costs. In species management, false positive and
false negative errors have different consequences, so cost functions such as the marginal
posterior modes or the maximum a posteriori estimates are not adequate. In most instances,
the decision maker has a rough idea of the relative importance of the two error types, so
there is a cost function of the type

L(π ∗ ) = {αE01 (π ∗ )}2 + {(1 − α)E10 (π ∗ )}2 (9)

where E01 (π ∗ ) and E10 (π ∗ ) are the false positive and false negative errors respectively,
when the threshold value π ∗ is used, and α ∈ [0, 1]. For a given value of α, the optimal
∗ ∗
decision is classifying the pixels as presence if π̂i > πopt where πopt is the value of π ∗ that
minimizes equation (9).

Results
Probability of Presence Maps
The maps of probability of presence π̂ for K = 975, K = 240, K = 30 and K = 5 when
the records are scattered at random are shown in Figure 4, and for the same values of K, and
when the records are clustered are shown in Figure 5. For random scattering of records, only
the map of π̂ for K = 5 does not resemble the presence pattern of the true distribution map
of Figure 1. For this number of records, the resulting π̂ map underestimates the presence
of the hypothetic species in the north-central part of the true distribution range, despite the
two positive records in that area. As the number of records increases, the resemblance to
the true distribution gets better, and for K = 975 the π̂ map does not show any noticeable
difference with Figure 1. For the case of clustered records, the effect of the number of
Spatial Modeling of Habitat Preferences 815
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

Figure 4. Posterior probability of presence for the hypothetic species, under random scattering of the
presence records

Figure 5. Posterior probability of presence for the hypothetic species, under clustered scattering of
the presence records
816 C.D. Avalos

positive records on the quality of the π̂ maps becomes more notorious (Figure 5). As K
decreases, the resulting π̂ maps underestimate areas of true presence and overestimate
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

areas of true absence. This poorer performance becomes more evident for a small number
of clustered records (K = 30 and K = 5), and is a consequence of the reduced effectiveness
of clustered samples in a spatial setting. Note that for K = 30 with random records and
for K = 5 with clustered and random records, the π̂i maps have positive values near the
Yucatan peninsula, as a result of a few similar covariate values for that zone to those of the
yi = 1 recordings.

Comparison of Classificatory Performance


The classification of pixels as presence or as absence of the target species is highly dependent
on the chosen threshold value π ∗ . Different threshold values produce different classifications
and hence different shapes and estimated distributions of the presence areas. Figure 6 shows
the plots of the cost function (9) for large and small record numbers, both, for the random

Figure 6. Cost function curves for random scattering of presence records (solid line) and clustered
scattering (dashed line). Error rates were computed from the probability maps shown in Figures 4 and 5
Spatial Modeling of Habitat Preferences 817

(solid lines) and clustered (dashed lines) scattering of the records, when α = 0.3, i.e. when
the cost of misclassifying a true presence area as an absence area receives more weight.
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

The ROC curves for the same probability of presence maps are shown in Figure 7. Except
for the case K = 30, the ROC curves from random records dominate the curves from
clustered records. This shows that better classifications are obtained if the presence records
are scattered at random inside the distribution range. This results is confirmed by the curves
of the cost function, where the minimum of the loss curve from random records is always
smaller than the optimum value of π ∗ for the clustered case. The quality of a classifier in
terms of the ROC curve may be measured as the distance to the ideal point (0, 1). Because
both TPR and FPR depend on π ∗ , it is possible to obtain the value of π ∗ that gives the best
classification from the ROC point of view.
Table 2 shows the optimum values of π ∗ for the random and clustered cases using the
cost function and the ROC curve criteria. The optimal threshold values change if we shift
the values of α in the cost function. Under the cost function and the ROC curve, the optimal

threshold values πopt tend to be smaller if the records are clustered. This is because clustered

Figure 7. ROC curves for random scattering of presence records (solid line) and clustered scattering
(dashed line). Error rates were computed from the probability maps shown in Figures 4 and 5
818 C.D. Avalos

Table 2. Optimum threshold values for the probability of presence of a


hypothetic species, for clustered and random distributed presence records,
and for different number of records
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

Number of Loss Function ROC curve


records Random Clustered Random Clustered

975 0.363 0.565 0.373 0.565


490 0.616 0.616 0.596 0.606
240 0.878 0.818 0.788 0.828
176 0.979 0.789 0.980 0.808
50 0.535 0.010 0.980 0.192
30 0.444 0.010 0.393 0.010
10 0.363 0.010 0.353 0.191
5 0.0303 0.0101 0.686 0.010

records contain less information about the species distribution range, resulting in higher error
rates. Because the cost function was evaluated assuming a higher weight for false negative
errors (α = 0.3), the best decision is to classify as presence pixels with π̂ values above
very small thresholds. These numbers are different if we reverse the misclassification cost
criterion to α = 0.7. In that case, the optimum values increase because now the weight is
higher for false positive errors. These changes in π ∗ as α changes is something we should
expect under different management strategies for natural resources.

Case study: Probability of Presence for Bird Species in Mexico


The red warbler (Ergaticus ruber) and the orange-breasted bunting (Passerina leclancherii)
are two endemic bird species reported by the CONABIO as distributed over a wide range
of ecological and biogeographic conditions, and subjected to capture by animal traders.
Because conservation of these bird species is important to preserve biodiversity in Mexico,
several organizations have conducted studies on their habitat requirements as well as their
geographic distribution. The data consist of presence records for each species. However,
the data consist only of presence reports, and we lack information of true absence records.
There were 127 records for Ergaticus ruber and 131 for Passerina leclancherii, gathered
from different sources of information. With these data, we used the model proposed in this
paper to obtain maps for the probability of presence for each species.
The geographic distribution of the records and the resulting probability of presence maps
are shown in Figure 8. The resulting π̂ maps cover with high probability of presence the
convex hull of the presence records, and suggest a wider distribution range for both species.
For P. leclanchierii the π̂ map shows high probability of presence in the Yucatan peninsula
as well as in some isolated areas along the border with the US and in Baja California. For
E. ruber, the π̂ values in the central part of the country are high despite the lack of presence
records for such region. This is a result of the presence records in the NW and NE parts of
the country and because the central area has covariate values similar to those of the presence
records in the north.
Because the true geographic distribution is unknown, it is not possible to obtain an
optimum threshold value for decision making. However, because for both species the records
are clustered and because the number of records is relatively close to 176, we may guess
that the optimum threshold is close to that obtained if the cost function (9) is evaluated
using the clustered y176 data (π ∗ = 0.788). An alternative is to classify as presence any
pixel having π̂ > 0.5, which corresponds to using the maximum posterior mode (Besag,
1986). The two maps for the two species are shown in Figure 9, where, as expected, there
Spatial Modeling of Habitat Preferences 819
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

Figure 8. Presence records for two endemic bird species (left panel) and probability of presence
estimates

Figure 9. Distribution maps for Passerina leclanchierii and Ergaticus ruber, obtained under different
threshold values π ∗ . The right panel corresponds to the Posterior Mode estimate
820 C.D. Avalos

are differences on pixel classification. However, such differences are not too notorious and
it is likely that, if the cost function were computable for these maps, the difference would be
small. Most of the discrepancies in the classification with the guessed π ∗ value and with the
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

posterior mode occur in areas far from locations with observations, where high probability
of presence is due to covariate values. The maps with the posterior mode tend to be more
conservative from the environmentalist point of view, because they suggest that larger areas
are candidates for protection. Again, these figures are highly dependent on the weight given
to the false positive and false negative errors. The shape of the distribution areas estimated is
concordant to those hand-made maps published by skilled ornithologists (e.g. Blake, 1977).
The resulting maps show that the model presented here is able to reproduce the distribution
pattern estimated by experts. Although this is not a formal quality test, it shows that the
model is reliable when used with real data sets.

Concluding Remarks
The use of predictive models for the construction of presence–absence maps is becoming
common in conservation biology. In the case of biodiversity studies, the lack of true absence
data precludes the evaluation of predictive models using statistical measures such as the
deviance. Further, the reduced number of presence records found in most practical situations
makes doubtful the use of evaluation methods such as jack-knife. We have presented here
a model formulated under the lack of true absence data. The performance tests made under
the ROC curve analysis and under the use of a cost function show that the model performs
reasonably well for a moderately low number of records. Although model performance is not
adequate for a small number of presence records, this is a weakness shared by most statistical
models. In addition, model performance is better if the presence records are randomly
scattered across the distribution range of the target species. However, random distribution
of the records is difficult to obtain, mainly because for many biological species the potential
presence areas are too difficult to reach. Another problem is that the distribution range of
the species is seldom known a priori, so random sampling will not be easy to implement.
When applied to real data, the model was able to reproduce the distribution pattern of the
two species analyzed. Because the true distribution range of P. leclanchierii and E. ruber is
unknown, it is not possible to give a quantitative evaluation of model performance. Further
research is needed to develop evaluation methods when the true distribution is unknown.

Acknowledgements

This research was supported by projects D.A.J.-J002/0728/99 and D.A.J.-J002/0798/2000


of the Sistema Nacional de Información sobre Biodiversidad (SNIB).

References
Armstrong, M. (1984) Basic Linear Geostatistics (New York: Springer Verlag).
Augustin, N.H., Mugglestone, M.A. & Buckland, S.T. (1996) An autologistic model for the spatial distribution of
wildlife, Journal of Applied Ecology, 33, pp. 339–347.
Austin, M.P. & Meyers, J.A. (1996) Current approaches to modeling the environmental niche of eucalypts:
Implication for management of forest biodiversity, Forest Ecology and Management, (851-3), pp. 95–106.
Besag, J.E. (1974) Spatial interaction and the statistical analysis of lattice systems (with discussion), Journal of
the Royal Statistical Society Series B, 40, pp. 147–174.
Besag, J.E. (1986) On the analysis of dirty pictures (with discussion), Journal of the Royal Statistical Society
Series B, 48, pp. 259–302.
Blake, E.R. (1977) Birds of Mexico: A Guide for Field Identification (Chicago IL: University of Chicago).
Spatial Modeling of Habitat Preferences 821

Buckland, S.T. & Elston, D.A. (1993) Empirical models for the spatial distribution of wildlife, Journal of Applied
Ecology, 30, pp. 478–495.
Busby, J.R. (1991) BIOCLIM A bioclimatic and prediction system, in: C.R. Margules & M.P. Augustin Biological
Downloaded By: [UNAM - IIMAS] At: 19:45 14 September 2007

Conservation: Cost Effective Biological Surveys and Analysis, pp. 64–68 (East Melbourne, Australia: CSIRO
Publications).
Chilès, J.P. & Delfiner, P. (1999) Geostatistics: Modeling Spatial Uncertainty (New York: Wiley Interscience).
Diaz-Avalos, C. Peterson, D. Alvarado, E. Ferguson, S. & Besag, J.E. (2001) Space-time modelling of lightning-
caused ignitions in the Blue Mountains, Oregon, Canadian Journal of Forest Research, 31, pp. 1579–1593.
Egan, J.P. (1975) Signal Detection Theory and ROC Analysis (New York: Academic Press).
Gilks, W.R., Richardson, S. & Spiegehalter, D.J. (Eds) (1996) Markov Chain Monte Carlo in Practice (New York:
Chapman & Hall).
Gumpertz, M.L., Graham, J.M. & Ristaino, J.B. (1997) Autologistic model of spatial pattern of Phytophthora
epidemic in bell pepper: Effects of soil variables on disease presence, Journal of Biological and Ecological
Statistics, 2, pp. 131–156.
Heikkinen, J. & Högmander, H. (1994) Fully Bayesian approach to image restoration with an application to
biogeography, Applied Statistics, pp. 569–582.
Hoeting, J. Leecaster, M. & Bowden, D. (2001) An improved model for spatially correlated binary responses,
Journal of Agricultural Biological and Ecological Statistics, 5, pp. 102–114.
Jarvis, A.M. & Robertson, A. (1999) Predicting population sizes and priority conservation areas for 10 endemic
Namibian bird species, Biological Conservation, 88(1), pp. 121–131.
McCullagh, P. & Nelder, J. (1989) Generalized Linear Models (London: Chapman & Hall).
Moore, D.M., Lees, B.G. & Davey, S.M. (1991) A new method for predicting vegetation distributions using
decision tree analysis in a geographic information system, Environmental Management, 15(1), pp. 59–71.
Osborne, P.E. & Tigar, B.J. (1992) Interpreting bird atlas data using logistic models: an example from Lesotho,
southern Africa, Journal of Applied Ecology, 29, pp. 55–62.
Peterson, A.T., Navarro-Siguenza, A.G. & Benitez-Diaz, H. (1998) The need for continued scientific collecting; a
geographic analysis of Mexican bird specimens, Ibis, 140(2), pp. 288–294.
Peterson, A.T., Ball, L. & Cohon, K. (2002) Predicting distributions of Mexican birds using ecological niche
modelling methods, Ibis, 144, pp. E27–E32.
Stockwell, D.R.B. (1993) LBS: Bayesian learning system for rapid expert system development, Expert Systems
With Applications, 6, pp. 137–147.
Stockwell, D.R.B. & Noble, I.R. (1992) Induction of sets of rules from animal distribution data: a robust and
informative method of data analysis, Mathematics and Computers in Simulation, 33, pp. 385–390.
Strockwell, D.R.B. & Peterson, T. (2002) Effects of sample size on accuracy of species distribution models,
Ecological Modelling, 148, pp. 1–13.
Stockwell, D.R.B. Davey, S.M. Davis, J.R. & Noble, I.R. (1990) Using induction of decision trees to predict
Greater Glider density, AI Applications in Natural Resource Management, 4(4), pp. 33–43.

View publication stats

También podría gustarte