Documentos de Académico
Documentos de Profesional
Documentos de Cultura
ITK Questions?
Classification
Classification
Classification
Features
Loosely stated, a feature is a value describing something about your data points (e.g. for pixels: intensity, local gradient, distance from landmark, etc) Multiple (n) features are put together to form a feature vector, which defines a data points location in n-dimensional feature space
Feature Space
Feature Space The theoretical n-dimensional space occupied by n input raster objects (features). Each feature represents one dimension, and its values represent positions along one of the orthogonal coordinate axes in feature space. The set of feature values belonging to a data point define a vector in feature space.
Statistical Notation
Class probability distribution:
In the text, they choose to concentrate on methods that use Gaussians to model class densities
Class Modeling
We model the class distributions as multivariate Gaussians x ~ N(0, 0) for y = 0 x ~ N(1, 1) for y = 1
Priors are based on training data, or a distribution can be chosen that is expected to fit the data well (e.g. Bernoulli distribution for a coin flip)
Calculating Posteriors
Use Bayes Rule: In this case,
P( A | B) P ( B | A) P ( A) P( B)
Clustering
Basic Clustering Problem:
Distribute data into k different groups such that data points similar to each other are in the same group Similarity between points is defined in terms of some distance metric
Dimensionality Reduction
High dimensional data replaced with a group (cluster) label
Clustering
Clustering
Distance Metrics
Euclidean Distance, in some space (for our purposes, probably a feature space) Must fulfill three properties:
Distance Metrics
Common simple metrics:
Euclidean:
Manhattan:
Clustering Algorithms
k-Nearest Neighbor k-Means Parzen Windows
k-Nearest Neighbor
In essence, a classifier Requires input parameter k
In this algorithm, k indicates the number of neighboring points to take into account when classifying a data point
q1
+
k-Nearest Neighbor
Advantages:
Simple General (can work for any distance measure you want)
Disadvantages:
Requires well classified training data Can be sensitive to k value chosen All attributes are used in classification, even ones that may be irrelevant Inductive bias: we assume that a data point should be classified the same as points near it
k-Means
Suitable only when data points have continuous values Groups are defined in terms of cluster centers (means) Requires input parameter k
In this algorithm, k indicates the number of clusters to be created
k-Means Algorithm
Algorithm:
1. Randomly initialize k mean values 2. Repeat next two steps until no change in means:
1. Partition the data using a similarity measure according to the current means 2. Move the means to the center of the data in the current partition
k-Means
k-Means
Advantages:
Simple General (can work for any distance measure you want) Requires no training phase
Disadvantages:
Result is very sensitive to initial mean placement Can perform poorly on overlapping regions Doesnt work on features with non-continuous values (cant compute cluster means) Inductive bias: we assume that a data point should be classified the same as points near it
Parzen Windows
Similar to k-Nearest Neighbor, but instead of using the k closest training data points, its uses all points within a kernel (window), weighting their contribution to the classification based on the kernel As with our classification algorithms, we will consider a gaussian kernel as the window
Parzen Windows
Assume a region defined by a d-dimensional Gaussian of scale We can define a window density function:
r p( x , ) 1 S
S
r r G( x S ( j) , )
2
j 1
Note that we consider all points in the training set, but if a point is outside of the kernel, its weight will be 0, negating its influence
Parzen Windows
Parzen Windows
Advantages:
More robust than k-nearest neighbor Excellent accuracy and consistency
Disadvantages:
How to choose the size of the window? Alone, kernel density estimation techniques provide little insight into data or problems