Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Module 3
Principal Component Analysis – PCA
Module 4
Cluster analysis
PCA:
https://www.youtube.com/watch?v=oiR3k9H-7K0&ab_channel=LorenAraujo
https://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_distance
https://en.wikipedia.org/wiki/Canberra_distance
https://en.wikipedia.org/wiki/Chebyshev_distance
Ignacio Puertas García-Figueras UC3M
> data("USArrests")
PC1 is
both the direction of maximum
variance and the equation of
greatest fit to the data (the
sum of distances from each
point to the PC1 axis is
minimised).
Página 2 de 45
Ignacio Puertas García-Figueras UC3M
Página 3 de 45
Ignacio Puertas García-Figueras UC3M
Página 4 de 45
Ignacio Puertas García-Figueras UC3M
Página 5 de 45
Ignacio Puertas García-Figueras UC3M
Página 6 de 45
Ignacio Puertas García-Figueras UC3M
Cluster analysis
What is Cluster Analysis? Cluster analysis or clustering is the
task of grouping a set of objects in such a way that objects in
the same group are more similar to each other than to those in
other groups.
Página 7 de 45
Ignacio Puertas García-Figueras UC3M
Página 8 de 45
Ignacio Puertas García-Figueras UC3M
Página 9 de 45
Ignacio Puertas García-Figueras UC3M
Página 10 de 45
Ignacio Puertas García-Figueras UC3M
Partitioning Algorithms
Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters. Given a k, find a partition of
k clusters that optimizes the chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueenʼ67): Each cluster is represented by the
centre of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuwʼ87): Each cluster is represented by one of the
objects in the cluster
Página 11 de 45
Ignacio Puertas García-Figueras UC3M
Página 12 de 45
Ignacio Puertas García-Figueras UC3M
Página 13 de 45
Ignacio Puertas García-Figueras UC3M
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432
Página 14 de 45
Ignacio Puertas García-Figueras UC3M
Página 15 de 45
Ignacio Puertas García-Figueras UC3M
Página 16 de 45
Ignacio Puertas García-Figueras UC3M
****************************************************************
> fviz_nbclust(resnumclust)
Among all indices:
===================
* 2 proposed 0 as the best number of clusters
* 13 proposed 2 as the best number of clusters
* 2 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 7 proposed 6 as the best number of clusters
* 1 proposed 9 as the best number of clusters
* 3 proposed 10 as the best number of clusters
Conclusion
=========================
* According to the majority rule, the best number of clusters is 2 .
Página 17 de 45
Ignacio Puertas García-Figueras UC3M
Página 18 de 45
Ignacio Puertas García-Figueras UC3M
Página 19 de 45
Ignacio Puertas García-Figueras UC3M
Página 20 de 45
Ignacio Puertas García-Figueras UC3M
Cluster means:
Murder Assault UrbanPop Rape
1 1.004934 1.0138274 0.1975853 0.8469650
2 -0.669956 -0.6758849 -0.1317235 -0.5646433
Clustering vector:
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
1 2 2 1 1
Hawaii Idaho Illinois Indiana Iowa
2 2 1 2 2
Kansas Kentucky Louisiana Maine Maryland
2 2 1 2 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 2 1 1
Montana Nebraska Nevada New Hampshire New Jersey
2 2 1 2 2
New Mexico New York North Carolina North Dakota Ohio
1 1 1 2 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 2 2 1
South Dakota Tennessee Texas Utah Vermont
2 1 1 2 2
Virginia Washington West Virginia Wisconsin Wyoming
2 2 2 2 2
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
Página 21 de 45
Ignacio Puertas García-Figueras UC3M
Página 22 de 45
Ignacio Puertas García-Figueras UC3M
Página 23 de 45
Ignacio Puertas García-Figueras UC3M
Página 24 de 45
Ignacio Puertas García-Figueras UC3M
Página 25 de 45
Ignacio Puertas García-Figueras UC3M
Página 26 de 45
Ignacio Puertas García-Figueras UC3M
> data("USArrests")
> prcomp(USArrests)
Standard deviations (1, .., p=4):
[1] 83.732400 14.212402 6.489426 2.482790
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
Murder 0.04170432 -0.04482166 0.07989066 -0.99492173
Assault 0.99522128 -0.05876003 -0.06756974 0.03893830
UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914
Rape 0.07515550 0.20071807 0.97408059 0.07232502
Numerically:
> summary(prcomp(USArrests))
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 83.7324 14.21240 6.4894 2.48279
Proportion of Variance 0.9655 0.02782 0.0058 0.00085
Cumulative Proportion 0.9655 0.99335 0.9991 1.00000
Página 27 de 45
Ignacio Puertas García-Figueras UC3M
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432
> plot(prcomp(USArrests,scale=T))
Página 28 de 45
Ignacio Puertas García-Figueras UC3M
> summary(prcomp(USArrests,scale=T))
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
> biplot(prcomp(USArrests,scale=T))
Página 29 de 45
Ignacio Puertas García-Figueras UC3M
Página 30 de 45
Ignacio Puertas García-Figueras UC3M
> summary(pca.Europa2)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.796 1.0896 1.0311 0.8777 0.67665 0.4107 0.35446
Proportion of Variance 0.461 0.1696 0.1519 0.1100 0.06541 0.0241 0.01795
Cumulative Proportion 0.461 0.6306 0.7825 0.8925 0.95795 0.9820 1.00000
> x11()
> plot(pca.Europa2)
> biplot(pca.Europa2)
Página 31 de 45
Ignacio Puertas García-Figueras UC3M
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
Página 32 de 45
Ignacio Puertas García-Figueras UC3M
Página 33 de 45
Ignacio Puertas García-Figueras UC3M
Name Description
1 "$coord" "Coordinates for the variables"
2 "$cor" "Correlations between variables and dimensions"
3 "$cos2" "Cos2 for the variables"
4 "$contrib" "contributions of the variables"
> library(corrplot)
corrplot 0.91 loaded
> corrplot(var$cos2,is.corr=F)
Página 34 de 45
Ignacio Puertas García-Figueras UC3M
> fviz_pca_var(res.pca,col.var="cos2",repel=T)
Página 35 de 45
Ignacio Puertas García-Figueras UC3M
Página 36 de 45
Ignacio Puertas García-Figueras UC3M
> fviz_contrib(res.pca,choice="var",axes=2,top=10)
Página 37 de 45
Ignacio Puertas García-Figueras UC3M
Página 38 de 45
Ignacio Puertas García-Figueras UC3M
> fviz_pca_ind(res.pca)
Buen ejemplo:
https://aulaglobal.uc3m.es/pluginfile.php/4792883/mod_resource/c
ontent/2/heptathlon-pca-ade_engl.pdf
Página 39 de 45
Ignacio Puertas García-Figueras UC3M
Página 40 de 45
Ignacio Puertas García-Figueras UC3M
Página 41 de 45
Ignacio Puertas García-Figueras UC3M
Conclusion
=========================
* According to the majority rule, the best number of clusters is 9.
Medoids:
ID Murder Assault UrbanPop Rape
Alabama 1 1.2425641 0.7828393 -0.5209066 -0.003416473
Alaska 2 0.5078625 1.1068225 -1.2117642 2.484202941
New Mexico 31 0.8292944 1.3708088 0.3081225 1.160319648
California 5 0.2782682 1.2628144 1.7589234 2.067820292
Massachusetts 21 -0.7778653 -0.2611064 1.3444088 -0.526563903
Oklahoma 36 -0.2727580 -0.2371077 0.1699510 -0.131534211
South Dakota 41 -0.9156219 -1.0170672 -1.4190215 -0.900240639
Illinois 13 0.5997002 0.9388312 1.2062373 0.295524916
New Hampshire 29 -1.3059321 -1.3650491 -0.6590781 -1.252564419
Página 42 de 45
Ignacio Puertas García-Figueras UC3M
Clustering vector:
Alabama Alaska Arizona Arkansas California Colorado
1 2 3 1 4 4
Connecticut Delaware Florida Georgia Hawaii Idaho
5 6 3 1 5 7
Illinois Indiana Iowa Kansas Kentucky Louisiana
8 6 9 6 6 1
Maine Maryland Massachusetts Michigan Minnesota Mississippi
9 3 5 3 9 1
Missouri Montana Nebraska Nevada New Hampshire New Jersey
6 7 6 4 9 5
New Mexico New York North Carolina North Dakota Ohio Oklahoma
3 8 1 9 6 6
Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee
6 6 5 1 7 1
Texas Utah Vermont Virginia Washington West Virginia
8 5 7 6 6 7
Wisconsin Wyoming
9 6
Objective function:
build swap
0.7439615 0.7435288
Available components:
[1] "medoids" "id.med" "clustering" "objective" "isolation" "clusinfo" "silinfo"
[8] "diss" "call" "data"
Available components:
[1] "medoids" "id.med" "clustering" "objective" "isolation" "clusinfo" "silinfo"
[8] "diss" "call" "data"
Página 43 de 45
Ignacio Puertas García-Figueras UC3M
Página 44 de 45
Ignacio Puertas García-Figueras UC3M
Página 45 de 45