Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Compruebe si el fichero iris.arff cumple dicha estructura: Cuntas muestras contiene de cada tipo de lirio?. Cuntos atributos representan cada muestra? A partir de este momento, podemos empezar a trabajar en Weka con sus datos. Recuerde que el objetivo de ECOFLORA (su empresa) es identificar diferentes tipos de lirios de forma automtica a partir de las muestras disponibles en el fichero .arff. La evaluacin de este caso prctico se realizar a partir de las respuestas proporcionadas a las cuestiones que se plantean en los siguientes apartados: 1. Primero vamos a analizar los atributos. Cargue la base de datos iris.arff dentro de Weka Explorer Preprocess. Para cada atributo puede ver sus estadsticos (p.e., min, max, media, etc.) y el histograma de los datos con respecto a ese atributo. El histograma nos aporta mucha informacin ya que nos permite visualizar la correlacin entre las etiquetas de las clases y el valor del atributo. El eje-x representa el valor del atributo y el eje-y es el nmero de instancias cuyo valor de ese atributo est dentro del intervalo.
Usando los histogramas de los atributos, responda a lo siguiente: si la longitud del spalo de un lirio es de 4.4 cm, qu tipo de lirio cree que ser?, Azul: Iris-setosa Rojo: Iris-versicolor Cyan: iris-virginica
Se puede notar el intervalo de [4.3,4.814] est dentro de la clase Iris-setosa y si la anchura del spalo es de 2.7 cm?.
A mi parecer pueden tratarse de los tres tipos de lirios, claro que Iris-versicolor e iris-virginica tienen ms instancias dentro del intervalo. 1. En este apartado vamos a considerar que las muestras recogidas no contienen informacin sobre el tipo de lirio al que pertenecen. Es decir, vamos a abordar una tarea de agrupamiento no supervisado (consultar apuntes tericos) sobre los datos de los lirios. Considere todos los atributos, excepto las etiquetas que identifican a las tres subespecies. Seleccione Weka ExplorerCluster. a) Agrupe los datos en k (variando k desde 2 hasta 10) grupos utilizando el algoritmo k-medias (SimpleKMeans en Weka). Observe los resultados utilizando las herramientas de visualizacin que proporciona Weka. Seleccionamos el algoritmo k-medias
=== Run information === Scheme: weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth Ignored: class Test mode: evaluate on training data === Model and evaluation on training set ===
kMeans ====== Number of iterations: 7 Within cluster sum of squared errors: 12.143688281579722 Missing values globally replaced with mean/mode Cluster centroids: Cluster#
Attribute
Full Data 0 1 (150) (100) (50) ============================================== sepallength 5.8433 6.262 5.006 sepalwidth 3.054 2.872 3.418 petallength 3.7587 4.906 1.464 petalwidth 1.1987 1.676 0.244
K=3
=== Run information === Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth Ignored: class Test mode: evaluate on training data === Model and evaluation on training set ===
kMeans ====== Number of iterations: 6 Within cluster sum of squared errors: 6.998114004826762 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Full Data 0 1 2 (150) (61) (50) (39) ========================================================= sepallength 5.8433 5.8885 5.006 6.8462 sepalwidth 3.054 2.7377 3.418 3.0821 petallength 3.7587 4.3967 1.464 5.7026 petalwidth 1.1987 1.418 0.244 2.0795 Attribute
Clustered Instances
0 1 2
K=4
=== Run information === Scheme: weka.clusterers.SimpleKMeans -N 4 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth Ignored: class Test mode: evaluate on training data === Model and evaluation on training set ===
kMeans ====== Number of iterations: 4 Within cluster sum of squared errors: 5.532831003081898 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Full Data 0 1 2 3 (150) (42) (29) (29) (50) ==================================================================== sepallength 5.8433 6.25 5.5828 6.9586 5.006 sepalwidth 3.054 2.9 2.569 3.1345 3.418 petallength 3.7587 4.8738 4.0034 5.8552 1.464 petalwidth 1.1987 1.6405 1.231 2.1724 0.244 Attribute
K=5
=== Run information ===
Scheme: weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth Ignored: class Test mode: evaluate on training data === Model and evaluation on training set ===
kMeans ====== Number of iterations: 9 Within cluster sum of squared errors: 5.130784647061167 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Full Data 0 1 2 3 4 (150) (27) (26) (27) (50) (20) =============================================================================== sepallength 5.8433 6.0296 5.55 6.9667 5.006 6.55 sepalwidth 3.054 2.7556 2.5808 3.137 3.418 3.05 petallength 3.7587 4.9444 3.9269 5.8852 1.464 4.805 petalwidth 1.1987 1.7037 1.2 2.2 0.244 1.55 Attribute
K=6
=== Run information === Scheme: weka.clusterers.SimpleKMeans -N 6 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth
kMeans ====== Number of iterations: 7 Within cluster sum of squared errors: 4.687015166064156 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Full Data 0 1 2 3 4 5 (150) (22) (19) (25) (50) (16) (18) ========================================================================================== sepallength 5.8433 6.1273 5.4842 7.012 5.006 6.6313 5.8778 sepalwidth 3.054 2.7 2.4684 3.164 3.418 3.0375 2.9556 petallength 3.7587 5.1318 3.8632 5.908 1.464 4.8938 4.35 petalwidth 1.1987 1.8364 1.1684 2.204 0.244 1.5563 1.3889 Attribute
K=7 === Run information === Scheme: Relation: Instances: Attributes: weka.clusterers.SimpleKMeans -N 7 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 iris 150 5 sepallength sepalwidth petallength petalwidth class evaluate on training data
kMeans ====== Number of iterations: 7 Within cluster sum of squared errors: 3.757589923861278 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Full Data 0 1 2 3 4 5 6 (150) (22) (19) (25) (14) (16) (18) (36) ===================================================================================================== sepallength 5.8433 6.1273 5.4842 7.012 5.3786 6.6313 5.8778 4.8611 sepalwidth 3.054 2.7 2.4684 3.164 3.8786 3.0375 2.9556 3.2389 petallength 3.7587 5.1318 3.8632 5.908 1.5071 4.8938 4.35 1.4472 petalwidth 1.1987 1.8364 1.1684 2.204 0.2786 1.5563 1.3889 0.2306 Attribute
Clustered Instances 0 1 2 3 4 5 6 K=8 === Run information === Scheme: Relation: Instances: Attributes: weka.clusterers.SimpleKMeans -N 8 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 iris 150 5 sepallength sepalwidth petallength petalwidth class evaluate on training data 22 19 25 14 16 18 36 ( ( ( ( ( ( ( 15%) 13%) 17%) 9%) 11%) 12%) 24%)
kMeans ====== Number of iterations: 7 Within cluster sum of squared errors: 3.4079099202793466 Missing values globally replaced with mean/mode Cluster centroids:
Cluster# Full Data 0 1 2 3 4 5 6 7 (150) (22) (19) (25) (13) (16) (18) (20) (17) ================================================================================================================ sepallength 5.8433 6.1273 5.4842 7.012 5.3692 6.6313 5.8778 5.045 4.6824 sepalwidth 3.054 2.7 2.4684 3.164 3.9077 3.0375 2.9556 3.43 3.0294 petallength 3.7587 5.1318 3.8632 5.908 1.5231 4.8938 4.35 1.465 1.4176 petalwidth 1.1987 1.8364 1.1684 2.204 0.2846 1.5563 1.3889 0.27 0.1824 Attribute
Clustered Instances 0 1 2 3 4 5 6 7 22 19 25 13 16 18 20 17 ( ( ( ( ( ( ( ( 15%) 13%) 17%) 9%) 11%) 12%) 13%) 11%)
K=9 === Run information === Scheme: Relation: Instances: Attributes: weka.clusterers.SimpleKMeans -N 9 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 iris 150 5 sepallength sepalwidth petallength petalwidth class evaluate on training data
kMeans ====== Number of iterations: 7 Within cluster sum of squared errors: 3.240418626354077 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 2 3 4 5 6 7 8 (150) (18) (16) (25) (13) (16) (12) (20) (17) (13) =========================================================================================================================== sepallength 5.8433 6.1278 5.85 7.012 5.3692 6.6438 5.9417 5.045 4.6824 5.3385 sepalwidth 3.054 2.6333 2.775 3.164 3.9077 3.0188 3.0583 3.43 3.0294 2.4077 petallength 3.7587 5.1611 4.1875 5.908 1.5231 4.925 4.65 1.465 1.4176 3.7231 petalwidth 1.1987 1.8333 1.25 2.204 0.2846 1.5813 1.6 0.27 0.1824 1.1538 Clustered Instances 0 1 2 3 4 5 6 7 8 18 16 25 13 16 12 20 17 13 ( ( ( ( ( ( ( ( ( 12%) 11%) 17%) 9%) 11%) 8%) 13%) 11%) 9%)
K=10 === Run information === Scheme: Relation: Instances: weka.clusterers.SimpleKMeans -N 10 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 iris 150
Attributes:
=== Model and evaluation on training set === kMeans ====== Number of iterations: 7 Within cluster sum of squared errors: 3.192318466613457 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 2 3 4 5 6 7 8 9 (150) (18) (16) (25) (13) (16) (12) (20) (9) (13) (8) ====================================================================================================================================== sepallength 5.8433 6.1278 5.85 7.012 5.3692 6.6438 5.9417 5.045 4.8333 5.3385 4.5125 sepalwidth 3.054 2.6333 2.775 3.164 3.9077 3.0188 3.0583 3.43 2.9667 2.4077 3.1 petallength 3.7587 5.1611 4.1875 5.908 1.5231 4.925 4.65 1.465 1.4667 3.7231 1.3625 petalwidth 1.1987 1.8333 1.25 2.204 0.2846 1.5813 1.6 0.27 0.1778 1.1538 0.1875 Clustered Instances 0 18 ( 12%) 1 16 ( 11%) 2 25 ( 17%) 3 13 ( 9%) 4 16 ( 11%) 5 12 ( 8%) 6 20 ( 13%) 7 9 ( 6%) 8 13 ( 9%) 9 8 ( 5%)
a) Represente el SSE (la suma de los errores cuadrticos) cuando el nmero de grupos vara de 2 a 10. Describa el comportamiento de dicha curva en una o dos frases. Calcule tambin la media y la desviacin tpica de cada grupo. Deber seleccionar como la mejor agrupacin de datos, aquella que proporcione menor SSE. SSE es una medida de calidad del agrupamiento obtenido.
El valor de SSM es alto cuando el nmero de grupos es pequeo y tiende a disminuir en intervalos cada vez ms cortos conforme el nmero de grupos crece, pienso que tiende a estabilizarse