02 Machinelearning Clasico

Deep Learning para visión
por computadora
Dra. María de la Paz Rico Fdz

INTRODUCCIÓN A INTELIGENCIA ARTIFICIAL
Imagen tomada de Blog NVIDIA Developer
MACHINE LEARNING VS PROGRAMACIÓN CLÁSICA
Machine Learning (aprendizaje automático)

Programación clásica
Campo de estudio que da a las computadoras la
habilidad de aprender sin estar explícitamente
Se programan reglas y los datos son
programadas.
procesados de acuerdo a esas reglas,
Arthur Samuel (1959)
produciendo una salida o respuesta.
En ML, se ingresan los datos y las respuestas

esperadas, siendo la salida las reglas
Datos
Salida (respuesta)
Reglas Programación
Datos
Machine Modelo
Salida Learning
Datos
Modelo Computadora Salida
5
ALCANCE DEL APRENDIZAJE AUTOMÁTICO
VENTAJAS DESVENTAJAS
- Buen funcionamiento con bases de - Dificultad para aprender de grandes bases
datos pequeñas de datos
- Interpretación fácil de resultados - Requieren feature engineering
- Su desarrollo no requiere de - Dificultad para aprender funciones
equipos con grandes capacidades complejas
ALCANCE DEL APRENDIZAJE AUTOMÁTICO
RETOS EN TAREAS DE PERCEPCIÓN
- Reconocer Imágenes
- Lenguaje natural
- Interactuar con el mundo real (exploración, ubicación, reconocimiento)
Machine Learning
⚫ ¿Qué necesitamos para ml?
Performance
Datos
Machine learning Reglas
Respuestas (modelos)
⚫ ml pipeline
Mejorarlos/
Adquisición de Preprocesamiento Procesamiento Medir su Mantenerlos/
datos exactitud Actualizarlos
ADQUISICIÓN DE DATOS ENTRENAR EL MODELO DESPLEGAR EL MODELO

Entender el problema,
13
identificar fuentes de datos
(etiquetados) y resaltar
posibles problemas con los
datos.
Adquisición de Preprocesamiento
datos Procesamiento Medir su Mejorarlos
exactitud
Datos
Machine Modelo
Salida Learning
PASOS CLAVE PARA ML PROJECT
(ML PIPELINE)
⚫ Adquisición de datos
Ejemplos de bases de datos

Machine Learning
TIPOS DE DATOS
Data
Numerical Categorical
Discrete Continuous Nominal Ordinal
Measurements Measurements No natural order An order between

are integers: can take on any between categories: categories:
- age value, usually - gender - T-shirt size (S, M, L)
within a range:
- n° of students - states / districts - grades (A, B, C)
- temperature
- color names - time of the day (morning,
- weight afternoon, evening)
Machine Learning
Tipos de datos
Data
Sequence
Strings Location data
data
- time series (time order)

- sequences of strings (text data)
(ML PIPELINE)
⚫ Adquisición de datos
Podemos adquirir las bases de datos por:
1) Etiquetado manual:
2) De datos observados
3) Descargandola de paginas web o partners.

(ML PIPELINE)
⚫ Adquisición de datos y sus problemas
1) Si entran datos ruidosos, estimaciones ruidosas saldrán
2) Problemas en los datos

- Etiquetas incorrectas
- Datos faltantes
3) Varios tipos de datos

- Estructurados: Tablas de datos.
- No estructurados: imágenes, audio, video, texto

19
exactitud
Limpiar los datos,
transformarlos, reducirlos y
separarlos
Entrenamiento Evaluación
20 2. Machine Learning
exactitud
Seleccionar el o los
algoritmos
⚫ tipos de sistemas de ml
• Supervisado-> Datos y etiquetas

• No supervisado-> Datos
• Semi supervisado
• Reinforcement learning
(ML PIPELINE)
APRENDIZAJE NO SUPERVISADO
⚫ entrenar el modelo
INICIO
PCA SVD
No
¿Tienes respuestas?
K-MEANS C-MEANS
Sí
APRENDIZAJE SUPERVISADO
Sí
No
Logistic
Decisicion
Númericas Decisicion
Linear Regression
Tree
Tree Regression
SVM
Random Neural Random Neural

Forest Network Forest Network
Machine Learning
Dog
From ‘Python Data Science Handbook’ by Jake Vander Plas.

2
4
Machine Learning
Aprendizaje supervisado Aprendizaje no supervisado
Reducción de
Regresión
dimensionalidad
Clasificación Asociación
25
REGRESIÓN LINEAL
En la regresión, las etiquetas son datos

continuos. Lineal
El modelo lineal se forma a través de la suma
ponderada de las variables (características),
más un sesgo o donde se intercepta.
𝑦 = 𝑎𝑥 + 𝑏
Puede haber regresiones lineales o

polinomiales, depende de si los datos son
lineales o no respectivamente.
Polinomial
𝑦 = 𝑎0 + 𝑎1 𝑥1 + 𝑎2 𝑥2 + 𝑎3𝑎3𝑥3 𝑥3 +. . . → 𝑦
= 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥 2 +
Notebook
https://drive.google.com/file/d/1ck9sBcHeJbn7SgUF9TDI40mGhgUTs1DM/view?usp=sharing
Penalizaciones
https://drive.google.com/file/d/1lgzTIzDoVBwYXVIzr3wuNNAlbV30Ymwl/view?usp=sharing
REGRESIÓN MÚLTIPLE
Cuando hay dos o más variables predictoras, el modelo se denomina

modelo de regresión múltiple. La forma general de un modelo de
regresión múltiple es
donde y es la variable a pronosticar y x1,…, x son las k variables

predictoras. Cada una de las variables predictoras debe ser numérica.
Los datos que se muestran son series extraídas de un estudio de

Shumway de los posibles efectos de la temperatura y la contaminación
sobre la mortalidad semanal en Los Ángeles.
𝑀𝑡 = 𝛽1 (𝑇𝑡 − 𝑇) + 𝛽2 𝑃𝑡 + 𝜀
Notebook
https://colab.research.google.com/drive/1ESOdWY3CfsKjXfMRzVbt5ui-2LbvAENX?usp=sharing
PROPHET
⚫ INTRODUCCIÓN
El análisis de series de tiempo es un enfoque para analizar datos de
series de tiempo para extraer características significativas de los
datos y generar otros conocimientos útiles aplicados en la situación
empresarial.
Generalmente, los datos de series de tiempo son una secuencia de

observaciones almacenadas en orden de tiempo.
Los datos de series de tiempo a menudo se destacan al rastrear

métricas comerciales, monitorear procesos industriales, etc.
Las técnicas clásicas de pronóstico de series de tiempo se basan en

modelos estadísticos que requieren mucho esfuerzo para ajustar los
modelos y esperar en los datos y la industria.
PAT R O N E S E N S E R I E S D E T I E M P O
Pronóstico
Esta compuesto principalmente por un patrón más una desviación (bias) o aleatoriedad.
Patrones Aleatoriedad Pronóstico
El objetivo es maximizar la habilidad de encontrar los patrones y minimizar la varianza inexplicable.
Los principales patrones son: tendencia, estacionalidad, cíclico, y ruido blanco.

⚫ PAT R O N E S E N S E R I E S D E T I E M P O
Tendencia Estacionalidad
Tendencia+ Estacionalidad
Ruido Blanco
⚫ FA C E B O O K P R O P H E T
Facebook desarrolló un Prophet de código abierto, una Aspectos destacados de Facebook Prophet
herramienta de pronóstico disponible tanto en Python
1. Muy rápido, ya que está integrado en Stan, un lenguaje de programación para
como en R.
inferencia estadística escrito en C ++.
2. Un modelo de regresión aditiva en el que las tendencias no lineales se ajustan
Proporciona parámetros intuitivos que son fáciles de a la estacionalidad anual, semanal y diaria, además de los efectos de las
ajustar. vacaciones:
3. Prophet detecta automáticamente cambios en las tendencias seleccionando
puntos de cambio de los datos
4. Un componente estacional anual modelado utilizando series de Fourier
5. Una lista de días festivos importantes proporcionada por el usuario.
6. Resistente a los datos faltantes y los cambios en la tendencia, y generalmente
maneja valores atípicos.
7. Procedimiento sencillo para modificar y ajustar el pronóstico al tiempo que
agrega conocimientos de dominio o perspectivas comerciales.
⚫ MODELO DE PROPHET
The Prophet utiliza un modelo de series de tiempo con tres componentes principales del modelo: tendencia, estacionalidad y días festivos.
Se combinan en la siguiente ecuación:
Prophet= Tendencia + Estacionalidad + Vacaciones + error
• Modelos de tendencias de cambios no periódicos en el valor de la serie temporal.

• La estacionalidad son los cambios periódicos como la estacionalidad diaria, semanal o anual.
• Efecto festivo que se produce en horarios irregulares durante un día o un período de días.
• Los términos de error son lo que no explica el modelo.
Usando el tiempo como regresor, Prophet está tratando de ajustar varias funciones lineales y no lineales del tiempo como componentes.
Saturación
Establezca un límite de capacidad de carga para

especificar el punto máximo alcanzable debido a los
escenarios o restricciones comerciales: tamaño del
mercado, tamaño de la población total, presupuesto
máximo, etc.
Un mínimo de saturación, que se especifica con un piso

de columna de la misma manera que la columna de
límite especifica el máximo.
Change points
El modelo podría estar sobreajustado o desajustado
mientras trabaja con el componente de tendencia.
Para aminorar estos efectos se ha incorporado de
puntos de cambio a Prophet para hacer un ajuste
flexible.
Prophet ha incluido los datos originales como puntos

negros y la línea azul es el modelo de pronóstico. El
área azul claro es el intervalo de confianza.
El uso de la función add_changepoints_to_plot agregó

las líneas rojas; las líneas punteadas verticales son
puntos de cambio que Prophet identificó donde
cambió la tendencia, y la línea roja continua es la
tendencia sin la estacionalidad.
Holidays (días festivos)
Los días festivos y los eventos pueden provocar cambios en
una serie temporal.
Podemos crear una lista de vacaciones personalizada para

Prophet creando un marco de datos con dos columnas ds y
vacaciones. Una fila para cada ocurrencia de las vacaciones.
Para ello tendremos además parámetros como lower y

upper window que extienden las vacaciones a días cercanos
a la fecha.
Si queremos incluir un día antes del día nacional del

aguacate y el día del guacamole, configuramos
lower_window: -1 upper_window: 0
Si quisiéramos usar el día después de las vacaciones,

establezcamos lower_window: 0 upper_window: 1
⚫ E VA LUAC I Ó N D E L D ES E M P E Ñ O
PARTICIÓN FIJA
Por lo general, queremos dividir la serie

temporal en un período de entrenamiento,
un período de validación y un período de
prueba.
A esto se le llama partición fija. Si la serie

temporal tiene alguna estacionalidad. Por lo
general, desea asegurarse de que cada
período contenga un número entero de
estaciones.
También se usa simplemente dividirlo en

dos categorías: entrenamiento y evaluación.
B A S E D E D AT O S D E A G U A C AT E
https://colab.research.google.com/drive/1qiR5HtoV9mqo4hnPZ8
WTzm-RCM5_TCc2?usp=sharing
https://peerj.com/preprints/3190/
⚫ ml supervisado
DECISION TREE
- Algoritmo caja blanca
- Tareas de clasificación y regresión y tareas
multiclase
- Los árboles de decisión formulan una serie de

preguntas y realizan una secuencia de operaciones
de ramificación basadas en comparaciones de
algunas cantidades.
- Predicen el valor de una variable objetivo

aprendiendo reglas de decisión simples inferidas
de las características de los datos.
- Estiman la probabilidad de que una instancia

pertenezca a una clase k particular.
Decision Trees
Root node: depth 0, at the top.

Leaf node: does not have any children nodes, it doesn’t ask
any questions.
Samples attribute: counts how many training instances it applies to.

Value attribute: how many training instances of each class the node applies to.
From ‘Hands-On Machine Learning with Scikit-Learn and TensorFlow’ by Aurélien Géron.
Decision Trees
Gini attribute: measures the impurity.

A node is pure (gini=0) if all the training instances it applies to belong
to the same class. It is a measure of inequality between nodes.
Gini score in depth-2 left node:

1 – (0/54) 2 - (49/54) 2 - (5/54) 2 ≈ 0.168
𝑛
2
𝐺𝑖 = 1 − ෍ 𝑝𝑖,𝑘
𝑘=1
pi,k: ratio of class k instances among the training
instances in the ith node.
Decision Trees
- Decision Trees are white box models, their decisions are easy to interpret and easy to explain
how the predictions were made.
- Decision Trees can also estimate the probability that an instance belongs to a particular class k.
Decision Trees
CART (Classification and Regression Trees) Algorithm
- CART algorithm produces only binary trees (the one used by Scikit Learn).
- For classification, while training the model on observations, at each step, pick a feature and split
the dataset into two parts based on how best to reduce node impurities at the next lower level.
- As the number of features in each observation increases, it gets more difficult to find and select
the right feature, and the right value to split on.
- For regression, the continuous value of the dependent variable can be computed to be the
‘average’ of the other nodes within that leaf node.
- Since the nodes are close in n-dimensional space among the features that matter the most, it is
logically similar to perform piecewise linear regression of related or close data (graded by purity) vs
performing a global regression.
Decision Trees
To reduce variance, a stopping criterion may be used:
1- stop growing the tree at a specific number of leaf nodes (pruning)
2- when leaves contain a certain minimum number of observations.
CART (Classification and Regression Trees) algorithm first splits the training set into two
subsets using a single feature k and a threshold tk (e.g. petal length < 2.45 cm ).
It searches for the pair (k, tk) that produces the purest subsets (weighted by their size).
It tries to minimize a cost function:
𝑚𝑙𝑒𝑓 𝑚𝑟𝑖𝑔ℎ
𝐽 𝑘, 𝑡𝑘 = 𝐺𝑙𝑒𝑓 + 𝐺𝑟𝑖𝑔ℎ
𝑚 𝑚
where:
Gleft/right measures the impurity of the left/right subset
mleft/right is the number of instances in the left/right subset
Decision Trees
The CART algorithm is a greedy algorithm:

- it greedily searches for an optimum split at the top level, then repeats the process at each
level
- it doesn’t check whether or not the split will lead to the lowest possible impurity several
levels down
- it produces a good solution, but it’s not guaranteed that it’s the optimal one
Decision Trees
Gini Impurity and Entropy
- A set’s entropy is zero when it contains instances of only one class.
𝐻𝑖 = − ෍ 𝑝𝑖,𝑘 𝑙𝑜𝑔 𝑝𝑖,𝑘 pi,k: ration of class k instances among the

𝑘=1 training instances in the ith node.
𝑝𝑖≠𝑘
- Most of the time there is no big difference between Gini and entropy.
- Gini impurity is slightly faster to compute.
- When Gini and entropy differ, entropy tends to produce slightly more balanced trees while
Gini tends to isolate the most frequent class in its own branch of the tree.
Decision Trees
Regularization Hyperparameters
- Nonparametric model: Decision Trees make few assumptions about the training data,
the number of parameters is not determined prior to training.
- Regularization: to avoid overfitting the training data, it is necessary to restrict the

freedom during training.
- Regularization: restrict the maximum depth of the Decision Tree (e.g. Scikit Learn
max_depth hyperparameter).
Decision Trees
Regression
- It predicts a value in each node instead of a class.

- The predicted value for each region is always the average target value of the instances in
that region.
- The CART algorithm splits the training set in a way that minimizes the MSE.
The cost function is:
𝑚𝑙𝑒𝑓 𝑚𝑟𝑖𝑔ℎ
𝐽 𝑘, 𝑡𝑘 = 𝑀𝑆𝐸𝑙𝑒𝑓 + 𝑀𝑆𝐸𝑟𝑖𝑔ℎt
𝑚 𝑚
where:
𝑖 2
𝑀𝑆𝐸𝑛𝑜𝑑𝑒 = ෍ 𝑦𝑛𝑜𝑑𝑒
̰ −𝑦
𝑖∈𝑛𝑜𝑑𝑒
1 𝑖
𝑦𝑛𝑜𝑑𝑒
̰ = ෍ 𝑦
𝑚𝑛𝑜𝑑𝑒
𝑖∈𝑛𝑜𝑑𝑒
Ensemble, voting & Random
Forest
⚫ Ensemble classifiers Hard voting
Soft voting Prob
⚫ Bagging and Pasting
Bagging= ?
Pasting=?
⚫ ml supervisado
RANDOM FOREST
• Enfocados a tareas de clasificación o regresión.

• Se puede entrenar con un subconjunto aleatorio
diferente del conjunto de entrenamiento.
• Se obtienen predicciones con base a las decisiones los

árboles individuales y la clase que obtiene la mayoría
de los votos (o el promedio de regresión) es la elegida.
• En lugar de buscar la mejor característica al dividir un

nodo, busca la mejor característica entre un
subconjunto aleatorio de características.
YA NO ES UNA CAJA BLANCA, POR LO TANTO NO ES FACIL DE

INTERPRETAR EL POR QUÉ DE SUS DECISIONES.
Decision Trees - Ensemble Learning and Random Forest
- If you aggregate the predictions of a group of predictors, better predictions than
with the best individual predictor are often obtained.
- A group of predictors is called an ensemble.
- Random Forest: a group of Decision Trees classifiers (or regressors), each trained on
a different random subset of the training set.
- For predictions you obtain the predictions of all individual trees and then predict the
class that gets majority of votes (or the average for regression).
- Random Forest: each predictor will be trained on a random subset of the input
features.
- Random Forest: instead of searching for the very best feature when splitting a node,
it searches for the best feature among a random subset of features.
Decision Trees - Ensemble Learning and Random Forest
Random Forest
- At each node only a random subset of the features is considered for splitting.
- Each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap
sample) from the training set.
- Feature importance: Random Forest measures the relative importance of each feature.
In Scikit Learn: how much the tree nodes that use that feature reduce impurity on average
across all trees in the forest. It is a weight average, where each nodes weight is equal to the
number of training samples that are associated with it (feature_importances_ variable).
ML supervisado
SVM
SVM realiza:
- Regresión lineal y no lineal
- Clasificación y detección de valores atípicos.
- Un modelo SVM es una representación de los ejemplos como
puntos en el espacio, de modo que los ejemplos de las
categorías separadas estén divididos por un espacio lo más
amplio posible.
Existe la versión lineal y con kernel.
- SVM lineal: la línea que maximiza este margen es la elegida
como modelo óptimo. Support vectors son los puntos de
entrenamiento tocan el margen.
- Kernel SVM: los datos separables de forma no lineal, son
proyectados en un espacio de mayor dimensión para ajustarse
a relaciones no lineales con un clasificador lineal.
NOTEBOOKS
DECISION TREES
https://drive.google.com/file/d/1Gi_IBvRdSXQALJJ_F99-L-KRiww-IwED/view?usp=sharing
RANDOM FOREST
https://drive.google.com/file/d/1aDoZqWmpMHq_xcHDcjotKzgrnbFUeP81/view?usp=sh
aring
(si aparece solo texto darle en la opción de abrir con google colab)
Bases de datos
desbalanceadas
¿QUÉ HACER CON LOS DATOS
IMBALANCEADOS?
Es importante contar con una medida que nos de una vista real de
lo que está pasando en los datos.
¿QUÉ HACER CON LOS DATOS
IMBALANCEADOS?
Es importante contar con una medida que nos de una vista real de lo que está pasando en los datos.
Matriz de confusión
0 1
0 699,991 0
1 602 0
⚫ precisión-recall
EJEMPLO MARKETING BANCO
A TRAVÉS DE MODELO PREDICTIVO PARA NUEVA TC
Casi ningún modelo de ML es perfecto, entonces:
•habrá clientes con los que contactaremos porque el modelo ha predicho que aceptarían y en realidad no lo hacen (False Positive
[FP], Positivos Falsos).
•habrá también clientes con los que no contactaremos porque el modelo ha predicho que no aceptarían que en realidad si lo
hubieran hecho (False Negative [FN], Negativos Falsos).
El modelo de Machine Learning también acertará (esperemos que mucho). A efectos prácticos esto significa que:
•habrá clientes con los que contactaremos porque el modelo ha predicho que aceptarían y en realidad sí que lo hacen (True
Positive [TP], Positivos Verdaderos).
•habrá clientes que no contactaremos porque el modelo ha predicho que no aceptarían la oferta y en realidad no lo hacen (True
Negative [TN], Negativos Verdaderos).
Precision (Precisión)
Con la métrica de precisión podemos medir la calidad del modelo de machine
learning en tareas de clasificación. En el ejemplo, se refiere a que la precisión es la
respuesta a la pregunta ¿qué porcentaje de los clientes que contactemos estarán
interesados?
Es decir, que sólo un 33% de los clientes a los que contactemos estarán
realmente interesados. Esto significa que el modelo del ejemplo se
equivocará un 66% de las veces cuando prediga que un cliente va a
estar interesado.
Recall (Exhaustividad)
La métrica de exhaustividad nos va a informar sobre la cantidad que el modelo
de machine learning es capaz de identificar. En el ejemplo, se refiere a que la
exhaustividad (recall) es la respuesta a la pregunta ¿qué porcentaje de los clientes
están interesados somos capaces de identificar?
Es decir, el modelo sólo es capaz de identificar un 25% de los clientes que estarían
interesados en adquirir el producto. Esto significa que el modelo del ejemplo sólo
es capaz de identificar 1 de cada 4 de los clientes que sí aceptarían la oferta.
¿QUÉ HACER CON LOS DATOS IMBALANCEADOS?
1)Recolectar más datos

2)Balancear los datos
3)Penalizar la clase minoritaria
4)Evaluar más algoritmos
Recolectar más datos
• ¿ Es significativa la muestra que tenemos?

• ¿Hemos realizado la limpieza apropiada de los
datos?
• ¿Hemos realizado la normalización apropiada de
los datos?
• ¿Estamos ingresando algún sesgo?
Balancear datos
Determinar la misma cantidad de ejemplos de cada clase

para entrenar.
PENALIZAR CLASE MINORITARIA
A través de el encontrar más pesos:
wj=n_samples / (n_classes * n_samplesj)

•wj peso de cada clase (j significa la clase)
•n_samples es el total de muestras o número de filas en la base de datos.
•n_classes es el número total de clases unicas en las etiquetas.
•n_samplesj número total de filas en la clase.
•Ejemplo:
n_samples= 43400, n_classes= 2(0&1), n_sample0= 42617, n_samples1= 783
Weights for class 0:

w0= 43400/(2*42617) = 0.509
Weights for class 1:

w1= 43400/(2*783) = 27.713
Opción con scikitlearn
import math
from sklearn.utils import class_weight
weights=class_weight.compute_class_weight('balanced',
np.unique(train_batches.classes),train_batches.classes)
keys=[i for i in np.unique(train_batches.classes)]
class_weight=dict(zip(keys,list(weights)))
print(class_weight)
⚫ PRACTICA
⚫ https://colab.research.google.com/drive/12lNeaaCD_yt
zX8BCUIH4V-Mfkf5kc7iF?usp=sharing
Machine Learning
Aprendizaje supervisado Aprendizaje no supervisado
Reducción de
Regresión
dimensionalidad
Clasificación Asociación
89
Machine Learning
Unsupervised Learning
k-Means
- k-Means algorithm searches for a pre-determined number of

clusters within an unlabeled multidimensional dataset.
- The cluster center is the arithmetic mean of all the points
belonging to the cluster.
- The partitions try to minimize the within-cluster sum of squares
(inertia).
- Each point is closer to its own cluster center than to other cluster
centers.
From ‘Python Data Science Handbook’ by Jake Vander Plas.

⚫ https://colab.research.google.com/drive/1DJx6PaFEklZrjT0vE85qAXIh3Lk-opZq?usp=sharing
⚫ Ejercicio:
⚫ https://colab.research.google.com/drive/1P7_2tJ5OeAiB9Zxlm_jjiWYrMNt9O2aO?usp=sharing
⚫ EXTRAS:
⚫ Explica a detalle proceso de kmeans:
⚫ https://github.com/jakevdp/PythonDataScienceHandbook/blob/8a34a4f653bdbdc01415a94dc20d4e
9b97438965/notebooks/05.11-K-Means.ipynb
⚫ Aplicaciones utiles:
⚫ https://github.com/ageron/handson-ml3/blob/main/09_unsupervised_learning.ipynb
Machine Learning
Principal Component Analysis (PCA)
- Dimensionality reduction: when there are many features (e.g. thousands or millions) for each
training instance it makes training slow and it could be hard to find a good solution.
- PCA:
- identifies the hyperplane that lies closest to the data
- projects the data onto the hyperplane
- selects the projection that preserves the maximum amount of variance
Dimensionality Reduction
- When there are many features (e.g. thousands or millions) for each training instance it
makes training slow and it could be hard to find a good solution.
- Reducing dimensionality of the training set before training a model speeds up training.
- Reducing dimensionality does reduce information.
- It is useful for data visualization.

The Curse of Dimensionality
- High-dimensional datasets are risk of being very sparse: most training instances are far
away from each other making predictions less reliable since they will be based on larger
extrapolations.
- Sparsity is a problem for statistical significance, the amount of data needed to support the
result often grows exponentially with the dimensionality.

Approaches for Dimensionality Reduction
1- Projection
- In real world-problems, training instances are not spread out uniformly across all dimensions.
- Then, training instances lie within a lower-dimensional subspace of the high-dimensional space.
Before projection After projection
Principal Component Analysis(PCA)
- Unsupervised method for dimensionality reduction of the data

- Identifies the hyperplane that lies closest to the data.
- Then, it projects the data onto it.
- Selects the projection that preserves the maximum amount of variance (the axis that
minimizes the mean squared distance between the original dataset and its projection onto
that axis).
𝑋𝑑𝑝𝑟𝑜𝑗 = 𝑋. 𝑊𝑑
X: matrix training set
Wd: matrix containing the first d principal
components
This projects the training set onto the

space defined by the principal
components.
Notebook ejercicio 1
https://colab.research.google.com/drive/1n4LHgBOMgo
DWHpoQZs8BUL76L2TbC2YT?usp=sharing
https://colab.research.google.com/drive/1V8u6wDNiAnW
jMtOBEEJ4TIXEvTLKhbEs?usp=sharing
Principal Components
- PCA identifies the axis that accounts for the largest amount of variance in the training set.
- PCA finds a second axis, orthogonal to the first one, that accounts for the largest amount
of remaining variance.
- PCA would also find a third axis, orthogonal to the both previous axis, etc.
- The unit vector that identifies the ith axis is called principal component.
- PCA assumes that the dataset is centered around the origin. Scikit-Learn’s PCA classes
centers the data.
Scikit-Learn
- After fitting PCA transformer to the dataset, the principal components can be accessed using
components_ variable (the first one: pca.components_T[:, 0]).
- Explained variance ratio of each principal component: the proportion of the dataset’s variance
that lies along the axis of each principal component, explained_variance_ratio_ variable.
PCA (Principal Component Analysis)
Choosing the Right Number of Dimensions
- It is generally useful to choose the number of dimensions that add up to a large proportion of
the variance (i.e. 95%).
- In case of data visualization the dimensionality is usually reduced to 2 or 3 dimensions.
- Scikit_Learn: set n_components to a float between 0.0 and 1.0, indicating the ratio of
variance to preserve, PCA(n-components=0.95).
- Plot the explained variance as a function of the number of dimensions.
n° of components / dimensions From ‘Python Data Science Handbook’ by Jake Vander Plas.
PCA for Compression
- After dimensionality reduction, the training set takes up less space.
- Speeds up an algorithm like SVM.
- It’s possible to decompress the reduced dataset by applying the inverse transformation
of the PCA.
PCA (Principal Component Analysis)
Disadvantages
- PCA tends to be highly affected by outliers in the data.
- PCA assumes that the principle components are a linear combination of the original features.
- PCA assumes that the principle components are orthogonal.
- PCA uses variance as the measure of how important a particular dimension is.
- High variance axes are treated as principle components.
- Low variance axes are treated as noise.

Machine Learning
t-Distributed Stochastic Neighbor
Embedding (t-SNE)
- Dimensionality reduction: tries to keep similar instances close and dissimilar instances apart.
- It is useful for visualization.
https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1

02 Machinelearning Clasico

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

02 Machinelearning Clasico

Cargado por

Copyright:

Formatos disponibles

Deep Learning para visión

Dra. María de la Paz Rico Fdz

Machine Learning (aprendizaje automático)

En ML, se ingresan los datos y las respuestas

RETOS EN TAREAS DE PERCEPCIÓN

ADQUISICIÓN DE DATOS ENTRENAR EL MODELO DESPLEGAR EL MODELO

Ejemplos de bases de datos

Discrete Continuous Nominal Ordinal

Measurements Measurements No natural order An order between

- time series (time order)

3) Descargandola de paginas web o partners.

1) Si entran datos ruidosos, estimaciones ruidosas saldrán

2) Problemas en los datos

3) Varios tipos de datos

- No estructurados: imágenes, audio, video, texto

• Supervisado-> Datos y etiquetas

Random Neural Random Neural

From ‘Python Data Science Handbook’ by Jake Vander Plas.

Aprendizaje supervisado Aprendizaje no supervisado

En la regresión, las etiquetas son datos

Puede haber regresiones lineales o

Cuando hay dos o más variables predictoras, el modelo se denomina

donde y es la variable a pronosticar y x1,…, x son las k variables

Los datos que se muestran son series extraídas de un estudio de

Generalmente, los datos de series de tiempo son una secuencia de

Los datos de series de tiempo a menudo se destacan al rastrear

Las técnicas clásicas de pronóstico de series de tiempo se basan en

Patrones Aleatoriedad Pronóstico

El objetivo es maximizar la habilidad de encontrar los patrones y minimizar la varianza inexplicable.

Los principales patrones son: tendencia, estacionalidad, cíclico, y ruido blanco.

Prophet= Tendencia + Estacionalidad + Vacaciones + error

• Modelos de tendencias de cambios no periódicos en el valor de la serie temporal.

Establezca un límite de capacidad de carga para

Un mínimo de saturación, que se especifica con un piso

Prophet ha incluido los datos originales como puntos

El uso de la función add_changepoints_to_plot agregó

Podemos crear una lista de vacaciones personalizada para

Para ello tendremos además parámetros como lower y

Si queremos incluir un día antes del día nacional del

Si quisiéramos usar el día después de las vacaciones,

Por lo general, queremos dividir la serie

A esto se le llama partición fija. Si la serie

También se usa simplemente dividirlo en

- Los árboles de decisión formulan una serie de

- Predicen el valor de una variable objetivo

- Estiman la probabilidad de que una instancia

Root node: depth 0, at the top.

Samples attribute: counts how many training instances it applies to.

Gini attribute: measures the impurity.

Gini score in depth-2 left node:

CART (Classification and Regression Trees) Algorithm

The CART algorithm is a greedy algorithm:

- A set’s entropy is zero when it contains instances of only one class.

𝐻𝑖 = − ෍ 𝑝𝑖,𝑘 𝑙𝑜𝑔 𝑝𝑖,𝑘 pi,k: ration of class k instances among the

- Regularization: to avoid overfitting the training data, it is necessary to restrict the

- It predicts a value in each node instead of a class.

• Enfocados a tareas de clasificación o regresión.

• Se obtienen predicciones con base a las decisiones los

• En lugar de buscar la mejor característica al dividir un

YA NO ES UNA CAJA BLANCA, POR LO TANTO NO ES FACIL DE

Casi ningún modelo de ML es perfecto, entonces:

1)Recolectar más datos

• ¿ Es significativa la muestra que tenemos?