Está en la página 1de 41

11/12/2020 Industria

Industria
Isabella Castillo, Danna Cardona, Brian Carreño
11/12/2020

Serie Industria
Se empieza por llamar a las librerías necesarias en el transcurso del documento e importar la base de datos a
trabajar.

library(forecast)

## Registered S3 method overwritten by 'quantmod':


## method from
## as.zoo.data.frame zoo

library(lmtest)

## Loading required package: zoo

##
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':


##
## as.Date, as.Date.numeric

library(readxl)
library(tsoutliers)

## Warning: package 'tsoutliers' was built under R version 4.0.3

library(tseries)
library(car)

## Loading required package: carData

library(readr)
series <- read_delim("C:/Users/brian.carreno/Downloads/series.csv",
";", escape_double = FALSE, trim_ws = TRUE)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 1/24
11/12/2020 Industria

## Parsed with column specification:


## cols(
## Mes = col_character(),
## IND = col_number(),
## ENE = col_number(),
## CAN = col_number()
## )

Plot serie original


Se transforma la variable a formato de serie de tiempo y se imprime el plot original

library(readxl)
serieORI <- read_excel("C:/Users/brian.carreno/Downloads/series (1).xlsx")
VarTS<- serieORI$IND
VarTS <- ts(VarTS,start=c(2000,01),frequency=12)
plot(VarTS)

Box Cox
Para el evaluar si es necesario usar Box-Cox se observan los resultados de la librería forecast usando los
métodos de “guerrero” y “loglik”. Elegimos un lambda=2 que es el resultado sugerido con el método de “loglik”

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 2/24
11/12/2020 Industria

BoxCox.lambda(VarTS, method = "guerrero", lower = 0, upper = 2)

## [1] 0.429797

BoxCox.lambda(VarTS,method="loglik",lower=0)

## [1] 2

#print("Volver a verificar Box-Cox con variable ya transformada")

logVar=forecast::BoxCox(VarTS,lambda=2)
BoxCox.lambda(logVar,method="loglik",lower=0)

## [1] 1.05

plot(logVar)

Despues de transformar la variable y revisar si se requiere aproximación otra vez, se sugiere un λ de


aproximadamente 1, por lo que se concluye que se toma la transformación con λ = 2 y se continua el
procedimiento del modelamiento.

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 3/24
11/12/2020 Industria

Diferenciación ordinaria
Para analizar si es necesario diferenciar se realiza la prueba de Dickey-Fuller

#Prueba de Dickey-Fuller

acf(logVar)

tseries::adf.test(logVar) #De acuerdo a ACF,concluye que hay presencia de raíz unitaria en la se


rie

##
## Augmented Dickey-Fuller Test
##
## data: logVar
## Dickey-Fuller = -3.5606, Lag order = 5, p-value = 0.03812
## alternative hypothesis: stationary

#Se debe chequear si hay que difereciar de nuevo

dlogVar=diff(logVar)
tseries::adf.test(dlogVar,k=12) #No toca volver a diferenciar

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 4/24
11/12/2020 Industria

##
## Augmented Dickey-Fuller Test
##
## data: dlogVar
## Dickey-Fuller = -3.824, Lag order = 12, p-value = 0.01884
## alternative hypothesis: stationary

plot(dlogVar)

##

Diferenciación estacional

Después de realizar la diferenciación ordinaria, se analiza si es necesario realizar diferenciación estacional y se


realiza en caso de ser necesario.

#Analizar estacionareidad

monthplot(dlogVar)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 5/24
11/12/2020 Industria

nsdiffs(dlogVar)

## [1] 0

#Los gráficos mensuales evidentemente tienen media variable, por lo que se procede a diferenciar
estacionalmente

DdlogVar=diff(dlogVar,lag=12)
par(mfrow=c(2,1))
plot(dlogVar)
plot(DdlogVar)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 6/24
11/12/2020 Industria

#Chequear por otra raíz estacional

par(mfrow=c(1,1))
monthplot(DdlogVar)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 7/24
11/12/2020 Industria

nsdiffs(DdlogVar)

## [1] 0

Al observar los gráficos mensuales de la serie ya diferenciada estacionalmente se observa que no es necesaria
otra diferenciación estacional y se procede a observar las funciones de autocorrelación.

acf(DdlogVar,lag.max = 48, ci.type='ma')

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 8/24
11/12/2020 Industria

pacf(DdlogVar,lag.max = 60)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 9/24
11/12/2020 Industria

Ajuste del modelo


Al observar las gráficas del ACF y PACF se recomienda en primera instancia un modelo1
S ARI M A(p = 3, d = 1, q = 1) × (P = 0, D = 1, Q = 1) s=12 pero también se considera un modelo2

S ARI M A(p = 2, d = 1, q = 1) × (P = 0, D = 1, Q = 1) s=12 . Se plantean los modelos y se llega al

mejor ajuste.

modelo1 = forecast::Arima(VarTS, c(3, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12),la
mbda = 2,method = c("CSS-ML"))
coeftest(modelo1)

##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -1.38673 0.55729 -2.4883 0.01283 *
## ar2 -0.92963 0.50122 -1.8548 0.06363 .
## ar3 -0.26316 0.25819 -1.0193 0.30808
## ma1 0.46535 0.56162 0.8286 0.40733
## sma1 -0.95812 0.14958 -6.4055 1.498e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 10/24
11/12/2020 Industria

modelo2 = forecast::Arima(VarTS, c(2, 1,1), seasonal = list(order = c(0, 1, 1), period = 12),lam
bda = 2,method = c("CSS-ML"))
coeftest(modelo2)

##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.862175 0.119224 -7.2316 4.775e-13 ***
## ar2 -0.446563 0.088798 -5.0290 4.931e-07 ***
## ma1 -0.050161 0.129608 -0.3870 0.6987
## sma1 -0.966022 0.182143 -5.3036 1.135e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

modelo3 = forecast::Arima(VarTS, c(2, 1,0), seasonal = list(order = c(0, 1, 1), period = 12),lam
bda = 2,method = c("CSS-ML"))
coeftest(modelo3)

##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.900803 0.061635 -14.6152 < 2.2e-16 ***
## ar2 -0.469875 0.061576 -7.6308 2.332e-14 ***
## sma1 -0.972106 0.218806 -4.4428 8.881e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(modelo1);AIC(modelo2);AIC(modelo3)

## [1] 2841.896

## [1] 2840.258

## [1] 2838.406

Partiendo de las comparaciones entre el modelo1 y modelo2, elegimos el segundo modelo al observar la
significancia de los estimadores y el criterio AIC. A este modelo2 eliminamos los órdenes MA con lo cual
obtenemos el modelo3 cuyo AIC es aún menor y todos los parámetros son significativos.

Análsis de residuales y verificación de suspuestos


residuales <- modelo3$residuals
plot(residuales)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 11/24
11/12/2020 Industria

acf(residuales)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 12/24
11/12/2020 Industria

pacf(residuales)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 13/24
11/12/2020 Industria

Se procede a realizar los tests de Jarque-Bera y Ljung-Box, y los gráficos de CUSUM y CUSUMSQ.

par(mfrow=c(1,1))
jarque.bera.test(residuales) #p-value = 0.01714

##
## Jarque Bera Test
##
## data: residuales
## X-squared = 19.547, df = 2, p-value = 5.693e-05

#Se rechaza normalidad pero observemos si tiene colas muy pesadas

qqPlot(residuales) #Por ahora se observa que no se cumple el supuesto de normalidad pero esto pu
ede ser corregido con la eliminación de los outliers detectados.

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 14/24
11/12/2020 Industria

## [1] 199 99

#Test de autocorrelación

acf(residuales,lag.max = 30)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 15/24
11/12/2020 Industria

Box.test(residuales,lag =24, type = "Ljung-Box", fitdf = 3)

##
## Box-Ljung test
##
## data: residuales
## X-squared = 29.368, df = 21, p-value = 0.1054

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 16/24
11/12/2020 Industria

###Estadísticas CUSUM

res=residuales
cum=cumsum(res)/sd(res)
N=length(res)
cumq=cumsum(res^2)/sum(res^2)
Af=0.948 ###Cuantil del 95% para la estadistica cusum
co=0.14422 ####Valor del cuantil aproximado para cusumsq para n/2
LS=Af*sqrt(N)+2*Af*c(1:length(res))/sqrt(N)
LI=-LS
LQS=co+(1:length(res))/N
LQI=-co+(1:length(res))/N
par(mfrow=c(2,1))
plot(cum,type="l",ylim=c(min(LI),max(LS)),xlab="t",ylab="",main="CUSUM")
lines(LS,type="S",col="red")
lines(LI,type="S",col="red")

#CUSUM Square

plot(cumq,type="l",xlab="t",ylab="",main="CUSUMSQ")
lines(LQS,type="S",col="red")

lines(LQI,type="S",col="red")

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 17/24
11/12/2020 Industria

#Análisis de Outliers

coef = coefs2poly(modelo3)
outliers = locate.outliers(residuales,coef, cval = 3.5)
outliers

## type ind coefhat tstat


## 1 AO 159 -625.2645 -3.848949
## 2 AO 199 -774.9885 -4.765641
## 3 LS 97 -519.0753 -3.832791
## 5 TC 94 579.9554 4.040323
## 6 TC 200 637.3344 4.428757

n <- length(VarTS)
xreg = outliers.effects(outliers,n)

Se detectan 5 outliers. 2 aditivos en 159 y 199, uno de cambio de nivel en 97 y dos transitorios en 94 y 200. Se
repite el procedimiento para obervar si se encuentran más outliers con el modelo ajustado a los anteriores outliers
encontrados.

modelSO = Arima(VarTS, c(2, 1,0), seasonal = list(order = c(0, 1, 1), period = 12),lambda = 2,me
thod = c("CSS-ML"), xreg = xreg)
resi_analisis = modelSO$residuals
coef_analisis = coefs2poly(modelSO)
outliers_analisis = locate.outliers(resi_analisis,coef_analisis)
outliers_analisis

## [1] type ind coefhat tstat


## <0 rows> (or 0-length row.names)

coeftest(modelo3)

##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.900803 0.061635 -14.6152 < 2.2e-16 ***
## ar2 -0.469875 0.061576 -7.6308 2.332e-14 ***
## sma1 -0.972106 0.218806 -4.4428 8.881e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coeftest(modelSO)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 18/24
11/12/2020 Industria

##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.930075 0.060870 -15.2797 < 2.2e-16 ***
## ar2 -0.501418 0.060767 -8.2515 < 2.2e-16 ***
## sma1 -0.878631 0.069627 -12.6191 < 2.2e-16 ***
## AO159 -645.010772 186.982019 -3.4496 0.0005614 ***
## AO199 -710.718604 196.988691 -3.6079 0.0003087 ***
## LS97 -292.657105 173.508641 -1.6867 0.0916610 .
## TC94 453.273201 182.805075 2.4795 0.0131551 *
## TC200 485.822450 169.618683 2.8642 0.0041806 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(modelo3); AIC(modelSO)

## [1] 2838.406

## [1] 2800.336

#No se encontraron mas outliers

Al realizar de nuevo el procedimiento de detección de outliers no se encuentran más datos atípicos. Se procede a
realizar el análisis de residuales a modelSO. Se observa además que por el criterio AIC el modelo modelSO es
mejor.

par(mfrow=c(1,1))
jarque.bera.test(resi_analisis ) #p-value = 0.6431

##
## Jarque Bera Test
##
## data: resi_analisis
## X-squared = 0.8829, df = 2, p-value = 0.6431

qqPlot(resi_analisis) #No se rechaza el supuesto de normalidad y no se observan colas pesadas

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 19/24
11/12/2020 Industria

## [1] 99 87

#Test de autocorrelación

acf(resi_analisis ,lag.max = 30)

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 20/24
11/12/2020 Industria

Box.test(resi_analisis ,lag =24, type = "Ljung-Box", fitdf = 8) #p-value= 0.1948, no se rechaza


la hipótesis nula

##
## Box-Ljung test
##
## data: resi_analisis
## X-squared = 20.591, df = 16, p-value = 0.1948

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 21/24
11/12/2020 Industria

###Estadísticas CUSUM

res2=resi_analisis
cum=cumsum(res2)/sd(res2)
N=length(res2)
cumq=cumsum(res2^2)/sum(res2^2)
Af=0.948 ###Cuantil del 95% para la estadistica cusum
co=0.14422 ####Valor del cuantil aproximado para cusumsq para n/2
LS=Af*sqrt(N)+2*Af*c(1:length(res2))/sqrt(N)
LI=-LS
LQS=co+(1:length(res2))/N
LQI=-co+(1:length(res2))/N
par(mfrow=c(2,1))
plot(cum,type="l",ylim=c(min(LI),max(LS)),xlab="t",ylab="",main="CUSUM")
lines(LS,type="S",col="red")
lines(LI,type="S",col="red")

#CUSUM Square
plot(cumq,type="l",xlab="t",ylab="",main="CUSUMSQ")
lines(LQS,type="S",col="red")
lines(LQI,type="S",col="red")

Se

realiza de nuevo el análisis de residuales teninendo en cuenta los outliers detectados anteriormente. Encontramos
que el supuesto de normalidad si se cumple y que no se presenta autocorrelación en nuestro modelo.Igualmente

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 22/24
11/12/2020 Industria

con la gráfica CUSUMQS de homogeneidad de varianza marginal se observa buenos rsultados en comparación a
la gráfica de nuestro modelo sin tener en cuenta los outliers. A pesar de que en la función de autocorrelación se
observa un rezago fuera de las bandas, el test de Ljung-Box no muestra autocorrelación serial.

train <- window(VarTS,start=c(2000,01),end=c(2016,08))


test <- window(VarTS,start=c(2016,09),end=c(2017,11))

h <- 1
n <- length(test) - h + 1
fc <- ts(numeric(n), start=c(2016,09), freq=12)
fitmodelo <- Arima(VarTS, c(2, 1,0), seasonal = list(order = c(0, 1, 1), period = 12),lambda = 2
,method = c("CSS-ML"), xreg = xreg)
for(i in 1:n){
x <- window(VarTS, end = c(2016, 08+(i-1)))
refit <- Arima(x, model= fitmodelo,xreg=xreg[1:(length(train)+(i-1)),])
fc[i] <- forecast::forecast(refit, h=h, xreg= xreg[1:(length(train)+(i-1)),])$mean[h]
}

plot(cbind(test, fc),
plot.type = "single",
col = c("red", "blue"))

dife=(test-fc)^2
ecm=(1/(length(test)))*sum(dife)
ecm

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 23/24
11/12/2020 Industria

## [1] 11.97132

El error cuadrático medio de nuestro modelo SARIMA es 11.97 y en comparación con el modelo de árboles es un
mejor modelo ya que el error cuadrático medio por este otro método fue de 17.8. Este error es mucho mayor
cuando se consideran las 17 covariables que arroja en principio el modelo.

file:///C:/Users/brian.carreno/Desktop/parcialfinalts.html 24/24
Parcial1

December 11, 2020

1 Parcial 2 (parte práctica)


Danna Cardona, Brian Carreño, Isabella Castillo
Serie de tiempo de industria, medida mensualmente desde Enero de 2000 hasta Noviembre de
2017.
[2]: import pandas as pd
from pandas import Series
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import numpy as np
import scipy as sp
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
from matplotlib import pyplot

1.1 Análisis descriptivo


Empezamos importando los datos y graficando la serie y la función de autocorrelación hasta 20
rezagos.

[10]: data = pd.read_excel("C:/Users/ASUS/Documents/Universidad Nacional/Septimo␣


,→Semestre/Series de Tiempo/series.xlsx")

#Convertir el conjunto de datos en una serie de tiempo

data['Mes'] = pd.to_datetime(data['Mes'])
indice = data.set_index('Mes')
ts = indice['IND']
#Establecer frecuencia del índice
indice.index.freq='MS'
ts.index.freq='MS'

#Graficar la Serie
plt.plot(ts)
plt.title('Industria')

1
[10]: Text(0.5, 1.0, 'Industria')

[21]: from statsmodels.graphics.tsaplots import plot_acf

plot_acf(ts,lags=20)
pyplot.show()

Se puede observar en el ACF que la serie presenta un ciclo anual. Esto porque en un periodo de 12
meses el valor medio depende del mes considerado. Por esto mismo, se puede afirmar que existe
una estructura de autocorrelación definida, con valores fuera de las bandas de confianza.
La series no es estacionaria

2
1.2 Transformación de Box-Cox
Ahora se analiza si es necesario realizar una transformaion de Box-Cox. Para ello se observa el
lambda que recomienda la libreria scipy

[12]: import scipy.stats

print(sp.stats.boxcox(ts,alpha=0.05)[1])

2.40614965681135
Como se observa en la salida anterior, se sugiere usar un λ = 2.4 aproximadamente. R toma un
λ = 2 pero se realizan los procedimientos en concordancia con las sugerencias de Python.

[43]: logCANbx = sp.stats.boxcox(data['IND'],lmbda=2.40614965681135)


data = data.assign(logCANbx=logCANbx)

indice2 = data.set_index('Mes')
logCAN = indice2['logCANbx']
#Fijar la frecuencia del índice
logCAN.index.freq='MS'

logCAN.plot()
pyplot.show()

La grafica anterior muestra la serie transformada por Box-Cox, cambiando un poco frente al caso
sin transformacion.

1.3 Retardos de la variable a tener en cuenta como covariables


De acuerdo a la gráfica del ACF se propone tomar como covariable para el ajuste de un modelo
de arboles de decisión los primeros 17 retardos. Esto ya que estos retardos capturan el efecto del
ciclo anual de la serie y además son los retardos de mayor autocorrelación.

3
1.4 Árboles de decisión
En este caso se comienza por la creación de las variables rezagadas.

[44]: from pandas import DataFrame

df1 = DataFrame()
CANdf = pd.DataFrame(logCAN)

#Covariables

for i in range(17,0,-1):
df1[['t-'+str(i)]] = CANdf.shift(i)

print(df1)

t-17 t-16 t-15 t-14 \


Mes
2000-01-01 NaN NaN NaN NaN
2000-02-01 NaN NaN NaN NaN
2000-03-01 NaN NaN NaN NaN
2000-04-01 NaN NaN NaN NaN
2000-05-01 NaN NaN NaN NaN
... ... ... ... ...
2017-07-01 27731.316673 28375.494769 29963.134836 30760.882836
2017-08-01 28375.494769 29963.134836 30760.882836 30808.562354
2017-09-01 29963.134836 30760.882836 30808.562354 26246.608474
2017-10-01 30760.882836 30808.562354 26246.608474 35655.596812
2017-11-01 30808.562354 26246.608474 35655.596812 34656.168082

t-13 t-12 t-11 t-10 \


Mes
2000-01-01 NaN NaN NaN NaN
2000-02-01 NaN NaN NaN NaN
2000-03-01 NaN NaN NaN NaN
2000-04-01 NaN NaN NaN NaN
2000-05-01 NaN NaN NaN NaN
... ... ... ... ...
2017-07-01 30808.562354 26246.608474 35655.596812 34656.168082
2017-08-01 26246.608474 35655.596812 34656.168082 33767.960888
2017-09-01 35655.596812 34656.168082 33767.960888 34259.994248
2017-10-01 34656.168082 33767.960888 34259.994248 35283.895278
2017-11-01 33767.960888 34259.994248 35283.895278 25111.497251

t-9 t-8 t-7 t-6 \


Mes
2000-01-01 NaN NaN NaN NaN
2000-02-01 NaN NaN NaN NaN

4
2000-03-01 NaN NaN NaN NaN
2000-04-01 NaN NaN NaN NaN
2000-05-01 NaN NaN NaN NaN
... ... ... ... ...
2017-07-01 33767.960888 34259.994248 35283.895278 25111.497251
2017-08-01 34259.994248 35283.895278 25111.497251 25585.302073
2017-09-01 35283.895278 25111.497251 25585.302073 31763.375649
2017-10-01 25111.497251 25585.302073 31763.375649 24904.091216
2017-11-01 25585.302073 31763.375649 24904.091216 30307.993441

t-5 t-4 t-3 t-2 \


Mes
2000-01-01 NaN NaN NaN NaN
2000-02-01 NaN NaN NaN NaN
2000-03-01 NaN NaN NaN 8834.807750
2000-04-01 NaN NaN 8834.807750 10028.258952
2000-05-01 NaN 8834.807750 10028.258952 12107.160159
... ... ... ... ...
2017-07-01 25585.302073 31763.375649 24904.091216 30307.993441
2017-08-01 31763.375649 24904.091216 30307.993441 29885.893791
2017-09-01 24904.091216 30307.993441 29885.893791 30405.357158
2017-10-01 30307.993441 29885.893791 30405.357158 33135.698008
2017-11-01 29885.893791 30405.357158 33135.698008 33046.466470

t-1
Mes
2000-01-01 NaN
2000-02-01 8834.807750
2000-03-01 10028.258952
2000-04-01 12107.160159
2000-05-01 9561.499369
... ...
2017-07-01 29885.893791
2017-08-01 30405.357158
2017-09-01 33135.698008
2017-10-01 33046.466470
2017-11-01 33561.100872

[215 rows x 17 columns]

[45]: #Columna t
df1['t'] = logCAN.values
print(df1.head(20))

t-17 t-16 t-15 t-14 \


Mes
2000-01-01 NaN NaN NaN NaN
2000-02-01 NaN NaN NaN NaN

5
2000-03-01 NaN NaN NaN NaN
2000-04-01 NaN NaN NaN NaN
2000-05-01 NaN NaN NaN NaN
2000-06-01 NaN NaN NaN NaN
2000-07-01 NaN NaN NaN NaN
2000-08-01 NaN NaN NaN NaN
2000-09-01 NaN NaN NaN NaN
2000-10-01 NaN NaN NaN NaN
2000-11-01 NaN NaN NaN NaN
2000-12-01 NaN NaN NaN NaN
2001-01-01 NaN NaN NaN NaN
2001-02-01 NaN NaN NaN NaN
2001-03-01 NaN NaN NaN 8834.807750
2001-04-01 NaN NaN 8834.807750 10028.258952
2001-05-01 NaN 8834.807750 10028.258952 12107.160159
2001-06-01 8834.807750 10028.258952 12107.160159 9561.499369
2001-07-01 10028.258952 12107.160159 9561.499369 12517.574662
2001-08-01 12107.160159 9561.499369 12517.574662 13276.126246

t-13 t-12 t-11 t-10 \


Mes
2000-01-01 NaN NaN NaN NaN
2000-02-01 NaN NaN NaN NaN
2000-03-01 NaN NaN NaN NaN
2000-04-01 NaN NaN NaN NaN
2000-05-01 NaN NaN NaN NaN
2000-06-01 NaN NaN NaN NaN
2000-07-01 NaN NaN NaN NaN
2000-08-01 NaN NaN NaN NaN
2000-09-01 NaN NaN NaN NaN
2000-10-01 NaN NaN NaN NaN
2000-11-01 NaN NaN NaN 8834.807750
2000-12-01 NaN NaN 8834.807750 10028.258952
2001-01-01 NaN 8834.807750 10028.258952 12107.160159
2001-02-01 8834.807750 10028.258952 12107.160159 9561.499369
2001-03-01 10028.258952 12107.160159 9561.499369 12517.574662
2001-04-01 12107.160159 9561.499369 12517.574662 13276.126246
2001-05-01 9561.499369 12517.574662 13276.126246 12642.260410
2001-06-01 12517.574662 13276.126246 12642.260410 15710.987025
2001-07-01 13276.126246 12642.260410 15710.987025 14017.013371
2001-08-01 12642.260410 15710.987025 14017.013371 15853.341808

t-9 t-8 t-7 t-6 \


Mes
2000-01-01 NaN NaN NaN NaN
2000-02-01 NaN NaN NaN NaN
2000-03-01 NaN NaN NaN NaN
2000-04-01 NaN NaN NaN NaN

6
2000-05-01 NaN NaN NaN NaN
2000-06-01 NaN NaN NaN NaN
2000-07-01 NaN NaN NaN 8834.807750
2000-08-01 NaN NaN 8834.807750 10028.258952
2000-09-01 NaN 8834.807750 10028.258952 12107.160159
2000-10-01 8834.807750 10028.258952 12107.160159 9561.499369
2000-11-01 10028.258952 12107.160159 9561.499369 12517.574662
2000-12-01 12107.160159 9561.499369 12517.574662 13276.126246
2001-01-01 9561.499369 12517.574662 13276.126246 12642.260410
2001-02-01 12517.574662 13276.126246 12642.260410 15710.987025
2001-03-01 13276.126246 12642.260410 15710.987025 14017.013371
2001-04-01 12642.260410 15710.987025 14017.013371 15853.341808
2001-05-01 15710.987025 14017.013371 15853.341808 17267.822305
2001-06-01 14017.013371 15853.341808 17267.822305 13362.070607
2001-07-01 15853.341808 17267.822305 13362.070607 10284.621104
2001-08-01 17267.822305 13362.070607 10284.621104 10540.649510

t-5 t-4 t-3 t-2 \


Mes
2000-01-01 NaN NaN NaN NaN
2000-02-01 NaN NaN NaN NaN
2000-03-01 NaN NaN NaN 8834.807750
2000-04-01 NaN NaN 8834.807750 10028.258952
2000-05-01 NaN 8834.807750 10028.258952 12107.160159
2000-06-01 8834.807750 10028.258952 12107.160159 9561.499369
2000-07-01 10028.258952 12107.160159 9561.499369 12517.574662
2000-08-01 12107.160159 9561.499369 12517.574662 13276.126246
2000-09-01 9561.499369 12517.574662 13276.126246 12642.260410
2000-10-01 12517.574662 13276.126246 12642.260410 15710.987025
2000-11-01 13276.126246 12642.260410 15710.987025 14017.013371
2000-12-01 12642.260410 15710.987025 14017.013371 15853.341808
2001-01-01 15710.987025 14017.013371 15853.341808 17267.822305
2001-02-01 14017.013371 15853.341808 17267.822305 13362.070607
2001-03-01 15853.341808 17267.822305 13362.070607 10284.621104
2001-04-01 17267.822305 13362.070607 10284.621104 10540.649510
2001-05-01 13362.070607 10284.621104 10540.649510 13565.751196
2001-06-01 10284.621104 10540.649510 13565.751196 11123.639014
2001-07-01 10540.649510 13565.751196 11123.639014 13970.570576
2001-08-01 13565.751196 11123.639014 13970.570576 12447.652844

t-1 t
Mes
2000-01-01 NaN 8834.807750
2000-02-01 8834.807750 10028.258952
2000-03-01 10028.258952 12107.160159
2000-04-01 12107.160159 9561.499369
2000-05-01 9561.499369 12517.574662
2000-06-01 12517.574662 13276.126246

7
2000-07-01 13276.126246 12642.260410
2000-08-01 12642.260410 15710.987025
2000-09-01 15710.987025 14017.013371
2000-10-01 14017.013371 15853.341808
2000-11-01 15853.341808 17267.822305
2000-12-01 17267.822305 13362.070607
2001-01-01 13362.070607 10284.621104
2001-02-01 10284.621104 10540.649510
2001-03-01 10540.649510 13565.751196
2001-04-01 13565.751196 11123.639014
2001-05-01 11123.639014 13970.570576
2001-06-01 13970.570576 12447.652844
2001-07-01 12447.652844 12648.096722
2001-08-01 12648.096722 13303.593183

[46]: #Remover Nans de las primeras 17 filas


df1_CAN = df1[17:]
print(df1_CAN.head(17))

t-17 t-16 t-15 t-14 \


Mes
2001-06-01 8834.807750 10028.258952 12107.160159 9561.499369
2001-07-01 10028.258952 12107.160159 9561.499369 12517.574662
2001-08-01 12107.160159 9561.499369 12517.574662 13276.126246
2001-09-01 9561.499369 12517.574662 13276.126246 12642.260410
2001-10-01 12517.574662 13276.126246 12642.260410 15710.987025
2001-11-01 13276.126246 12642.260410 15710.987025 14017.013371
2001-12-01 12642.260410 15710.987025 14017.013371 15853.341808
2002-01-01 15710.987025 14017.013371 15853.341808 17267.822305
2002-02-01 14017.013371 15853.341808 17267.822305 13362.070607
2002-03-01 15853.341808 17267.822305 13362.070607 10284.621104
2002-04-01 17267.822305 13362.070607 10284.621104 10540.649510
2002-05-01 13362.070607 10284.621104 10540.649510 13565.751196
2002-06-01 10284.621104 10540.649510 13565.751196 11123.639014
2002-07-01 10540.649510 13565.751196 11123.639014 13970.570576
2002-08-01 13565.751196 11123.639014 13970.570576 12447.652844
2002-09-01 11123.639014 13970.570576 12447.652844 12648.096722
2002-10-01 13970.570576 12447.652844 12648.096722 13303.593183

t-13 t-12 t-11 t-10 \


Mes
2001-06-01 12517.574662 13276.126246 12642.260410 15710.987025
2001-07-01 13276.126246 12642.260410 15710.987025 14017.013371
2001-08-01 12642.260410 15710.987025 14017.013371 15853.341808
2001-09-01 15710.987025 14017.013371 15853.341808 17267.822305
2001-10-01 14017.013371 15853.341808 17267.822305 13362.070607
2001-11-01 15853.341808 17267.822305 13362.070607 10284.621104
2001-12-01 17267.822305 13362.070607 10284.621104 10540.649510

8
2002-01-01 13362.070607 10284.621104 10540.649510 13565.751196
2002-02-01 10284.621104 10540.649510 13565.751196 11123.639014
2002-03-01 10540.649510 13565.751196 11123.639014 13970.570576
2002-04-01 13565.751196 11123.639014 13970.570576 12447.652844
2002-05-01 11123.639014 13970.570576 12447.652844 12648.096722
2002-06-01 13970.570576 12447.652844 12648.096722 13303.593183
2002-07-01 12447.652844 12648.096722 13303.593183 13713.464888
2002-08-01 12648.096722 13303.593183 13713.464888 15530.336931
2002-09-01 13303.593183 13713.464888 15530.336931 15264.207302
2002-10-01 13713.464888 15530.336931 15264.207302 13053.761205

t-9 t-8 t-7 t-6 \


Mes
2001-06-01 14017.013371 15853.341808 17267.822305 13362.070607
2001-07-01 15853.341808 17267.822305 13362.070607 10284.621104
2001-08-01 17267.822305 13362.070607 10284.621104 10540.649510
2001-09-01 13362.070607 10284.621104 10540.649510 13565.751196
2001-10-01 10284.621104 10540.649510 13565.751196 11123.639014
2001-11-01 10540.649510 13565.751196 11123.639014 13970.570576
2001-12-01 13565.751196 11123.639014 13970.570576 12447.652844
2002-01-01 11123.639014 13970.570576 12447.652844 12648.096722
2002-02-01 13970.570576 12447.652844 12648.096722 13303.593183
2002-03-01 12447.652844 12648.096722 13303.593183 13713.464888
2002-04-01 12648.096722 13303.593183 13713.464888 15530.336931
2002-05-01 13303.593183 13713.464888 15530.336931 15264.207302
2002-06-01 13713.464888 15530.336931 15264.207302 13053.761205
2002-07-01 15530.336931 15264.207302 13053.761205 10644.393826
2002-08-01 15264.207302 13053.761205 10644.393826 10669.290257
2002-09-01 13053.761205 10644.393826 10669.290257 10571.785556
2002-10-01 10644.393826 10669.290257 10571.785556 13494.193187

t-5 t-4 t-3 t-2 \


Mes
2001-06-01 10284.621104 10540.649510 13565.751196 11123.639014
2001-07-01 10540.649510 13565.751196 11123.639014 13970.570576
2001-08-01 13565.751196 11123.639014 13970.570576 12447.652844
2001-09-01 11123.639014 13970.570576 12447.652844 12648.096722
2001-10-01 13970.570576 12447.652844 12648.096722 13303.593183
2001-11-01 12447.652844 12648.096722 13303.593183 13713.464888
2001-12-01 12648.096722 13303.593183 13713.464888 15530.336931
2002-01-01 13303.593183 13713.464888 15530.336931 15264.207302
2002-02-01 13713.464888 15530.336931 15264.207302 13053.761205
2002-03-01 15530.336931 15264.207302 13053.761205 10644.393826
2002-04-01 15264.207302 13053.761205 10644.393826 10669.290257
2002-05-01 13053.761205 10644.393826 10669.290257 10571.785556
2002-06-01 10644.393826 10669.290257 10571.785556 13494.193187
2002-07-01 10669.290257 10571.785556 13494.193187 14040.933317
2002-08-01 10571.785556 13494.193187 14040.933317 11861.528307

9
2002-09-01 13494.193187 14040.933317 11861.528307 13094.569510
2002-10-01 14040.933317 11861.528307 13094.569510 13228.566918

t-1 t
Mes
2001-06-01 13970.570576 12447.652844
2001-07-01 12447.652844 12648.096722
2001-08-01 12648.096722 13303.593183
2001-09-01 13303.593183 13713.464888
2001-10-01 13713.464888 15530.336931
2001-11-01 15530.336931 15264.207302
2001-12-01 15264.207302 13053.761205
2002-01-01 13053.761205 10644.393826
2002-02-01 10644.393826 10669.290257
2002-03-01 10669.290257 10571.785556
2002-04-01 10571.785556 13494.193187
2002-05-01 13494.193187 14040.933317
2002-06-01 14040.933317 11861.528307
2002-07-01 11861.528307 13094.569510
2002-08-01 13094.569510 13228.566918
2002-09-01 13228.566918 13384.900198
2002-10-01 13384.900198 15977.801285
Ahora se genera la partición de la base. Para empezar se separan las variables rezagadas de la
columna de datos t definida anteriormente. Esto para tener la distinción entre covariables y vari-
able respuesta.

[47]: CANsplit = df1_CAN.values

X1= CANsplit[:, 0:-1] #Seleccionar covariables


y1 = CANsplit[:,-1] #Separar columna 't'

Ahora que se tiene la distincion entre y y X, se dividen estos dos conjuntos en dos partes, el
conjunto de validación y el conjunto entrenamiento. Se toman 15 datos de validación. Se notan
las variables como test por facilidad pero estas hacen referencia al conjunto de validación.

[48]: # Target Train-Val split

Y1 = y1
traintarget_size = 183
train_target, test_target = Y1[0:traintarget_size], Y1[traintarget_size:len(Y1)]

print('Observations for Target: %d' % (len(Y1)))


print('Training Observations for Target: %d' % (len(train_target)))
print('Validation Observations for Target: %d' % (len(test_target)))

Observations for Target: 198


Training Observations for Target: 183
Validation Observations for Target: 15

10
[49]: # Features Train-Val split

trainfeature_size = 183
train_feature, test_feature = X1[0:trainfeature_size], X1[trainfeature_size:
,→len(X1)]

print('Observations for feature: %d' % (len(X1)))


print('Training Observations for feature: %d' % (len(train_feature)))
print('Validation Observations for feature: %d' % (len(test_feature)))

Observations for feature: 198


Training Observations for feature: 183
Validation Observations for feature: 15
Se confirma que el conjunto de entrenamiento y prueba es del mismo tamaño para los labels y y
las regresoras X, de modo que se puede proceder a la construcción del modelo de regresión para
el árbol de decisión.
[50]: # Decision Tree Regresion Model

from sklearn.tree import DecisionTreeRegressor

# Create a decision tree regression model with default arguments


decision_tree_CAN = DecisionTreeRegressor()

# Fit the model to the training features(covariables) and targets(respuesta)


decision_tree_CAN.fit(train_feature, train_target)

# Check the score on train and test


print(decision_tree_CAN.score(train_feature, train_target))
print(decision_tree_CAN.score(test_feature,test_target))

1.0
-0.4330516704949361

[51]: for d in [2, 3, 4, 5,7,8,10]:


# Create the tree and fit it
decision_tree_CAN = DecisionTreeRegressor(max_depth=d)
decision_tree_CAN.fit(train_feature, train_target)

# Print out the scores on train and test


print('max_depth=', str(d))
print(decision_tree_CAN.score(train_feature, train_target))
print(decision_tree_CAN.score(test_feature, test_target), '\n')

max_depth= 2
0.8152166896655462
-0.647021873889613

11
max_depth= 3
0.8648190081409888
-0.46795376865845983

max_depth= 4
0.9061342324349833
-0.47664554983452745

max_depth= 5
0.9426969754330095
-0.5774614769318107

max_depth= 7
0.9834000035229826
-0.222779408711977

max_depth= 8
0.9927172936249665
-0.519032710271299

max_depth= 10
0.9992364226890428
-0.4912073654114555

Anteriormente se tenia un R2 = −0.433, que es bastante malo pero como se observa en la salida
anterior, podría mejorar. Este coeficiente alcanza un valor en el conjunto de validación de aprox
−0.22 cuando se toma una profundidad máxima de 7. Se estima nuevamente el modelo especifi-
cando este parámetro.

[52]: from matplotlib import pyplot as plt

# Se cambia el parametro max_depth por 7, según lo mencionado anteriormente


decision_tree_CAN = DecisionTreeRegressor(max_depth=7)
decision_tree_CAN.fit(train_feature, train_target)

# Predict values for train and test


train_prediction = decision_tree_CAN.predict(train_feature)
test_prediction = decision_tree_CAN.predict(test_feature)

1.5 Error cuadrático medio


[54]: from sklearn.metrics import mean_squared_error
mean_squared_error(test_target, test_prediction)

[54]: 17363308.844464015

12
[55]: d = {'observado': targetjoint, 'Predicción': predictionjoint}
ObsvsPred=pd.DataFrame(data=d,index=indicetrian_test)
ObsvsPred.head(10)

[55]: observado Predicción


Mes
2001-06-01 2628.495072 2700.668598
2001-07-01 2663.635012 2700.668598
2001-08-01 2777.903968 3182.598827
2001-09-01 2848.870050 2700.668598
2001-10-01 3159.307008 3182.598827
2001-11-01 3114.235664 3182.598827
2001-12-01 2734.466841 2700.668598
2002-01-01 2307.829458 2700.668598
2002-02-01 2312.316072 2700.668598
2002-03-01 2294.734505 2700.668598

[57]: ax = ObsvsPred.plot( marker="o", figsize=(12,8))


ax.axvline(x=indicetrian_test[183].date(),color='red')

[57]: <matplotlib.lines.Line2D at 0x13435bd3f60>

Evidentemente el modelo no es muy bueno tomando como covariable los rezagos significativos.

13
Como ejercicio adicional se propone una segunda opcion, tomar los primeros 2 rezagos y los
ultimos 2. No se comenta paso a paso como el caso anterior pero los procedimientos son análogos.

[67]: df1 = DataFrame()


CANdf = pd.DataFrame(logCAN)
#Covariables
for i in range(2,0,-1):
df1[['t-'+str(i)]] = CANdf.shift(i)
for i in range(12,10,-1):
df1[['t-'+str(i)]] = CANdf.shift(i)
print(df1)

#Columna t
df1['t'] = ts.values
print(df1.head(14))

#Remover Nans de las primeras 13 filas


df1_CAN = df1[12:]
print(df1_CAN.head(14))

CANsplit = df1_CAN.values
X1= CANsplit[:, 0:-1] #Seleccionar covariables
y1 = CANsplit[:,-1] #Separar columna 't'

# Target Train-Val split


Y1 = y1
traintarget_size = 183
train_target, test_target = Y1[0:traintarget_size], Y1[traintarget_size:len(Y1)]
print('Observations for Target: %d' % (len(Y1)))
print('Training Observations for Target: %d' % (len(train_target)))
print('Validation Observations for Target: %d' % (len(test_target)))

trainfeature_size = 183
train_feature, test_feature = X1[0:trainfeature_size], X1[trainfeature_size:
,→len(X1)]

print('Observations for feature: %d' % (len(X1)))


print('Training Observations for feature: %d' % (len(train_feature)))
print('Validation Observations for feature: %d' % (len(test_feature)))

# Decision Tree Regresion Model


from sklearn.tree import DecisionTreeRegressor
# Create a decision tree regression model with default arguments
decision_tree_CAN = DecisionTreeRegressor()
# Fit the model to the training features(covariables) and targets(respuesta)
decision_tree_CAN.fit(train_feature, train_target)
# Check the score on train and test
print(decision_tree_CAN.score(train_feature, train_target))

14
print(decision_tree_CAN.score(test_feature,test_target))

for d in [2, 3, 4, 5,7,8,10]:


# Create the tree and fit it
decision_tree_CAN = DecisionTreeRegressor(max_depth=d)
decision_tree_CAN.fit(train_feature, train_target)
# Print out the scores on train and test
print('max_depth=', str(d))
print(decision_tree_CAN.score(train_feature, train_target))
print(decision_tree_CAN.score(test_feature, test_target), '\n')

t-2 t-1 t-12 t-11


Mes
2000-01-01 NaN NaN NaN NaN
2000-02-01 NaN 8834.807750 NaN NaN
2000-03-01 8834.807750 10028.258952 NaN NaN
2000-04-01 10028.258952 12107.160159 NaN NaN
2000-05-01 12107.160159 9561.499369 NaN NaN
... ... ... ... ...
2017-07-01 30307.993441 29885.893791 26246.608474 35655.596812
2017-08-01 29885.893791 30405.357158 35655.596812 34656.168082
2017-09-01 30405.357158 33135.698008 34656.168082 33767.960888
2017-10-01 33135.698008 33046.466470 33767.960888 34259.994248
2017-11-01 33046.466470 33561.100872 34259.994248 35283.895278

[215 rows x 4 columns]


t-2 t-1 t-12 t-11 t
Mes
2000-01-01 NaN NaN NaN NaN 62.883
2000-02-01 NaN 8834.807750 NaN NaN 66.283
2000-03-01 8834.807750 10028.258952 NaN NaN 71.681
2000-04-01 10028.258952 12107.160159 NaN NaN 64.983
2000-05-01 12107.160159 9561.499369 NaN NaN 72.681
2000-06-01 9561.499369 12517.574662 NaN NaN 74.480
2000-07-01 12517.574662 13276.126246 NaN NaN 72.981
2000-08-01 13276.126246 12642.260410 NaN NaN 79.879
2000-09-01 12642.260410 15710.987025 NaN NaN 76.180
2000-10-01 15710.987025 14017.013371 NaN NaN 80.179
2000-11-01 14017.013371 15853.341808 NaN NaN 83.078
2000-12-01 15853.341808 17267.822305 NaN 8834.807750 74.680
2001-01-01 17267.822305 13362.070607 8834.807750 10028.258952 66.982
2001-02-01 13362.070607 10284.621104 10028.258952 12107.160159 67.670
t-2 t-1 t-12 t-11 t
Mes
2001-01-01 17267.822305 13362.070607 8834.807750 10028.258952 66.982
2001-02-01 13362.070607 10284.621104 10028.258952 12107.160159 67.670
2001-03-01 10284.621104 10540.649510 12107.160159 9561.499369 75.151

15
2001-04-01 10540.649510 13565.751196 9561.499369 12517.574662 69.201
2001-05-01 13565.751196 11123.639014 12517.574662 13276.126246 76.075
2001-06-01 11123.639014 13970.570576 13276.126246 12642.260410 72.512
2001-07-01 13970.570576 12447.652844 12642.260410 15710.987025 72.995
2001-08-01 12447.652844 12648.096722 15710.987025 14017.013371 74.544
2001-09-01 12648.096722 13303.593183 14017.013371 15853.341808 75.490
2001-10-01 13303.593183 13713.464888 15853.341808 17267.822305 79.496
2001-11-01 13713.464888 15530.336931 17267.822305 13362.070607 78.927
2001-12-01 15530.336931 15264.207302 13362.070607 10284.621104 73.959
2002-01-01 15264.207302 13053.761205 10284.621104 10540.649510 67.946
2002-02-01 13053.761205 10644.393826 10540.649510 13565.751196 68.012
Observations for Target: 203
Training Observations for Target: 183
Validation Observations for Target: 20
Observations for feature: 203
Training Observations for feature: 183
Validation Observations for feature: 20
1.0
0.11216668980585531
max_depth= 2
0.8481719715059304
-0.9581347741106794

max_depth= 3
0.891688075801554
-0.5951970739075925

max_depth= 4
0.920812675121631
0.08116032962489794

max_depth= 5
0.9451022315852008
0.18316002016104027

max_depth= 7
0.9759733895208629
0.2188407609561649

max_depth= 8
0.9896871745563044
0.2498700930937643

max_depth= 10
0.9980536395131769
0.0780408301254435

16
Tomando depth=8 se obtiene un mejor R2 que en el caso donde se tomaban 17 covariables. Ob-
serve los resultados finales:
[68]: from matplotlib import pyplot as plt
# Se cambia el parametro max_depth por 4, según lo mencionado anteriormente
decision_tree_CAN = DecisionTreeRegressor(max_depth=8)
decision_tree_CAN.fit(train_feature, train_target)
# Predict values for train and test
train_prediction = decision_tree_CAN.predict(train_feature)
test_prediction = decision_tree_CAN.predict(test_feature)

from sklearn.metrics import mean_squared_error


mean_squared_error(test_target, test_prediction)

[68]: 17.8187084455139

Evidentemente se reduce considerablemente el error cuadrático medio.

17

También podría gustarte