Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Seleccione uno o más predictores para adicionar a su modelo y repita los pasos 1-3. ¿Es este
modelo significativamente mejor que el modelo con solo como único predictor?
Lectura de datos
In [1]: # lee los datos de un archivo .rds
states.data <- readRDS("states.rds")
state State
Alabama South 4041000 52423 77.08 67.4 1.11 393 10.5 27.86 ⋯ 30
Alaska West 550000 570374 0.96 41.1 0.91 991 7.2 37.41 ⋯
Arizona West 3665000 113642 32.25 79.0 0.79 258 9.7 19.65 ⋯ 13
Arkansas South 2351000 52075 45.15 40.1 0.85 330 8.9 24.60 ⋯ 25
California West 29760000 155973 190.80 95.7 1.51 246 8.7 3.26 ⋯ 50
Analisis de correlacion
In [7]: # correlación entre metro y energy
cor(sts.me.en)
metro energy
metro 1 NA
energy NA 1
La correlacion no esta definida, debido a los NA en los datos. Primero debemos eliminirlos.
In [8]: # depura NA
states.data1 <- na.omit(states.data)
Regresion univariable
In [12]: # primeramente realizaremos una regresión univariable
ener.mod1 <- lm(energy ~ metro,data=states.data1)
Call:
lm(formula = energy ~ metro, data = states.data1)
Residuals:
Min 1Q Median 3Q Max
-179.17 -54.21 -21.64 15.07 448.02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 449.8382 50.4472 8.917 1.37e-11 ***
metro -1.6526 0.7428 -2.225 0.031 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analisis multivariable
In [21]: # define un sub data frame con 7 variables
states.data2 <- subset(states.data1, select =
c("metro", "miles", "toxic", "green", "income"
Para usarlo primero debemos instarlo copiando lo siguiente en la consola del Anaconda:
conda install -c conda-forge r-corrplot
Call:
lm(formula = energy ~ metro + miles + income, data = states.data1)
Residuals:
Min 1Q Median 3Q Max
-138.41 -51.76 -10.61 28.60 379.66
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 145.894 194.707 0.749 0.4577
metro 1.205 1.010 1.193 0.2391
miles 42.693 15.445 2.764 0.0083 **
income -8.063 3.282 -2.457 0.0180 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = energy ~ toxic * green + I(toxic^2) + I(green^2),
data = states.data1)
Residuals:
Min 1Q Median 3Q Max
-110.48 -31.16 -1.39 24.44 204.35
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 201.942874 32.059022 6.299 1.47e-07 ***
toxic -2.269437 1.712104 -1.326 0.19216
green 5.252933 1.821773 2.883 0.00618 **
I(toxic^2) 0.002473 0.016743 0.148 0.88326
I(green^2) -0.029078 0.018292 -1.590 0.11942
toxic:green 0.142213 0.061854 2.299 0.02654 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = energy ~ toxic + green + toxic:green + I(toxic^2) +
I(green^2), data = states.data1)
Residuals:
Min 1Q Median 3Q Max
-110.48 -31.16 -1.39 24.44 204.35
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 201.942874 32.059022 6.299 1.47e-07 ***
toxic -2.269437 1.712104 -1.326 0.19216
green 5.252933 1.821773 2.883 0.00618 **
I(toxic^2) 0.002473 0.016743 0.148 0.88326
I(green^2) -0.029078 0.018292 -1.590 0.11942
toxic:green 0.142213 0.061854 2.299 0.02654 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
En otras palabras:
De los resultados se observa que los términos cuadráticos no son significativos, por lo que la
regresión tendría la forma:
Call:
lm(formula = energy ~ green + toxic:green, data = states.data1)
Residuals:
Min 1Q Median 3Q Max
-144.986 -32.174 -3.178 24.981 205.762
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 224.10982 14.76351 15.180 < 2e-16 ***
green 2.92910 0.62254 4.705 2.44e-05 ***
green:toxic 0.08822 0.01361 6.482 5.99e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = energy ~ region, data = states.data1)
Residuals:
Min 1Q Median 3Q Max
-143.13 -50.13 -23.62 17.36 418.82
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 367.18 33.25 11.044 2.87e-14 ***
regionN. East -118.07 49.56 -2.382 0.0216 *
regionSouth 12.94 43.19 0.300 0.7658
regionMidwest -23.18 46.03 -0.504 0.6170
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
En la regresión anterior se usaron los contrastes por defecto para definir las regiones, con el
1er nivel como referencia. Podemos cambiar la referencia o usar otra codificación usando la
función C.
West 0 0 0
N. East 1 0 0
South 0 1 0
Midwest 0 0 1
Call:
lm(formula = energy ~ C(region, base = 2), data = states.data1)
Residuals:
Min 1Q Median 3Q Max
-143.13 -50.12 -23.62 17.36 418.82
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 249.11 36.76 6.777 2.43e-08 ***
C(region, base = 2)1 118.07 49.56 2.382 0.0216 *
C(region, base = 2)3 131.01 45.95 2.851 0.0066 **
C(region, base = 2)4 94.89 48.63 1.951 0.0574 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In [ ]: