Proyecto de Cálculo de Abandono (Churn) en Una Compañía de Telecomunicaciones PDF

24/12/2020 Proyecto de Cálculo de Abandono (Churn) en una compañía de telecomunicaciones
Proyecto de Cálculo de Abandono

(Churn) en una compañía de
telecomunicaciones
Por Daniel Bajo Lema
22/05/2020
Vamos a llevar a cabo un proyecto en R para predecir el abandono (Churn) de los clientes de una compañía
de telecomunicaciones y permitir tomar medidas encaminadas a reducirlo
1. Importación y muestreo de los datos

Primero vamos a definir el formato de salida en R
knitr::opts_chunk$set(echo = TRUE)
options(scipen=999)
Y los paquetes que tenemos que cargar en R
#lista de paquetes que vamos a usar

paquetes <- c('data.table',#para leer y escribir datos de forma rapida
'dplyr',#para manipulacion de datos
'tidyr',#para manipulacion de datos
'ggplot2',#para graficos
'randomForest',#para crear los modelos
'ROCR',#para evaluar modelos
'purrr',#para usar la funcion map que aplica la misma funciona a varios compone
ntes de un dataframe
'smbinning',#para calcular la para importancia de las variables
'rpart',#para crear arboles de decision
'rpart.plot',#para el grafico del arbol
'e1071', #para la modelización con Bayes
'caret', #para calcular la matrix de confusión
'h2o' #para h2o
)
instalados <- paquetes %in% installed.packages()
if(sum(instalados == FALSE) > 0) {
install.packages(paquetes[!instalados])
}
lapply(paquetes,require,character.only = TRUE)
## Loading required package: data.table
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
https://rstudio-pubs-static.s3.amazonaws.com/618170_c4a4de58d3ec4af38c6b326e3f593b6d.html 1/116
## The following objects are masked from 'package:data.table':

##
## between, first, last
## The following objects are masked from 'package:stats':

##
## filter, lag
## The following objects are masked from 'package:base':

##
## intersect, setdiff, setequal, union
## Loading required package: tidyr
## Loading required package: ggplot2
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':

##
## margin
## The following object is masked from 'package:dplyr':

##
## combine
## Loading required package: ROCR
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':

##
## lowess
## Loading required package: purrr
##
## Attaching package: 'purrr'
## The following object is masked from 'package:data.table':

##
## transpose
## Loading required package: smbinning
## Loading required package: sqldf
## Loading required package: gsubfn
## Loading required package: proto
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):

## comando ejecutado ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/
## Resources/library/tcltk/libs//tcltk.so'' tiene estatus 1
## Loading required package: RSQLite
## Loading required package: partykit
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
## Loading required package: Formula
## Loading required package: rpart
## Loading required package: rpart.plot
## Loading required package: e1071
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':

##
## lift
## Loading required package: h2o
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:data.table':

##
## hour, month, week, year
## The following objects are masked from 'package:stats':

##
## cor, sd, var
## The following objects are masked from 'package:base':

##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
##
## [[8]]
## [1] TRUE
##
## [[9]]
## [1] TRUE
##
## [[10]]
## [1] TRUE
##
## [[11]]
## [1] TRUE
##
## [[12]]
## [1] TRUE
##
## [[13]]
## [1] TRUE
Cargamos el fichero de datos y vemos qué pinta tienen con un análisis exploratorio
df <- fread('Telco-Customer-Churn.csv')
glimpse(df)
## Observations: 7,043
## Variables: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOC…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female"…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Ye…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, …
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No",…
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", …
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber op…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", …
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Y…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Ye…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One…
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No",…
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", …
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 194…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "…
str(df)
## Classes 'data.table' and 'data.frame': 7043 obs. of 21 variables:

## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer
(automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
## - attr(*, ".internal.selfref")=<externalptr>
as.data.frame(sort(names(df)))
## sort(names(df))
## 1 Churn
## 2 Contract
## 3 customerID
## 4 Dependents
## 5 DeviceProtection
## 6 gender
## 7 InternetService
## 8 MonthlyCharges
## 9 MultipleLines
## 10 OnlineBackup
## 11 OnlineSecurity
## 12 PaperlessBilling
## 13 Partner
## 14 PaymentMethod
## 15 PhoneService
## 16 SeniorCitizen
## 17 StreamingMovies
## 18 StreamingTV
## 19 TechSupport
## 20 tenure
## 21 TotalCharges
Vemos que es un base de datos de 7043 registros con 21 variables incluyendo la variable que tenemos que
predecir ‘Churn’. Se trata sobre todo de variables de texto (character) salvo las variables de cargos mensuales
y totales (MonthlyCharges y TotalCharges) que son numéricas y la antigüeda (tenure) que está formada por
números enteros. Las siguientes variables de carácter o de texto habrá que transformarlas en factores:
1. Gender
2. SeniorCitizen
3. Partner
4. Dependents
5. PhoneService
6. MultipleLines
7. InternetService
8. OnlineSecurity
9. OnlineBackup
10. DeviceProtection
11. TechSupport
12. StreamingTV
13. StreamingMovies
14. Contract
15. PaperlessBilling
16. PaymentMethod
17. Churn
Lo convertiremos en factores más adelante pero las agrupamos en un vector con su nombre
a_factores <- c('gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'Multiple

Lines', 'InternetService','OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport'
, 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Chur
n')
2. Calidad de datos
2.1. Estadísticos básicos

Vemos los estadísticos básicos de la base de datos
lapply(df,summary)
## $customerID
## Length Class Mode
## 7043 character character
##
## $gender
##
## $SeniorCitizen
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1621 0.0000 1.0000
##
## $Partner
##
## $Dependents
##
## $tenure
## 0.00 9.00 29.00 32.37 55.00 72.00
##
## $PhoneService
##
## $MultipleLines
##
## $InternetService
##
## $OnlineSecurity
##
## $OnlineBackup
##
## $DeviceProtection
##
## $TechSupport
##
## $StreamingTV
##
## $StreamingMovies

##
## $Contract
##
## $PaperlessBilling
##
## $PaymentMethod
##
## $MonthlyCharges
## 18.25 35.50 70.35 64.76 89.85 118.75
##
## $TotalCharges
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.8 401.4 1397.5 2283.3 3794.7 8684.8 11
##
## $Churn
Parece que hay muchos ceros en senior citizen (la mediana es 0 y la media es muy baja) por lo que podría ser
una variable a eliminar pero también puede ser debido a que hay pocos seniors en la base de datos en una
variable dicotómica (1=Seniors / 0=No Seniors). Tenure parece ser una variable que mida la antigüedad o el
tiempo que lleva un usuario siendo cliente de la compañía, probablemente medido en meses.
2.2. Nulos y ceros

Vemos si hay nulos en las variables
data.frame(colSums(is.na(df)))
## colSums.is.na.df..
## customerID 0
## gender 0
## SeniorCitizen 0
## Partner 0
## Dependents 0
## tenure 0
## PhoneService 0
## MultipleLines 0
## InternetService 0
## OnlineSecurity 0
## OnlineBackup 0
## DeviceProtection 0
## TechSupport 0
## StreamingTV 0
## StreamingMovies 0
## Contract 0
## PaperlessBilling 0
## PaymentMethod 0
## MonthlyCharges 0
## TotalCharges 11
## Churn 0
Sólo hay 11 nulos en Total Charges
Vamos a ver si además hay algunos que se han convertido en ceros y contamos los ceros para cada una de
las variables.
contar_ceros <- function(variable) {

temp <- transmute(df,if_else(variable==0,1,0))
sum(temp)
}
num_ceros <- sapply(df,contar_ceros)

num_ceros <- data.frame(VARIABLE=names(num_ceros),CEROS = as.numeric(num_ceros),stringsAsFact
ors = F)
num_ceros <- num_ceros %>%
arrange(desc(CEROS)) %>%
mutate(PORCENTAJE = CEROS / nrow(df) * 100)
num_ceros
## VARIABLE CEROS PORCENTAJE

## 1 SeniorCitizen 5901 83.7853188
## 2 tenure 11 0.1561834
## 3 customerID 0 0.0000000
## 4 gender 0 0.0000000
## 5 Partner 0 0.0000000
## 6 Dependents 0 0.0000000
## 7 PhoneService 0 0.0000000
## 8 MultipleLines 0 0.0000000
## 9 InternetService 0 0.0000000
## 10 OnlineSecurity 0 0.0000000
## 11 OnlineBackup 0 0.0000000
## 12 DeviceProtection 0 0.0000000
## 13 TechSupport 0 0.0000000
## 14 StreamingTV 0 0.0000000
## 15 StreamingMovies 0 0.0000000
## 16 Contract 0 0.0000000
## 17 PaperlessBilling 0 0.0000000
## 18 PaymentMethod 0 0.0000000
## 19 MonthlyCharges 0 0.0000000
## 20 Churn 0 0.0000000
## 21 TotalCharges NA NA
Hay 5901 ceros (el 84% del total) en Senior Citizen pero como ya dijimos parece un factor en el que el 0
representa a los clientes “no senior” por lo que podría tener sentido.
Además hay 11 casos con ceros en tenure que podríamos analizar si coinciden con los 11 casos con NA en
Total Charges y si merece la pena eliminarnos en el caso de que estén incompletos aunque podrían ser
clientes nuevos con una antigüedad inferior al mes.
2.3. Análisis longitudinal de las variables numéricas

Podemos hacer un gráfico de las variables continuas y de las enteras como TotalCharges, MonthlyCharges y
Tenure
boxplot(df$TotalCharges,border=9, col=3, main="Distribución de la Facturación Total de los Cl

ientes",cex.main=1)
boxplot(df$MonthlyCharges,border=9, col=4, main="Distribución de la Facturación Mensual de lo

s Clientes",cex.main=1)
boxplot(df$tenure,border=9, col=8, main="Distribución de la Antigüedad de los Clientes en mes

es",cex.main=1)
Convertimos la variable tenure de entero a numérica para poder ver sus valores más altos
df$tenure <-as.numeric(df$tenure)
Para todas las variables continuas de la base de datos (antigüedad, pagos totales, pagos mensuales y tenure)
vamos a ver cuales son sus valores más altos
out <- function(variable){

t(t(head(sort(variable,decreasing = T),20)))
}
lapply(df,function(x){
if(is.double(x)) out(x)
})
## $customerID
## NULL
##
## $gender
## NULL
##
## $SeniorCitizen
## NULL
##
## $Partner
## NULL
##
## $Dependents
## NULL
##
## $tenure
## [,1]
## [1,] 72
## [2,] 72
## [3,] 72
## [4,] 72
## [5,] 72
## [6,] 72
## [7,] 72
## [8,] 72
## [9,] 72
## [10,] 72
## [11,] 72
## [12,] 72
## [13,] 72
## [14,] 72
## [15,] 72
## [16,] 72
## [17,] 72
## [18,] 72
## [19,] 72
## [20,] 72
##
## $PhoneService
## NULL
##
## $MultipleLines
## NULL
##
## $InternetService
## NULL
##
## $OnlineSecurity
## NULL
##
## $OnlineBackup
## NULL
##
## $DeviceProtection
## NULL
##
## $TechSupport
## NULL
##
## $StreamingTV
## NULL
##
## $StreamingMovies
## NULL
##
## $Contract
## NULL
##
## $PaperlessBilling
## NULL
##
## $PaymentMethod
## NULL
##
## $MonthlyCharges
## [,1]
## [1,] 118.75
## [2,] 118.65
## [3,] 118.60
## [4,] 118.60
## [5,] 118.35
## [6,] 118.20
## [7,] 117.80
## [8,] 117.60
## [9,] 117.50
## [10,] 117.45
## [11,] 117.35
## [12,] 117.20
## [13,] 117.15
## [14,] 116.95
## [15,] 116.85
## [16,] 116.80
## [17,] 116.75
## [18,] 116.60
## [19,] 116.60
## [20,] 116.55
##
## $TotalCharges
## [,1]
## [1,] 8684.80
## [2,] 8672.45
## [3,] 8670.10
## [4,] 8594.40
## [5,] 8564.75
## [6,] 8547.15
## [7,] 8543.25
## [8,] 8529.50
## [9,] 8496.70
## [10,] 8477.70
## [11,] 8477.60
## [12,] 8476.50
## [13,] 8468.20
## [14,] 8456.75
## [15,] 8443.70
## [16,] 8436.25
## [17,] 8425.30
## [18,] 8425.15
## [19,] 8424.90
## [20,] 8405.00
##
## $Churn
## NULL
Los mayores valores se corresponden con una antigüedad de 72 meses (6 años), 118€ de cargos mensuales
y 8.684€ de cargos totales.
Mi hipótesis es que los cargos totales es el resultado de multiplicar los cargos mensuales por la antigüedad ya
que si dividimos los cargos totales máximos entre la antigüedad máxima obtenemos 120€ mensuales, un valor
muy próximo al valor más alto de facturación mensual (119€).
summary(df$MonthlyCharges)

## 18.25 35.50 70.35 64.76 89.85 118.75
summary(df$TotalCharges)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

## 18.8 401.4 1397.5 2283.3 3794.7 8684.8 11
summary(df$tenure)

## 0.00 9.00 29.00 32.37 55.00 72.00
Analizando los estadísticos básicos vemos que la media se sitúa en 65€ de cargos mensuales, 2283€ de
cargos totales y una antigüedad de 32 meses.
Parece que puede haber algún problema con la variable antigüedad (tenure) que tiene algunos ceros. Vamos
a ver cuanto facturan estos clientes para entender si son clientes sin antigüedad pero con tarifa mensual y
comprobar sus datos en otras variables clave como servicios de TV en streaming o películas.
df %>%
filter(tenure == 0) %>%
select(tenure, TotalCharges, MonthlyCharges, StreamingTV, StreamingMovies)
## tenure TotalCharges MonthlyCharges StreamingTV StreamingMovies

## 1 0 NA 52.55 Yes No
## 2 0 NA 20.25 No internet service No internet service
## 3 0 NA 80.85 Yes Yes
## 5 0 NA 56.05 Yes No
## 10 0 NA 73.35 Yes No
## 11 0 NA 61.90 No No
Son 11 casos con una antigüedad de 0. Deben ser clientes nuevos ya que tienen valores de NA en
TotalCharges (todavía no han facturado) pero sí una tarifa mensual por lo que podríamos eliminarlos o
asignarles un valor 0 en TotalCharges
Vemos gráficamente como se representa la distribución de la antigüedad
hist(df$tenure,main="Distribución de la Antigüedad de los clientes en la compañía",cex.main=1

,col="#934962",breaks = 100,xlab="Antigüedad de los clientes en meses agrupados",ylab="Nº de
clientes")
En general salvo al principio y al final del gráfico los clientes se reparten de manera bastante equilibrada en
todos los niveles. Que haya bastantes clientes nuevos parece correcto pero probablemente los de más de 72
meses (6 años) son clientes más antiguos pero 72 meses es el nivel máximo disponible en el sistema. Habrá
por tanto que discretizar la variable para evitar esos problemas
2.4. Coherencia entre variables

Para comprobar si TotalCharges es una variable compuesta de los ingresos mensuales (Monthly Charges) y la
antigüedad en meses (tenure) creamos una nueva variable (TotalChargesv2) resultado de multiplicar los
ingresos mensuales (Monthly Charges) por la antigüedad en meses (tenure) y comparamos el resultado con
los ingresos totales (TotalCharges) en algunos de sus valores
head(df) %>%
select(MonthlyCharges,TotalCharges,tenure) %>%
mutate(TotalChargesv2=MonthlyCharges*tenure)
## MonthlyCharges TotalCharges tenure TotalChargesv2

## 1 29.85 29.85 1 29.85
## 2 56.95 1889.50 34 1936.30
## 3 53.85 108.15 2 107.70
## 4 42.30 1840.75 45 1903.50
## 5 70.70 151.65 2 141.40
## 6 99.65 820.50 8 797.20
Vemos que efectivamente parece que TotalCharges es el resultado de multiplicar las otras 2 variables
(tenencia y cargos mensuales) por lo que la podríamos eliminar y quedarnos con las otras dos que aportan la
misma información. De hecho vamos a ver la correlación de TotalCharges con el resultado de multiplicar las
otras dos variables tanto numérica como gráficamente
dfcorta <- df %>%

select(MonthlyCharges,tenure,TotalCharges) %>%
mutate(TotalChargesv2=MonthlyCharges*tenure)
cor(dfcorta$TotalCharges,dfcorta$TotalChargesv2,use = 'complete.obs')
## [1] 0.9995599
ggplot(data=dfcorta,aes(x = TotalCharges, y=TotalChargesv2)) + geom_point(colour = "lightblu

e", size=4) + geom_smooth() + labs(x = "Total Charges",y = "Monthly Charges X Tenure")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).
La correlación entre ambas variables es el 99.95599% y visualmente vemos que coinciden por lo que vamos
a eliminar la variable Total Charges y quedarnos con las otras dos variables que nos pueden ayudar a
predecir mejor el Churn
Además voy a comprobar que aquellos que no tienen servicio de Internet “No Internet service” obtienen los
mismos resultados en las otras seis variables relacionadas con Internet. Para ello tenemos que transformar
todas estas variables en factores y ver con summary los valores de sin servicio de internet (No internet
service)
df$InternetService <- as.factor(df$InternetService)

df$StreamingTV <- as.factor(df$StreamingTV)
df$StreamingMovies <- as.factor(df$StreamingMovies)
df$OnlineSecurity <- as.factor(df$OnlineSecurity)
df$OnlineBackup <- as.factor(df$OnlineBackup)
df$DeviceProtection <- as.factor(df$DeviceProtection)
df$TechSupport <- as.factor(df$TechSupport)
summary(df$InternetService)
## DSL Fiber optic No

## 2421 3096 1526
summary(df$StreamingTV)
## No No internet service Yes

## 2810 1526 2707
summary(df$StreamingMovies)

## 2785 1526 2732
summary(df$OnlineSecurity)

## 3498 1526 2019
summary(df$OnlineBackup)

## 3088 1526 2429
summary(df$DeviceProtection)

## 3095 1526 2422
summary(df$TechSupport)

## 3473 1526 2044
Hay 1526 clientes sin servicio de Internet y este valor es el mismo en las seis variables relacionadas con los
servicios de Internet.
Vamos también a comprobar que el resultado “No internet service” en StreamingTV, StreamingMovies,
OnlineSecurity, OnlineBackup, DeviceProtection y TechSupport se corresponde con un No en la variable
InternetService en esos mismos registros
df %>%
filter(row_number() <= 30) %>%
filter(InternetService == 'No') %>%
select(InternetService,StreamingTV,StreamingMovies,OnlineSecurity,OnlineBackup,DeviceProtec
tion, TechSupport)
## InternetService StreamingTV StreamingMovies OnlineSecurity

## 1 No No internet service No internet service No internet service
## OnlineBackup DeviceProtection TechSupport
## 1 No internet service No internet service No internet service
Vemos que en principio es correcto y que cuando no hay servicio en internet en la variable InternetService se
refleja correctamente en las otras seis variables y por tanto no hay televisión, películas por internet, seguridad
online, backup, protección del dispositivo y servicio técnico para lo que es necesario tener internet.
Tenemos que hacer la misma comprobación con otras dos variables correlacionadas: la presencia del servicio
de Teléfono (PhoneService) y la existencia de varias líneas de teléfono (MultipleLines)
df$PhoneService <- as.factor(df$PhoneService)

df$MultipleLines <- as.factor(df$MultipleLines)
summary(df$PhoneService)
## No Yes
## 682 6361
summary(df$MultipleLines)
## No No phone service Yes

## 3390 682 2971
Vemos que 682 clientes no tienen servicio de teléfono y el número coincide en ambas variables
Voy también a comprobar que el resultado “No phone service” en MultipleLines se corresponde con un No en
la variable PhoneService en algunos de los registros
df %>%
filter(row_number() <= 30) %>%
filter(PhoneService == 'No') %>%
select(PhoneService,MultipleLines)
## PhoneService MultipleLines
## 1 No No phone service
Efectivamente el no en PhoneService se corresponde con un No Phone Service en MultipleLines
En cualquier caso y para evitar correlaciones entre las variables relacionadas con Internet y con el servicio de
teléfono tendremos que transformar el “no internet” o “no phone service” en simplemente un “no”
3. Transformación de datos
Acciones resultado del analisis de calidad de datos y exploratorio:
Transformar los resultados de la variable target (Churn) de si/no a 1 y 0 para poder funcionar como
variable target
Transformar los valores “No Internet Service” de las variables relacionadas con internet
(OnlineSecurity, OnlineBackup, DeviceProtection, StreamingMovies y StreamingTV) y “No Phone
Line” de la variable relacionada con la línea de teléfono (MultiplesLines) en un simple “No” para
evitar correlaciones entre las mismas y dejarlas en variables dicotómicas
Transformar a factor las variables de ‘a_factores’ que no habíamos transformado previamente. Ya
hemos transformado tenure en una variable numérica (era una variable entera)
Discretizar la variable tenencia (tenure) para corregir los valores elevados en a partir de los 6 años
Eliminar la variable TotalCharges que es resultado de multiplicar las otras dos y así evitamos las NA
que se producen cuando la tenencia es cero (clientes nuevos)
3.1. Variable target

Transformamos la variable Churn y cambiamos el Yes/No por unos y ceros y la convertimos en un factor
df <- df %>%
mutate(Churn = ifelse(Churn == 'Yes', 1, 0))
df$Churn <- as.factor(df$Churn)
3.2. Variables independientes

Para evitar correlaciones en las variables independientes sustituimos “No internet service” y “No phone
service” por un simple “No”
df$OnlineSecurity[df$OnlineSecurity=="No internet service"] <- "No"

df$OnlineBackup[df$OnlineBackup=="No internet service"] <- "No"
df$DeviceProtection[df$DeviceProtection=="No internet service"] <- "No"
df$StreamingMovies[df$StreamingMovies=="No internet service"] <- "No"
df$StreamingTV[df$StreamingTV=="No internet service"] <- "No"
df$TechSupport[df$TechSupport=="No internet service"] <- "No"
df$MultipleLines[df$MultipleLines=="No phone service"] <- "No"
Y transformamos las variables independientes que habíamos incluido en la lista a_factores en factores
df <- df %>%
mutate_at(a_factores,list(factor))
Creamos una lista larga con todas las variables independientes. Eliminamos de esa lista el CustomerID y Total
Charges además de la variable Churn que es la variable target
ind_larga<-names(df)
no_usar <- c('customerID','Churn', 'TotalCharges')
ind_larga<-setdiff(ind_larga,no_usar)
No hace falta crear una muestra para trabajar porque sólo tenemos 7000 observaciones y por tanto no
manejamos muchos datos.
3.3. Preselección de variables

Vamos a hacer un preseleccion de las variables con RandomForest y seleccionar sólo aquellas que tienen
más impacto para predecir la variable target Churn
pre_rf <- randomForest(formula = reformulate(ind_larga,'Churn'), data= df,mtry=2,ntree=50, im

portance = T)
imp_rf <- importance(pre_rf)[,4]
imp_rf <- data.frame(VARIABLE = names(imp_rf), IMP_RF = imp_rf)
imp_rf <- imp_rf %>% arrange(desc(IMP_RF)) %>% mutate(RANKING_RF = 1:nrow(imp_rf))
imp_rf
## VARIABLE IMP_RF RANKING_RF

## 1 tenure 272.67423 1
## 2 Contract 204.28593 2
## 3 MonthlyCharges 188.10371 3
## 4 PaymentMethod 133.30097 4
## 5 InternetService 109.34978 5
## 6 PaperlessBilling 47.73792 6
## 7 OnlineSecurity 46.60165 7
## 8 TechSupport 44.32185 8
## 9 SeniorCitizen 31.95836 9
## 10 OnlineBackup 29.27913 10
## 11 Partner 27.86869 11
## 12 MultipleLines 26.14952 12
## 13 StreamingTV 26.10560 13
## 14 Dependents 25.57337 14
## 15 StreamingMovies 25.19977 15
## 16 gender 25.10330 16
## 17 DeviceProtection 23.30836 17
## 18 PhoneService 11.56446 18
Las cinco variables mas importates con Random Forest son:
La antigüedad (tenure)
El cargo mensual (MonthlyCharges)
La duración del contrato (Contract)
Si tiene o no contratado internet (InternetService)
El método de pago (PaymentMethod)
Vamos a hacer lo mismo con Information Value (IV)
df2 <- mutate(df,Churn = as.numeric(as.character(Churn)))

imp_iv <- smbinning.sumiv(df2[c(ind_larga,'Churn')],y="Churn")
##
##
|
| | 0%
|
|--- | 5%
|
|----- | 11%
|
|-------- | 16%
|
|----------- | 21%
|
|------------- | 26%
|
|---------------- | 32%
|
|------------------ | 37%
|
|--------------------- | 42%
|
|------------------------ | 47%
|
|-------------------------- | 53%
|
|----------------------------- | 58%
|
|-------------------------------- | 63%
|
|---------------------------------- | 68%
|
|------------------------------------- | 74%
|
|--------------------------------------- | 79%
|
|------------------------------------------ | 84%
|
|--------------------------------------------- | 89%
|
|----------------------------------------------- | 95%
|
|--------------------------------------------------| 100%
##
imp_iv <- imp_iv %>% mutate(Ranking = 1:nrow(imp_iv)) %>% select(-Process)

names(imp_iv) <- c('VARIABLE','IMP_IV','RANKING_IV')
imp_iv
## VARIABLE IMP_IV RANKING_IV

## 1 Contract 1.2386 1
## 2 tenure 0.8659 2
## 3 InternetService 0.6179 3
## 4 MonthlyCharges 0.4842 4
## 5 PaymentMethod 0.4571 5
## 6 PaperlessBilling 0.2030 6
## 7 OnlineSecurity 0.1719 7
## 8 TechSupport 0.1575 8
## 9 Dependents 0.1555 9
## 10 Partner 0.1188 10
## 11 SeniorCitizen 0.1057 11
## 12 OnlineBackup 0.0359 12
## 13 DeviceProtection 0.0230 13
## 14 StreamingTV 0.0202 14
## 15 StreamingMovies 0.0191 15
## 16 MultipleLines 0.0082 16
## 17 PhoneService 0.0008 17
## 18 gender 0.0004 18
Las cinco variables mas importates en base a la IV son:

El servicio de Internet (InternetService)
Los pagos mensuales (MonthlyCharges)
Son, por tanto, muy similares en ambos casos
Vamos a hacer la preseleccion final teniendo en cuenta ambos procedimientos
imp_final <- inner_join(imp_rf,imp_iv,by='VARIABLE') %>%

select(VARIABLE,IMP_RF,IMP_IV,RANKING_RF,RANKING_IV) %>%
mutate(RANKING_TOT = RANKING_RF + RANKING_IV) %>%
arrange(RANKING_TOT)
## Warning: Column `VARIABLE` joining factors with different levels, coercing to

## character vector
imp_final
## VARIABLE IMP_RF IMP_IV RANKING_RF RANKING_IV RANKING_TOT

## 1 tenure 272.67423 0.8659 1 2 3
## 2 Contract 204.28593 1.2386 2 1 3
## 3 MonthlyCharges 188.10371 0.4842 3 4 7
## 4 InternetService 109.34978 0.6179 5 3 8
## 5 PaymentMethod 133.30097 0.4571 4 5 9
## 6 PaperlessBilling 47.73792 0.2030 6 6 12
## 7 OnlineSecurity 46.60165 0.1719 7 7 14
## 8 TechSupport 44.32185 0.1575 8 8 16
## 9 SeniorCitizen 31.95836 0.1057 9 11 20
## 10 Partner 27.86869 0.1188 11 10 21
## 11 OnlineBackup 29.27913 0.0359 10 12 22
## 12 Dependents 25.57337 0.1555 14 9 23
## 13 StreamingTV 26.10560 0.0202 13 14 27
## 14 MultipleLines 26.14952 0.0082 12 16 28
## 15 StreamingMovies 25.19977 0.0191 15 15 30
## 16 DeviceProtection 23.30836 0.0230 17 13 30
## 17 gender 25.10330 0.0004 16 18 34
## 18 PhoneService 11.56446 0.0008 18 17 35
Las cinco variables mas importantes en base tanto al Random Forest y el IV son
Los cargos mensuales (MonthlyCharges)
El servicio de Internet (InternetService)
Vamos a comprobar si los dos métodos nos han dado resultados similares por lo que calculamos su
correlación
cor(imp_final$IMP_RF,imp_final$IMP_IV,use = 'complete.obs')
## [1] 0.9035112
Una correlación de 90.35112% es un dato elevado por lo que podemos dar como válidos los datos
Limitamos las variables explicativas a aquellas que hayan salido en el Top 10 en alguna de las dos cálculos y
vemos cuáles son
ind_corta <- imp_final %>%

filter(RANKING_RF <= 10 | RANKING_IV <= 10) %>%
select(VARIABLE)
ind_corta <- as.character(ind_corta$VARIABLE)
ind_corta
## [1] "tenure" "Contract" "MonthlyCharges" "InternetService"

## [5] "PaymentMethod" "PaperlessBilling" "OnlineSecurity" "TechSupport"
## [9] "SeniorCitizen" "Partner" "OnlineBackup" "Dependents"
Una vez que ya hemos identificado las variables independientes más relevantes tenemos que volver añadir la
variable target y el código de cliente que eliminamos previamente para generar un nuevo dataframe.
finales <- union(ind_corta,c('customerID','Churn'))

df <- df %>%
select(one_of(finales))
Eliminamos todo de nuestro entorno salvo la base de datos
ls()
## [1] "a_factores" "contar_ceros" "df" "df2" "dfcorta"

## [6] "finales" "imp_final" "imp_iv" "imp_rf" "ind_corta"
## [11] "ind_larga" "instalados" "no_usar" "num_ceros" "out"
## [16] "paquetes" "pre_rf"
rm(list=setdiff(ls(),'df'))
Guardamos el cache temporal
saveRDS(df,'cacheDan1.rds')
3.4. Discretización de variables

Vamos a llevar a cabo las transformaciones necesarias en las variables continuas (antigüedad y cargos
mensuales) para agrupar los datos en niveles
Primero cargamos el cache temporal y definimos de nuevo las variables target e independientes
df <- readRDS(file = 'cacheDan1.rds')

target <- 'Churn'
indep <- setdiff(names(df),c(target,'customerID'))
Creamos la función para discretizar
discretizar <- function(vi,target){

temp_df <- data.frame(vi = vi, target = target)
temp_df$target <- as.numeric(as.character(temp_df$target))
disc <- smbinning(temp_df, y = 'target', x = 'vi')
return(disc)
}
Discretizamos las dos variables numéricas: antigüedad (tenure) y cargos mensuales (MonthlyCharges)
#tenure
disc_temp_tenure <- discretizar(df$tenure,df$Churn)
df_temp <- select(df,tenure,Churn)
df_temp <- smbinning.gen(df_temp,disc_temp_tenure,chrname = 'Tenure_DISC')
df <- cbind(df,df_temp[3])
#MonthlyCharges
disc_temp_MonthlyCharges <- discretizar(df$MonthlyCharges,df$Churn)
df_temp <- select(df,MonthlyCharges,Churn)
df_temp <- smbinning.gen(df_temp,disc_temp_MonthlyCharges,chrname = 'MonthlyCharges_DISC')
df <- cbind(df,df_temp[3])
Vamos a ver visualmente como queda Tenure discretizada
ggplot(df,aes(Tenure_DISC)) + scale_y_continuous(name="Número de clientes") + scale_x_discret

e(name="Antigüedad por niveles") + geom_bar(color = "white", fill = "darkblue")
Parece que los niveles quedan algo descompensados ya que el mayor número de clientes se sitúan entre 22
y 49 meses de antigüedad y en este corte se incluyen 27 meses mientras que, por ejemplo, en el nivel anterior
se incluyen sólo 4 meses.
Y hacemos lo mismo con la variable MonthlyCharges
ggplot(df,aes(MonthlyCharges_DISC)) + scale_y_continuous(name="Número de clientes") + scale_x

_discrete(name="Cargos Mensuales por niveles") + geom_bar(color = "white", fill = "darkgreen"
)
Del mismo modo que en la variable antigüedad en MonthlyCharges hay un grupo muy descompensado que es
el 4 ya que incluye más de 3000 clientes al incluir casi 40 € en el rango vs los 13 € del grupo anterior.
Incluimos la penetración del target para la variable antigüedad (tenure)
ggplot(df,aes(Tenure_DISC,fill=Churn)) + scale_y_continuous(name="% de Churn") + scale_x_disc

rete(name="Antigüedad por niveles") + geom_bar(position='fill') + scale_fill_manual(values=c(
"DarkBlue", "DarkGreen"))
Podemos ver que para Tenure la penetración del target es completamente monótona por lo que la dejamos tal
cual y eliminamos la variable sin discretizar
df <- df %>%
select(-tenure)
Del mismo modo incluimos la penetración del target para la variable MonthlyCharges
ggplot(df,aes(MonthlyCharges_DISC,fill=Churn)) + scale_y_continuous(name="% de Churn") + sca

le_x_discrete(name="Cargos Mensuales por niveles") + geom_bar(position='fill') + scale_fill_m
anual(values=c("DarkBlue", "DarkGreen"))
En este caso la variable no es monótona por lo que vamos a probar con la discretización manual con rangos
de 20€ hasta los 100€
df <- df %>% mutate(MonthlyCharges = as.factor(case_when(

MonthlyCharges <= 20 ~ '01_MENOR_20',
MonthlyCharges > 20 & MonthlyCharges <= 40 ~ '02_DE_20_A_40',
MonthlyCharges > 100 ~ '07_MAYOR_100',
TRUE ~ '00_ERROR'))
)
Y comprobamos ahora si la variable de cargos mensuales es monotona
ggplot(df,aes(MonthlyCharges,fill=Churn)) + scale_y_continuous(name="% de Churn") + scale_x_d

iscrete(name="Cargos Mensuales por niveles") + geom_bar(position='fill') + scale_fill_manual
(values=c("DarkBlue", "DarkGreen")) + theme (axis.text.x = element_text(face="bold", size=rel
(0.8)))
Ahora sí parece que la variable sigue una tendencia ascendente aunque en el último grupo cae ligeramente
por lo que vamos a dejar esta variable con la discretización manual
Eliminamos la variable discretizada de manera automática que ya no nos sirve y cambiamos el nombre de
MonthlyCharges que habíamos discretizado por el de MonthlyCharges_DISC
df <- df %>%
select(-MonthlyCharges_DISC)
df <- df %>%
rename(MonthlyCharges_DISC = MonthlyCharges)
Vamos a ver visualmente todas las variables incluida la variable target Churn
df %>%
select_if(is.factor) %>%
gather() %>%
ggplot(aes(value)) +
geom_bar(color = "white", fill = "darkgreen") +
facet_wrap(~ key, scales = "free") +
theme(axis.text=element_text(size=5))
## Warning: attributes are not identical across measure variables;

## they will be dropped
Y guardamos las discretizaciones realizadas
discretizaciones <- list(

disc_temp_tenure = disc_temp_tenure,
disc_temp_MonthlyCharges = disc_temp_MonthlyCharges
)
saveRDS(discretizaciones,'CortesDiscretizaciones.rds')
Vamos a ver como ha quedado nuestro fichero antes de pasar a la fase de modelizacion
glimpse(df)
## Variables: 14
## $ Contract <fct> Month-to-month, One year, Month-to-month, One yea…
## $ MonthlyCharges_DISC <fct> 02_DE_20_A_40, 03_DE_40_A_60, 03_DE_40_A_60, 03_D…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fib…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Ban…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes,…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No,…
## $ SeniorCitizen <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, N…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No…
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-C…
## $ Churn <fct> 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Tenure_DISC <fct> 01 <= 1, 05 <= 49, 02 <= 5, 05 <= 49, 02 <= 5, 03…
Ahora todas las variables salvo el CustomerID son factoriales incluidas las que eran continuas y que hemos
discretizado
Limpiamos el entorno de cualquier cosa que no sea el dataframe
a_borrar <- setdiff(ls(),'df')

rm(list=c(a_borrar,'a_borrar'))
Guardamos otro cache temporal
4. Modelización manual
Vamos a probar manualmente cuatro algoritmos diferentes para ver cual funciona mejor y elegir aquel que
tiene mayor capacidad predictora de la variable target, el Churn. En concreto vamos a modelizar con:
Regresión Logística
Árboles de decisión
Random Forest
Bayes naive
Para ello primero cargamos el cache temporal con los datos a emplear
df <- readRDS('cacheDan2.rds')
4.1. Funciones de apoyo

Vamos a crear las funciones que vamos a necesitar
Función para crear una matriz de confusión
confusion<-function(real,scoring,umbral){
conf<-table(real,scoring>=umbral)
if(ncol(conf)==2) return(conf) else return(NULL)
}
Función para calcular las métricas de los modelos: acierto, precisión, cobertura y F1
metricas<-function(matriz_conf){
acierto <- (matriz_conf[1,1] + matriz_conf[2,2]) / sum(matriz_conf) *100
precision <- matriz_conf[2,2] / (matriz_conf[2,2] + matriz_conf[1,2]) *100
cobertura <- matriz_conf[2,2] / (matriz_conf[2,2] + matriz_conf[2,1]) *100
F1 <- 2*precision*cobertura/(precision+cobertura)
salida<-c(acierto,precision,cobertura,F1)
return(salida)
}
Función para probar distintos umbrales y ver el efecto sobre precisión y cobertura
umbrales<-function(real,scoring){
umbrales<-data.frame(umbral=rep(0,times=19),acierto=rep(0,times=19),precision=rep(0,times=1
9),cobertura=rep(0,times=19),F1=rep(0,times=19))
cont <- 1
for (cada in seq(0.05,0.95,by = 0.05)){
datos<-metricas(confusion(real,scoring,cada))
registro<-c(cada,datos)
umbrales[cont,]<-registro
cont <- cont + 1
}
return(umbrales)
}
Funciones que calculan la curva ROC y el AUC
roc<-function(prediction){
r<-performance(prediction,'tpr','fpr')
plot(r)
}
auc<-function(prediction){
a<-performance(prediction,'auc')
return(a@y.values[[1]])
}
4.2. Definición de muestras y de variables Target e

Independientes
Para testar los diferente algoritmos primero creamos las particiones de la muestra en dos: training (70%) y test
(30%) y establecemos una semilla para que los resultados sean replicables
set.seed(12345)
df$random<-sample(0:1,size = nrow(df),replace = T,prob = c(0.3,0.7))
Creamos los dos dataframes, el de entrenamiento y el de test
train<-filter(df,random==1)
test<-filter(df,random==0)
df$random <- NULL
Identificamos las variables independientes y target
independientes <- setdiff(names(df),c('customerID','Churn'))

target <- 'Churn'
Creamos la fórmula para usar en el modelo
formula <- reformulate(independientes,target)
4.3. Modelo de Regresión Logística

Primero vamos a hacer el modelo de regresión logística con todas las variables independientes
formula_rl <- formula

rl<- glm(formula_rl,train,family=binomial(link='logit'))
summary(rl)
##
## Call:
## glm(formula = formula_rl, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0713 -0.6856 -0.2592 0.5854 3.2868
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.78785 0.33544 2.349
## ContractOne year -0.77413 0.12989 -5.960
## ContractTwo year -1.71221 0.23222 -7.373
## MonthlyCharges_DISC02_DE_20_A_40 -0.20159 0.24525 -0.822
## MonthlyCharges_DISC07_MAYOR_100 0.40096 0.38782 1.034
## InternetServiceFiber optic 0.69193 0.18995 3.643
## InternetServiceNo -1.40874 0.24841 -5.671
## PaymentMethodCredit card (automatic) -0.06755 0.13599 -0.497
## PaymentMethodElectronic check 0.24167 0.11389 2.122
## PaymentMethodMailed check -0.15405 0.13856 -1.112
## PaperlessBillingYes 0.36767 0.08996 4.087
## OnlineSecurityYes -0.35766 0.10282 -3.479
## TechSupportYes -0.23628 0.10626 -2.224
## SeniorCitizen1 0.19190 0.10049 1.910
## PartnerYes 0.18824 0.09325 2.019
## OnlineBackupYes -0.12575 0.09389 -1.339
## DependentsYes -0.19271 0.10828 -1.780
## Tenure_DISC02 <= 5 -0.94728 0.14977 -6.325
## Tenure_DISC03 <= 16 -1.36954 0.14592 -9.386
## Tenure_DISC04 <= 22 -1.85791 0.18507 -10.039
## Tenure_DISC05 <= 49 -2.20999 0.15849 -13.944
## Tenure_DISC06 <= 59 -2.20157 0.20889 -10.539
## Tenure_DISC07 <= 70 -2.56079 0.23163 -11.056
## Tenure_DISC08 > 70 -3.80441 0.47957 -7.933
## Pr(>|z|)
## (Intercept) 0.018837 *
## ContractOne year 0.00000000252176384 ***
## ContractTwo year 0.00000000000016671 ***
## MonthlyCharges_DISC02_DE_20_A_40 0.411097
## MonthlyCharges_DISC07_MAYOR_100 0.301190
## InternetServiceFiber optic 0.000270 ***
## InternetServiceNo 0.00000001419561110 ***
## PaymentMethodCredit card (automatic) 0.619404
## PaymentMethodElectronic check 0.033835 *
## PaymentMethodMailed check 0.266224
## PaperlessBillingYes 0.00004368168984892 ***
## OnlineSecurityYes 0.000504 ***
## TechSupportYes 0.026177 *
## SeniorCitizen1 0.056180 .
## PartnerYes 0.043526 *
## OnlineBackupYes 0.180485
## DependentsYes 0.075131 .
## Tenure_DISC02 <= 5 0.00000000025346651 ***
## Tenure_DISC03 <= 16 < 0.0000000000000002 ***
## Tenure_DISC04 <= 22 < 0.0000000000000002 ***
## Tenure_DISC05 <= 49 < 0.0000000000000002 ***
## Tenure_DISC06 <= 59 < 0.0000000000000002 ***
## Tenure_DISC07 <= 70 < 0.0000000000000002 ***
## Tenure_DISC08 > 70 0.00000000000000214 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5711.2 on 4940 degrees of freedom
## Residual deviance: 4033.4 on 4914 degrees of freedom
## AIC: 4087.4
##
## Number of Fisher Scoring iterations: 6
Revisamos la significatividad y mantenemos las cinco variables que tienen tres estrellas en alguno de sus
valores:
Contrato
Servicio de Internet
Seguridad online
Factura sin papel
Antigüedad
a_mantener <- c(
'Contract',
'InternetService',
'OnlineSecurity',
'PaperlessBilling',
'Tenure_DISC'
)
Volvemos a modelizar con las cinco variables significativas
formula_rl <- reformulate(a_mantener,target)

rl<- glm(formula_rl,train,family=binomial(link='logit'))
summary(rl)
##
## Call:
## glm(formula = formula_rl, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9958 -0.6801 -0.2767 0.5417 3.2894
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.31357 0.13506 2.322 0.020246 *
## ContractOne year -0.83068 0.12553 -6.617 0.0000000000366039 ***
## ContractTwo year -1.88549 0.22619 -8.336 < 0.0000000000000002 ***
## InternetServiceFiber optic 1.11043 0.09176 12.102 < 0.0000000000000002 ***
## InternetServiceNo -1.13010 0.14872 -7.599 0.0000000000000299 ***
## OnlineSecurityYes -0.36569 0.10010 -3.653 0.000259 ***
## PaperlessBillingYes 0.42099 0.08858 4.752 0.0000020105130901 ***
## Tenure_DISC02 <= 5 -0.95112 0.14868 -6.397 0.0000000001584892 ***
## Tenure_DISC03 <= 16 -1.34002 0.14214 -9.427 < 0.0000000000000002 ***
## Tenure_DISC04 <= 22 -1.81540 0.17975 -10.099 < 0.0000000000000002 ***
## Tenure_DISC05 <= 49 -2.09241 0.14712 -14.223 < 0.0000000000000002 ***
## Tenure_DISC06 <= 59 -1.99484 0.19422 -10.271 < 0.0000000000000002 ***
## Tenure_DISC07 <= 70 -2.29195 0.20982 -10.923 < 0.0000000000000002 ***
## Tenure_DISC08 > 70 -3.46783 0.46581 -7.445 0.0000000000000972 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5711.2 on 4940 degrees of freedom
## Residual deviance: 4096.1 on 4927 degrees of freedom
## AIC: 4124.1
##
## Number of Fisher Scoring iterations: 6
Vemos que ahora todas las variables seleccionadas (Contract, InternetService, OnlineSecurity,
PaperlessBilling y Tenure_DISC) tienen 3 estrellas de significacion en alguno de sus niveles
En relación con el signo de los coeficientes y su valor vemos que la relación tiene sentido de negocio:
1. La mayor antigüedad genera mayores coeficientes y de signo negativo, es decir que el Churn o el
abandono baja a mayor antigüedad
2. El tener fibra óptica genera mayor probabilidad de Churn mientras que no tener Servicio de
Internet tiene signo negativo, es decir, reduce el Churn. Quizás esto puede estar debido a la mayor
actividad promocional en este servicio o al perfil del usuario de Internet que es más proclive a cambiar
de compañía de telecomunicaciones vs aquel que tiene sólo línea de teléfono
3. Disponer del servicio de seguridad online también disminuye la probabilidad de Churn
4. No tener factura en papel incrementa la probabilidad de Churn. Probablemente porque estos
clientes son los más digitales y los que más buscan opciones de cambio
5. Como era previsible el tener algún tipo de contrato anual (uno o dos años) vs el mensual reduce la
probabilidad de Churn y cuanto más largo es el contrato más se reduce
Calculamos el pseudo R cuadrado:
pr2_rl <- 1 -(rl$deviance / rl$null.deviance)

pr2_rl
## [1] 0.2827973
El R cuadrado es bajo ya que sólo explica de manera correcta el 28.27973% de la variabilidad de la variable
target y se sitúa por debajo del 50%. En cualquier caso vamos a analizar el modelo con el resto de variables
para entender si es válido.
Aplico el modelo al conjunto de test, generando un vector con las probabilidades
rl_predict<-predict(rl,test,type = 'response')
Lo vemos visualmente
plot(rl_predict~test$Churn,border=9, col=13, main="Comparación Churn real vs Scoring - Modelo

s Regresión Logística",cex.main=0.9,xlab="Churn real en la muestra test", ylab="Scoring del C
hurn", cex.axis=0.7, cex.lab=0.8)
Aunque en general parece que como cabría esperar los valores medios de scoring en aquellos que tienen un
Churn positivo son más altos (alrededor de un 0.4) que cuando el Churn es 0 (0.1) vemos que algunos valores
de Churn real tienen niveles elevados de probabilidad en el scoring
Y podemos ver esos valores que tienen un Churn de 0 pero que en su predicción dan un scoring elevado (por
encima del 60%)
test_rlpredict <-
cbind(rl_predict,test)
test_rlpredict %>%
filter(rl_predict >= '0.6', Churn=='0') %>%
select(Churn,rl_predict)
## Churn rl_predict
## 1 0 0.6758052
## 2 0 0.8059653
## 3 0 0.6236269
## 4 0 0.7096887
## 5 0 0.6236269
## 6 0 0.6236269
## 7 0 0.7096887
## 8 0 0.6236269
## 9 0 0.6236269
## 10 0 0.6236269
## 11 0 0.6236269
## 12 0 0.8059653
## 13 0 0.6236269
## 14 0 0.6236269
## 15 0 0.6290584
## 16 0 0.6758052
## 17 0 0.6160669
## 18 0 0.6236269
## 19 0 0.7096887
## 20 0 0.6290584
## 21 0 0.6758052
## 22 0 0.6758052
## 23 0 0.8635377
## 24 0 0.6758052
## 25 0 0.8059653
## 26 0 0.6236269
## 27 0 0.6758052
## 28 0 0.6160669
## 29 0 0.7096887
## 30 0 0.7096887
## 31 0 0.7096887
## 32 0 0.6236269
## 33 0 0.6236269
## 34 0 0.7096887
## 35 0 0.6236269
## 36 0 0.8144662
## 37 0 0.6236269
## 38 0 0.8635377
## 39 0 0.7096887
## 40 0 0.6236269
## 41 0 0.6236269
## 42 0 0.6160669
## 43 0 0.6758052
## 44 0 0.6758052
## 45 0 0.7096887
## 46 0 0.6236269
## 47 0 0.6758052
## 48 0 0.7096887
## 49 0 0.6236269
## 50 0 0.6236269
## 51 0 0.7096887
## 52 0 0.7096887
## 53 0 0.6160669
## 54 0 0.8635377
## 55 0 0.6236269
## 56 0 0.6236269
## 57 0 0.8059653
## 58 0 0.6236269
## 59 0 0.7096887
## 60 0 0.6236269
## 61 0 0.7096887
## 62 0 0.6160669
## 63 0 0.8635377
## 64 0 0.6758052
## 65 0 0.6236269
## 66 0 0.6236269
## 67 0 0.6236269
## 68 0 0.6236269
## 69 0 0.6236269
## 70 0 0.6236269
## 71 0 0.6160669
## 72 0 0.8635377
## 73 0 0.6236269
## 74 0 0.6160669
## 75 0 0.6290584
## 76 0 0.6290584
## 77 0 0.6236269
## 78 0 0.6236269
Son 78 casos con una probabilidad o scoring superior al 60% pese a que su Churn real es 0. En una muestra
de alrededor 2000 casos no son tampoco muchos así que vamos a transformar la probabilidad en un scoring
de si el cliente va a cancelar o no el servicio
Con la funcion umbrales probamos diferentes cortes
umb_rl<-umbrales(test$Churn,rl_predict)
umb_rl
## umbral acierto precision cobertura F1

## 1 0.05 51.80780 35.36269 97.326203 51.87648
## 2 0.10 61.03711 40.27149 95.187166 56.59777
## 3 0.15 67.17412 44.35696 90.374332 59.50704
## 4 0.20 70.26641 46.91715 86.809269 60.91307
## 5 0.25 72.50238 49.11734 84.313725 62.07349
## 6 0.30 75.40438 52.60047 79.322638 63.25515
## 7 0.35 76.87916 54.96689 73.975045 63.06991
## 8 0.40 78.73454 58.96226 66.844920 62.65664
## 9 0.45 78.49667 62.70396 47.950089 54.34343
## 10 0.50 79.30542 65.75000 46.880570 54.73465
## 11 0.55 79.63844 70.71651 40.463458 51.47392
## 12 0.60 79.68601 73.10345 37.789661 49.82374
## 13 0.65 77.40247 77.92208 21.390374 33.56643
## 14 0.70 77.11703 81.25000 18.538324 30.18868
## 15 0.75 75.49952 84.84848 9.982175 17.86284
## 16 0.80 75.49952 84.84848 9.982175 17.86284
## 17 0.85 75.21408 90.00000 8.021390 14.72995
## 18 0.90 0.90000 0.90000 0.900000 0.90000
## 19 0.95 0.95000 0.95000 0.950000 0.95000
Seleccionamos el umbral que maximiza la F1
umbral_final_rl<-umb_rl[which.max(umb_rl$F1),1]
umbral_final_rl
## [1] 0.3
Ese nivel es el 0.3
Evaluamos la matriz de confusion y las metricas con el umbral optimizado
confusion(test$Churn,rl_predict,umbral_final_rl)
##
## real FALSE TRUE
## 0 1140 401
## 1 116 445
rl_metricas<-filter(umb_rl,umbral==umbral_final_rl)
rl_metricas

## 1 0.3 75.40438 52.60047 79.32264 63.25515
La precisión tiene niveles algo bajos ya que predice como positivos valores que son negativos. La
cobertura es más alta ya que funciona a la inversa que la precisión y nos indica que % de positivos
reales identifica.
En este caso parece que tiene más sentido dar prioridad a la cobertura para asegurar que somos
capaces de reducir el Churn aunque en ocasiones generemos falsos negativos y eso suponga un
coste añadido (si llevamos a cabo una campaña para reducir el Churn ofreciendo mejores condiciones a
algunos clientes que en realidad no tienen pretensión de cambiar de compañía) El acierto también es elevado
Evaluamos la ROC
rl_prediction<-prediction(rl_predict,test$Churn)
roc(rl_prediction)
Sacamos las metricas definitivas incluyendo el AUC
rl_metricas<-cbind(rl_metricas,AUC=round(auc(rl_prediction),2)*100)
print(t(rl_metricas))
## [,1]
## umbral 0.30000
## acierto 75.40438
## precision 52.60047
## cobertura 79.32264
## F1 63.25515
## AUC 84.00000
En cualquier caso la métrica más relevante, el AUC, alcanza un nivel elevado (84.04825)
4.4. Modelo de Arboles de decision

Vamos a modelizar usando árboles de decisión
formula_ar <- formula

ar<-rpart(formula_ar, train, method = 'class', parms = list(
split = "information"),
control = rpart.control(cp = 0.00001))
Revisamos donde el error de validacion cruzada empieza a crecer
printcp(ar)
##
## Classification tree:
## rpart(formula = formula_ar, data = train, method = "class", parms = list(split = "informat
ion"),
## control = rpart.control(cp = 0.00001))
##
## Variables actually used in tree construction:
## [1] Contract Dependents InternetService
## [4] MonthlyCharges_DISC OnlineBackup OnlineSecurity
## [7] PaperlessBilling Partner PaymentMethod
## [10] SeniorCitizen TechSupport Tenure_DISC
##
## Root node error: 1308/4941 = 0.26472
##
## n= 4941
##
## CP nsplit rel error xerror xstd
## 1 0.06982671 0 1.00000 1.00000 0.023709
## 2 0.00535168 3 0.79052 0.79052 0.021861
## 3 0.00382263 7 0.76529 0.78058 0.021759
## 4 0.00305810 13 0.74159 0.79052 0.021861
## 5 0.00267584 15 0.73547 0.78899 0.021845
## 6 0.00254842 17 0.73012 0.78670 0.021822
## 7 0.00229358 21 0.71942 0.78440 0.021798
## 8 0.00191131 25 0.71024 0.78746 0.021830
## 9 0.00152905 29 0.70260 0.78976 0.021853
## 10 0.00127421 41 0.68196 0.79587 0.021915
## 11 0.00101937 44 0.67813 0.79969 0.021954
## 12 0.00076453 50 0.67202 0.81498 0.022105
## 13 0.00057339 73 0.65443 0.80199 0.021977
## 14 0.00050968 77 0.65214 0.81040 0.022060
## 15 0.00038226 83 0.64908 0.80963 0.022053
## 16 0.00025484 89 0.64679 0.81651 0.022120
## 17 0.00015291 98 0.64450 0.82492 0.022202
## 18 0.00012742 103 0.64373 0.82569 0.022210
## 19 0.00001000 109 0.64297 0.82492 0.022202
plotcp(ar)
Parece que minimiza aprox en 0.0022 de complejidad y generamos un nuevo arbol con ese parametro
Además vamos a incluir un nuevo parámetro para que el árbol no tenga más de 7 niveles
ar<-rpart(formula, train, method = 'class', parms = list(

split = "information"),
control = rpart.control(cp = 0.0022,maxdepth = 7))
Revisamos de nuevo la complejidad
printcp(ar)
##
## Classification tree:
## rpart(formula = formula, data = train, method = "class", parms = list(split = "informatio
n"),
## control = rpart.control(cp = 0.0022, maxdepth = 7))
##
## Variables actually used in tree construction:
## [1] Contract InternetService MonthlyCharges_DISC
## [4] OnlineBackup OnlineSecurity PaperlessBilling
## [7] PaymentMethod SeniorCitizen Tenure_DISC
##
## Root node error: 1308/4941 = 0.26472
##
## n= 4941
##
## CP nsplit rel error xerror xstd
## 1 0.0698267 0 1.00000 1.00000 0.023709
## 2 0.0053517 3 0.79052 0.79052 0.021861
## 3 0.0030581 7 0.76529 0.78593 0.021814
## 4 0.0022936 8 0.76223 0.78823 0.021837
## 5 0.0022000 18 0.73471 0.79434 0.021900
plotcp(ar)
Parece bastante estable así que vamos a crear el gráfico del árbol para analizarlo
rpart.plot(ar,type=2,extra = 7, under = TRUE,under.cex = 0.8,fallen.leaves=F,gap = 0,cex=0.4,

yesno = 2,box.palette = "GnYlRd",branch.lty = 3)
Y vamos a sacar las reglas que podrian ser utilizadas por ejemplo para hacer una implantacion del arbol
rpart.rules(ar,style = 'tall',cover = T)
## Churn is 0.07 with cover 45% when

## Contract is One year or Two year
##
## Contract is Month-to-month
## Tenure_DISC is 03 <= 16 or 04 <= 22 or 05 <= 49 or 06 <= 59 or 07 <= 70 or 08 > 70
## InternetService is DSL or No
##
## Tenure_DISC is 04 <= 22 or 06 <= 59
## InternetService is Fiber optic
## PaymentMethod is Electronic check
## OnlineSecurity is Yes
## OnlineBackup is Yes
##
## Tenure_DISC is 05 <= 49 or 07 <= 70 or 08 > 70
##
## Tenure_DISC is 01 <= 1 or 02 <= 5
## InternetService is No
##
## Tenure_DISC is 04 <= 22 or 05 <= 49 or 06 <= 59 or 07 <= 70 or 08 > 70
## PaymentMethod is Bank transfer (automatic) or Credit card (automatic) or Mailed check
##
## Tenure_DISC is 02 <= 5 or 03 <= 16
## PaymentMethod is Credit card (automatic) or Mailed check
## SeniorCitizen is 0
## MonthlyCharges_DISC is 04_DE_60_A_80
##
## Tenure_DISC is 07 <= 70 or 08 > 70
## OnlineSecurity is No
##
## Tenure_DISC is 02 <= 5
## InternetService is DSL
##
## Tenure_DISC is 04 <= 22 or 05 <= 49 or 06 <= 59

## PaperlessBilling is No
##
## PaperlessBilling is No
##
## Tenure_DISC is 04 <= 22 or 05 <= 49 or 06 <= 59
## PaperlessBilling is Yes
##
## Tenure_DISC is 02 <= 5 or 03 <= 16
## PaymentMethod is Bank transfer (automatic) or Electronic check
##
## PaperlessBilling is Yes
##
## Tenure_DISC is 02 <= 5 or 03 <= 16
## PaymentMethod is Credit card (automatic) or Mailed check
##
## Tenure_DISC is 04 <= 22 or 06 <= 59
## OnlineBackup is No
##
## Tenure_DISC is 02 <= 5 or 03 <= 16
## MonthlyCharges_DISC is 05_DE_80_A_100 or 07_MAYOR_100
##
## Tenure_DISC is 01 <= 1 or 02 <= 5
##
Por ejemplo el primer criterio sería tener un contrato de uno o dos años que bajaría notablemente el Churn, el
segundo criterio se base en un contrato mensual, con una antigüedad superior a los 5 meses pero sin servicio
de internet, etc…
Podemos llevarnos el nodo final de cada cliente a un data.frame para poder hacer una explotación posterior
ar_numnodos<-rpart.predict(ar,test,nn = T)
head(ar_numnodos)
## 0 1 nn
## 1 0.3888889 0.61111111 219
## 2 0.9347531 0.06524694 2
## 3 0.6025641 0.39743590 108
## 4 0.9347531 0.06524694 2
## 5 0.4653061 0.53469388 239
## 6 0.9347531 0.06524694 2
Vamos a calcular los scorings y evaluar el modelo
ar_predict<-predict(ar,test,type = 'prob')[,2]
Vemos que pinta tiene
plot(ar_predict~test$Churn, border=9, col=22, main="Comparación Churn real vs Scoring - Model

o Arbol de decision",cex.main=0.9,xlab="Churn real en la muestra test", ylab="Scoring del Chu
rn", cex.axis=0.7, cex.lab=0.8)
Aquí vemos que predice en línea a como lo hacía el modelo anterior de la regresión logística pero como en
este caso seguimos teniendo valores con un elevado scoring pero un Churn real de 0
umb_ar<-umbrales(test$Churn,ar_predict)
umb_ar

## 1 0.05 0.05000 0.05000 0.05000 0.05000
## 2 0.10 65.74691 43.03243 87.52228 57.69683
## 3 0.15 65.74691 43.03243 87.52228 57.69683
## 4 0.20 74.78592 51.81712 78.78788 62.51768
## 5 0.25 75.16651 52.35792 77.18360 62.39193
## 6 0.30 76.92674 55.00000 74.50980 63.28539
## 7 0.35 79.63844 60.88380 66.31016 63.48123
## 8 0.40 79.87631 63.42412 58.11052 60.65116
## 9 0.45 80.44719 65.30612 57.04100 60.89439
## 10 0.50 80.54234 66.37931 54.90196 60.09756
## 11 0.55 79.87631 69.49153 43.85027 53.77049
## 12 0.60 79.01998 71.42857 35.65062 47.56243
## 13 0.65 78.78211 74.05858 31.55080 44.25000
## 14 0.70 75.68982 80.48780 11.76471 20.52877
## 15 0.75 75.68982 80.48780 11.76471 20.52877
## 16 0.80 75.54710 85.07463 10.16043 18.15287
## 17 0.85 75.54710 85.07463 10.16043 18.15287
## 18 0.90 0.90000 0.90000 0.90000 0.90000
## 19 0.95 0.95000 0.95000 0.95000 0.95000
Seleccionamos automaticamente el mejor umbral
umbral_final_ar<-umb_ar[which.max(umb_ar$F1),1]
umbral_final_ar
## [1] 0.35
En este caso el mejor umbral es el 0.35
Evaluamos la matriz de confusion y las metricas con el umbral optimizado
confusion(test$Churn,ar_predict,umbral_final_ar)
##
## real FALSE TRUE
## 0 1302 239
## 1 189 372
ar_metricas<-filter(umb_ar,umbral==umbral_final_ar)
ar_metricas

## 1 0.35 79.63844 60.8838 66.31016 63.48123
Evaluamos la ROC
ar_prediction<-prediction(ar_predict,test$Churn)
roc(ar_prediction)
ar_metricas<-cbind(ar_metricas,AUC=round(auc(ar_prediction),2)*100)
print(t(ar_metricas))
## [,1]
## umbral 0.35000
## acierto 79.63844
## F1 63.48123
## AUC 82.00000
Este modelo tiene mayor precisión que la regresión logística pero como es lógico la cobertura es más baja ya
que ambas variables funcionan a la inversa. El acierto también es más elevado que la regresión logística.
La AUC consigue niveles por debajo de la regresión logística (81.86006)
4.5. Modelo Random Forest

Creamos el modelo
formula_rf <- formula

rf<-randomForest(formula_rf,train,importance=T)
rf
##
## Call:
## randomForest(formula = formula_rf, data = train, importance = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 20.38%
## Confusion matrix:
## 0 1 class.error
## 0 3216 417 0.1147812
## 1 590 718 0.4510703
Visualizamos las variables mas importantes
varImpPlot(rf, main='Importancia de las variables en Random Forest')
Vemos que la antigüedad es la variable más importante en Random Forest en los dos criterios pero el resto de
posiciones el orden difiere bastante.
Como hay dos criterios vamos a crear una unica variable agregada y visualizarla para tener una mejor idea de
la importancia de cada variable
importancia <- importance(rf)[,3:4]

importancia_norm <- as.data.frame(scale(importancia))
importancia_norm <- importancia_norm %>% mutate(
Variable = rownames(importancia_norm),
Imp_tot = MeanDecreaseAccuracy + MeanDecreaseGini) %>%
mutate(Imp_tot = Imp_tot + abs(min(Imp_tot))) %>%
arrange(desc(Imp_tot)) %>%
select(Variable,Imp_tot,MeanDecreaseAccuracy,MeanDecreaseGini)
ggplot(importancia_norm, aes(reorder(Variable,-Imp_tot),Imp_tot)) + geom_bar(stat = "identit
y", fill = "darkblue") + theme(axis.text.x = element_text(angle = 90,size = 9)) + ggtitle ("I
mportancia de las variables en Random Forest") + labs(x = "Variables ordenadas por importanci
a",y = "% de la importancia explicada")
importancia_norm
## Variable Imp_tot MeanDecreaseAccuracy MeanDecreaseGini

## 1 Tenure_DISC 6.075961670 1.7450521 2.4080930
## 2 Contract 4.763763723 1.5537060 1.2872411
## 3 InternetService 3.447610611 1.2673087 0.2574853
## 4 MonthlyCharges_DISC 2.300133943 0.1116773 0.2656400
## 5 PaymentMethod 1.789829143 -0.6124475 0.4794601
## 6 OnlineSecurity 1.163624137 -0.1503411 -0.6088514
## 7 TechSupport 1.113902796 -0.1535495 -0.6553643
## 8 OnlineBackup 0.860064018 -0.3890937 -0.6736589
## 9 PaperlessBilling 0.824246251 -0.5192477 -0.5793226
## 10 SeniorCitizen 0.727765723 -0.4579296 -0.7371213
## 11 Partner 0.006897105 -1.2286168 -0.6873027
## 12 Dependents 0.000000000 -1.1665184 -0.7562982
Las variables que mejor explican la variabilidad en el Churn son:
La antigüedad
La presencia de contrato
El servicio de Internet
Los cargos mensuales
El sistema de pago
La caída es bastante gradual, asi que no hay corte claro. Podemos seleccionar aquellas variables que están
por encima de 1.0
a_mantener <- importancia_norm %>%

filter(Imp_tot > 1.0) %>%
select(Variable)
a_mantener <- as.character((a_mantener$Variable))
Creamos de nuevo el modelo con las nuevas variables
formula_rf <- reformulate(a_mantener,target)

rf<-randomForest(formula_rf,train,importance=T)
rf
##
## Call:
## randomForest(formula = formula_rf, data = train, importance = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.52%
## Confusion matrix:
## 0 1 class.error
## 0 3260 373 0.1026700
## 1 641 667 0.4900612
Aplicamos el modelo al conjunto de test, generando un vector con las probabilidades
rf_predict<-predict(rf,test,type = 'prob')[,2]
plot(rf_predict~test$Churn, border=9, col=11, main="Comparación Churn real vs Scoring - Model

o Random Forest",cex.main=0.9,xlab="Churn real en la muestra test", ylab="Scoring del Churn",
cex.axis=0.7, cex.lab=0.8)
Parece que se generan bastantes valores con un Churn negativo pero scoring elevados en el modelo.
Además los valores con un Churn positivo tiene una dispersión muy amplia.
umb_rf<-umbrales(test$Churn,rf_predict)
umb_rf

## 1 0.05 68.22074 44.96707 85.20499 58.86700
## 2 0.10 73.50143 50.22831 78.43137 61.23869
## 3 0.15 75.64225 53.19426 72.72727 61.44578
## 4 0.20 78.02093 57.37705 68.62745 62.50000
## 5 0.25 79.01998 59.64630 66.13191 62.72189
## 6 0.30 79.63844 61.44578 63.63636 62.52189
## 7 0.35 79.63844 62.52354 59.18004 60.80586
## 8 0.40 79.78116 63.60000 56.68449 59.94345
## 9 0.45 80.11418 64.68172 56.14973 60.11450
## 10 0.50 80.16175 67.22488 50.08913 57.40552
## 11 0.55 80.30447 68.42105 48.66310 56.87500
## 12 0.60 79.97146 69.33702 44.74153 54.38787
## 13 0.65 79.68601 72.94521 37.96791 49.94138
## 14 0.70 79.59087 75.58140 34.75936 47.61905
## 15 0.75 79.49572 77.08333 32.97683 46.19226
## 16 0.80 79.44814 77.21519 32.62032 45.86466
## 17 0.85 79.01998 80.30303 28.34225 41.89723
## 18 0.90 78.30637 80.34682 24.77718 37.87466
## 19 0.95 77.92578 80.12422 22.99465 35.73407
De nuevo parece que el umbral que maximiza el F1 es el mismo queen la regresión logística pero lo
seleccionamos automáticamente
umbral_final_rf<-umb_rf[which.max(umb_rf$F1),1]
umbral_final_rf
## [1] 0.25
El umbral que mayor F1 consigue es 0.25
Evaluamos la matriz de confusión y las métricas con el umbral optimizado
confusion(test$Churn,rf_predict,umbral_final_rf)
##
## real FALSE TRUE
## 0 1290 251
## 1 190 371
rf_metricas<-filter(umb_rf,umbral==umbral_final_rf)
rf_metricas

## 1 0.25 79.01998 59.6463 66.13191 62.72189
Evaluamos la ROC
rf_prediction<-prediction(rf_predict,test$Churn)
roc(rf_prediction)
rf_metricas<-cbind(rf_metricas,AUC=round(auc(rf_prediction),2)*100)
print(t(rf_metricas))
## [,1]
## umbral 0.25000
## acierto 79.01998
## F1 62.72189
## AUC 83.00000
La precisión es más elevada que en la regresión logística y los árboles de decisión pero la cobertura al
contrario es baja.
El acierto es el más elevado de los tres. La AUC es superior a los Árboles de decisión pero por debajo
del que alcanza la regresión logística (83.36879)
4.6. Modelo Bayesiano

El último modelo que vamos a probar es el modelo Modelo Bayesiano ingenuo o naive
Primero tenemos que eliminar de las dos muestras de entrenamiento y de test el código de cliente y dejar sólo
la variable target y las independientes
train_Bayes <- select(train, -customerID)
Creamos el modelo y vemos como clasifica en las dos niveles de la variable independiente Churn
Bayes=naiveBayes(Churn~., data=train_Bayes)
print(Bayes)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.7352763 0.2647237
##
## Conditional probabilities:
## Contract
## Y Month-to-month One year Two year
## 0 0.43214974 0.24552711 0.32232315
## 1 0.88990826 0.08868502 0.02140673
##
## MonthlyCharges_DISC
## Y 01_MENOR_20 02_DE_20_A_40 03_DE_40_A_60 04_DE_60_A_80 05_DE_80_A_100
## 0 0.11312964 0.19763281 0.16240022 0.19157721 0.21277181
## 1 0.03440367 0.08486239 0.14908257 0.25382263 0.34556575
## MonthlyCharges_DISC
## Y 07_MAYOR_100
## 0 0.12248830
## 1 0.13226300
##
## InternetService
## Y DSL Fiber optic No
## 0 0.38370493 0.34434352 0.27195156
## 1 0.24617737 0.68807339 0.06574924
##
## PaymentMethod
## Y Bank transfer (automatic) Credit card (automatic) Electronic check
## 0 0.2477291 0.2491054 0.2496559
## 1 0.1360856 0.1269113 0.5619266
## PaymentMethod
## Y Mailed check
## 0 0.2535095
## 1 0.1750765
##
## PaperlessBilling
## Y No Yes
## 0 0.4668318 0.5331682
## 1 0.2515291 0.7484709
##
## OnlineSecurity
## Y No Yes
## 0 0.6672172 0.3327828
## 1 0.8425076 0.1574924
##
## TechSupport
## Y No Yes
## 0 0.6625378 0.3374622
## 1 0.8279817 0.1720183
##
## SeniorCitizen
## Y 0 1
## 0 0.8720066 0.1279934
## 1 0.7423547 0.2576453
##
## Partner
## Y No Yes
## 0 0.4750895 0.5249105
## 1 0.6269113 0.3730887
##
## OnlineBackup
## Y No Yes
## 0 0.6344619 0.3655381
## 1 0.7232416 0.2767584
##
## Dependents
## Y No Yes
## 0 0.6559317 0.3440683
## 1 0.8233945 0.1766055
##
## Tenure_DISC
## Y 01 <= 1 02 <= 5 03 <= 16 04 <= 22 05 <= 49 06 <= 59
## 0 0.048444811 0.077071291 0.141756124 0.069914671 0.283787503 0.113404900
## 1 0.213302752 0.190366972 0.226299694 0.069571865 0.187308869 0.060397554
## Tenure_DISC
## Y 07 <= 70 08 > 70
## 0 0.162124966 0.103495734
## 1 0.048165138 0.004587156
##
## random
## Y [,1] [,2]
## 0 1 0
## 1 1 0
Vemos por ejemplo como el contrato de 2 años genera bajas probabilidades de Churn o disponer el servicio
de seguridad online baja el Churn.
Aplicamos el modelo al conjunto de test, generando un vector con los resultados. En este caso no nos
devuelve probabilidades sino directamente 0 y 1 por lo que no necesitamos trabajar con umbrales
Bayes_predict <- predict(Bayes, test)
## Warning in data.matrix(newdata): NAs introducidos por coerción
plot(Bayes_predict~test$Churn, border=9, main="Comparación Churn real vs Churn previsto - Mod

elo Bayesiano", xlab="Churn real en la muestra test", ylab="Previsión de Churn", col=c('Darkb
lue','DarkGreen'))
En este caso no obtenemos probabilidades sino que clasifica los resultados en 1 y 0 (Churn sí o no). En
general parece que clasifica bien aunque a algunos valores con Churn negativo en realidad los clasifica como
con Churn y a la inversa.
Sacamos la matriz de confusión y las principales métricas
BayesMatrizConfusion <- table(test$Churn, Bayes_predict, dnn = c("Realidad", "Predicción"))

confusionMatrix(BayesMatrizConfusion)
## Confusion Matrix and Statistics

##
## Predicción
## Realidad 0 1
## 0 1263 278
## 1 157 404
##
## Accuracy : 0.7931
## 95% CI : (0.7751, 0.8102)
## No Information Rate : 0.6755
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5051
##
## Mcnemar's Test P-Value : 0.000000008738
##
## Sensitivity : 0.8894
## Specificity : 0.5924
## Pos Pred Value : 0.8196
## Neg Pred Value : 0.7201
## Prevalence : 0.6755
## Detection Rate : 0.6009
## Detection Prevalence : 0.7331
## Balanced Accuracy : 0.7409
##
## 'Positive' Class : 0
##
Y lo convertimos en un vector y le damos nombre
Bayes_metricas<-c(acierto <- (BayesMatrizConfusion[1,1] + BayesMatrizConfusion[2,2]) / sum(Ba

yesMatrizConfusion) *100,
precision <- BayesMatrizConfusion[2,2] / (BayesMatrizConfusion[2,2] + BayesMatrizConfusion[
1,2]) *100,
cobertura <- BayesMatrizConfusion[2,2] / (BayesMatrizConfusion[2,2] + BayesMatrizConfusion[
2,1]) *100,
F1 <- 2*precision*cobertura/(precision+cobertura))
Bayes_metricas
## [1] 79.30542 59.23754 72.01426 65.00402
Evaluamos la ROC pero tenemos que transformar los resultados de factor a numérico
Bayes_predict<-as.numeric(Bayes_predict)
Bayes_prediction<-prediction(Bayes_predict,test$Churn)
roc(Bayes_prediction)
Sacamos las metricas definitivas incluyendo el AUC e incluimos el dato de umbral 0 para que se pueda
comparar con el resto de modelos ya Bayes no necesita umbrales. El AUC es el más bajo de los cuatro
modelos evaluados (76.98701)
Bayes_metricas<-append(c(NA),Bayes_metricas)
Bayes_metricas <- t(Bayes_metricas)
Bayes_metricas<-cbind(Bayes_metricas,round(auc(Bayes_prediction),2)*100)
print(t(Bayes_metricas))
## [,1]
## [1,] NA
## [2,] 79.30542
## [3,] 59.23754
## [4,] 72.01426
## [5,] 65.00402
## [6,] 77.00000
Lo convertimos en un data frame para poder comparar con los otros 3 modelos
Bayes_metricas <- data.frame(matrix(unlist(Bayes_metricas), ncol=length(Bayes_metricas)))

names (Bayes_metricas) = c("umbral","acierto","precision","cobertura","F1","AUC")
Bayes_metricas
## umbral acierto precision cobertura F1 AUC

## 1 NA 79.30542 59.23754 72.01426 65.00402 77
Vemos que con Bayes obtenemos una menor cobertura pero un nivel de acierto elevado. Sin embargo el AUC
es el más bajo de los modelos empleados
4.7. Evaluación de los modelos manuales y de sus resultados

Comparamos los 4 métodos utilizando todas las métricas
ComparativaModelos1 <- rbind(rl_metricas,ar_metricas,rf_metricas,Bayes_metricas)

rownames(ComparativaModelos1) <- c('Regresion Logistica Manual','Arbol Decision','Random Fore
st','Bayes')
t(ComparativaModelos1)
## Regresion Logistica Manual Arbol Decision Random Forest Bayes

## umbral 0.30000 0.35000 0.25000 NA
## acierto 75.40438 79.63844 79.01998 79.30542
## precision 52.60047 60.88380 59.64630 59.23754
## cobertura 79.32264 66.31016 66.13191 72.01426
## F1 63.25515 63.48123 62.72189 65.00402
## AUC 84.00000 82.00000 83.00000 77.00000
Conclusión: teniendo en cuenta el AUC los dos modelos que obtienen niveles más elevados son los de
regresion logistica y Random Forest. Si además damos prioridad a la cobertura aunque tengamos una
menor precisión optaríamos por el modelo de Regresión Logística
Escribimos el scoring final del modelo de Regresión Logística en el dataset y guardamos el modelo
aunque vamos a hacer también modelos automáticos
df$SCORING_CHURN <- predict(rl,df,type = 'response')

saveRDS(rl,'Modelo_Churn_manual_final.rds')
Podemos ver la media del Scoring vs el Churn real y comprobar que la media es significativamente más baja
en el caso del 0 (0.18) vs el 1 (0.48) aunque la Desviación es elevada en ambos casos
group_by(df, Churn) %>% summarize(mean_Scoring = mean(SCORING_CHURN), Desv_Scoring = sd(SCORI

NG_CHURN))
## # A tibble: 2 x 3
## Churn mean_Scoring Desv_Scoring
## <fct> <dbl> <dbl>
## 1 0 0.184 0.194
## 2 1 0.485 0.222
Y lo vemos en un gráfico
ggplot(df, aes(x=Churn, y=SCORING_CHURN, fill=Churn)) + geom_boxplot(color = 'darkblue') + xl

ab("Churn real") + ylab("Scoring Regresion Logística") + ggtitle('Comparación Churn vs Scorin
g Modelo Ganador Regresión Logística') + theme (plot.title = element_text(size=rel(0.9), face
="bold")) + theme(axis.title.x = element_text(face="bold", size=rel(0.8))) +
theme(axis.title.y = element_text(face="bold",size=rel(0.8)))
En general parece que identifica bien el Churn aunque como vimos previamente algunos valores que tienen
un elevado scoring pese a tener un Churn negativo
Finalmente guardamos el fichero antes de hacer modelos con AutoMl
5. Modelización con AutoMl

5.1. Preparación del entorno AutoMl
Vamos a modelizar ahora usando modelos automáticos y comparar los resultados obtenidos con los modelos
manuales que hemos realizado previamente
Cargamos los datos
df <- readRDS('cacheDan3.rds')
Inicializamos el cluster de H2O
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 16 minutes 35 seconds
## H2O cluster timezone: Europe/Madrid
## H2O data parsing timezone: UTC
## H2O cluster version: 3.30.0.1
## H2O cluster version age: 1 month and 18 days
## H2O cluster name: H2O_started_from_R_DanielB_qzt318
## H2O cluster total nodes: 1
## H2O cluster total memory: 0.47 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder,
Core V4
## R Version: R version 3.6.3 (2020-02-29)
Preparamos los datos y los cargamos en el cluster h2o
df_h2o <- as.h2o(df)
##
|
| | 0%
|
|======================================================================| 100%
Creamos las particiones en los datos entre la muestra de entrenamiento y la de validación
split <- h2o.splitFrame(df_h2o)

train_h2o <- split[[1]]
valid_h2o <- split[[2]]
split <- NULL
Definimos los roles de las variables, tanto de la variable target (Churn) como de las dependientes dejando
fuera el scoring obtenido anteriormente y el CustomerID
y <- 'Churn'
x <- setdiff(names(df_h2o),c('customerID','SCORING_CHURN','Churn',y))
5.2. Modelización con Regresión Logística

Vamos a crear un primer modelo de regresión logística con h2o
rl <- h2o.glm(
y = y,
x = x,
training_frame = train_h2o,
validation_frame = valid_h2o,
nfolds = 5,
family = 'binomial',
lambda = 0,
compute_p_values = TRUE
)
##
|
| | 0%
|
|======= | 11%
|
|======================================================================| 100%
Sacamos la significación de los coeficientes
rl@model$coefficients_table %>%
as.data.frame() %>%
mutate(coefficients = round(coefficients,2),
p_value = round(p_value,2)) %>%
select(names,coefficients,p_value)
## names coefficients p_value

## 1 Intercept 0.30 0.36
## 2 Tenure_DISC.02 <= 5 -0.88 0.00
## 3 Tenure_DISC.03 <= 16 -1.38 0.00
## 4 Tenure_DISC.04 <= 22 -1.82 0.00
## 5 Tenure_DISC.05 <= 49 -2.08 0.00
## 6 Tenure_DISC.06 <= 59 -2.25 0.00
## 7 Tenure_DISC.07 <= 70 -2.61 0.00
## 8 Tenure_DISC.08 > 70 -3.64 0.00
## 9 MonthlyCharges_DISC.02_DE_20_A_40 0.21 0.40
## 10 MonthlyCharges_DISC.03_DE_40_A_60 -0.07 0.83
## 11 MonthlyCharges_DISC.04_DE_60_A_80 -0.03 0.94
## 12 MonthlyCharges_DISC.05_DE_80_A_100 0.35 0.31
## 13 MonthlyCharges_DISC.07_MAYOR_100 0.88 0.02
## 14 PaymentMethod.Credit card (automatic) 0.02 0.89
## 15 PaymentMethod.Electronic check 0.35 0.00
## 16 PaymentMethod.Mailed check -0.10 0.45
## 17 Contract.One year -0.72 0.00
## 18 Contract.Two year -1.41 0.00
## 19 InternetService.Fiber optic 0.69 0.00
## 20 InternetService.No -1.28 0.00
## 21 OnlineSecurity.Yes -0.36 0.00
## 22 TechSupport.Yes -0.40 0.00
## 23 SeniorCitizen.1 0.26 0.01
## 24 Partner.Yes 0.06 0.51
## 25 OnlineBackup.Yes -0.20 0.03
## 26 Dependents.Yes -0.21 0.05
## 27 PaperlessBilling.Yes 0.47 0.00
De nuevo la antigüedad es la variable con los coeficientes más elevados y signo negativo (a mayor
antigüedad menor es el abandono). También los cargos mensuales, el servicio de internet y el tipo de contrato
tienen de nuevo elevados coeficientes.
Aunque algunas variables tienen bajos coeficientes alcanzan un elevado p_value así que optamos por no
eliminar ninguna variable
Evaluamos los resultados del modelo
rl@model$validation_metrics
## H2OBinomialMetrics: glm
## ** Reported on validation data. **
##
## MSE: 0.134352
## RMSE: 0.3665406
## LogLoss: 0.4136125
## Mean Per-Class Error: 0.2374537
## AUC: 0.8398253
## AUCPR: 0.6443712
## Gini: 0.6796507
## R^2: 0.2899401
## Residual Deviance: 1439.371
## AIC: 1493.371
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 1012 287 0.220939 =287/1299
## 1 112 329 0.253968 =112/441
## Totals 1124 616 0.229310 =399/1740
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.321889 0.622517 198
## 2 max f2 0.085079 0.728617 324
## 3 max f0point5 0.466177 0.614895 131
## 4 max accuracy 0.466177 0.805172 131
## 5 max precision 0.922682 1.000000 0
## 6 max recall 0.011503 1.000000 389
## 7 max specificity 0.922682 1.000000 0
## 8 max absolute_mcc 0.419579 0.485793 152
## 9 max min_per_class_accuracy 0.300670 0.755966 209
## 10 max mean_per_class_accuracy 0.321889 0.762546 198
## 11 max tns 0.922682 1299.000000 0
## 12 max fns 0.922682 439.000000 0
## 13 max fps 0.002539 1299.000000 399
## 14 max tps 0.011503 441.000000 389
## 15 max tnr 0.922682 1.000000 0
## 16 max fnr 0.922682 0.995465 0
## 17 max fpr 0.002539 1.000000 399
## 18 max tpr 0.011503 1.000000 389
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>,
valid=<T/F>, xval=<T/F>)`
h2o.auc(rl@model$validation_metrics)
## [1] 0.8398253
Obtenemos un buen AUC (83.98253)
Incluimos los principales resultados de la matriz de confusión
matrizconfusionh2orl <- h2o.confusionMatrix(rl)

matrizconfusionh2orl
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.3493648
54584978:
## 0 1 Error Rate
## 0 3122 753 0.194323 =753/3875
## 1 380 1048 0.266106 =380/1428
## Totals 3502 1801 0.213653 =1133/5303
Extraemos las principales variables de la matriz de confusión para poder comparar con los modelos anteriores
aunque no tengamos el dato del umbral
h2orl_metricas<-c(NA, acierto <- (matrizconfusionh2orl[1,1] + matrizconfusionh2orl[2,2]) / (m

atrizconfusionh2orl[1,1] + matrizconfusionh2orl[1,2] + matrizconfusionh2orl[2,1] + matrizconf
usionh2orl[2,2]) *100,
precision <- matrizconfusionh2orl[2,2] / (matrizconfusionh2orl[2,1] + matrizconfusionh2orl[
2,2]) *100,
cobertura <- matrizconfusionh2orl[2,2] / (matrizconfusionh2orl[2,2] + matrizconfusionh2orl[
1,2]) *100,
F1 <- 2*precision*cobertura/(precision+cobertura),
h2o.auc(rl@model$validation_metrics)*100)
names (h2orl_metricas) = c("umbral","acierto","precision","cobertura","F1","AUC")
h2orl_metricas

## NA 78.63474 73.38936 58.18989 64.91174 83.98253
Obtenemos un buen nivel de acierto y de precisión pero con una cobertura baja, sobre todo en
comparación a la obtenida en los modelos manuales
5.2. Modelización con Random forest

Para modelizar en Random Forest hemos de definir manualmente los 2 parámetros principales: el número de
variables a emplear (mtries) y de árboles a generar (ntrees)
rf <- h2o.randomForest(
y = y,
x = x,
nfolds = 5,
ntrees = 50,
mtries = 5
)
##
|
| | 0%
|
|===== | 7%
|
|========== | 14%
|
|============== | 20%
|
|===================== | 30%
|
|============================== | 43%
|
|========================================= | 58%
|
|================================================ | 68%
|
|==================================================== | 74%
|
|======================================================= | 78%
|
|======================================================== | 80%
|
|========================================================== | 83%
|
|============================================================== | 88%
|
|================================================================ | 92%
|
|================================================================== | 95%
|
|===================================================================== | 98%
|
|======================================================================| 100%
Evaluamos
rf@model$validation_metrics
## H2OBinomialMetrics: drf
##
## MSE: 0.1453532
## RMSE: 0.3812521
## LogLoss: 0.5207393
## AUC: 0.8159547
## AUCPR: 0.6046925
## Gini: 0.6319094
## R^2: 0.2317983
##
## 0 1 Error Rate
## 0 1022 277 0.213241 =277/1299
## 1 135 306 0.306122 =135/441
## Totals 1157 583 0.236782 =412/1740
##
## 1 max f1 0.353526 0.597656 202
## 2 max f2 0.135037 0.718796 305
## 3 max f0point5 0.585723 0.583384 104
## 4 max accuracy 0.589244 0.794253 103
## 5 max precision 0.970243 1.000000 0
## 6 max recall 0.000004 1.000000 399
## 7 max specificity 0.970243 1.000000 0
## 8 max absolute_mcc 0.430041 0.443000 166
## 11 max tns 0.970243 1299.000000 0
## 12 max fns 0.970243 440.000000 0
## 13 max fps 0.000004 1299.000000 399
## 14 max tps 0.000004 441.000000 399
## 15 max tnr 0.970243 1.000000 0
## 16 max fnr 0.970243 0.997732 0
## 17 max fpr 0.000004 1.000000 399
## 18 max tpr 0.000004 1.000000 399
##
h2o.auc(rf@model$validation_metrics)
## [1] 0.8159547
Vemos que obtenemos un AUC más bajo que el obtenido con la regresión logística (81.59547)
Creamos en cualquier caso la matriz de confusión
matrizconfusionh2orf <- h2o.confusionMatrix(rf)

matrizconfusionh2orf
14293592:
## 0 1 Error Rate
## 0 2618 1257 0.324387 =1257/3875
## 1 275 1153 0.192577 =275/1428
## Totals 2893 2410 0.288893 =1532/5303
Extraemos las principales variables para poder comparar con los modelos anteriores
h2orf_metricas<-c(NA, acierto <- (matrizconfusionh2orf[1,1] + matrizconfusionh2orf[2,2]) / (m

atrizconfusionh2orf[1,1] + matrizconfusionh2orf[1,2] + matrizconfusionh2orf[2,1] + matrizconf
usionh2orf[2,2]) *100,
precision <- matrizconfusionh2orf[2,2] / (matrizconfusionh2orf[2,1] + matrizconfusionh2orf[
2,2]) *100,
cobertura <- matrizconfusionh2orf[2,2] / (matrizconfusionh2orf[2,2] + matrizconfusionh2orf[
1,2]) *100,
h2o.auc(rf@model$validation_metrics)*100)
names (h2orf_metricas) = c("umbral","acierto","precision","cobertura","F1","AUC")
h2orf_metricas

## NA 71.11069 80.74230 47.84232 60.08338 81.59547
Obtenemos valores inferiores a los de la regresión logísitica en todas las variables
Vamos a probar el modelo de Random Forest pero con otros parámetros diferentes a través del Grid Search y
el Random Search
5.3. Modelización con Grid Search

Para crear un modelo con Grid Search primero tenemos que crear las diferentes opciones a evaluar: número
de variables (mtries) y de árboles a generar (ntrees). El modelo va a testar todas las combinaciones posibles
rf_params <- list(

ntrees = c(30,50,100),
mtries = c(3,5,10)
)
Y luego va a generar el modelo usando la función h2o.grid
rf_grid <- h2o.grid(

algorithm = 'randomForest',
grid_id = 'rf_grid',
y = y,
x = x,
nfolds = 5,
hyper_params = rf_params
)
##
|
| | 0%
|
|======================================================================| 100%
Evaluamos los resultados y ordenamos en base al AUC obtenido
rf_grid_results <- h2o.getGrid(

grid_id = 'rf_grid',
sort_by = 'auc',
decreasing = T
)
rf_grid_results
## H2O Grid Details

## ================
##
## Grid ID: rf_grid
## Used hyper parameters:
## - mtries
## - ntrees
## Number of models: 18
## Number of failed models: 0
##
## Hyper-Parameter Search Summary: ordered by decreasing auc
## mtries ntrees model_ids auc
## 1 3 100 rf_grid_model_16 0.836894822445107
## 2 3 50 rf_grid_model_13 0.8346178729556338
## 3 3 100 rf_grid_model_7 0.8312668391929622
## 4 3 50 rf_grid_model_4 0.8287834344568387
## 5 3 30 rf_grid_model_10 0.8270483419174122
## 6 3 30 rf_grid_model_1 0.82694188216581
## 7 5 50 rf_grid_model_14 0.8213142676425409
## 8 5 100 rf_grid_model_17 0.8155409776813952
## 9 5 30 rf_grid_model_2 0.8137329649450095
## 10 5 30 rf_grid_model_11 0.8135400740941539
## 11 5 100 rf_grid_model_8 0.8134685512067796
## 12 5 50 rf_grid_model_5 0.8115931273631471
## 13 10 50 rf_grid_model_15 0.8026234751965303
## 14 10 100 rf_grid_model_18 0.8021274058010301
## 15 10 30 rf_grid_model_12 0.7979981024667931
## 16 10 100 rf_grid_model_9 0.7972993321643904
## 17 10 30 rf_grid_model_3 0.7907336416573018
## 18 10 50 rf_grid_model_6 0.7849960522121447
Seleccionamos el mejor modelo en base a su AUC
rf_gridwinner <- h2o.getModel(rf_grid_results@model_ids[[1]])
Evaluamos ese modelo que tiene el mayor AUC en la muestra de validación
rf_gridwinner@model$validation_metrics
##
## MSE: 0.1396304
## RMSE: 0.3736715
## LogLoss: 0.4298243
## AUC: 0.8260986
## AUCPR: 0.6034246
## Gini: 0.6521971
## R^2: 0.2620436
##
## 0 1 Error Rate
## 0 993 306 0.235566 =306/1299
## 1 117 324 0.265306 =117/441
## Totals 1110 630 0.243103 =423/1740
##
## 1 max f1 0.327651 0.605042 192
## 2 max f2 0.155712 0.737523 285
## 3 max f0point5 0.504837 0.594679 108
## 4 max accuracy 0.560279 0.797701 84
## 5 max precision 0.896253 1.000000 0
## 6 max recall 0.003087 1.000000 394
## 7 max specificity 0.896253 1.000000 0
## 8 max absolute_mcc 0.428719 0.452990 143
## 11 max tns 0.896253 1299.000000 0
## 12 max fns 0.896253 440.000000 0
## 13 max fps 0.000434 1299.000000 399
## 14 max tps 0.003087 441.000000 394
## 15 max tnr 0.896253 1.000000 0
## 16 max fnr 0.896253 0.997732 0
## 17 max fpr 0.000434 1.000000 399
## 18 max tpr 0.003087 1.000000 394
##
h2o.auc(rf_gridwinner@model$validation_metrics)
## [1] 0.8260986
Obtiene un AUC del 82.60986, superior al modelo anterior pero inferior al de regresión logística
Creamos la matriz de confusión
matrizconfusionh2orfgridwinner <- h2o.confusionMatrix(rf_gridwinner)

matrizconfusionh2orfgridwinner
01225369:
## 0 1 Error Rate
## 0 2809 1066 0.275097 =1066/3875
## 1 294 1134 0.205882 =294/1428
## Totals 3103 2200 0.256459 =1360/5303
h2orfgridwinner_metricas<-c(NA, acierto <- (matrizconfusionh2orfgridwinner[1,1] + matrizconfu

sionh2orfgridwinner[2,2]) / (matrizconfusionh2orfgridwinner[1,1] + matrizconfusionh2orfgridwi
nner[1,2] + matrizconfusionh2orfgridwinner[2,1] + matrizconfusionh2orfgridwinner[2,2]) *100,
precision <- matrizconfusionh2orfgridwinner[2,2] / (matrizconfusionh2orfgridwinner[2,1] + m
atrizconfusionh2orfgridwinner[2,2]) *100,
cobertura <- matrizconfusionh2orfgridwinner[2,2] / (matrizconfusionh2orfgridwinner[2,2] + m
atrizconfusionh2orfgridwinner[1,2]) *100,
h2o.auc(rf_gridwinner@model$validation_metrics)*100)
names (h2orfgridwinner_metricas) = c("umbral","acierto","precision","cobertura","F1","AUC")
h2orfgridwinner_metricas

## NA 74.35414 79.41176 51.54545 62.51378 82.60986
En la matriz de confusión obtenemos valores similares a los de la regresión logísitica aunque algo por
debajo en acierto. La cobertura que sigue siendo baja
5.4. Modelización con Random Search

Vamos ahora a modelizar con Random Search y probar 5 valores por parámetro y 3 parámetros distintos.
Esto equivale a crear 53 combinaciones, es decir, 125 en total por 3 particiones de validación cruzada
supondría crear 375 modelos aunque el algoritmo sólo probará alguna de las opciones posibles de manera
aleatoria.
Comenzamos definiendo el espacio de hyperparámetros
rf_params <- list(

ntrees = c(10,30,50,80,120),
mtries = c(3,5,8,12,15),
max_depth = c(5,10,20,25,30)
)
Y tenemos que definir ahora también los parámetros de búsqueda aleatoria definiendo un tiempo máximo, en
este caso 3 minutos o 180 segundos
search_params <- list(

strategy = "RandomDiscrete",
max_runtime_secs = 180
)
Volvemos a generar el modelo usando la función h2o.grid
rf_grid_rand <- h2o.grid(

algorithm = 'randomForest',
grid_id = 'rf_grid_rand',
y = y,
x = x,
nfolds = 3,
hyper_params = rf_params,
search_criteria = search_params
)
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 23%
|
|================= | 24%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|=============================== | 45%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 49%
|
|=================================== | 50%
|
|=================================== | 51%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 79%
|
|======================================================== | 80%
|
|======================================================== | 81%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|=================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
## Warning in h2o.getGrid(grid_id = grid_id): Some models were not built due to a

## failure, for more details run `summary(grid_object, show_stack_traces = TRUE)`
Evaluamos los resultados y los ordenamos en función de su AUC
rf_grid_results_rand <- h2o.getGrid(

grid_id = 'rf_grid_rand',
sort_by = 'auc',
decreasing = T
)
## Warning in h2o.getGrid(grid_id = "rf_grid_rand", sort_by = "auc", decreasing

## = T): Some models were not built due to a failure, for more details run
## `summary(grid_object, show_stack_traces = TRUE)`
rf_grid_results_rand
## H2O Grid Details

## ================
##
## Grid ID: rf_grid_rand
## Used hyper parameters:
## - max_depth
## - mtries
## - ntrees
## Number of models: 36
## Number of failed models: 18
##
## Hyper-Parameter Search Summary: ordered by decreasing auc
## max_depth mtries ntrees model_ids auc
## 1 5 3 80 rf_grid_rand_model_12 0.8407059514384359
##
## ---
## max_depth mtries ntrees model_ids auc
## Failed models
## -------------
## max_depth mtries ntrees status_failed
## 20 12 50 FAIL
## 30 12 10 FAIL
## 25 15 50 FAIL
## 30 12 120 FAIL
## 25 12 30 FAIL
## 30 15 30 FAIL
## 20 12 120 FAIL
## 20 12 10 FAIL
## 5 15 10 FAIL
## 20 12 80 FAIL
## 20 15 120 FAIL
## 30 15 50 FAIL
## 10 15 50 FAIL
## 5 12 10 FAIL
## 10 12 120 FAIL
## 5 12 50 FAIL
## 10 15 30 FAIL
## 25 12 10 FAIL
##
msgs_failed
##
"Illegal argument(s) for DRF model: rf_grid_rand_model_15. Details: ERRR on field: _mtries:
Computed mtries should be -1 or -2 or in interval [1,12[ but it is 12\n"
##
## "Illegal argument(s) for DRF model: rf_grid_rand_model_17. Details: ERRR on field: _mtri
es: Computed mtries should be -1 or -2 or in interval [1,12[ but it is 15\nERRR on field: _mt
ries: Computed mtries should be -1 or -2 or in interval [1,13[ but it is 15\nERRR on field: _
mtries: Computed mtries should be -1 or -2 or in interval [1,13[ but it is 15\nERRR on field:
_mtries: Computed mtries should be -1 or -2 or in interval [1,13[ but it is 15\n"
##
"Illegal argument(s) for DRF model: rf_grid_rand_model_4. Details: ERRR on field: _mtries: C
omputed mtries should be -1 or -2 or in interval [1,12[ but it is 12\n"
##
##
##
##
##
##
##
##
Algunos modelos nos dan error pero disponemos suficientes modelos para poder elegir y seleccionamos el
mejor en base a su AUC
rf_randwinner <- h2o.getModel(rf_grid_results_rand@model_ids[[1]])
Y lo evaluamos
rf_randwinner@model$validation_metrics
##
## MSE: 0.1297066
## RMSE: 0.3601481
## LogLoss: 0.4016776
## AUC: 0.8549033
## AUCPR: 0.6449117
## Gini: 0.7098067
## R^2: 0.3006686
##
## 0 1 Error Rate
## 0 1085 227 0.173018 =227/1312
## 1 119 309 0.278037 =119/428
## Totals 1204 536 0.198851 =346/1740
##
## 1 max f1 0.372960 0.641079 148
## 2 max f2 0.264981 0.742754 212
## 3 max f0point5 0.438884 0.613861 112
## 4 max accuracy 0.488570 0.810345 93
## 5 max precision 0.745260 1.000000 0
## 6 max recall 0.020651 1.000000 391
## 7 max specificity 0.745260 1.000000 0
## 8 max absolute_mcc 0.372960 0.512061 148
## 11 max tns 0.745260 1312.000000 0
## 12 max fns 0.745260 426.000000 0
## 13 max fps 0.013749 1312.000000 399
## 14 max tps 0.020651 428.000000 391
## 15 max tnr 0.745260 1.000000 0
## 16 max fnr 0.745260 0.995327 0
## 17 max fpr 0.013749 1.000000 399
## 18 max tpr 0.020651 1.000000 391
##
h2o.auc(rf_randwinner@model$validation_metrics)
## [1] 0.8549033
Obtiene un buen nivel de AUC del 85.49033
matrizconfusionh2orfrandwinner <- h2o.confusionMatrix(rf_randwinner)

matrizconfusionh2orfrandwinner
83779214:
## 0 1 Error Rate
## 0 3060 802 0.207664 =802/3862
## 1 396 1045 0.274809 =396/1441
## Totals 3456 1847 0.225910 =1198/5303
h2orfrandwinner_metricas<-c(NA, acierto <- (matrizconfusionh2orfrandwinner[1,1] + matrizconfu

sionh2orfrandwinner[2,2]) / (matrizconfusionh2orfrandwinner[1,1] + matrizconfusionh2orfrandwi
nner[1,2] + matrizconfusionh2orfrandwinner[2,1] + matrizconfusionh2orfrandwinner[2,2]) *100,
precision <- matrizconfusionh2orfrandwinner[2,2] / (matrizconfusionh2orfrandwinner[2,1] + m
atrizconfusionh2orfrandwinner[2,2]) *100,
cobertura <- matrizconfusionh2orfrandwinner[2,2] / (matrizconfusionh2orfrandwinner[2,2] + m
atrizconfusionh2orfrandwinner[1,2]) *100,
h2o.auc(rf_randwinner@model$validation_metrics)*100)
names (h2orfrandwinner_metricas) = c("umbral","acierto","precision","cobertura","F1","AUC")
h2orfrandwinner_metricas

## NA 77.40901 72.51908 56.57823 63.56448 85.49033
Obtenemos valores similares a los de la regresión logísitica en la matriz de confusión
5.4. Modelización automática con AutoMl

En el caso de la modelización automática creamos una regla de parada a los 5 minutos (300 segundos) y
definimos la variable con la que vamos a priorizar los modelos, en este caso el AUC como hemos hecho
anteriormente
automl_simple <- h2o.automl(

y = y,
x = x,
nfolds = 3,
max_runtime_secs = 300,
sort_metric = 'AUC'
)
##
|
| | 0%
## 16:25:07.835: User specified a validation frame with cross-validation still enabled. Pleas
e note that the models will still be validated using cross-validation only, the validation fr
ame will be used to provide purely informative validation metrics on the trained models.
|
|== | 2%
|
|== | 3%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
|
|======== | 12%
|
|========= | 13%
|
|========== | 15%
|
|=========== | 15%
|
|============ | 17%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 44%
|
|=============================== | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 50%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 53%
|
|====================================== | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 60%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================== | 88%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================== | 94%
|
|==================================================================== | 97%
|
|===================================================================== | 98%
|
|======================================================================| 99%
|
|======================================================================| 100%
Vemos todos los modelos que ha generado
automl_simple@leaderboard
## model_id auc logloss

## 1 GLM_1_AutoML_20200522_162507 0.8504900 0.4128789
## 2 StackedEnsemble_AllModels_AutoML_20200522_162507 0.8499799 0.4216140
## 3 StackedEnsemble_BestOfFamily_AutoML_20200522_162507 0.8496940 0.4215125
## 4 GBM_grid__1_AutoML_20200522_162507_model_2 0.8465655 0.4179362
## 5 GBM_5_AutoML_20200522_162507 0.8453562 0.4188286
## aucpr mean_per_class_error rmse mse
## 1 0.6638599 0.2394998 0.3664130 0.1342585
## 2 0.6626032 0.2354303 0.3686021 0.1358675
## 3 0.6620552 0.2397578 0.3686256 0.1358848
## 4 0.6596606 0.2310877 0.3684499 0.1357553
## 5 0.6553794 0.2333326 0.3690750 0.1362164
## 6 0.6551156 0.2337379 0.3691187 0.1362486
##
## [31 rows x 7 columns]
print(automl_simple@leaderboard, n = nrow(automl_simple@leaderboard))
## model_id auc logloss

## 1 GLM_1_AutoML_20200522_162507 0.8504900 0.4128789
## 2 StackedEnsemble_AllModels_AutoML_20200522_162507 0.8499799 0.4216140
## 3 StackedEnsemble_BestOfFamily_AutoML_20200522_162507 0.8496940 0.4215125
## 5 GBM_5_AutoML_20200522_162507 0.8453562 0.4188286
## 9 XGBoost_grid__1_AutoML_20200522_162507_model_1 0.8438421 0.4207890
## 14 DeepLearning_1_AutoML_20200522_162507 0.8381488 0.4304889
## 15 GBM_1_AutoML_20200522_162507 0.8375663 0.4280813
## 17 XGBoost_2_AutoML_20200522_162507 0.8358072 0.4318807
## 18 GBM_3_AutoML_20200522_162507 0.8352424 0.4307389
## 19 GBM_2_AutoML_20200522_162507 0.8347736 0.4323694
## 20 DeepLearning_grid__1_AutoML_20200522_162507_model_1 0.8342938 0.4428556
## 21 XGBoost_1_AutoML_20200522_162507 0.8322641 0.4379713
## 22 XRT_1_AutoML_20200522_162507 0.8320064 0.4336864
## 23 GBM_4_AutoML_20200522_162507 0.8300310 0.4381179
## 25 DRF_1_AutoML_20200522_162507 0.8264530 0.4591243
## 28 XGBoost_3_AutoML_20200522_162507 0.8055172 0.5889526
## aucpr mean_per_class_error rmse mse
## 1 0.6638599 0.2394998 0.3664130 0.1342585
## 2 0.6626032 0.2354303 0.3686021 0.1358675
## 3 0.6620552 0.2397578 0.3686256 0.1358848
## 4 0.6596606 0.2310877 0.3684499 0.1357553
## 5 0.6553794 0.2333326 0.3690750 0.1362164
## 6 0.6551156 0.2337379 0.3691187 0.1362486
## 7 0.6533386 0.2368490 0.3693051 0.1363862
## 8 0.6548040 0.2380673 0.3696621 0.1366501
## 9 0.6532578 0.2292481 0.3701250 0.1369925
## 10 0.6559734 0.2423391 0.3698225 0.1367687
## 11 0.6493833 0.2377241 0.3716807 0.1381466
## 12 0.6464204 0.2426016 0.3722240 0.1385507
## 13 0.6445449 0.2426781 0.3727629 0.1389521
## 14 0.6442401 0.2343556 0.3736331 0.1396017
## 15 0.6507668 0.2342434 0.3731149 0.1392147
## 16 0.6424777 0.2398738 0.3732130 0.1392879
## 17 0.6361131 0.2389562 0.3752550 0.1408163
## 18 0.6460849 0.2435787 0.3736880 0.1396427
## 19 0.6406430 0.2489392 0.3751698 0.1407524
## 20 0.6322726 0.2404733 0.3778280 0.1427540
## 21 0.6345590 0.2492198 0.3780416 0.1429154
## 22 0.6382730 0.2471585 0.3766871 0.1418931
## 23 0.6385071 0.2560873 0.3775092 0.1425132
## 24 0.6180467 0.2492756 0.3811432 0.1452701
## 25 0.6355137 0.2545207 0.3785703 0.1433155

## 26 0.6006138 0.2394053 0.3860855 0.1490620
## 27 0.5958053 0.2588096 0.3944704 0.1556069
## 28 0.5748241 0.2596577 0.4459975 0.1989138
## 29 0.5801328 0.2677071 0.4027106 0.1621758
## 30 0.5838910 0.2698538 0.4092412 0.1674783
## 31 0.5201450 0.2901355 0.4329406 0.1874376
##
## [31 rows x 7 columns]
Podemos crear una visualización para que nos ayude a elegir
as.data.frame(automl_simple@leaderboard) %>%
select(model_id,auc) %>%
ggplot(aes(x = auc, y = reorder(model_id,auc))) +
geom_point() +
geom_label(aes(label = round(auc,3),label.size=0.9,color = auc),hjust = 'left') +
expand_limits(x = c(0.936,0.947)) + labs(x = "AUC",y = "Modelo Ganador Automl")
## Warning: Ignoring unknown aesthetics: label.size
theme_bw()
## List of 65
## $ line :List of 6
## ..$ colour : chr "black"
## ..$ size : num 0.5
## ..$ linetype : num 1
## ..$ lineend : chr "butt"
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ rect :List of 5
## ..$ fill : chr "white"
## ..$ size : num 0.5
## ..$ linetype : num 1
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ text :List of 11
## ..$ family : chr ""
## ..$ face : chr "plain"
## ..$ size : num 11
## ..$ hjust : num 0.5
## ..$ vjust : num 0.5
## ..$ angle : num 0
## ..$ lineheight : num 0.9
## ..$ margin : 'margin' num [1:4] 0pt 0pt 0pt 0pt
## .. ..- attr(*, "valid.unit")= int 8
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title.x :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 2.75pt 0pt 0pt 0pt
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : NULL
## $ axis.title.x.top :List of 11
## ..$ face : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ angle : NULL
## ..$ margin : 'margin' num [1:4] 0pt 0pt 2.75pt 0pt
## .. ..- attr(*, "unit")= chr "pt"

## ..$ debug : NULL
## $ axis.title.y :List of 11
## ..$ face : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ angle : num 90
## ..$ margin : 'margin' num [1:4] 0pt 2.75pt 0pt 0pt
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : NULL
## $ axis.title.y.right :List of 11
## ..$ face : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ angle : num -90
## ..$ margin : 'margin' num [1:4] 0pt 0pt 0pt 2.75pt
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : NULL
## $ axis.text :List of 11
## ..$ face : NULL
## ..$ colour : chr "grey30"
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## $ axis.text.x :List of 11
## ..$ face : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ angle : NULL
## .. ..- attr(*, "unit")= chr "pt"

## ..$ debug : NULL
## $ axis.text.x.top :List of 11
## ..$ face : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ angle : NULL
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : NULL
## $ axis.text.y :List of 11
## ..$ face : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ margin : 'margin' num [1:4] 0pt 2.2pt 0pt 0pt
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : NULL
## $ axis.text.y.right :List of 11
## ..$ face : NULL
## ..$ size : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ margin : 'margin' num [1:4] 0pt 0pt 0pt 2.2pt
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : NULL
## $ axis.ticks :List of 6
## ..$ size : NULL
## ..$ linetype : NULL
## ..$ lineend : NULL
## $ axis.ticks.length : 'unit' num 2.75pt
## ..- attr(*, "valid.unit")= int 8

## ..- attr(*, "unit")= chr "pt"
## $ axis.ticks.length.x : NULL
## $ axis.ticks.length.x.top : NULL
## $ axis.ticks.length.x.bottom: NULL
## $ axis.ticks.length.y : NULL
## $ axis.ticks.length.y.left : NULL
## $ axis.ticks.length.y.right : NULL
## $ axis.line : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ axis.line.x : NULL
## $ axis.line.y : NULL
## $ legend.background :List of 5
## ..$ fill : NULL
## ..$ colour : logi NA
## ..$ size : NULL
## $ legend.margin : 'margin' num [1:4] 5.5pt 5.5pt 5.5pt 5.5pt
## $ legend.spacing : 'unit' num 11pt
## $ legend.spacing.x : NULL
## $ legend.spacing.y : NULL
## $ legend.key :List of 5
## ..$ size : NULL
## $ legend.key.size : 'unit' num 1.2lines
## ..- attr(*, "unit")= chr "lines"
## $ legend.key.height : NULL
## $ legend.key.width : NULL
## $ legend.text :List of 11
## ..$ face : NULL
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ debug : NULL
## $ legend.text.align : NULL
## $ legend.title :List of 11
## ..$ face : NULL
## ..$ size : NULL

## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ debug : NULL
## $ legend.title.align : NULL
## $ legend.position : chr "right"
## $ legend.direction : NULL
## $ legend.justification : chr "center"
## $ legend.box : NULL
## $ legend.box.margin : 'margin' num [1:4] 0cm 0cm 0cm 0cm
## ..- attr(*, "unit")= chr "cm"
## $ legend.box.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.box.spacing : 'unit' num 11pt
## $ panel.background :List of 5
## ..$ size : NULL
## $ panel.border :List of 5
## ..$ fill : logi NA
## ..$ size : NULL
## $ panel.spacing : 'unit' num 5.5pt
## $ panel.spacing.x : NULL
## $ panel.spacing.y : NULL
## $ panel.grid :List of 6
## ..$ size : NULL
## $ panel.grid.minor :List of 6
## ..$ size : 'rel' num 0.5
## $ panel.ontop : logi FALSE
## $ plot.background :List of 5
## ..$ fill : NULL

## ..$ colour : chr "white"
## ..$ size : NULL
## $ plot.title :List of 11
## ..$ face : NULL
## ..$ size : 'rel' num 1.2
## ..$ angle : NULL
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : NULL
## $ plot.subtitle :List of 11
## ..$ face : NULL
## ..$ size : NULL
## ..$ angle : NULL
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : NULL
## $ plot.caption :List of 11
## ..$ face : NULL
## ..$ size : 'rel' num 0.8
## ..$ angle : NULL
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : NULL
## $ plot.tag :List of 11
## ..$ face : NULL
## ..$ size : 'rel' num 1.2
## ..$ hjust : num 0.5
## ..$ vjust : num 0.5
## ..$ angle : NULL

## ..$ debug : NULL
## $ plot.tag.position : chr "topleft"
## $ plot.margin : 'margin' num [1:4] 5.5pt 5.5pt 5.5pt 5.5pt
## $ strip.background :List of 5
## ..$ fill : chr "grey85"
## ..$ size : NULL
## $ strip.placement : chr "inside"
## $ strip.text :List of 11
## ..$ face : NULL
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ margin : 'margin' num [1:4] 4.4pt 4.4pt 4.4pt 4.4pt
## .. ..- attr(*, "unit")= chr "pt"
## ..$ debug : NULL
## $ strip.text.x : NULL
## $ strip.text.y :List of 11
## ..$ face : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : num -90
## ..$ debug : NULL
## $ strip.switch.pad.grid : 'unit' num 2.75pt
## $ strip.switch.pad.wrap : 'unit' num 2.75pt
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi TRUE
## - attr(*, "validate")= logi TRUE
Extraemos el mejor modelo que es el GLM
automl_simple_winner <- automl_simple@leader
Y lo evaluamos
automl_simple_winner@model$validation_metrics
## H2OBinomialMetrics: glm
##
## MSE: 0.1343104
## RMSE: 0.3664839
## LogLoss: 0.4135095
## AUC: 0.8396429
## AUCPR: 0.6442732
## Gini: 0.6792858
## R^2: 0.29016
## Residual Deviance: 1439.013
## AIC: 1517.013
##
## 0 1 Error Rate
## 0 1069 230 0.177059 =230/1299
## 1 138 303 0.312925 =138/441
## Totals 1207 533 0.211494 =368/1740
##
## 1 max f1 0.371949 0.622177 170
## 2 max f2 0.104322 0.728679 308
## 3 max f0point5 0.469174 0.616032 123
## 4 max accuracy 0.469174 0.805747 123
## 5 max precision 0.917766 1.000000 0
## 6 max recall 0.011659 1.000000 389
## 7 max specificity 0.917766 1.000000 0
## 8 max absolute_mcc 0.415660 0.486298 146
## 11 max tns 0.917766 1299.000000 0
## 12 max fns 0.917766 439.000000 0
## 13 max fps 0.003412 1299.000000 399
## 14 max tps 0.011659 441.000000 389
## 15 max tnr 0.917766 1.000000 0
## 16 max fnr 0.917766 0.995465 0
## 17 max fpr 0.003412 1.000000 399
## 18 max tpr 0.011659 1.000000 389
##
h2o.auc(automl_simple_winner@model$validation_metrics)
## [1] 0.8396429
El AUC es de 83.96429, un valor elevado y similar al de la regresión logística
Podemos comprobar con qué parámetros se ha generado el modelo
automl_simple_winner@allparameters
## $model_id
## [1] "GLM_1_AutoML_20200522_162507"
##
## $training_frame
## [1] "automl_training_RTMP_sid_9ee3_3"
##
## $validation_frame
## [1] "RTMP_sid_9ee3_5"
##
## $nfolds
## [1] 3
##
## $seed
## [1] "4955756915282139044"
##
## $keep_cross_validation_models
## [1] FALSE
##
## $keep_cross_validation_predictions
## [1] TRUE
##
## $keep_cross_validation_fold_assignment
## [1] FALSE
##
## $fold_assignment
## [1] "Modulo"
##
## $ignore_const_cols
## [1] TRUE
##
## $score_each_iteration
## [1] FALSE
##
## $family
## [1] "binomial"
##
## $tweedie_variance_power
## [1] 0
##
## $tweedie_link_power
## [1] 1
##
## $theta
## [1] 0.0000000001
##
## $solver
## [1] "COORDINATE_DESCENT"
##
## $alpha
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
##
## $lambda
## [1] 8.95655041253 5.56216954709 3.45420152241 2.14511766613 1.33215441302
## [6] 0.82729045970 0.51376139134 0.31905452811 0.19813826734 0.12304722086
## [11] 0.07641440881 0.04745464247 0.02947013695 0.01830145432 0.01136551320
## [16] 0.00705817625 0.00438324703 0.00272207067 0.00169045200 0.00104979933
## [21] 0.00065194317 0.00040486775 0.00025142973 0.00015614212 0.00009696690
## [26] 0.00006021809
##
## $lambda_search
## [1] TRUE
##
## $early_stopping
## [1] TRUE
##
## $nlambdas
## [1] 30
##
## $standardize
## [1] TRUE
##
## $missing_values_handling
## [1] "MeanImputation"
##
## $compute_p_values
## [1] FALSE
##
## $remove_collinear_columns
## [1] FALSE
##
## $intercept
## [1] TRUE
##
## $non_negative
## [1] FALSE
##
## $max_iterations
## [1] 300
##
## $objective_epsilon
## [1] 0.0001
##
## $beta_epsilon
## [1] 0.0001
##
## $gradient_epsilon
## [1] 0.000001
##
## $link
## [1] "logit"
##
## $calc_like
## [1] FALSE
##
## $HGLM
## [1] FALSE
##
## $prior
## [1] -1
##
## $lambda_min_ratio
## [1] 0.000001
##
## $max_active_predictors
## [1] 5000
##
## $obj_reg
## [1] 0.0001885725
##
## $balance_classes
## [1] FALSE
##
## $max_after_balance_size
## [1] 5
##
## $max_confusion_matrix_size
## [1] 20
##
## $max_hit_ratio_k
## [1] 0
##
## $max_runtime_secs
## [1] 0
##
## $x
## [1] "Tenure_DISC" "MonthlyCharges_DISC" "PaymentMethod"
## [4] "Contract" "InternetService" "OnlineSecurity"
## [7] "TechSupport" "SeniorCitizen" "Partner"
## [10] "OnlineBackup" "Dependents" "PaperlessBilling"
##
## $y
## [1] "Churn"
Y podemos ver las variables que están resultando más importantes
h2o.varimp(automl_simple@leader)
## variable relative_importance
## 1 Tenure_DISC.01 <= 1 1.700969129
## 2 Tenure_DISC.08 > 70 1.563755883
## 3 InternetService.No 1.074071061
## 4 Tenure_DISC.02 <= 5 0.852526605
## 5 InternetService.Fiber optic 0.832620959
## 6 Tenure_DISC.07 <= 70 0.791501079
## 7 Contract.Month-to-month 0.741176041
## 8 Contract.Two year 0.719720396
## 9 MonthlyCharges_DISC.07_MAYOR_100 0.609053016
## 10 Tenure_DISC.06 <= 59 0.453122949
## 11 Tenure_DISC.03 <= 16 0.367075026
## 12 Tenure_DISC.05 <= 49 0.311786435
## 13 MonthlyCharges_DISC.03_DE_40_A_60 0.289515780
## 14 PaymentMethod.Electronic check 0.271825049
## 15 MonthlyCharges_DISC.01_MENOR_20 0.265946543
## 17 PaperlessBilling.No 0.238327312
## 18 PaperlessBilling.Yes 0.227138515
## 19 TechSupport.Yes 0.210028992
## 20 OnlineSecurity.Yes 0.204212929
## 21 TechSupport.No 0.182363289
## 22 PaymentMethod.Mailed check 0.181130586
## 23 OnlineSecurity.No 0.160424645
## 24 InternetService.DSL 0.148546487
## 25 SeniorCitizen.0 0.130674281
## 26 SeniorCitizen.1 0.122001297
## 28 OnlineBackup.No 0.105799666
## 29 Dependents.Yes 0.104187322
## 30 OnlineBackup.Yes 0.100511934
## 31 Dependents.No 0.097864806
## 32 PaymentMethod.Bank transfer (automatic) 0.090413734
## 33 PaymentMethod.Credit card (automatic) 0.074054005
## 34 Tenure_DISC.04 <= 22 0.064530411
## 36 Partner.Yes 0.024560960
## 37 Partner.No 0.020222667
## 38 Contract.One year 0.001422562
## scaled_importance percentage
## 1 1.0000000000 0.1237662960
## 2 0.9193323128 0.1137823551
## 3 0.6314465338 0.0781517986
## 4 0.5012005160 0.0620317314
## 5 0.4894979837 0.0605833523
## 6 0.4653236001 0.0575913784
## 7 0.4357375029 0.0539296168
## 8 0.4231237260 0.0523684563
## 9 0.3580623573 0.0443160517
## 10 0.2663910483 0.0329702333
## 11 0.2158034614 0.0267091951
## 12 0.1832992907 0.0226862743
## 13 0.1702063696 0.0210658119
## 14 0.1598059863 0.0197785950
## 15 0.1563500116 0.0193508618
## 16 0.1480303308 0.0183211657
## 17 0.1401126616 0.0173412251
## 18 0.1335347665 0.0165271034
## 19 0.1234760752 0.0152821765
## 20 0.1200568109 0.0148589868
## 21 0.1072114043 0.0132691584
## 22 0.1064866982 0.0131794642
## 23 0.0943136722 0.0116728539
## 24 0.0873305017 0.0108085727
## 25 0.0768234292 0.0095081513
## 26 0.0717245805 0.0088770857
## 27 0.0655589363 0.0081139867
## 28 0.0621996393 0.0076982190
## 29 0.0612517421 0.0075809012
## 30 0.0590909809 0.0073134718
## 31 0.0575347336 0.0071208609
## 32 0.0531542474 0.0065787043
## 33 0.0435363606 0.0053883341
## 34 0.0379374381 0.0046953762
## 35 0.0276337659 0.0034201288
## 36 0.0144393920 0.0017871101
## 37 0.0118889091 0.0014714462
## 38 0.0008363244 0.0001035088
h2o.varimp_plot(automl_simple@leader)
Como hemos visto previamente en otros modelos las variables más importantes son:
La antigüedad
La presencia o no de servicio de internet
El tipo de contrato
matrizconfusionoautoml <- h2o.confusionMatrix(automl_simple_winner)

matrizconfusionoautoml
18948918:
## 0 1 Error Rate
## 0 3102 773 0.199484 =773/3875
## 1 370 1058 0.259104 =370/1428
## Totals 3472 1831 0.215538 =1143/5303
automlwinner_metricas<-c(NA, acierto <- (matrizconfusionoautoml[1,1] + matrizconfusionoautoml

[2,2]) / (matrizconfusionoautoml[1,1] + matrizconfusionoautoml[1,2] + matrizconfusionoautoml[
2,1] + matrizconfusionoautoml[2,2]) *100,
precision <- matrizconfusionoautoml[2,2] / (matrizconfusionoautoml[2,1] + matrizconfusionoa
utoml[2,2]) *100,
cobertura <- matrizconfusionoautoml[2,2] / (matrizconfusionoautoml[2,2] + matrizconfusionoa
utoml[1,2]) *100,
h2o.auc(automl_simple_winner@model$validation_metrics)*100)
names (automlwinner_metricas) = c("umbral","acierto","precision","cobertura","F1","AUC")
automlwinner_metricas

## NA 78.44616 74.08964 57.78263 64.92789 83.96429
Obtenemos valores similares en la matriz de confusión a los de la regresión logísitica: un buen nivel
de acierto y de precisión pero con una cobertura algo baja
5.5. Evaluación de todos los modelos generados con AutoMl

Podemos comparar todos los modelos generados con AutoMl
ComparativaModelos2 <- rbind(h2orl_metricas,h2orf_metricas,h2orfgridwinner_metricas,h2orfrand

winner_metricas,automlwinner_metricas)
rownames(ComparativaModelos2) <- c('Regresion Logistica h2o','Random Forest','Grid Search','R
andom Search','Automl')
t(ComparativaModelos2)
## Regresion Logistica h2o Random Forest Grid Search Random Search

## umbral NA NA NA NA
## acierto 78.63474 71.11069 74.35414 77.40901
## precision 73.38936 80.74230 79.41176 72.51908
## cobertura 58.18989 47.84232 51.54545 56.57823
## F1 64.91174 60.08338 62.51378 63.56448
## AUC 83.98253 81.59547 82.60986 85.49033
## Automl
## umbral NA
## acierto 78.44616
## F1 64.92789
## AUC 83.96429
Hay varios modelos con elevados niveles de AUC, sobre todo el de Regresión Logística y el de Automl
que además son los que obtienen la mayor cobertura que era la variable a dar prioridad desde un punto
de vista de negocio.
Por su simplicidad optaremos por el modelo de regresión logística
6. Evaluación final de todos los modelos

Calculamos el scoring con el modelo de Automl en h2o y ya podemos compararlo con el scoring obtenido con
el modelo manual ganador, el de regresión logística
SCORING_RL_H2O <- as.data.frame(h2o.predict(rl,df_h2o)[3])
##
|
| | 0%
|
|======================================================================| 100%
df$SCORING_RL_H2O <- as.numeric(SCORING_RL_H2O$p1)

df = rename(df, c(SCORING_RL_MANUAL="SCORING_CHURN"))
glimpse(df)
## Variables: 16
## $ Contract <fct> Month-to-month, One year, Month-to-month, One yea…
## $ MonthlyCharges_DISC <fct> 02_DE_20_A_40, 03_DE_40_A_60, 03_DE_40_A_60, 03_D…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fib…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Ban…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes,…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No,…
## $ SeniorCitizen <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, N…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No…
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-C…
## $ Churn <fct> 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Tenure_DISC <fct> 01 <= 1, 05 <= 49, 02 <= 5, 05 <= 49, 02 <= 5, 03…
## $ SCORING_RL_MANUAL <dbl> 0.67580519, 0.04855858, 0.35841520, 0.04855858, 0…
## $ SCORING_RL_H2O <dbl> 0.76646574, 0.04607989, 0.30235211, 0.03476043, 0…
Vemos la correlación entre ambos scoring, el obtenido ahora y el que sacamos con el modelo de regresión
logística manualmente
cor(df$SCORING_RL_MANUAL,df$SCORING_RL_H2O)
## [1] 0.9703489
La correlación de ambos scoring es de un 97.03489%. Eso implica que los scoring obtenidos en ambos
modelos son similares y nos da confianza en la calidad de los datos obtenidos por ambos modelos.
Podemos verlo visualmente
ggplot(df, aes(SCORING_RL_MANUAL, SCORING_RL_H2O)) + geom_point(color = 'orange') + xlab("Sco

ring Regresion Logística Manual") + ylab("Scoring Regresion Logística h2o") + ggtitle('Compar
ación Scoring Regresión Lineal Manual vs Regresion Logística h2o') + geom_smooth(method=lm)
Y comparamos todos los modelos para hacer la selección final
ComparativaModelosFinal <- rbind(ComparativaModelos1,ComparativaModelos2)

t(ComparativaModelosFinal)
## Regresion Logistica Manual Arbol Decision Random Forest Bayes

## umbral 0.30000 0.35000 0.25000 NA
## acierto 75.40438 79.63844 79.01998 79.30542
## precision 52.60047 60.88380 59.64630 59.23754
## cobertura 79.32264 66.31016 66.13191 72.01426
## F1 63.25515 63.48123 62.72189 65.00402
## AUC 84.00000 82.00000 83.00000 77.00000
## Regresion Logistica h2o Random Forest1 Grid Search Random Search
## umbral NA NA NA NA
## acierto 78.63474 71.11069 74.35414 77.40901
## precision 73.38936 80.74230 79.41176 72.51908
## cobertura 58.18989 47.84232 51.54545 56.57823
## F1 64.91174 60.08338 62.51378 63.56448
## AUC 83.98253 81.59547 82.60986 85.49033
## Automl
## umbral NA
## acierto 78.44616
## F1 64.92789
## AUC 83.96429
Hay varios modelos con un elevado AUC. Sin embargo, teniendo en cuenta las necesidades de negocio
vamos a seleccionar el modelo de regresión logística manual ya que es el que alcanza una mayor
cobertura
Eliminamos por tanto el Scoring generado en la fase de h2o, guardamos el modelo con el Scoring obtenido
con el primero modelo y la matriz con el resumen del los resultados de todos los modelos.
Ese scoring es el que vamos a emplear para definir el Churn de los cientes de la empresa de
telecomunicaciones y poder hacer acciones comerciales si fuera necesario
ls()
## [1] "a_mantener" "acierto"

## [3] "ar" "ar_metricas"
## [5] "ar_numnodos" "ar_predict"
## [7] "ar_prediction" "auc"
## [9] "automl_simple" "automl_simple_winner"
## [11] "automlwinner_metricas" "Bayes"
## [13] "Bayes_metricas" "Bayes_predict"
## [15] "Bayes_prediction" "BayesMatrizConfusion"
## [17] "cobertura" "ComparativaModelos1"
## [19] "ComparativaModelos2" "ComparativaModelosFinal"
## [21] "confusion" "df"
## [23] "df_h2o" "F1"
## [25] "formula" "formula_ar"
## [27] "formula_rf" "formula_rl"
## [29] "h2orf_metricas" "h2orfgridwinner_metricas"
## [31] "h2orfrandwinner_metricas" "h2orl_metricas"
## [33] "importancia" "importancia_norm"
## [35] "independientes" "matrizconfusionh2orf"
## [37] "matrizconfusionh2orfgridwinner" "matrizconfusionh2orfrandwinner"
## [39] "matrizconfusionh2orl" "matrizconfusionoautoml"
## [41] "metricas" "pr2_rl"
## [43] "precision" "rf"
## [45] "rf_grid" "rf_grid_rand"
## [47] "rf_grid_results" "rf_grid_results_rand"
## [49] "rf_gridwinner" "rf_metricas"
## [51] "rf_params" "rf_predict"
## [53] "rf_prediction" "rf_randwinner"
## [55] "rl" "rl_metricas"
## [57] "rl_predict" "rl_prediction"
## [59] "roc" "SCORING_RL_H2O"
## [61] "search_params" "split"
## [63] "target" "test"
## [65] "test_rlpredict" "train"
## [67] "train_Bayes" "train_h2o"
## [69] "umb_ar" "umb_rf"
## [71] "umb_rl" "umbral_final_ar"
## [73] "umbral_final_rf" "umbral_final_rl"
## [75] "umbrales" "valid_h2o"
## [77] "x" "y"
rm(list=setdiff(ls(),c('df','ComparativaModelosFinal')))
df <- select(df, -SCORING_RL_H2O)
saveRDS(df,'Modelo_Churn_final.rds')
```

Proyecto de Cálculo de Abandono (Churn) en Una Compañía de Telecomunicaciones PDF

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Proyecto de Cálculo de Abandono (Churn) en Una Compañía de Telecomunicaciones PDF

Cargado por

Copyright:

Formatos disponibles

24/12/2020 Proyecto de Cálculo de Abandono (Churn) en una compañía de telecomunicaciones

Proyecto de Cálculo de Abandono

1. Importación y muestreo de los datos

Y los paquetes que tenemos que cargar en R

#lista de paquetes que vamos a usar

## Loading required package: data.table

## Loading required package: dplyr

## The following objects are masked from 'package:data.table':

## The following objects are masked from 'package:stats':

## The following objects are masked from 'package:base':

## Loading required package: tidyr

## Loading required package: ggplot2

## Loading required package: randomForest

## Type rfNews() to see new features/changes/bug fixes.

## The following object is masked from 'package:ggplot2':

## The following object is masked from 'package:dplyr':

## Loading required package: ROCR

## Loading required package: gplots

## The following object is masked from 'package:stats':

## Loading required package: purrr

## The following object is masked from 'package:data.table':

## Loading required package: smbinning

## Loading required package: sqldf

## Loading required package: gsubfn

## Loading required package: proto

## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):

## Loading required package: RSQLite

## Loading required package: partykit

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

## Loading required package: Formula

## Loading required package: rpart

## Loading required package: rpart.plot

## Loading required package: e1071

## Loading required package: caret

## Loading required package: lattice

## The following object is masked from 'package:purrr':

## Loading required package: h2o

## The following objects are masked from 'package:data.table':

## The following objects are masked from 'package:stats':

## The following objects are masked from 'package:base':

## Classes 'data.table' and 'data.frame': 7043 obs. of 21 variables:

a_factores <- c('gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'Multiple

2.1. Estadísticos básicos

## Length Class Mode

2.2. Nulos y ceros

Sólo hay 11 nulos en Total Charges

contar_ceros <- function(variable) {

num_ceros <- sapply(df,contar_ceros)

## VARIABLE CEROS PORCENTAJE

2.3. Análisis longitudinal de las variables numéricas

boxplot(df$TotalCharges,border=9, col=3, main="Distribución de la Facturación Total de los Cl

boxplot(df$MonthlyCharges,border=9, col=4, main="Distribución de la Facturación Mensual de lo

boxplot(df$tenure,border=9, col=8, main="Distribución de la Antigüedad de los Clientes en mes

out <- function(variable){

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## tenure TotalCharges MonthlyCharges StreamingTV StreamingMovies

Vemos gráficamente como se representa la distribución de la antigüedad

hist(df$tenure,main="Distribución de la Antigüedad de los clientes en la compañía",cex.main=1

2.4. Coherencia entre variables