Está en la página 1de 232

Dr.

Patricio Adrian Zapata Morin


2022
Contents
1 Etapa 1 3
1.1 Instalación de paquetes . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Primeros pasos . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Variables en R . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Manejo de Variables en R . . . . . . . . . . . . . . . . . . . . 8
1.5 Verdadero-Falso en R . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Loop while en R . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Loop for en R . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8 Loop if-else en R . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.9 Ejercicio Loops en R . . . . . . . . . . . . . . . . . . . . . . . 19
1.10 Asignatura 1 en R . . . . . . . . . . . . . . . . . . . . . . . . 20
1.11 Vectores en R . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.12 Operaciones con vectores en R . . . . . . . . . . . . . . . . . . 26
1.13 Ejercicio - Alineador de Secuencias en R . . . . . . . . . . . . 28
1.14 Funciones en R . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.14.1 Paquetes en R . . . . . . . . . . . . . . . . . . . . . . . 37
1.15 Ejercicio Contabilidad en R . . . . . . . . . . . . . . . . . . . 40

2 Etapa 2 46
2.1 Manipulación de Matrices en R . . . . . . . . . . . . . . . . . 46
2.2 Ejercicio 1 y 2 Manipulación de Matrices en R . . . . . . . . . 52
2.3 Ejercicio Manipulación de Matrices en R . . . . . . . . . . . . 55
2.3.1 Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.4 Matplot en R . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5 Funciones en R . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.1 Uso de funciones . . . . . . . . . . . . . . . . . . . . . 86
2.6 Manejo de Data Frames en R vol 1 . . . . . . . . . . . . . . . 95
2.6.1 Explorando los Datos . . . . . . . . . . . . . . . . . . . 96
2.6.2 Operaciones basicas con DataFrame . . . . . . . . . . . 102
2.7 Manejo de Data Frames en R vol 2 . . . . . . . . . . . . . . . 107
2.8 Visualización de Data Frames en R . . . . . . . . . . . . . . . 112
2.8.1 Visualizando solo lo que necesitamos . . . . . . . . . . 112
2.8.2 Enriquecimiento de Data Frames en R . . . . . . . . . 114
2.8.3 Enriqueciendo Data Frames . . . . . . . . . . . . . . . 116
2.8.4 Visualizando con una nueva división . . . . . . . . . . 117
2.9 Ejercicio Manejo de Data Frames en R . . . . . . . . . . . . . 118

1
2.9.1 Filtrando información del Data Frames T - F . . . . . 120
2.10 Graficación en R . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.10.1 Aspecto . . . . . . . . . . . . . . . . . . . . . . . . . . 127
2.10.2 Grafica por capas . . . . . . . . . . . . . . . . . . . . . 132
2.10.3 Sobrescribe los estéticos de la gráfica . . . . . . . . . . 135
2.10.4 Mapeo vs. Ajuste . . . . . . . . . . . . . . . . . . . . . 139
2.10.5 Histogramas y Gráficos de densidad . . . . . . . . . . . 142
2.11 Ejercicio Graficación en R . . . . . . . . . . . . . . . . . . . . 147
2.12 Ejercicio de estructuración de datos en R . . . . . . . . . . . . 153
2.13 Ejercicio 2 estructuración de datos en R . . . . . . . . . . . . 160
2.14 Introducción a la limpieza de datos en R . . . . . . . . . . . . 170
2.15 Seguimiento a la limpieza de datos en R . . . . . . . . . . . . 184
2.15.1 Reemplazando la información faltante: análisis basado
en hechos . . . . . . . . . . . . . . . . . . . . . . . . . 192

3 Etapa 3 204
3.1 Ejercicio de Manejo de datos en R . . . . . . . . . . . . . . . . 204
3.2 Ejercicio Transformación de Datos en R . . . . . . . . . . . . 215
3.3 Vizualización Gráfica en R Resultados de VGchartz . . . . . . 220

2
1 Etapa 1
1.1 Instalación de paquetes
Paquetes de Uso para el Curso de Data Science
Paquetes del Bloque 2 : Paquete Principal - ggplot2
i n s t a l l . packages ( ” g g p l o t 2 ” )
Paquetes que requiere ggplot2 para funcionar
i n s t a l l . packages (” assertthat ”)
i n s t a l l . packages ( ” bio3d ” )
i n s t a l l . packages (” cli ”)
i n s t a l l . packages (” colorspace ”)
i n s t a l l . packages (” fansi ”)
i n s t a l l . packages (” glue ”)
i n s t a l l . packages (” gtable ”)
i n s t a l l . packages (” labeling ”)
i n s t a l l . packages (” lazyeval ”)
i n s t a l l . packages ( ” munsell ” )
i n s t a l l . packages (” pillar ”)
i n s t a l l . packages (” plyr ”)
i n s t a l l . packages ( ” RColorBrewer ” )
i n s t a l l . packages ( ” reshape2 ” )
i n s t a l l . packages (” scales ”)
i n s t a l l . packages (” stringi ”)
i n s t a l l . packages (” stringr ”)
i n s t a l l . packages (” tbble ”)
i n s t a l l . packages (” utf8 ”)
i n s t a l l . packages (” viridisLite ”)
i n s t a l l . packages ( ” withr ” )
Paquetes del Bloque 3: Paquetes Principales ggplot2, gridExtra
i n s t a l l . packages ( ” g g p l o t 2 ” )
i n s t a l l . packages ( ” g r i d E x t r a ” )
Paquetes del Bloque 4: Paquetes Principales ggplot2, stringr
i n s t a l l . packages ( ” g g p l o t 2 ” )
i n s t a l l . packages ( ” s t r i n g r ” )

3
Paquetes del Bloque 5: Paquetes Principales ggplot2, shiny
i n s t a l l . packages ( ” g g p l o t 2 ” )
i n s t a l l . packages ( ” s h i n y ” )
Paquetes que requiere shiny para funcionar
i n s t a l l . packages ( ”BH” )
i n s t a l l . packages ( ” crayon ” )
i n s t a l l . packages (” digest ”)
i n s t a l l . packages (” htmltools ”)
i n s t a l l . packages ( ” httpuv ” )
i n s t a l l . packages (” jsonlite ”)
i n s t a l l . packages (” later ”)
i n s t a l l . packages (” magrittr ”)
i n s t a l l . packages ( ”mime” )
i n s t a l l . packages ( ” promises ” )
i n s t a l l . packages ( ”R6” )
i n s t a l l . packages ( ”Rcpp” )
i n s t a l l . packages (” rlang ”)
i n s t a l l . packages (” sourcetools ”)
i n s t a l l . packages (” xtable ”)

4
1.2 Primeros pasos
A diferencia de otros lenguajes de programación, tan solo es necesario el
poner en comillas (“ ”) lo que quieres que se imprima en terminal. R es un
lenguaje de programación intuitivo, resulta ser una herramienta altamente
eficiente y simple de implementar una vez dominas las bases.
1 > " Hello world ! "

Hello world!

Lo anterior equivale a “print” con respecto al resto de los lenguajes de


programación
1 > print ( " Hello world " )

Hello world!

1 >" Hello world ! "


2 >" Hello world ! "
3 >" Hello world ! "

Hello world! Hello world! Hello world!

5
1.3 Variables en R
Entero (integer)
1 >x <- 2 L
2 > typeof ( x ) # typeof = Te indica el tipo de variable

integer

Doble (double)
1 >y <- 2.5
2 > typeof ( y )

double

Complejo (Complex)
1 >z <- 3+2 i
2 > typeof ( z )

complex

Caracter (character)
1 >a <- " h "
2 >b <- " 2 "
3 > typeof ( a )

character

1 > typeof ( b )

character

Logico T= Verdadero (logical T = TRUE)


1 > q1 <- T
2 > typeof ( q1 )

logical

6
Logico F= Falso (logical F = FALSE)
1 > q2 <- F
2 > typeof ( q2 )

logical

T y F = son sub-tipos del parámetro lógico conocido como Booleano (Boolean)

7
1.4 Manejo de Variables en R
Una variable puede ser o contener cualquier número, letra, arreglo, función,
etc. Básicamente es un espacio en la memoria de la computadora en la cual
estas guardando uno o múltiples valores. La indicación < − representa que
se va a guardar en la variable (también se puede utilizar el “=”)
1 >A <- 5
2 >B <- 5
3 >A = 10
4 >B = 10
5
6 >C <- A + B
7 >C

20

El nombre que utilicemos para definir una variable, puede ser cualquier
combinación de números y caracteres deseados. Se recomienda ampliamente
el utilizar nombres que te sean lo suficientemente simples e intuitivos para
no perder el hilo de las posibles operaciones y funciones que más adelante
llegaras a realizar.

Variable 1
1 > var1 <- 2.5

“typeof” es una función, básicamente un “Algoritmo” que nos permite tra-


bajar las variables de cierta manera. Dependiendo la instrucción este provee
un resultado particular.
1 typeof ( var1 )

double

Variable 2
1 var2 <- 4

Lo importante no es lo que estamos haciendo, si no lo que se puede llegar a


hacer al ir definiendo variables

8
1 resultado <- var1 / var2
2 resultado

0.625
sqrt(), es una función que como su acrónimo en inglés lo indica sirve para
calcular la raı́z cuadrada de lo que coloquemos en los “()”
1 resp <- sqrt ( var2 )
2 resp

2
1 saludo <- " Hola "
2 nombre <- " Bob "
3 Mensaje <- paste ( saludo , nombre )
4 # " paste " , es una funci \ ’ on que une dos elementos en una
variable
5 ? paste ()
6 Mensaje

Hola Bob

“paste” de entrada asigna un espacio, pero cambiando las variables dentro


de la función se puede ir cambiando el resultado
1 Mensaje2 <- paste ( saludo , nombre , sep = " " )
2 Mensaje2

HolaBob
1 Mensaje3 <- paste ( saludo , nombre , sep = " @ " )
2 Mensaje3

Hola@Bob

R es un lenguaje muy versátil. La forma en la cual está diseñado este


lenguaje, te permite intuitivamente generar código de programación sin una
comprensión previa fuerte de lo que esto implica.
1 Mensaje4 <- paste ( paste ( saludo , nombre , sep = " " ) , " gmail . com " ,
sep = " @ " )
2 Mensaje 4

HolaBob@gmail.com

9
1.5 Verdadero-Falso en R
Los argumentos lógicos son la medula espinal de la ejecución de muchas de
las funciones que estaremos empleando a lo largo de las rutinas del curso.

Booleano <- Logico:


VERDADERO T <- TRUE T
FALSO F <- FALSE F

1 4 < 5

True

1 10 > 100

False

1 4 == 5

False

== igual a
!= no igual a
< menor que
> mayor que
<= menor o igual que
>= mayor o igual que
! no
| o (or)
& y (and)
isTRUE(x)

Al combinar la definición de variables (< −) y los argumentos Lógicos,


puedes llegar a desarrollar comparativas muy interesantes entre 2 o mas val-
ores
1 res <- 4 < 5
2 res

10
True

1 typeof ( res )

logical

El argumento “!” básicamente extrapola lo contrario a lo que lo precede


1 res2 <- ! TRUE
2 res2

False

1 res3 <- ! (4 < 5)


2 res3

False

El argumento “|” implica que si uno de los dos elementos es verdadero,


el resultado igualmente lo será
1 res | res2

True

El argumento “&” implica que si uno de los dos es Falso, el resultado igual-
mente lo será
1 res & res2

False

1 isTRUE ( res )

True

11
1 isTRUE (4 > 5)

False

1 isTRUE (4 < 5)

True

1 incognita <- ! ((4+5 == 9) | (3+6 < 8) )


2 incognita

False

Estos argumentos te permiten establecer la lógica a seguir dentro de los


“loops”

12
1.6 Loop while en R
While = Mientras. Función que ejecuta la siguiente idea, mientras(”Esto
sea verdadero”) ejecuta esto, en el momento que llegue a ser Falso deja de
correrlo.

La forma en que se implementan los loops en R básicamente es como si


llamaras una función:
tipoDeLoop(Condición)Cuerpo del código a ejecutar
1 while ( FALSE ) {
2 print ( " Hola " )
3 } # No se imprime ni una sola vez el Hola porque el argumento
siempre sera falso
4

5 while ( TRUE ) {
6 print ( " Hola " )
7 } # Imprime Hola Indefinidamente ya que el argumento siempre
sera verdadero

1 conteo <- 1
2 while ( conteo < 12) {
3 print ( conteo )
4 } # En este caso el ‘‘ argumento ’ ’ nunca va a ser menor que 12
por ende el 1 se imprimir \ ’ a indefinidamente

1 while ( conteo < 12) {


2 print ( conteo )
3 conteo <- conteo + 1
4 } # En este caso cada ciclo le estas dando un +1 a la
variable " conteo " , en el momento que la variable es < 12
la ejecuci \ ’ on del loop cesa

1 2 3 4 5 6 7 8 9 10 11

¿Qué habrı́a que hacer para que 12 se incluya en el resultado?


1 conteo <- 1
2 while ( conteo <= 12) {
3 print ( conteo )
4 conteo <- conteo + 1

1 2 3 4 5 6 7 8 9 10 11 12

13
1.7 Loop for en R
Los loop tipo “for” sirven para ejecutar una indicación por cada elemento
disponible en un vector, lista, tabla, que se le indique.

En este ejemplo en particular el “for” recorrerá los 5 espacios, en los


cuales se ejecutará el argumento del loop
1 for ( i in 1:5) { # i en este caso sera la variable que se
definir \ ’ a por la secuencia num \ ’ erica 1:5 (1 ,2 ,3 ,4 ,5)
2 print ( " Hola R " ) # se imprimir \ ’ a 5 veces por la secuencia
del 1 al 5
3 }

Hola R
Hola R
Hola R
Hola R
Hola R

1 typeof ( i )

integer

1 for ( i in 5:10) { # la secuencia del 5 al 10 tiene una unidad


adicional a la del 1 al 5 (5 ,6 ,7 ,8 ,9 ,10) , raz \ ’ on por la
cual en este ejemplo el loop corre 6 veces en vez de las 5
del ejemplo anterior .
2 print ( " Hola R " )
3 }

Hola R
Hola R
Hola R
Hola R
Hola R
Hola R

1 for ( i in 5:10) {
2 print ( i )
3 }

14
5
6
7
8
9
10

A partir de este simple ejercicio podemos ver que i es una variable la cual
podemos emplear dentro del cuerpo del loop.

En el siguiente ejemplo creamos un vector denominado “fruta”:


1 fruta <- c ( ’ Manzana ’ , ’ Naranja ’ , ’ Fresa ’ , ’ Platano ’)
2 fruta # c () = combina cualquier tipo de elementos

Manzana Naranja Fresa Platano

En el siguiente ejemplo definimos la variable “i” del loop “for” con cada
elemento en el vector ”fruta”.
1 for ( i in fruta ) {
2 i <- paste (i , " es una fruta " , sep = " , " )
3 print ( i ) # paste () une dos o m \ ’ as caracteres definidos por
una separaci \ ’ on " sep "
4 }

Manzana, es una fruta


Naranja, es una fruta
Fresa, es una fruta
Platano, es una fruta

Los “[ ]” nos permite acceder a los elementos contenidos en el vector, la


numeración especifica la posición a la que queremos acceder.

1 fruta [1]
2 fruta [4]
3 fruta [5] # Error

15
Manzana
Plátano
NA

Para el siguiente ejemplo creamos primero una vector vacı́o


1 lista <- c ()

Con el loop “for” poblaremos la variable “lista” al ejecutar la función


“seq()”, la cual genera una secuencia (”de”, ”hasta”, con intervalos de ”by”)
1 seq (1 , 4 , by =1)

1234

1 for ( i in seq (1 , 4 , by =1) ) {


2 lista [ i ] <- i * i
3 }
4 print ( lista )

1 4 9 16

1 for ( i in seq (1 , 4 , by =1) ) {


2 lista [ i ] <- i * i
3 print ( i )
4 }

1
4
9
16

16
1.8 Loop if-else en R
rnorm = Generador de n\’umeros al azar con una Distribuci\’on
Normal
rnorm(1) = Genera un numero al azar, si cambi\’aramos a (n),
genera n numero al azar en forma de un vector.
-3—- -2 —- -1 —- 0 —- 1 —- 2 —-3

1 rnorm (1) # El resultado es completamente al azar , en un rango


de -3 a 3
2 ? rnorm ()

-0.5551363
rm = Remueve (variable del ambiente global)

1 rm ( respuesta )

Guardamos el resultado de rnorm en la variable x


1 x <- rnorm (1)

El ”if” es un loop que por sı́ solo tiene una gran variedad de aplicaciones.
No obstante su verdadera fortaleza reside al ser ejecutado en conjunto con el
”else”.

El loop if es uno de los más versátiles y fáciles de implementar gracias a


su sencilla forma de ser estructurado.
L\’ogica: if = si (Esta condici\’on es verdadera) {corre el c\’odigo}
else = de no ser as\’i{corre esto},

1 if ( x > 1) { # Depender \ ’ a del valor que x haya tomado al


ejecutar " x <- rnorm (1) " correra la tarea del if o del
else
2 respuesta <- " mayor que 1 "
3 # Si la condici \ ’ on del if es falso ejecuta entonces la
tarea del " else "
4 } else {
5 respuesta <- " menor que 1 "
6 }

17
“if” Anidado
1 if ( x > 1) {
2 respuesta <- " mayor que 1 "
3 } else { # else - if argumento anidado = para definir una nueva
regla o condici \ ’ on if
4 if ( x >= -1) {
5 respuesta <- " Entre -1 y 1 "
6 } else { respuesta <- " menor que -1 " # un ultimo else nos
puede ofrecer la informaci \ ’ on que nos falta por cubrir
7 }
8 }

Encadenando argumentos
1 rm ( respuesta )
2 x <- rnorm (1)
3 if ( x > 1) {
4 respuesta <- " Mayor que 1 "
5 # else if = seria un equivalente a en cambio s \ ’ i ocurre (
esto ) { entonces corre esto }
6 } else if ( x >= -1) { # else if es una forma mas elegante y de
una sola linea para correr el else { if {}} anidado
7 respuesta <- " Entre -1 y 1 "
8 } else { # Si por ultimo no era opci \ ’ on 1 o 2 , entonces seria
{ esto }
9 respuesta <- " Menor que -1 "
10 }

18
1.9 Ejercicio Loops en R
Problema: Quiero ver cuantas veces la función rnorm cae entre -1 y 1 en
un numero N de veces
1 N <- 10000
2 rnorm ( N ) # Va a generar N n \ ’ umeros al azar con un intervalo
de -3 a 3 cada uno

a = variable que va a contener el numero de veces que rnorm


me va a ofrecer un resultado entre -1 y 1

1 a <- 0

¿Qué loop puede ver cada uno de los elementos de un vector para ejecutar
una tarea?
Respuesta: “for”

Definir la regla que quiero rastrear en mi resultado, ¿qué argumento puedo


utilizar para esta tarea?
Respuesta: “if”
1 for ( i in rnorm ( N ) ) { # i in rnorm ( N ) va a correr el loop N
numero de veces , he " i " tomara el valor que en cada
iteraci \ ’ on N contenga en dicho index
2 if ( i > -1 & i < 1) {
3 a <- a + 1
4 }
5 }
6
7 z <- a / N
8 print ( z ) # z me va a ofrecer la probabilidad de cuantos
espacios del vector i caen entre -1 y 1

¿El resultado se acerca a la realidad?, la probabilidad de generar un re-


sultado entre -1 y 1 es de 68.2%. Esto según la formula que emplea la función
de “rnorm()”

¿Que pasarı́a si aumentas el valor de N de 10000 a 100000 o a 100000?

19
1.10 Asignatura 1 en R
Quiero evaluar la precisión del modelo empleado por el lenguaje de progra-
mación para generar los números aleatorios con Distribución Normal

0.1%--2.1%-----13.6%--------68.2%-------13.6%-----2.1%--0.1%
<-3---- -2 -------- -1 --- 0 --- 1 -------- 2 ---- 3>

Como Podrı́a lograrlo?


1 N <- 1000
2 rnorm ( N )
3
4 a <- 0
5 b <- 0
6 c <- 0
7 d <- 0
8 e <- 0
9
10 for ( i in rnorm ( N ) ) {
11 if ( i > -1 & i < 1) { # Eval \ ’ ua el evento que sea entre -1
y 1 ( real 68.2%)
12 print ( " Entre -1 y 1 " )
13 a <- a + 1
14 } else if ( i < -1 & i > -2) { # Eval \ ’ ua el evento que sea
entre -1 y -2 ( real 13.6%)
15 print ( " Entre -2 y -1 " )
16 b <- b + 1
17 } else if ( i < -2 & i > -3) { # Eval \ ’ ua el evento que sea
entre -2 y -3 ( real 2.1%)
18 print ( " Entre -3 y -2 " )
19 c <- c + 1
20 }
21 else if ( i > 1 & i < 2) { # Eval \ ’ ua el evento que sea entre
1 y 2 ( real 13.6%)
22 print ( " Entre 1 y 2 " )
23 d <- d + 1
24 }
25 else if ( i > 2 & i < 3) { # Eval \ ’ ua el evento que sea entre 2
y 3 ( real 2.1%)
26 print ( " Entre 2 y 3 " )
27 e <- e + 1
28 }
29 }
30

20
31 # Lo siguiente nos representa la facci \ ’ on de cada evento al
cada uno ser dividido por la N total
32 UnoMenosUno <- a / N
33 MenosUnoMenosDos <- b / N
34 Men osDosM enosTr es <- c / N
35 UnoDos <- d / N
36 DosTres <- e / N
37
38 # Al multiplicar el resultado por 100 podemos obtener el
porcentaje de cada evento
39 UnoMenosUno * 100 # Real 68.2%
40 MenosUnoMenosDos * 100 # Real 13.6%
41 UnoDos * 100 # Real 13.6%
42 Men osDosM enosTr es * 100 # Real 2.1%
43 DosTres * 100 # Real 2.1%

Entre 1 y 2
Entre -1 y 1
Entre -1 y 1
Entre -1 y 1
Entre -1 y 1
...

21
1.11 Vectores en R
c(*,*,*,*) = Combina (num o "caracteres" en un vector),
una de las restricciones de los vectores en R es el no
poder combinar dos tipos de elementos en un solo
vector.

1 MiPrimerVector <- c (13 , 28 , 11 , 696)


2 MiPrimerVector

Existen funciones que te permiten entender el tipo de elemento que con-


tienen los vectores. is.*(MiPrimerVector), te reporta el resultado como ver-
dadero o falso según sea el caso (básicamente sı́(T) o no(F))

Integer = entero
1 is . numeric ( MiPrimerVector )

TRUE

1 is . integer ( MiPrimerVector )

FALSE

1 is . double ( MiPrimerVector ) # double = es un numero fraccionado


- Ej . 45.4

TRUE

la L después de un número lo define como entero


1 X2 <- c (3 L , 12 L , 243 L , 0 L )
2 is . numeric ( X2 )
3 is . integer ( X2 )
4 is . double ( X2 )

TRUE
TRUE
FALSE

22
¿Que va a ocurrir con el 7 en el vector X3?
1 X3 <- c ( " Pedro " , " Z3 " , " Hola " , 7)
2 X3
3 is . character ( X3 )
4 is . numeric ( X3 )
5 is . integer ( X3 )

TRUE
FALSE
FALSE

‘‘:’’ permite generar vectores con una longitud determinada


por el valor inicial:final
seq() y ‘‘:’’ en R hacen casi lo mismo
rep() significa replicar (un elemento x, tantas N veces le indiques)

1 seq (1 ,15)
2 1:15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

seq() te permite pasar espacios que “:” no, básicamente seria (de donde,
a donde, con que intervalo)
1 seq (1 ,15 ,2)
2 z <- seq (1 ,15 ,4)
3 z
4

5 d <- rep (3 , 50)


6 d
7
8 rep ( " a " , 5)
9
10 x <- c (40 , 15)
11 y <- rep (x , 10)
12 y

23
1 3 5 7 9 11 13 15
1 5 9 13
33333333333333333333333333333333333
333333333333333
“a” “a” “a” “a” “a”
40 15 40 15 40 15 40 15 40 15 40 15 40 15 40 15 40 15 40 15

1 x <- c (1 ,123 ,534 ,13 ,4) # Combinar


2 y <- seq (201 ,250 ,11) # Secuencia
3 z <- rep ( " Hola ! " ,3) # Replicar
4
5 w <- c ( " a " , " b " , " c " , " d " , " e " )
6 w

“a”, “b”, “c”, “d”, “e”

Los corchetes “[ ]” en R, te permiten seleccionar uno de los elementos que


se encuentran en el vector. El número indica el index de la posición en el
vector al que se desea acceder
1 w [1]
2 w [2]
3 w [3]

“a”
“b”
“c”

1 w [ -1] # muestra todos los elementos menos 1

“b”, “c”, “d”, “e”

1 w [ -3] # muestra todos los elementos menos 3

1 v <- w [ -3] # genera un nuevo vector que no tenga el index 3


2 v

“a”, “b”, “d”, “e”

24
w[1:3] muestra los elementos del 1 al 3 del vector, recuerden que “:” es
similar a la función “seq()” la cual va a generar un vector que va del 1 al
3. El cual al ponerlo dentro de “[ ]” sirve como indicación para mostrar la
información que contiene el vector de esos elementos.
1 w [1:3]
2 w [3:5]

“a”, “b”, “c”


“c”, “d”, “e”

El potencial que presenta el poder seleccionar y aislar información con


los “[]” es muy amplio. Por ejemplo con la indicación “w[c(1,3,5)]” podemos
seleccionar los elementos 1, 3 y 5, generando un nuevo vector con dicha
información. A diferencia de “:” no nos limita a seleccionar números en
secuencia.
1 w [ c (1 ,3 ,5) ]
2 w [ c ( -2 , -4) ] # recordemos que en R la indicaci \ ’ on ‘‘-’’
dentro de los ‘ ‘[ ] ’ ’ elimina elementos de los index
indicados

“a”, “c”, “e”


“a”, “c”, “e”

1 w [ -3: -5] # Muestra elementos sin pasar del 3 al 5

“a”, “b”

1 w [1:2] # Muestra elementos del 1 al 2

“a”, “b”

25
1.12 Operaciones con vectores en R
“rnorm()” es una función que genera valores aleatorios que respetan una
distribución normal
1 N <- 10000
2 a <- rnorm ( N )
3 b <- rnorm ( N )

Enfoque vectorizado en R
1 c <- a * b

Casi en cualquier otro lenguaje que no sea R o en el caso de Python


empleando los arreglos mediante la clase numpy, se deberia utiliza el ”enfoque
no-vectorizado”.
Lo cual implicarı́a primero Multiplicar el par de espacios de cada vector (a
y b) y el resultado se guardan en un nuevo tercer vector (c)
1 d <- integer ( N ) # se generan espacios vac \ ’ ios en un vector
de N unidades enteras .
2 d

1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
21: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
41: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
61: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
81: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...

Loop que va a abarcar de 1 a N y va a multiplicar el espacio i = N de


cada vector (a y b) este resultado lo va a almacenar en la variable d que ya
generamos con los espacios N
1 for ( i in 1: N ) {
2 d [ i ] <- a [ i ] * b [ i ]
3 }
4 d [5000:5005]

0.4348447 0.4700353 -0.4305244 -0.2285815 -0.1196912 0.4232871

26
Por su arquitectura, R es un lenguaje operativo de segundo orden. Esto
implica que manda llamar librerı́as de c++ y fortran para que realicen las
tareas según se requieran y el resultado lo reporta Rstudio, lo anterior hace
de R un lenguaje de programación muy eficiente y preciso.

27
1.13 Ejercicio - Alineador de Secuencias en R
Definimos las variables de las secuencias que vamos a comparar
1 sec1 <- " A T G A A G T A T A G T T T G C T C C T C T T C C T T G C T C C G C T T G G A G T A T G G A G C C G
2 TGCCTGTACATGCGGGCAGGCAAATCAGAATGGCGCCTATTCGAGAAATG "
3 sec2 <- " A T G A A G T A T A G T T T C C T C C T C T T C C T T G C T C C G C T T G G A G T A T G G A G C C G
4 TGCCTATACATACGGGCAGGCAAATCCGAATGGCGCCGATTCGAGAAATG "
5 sec3 <- " A T G A A G T T A A G T T T C C T C C T C T T C C T T G C T C C G C T T G C T G T A T G G A G C C G
6 TGCCTATACATACGGGCAGGCAAATCCGAATGGCGCCGATTCGAGAACTG "

La función ”strsplit()” literalmente sirve para manipular una lı́nea o lı́neas


de un vector para cambiar su formato, en éste caso particular la indicación
“split = NULL” permite separar todos los elementos de las componentes del
vector de manera individual, el resultado produce una lista.
1 sec1split <- strsplit ( sec1 , split = NULL ) # En R se pretende
siempre trabajar con vectores no con listas
2
3 sec1splitvector <- unlist ( sec1split , use . names = FALSE ) # "
unlist () " convierte una lista a vector
4
5 sec2split <- strsplit ( sec2 , split = NULL )
6 sec2splitvector <- unlist ( sec2split , use . names = FALSE )
7
8 sec3split <- strsplit ( sec3 , split = NULL )
9 sec3splitvector <- unlist ( sec3split , use . names = FALSE )

La funci\’on ‘‘rm()’’ remueve una variable definida

1 rm ( sec1 )
2 rm ( sec2 )
3 rm ( sec3 )
4 rm ( sec1split )
5 rm ( sec2split )
6 rm ( sec3split )

En este punto estamos preguntando la siguiente indicación: el vector1 es


igual al vector2

28
1 ID <- sec1splitvector == sec2splitvector
2 ID # En ID se est \ ’ a guardando el registro de la comparaci \ ’
on , que b \ ’ asicamente es si ( T ) o no ( F ) . El resultado es un
vector de booleanos
3 ID2 <- sec2splitvector == sec3splitvector
4 ID3 <- sec1splitvector == sec3splitvector

TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

La función “which()” sirve para mandar llamar el o los números del index
de un vector que cumple una caracterı́stica deseada.
Básicamente preguntas ”cuales(de estos elementos cumplen esta regla)”
1 Dist1F <- which ( ID == FALSE )
2 Dist1F

15 56 62 77 88

1 Dist2F <- which ( ID2 == FALSE )


2 Dist3F <- which ( ID3 == FALSE )
3

4 Dist1T <- which ( ID == TRUE )


5 Dist1T

123 4 5 6 7 8 9 10 11 12 13 14 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 57 58 59 60 61 63 64 65 66 67 68 69 70 71 72
73 74 75 76 78 79 80 81 82 83 84 85 86 87 89 90 91 92 93 94 95 96
97 98 99 100

29
1 Dist2T <- which ( ID2 == TRUE )
2 Dist3T <- which ( ID3 == TRUE )

Si quisiéramos saber cuales son los nucleótidos de la secuencia que son


diferentes podremos utilizar los “[ ]”
1 x1 <- integer (0)
2 y <- integer (0)
3 for ( i in Dist1F ) {
4 print ( sec1splitvector [ i ])
5 x1 <- append ( x1 , sec1splitvector [ i ])
6 y <- append (y , i )
7 }

G
G
G
A
T

1 x1
2 Dist1F
3 y

GGGAT
15 56 62 77 88
15 56 62 77 88

1 x2 <- integer (0)


2 for ( i in Dist2F ) {
3 print ( sec2splitvector [ i ])
4 x2 <- append ( x2 , sec2splitvector [ i ])
5 }

A
T
G
A
A

30
1 x3 <- integer (0)
2 for ( i in Dist3F ) {
3 print ( sec3splitvector [ i ])
4 x3 <- append ( x3 , sec3splitvector [ i ])
5 }

T
A
C
C
T
A
A
C
G
C

1 Dist1F
2 x1
3 Dist2F
4 x2
5 Dist3F
6 x3

15 56 62 77 88
‘G’ ‘G’ ‘G’ ‘A’ ‘T’
8 9 38 39 98
‘A’ ‘T’ ‘G’ ‘A’ ‘A’
8 9 15 38 39 56 62 77 88 98
‘T’ ‘A’ ‘C’ ‘C’ ‘T’ ‘A’ ‘A’ ‘C’ ‘G’ ‘C’

La función “length()” me indica la longitud de un determinado vector,


simplemente cuantos elementos contiene
1 length ( sec1splitvector )

100

31
Si divido el resultado de “length(Dist1T)” entre “length(sec1splitvector)”
serı́a como dividir el número de registros TRUE entre el total, lo cual darı́a
la fracción de identidad
1 IdenT <- length ( Dist1T ) / length ( sec1splitvector )

Si aplico lo anterior pero para “Dist1F” corresponderı́a a la fracción de


diferencia
1 IdenF <- length ( Dist1F ) / length ( sec1splitvector )
2 IdenT
3 IdenF

0.95
0.05

¿Cual serı́a la estrategia a seguir si no contáramos con la función “which()”?

Básicamente si lo que queremos saber es cuantos T y F tiene un vector,


habrı́a que observar cada uno de los elementos del mismo y contarlos según
sea el caso.
1 a <- 0
2 b <- 0
3 for ( i in ID ) {
4 if ( i == TRUE ) {
5 a <- a + 1
6 } else if ( i == FALSE ) {
7 b <- b + 1
8 }
9 }
10 Id1enT <- a / length ( sec1splitvector )
11 Id1enF <- b / length ( sec1splitvector )
12 print ( Id1enT )
13 print ( Id1enF )

0.95
0.05

32
1 c <- 0
2 d <- 0
3 for ( i in ID2 ) {
4 if ( i == TRUE ) {
5 c <- c + 1
6 } else if ( i == FALSE ) {
7 d <- d + 1
8 }
9 }
10 Id2enT <- c / length ( sec1splitvector )
11 Id2enF <- d / length ( sec1splitvector )
12 print ( Id2enT )
13 print ( Id2enF )

0.95
0.05

1 e <- 0
2 f <- 0
3 for ( i in ID3 ) {
4 if ( i == TRUE ) {
5 e <- e + 1
6 } else if ( i == FALSE ) {
7 f <- f + 1
8 }
9 }
10 Id3enT <- e / length ( sec1splitvector )
11 Id3enF <- f / length ( sec1splitvector )
12 print ( Id3enT )
13 print ( Id3enF )

0.9
0.1

33
1.14 Funciones en R
“rnorm()” es una función. Qué y cómo va a realizar su ejecución, dependerá
de los parámetros que en ella definamos.

rnorm(n\’umero de muestra = n, media = mean,


desviaci\’on est\’andar = sd)

1 ? rnorm ()
2 rnorm (5 , 10 , 8) # Los resultados son al azar en un rango de
-14 y 34 con media de 10

12.3455375 20.2996977 11.6449661 0.2456556 7.4741601

1 rnorm ( n =5 , mean =10 , sd =8)

-1.158309 12.556061 10.046840 14.804029 17.264450

1 rnorm ( n =5 , sd =8 , mean =10) # El orden de los par \ ’ ametros si


est \ ’ an bien definidos no alteran el resultado

12.774692 13.817467 5.575749 20.046214 19.255221

1 rnorm ( n =5 , sd =8) # Los par \ ’ ametros no definidos tienen ya un


valor por defecto ( mean = 0)

-9.8921348 -13.8261297 0.8221266 13.0884852 6.1612500

‘‘c()’’ combina elementos en un vector

1 ? c ()
2 c ()

NULL

1 x <- c ( " a " , " b " , " c " )


2 x

34
“a” “b” “c”

‘‘seq()’’ generar par\’ametros de un punto a otro con o pasando


por un n\’umero determinado de unidades

1 ? seq ()
2 seq ( from =10 , to =20 , by =3)

10 13 16 19

1 seq ( from =10 , to =20 , length . out =100) # " length . out ": longitud
total de valores que abra desde 10 hasta 20

10.00000 10.10101 10.20202 10.30303 10.40404 10.50505 10.60606


10.70707 10.80808 10.90909 11.01010 11.11111 11.21212 11.31313
11.41414 11.51515 11.61616 11.71717 11.81818 11.91919 12.02020
12.12121 12.22222 12.32323 12.42424 12.52525 12.62626 12.72727
12.82828 12.92929 13.03030 13.13131 13.23232 13.33333 13.43434
13.53535 13.63636 13.73737 13.83838 13.93939 14.04040 14.14141
14.24242 14.34343 14.44444 14.54545 14.64646 14.74747 14.84848
14.94949 15.05051 15.15152 15.25253 15.35354 15.45455 15.55556
15.65657 15.75758 15.85859 15.95960 16.06061 16.16162 16.26263
16.36364 16.46465 16.56566 16.66667 16.76768 16.86869 16.96970
17.07071 17.17172 17.27273 17.37374 17.47475 17.57576 17.67677
17.77778 17.87879 17.97980 18.08081 18.18182 18.28283 18.38384
18.48485 18.58586 18.68687 18.78788 18.88889 18.98990 19.09091
19.19192 19.29293 19.39394 19.49495 19.59596 19.69697 19.79798
19.89899 20.00000

1 seq ( from =10 , to =20 , along . with = x ) # along . with = dependiendo


los espacios que tenga el vector " x " distribuir \ ’ a los
valores que abarcan de 10 a 20

10 15 20

‘‘rep()’’ repetir un elemento o vector cierta cantidad de veces

35
1 ? rep ()
2 rep (5 , 10)
3 rep (5:6 , times =10)
4 rep (x , times =5)

5555555555
56565656565656565656
”a” ”b” ”c” ”a” ”b” ”c” ”a” ”b” ”c” ”a” ”b” ”c” ”a” ”b” ”c”

1 rep (5:6 , each =10) # each = cada uno


2 rep (x , each =5)

55555555556666666666
”a” ”a” ”a” ”a” ”a” ”b” ”b” ”b” ”b” ”b” ”c” ”c” ”c” ”c” ”c”

1 print ( x )

”a” ”b” ”c”

1 is . numeric ( x )

FALSE

1 is . integer ( x )

FALSE

1 is . double ( x )

FALSE

1 is . character ( x )

TRUE

1 typeof ( x )

36
character

1 A <- seq ( from =10 , to =20 , along . with = x )


2 A
3 sqrt ( A )

10 15 20
3.162278 3.872983 4.472136

1 B <- sqrt ( A )
2 paste (B , A , sep = " + " )

3.16227766016838 3.87298334620742 4.47213595499958


”3.16227766016838 + 10” ”3.87298334620742 + 15”
”4.47213595499958 + 20”

1.14.1 Paquetes en R
“install.packages()” sirve para buscar la librerı́a(x), la cual cuenta con un
gran número de funciones relacionadas.
1 install . packages ( " ggplot2 " )
2 library ( ggplot2 )
3
4 ? qplot ()
5 ? ggplot ()
6 ? diamonds
7 View ( diamonds )

37
1 qplot ( data = diamonds , carat , price ,
2 colour = clarity , facets =. ~ clarity )

Otro ejemplo:
1 install . packages ( " bio3d " )
2 library ( bio3d ) # " bio3d () " es un conjunto de funciones que
permiten manipular secuencias de nucle \ ’ otidos
3
4 getwd ()

38
C:/Users/Username/Documents * Varia la dirección “C:”
dependiendo de cada computadora

La función “read.fasta()” sirve para que R pueda interpretar la infor-


mación que está en los archivos .fasta que reciden en Ali.
La función “as.fasta()” permite generar a partir de un vector un archivo

con formato “fasta”, el vector lo generamos con la indicación “c(E)”. En


el ejemplo original que programamos esta función nos hubiera servido para
combinar c(sec1, sec2, sec3)
1 setwd ( C : / Users / Username / Documents ) # Aqu \ ’ i se puede cambiar
la direcci \ ’ on de trabajo
2 getwd ()
3
4 E <- read . fasta ( " Ali . fasta " )
5 Fasta <- as . fasta ( c ( E ) )
6 seqidentity (( E ) , normalize = TRUE , similarity = FALSE , ncore =1 ,
nseg . scale =1)

Ej1 Ej2
Ej1 1.00 0.91
Ej2 0.91 1.00

39
1.15 Ejercicio Contabilidad en R
Datos
1 ingresos <- c (14574.49 , 7606.46 , 8611.41 , 9175.41 , 8058.65 ,
8105.44 , 11496.28 , 9766.09 , 10305.32 , 14379.96 , 10713.97 ,
15433.50)
2 gastos <- c (12051.82 , 5695.07 , 12319.20 , 12089.72 , 8658.57 ,
840.20 , 3285.73 , 5821.12 , 6976.93 , 16618.61 , 10054.37 ,
3803.96)

Queremos calcular la ganancia, ¿cómo podrı́amos hacerlo?

Básicamente lo que tendrı́amos que hacer es réstale los gastos a los ingre-
sos
1 ganancia <- ingresos - gastos
2 ganancia

2522.67 1911.39 -3707.79 -2914.31 -599.92 7265.24 8210.55


3944.97 3328.39 -2238.65 659.60 11629.54

Queremos calcular el IVA (16%) de la ganancia


¿c\’omo podr\’iamos hacerlo?

Se recomienda redondear a 2 puntos decimales para simplificar los resul-


tados. La función ”round ()” = redondea un número a los decimales deseados
1 ? round ()
2 iva <- round (0.16 * ganancia , 2)
3 iva

403.63 305.82 -593.25 -466.29 -95.99 1162.44 1313.69 631.20


532.54 -358.18 105.54 1860.73

Queremos Calcular la ganancia Restante después iva, ¿cómo podrı́amos


hacerlo?

40
1 ganancia . despues . iva <- ganancia - iva
2 ganancia . despues . iva

2119.04 1605.57 -3114.54 -2448.02 -503.93 6102.80 6896.86


3313.77 2795.85 -1880.47 554.06 9768.81

Queremos Calcular los Márgenes de ganancia (en porcentaje) de los in-


gresos, ya considerando el iva, ¿como podrı́amos hacerlo?

Se recomienda utilizar la función “round()” para limitar a 2 decimales


1 margen . ganancia <- round ( ganancia . despues . iva / ingresos , 2)
* 100
2 margen . ganancia

15 21 -36 -27 -6 75 60 34 27 -13 5 63

Queremos Calcular la ganancia Media ya sin el iva por los 12 Meses,


¿como podrı́amos hacerlo?

Existen funciones con utilidad estadı́stica dentro de R “median()” es un


claro ejemplo, ésta permitirá calcular la media de un vector según se lo
indiquemos.
1 ? median ()
2 median _ pat <- median ( ganancia . despues . iva )
3 median _ pat

1862.305

Queremos encontrar los meses en los cuales la ganancia después del iva
estuvo por encima de la Media ¿cómo podrı́amos hacerlo?
1 Meses . buenos <- ganancia . despues . iva > median _ pat
2 Meses . buenos

TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE


TRUE FALSE FALSE TRUE

41
¿Cómo encontrarı́amos los Meses Malos?
1 Meses . Malos <- ! Meses . buenos
2 Meses . Malos

FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE


FALSE TRUE TRUE FALSE

¿Cómo encontrarı́amos el valor del mejor mes después del iva?

La funci\’on "max()" nos permite encontrar el valor m\’aximo


en un vector

1 mejor . mes <- ganancia . despues . iva == max ( ganancia . despues . iva
)
2 mejor . mes

FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE


FALSE FALSE FALSE TRUE

¿Como encontrarı́amos el valor del peor mes después del iva?

La funci\’on "min()" nos permite encontrar el valor m\’inimo


en un vector

1 peor . mes <- ganancia . despues . iva == min ( ganancia . despues . iva )
2 peor . mes

FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE


FALSE FALSE FALSE FALSE

¿Como podrı́amos convertir los cálculos a unidades de miles de pesos?


1 ingresos .1000 <- round ( ingresos / 1000 , 0)
2 gastos .1000 <- round ( gastos / 1000 , 0)
3 ganancia .1000 <- round ( ganancia / 1000 , 0)
4 ganancia . despues . iva .1000 <- round ( ganancia . despues . iva /
1000 , 0)

42
Imprime los resultados
1 ingresos .1000

15 8 9 9 8 8 11 10 10 14 11 15

1 gastos .1000

12 6 12 12 9 1 3 6 7 17 10 4

1 ganancia .1000

3 2 -4 -3 -1 7 8 4 3 -2 1 12

1 ganancia . despues . iva .1000

2 2 -3 -2 -1 6 7 3 3 -2 1 10

1 margen . ganancia

15 21 -36 -27 -6 75 60 34 27 -13 5 63

1 Meses . buenos

TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE


TRUE FALSE FALSE TRUE

1 Meses . Malos

FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE


FALSE TRUE TRUE FALSE

1 mejor . mes

FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE


FALSE FALSE FALSE TRUE

43
1 peor . mes

FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE


FALSE FALSE FALSE FALSE

1 which ( Meses . buenos == TRUE )

1 6 7 8 9 12

1 which ( Meses . Malos == TRUE )

2 3 4 5 10 11

1 which ( mejor . mes == TRUE )

12

1 which ( peor . mes == TRUE )

1 M <- rbind (
2 ingresos .1000 ,
3 gastos .1000 ,
4 ganancia .1000 ,
5 ganancia . despues . iva .1000 ,
6 margen . ganancia ,
7 Meses . buenos ,
8 Meses . Malos ,
9 mejor . mes ,
10 peor . mes
11 )
12
13 M

44
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
ingresos.1000 15 8 9 9 8 8 11
gastos.1000 12 6 12 12 9 1 3
ganancia.1000 3 2 -4 -3 -1 7 8
ganancia.despues.iva.1000 2 2 -3 -2 -1 6 7
margen.ganancia 15 21 -36 -27 -6 75 60
Meses.buenos 1 0 0 0 0 1 1
Meses.Malos 0 1 1 1 1 0 0
mejor.mes 0 0 0 0 0 0 0
peor.mes 0 0 1 0 0 0 0
[,8] [,9] [,10] [,11] [,12]
ingresos.1000 10 10 14 11 15
gastos.1000 6 7 17 10 4
ganancia.1000 4 3 -2 1 12
ganancia.despues.iva.1000 3 3 -2 1 10
margen.ganancia 34 27 -13 5 63
Meses.buenos 1 1 0 0 1
Meses.Malos 0 0 1 1 0
mejor.mes 0 0 0 0 1
peor.mes 0 0 0 0 0

45
2 Etapa 2
2.1 Manipulación de Matrices en R
Las Matrices en R básicamente son tablas, que tiene información de vectores
de forma horizontal y vertical. Vector con 2 dimensiones.
1 ? matrix ()

1 ? rbind () # Acomoda vectores en filas

1 ? cbind () # Acomoda vectores en columnas

Matriz
1 my . data <- 1:20
2 my . data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

nrow= n\’umero de filas, ncol= n\’umero de columnas

1 A <- matrix ( my . data , nrow =4 , ncol =5)


2 A

[,1] [,2] [,3] [,4] [,5]


[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

1 A [2 ,3]

10

La indicación [,*] y [*,] prácticamente imprime el vector de la columna o


fila correspondiente
1 A [ ,3]

46
9 10 11 12

1 A [2 ,]

2 6 10 14 18

De exceder en número de elementos disponibles para posicionar en la


matriz, el vector comienza a reciclar sus valores, ejemplo:
1 Z <- matrix ( my . data , nrow =5 , ncol =6)
2 Z

[,1] [,2] [,3] [,4] [,5] [,6]


[1,] 1 6 11 16 1 6
[2,] 2 7 12 17 2 7
[3,] 3 8 13 18 3 8
[4,] 4 9 14 19 4 9
[5,] 5 10 15 20 5 10

La indicación ”byrow=TRUE” permite poblar la matriz por fila, en lugar


que por columna como lo harı́a por default.
1 B <- matrix ( my . data , nrow =4 , ncol =5 , byrow = TRUE )
2 A

[,1] [,2] [,3] [,4] [,5]


[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

1 B

[,1] [,2] [,3] [,4] [,5]


[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20

47
1 B [2 ,5]

10

La función “rbind()” te permite unir vectores por fila.


1 r1 <- c ( " I " ," am " ," Happy " )
2 r2 <- c ( " what " ," a " ," day " )
3 r3 <- c (1 ,2 ,3)
4 C <- rbind ( r1 , r2 , r3 )
5 C

[,1] [,2] [,3]


r1 ”I” ”am” ”Happy”
r2 ”what” ”a” ”day”
r3 ”1” ”2” ”3”

La función “cbind()” te permite unir vectores por columna.


1 c1 <- 1:5
2 c2 <- -1: -5
3 D <- cbind ( c1 , c2 )
4 D

c1 c2
[1,] 1 -1
[2,] 2 -2
[3,] 3 -3
[4,] 4 -4
[5,] 5 -5

Nombrando Vectores
1 Javier <- 1:5
2 Javier

12345

48
Dales nombres a las posiciones del vector.
1 names ( Javier ) <- c ( " a " ," b " ," c " ," d " ," e " )
2 Javier

abcde
12345

1 Javier [ " d " ]

d
4

1 Javier [4]

d
4

1 names ( Javier )

abcde

“NULL” Remueve los nombres del vector.


1 names ( Javier ) <- NULL
2 Javier

12345

Nombrando la dimensión 1 de la Matriz


1 vec . temp <- rep ( c ( " a " ," b " ," zZ " ) , each =3)
2 vec . temp

a a a b b b zZ zZ zZ

49
Recordemos que la función matrix(espera el vector, el no.de filas, el no.de
columnas).
1 Bravo <- matrix ( vec . temp , 3 , 3)
2 Bravo

[,1] [,2] [,3]


[1,] a b zZ
[2,] a b zZ
[3,] a b zZ

Muy similar a cuando estábamos nombrando las posiciones del vector de


1 dimensión pero ahora: por fila.
1 rownames ( Bravo )

NULL

1 rownames ( Bravo ) <- c ( " How " ," are " ," you " )
2 Bravo

[,1] [,2] [,3]


[How] a b zZ
[are] a b zZ
[you] a b zZ

por columna:
1 colnames ( Bravo )

NULL

1 colnames ( Bravo ) <- c ( " X " ," Y " ," Z " )


2 Bravo

[X] [Y] [Z]


[How] a b zZ
[are] a b zZ
[you] a b zZ

50
1 Bravo [2 ,2]

1 Bravo [ " are " ," Y " ]

La indicacion "<-" previa a [*,*] reemplaza el valor que contenga


ese espacio. Hay que recordar que todo caracter siempre debe
ir definido con sus particulares ""

1 Bravo [ " are " ," Y " ] <- 0


2 Bravo

[X] [Y] [Z]


[How] a b zZ
[are] a 0 zZ
[you] a b zZ

1 rownames ( Bravo ) <- NULL


2 Bravo

[X] [Y] [Z]


[1,] a b zZ
[2,] a 0 zZ
[3,] a b zZ

51
2.2 Ejercicio 1 y 2 Manipulación de Matrices en R
Ejercicio 1
1 getwd ()
2
3 Examenes <- matrix (1:5 , 9 , 5 , byrow = TRUE )
4 View ( Examenes )

V1 V2 V3 V4 V5
1 1 2 3 4 5
2 1 2 3 4 5
3 1 2 3 4 5
4 1 2 3 4 5
5 1 2 3 4 5
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5

1 colnames ( Examenes ) <- c ( " Lunes 24 " , " Martes 25 " , " Miercoles
26 " ," Jueves 27 " , " Viernes 28 " )
2 rownames ( Examenes ) <- c ( " 7:00 am " , " 8:00 am " , " 9:00 am " , " 10:00
am " ," 11:00 am " ," 12:00 pm " , " 1:00 pm " , " 2:00 pm " , " 3:00 pm " )
3 Examenes [ ,] <- " "
4 Examenes [ c ( " 11:00 am " , " 12:00 pm " ) ," Martes 25 " ] <- " GeF _ 461 _
Bs101 "
5 Examenes [ c ( " 11:00 am " , " 12:00 pm " ) ," Jueves 27 " ] <- " GeC _ 461 _
Bs101 "
6 Examenes [ c ( " 7:00 am " , " 8:00 am " ) ," Miercoles 26 " ] <- " GeF _ 462 _
Bs102 "
7 Examenes [ c ( " 11:00 am " , " 12:00 pm " ) ," Miercoles 26 " ] <- " GeC _ 462 _
Bs102 "
8

9 install . packages ( ’ gridExtra ’)


10 library ( gridExtra )
11
12 pdf ( " Examen . pdf " , height =11 , width =8.5)
13

14 grid . table ( Examenes )


15
16 dev . off ()
17
18 View ( Examenes )

52
Lunes 24 Martes 25 Miercoles 26 Jueves 27 Viernes 28
7:00am GeF_462_Bs102
8:00am GeF_462_Bs102
9:00am
10:00am
11:00am GeF_461_Bs101 GeC_462_Bs102 GeC_461_Bs101
12:00pm GeF_461_Bs101 GeC_462_Bs102 GeC_461_Bs101
1:00pm
2:00pm
3:00pm

Ejercicio 2
1 getwd ()
2
3 Asist <- matrix (1:5 , 8 , 5 , byrow = TRUE )
4 View ( Asist )
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 1 2 3 4 5
3 1 2 3 4 5
4 1 2 3 4 5
5 1 2 3 4 5
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5

1 colnames ( Asist ) <- c ( " Sabado 1 " , " Sabado 8 " , " Sabado 15 " , "
Sabado 22 " , " Sabado 29 " )
2 rownames ( Asist ) <- c ( " 1 " , " 2 " , " 3 " , " 4 " , " 5 " , " 6 " , " 7 " , " 8 " )
3 Asist [ ,] <- " "
4
5 Equipo <- c ( " Hipolito " , " Aza " , " Cristian " , " Abelardo " , " Dante
" , " Paloma " , " Cristina " , " Sam " )
6 RowNo <- c ( " 1 " , " 2 " , " 3 " , " 4 " , " 5 " , " 6 " , " 7 " , " 8 " )
7

8 Asist [ RowNo , " Sabado 1 " ] <- Equipo


9 Asist [ RowNo , " Sabado 8 " ] <- Equipo
10 Asist [ RowNo , " Sabado 15 " ] <- Equipo
11 Asist [ RowNo [1:6] , " Sabado 22 " ] <- Equipo [ c (1:3 ,5:6 ,8) ]

53
12 Asist [ RowNo [1:6] , " Sabado 29 " ] <- Equipo [ c ( -4 , -7) ]
13
14 install . packages ( ’ gridExtra ’)
15 library ( gridExtra )
16
17 pdf ( " Asist . pdf " , height =11 , width =8.5)
18 grid . table ( Asist )
19 dev . off ()
20
21 View ( Asist )

Sabado 1 Sabado 8 Sabado 15 Sabado 22 Sabado 29


1 Hipolito Hipolito Hipolito Hipolito Hipolito
2 Aza Aza Aza Aza Aza
3 Cristian Cristian Cristian Cristian Cristian
4 Abelardo Abelardo Abelardo Dante Dante
5 Dante Dante Dante Paloma Paloma
6 Paloma Paloma Paloma Sam Sam
7 Cristina Cristina Cristina
8 Sam Sam Sam

54
2.3 Ejercicio Manipulación de Matrices en R
Copyright: www.superdatascience.com.
Comments:
Seasons are labeled based on the first year in the season E.g. the 2012-
2013 season is preseneted as simply 2012
Notes and Corrections to the data:
• Kevin Durant: 2006 - College Data Used

• Kevin Durant: 2005 - Proxied With 2006 Data

• Derrick Rose: 2012 - Did Not Play

• Derrick Rose: 2007 - College Data Used

• Derrick Rose: 2006 - Proxied With 2007 Data

• Derrick Rose: 2005 - Proxied With 2007 Data


Seasons
1 Seasons <- c ( " 2005 " ," 2006 " ," 2007 " ," 2008 " ," 2009 " ," 2010 " ," 2011 "
," 2012 " ," 2013 " ," 2014 " )

Players
1 Players <- c ( " KobeBryant " ," JoeJohnson " ," LeBronJames " ,"
CarmeloAnthony " ," DwightHoward " ," ChrisBosh " ," ChrisPaul " ,"
KevinDurant " ," DerrickRose " ," DwayneWade " )

Salaries
1 KobeBryant _ Salary <- c
(15946875 ,17718750 ,19490625 ,21262500 ,23034375 ,24806250 ,
2 25244493 ,27849149 ,30453805 ,23500000)
3 JoeJohnson _ Salary <- c
(12000000 ,12744189 ,13488377 ,14232567 ,14976754 ,16324500 ,
4 18038573 , 19752645 ,21466718 ,23180790)
5 LeBronJames _ Salary <- c
(4621800 ,5828090 ,13041250 ,14410581 ,15779912 ,14500000 ,
6 16022500 ,17545000 ,19067500 ,20644400)
7 CarmeloAnthony _ Salary <- c
(3713640 ,4694041 ,13041250 ,14410581 ,15779912 ,17149243 ,

55
8 18518574 , 19450000 ,22407474 ,22458000)
9 DwightHoward _ Salary <- c
(4493160 ,4806720 ,6061274 ,13758000 ,15202590 ,16647180 ,
10 18091770 ,19536360 ,20513178 ,21436271)
11 ChrisBosh _ Salary <- c
(3348000 ,4235220 ,12455000 ,14410581 ,15779912 ,14500000 ,
12 16022500 ,17545000 ,19067500 ,20644400)
13 ChrisPaul _ Salary <- c
(3144240 ,3380160 ,3615960 ,4574189 ,13520500 ,14940153 ,
14 16359805 ,17779458 ,18668431 ,20068563)
15 KevinDurant _ Salary <- c
(0 ,0 ,4171200 ,4484040 ,4796880 ,6053663 ,15506632 ,16669630 ,
16 17832627 ,18995624)
17 DerrickRose _ Salary <- c
(0 ,0 ,0 ,4822800 ,5184480 ,5546160 ,6993708 ,16402500 ,
18 17632688 ,18862875)
19 DwayneWade _ Salary <- c
(3031920 ,3841443 ,13041250 ,14410581 ,15779912 ,14200000 ,
20 15691000 ,17182000 ,18673000 ,15000000)

Matrix

Recordemos que “rbind()” genera las matrices a partir de los vectores


‘‘rm’’ remueve los vectores contenidos en las variables

1 Salary <- rbind ( KobeBryant _ Salary , JoeJohnson _ Salary ,


LeBronJames _ Salary , CarmeloAnthony _ Salary , DwightHoward _
Salary , ChrisBosh _ Salary , ChrisPaul _ Salary , KevinDurant _
Salary , DerrickRose _ Salary , DwayneWade _ Salary )
2 rm ( KobeBryant _ Salary , JoeJohnson _ Salary , CarmeloAnthony _
Salary , DwightHoward _ Salary , ChrisBosh _ Salary , LeBronJames
_ Salary , ChrisPaul _ Salary , DerrickRose _ Salary , DwayneWade _
Salary , KevinDurant _ Salary )
3 colnames ( Salary ) <- Seasons
4 rownames ( Salary ) <- Players

Games
1 KobeBryant _ G <- c (80 ,77 ,82 ,82 ,73 ,82 ,58 ,78 ,6 ,35)
2 JoeJohnson _ G <- c (82 ,57 ,82 ,79 ,76 ,72 ,60 ,72 ,79 ,80)
3 LeBronJames _ G <- c (79 ,78 ,75 ,81 ,76 ,79 ,62 ,76 ,77 ,69)
4 CarmeloAnthony _ G <- c (80 ,65 ,77 ,66 ,69 ,77 ,55 ,67 ,77 ,40)
5 DwightHoward _ G <- c (82 ,82 ,82 ,79 ,82 ,78 ,54 ,76 ,71 ,41)

56
6 ChrisBosh _ G <- c (70 ,69 ,67 ,77 ,70 ,77 ,57 ,74 ,79 ,44)
7 ChrisPaul _ G <- c (78 ,64 ,80 ,78 ,45 ,80 ,60 ,70 ,62 ,82)
8 KevinDurant _ G <- c (35 ,35 ,80 ,74 ,82 ,78 ,66 ,81 ,81 ,27)
9 DerrickRose _ G <- c (40 ,40 ,40 ,81 ,78 ,81 ,39 ,0 ,10 ,51)
10 DwayneWade _ G <- c (75 ,51 ,51 ,79 ,77 ,76 ,49 ,69 ,54 ,62)

Matrix
1 Games <- rbind ( KobeBryant _G , JoeJohnson _G , LeBronJames _G ,
CarmeloAnthony _G , DwightHoward _G , ChrisBosh _G , ChrisPaul _G
, KevinDurant _G , DerrickRose _G , DwayneWade _ G )
2 rm ( KobeBryant _G , JoeJohnson _G , CarmeloAnthony _G , DwightHoward
_G , ChrisBosh _G , LeBronJames _G , ChrisPaul _G , DerrickRose _G
, DwayneWade _G , KevinDurant _ G )
3 colnames ( Games ) <- Seasons
4 rownames ( Games ) <- Players

Minutes Played
1 KobeBryant _ MP <- c
(3277 ,3140 ,3192 ,2960 ,2835 ,2779 ,2232 ,3013 ,177 ,1207)
2 JoeJohnson _ MP <- c
(3340 ,2359 ,3343 ,3124 ,2886 ,2554 ,2127 ,2642 ,2575 ,2791)
3 LeBronJames _ MP <- c
(3361 ,3190 ,3027 ,3054 ,2966 ,3063 ,2326 ,2877 ,2902 ,2493)
4 CarmeloAnthony _ MP <- c
(2941 ,2486 ,2806 ,2277 ,2634 ,2751 ,1876 ,2482 ,2982 ,1428)
5 DwightHoward _ MP <- c
(3021 ,3023 ,3088 ,2821 ,2843 ,2935 ,2070 ,2722 ,2396 ,1223)
6 ChrisBosh _ MP <- c
(2751 ,2658 ,2425 ,2928 ,2526 ,2795 ,2007 ,2454 ,2531 ,1556)
7 ChrisPaul _ MP <- c
(2808 ,2353 ,3006 ,3002 ,1712 ,2880 ,2181 ,2335 ,2171 ,2857)
8 KevinDurant _ MP <- c
(1255 ,1255 ,2768 ,2885 ,3239 ,3038 ,2546 ,3119 ,3122 ,913)
9 DerrickRose _ MP <- c
(1168 ,1168 ,1168 ,3000 ,2871 ,3026 ,1375 ,0 ,311 ,1530)
10 DwayneWade _ MP <- c
(2892 ,1931 ,1954 ,3048 ,2792 ,2823 ,1625 ,2391 ,1775 ,1971)

Matrix
1 MinutesPlayed <- rbind ( KobeBryant _ MP , JoeJohnson _ MP ,
LeBronJames _ MP , CarmeloAnthony _ MP , DwightHoward _ MP ,

57
ChrisBosh _ MP , ChrisPaul _ MP , KevinDurant _ MP , DerrickRose _ MP
, DwayneWade _ MP )
2 rm ( KobeBryant _ MP , JoeJohnson _ MP , CarmeloAnthony _ MP ,
DwightHoward _ MP , ChrisBosh _ MP , LeBronJames _ MP , ChrisPaul _
MP , DerrickRose _ MP , DwayneWade _ MP , KevinDurant _ MP )
3 colnames ( MinutesPlayed ) <- Seasons
4 rownames ( MinutesPlayed ) <- Players

Field Goals
1 KobeBryant _ FG <- c (978 ,813 ,775 ,800 ,716 ,740 ,574 ,738 ,31 ,266)
2 JoeJohnson _ FG <- c (632 ,536 ,647 ,620 ,635 ,514 ,423 ,445 ,462 ,446)
3 LeBronJames _ FG <- c (875 ,772 ,794 ,789 ,768 ,758 ,621 ,765 ,767 ,624)
4 CarmeloAnthony _ FG <- c
(756 ,691 ,728 ,535 ,688 ,684 ,441 ,669 ,743 ,358)
5 DwightHoward _ FG <- c (468 ,526 ,583 ,560 ,510 ,619 ,416 ,470 ,473 ,251)
6 ChrisBosh _ FG <- c (549 ,543 ,507 ,615 ,600 ,524 ,393 ,485 ,492 ,343)
7 ChrisPaul _ FG <- c (407 ,381 ,630 ,631 ,314 ,430 ,425 ,412 ,406 ,568)
8 KevinDurant _ FG <- c (306 ,306 ,587 ,661 ,794 ,711 ,643 ,731 ,849 ,238)
9 DerrickRose _ FG <- c (208 ,208 ,208 ,574 ,672 ,711 ,302 ,0 ,58 ,338)
10 DwayneWade _ FG <- c (699 ,472 ,439 ,854 ,719 ,692 ,416 ,569 ,415 ,509)

Matrix
1 FieldGoals <- rbind ( KobeBryant _ FG , JoeJohnson _ FG , LeBronJames
_ FG , CarmeloAnthony _ FG , DwightHoward _ FG , ChrisBosh _ FG ,
ChrisPaul _ FG , KevinDurant _ FG , DerrickRose _ FG , DwayneWade _
FG )
2 rm ( KobeBryant _ FG , JoeJohnson _ FG , LeBronJames _ FG ,
CarmeloAnthony _ FG , DwightHoward _ FG , ChrisBosh _ FG ,
ChrisPaul _ FG , KevinDurant _ FG , DerrickRose _ FG , DwayneWade _
FG )
3 colnames ( FieldGoals ) <- Seasons
4 rownames ( FieldGoals ) <- Players

Field Goal Attempts


1 KobeBryant _ FGA <- c
(2173 ,1757 ,1690 ,1712 ,1569 ,1639 ,1336 ,1595 ,73 ,713)
2 JoeJohnson _ FGA <- c
(1395 ,1139 ,1497 ,1420 ,1386 ,1161 ,931 ,1052 ,1018 ,1025)
3 LeBronJames _ FGA <- c
(1823 ,1621 ,1642 ,1613 ,1528 ,1485 ,1169 ,1354 ,1353 ,1279)
4 CarmeloAnthony _ FGA <- c
(1572 ,1453 ,1481 ,1207 ,1502 ,1503 ,1025 ,1489 ,1643 ,806)

58
5 DwightHoward _ FGA <- c
(881 ,873 ,974 ,979 ,834 ,1044 ,726 ,813 ,800 ,423)
6 ChrisBosh _ FGA <- c
(1087 ,1094 ,1027 ,1263 ,1158 ,1056 ,807 ,907 ,953 ,745)
7 ChrisPaul _ FGA <- c
(947 ,871 ,1291 ,1255 ,637 ,928 ,890 ,856 ,870 ,1170)
8 KevinDurant _ FGA <- c
(647 ,647 ,1366 ,1390 ,1668 ,1538 ,1297 ,1433 ,1688 ,467)
9 DerrickRose _ FGA <- c
(436 ,436 ,436 ,1208 ,1373 ,1597 ,695 ,0 ,164 ,835)
10 DwayneWade _ FGA <- c
(1413 ,962 ,937 ,1739 ,1511 ,1384 ,837 ,1093 ,761 ,1084)

Matrix
1 Fie ldGoal Attemp ts <- rbind ( KobeBryant _ FGA , JoeJohnson _ FGA ,
LeBronJames _ FGA , CarmeloAnthony _ FGA , DwightHoward _ FGA ,
ChrisBosh _ FGA , ChrisPaul _ FGA , KevinDurant _ FGA , DerrickRose
_ FGA , DwayneWade _ FGA )
2 rm ( KobeBryant _ FGA , JoeJohnson _ FGA , LeBronJames _ FGA ,
CarmeloAnthony _ FGA , DwightHoward _ FGA , ChrisBosh _ FGA ,
ChrisPaul _ FGA , KevinDurant _ FGA , DerrickRose _ FGA ,
DwayneWade _ FGA )
3 colnames ( Fiel dGoalA ttempt s ) <- Seasons
4 rownames ( Fiel dGoalA ttempt s ) <- Players

Points
1 KobeBryant _ PTS <- c
(2832 ,2430 ,2323 ,2201 ,1970 ,2078 ,1616 ,2133 ,83 ,782)
2 JoeJohnson _ PTS <- c
(1653 ,1426 ,1779 ,1688 ,1619 ,1312 ,1129 ,1170 ,1245 ,1154)
3 LeBronJames _ PTS <- c
(2478 ,2132 ,2250 ,2304 ,2258 ,2111 ,1683 ,2036 ,2089 ,1743)
4 CarmeloAnthony _ PTS <- c
(2122 ,1881 ,1978 ,1504 ,1943 ,1970 ,1245 ,1920 ,2112 ,966)
5 DwightHoward _ PTS <- c
(1292 ,1443 ,1695 ,1624 ,1503 ,1784 ,1113 ,1296 ,1297 ,646)
6 ChrisBosh _ PTS <- c
(1572 ,1561 ,1496 ,1746 ,1678 ,1438 ,1025 ,1232 ,1281 ,928)
7 ChrisPaul _ PTS <- c
(1258 ,1104 ,1684 ,1781 ,841 ,1268 ,1189 ,1186 ,1185 ,1564)
8 KevinDurant _ PTS <- c
(903 ,903 ,1624 ,1871 ,2472 ,2161 ,1850 ,2280 ,2593 ,686)

59
9 DerrickRose _ PTS <- c
(597 ,597 ,597 ,1361 ,1619 ,2026 ,852 ,0 ,159 ,904)
10 DwayneWade _ PTS <- c
(2040 ,1397 ,1254 ,2386 ,2045 ,1941 ,1082 ,1463 ,1028 ,1331)

Matrix
1 Points <- rbind ( KobeBryant _ PTS , JoeJohnson _ PTS , LeBronJames _
PTS , CarmeloAnthony _ PTS , DwightHoward _ PTS , ChrisBosh _ PTS ,
ChrisPaul _ PTS , KevinDurant _ PTS , DerrickRose _ PTS ,
DwayneWade _ PTS )
2 rm ( KobeBryant _ PTS , JoeJohnson _ PTS , LeBronJames _ PTS ,
CarmeloAnthony _ PTS , DwightHoward _ PTS , ChrisBosh _ PTS ,
ChrisPaul _ PTS , KevinDurant _ PTS , DerrickRose _ PTS ,
DwayneWade _ PTS )
3 colnames ( Points ) <- Seasons
4 rownames ( Points ) <- Players

1 Games

60
2005 2006 2007 2008 2009 2010 2011 2012
KobeBryant 80 77 82 82 73 82 58 78
JoeJohnson 82 57 82 79 76 72 60 72
LeBronJames 79 78 75 81 76 79 62 76
CarmeloAnthony 80 65 77 66 69 77 55 67
DwightHoward 82 82 82 79 82 78 54 76
ChrisBosh 70 69 67 77 70 77 57 74
ChrisPaul 78 64 80 78 45 80 60 70
KevinDurant 35 35 80 74 82 78 66 81
DerrickRose 40 40 40 81 78 81 39 0
DwayneWade 75 51 51 79 77 76 49 69
2013 2014
KobeBryant 6 35
JoeJohnson 79 80
LeBronJames 77 69
CarmeloAnthony 77 40
DwightHoward 71 41
ChrisBosh 79 44
ChrisPaul 62 82
KevinDurant 81 27
DerrickRose 10 51
DwayneWade 54 62

1 rownames ( Games )

KobeBryant JoeJohnson LeBronJames CarmeloAnthony


DwightHoward ChrisBosh ChrisPaul KevinDurant DerrickRose
DwayneWade

1 colnames ( Games )

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

1 Games [ " LeBronJames " ," 2012 " ]

76

61
1 FieldGoals

2005 2006 2007 2008 2009 2010 2011 2012


KobeBryant 978 813 775 800 716 740 574 738
JoeJohnson 632 536 647 620 635 514 423 445
LeBronJames 875 772 794 789 768 758 621 765
CarmeloAnthony 756 691 728 535 688 684 441 669
DwightHoward 468 526 583 560 510 619 416 470
ChrisBosh 549 543 507 615 600 524 393 485
ChrisPaul 407 381 630 631 314 430 425 412
KevinDurant 306 306 587 661 794 711 643 731
DerrickRose 208 208 208 574 672 711 302 0
DwayneWade 699 472 439 854 719 692 416 569
2013 2014
KobeBryant 31 266
JoeJohnson 462 446
LeBronJames 767 624
CarmeloAnthony 743 358
DwightHoward 473 251
ChrisBosh 492 343
ChrisPaul 406 568
KevinDurant 849 238
DerrickRose 58 338
DwayneWade 415 509

1 FieldGoals / Games

62
2005 2006 2007 2008
KobeBryant 12.225000 10.558442 9.451220 9.756098
JoeJohnson 7.707317 9.403509 7.890244 7.848101
LeBronJames 11.075949 9.897436 10.586667 9.740741
CarmeloAnthony 9.450000 10.630769 9.454545 8.106061
DwightHoward 5.707317 6.414634 7.109756 7.088608
ChrisBosh 7.842857 7.869565 7.567164 7.987013
ChrisPaul 5.217949 5.953125 7.875000 8.089744
KevinDurant 8.742857 8.742857 7.337500 8.932432
DerrickRose 5.200000 5.200000 5.200000 7.086420
DwayneWade 9.320000 9.254902 8.607843 10.810127
2009 2010 2011 2012
KobeBryant 9.808219 9.024390 9.896552 9.461538
JoeJohnson 8.355263 7.138889 7.050000 6.180556
LeBronJames 10.105263 9.594937 10.016129 10.065789
CarmeloAnthony 9.971014 8.883117 8.018182 9.985075
DwightHoward 6.219512 7.935897 7.703704 6.184211
ChrisBosh 8.571429 6.805195 6.894737 6.554054
ChrisPaul 6.977778 5.375000 7.083333 5.885714
KevinDurant 9.682927 9.115385 9.742424 9.024691
DerrickRose 8.615385 8.777778 7.743590 NaN
DwayneWade 9.337662 9.105263 8.489796 8.246377
2013 2014
KobeBryant 5.166667 7.600000
JoeJohnson 5.848101 5.575000
LeBronJames 9.961039 9.043478
CarmeloAnthony 9.649351 8.950000
DwightHoward 6.661972 6.121951
ChrisBosh 6.227848 7.795455
ChrisPaul 6.548387 6.926829
KevinDurant 10.481481 8.814815
DerrickRose 5.800000 6.627451
DwayneWade 7.685185 8.209677

1 round ( FieldGoals / Games )

63
2005 2006 2007 2008 2009 2010 2011 2012
KobeBryant 12 11 9 10 10 9 10 9
JoeJohnson 8 9 8 8 8 7 7
6
LeBronJames 11 10 11 10 10 10 10 10
CarmeloAnthony 9 11 9 8 10 9 8 10
DwightHoward 6 6 7 7 6 8 8 6
ChrisBosh 8 8 8 8 9 7 7 7
ChrisPaul 5 6 8 8 7 5 7 6
KevinDurant 9 9 7 9 10 9 10 9
DerrickRose 5 5 5 7 9 9 8 NaN
DwayneWade 9 9 9 11 9 9 8 8
2013 2014
KobeBryant 5 8
JoeJohnson 6 6
LeBronJames 10 9
CarmeloAnthony 10 9
DwightHoward 7 6
ChrisBosh 6 8
ChrisPaul 7 7
KevinDurant 10 9
DerrickRose 6 7
DwayneWade 8 8

1 round ( FieldGoals / Games , 1)

64
2005 2006 2007 2008 2009 2010 201 2012
KobeBryant 12.2 10.6 9.5 9.8 9.8 9.0 9.9 9.5
JoeJohnson 7.7 9.4 7.9 7.8 8.4 7.1 7.0 6.2
LeBronJames 11.1 9.9 10.6 9.7 10.1 9.6 10.0 10.1
CarmeloAnthony 9.4 10.6 9.5 8.1 10.0 8.9 8.0 10.0
DwightHoward 5.7 6.4 7.1 7.1 6.2 7.9 7.7 6.2
ChrisBosh 7.8 7.9 7.6 8.0 8.6 6.8 6.9 6.6
ChrisPaul 5.2 6.0 7.9 8.1 7.0 5.4 7.1 5.9
KevinDurant 8.7 8.7 7.3 8.9 9.7 9.1 9.7 9.0
DerrickRose 5.2 5.2 5.2 7.1 8.6 8.8 7.7 NaN
DwayneWade 9.3 9.3 8.6 10.8 9.3 9.1 8.5 8.2
2013 2014
KobeBryant 5.2 7.6
JoeJohnson 5.8 5.6
LeBronJames 10.0 9.0
CarmeloAnthony 9.6 8.9
DwightHoward 6.7 6.1
ChrisBosh 6.2 7.8
ChrisPaul 6.5 6.9
KevinDurant 10.5 8.8
DerrickRose 5.8 6.6
DwayneWade 7.7 8.2

1 MinutesPlayed / Games

65
2005 2006 2007 2008 2009
KobeBryant 40.96250 40.77922 38.92683 36.09756 38.83562
JoeJohnson 40.73171 41.38596 40.76829 39.54430 37.97368
LeBronJames 42.54430 40.89744 40.36000 37.70370 39.02632
CarmeloAnthony 36.76250 38.24615 36.44156 34.50000 38.17391
DwightHoward 36.84146 36.86585 37.65854 35.70886 34.67073
ChrisBosh 39.30000 38.52174 36.19403 38.02597 36.08571
ChrisPaul 36.00000 36.76562 37.57500 38.48718 38.04444
KevinDurant 35.85714 35.85714 34.60000 38.98649 39.50000
DerrickRose 29.20000 29.20000 29.20000 37.03704 36.80769
DwayneWade 38.56000 37.86275 38.31373 38.58228 36.25974
2010 2011 2012 2013 2014
KobeBryant 33.89024 38.48276 38.62821 29.50000 34.48571
JoeJohnson 35.47222 35.45000 36.69444 32.59494 34.88750
LeBronJames 38.77215 37.51613 37.85526 37.68831 36.13043
CarmeloAnthony 35.72727 34.10909 37.04478 38.72727 35.70000
DwightHoward 37.62821 38.33333 35.81579 33.74648 29.82927
ChrisBosh 36.29870 35.21053 33.16216 32.03797 35.36364
ChrisPaul 36.00000 36.35000 33.35714 35.01613 34.84146
KevinDurant 38.94872 38.57576 38.50617 38.54321 33.81481
DerrickRose 37.35802 35.25641 NaN 31.10000 30.00000
DwayneWade 37.14474 33.16327 34.65217 32.87037 31.79032

1 round ( MinutesPlayed / Games )

66
2005 2006 2007 2008 2009 2010 2011 2012
2013
KobeBryant 41 41 39 36 39 34 38 39
JoeJohnson 41 41 41 40 38 35 35 37
LeBronJames 43 41 40 38 39 39 38 38
CarmeloAnthony 37 38 36 34 38 36 34 37
DwightHoward 37 37 38 36 35 38 38 36
ChrisBosh 39 39 36 38 36 36 35 33
ChrisPaul 36 37 38 38 38 36 36 33
KevinDurant 36 36 35 39 40 39 39 39
DerrickRose 29 29 29 37 37 37 35 NaN
DwayneWade 39 38 38 39 36 37 33 35
2013 2014
KobeBryant 30 34
JoeJohnson 33 35
LeBronJames 38 36
CarmeloAnthony 39 36
DwightHoward 34 30
ChrisBosh 32 35
ChrisPaul 35 35
KevinDurant 39 34
DerrickRose 31 30
DwayneWade 35 32

2.3.1 Subsets

1 x <- c ( " a " ," b " ," c " ," d " ," e " )
2 x

abcde

Con la indicación x[c(,)] podemos aislar sets especı́ficos de información.


1 x [ c (1 ,5) ]

ae

1 x [1]

67
a

1 Games

2005 2006 2007 2008 2009 2010 2011 2012


KobeBryant 80 77 82 82 73 82 58 78
JoeJohnson 82 57 82 79 76 72 60 72
LeBronJames 79 78 75 81 76 79 62 76
CarmeloAnthony 80 65 77 66 69 77 55 67
DwightHoward 82 82 82 79 82 78 54 76
ChrisBosh 70 69 67 77 70 77 57 74
ChrisPaul 78 64 80 78 45 80 60 70
KevinDurant 35 35 80 74 82 78 66 81
DerrickRose 40 40 40 81 78 81 39 0
DwayneWade 75 51 51 79 77 76 49 69
2013 2014
KobeBryant 6 35
JoeJohnson 79 80
LeBronJames 77 69
CarmeloAnthony 77 40
DwightHoward 71 41
ChrisBosh 79 44
ChrisPaul 62 82
KevinDurant 81 27
DerrickRose 10 51
DwayneWade 54 62

Recordemos que ”num:num” resulta en una secuencia de números


1 Games [1:3 ,6:10]

2010 2011 2012 2013 2014


KobeBryant 82 58 78 6 35
JoeJohnson 72 60 72 79 80
LeBronJames 79 62 76 77 69

1 Games [ c (1 ,10) ,]

68
2005 2006 2007 2008 2009 2010 2011 2012
KobeBryant 80 77 82 82 73 82 58 78
DwayneWade 75 51 51 79 77 76 49 69
2013 2014
KobeBryant 6 35
DwayneWade 54 62

1 Games [ , c ( " 2008 " ," 2009 " ) ]

2008 2009
KobeBryant 82 73
JoeJohnson 79 76
LeBronJames 81 76
CarmeloAnthony 66 69
DwightHoward 79 82
ChrisBosh 77 70
ChrisPaul 78 45
KevinDurant 74 82
DerrickRose 81 78
DwayneWade 79 77

1 Games [1 ,]

2005 2006 2007 2008 2009 2010 2011 2012


KobeBryant 80 77 82 82 73 82 58 78
2013 2014
KobeBryant 6 35

1 Games [1 ,5]

73

Al generar un subset de una fila de una matriz, R presenta el resultado


en forma de vector
1 is . matrix ( Games [1 ,])

69
FALSE

1 is . vector ( Games [1 ,])

TRUE

Si se desea conservar la información en forma de matriz es necesario agre-


gar la indicación “drop=F”
1 Games [1 , , drop = F ]

2005 2006 2007 2008 2009 2010 2011 2012


KobeBryant 80 77 82 82 73 82 58 78
2013 2014
KobeBryant 6 35

1 Games [1 ,5 , drop = F ]

2009
KobeBryant 73

70
2.4 Matplot en R
La función “matplot()” grafica columnas contra filas de matrices
1 ? matplot

El parámetro ‘type” sirve para definir el tipo de gráfico que se va a utilizar


(p=puntos, l=lı́neas, b=lı́neas y puntos, h= histograma etc...). “pch” define
el tipo de figuras que definen a los puntos, de las gráficas, “col” define los
colores a utilizar.
1 matplot ( t ( FieldGoals / Games ) , type = " b " , pch =15:18 , col = c
(1:4 ,6) )

71
La funcion “legend()” te permite incluir una leyenda con la información
de un vector
1 legend ( " bottomleft " , inset =0.01 , legend = Players , col = c
(1:4 ,6) , pch =15:18 , horiz = F )
2 ? legend ()

1 FieldGoals

72
2005 2006 2007 2008 2009 2010 2011 2012
KobeBryant 978 813 775 800 716 740 574 738
JoeJohnson 632 536 647 620 635 514 423 445
LeBronJames 875 772 794 789 768 758 621 765
CarmeloAnthony 756 691 728 535 688 684 441 669
DwightHoward 468 526 583 560 510 619 416 470
ChrisBosh 549 543 507 615 600 524 393 485
ChrisPaul 407 381 630 631 314 430 425 412
KevinDurant 306 306 587 661 794 711 643 731
DerrickRose 208 208 208 574 672 711 302 0
DwayneWade 699 472 439 854 719 692 416 569
2013 2014
KobeBryant 31 266
JoeJohnson 462 446
LeBronJames 767 624
CarmeloAnthony 743 358
DwightHoward 473 251
ChrisBosh 492 343
ChrisPaul 406 568
KevinDurant 849 238
DerrickRose 58 338
DwayneWade 415 509

La función “t()” = transponer Matriz, cambia filas por columnas y vicev-


ersa al utilizar la opción “t()” en el resultado de la gráfica cambiamos la
información de todos los jugadores por temporada a un jugador todas las
temporadas. Esto hace más sentido al momento de graficar.
1 t ( FieldGoals / Games )

73
KobeBryant JoeJohnson LeBronJames
2005 12.225000 7.707317 11.075949
2006 10.558442 9.403509 9.897436
2007 9.451220 7.890244 10.586667
2008 9.756098 7.848101 9.740741
2009 9.808219 8.355263 10.105263
2010 9.024390 7.138889 9.594937
2011 9.896552 7.050000 10.016129
2012 9.461538 6.180556 10.065789
2013 5.166667 5.848101 9.961039
2014 7.600000 5.575000 9.043478
CarmeloAnthony DwightHoward ChrisBosh
2005 9.450000 5.707317 7.842857
2006 10.630769 6.414634 7.869565
2007 9.454545 7.109756 7.567164
2008 8.106061 7.088608 7.987013
2009 9.971014 6.219512 8.571429
2010 8.883117 7.935897 6.805195
2011 8.018182 7.703704 6.894737
2012 9.985075 6.184211 6.554054
2013 9.649351 6.661972 6.227848
2014 8.950000 6.121951 7.795455
ChrisPaul KevinDurant DerrickRose DwayneWade
2005 5.217949 8.742857 5.200000 9.320000
2006 5.953125 8.742857 5.200000 9.254902
2007 7.875000 7.337500 5.200000 8.607843
2008 8.089744 8.932432 7.086420 10.810127
2009 6.977778 9.682927 8.615385 9.337662
2010 5.375000 9.115385 8.777778 9.105263
2011 7.083333 9.742424 7.743590 8.489796
2012 5.885714 9.024691 NaN 8.246377
2013 6.548387 10.481481 5.800000 7.685185
2014 6.926829 8.814815 6.627451 8.209677

1 FieldGoals / Games

74
2005 2006 2007
KobeBryant 12.225000 10.558442 9.451220
JoeJohnson 7.707317 9.403509 7.890244
LeBronJames 11.075949 9.897436 10.586667
CarmeloAnthony 9.450000 10.630769 9.454545
DwightHoward 5.707317 6.414634 7.109756
ChrisBosh 7.842857 7.869565 7.567164
ChrisPaul 5.217949 5.953125 7.875000
KevinDurant 8.742857 8.742857 7.337500
DerrickRose 5.200000 5.200000 5.200000
DwayneWade 9.320000 9.254902 8.607843
2008 2009 2010
KobeBryant 9.756098 9.808219 9.024390
JoeJohnson 7.848101 8.355263 7.138889
LeBronJames 9.740741 10.105263 9.594937
CarmeloAnthony 8.106061 9.971014 8.883117
DwightHoward 7.088608 6.219512 7.935897
ChrisBosh 7.987013 8.571429 6.805195
ChrisPaul 8.089744 6.977778 5.375000
KevinDurant 8.932432 9.682927 9.115385
DerrickRose 7.086420 8.615385 8.777778
DwayneWade 10.810127 9.337662 9.105263
2011 2012 2013
KobeBryant 9.896552 9.461538 5.166667
JoeJohnson 7.050000 6.180556 5.848101
LeBronJames 10.016129 10.065789 9.961039
CarmeloAnthony 8.018182 9.985075 9.649351
DwightHoward 7.703704 6.184211 6.661972
ChrisBosh 6.894737 6.554054 6.227848
ChrisPaul 7.083333 5.885714 6.548387
KevinDurant 9.742424 9.024691 10.481481
DerrickRose 7.743590 NaN 5.800000
DwayneWade 8.489796 8.246377 7.685185
2014
KobeBryant 7.600000
JoeJohnson 5.575000
LeBronJames 9.043478
CarmeloAnthony 8.950000
DwightHoward 6.121951
ChrisBosh 7.795455
75
ChrisPaul 6.926829
KevinDurant 8.814815
DerrickRose 6.627451
DwayneWade 8.209677
1 matplot ( t ( FieldGoals / Fi eldGoa lAttem pts ) , type = " b " , pch =15:18 ,
col = c (1:4 ,6) )
2 legend ( " bottomleft " , inset =0.01 , legend = Players , col = c
(1:4 ,6) , pch =15:18 , horiz = F )

Si deseáramos aislar y graficar la información solo del 2005 de los minutos


jugados
1 MinutesPlayed

76
2005 2006 2007 2008 2009
KobeBryant 40.96250 40.77922 38.92683 36.09756 38.83562
JoeJohnson 40.73171 41.38596 40.76829 39.54430 37.97368
LeBronJames 42.54430 40.89744 40.36000 37.70370 39.02632
CarmeloAnthony 36.76250 38.24615 36.44156 34.50000 38.17391
DwightHoward 36.84146 36.86585 37.65854 35.70886 34.67073
ChrisBosh 39.30000 38.52174 36.19403 38.02597 36.08571
ChrisPaul 36.00000 36.76562 37.57500 38.48718 38.04444
KevinDurant 35.85714 35.85714 34.60000 38.98649 39.50000
DerrickRose 29.20000 29.20000 29.20000 37.03704 36.80769
DwayneWade 38.56000 37.86275 38.31373 38.58228 36.25974
2010 2011 2012 2013 2014
KobeBryant 33.89024 38.48276 38.62821 29.50000 34.48571
JoeJohnson 35.47222 35.45000 36.69444 32.59494 34.88750
LeBronJames 38.77215 37.51613 37.85526 37.68831 36.13043
CarmeloAnthony 35.72727 34.10909 37.04478 38.72727 35.70000
DwightHoward 37.62821 38.33333 35.81579 33.74648 29.82927
ChrisBosh 36.29870 35.21053 33.16216 32.03797 35.36364
ChrisPaul 36.00000 36.35000 33.35714 35.01613 34.84146
KevinDurant 38.94872 38.57576 38.50617 38.54321 33.81481
DerrickRose 37.35802 35.25641 NaN 31.10000 30.00000
DwayneWade 37.14474 33.16327 34.65217 32.87037 31.79032

1 Data <- MinutesPlayed [1 ,] # Error


2 Data

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
3277 3140 3192 2960 2835 2779 2232 3013 177 1207

77
Recordemos que al extraer una sola fila de una matriz esencialmente lo
estamos convirtiendo en un vector
1 matplot ( t ( Data ) , type = " b " , pch =15:18 , col = c (1:4 ,6) )
2 legend ( " bottomleft " , inset =0.01 , legend = Players [1] , col = c
(1:4 ,6) , pch =15:18 , horiz = F )

1 Data2 <- MinutesPlayed [1 , , drop = F ]


2 matplot ( t ( Data2 ) , type = " b " , pch =15:18 , col = c (1:4 ,6) )
3 legend ( " bottomleft " , inset =0.01 , legend = Players [1] , col = c
(1:4 ,6) , pch =15:18 , horiz = F )

1 Data2

78
2005 2006 2007 2008 2009 2010 2011 2012
KobeBryant 3277 3140 3192 2960 2835 2779 2232 3013
2013 2014
KobeBryant 177 1207

79
2.5 Funciones en R
Con la indicaci\’on "function(){}" puedes crear tus propias
funciones. Esta se recomienda emplear siempre y cuando una
tarea se vaya a ejecutar m\’ultiples ocasiones, con la finalidad
de no copiar, pegar y modificar cada vez el c\’odigo

Aquı́ vamos a guardar la información del subset ”MinutesPlayed[2:3,,drop=F]”


en la variable Data2, eso nos despliega la información de los dos jugadores a
lo largo de las temporadas.
1 myplot <- function () {
2 Data2 <- MinutesPlayed [2:3 , , drop = F ]
3 matplot ( t ( Data2 ) , type = " b " , pch =15:18 , col = c (1:4 ,6) )
4 legend ( " bottomleft " , inset =0.01 , legend = Players [2:3] , col =
c (1:4 ,6) , pch =15:18 , horiz = F )
5 }
6
7 myplot ()

El verdadero poder de las funciones surge cuando comenzamos


a definir variables para sus par\’ametros de ejecuci\’on.

funci\’on(par\’ametros){par\’ametros que regulan la tarea a ejecutar}

80
1 myplot2 <- function ( rows ) {
2 Data2 <- MinutesPlayed [ rows , , drop = F ]
3 matplot ( t ( Data2 ) , type = " b " , pch =15:18 , col = c (1:4 ,6) )
4 legend ( " bottomleft " , inset =0.01 , legend = Players [ rows ] , col =
c (1:4 ,6) , pch =15:18 , horiz = F )
5 }
6

7 myplot2 ( rows = 1:5)

1 myplot2 (1:10)

1 myplot2 ( c (1 ,3) )

81
El número de variables que podemos definir dentro de los parámetros de
la función le dará una mayor utilidad.
1 myplot3 <- function ( data , rows ) {
2 Data2 <- data [ rows , , drop = F ]
3 matplot ( t ( Data2 ) , type = " b " , pch =15:18 , col = c (1:4 ,6) )
4 legend ( " bottomleft " , inset =0.01 , legend = Players [ rows ] , col =
c (1:4 ,6) , pch =15:18 , horiz = F )
5 }
6
7 myplot3 ( Salary , 1:2)

1 myplot3 ( Salary , c (1 ,3) )

82
Se pueden definir parámetros fijos en la función los cuales se pueden
reescribir.
1 myplot4 <- function ( data , rows =1:10) {
2 Data2 <- data [ rows , , drop = F ]
3 matplot ( t ( Data2 ) , type = " b " , pch =15:18 , col = c (1:4 ,6) )
4 legend ( " bottomleft " , inset =0.01 , legend = Players [ rows ] , col =
c (1:4 ,6) , pch =15:18 , horiz = F )
5 }
6
7 myplot4 ( Salary )

83
1 myplot4 ( MinutesPlayed )

1 myplot4 ( MinutesPlayed / Games )

84
1 myplot4 ( MinutesPlayed / Games , 3)

1 myplot4 ( MinutesPlayed / Games , 3:6)

85
2.5.1 Uso de funciones
Para emplear una función de la mejor manera, se recomienda definir las
variables que la compongan. En el caso particular que exponemos a contin-
uación estamos definiendo el set de datos, las filas y las columnas que se van
a graficar. Básicamente esta función te permitirá en una sola lı́nea el acceder
a toda la información que tienes en tu ambiente global.
1 myplot5 <- function ( data , rows =1:10 , cols =1:10) {
2 Data2 <- data [ rows , cols , drop = F ]
3 matplot ( t ( Data2 ) , type = " b " , pch =15:18 , col = c (1:4 ,6) )
4 legend ( " bottomleft " , inset =0.01 , legend = Players [ rows ] , col =
c (1:4 ,6) , pch =15:18 , horiz = F )
5 }

Salary
1 myplot5 ( Salary )

86
1 myplot5 ( Salary / Games )

1 myplot5 ( Salary / FieldGoals )

87
1 myplot5 ( Salary , cols = 1:7)

1 myplot5 ( Salary / Games , cols = 1:7)

88
1 myplot5 ( Salary / FieldGoals , cols = 1:7)

In-Game Metrics
1 myplot5 ( MinutesPlayed )

89
1 myplot5 ( Points )

In-Game Metrics Normalized


1 myplot5 ( FieldGoals / Games )

90
1 myplot5 ( FieldGoals / Fi eldGoa lAttem pts )

1 myplot5 ( Fi eldGoa lAttem pts / Games )

91
1 myplot5 ( Points / Games )

Interesting Observation
1 myplot5 ( MinutesPlayed / Games )

92
1 myplot5 ( Games )

Time is value
1 myplot5 ( FieldGoals / MinutesPlayed )

93
Player Style
1 myplot5 ( Points / FieldGoals )

94
2.6 Manejo de Data Frames en R vol 1
1 install . packages ( " dplyr " )
2 library ( dplyr )

Importar un archivo
1 ? read . csv ()

Método 1: selecciona el archivo manualmente


1 estadisticas <- read . csv ( file . choose () ) # De la carpeta de
trabajo selecciona el documento DemographicData . csv
2 estadisticas

95
Método 2: imponer tu directorio de trabajo Working Directory(WD)
para leer y salvar archivos
1 getwd ()

• Windows

setwd("E:/User/PathToFolder/D\’ia 5")

• Mac

setwd("/User/PathToFolder/D\’ia 5")
getwd()

1 rm ( estadisticas )
2 estadisticas <- read . csv ( " DemographicData . csv " )

2.6.1 Explorando los Datos


La función “nrow()” te indica el número de Filas que contiene el archivo.
1 nrow ( estadisticas )

195

La función “ncol()” te indica el número de columnas que contiene el


archivo.
1 ncol ( estadisticas )

“dim” te de las dimensiones de las filas y columnas del archivo.


1 dim ( estadisticas )

195 5

96
La función “head()” te muestra las primeras seis filas del Data Frame.
1 head ( estadisticas )

Country Name Country Code Birth rate Internet users 1


Aruba ABW 10.244 78.9
2 Afghanistan AFG 35.253 5.9
3 Angola AGO 45.985 19.1
4 Albania ALB 12.877 57.2
5 United Arab Emirates ARE 11.044 88.0
6 Argentina ARG 17.716 59.9
Income.Group
1 High income
2 Low income
3 Upper middle income
4 Upper middle income
5 High income
6 High income

La indicación “n=” modifica el número de filas que te va a mostrar.


1 head ( estadisticas , n =8)

Country Name Country Code Birth rate Internet users


1 Aruba ABW 10.244 78.9
2 Afghanistan AFG 35.253 5.9
3 Angola AGO 45.985 19.1
4 Albania ALB 12.877 57.2
5 United Arab Emirates ARE 11.044 88.0
6 Argentina ARG 17.716 59.9
7 Armenia ARM 13.308 41.9
8 Antigua and Barbuda ATG 16.447 63.4
Income.Group
1 High income
2 Low income
3 Upper middle income
4 Upper middle income
5 High income
6 High income
7 Lower middle income
8 High income

97
La función “tail()” te muestra las ultimas 6 filas del Data Frame.
1 tail ( estadisticas )

Country Name Country Code Birth rate Internet.users


190 Samoa WSM 26.172 15.3
191 Yemen, Rep. YEM 32.947 20.0
192 South Africa ZAF 20.850 46.5
193 Congo, Dem. Rep. COD 42.394 2.2
194 Zambia ZMB 40.471 15.4
195 Zimbabwe ZWE 35.715 18.5
Income.Group
190 Lower middle income
191 Lower middle income
192 Upper middle income
193 Low income
194 Lower middle income
195 Low income

La función “str()” te muestra la estructura que tienen las columnas del


Data Frame.

Factor = implica la incidencia de caracteres repetidos en la columna, los


cuales se subdividen en “levels” esto con la finalidad de categorizarlos.
1 str ( estadisticas )

’data.frame’: 195 obs. of 5 variables:

Country.Name : Factor w/ 195 levels ”Afghanistan”,..: 8 1 4 2


183 6 7 5 9 10 ...

Country.Code : Factor w/ 195 levels ”ABW”,”AFG”,”AGO”,..:


1 2 3 4 5 6 7 8 9 10 ...

Birth.rate : num 10.2 35.3 46 12.9 11 ...

Internet.users: num 78.9 5.9 19.1 57.2 88 ...

98
Income.Group : Factor w/ 4 levels ”High income”,..: 1 2 4 4 1 1
3 1 1 1 ...

La función “summary()” te presenta la información resumida de los di-


versos valores presentes en la columna.
1 str ( estadisticas )

Country Name Country Code Birthrate


Afghanistan : 1 ABW : 1 Min. : 7.90
Albania : 1 AFG : 1 1st Qu.:12.12
Algeria : 1 AGO : 1 Median :19.68
Angola : 1 ALB : 1 Mean :21.47
Antigua and Barbuda : 1 ARE : 1 3rd Qu.:29.76
Argentina : 1 ARG : 1 Max. :49.66
(Other) :189 (Other):189

Internet.users Income.Group
Min. : 0.90 High income :67
1st Qu.:14.52 Low income :30
Median :41.00 Lower middle income:50
Mean :42.08 Upper middle income:48
3rd Qu.:66.22
Max. :96.55

99
El poder del signo de "$" en R

1 head ( estadisticas )

Country Name Country Code Birth rate Internet users


1 Aruba ABW 10.244 78.9
2 Afghanistan AFG 35.253 5.9
3 Angola AGO 45.985 19.1
4 Albania ALB 12.877 57.2
5 United Arab Emirates ARE 11.044 88.0
6 Argentina ARG 17.716 59.9
7 Armenia ARM 13.308 41.9
8 Antigua and Barbuda ATG 16.447 63.4
Income.Group
1 High income
2 Low income
3 Upper middle income
4 Upper middle income
5 High income
6 High income
7 Lower middle income
8 High income
1 estadisticas [3 , 3]

45.985

1 estadisticas [3 , " Birth . rate " ]

45.985

Con la orden "estadisticas$" podemos acceder directamente a


todos los componentes de la columna que deseemos.
La indicaci\’on "$" ofrece un vector como resultado.

1 estadisticas $ Internet . users

100
78.90000 5.90000 19.10000 57.20000 88.00000 59.90000 41.90000
63.40000 83.00000 80.61880 58.70000 1.30000 82.17020 4.90000
9.10000 6.63000 53.06150 90.00004 (......) 72.00000 57.79000
54.17000 33.60000 95.30000 36.94000 43.90000 11.30000 46.60000
15.30000 20.00000 46.50000 2.20000 15.40000 18.50000

Si se desea conservar la propiedad de data frame se requiere darle la


indicación.
1 as . data . frame ( estadisticas $ Internet . users )

1 78.90000
2 5.90000
3 19.10000
4 57.20000
5 88.00000
6 59.90000
... ...
193 2.20000
194 15.40000
195 18.50000

Otra opción es emplear la librerı́a de dplyr.


1 install . packages ( " dplyr " )
2 library ( dplyr )
3 select ( estadisticas , Internet . users )

1 78.90000
2 5.90000
3 19.10000
4 57.20000
5 88.00000
6 59.90000
... ...
193 2.20000
194 15.40000
195 18.50000

101
Un subsecuente [] nos va a aislar un solo valor de la columna,
b\’asicamente nos ofrece el vector correspondiente a la fila o
la columna

1 estadisticas $ Internet . users [2]


2 estadisticas [ , " Internet . users " ]

5.9
78.90000 5.90000 19.10000 57.20000 88.00000 59.90000 41.90000
63.40000 83.00000 80.61880 58.70000 1.30000 82.17020 4.90000
9.10000 6.63000 53.06150 90.00004 (......) 72.00000 57.79000
54.17000 33.60000 95.30000 36.94000 43.90000 11.30000 46.60000
15.30000 20.00000 46.50000 2.20000 15.40000 18.50000

El grado de control que tienes con el signo de "$" o dplyr


virtualmente convierte a R en una de las mejores herramientas
para el manejo de grandes cantidades de datos.

1 levels ( estadisticas $ Income . Group )

2.6.2 Operaciones basicas con DataFrame

1 estadisticas [1:10 ,] # Funcionaria como head ( estadisticas , 10)


2 estadisticas [3:9 ,]
3 estadisticas [ c (4 ,100) ,]

Country.Name Country.Code Birth.rate Internet.users Income.Group


1 Aruba ABW 10.244 78.9000 High income
2 Afghanistan AFG 35.253 5.9000 Low income
3 Angola AGO 45.985 19.1000 Upper middle income
4 Albania ALB 12.877 57.2000 Upper middle income
5 United Arab Emirates ARE 11.044 88.0000 High income
6 Argentina ARG 17.716 59.9000 High income
7 Armenia ARM 13.308 41.9000 Lower middle income
8 Antigua and Barbuda ATG 16.447 63.4000 High income
9 Australia AUS 13.200 83.0000 High income
10 Austria AUT 9.400 80.6188 High income

102
Country.Name Country.Code Birth.rate Internet.users Income.Group
3 Angola AGO 45.985 19.1 Upper middle income
4 Albania ALB 12.877 57.2 Upper middle income
5 United Arab Emirates ARE 11.044 88.0 High income
6 Argentina ARG 17.716 59.9 High income
7 Armenia ARM 13.308 41.9 Lower middle income
8 Antigua and Barbuda ATG 16.447 63.4 High income
9 Australia AUS 13.200 83.0 High income

Country.Name Country.Code Birth.rate Internet.users Income.Group


4 Albania ALB 12.877 57.2 Upper middle income
100 Liberia LBR 35.521 3.2 Low income

Recordemos c\’omo funcionan los []

Estamos aislando una fila y esta fila tiene “colnames” razón por la cual
lo identifica como Data Frame.
1 is . data . frame ( estadisticas [1 ,])
2 estadisticas [ ,1]
3 is . data . frame ( estadisticas [ ,1]) # Cual creen que sea el
resultado de esta funci \ ’ on ?

TRUE

[1] "Aruba" "Afghanistan" "Angola"


[4] "Albania" "United Arab Emirates" "Argentina"
[7] "Armenia" "Antigua and Barbuda" "Australia"
[10] "Austria" "Azerbaijan" "Burundi"
...
[187] "Vietnam" "Vanuatu" "West Bank and Gaza"
[190] "Samoa" "Yemen, Rep." "South Africa"
[193] "Congo, Dem. Rep." "Zambia" "Zimbabwe"

FALSE

103
La razón del resultado FALSE de “is.data.frame(estadisticas[,1])” es que
en esta ocasión lo que estamos extrayendo es una sola columna con un solo
nombre con múltiples filas, R esto automáticamente lo asocia a un vector.
Al agregar la opción “drop=F” recuerden era la forma de mantener la
información como Data frame.
1 estadisticas [ ,1 , drop = F ]
2 is . data . frame ( estadisticas [ ,1 , drop = F ])

Country.Name
1 Aruba
2 Afghanistan
3 Angola
...
193 Congo, Dem. Rep.
194 Zambia
195 Zimbabwe

FALSE

Operaciones con las columnas del Data Frame con $

1 estadisticas $ Birth . rate * estadisticas $ Internet . users


2 estadisticas $ Birth . rate + estadisticas $ Internet . users
3 estadisticas $ Birth . rate / estadisticas $ Internet . users
4
5 TRUE

Agregar columnas es tan simple como definir una nueva variable que
reescriba sobre el Data Frame que estamos trabajando. El nombre de la
columna lo definimos después de signo de $
1 estadisticas $ MyCalcPeso <- estadisticas $ Birth . rate *
estadisticas $ Internet . users

O con la funcion mutate de dplyr

104
1 estadisticas <- mutate ( estadisticas ,
2 MyCalcdplyr = estadisticas $ Birth . rate
* estadisticas $ Internet . users )

Prueba

¿Que creen que ocurra con esta indicación?


1 estadisticas $ xyz <- 1:5
2 head ( estadisticas , n =12)

Country.Name Country.Code Birth.rate Internet.users Income.Group


1 Aruba ABW 10.244 78.9000 High income
2 Afghanistan AFG 35.253 5.9000 Low income
3 Angola AGO 45.985 19.1000 Upper middle income
4 Albania ALB 12.877 57.2000 Upper middle income
5 United Arab Emirates ARE 11.044 88.0000 High income
6 Argentina ARG 17.716 59.9000 High income
7 Armenia ARM 13.308 41.9000 Lower middle income
8 Antigua and Barbuda ATG 16.447 63.4000 High income
9 Australia AUS 13.200 83.0000 High income
10 Austria AUT 9.400 80.6188 High income
11 Azerbaijan AZE 18.300 58.7000 Upper middle income
12 Burundi BDI 44.151 1.3000 Low income
xyz
1 1
2 2
3 3
4 4
5 5
6 1
7 2
8 3
9 4
10 5
11 1
12 2

105
Eliminar columnas

La indicación ”NULL” es el equivalente a la función ”rm()”


1 estadisticas $ MyCalcPeso <- NULL
2 estadisticas $ MyCalcdplyr <- NULL
3 estadisticas $ xyz <- NULL

106
2.7 Manejo de Data Frames en R vol 2
1 head ( estadisticas )
2 estadisticas $ Internet . users < 2

Country.Name Country.Code Birth.rate Internet.users Income.Group


1 Aruba ABW 10.244 78.9 High income
2 Afghanistan AFG 35.253 5.9 Low income
3 Angola AGO 45.985 19.1 Upper middle income
4 Albania ALB 12.877 57.2 Upper middle income
5 United Arab Emirates ARE 11.044 88.0 High income
6 Argentina ARG 17.716 59.9 High income

FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Con esta simple indicación podemos hacer un loop “for” que va a recorrer
todos los espacios de la columna para determinar si el argumento es T o F
1 filter <- estadisticas $ Internet . users < 2
2 estadisticas [ filter ,]

Country.Name Country.Code Birth.rate Internet.users Income.Group


12 Burundi BDI 44.151 1.3 Low income

107
53 Eritrea ERI 34.800 0.9 Low income
56 Ethiopia ETH 32.925 1.9 Low income
65 Guinea GIN 37.337 1.6 Low income
118 Myanmar MMR 18.119 1.6 Lower middle income
128 Niger NER 49.661 1.7 Low income
155 Sierra Leone SLE 36.729 1.7 Low income
157 Somalia SOM 43.891 1.5 Low income
173 Timor-Leste TLS 35.755 1.1 Lower middle income

Con lo anterior podemos aislar información que nos puede ser determinante
para generar un análisis dirigido.

1 dim ( estadisticas [ filter ,])

9 5

Llevando un nivel más elevado el grado de control sobre la información


que tenemos en nuestro Data Frame, podemos condicionar caracterı́sticas
que sean de nuestro interés en un mar de información. Por ejemplo podemos
aislar argumentos por filas
1 estadisticas [ estadisticas $ Birth . rate > 40 ,]
2 estadisticas [ estadisticas $ Birth . rate > 40 & estadisticas $
Internet . users < 2 ,]
3 estadisticas [ estadisticas $ Income . Group == " High income " ,]
4 estadisticas [ estadisticas $ Country . Name == " Malta " ,]

12 5

Country.Name Country.Code Birth.rate Internet.users Income.Group


3 Angola AGO 45.985 19.1 Upper middle income
12 Burundi BDI 44.151 1.3 Low income
15 Burkina Faso BFA 40.551 9.1 Low income
66 Gambia, The GMB 42.525 14.0 Low income
116 Mali MLI 44.138 3.5 Low income
128 Niger NER 49.661 1.7 Low income

108
129 Nigeria NGA 40.045 38.0 Lower middle income
157 Somalia SOM 43.891 1.5 Low income
168 Chad TCD 45.745 2.3 Low income
179 Uganda UGA 43.474 16.2 Low income
193 Congo, Dem. Rep. COD 42.394 2.2 Low income
194 Zambia ZMB 40.471 15.4 Lower middle income

Country.Name Country.Code Birth.rate Internet.users Income.Group


12 Burundi BDI 44.151 1.3 Low income
128 Niger NER 49.661 1.7 Low income
157 Somalia SOM 43.891 1.5 Low income

Country.Name Country.Code Birth.rate Internet.users Income.Group


1 Aruba ABW 10.244 78.90000 High income
5 United Arab Emirates ARE 11.044 88.00000 High income
6 Argentina ARG 17.716 59.90000 High income
8 Antigua and Barbuda ATG 16.447 63.40000 High income
9 Australia AUS 13.200 83.00000 High income
10 Austria AUT 9.400 80.61880 High income
13 Belgium BEL 11.200 82.17020 High income
18 Bahrain BHR 15.040 90.00004 High income
19 Bahamas, The BHS 15.339 72.00000 High income
23 Bermuda BMU 10.400 95.30000 High income
26 Barbados BRB 12.188 73.00000 High income
27 Brunei Darussalam BRN 16.405 64.50000 High income
31 Canada CAN 10.900 85.80000 High income
32 Switzerland CHE 10.200 86.34000 High income
33 Chile CHL 13.385 66.50000 High income
43 Cayman Islands CYM 12.500 74.10000 High income
44 Cyprus CYP 11.436 65.45480 High income
45 Czech Republic CZE 10.200 74.11040 High income
46 Germany DEU 8.500 84.17000 High income
48 Denmark DNK 10.000 94.62970 High income
54 Spain ESP 9.100 71.63500 High income
55 Estonia EST 10.300 79.40000 High income
57 Finland FIN 10.700 91.51440 High income
59 France FRA 12.300 81.91980 High income
62 United Kingdom GBR 12.200 89.84410 High income

109
68 Equatorial Guinea GNQ 35.362 16.40000 High income
69 Greece GRC 8.500 59.86630 High income
71 Greenland GRL 14.500 65.80000 High income
73 Guam GUM 17.389 65.40000 High income
75 Hong Kong SAR, China HKG 7.900 74.20000 High income
77 Croatia HRV 9.400 66.74760 High income
79 Hungary HUN 9.200 72.64390 High income
82 Ireland IRL 15.000 78.24770 High income
85 Iceland ISL 13.400 96.54680 High income
86 Israel ISR 21.300 70.80000 High income
87 Italy ITA 8.500 58.45930 High income
90 Japan JPN 8.200 89.71000 High income
96 Korea, Rep. KOR 8.600 84.77000 High income
97 Kuwait KWT 20.575 75.46000 High income
103 Liechtenstein LIE 9.200 93.80000 High income
106 Lithuania LTU 10.100 68.45290 High income
107 Luxembourg LUX 11.300 93.77650 High income
108 Latvia LVA 10.200 75.23440 High income
109 Macao SAR, China MAC 11.256 65.80000 High income
117 Malta MLT 9.500 68.91380 High income
127 New Caledonia NCL 17.000 66.00000 High income
131 Netherlands NLD 10.200 93.95640 High income
132 Norway NOR 11.600 95.05340 High income
134 New Zealand NZL 13.120 82.78000 High income
135 Oman OMN 20.419 66.45000 High income
141 Poland POL 9.600 62.84920 High income
142 Puerto Rico PRI 10.800 73.90000 High income
143 Portugal PRT 7.900 62.09560 High income
145 French Polynesia PYF 16.393 56.80000 High income
146 Qatar QAT 11.940 85.30000 High income
148 Russian Federation RUS 13.200 67.97000 High income
150 Saudi Arabia SAU 20.576 60.50000 High income
153 Singapore SGP 9.300 81.00000 High income
162 Slovak Republic SVK 10.100 77.88260 High income
163 Slovenia SVN 10.200 72.67560 High income
164 Sweden SWE 11.800 94.78360 High income
166 Seychelles SYC 18.600 50.40000 High income
175 Trinidad and Tobago TTO 14.590 63.80000 High income

110
181 Uruguay URY 14.374 57.69000 High income
182 United States USA 12.500 84.20000 High income
185 Venezuela, RB VEN 19.842 54.90000 High income
186 Virgin Islands (U.S.) VIR 10.700 45.30000 High income

Country.Name Country.Code Birth.rate Internet.users Income.Group


117 Malta MLT 9.5 68.9138 High income

Lo anterior lo podemos lograr también con dplyr


1 filter ( estadisticas , Birth . rate > 40)
2 filter ( estadisticas , Birth . rate > 40 , Internet . users < 2)
3 filter ( estadisticas , Income . Group == " High income " )
4 filter ( estadisticas , Country . Name == " Malta " )

111
2.8 Visualización de Data Frames en R
1 install . packages ( " ggplot2 " )
2 library ( ggplot2 )
3 ? qplot

La función ”qplot()” es una herramienta increı́blemente versátil para


graficar, lo único que requieres definir son Los siguientes parámetros para
lograr generara output gráficos:
• data= de donde va a tomar la información qplot
• x= Que vas a graficar en el plano X
• y= que vas a graficar en el plano y con respecto a x
• size= tamaño de los puntos en la gráfica
• colour= color de los puntos
• geom= Geometrı́a

1 qplot ( data = estadisticas , x = Internet . users )


2 qplot ( data = estadisticas , x = Income . Group , y = Birth . rate , size = I
(10) )
3 qplot ( data = estadisticas , x = Income . Group , y = Birth . rate , size = I
(3) )
4 qplot ( data = estadisticas , x = Income . Group , y = Birth . rate , size = I
(3) ,
5 colour = I ( " blue " ) )
6 qplot ( data = estadisticas , x = Income . Group , y = Birth . rate , geom =
" boxplot " )

2.8.1 Visualizando solo lo que necesitamos

1 qplot ( data = estadisticas , x = Internet . users , y = Birth . rate )


2 qplot ( data = estadisticas , x = Internet . users , y = Birth . rate ,
3 size = I (4) )
4 qplot ( data = estadisticas , x = Internet . users , y = Birth . rate ,
5 colour = I ( " red " ) , size = I (4) )

112
Lo interesante de la indicación “colour” es que podemos seleccionar un
grupo de valores para definir su coloración. En nuestro caso, podrı́an ser los
“levels” de alguno de nuestro Data Frame.
1 qplot ( data = estadisticas , x = Internet . users , y = Birth . rate ,
2 colour = Income . Group , size = I (5) )

113
2.8.2 Enriquecimiento de Data Frames en R
setwd("E:/Ususario/PathToFolder/D\’ia 5")
# estadisticas <- read.csv(file.choose())

1 Countries _ 2012 _ Dataset <- c ( " Aruba " ," Afghanistan " ," Angola " ,"
Albania " ," United Arab Emirates " ," Argentina " ," Armenia " ,"
Antigua and Barbuda " ," Australia " ," Austria " ," Azerbaijan " ,"
Burundi " ," Belgium " ," Benin " ," Burkina Faso " ," Bangladesh " ,"
Bulgaria " ," Bahrain " ," Bahamas , The " ," Bosnia and Herzegovina
" ," Belarus " ," Belize " ," Bermuda " ," Bolivia " ," Brazil " ,"
Barbados " ," Brunei Darussalam " ," Bhutan " ," Botswana " ," Central
African Republic " ," Canada " ," Switzerland " ," Chile " ," China " ,
" Cote d ’ Ivoire " ," Cameroon " ," Congo , Rep . " ," Colombia " ,"
Comoros " ," Cabo Verde " ," Costa Rica " ," Cuba " ," Cayman Islands "
," Cyprus " ," Czech Republic " ," Germany " ," Djibouti " ," Denmark " ,
" Dominican Republic " ," Algeria " ," Ecuador " ," Egypt , Arab Rep .
" ," Eritrea " ," Spain " ," Estonia " ," Ethiopia " ," Finland " ," Fiji " ,
" France " ," Micronesia , Fed . Sts . " ," Gabon " ," United Kingdom " ,
" Georgia " ," Ghana " ," Guinea " ," Gambia , The " ," Guinea - Bissau " ,"
Equatorial Guinea " ," Greece " ," Grenada " ," Greenland " ,"
Guatemala " ," Guam " ," Guyana " ," Hong Kong SAR , China " ,"
Honduras " ," Croatia " ," Haiti " ," Hungary " ," Indonesia " ," India " ,
" Ireland " ," Iran , Islamic Rep . " ," Iraq " ," Iceland " ," Israel " ,"
Italy " ," Jamaica " ," Jordan " ," Japan " ," Kazakhstan " ," Kenya " ,"
Kyrgyz Republic " ," Cambodia " ," Kiribati " ," Korea , Rep . " ,"
Kuwait " ," Lao PDR " ," Lebanon " ," Liberia " ," Libya " ," St . Lucia " ,
" Liechtenstein " ," Sri Lanka " ," Lesotho " ," Lithuania " ,"
Luxembourg " ," Latvia " ," Macao SAR , China " ," Morocco " ," Moldova
" ," Madagascar " ," Maldives " ," Mexico " ," Macedonia , FYR " ," Mali "
," Malta " ," Myanmar " ," Montenegro " ," Mongolia " ," Mozambique " ,"
Mauritania " ," Mauritius " ," Malawi " ," Malaysia " ," Namibia " ," New
Caledonia " ," Niger " ," Nigeria " ," Nicaragua " ," Netherlands " ,"
Norway " ," Nepal " ," New Zealand " ," Oman " ," Pakistan " ," Panama " ,"
Peru " ," Philippines " ," Papua New Guinea " ," Poland " ," Puerto
Rico " ," Portugal " ," Paraguay " ," French Polynesia " ," Qatar " ,"
Romania " ," Russian Federation " ," Rwanda " ," Saudi Arabia " ,"
Sudan " ," Senegal " ," Singapore " ," Solomon Islands " ," Sierra
Leone " ," El Salvador " ," Somalia " ," Serbia " ," South Sudan " ," Sao
Tome and Principe " ," Suriname " ," Slovak Republic " ," Slovenia
" ," Sweden " ," Swaziland " ," Seychelles " ," Syrian Arab Republic "
," Chad " ," Togo " ," Thailand " ," Tajikistan " ," Turkmenistan " ,"
Timor - Leste " ," Tonga " ," Trinidad and Tobago " ," Tunisia " ,"
Turkey " ," Tanzania " ," Uganda " ," Ukraine " ," Uruguay " ," United
States " ," Uzbekistan " ," St . Vincent and the Grenadines " ,"

114
Venezuela , RB " ," Virgin Islands ( U . S .) " ," Vietnam " ," Vanuatu "
," West Bank and Gaza " ," Samoa " ," Yemen , Rep . " ," South Africa "
," Congo , Dem . Rep . " ," Zambia " ," Zimbabwe " )
2 Codes _ 2012 _ Dataset <- c ( " ABW " ," AFG " ," AGO " ," ALB " ," ARE " ," ARG " ,"
ARM " ," ATG " ," AUS " ," AUT " ," AZE " ," BDI " ," BEL " ," BEN " ," BFA " ," BGD "
," BGR " ," BHR " ," BHS " ," BIH " ," BLR " ," BLZ " ," BMU " ," BOL " ," BRA " ,"
BRB " ," BRN " ," BTN " ," BWA " ," CAF " ," CAN " ," CHE " ," CHL " ," CHN " ," CIV "
," CMR " ," COG " ," COL " ," COM " ," CPV " ," CRI " ," CUB " ," CYM " ," CYP " ,"
CZE " ," DEU " ," DJI " ," DNK " ," DOM " ," DZA " ," ECU " ," EGY " ," ERI " ," ESP "
," EST " ," ETH " ," FIN " ," FJI " ," FRA " ," FSM " ," GAB " ," GBR " ," GEO " ,"
GHA " ," GIN " ," GMB " ," GNB " ," GNQ " ," GRC " ," GRD " ," GRL " ," GTM " ," GUM "
," GUY " ," HKG " ," HND " ," HRV " ," HTI " ," HUN " ," IDN " ," IND " ," IRL " ,"
IRN " ," IRQ " ," ISL " ," ISR " ," ITA " ," JAM " ," JOR " ," JPN " ," KAZ " ," KEN "
," KGZ " ," KHM " ," KIR " ," KOR " ," KWT " ," LAO " ," LBN " ," LBR " ," LBY " ,"
LCA " ," LIE " ," LKA " ," LSO " ," LTU " ," LUX " ," LVA " ," MAC " ," MAR " ," MDA "
," MDG " ," MDV " ," MEX " ," MKD " ," MLI " ," MLT " ," MMR " ," MNE " ," MNG " ,"
MOZ " ," MRT " ," MUS " ," MWI " ," MYS " ," NAM " ," NCL " ," NER " ," NGA " ," NIC "
," NLD " ," NOR " ," NPL " ," NZL " ," OMN " ," PAK " ," PAN " ," PER " ," PHL " ,"
PNG " ," POL " ," PRI " ," PRT " ," PRY " ," PYF " ," QAT " ," ROU " ," RUS " ," RWA "
," SAU " ," SDN " ," SEN " ," SGP " ," SLB " ," SLE " ," SLV " ," SOM " ," SRB " ,"
SSD " ," STP " ," SUR " ," SVK " ," SVN " ," SWE " ," SWZ " ," SYC " ," SYR " ," TCD "
," TGO " ," THA " ," TJK " ," TKM " ," TLS " ," TON " ," TTO " ," TUN " ," TUR " ,"
TZA " ," UGA " ," UKR " ," URY " ," USA " ," UZB " ," VCT " ," VEN " ," VIR " ," VNM "
," VUT " ," PSE " ," WSM " ," YEM " ," ZAF " ," COD " ," ZMB " ," ZWE " )
3 Regions _ 2012 _ Dataset <- c ( " The Americas " ," Asia " ," Africa " ,"
Europe " ," Middle East " ," The Americas " ," Asia " ," The Americas "
," Oceania " ," Europe " ," Asia " ," Africa " ," Europe " ," Africa " ,"
Africa " ," Asia " ," Europe " ," Middle East " ," The Americas " ,"
Europe " ," Europe " ," The Americas " ," The Americas " ," The
Americas " ," The Americas " ," The Americas " ," Asia " ," Asia " ,"
Africa " ," Africa " ," The Americas " ," Europe " ," The Americas " ,"
Asia " ," Africa " ," Africa " ," Africa " ," The Americas " ," Africa " ,"
Africa " ," The Americas " ," The Americas " ," The Americas " ,"
Europe " ," Europe " ," Europe " ," Africa " ," Europe " ," The Americas "
," Africa " ," The Americas " ," Africa " ," Africa " ," Europe " ,"
Europe " ," Africa " ," Europe " ," Oceania " ," Europe " ," Oceania " ,"
Africa " ," Europe " ," Asia " ," Africa " ," Africa " ," Africa " ," Africa
" ," Africa " ," Europe " ," The Americas " ," The Americas " ," The
Americas " ," Oceania " ," The Americas " ," Asia " ," The Americas " ,"
Europe " ," The Americas " ," Europe " ," Asia " ," Asia " ," Europe " ,"
Middle East " ," Middle East " ," Europe " ," Middle East " ," Europe "
," The Americas " ," Middle East " ," Asia " ," Asia " ," Africa " ," Asia
" ," Asia " ," Oceania " ," Asia " ," Middle East " ," Asia " ," Middle
East " ," Africa " ," Africa " ," The Americas " ," Europe " ," Asia " ,"
Africa " ," Europe " ," Europe " ," Europe " ," Asia " ," Africa " ," Europe

115
" ," Africa " ," Asia " ," The Americas " ," Europe " ," Africa " ," Europe
" ," Asia " ," Europe " ," Asia " ," Africa " ," Africa " ," Africa " ,"
Africa " ," Asia " ," Africa " ," Oceania " ," Africa " ," Africa " ," The
Americas " ," Europe " ," Europe " ," Asia " ," Oceania " ," Middle East "
," Asia " ," The Americas " ," The Americas " ," Asia " ," Oceania " ,"
Europe " ," The Americas " ," Europe " ," The Americas " ," Oceania " ,"
Middle East " ," Europe " ," Europe " ," Africa " ," Middle East " ,"
Africa " ," Africa " ," Asia " ," Oceania " ," Africa " ," The Americas " ,
" Africa " ," Europe " ," Africa " ," Africa " ," The Americas " ," Europe
" ," Europe " ," Europe " ," Africa " ," Africa " ," Middle East " ,"
Africa " ," Africa " ," Asia " ," Asia " ," Asia " ," Asia " ," Oceania " ,"
The Americas " ," Africa " ," Europe " ," Africa " ," Africa " ," Europe "
," The Americas " ," The Americas " ," Asia " ," The Americas " ," The
Americas " ," The Americas " ," Asia " ," Oceania " ," Middle East " ,"
Oceania " ," Middle East " ," Africa " ," Africa " ," Africa " ," Africa "
)
(c) Kirill Eremenko, www.superdatascience.com
1 mydatfr <- data . frame ( Countries _ 2012 _ Dataset , Codes _ 2012 _
Dataset , Regions _ 2012 _ Dataset )
2 head ( mydatfr )
3 colnames ( mydatfr ) <- c ( " Country " , " Code " , " Region " )
4 head ( mydatfr )
5 rm ( mydatfr )
6

7 mydatfr <- data . frame ( Country = Countries _ 2012 _ Dataset , Code =


Codes _ 2012 _ Dataset , Region = Regions _ 2012 _ Dataset )
8
9 head ( mydatfr )
10 tail ( mydatfr )
11 summary ( mydatfr )

2.8.3 Enriqueciendo Data Frames

1 head ( estadisticas )
2 head ( mydatfr )

La función “merge()” fusiona dos matrices por algún elemento compartido


entre ambas.

by.x = que parámetro de la primer matriz se selecciona

116
by.y = que parámetro de la segunda matriz se selecciona

Este apartado es significativamente útil cuando se quiere enriquecer un


estudio
1 union <- merge ( estadisticas , mydatfr , by . x = " Country . Code " ,
by . y = " Code " )
2 head ( union )
3 ? merge ()
4
5 union $ Country <- NULL

2.8.4 Visualizando con una nueva división

1 qplot ( data = union , x = Internet . users , y = Birth . rate )


2 qplot ( data = union , x = Internet . users , y = Birth . rate , colour =
Region )

1 Formas
1 qplot ( data = union , x = Internet . users , y = Birth . rate , colour =
Region , size = I (5) , shape = I (17) )
2 qplot ( data = union , x = Internet . users , y = Birth . rate , colour =
Region , size = I (5) , shape = I (2) )

2 Transparencia
1 qplot ( data = union , x = Internet . users , y = Birth . rate , colour =
Region , size = I (5) , shape = I (19) , alpha = I (0.6) )

3 Titulo
1 qplot ( data = union , x = Internet . users , y = Birth . rate , colour =
Region , size = I (5) , shape = I (19) , alpha = I (0.6) , main = " Birth
rate vs Internet Users " )

117
2.9 Ejercicio Manejo de Data Frames en R
1 library ( dplyr )
2

3 getwd ()

setwd("E:/Usuario/PathToFolder/D\’ia 5")

stats <- read.csv("updowngeneid.csv")


nrow(stats)

353

1 ncol ( stats )

16

1 dim ( stats )

353 16

118
1 View ( stats )

1 str ( stats )

353 obs. of 16 variables:


$ Tags : Factor w/ 2 levels "[DOWN]","[UP]" 2 1 2
2 1 2 2 2 1 2 ...
$ Name : Factor w/ 353 levels "UMAG_00037","UMAG_00056"
,..: 1 2 3 4 5 6 7 8 9 10 ...
$ logFC : num 2.07 -2.57 1.08 1.45 -1.43 ...
$ logCPM : num 8.87 3.64 6.59 6.28 4.16 b ...}
(..........................................)
$ InterPro_GO_IDs_C..F..P._func: Factor w/ 55 levels
"biological process",..: 43 54 54 54 54 22 21 33 3 55 ...

$ InterPro_GO_Names : Factor w/ 118 levels


"aldehyde-lyase activity; carbohydrate metabolic process;
catalytic activity; metabolic process",..: 80 58 58 58 58 83
17 66 110 59 ...

1 summary ( stats )

119
Tags Name logFC
[DOWN]:191 XXXX_00037: 1 Min. :-7.7076
[UP] :162 XXXX_00056: 1 1st Qu.:-1.5731
XXXX_00081: 1 Median :-1.0440
XXXX_00082: 1 Mean :-0.2834
XXXX_00132: 1 3rd Qu.: 1.2732
XXXX_00156: 1 Max. : 6.4324
(Other) :347
(............................................)
carbohydrate metabolic process; hydrolase activity,
hydrolyzing O-glycosyl compounds: 9
catalytic activity
oxidoreductase activity; oxidation-reduction process
(Other)

2.9.1 Filtrando información del Data Frames T - F

1 filter <- stats $ logFC < 1


2 filter

FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE


TRUE [10] FALSE TRUE TRUE FALSE FALSE TRUE TRUE
FALSE TRUE [19] FALSE TRUE TRUE TRUE FALSE FALSE
FALSE FALSE TRUE [28] FALSE TRUE FALSE FALSE TRUE
FALSE FALSE FALSE FALSE (...............) [334] TRUE FALSE
TRUE FALSE TRUE TRUE TRUE TRUE TRUE [343] TRUE
FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE [352]
FALSE TRUE

1 stats [ stats $ logFC < 1 ,]

Tags Name logFC logCPM FDR


2 [DOWN] XXXX_00056 -2.571603 3.6359615 3.740000e-37
5 [DOWN] XXXX_00132 -1.433062 4.1628088 2.740000e-15
9 [DOWN] XXXX_00222 -1.279568 8.4966749 1.800000e-73

120
11 [DOWN] XXXX_00254 -2.141944 3.8140550 1.060000e-32
12 [DOWN] XXXX_00262 -1.151004 6.1700215 3.110000e-38
15 [DOWN] XXXX_00336 -1.354724 9.3118920 1.600000e-16
16 [DOWN] XXXX_00337 -1.788649 1.4505150 4.500000e-05
18 [DOWN] XXXX_00374 -1.595040 6.6360524 1.250000e-46
(......................................................)
123 [DOWN] XXXX_03115 -5.367095 3.8136448 5.370000e-70
124 [DOWN] XXXX_03116 -6.081473 3.4447581 3.550000e-54
125 [DOWN] XXXX_03117 -7.459759 1.6833833 9.170000e-21
127 [DOWN] XXXX_03148 -1.278143 0.9691055 1.608155e-02

PValue Description
2 9.180000e-39 putative neutral amino acid permease
5 1.840000e-16 L-aminoadipate-semialdehyde dehydrogenase
9 1.680000e-75 hypothetical protein
11 3.110000e-34 hypothetical protein
12 7.340000e-40 hypothetical protein
15 1.000000e-17 hypothetical protein
16 9.080000e-06 hypothetical protein
18 2.330000e-48 putative allantoate and ureidosuccinate
(.....................................................)
123 5.560000e-72 transmembrane transport
124 5.350000e-56 no GO terms
125 4.440000e-22 transferase activity
127 5.931740e-03 ATPase activity;

ó con dplyr
1 library ( dplyr )
2 filter ( stats , logFC < 1)

Tags Name logFC logCPM FDR


2 [DOWN] UMAG_00056 -2.571603 3.6359615 3.740000e-37
5 [DOWN] UMAG_00132 -1.433062 4.1628088 2.740000e-15
9 [DOWN] UMAG_00222 -1.279568 8.4966749 1.800000e-73
11 [DOWN] UMAG_00254 -2.141944 3.8140550 1.060000e-32
12 [DOWN] UMAG_00262 -1.151004 6.1700215 3.110000e-38

121
15 [DOWN] UMAG_00336 -1.354724 9.3118920 1.600000e-16
16 [DOWN] UMAG_00337 -1.788649 1.4505150 4.500000e-05
18 [DOWN] UMAG_00374 -1.595040 6.6360524 1.250000e-46
(......................................................)
123 [DOWN] UMAG_03115 -5.367095 3.8136448 5.370000e-70
124 [DOWN] UMAG_03116 -6.081473 3.4447581 3.550000e-54
125 [DOWN] UMAG_03117 -7.459759 1.6833833 9.170000e-21
127 [DOWN] UMAG_03148 -1.278143 0.9691055 1.608155e-02

PValue Description
2 9.180000e-39 putative neutral amino acid permease
5 1.840000e-16 L-aminoadipate-semialdehyde dehydrogenase
9 1.680000e-75 hypothetical protein
11 3.110000e-34 hypothetical protein
12 7.340000e-40 hypothetical protein
15 1.000000e-17 hypothetical protein
16 9.080000e-06 hypothetical protein
18 2.330000e-48 putative allantoate and ureidosuccinate
(.....................................................)
123 5.560000e-72 transmembrane transport
124 5.350000e-56 no GO terms
125 4.440000e-22 transferase activity
127 5.931740e-03 ATPase activity;

Diferencia entre & y |. La indicación ‘‘|’’ (or) no se puede trabajar con


dplyr, solo con el ‘‘&’’
1 TengaAyB <- stats [ stats $ logFC < 1 & stats $ PValue < .005 ,]
2 TengaAoB <- stats [ stats $ logFC < 1 | stats $ PValue < .005 ,]
3 View ( stats )

1 C <- stats [ stats $ GO _ IDs _ C .. F .. P . == " C " ,]


2 f <- stats [ stats $ GO _ IDs _ C .. F .. P . == " F " ,]
3 P <- stats [ stats $ GO _ IDs _ C .. F .. P . == " P " ,]

La función “grep()” nos sirve para identificar caracteres especı́ficos

122
1 C <- stats [ grep ( " C " , stats $ GO _ IDs _ C .. F .. P .) , ]
2 f <- stats [ grep ( " F " , stats $ GO _ IDs _ C .. F .. P .) , ]
3 P <- stats [ grep ( " P " , stats $ GO _ IDs _ C .. F .. P .) , ]
4
5 stats [ stats $ Tags == " [ UP ] " ,]

Tags Name logFC logCPM ...


1 [UP] UMAG_00037 2.069412 8.8705848 ...
3 [UP] UMAG_00081 1.084135 6.5932902 ...
4 [UP] UMAG_00082 1.449829 6.2789183 ...
6 [UP] UMAG_00156 1.025664 4.4145420 ...
7 [UP] UMAG_00197 1.138476 6.9619722 ...
8 [UP] UMAG_00204 1.155848 5.5536841 ...

1 stats [ stats $ Tags == " [ DOWN ] " ,]

Tags Name logFC logCPM ...


2 [DOWN] UMAG_00056 -2.571603 3.6359615 ...
5 [DOWN] UMAG_00132 -1.433062 4.1628088 ...
9 [DOWN] UMAG_00222 -1.279568 8.4966749 ...
11 [DOWN] UMAG_00254 -2.141944 3.8140550 ...
12 [DOWN] UMAG_00262 -1.151004 6.1700215 ...
15 [DOWN] UMAG_00336 -1.354724 9.3118920 ...
16 [DOWN] UMAG_00337 -1.788649 1.4505150 ...
18 [DOWN] UMAG_00374 -1.595040 6.6360524 ...
(......................................................)

123
1 UP <- stats [ stats $ Tags == " [ UP ] " ,]
2 DOWN <- stats [ stats $ Tags == " [ DOWN ] " ,]
3
4 View ( DOWN )

1 write . csv ( DOWN , file = " downgeneid . csv " )

124
2.10 Graficación en R
1 getwd ()

setwd("E:/User/PathToFolder/Dia 5")

1 movies <- read . csv ( " Movie - Ratings . csv " )


2 View ( movies )

1 colnames ( movies ) <- c ( " Film " , " Genre " , " CriticRating " , "
AudienceRating " , " BudgetMillions " , " Year " )
2 str ( movies )

’data.frame’: 562 obs. of 6 variables:


$ Film : Factor w/ 562 levels "(500) Days of Summer
",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Genre : Factor w/ 7 levels "Action","Adventure",..:
3 2 1 2 3 1 3 5 3 3 ...
$ CriticRating : int 87 9 30 93 55 39 40 50 43 93 ...
$ AudienceRating: int 81 44 52 84 70 63 71 57 48 93 ...
$ BudgetMillions: int 8 105 20 18 20 200 30 32 28 8 ...
$ Year : int 2009 2008 2009 2010 2009 2009 2008
2007 2011 2011 ...

125
1 levels ( movies $ Genre )

Action Adventure Comedy Drama Horror Romance Thriller

1 summary ( movies )
Film Genre CriticRating
(500) Days of Summer :1 Action:154 Min.: 0.0
10,000 B.C. : 1 Adventure: 29 1st Qu.:25.0
12 Rounds : 1 Comedy :172 Median :46.0
127 Hours : 1 Drama :101 Mean :47.4
17 Again : 1 Horror : 49 3rd Qu.:70.0
2012 : 1 Romance : 21 Max. :97.0
(Other) :556 Thriller : 36

AudienceRating BudgetMillions Year


Min. : 0.00 Min. : 0.0 Min. :2007
1st Qu.:47.00 1st Qu.: 20.0 1st Qu.:2008
Median :58.00 Median : 35.0 Median :2009
Mean :58.83 Mean : 50.1 Mean :2009
3rd Qu.:72.00 3rd Qu.: 65.0 3rd Qu.:2010
Max. :96.00 Max. :300.0 Max. :2011

La función “factor()” permite agrupar caracteres que se repiten en la


columna
1 factor ( movies $ Year )

2009 2008 2009 2010 2009 2009 2008 2007 2011 2011 2007 2011
2010 2009 2011 2011 2007 2009 2011 2010 2007 2009 2009 2010
2009 2007 2009 2011 2011 2008 2009 2011 2008
(......................................................) [529] 2007 2010 2008 2011
2011 2009 2011 2011 2007 2008 2008 2011 2009 2010 2009 2009
2009 2010 2007 2010 2009 2011 2009 2008 2010 2010 2008 2010
2011 2009 2008 2007 2009 2011 Levels: 2007 2008 2009 2010 2011

1 movies $ Year <- factor ( movies $ Year )


2
3 summary ( movies )

126
Film Genre CriticRating
(500) Days of Summer :1 Action:154 Min.: 0.0
10,000 B.C. : 1 Adventure: 29 1st Qu.:25.0
12 Rounds : 1 Comedy :172 Median :46.0
127 Hours : 1 Drama :101 Mean :47.4
17 Again : 1 Horror : 49 3rd Qu.:70.0
2012 : 1 Romance : 21 Max. :97.0
(Other) :556 Thriller : 36

AudienceRating BudgetMillions Year


Min. : 0.00 Min. : 0.0 Min. :2007
1st Qu.:47.00 1st Qu.: 20.0 1st Qu.:2008
Median :58.00 Median : 35.0 Median :2009
Mean :58.83 Mean : 50.1 Mean :2009
3rd Qu.:72.00 3rd Qu.: 65.0 3rd Qu.:2010
Max. :96.00 Max. :300.0 Max. :2011

1 str ( movies )

’data.frame’: 562 obs. of 6 variables:


$ Film : Factor w/ 562 levels "(500) Days of Summer
",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Genre : Factor w/ 7 levels "Action","Adventure",..:
3 2 1 2 3 1 3 5 3 3 ...
$ CriticRating : int 87 9 30 93 55 39 40 50 43 93 ...
$ AudienceRating: int 81 44 52 84 70 63 71 57 48 93 ...
$ BudgetMillions: int 8 105 20 18 20 200 30 32 28 8 ...
$ Year : Factor w/ 5 levels "2007","2008",..: 3 2 3
4 3 3 2 1 5 5 ...

2.10.1 Aspecto

1 library ( ggplot2 )

“aes()” = como quieres que la información se despliegue

127
1 ggplot ( data = movies , aes ( x = CriticRating , y = AudienceRating ) )

Agregar la geometrı́a
1 ggplot ( data = movies , aes ( x = CriticRating , y = AudienceRating ) ) +
2 geom _ point ()

128
Agregar parámetros estadı́sticos a la gráfica (ajuste de lı́nea recta)
1 ggplot ( data = movies , aes ( x = CriticRating , y = AudienceRating ) ) +
2 geom _ point () + geom _ smooth ( method = ’ lm ’)

1 lm _ fit <- lm ( CriticRating ~ AudienceRating , data = movies )


2 summary ( lm _ fit )

Call: lm(formula = CriticRating ~ AudienceRating, data =


movies)
Residuals: Min 1Q Median 3Q Max
-50.684 -14.377 -0.024 14.222 64.305
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13.0007 3.0654 -4.241 2.6e-05 ***
AudienceRating 1.0268 0.0501 20.494 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1
Residual standard error: 19.97 on 560 degrees of freedom
Multiple R-squared: 0.4286,Adjusted R-squared: 0.4276
F-statistic: 420 on 1 and 560 DF, p-value: < 2.2e-16

129
Agrega color
1 ggplot ( data = movies , aes ( x = CriticRating , y = AudienceRating ,
colour = Genre ) ) + geom _ point ()

Agregar tamaño
1 ggplot ( data = movies , aes ( x = CriticRating , y = AudienceRating ,
colour = Genre , size = Genre ) ) + geom _ point ()

130
Al definir el color y el tamaño de las gráficas con diferentes valores, la
forma en que se presenta la información puede ser más digerible
1 ggplot ( data = movies , aes ( x = CriticRating , y = AudienceRating ,
colour = Genre , size = Year ) ) + geom _ point ()

1 ggplot ( data = movies , aes ( x = CriticRating , y = AudienceRating ,


2 colour = Genre , size = BudgetMillions ) ) + geom _ point ()

1 ggplot ( data = movies , aes ( x = CriticRating , y = AudienceRating ,


2 colour = BudgetMillions , size = Genre ) ) + geom _ point ()

131
2.10.2 Grafica por capas

1 p <- ggplot ( data = movies , aes ( x = CriticRating , y = AudienceRating


, colour = Genre , size = BudgetMillions ) )

132
Agrega una gráfica de puntos a la variable p
1 p + geom _ point ()

o de lineas
1 p + geom _ line ()

133
Agrega Múltiples capas
1 p + geom _ point () + geom _ line ()

1 p + geom _ line () + geom _ point ()

134
2.10.3 Sobrescribe los estéticos de la gráfica

1 q <- ggplot ( data = movies , aes ( x = CriticRating , y =


AudienceRating , colour = Genre , size = BudgetMillions ) )

Agrega capas a geom


1 q + geom _ point ()

135
Sobrescribe aes
Ejemplo 1:
1 q + geom _ point ( aes ( size = Genre ) )

Ejemplo 2:
1 q + geom _ point ( aes ( size = Year ) ) + labs ( size = " Year " )

136
q sigue integro
1 q + geom _ point ()

Ejemplo 3
La función “xlab” te permite modificar el texto del eje x
1 q + geom _ point ( aes ( x = BudgetMillions ) ) + xlab ( " Budget Millions
$$$")

137
Ejemplo 4
1 q + geom _ line () + geom _ point ()

Reduce el tamaño de la linea


1 q + geom _ line ( size =1) + geom _ point ()

138
2.10.4 Mapeo vs. Ajuste

1 r <- ggplot ( data = movies , aes ( x = CriticRating , y =


AudienceRating ) )
2 r + geom _ point ()

139
Agregando color
1. Mapeo
Estamos tomando el valor de los ”levels” de la columna como referencia
1 r + geom _ point ( aes ( colour = Genre ) )

2. Ajuste
Error en la función si se define “colour=”Green” en la función “aes()”
básicamente estamos utilizando a “DarkGreen” como variable en la gráfica.
1 r + geom _ point ( colour = " Green " )
2 # r + geom _ point ( aes ( colour =" Green ") ) ERROR

140
Agregando tamaño
1. Mapeo
1 r + geom _ point ( aes ( size = BudgetMillions ) )

Ajuste
1 r + geom _ point ( size =10)
2 # ERROR -> r + geom _ point ( aes ( size =10) )

141
2.10.5 Histogramas y Gráficos de densidad
Al definir solo el valor de x para graficar, automáticamente “ggplot()” genera
un Histograma.
“binwidth” es un valor que debemos siempre definir para obtener una
cobertura adecuada de nuestra gráfica.
1 s <- ggplot ( data = movies , aes ( x = BudgetMillions ) )
2 s + geom _ histogram ( binwidth =10)

142
Agregar color: Ajuste vs Mapeo
1 s + geom _ histogram ( binwidth =10 , fill = " Green " )

1 s + geom _ histogram ( binwidth =10 , aes ( fill = Genre ) )

143
Agregar un borde
“colour” en “geom histogram()” determina los bordes
1 s + geom _ histogram ( binwidth =10 , aes ( fill = Genre ) , colour = "
Black " )

1 s + geom _ histogram ( binwidth =5 , aes ( fill = Genre ) , colour = " Green


")

144
1 s + geom _ histogram ( binwidth =20 , aes ( fill = Genre ) , colour = "
Black " )

En algunos casos será más útil el emplear gráficos de densidad:


1 s + geom _ density ( aes ( fill = Genre ) )

145
position=“stack” te permite traslapar de menor a mayor
1 s + geom _ density ( aes ( fill = Genre ) , position = " stack " )

Cada tipo de gráfica tiene un sin fin de opciones que se pueden definir
para personalizar el resultado.
1 ? geom _ density ()

146
2.11 Ejercicio Graficación en R
setwd("E:/User/PathToFolder/Dia 6")

1 getwd ()

1 Data <- read . csv ( " Fcb - uanlNew . csv " )


2 # Data <- read . csv ( file . choose () )
3
4 View ( Data )

1 summary ( Data )

147
DayOfWeek Month Day Time start
Fri:40 Nov :24 Min. : 1.00 7:00 AM: 11
Mon:40 May :22 1st Qu.: 8.00 6:56 AM: 9
Sat: 2 Aug :21 Median :16.00 6:59 AM: 9
Sun: 1 Jan :19 Mean :15.84 7:01 AM: 9
Thu:41 Mar :19 3rd Qu.:23.00 6:51 AM: 8
Tue:39 Feb :18 Max. :31.00 6:54 AM: 8
Wed:39 (Other):79 (Other):148

Time end PM Hours


3:40 PM: 4 Min. : 0.600
4:13 PM: 4 1st Qu.: 7.285
5:19 PM: 4 Median : 8.630
3:16 PM: 3 Mean : 8.190
3:33 PM: 3 3rd Qu.: 9.453
3:41 PM: 3 Max. :14.280
(Other):181
1 Sum . WrkHr <- sum ( Data $ Hours )

¿Como puedo contar las semanas laborales que llevo hasta el momento?
Opcion 1
1 a =0
2 for ( i in Data $ DayOfWeek ) {
3 if ( i == " Fri " ) {
4 a <- a + 1
5 }
6 }
7 a

40

Opcion 2
1 b <- length ( which ( Data $ DayOfWeek == " Fri " ) )
2 b

40

148
Opcion 3
1 summary ( Data $ DayOfWeek )

Fri Mon Sat Sun Thu Tue Wed


40 40 2 1 41 39 39

¿Como podrı́a saber cuantas horas extra he trabajado?


1 Reg . wrkW <- a * 40
2 Ext . time <- Sum . WrkHr - Reg . wrkW
3 Ext . time

54.38

1 library ( ggplot2 )

¿Como podrı́a hacer un subset de las horas que he trabajado por dı́a en
todo el periodo?
1 mon <- Data [ Data $ DayOfWeek == " Mon " ,]
2 Tue <- Data [ Data $ DayOfWeek == " Tue " ,]
3 Wed <- Data [ Data $ DayOfWeek == " Wed " ,]
4 Thu <- Data [ Data $ DayOfWeek == " Thu " ,]
5 Fri <- Data [ Data $ DayOfWeek == " Fri " ,]
6 Sat <- Data [ Data $ DayOfWeek == " Sat " ,]
7 Sun <- Data [ Data $ DayOfWeek == " Sun " ,]
8
9 rm ( mon , Tue , Wed , Thu , Fri , Sat )
10
11 day <- levels ( Data $ DayOfWeek )
12 Day _ of _ week <- list ()
13 for ( i in day ) {
14 Day _ of _ week [[ i ]] <- Data [ Data $ DayOfWeek == i ,]
15 }
16
17 Day _ of _ week [[ " Fri " ]]

¿Como podrı́a hacer un subset de las horas que he trabajado por mes en
todo el periodo?

149
1 Nov <- Data [ Data $ Month == " Nov " ,]
2 Nov <- Data [ Data $ Month == " Dec " ,]
3 Nov <- Data [ Data $ Month == " Jan " ,]
4 Nov <- Data [ Data $ Month == " Feb " ,]
5 Nov <- Data [ Data $ Month == " Mar " ,]
6 Nov <- Data [ Data $ Month == " Apr " ,]
7 Nov <- Data [ Data $ Month == " May " ,]
8 Nov <- Data [ Data $ Month == " Jun " ,]
9 Nov <- Data [ Data $ Month == " Jul " ,]
10 Ago <- Data [ Data $ Month == " Aug " ,]
11 Sep <- Data [ Data $ Month == " Sep " ,]
12 Oct <- Data [ Data $ Month == " Oct " ,]
13
14 rm ( Mar , Apr , May , Jun , Jul , Aug , Sep , Oct , Nov , Dec , Jan , Feb
)
15
16 mnt <- levels ( Data $ Month )
17 Month _ of _ year <- list ()
18 for ( i in mnt ) {
19 Month _ of _ year [[ i ]] <- Data [ Data $ Month == i ,]
20 }
21
22 Month _ of _ year [[ " Nov " ]]

¿Como podrı́a hacer un subset de las horas que he trabajado por mes y
por dı́a en todo el periodo?
1 Nov . mon <- Data [ Data $ DayOfWeek == " Mon " & Data $ Month == " Nov "
,]
2 Nov . tue <- Data [ Data $ DayOfWeek == " Tue " & Data $ Month == " Nov "
,]
3 Nov . wed <- Data [ Data $ DayOfWeek == " Wed " & Data $ Month == " Nov "
,]
4 Nov . thu <- Data [ Data $ DayOfWeek == " Thu " & Data $ Month == " Nov "
,]
5 Nov . fri <- Data [ Data $ DayOfWeek == " Fri " & Data $ Month == " Nov "
,]
6 Nov . sat <- Data [ Data $ DayOfWeek == " Sat " & Data $ Month == " Nov "
,]
7 Nov . sun <- Data [ Data $ DayOfWeek == " Sun " & Data $ Month == " Nov "
,]
8
9 Ago . mon <- Data [ Data $ DayOfWeek == " Mon " & Data $ Month == " Aug "
,]

150
10 Ago . tue <- Data [ Data $ DayOfWeek == " Tue " & Data $ Month == " Aug "
,]
11 Ago . wed <- Data [ Data $ DayOfWeek == " Wed " & Data $ Month == " Aug "
,]
12 Ago . thu <- Data [ Data $ DayOfWeek == " Thu " & Data $ Month == " Aug "
,]
13 Ago . fri <- Data [ Data $ DayOfWeek == " Fri " & Data $ Month == " Aug "
,]
14 Ago . sat <- Data [ Data $ DayOfWeek == " Sat " & Data $ Month == " Aug "
,]
15 Ago . sun <- Data [ Data $ DayOfWeek == " Sun " & Data $ Month == " Aug "
,]
16
17 DayOfWeek <- levels ( Data $ DayOfWeek )
18 Month <- levels ( Data $ Month )
19
20 Data $ Month <- factor ( Data $ Month , levels = c ( " Nov " , " Dec " , "
Jan " , " Feb " , " Mar " , " Apr " , " May " , " Jun " , " Jul " , " Aug " , "
Sep " , " Oct " ) )
21 Data $ DayOfWeek <- factor ( Data $ DayOfWeek , levels = c ( " Mon " , "
Tue " , " Wed " , " Thu " , " Fri " , " Sat " , " Sun " ) )

1 p <- ggplot ( data = Data , aes ( x = Month , y = Hours , size = DayOfWeek ,


colour = Month ) )
2 p + geom _ jitter () + geom _ boxplot ( size =1.2 , alpha =0.5)

151
1 v <- ggplot ( data = Data , aes ( x = Hours ) )
2 v + geom _ histogram ( binwidth =.5 , colour = " Black " )

1 v + geom _ histogram ( binwidth =.5 , aes ( fill = DayOfWeek ) , colour = "


Black " ) + facet _ grid (. ~ Month , scales = " free " )

152
2.12 Ejercicio de estructuración de datos en R
1 install . packages ( " stringr " )
2 library ( stringr ) # llama libreria stringr

setwd("E:/PathToFolder/Dia 6")

1 dat <- readLines ( " addr . txt " )


2 View ( dat )

El objetivo es crear un “Data Frame” dividiendo en “strings” indepen-


dientes los elementos del archivo para formar las columnas del mismo. La
función “as.data.frame” sirve para agregar propiedades de matriz de datos.

“do.call” manda llamar los comandos “rbind” para apilara los objetos en la
matriz de datos, por ultimo “strsplit” dividirá los elementos que cada lı́nea
contenga para ordenarlos en columnas.

153
La opción “split” divide los datos por los valores que asignemos “”, en
este caso serán en espacios en blanco de 2 a 10. La opción “stringsAs-
Factor=FALSE” evita que las columnas del texto sean consideradas como
factor.

1 dat <- as . data . frame ( do . call ( rbind , strsplit ( dat , split = "
{2 ,10} " ) ) , stringsAsFactors = FALSE )
2 head ( dat )

V1 V2 V3 V4
Bania Thomas M. 725 Commonwealth Ave. Boston
Barnaby David 373 W. Geneva St. Wms. Bay
Bausch Judy 373 W. Geneva St. Wms. Bay
Bolatto Alberto 725 Commonwealth Ave. Boston
Carlstrom John 933 E. 56th St. Chicago
Chamberlin Richard A. 111 Nowelo St. Hilo
V5 V6
MA O2215
WI 53191
WI 53191
MA O2215
IL 60637
HI 96720

La función “names()” nos va a servir para nombrar las columnas del


“Data Frame” De igual manera se puede usar “rownames()” o “colnames()”
1 names ( dat ) <- c ( " LastName " , " FirstName " , " address " , " city " , "
state " , " zip " )
2 head ( dat )

154
LastName FirstName address city state
Bania Thomas M. 725 Commonwealth Ave. Boston MA
Barnaby David 373 W. Geneva St. Wms. Bay WI
Bausch Judy 373 W. Geneva St. Wms. Bay WI
Bolatto Alberto 725 Commonwealth Ave. Boston MA
Carlstrom John 933 E. 56th St. Chicago IL
Chamberlin Richard A. 111 Nowelo St. Hilo HI
zip
O2215
53191
53191
O2215
60637
96720

1 tail ( dat )

LastName FirstName address city


Thoma Mark 373 W. Geneva St. Wms. Bay
Walker Chris 933 N. Cherry St. Tucson
Wehrer Cheryl 5000 Forbes Ave. Pittsburgh
Wirth Jesse 373 W. Geneva St. Wms. Bay
Wright Greg 791 Holmdel-Keyport Rd. Holmdel
Zingale Michael 5640 S. Ellis Ave. Chicago
state zip
WI 53191
AZ 85721
PA 15213
WI 53191
NY O7733-1988
IL 60637

155
1 View ( dat )

Lo anterior solo nos permitió el organizar la información de manera par-


cial, ahora nos falta limpiar y presentar el “Data Frame”. Con la sigu-
iente indicación vamos a separar y remplazar con un espacio (tab) el nombre
y número de las calles, “as.numeric” convierte los valores que antes eran
carácter a numérico.

La función “gsub()” permitirá remplazar un carácter según se le indique, en


este caso estamos indicando los números del 0-9 de “length()” 1 a 4, el “.”
es una indicación de lo que este de ese lado, por ultimo “*” indica cualquier
cosa y ‘‘\\1’’ es lo que se va a cambiar o asignar. Al definirlo como una
nueva columna “dat$streetno” básicamente estamos guardando los números
que aislamos.
1 dat $ streetno <- as . numeric ( gsub ( " ([0 -9]{1 ,4}) . * " , " \\1 " , dat
$ address ) )

Para terminar de limpiar la columna "dat$address" en esta ocasi\’on


agrupamos los valores "(.*)" y solamente eso reescribimos en
una nueva columna llamada "dat$streetname"

156
1 dat $ streetname <- gsub ( " [0 -9]{1 ,4} (. * ) " , " \\1 " , dat $ address
)
2 dat $ streetno2 <- paste ( dat $ streetname , dat $ streetno , sep = "
")
3
4 View ( dat )

1 dat $ streetno2 <- NULL

El documento contiene la letra O en algunas de sus columnas en vez del


dı́gito 0, esto puede causar problemas si se desea manipular la información
del documento.
1 dat $ zip <- gsub ( " O " ," 0 " , dat $ zip )

La funci\’on "str_trim" quita el espacio en blanco al final


de un string.

1 dat $ zip <- str _ trim ( dat $ zip )


2
3 dat <- dat [ , c (1:2 ,7:8 ,4:6) ]

157
1 View ( dat )

158
La función “data.entry(dat)” te permite visualizar y modificar directa del
“Data Frame”
1 data . entry ( dat )

159
2.13 Ejercicio 2 estructuración de datos en R
1 setwd ( " E : / User / PathToFolder / Dia 6 " )
2 library ( stringr )

1 dat <- read . csv ( " Horario . csv " )


2 head ( dat )

Horario.UANL X X.1 X.2


Date Shift Start Shift End Hours
Thu, Nov 1 11:46 AM 5:15 PM 5.48
Fri, Nov 2 7:52 AM 4:23 PM 8.52
Mon, Nov 5 6:59 AM 4:07 PM 9.13
Tue, Nov 6 6:58 AM 2:40 PM 7.7
Wed, Nov 7 7:03 AM 5:28 PM 10.42

1 tail ( dat )

Horario.UANL X X.1 X.2


Date Shift Start Shift End Hours
Thu,Oct 4 7:00 AM 4:49 PM 9.82
Mon, Oct 7 6:56 AM 4:30 PM 9.57
Tue, Oct 8 6:55 AM 3:48 PM 8.88
Wed, Oct 9 10:49 AM 5:25 PM 6.6
Thu, Oct 10 10:40 AM 5:13 PM 6.55
Totals: 1679.57

160
1 dat <- read . csv ( " Horario . csv " )
2 head ( dat )

1 nrow ( dat )

204

Se necesita eliminar siempre la primera columna, y la última del data frame.


“nrow(dat)” nos permitirá acceder siempre a la última columna del data
frame. Recordemos que en R el agregar un “-” dentro del paréntesis implica
el excluir dicho elemento del resultado.

161
1 dat <- dat [ - c (1 , nrow ( dat ) ) ,]
2
3 colnames ( dat ) <- c ( " DayOfWeek _ Month _ Day " , " Time _ start " , " Time
_ end _ PM " , " Hours " )

1 View ( dat )

162
Podemos separar “Thu, Nov 1” con gsub
1 dat $ DayOfWeek <- gsub ( " ( * ) ,. * " , " \\1 " , dat $ DayOfWeek _ Month _
Day )
2 View ( dat )

“str extract” nos permite extraer algún elemento que se localiza en el


vector. “[0-9]+” permite extraer elementos numéricos exclusivamente.
1 dat $ Day <- as . numeric ( str _ extract ( dat $ DayOfWeek _ Month _ Day , "
[0 -9]+ " ) )
2 View ( dat )

163
[A-Z]+[a-z]+ permite extraer caracteres mayúscula y minúscula que se en-
cuentran después del caracter “, ”
1 dat $ DayOF _ Month <- str _ extract ( dat $ DayOfWeek _ Month _ Day , " , [A
- Z ]+[ a - z ]+ " )
2 View ( dat )

Nos deshacemos de la “,”


1 dat $ Month <- str _ extract ( dat $ DayOF _ Month , " [A - Z ]+[ a - z ]+ " )

Eliminamos las columnas que no necesitamos, y reorganizamos el orden de


las columnas.
1 dat $ DayOfWeek _ Month _ Day <- NULL
2 dat $ DayOF _ Month <- NULL
3 dat <- dat [ , c (4 ,6 ,5 ,1:3) ]
4 rownames ( dat ) <- NULL
5 View ( dat )

1 summary ( dat )

164
DayOfWeek Month Day
Length:202 Length:202 Min. : 1.00
Class :character Class :character 1st Qu.: 8.00
Mode :character Mode :character Median :16.00
Mean :15.84
3rd Qu.:23.00
Max. :31.00
Time start Time end PM Hours
7:00 AM: 11 3:40 PM: 4 8.65 : 6
6:56 AM: 9 4:13 PM: 4 8.45 : 4
6:59 AM: 9 5:19 PM: 4 8.55 : 4
7:01 AM: 9 3:16 PM: 3 8.7 : 4
6:51 AM: 8 3:33 PM: 3 10.42 : 3
6:54 AM: 8 3:41 PM: 3 8: 3
(Other):148 (Other):181 (Other):178

1 str ( dat )

’data.frame’: 202 obs. of 6 variables:


$ DayOfWeek : chr "Thu" "Fri" "Mon" "Tue" ...
$ Month : chr "Nov" "Nov" "Nov" "Nov" ...
$ Day : num 1 2 5 6 7 8 9 12 13 14 ...
$ Time_start : Factor w/ 88 levels "","1:01 PM","10:02 AM"

165
,..: 15 74 45 44 49 7 3 38 51 46 ...
$ Time_end_PM: Factor w/ 143 levels "1:18 PM","1:19 PM",..:
117 90 78 36 126 128 89 58 6 109 ...
$ Hours : Factor w/ 152 levels "0.6","1.12","1.22",..:
49 105 128 82 14 69 61 113 64 6 ...

Convertir de Factor a numérico no es tan intuitivo como parece.


1 dat $ Hours <- as . numeric ( as . character ( dat $ Hours ) )
2
3 summary ( dat )

DayOfWeek Month Day


Length:202 Length:202 Min. : 1.00
Class :character Class :character 1st Qu.: 8.00
Mode :character Mode :character Median :16.00
Mean :15.84
3rd Qu.:23.00
Max. :31.00
Time start Time end PM Hours
7:00 AM: 11 3:40 PM: 4 Min. : 0.600
6:56 AM: 9 4:13 PM: 4 1st Qu.: 7.285
6:59 AM: 9 5:19 PM: 4 Median : 8.630
7:01 AM: 9 3:16 PM: 3 Mean : 8.190
6:51 AM: 8 3:33 PM: 3 10.42 : 3
6:54 AM: 8 3:41 PM: 3 3rd Qu.: 9.453
(Other):148 (Other):181 Max. :14.280

1 Oneto4Hr <- dat [ dat $ Hours > 1 & dat $ Hours < 4 ,]
2 Over5Hr <- dat [ dat $ Hours > 5 & dat $ Hours < 6 ,]
3 Over6Hr <- dat [ dat $ Hours > 6 & dat $ Hours < 7 ,]
4 Over7Hr <- dat [ dat $ Hours > 7 & dat $ Hours < 8 ,]
5 Over8Hr <- dat [ dat $ Hours > 8 & dat $ Hours < 9 ,]
6 Over9Hr <- dat [ dat $ Hours > 9 & dat $ Hours < 10 ,]
7 Over10Hr <- dat [ dat $ Hours > 10 ,]
8

9 Oneto4Hr

166
DayOfWeek Month Day Time start Time end PM Hours
Fri Dec 7 8:20 AM 10:52 AM 2.53
Tue Dec 18 10:45 AM 2:31 PM 3.77
Thu Dec 20 11:43 AM 12:50 PM 1.12
Mon Jan 28 6:00 AM 7:13 AM 1.22
Fri Mar 8 8:31 AM 12:08 PM 3.62
Fri Apr 12 12:29 PM 2:40 PM 2.18
Fri Jul 19 9:36 AM 12:03 PM 2.45
Fri Jul 26 9:49 AM 1:27 PM 3.63

1 summary ( dat $ DayOfWeek )

Length Class Mode


202 character character

1 dat $ DayOfWeek <- factor ( dat $ DayOfWeek )


2 dat $ DayOfWeek <- factor ( dat $ DayOfWeek , levels = c ( " Mon " , " Tue
" , " Wed " , " Thu " , " Fri " , " Sat " , " Sun " ) )
3 dat $ Month <- factor ( dat $ Month )
4 dat $ Month <- factor ( dat $ Month , levels = c ( " Nov " , " Dec " , " Jan "
, " Feb " , " Mar " , " Apr " , " May " , " Jun " , " Jul " , " Aug " , " Sep " ,
" Oct " ) )
5
6 library ( ggplot2 )
7 library ( reshape2 )

Fórmula para encontrar el bin ideal para los valores de un histograma


1 min ( dat $ Hours )

0.6

1 max ( dat $ Hours )

14.28

1 x <- seq ( min ( dat $ Hours ) , max ( dat $ Hours ) , length . out = 10)
2 x

167
0.60 2.12 3.64 5.16 6.68 8.20 9.72 11.24 12.76 14.28

1 y <- x [2] - x [1]


2 y

1.52

1 p <- ggplot ( data = dat , aes ( x = Hours ) )


2 p + geom _ histogram ( binwidth =y , aes ( fill = DayOfWeek ) , colour = "
Black " )

168
1 p <- ggplot ( data = dat , aes ( x = Hours ) )
2 p + geom _ histogram ( binwidth =y , aes ( fill = DayOfWeek ) , colour = "
Black " ) + facet _ grid (. ~ Month , scales = " free " )

169
2.14 Introducción a la limpieza de datos en R
1 if ( ! require ( installr ) )
2 install . packages ( " installr " ) ; require ( installr )
3 updateR ()

1 getwd ()

setwd("E:/User/PathToFolder/Dia 6")

1 getwd ()

("E:/User/PathToFolder/Dia 6")

1 fin <- read . csv ( " Future -500. csv " )


2 head ( fin )

Name Industry Inception Employees


Over-Hex Software 2006 25
Unimattax IT Services 2009 36
Greenfax Retail 2012 NA
Blacklane IT Services 2011 66
Yearflex Software 2013 45
Indigoplanet IT Services 2013 60

State City Revenue Expenses


TN Franklin 9,684,527 1,130,700 Dollars
PA Newtown Square 14,016,543 804,035 Dollars
SC Greenville 9,746,272 1,044,375 Dollars
CA Orange 15,359,369 4,631,808 Dollars
WI Madison 8,567,910 4,374,841 Dollars
NJ Manalapan 12,805,452 4,626,275 Dollars

Profit Growth
8553827 19
13212508 20
8701897 16
10727561 19
4193069 19
8179177 22

170
1 tail ( fin )

Name Industry Inception Employees


Rawfishcomplete Financial Services 2012 124
Buretteadmirable IT Services 2009 93
Inventtremendous Construction 2009 24
Overviewparrot Retail 2011 7124
Belaguerra IT Services 2010 140
Allpossible IT Services 2011 24

State City Revenue


CA Los Angeles 10,624,949
ME Portland 15,407,450
MN Woodbury 9,144,857
TX Fort Worth 11,134,728
MI Troy 17,387,130
CA Los Angeles 11,949,706

Expenses Profit Growth


2,951,178 Dollars 7673771 22
2,833,136 Dollars 12574314 25
4,755,995 Dollars 4388862 11
5,152,110 Dollars 5982618 12
1,387,784 Dollars 15999346 23
689,161 Dollars 11260545 24
1 str ( fin )

’data.frame’: 500 obs. of 11 variables:


$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : Factor w/ 500 levels "Abstractedchocolat",..:
297 451 168 40 485 199 435 339 242 395 ...
$ Industry : Factor w/ 8 levels "","Construction",..: 8 6 7
6 8 6 3 2 6 3 ...
$ Inception: int 2006 2009 2012 2011 2013 2013 2009 2013
2009 2010 ...
$ Employees: int 25 36 NA 66 45 60 116 73 55 25 ...
$ State : Factor w/ 43 levels "","AL","AZ","CA",..: 37
34 36 4 42 28 23 30 4 9 ...

171
$ City : Factor w/ 297 levels "Addison","Alexandria",..:
94 181 105 195 151 154 53 295 232 26 ...
$ Revenue : Factor w/ 499 levels "","$1,614,585",..: 480 195
486 247 403 142 309 1 97 118 ...
$ Expenses : Factor w/ 498 levels "","1,026,548 Dollars",..:
7 486 4 249 228 248 58 1 403 496 ...
$ Profit : int 8553827 13212508 8701897 10727561 4193069
8179177 3259485 NA 5274553 11412916 ...
$ Growth : Factor w/ 33 levels "","-2%","-3%",..: 15 17 12
15 15 19 13 1 27 17 ...

1 summary ( fin )

172
ID Name
Min. : 1.0 Abstractedchocolat: 1
1st Qu.:125.8 Abusivebong : 1
Median :250.5 Acclaimedcirl : 1
Mean :250.5 Admitruppell : 1
3rd Qu.:375.2 Admonishbadelynge : 1
Max. :500.0 Ahemparticular : 1
(Other) :494
Industry Inception Employees

IT Services :146 Min. :1999 Min. : 1.00


Health : 86 1st Qu.:2009 1st Qu.: 27.25
Software : 64 Median :2011 Median : 56.00
Financial Services : 54 Mean :2010 Mean : 148.61
Construction : 50 3rd Qu.:2012 3rd Qu.: 126.00
Government Services: 50 Max. :2014 Max. :7125.00
(Other) : 50 NA’s :1 NA’s :2

State City Revenue


CA : 57 San Diego : 13 : 2
VA : 50 New York : 11 1,614,585 : 1
TX : 47 Reston : 10 1,835,717 : 1
FL : 34 Houston : 9 10,064,297: 1
MD : 25 Austin : 8 10,067,223: 1
NY : 25 Minneapolis: 8 10,072,452: 1
(Other):262 (Other) :441 (Other) :493

Expenses Profit Growth


: 3 Min. : 12434 20 : 39
1,026,548 Dollars: 1 1st Qu.: 3272074 19 : 35
1,040,662 Dollars: 1 Median : 6513366 17 : 27
1,044,375 Dollars: 1 Mean : 6539474 6 : 25
1,097,353 Dollars: 1 3rd Qu.: 9303951 12 : 24
1,117,206 Dollars: 1 Max. :19624534 16 : 24
(Other) :492 NA’s :2 (Other):326 ¿

173
Asignando propiedades de factor a una variable
1 fin $ ID

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52.................................... 464 465 466 467 468 469
470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485
486 487 488 489 490 491 492 493 494 495 496 497 498 499 500

1 factor ( fin $ ID )

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 ....................................... 465 466 467 468 469 470 471 472
473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488
489 490 491 492 493 494 495 496 497 498 499 500 500 Levels: 1 2
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ... 500

1 fin $ ID <- factor ( fin $ ID )


2 str ( fin )

’data.frame’: 500 obs. of 11 variables:


$ ID : Factor w/ 500 levels "1","2","3","4",..:
1 2 3 4 5 6 7 8 9 10 ...
$ Name : Factor w/ 500 levels "Abstractedchocolat",..:
297 451 168 40 485 199 435 339 242 395 ...
$ Industry : Factor w/ 8 levels "","Construction",..: 8
6 7 6 8 6 3 2 6 3 ...
$ Inception: int 2006 2009 2012 2011 2013 2013 2009 2013
2009+ 2010 ...
$ Employees: int 25 36 NA 66 45 60 116 73 55 25 ...
$ State : Factor w/ 43 levels "","AL","AZ","CA",..:
37 34 36 4 42 28 23 30 4 9 ...
$ City : Factor w/ 297 levels "Addison","Alexandria"
,..: 94 181 105 195 151 154 53 295 232 26 ...
$ Revenue : Factor w/ 499 levels "","$1,614,585",..:
480 195 486 247 403 142 309 1 97 118 ...

174
$ Expenses : Factor w/ 498 levels "","1,026,548 Dollars"
,..: 7 486 4 249 228 248 58 1 403 496 ...
$ Profit : int 8553827 13212508 8701897 10727561
4193069 8179177 3259485 NA 5274553 11412916 ...
$ Growth : Factor w/ 33 levels "","-2%","-3%",..:
15 17 12 15 15 19 13 1 27 17 ...

1 fin $ Inception <- factor ( fin $ Inception )


2 summary ( fin )

Name .... .... Growth


Abstractedchocolat: 1 .... .... .... 20 : 39
Abusivebong : 1 ... .... 19: 35
Acclaimedcirl : 1 .... .... 17: 27
Admitruppell : 1 .... .... 6: 25
Admonishbadelynge : 1 .... .... 12: 24
Ahemparticular : 1 .... .... 16: 24
(Other) :494 .... .... (Other):326

1 str ( fin )

’data.frame’: 500 obs. of 11 variables:


$ ID : Factor w/ 500 levels "1","2","3","4",..:
1 2 3 4 5 6 7 8 9 10 ...
$ Name : Factor w/ 500 levels "Abstractedchocolat",..:
297 451 168 40 485 199 435 339 242 395 ...
$ Industry : Factor w/ 8 levels "","Construction",..:
8 6 7 6 8 6 3 2 6 3 ...
$ Inception: Factor w/ 16 levels "1999","2000",..:
8 11 14 13 15 15 11 15 11 12 ...
$ Employees: int 25 36 NA 66 45 60 116 73 55 25 ...
$ State : Factor w/ 43 levels "","AL","AZ","CA",..:
37 34 36 4 42 28 23 30 4 9 ...
$ City : Factor w/ 297 levels "Addison","Alexandria",..:
94 181 105 195 151 154 53 295 232 26 ...
$ Revenue : Factor w/ 499 levels "","$1,614,585",..:
480 195 486 247 403 142 309 1 97 118 ...
$ Expenses : Factor w/ 498 levels "","1,026,548 Dollars",..:

175
7 486 4 249 228 248 58 1 403 496 ...
$ Profit : int 8553827 13212508 8701897 10727561 4193069
8179177 3259485 NA 5274553 11412916 ...
$ Growth : Factor w/ 33 levels "","-2%","-3%",..:
15 17 12 15 15 19 13 1 27 17 ...

Error común al lidiar con factores. Cuando se quiere remover la propiedad


de factor a un vector
1 a <- c ( " 12 " ," 13 " ," 14 " ," 12 " ," 12 " )
2 a

12 13 14 12 12

1 typeof ( a )

character

1 b <- as . numeric ( a )
2 b

12 13 14 12 12

1 typeof ( b )

double

1 c <- factor ( a )
2 c

12 13 14 12 12 Levels: 12 13 14

1 typeof ( c )

integer

1 y <- as . numeric ( c )

176
Error, se asignan las unidades de las categorias (numero de unidades del
factor) en vez del numero.
1 y

12311

1 typeof ( y )

Forma correcta
1 x <- as . numeric ( as . character ( c ) )
2 x

12 13 14 12 12

1 typeof ( x )

double

Donde podrı́amos cometer este error


1 head ( fin )

Industry Inception .... Profit Growth


Over-Hex Software .... 8553827 19
Unimattax IT Services .... 13212508 620
Greenfax Retail .... 8701897 16
Blacklane IT Services .... 10727561 19
Yearflex Software .... 4193069 19
Indigoplanet IT Services .... 8179177 22
1 str ( fin )

’data.frame’: 500 obs. of 11 variables:


$ ID : Factor w/ 500 levels "1","2","3","4",..:
1 2 3 4 5 6 7 8 9 10 ...
$ Name : Factor w/ 500 levels "Abstractedchocolat",..:
297 451 168 40 485 199 435 339 242 395 ...

177
$ Industry : Factor w/ 8 levels "","Construction",..:
8 6 7 6 8 6 3 2 6 3 ...
.........................................................
$ Profit : int 8553827 13212508 8701897 10727561 4193069
8179177 3259485 NA 5274553 11412916 ...
$ Growth : Factor w/ 33 levels "","-2%","-3%",..:
15 17 12 15 15 19 13 1 27 17 ...

1 fin $ Profit2 <- factor ( fin $ Profit )


2 head ( fin )

Industry Inception ..... ..... Growth Profit2


Over-Hex Software ..... ..... 19 8553827
Unimattax IT Services ..... ..... 20 13212508
Greenfax Retail ..... ..... 16 8701897
Blacklane IT Services ..... ..... 19 10727561
Yearflex Software ..... ..... 19 4193069
Indigoplanet IT Services ..... ..... 22 8179177
1 str ( fin )

’data.frame’: 500 obs. of 12 variables:


$ ID : Factor w/ 500 levels "1","2","3","4",..:
1 2 3 4 5 6 7 8 9 10 ...
$ Name : Factor w/ 500 levels "Abstractedchocolat",..:
297 451 168 40 485 199 435 339 242 395 ...
$ Industry : Factor w/ 8 levels "","Construction",..:
8 6 7 6 8 6 3 2 6 3
......................................................
$ Growth : Factor w/ 33 levels "","-2%","-3%",..:
15 17 12 15 15 19 13 1 27 17 ...
$ Profit2 : Factor w/ 498 levels "12434","46851",..:
342 476 348 420 150 321 125 NA 195 446 ...

1 summary ( fin )

178
ID Name ..... ..... Profit2
1: 1 Abstractedchocolat: 1 ..... ..... 12434 : 1
2: 1 Abusivebong : 1 ..... ..... 46851 : 1
3: 1 Acclaimedcirl : 1 ..... ..... 53681 : 1
4: 1 Admitruppell : 1 ..... ..... 68862 : 1
5: 1 Admonishbadelynge : 1 ..... ..... 73350 : 1
6: 1 Ahemparticular : 1 ..... ..... (Other):493
(Other):494 (Other) :494 ..... ..... NA’s : 2
1 fin $ Profit2 <- as . numeric ( fin $ Profit2 )
2 str ( fin )

’data.frame’: 500 obs. of 12 variables:


$ ID : Factor w/ 500 levels "1","2","3","4",..:
1 2 3 4 5 6 7 8 9 10 ...
$ Name : Factor w/ 500 levels "Abstractedchocolat",..:
297 451 168 40 485 199 435 339 242 395 ...
$ Industry : Factor w/ 8 levels "","Construction",..:
8 6 7 6 8 6 3 2 6 3 ...
(..............................................)
$ Growth : Factor w/ 33 levels "","-2%","-3%",..:
15 17 12 15 15 19 13 1 27 17 ...
$ Profit2 :
num 342 476 348 420 150 321 125 NA 195 446 ...

1 head ( fin )

ID Name Industry ..... ..... Growth Profit2


1 Over-Hex Software ..... ..... 19 342
2 Unimattax IT Services ..... ..... 20 476
3 Greenfax Retail ..... ..... 16 348
4 Blacklane IT Services ..... ..... 19 420
5 Yearflex Software ..... ..... 19 150
6 Indigoplanet IT Services ..... ..... 22 321
1 fin $ Profit2 <- NULL

Invirtiendo numérico desde un factor que contiene elementos de un “chr”


“gsub” remueve la condición de factor.
gsub(“que buscas”, “con que lo quieres reemplazar”, “en donde”)

179
1 View ( fin )

1 fin $ Expenses <- gsub ( " Dollars " , " " , fin $ Expenses )
2 fin $ Expenses <- gsub ( " ," , " " , fin $ Expenses )
3 head ( fin )
ID Name Industry ..... Profit Growth
1 Over-Hex Software ..... 8553827 19
2 Unimattax IT Services ..... 13212508 20
3 Greenfax Retail ..... 8701897 16
4 Blacklane IT Services ..... 10727561 19
5 Yearflex Software ..... 4193069 19
6 Indigoplanet IT Services ..... 8179177 22
$ es un caracter especial, se debe agregar \\$

1 fin $ Revenue <- gsub ( " \\ $ " ," " , fin $ Revenue )
2 fin $ Revenue <- gsub ( " ," ," " , fin $ Revenue )
3 head ( fin )
ID Name Industry ..... Profit Growth
1 Over-Hex Software ..... 8553827 19
2 Unimattax IT Services ..... 13212508 20
3 Greenfax Retail ..... 8701897 16
4 Blacklane IT Services ..... 10727561 19
5 Yearflex Software ..... 4193069 19
6 Indigoplanet IT Services ..... 8179177 22

180
1 str ( fin )

’data.frame’: 500 obs. of 11 variables:


$ ID : Factor w/ 500 levels "1","2","3","4",..:
1 2 3 4 5 6 7 8 9 10 ...
$ Name : Factor w/ 500 levels "Abstractedchocolat",..:
297 451 168 40 485 199 435 339 242 395 ...
$ Industry : Factor w/ 8 levels "","Construction",..:
8 6 7 6 8 6 3 2 6 3 ...
$ Inception: Factor w/ 16 levels "1999","2000",..:
8 11 14 13 15 15 11 15 11 12 ...
$ Employees: int 25 36 NA 66 45 60 116 73 55 25 ...
$ State : Factor w/ 43 levels "","AL","AZ","CA",..:
37 34 36 4 42 28 23 30 4 9 ...
$ City : Factor w/ 297 levels "Addison","Alexandria",..:
94 181 105 195 151 154 53 295 232 26 ...
$ Revenue :
chr "9684527" "14016543" "9746272" "15359369" ...
$ Expenses :
chr "1130700" "804035" "1044375" "4631808" ...
$ Profit :
int 8553827 13212508 8701897 10727561 4193069 8179177
3259485 NA 5274553 11412916 ...
$ Growth :
Factor w/ 33 levels "","-2%","-3%",..: 15 17 12 15 15 19
13 1 27 17 ...

1 fin $ Growth <- gsub ( " % " ," " , fin $ Growth )
2 head ( fin )

ID Name Industry ..... Profit Growth


1 Over-Hex Software ..... 8553827 19
2 Unimattax IT Services ..... 13212508 20
3 Greenfax Retail ..... 8701897 16
4 Blacklane IT Services ..... 10727561 19
5 Yearflex Software ..... 4193069 19
6 Indigoplanet IT Services ..... 8179177 22
1 str ( fin )

181
’data.frame’: 500 obs. of 11 variables:
$ ID : Factor w/ 500 levels "1","2","3","4",..:
1 2 3 4 5 6 7 8 9 10 ...
$ Name : Factor w/ 500 levels "Abstractedchocolat",..:
297 451 168 40 485 199 435 339 242 395 ...
$ Industry : Factor w/ 8 levels "","Construction",..:
8 6 7 6 8 6 3 2 6 3 ...
$ Inception: Factor w/ 16 levels "1999","2000",..:
8 11 14 13 15 15 11 15 11 12 ...
$ Employees: int 25 36 NA 66 45 60 116 73 55 25 ...
$ State : Factor w/ 43 levels "","AL","AZ","CA",..:
37 34 36 4 42 28 23 30 4 9 ...
$ City : Factor w/ 297 levels "Addison","Alexandria",..:
94 181 105 195 151 154 53 295 232 26 ...
$ Revenue :
chr "9684527" "14016543" "9746272" "15359369" ...
$ Expenses :
chr "1130700" "804035" "1044375" "4631808" ...
$ Profit :
int 8553827 13212508 8701897 10727561 4193069 8179177
3259485 NA 5274553 11412916 ...
$ Growth : chr "19" "20" "16" "19" ...

Esencialmente solo estamos convirtiendo “chr” a “num”


1 fin $ Expenses <- as . numeric ( fin $ Expenses )
2 fin $ Revenue <- as . numeric ( fin $ Revenue )
3 fin $ Growth <- as . numeric ( fin $ Growth )
4 str ( fin )

’data.frame’: 500 obs. of 11 variables:


$ ID : Factor w/ 500 levels "1","2","3","4",..:
1 2 3 4 5 6 7 8 9 10 ...
$ Name : Factor w/ 500 levels "Abstractedchocolat",..:
297 451 168 40 485 199 435 339 242 395 ...
$ Industry : Factor w/ 8 levels "","Construction",..:
8 6 7 6 8 6 3 2 6 3 ...

182
$ Inception: Factor w/ 16 levels "1999","2000",..:
8 11 14 13 15 15 11 15 11 12 ...
$ Employees: int 25 36 NA 66 45 60 116 73 55 25 ...
$ State : Factor w/ 43 levels "","AL","AZ","CA",..:
37 34 36 4 42 28 23 30 4 9 ...
$ City : Factor w/ 297 levels "Addison",
"Alexandria",..: 94 181 105 195 151 154 53 295 232 26 ...
$ Revenue :
num 9684527 14016543 9746272 15359369 8567910 ...
$ Expenses :
num 1130700 804035 1044375 4631808 4374841 ...
$ Profit : int 8553827 13212508 8701897 10727561
4193069 8179177 3259485 NA 5274553 11412916 ...
$ Growth : num 19 20 16 19 19 22 17 NA 30 20 ...

1 summary ( fin )

ID Name ..... Growth


1: 1 Abstractedchocolat: 1 ..... Min. :-3.00
2: 1 Abusivebong : 1 ..... 1st Qu.: 8.00
3: 1 Acclaimedcirl : 1 ..... Median :15.00
4: 1 Admitruppell : 1 ..... Mean :14.38
5: 1 Admonishbadelynge : 1 ..... 3rd Qu.:20.00
6: 1 Ahemparticular : 1 ..... Max. :30.00
(Other):494 (Other) :494 ..... NA’s :1

183
2.15 Seguimiento a la limpieza de datos en R
1 View ( fin )
2 head ( fin ,24)

“complete.cases” muestra las filas que tengan valores. Al agregar el “!”


básicamente estamos indicando que imprima lo que no sea NaN o NA.
1 fin [ ! complete . cases ( fin ) ,]

complete.cases no muestra las columnas que no tienen NA, lo que se puede


hacer es importas el archivo reemplazando lo que indique en “” (espacios
vacı́os) por “stings NA”
1 fin <- read . csv ( " Future -500. csv " , na . strings = c ( " " ) )
2 fin [ ! complete . cases ( fin ) ,]

ID Name Industry ... Inception Profit Growth


3 Greenfax Retail ... 2012 8701897 1
8 Rednimdox Construction ... 2013 NA ¡NA¿
11 Canecorporation Health ... 2012 3005820 7
14 Techline ¡NA¿ ... 2006 8427816 23
15 Cityace ¡NA¿ ... 2010 3005116 6
17 Ganzlax IT Services ... 2011 11901180 18
22 Lathotline Health ... NA 1851070 2

1 str ( fin )

’data.frame’: 500 obs. of 11 variables:


$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : Factor w/ 500 levels "Abstractedchocolat",..:
297 451 168 40 485 199 435 339 242 395 ...
$ Industry : Factor w/ 7 levels "Construction",..:
7 5 6 5 7 5 2 1 5 2 ...
$ Inception:
int 2006 2009 2012 2011 2013 2013 2009 2013 2009 2010

184
.........................................................
$ Profit : int 8553827 13212508 8701897 10727561 4193069
8179177 3259485 NA 5274553 11412916 ...
$ Growth : Factor w/ 32 levels "-2%","-3%","0%",..:
14 16 11 14 14 18 12 NA 26 16 ...

1 head ( fin )

Name Industry ... Profit Growth


Over-Hex Software ... 8553827 19
Unimattax IT Services ... 13212508 20
Greenfax Retail ... 8701897 16
Blacklane IT Services ... 10727561 19
Yearflex Software ... 4193069 19
Indigoplanet IT Services ... 8179177 22
1 fin [ fin $ Employees == 45 ,]

ID Name ... Profit Growth


NA ¡NA¿ ... NA ¡NA¿
5 Yearflex ... 4193069 19
137 Toughcare ... 6633554 14
183 Ittech ... 4589251 20
200 Lalane ... 7527175 14
208 Countslovenly ... 166462 10
245 Peskyevaluate ... 8727201 23
NA ¡NA¿ ... NA ¡NA¿
360 Remembergabbro ... 6363466 12
380 Pickyfive ... 10368276 26
435 Lucrepickled ... 9382538 17
487 Genusequ ... 2756691 11

Filtrando: usando “which()” se puede selecciona solo lo que coincida


independiente de NA.
1 fin [ which ( fin $ Employees == 45) ,]

185
ID Name Industry ... Profit Growth
5 Yearflex Software ... 4193069 19
137 Toughcare Retail ... 6633554 14
183 Ittech IT Services ... 4589251 20
200 Lalane Retail ... 7527175 14
208 Countslovenly Construction ... 166462 10
245 Peskyevaluate IT Services ... 8727201 23
360 Remembergabbro Construction ... 6363466 12
380 Pickyfive IT Services ... 10368276 26
435 Lucrepickled IT Services ... 9382538 17
487 Genusequ Construction ... 2756691 11
1 head ( fin ,24)

ID Name Industry ... Profit Growth


1 Over-Hex Software ... 8553827 19
2 Unimattax IT Services ... 13212508 20
3 Greenfax Retail ... 8701897 16
4 Blacklane IT Services ... 10727561 19
5 Yearflex Software ... 4193069 19
6 Indigoplanet IT Services ... 8179177 22
7 Treslam Financial Services ... 3259485 17
8 Rednimdox Construction ... NA ¡NA¿
9 Lamtone IT Services ... 5274553 30
10 Stripfind Financial Services ... 11412916 20
Error
1 fin $ Expenses == NA

NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA (.....................................................) NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA

1 fin [ fin $ Expenses == NA ,]

186
ID Name Industry Inception Employees ... Profit Growth
NA NA NA NA NA ... NA NA
NA NA NA NA NA ... NA NA
NA NA NA NA NA ... NA NA
NA NA NA NA NA ... NA NA
NA NA NA NA NA ... NA NA
NA NA NA NA NA ... NA NA
NA NA NA NA NA ... NA NA
NA NA NA NA NA ... NA NA
NA NA NA NA NA ... NA NA
NA ¡NA¿ ¡NA¿ NA NA ... NA NA

Filtrando: usando “is.na()” para información faltante.

1 a <- c (1 ,24 ,543 , NA ,76 ,45 , NA )


2 is . na ( a )

FALSE FALSE FALSE TRUE FALSE FALSE TRUE

Básicamente se produce un vector de booleanos donde la condición se


haya cumplido. Muy similar a lo que hacemos con “wich()”, pero en este
caso solo funciona para elementos NA.
1 is . na ( fin $ Expenses )

FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE


FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE TRUE FALSE FALSE FALSE (...............)
FALSE FALSE FALSE FALSE FALSE FALSE

1 fin [ is . na ( fin $ Expenses ) ,]

187
ID Name Industry Inception
8 Rednimdox Construction 2013
17 Ganzlax IT Services 2011
44 Ganzgreen Construction 2010
Employees State City Revenue
73 NY Woodside ¡NA¿
75 NJ Iselin 14,001,180
224 TN Franklin ¡NA¿
Expenses Profit Growth
¡NA¿ NA ¡NA¿
¡NA¿ 11901180 18
¡NA¿ NA 9
1 fin [ is . na ( fin $ State ) ,]

ID Name Industry Inception


11 Canecorporation Health 2012
84 Drilldrill Software 2010
267 Circlechop Software 2010
379 Stovepuck Retail 2013
Employees State City Revenue
6 NA New York 10,597,009
30 NA San Francisco 7,800,620
73 NA New York 13,814,975
14 NA San Francisco 9,067,070
Expenses Profit Growth
7,591,189 Dollars 3005820 7
2,785,799 Dollars 5014821 17
5,904,502 Dollars 7910473 20
5,929,828 Dollars 3137242 10

188
Eliminando observaciones de las cuales no se tiene información.
1 fin _ backup <- fin
2 fin <- fin _ backup

Esta indicación resulta en el número de filas que no contienen un valor. Esto


en consecuencia del “!”.
1 nrow ( fin [ ! complete . cases ( fin ) ,])

12

1 fin [ is . na ( fin $ Industry ) ,]

ID Name Industry Inception Employees


14 Techline ¡NA¿ 2006 65
15 Cityace ¡NA¿ 2010 25
State City Revenue Expenses
CA San Ramon 13,898,119 5,470,303 Dollars Esta in-
CO Louisville 9,254,614 6,249,498 Dollars
Profit Growth
8427816 23
3005116 6

dicación resulta en el número de filas que no es NA. Esto en consecuencia


del “!”.
1 nrow ( fin [ ! is . na ( fin $ Industry ) ,])

498

Me estoy quedando solo con lo que no es NA en la columna Industry.


1 fin

189
ID Name Industry Inception
1 Over-Hex Software 2006
2 Unimattax IT Services 2009
3 Greenfax Retail 2012
4 Blacklane IT Services 2011
... ... ... ...

Employees State City Revenue


25 TN Franklin 9,684,527
36 PA Newtown Square 14,016,543
NA SC Greenville 9,746,272
CA Orange 15,359,369
66 WI Madison 8,567,910
... ... ... ...

Expenses Profit Growth


1,130,700 Dollars 8553827 19
804,035 Dollars 13212508 20
1,044,375 Dollars 8701897 16
4,631,808 Dollars 10727561 19
4,374,841 Dollars 4193069 19
... ... ...
1 nrow ( fin [ ! complete . cases ( fin ) ,])

10

1 fin [ ! complete . cases ( fin ) ,]

190
ID Name Industry Inception
3 Greenfax Retail 2012
8 Rednimdox Construction 2013
11 Canecorporation Health 2012
17 Ganzlax IT Services 2011
22 Lathotline Health NA
44 Ganzgreen Construction 2010
84 Drilldrill Software 2010
267 Circlechop Software 2010
332 Westminster Financial Services 2010
379 Stovepuck Retail 2013

Employees State City Revenue


NA SC Greenville 9,746,272
73 NY Woodside NA
6 NA New York 10,597,009
75 NJ Iselin 14,001,180
103 VA McLean 9,418,303
224 TN Franklin NA
30 NA San Francisco 7,800,620
14 NA San Francisco 9,067,070
NA MI Troy 11,861,652
73 NA New York 13,814,975

Expenses Profit Growth


1,044,375 Dollars 8701897 16
NA NA NA
7,591,189 Dollars 3005820 7
NA 11901180 18
7,567,233 Dollars 1851070 2
NA NA 9
2,785,799 Dollars 5014821 17
5,929,828 Dollars 3137242 20
5,245,126 Dollars 6616526 15
5,904,502 Dollars 7910473 10

191
Recetando el index del data frame.

Opción 1, esta no es tan intuitiva.

1 rownames ( fin ) <- 1: nrow ( fin )

Opción 2, esta es la mas recomendada.


1 rownames ( fin ) <- NULL

1 fin

ID Name Industry Inception


1 Over-Hex Software 2006
2 Unimattax IT Services 2009
3 Greenfax Retail 2012
4 Blacklane IT Services 2011
... ... ... ...

Employees State City Revenue


25 TN Franklin 9,684,527
36 PA Newtown Square 14,016,543
NA SC Greenville 9,746,272
CA Orange 15,359,369
66 WI Madison 8,567,910
... ... ... ...

Expenses Profit Growth


1,130,700 Dollars 8553827 19
804,035 Dollars 13212508 20
1,044,375 Dollars 8701897 16
4,631,808 Dollars 10727561 19
4,374,841 Dollars 4193069 19
... ... ...

2.15.1 Reemplazando la información faltante: análisis basado en


hechos

192
1 fin [ ! complete . cases ( fin ) ,]

ID Name Industry Inception


3 Greenfax Retail 2012
8 Rednimdox Construction 2013
11 Canecorporation Health 2012
17 Ganzlax IT Services 2011
22 Lathotline Health NA
44 Ganzgreen Construction 2010
84 Drilldrill Software 2010
267 Circlechop Software 2010
332 Westminster Financial Services 2010
379 Stovepuck Retail 2013

Employees State City Revenue


NA SC Greenville 9,746,272
73 NY Woodside NA
6 NA New York 10,597,009
75 NJ Iselin 14,001,180
103 VA McLean 9,418,303
224 TN Franklin NA
30 NA San Francisco 7,800,620
14 NA San Francisco 9,067,070
NA MI Troy 11,861,652
73 NA New York 13,814,975

Expenses Profit Growth


1,044,375 Dollars 8701897 16
NA NA NA
7,591,189 Dollars 3005820 7
NA 11901180 18
7,567,233 Dollars 1851070 2
NA NA 9
2,785,799 Dollars 5014821 17
5,929,828 Dollars 3137242 20
5,245,126 Dollars 6616526 15
5,904,502 Dollars 7910473 10

193
A partir de la columna de “City”, puedo fácilmente saber al “State” que

estos pertenecen.
1 fin [ is . na ( fin $ State ) ,]

ID Name Industry Inception Employees


11 Canecorporation Health 2012 6
84 Drilldrill Software 2010 30
267 Circlechop Software 2010 14
379 Stovepuck Retail 2013 73

State City Revenue


NA New York 10,597,009
NA San Francisco 7,800,620
NA San Francisco 9,067,070
NA New York 13,814,975

Expenses Profit Growth


7,591,189 Dollars 3005820 7
2,785,799 Dollars 5014821 17
5,929,828 Dollars 3137242 20
5,904,502 Dollars 7910473 10
1 fin [ is . na ( fin $ State ) & fin $ City == " New York " ,]

ID Name Industry Inception Employees


11 Canecorporation Health 2012 6
379 Stovepuck Retail 2013 73

State City Revenue


NA New York 10,597,009
NA New York 13,814,975 Expenses Profit Growth
7,591,189 Dollars 3005820 7
5,904,502 Dollars 7910473 10

Al adicionar “State” después de la “,” estamos accediendo especı́ficamente

194
a esa columna, de tal manera que si agregamos un valor a ese espacio
básicamente lo reescribiremos.
1 fin [ is . na ( fin $ State ) & fin $ City == " New York " ," State " ] <- " NY
"

1 fin [ c (11 ,377) ,] # Revisando posiciones que ten \ ’ ian el NA

ID Name Industry Inception Employees


11 Canecorporation Health 2012 6
379 Stovepuck Retail 2013 73

State City Revenue


NY New York 10,597,009
NY New York 13,814,975 Expenses Profit Growth
7,591,189 Dollars 3005820 7
5,904,502 Dollars 7910473 10
Repetimos lo anterior pero ahora con “San Francisco”.

1 fin [ is . na ( fin $ State ) & fin $ City == " San Francisco " ," State " ]
<- " CA "

1 fin [ c (82 ,265) ,]

ID Name Industry Inception Employees


84 Drilldrill Software 2010 30
267 Circlechop Software 2010 14

State City Revenue Expenses


CA San Francisco 7,800,620 2,785,799 Dollars
CA San Francisco 9,067,070 5,929,828 Dollars

Profit Growth
5014821 17
3137242 20
1 nrow ( fin [ ! complete . cases ( fin ) ,])

195
Reemplazando la información faltante: método de la mediana estadı́stica
“Employees” y “Growth
1 fin [ ! complete . cases ( fin $ Employees ) ,]

ID Name Industry Inception


3 Greenfax Retail 2012
332 Westminster Financial Services 2010

Employees State City Revenue


NA SC Greenville 9,746,272
NA MI Troy 11,861,652
Expenses Profit Growth
1,044,375 Dollars 8701897 15
5,245,126 Dollars 6616526 16

[fila,columna] - busca la media en el resultado de fin[fila:


de la columna "Industry"]
# fila: == "retail", columna:"Employees"]
# na.rm=TRUE no tomara en consideraci\’on los NA dentro de la columna.

1 med _ empl _ retail <- median ( fin [ fin $ Industry == " Retail " ,"
Employees " ] , na . rm = TRUE )
2 med _ empl _ retail

28

1 mean ( fin [ fin $ Industry == " Retail " ," Employees " ] , na . rm = TRUE )

209.2766

1 fin [ is . na ( fin $ Employees ) & fin $ Industry == " Retail " ,]

ID Name Industry Inception


3 Greenfax Retail 2012
Employees State City Revenue
NA SC Greenville 9,746,272
Expenses Profit Growth
1,044,375 Dollars 8701897 16

196
1 fin [ is . na ( fin $ Employees ) & fin $ Industry == " Retail " ," Employees "
] <- med _ empl _ retail

1 fin [3 ,]

ID Name Industry Inception


3 Greenfax Retail 2012
Employees State City Revenue
28 SC Greenville 9,746,272
Expenses Profit Growth
1,044,375 Dollars 8701897 16
1 med _ empl _ Fservices <- median ( fin [ fin $ Industry == " Financial
Services " ," Employees " ] , na . rm = TRUE )
2 med _ empl _ Fservices

80

1 fin [ is . na ( fin $ Employees ) & fin $ Industry == " Financial Services "
,]

ID Name Industry Inception


332 Westminster Financial Services 2010

Employees State City Revenue


NA MI Troy 11,861,652

Expenses Profit Growth


5,245,126 Dollars 6616526 15
1 fin [ is . na ( fin $ Employees ) & fin $ Industry == " Financial Services "
," Employees " ] <- med _ empl _ Fservices

1 fin [330 ,]

ID Name Industry Inception


332 Westminster Financial Services 2010

Employees State City Revenue


80 MI Troy 11,861,652

Expenses Profit Growth


5,245,126 Dollars 6616526 15

197
1 nrow ( fin [ ! complete . cases ( fin ) ,])

1 fin $ Growth <- gsub ( " % " ," " , fin $ Growth )
2 fin $ Growth <- as . numeric ( fin $ Growth )
3 med _ growth _ constr <- median ( fin [ fin $ Industry == " Construction " ,
" Growth " ] , na . rm = TRUE )
4 med _ growth _ constr

10

1 fin [ is . na ( fin $ Growth ) & fin $ Industry == " Construction " ,]

ID Name Industry Inception Employees


State City Revenue Expenses Profit
Growth
1 fin [ is . na ( fin $ Growth ) & fin $ Industry == " Construction " ," Growth "
] <- med _ growth _ constr

1 fin [8 ,]

ID Name Industry Inception


8 Rednimdox Construction 2013
Employees State City Revenue
73 NY Woodside NA
xpenses Profit Growth
NA NA 10
1 fin [ ! complete . cases ( fin ) ,]

198
ID Name Industry Inception
8 Rednimdox Construction 2013
17 Ganzlax IT Services 2011
22 Lathotline Health NA
44 Ganzgreen Construction 2010

Employees State City Revenue


73 NY Woodside NA
75 NJ Iselin 14,001,180
103 VA McLean 9,418,303
224 TN Franklin NA

Expenses Profit Growth


NA NA 10
NA 11901180 18
7,567,233 Dollars 1851070 2
NA NA 9

Para determinar “Revenue”, si no tenemos la información de “Expenses”


y “Profit” se puede utilizar la media como parámetro.
1 fin $ Revenue <- gsub ( " \\ $ " ," " , fin $ Revenue )
2 fin $ Revenue <- gsub ( " ," ," " , fin $ Revenue )
3 fin $ Revenue <- as . numeric ( as . character ( fin $ Revenue ) )
4
5 med _ rev _ constr <- median ( fin [ fin $ Industry == " Construction " , "
Revenue " ] , na . rm = TRUE )
6 med _ rev _ constr

9055059

1 fin [ is . na ( fin $ Revenue ) & fin $ Industry == " Construction " ,]

199
ID Name Industry Inception
8 Rednimdox Construction 2013
44 Ganzgreen Construction 2010

Employees State City Revenue


73 NY Woodside NA
224 TN Franklin NA

Expenses Profit Growth


NA NA 10
NA NA 9
1 fin [ is . na ( fin $ Revenue ) & fin $ Industry == " Construction " ,"
Revenue " ] <- med _ rev _ constr
2
3 fin [ ! complete . cases ( fin ) ,]

ID Name Industry Inception


8 Rednimdox Construction 2013
17 Ganzlax IT Services 2011
22 Lathotline Health NA
44 Ganzgreen Construction 2010

Employees State City Revenue


73 NY Woodside 9055059
75 NJ Iselin 14,001,180
103 VA McLean 9,418,303
224 TN Franklin 9055059

Expenses Profit Growth


NA NA 10
NA 11901180 18
7,567,233 Dollars 1851070 2
NA NA 9

Tengamos cuidado, solo queremos reemplazar los valores


en los cuales "Expenses" y "Profit" son NA, como podemos
especificarlo con df[,]?

200
1 fin $ Expenses <- gsub ( " Dollars " , " " , fin $ Expenses )
2 fin $ Expenses <- gsub ( " ," , " " , fin $ Expenses )
3 fin $ Expenses <- as . numeric ( fin $ Expenses )
4
5 med _ exp _ constr <- median ( fin [ fin $ Industry == " Construction " ,"
Expenses " ] , na . rm = TRUE )
6 med _ exp _ constr

4506976

1 fin [ is . na ( fin $ Expenses ) & fin $ Industry == " Construction " & is .
na ( fin $ Profit ) ,]

ID Name Industry Inception


8 Rednimdox Construction 2013
44 Ganzgreen Construction 2010

Employees State City Revenue


73 NY Woodside 9055059
224 TN Franklin 9055059

Expenses Profit Growth


NA NA 10
NA NA 9
1 fin [ is . na ( fin $ Expenses ) & fin $ Industry == " Construction " & is .
na ( fin $ Profit ) ," Expenses " ] <- med _ exp _ constr
2
3 fin [ ! complete . cases ( fin ) ,]

201
ID Name Industry Inception
8 Rednimdox Construction 2013
17 Ganzlax IT Services 2011
22 Lathotline Health NA
44 Ganzgreen Construction 2010

Employees State City Revenue


73 NY Woodside 9055059
75 NJ Iselin 14,001,180
103 VA McLean 9,418,303
224 TN Franklin 9055059

Expenses Profit Growth


4506976 NA 10
NA 11901180 18
7,567,233 Dollars 1851070 2
4506976 NA 9

Reemplazando valores: valores que se pueden obtener a partir de la in-


formación disponible
Revenue - Expenses = Profit
Expenses = Revenue - profir
1 fin [ is . na ( fin $ Profit ) ," Profit " ] <- fin [ is . na ( fin $ Profit ) ,"
Revenue " ] - fin [ is . na ( fin $ Profit ) ," Expenses " ]

1 fin [ c (8 ,42) ,]

ID Name Industry Inception


8 Rednimdox Construction 2013
44 Ganzgreen Construction 2010

Employees State City Revenue


73 NY Woodside 9055059
224 TN Franklin 9055059

Expenses Profit Growth


4506976 4548083 10
4506976 4548083 9

202
1 fin [ ! complete . cases ( fin ) ,]

ID Name Industry Inception


17 Ganzlax IT Services 2011
22 Lathotline Health NA
Employees State City Revenue
75 NJ Iselin 14,001,180
103 VA McLean 9,418,303
Expenses Profit Growth
NA 11901180 18
7,567,233 Dollars 1851070 2
1 fin [ is . na ( fin $ Expenses ) ," Expenses " ] <- fin [ is . na ( fin $ Expenses
) ," Revenue " ] - fin [ is . na ( fin $ Expenses ) ," Profit " ]

Revisando Posiciones que tenian el NA


1 fin [15 ,]

ID Name Industry Inception


17 Ganzlax IT Services 2011
Employees State City Revenue
75 NJ Iselin 14,001,180
Expenses Profit Growth
NA 11901180 18
1 fin [ ! complete . cases ( fin ) ,]

ID Name Industry Inception


22 Lathotline Health NA
Employees State City Revenue
103 VA McLean 9,418,303
Expenses Profit Growth
7,567,233 Dollars 1851070 2
1 fin <- fin [ ! is . na ( fin $ Inception ) ,]
2 fin [ ! complete . cases ( fin ) ,]

ID Name Industry Inception Employees


State City Revenue Expenses Profit
Growth

203
3 Etapa 3
3.1 Ejercicio de Manejo de datos en R
setwd("E:/User/PathToFolder/Dia 6")

1 Auto <- read . csv ( " Automobile _ data . csv " )

1.- Imprime las primeras y ultimas 5 filas del documento


1 head ( Auto , 5)

index company body-style wheel-base length engine-type


0 0 alfa-romero convertible 88.6 168.8 dohc
1 1 alfa-romero convertible 88.6 168.8 dohc
2 2 alfa-romero hatchback 94.5 171.2 ohcv
3 3 audi sedan 99.8 176.6 ohc
4 4 audi sedan 99.4 176.6 ohc

num-of-cylinders horsepower average mileage price


0 four 111 21 13495.0
1 four 111 21 16500.0
2 six 154 19 16500.0
3 four 102 24 13950.0
4 five 115 18 17450.0
1 tail ( Auto , 5)

index company body-style wheel-base length engine-type


56 81 volkswagen sedan 97.3 171.7 ohc
57 82 volkswagen sedan 97.3 171.7 ohc
58 86 volkswagen sedan 97.3 171.7 ohc
59 87 volvo sedan 104.3 188.8 ohc
60 88 volvo wagon 104.3 188.8 ohc

num-of-cylinders horsepower average mileage price


56 four 85 27 7975.0
57 four 52 37 7995.0
58 four 100 26 9995.0
59 four 114 23 12940.0
60 four 114 23 13415.0

204
2.- Encuentra cual es la fila del auto más costoso
1 Auto [ which ( Auto $ price == max ( Auto $ price , na . rm = TRUE ) ) ,]

index company body style wheel.base length


36 47 mercedes-benz hardtop 112 199.2

engine.type num.of.cylinders horsepower average.mileage


36 ohcv eight 184 14

price
36 45400

3.- Encuentra las filas que representan a la marca “toyota”

1 Auto [ Auto $ company == " toyota " ,]

index company body style wheel base length


49 66 Toyota hatchback 95.7 158.7
50 67 Toyota hatchback 95.7 158.7
51 68 Toyota hatchback 95.7 158.7
52 69 Toyota wagon 95.7 169.7
53 70 Toyota wagon 104.3 169.7

engine.type num.of.cylinders horsepower average mileage price


49 ohc four 62 35 5348
50 ohc four 62 31 6338
51 ohc four 62 31 6488
52 ohc four 62 31 6918
53 ohc four 62 27 7898

4.- Cuenta el numero total de autos por compañı́a

1 levels ( Auto $ company )


2 table ( Auto $ company )

205
alfa-romero audi bmw chevrolet
3 4 6 3
dodge honda isuzu jaguar
2 3 3 3
mazda mercedes-benz mitsubishi nissan
5 4 4 5
porsche toyota volkswagen volvo
3 7 4 2

5.- Encuentra el carro mas costoso por cada compañı́a

Opción 1 subsets

1 alfa _ romero <- Auto [ Auto $ company == " alfa - romero " ,]
2 audi <- Auto [ Auto $ company == " audi " ,]
3 bmw <- Auto [ Auto $ company == " bmw " ,]
4 chevrolet <- Auto [ Auto $ company == " chevrolet " ,]
5 dodge <- Auto [ Auto $ company == " dodge " ,]
6 honda <- Auto [ Auto $ company == " honda " ,]
7 isuzu <- Auto [ Auto $ company == " isuzu " ,]
8 jaguar <- Auto [ Auto $ company == " jaguar " ,]
9 mazda <- Auto [ Auto $ company == " mazda " ,]
10 mercedes _ benz <- Auto [ Auto $ company == " mercedes - benz " ,]
11 mitsubishi <- Auto [ Auto $ company == " mitsubishi " ,]
12 nissan <- Auto [ Auto $ company == " nissan " ,]
13 porsche <- Auto [ Auto $ company == " porsche " ,]
14 toyota <- Auto [ Auto $ company == " toyota " ,]
15 volkswagen <- Auto [ Auto $ company == " volkswagen " ,]
16 volvo <- Auto [ Auto $ company == " volvo " ,]
17
18 MaxpriceAr <- alfa _ romero [ which ( alfa _ romero $ price == max ( alfa
_ romero $ price , na . rm = TRUE ) ) ,]
19 MaxpriceAu <- audi [ which ( audi $ price == max ( audi $ price , na . rm
= TRUE ) ) ,]
20 MaxpriceBw <- bmw [ which ( bmw $ price == max ( bmw $ price , na . rm =
TRUE ) ) ,]
21 MaxpriceCh <- chevrolet [ which ( chevrolet $ price == max (
chevrolet $ price , na . rm = TRUE ) ) ,]
22 MaxpriceDo <- dodge [ which ( dodge $ price == max ( dodge $ price , na .
rm = TRUE ) ) ,]
23 MaxpriceHo <- honda [ which ( honda $ price == max ( honda $ price , na .

206
rm = TRUE ) ) ,]
24 MaxpriceIs <- isuzu [ which ( isuzu $ price == max ( isuzu $ price , na .
rm = TRUE ) ) ,]
25 MaxpriceJa <- jaguar [ which ( jaguar $ price == max ( jaguar $ price ,
na . rm = TRUE ) ) ,]
26 MaxpriceMa <- mazda [ which ( mazda $ price == max ( mazda $ price , na .
rm = TRUE ) ) ,]
27 MaxpriceMe <- mercedes _ benz [ which ( mercedes _ benz $ price == max (
mercedes _ benz $ price , na . rm = TRUE ) ) ,]
28 MaxpriceMi <- mitsubishi [ which ( mitsubishi $ price == max (
mitsubishi $ price , na . rm = TRUE ) ) ,]
29 MaxpriceNi <- nissan [ which ( nissan $ price == max ( nissan $ price ,
na . rm = TRUE ) ) ,]
30 MaxpricePo <- porsche [ which ( porsche $ price == max ( porsche $
price , na . rm = TRUE ) ) ,]
31 MaxpriceTo <- toyota [ which ( toyota $ price == max ( toyota $ price ,
na . rm = TRUE ) ) ,]
32 MaxpriceVk <- volkswagen [ which ( volkswagen $ price == max (
volkswagen $ price , na . rm = TRUE ) ) ,]
33 MaxpriceVl <- volvo [ which ( volvo $ price == max ( volvo $ price , na .
rm = TRUE ) ) ,]
34
35 PriceDF <- rbind ( MaxpriceAr , MaxpriceAu , MaxpriceBw ,
MaxpriceCh , MaxpriceDo , MaxpriceHo ,
36 MaxpriceIs , MaxpriceJa , MaxpriceMa ,
MaxpriceMe , MaxpriceMi , MaxpriceNi ,
37 MaxpricePo , MaxpriceTo , MaxpriceVk ,
MaxpriceVl )
38
39 PriceDF

207
index company body-style wheel-base length engine-type
2 1 alfa-romero convertible 88.6 168.8 dohc
3 2 alfa-romero hatchback 94.5 171.2 ohcv
7 6 audi wagon 105.8 192.7 ohc
12 14 bmw sedan 103.5 193.8 ohc
16 18 chevrolet sedan 94.5 158.8 ohc
17 19 dodge hatchback 93.7 157.3 ohc
20 28 honda sedan 96.5 175.4 ohc
22 30 isuzu sedan 94.3 170.7 ohc
27 35 jaguar sedan 102.0 191.7 ohcv
32 43 mazda sedan 104.9 175.0 ohc
36 47 mercedes-benz hardtop 112.0 199.2 ohcv
40 52 mitsubishi sedan 96.3 172.4 ohc
45 57 nissan sedan 100.4 184.6 ohcv
47 62 porsche convertible 89.5 168.9 ohcf
55 79 toyota wagon 104.5 187.8 dohc
59 86 volkswagen sedan 97.3 171.7 ohc
61 88 volvo wagon 104.3 188.8 ohc
num-of-cylinders horsepower average mileage price
2 four 111 21 16500.0
3 six 154 19 16500.0
7 five 110 19 18920.0
12 six 182 16 41315.0
16 four 70 38 6575.0
17 four 68 31 6377.0
20 four 101 24 12945.0
22 four 78 24 6785.0
27 twelve 262 13 36000.0
32 four 72 31 18344.0
36 eight 184 14 45400.0
40 four 88 25 8189.0
45 six 152 19 13499.0
47 six 207 17 37028.0
55 six 156 19 15750.0
59 four 100 26 9995.0
61 four 114 23 13415.0

208
Opción 2 loop
1 comp _ auto <- levels ( Auto $ company )
2 PriceDF2 <- data . frame ()
3 for ( i in comp _ auto ) {
4 Temp <- Auto [ Auto $ company == i ,]
5 Temp <- Temp [ which ( Temp $ price == max ( Temp $ price , na . rm =
TRUE ) ) ,]
6 PriceDF2 <- rbind ( PriceDF2 , Temp )
7 }
8 PriceDF2
9
10 View ( PriceDF )
11 View ( PriceDF2 )

6.- Encuentra el kilometraje promedio de cada compañı́a armadora de


vehı́culos
Opción 1 subsets

1 alfa _ romeroTable <- table ( alfa _ romero $ average . mileage )


2 View ( alfa _ romeroTable )

Var1 Freq
1 19 1
2 21 2

209
1 alfa _ romeroKmMode <- names ( alfa _ romeroTable ) [ which ( alfa _
romeroTable == max ( alfa _ romeroTable ) ) ]
2 alfa _ romeroKmMode

[1] ”21”:

Repetir el proceso para todas las compañı́as

Opcion 2 loop
1 AutoKmMode <- c ()
2 comp _ auto <- levels ( Auto $ company )
3 for ( i in comp _ auto ) {
4 Temp <- Auto [ Auto $ company == i ,]
5 Temp <- table ( Temp $ average . mileage )
6 Temp <- names ( Temp ) [ which ( Temp == max ( Temp ) ) ]
7 Temp <- paste (i , Temp , sep = " = " )
8 AutoKmMode <- append ( AutoKmMode , Temp )
9 }
10 AutoKmMode

[[’alfa-romero = [21]’],
[’audi = [19]’],
[’bmw = [16]’],
[’chevrolet = [38]’],
[’dodge = [31]’],
[’honda = [24]’],
[’isuzu = [38]’],
[’jaguar = [15]’],
[’mazda = [31]’],
[’mercedes-benz = [14]’],
[’mitsubishi = [25]’],
[’nissan = [31]’],
[’porsche = [17]’],
[’toyota = [31]’],
[’volkswagen = [37]’],
[’volvo = [23]’]]

7.- Ordena los automóviles por la columna de precio

210
1 o <- order ( Auto $ price , na . last = TRUE , decreasing = TRUE )
2 AutoOrderPrice <- Auto [o ,]
3
4 AutoOrderPrice <- Auto [ order ( Auto $ price , na . last = TRUE ,
decreasing = TRUE ) ,]
5
6 AutoOrderPrice

index company body-style wheel-base length engine-type


35 47 mercedes-benz hardtop 112.0 199.2 ohcv
11 14 bmw sedan 103.5 193.8 ohc
34 46 mercedes-benz sedan 120.9 208.1 ohcv
46 62 porsche convertible 89.5 168.9 ohcf
12 15 bmw sedan 110.0 197.0 ohc
26 35 jaguar sedan 102.0 191.7 ohcv
25 34 jaguar sedan 113.0 199.6 dohc
45 61 porsche hardtop 89.5 168.9 ohcf
24 33 jaguar sedan 113.0 199.6 dohc
10 13 bmw sedan 103.5 189.0 ohc
33 45 mercedes-benz wagon 110.0 190.9 ohc
32 44 mercedes-benz sedan 110.0 190.9 ohc
9 11 bmw sedan 101.2 176.8 ohc
6 6 audi wagon 105.8 192.7 ohc
31 43 mazda sedan 104.9 175.0 ohc
4 4 audi sedan 99.4 176.6 ohc
8 10 bmw sedan 101.2 176.8 ohc
2 2 alfa-romero hatchback 94.5 171.2 ohcv
1 1 alfa-romero convertible 88.6 168.8 dohc
7 9 bmw sedan 101.2 176.8 ohc
54 79 toyota wagon 104.5 187.8 dohc
5 5 audi sedan 99.8 177.3 ohc
3 3 audi sedan 99.8 176.6 ohc
44 57 nissan sedan 100.4 184.6 ohcv
0 0 alfa-romero convertible 88.6 168.8 dohc
60 88 volvo wagon 104.3 188.8 ohc
19 28 honda sedan 96.5 175.4 ohc
59 87 volvo sedan 104.3 188.8 ohc
30 39 mazda hatchback 95.3 169.0 rotor

211
20 29 honda sedan 96.5 169.1 ohc
58 86 volkswagen sedan 97.3 171.7 ohc
53 71 toyota wagon 95.7 169.7 ohc
39 52 mitsubishi sedan 96.3 172.4 ohc
57 82 volkswagen sedan 97.3 171.7 ohc
56 81 volkswagen sedan 97.3 171.7 ohc
52 70 toyota wagon 95.7 169.7 ohc
55 80 volkswagen sedan 97.3 171.7 ohc
43 56 nissan wagon 94.5 170.2 ohc
18 27 honda wagon 96.5 157.1 ohc
40 53 nissan sedan 94.5 165.3 ohc
38 51 mitsubishi sedan 96.3 172.4 ohc
51 69 toyota wagon 95.7 169.7 ohc
42 55 nissan sedan 94.5 165.3 ohc
29 38 mazda hatchback 93.1 159.1 ohc
21 30 isuzu sedan 94.3 170.7 ohc
41 54 nissan sedan 94.5 165.3 ohc
15 18 chevrolet sedan 94.5 158.8 ohc
50 68 toyota hatchback 95.7 158.7 ohc
16 19 dodge hatchback 93.7 157.3 ohc
49 67 toyota hatchback 95.7 158.7 ohc
14 17 chevrolet hatchback 94.5 155.9 ohc
17 20 dodge hatchback 93.7 157.3 ohc
37 50 mitsubishi hatchback 93.7 157.3 ohc
28 37 mazda hatchback 93.1 159.1 ohc
36 49 mitsubishi hatchback 93.7 157.3 ohc
48 66 toyota hatchback 95.7 158.7 ohc
27 36 mazda hatchback 93.1 159.1 ohc
13 16 chevrolet hatchback 88.4 141.1 l
22 31 isuzu sedan 94.5 155.9 ohc
23 32 isuzu sedan 94.5 155.9 ohc
47 63 porsche hatchback 98.4 175.7 dohcv

212
num-of-cylinders horsepower average mileage price
35 eight 184 14 45400.0
11 six 182 16 41315.0
34 eight 184 14 40960.0
46 six 207 17 37028.0
12 six 182 15 36880.0
26 twelve 262 13 36000.0
25 six 176 15 35550.0
45 six 207 17 34028.0
24 six 176 15 32250.0
10 six 182 16 30760.0
33 five 123 22 28248.0
32 five 123 22 25552.0
9 six 121 21 20970.0
6 five 110 7 19 18920.0
31 four 72 31 18344.0
4 five 115 18 17450.0
8 four 101 23 16925.0
2 six 154 19 16500.0
1 four 111 21 16500.0
7 four 101 23 16430.0
54 six 156 19 15750.0
5 five 110 19 15250.0
3 four 102 24 13950.0
44 six 152 19 13499.0
0 four 111 21 13495.0
60 four 114 23 13415.0
19 four 101 24 12945.0
59 four 114 23 12940.0
30 two 101 17 11845.0
20 four 100 25 10345.0
58 four 100 26 9995.0
53 four 62 27 8778.0
39 four 88 25 8189.0
57 four 52 37 7995.0
56 four 85 27 7975.0
52 four 62 27 7898.0
55 four 52 37 7775.0

213
43 four 69 31 7349.0
18 four 76 30 7295.0
40 four 55 45 7099.0
38 four 88 25 6989.0
51 four 62 31 6918.0
42 four 69 31 6849.0
29 four 68 31 6795.0
21 four 78 24 6785.0
41 four 69 31 6649.0
15 four 70 38 6575.0
50 four 62 31 6488.0
16 four 68 31 6377.0
49 four 62 31 6338.0
14 four 70 38 6295.0
17 four 68 31 6229.0
37 four 68 31 6189.0
28 four 68 31 6095.0
36 four 68 37 5389.0
48 four 62 35 5348.0
27 four 68 30 5195.0
13 three 70 38 NaN
23 four 70 38 NaN
47 eight 288 17 NaN

214
3.2 Ejercicio Transformación de Datos en R
setwd("E:/User/PathToFolder/Dia 7")

1 a <- read . csv ( " TriplicadoALL . csv " )

Como puedo definir la primera columna como los nombres de las columnas?

Un factor no se puede emplear para definir el nombre de las columnas.


1 str ( a $ Sample )

Factor w/ 27 levels "1_R1","1_R2",..: 16 17 18 13 14 15 25 26


27 22 ...

1 a $ Sample <- as . character ( a $ Sample )


2 rownames ( a ) <- a $ Sample
3 a $ Sample <- NULL

Con la intención de automatizar el uso de la anterior función en múltiples


columnas se puede utilizar la función apply o lapply.
1 a [] <- lapply (a , function ( x ) as . numeric ( as . character ( x ) ) )
2 a [] <- apply (a , 2 , function ( x ) as . numeric ( as . character ( x ) ) )

Si quisiéramos determinar la facción que cada valor corresponde del data


frame en relación a la fila o la columna a la que esta pertenece, podrı́amos
también emplear apply.
1 anorm <- t ( apply (a , 1 , function ( i ) i / sum ( i ) ) )
2 anorm2 <- apply (a , 2 , function ( i ) i / sum ( i ) )

al multiplicar la facción por 100 se puede obtener el porcentaje.


1 anormPorcentaje <- anorm * 100
2 anormPorcentaje2 <- anorm2 * 100

215
Al sumar todos los valores de la fila podrı́amos validar si la operación se
realizo según lo esperado.
1 sum ( anormPorcentaje [1 ,])

100

1 sum ( anormPorcentaje2 [ ,1])

100

Si quisiéramos transformar los datos existen múltiples operaciones que se


pueden aplicar para lograrlo.
1 anormlog <- log2 ( anorm )
2 anormlog <- as . data . frame ( anormlog )
3

4 str ( anormlog )

data.frame’: 27 obs. of 43 variables:


$ ALA : num -2.93 -2.92 -2.94 -2.87 -2.86 ...
$ ARG : num -3.4 -3.42 -3.4 -3.42 -3.41 ...
$ CIT : num -4.62 -4.58 -4.62 -5.3 -5.3 ...
$ GLY : num -1.89 -1.87 -1.85 -2.06 -2.12 ...
$ LEU : num -2.91 -2.94 -2.94 -2.74 -2.69 ...
$ MET : num -6.97 -7.1 -7.11 -7.1 -7.07 ...
$ ORN : num -4.35 -4.38 -4.38 -4.36 -4.34 ...
$ PHE : num -4.63 -4.65 -4.64 -4.14 -4.1 ...
$ PRO : num -3.72 -3.71 -3.71 -4.36 -4.4 ...
$ SA : num -11.8 -11.6 -11.5 -12 -11.7 ...
$ TYR : num -5.2 -5.24 -5.26 -5.06 -4.98 ...
$ VAL : num -3.09 -3.07 -3.09 -2.87 -2.87 ...
$ C0 : num -7.56 -7.52 -7.6 -5.94 -5.92 ...
$ C2 : num -7.71 -7.75 -7.75 -6.86 -6.82 ...
$ C3 : num -12.6 -12.7 -12.6 -12.4 -12.4 ...
$ C3DC.C4OH: num -15.6 -15.2 -15.5 -14.9 -14.6 ...
$ C4 : num -11.6 -11.6 -11.5 -12.3 -12.4 ...

216
$ C4DC.C5OH: num -12.9 -12.9 -13 -12.5 -12.5 ...
$ C5 : num -14.4 -14.5 -14.6 -13.8 -13.5 ...
$ C5.1 : num -17.9 -18 -18 -17.7 -17.6 ...
$ C5DC.C6OH: num -14.8 -14.7 -14.7 -13.8 -13.8 ...
$ C6 : num -16.9 -17 -17 -16.1 -15.6 ...
$ C6DC : num -14.2 -14.4 -14.3 -14.7 -14.4 ...
$ C8 : num -16.9 -17 -17 -16.7 -16 ...
$ C8.1 : num -Inf -18 -18 -17.7 -17.6 ...
$ C10 : num -16.9 -17 -17 -16.7 -16.6 ...
$ C10.1 : num -Inf -Inf -Inf -17.7 -16.6 ...
$ C10.2 : num -Inf -Inf -Inf -Inf -Inf ...
$ C12 : num -16.9 -17 -17 -16.7 -16.6 ...
$ C12.1 : num -16.9 -17 -17 -Inf -16.6 ...
$ C14 : num -15.4 -15.7 -15.5 -16.7 -16.6 ...
$ C14.1 : num -15.9 -16 -16 -16.1 -16 ...
$ C14.2 : num -16.9 -18 -17 -16.7 -16.6 ...
$ C14OH : num -Inf -18 -Inf -Inf -Inf ...
$ C16 : num -12.1 -12.1 -12 -14.1 -13.9 ...
$ C16.1 : num -14.6 -14.7 -14.6 -16.1 -16 ...
$ C16.1OH : num -15.9 -15.7 -16 -16.7 -16.6 ...
$ C16OH : num -16.9 -16.4 -16.5 -16.7 -16 ...
$ C18 : num -13.6 -13.6 -13.9 -14 -13.8 ...
$ C18.1 : num -11.9 -11.8 -11.9 -11.2 -11.1 ...
$ C18.1OH : num -15.9 -16 -16 -16.7 -16.6 ...
$ C18.2 : num -13 -13 -13.1 -12.1 -12 ...
$ C18OH : num -16.9 -17 -17 -Inf -17.6 ...

217
El resultado “-Inf” puede afectar significativamente los análisis que a este
se le quisieran hacer.
1 View ( anormlog )

Para automatizar el reemplazo de los valores “-Inf” se puede generar una


función para hacerlo, ligado a una implementación de la función de apply.
1 changetoinf <- function ( colnum , df ) {
2 col <- df [ , colnum ]
3 if ( is . numeric ( col ) ) { # verifica que la columna sea
numerica
4 col [ col == - Inf & is . numeric ( col ) ] <- NA
5 }
6 return ( col )
7 }
8
9 df <- data . frame ( sapply (1:43 , changetoinf , anormlog ) )
10
11 colnames ( df ) <- colnames ( anormlog )
12 rownames ( df ) <- rownames ( anormlog )
13
14 anormlog <- df
15

16 View ( anormlog )

218
1 write . csv ( anormlog , " FoldTriplicado . csv " )

219
3.3 Vizualización Gráfica en R Resultados de VGchartz
1
2 install . packages ( " dplyr " )
3 install . packages ( " ggplot2 " )
4 install . packages ( " gganimate " )
5 install . packages ( " gifski " )
6
7 library ( dplyr )
8 library ( ggplot2 )
9 library ( gganimate )
10 library ( gifski )

Se realiza el pre-procesamiento con la función preprocessing(), la cual sirve


para darle el formato de acuerdo a los años desde que los juegos se encuentran
disponibles. Se diseña un df que abarca la cantidad de años desde el primer
juego disponible
1 preprocessing <- function ( df ) {
2 df1 <- as . data . frame ( seq ( min ( games $ Year ) , max ( games $ Year ) ) )
3 colnames ( df1 ) <- " Years "
4

5 for ( i in seq ( df1 $ Years ) ) {


6 df1 [i , " Rank " ] <- df $ Rank
7 df1 [i , " Name " ] <- df $ Name
8 # Se adiciona el rango y el nombre al nuevo dataframe
9
10 # Vamos a hacer una resta de la cantidad de tiempo desde
que el juego se lanzo a la venta
11 # hasta el \ ’ ultimo a \ ~ no disponible en el dataframe .
Este nos ayudara a poder hacer la diferencia
12 # para poder agregar una cantidad especifica de ventas
por a \ ~ no
13 if ( max ( df1 $ Years ) ! = df $ Year ) {
14 difference <- max ( df1 $ Years ) - df $ Year
15 } else {
16 difference <- max ( df1 $ Years ) # En el caso que el a \ ~ no
que se lanzo el juego es el mismo
17 # que el \ ’ ultimo disponible en el dataframe se asigna
este
18 }
19
20 # Comparativa que permite asignar la cantidad de copias

220
vendidas por a \ ~ no
21 if ( as . numeric ( df1 [i , " Years " ]) < as . numeric ( df $ Year ) ) {
22 df1 [i , " TotalShipped " ] <- 0 # A \ ~ nos en que el juego
aun no se lanza se asigna 0
23 } else if ( as . numeric ( df1 [i , " Years " ]) == as . numeric ( df $
Year ) ) {
24 # A \ ~ nos en que lanza el juego y adelante se asigna una
proporci \ ’ on de ventas del juego
25 a <- df $ Total _ Shipped / difference # Se divide la
cantidad completa de juegos entre la diferencia
26 df1 [i , " TotalShipped " ] <- a # Al primer a \ ~ no se asigna
el resultado de la divisi \ ’ on
27 } else if ( as . numeric ( df1 [i , " Years " ]) > as . numeric ( df $
Year ) ) {
28 df1 [i , " TotalShipped " ] <- a + df1 [ i - 1 , " TotalShipped "
] # A los resultados en adelante se asigna la suma de los
anteriores
29 }
30 }
31 return ( df1 )
32 }
33
34 assign _ rank <- function ( df ) {
35 # Loops dise \ ~ nados con el fin de asignar el rango de
acuerdo a la cantidad m \ ’ axima de juegos
36 # disponibles en cada uno de los a \ ~ nos
37
38 # For disponible para asignar el Rank == 1 en cada uno de
los a \ ~ nos
39 for ( i in levels ( df $ Years ) ) {
40 b <- df % >% filter ( Years == i )
41 df $ Rank [ which ( df $ Name == b $ Name [ which ( b $ TotalShipped ==
max ( b $ TotalShipped ) ) ]
42 & df $ Years == i ) ] <- 1
43 }
44
45 # For disponible para asignar el Rank == 2 en cada uno de
los a \ ~ nos
46 for ( i in levels ( df $ Years ) ) {
47 b <- df % >% filter ( Years == i & Rank ! = 1)
48 df $ Rank [ which ( df $ Name == b $ Name [ which ( b $ TotalShipped ==
max ( b $ TotalShipped ) ) ]
49 & df $ Years == i ) ] <- 2
50 }
51

221
52 # For disponible para asignar el Rank == 3 en cada uno de
los a \ ~ nos
53 for ( i in levels ( df $ Years ) ) {
54 b <- df % >% filter ( Years == i & Rank ! = 1 & Rank ! = 2)
55 df $ Rank [ which ( df $ Name == b $ Name [ which ( b $ TotalShipped ==
max ( b $ TotalShipped ) ) ]
56 & df $ Years == i ) ] <- 3
57 }
58
59 # For disponible para asignar el Rank == 4 en cada uno de
los a \ ~ nos
60 for ( i in levels ( df $ Years ) ) {
61 b <- df % >% filter ( Years == i & Rank ! = 1 & Rank ! = 2 &
Rank ! = 3)
62 df $ Rank [ which ( df $ Name == b $ Name [ which ( b $ TotalShipped ==
max ( b $ TotalShipped ) ) ]
63 & df $ Years == i ) ] <- 4
64 }
65
66 # For disponible para asignar el Rank == 5 en cada uno de
los a \ ~ nos
67 for ( i in levels ( df $ Years ) ) {
68 b <- df % >% filter ( Years == i & Rank ! = 1 & Rank ! = 2 &
Rank ! = 3 & Rank ! = 4)
69 df $ Rank [ which ( df $ Name == b $ Name [ which ( b $ TotalShipped ==
max ( b $ TotalShipped ) ) ]
70 & df $ Years == i ) ] <- 5
71 }
72
73 # For disponible para asignar el Rank == 6 en cada uno de
los a \ ~ nos
74 for ( i in levels ( df $ Years ) ) {
75 b <- df % >% filter ( Years == i & Rank ! = 1 & Rank ! = 2 &
Rank ! = 3 & Rank ! = 4
76 & Rank ! = 5)
77 df $ Rank [ which ( df $ Name == b $ Name [ which ( b $ TotalShipped ==
max ( b $ TotalShipped ) ) ]
78 & df $ Years == i ) ] <- 6
79 }
80
81 # For disponible para asignar el Rank == 7 en cada uno de
los a \ ~ nos
82 for ( i in levels ( df $ Years ) ) {
83 b <- df % >% filter ( Years == i & Rank ! = 1 & Rank ! = 2 &
Rank ! = 3 & Rank ! = 4

222
84 & Rank ! = 5 & Rank ! = 6)
85 df $ Rank [ which ( df $ Name == b $ Name [ which ( b $ TotalShipped ==
max ( b $ TotalShipped ) ) ]
86 & df $ Years == i ) ] <- 7
87 }
88
89 # For disponible para asignar el Rank == 8 en cada uno de
los a \ ~ nos
90 for ( i in levels ( df $ Years ) ) {
91 b <- df % >% filter ( Years == i & Rank ! = 1 & Rank ! = 2 &
Rank ! = 3 & Rank ! = 4
92 & Rank ! = 5 & Rank ! = 6 & Rank ! = 7)
93 df $ Rank [ which ( df $ Name == b $ Name [ which ( b $ TotalShipped ==
max ( b $ TotalShipped ) ) ]
94 & df $ Years == i ) ] <- 8
95 }
96
97 # For disponible para asignar el Rank == 9 en cada uno de
los a \ ~ nos
98 for ( i in levels ( df $ Years ) ) {
99 b <- df % >% filter ( Years == i & Rank ! = 1 & Rank ! = 2 &
Rank ! = 3 & Rank ! = 4
100 & Rank ! = 5 & Rank ! = 6 & Rank ! = 7 &
Rank ! = 8)
101 df $ Rank [ which ( df $ Name == b $ Name [ which ( b $ TotalShipped ==
max ( b $ TotalShipped ) ) ]
102 & df $ Years == i ) ] <- 9
103 }
104
105 # For disponible para asignar el Rank == 10 en cada uno de
los a \ ~ nos
106 for ( i in levels ( df $ Years ) ) {
107 b <- df % >% filter ( Years == i & Rank ! = 1 & Rank ! = 2 &
Rank ! = 3 & Rank ! = 4
108 & Rank ! = 5 & Rank ! = 6 & Rank ! = 7 &
Rank ! = 8 & Rank ! = 9)
109 df $ Rank [ which ( df $ Name == b $ Name [ which ( b $ TotalShipped ==
max ( b $ TotalShipped ) ) ]
110 & df $ Years == i ) ] <- 10
111 }
112

113 return ( df )
114 }
115 sv _ anim <- function ( data , name ) {
116 final _ animation <- animate ( data , 100 , fps = 20 , duration =

223
30 , width = 950 ,
117 height = 750 , renderer = gifski _
renderer () )
118 assign ( " final _ animation " , final _ animation , envir =
globalenv () )
119 filename <- getwd ()
120 filename <- paste ( filename , " / " , name , " . gif " , sep = " " )
121 anim _ save ( filename , animation = final _ animation )
122 }

VGChartz Fase 1: Asignar la lectura del documento .csv

Se asigna el documento .csv que contenga la información y datos.


1 games <- read . csv ( " complete _ vgchartz . csv " )

Se selecciona ciertas columnas con las que se desean trabajar.


1 games <- games % >% select ( Rank , Name , Total _ Shipped , Year )

224
Rank Name Total Shipped Year
1 1 Pokemon 362.06 1998
2 2 Super Mario 354.51 1983
3 3 Call of Duty 300.00 2003
4 4 Grand Theft Auto 300.00 1998
5 5 FIFA 282.40 1993
6 6 The Sims 200.00 2000
7 7 Minecraft 180.00 2011
8 8 Tetris 171.00 1984
9 9 Need for Speed 150.00 1994
10 10 Final Fantasy 149.00 1987
11 11 Mario Kart 142.34 1992
12 12 Assassin’s Creed 140.00 2007
13 13 Madden NFL 130.00 1988
14 14 Wii Sports 115.99 2006
15 15 Pro Evolution Soccer 106.80 1995
16 16 The Legend of Zelda 105.81 1987
17 17 Lego 104.30 1997
18 18 Resident Evil 95.00 1996
19 19 NBA 2K 90.00 1999
20 20 Wii Sports 82.88 2006
21 21 Gran Turismo 80.40 1998
22 22 Dragon Quest 80.00 1986
23 23 Battlefield 78.90 2002
24 24 Sonic the Hedgehog 76.64 1991
25 25 Tomb Raider 75.00 1996
26 26 Halo 71.00 2001
27 27 Just Dance 70.00 2009
28 28 WWE 2K 70.00 2000
29 29 Counter-Strike 65.00 2000
30 30 The Oregon Trail 65.00 1971
31 31 Donkey Kong 62.55 1981
32 32 Monster Hunter 60.00 2004
33 33 PUBG 60.00 2017
34 34 Super Smash Bros. 58.88 1999
35 35 The Elder Scrolls 58.62 1994
36 36 Borderlands 56.00 2009
37 37 Dragon Ball 56.00 1986
38 38 Metal Gear 55.00 1987
39 39 Far Cry 225 52.50 2004
40 40 Mario Party 50.94 1999
Esta es una pequeña trampa por que de acuerdo VGChartz “Call of Duty”
y “Gran Theft Auto” tienen la misma cantidad de ventas. Asignamos que
“Call of Duty” tenga más copias vendidas.
1 games $ Total _ Shipped [ which ( games $ Name == " Call of Duty " ) ] <-
as . numeric (300.10)
2 games $ Rank <- rep (0 , length ( games $ Rank ) ) # Se asigna un
formato de Rango que todos sean 0 , se les asignara por
ventas proximamente

[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[33] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

VGChartz Fase 2: Formato de un df con el fin de trabajar correctamente con


gganimate.
Realizar un dataframe que incluya todos los años desde la venta inicial del

juego.
1 gamesdb <- as . data . frame ( NULL ) # Se crea el dataframe vacio

data frame with 0 columns and 0 rows

Este loop toma el indexado individual de cada uno de los juegos.


1 for ( i in seq ( games $ Name ) ) {
2 game <- games [i ,] # Esta fila del juego indiviual es el que
se usa como base para la siguiente funci \ ’ on
3 a <- preprocessing ( game ) # Se asigna el resultado de la
funci \ ’ on a a
4 gamesdb <- rbind ( gamesdb , a ) # Se une los resultados en el
dataframe " gamesdb "
5 }

Years Rank Name Total Shipped


1 1971 0 Pokemon 0.00000
. . . . .
. . . . .
250 1985 0 The Sims 0.00000
[ reached ’max’ / getOption(”max.print”) – omitted 3510 rows ]

226
Se cambia la clase de “Years” de numérico a factor, con el fin de poder
conseguir los niveles.
1 gamesdb $ Years <- as . factor ( gamesdb $ Years )
2 gamesdb <- assign _ rank ( gamesdb ) # Se reassigna el resultado
de la funcion

Se realiza un filtrado final con el fin de solo obtener los juegos que en algún
momento se encontraban en el Top 10, eliminando los años en que no se
encontraban a la venta.
1 finalgame <- gamesdb % >% filter ( Rank >= 1 & TotalShipped ! =
0)
2 finalgame

Years Rank Name Total Shipped


1 1999 10 Pokemon 38.111579
2 2000 6 Pokemon 57.167368
. . . . .
. . . . .
249 1987 8 The Legend of Zelda 3.527000
250 1988 7 The Legend of Zelda 7.054000
[ reached ’max’ / getOption(”max.print”) – omitted 770 rows ]

VGChartz Fase 3: Construcción del Plot Estático


Se realiza el plot estático con las siguientes funciones.

1 p <- ggplot ( data = finalgame , aes ( x = Rank , group = as .


character ( Name ) ,
2 color = Name , fill = Name ) ) +
3 # geom _ tile () para generar el histograma
4 geom _ tile ( aes ( y = TotalShipped / 2 , height = TotalShipped ,
5 width = 0.9) , alpha =0.8 , color = NA ) +
6 geom _ text ( aes ( y = 0 , label = paste ( Name , " " ) ) , vjust =
0.2 , hjust = 1) +
7 # coor _ flip () para voltear el eje del gr \ ’ afico
8 coord _ flip ( clip = " off " , expand = TRUE ) +
9 # scale _ y _ continous permite asignar que los titulos sigan
el valor de la escala

227
10 scale _ y _ continuous ( labels = scales :: comma ) +
11 # scale _ x _ reverse para voltear 180 el inicio del eje X
12 scale _ x _ reverse () +
13 guides ( color = FALSE , fill = FALSE ) +
14 # elimina el formato del eje X y eje Y que se establece
en el plot est \ ’ atico
15 theme _ minimal () +
16 # formato de los titulos del gr \ ’ afico
17 theme (
18 plot . title = element _ text ( size =20 , hjust =0.5 , face = " bold "
, colour = " grey " , vjust = -1) ,
19 plot . subtitle = element _ text ( size =18 , hjust =0.5 , face = "
italic " , color = " grey " ) ,
20 plot . caption = element _ text ( size =8 , hjust =0.5 , face = "
italic " , color = " grey " ) ,
21 axis . ticks . y = element _ blank () ,
22 axis . text . y = element _ blank () ,
23 axis . title . y = element _ blank () ,
24 plot . margin = margin (1 , 1 , 1 , 4 , " cm " )
25 )
26 p

228
Es totalmente normal que queden los nombres empalmados ya que en el
plot dinámico se mueven. VGChartz Fase 4: Construcción del Plot Dinámico

Se usa de base el plot estático y se agregan algunas funciones.

229
1 plt <- p +
2 # La funci \ ’ on transition _ states () permite que se realice
la animaci \ ’ on
3 transition _ states ( states = Years , transition _ length = 4 ,
state _ length = 1) +
4 # Permite establece el formato cuadr \ ’ atico de los aes del
plot est \ ’ atico
5 ease _ aes ( " cubic - in - out " ) +
6 # Estable los titulos de cada uno de los objetos del gr \ ’
afico
7 labs ( title = " Top 10 Most Sucessful Videogames : { closest _
state } " ,
8 subtitle = " Millions Units Sold " ,
9 caption = " Sorce : VGChartz " ,
10 y = " Total Copies Sold " )
11 plt

En la animación cambian gradualmente las barras del histograma.

230
VGChartz Fase 5: Guardar la animación
Los argumentos son el plot dinámico y el nombre que se quiere asignar a la

animación.
1 sv _ anim ( plt , " vgchartzcomplete " )

231

También podría gustarte