Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Usaremos una versión "semi-limpia" del conjunto de datos Titanic, si usa el conjunto de datos alojado
directamente en Kaggle, es posible que necesite realizar una limpieza adicional que no se muestra en este
notebook.
Importación de Librerias
Vamos a importar algunas de las librerias para comenzar!
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Los datos
Comencemos leyendo el archivo titanic_train.csv en un dataframe de pandas.
In [2]:
train = pd.read_csv('titanic_train.csv')
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 1/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [3]:
train.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare C
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500
Henry
In [4]:
train.info()
<class 'pandas.core.frame.DataFrame'>
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 2/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
Datos perdidos
¡Podemos usar Seaborn para crear un mapa de calor simple para ver dónde nos faltan datos!
In [5]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[5]:
<AxesSubplot:>
Aproximadamente el 20 por ciento de los datos Age falta. La proporción de datos Age faltantes es
probablemente lo suficientemente pequeña para un reemplazo razonable con alguna forma de imputación. Al
mirar la columna Cabin, parece que se está perdiendo demasiados datos para hacer algo útil a un nivel básico.
Probablemente lo eliminaremos más tarde o lo cambiemos a otra característica como "Cabina Conocida: 1 or
0".
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 3/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [6]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='RdBu_r')
Out[6]:
<AxesSubplot:xlabel='Survived', ylabel='count'>
In [7]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')
Out[7]:
<AxesSubplot:xlabel='Survived', ylabel='count'>
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 4/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [8]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')
Out[8]:
<AxesSubplot:xlabel='Survived', ylabel='count'>
In [9]:
sns.displot(train['Age'].dropna(),kde=True,color='darkblue',bins=30)
Out[9]:
<seaborn.axisgrid.FacetGrid at 0x19e7d40e760>
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 5/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [10]:
train['Age'].hist(bins=30,color='darkred',alpha=0.5)
Out[10]:
<AxesSubplot:>
In [11]:
sns.countplot(x='SibSp',data=train)
Out[11]:
<AxesSubplot:xlabel='SibSp', ylabel='count'>
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 6/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [12]:
sns.countplot(x='Parch',data=train)
Out[12]:
<AxesSubplot:xlabel='Parch', ylabel='count'>
In [13]:
train['Fare'].hist(color='green',bins=40,figsize=(8,4))
Out[13]:
<AxesSubplot:>
Limpieza de datos
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 7/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
Queremos completar los datos de edad faltantes en lugar de simplemente eliminar las filas con datos de edad
faltantes. Una forma de hacerlo es rellenándolos con la edad promedio de todos los pasajeros (imputación).
Sin embargo, podemos ser más cuidadosos al respecto y verificar la edad promedio por clase de pasajeros.
Por ejemplo:
In [14]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')
Out[14]:
<AxesSubplot:xlabel='Pclass', ylabel='Age'>
In [15]:
fil_no_null = pd.notnull(train['Age'])
edades=train[fil_no_null]
In [16]:
fil_1ra = edades['Pclass'] == 1
pas_1ra = edades[fil_1ra]
print('promedio',pas_1ra['Age'].mean())
print('Percentiles')
print(pas_1ra[['Age']].quantile([0.25,0.50,0.75]))
promedio 38.233440860215055
Percentiles
Age
0.25 27.0
0.50 37.0
0.75 49.0
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 8/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [17]:
fil_2da = edades['Pclass'] == 2
pas_2da = edades[fil_2da]
print('promedio',pas_2da['Age'].mean())
print('Percentiles')
print(pas_2da[['Age']].quantile([0.25,0.50,0.75]))
promedio 29.87763005780347
Percentiles
Age
0.25 23.0
0.50 29.0
0.75 36.0
In [18]:
fil_3ra = edades['Pclass'] == 3
pas_3ra = edades[fil_3ra]
print('promedio',pas_3ra['Age'].mean())
print('Percentiles')
print(pas_3ra[['Age']].quantile([0.25,0.50,0.75]))
promedio 25.14061971830986
Percentiles
Age
0.25 18.0
0.50 24.0
0.75 32.0
Podemos ver que los pasajeros más ricos en las clases superiores tienden a ser mayores, lo que tiene sentido.
Utilizaremos los valores de la edad mediana para imputar Age según Pclass.
In [19]:
def imputar_edad(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 9/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [20]:
train['Age'] = train[['Age','Pclass']].apply(imputar_edad,axis=1)
In [21]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[21]:
<AxesSubplot:>
In [22]:
train.info()
<class 'pandas.core.frame.DataFrame'>
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 10/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [23]:
In [24]:
train.head()
Out[24]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare E
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500
Henry
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 11/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [25]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[25]:
<AxesSubplot:>
In [26]:
train.dropna(inplace=True)
In [27]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[27]:
<AxesSubplot:>
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 12/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [28]:
train.info()
<class 'pandas.core.frame.DataFrame'>
In [29]:
Out[29]:
In [30]:
Out[30]:
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 13/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [31]:
train.info()
<class 'pandas.core.frame.DataFrame'>
In [32]:
train.head()
#[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'male', 'Q','S']]
Out[32]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare E
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500
Henry
In [33]:
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 14/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [34]:
train.drop(['Sex','Embarked','Name','Ticket','PassengerId'],axis=1,inplace=True)
In [35]:
train.head()
Out[35]:
0 0 3 22.0 1 0 7.2500
1 1 1 38.0 1 0 71.2833
2 1 3 26.0 0 0 7.9250
3 1 1 35.0 1 0 53.1000
4 0 3 35.0 0 0 8.0500
In [36]:
train = pd.concat([train,sex,embark],axis=1)
In [37]:
train.head()
Out[37]:
0 0 3 22.0 1 0 7.2500 1 0 1
1 1 1 38.0 1 0 71.2833 0 0 0
2 1 3 26.0 0 0 7.9250 0 0 1
3 1 1 35.0 1 0 53.1000 0 0 1
4 0 3 35.0 0 0 8.0500 1 0 1
In [38]:
train.to_csv("titanic_procesado.csv")
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 15/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [39]:
In [40]:
In [41]:
Estandarizaciòn
In [42]:
In [43]:
escala = StandardScaler()
In [44]:
escala.fit(X_train)
Out[44]:
StandardScaler()
In [45]:
X_train_escala = escala.transform(X_train)
In [46]:
X_test_escala = escala.transform(X_test)
Entrenamiento y predicción
In [47]:
In [48]:
logmodel = LogisticRegression(max_iter=1000)
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 16/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [49]:
#logmodel.get_params()
In [50]:
help(LogisticRegression)
| See Also
| --------
| Notes
| -----
| References
| ----------
Entrenamiento
In [51]:
logmodel.fit(X_train,y_train)
Out[51]:
LogisticRegression(max_iter=1000)
In [52]:
predictions = logmodel.predict(X_test)
In [53]:
y_test[0:5]
Out[53]:
511 0
613 0
615 1
337 1
718 0
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 17/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [54]:
X_test[0:5]
Out[54]:
In [55]:
predictions[0:5]
Out[55]:
Modelo estandarizado
In [57]:
In [58]:
logmodelStd = LogisticRegression(max_iter=1000)
In [59]:
logmodelStd.fit(X_train_escala,y_train)
Out[59]:
LogisticRegression(max_iter=1000)
In [60]:
predictionsStd = logmodelStd.predict(X_test_escala)
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 18/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [61]:
y_test[0:5]
Out[61]:
511 0
613 0
615 1
337 1
718 0
In [62]:
X_test[0:5]
Out[62]:
In [63]:
predictionsStd[0:5]
Out[63]:
In [64]:
In [65]:
confusion_matrix(y_test,predictions)
Out[65]:
array([[150, 13],
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 19/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [66]:
In [67]:
plot_confusion_matrix(logmodel,X_test,y_test,colorbar=False)
C:\Users\Jimmy\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87:
FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_
confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one o
f the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMat
rixDisplay.from_estimator.
warnings.warn(msg, category=FutureWarning)
Out[67]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x19e7f537
f70>
In [68]:
Curvas ROC
Sensibilidad (Sensitivity): La probabilidad de que el modelo prediga un resultado positivo para una
observación cuando el resultado es en verdad positivo.
Especificidad (Specificity): La probabilidad de que el modelo prediga un resultado negativo para una
observación cuando el resultado es en verdad negativo.
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 20/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [69]:
In [70]:
plot_roc_curve(logmodel,X_test,y_test,pos_label=0)
C:\Users\Jimmy\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87:
FutureWarning: Function plot_roc_curve is deprecated; Function :func:`plot_r
oc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the cl
ass methods: :meth:`sklearn.metric.RocCurveDisplay.from_predictions` or :met
h:`sklearn.metric.RocCurveDisplay.from_estimator`.
warnings.warn(msg, category=FutureWarning)
Out[70]:
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x19e7f612670>
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 21/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [71]:
plot_roc_curve(logmodel,X_test,y_test,pos_label=1)
C:\Users\Jimmy\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87:
FutureWarning: Function plot_roc_curve is deprecated; Function :func:`plot_r
oc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the cl
ass methods: :meth:`sklearn.metric.RocCurveDisplay.from_predictions` or :met
h:`sklearn.metric.RocCurveDisplay.from_estimator`.
warnings.warn(msg, category=FutureWarning)
Out[71]:
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x19e7f722580>
confusion_matrix(y_test,predictionsStd)
Out[72]:
array([[150, 13],
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 22/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [73]:
plot_confusion_matrix(logmodelStd,X_test_escala,y_test,colorbar=False)
C:\Users\Jimmy\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87:
FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_
confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one o
f the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMat
rixDisplay.from_estimator.
warnings.warn(msg, category=FutureWarning)
Out[73]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x19e7ed5a
e50>
In [74]:
Curvas R
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 23/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [78]:
plot_roc_curve(logmodelStd,X_test_escala,y_test,pos_label=0)
C:\Users\Jimmy\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87:
FutureWarning: Function plot_roc_curve is deprecated; Function :func:`plot_r
oc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the cl
ass methods: :meth:`sklearn.metric.RocCurveDisplay.from_predictions` or :met
h:`sklearn.metric.RocCurveDisplay.from_estimator`.
warnings.warn(msg, category=FutureWarning)
Out[78]:
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x19e00178af0>
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 24/25
26/12/22, 13:12 01RegresionLogisticaConPython - Jupyter Notebook
In [79]:
plot_roc_curve(logmodelStd,X_test_escala,y_test,pos_label=1)
C:\Users\Jimmy\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87:
FutureWarning: Function plot_roc_curve is deprecated; Function :func:`plot_r
oc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the cl
ass methods: :meth:`sklearn.metric.RocCurveDisplay.from_predictions` or :met
h:`sklearn.metric.RocCurveDisplay.from_estimator`.
warnings.warn(msg, category=FutureWarning)
Out[79]:
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x19e0002ca30>
Conclusiones
Las algunas metricas (recall, macro-avg, precision, etc) del modelo estandarizado son menores por 0.01 en
comparacion de las metricas del modelo sin estandarizar porque lo que se concluye que el modelo sin
estandarizar da mejores resultados por una minima diferencia, es decir, la estandarizacion no tuvo impacto en
los datos de este caso de estudio.
In [ ]:
localhost:8888/notebooks/Desktop/03RegresionLogistica/03RegresionLogistica/01RegresionLogisticaConPython.ipynb#Curvas-R# 25/25