Está en la página 1de 152

Section 1: Getting Started

1: Python Setup
2: Python Arithmetic
3: Basic Data Types
Section 2: Data Structures
4: Variables
5: Lists
6: Tuples and Strings
7: Dictionaries and Sets
8: Numpy Arrays
9: Pandas DataFrames
10: Reading and Writing Data
Section 3: Programming Constructs
11: Control Flow
12: Defining Functions
13: List and Dictionary Comprehensions
Section 4: Data Exploration and Cleaning
14: Initial Data Exploration and Preparation
15: Working With Text Data
16: Preparing Numeric Data
17: Dealing With Dates
18: Merging Data
19: Frequency Tables
20: Plotting with pandas
Section 5: Basic Statistics
21: Descriptive Statistics
22: Probability Distributions
23: Point Estimates and Confidence Intervals
Section 6: Inferential Statistics
24: Hypothesis Testing and the T-Test
25: Chi-Squared Tests
26: Analysis of Variance (ANOVA)
Section 7: Predictive Modeling
27: Linear Regression
28: Logistic Regression
29: Decision Trees
30: Random Forests

Python for Data Analysis Part 1: Setup


I recently completed an introductory guide to R programming aimed at teaching the basic tools necessary to
use R for data analysis and predictive modeling. R is a great language for statistics and data analysis
because the language was built with that goal in mind. Python is only language to rival R's popularity for
data analysis. Unlike R, Python is a general-purpose language that isn't designed for any particular task. It is
a jack-of-all-trades language with clean syntax and a vibrant ecosystem of data science software libraries
that extend its base functionality, making it an excellent first language to learn and a data science
powerhouse.
This guide does not assume prior programming experience and focuses on using Python as a tool for data
analysis. We won't spend much time digging into low level details of the language or functionality that is not
needed to use Python for data analysis. Since Python is a general-purpose language, however, it will take
several lessons to build the basic Python proficiency necessary to start using Python for data analysis. If you
already have basic Python proficiency, you may want to skip ahead. After Part 1, this guide will not spend
much time comparing Python and R, as it does not assume R knowledge. In my experience, it is easier to
learn how to program in Python, but it is easier to get started with data analysis in R because all the tools
you need are either baked in or one simple download away. If you're just getting into data science for the
first time, you can't go wrong with either language and it is a good idea to learn both eventually.
Perhaps the biggest downside of Python as a language is that it has two major branches, Python 2.7 and
Python 3.X, which are not fully compatible with one another. This means code written for 2.7 generally
doesn't work in Python 3 and vice versa, so certain software libraries might only be available for one or the
other. As a result, managing add-on libraries and their dependencies in Python can be troublesome. The
differences between Python 2.7 and 3.X won't really affect our learning of the language itself, however, as
the basic syntax is mostly the same between both versions.
This guide will use Python version 3.4. All code presented should work with version of Python 3.4 or later.
Most if not all of the code will also work on Python 2.7, so you can still follow along if you are using Python
2.7.
Since Python package management can be difficult, I do not recommend installing Python and its data
analysis libraries individually. It is easiest to download the Anaconda Python distribution from Continuum
Analytics. Anaconda bundles Python with dozens of popular data analysis libraries and it comes with a nice
integrated development environment (a fancy code editor) called Spyder. Simply go to the Continuum
Analytics download page, click the download link appropriate for your operating system and Python version
and then run the installer to set up the Anaconda Python environment.
A Brief Intro to Spyder
After installing Anaconda, open the Continuum analytics app Launcher and click the "launch" button next to
the Spyder app or simply find Spyder in your program list and launch it directly. Spyder is code development
tool geared toward data analysis. When you first open Spyder, you'll see an application window separated
into several panes, each with one or more tabs. The arrangement of the panes and tabs is customizable:
simply click and hold on the edge of a pane and drag it to a different part of the Spyder window to reorganize
panes. Select a tab and click the window icon in the upper right corner of the pane cause the tab to pop out
into its own pane that you can drag around or drop into an existing pane. When you open the editor for the
first time, certain useful panes might be turned off. You can turn panes on and off under the "view -> panes"
menu. My Spyder editor in this intro has the following panes turned on: Editor, Console, IPython Console,
Variable Explorer, Object Inspector, File Explorer and History Log. I organized my editor into a 4-pane layout
that mirrors R's popular RStudio code editor:
The upper left pane is a code editor that contains a tabbed list of code files. This is where you write code
you want to save and run. To run code written in your code editor, highlight the code you want to run, hold
shift and press enter. You can also click the green run button (looks like a play button) to run the entire code
file.
The pane in the upper right consists of two tabs: the variable explorer tab and history tab. The history tab
shows a list of commands you've run and the variable explorer shows a summary of the variables and data
structures you've defined.
The pane in the bottom left corner is the interactive Python console. The console is where you enter Python
code and view its output. When you run code from the code editor, the output appears in the console. You
can also type code directly into the console and run it by pressing the enter key.
The pane in the bottom right consists of two tabs: the object inspector and the file explorer. The object
inspector lets you get you view help information on objects in the console by typing the object's name into
the search bar or placing your cursor in front of the object in the console and pressing control + I. The file
explorer lets you navigate your computer's file system.
For demonstration purposes, I added some code to my editor and ran it:

The code in the editor window is:


In [1]:
# Lets make a list!

my_list = [1,2,3,4,5,6,7,8,9,10]

print ( len(my_list) )
10
*Note: Code in this guide consists of blocks of input labeled "In" and the corresponding output appears
below the input block (in this case, the number 10.).
The first line of the code starts with a pound symbol "#". In Python, # defines a comment: a bit of text that
the coder adds to explain something about the program that is not actually a part of the code that is
executed.
The second line defines a new variable my_list.
Finally the third line prints the length of the my_list variable.
Notice that upon running the file, the number 10 appears in the console, but no other output appears.
Comments and variable definitions produce no output, so the only output we see is the result of the print
statement: the length of my_list, which is 10.
Also note that the variable my_list has appeared in the variable explorer pane. The pane shows the
variable's type, size and a summary of its values. You can double click on a variable in the explorer window
to get a more detailed view of the variable and even edit individual values it contains:
Finally notice the search for "list" in the bottom right object inspector pane, which pulled up
a short description the list() function:

Spyder has a lot of similarities to R's popular RStudio code editor, which makes it a little bit easier to
transition from one language to the other than it might be if you used a different editor. That said, this guide
doesn't assume you are using any particular editor: you can use anything you like as long as you can run
Python code.
Looking Ahead
Now that you have Python installed, you are ready to start learning Python for data analysis. We'll start slow,
but by the end of this guide you'll have all the tools necessary to load, clean, explore, analyze and create
predictive models from data.
Python for Data Analysis Part 2: Python Arithmetic
In this lesson, we'll explore Python's ability to perform basic computations. Although Python is a fully-fledged
programming language, the console can serve as a powerful calculator and interactive tool for exploring and
analyzing data. To follow along with this lesson, you'll need an open Python console, so launch Spyder (or
your other favorite Python IDE) and use its console.
We'll start by going over the basic mathematical operators built into Python:
In [1]:
# Use + for addition:

4 + 9
Out[1]:
13
In [2]:
# Use - for subtraction:

5 - 10
Out[2]:
-5
In [3]:
# Use * for multiplication:

8 * 4
Out[3]:
32
In [4]:
# Use / for division:

100 / 3
Out[4]:
33.333333333333336
In [5]:
# Use // for floor division (division rounded down to the nearest whole number):

100 // 3
Out[5]:
33
In [6]:
# Use ** for exponentiation

2**4
Out[6]:
16
Math expressions in Python follow the normal order of operations so and / are executed before + and -, and
* is executed before multiplication and division.
In [7]:
# These operations are executed in reverse order of appearance.

2+3*5**2
Out[7]:
77
You can use parentheses in your math expressions to ensure that operations are carried out on the correct
order. Operations within parentheses are carried out before operations that are external to the parentheses,
just like you'd expect.
In [8]:
# Now the addition comes first and the exponentiation comes last.

((2+3) * 5 )**2
Out[8]:
625
If you're new to programming, you may not be familiar with the modulus operator, but it is another common
math symbol that returns the remainder you'd get when dividing two numbers. Use the percent symbol to
take the modulus:
In [9]:
# Use % for modulus

100 % 30
Out[9]:
10
Take the absolute value of a number with the abs() function:
In [10]:
abs(-30)
Out[10]:
30
Beyond abs() and the built in symbolic operators, Python doesn't have many math functions available by
default. Instead, many common math functions are contained in libraries you can load into your project as
necessary. The "math" module contains many additional functions of interest. Let's import the math module
and explore some if its functions:
In [11]:
import math # Load the math module
In [12]:
# math.log() takes the natural logarithm of its argument:

math.log(2.7182)
Out[12]:
0.9999698965391098
In [13]:
# Add a second argument to specify the log base:

math.log(100, 10) # Take the log base 10 of 100


Out[13]:
2.0
In [14]:
math.log(64, 2) # Take the log base 2 of 64
Out[14]:
6.0
In [15]:
# math.exp() raises e to the power of its argument

math.exp(10)
Out[15]:
22026.465794806718
In [16]:
# If you ever need the constants e or pi you can use:

math.e # Get the constant e


Out[16]:
2.718281828459045
In [17]:
math.pi # Get the constant pi
Out[17]:
3.141592653589793
In [18]:
# Use math.sqrt() to take the square root of a number:

math.sqrt(64)
Out[18]:
8.0
In [19]:
# Use round() to round a number to the nearest whole number:
In [20]:
round(233.234)
Out[20]:
233
In [21]:
# Add a second argument to round to a specified decimal place

round(233.234, 1) # round to 1 decimal place


Out[21]:
233.2
In [22]:
# Enter a negative number to round to the left of the decimal

round(233.234, -1) # round to the 10's place


Out[22]:
230.0
In [23]:
# Round down to the nearest whole number with math.floor()

math.floor(2.8)
Out[23]:
2
In [24]:
# Round up with math.ciel()

math.ceil(2.2)
Out[24]:
3
Common trigonometric functions are also available in the math module. The trig functions assume your
argument is expressed in terms of radians, not degrees.
In [25]:
math.cos(0) # Cosine
Out[25]:
1.0
In [26]:
math.sin(math.pi/2) # Sine
Out[26]:
1.0
In [27]:
math.tan(math.pi/4) # Tangent
Out[27]:
0.9999999999999999
In [28]:
math.acos(1) # Inverse Cosine
Out[28]:
0.0
In [29]:
math.asin(1) # Inverse Sine
Out[29]:
1.5707963267948966
In [30]:
math.atan(1) # Inverse Tangent
Out[30]:
0.7853981633974483
Convert between radians and degrees with math.radians() and math.degrees():
In [31]:
math.radians(180) # Convert degrees to radians
Out[31]:
3.141592653589793
In [32]:
math.degrees(math.pi) # Convert radians to degrees
Out[32]:
180.0
Wrap Up
Any time you need to perform a common mathematical operation in Python, it is probably available in a
library that is one import statement away. Python's Anaconda distribution comes with most of the libraries
we'll use in this guide, but there many more that can extend Python's functionality even further. When in
doubt, try Google. Helpful blog posts and answers posted on programming sites like stackoverflow can often
save you a lot time and help you learn better ways of doing things.
Python for Data Analysis Part 3: Basic Data Types
In the last lesson we learned that Python can act as a powerful calculator, but numbers are just one of many
basic data types you'll encounter in data analysis. A solid understanding of basic data types is essential for
working with data in Python.
Integers
Integers or "ints" for short, are whole-numbered numeric values. Any positive or negative number (or 0)
without a decimal is an integer in Python. Integer values have unlimited precision, meaning an integer is
always exact. You can check the type of a Python object with the type() function. Let's run type() on an
integer:
In [1]:
type(12)
Out[1]:
int
Above we see that the type of "12" is of type "int". You can also use the function isinstance() to check
whether an object is an instance of a given type:
In [2]:
# Check if 12 is an instance of type "int"

isinstance(12, int)
Out[2]:
True
The code output True confirms that 12 is an int.
Integers support all the basic math operations we covered last time. If a math operation involving integers
would result in a non-integer (decimal) value, the result is becomes a float:
In [3]:
1/3 # A third is not a whole number*
Out[3]:
0.3333333333333333
In [4]:
type(1/3) # So the type of the result is not an int
Out[4]:
float
*Note: In Python 2, integer division performs floor division instead of converting the ints to floats as we see
here in Python 3, so 1/3 would return 0 instead of 0.3333333.
Floats
Floating point numbers or "floats" are numbers with decimal values. Unlike integers, floating point numbers
don't have unlimited precision because irrational decimal numbers are infinitely long and therefore can't be
stored in memory. Instead, the computer approximates the value of long decimals, so there can be small
rounding errors in long floats. This error is so minuscule it usually isn't of concern to us, but it can add up in
certain cases when making many repeated calculations.
Every number in Python with a decimal point is a float, even if there are no non-zero numbers after the
decimal:
In [5]:
type(1.0)
Out[5]:
float
In [6]:
isinstance(0.33333, float)
Out[6]:
True
The arithmetic operations we learned last time work on floats as well as ints. If you use both floats and ints
in the same math expression the result is a float:
In [7]:
5 + 1.0
Out[7]:
6.0
You can convert a float to an integer using the int() function:
In [8]:
int(6.0)
Out[8]:
6
You can convert an integer to a float with the float() function:
In [9]:
float(6)
Out[9]:
6.0
Floats can also take on a few special values: Inf, -Inf and NaN. Inf and -Inf stand for infinity and negative
infinity respectively and NaN stands for "not a number", which is sometimes used as a placeholder for
missing or erroneous numerical values.
In [10]:
type ( float ("Inf") )
Out[10]:
float
In [11]:
type ( float ("NaN") )
Out[11]:
float
*Note: Python contains a third, uncommon numeric data type "complex" which is used to store complex
numbers.
Booleans
Booleans or "bools" are true/false values that result from logical statements. In Python, booleans start with
the first letter capitalized so True and False are recognized as bools but true and false are not. We've
already seen an example of booleans when we used the isinstance() function above.
In [12]:
type(True)
Out[12]:
bool
In [13]:
isinstance(False, bool) # Check if False is of type bool
Out[13]:
True
You can create boolean values with logical expressions. Python supports all of the standard logic operators
you'd expect:
In [14]:
# Use > and < for greater than and less than:

20>10
Out[14]:
True
In [15]:
20<5
Out[15]:
False
In [16]:
# Use >= and <= for greater than or equal and less than or equal:

20>=20
Out[16]:
True
In [17]:
30<=29
Out[17]:
False
In [18]:
# Use == (two equal signs in a row) to check equality:

10 == 10
Out[18]:
True
In [19]:
"cat" == "cat"
Out[19]:
True
In [20]:
True == False
Out[20]:
False
In [21]:
40 == 40.0 # Equivalent ints and floats are considered equal
Out[21]:
True
In [22]:
# Use != to check inequality. (think of != as "not equal to")

1 != 2
Out[22]:
True
In [23]:
10 != 10
Out[23]:
False
In [24]:
# Use the keyword "not" for negation:

not False
Out[24]:
True
In [25]:
not (2==2)
Out[25]:
False
In [26]:
# Use the keyword "and" for logical and:

(2 > 1) and (10 > 9)


Out[26]:
True
In [27]:
False and True
Out[27]:
False
In [28]:
# Use the keyword "or" for logical or:

(2 > 3) or (10 > 9)


Out[28]:
True
In [29]:
False or True
Out[29]:
True
Similar to math expressions, logical expressions have a fixed order of operations. In a logical statement,
"not" is executed first, followed by "or" and finally "and". Equalities and inequalities are executed last. Use
parentheses to enforce the desired order of operations.
In [30]:
2 > 1 or 10 < 8 and not True
Out[30]:
True
In [31]:
((2 > 1) or (10 < 8)) and not True
Out[31]:
False
You can convert numbers into boolean values using the bool() function. All numbers other than 0 convert to
True:
In [32]:
bool(1)
Out[32]:
True
In [33]:
bool(-12.5)
Out[33]:
True
In [34]:
bool(0)
Out[34]:
False
Strings
Text data in Python is known as a string or str. Surround text with single or double quotation marks to create
a string:
In [35]:
type("cat")
Out[35]:
str
In [36]:
type('1')
Out[36]:
str
In [37]:
isinstance("hello!", str)
Out[37]:
True
You can define a multi-line string using triple quotes:
In [38]:
print( """This string spans
multiple lines """ )
This string spans
multiple lines
You can convert numbers from their integer or float representation to a string representation and vice versa
using the int(), float() and str() functions:`
In [39]:
str(1) # Convert an int to a string
Out[39]:
'1'
In [40]:
str(3.333) # Convert a float to a string
Out[40]:
'3.333'
In [41]:
int('1') # Convert a string to an int
Out[41]:
1
In [42]:
float('3.333') # Convert a string to a float
Out[42]:
3.333
Two quotation marks right next to each other (such as '' or "") without anything in between them is known as
the empty string. The empty string often represents a missing text value.
Numeric data and logical data are generally well-behaved, but strings of text data can be very messy and
difficult to work with. Cleaning text data is often one of the most laborious steps in preparing real data sets
for analysis. We will revisit strings and functions to help you clean text data in future lesson.
None
In Python, "None" is a special data type that is often used to represent a missing value. For example, if you
define a function that doesn't return anything (does not give you back some resulting value) it will return
"None" by default.
In [43]:
type(None)
Out[43]:
NoneType
In [44]:
# Define a function that prints the input but returns nothing*

def my_function(x):
print(x)

my_function("hello") == None # The output of my_function equals None


hello
Out[44]:
True
*Note: We will cover defining custom functions in detail in a future lesson.
Wrap Up
This lesson covered the most common basic data types in Python, but it is not an exhaustive list of Python
data objects or the functions. The Python language's official documentation has a more thorough summary
of built-in types, but it is a bit more verbose and detailed than is necessary when you are first getting started
with the language.
Now that we know about the basic data types, it would be nice to know how to save values to use them
later. We'll cover that in the next lesson.
Python for Data Analysis Part 4: Variables
In this short lesson, we'll learn how to assign variables in Python. A variable is a name you assign to an
object. In Python, everything is an object, from the basic data types we learned about last time, to complex
data structures and functions. After assigning an object to a variable name, you can access the object using
that name.
You assign variables in Python using the single equals sign "=" operator:
In [1]:
# Assign some variables

x = 10
y = "Life is Study"
z = (2+3)**2

# Print the variables to the screen:*

print(x)
print(y)
print(z)
10
Life is Study
25
*Note: assigning a variable does not produce any output.
It is good practice to put a space between the variable, assignment operator and value for readability:
In [2]:
p=8 # This works, but it looks messy.
print(p)

p = 10 # Use spaces instead


print(p)
8
10
As shown above, you can reassign a variable after creating it by assigning the variable name a new value.
After assigning variables, you can perform operations on the objects assigned to them using their names:
In [3]:
x + z + p
Out[3]:
45
You can assign the same object to multiple variables with a multiple assignment statement.
In [4]:
n = m = 4
print(n)
print(m)
4
4
You can also assign several different variables at the same time using a comma separated sequence of
variable names followed by the assignment operator and a comma separated sequence of values inside
parentheses:
In [5]:
# Assign 3 variables at the same time:

x, y, z = (10, 20 ,30)

print(x)
print(y)
print(z)
10
20
30
This method of extracting variables from a comma separated sequence is known as "tuple unpacking."
You can swap the values of two variables using a similar syntax:
In [6]:
(x, y) = (y, x)

print(x)
print(y)
20
10
We'll learn more about tuples in the next lesson, but these are very common and convenient methods for of
assigning and altering variables in Python.
When you assign a variable in Python, the variable is a reference to a specific object in the computer's
memory. Reassigning a variable simply switches the reference to a different object in memory. If the object a
variable refers to in memory is altered in some way, the value of the variable corresponding to the altered
object will also change. All of the basic data types we've seen thus far are immutable, meaning they cannot
be changed after they are created. If you perform some operation that appears to alter an immutable object,
it is actually creating a totally new object in memory, rather than changing the original immutable object.
Consider the following example:
In [7]:
x = "Hello" # Create a new string
y = x # Assign y the same object as x
y = y.lower() # Assign y the result of y.lower()
print(x)
print(y)
Hello
hello
In the case above, we first assign x the value "Hello", a string object stored somewhere in memory. Next we
use the string method lower() to make the string assigned to y lowercase. Since strings are immutable,
Python creates an entirely new string, "hello" and stores it somewhere in memory separate from the original
"Hello" object. As a result, x and y refer to different objects in memory and produce different results when
printed to the console.
By contrast, lists are a mutable data structure that can hold multiple objects. If you alter a list, Python doesn't
make an entirely new list in memory: it changes the actual list object itself. This can lead to seemingly
inconsistent and confusing behavior:
In [8]:
x = [1,2,3] # Create a new list
y = x # Assign y the same object as x
y.append(4) # Add 4 to the end of list y
print(x)
print(y)
[1, 2, 3, 4]
[1, 2, 3, 4]
In this case, x and y still both refer to the same original list, so both x and y have the same value, even
though it may appear that the code only added the number 4 to list y.
Wrap Up
Variables are a basic coding construct used across all programming languages and applications. Many data
applications involve assigning data to some variables and then passing those variables on to functions
that perform various operations on the data.

This lesson briefly introduced the concept of tuples and lists, which are sequence data types that can hold
several values. In the next lesson, dig deeper into these sorts of compound data types.
Python for Data Analysis Part 5: Lists
Most of the individual data values you work with will take the form of one of the basic data types we learned
about in lesson 3, but data analysis involves working with sets of related records that need to be grouped
together. Sequences in Python are data structures that hold objects in an ordered array. In this lesson, we'll
learn about lists, one of the most common sequence data types in Python.
List Basics
A list is a mutable, ordered collection of objects. "Mutable" means a list can be altered after it is created. You
can, for example, add new items to a list or remove existing items. Lists are heterogeneous, meaning they
can hold objects of different types.
Construct a list with a comma separated sequence of objects within square brackets:
In [1]:
my_list = ["Lesson", 5, "Is Fun?", True]

print(my_list)
['Lesson', 5, 'Is Fun?', True]
Alternatively, you can construct a list by passing some other iterable into the list() function.
An iterable describes an object you can look through one item at a time, such as lists, tuples, strings and
other sequences.
In [2]:
second_list = list("Life is Study") # Create a list from a string

print(second_list)
['L', 'i', 'f', 'e', ' ', 'i', 's', ' ', 'S', 't', 'u', 'd', 'y']
A list with no contents is known as the empty list:
In [3]:
empty_list = []

print( empty_list )
[]
You can add an item to an existing list with the list.append() function:
In [4]:
empty_list.append("I'm no longer empty!")

print(empty_list)
["I'm no longer empty!"]
Remove a matching item from a list with list.remove():
In [5]:
my_list.remove(5)

print(my_list)
['Lesson', 'Is Fun?', True]
*Note: Remove deletes the first matching item only.
Join two lists together with the + operator:
In [6]:
combined_list = my_list + empty_list
print(combined_list)
['Lesson', 'Is Fun?', True, "I'm no longer empty!"]
You can also add a sequence to the end of an existing list with the list.extend() function:
In [7]:
combined_list = my_list

combined_list.extend(empty_list)

print(combined_list)
['Lesson', 'Is Fun?', True, "I'm no longer empty!"]
Check the length, maximum, minimum and sum of a list with the len(), max(), min() and sum() functions,
respectively.
In [8]:
num_list = [1, 3, 5, 7, 9]
print( len(num_list)) # Check the length
print( max(num_list)) # Check the max
print( min(num_list)) # Check the min
print( sum(num_list)) # Check the sum
print( sum(num_list)/len(num_list)) # Check the mean*
5
9
1
25
5.0
*Note: Python does not have a built in function to calculate the mean, but the numpy library we will introduce
in upcoming lessons does.
You can check whether a list contains a certain object with the "in" keyword:
In [9]:
1 in num_list
Out[9]:
True
Add the keyword "not" to test whether a list does not contain an object:
In [10]:
1 not in num_list
Out[10]:
False
Count the occurrences of an object within a list using the list.count() function:
In [11]:
num_list.count(3)
Out[11]:
1
Other common list functions include list.sort() and list.reverse():
In [12]:
new_list = [1, 5, 4, 2, 3, 6] # Make a new list

new_list.reverse() # Reverse the list


print("Reversed list", new_list)

new_list.sort() # Sort the list


print("Sorted list", new_list)
Reversed list [6, 3, 2, 4, 5, 1]
Sorted list [1, 2, 3, 4, 5, 6]
List Indexing and Slicing
Lists and other Python sequences are indexed, meaning each position in the sequence has a corresponding
number called the index that you can use to look up the value at that position. Python sequences are zero-
indexed, so the first element of a sequence is at index position zero, the second element is at index 1 and so
on. Retrieve an item from a list by placing the index in square brackets after the name of the list:
In [13]:
another_list = ["Hello","my", "bestest", "old", "friend."]

print (another_list[0])
print (another_list[2])
Hello
bestest
If you supply a negative number when indexing into a list, it accesses items starting from the end of the list (-
1) going backward:
In [14]:
print (another_list[-1])
print (another_list[-3])
friend.
bestest
Supplying an index outside of a lists range will result in an IndexError:
In [15]:
print (another_list[5])
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-15-a9a01b5b15dc> in <module>()
----> 1 print (another_list[5])

IndexError: list index out of range


If your list contains other indexed objects, you can supply additional indexes to get items contained within
the nested objects:
In [17]:
nested_list = [[1,2,3],[4,5,6],[7,8,9]]

print (nested_list[0][2])
3
You can take a slice (sequential subset) of a list using the syntax [start:stop:step] where start and stop are
the starting and ending indexes for the slice and step controls how frequently you sample values along the
slice. The default step size is 1, meaning you take all values in the range provided, starting from the first, up
to but not including the last:
In [18]:
my_slice = another_list[1:3] # Slice index 1 and 2
print(my_slice )
['my', 'bestest']
In [19]:
# Slice the entire list but use step size 2 to get every other item:

my_slice = another_list[0:6:2]
print(my_slice )
['Hello', 'bestest', 'friend.']
You can leave the starting or ending index blank to slice from the beginning or up to the end of the list
respectively:
In [20]:
slice1 = another_list[:4] # Slice everything up to index 4
print(slice1)
['Hello', 'my', 'bestest', 'old']
In [21]:
slice2 = another_list[3:] # Slice everything from index 3 to the end
print(slice2)
['old', 'friend.']
If you provide a negative number as the step, the slice steps backward:
In [22]:
# Take a slice starting at index 4, backward to index 2

my_slice = another_list[4:2:-1]
print(my_slice )
['friend.', 'old']
If you don't provide a start or ending index, you slice of the entire list:
In [23]:
my_slice = another_list[:] # This slice operation copies the list
print(my_slice)
['Hello', 'my', 'bestest', 'old', 'friend.']
Using a step of -1 without a starting or ending index slices the entire list in reverse, providing a shorthand to
reverse a list:
In [24]:
my_slice = another_list[::-1] # This slice operation reverses the list
print(my_slice)
['friend.', 'old', 'bestest', 'my', 'Hello']
You can use indexing to change the values within a list or delete items in a list:
In [25]:
another_list[3] = "new" # Set the value at index 3 to "new"

print(another_list)

del(another_list[3]) # Delete the item at index 3

print(another_list)
['Hello', 'my', 'bestest', 'new', 'friend.']
['Hello', 'my', 'bestest', 'friend.']
You can also remove items from a list using the list.pop() function. pop() removes the final item in a list and
returns it:
In [26]:
next_item = another_list.pop()

print(next_item)
print(another_list)
friend.
['Hello', 'my', 'bestest']
Notice that the list resizes itself dynamically as you delete or add items to it. Appending items to lists and
removing items from the end of list with list.pop() are very fast operations. Deleting items at the front of a list
or within the body of a lists is much slower.
Copying Lists
In the code above, we saw that we can slice an entire list using the [:] indexing operation. You also copy a
list using the list.copy() function:
In [27]:
list1 = [1,2,3] # Make a list

list2 = list1.copy() # Copy the list

list1.append(4) # Add an item to list 1

print("List1:", list1) # Print both lists


print("List2:", list2)
List1: [1, 2, 3, 4]
List2: [1, 2, 3]
As expected, the copy was not affected by the append operation we performed on the original list. The copy
function (and slicing an entire list with [:]) creates what is known as a "shallow copy." A shallow copy makes
a new list where each list element refers to the object at the same position (index) in the original list. This is
fine when the list is contains immutable objects like ints, floats and strings, since they cannot change.
Shallow copies can however, have undesired consequences when copying lists that contain mutable
container objects, such as other lists.
Consider the following copy operation:
In [28]:
list1 = [1,2,3] # Make a list

list2 = ["List within a list", list1] # Nest it in another list

list3 = list2.copy() # Shallow copy list2

print("Before appending to list1:")


print("List2:", list2)
print("List3:", list3, "\n")
list1.append(4) # Add an item to list1
print("After appending to list1:")
print("List2:", list2)
print("List3:", list3)
Before appending to list1:
List2: ['List within a list', [1, 2, 3]]
List3: ['List within a list', [1, 2, 3]]

After appending to list1:


List2: ['List within a list', [1, 2, 3, 4]]
List3: ['List within a list', [1, 2, 3, 4]]
Notice that when we use a shallow copy on list2, the second element of list2 and its copy both refer to list1.
Thus, when we append a new value into list1, the second element of list2 and the copy, list3, both change.
When you are working with nested lists, you have to make a "deepcopy" if you want to truly copy nested
objects in the original to avoid this behavior of shallow copies.
You can make a deep copy using the deepcopy() function in the copy module:
In [29]:
import copy # Load the copy module

list1 = [1,2,3] # Make a list

list2 = ["List within a list", list1] # Nest it in another list

list3 = copy.deepcopy(list2) # Deep copy list2

print("Before appending to list1:")


print("List2:", list2)
print("List3:", list3, "\n")

list1.append(4) # Add an item to list1


print("After appending to list1:")
print("List2:", list2)
print("List3:", list3)
Before appending to list1:
List2: ['List within a list', [1, 2, 3]]
List3: ['List within a list', [1, 2, 3]]

After appending to list1:


List2: ['List within a list', [1, 2, 3, 4]]
List3: ['List within a list', [1, 2, 3]]
This time list3 is not changed when we append a new value into list1 because the second element in list3 is
a copy of list1 rather than a reference to list1 itself.
Wrap Up
Lists are one of the most ubiquitous data structures in Python, so it is important to be familiar with them,
even though specialized data structures are often better suited for data analysis tasks. Despite some quirks
like the shallow vs. deep copy issue, lists are very useful as simple data containers.
In the next lesson, we'll cover a couple more built in sequence objects.
Python for Data Analysis Part 6: Tuples and Strings
In the last lesson we learned about lists, Python's jack-of-all trades sequence data type. In this lesson we'll
take a look at 2 more Python sequences: tuples and strings.
Tuples
Tuples are an immutable sequence data type that are commonly used to hold short collections of related
data. For instance, if you wanted to store latitude and longitude coordinates for cities, tuples might be a good
choice, because the values are related and not likely to change. Like lists, tuples can store objects of
different types.
Construct a tuple with a comma separated sequence of objects within parentheses:
In [1]:
my_tuple = (1,3,5)
print(my_tuple)
(1, 3, 5)
Alternatively, you can construct a tuple by passing an iterable into the tuple() function:
In [2]:
my_list = [2,3,1,4]

another_tuple = tuple(my_list)

another_tuple
Out[2]:
(2, 3, 1, 4)
Tuples generally support the same indexing and slicing operations as lists and they also support some of the
same functions, with the caveat that tuples cannot be changed after they are created. This means we can do
things like find the length, max or min of a tuple, but we can't append new values to them or remove values
from them:
In [3]:
another_tuple[2] # You can index into tuples
Out[3]:
1
In [4]:
another_tuple[2:4] # You can slice tuples
Out[4]:
(1, 4)
In [5]:
# You can use common sequence functions on tuples:

print( len(another_tuple))
print( min(another_tuple))
print( max(another_tuple))
print( sum(another_tuple))
4
1
4
10
In [6]:
another_tuple.append(1) # You can't append to a tuple
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-26174f458975> in <module>()
----> 1 another_tuple.append(1) # You can't append to a tuple

AttributeError: 'tuple' object has no attribute 'append'


In [7]:
del another_tuple[1] # You can't delete from a tuple
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-c547ee9ba53d> in <module>()
----> 1 del another_tuple[1] # You can't delete from a tuple

TypeError: 'tuple' object doesn't support item deletion


You can sort the objects in tuple using the sorted() function, but doing so creates a new list containing the
result rather than sorting the original tuple itself like the list.sort() function does with lists:
In [8]:
sorted(another_tuple)
Out[8]:
[1, 2, 3, 4]
Although tuples are immutable themselves, they can contain mutable objects like lists. This means that the
contents of a shallow copy of a tuple containing a list will change if the nested list changes:
In [9]:
list1 = [1,2,3]

tuple1 = ("Tuples are Immutable", list1)


tuple2 = tuple1[:] # Make a shallow copy

list1.append("But lists are mutable")

print( tuple2 ) # Print the copy


('Tuples are Immutable', [1, 2, 3, 'But lists are mutable'])
To avoid this behavior, make a deepcopy using the copy library:
In [11]:
import copy

list1 = [1,2,3]

tuple1 = ("Tuples are Immutable", list1)

tuple2 = copy.deepcopy(tuple1) # Make a deep copy

list1.append("But lists are mutable")

print( tuple2 ) # Print the copy


('Tuples are Immutable', [1, 2, 3])
Strings
We already learned a little bit about strings in the lesson on basic data types, but strings are technically
sequences: immutable sequences of text characters. As sequences, they support indexing operations where
the first character of a string is index 0. This means we can get individual letters or slices of letters with
indexing:
In [12]:
my_string = "Hello world"
In [13]:
my_string[3] # Get the character at index 3
Out[13]:
'l'
In [14]:
my_string[3:] # Slice from the third index to the end
Out[14]:
'lo world'
In [15]:
my_string[::-1] # Reverse the string
Out[15]:
'dlrow olleH'
In addition, certain sequence functions like len() and count() work on strings:
In [16]:
len(my_string)
Out[16]:
11
In [17]:
my_string.count("l") # Count the l's in the string
Out[17]:
3
As immutable objects, you can't change a string itself: every time you transform a string with a function,
Python makes a new string object, rather than actually altering the original string that exists in your
computer's memory.
Strings have many associated functions. Some basic string functions include:
In [18]:
# str.lower()

my_string.lower() # Make all characters lowercase


Out[18]:
'hello world'
In [19]:
# str.upper()
my_string.upper() # Make all characters uppercase
Out[19]:
'HELLO WORLD'
In [20]:
# str.title()

my_string.title() # Make the first letter of each word uppercase


Out[20]:
'Hello World'
Find the index of the first appearing substring within a string using str.find(). If the substring does not appear,
find() returns -1:
In [21]:
my_string.find("W")
Out[21]:
-1
Notice that since strings are immutable, we never actually changed the original value of my_string with any
of the code above, but instead generated new strings that were printed to the console. This means "W" does
not exist in my_string even though our call to str.title() produced the output 'Hello World'. The original
lowercase "w" still exists at index position 6:
In [22]:
my_string.find("w")
Out[22]:
6
Find and replace a target substring within a string using str.replace()
In [23]:
my_string.replace("world", # Substring to replace
"friend") # New substring
Out[23]:
'Hello friend'
Split a string into a list of substrings based on a given separating character with str.split():
In [24]:
my_string.split() # str.split() splits on spaces by default
Out[24]:
['Hello', 'world']
In [25]:
my_string.split("l") # Supply a substring to split on other values
Out[25]:
['He', '', 'o wor', 'd']
Split a multi-line string into a list of lines using str.splitlines():
In [26]:
multiline_string = """I am
a multiline
string!
"""

multiline_string.splitlines()
Out[26]:
['I am', 'a multiline ', 'string!']
Strip leading and trailing characters from both ends of a string with str.strip().
In [27]:
# str.strip() removes whitespace by default

" white space ".strip()


Out[27]:
'white space'
Override the default by supplying a string containing all characters you'd like to strip as an argument to the
function:
In [28]:
"xXxxBuyNOWxxXx".strip("xX")
Out[28]:
'BuyNOW'
You can strip characters from the left or right sides only with str.lstrip() and str.rstrip() respectively:
In [29]:
" white space ".lstrip()
Out[29]:
'white space '
In [30]:
" white space ".rstrip()
Out[30]:
' white space'
You can join (concatenate) two strings with the plus (+) operator:
In [31]:
"Hello " + "World"
Out[31]:
'Hello World'
Convert the a list of strings into a single string separated by a given delimiter with str.join():
In [32]:
" ".join(["Hello", "World!", "Join", "Me!"])
Out[32]:
'Hello World! Join Me!'
Although the + operator works for string concatenation, things can get messy if you start trying to join more
than a couple values together with pluses.
In [33]:
name = "Joe"
age = 10
city = "Paris"

"My name is " + name + " I am " + str(age) + " and I live in " + "Paris"
Out[33]:
'My name is Joe I am 10 and I live in Paris'
For complex string operations of this sort is preferable to use the str.format() function. str.format() takes in a
template string with curly braces as placeholders for values you provide to the function as the arguments.
The arguments are then filled into the appropriate placeholders in the string:
In [34]:
template_string = "My name is {} I am {} and I live in {}"

template_string.format(name, age, city)


Out[34]:
'My name is Joe I am 10 and I live in Paris'
Read more about string formatting here.
Wrap Up
Basic sequences like lists, tuples and strings appear everywhere in Python code, so it is essential to
understand the basics of how they work before we can start using Python for data analysis. We're almost
ready to dive into data structures designed specifically data analysis, but before we do, we need to cover
two more useful built in Python data structures: dictionaries and sets.
Python for Data Analysis Part 7: Dictionaries and Sets
Sequence data types like lists, tuples and strings are ordered. Ordering can be useful in some cases, such
as if your data is sorted or has some other natural sense of ordering, but it comes at a price. When you
search through sequences like lists, your computer has to go through each element one at a time to find an
object you're looking for.
Consider the following code:
In [1]:
my_list = [1,2,3,4,5,6,7,8,9,10]

0 in my_list
Out[1]:
False
When running the code above, Python has to search through the entire list, one item at a time before it
returns that 0 is not in the list. This sequential searching isn't much of a concern with small lists like this one,
but if you're working with data that contains thousands or millions of values, it can add up quickly.
Dictionaries and sets are unordered Python data structures that solve this issue using a technique
called hashing. We won't go into the details of their implementation, but dictionaries and sets let you check
whether they contain objects without having to search through each element one at a time, at the cost of
have no order and using more system memory.
Dictionaries
A dictionary or dict is an object that maps a set of named indexes called keys to a set of corresponding
values. Dictionaries are mutable, so you can add and remove keys and their associated values. A
dictionary's keys must be immutable objects, such as ints, strings or tuples, but the values can be anything.
Create a dictionary with a comma-separated list of key: value pairs within curly braces:
In [2]:
my_dict = {"name": "Joe",
"age": 10,
"city": "Paris"}

print(my_dict)
{'age': 10, 'city': 'Paris', 'name': 'Joe'}
Notice that in the printed dictionary, the items don't appear in the same order as when we defined it, since
dictionaries are unordered. Index into a dictionary using keys rather than numeric indexes:
In [3]:
my_dict["name"]
Out[3]:
'Joe'
In [4]:
my_dict["age"]
Out[4]:
10
Add new items to an existing dictionary with the following syntax:
In [5]:
my_dict["new_key"] = "new_value"

print(my_dict)
{'new_key': 'new_value', 'age': 10, 'city': 'Paris', 'name': 'Joe'}
Delete existing key: value pairs with del:
In [6]:
del my_dict["new_key"]

print(my_dict)
{'age': 10, 'city': 'Paris', 'name': 'Joe'}
Check the number of items in a dict with len():
In [7]:
len(my_dict)
Out[7]:
3
Check whether a certain key exists with "in":
In [8]:
"name" in my_dict
Out[8]:
True
You can access all the keys, all the values or all the key: value pairs of a dictionary with the keys(), value()
and items() functions respectively:
In [9]:
my_dict.keys()
Out[9]:
dict_keys(['age', 'city', 'name'])
In [10]:
my_dict.values()
Out[10]:
dict_values([10, 'Paris', 'Joe'])
In [11]:
my_dict.items()
Out[11]:
dict_items([('age', 10), ('city', 'Paris'), ('name', 'Joe')])
Real world data often comes in the form tables of rows and columns, where each column specifies a
different data feature like name or age and each row represents an individual record. We can encode this
sort of tabular data in a dictionary by assigning each column label a key and then storing the column values
as a list.
Consider the following table:
name age city
Joe 10 Paris
Bob 15 New York
Harry 20 Tokyo

We can store this data in a dictionary like so:


In [12]:
my_table_dict = {"name": ["Joe", "Bob", "Harry"],
"age": [10,15,20] ,
"city": ["Paris", "New York", "Tokyo"]}
Certain data formats like XML and Json have a non-tabular, nested structure. Python dictionaries can
contain other dictionaries, so they can mirror this sort of nested structure, providing a convenient interface
for working with these sorts of data formats in Python. (We'll cover loading data into Python in a future
lesson.).
Sets
Sets are unordered, mutable collections of objects that cannot contain duplicates. Sets are useful for storing
and performing operations on data where each value is unique.
Create a set with a comma separated sequence of values within curly braces:
In [13]:
my_set = {1,2,3,4,5,6,7}

type(my_set)
Out[13]:
set
Add and remove items from a set with add() and remove() respectively:
In [14]:
my_set.add(8)

my_set
Out[14]:
{1, 2, 3, 4, 5, 6, 7, 8}
In [15]:
my_set.remove(7)

my_set
Out[15]:
{1, 2, 3, 4, 5, 6, 8}
Sets do not support indexing, but they do support basic sequence functions like len(), min(), max() and
sum(). You can also check membership and non-membership as usual with in:
In [16]:
6 in my_set
Out[16]:
True
In [17]:
7 in my_set
Out[17]:
False
The main purpose of sets is to perform set operations that compare or combine different sets. Python sets
support many common mathematical set operations like union, intersection, difference and checking
whether one set is a subset of another:
In [18]:
set1 = {1,3,5,6}
set2 = {1,2,3,4}

set1.union(set2) # Get the union of two sets


Out[18]:
{1, 2, 3, 4, 5, 6}
In [19]:
set1.intersection(set2) # Get the intersection of two sets
Out[19]:
{1, 3}
In [20]:
set1.difference(set2) # Get the difference between two sets
Out[20]:
{5, 6}
In [21]:
set1.issubset(set2) # Check whether set1 is a subset of set2
Out[21]:
False
You can convert a list into a set using the set() function. Converting a list to a set drops any duplicate
elements in the list. This can be a useful way to strip unwanted duplicate items or count the number of
unique elements in a list:
In [22]:
my_list = [1,2,2,2,3,3,4,5,5,5,6]

set(my_list)
Out[22]:
{1, 2, 3, 4, 5, 6}
In [23]:
len(set(my_list))
Out[23]:
6
Wrap up
Dictionaries are general-purpose data structures capable of encoding both tabular and non-tabular data. As
basic built in Python data structures, however, they lack many of the conveniences we'd like when working
with tabular data, like the ability to look at summary statistics for each column and transform the data quickly
and easily. In the next two lessons, we'll look at data structures available in Python packages designed for
data analysis: numpy arrays and pandas DataFrames.
Python for Data Analysis Part 8: Numpy Arrays
Python's built in data structures are great for general-purpose programming, but they lack specialized
features we'd like for data analysis. For example, adding rows or columns of data in an element-wise fashion
and performing math operations on two dimensional tables (matrices) are common tasks that aren't readily
available with Python's base data types. In this lesson we'll learn about ndarrays, a data structure available
Python's numpy library that implements a variety of useful functions for analyzing data.
Numpy and ndarray Basics
The numpy library is one of the core packages in Python's scientific software stack. Many other Python data
analysis libraries require numpy as a prerequisite, because they use its ndarray data structure as a building
block. The Anaconda Python distribution we installed in part 1 comes with numpy.

Numpy implements a data structure called the N-dimensional array or ndarray. ndarrays are similar to lists in
that they contain a collection of items that can be accessed via indexes. On the other hand, ndarrays are
homogeneous, meaning they can only contain objects of the same type and they can be multi-dimensional,
making it easy to store 2-dimensional tables or matrices.

To work with ndarrays, we need to load the numpy library. It is standard practice to load numpy with the
alias "np" like so:
In [1]:
import numpy as np
The "as np" after the import statement lets us access the numpy library's functions using the shorthand "np."
Create an ndarray by passing a list to np.array() function:
In [2]:
my_list = [1, 2, 3, 4] # Define a list

my_array = np.array(my_list) # Pass the list to np.array()

type(my_array) # Check the object's type


Out[2]:
numpy.ndarray
To create an array with more than one dimension, pass a nested list to np.array():
In [3]:
second_list = [5, 6, 7, 8]

two_d_array = np.array([my_list, second_list])

print(two_d_array)
[[1 2 3 4]
[5 6 7 8]]
An ndarray is defined by the number of dimensions it has, the size of each dimension and the type of data it
holds. Check the number and size of dimensions of an ndarray with the shape attribute:
In [4]:
two_d_array.shape
Out[4]:
(2, 4)
The output above shows that this ndarray is 2-dimensional, since there are two values listed, and the
dimensions have length 2 and 4. Check the total size (total number of items) in an array with the size
attribute:
In [5]:
two_d_array.size
Out[5]:
8
Check the type of the data in an ndarray with the dtype attribute:
In [6]:
two_d_array.dtype
Out[6]:
dtype('int32')
Numpy has a variety of special array creation functions. Some handy array creation functions include:
In [7]:
# np.identity() to create a square 2d array with 1's across the diagonal

np.identity(n = 5) # Size of the array


Out[7]:
array([[ 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 1.]])
In [8]:
# np.eye() to create a 2d array with 1's across a specified diagonal

np.eye(N = 3, # Number of rows


M = 5, # Number of columns
k = 1) # Index of the diagonal (main diagonal (0) is default)
Out[8]:
array([[ 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 0., 0., 1., 0.]])
In [9]:
# np.ones() to create an array filled with ones:

np.ones(shape= [2,4])
Out[9]:
array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
In [10]:
# np.zeros() to create an array filled with zeros:

np.zeros(shape= [4,6])
Out[10]:
array([[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.]])
Array Indexing and Slicing
Numpy ndarrays offer numbered indexing and slicing syntax that mirrors the syntax for Python lists:
In [11]:
one_d_array = np.array([1,2,3,4,5,6])

one_d_array[3] # Get the item at index 3


Out[11]:
4
In [12]:
one_d_array[3:] # Get a slice from index 3 to the end
Out[12]:
array([4, 5, 6])
In [13]:
one_d_array[::-1] # Slice backwards to reverse the array
Out[13]:
array([6, 5, 4, 3, 2, 1])
If an ndarray has more than one dimension, separate indexes for each dimension with a comma:
In [14]:
# Create a new 2d array
two_d_array = np.array([one_d_array, one_d_array + 6, one_d_array + 12])

print(two_d_array)
[[ 1 2 3 4 5 6]
[ 7 8 9 10 11 12]
[13 14 15 16 17 18]]
In [15]:
# Get the element at row index 1, column index 4

two_d_array[1, 4]
Out[15]:
11
In [16]:
# Slice elements starting at row 2, and column 5

two_d_array[1:, 4:]
Out[16]:
array([[11, 12],
[17, 18]])
In [17]:
# Reverse both dimensions (180 degree rotation)

two_d_array[::-1, ::-1]
Out[17]:
array([[18, 17, 16, 15, 14, 13],
[12, 11, 10, 9, 8, 7],
[ 6, 5, 4, 3, 2, 1]])
Reshaping Arrays
Numpy has a variety of built in functions to help you manipulate arrays quickly without having to use
complicated indexing operations.
Reshape an array into a new array with the same data but different structure with np.reshape():
In [18]:
np.reshape(a=two_d_array, # Array to reshape
newshape=(6,3)) # Dimensions of the new array
Out[18]:
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12],
[13, 14, 15],
[16, 17, 18]])
Unravel a multi-dimensional into 1 dimension with np.ravel():
In [19]:
np.ravel(a=two_d_array,
order='C') # Use C-style unraveling (by rows)
Out[19]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18])
In [20]:
np.ravel(a=two_d_array,
order='F') # Use Fortran-style unraveling (by columns)
Out[20]:
array([ 1, 7, 13, 2, 8, 14, 3, 9, 15, 4, 10, 16, 5, 11, 17, 6, 12,
18])
Alternatively, use ndarray.flatten() to flatten a multi-dimensional into 1 dimension and return a copy of the
result:
In [21]:
two_d_array.flatten()
Out[21]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18])
Get the transpose of an array with ndarray.T:
In [22]:
two_d_array.T
Out[22]:
array([[ 1, 7, 13],
[ 2, 8, 14],
[ 3, 9, 15],
[ 4, 10, 16],
[ 5, 11, 17],
[ 6, 12, 18]])
Flip an array vertically or horizontally with np.flipud() and np.fliplr() respectively:
In [23]:
np.flipud(two_d_array)
Out[23]:
array([[13, 14, 15, 16, 17, 18],
[ 7, 8, 9, 10, 11, 12],
[ 1, 2, 3, 4, 5, 6]])
In [24]:
np.fliplr(two_d_array)
Out[24]:
array([[ 6, 5, 4, 3, 2, 1],
[12, 11, 10, 9, 8, 7],
[18, 17, 16, 15, 14, 13]])
Rotate an array 90 degrees counter-clockwise with np.rot90():
In [25]:
np.rot90(two_d_array,
k=1) # Number of 90 degree rotations
Out[25]:
array([[ 6, 12, 18],
[ 5, 11, 17],
[ 4, 10, 16],
[ 3, 9, 15],
[ 2, 8, 14],
[ 1, 7, 13]])
Shift elements in an array along a given dimension with np.roll():
In [26]:
np.roll(a= two_d_array,
shift = 2, # Shift elements 2 positions
axis = 1) # In each row
Out[26]:
array([[ 5, 6, 1, 2, 3, 4],
[11, 12, 7, 8, 9, 10],
[17, 18, 13, 14, 15, 16]])
Leave the axis argument empty to shift on a flattened version of the array (shift across all dimensions):
In [27]:
np.roll(a= two_d_array,
shift = 2)
Out[27]:
array([[17, 18, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15, 16]])
Join arrays along an axis with np.concatenate():
In [28]:
array_to_join = np.array([[10,20,30],[40,50,60],[70,80,90]])

np.concatenate( (two_d_array,array_to_join), # Arrays to join


axis=1) # Axis to join upon
Out[28]:
array([[ 1, 2, 3, 4, 5, 6, 10, 20, 30],
[ 7, 8, 9, 10, 11, 12, 40, 50, 60],
[13, 14, 15, 16, 17, 18, 70, 80, 90]])
Array Math Operations
Creating and manipulating arrays is nice, but the true power of numpy arrays is the ability to perform
mathematical operations on many values quickly and easily. Unlike built in Python objects, you can use
math operators like +, -, / and * to perform basic math operations with ndarrays:
In [29]:
two_d_array + 100 # Add 100 to each element
Out[29]:
array([[101, 102, 103, 104, 105, 106],
[107, 108, 109, 110, 111, 112],
[113, 114, 115, 116, 117, 118]])
In [30]:
two_d_array - 100 # Subtract 100 from each element
Out[30]:
array([[-99, -98, -97, -96, -95, -94],
[-93, -92, -91, -90, -89, -88],
[-87, -86, -85, -84, -83, -82]])
In [31]:
two_d_array * 2 # Multiply each element by 2
Out[31]:
array([[ 2, 4, 6, 8, 10, 12],
[14, 16, 18, 20, 22, 24],
[26, 28, 30, 32, 34, 36]])
In [32]:
two_d_array / 2 # Divide each element by 2
Out[32]:
array([[ 0.5, 1. , 1.5, 2. , 2.5, 3. ],
[ 3.5, 4. , 4.5, 5. , 5.5, 6. ],
[ 6.5, 7. , 7.5, 8. , 8.5, 9. ]])
In [33]:
two_d_array ** 2 # Square each element
Out[33]:
array([[ 1, 4, 9, 16, 25, 36],
[ 49, 64, 81, 100, 121, 144],
[169, 196, 225, 256, 289, 324]])
In [34]:
two_d_array % 2 # Take modulus of each element
Out[34]:
array([[1, 0, 1, 0, 1, 0],
[1, 0, 1, 0, 1, 0],
[1, 0, 1, 0, 1, 0]], dtype=int32)
Beyond operating on each element of an array with a single scalar value, you can also use the basic math
operators on two arrays with the same shape. When operating on two arrays, the basic math operators
function in an element-wise fashion, returning an array with the same shape as the original:
In [35]:
small_array1 = np.array([[1,2],[3,4]])

small_array1 + small_array1
Out[35]:
array([[2, 4],
[6, 8]])
In [36]:
small_array1 - small_array1
Out[36]:
array([[0, 0],
[0, 0]])
In [37]:
small_array1 * small_array1
Out[37]:
array([[ 1, 4],
[ 9, 16]])
In [38]:
small_array1 / small_array1
Out[38]:
array([[ 1., 1.],
[ 1., 1.]])
In [39]:
small_array1 ** small_array1
Out[39]:
array([[ 1, 4],
[ 27, 256]], dtype=int32)
Numpy also offers a variety of named math functions for ndarrays. There are too many to cover in detail
here, so we'll just look at a selection of the most useful ones for data analysis:
In [40]:
# Get the mean of all the elements in an array with np.mean()

np.mean(two_d_array)
Out[40]:
9.5
In [41]:
# Provide an axis argument to get means across a dimension

np.mean(two_d_array,
axis = 1) # Get means of each row
Out[41]:
array([ 3.5, 9.5, 15.5])
In [42]:
# Get the standard deviation all the elements in an array with np.std()

np.std(two_d_array)
Out[42]:
5.1881274720911268
In [43]:
# Provide an axis argument to get standard deviations across a dimension
np.std(two_d_array,
axis = 0) # Get stdev for each column
Out[43]:
array([ 4.89897949, 4.89897949, 4.89897949, 4.89897949, 4.89897949,
4.89897949])
In [44]:
# Sum the elements of an array across an axis with np.sum()

np.sum(two_d_array,
axis=1) # Get the row sums
Out[44]:
array([21, 57, 93])
In [45]:
np.sum(two_d_array,
axis=0) # Get the column sums
Out[45]:
array([21, 24, 27, 30, 33, 36])
In [46]:
# Take the log of each element in an array with np.log()

np.log(two_d_array)
Out[46]:
array([[ 0. , 0.69314718, 1.09861229, 1.38629436, 1.60943791,
1.79175947],
[ 1.94591015, 2.07944154, 2.19722458, 2.30258509, 2.39789527,
2.48490665],
[ 2.56494936, 2.63905733, 2.7080502 , 2.77258872, 2.83321334,
2.89037176]])
In [47]:
# Take the square root of each element with np.sqrt()

np.sqrt(two_d_array)
Out[47]:
array([[ 1. , 1.41421356, 1.73205081, 2. , 2.23606798,
2.44948974],
[ 2.64575131, 2.82842712, 3. , 3.16227766, 3.31662479,
3.46410162],
[ 3.60555128, 3.74165739, 3.87298335, 4. , 4.12310563,
4.24264069]])
Take the dot product of two arrays with np.dot(). This function performs an element-wise multiply and then a
sum for 1-dimensional arrays (vectors) and matrix multiplication for 2-dimensional arrays.
In [48]:
# Take the vector dot product of row 0 and row 1

np.dot(two_d_array[0,0:], # Slice row 0


two_d_array[1,0:]) # Slice row 1
Out[48]:
217
In [49]:
# Do a matrix multiply

np.dot(small_array1, small_array1)
Out[49]:
array([[ 7, 10],
[15, 22]])
The package includes a variety of more advanced linear algebra functions, should you need to do things like
computing eigenvectors and eigenvalues or inverting matrices.
Wrap Up
Numpy's ndarray data structure provides many desirable features for working with data, such as element-
wise math operations and a variety of functions that work on 2D arrays. Since numpy was built with data
analysis in mind, its math operations are optimized for that purpose and generally faster than what could be
achieved if you hand-coded functions to carry out similar operations on lists.
Numpy's arrays are great for performing calculations on numerical data, but most data sets you encounter in
real life aren't homogeneous. Many data sets include a mixture of data types including numbers, text and
dates, so they can't be stored in a single numpy array. In the next lesson we'll conclude our study of Python
data structures with Pandas DataFrames, a powerful data container that mirrors the structure of data tables
you'd find in databases and spreadsheet programs like Microsoft Excel.
Python for Data Analysis Part 9: Pandas DataFrames

Numpy's ndarrays well-suited for performing math operations on one and two-dimensional arrays of numeric
values, but they fall short when it comes to dealing with heterogeneous data sets. To store data from an
external source like an excel workbook or database, we need a data structure that can hold different data types.
It is also desirable to be able to refer to rows and columns in the data by custom labels rather than numbered
indexes.
The pandas library offers data structures designed with this in mind: the series and the DataFrame. Series are 1-
dimensional labeled arrays similar to numpy's ndarrays, while DataFrames are labeled 2-dimensional structures,
that essentially function as spreadsheet tables.
Pandas Series
Before we get into DataFrames, we'll take a brief detour to explore pandas series. Series are very similar to
ndarrays: the main difference between them is that with series, you can provide custom index labels and then
operations you perform on series automatically align the data based on the labels.
To create a new series, first load the numpy and pandas libraries (pandas is preinstalled with the Anaconda
Python distribution.).
In [1]:
import numpy as np
import pandas as pd
*Note: It is common practice to import pandas with the shorthand "pd".
Define a new series by passing a collection of homogeneous data like ndarray or list, along with a list of
associated indexes to pd.Series():
In [2]:
my_series = pd.Series( data = [2,3,5,4], # Data
index= ['a', 'b', 'c', 'd']) # Indexes

my_series
Out[2]:
a 2
b 3
c 5
d 4
dtype: int64
You can also create a series from a dictionary, in which case the dictionary keys act as the labels and the values
act as the data:
In [3]:
my_dict = {"x": 2, "a": 5, "b": 4, "c": 8}

my_series2 = pd.Series(my_dict)

my_series2
Out[3]:
a 5
b 4
c 8
x 2
dtype: int64
Similar to a dictionary, you can access items in a series by the labels:
In [4]:
my_series["a"]
Out[4]:
2
Numeric indexing also works:
In [5]:
my_series[0]
Out[5]:
2
If you take a slice of a series, you get both the values and the labels contained in the slice:
In [6]:
my_series[1:3]
Out[6]:
b 3
c 5
dtype: int64
As mentioned earlier, operations performed on two series align by label:
In [7]:
my_series + my_series
Out[7]:
a 4
b 6
c 10
d 8
dtype: int64
If you perform an operation with two series that have different labels, the unmatched labels will return a value
of NaN (not a number.).
In [8]:
my_series + my_series2
Out[8]:
a 7
b 7
c 13
d NaN
x NaN
dtype: float64
Other than labeling, series behave much like numpy's ndarrays. A series is even a valid argument to many of the
numpy array functions we covered last time:
In [9]:
np.mean(my_series) # numpy array functions generally work on series
Out[9]:
3.5
In [10]:
np.dot(my_series, my_series)
Out[10]:
54
DataFrame Creation and Indexing
A DataFrame is a 2D table with labeled columns that can each hold different types of data. DataFrames are
essentially a Python implementation of the types of tables you'd see in an Excel workbook or SQL database.
DataFrames are the defacto standard data structure for working with tabular data in Python; we'll be using them
a lot throughout the remainder of this guide.
You can create a DataFrame out a variety of data sources like dictionaries, 2D numpy arrays and series using the
pd.DataFrame() function. Dictionaries provide an intuitive way to create DataFrames: when passed to
pd.DataFrame() a dictionary's keys become column labels and the values become the columns themselves:
In [11]:
# Create a dictionary with some different data types as values

my_dict = {"name" : ["Joe","Bob","Frans"],


"age" : np.array([10,15,20]),
"weight" : (75,123,239),
"height" : pd.Series([4.5, 5, 6.1],
index=["Joe","Bob","Frans"]),
"siblings" : 1,
"gender" : "M"}

df = pd.DataFrame(my_dict) # Convert the dict to DataFrame

df # Show the DataFrame


Out[11]:
age gender height name siblings weight
Joe 10 M 4.5 Joe 1 75
Bob 15 M 5.0 Bob 1 123
Frans 20 M 6.1 Frans 1 239
3 rows × 6 columns
Notice that values in the dictionary you use to make a DataFrame can be a variety of sequence objects, including
lists, ndarrays, tuples and series. If you pass in singular values like a single number or string, that value is
duplicated for every row in the DataFrame (in this case gender is set to "M" for all records and siblings is set to
1.).
Also note that in the DataFrame above, the rows were automatically given indexes that align with the indexes of
the series we passed in for the "height" column. If we did not use a series with index labels to create our
DataFrame, it would be given numeric row index labels by default:
In [12]:
my_dict2 = {"name" : ["Joe","Bob","Frans"],
"age" : np.array([10,15,20]),
"weight" : (75,123,239),
"height" :[4.5, 5, 6.1],
"siblings" : 1,
"gender" : "M"}

df2 = pd.DataFrame(my_dict2) # Convert the dict to DataFrame

df2 # Show the DataFrame


Out[12]:
age gender height name siblings weight
0 10 M 4.5 Joe 1 75
1 15 M 5.0 Bob 1 123
2 20 M 6.1 Frans 1 239
3 rows × 6 columns
You can provide custom row labels when creating a DataFrame by adding the index argument:
In [13]:
df2 = pd.DataFrame(my_dict2,
index = my_dict["name"] )

df2
Out[13]:
age gender height name siblings weight
Joe 10 M 4.5 Joe 1 75
Bob 15 M 5.0 Bob 1 123
Frans 20 M 6.1 Frans 1 239
3 rows × 6 columns
A DataFrame behaves like a dictionary of Series objects that each have the same length and indexes. This means
we can get, add and delete columns in a DataFrame the same way we would when dealing with a dictionary:
In [14]:
# Get a column by name

df2["weight"]
Out[14]:
Joe 75
Bob 123
Frans 239
Name: weight, dtype: int32
Alternatively, you can get a column by label using "dot" notation:
In [15]:
df2.weight
Out[15]:
Joe 75
Bob 123
Frans 239
Name: weight, dtype: int32
In [16]:
# Delete a column

del df2['name']
In [17]:
# Add a new column

df2["IQ"] = [130, 105, 115]

df2
Out[17]:
age gender height siblings weight IQ
Joe 10 M 4.5 1 75 130
Bob 15 M 5.0 1 123 105
Frans 20 M 6.1 1 239 115
3 rows × 6 columns
Inserting a single value into a DataFrame causes it to be all the rows?
In [18]:
df2["Married"] = False
df2
Out[18]:
age gender height siblings weight IQ Married
Joe 10 M 4.5 1 75 130 False
Bob 15 M 5.0 1 123 105 False
Frans 20 M 6.1 1 239 115 False
3 rows × 7 columns
When inserting a Series into a DataFrame, rows are matched by index. Unmatched rows will be filled with NaN:
In [19]:
df2["College"] = pd.Series(["Harvard"],
index=["Frans"])

df2
Out[19]:
age gender height siblings weight IQ Married College
Joe 10 M 4.5 1 75 130 False NaN
Bob 15 M 5.0 1 123 105 False NaN
Frans 20 M 6.1 1 239 115 False Harvard
3 rows × 8 columns
You can select both rows or columns by label with df.loc[row, column]:
In [20]:
df2.loc["Joe"] # Select row "Joe"
Out[20]:
age 10
gender M
height 4.5
siblings 1
weight 75
IQ 130
Married False
College NaN
Name: Joe, dtype: object
In [21]:
df2.loc["Joe","IQ"] # Select row "Joe" and column "IQ"
Out[21]:
130
In [22]:
df2.loc["Joe":"Bob" , "IQ":"College"] # Slice by label
Out[22]:
IQ Married College
Joe 130 False NaN
Bob 105 False NaN
2 rows × 3 columns
Select rows or columns by numeric index with df.iloc[row, column]:
In [23]:
df2.iloc[0] # Get row 0
Out[23]:
age 10
gender M
height 4.5
siblings 1
weight 75
IQ 130
Married False
College NaN
Name: Joe, dtype: object
In [24]:
df2.iloc[0, 5] # Get row 0, column 5
Out[24]:
130
In [25]:
df2.iloc[0:2, 5:8] # Slice by numeric row and column index
Out[25]:
IQ Married College
Joe 130 False NaN
Bob 105 False NaN
2 rows × 3 columns
Select rows or columns based on a mixture of both labels and numeric indexes with df.ix[row, column]:
In [26]:
df2.ix[0] # Get row 0
Out[26]:
age 10
gender M
height 4.5
siblings 1
weight 75
IQ 130
Married False
College NaN
Name: Joe, dtype: object
In [27]:
df2.ix[0, "IQ"] # Get row 0, column "IQ"
Out[27]:
130
In [28]:
df2.ix[0:2, ["age", "IQ", "weight"]] # Slice rows and get specific columns
Out[28]:
age IQ weight
Joe 10 130 75
Bob 15 105 123
2 rows × 3 columns
You can also select rows by passing in a sequence boolean(True/False) values. Rows where the corresponding
boolean is True are returned:
In [29]:
boolean_index = [False, True, True]

df2[boolean_index]
Out[29]:
age gender height siblings weight IQ Married College
Bob 15 M 5.0 1 123 105 False NaN
Frans 20 M 6.1 1 239 115 False Harvard
2 rows × 8 columns
This sort of logical True/False indexing is useful for subsetting data when combined with logical operations. For
example, say we wanted to get a subset of our DataFrame with all persons who are over 12 years old. We can do
it with boolean indexing:
In [30]:
# Create a boolean sequence with a logical comparison
boolean_index = df2["age"] > 12

# Use the index to get the rows where age > 12


df2[boolean_index]
Out[30]:
age gender height siblings weight IQ Married College
Bob 15 M 5.0 1 123 105 False NaN
Frans 20 M 6.1 1 239 115 False Harvard
2 rows × 8 columns
You can do this sort of indexing all in one operation without assigning the boolean sequence to a variable.
In [31]:
df2[ df2["age"] > 12 ]
Out[31]:
age gender height siblings weight IQ Married College
Bob 15 M 5.0 1 123 105 False NaN
Frans 20 M 6.1 1 239 115 False Harvard
2 rows × 8 columns
Exploring DataFrames
Exploring data is an important first step in most data analyses. DataFrames come with a variety of functions to
help you explore and summarize the data they contain.
First, let's load in data set to explore: the mtcars data set. The mtcars data set comes with the ggplot library, a
port of a popular R plotting library called ggplot2. ggplot does not come with Anaconda, but you can install it by
opening a console (cmd.exe) and running: "pip install ggplot" (close Spyder and other programs before installing
new libraries.).
Now we can import the mtcars data from ggplot:
In [32]:
from ggplot import mtcars

type(mtcars)
Out[32]:
pandas.core.frame.DataFrame
Notice that mtcars is loaded as a DataFrame. We can check the dimensions and size of a DataFrame with
df.shape:
In [33]:
mtcars.shape # Check dimensions
Out[33]:
(32, 12)
The output shows that mtars has 32 rows and 12 columns.
We can check the first n rows of the data with the df.head() function:
In [34]:
mtcars.head(6) # Check the first 6 rows
Out[34]:
name mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
6 rows × 12 columns
Similarly, we can check the last few rows with df.tail()
In [35]:
mtcars.tail(6) # Check the lst 6 rows
Out[35]:
name mpg cyl disp hp drat wt qsec vs am gear carb
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
6 rows × 12 columns
With large data sets, head() and tail() are useful to get a sense of what the data looks like without printing
hundreds or thousands of rows to the screen. Since each row specifies a different car, lets set the row indexes
equal to the car name. You can access and assign new row indexes with df.index:
In [36]:
print(mtcars.index, "\n") # Print original indexes

mtcars.index = mtcars["name"] # Set index to car name


del mtcars["name"] # Delete name column

print(mtcars.index) # Print new indexes


Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31], dtype='int64')

Index(['Mazda RX4', 'Mazda RX4 Wag', 'Datsun 710', 'Hornet 4 Drive', 'Hornet Sportabout', 'Valiant', 'Duster 360',
'Merc 240D', 'Merc 230', 'Merc 280', 'Merc 280C', 'Merc 450SE', 'Merc 450SL', 'Merc 450SLC', 'Cadillac
Fleetwood', 'Lincoln Continental', 'Chrysler Imperial', 'Fiat 128', 'Honda Civic', 'Toyota Corolla', 'Toyota Corona',
'Dodge Challenger', 'AMC Javelin', 'Camaro Z28', 'Pontiac Firebird', 'Fiat X1-9', 'Porsche 914-2', 'Lotus Europa',
'Ford Pantera L', 'Ferrari Dino', 'Maserati Bora', 'Volvo 142E'], dtype='object')
You can access the column labels with df.columns:
In [37]:
mtcars.columns
Out[37]:
Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'], dtype='object')
Use the df.describe() command to get a quick statistical summary of your data set. The summary includes the
mean, median, min, max and a few key percentiles for numeric columns:
In [38]:
mtcars.ix[:,:6].describe() # Summarize the first 6 columns
Out[38]:
mpg cyl disp hp drat wt
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000
8 rows × 6 columns
Since the columns of a DataFrame are series and series are closely related to numpy's arrays, many ndarray
functions work on DataFrames, operating on each column of the DataFrame:
In [39]:
np.mean(mtcars,
axis=0) # Get the mean of each column
Out[39]:
mpg 20.090625
cyl 6.187500
disp 230.721875
hp 146.687500
drat 3.596563
wt 3.217250
qsec 17.848750
vs 0.437500
am 0.406250
gear 3.687500
carb 2.812500
dtype: float64
In [40]:
np.sum(mtcars,
axis=0) # Get the sum of each column
Out[40]:
mpg 642.900
cyl 198.000
disp 7383.100
hp 4694.000
drat 115.090
wt 102.952
qsec 571.160
vs 14.000
am 13.000
gear 118.000
carb 90.000
dtype: float64
Wrap Up
Pandas DataFrames are the workhorse data structure for data analysis in Python. They provide an intuitive
structure that mirrors the sorts of data tables we're using to seeing in spreadsheet programs and indexing
functionality that follows the same pattern as other Python data structures. This brief introduction only
scratches the surface; DataFrames offer a host of other indexing options and functions, many of which we will
see in future lessons.
Python for Data Analysis Part 10: Reading and Writing Data

Reading data into pandas DataFrames is often the first step when conducting data analysis in Python. The
pandas package comes equipped with several data reading and writing functions that let you read data directly
from common file formats like comma separated values files (CSV) and Microsoft Excel files. This lesson will
focus on reading and writing data from these common file formats, but Python has packages available to work
with just about every data format you encounter.
Python Working Directory and File Paths
Before we can jump into reading and writing data, we need to learn a little bit about Python's working directory
and file paths. When you launch Python, it starts in a default location in your computer's file system known as
the working directory. You can check your current working directory by importing the os module and then using
os.getcwd():
In [1]:
import os

os.getcwd()
Out[1]:
'C:\\Users\\Greg'
The working directory acts as your starting point for accessing files on your computer from within Python. To
load a data set from your hard drive, you either need to put the file in your working directory, change your
working directory to the folder containing the data or supply the data file's file path to the data reading
function.
You can change your working directory by supplying a new file path in quotes to the os.chdir() function:
In [2]:
os.chdir('C:\\Users\\Greg\\Desktop\\intro_python10')

os.getcwd() # Check the working directory again


Out[2]:
'C:\\Users\\Greg\\Desktop\\intro_python10'
You can list all of the objects in a directory by passing the file path to the os.listdir( ) function:
In [3]:
os.listdir('C:\\Users\\Greg\\Desktop\\intro_python10')
Out[3]:
['draft2015.csv', 'draft2015.tsv', 'draft2015.xlsx']
Notice my intro_python10 folder has 3 files named "draft2015" in different file formats. Let's load them into
DataFrames. (Download the data files here: csv, tsv, xlsx)
Reading CSV and TSV Files
Data is commonly stored in simple flat text files consisting of values delimited(separated) by a special character
like a comma (CSV) or tab (TSV).
You can read CSV files into a pandas DataFrame using the pandas function pd.read_csv():
In [4]:
import pandas as pd

draft1 = pd.read_csv('draft2015.csv') # Supply the file name (path)

draft1.head(6) # Check the first 6 rows


Out[4]:
Player Draft_Express CBS CBS_2 CBS_3 BleacherReport SI
0 Karl-Anthony Towns 1 1 1 1 1 1
1 Jahlil Okafor 2 2 2 2 2 2
2 Emmanuel Mudiay 7 6 6 6 7 6
3 D'Angelo Russell 3 3 4 4 3 3
4 Kristaps Porzingis 6 5 3 3 4 4
5 Mario Hezonja 4 7 8 7 6 7
6 rows × 7 columns
To load a TSV file, use pd.read_table():
In [5]:
draft2 = pd.read_table('draft2015.tsv') # Read a tsv into a DataFrame

draft2.head(6) # Check the first 6 rows


Out[5]:
Player Draft_Express CBS CBS_2 CBS_3 BleacherReport SI
0 Karl-Anthony Towns 1 1 1 1 1 1
1 Jahlil Okafor 2 2 2 2 2 2
2 Emmanuel Mudiay 7 6 6 6 7 6
3 D'Angelo Russell 3 3 4 4 3 3
4 Kristaps Porzingis 6 5 3 3 4 4
5 Mario Hezonja 4 7 8 7 6 7
6 rows × 7 columns
The read_table() function is a general file reading function that reads TSV files by default, but you can use to to
read flat text files separated by any delimiting character by setting the "sep" argument to a different character.
Read more about the options it offers here.
Reading Excel Files
Microsoft Excel is a ubiquitous enterprise spreadsheet program that stores data in its own format with the
extension .xls or .xlsx. Although you can save Excel files as CSV from within Excel and then load it into Python
with the functions we covered above, pandas is capable of loading data directly from Excel file formats.
To load data from an Excel file you need the "xlrd" module. This module comes with the Python Anaconda
distribution. If you don't have it installed, you can get it by opening a command console and running "pip install
xlrd" (without quotes.).
Load data from an Excel file to a DataFrame with pd.read_excel(), supplying the file path and the name of the
worksheet you want to load:
In [6]:
draft3 = pd.read_excel('draft2015.xlsx', # Path to Excel file
sheetname = 'draft2015') # Name of sheet to read from

draft3.head(6) # Check the first 6 rows


Out[6]:
Player Draft_Express CBS CBS_2 CBS_3 BleacherReport SI
0 Karl-Anthony Towns 1 1 1 1 1 1
1 Jahlil Okafor 2 2 2 2 2 2
Player Draft_Express CBS CBS_2 CBS_3 BleacherReport SI
2 Emmanuel Mudiay 7 6 6 6 7 6
3 D'Angelo Russell 3 3 4 4 3 3
4 Kristaps Porzingis 6 5 3 3 4 4
5 Mario Hezonja 4 7 8 7 6 7
6 rows × 7 columns
Reading Web Data
The Internet gives you access to more data than you could ever hope to analyze. Data analysis often begins with
getting data from the web and loading it into Python. Websites that offer data for download usually let you
download data as CSV, TSV or excel files. Perhaps the easiest way load web data, is to simply download data to
your hard drive and then use the functions we discussed earlier to load it into a DataFrame.
Reading from the clipboard is another quick and dirty option for reading web data and other tabular data. To
read data from the clipboard, highlight the data you want to copy and use the appropriate copy function on your
keyboard (typically control+C) as if you were going to copy and paste the data. Next, use the pd.read_clipboard()
function with the appropriate separator to load the data into a pandas DataFrame:
In [7]:
# Go to http://www.basketball-reference.com/leagues/NBA_2015_totals.html
# click the CSV button and then copy some data to the clipboard

BB_reference_data = pd.read_clipboard(sep=",") # Read data from the clipboard

BB_reference_data.ix[:, 0:10].head(5) # Check 5 rows (10 columns only)


Out[7]:
Rk Player Pos Age Tm G GS MP FG FGA
0 1 Quincy Acy PF 24 NYK 68 22 1287 152 331
1 2 Jordan Adams SG 20 MEM 30 0 248 35 86
2 3 Steven Adams C 21 OKC 70 67 1771 217 399
3 4 Jeff Adrien PF 28 MIN 17 0 215 19 44
4 5 Arron Afflalo SG 29 TOT 78 72 2502 375 884
5 rows × 10 columns
Pandas also comes with a read_html() function to read data directly from web pages. To use read_html() you
need the HTML5lib package. Install it by opening a command console and running "pip install HTLM5lib"
(without quotes.). Note that HTML can have all sorts of nested structures and formatting quirks, which makes
parsing it to extract data troublesome. The read_html() function does its best to draw out tabular data in web
pages, but the results aren't always perfect. Let's read html directly from basketball-reference.com to get the
same data we loaded from the clipboard:
In [8]:
url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"

BB_data = pd.read_html(url) # Read data from the specified url

BB_data[0].ix[:, 0:10].head(5) # Check 5 rows (10 columns only)*


Out[8]:
Rk Player Pos Age Tm G GS MP FG FGA
0 1 Quincy Acy PF 24 NYK 68 22 NaT 152 331
1 2 Jordan Adams SG 20 MEM 30 0 NaT 35 86
Rk Player Pos Age Tm G GS MP FG FGA
2 3 Steven Adams C 21 OKC 70 67 1771-01-01 217 399
3 4 Jeff Adrien PF 28 MIN 17 0 NaT 19 44
4 5 Arron Afflalo SG 29 TOT 78 72 NaT 375 884
5 rows × 10 columns
*Note: read_html() returns a list of DataFrames, regardless of the number of tables on the web page you read
from. In the code above, BB_data[0] gets the first DataFrame from the list and then .ix[:, 0:10].head(5) slices the
first 10 columns and checks the head.
Data comes in all sorts of formats other than the ones we've discussed here. The pandas library has several
other data reading functions to work with data in other common formats, like json, SAS and stata files and SQL
databases.
Writing Data
Each of the data reading functions in pandas has a corresponding writer function that lets you write data back to
into the format it came from. Most of the time, however, you'll probably want to save your data in an easy-to-
use format like CSV. Write a DataFrame to CSV in the working directory by passing the desired file name to the
df.to_csv() function:
In [9]:
BB_reference_data.to_csv("bb_data.csv")

os.listdir('C:\\Users\\Greg\\Desktop\\intro_python10')
Out[9]:
['bb_data.csv', 'draft2015.csv', 'draft2015.tsv', 'draft2015.xlsx']
Notice 'bb_data.csv' now exists in the folder.
Wrap Up
The pandas library makes it easy to read data into DataFrames and export it back into common data formats like
CSV files.
Now that we know how to load data into Python we're almost ready to start doing data analysis, but before we
do, we need to learn some basic Python programming constructs.
Python for Data Analysis Part 11: Control Flow

Although Python is a popular tool for data analysis, it is a general-purpose programming language that
wasn't designed specifically for that task. It is important to know some basic Python programming constructs
even though you can go a long way using the functions built into Python and its data analysis libraries.
When you run code in Python, each statement is executed in the order in which they appear. Programming
languages like Python let you change the order in which code executes, allowing you to skip statements or
run certain statements over and over again. Programming constructs that let you alter the order in which
code executes are known as control flow statements.
If, Else and Elif
The most basic control flow statement in Python is the "if" statement. An if statement checks whether some
logical expression evaluates to true or false and then executes a code block if the expression is true.
In Python, an if statement starts with if, followed by a logical expression and a colon. The code to execute if
the logical expression is true appears on the next line, indented from the if statement above it by 4 spaces:
In [1]:
x = 10 # Assign some variables
y = 5

if x > y: # If statement
print("x is greater than y")
x is greater than y
In the code above, the logical expression was true--x is greater than y--so the print(x) statement was
executed.
If statements are often accompanied by else statements. Else statements come after if statements and
execute code in the event that logical expression checked by an if statement is false:
In [2]:
y = 25
x = 10

if x > y:
print("x is greater than y")
else:
print("y is greater than x")
y is greater than x
In this case the logical expression after the if statement is false, so the print statement after the if block is
skipped and the print statement after the else block is executed instead.
You can extend this basic if/else construct to perform multiple logical checks in a row by adding one or more
"elif" (else if) statements between the opening if and closing else. Each elif statement performs an additional
logical check and executes its code if the check is true:
In [3]:
y = 10

if x > y:
print("x is greater than y")
elif x == y:
print("x and y are equal!")
else:
print("y is greater than x")
x and y are equal!
For Loops
For loops are a programming construct that let you go through each item in a sequence and then perform
some operation on each one. For instance, you could use a for loop to go through all the values in a list,
tuple, dictionary or series and check whether each conforms to some logical expression or print the value to
the console.
Create a for loop using the following syntax:
In [4]:
my_sequence = list(range(0,101,10)) # Make a new list

for number in my_sequence: # Create a new for loop over the specified items
print(number) # Code to execute
0
10
20
30
40
50
60
70
80
90
100
In each iteration of the loop, the variable "number" takes on the value of the next item in the sequence.
For loops support a few special keywords that help you control the flow of the loop: continue and break.
The continue keyword causes a for loop to skip the current iteration and go to the next one:
In [5]:
for number in my_sequence:
if number < 50:
continue # Skip numbers less than 50
print(number)
50
60
70
80
90
100
The "break" keyword halts the execution of a for loop entirely. Use break to "break out" of a loop:
In [6]:
for number in my_sequence:
if number > 50:
break # Break out of the loop if number > 50
print(number)
0
10
20
30
40
50
In the for loop above, substituting the "continue" keyword for break would actually result in the exact same
output but the code would take longer to run because it would still go through each number in the list instead
of breaking out of the for loop early. It is best to break out of loops early if possible to reduce execution time.
While Loops
While loops are similar to for loops in that they allow you to execute code over and over again. For loops
execute their contents, at most, a number of iterations equal to the length of the sequence you are looping
over. While loops, on the other hand, keep executing their contents as long as a logical expression you
supply remains true:
In [7]:
x = 5
iters = 0

while iters < x: # Execute the contents as long as iters < x


print("Study")
iters = iters+1 # Increment iters by 1 each time the loop executes
Study
Study
Study
Study
Study
While loops can get you into trouble because they keep executing until the logical statement provided is
false. If you supply a logical statement that will never become false and don't provide a way to break out of
the while loop, it will run forever. For instance, if the while loop above didn't include the statement
incrementing the value of iters by 1, the logical statement would never become false and the code would run
forever. Infinite while loops are a common cause of program crashes.
The continue and break statements work inside while loops just like they do in for loops. You can use the
break statement to escape a while loop even if the logical expression you supplied is true. Consider the
following while loop:
In [8]:
while True: # True is always true!
print("Study")
break # But we break out of the loop here
Study
It is important to make sure while loops contain a logical expression that will eventually be false or a break
statement that will eventually be executed to avoid infinite loops.
Although you can use a while loop to do anything a for loop can do, it is best to use for loops whenever you
want to perform a specific number of operations, such as when running some code on each item in a
sequence. While loops should be reserved for cases where you don't know how many times you will need to
execute a loop.
The np.where() Function
Although it is important to be able to create your own if/else statements and loops when you need to,
numpy's vectorized nature means you can often avoid using such programming constructs. Whenever you
want to perform the same operation to each object in a numpy or pandas data structure, there's often a way
to do it efficiently without writing your own loops and if statements.
For example, imagine you have a sequence of numbers and you want to set all the negative values in the
sequence to zero. One way to do it is to use a for loop with an inner if statement:
In [9]:
import numpy as np
my_data = np.random.uniform(-1,1,25) # Draw 25 random numbers from -1 to 1

for index, number in enumerate(my_data):


if number < 0:
my_data[index] = 0 # Set numbers less than 0 to 0

print(my_data)
[ 0.85775482 0.23345796 0.12481305 0.80818166 0.39084045 0. 0.
0.8885405 0.89332358 0.16834338 0. 0. 0.72302815
0. 0.43432967 0.26056279 0.82086754 0. 0.82894075
0.92541317 0. 0.194373 0. 0.3360512 0. ]
*Note: The function "enumerate" takes a sequence and turns it into a sequence of (index, value) tuples;
enumerate() lets you loop over the items in a sequence while also having access the item's index.
Using a for loop to perform this sort of operation requires writing quite a bit of code and for loops are not
particularly fast because they have to operate on each item in a sequence one at a time.
Numpy includes a function called where() that lets you perform an if/else check on a sequence with less
code:
In [10]:
my_data = np.random.uniform(-1,1,25) # Generate new random numbers

my_data = np.where(my_data < 0, # A logical test


0, # Value to set if the test is true
my_data) # Value to set if the test is false

print(my_data)
[ 0.52262476 0. 0.31698457 0. 0. 0.59368824
0. 0. 0.06379209 0.26576472 0.75626607 0.
0.06003758 0. 0.37269663 0. 0. 0. 0.
0. 0.72700802 0.62098044 0. 0. 0.58293886]
Not only is np.where() more concise than a for loop, it is also much more computationally efficient because
numpy arrays are able to operate on all the values they contain at the same time instead of going through
each value one at a time.
Wrap Up
Control flow statements are the basic building blocks of computer programs. Python and its libraries offer
vast number functions, but general-use functions can't apply to every situation. Sooner or later, you'll need
to write a custom code to perform a task unique to your specific project or data set. Next time we'll learn how
to package control flow statements into reusable functions.
Python for Data Analysis Part 12: Defining Functions

The functions built into Python and its libraries can take you a long way, but general-purpose functions aren't
always applicable to the specific tasks you face when analyzing data. The ability to create user-defined
functions gives you the flexibility to handle situations where pre-made tools don't cut it.
Defining Functions
Define a function using the "def" keyword followed by the function's name, a tuple of function arguments and
then a colon:
In [1]:
def my_function(arg1, arg2): # Defines a new function
return arg1+arg2 # Function body (code to execute)
After defining a function, you can call it using the name you assigned to it, just like you would with a built in
function. The "return" keyword specifies what the function produces as its output. When a function reaches a
return statement, it immediately exits and returns the specified value. The function we defined above takes
two arguments and then returns their sum:
In [2]:
my_function(5, 10)
Out[2]:
15
You can give function arguments a default value that is used automatically unless you override it. Set a
default value with the argument_name = argument_value syntax:
In [3]:
def sum_3_items(x,y,z, print_args=False):
if print_args:
print(x,y,z)
return x+y+z
In [4]:
sum_3_items(5,10,20) # By default the arguments are not printed
Out[4]:
35
In [5]:
sum_3_items(5,10,20,True) # Changing the default prints the arguments
5 10 20
Out[5]:
35
A function can be set up to accept any number of named or unnamed arguments. Accept extra unnamed
arguments by including *args in the argument list. The unnamed arguments are accessible within the
function body as a tuple:
In [6]:
def sum_many_args(*args):
print (type (args))
return (sum(args))

sum_many_args(1, 2, 3, 4, 5)
<class 'tuple'>
Out[6]:
15
Accept additional keyword arguments by putting **kwargs in the argument list. The keyword arguments are
accessible as a dictionary:
In [7]:
def sum_keywords(**kwargs):
print (type (kwargs))
return (sum(kwargs.values()))

sum_keywords(mynum=100, yournum=200)
<class 'dict'>
Out[7]:
300
Function Documentation
If you are writing a function that you or someone else is going to use in the future, it can be useful to supply
some documentation that explains how the function works. You can include documentation below the
function definition statement as a multi-line string. Documentation typically includes a short description of
what the function does, a summary of the arguments and a description of the value the function returns:
In [8]:
import numpy as np

def rmse(predicted, targets):


"""
Computes root mean squared error of two numpy ndarrays

Args:
predicted: an ndarray of predictions
targets: an ndarray of target values

Returns:
The root mean squared error as a float
"""
return (np.sqrt(np.mean((targets-predicted)**2)))
*Note: root mean squared error (rmse) is a common evaluation metric in predictive modeling.
Documentation should provide enough information that the user doesn't have to read the code in the body of
the function to use the function.
Lambda Functions
Named functions are great for code that you are going to reuse several times, but sometimes you only need
to use a simple function once. Python provides a shorthand for creating functions that let you define
unnamed (anonymous) functions called Lambda functions, which are typically used in situations where you
only plan to use a function in one part of your code.
The syntax for creating lambda functions looks like this:
In [9]:
lambda x, y: x + y
Out[9]:
<function __main__.<lambda>>
In the function above, the keyword "lambda" is similar to "def" in that it signals the definition of a new lambda
function. The values x, y are the arguments of the function and the code after the colon is the value that the
function returns.
You can assign a lambda function a variable name and use it just like a normal function:
In [10]:
my_function2 = lambda x, y: x + y

my_function2(5,10)
Out[10]:
15
Although you can assign a name to lambda function, their main purpose is for use in situations where you
need to create an unnamed function on the fly, such as when using functions that take other functions as
input. For example, consider the Python built in function map(). map() takes a function and an iterable like a
list as arguments and applies the function to each item in the iterable. Instead of defining a function and then
passing that function to map(), we can define a lambda function right in the call to map():
In [11]:
# Example of using map() without a lambda function

def square(x): # Define a function


return x**2

my_map = map(square, [1, 2, 3, 4, 5]) # Pass the function to map()

for item in my_map:


print(item)
1
4
9
16
25
In [12]:
# Example of using map() with a lambda function

my_map = map(lambda x: x**2, [1, 2, 3, 4, 5])

for item in my_map:


print(item)
1
4
9
16
25
The lambda function version of this code is shorter and more importantly, it avoids creating a named
function that won't be used anywhere else in our code.
Wrap Up
Python's built in functions can take you a long way in data analysis, but sometimes you need to define your
own functions to perform project-specific tasks.
In our final lesson on Python programming constructs, we'll cover several miscellaneous Python functions
and conveniences including the ability to construct lists and dictionaries quickly and easily with list and dict
comprehensions.
Python for Data Analysis Part 13: List and Dictionary Comprehensions

Python prides itself on its clean, readable code and making it as simple as possible for you to do the things
you want to do. Although basic control flow statements and functions have enough power to express virtually
any program, Python includes many convenience functions and constructs to let you do things faster and
with less code.
Populating lists and dictionaries is a common task that can be achieved with the loops we learned about
in lesson 11. For instance, if we wanted to populate a list with the numbers 0 through 100, we could initialize
an empty list as a container, run a for loop over the range of numbers from 0 to 100, and append each
number to the list:
In [1]:
my_list = []

for number in range(0, 101):


my_list.append(number)

print(my_list)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
*Note: range() creates a sequence of numbers from some specified starting number up to but not including
an ending number. It also takes an optional argument for the step (counting increment) which defaults to 1.
The code above works, but it is unnecessarily verbose. List comprehensions provide a way to do these sorts
of constructions efficiently with less code.
List Comprehensions
List comprehensions let you populate lists in one line of code by taking the logic you would normally put a for
loop and moving it inside the list brackets. We can construct the same list as the one above using the
following list comprehension:
In [2]:
my_list2 = [number for number in range(0, 101)]

print(my_list2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
In a list comprehension, the value that you want to append to the list come first, in this case "number",
followed by a for statement that mirrors the one we used in the for loop version of the code. You can
optionally include if clauses after the for clause to filter the results based on some logical check. For
instance, we could add an if statement to filter out odd numbers:
In [3]:
my_list3 = [number for number in range(0, 101) if number % 2 == 0]

print(my_list3)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,
44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84,
86, 88, 90, 92, 94, 96, 98, 100]
In the code above we take all the numbers in the range for which the number modulus 2 (the remainder
when divided by 2) is equal to zero, which returns all the even numbers in the range.
*Note: You could get even numbers in a range more by including a step argument equal to 2 such as:
range(0,101,2)
It is possible to put more than one for loop in a list comprehension, such as to construct a list from two
different iterables. For instance, if we wanted to make a list of each combination of two letters in two different
strings we could do it with a list comprehension over the two strings with two for clauses:
In [4]:
combined = [a+b for a in "life" for b in "study"]

print (combined)
['ls', 'lt', 'lu', 'ld', 'ly', 'is', 'it', 'iu', 'id', 'iy', 'fs', 'ft', 'fu',
'fd', 'fy', 'es', 'et', 'eu', 'ed', 'ey']
You also can nest one list comprehension inside of another:
In [5]:
nested = [letters[1] for letters in [a+b for a in "life" for b in "study"]]

print(nested)
['s', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's',
't', 'u', 'd', 'y']
Notice that while you can nest list comprehensions to achieve a lot in a single line of code, doing so can lead
to long, verbose and potentially confusing code. Avoid the temptation to create convoluted "one-liners" when
a series of a few shorter, more readable operations will yield the same result:
In [6]:
combined = [a+b for a in "life" for b in "study"]
non_nested = [letters[1] for letters in combined]

print (non_nested)
['s', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's',
't', 'u', 'd', 'y']
Dictionary Comprehensions
You can create dictionaries quickly in one line using a syntax that mirrors list comprehensions. Consider the
following dictionary that sets words as keys and their lengths as values:
In [7]:
words = ["life","is","study"]

word_length_dict = {}

for word in words:


word_length_dict[word] = len(word)

print(word_length_dict)
{'study': 5, 'life': 4, 'is': 2}
We could make the same dictionary using a dictionary comprehension where the key and value come first in
the form key:value, followed a for clause that loops over some sequence:
In [8]:
words = ["life","is","study"]
word_length_dict2 = {word:len(word) for word in words}

print(word_length_dict2)
{'study': 5, 'life': 4, 'is': 2}
It is common to create a dictionary from the items in two different ordered sequences, where one sequence
contains the keys you want to use and the other sequence contains the corresponding values. You can pair
the items in two sequences into tuples using the built in Python function zip():
In [9]:
words = ["life","is","study"]
word_lengths = [4, 2, 5]
pairs = zip(words, word_lengths)

for item in pairs:


print (item)
('life', 4)
('is', 2)
('study', 5)
Using zip inside a dictionary comprehension lets you extract key:value pairs from two sequences:
In [10]:
words = ["life","is","study"]
word_lengths = [4, 2, 5]

word_length_dict3 = {key:value for (key, value) in zip(words, word_lengths)}

print( word_length_dict3 )
{'study': 5, 'life': 4, 'is': 2}
Wrap Up
List and dictionary comprehensions provide a convenient syntax for creating lists and dictionaries more
efficiently and with less code than standard loops. Once you have data loaded into numpy arrays and
pandas DataFrames, however, you can often avoid looping constructs all together by using functions
available in those packages that operate on data in a vectorized manner.
Now that we know the basics of Python's data structures and programming constructs, the remainder of this
guide will focus on data analysis. In the next lesson, we'll use Python to explore a real-world data set:
records of passengers who rode aboard the RMS Titanic on its fateful maiden voyage.
Python for Data Analysis Part 14: Initial Data Exploration and Preparation

The first part of any data analysis or predictive modeling task is an initial exploration of the data. Even if you
collected the data yourself and you already have a list of questions in mind that you want to answer, it is
important to explore the data before doing any serious analysis, since oddities in the data can cause bugs and
muddle your results. Before exploring deeper questions, you have to answer many simpler ones about the form
and quality of data. That said, it is important to go into your initial data exploration with a big picture question in
mind since the goal of your analysis should inform how you prepare the data.
This lesson aims to raise some of the questions you should consider when you look at a new data set for the first
time and show how to perform various Python operations related to those questions. We are going to cover a
lot of ground in this lesson, touching briefly on many topics from data cleaning to graphing to feature
engineering. We will cover many of these topics in future lessons in greater detail.
In this lesson, we will explore the Titanic disaster training set available from Kaggle.com, a website dedicated to
data science competitions and education. You need to create a Kaggle account and accept the rules for the
Titanic competition to download the data set. The data set consists of 889 passengers who rode aboard the
Titanic.
Exploring The Variables
The first step in exploratory analysis is reading in the data and then exploring the variables. It is important to get
a sense of how many variables and cases there are, the data types of the variables and the range of values they
take on.
We'll start by reading in the data:
In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import os
In [3]:
os.chdir('C:\\Users\\Greg\\Desktop\\Kaggle\\titanic') # Set working directory

titanic_train = pd.read_csv("titanic_train.csv") # Read the data


It's a good idea to start off by checking the dimensions of your data set with df.shape and the variable data types
of df.dtypes.
In [4]:
titanic_train.shape # Check dimensions
Out[4]:
(889, 12)
In [5]:
titanic_train.dtypes
Out[5]:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
The output shows us that we're working with a set of 889 records and 12 columns. Several of the column
variables are encoded as numeric data types (ints and floats) but a few of them are encoded as "object". Let's
check the head of the data to get a better sense of what the variables look like:
In [6]:
print(titanic_train.head(5)) # Check the first 5 rows
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age SibSp \


0 Braund, Mr. Owen Harris male 22 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1
2 Heikkinen, Miss. Laina female 26 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
4 Allen, Mr. William Henry male 35 0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
It appears we have a mixture of numeric columns and columns with text data. In data analysis, variables that
split records into a fixed number of unique categories, such as Sex, are known as categorical variables. Pandas
will attempt to interpret categorical variables as such when you load data, but you can specifically convert a
variable to categorical if necessary, as we'll see later.
Note that if you're using Spyder as your Python editor, you will see the data in the variable explorer pane,
showing its type (DataFrame) and size. If you double click on the data in the variable explorer, you can see the
data in a spreadsheet-like view that lets you sort by columns and edit values directly. You shouldn't rely too
much on the variable explorer, however, since it doesn't work well when dealing with large data sets.
After getting a sense of the data's structure, it is a good idea to look at a statistical summary of the variables
with df.describe():
In [7]:
print( titanic_train.describe() )
PassengerId Survived Pclass Age SibSp \
count 889.000000 889.000000 889.000000 712.000000 889.000000
mean 446.000000 0.382452 2.311586 29.642093 0.524184
std 256.998173 0.486260 0.834700 14.492933 1.103705
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 224.000000 0.000000 2.000000 20.000000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.000000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000

Parch Fare
count 889.000000 889.000000
mean 0.382452 32.096681
std 0.806761 49.697504
min 0.000000 0.000000
25% 0.000000 7.895800
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
Notice that non-numeric columns are dropped from the statistical summary provided by df.describe().
We can get a summary of the categorical variables by passing only those columns to describe():
In [8]:
categorical = titanic_train.dtypes[titanic_train.dtypes == "object"].index
print(categorical)

titanic_train[categorical].describe()
Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')
Out[8]:
Name Sex Ticket Cabin Embarked
count 889 889 889 201 889
unique 889 2 680 145 3
top Shutes, Miss. Elizabeth W male CA. 2343 C23 C25 C27 S
freq 1 577 7 4 644
The categorical variable summary shows the count of non-NaN records, the number of unique categories, the
most frequently occurring value and the number of occurrences of the most frequent value.
Although describe() gives a concise overview of each variable, it does not necessarily give us enough information
to determine what each variable means. Certain variables like "Age" and "Fare" are self-explanatory, while
others like "SibSp" and "Parch" are not. Whoever collects or provides data for download should also provide a
list of variable descriptions. In this case, Kaggle provides a list of descriptions on the data download page:
In [9]:
# VARIABLE DESCRIPTIONS:
# survival Survival
# (0 = No; 1 = Yes)
# pclass Passenger Class
# (1 = 1st; 2 = 2nd; 3 = 3rd)
# name Name
# sex Sex
# age Age
# sibsp Number of Siblings/Spouses Aboard
# parch Number of Parents/Children Aboard
# ticket Ticket Number
# fare Passenger Fare
# cabin Cabin
# embarked Port of Embarkation
# (C = Cherbourg; Q = Queenstown; S = Southampton)
After looking at the data for the first time, you should ask yourself a few questions:
1. Do I need all of the variables?
2. Should I transform any variables?
3. Are there NA values, outliers or other strange values?
4. Should I create new variables?
For the rest of this lesson we will address each of these questions in the context of this data set.
Do I Need All of the Variables?
Getting rid of unnecessary variables is a good first step when dealing with any data set, since dropping variables
reduces complexity and can make computation on the data faster. Whether you should get rid of a variable or
not will depend on size of the data set and the goal of your analysis. With a data set as small as the Titanic data,
there's no real need to drop variables from a computing perspective (we have plenty of memory and processing
power to deal with such a small data set) but it can still be helpful to drop variables that will only distract from
your goal.
This data set is provided in conjunction with a predictive modeling competition where the goal is to use the
training data to predict whether passengers of the titanic listed in a second data set survived or not. We won't
be dealing with the second data set (known the test set) right now, but we will revisit this competition and make
predictions in a future lesson on predictive modeling.
Let's go through each variable and consider whether we should keep it or not in the context of predicting
survival:
"PassengerId" is just a number assigned to each passenger. It is nothing more than an arbitrary identifier; we
could keep it for identification purposes, but let's remove it anyway:
In [10]:
del titanic_train["PassengerId"] # Remove PassengerId
"Survived" indicates whether each passenger lived or died. Since predicting survival is our goal, we definitely
need to keep it.
Features that describe passengers numerically or group them into a few broad categories could be useful for
predicting survival. The variables Pclass, Sex, Age, SibSp, Parch, Fare and Embarked appear to fit this description,
so let's keep all of them.
We have 3 more features to consider: Name, Ticket and Cabin.
"Name" appears to be a character string of the name of each passenger. Let's look at name a little closer:
In [11]:
sorted(titanic_train["Name"])[0:15] # Check the first 15 sorted names
Out[11]:
['Abbing, Mr. Anthony',
'Abbott, Mr. Rossmore Edward',
'Abbott, Mrs. Stanton (Rosa Hunt)',
'Abelson, Mr. Samuel',
'Abelson, Mrs. Samuel (Hannah Wizosky)',
'Adahl, Mr. Mauritz Nils Martin',
'Adams, Mr. John',
'Ahlin, Mrs. Johan (Johanna Persdotter Larsson)',
'Aks, Mrs. Sam (Leah Rosen)',
'Albimona, Mr. Nassef Cassem',
'Alexander, Mr. William',
'Alhomaki, Mr. Ilmari Rudolf',
'Ali, Mr. Ahmed',
'Ali, Mr. William',
'Allen, Miss. Elisabeth Walton']
In [12]:
titanic_train["Name"].describe()
Out[12]:
count 889
unique 889
top Shutes, Miss. Elizabeth W
freq 1
Name: Name, dtype: object
From the output above, we see that the Name variable has 889 unique values. Since there are 889 rows in the
data set we know each name is unique. It appears that married women have their maiden names listed in
parentheses. In general, a categorical variable that is unique to each case isn't useful for prediction. We could
extract last names to try to group family members together, but even then the number of categories would be
very large. In addition, the Parch and SibSp variables already contain some information about family
relationships, so from the perspective of predictive modeling, the Name variable could be removed. On the
other hand, it can be nice to have some way to uniquely identify particular cases and names are interesting from
a personal and historical perspective, so let's keep Name, knowing that we won't actually use it in any predictive
models we make.
Next, let's look closer at "Ticket":
In [13]:
titanic_train["Ticket"][0:15] # Check the first 15 tickets
Out[13]:
0 A/5 21171
1 PC 17599
2 STON/O2. 3101282
3 113803
4 373450
5 330877
6 17463
7 349909
8 347742
9 237736
10 PP 9549
11 113783
12 A/5. 2151
13 347082
14 350406
Name: Ticket, dtype: object
In [14]:
titanic_train["Ticket"].describe()
Out[14]:
count 889
unique 680
top CA. 2343
freq 7
Name: Ticket, dtype: object
Ticket has 680 unique values: almost as many as there are passengers. Categorical variables with almost as many
levels as there are records are generally not very useful for prediction. We could try to reduce the number of
levels by grouping certain tickets together, but the ticket numbers don't appear to follow any logical pattern we
could use for grouping. Let's remove it:
In [15]:
del titanic_train["Ticket"] # Remove Ticket
Finally let's consider the "Cabin" variable:
In [16]:
titanic_train["Cabin"][0:15] # Check the first 15 tickets
Out[16]:
0 NaN
1 C85
2 NaN
3 C123
4 NaN
5 NaN
6 E46
7 NaN
8 NaN
9 NaN
10 G6
11 C103
12 NaN
13 NaN
14 NaN
Name: Cabin, dtype: object
In [17]:
titanic_train["Cabin"].describe() # Check number of unique cabins
Out[17]:
count 201
unique 145
top C23 C25 C27
freq 4
Name: Cabin, dtype: object
Cabin also has 145 unique values, which indicates it may not be particularly useful for prediction. On the other
hand, the names of the levels for the cabin variable seem to have a regular structure: each starts with a capital
letter followed by a number. We could use that structure to reduce the number of levels to make categories
large enough that they might be useful for prediction. Let's Keep Cabin for now.
As you might have noticed, removing variables is often more of an art than a science. It is easiest to start simple:
don't be afraid to remove (or simply ignore) confusing, messy or otherwise troublesome variables temporarily
when you're just getting starting with an analysis or predictive modeling task. Data projects are iterative
processes: you can start with a simple analysis or model using only a few variables and then expand later by
adding more and more of the other variables you initially ignored or removed.
Should I Transform Any Variables?
When you first load a data set, some of the variables may be encoded as data types that don't fit well with what
the data really is or what it means.
For instance, Survived is just an integer variable that takes on the value 0 or 1 depending on whether a
passenger died or survived respectively. Variables that indicate a state or the presence or absence of something
with the numbers 0 and 1 are sometimes called indicator variables or dummy variables (0 indicates absence and
1 indicates presence.). Indicator variables are essentially just a shorthand for encoding a categorical variable
with 2 levels. We could instead encode Survived as a categorical variable with more descriptive categories:
In [18]:
new_survived = pd.Categorical(titanic_train["Survived"])
new_survived = new_survived.rename_categories(["Died","Survived"])

new_survived.describe()
Out[18]:
counts freqs
categories
Died 549 0.617548
Survived 340 0.382452
Survived looks a little nicer as categorical variable with appropriate category names, but even so, we're not
going to change it. Why not? If you remember, our goal with this data set is predicting survival for the Kaggle
competition. It turns out that when submitting predictions for the competition, the predictions need to be
encoded as 0 or 1. It would only complicate things to transform Survived, only to convert it back to 0 and 1 later.
There's one more variable that has a questionable data encoding: Pclass. Pclass is an integer that indicates a
passenger's class, with 1 being first class, 2 being second class and 3 being third class. Passenger class is a
category, so it doesn't make a lot of sense to encode it as a numeric variable. What's more 1st class would be
considered "above" or "higher" than second class, but when encoded as an integer, 1 comes before 2. We can
fix this by transforming Pclass into an ordered categorical variable:
In [19]:
new_Pclass = pd.Categorical(titanic_train["Pclass"],
ordered=True)

new_Pclass = new_Pclass.rename_categories(["Class1","Class2","Class3"])

new_Pclass.describe()
Out[19]:
counts freqs
categories
Class1 214 0.240720
Class2 184 0.206974
Class3 491 0.552306
In [20]:
titanic_train["Pclass"] = new_Pclass
Now it's time to revisit the Cabin variable. It appears that each Cabin is in a general section of the ship indicated
by the capital letter at the start of each factor level:
In [21]:
titanic_train["Cabin"].unique() # Check unique cabins
Out[21]:
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6', 'C23 C25 C27',
'B78', 'D33', 'B30', 'C52', 'C83', 'F33', 'F G73', 'E31', 'A5',
'D10 D12', 'D26', 'C110', 'B58 B60', 'E101', 'F E69', 'D47', 'B86',
'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4', 'A32', 'B4', 'B80',
'A31', 'D36', 'D15', 'C93', 'C78', 'D35', 'C87', 'B77', 'E67',
'B94', 'C125', 'C99', 'C118', 'D7', 'A19', 'B49', 'D', 'C22 C26',
'C106', 'C65', 'E36', 'C54', 'B57 B59 B63 B66', 'C7', 'E34', 'C32',
'B18', 'C124', 'C91', 'E40', 'C128', 'D37', 'B35', 'E50', 'C82',
'B96 B98', 'E10', 'E44', 'A34', 'C104', 'C111', 'C92', 'E38', 'D21',
'E12', 'E63', 'A14', 'B37', 'C30', 'D20', 'B79', 'E25', 'D46',
'B73', 'C95', 'B38', 'B39', 'B22', 'C86', 'C70', 'A16', 'C101',
'C68', 'A10', 'E68', 'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50',
'A26', 'D48', 'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5',
'B20', 'F G63', 'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45',
'C46', 'D30', 'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84',
'D17', 'A36', 'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24',
'C50', 'B42', 'C148'], dtype=object)
If we grouped cabin just by this letter, we could reduce the number of levels while potentially extracting some
useful information.
In [22]:
char_cabin = titanic_train["Cabin"].astype(str) # Convert data to str
new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter

new_Cabin = pd.Categorical(new_Cabin)

new_Cabin .describe()
Out[22]:
counts freqs
categories
A 15 0.016873
B 45 0.050619
C 59 0.066367
D 33 0.037120
E 32 0.035996
F 13 0.014623
G 4 0.004499
n 688 0.773903
The output of describe() shows we succeeded in condensing Cabin into a handful of broader categories, but we
also discovered something interesting: 688 of the records have Cabin are "n" which is shortened from "nan". In
other words, more than 2/3 of the passengers do not have a cabin listed at all. Discovering and deciding how to
handle these sorts of peculiarities is an important part working with data and there often isn't a single correct
answer.
Since there are so many missing values, the Cabin variable might be devoid of useful information for prediction.
On the other hand, a missing cabin variable could be an indication that a passenger died: after all, how would we
know what cabin a passenger stayed in if they weren't around to tell the tale?
Let's keep the new cabin variable:
In [23]:
titanic_train["Cabin"] = new_Cabin
This is as far as we'll go with transformations right now, but know that the transformations we've covered here
are just the tip of the iceberg.
Are there NA Values, Outliers or Other Strange Values?
Data sets are often littered with missing data, extreme data points called outliers and other strange values.
Missing values, outliers and strange values can negatively affect statistical tests and models and may even cause
certain functions to fail.
In Python, you can detect missing values with the pd.isnull() function:
In [24]:
dummy_vector = pd.Series([1,None,3,None,7,8])

dummy_vector.isnull()
Out[24]:
0 False
1 True
2 False
3 True
4 False
5 False
dtype: bool
Detecting missing values is the easy part: it is far more difficult to decide how to handle them. In cases where
you have a lot of data and only a few missing values, it might make sense to simply delete records with missing
values present. On the other hand, if you have more than a handful of missing values, removing records with
missing values could cause you to get rid of a lot of data. Missing values in categorical data are not particularly
troubling because you can simply treat NA as an additional category. Missing values in numeric variables are
more troublesome, since you can't just treat a missing value as number. As it happens, the Titanic dataset has
some NA's in the Age variable:
In [25]:
titanic_train["Age"].describe()
Out[25]:
count 712.000000
mean 29.642093
std 14.492933
min 0.420000
25% 20.000000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64
Notice the count of age(712) is less than the total row count of hte data set(889). This indicates missing data.
We can get the row indexes of the missing values with np.where():
In [26]:
missing = np.where(titanic_train["Age"].isnull() == True)
missing
Out[26]:
(array([ 5, 17, 19, 26, 28, 29, 31, 32, 36, 42, 45, 46, 47,
48, 55, 63, 64, 75, 76, 81, 86, 94, 100, 106, 108, 120,
125, 127, 139, 153, 157, 158, 165, 167, 175, 179, 180, 184, 185,
195, 197, 200, 213, 222, 228, 234, 239, 240, 249, 255, 259, 263,
269, 273, 276, 283, 294, 297, 299, 300, 302, 303, 305, 323, 329,
333, 334, 346, 350, 353, 357, 358, 363, 366, 367, 374, 383, 387,
408, 409, 410, 412, 414, 419, 424, 427, 430, 443, 450, 453, 456,
458, 463, 465, 467, 469, 474, 480, 484, 489, 494, 496, 501, 506,
510, 516, 521, 523, 526, 530, 532, 537, 546, 551, 556, 559, 562,
563, 567, 572, 577, 583, 588, 592, 595, 597, 600, 601, 610, 611,
612, 628, 632, 638, 642, 647, 649, 652, 655, 666, 668, 673, 679,
691, 696, 708, 710, 717, 726, 731, 737, 738, 739, 759, 765, 767,
772, 775, 777, 782, 789, 791, 792, 814, 824, 825, 827, 830, 835,
837, 844, 847, 857, 861, 866, 876, 886], dtype=int64),)
In [27]:
len(missing[0])
Out[27]:
177
With 177 missing values it's probably not a good idea to throw all those records away. Here are a few ways we
could deal with them:
1. Replace the null values with 0s
2. Replace the null values with some central value like the mean or median
3. Impute values (estimate values using statistical/predictive modeling methods.).
4. Split the data set into two parts: one set with where records have an Age value and another set where
age is null.
Setting missing values in numeric data to zero makes sense in some cases, but it doesn't make any sense here
because a person's age can't be zero. Setting all ages to some central number like the median is a simple fix but
there's no telling whether such a central number is a reasonable estimate of age without looking at the
distribution of ages. For all we know each age is equally common. We can quickly get a sense of the distribution
of ages by creating a histogram of the age variable with df.hist():
In [28]:
titanic_train.hist(column='Age', # Column to plot
figsize=(9,6), # Plot size
bins=20) # Number of histogram bins
Out[28]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000008412710>]], dtype=object)

From the histogram, we see that ages between 20 and 30 are the most common, so filling in missing values with
a central number like the mean or median wouldn't be entirely unreasonable. Let's fill in the missing values with
the median value of 28:
In [29]:
new_age_var = np.where(titanic_train["Age"].isnull(), # Logical check
28, # Value if check is true
titanic_train["Age"]) # Value if check is false

titanic_train["Age"] = new_age_var

titanic_train["Age"].describe()
Out[29]:
count 889.000000
mean 29.315152
std 12.984932
min 0.420000
25% 22.000000
50% 28.000000
75% 35.000000
max 80.000000
Name: Age, dtype: float64
Since we just added a bunch of 28s to age, let's look at the histogram again for a sanity check. The bar
representing 28 to be much taller this time.
In [30]:
titanic_train.hist(column='Age', # Column to plot
figsize=(9,6), # Plot size
bins=20) # Number of histogram bins
Out[30]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000086F59B0>]], dtype=object)

Some of the ages we assigned are probably way off, but it might be better than throwing entire records away. In
practice, imputing the missing data (estimating age based on other variables) might have been a better option,
but we'll stick with this for now.
Next, let's consider outliers. Outliers are extreme numerical values: values that lie far away from the typical
values a variable takes on. Creating plots is one of the quickest ways to detect outliers. For instance, the
histogram above shows that 1 or 2 passengers were near age 80. Ages near 80 are uncommon for this data set,
but in looking at the general shape of the data seeing one or two 80 year olds doesn't seem particularly
surprising.
Now let's investigate the "Fare" variable. This time we'll use a boxplot, since boxplots are designed to show the
spread of the data and help identify outliers:
In [31]:
titanic_train["Fare"].plot(kind="box",
figsize=(9,9))
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x890ba58>
In a boxplot, the central box represents 50% of the data and the central bar represents the median. The dotted
lines with bars on the ends are "whiskers" which encompass the great majority of the data and points beyond
the whiskers indicate uncommon values. In this case, we have some uncommon values that are so far away from
the typical value that the box appears squashed in the plot: this is a clear indication of outliers. Indeed, it looks
like one passenger paid almost twice as much as any other passenger. Even the passengers that paid between
200 and 300 are far higher than the vast majority of the other passengers.
For interest's sake, let's check the name of this high roller:
In [32]:
index = np.where(titanic_train["Fare"] == max(titanic_train["Fare"]) )

titanic_train.loc[index]
Out[32]:
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked
257 1 Class1 Ward, Miss. Anna female 35 0 0 512.3292 n C
Cardeza, Mr. Thomas Drake
678 1 Class1 male 36 0 1 512.3292 B C
Martinez
736 1 Class1 Lesurer, Mr. Gustave J male 35 0 0 512.3292 B C
In the graph there appears to be on passenger who paid more than all the others, but the output above shows
that there were actually three passengers who all paid the same high fare.
Similar to NA values, there's no single cure for outliers. You can keep them, delete them or transform them in
some way to try to reduce their impact. Even if you decide to keep outliers unchanged it is still worth identifying
them since they can have disproportionately large influence on your results. Let's keep the three high rollers
unchanged.
Data sets can have other strange values beyond missing values and outliers that you may need to address.
Sometimes data is mislabeled or simply erroneous; bad data can corrupt any sort of analysis so it is important to
address these sorts of issues before doing too much work.
Should I Create New Variables?
The variables present when you load a data set aren't always the most useful variables for analysis. Creating new
variables that are derivations or combinations existing ones is a common step to take before jumping into an
analysis or modeling task.
For example, imagine you are analyzing web site auctions where one of the data fields is a text description of the
item being sold. A raw block of text is difficult to use in any sort of analysis, but you could create new variables
from it such as a variable storing the length of the description or variables indicating the presence of certain
keywords.
Creating a new variable can be as simple as taking one variable and adding, multiplying or dividing by another.
Let's create a new variable, Family, that combines SibSp and Parch to indicate the total number of family
members (siblings, spouses, parents and children) a passenger has on board:
In [33]:
titanic_train["Family"] = titanic_train["SibSp"] + titanic_train["Parch"]
For interest's sake, let's find out who had the most family members on board:
In [34]:
most_family = np.where(titanic_train["Family"] == max(titanic_train["Family"]))

titanic_train.ix[most_family]
Out[34]:
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Family
Sage, Master. Thomas
158 0 Class3 male 28 8 2 69.55 n S 10
Henry
179 0 Class3 Sage, Miss. Constance female 28 8 2 69.55 n S 10
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Family
Gladys
200 0 Class3 Sage, Mr. Frederick male 28 8 2 69.55 n S 10
323 0 Class3 Sage, Mr. George John Jr male 28 8 2 69.55 n S 10
791 0 Class3 Sage, Miss. Stella Anna female 28 8 2 69.55 n S 10
844 0 Class3 Sage, Mr. Douglas Bullen male 28 8 2 69.55 n S 10
Sage, Miss. Dorothy Edith
861 0 Class3 female 28 8 2 69.55 n S 10
"Dolly"
There were 7 people on board with 8 siblings/spouses and 2 parents/children--they were probably all siblings of
one another. Tragically, all 7 of them passed away. The 8th sibling is likely in the test data for which we are
supposed make predictions. Would you predict that the final sibling survived or died?
Wrap Up
In this lesson, we covered several general questions you should address when you first inspect a data set. Your
first goal should be to explore the structure of the data to clean it and prepare the variables for your analysis.
Once your data is it the right form, you can move from exploring structure to investigating relationships
between variables.
Python for Data Analysis Part 15: Working With Text Data

Last lesson we learned that there are a lot of questions to consider when you first look at a data set, including
whether you should clean or transform the data. We touched briefly on a few basic operations to prepare data
for analysis, but the Titanic data set was pretty clean to begin with. Data you encounter in the wild won't always
be so friendly. Text data in particular can be extremely messy and difficult to work with because it can contain all
sorts of characters and symbols that may have little meaning for your analysis. This lesson will cover some basic
techniques and functions for working with text data in Python.
To start, we'll need some text data that is a little messier than the names in the Titanic data set. As it happens,
Kaggle launched a data exploration competition recently, giving users access to a database of comments made
on Reddit.com during the month of May 2015. Since the Minnesota Timberwolves are my favorite basketball
team, I extracted the comments from the team's fan subreddit from the database. You can get the data
file (comments.csv) here.
Let's start by loading the data and checking its structure and a few of the comments:
In [1]:
import numpy as np
import pandas as pd
import os
In [2]:
os.chdir('C:\\Users\\Greg\\Desktop\\Kaggle\\misc')

comments = pd.read_csv("t_wolves_reddit_may2015.csv")

comments = comments["body"] # Convert from df to series

print (comments.shape)

print( comments.head(8))
(4166,)
0 Strongly encouraging sign for us. The T-Wolve...
1 [My reaction.](http://4.bp.blogspot.com/-3ySob...
2 http://imgur.com/gallery/Zch2AWw
3 Wolves have more talent than they ever had rig...
4 Nah. Wigg is on the level of KG but where's ou...
5 2004 was a pretty damn talented team dude.
6 :')
7 *swoon*
Name: body, dtype: object
The text in these comments is pretty messy. We see everything from long paragraphs to web links to text
emoticons. We already learned about a variety of basic string processing functions in lesson 6; pandas extends
built in string functions that operate on entire series of strings.
Pandas String Functions
String functions in pandas mirror built in string functions and many have the same name as their singular
counterparts. For example, str.lower() converts a single string to lowercase, while series.str.lower() converts all
the strings in a series to lowercase:
In [3]:
comments[0].lower() # Convert the first comment to lowercase
Out[3]:
"strongly encouraging sign for us. the t-wolves management better not screw this up and they better surround
wiggins with a championship caliber team to support his superstar potential or else i wouldn't want him to sour
his prime years here in minnesota just like how i felt with garnett.\n\ntl;dr: wolves better not fuck this up."
In [4]:
comments.str.lower().head(8) # Convert all comments to lowercase
Out[4]:
0 strongly encouraging sign for us. the t-wolve...
1 [my reaction.](http://4.bp.blogspot.com/-3ysob...
2 http://imgur.com/gallery/zch2aww
3 wolves have more talent than they ever had rig...
4 nah. wigg is on the level of kg but where's ou...
5 2004 was a pretty damn talented team dude.
6 :')
7 *swoon*
Name: body, dtype: object
Pandas also supports str.upper() and str.len():
In [5]:
comments.str.upper().head(8) # Convert all comments to uppercase
Out[5]:
0 STRONGLY ENCOURAGING SIGN FOR US. THE T-WOLVE...
1 [MY REACTION.](HTTP://4.BP.BLOGSPOT.COM/-3YSOB...
2 HTTP://IMGUR.COM/GALLERY/ZCH2AWW
3 WOLVES HAVE MORE TALENT THAN THEY EVER HAD RIG...
4 NAH. WIGG IS ON THE LEVEL OF KG BUT WHERE'S OU...
5 2004 WAS A PRETTY DAMN TALENTED TEAM DUDE.
6 :')
7 *SWOON*
Name: body, dtype: object
In [6]:
comments.str.len().head(8) # Get the length of all comments
Out[6]:
0 329
1 101
2 32
3 53
4 145
5 42
6 3
7 7
Name: body, dtype: int64
The string splitting and stripping functions also have pandas equivalents:
In [7]:
comments.str.split(" ").head(8) # Split comments on spaces
Out[7]:
0 [Strongly, encouraging, sign, for, us., , The,...
1 [[My, reaction.](http://4.bp.blogspot.com/-3yS...
2 [http://imgur.com/gallery/Zch2AWw]
3 [Wolves, have, more, talent, than, they, ever,...
4 [Nah., Wigg, is, on, the, level, of, KG, but, ...
5 [2004, was, a, pretty, damn, talented, team, d...
6 [:')]
7 [*swoon*]
dtype: object
In [8]:
comments.str.strip("[]").head(8) # Strip leading and trailing brackets
Out[8]:
0 Strongly encouraging sign for us. The T-Wolve...
1 My reaction.](http://4.bp.blogspot.com/-3ySobv...
2 http://imgur.com/gallery/Zch2AWw
3 Wolves have more talent than they ever had rig...
4 Nah. Wigg is on the level of KG but where's ou...
5 2004 was a pretty damn talented team dude.
6 :')
7 *swoon*
Name: body, dtype: object
Combine all the strings in a series together into a single string with series.str.cat():
In [9]:
comments.str.cat()[0:500] # Check the first 500 characters
Out[9]:
"Strongly encouraging sign for us. The T-Wolves management better not screw this up and they better surround
Wiggins with a championship caliber team to support his superstar potential or else I wouldn't want him to sour
his prime years here in Minnesota just like how I felt with Garnett.\n\nTL;DR: Wolves better not fuck this up.[My
reaction.](http://4.bp.blogspot.com/-
3ySobv38ihc/U6yxpPwsbzI/AAAAAAAAIPo/IO8Z_wbTIVQ/s1600/2.gif)http://imgur.com/gallery/Zch2AWwWolve
s have more talent than they ever"
You can slice each string in a series and return the result in an elementwise fasion with series.str.slice():
In [10]:
comments.str.slice(0, 10).head(8) # Slice the first 10 characters
Out[10]:
0 Strongly e
1 [My reacti
2 http://img
3 Wolves hav
4 Nah. Wigg
5 2004 was a
6 :')
7 *swoon*
Name: body, dtype: object
Alternatively, you can use indexing after series.str to take slices:
In [11]:
comments.str[0:10].head(8) # Slice the first 10 characters
Out[11]:
0 Strongly e
1 [My reacti
2 http://img
3 Wolves hav
4 Nah. Wigg
5 2004 was a
6 :')
7 *swoon*
Name: body, dtype: object
Replace a slice with a new substring using str.slice_replace():
In [12]:
comments.str.slice_replace(5, 10, " Wolves Rule! " ).head(8)
Out[12]:
0 Stron Wolves Rule! ncouraging sign for us. Th...
1 [My r Wolves Rule! on.](http://4.bp.blogspot.c...
2 http: Wolves Rule! ur.com/gallery/Zch2AWw
3 Wolve Wolves Rule! e more talent than they eve...
4 Nah. Wolves Rule! is on the level of KG but w...
5 2004 Wolves Rule! pretty damn talented team ...
6 :') Wolves Rule!
7 *swoo Wolves Rule!
Name: body, dtype: object
Replace the occurences of a given substring with a different substring using str.replace():
In [13]:
comments.str.replace("Wolves", "Pups").head(8)
Out[13]:
0 Strongly encouraging sign for us. The T-Pups ...
1 [My reaction.](http://4.bp.blogspot.com/-3ySob...
2 http://imgur.com/gallery/Zch2AWw
3 Pups have more talent than they ever had right...
4 Nah. Wigg is on the level of KG but where's ou...
5 2004 was a pretty damn talented team dude.
6 :')
7 *swoon*
Name: body, dtype: object
A common operation when working with text data is to test whether character strings contain a certain
substring or pattern of characters. For instance, if we were only interested in posts about Andrew Wiggins, we'd
need to match all posts that make mention of him and avoid matching posts that don't mention him. Use
series.str.contains() to get a series of true/false values that indicate whether each string contains a given
substring:
In [14]:
logical_index = comments.str.lower().str.contains("wigg|drew")

comments[logical_index].head(10) # Get first 10 comments about Wiggins


Out[14]:
0 Strongly encouraging sign for us. The T-Wolve...
4 Nah. Wigg is on the level of KG but where's ou...
9 I FUCKING LOVE YOU ANDREW
10 I LOVE YOU WIGGINS
33 Yupiii!!!!!! Great Wiggins celebration!!!!! =D...
44 Wiggins on the level of KG?!
45 I'm comfortable with saying that Wiggins is as...
62 They seem so Wiggins. Did he help design them?
63 The more I think about this the more I can und...
64 I dig these a lot. Like the AW logo too with t...
Name: body, dtype: object
For interest's sake, let's also calculate the ratio of comments that mention Andrew Wiggins:
In [15]:
len(comments[logical_index])/len(comments)
Out[15]:
0.06649063850216035
It looks like about 6.6% of comments make mention of Andrew Wiggins. Notice that this string pattern argument
we supplied to str.contains() wasn't just a simple substring. Posts about Andrew Wiggins could use any number
of different names to refer to him--Wiggins, Andrew, Wigg, Drew--so we needed something a little more flexible
than a single substring to match the all posts we're interested in. The pattern we supplied is a simple example of
a regular expression.
Regular Expressions
Pandas has a few more useful string functions, but before we go any further, we need to learn about regular
expressions. A regular expression or regex is a sequence of characters and special meta characters used to
match a set of character strings. Regular expressions allow you to be more expressive with string matching
operations than just providing a simple substring. A regular expression lets you define a "pattern" that can
match strings of different lengths, made up of different characters.
In the str.contains() example above, we supplied the regular expression: "wigg|drew". In this case, the vertical
bar | is a metacharacter that acts as the "or" operator, so this regular expression matches any string that
contains the substring "wigg" or "drew".
When you provide a regular expression that contains no metacharacters, it simply matches the exact substring.
For instance, "Wiggins" would only match strings containing the exact substring "Wiggins." Metacharacters let
you change how you make matches. Here is a list of basic metacharacters and what they do:
"." - The period is a metacharacter that matches any character other than a newline:
In [16]:
my_series = pd.Series(["will","bill","Till","still","gull"])

my_series.str.contains(".ill") # Match any substring ending in ill


Out[16]:
0 True
1 True
2 True
3 True
4 False
dtype: bool
"[ ]" - Square brackets specify a set of characters to match:
In [17]:
my_series.str.contains("[Tt]ill") # Matches T or t followed by "ill"
Out[17]:
0 False
1 False
2 True
3 True
4 False
dtype: bool
Regular expressions include several special character sets that allow to quickly specify certain common character
types. They include:
[a-z] - match any lowercase letter
[A-Z] - match any uppercase letter
[0-9] - match any digit
[a-zA-Z0-9] - match any letter or digit
Adding the "^" symbol inside the square brackets matches any characters NOT in the set:
[^a-z] - match any character that is not a lowercase letter
[^A-Z] - match any character that is not a uppercase letter
[^0-9] - match any character that is not a digit
[^a-zA-Z0-9] - match any character that is not a letter or digit
Python regular expressions also include a shorthand for specifying common sequences:
\d - match any digit
\D - match any non digit
\w - match a word character
\W - match a non-word character
\s - match whitespace (spaces, tabs, newlines, etc.)
\S - match non-whitespace
"^" - outside of square brackets, the caret symbol searches for matches at the beginning of a string:
In [18]:
ex_str1 = pd.Series(["Where did he go", "He went to the mall", "he is good"])

ex_str1.str.contains("^(He|he)") # Matches He or he at the start of a string


Out[18]:
0 False
1 True
2 True
dtype: bool
"$" - searches for matches at the end of a string:
In [19]:
ex_str1.str.contains("(go)$") # Matches go at the end of a string
Out[19]:
0 True
1 False
2 False
dtype: bool
"( )" - parentheses in regular expressions are used for grouping and to enforce the proper order of operations
just like they are in math and logical expressions. In the examples above, the parentheses let us group the or
expressions so that the "^" and "$" symbols operate on the entire or statement.
"*" - an asterisk matches zero or more copies of the preceding character
"?" - a question mark matches zero or 1 copy of the preceding character
"+" - a plus matches 1 more copies of the preceding character
In [20]:
ex_str2 = pd.Series(["abdominal","b","aa","abbcc","aba"])

# Match 0 or more a's, a single b, then 1 or characters


ex_str2.str.contains("a*b.+")
Out[20]:
0 True
1 False
2 False
3 True
4 True
dtype: bool
In [21]:
# Match 1 or more a's, an optional b, then 1 or a's
ex_str2.str.contains("a+b?a+")
Out[21]:
0 False
1 False
2 True
3 False
4 True
dtype: bool
"{ }" - curly braces match a preceding character for a specified number of repetitions:
"{m}" - the preceding element is matched m times
"{m,}" - the preceding element is matched m times or more
"{m,n}" - the preceding element is matched between m and n times
In [22]:
ex_str3 = pd.Series(["aabcbcb","abbb","abbaab","aabb"])

ex_str3.str.contains("a{2}b{2,}") # Match 2 a's then 2 or more b's


Out[22]:
0 False
1 False
2 False
3 True
dtype: bool
"\" - backslash let you "escape" metacharacters. You must escape metacharacters when you actually want to
match the metacharacter symbol itself. For instance, if you want to match periods you can't use "." because it is
a metacharacter that matches anything. Instead, you'd use "." to escape the period's metacharacter behavior
and match the period itself:
In [23]:
ex_str4 = pd.Series(["Mr. Ed","Dr. Mario","Miss\Mrs Granger."])

ex_str4.str.contains("\. ") # Match a single period and then a space


Out[23]:
0 True
1 True
2 False
dtype: bool
If you want to match the escape character backslash itself, you either have to use four backslashes "\\" or
encode the string as a raw string of the form r"mystring" and then use double backslashes. Raw strings are an
alternate string representation in Python that simplify some oddities in performing regular expressions on
normal strings. Read more about them here.
In [24]:
ex_str4.str.contains(r"\\") # Match strings containing a backslash
Out[24]:
0 False
1 False
2 True
dtype: bool
Raw strings are often used for regular expression patterns because they avoid issues that may that arise when
dealing with special string characters.
There are more regular expression intricacies we won't cover here, but combinations of the few symbols we've
covered give you a great amount of expressive power. Regular expressions are commonly used to perform tasks
like matching phone numbers, email addresses and web addresses in blocks of text.
To use regular expressions outside of pandas, you can import the regular expression library with: import re.
Pandas has several string functions that accept regex patterns and perform an operation on each string in series.
We already saw two such functions: series.str.contains() and series.str.replace(). Let's go back to our basketball
comments and explore some of these functions.
Use series.str.count() to count the occurrences of a pattern in each string:
In [25]:
comments.str.count(r"[Ww]olves").head(8)
Out[25]:
0 2
1 0
2 0
3 1
4 0
5 0
6 0
7 0
Name: body, dtype: int64
Use series.str.findall() to get each matched substring and return the result as a list:
In [26]:
comments.str.findall(r"[Ww]olves").head(8)
Out[26]:
0 [Wolves, Wolves]
1 []
2 []
3 [Wolves]
4 []
5 []
6 []
7 []
Name: body, dtype: object
Getting Posts with Web Links
Now it's time to use some of the new tools we have in our toolbox on the Reddit comment data. Let's say we are
only interested in posts that contain web links. If we want to narrow down comments to only those with web
links, we'll need to match comments that agree with some pattern that expresses the textual form of a web link.
Let's try using a simple regular expression to find posts with web links.
Web links begin with "http:" or "https:" so let's make a regular expression that matches those substrings:
In [27]:
web_links = comments.str.contains(r"https?:")

posts_with_links = comments[web_links]

print( len(posts_with_links))

posts_with_links.head(5)
216
Out[27]:
1 [My reaction.](http://4.bp.blogspot.com/-3ySob...
2 http://imgur.com/gallery/Zch2AWw
25 [January 4th, 2005 - 47 Pts, 17 Rebs](https://...
29 [You're right.](http://espn.go.com/nba/noteboo...
34 https://www.youtube.com/watch?v=K1VtZht_8t4\n\...
Name: body, dtype: object
It appears the comments we've returned all contain web links. It is possible that a post could contain the string
"http:" without actually having a web link. If we wanted to reduce this possibility, we'd have to be more specific
with our regular expression pattern, but in the case of a basketball-themed forum, it is pretty unlikely.
Now that we've identified posts that contain web links, let's extract the links themselves. Many of the posts
contain both web links and a bunch of text the user wrote. We want to get rid of the text keep the web links. We
can do with with series.str.findall():
In [28]:
only_links = posts_with_links.str.findall(r"https?:[^ \n\)]+")

only_links.head(10)
Out[28]:
1 [http://4.bp.blogspot.com/-3ySobv38ihc/U6yxpPw...
2 [http://imgur.com/gallery/Zch2AWw]
25 [https://www.youtube.com/watch?v=iLRsJ9gcW0Y, ...
29 [http://espn.go.com/nba/notebook/_/page/ROY141...
34 [https://www.youtube.com/watch?v=K1VtZht_8t4]
40 [https://www.youtube.com/watch?v=mFEzW1Z6TRM]
69 [https://instagram.com/p/2HWfB3o8rK/]
76 [https://www.youtube.com/watch?v=524h48CWlMc&a...
93 [http://i.imgur.com/OrjShZv.jpg]
95 [http://content.sportslogos.net/logos/6/232/fu...
Name: body, dtype: object
The pattern we used to match web links may look confusing, so let's go over it step by step.
First the pattern matches the exact characters "http", an optional "s" and then ":".
Next, with [^ \n)], we create a set of characters to match. Since our set starts with "^", we are actually matching
the negation of the set. In this case, the set is the space character, the newline character "\n" and the closing
parenthesis character ")". We had to escape the closing parenthesis character by writing ")". Since we are
matching the negation, this set matches any character that is NOT a space, newline or closing parenthesis.
Finally, the "+" at the end matches this set 1 or more times.
To summarize, the regex matches http: or https: at the start and then any number of characters until it
encounters a space, newline or closing parenthesis. This regex isn't perfect: a web address could contain
parentheses and a space, newline or closing parenthesis might not be the only characters that mark the end of a
web link in a comment. It is good enough for this small data set, but for a serious project we would probably
want something a little more specific to handle such corner cases.
Complex regular expressions can be difficult to write and confusing to read. Sometimes it is easiest to simply
search the web for a regular expression to perform a common task instead of writing one from scratch. You can
test and troubleshoot Python regular expressions using this online tool.
*Note: If you copy a regex written for another language it might not work in Python without some
modifications.
Wrap Up
In this lesson, we learned several functions for dealing with text data in Python and introduced regular
expressions, a powerful tool for matching substrings in text. Regular expressions are used in many programming
languages and although the syntax for regex varies a bit for one language to another, the basic constructs are
similar across languages.
Next time we'll turn our attention to cleaning and preparing numeric data.
Python for Data Analysis Part 16: Preparing Numeric Data

Numeric data tends to be better-behaved than text data. There's only so many symbols that appear in numbers
and they have well-defined values. Despite its relative cleanliness, there are variety of preprocessing tasks you
should consider before using numeric data. In this lesson, we'll learn some common operations used to prepare
numeric data for use in analysis and predictive models.
Centering and Scaling
Numeric variables are often on different scales and cover different ranges, so they can't be easily compared.
What's more, variables with large values can dominate those with smaller values when using certain modeling
techniques. Centering and scaling is a common preprocessing task that puts numeric variables on a common
scale so no single variable will dominate the others.
The simplest way to center data is to subtract the mean value from each data point. Subtracting the mean
centers the data around zero and sets the new mean to zero. Let's try zero-centering the mtcars dataset that
comes with the ggplot library:
In [1]:
%matplotlib inline # This line lets me show plots
In [2]:
import numpy as np
import pandas as pd
from ggplot import mtcars
In [3]:
print (mtcars.head() )

mtcars.index = mtcars.name # Set row index to car name


del mtcars["name"] # Drop car name column

colmeans = mtcars.sum()/mtcars.shape[0] # Get column means

colmeans
name mpg cyl disp hp drat wt qsec vs am gear \
0 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
1 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4
2 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4
3 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3
4 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3

carb
0 4
1 4
2 1
3 1
4 2
Out[3]:
mpg 20.090625
cyl 6.187500
disp 230.721875
hp 146.687500
drat 3.596563
wt 3.217250
qsec 17.848750
vs 0.437500
am 0.406250
gear 3.687500
carb 2.812500
dtype: float64
With the column means in hand, we just need to subtract the column means from each row in an element-wise
fashion to zero center the data. Pandas performs math operations involving DataFrames and columns on an
element-wise row-by-row basis by default, so we can simply subtract our column means series from the data set
to center it:
In [4]:
centered = mtcars-colmeans

print(centered.describe())
mpg cyl disp hp drat \
count 3.200000e+01 32.000000 3.200000e+01 32.000000 3.200000e+01
mean 3.996803e-15 0.000000 -4.618528e-14 0.000000 -5.967449e-16
std 6.026948e+00 1.785922 1.239387e+02 68.562868 5.346787e-01
min -9.690625e+00 -2.187500 -1.596219e+02 -94.687500 -8.365625e-01
25% -4.665625e+00 -2.187500 -1.098969e+02 -50.187500 -5.165625e-01
50% -8.906250e-01 -0.187500 -3.442188e+01 -23.687500 9.843750e-02
75% 2.709375e+00 1.812500 9.527812e+01 33.312500 3.234375e-01
max 1.380938e+01 1.812500 2.412781e+02 188.312500 1.333437e+00

wt qsec vs am gear carb


count 3.200000e+01 3.200000e+01 32.000000 32.000000 32.000000 32.0000
mean 4.440892e-16 -2.609024e-15 0.000000 0.000000 0.000000 0.0000
std 9.784574e-01 1.786943e+00 0.504016 0.498991 0.737804 1.6152
min -1.704250e+00 -3.348750e+00 -0.437500 -0.406250 -0.687500 -1.8125
25% -6.360000e-01 -9.562500e-01 -0.437500 -0.406250 -0.687500 -0.8125
50% 1.077500e-01 -1.387500e-01 -0.437500 -0.406250 0.312500 -0.8125
75% 3.927500e-01 1.051250e+00 0.562500 0.593750 0.312500 1.1875
max 2.206750e+00 5.051250e+00 0.562500 0.593750 1.312500 5.1875
With zero-centered data, negative values are below average and positive values are above average.
Now that the data is centered, we'd like to put it all on a common scale. One way to put data on a common scale
is to divide by the standard deviation. Standard deviation is a statistic that describes the spread of numeric data.
The higher the standard deviation, the further the data points tend to be spread away from the mean value. You
can get standard deviations with df.std():
In [5]:
column_deviations = mtcars.std(axis=0) # Get column standard deviations

centered_and_scaled = centered/column_deviations

print(centered_and_scaled.describe())
mpg cyl disp hp drat \
count 3.200000e+01 3.200000e+01 3.200000e+01 3.200000e+01 3.200000e+01
mean 6.661338e-16 -2.775558e-17 -3.330669e-16 2.775558e-17 -1.110223e-15
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -1.607883e+00 -1.224858e+00 -1.287910e+00 -1.381032e+00 -1.564608e+00
25% -7.741273e-01 -1.224858e+00 -8.867035e-01 -7.319924e-01 -9.661175e-01
50% -1.477738e-01 -1.049878e-01 -2.777331e-01 -3.454858e-01 1.841059e-01
75% 4.495434e-01 1.014882e+00 7.687521e-01 4.858679e-01 6.049193e-01
max 2.291272e+00 1.014882e+00 1.946754e+00 2.746567e+00 2.493904e+00

wt qsec vs am gear \
count 3.200000e+01 3.200000e+01 32.000000 3.200000e+01 3.200000e+01
mean 4.163336e-16 -1.443290e-15 0.000000 5.551115e-17 -2.775558e-17
std 1.000000e+00 1.000000e+00 1.000000 1.000000e+00 1.000000e+00
min -1.741772e+00 -1.874010e+00 -0.868028 -8.141431e-01 -9.318192e-01
25% -6.500027e-01 -5.351317e-01 -0.868028 -8.141431e-01 -9.318192e-01
50% 1.101223e-01 -7.764656e-02 -0.868028 -8.141431e-01 4.235542e-01
75% 4.013971e-01 5.882951e-01 1.116036 1.189901e+00 4.235542e-01
max 2.255336e+00 2.826755e+00 1.116036 1.189901e+00 1.778928e+00

carb
count 3.200000e+01
mean 2.775558e-17
std 1.000000e+00
min -1.122152e+00
25% -5.030337e-01
50% -5.030337e-01
75% 7.352031e-01
max 3.211677e+00
Notice that after dividing by the standard deviation, every variable now has a standard deviation of 1. At this
point, all the columns have roughly the same mean and scale of spread about the mean.
Manually centering and scaling as we've done is a good exercise, but it is often possible to perform common
data preprocessing automatically using functions built into Python libraries. The Python library scikit-learn, a
popular package for predictive modeling and data analysis, has preprocessing tools including a scale() function
for centering and scaling data:
In [6]:
from sklearn import preprocessing
In [7]:
scaled_data = preprocessing.scale(mtcars) # Scale the data*

scaled_cars = pd.DataFrame(scaled_data, # Remake the DataFrame


index=mtcars.index,
columns=mtcars.columns)

print(scaled_cars.describe() )
mpg cyl disp hp drat \
count 3.200000e+01 3.200000e+01 3.200000e+01 32.000000 3.200000e+01
mean -4.996004e-16 2.775558e-17 1.665335e-16 0.000000 -3.053113e-16
std 1.016001e+00 1.016001e+00 1.016001e+00 1.016001 1.016001e+00
min -1.633610e+00 -1.244457e+00 -1.308518e+00 -1.403130 -1.589643e+00
25% -7.865141e-01 -1.244457e+00 -9.008917e-01 -0.743705 -9.815764e-01
50% -1.501383e-01 -1.066677e-01 -2.821771e-01 -0.351014 1.870518e-01
75% 4.567366e-01 1.031121e+00 7.810529e-01 0.493642 6.145986e-01
max 2.327934e+00 1.031121e+00 1.977904e+00 2.790515 2.533809e+00

wt qsec vs am gear \
count 3.200000e+01 3.200000e+01 32.000000 32.000000 3.200000e+01
mean 5.551115e-17 -1.471046e-15 0.000000 0.000000 -2.775558e-17
std 1.016001e+00 1.016001e+00 1.016001 1.016001 1.016001e+00
min -1.769642e+00 -1.903996e+00 -0.881917 -0.827170 -9.467293e-01
25% -6.604034e-01 -5.436944e-01 -0.881917 -0.827170 -9.467293e-01
50% 1.118844e-01 -7.888899e-02 -0.881917 -0.827170 4.303315e-01
75% 4.078199e-01 5.977084e-01 1.133893 1.208941 4.303315e-01
max 2.291423e+00 2.871986e+00 1.133893 1.208941 1.807392e+00

carb
count 3.200000e+01
mean -2.775558e-17
std 1.016001e+00
min -1.140108e+00
25% -5.110827e-01
50% -5.110827e-01
75% 7.469671e-01
max 3.263067e+00
*Note: preprocessing.scale() returns ndarrays so we have to convert it back into a DataFrame.
Notice that the values are almost the same as those we calculated manually but not exactly the same. These
small differences are likely due to rounding and details of the scikit-learn implementation of centering and
scaling.
Dealing With Skewed Data
The distribution of data--its overall shape and how it is spread out--can have a significant impact on analysis and
modeling. Data that is roughly evenly spread around the mean value--known as normally distributed data--tends
to be well-behaved. On the other hand, some data sets exhibit significant skewness or asymmetry. To illustrate,
let's generate a few distributions
In [8]:
normally_distributed = np.random.normal(size=10000) # Generate normal data*

normally_distributed = pd.DataFrame(normally_distributed) # Convert to DF

normally_distributed.hist(figsize=(8,8), # Plot histogram


bins=30)
Out[8]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000B349EB8>]], dtype=object)

*Note: We will cover probability distributions and generating random data in a future lesson.
Notice how the normally distributed data looks roughly symmetric with a bell-shaped curve. Now let's generate
some skewed data:
In [9]:
skewed = np.random.exponential(scale=2, # Generate skewed data
size= 10000)

skewed = pd.DataFrame(skewed) # Convert to DF

skewed.hist(figsize=(8,8), # Plot histogram


bins=50)
Out[9]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000B3E5B70>]], dtype=object)

Data with a long tail that goes off to the right is called positively skewed or right skewed. When you have a
skewed distribution like the one above, the extreme values in the long tail can have a disproportionately large
influence on whatever test you perform or models you build. Reducing skew may improve your results. Taking
the square root of each data point or taking the natural logarithm of each data point are two simple
transformations that can reduce skew. Let's see their effects on the skewed data:
In [10]:
sqrt_transformed = skewed.apply(np.sqrt) # Get the square root of data points*

sqrt_transformed.hist(figsize=(8,8), # Plot histogram


bins=50)
Out[10]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000B7A5B38>]], dtype=object)

*Note: The df.apply() function applies a given function to each row or column of the DataFrame. In this case we
pass in np.sqrt to get the square root of each value.
Now let's look at a log transformation:
In [11]:
log_transformed = (skewed+1).apply(np.log) # Get the log of the data

log_transformed.hist(figsize = (8,8), # Plot histogram


bins=50)
Out[11]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000B8735F8>]], dtype=object)

*Note: Adding 1 before taking the log ensures we don't end up with negative values. Also note that neither of
these transformations work on data containing negative values. To make them work on data with negative
values add a constant to each value that is large enough to make all the data greater than or equal to 1 (such as
adding the absolute value of the smallest number +1)
Both the sqrt() and log() transforms reduced the skew of the data. It's still not quite normally distributed, but the
amount of extreme data in the tails has been reduced to the point where we might not be so worried about it
having a large influence on our results.
Highly Correlated Variables
In predictive modeling, each variable you use to construct a model would ideally represent some unique feature
of the data. In other words, you want each variable to tell you something different. In reality, variables often
exhibit collinearity--a strong correlation or tendency to move together, typically due to some underlying
similarity or common influencing factor. Variables with strong correlations can interfere with one another when
performing modeling and muddy results.
You can check the pairwise correlations between numeric variables using the df.corr() function:
In [12]:
mtcars.ix[:,0:6].corr() # Check the pairwise correlations of 6 variables
Out[12]:
mpg cyl disp hp drat wt
mpg 1.000000 -0.852162 -0.847551 -0.776168 0.681172 -0.867659
cyl -0.852162 1.000000 0.902033 0.832447 -0.699938 0.782496
disp -0.847551 0.902033 1.000000 0.790949 -0.710214 0.887980
hp -0.776168 0.832447 0.790949 1.000000 -0.448759 0.658748
drat 0.681172 -0.699938 -0.710214 -0.448759 1.000000 -0.712441
wt -0.867659 0.782496 0.887980 0.658748 -0.712441 1.000000
A positive correlation implies that when one variable goes up the other tends to go up as well. Negative
correlations indicate an inverse relationship: when one variable goes up the other tends to go down. A
correlation near zero indicates low correlation while a correlation near -1 or 1 indicates a large negative or
positive correlation.
Inspecting the data table, we see that the number of cylinders a car has (cyl) and its weight (wt) have fairly
strong negative correlations to gas mileage (mpg.). This indicates that heavier cars and cars with more cylinders
tend to get lower gas mileage.
A scatter plot matrix can be a helpful visual aide for inspecting collinearity. We can create one with the
pandas scatter_matrix() function located in the tools.plotting pandas folder:
In [13]:
from pandas.tools.plotting import scatter_matrix
In [14]:
scatter_matrix(mtcars.ix[:,0:6], # Make a scatter matrix of 6 columns
figsize=(10, 10), # Set plot size
diagonal='kde') # Show distribution estimates on diagonal
Out[14]:

A scatter plot matrix creates pairwise scatter plots that let you visually inspect the relationships between pairs
of variables. It can also help identify oddities in the data, such as variables like cyl that only take on values in a
small discrete set.
If you find highly correlated variables, there are a few things you can do including:
1. Leave them be
2. Remove one or more variables
3. Combine them in some way
Reducing the number of variables under consideration, either by removing some or by combining them some
way is known as "dimensionality reduction." How you choose to handle correlated variables is ultimately a
subjective decision that should be informed by your goal.
Imputing with Sklearn
In the lesson on initial data exploration, we explored Titanic survivor data and found that several passengers had
missing listed for age. Missing values in numeric data are troublesome because you can't simply treat them as a
category: you have to either remove them or fill them in.
Imputation describes filling in missing data with estimates based on the rest of the data set. When working with
the titanic data set, we set all the missing Age values to the median age for the data set. Other simple
imputation methods include setting missing values to the mean or most common value (mode.). The scikit-learn
offers an Imupter tool that can automatically carry out these imputations for us. Let's start by loading the
Imputer and introducing some missing values into the mpg data:
In [15]:
from sklearn.preprocessing import Imputer
In [16]:
# The following line sets a few mpg values to None
mtcars["mpg"] = np.where(mtcars["mpg"]>22, None, mtcars["mpg"])

mtcars["mpg"] # Confirm that missing values were added


Out[16]:
name
Mazda RX4 21
Mazda RX4 Wag 21
Datsun 710 None
Hornet 4 Drive 21.4
Hornet Sportabout 18.7
Valiant 18.1
Duster 360 14.3
Merc 240D None
Merc 230 None
Merc 280 19.2
Merc 280C 17.8
Merc 450SE 16.4
Merc 450SL 17.3
Merc 450SLC 15.2
Cadillac Fleetwood 10.4
Lincoln Continental 10.4
Chrysler Imperial 14.7
Fiat 128 None
Honda Civic None
Toyota Corolla None
Toyota Corona 21.5
Dodge Challenger 15.5
AMC Javelin 15.2
Camaro Z28 13.3
Pontiac Firebird 19.2
Fiat X1-9 None
Porsche 914-2 None
Lotus Europa None
Ford Pantera L 15.8
Ferrari Dino 19.7
Maserati Bora 15
Volvo 142E 21.4
Name: mpg, dtype: object
Now let's use the Imputer fill in missing values based on the mean:
In [17]:
imp = Imputer(missing_values='NaN', # Create imputation model
strategy='mean', # Use mean imputation
axis=0) # Impute by column

imputed_cars = imp.fit_transform(mtcars) # Use imputation model to get values

imputed_cars = pd.DataFrame(imputed_cars, # Remake DataFrame with new values


index=mtcars.index,
columns = mtcars.columns)

imputed_cars.head(10)
Out[17]:
mpg cyl disp hp drat wt qsec vs am gear carb
name
Mazda RX4 21.000000 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.000000 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 17.065217 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.400000 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.700000 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.100000 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.300000 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 17.065217 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 17.065217 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.200000 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Currently the Imputer only supports the "mean", "median" and "most_frequent" (mode) as strategies for
imputation.
Other imputation methods include filling in values based on "similar" or "neighboring" records (K-nearest-
neighbors imputation) and filling in values based on regression models. Using predictive models to fill in missing
values adds an extra layer of complexity to an analysis and can significantly increase processing time, although it
may result in better predictive performance. We'll revisit predictive modeling in a future lesson.
Wrap Up
In the past two lessons, we covered a variety of methods for preparing text data and numeric data. The majority
of data you encounter will likely fall in one of these two categories, but there is one other type of data that
appears with enough frequency that you will have to deal with it sooner or later: dates.
Python for Data Analysis Part 17: Dealing With Dates

In the last two lessons, we learned a variety of methods to text character and numeric data, but many data sets
also contain dates that don't fit nicely into either category. Common date formats contain numbers and
sometimes text as well to specify months and days. Getting dates into a friendly format and extracting features
of dates like month and year into new variables can be useful preprocessing steps.
For this lesson I've created some dummy date data in a few different formats. To read the data, copy the table
of dates below and then use pd.read_clipboard() with the tab character as the separator and the index column
set to 0:
In [68]:
import numpy as np
import pandas as pd
In [75]:
dates = pd.read_clipboard(sep="\t", # Read data from clipboard
index_col=0)
In [70]:
dates # Check the dates
Out[70]:
month_day_year day_month_year date_time year_month_day
1 4/22/1996 22-Apr-96 Tue Aug 11 09:50:35 1996 2007-06-22
2 4/23/1996 23-Apr-96 Tue May 12 19:50:35 2016 2017-01-09
3 5/14/1996 14-May-96 Mon Oct 14 09:50:35 2017 1998-04-12
4 5/15/1996 15-May-96 Tue Jan 11 09:50:35 2018 2027-07-22
5 5/16/2001 16-May-01 Fri Mar 11 07:30:36 2019 1945-11-15
6 5/17/2002 17-May-02 Tue Aug 11 09:50:35 2020 1942-06-22
7 5/18/2003 18-May-03 Wed Dec 21 09:50:35 2021 1887-06-13
8 5/19/2004 19-May-04 Tue Jan 11 09:50:35 2022 1912-01-25
9 5/20/2005 20-May-05 Sun Jul 10 19:40:25 2023 2007-06-22
When you load data with Pandas, dates are typically loaded as strings by default. Let's check the type of data in
each column:
In [76]:
for col in dates:
print (type(dates[col][1]))
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
The output confirms that all the date data is currently in string form. To work with dates, we need to convert
them from strings into a data format built for processing dates. The pandas library comes with a Timestamp data
object for storing and working with dates. You can instruct pandas to automatically convert a date column in
your data into Timestamps when you read your data by adding the "parse_dates" argument to the data reading
function with a list of column indices indicated the columns you wish to convert to Timestamps. Let's re-read the
data with parse_dates turned on for each column:
In [85]:
dates = pd.read_clipboard(sep="\t",
index_col=0,
parse_dates=[0,1,2,3]) # Convert cols to Timestamp
Now let's check the data types again:
In [86]:
for col in dates:
print (type(dates[col][1]))
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
<class 'str'>
The output shows that 3 out of 4 of the date columns were successfully parsed and translated into Timestamps.
The default date parser works on many common date formats, but dates can come in a lot different forms. If a
date column is not converted to Timestamp by the default date parser, you can attempt to convert the column
to Timestamp using the function pd.to_datetime(). Let's use it to convert column 3:
In [88]:
dates["year_month_day"] = pd.to_datetime(dates["year_month_day"] )
In [140]:
for col in dates:
print (type(dates[col][1]))
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
If you have oddly formatted date time objects, you might have to specify the exact format to get it to convert
correctly into a Timestamp. For instance, consider a date format that gives date times of the form
hour:minute:second year-day-month:
In [120]:
odd_date = "12:30:15 2015-29-11"
The default to_datetime parser will fail to convert this date because it expects dates in the form year-month-
day. In cases like this, specify the date's format to convert it to Timestamp:
In [123]:
pd.to_datetime(odd_date,
format= "%H:%M:%S %Y-%d-%m")
Out[123]:
Timestamp('2015-11-29 12:30:15')
As seen above, date formatting uses special formatting codes for each part of the date. For instance, %H
represents hours and %Y represents the four digit year. View a list of formatting codes here.
Once you have your dates in the Timestamp format, you can extract a variety of properties like the year, month
and day. Converting dates into several simpler features can make the data easier to analyze and use in
predictive models. Access date properties from a Series of Timestamps with the syntax: Series.dt.property. To
illustrate, let's extract some features from the first column of our date data and put them in a new DataFrame:
In [151]:
column_1 = dates.ix[:,0]

pd.DataFrame({"year": column_1.dt.year,
"month": column_1.dt.month,
"day": column_1.dt.day,
"hour": column_1.dt.hour,
"dayofyear": column_1.dt.dayofyear,
"week": column_1.dt.week,
"weekofyear": column_1.dt.weekofyear,
"dayofweek": column_1.dt.dayofweek,
"weekday": column_1.dt.weekday,
"quarter": column_1.dt.quarter,
})
Out[151]:
day dayofweek dayofyear hour month quarter week weekday weekofyear year
1 22 0 113 0 4 2 17 0 17 1996
2 23 1 114 0 4 2 17 1 17 1996
3 14 1 135 0 5 2 20 1 20 1996
4 15 2 136 0 5 2 20 2 20 1996
5 16 2 136 0 5 2 20 2 20 2001
day dayofweek dayofyear hour month quarter week weekday weekofyear year
6 17 4 137 0 5 2 20 4 20 2002
7 18 6 138 0 5 2 20 6 20 2003
8 19 2 140 0 5 2 21 2 21 2004
9 20 4 140 0 5 2 20 4 20 2005
In addition to extracting date features, you can use the subtraction operator on Timestamp objects to determine
the amount of time between two different dates:
In [160]:
print(dates.ix[1,0])
print(dates.ix[3,0])
print(dates.ix[3,0]-dates.ix[1,0])
1996-04-22 00:00:00
1996-05-14 00:00:00
22 days 00:00:00
Pandas includes a variety of more advanced date and time functionality beyond the basics covered in this lesson,
particularly for dealing time series data (data consisting of many periodic measurements over time.). Read more
about date and time functionality here.
Wrap Up
Pandas makes it easy to convert date data into the Timestamp data format and extract basic date features like
day of the year, month and day of week. Simple date features can be powerful predictors because data often
exhibit cyclical patterns over different time scales.
Cleaning and preprocessing numeric, character and date data is sometimes all you need to do before you start a
project. In some cases, however, your data may be split across several tables such as different worksheets in an
excel file or different tables in a database. In these cases, you might have combine two tables together before
proceeding with your project. In the next lesson, we'll explore how to merge data sets.
Python for Data Analysis Part 18: Merging Data

Data you use for your projects won't always be confined to a single table in a CSV or excel file. Data is often split
across several tables that you need to combine in some way. Data frames can be joined together if they have
columns in common. Joining tables in various ways is a common operation when working with databases but
you can also join data frames in Python using functions included with pandas.
First, let's import some libraries and load two tables of related data. You can load the data into your own
environment by copying each table below and then using pd.read_clipboard(sep="\t")
In [1]:
import numpy as np
import pandas as pd
import os
In [2]:
table1 = pd.read_clipboard(sep="\t")

table1
Out[2]:
Unnamed: 0 P_ID gender height weight
0 1 1 male 71 175
1 2 2 male 74 225
2 3 3 female 64 130
3 4 4 female 64 125
Unnamed: 0 P_ID gender height weight
4 5 5 female 66 165
5 6 6 male 69 160
6 7 7 female 62 115
7 8 8 male 72 250
In [3]:
table2 = pd.read_clipboard(sep="\t")

table2
Out[3]:
Unnamed: 0 P_ID sex visits checkup follow_up illness surgery ER
0 1 1 male 1 1 0 0 0 0
1 2 2 male 2 1 0 0 0 1
2 3 4 female 4 1 1 2 0 0
3 4 5 female 12 1 2 7 2 0
4 5 7 female 2 1 0 1 0 0
5 6 8 male 2 1 0 1 0 0
6 7 9 male 1 0 0 0 0 1
7 8 10 female 1 0 0 0 0 1
Let's delete the unnamed column:
In [4]:
del table1["Unnamed: 0"]
del table2["Unnamed: 0"]
Both data frames contain the column "P_ID" but the other columns are different. A unique identifier like an ID is
usually a good key for joining two data frames together. You can combine two data frames by a common column
with merge():
In [5]:
combined1 = pd.merge(table1, # First table
table2, # Second table
how="inner", # Merge method
on="P_ID") # Column(s) to join on

combined1
Out[5]:
P_ID gender height weight sex visits checkup follow_up illness surgery ER
0 1 male 71 175 male 1 1 0 0 0 0
1 2 male 74 225 male 2 1 0 0 0 1
2 4 female 64 125 female 4 1 1 2 0 0
3 5 female 66 165 female 12 1 2 7 2 0
4 7 female 62 115 female 2 1 0 1 0 0
5 8 male 72 250 male 2 1 0 1 0 0
Inspecting the new combined data frame, we can see that the number of records dropped from 8 in the original
tables to 6 in the combined table. If we inspect the P_ID column closely, we see that the original data tables
contain some different values for P_ID. Note that inside the merge function we set the argument "how" to
"inner". An inner join only merges records that appear in both columns used for the join. Since patients 3 and 6
only appear in table1 and patients 9 and 10 only appear in table2, those four patients were dropped when we
merged the tables together.
Inner joins ensure that we don't end up introducing missing values in our data. For instance, if we kept patients
3 and 6 in the combined data frame, those patients would end up with a lot of missing values because they
aren't present in the table2. If you want to keep more of your data and don't mind introducing some missing
values, you can use merge to perform other types of joins, such as left joins, right joins and outer joins:
In [6]:
# A left join keeps all key values in the first(left) data frame

left_join = pd.merge(table1, # First table


table2, # Second table
how="left", # Merge method
on="P_ID") # Column(s) to join on

left_join
Out[6]:
P_ID gender height weight sex visits checkup follow_up illness surgery ER
0 1 male 71 175 male 1 1 0 0 0 0
1 2 male 74 225 male 2 1 0 0 0 1
2 3 female 64 130 NaN NaN NaN NaN NaN NaN NaN
3 4 female 64 125 female 4 1 1 2 0 0
4 5 female 66 165 female 12 1 2 7 2 0
5 6 male 69 160 NaN NaN NaN NaN NaN NaN NaN
6 7 female 62 115 female 2 1 0 1 0 0
7 8 male 72 250 male 2 1 0 1 0 0
In [7]:
# A right join keeps all key values in the second(right) data frame

right_join = pd.merge(table1, # First table


table2, # Second table
how="right", # Merge method
on="P_ID") # Column(s) to join on

right_join
Out[7]:
P_ID gender height weight sex visits checkup follow_up illness surgery ER
0 1 male 71 175 male 1 1 0 0 0 0
1 2 male 74 225 male 2 1 0 0 0 1
2 4 female 64 125 female 4 1 1 2 0 0
3 5 female 66 165 female 12 1 2 7 2 0
4 7 female 62 115 female 2 1 0 1 0 0
P_ID gender height weight sex visits checkup follow_up illness surgery ER
5 8 male 72 250 male 2 1 0 1 0 0
6 9 NaN NaN NaN male 1 0 0 0 0 1
7 10 NaN NaN NaN female 1 0 0 0 0 1
In [8]:
# An outer join keeps all key values in both data frames

outer_join = pd.merge(table1, # First table


table2, # Second table
how="outer", # Merge method
on="P_ID") # Column(s) to join on

outer_join
Out[8]:
P_ID gender height weight sex visits checkup follow_up illness surgery ER
0 1 male 71 175 male 1 1 0 0 0 0
1 2 male 74 225 male 2 1 0 0 0 1
2 3 female 64 130 NaN NaN NaN NaN NaN NaN NaN
3 4 female 64 125 female 4 1 1 2 0 0
4 5 female 66 165 female 12 1 2 7 2 0
5 6 male 69 160 NaN NaN NaN NaN NaN NaN NaN
6 7 female 62 115 female 2 1 0 1 0 0
7 8 male 72 250 male 2 1 0 1 0 0
8 9 NaN NaN NaN male 1 0 0 0 0 1
9 10 NaN NaN NaN female 1 0 0 0 0 1
By this point, you may have noticed that the two data frames contain a second column in common. The first
table contains the column "gender" while the second contains the column "sex", both of which record the same
information. We can solve this issue by first renaming one of the two columns so that their names are the same
and then supplying that column's name as a second column to merge upon:
In [9]:
table2.rename(columns={"sex":"gender"}, inplace=True) # Rename "sex" column

combined2 = pd.merge(table1, # First data frame


table2, # Second data frame
how="outer", # Merge method
on=["P_ID","gender"]) # Column(s) to join on

combined2
Out[9]:
P_ID gender height weight visits checkup follow_up illness surgery ER
0 1 male 71 175 1 1 0 0 0 0
1 2 male 74 225 2 1 0 0 0 1
2 3 female 64 130 NaN NaN NaN NaN NaN NaN
P_ID gender height weight visits checkup follow_up illness surgery ER
3 4 female 64 125 4 1 1 2 0 0
4 5 female 66 165 12 1 2 7 2 0
5 6 male 69 160 NaN NaN NaN NaN NaN NaN
6 7 female 62 115 2 1 0 1 0 0
7 8 male 72 250 2 1 0 1 0 0
8 9 male NaN NaN 1 0 0 0 0 1
9 10 female NaN NaN 1 0 0 0 0 1
By renaming and merging on the gender column, we've managed to eliminate some NA values in the outer join.
Although an outer joins can introduce NA values, they can also be helpful for discovering patterns in the data.
For example, in our combined data, notice that the two patients who did not have values listed for height and
weight only made visits to the ER. It could be that the hospital did not have patients 9 and 10 on record
previously and that it does not take height and weight measurements for ER visits. Using the same type of
intuition, it could be that patients 3 and 6 have height and weight measurements on file from visits in the past,
but perhaps they did not visit the hospital during the time period for which the visit data was collected.
Wrap Up
The pandas function merge() can perform common joins to combine data frames with matching columns. For
some projects, you may have to merge several tables into one to get the most out of your data.
Now that we know how to prepare and merge data, we're ready to learn more about two of the most common
tools for exploring data sets: frequency tables and plots.
Python for Data Analysis Part 19: Frequency Tables

Discovering relationships between variables is the fundamental goal of data analysis. Frequency tables are a
basic tool you can use to explore data and get an idea of the relationships between variables. A frequency table
is just a data table that shows the counts of one or more categorical variables.
To explore frequency tables, we'll revisit the Titanic training set from Kaggle that we studied in lesson 14. We
will perform a couple of the same preprocessing steps we did in lesson 14:
In [1]:
import numpy as np
import pandas as pd
import os
In [2]:
os.chdir('C:\\Users\\Greg\\Desktop\\Kaggle\\titanic') # Set working directory

titanic_train = pd.read_csv("titanic_train.csv") # Read the data

char_cabin = titanic_train["Cabin"].astype(str) # Convert cabin to str

new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter

titanic_train["Cabin"] = pd.Categorical(new_Cabin) # Save the new cabin var


One-Way Tables
Create frequency tables (also known as crosstabs) in pandas using the pd.crosstab() function. The function takes
one or more array-like objects as indexes or columns and then constructs a new DataFrame of variable counts
based on the supplied arrays. Let's make a one-way table of the survived variable:
In [3]:
my_tab = pd.crosstab(index=titanic_train["Survived"], # Make a crosstab
columns="count") # Name the count column
my_tab
Out[3]:
col_0 count
Survived
0 549
1 340
In [4]:
type(my_tab) # Confirm that the crosstab is a DataFrame
Out[4]:
pandas.core.frame.DataFrame
Let's make a couple more crosstabs to explore other variables:
In [5]:
pd.crosstab(index=titanic_train["Pclass"], # Make a crosstab
columns="count") # Name the count column
Out[5]:
col_0 count
Pclass
1 214
2 184
3 491
In [6]:
pd.crosstab(index=titanic_train["Sex"], # Make a crosstab
columns="count") # Name the count column
Out[6]:
col_0 count
Sex
female 312
male 577
In [7]:
cabin_tab = pd.crosstab(index=titanic_train["Cabin"], # Make a crosstab
columns="count") # Name the count column

cabin_tab
Out[7]:
col_0 count
Cabin
A 15
B 45
C 59
D 33
E 32
col_0 count
Cabin
F 13
G 4
n 688
Even these simple one-way tables give us some useful insight: we immediately get a sense of distribution of
records across the categories. For instance, we see that males outnumbered females by a significant margin and
that there were more third class passengers than first and second class passengers combined.
If you pass a variable with many unique values to table(), such a numeric variable, it will still produce a table of
counts for each unique value, but the counts may not be particularly meaningful.
Since the crosstab function produces DataFrames, the DataFrame operations we've learned work on crosstabs:
In [8]:
print (cabin_tab.sum(), "\n") # Sum the counts

print (cabin_tab.shape, "\n") # Check number of rows and cols

cabin_tab.iloc[1:7] # Slice rows 1-6


col_0
count 889
dtype: int64

(8, 1)

Out[8]:
col_0 count
Cabin
B 45
C 59
D 33
E 32
F 13
G 4
One of the most useful aspects of frequency tables is that they allow you to extract the proportion of the data
that belongs to each category. With a one-way table, you can do this by dividing each table value by the total
number of records in the table:
In [9]:
cabin_tab/cabin_tab.sum()
Out[9]:
col_0 count
Cabin
A 0.016873
B 0.050619
C 0.066367
col_0 count
Cabin
D 0.037120
E 0.035996
F 0.014623
G 0.004499
n 0.773903
Two-Way Tables
Two-way frequency tables, also called contingency tables, are tables of counts with two dimensions where each
dimension is a different variable. Two-way tables can give you insight into the relationship between two
variables. To create a two way table, pass two variables to the pd.crosstab() function instead of one:
In [10]:
# Table of survival vs. sex
survived_sex = pd.crosstab(index=titanic_train["Survived"],
columns=titanic_train["Sex"])

survived_sex.index= ["died","survived"]

survived_sex
Out[10]:
Sex female male
died 81 468
survived 231 109
In [11]:
# Table of survival vs passenger class
survived_class = pd.crosstab(index=titanic_train["Survived"],
columns=titanic_train["Pclass"])

survived_class.columns = ["class1","class2","class3"]
survived_class.index= ["died","survived"]

survived_class
Out[11]:
class1 class2 class3
died 80 97 372
survived 134 87 119
You can get the marginal counts (totals for each row and column) by including the argument margins=True:
In [12]:
# Table of survival vs passenger class
survived_class = pd.crosstab(index=titanic_train["Survived"],
columns=titanic_train["Pclass"],
margins=True) # Include row and column totals

survived_class.columns = ["class1","class2","class3","rowtotal"]
survived_class.index= ["died","survived","coltotal"]
survived_class
Out[12]:
class1 class2 class3 rowtotal
died 80 97 372 549
survived 134 87 119 340
coltotal 214 184 491 889
To get the total proportion of counts in each cell, divide the table by the grand total:
In [13]:
survived_class/survived_class.ix["coltotal","rowtotal"]
Out[13]:
class1 class2 class3 rowtotal
died 0.089989 0.109111 0.418448 0.617548
survived 0.150731 0.097863 0.133858 0.382452
coltotal 0.240720 0.206974 0.552306 1.000000
To get the proportion of counts along each column (in this case, the survival rate within each passenger class)
divide by the column totals:
In [14]:
survived_class/survived_class.ix["coltotal"]
Out[14]:
class1 class2 class3 rowtotal
died 0.373832 0.527174 0.757637 0.617548
survived 0.626168 0.472826 0.242363 0.382452
coltotal 1.000000 1.000000 1.000000 1.000000
To get the proportion of counts along each row divide by the row totals. The division operator functions on a
row-by-row basis when used on DataFrames by default. In this case we want to divide each column by the
rowtotals column. To get division to work on a column by column basis, use df.div() with the axis set to 0 (or
"index"):
In [15]:
survived_class.div(survived_class["rowtotal"],
axis=0)
Out[15]:
class1 class2 class3 rowtotal
died 0.145719 0.176685 0.677596 1
survived 0.394118 0.255882 0.350000 1
coltotal 0.240720 0.206974 0.552306 1
Alternatively, you can transpose the table with df.T to swap rows and columns and perform row by row division
as normal:
In [16]:
survived_class.T/survived_class["rowtotal"]
Out[16]:
died survived coltotal
class1 0.145719 0.394118 0.240720
class2 0.176685 0.255882 0.206974
died survived coltotal
class3 0.677596 0.350000 0.552306
rowtotal 1.000000 1.000000 1.000000
Higher Dimensional Tables
The crosstab() function lets you create tables out of more than two categories. Higher dimensional tables can be
a little confusing to look at, but they can also yield finer-grained insight into interactions between multiple
variables. Let's create a 3-way table inspecting survival, sex and passenger class:
In [17]:
surv_sex_class = pd.crosstab(index=titanic_train["Survived"],
columns=[titanic_train["Pclass"],
titanic_train["Sex"]],
margins=True) # Include row and column totals

surv_sex_class
Out[17]:
Pclass 1 2 3 All
Sex female male female male female male
Survived
0 3 77 6 91 72 300 549
1 89 45 70 17 72 47 340
All 92 122 76 108 144 347 889
Notice that by passing a second variable to the columns argument, the resulting table has columns categorized
by both Pclass and Sex. The outermost index (Pclass) returns sections of the table instead of individual columns:
In [18]:
surv_sex_class[2] # Get the subtable under Pclass 2
Out[18]:
Sex female male
Survived
0 6 91
1 70 17
All 76 108
The secondary column index, Sex, can't be used as a top level index, but it can be used within a given Pclass:
In [19]:
surv_sex_class[2]["female"] # Get female column within Pclass 2
Out[19]:
Survived
0 6
1 70
All 76
Name: female, dtype: int64
Due to the convenient hierarchical structure of the table, we still use one division to get the proportion of
survival across each column:
In [20]:
surv_sex_class/surv_sex_class.ix["All"] # Divide by column totals
Out[20]:
Pclass 1 2 3 All
Sex female male female male female male
Survived
0 0.032609 0.631148 0.078947 0.842593 0.5 0.864553 0.617548
1 0.967391 0.368852 0.921053 0.157407 0.5 0.135447 0.382452
All 1.000000 1.000000 1.000000 1.000000 1.0 1.000000 1.000000
Here we see something quite interesting: over 90% of women in first class and second class survived, but only
50% of women in third class survived. Men in first class also survived at a greater rate than men in lower classes.
Passenger class seems to have a significant impact on survival, so it would likely be useful to include as a feature
in a predictive model.
Wrap Up
Frequency tables are a simple yet effective tool for exploring relationships between variables that take on few
unique values. Tables do, however, require you to inspect numerical values and proportions closely and it is not
always easy to quickly convey insights drawn from tables to others. Creating plots is a way to visually investigate
data, which takes advantage of our innate ability to process and detect patterns in images.
Python for Data Analysis Part 20: Plotting with Pandas

Visualizations are one of the most powerful tools at your disposal for exploring data and communicating data
insights. The pandas library includes basic plotting capabilities that let you create a variety of plots from
DataFrames. Plots in pandas are built on top of a popular Python plotting library called matplotlib, which comes
with the Anaconda Python distribution.
Let's start by loading some packages:
In [2]:
import numpy as np
import pandas as pd
import matplotlib
from ggplot import diamonds

matplotlib.style.use('ggplot') # Use ggplot style plots*


*Note: If you have not installed ggplot, you can do so by opening a console and running "pip install ggplot"
(without quotes.).
In this lesson, we're going to look at the diamonds data set that comes with the ggplot library. Let's take a
moment to explore the structure of the data before going any further:
In [3]:
diamonds.shape # Check data shape
Out[3]:
(53940, 10)
In [4]:
diamonds.head(5)
Out[4]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
The output shows that data set contains 10 features of 53940 different diamonds, including both numeric and
categorical variables.
Histograms
A histogram is a univariate plot (a plot that displays one variable) that groups a numeric variable into bins and
displays the number of observations that fall within each bin. A histogram is a useful tool for getting a sense of
the distribution of a numeric variable. Let's create a histogram of diamond carat weight with the df.hist()
function:
In [5]:
diamonds.hist(column="carat", # Column to plot
figsize=(8,8), # Plot size
color="blue") # Plot color
Out[5]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000B25E2E8>]], dtype=object)

We see immediately that the carat weights are positively skewed: most diamonds are around 1 carat or below
but there are extreme cases of larger diamonds.
The plot above has fairly wide bins and there doesn't appear to be any data beyond a carat size of 3.5. We can
make try to get more out of hour histogram by adding some additional arguments to control the size of the bins
and limits of the x-axis:
In [6]:
diamonds.hist(column="carat", # Column to plot
figsize=(8,8), # Plot size
color="blue", # Plot color
bins=50, # Use 50 bins
range= (0,3.5)) # Limit x-axis range
Out[6]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000B499EF0>]], dtype=object)

This histogram gives us a better sense of some subtleties within the distribution, but we can't be sure that it
contains all the data. Limiting the X-axis to 3.5 might have cut out some outliers with counts so small that they
didn't show up as bars on our original chart. Let's check to see if any diamonds are larger than 3.5 carats:
In [7]:
diamonds[diamonds["carat"] > 3.5]
Out[7]:
carat cut color clarity depth table price x y z
23644 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
25998 4.01 Premium I I1 61.0 61 15223 10.14 10.10 6.17
25999 4.01 Premium J I1 62.5 62 15223 10.02 9.94 6.24
26444 4.00 Very Good I I1 63.3 58 15984 10.01 9.94 6.31
26534 3.67 Premium I I1 62.4 56 16193 9.86 9.81 6.13
27130 4.13 Fair H I1 64.8 61 17329 10.00 9.85 6.43
27415 5.01 Fair J I1 65.5 59 18018 10.74 10.54 6.98
27630 4.50 Fair J I1 65.8 58 18531 10.23 10.16 6.72
27679 3.51 Premium J VS2 62.5 59 18701 9.66 9.63 6.03
It turns out that 9 diamonds are bigger than 3.5 carats. Should cutting these diamonds out concern us? On one
hand, these outliers have very little bearing on the shape of the distribution. On the other hand, limiting the X-
axis to 3.5 implies that no data lies beyond that point. For our own exploratory purposes this is not an issue but
if we were to show this plot to someone else, it could be misleading. Including a note that 9 diamonds lie
beyond the chart range could be helpful.
Boxplots
Boxplots are another type of univariate plot for summarizing distributions of numeric data graphically. Let's
make a boxplot of carat using the pd.boxplot() function:
In [8]:
diamonds.boxplot(column="carat")

As we learned in lesson 14, the central box of the boxplot represents the middle 50% of the observations, the
central bar is the median and the bars at the end of the dotted lines (whiskers) encapsulate the great majority of
the observations. Circles that lie beyond the end of the whiskers are data points that may be outliers.
In this case, our data set has over 50,000 observations and we see many data points beyond the top whisker.
We probably wouldn't want to classify all of those points as outliers, but the handful of diamonds at 4 carats and
above are definitely far outside the norm.
One of the most useful features of a boxplot is the ability to make side-by-side boxplots. A side-by-side boxplot
takes a numeric variable and splits it on based on some categorical variable, drawing a different boxplot for each
level of the categorical variable. Let's make a side-by-side boxplot of diamond price split by diamond clarity:
In [9]:
diamonds.boxplot(column="price", # Column to plot
by= "clarity", # Column to split upon
figsize= (8,8)) # Figure size
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0xb3534a8>

The boxplot above is curious: we'd expect diamonds with better clarity to fetch higher prices and yet diamonds
on the highest end of the clarity spectrum (IF = internally flawless) actually have lower median prices than low
clarity diamonds! What gives? Perhaps another boxplot can shed some light on this situation:
In [10]:
diamonds.boxplot(column="carat", # Column to plot
by= "clarity", # Column to split upon
figsize= (8,8)) # Figure size
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0xba4c7f0>

The plot above shows that diamonds with low clarity ratings also tend to be larger. Since size is an important
factor in determining a diamond's value, it isn't too surprising that low clarity diamonds have higher median
prices.
Density Plots
A density plot shows the distribution of a numeric variable with a continuous curve. It is similar to a histogram
but without discrete bins, a density plot gives a better picture of the underlying shape of a distribution. Create a
density plot with series.plot(kind="density")
In [11]:
diamonds["carat"].plot(kind="density", # Create density plot
figsize=(8,8), # Set figure size
xlim= (0,5)) # Limit x axis values
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0xb7f6588>

Barplots
Barplots are graphs that visually display counts of categorical variables. We can create a barplot by creating a
table of counts for a certain variable using the pd.crosstab() function and then passing the counts to
df.plot(kind="bar"):
In [12]:
carat_table = pd.crosstab(index=diamonds["clarity"], columns="count")
carat_table
Out[12]:
col_0 count
clarity
I1 741
IF 1790
SI1 13065
SI2 9194
VS1 8171
VS2 12258
VVS1 3655
VVS2 5066
In [13]:
carat_table.plot(kind="bar",
figsize=(8,8))
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0xba242b0>

You can use a two dimensional table to create a stacked barplot. Stacked barplots show the distribution of a
second categorical variable within each bar:
In [14]:
carat_table = pd.crosstab(index=diamonds["clarity"],
columns=diamonds["color"])

carat_table
Out[14]:
color D E F G H I J
clarity
I1 42 102 143 150 162 92 50
IF 73 158 385 681 299 143 51
SI1 2083 2426 2131 1976 2275 1424 750
SI2 1370 1713 1609 1548 1563 912 479
VS1 705 1281 1364 2148 1169 962 542
VS2 1697 2470 2201 2347 1643 1169 731
VVS1 252 656 734 999 585 355 74
VVS2 553 991 975 1443 608 365 131
In [15]:
carat_table.plot(kind="bar",
figsize=(8,8),
stacked=True)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0xc2981d0>
A grouped barplot is an alternative to a stacked barplot that gives each stacked section its own bar. To make a
grouped barplot, do not include the stacked argument (or set stacked=False):
In [16]:
carat_table.plot(kind="bar",
figsize=(8,8),
stacked=False)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0xbce8208>

Scatterplots
Scatterplots are bivariate (two variable) plots that take two numeric variables and plot data points on the x/y
plane. We saw an example of scatterplots in lesson 16 when we created a scatter plot matrix of the mtcars data
set. To create a single scatterplot, use df.plot(kind="scatter"):
In [17]:
diamonds.plot(kind="scatter", # Create a scatterplot
x="carat", # Put carat on the x axis
y="price", # Put price on the y axis
figsize=(10,10),
ylim=(0,20000))
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0xbd35f98>

Although the scatterplot above has many overlapping points, it still gives us some insight into the relationship
between diamond carat weight and price: bigger diamonds are generally more expensive.
Line Plots
Line plots are charts used to show the change in a numeric variable based on some other ordered variable. Line
plots are often used to plot time series data to show the evolution of a variable over time. Line plots are the
default plot type when using df.plot() so you don't have to specify the kind argument when making a line plot in
pandas. Let's create some fake time series data and plot it with a line plot
In [18]:
# Create some data
years = [y for y in range(1950,2016)]

readings = [(y+np.random.uniform(0,20)-1900) for y in years]

time_df = pd.DataFrame({"year":years,
"readings":readings})

# Plot the data


time_df.plot(x="year",
y="readings",
figsize=(9,9))
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0xbe3bf60>

Saving Plots
If you want to save plots for later use, you can export the plot figure (plot information) to a file. First get the plot
figure with plot.get_figure() and then save it to a file with figure.savefig("filename"). You can save plots to a
variety of common image file formats, such as png, jpeg and pdf.
In [19]:
my_plot = time_df.plot(x="year", # Create the plot and save to a variable
y="readings",
figsize=(9,9))

my_fig = my_plot.get_figure() # Get the figure

my_fig.savefig("line_plot_example.png") # Save to file


Wrap Up
Pandas plotting functions let you visualize and explore data quickly. Pandas plotting functions don't offer all the
features of dedicated plotting package like matplotlib or ggplot, but they are often enough to get the job done.

Now that we have developed some tools to explore data, the remainder of this guide will focus on statistics and
predictive modeling in Python
Python for Data Analysis Part 21: Descriptive Statistics

Descriptive statistics are measures that summarize important features of data, often with a single number.
Producing descriptive statistics is a common first step to take after cleaning and preparing a data set for analysis.
We've already seen several examples of deceptive statistics in earlier lessons, such as means and medians. In
this lesson, we'll review some of these functions and explore several new ones.
Measures of Center
Measures of center are statistics that give us a sense of the "middle" of a numeric variable. In other words,
centrality measures give you a sense of a typical value you'd expect to see. Common measures of center include
the mean, median and mode.
The mean is simply an average: the sum of the values divided by the total number of records. As we've seen in
previous lessons we can use df.mean() to get the mean of each column in a DataFrame:
In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ggplot import mtcars
In [3]:
mtcars.index = mtcars["name"]
mtcars.mean() # Get the mean of each column
Out[3]:
mpg 20.090625
cyl 6.187500
disp 230.721875
hp 146.687500
drat 3.596563
wt 3.217250
qsec 17.848750
vs 0.437500
am 0.406250
gear 3.687500
carb 2.812500
dtype: float64
We can also get the means of each row by supplying an axis argument:
In [4]:
mtcars.mean(axis=1) # Get the mean of each row
Out[4]:
name
Mazda RX4 29.907273
Mazda RX4 Wag 29.981364
Datsun 710 23.598182
Hornet 4 Drive 38.739545
Hornet Sportabout 53.664545
Valiant 35.049091
Duster 360 59.720000
Merc 240D 24.634545
Merc 230 27.233636
Merc 280 31.860000
Merc 280C 31.787273
Merc 450SE 46.430909
Merc 450SL 46.500000
Merc 450SLC 46.350000
Cadillac Fleetwood 66.232727
Lincoln Continental 66.058545
Chrysler Imperial 65.972273
Fiat 128 19.440909
Honda Civic 17.742273
Toyota Corolla 18.814091
Toyota Corona 24.888636
Dodge Challenger 47.240909
AMC Javelin 46.007727
Camaro Z28 58.752727
Pontiac Firebird 57.379545
Fiat X1-9 18.928636
Porsche 914-2 24.779091
Lotus Europa 24.880273
Ford Pantera L 60.971818
Ferrari Dino 34.508182
Maserati Bora 63.155455
Volvo 142E 26.262727
dtype: float64
The median of a distribution is the value where 50% of the data lies below it and 50% lies above it. In essence,
the median splits the data in half. The median is also known as the 50% percentile since 50% of the observations
are found below it. As we've seen previously, you can get the median using the df.median() function:
In [5]:
mtcars.median() # Get the median of each column
Out[5]:
mpg 19.200
cyl 6.000
disp 196.300
hp 123.000
drat 3.695
wt 3.325
qsec 17.710
vs 0.000
am 0.000
gear 4.000
carb 2.000
dtype: float64
Again, we could get the row medians across each row by supplying the argument axis=1.
Although the mean and median both give us some sense of the center of a distribution, they aren't always the
same. The median always gives us a value that splits the data into two halves while the mean is a numeric
average so extreme values can have a significant impact on the mean. In a symmetric distribution, the mean and
median will be the same. Let's investigate with a density plot:
In [6]:
norm_data = pd.DataFrame(np.random.normal(size=100000))

norm_data.plot(kind="density",
figsize=(10,10))

plt.vlines(norm_data.mean(), # Plot black line at mean


ymin=0,
ymax=0.4,
linewidth=5.0)

plt.vlines(norm_data.median(), # Plot red line at median


ymin=0,
ymax=0.4,
linewidth=2.0,
color="red")
Out[6]:
<matplotlib.collections.LineCollection at 0xbf49208>

In the plot above the mean and median are both so close to zero that the red median line lies on top of the
thicker black line drawn at the mean.
In skewed distributions, the mean tends to get pulled in the direction of the skew, while the median tends to
resist the effects of skew:
In [7]:
skewed_data = pd.DataFrame(np.random.exponential(size=100000))

skewed_data.plot(kind="density",
figsize=(10,10),
xlim=(-1,5))

plt.vlines(skewed_data.mean(), # Plot black line at mean


ymin=0,
ymax=0.8,
linewidth=5.0)

plt.vlines(skewed_data.median(), # Plot red line at median


ymin=0,
ymax=0.8,
linewidth=2.0,
color="red")
Out[7]:
<matplotlib.collections.LineCollection at 0xb33cdd8>

The mean is also influenced heavily by outliers, while the median resists the influence of outliers:
In [8]:
norm_data = np.random.normal(size=50)
outliers = np.random.normal(15, size=3)
combined_data = pd.DataFrame(np.concatenate((norm_data, outliers), axis=0))
combined_data.plot(kind="density",
figsize=(10,10),
xlim=(-5,20))

plt.vlines(combined_data.mean(), # Plot black line at mean


ymin=0,
ymax=0.2,
linewidth=5.0)

plt.vlines(combined_data.median(), # Plot red line at median


ymin=0,
ymax=0.2,
linewidth=2.0,
color="red")
Out[8]:
<matplotlib.collections.LineCollection at 0xc4bbc88>

Since the median tends to resist the effects of skewness and outliers, it is known a "robust" statistic. The median
generally gives a better sense of the typical value in a distribution with significant skew or outliers.
The mode of a variable is simply the value that appears most frequently. Unlike mean and median, you can take
the mode of a categorical variable and it is possible to have multiple modes. Find the mode with df.mode():
In [9]:
mtcars.mode()
Out[9]:
name mpg cyl disp hp drat wt qsec vs am gear carb
0 NaN 10.4 8 275.8 110 3.07 3.44 17.02 0 0 3 2
1 NaN 15.2 NaN NaN 175 3.92 NaN 18.90 NaN NaN NaN 4
2 NaN 19.2 NaN NaN 180 NaN NaN NaN NaN NaN NaN NaN
3 NaN 21.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN 21.4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN 22.8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN 30.4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
The columns with multiple modes (multiple values with the same count) return multiple values as the mode.
Columns with no mode (no value that appears more than once) return NaN.
Measures of Spread
Measures of spread (dispersion) are statistics that describe how data varies. While measures of center give us an
idea of the typical value, measures of spread give us a sense of how much the data tends to diverge from the
typical value.
One of the simplest measures of spread is the range. Range is the distance between the maximum and minimum
observations:
In [10]:
max(mtcars["mpg"]) - min(mtcars["mpg"])
Out[10]:
23.5
As noted earlier, the median represents the 50th percentile of a data set. A summary of several percentiles can
be used to describe a variable's spread. We can extract the minimum value (0th percentile), first quartile (25th
percentile), median, third quartile(75th percentile) and maximum value (100th percentile) using the quantile()
function:
In [11]:
five_num = [mtcars["mpg"].quantile(0),
mtcars["mpg"].quantile(0.25),
mtcars["mpg"].quantile(0.50),
mtcars["mpg"].quantile(0.75),
mtcars["mpg"].quantile(1)]

five_num
Out[11]:
[10.4,
15.425000000000001,
19.199999999999999,
22.800000000000001,
33.899999999999999]
Since these values are so commonly used to describe data, they are known as the "five number summary". They
are the same percentile values returned by df.describe():
In [12]:
mtcars["mpg"].describe()
Out[12]:
count 32.000000
mean 20.090625
std 6.026948
min 10.400000
25% 15.425000
50% 19.200000
75% 22.800000
max 33.900000
Name: mpg, dtype: float64
Interquartile (IQR) range is another common measure of spread. IQR is the distance between the 3rd quartile
and the 1st quartile:
In [13]:
mtcars["mpg"].quantile(0.75) - mtcars["mpg"].quantile(0.25)
Out[13]:
7.375
The boxplots we learned to create in the lesson on plotting are just visual representations of the five number
summary and IQR:
In [14]:
mtcars.boxplot(column="mpg",
return_type='axes',
figsize=(8,8))

plt.text(x=0.74, y=22.25, s="3rd Quartile")


plt.text(x=0.8, y=18.75, s="Median")
plt.text(x=0.75, y=15.5, s="1st Quartile")
plt.text(x=0.9, y=10, s="Min")
plt.text(x=0.9, y=33.5, s="Max")
plt.text(x=0.7, y=19.5, s="IQR", rotation=90, size=25)
Out[14]:
<matplotlib.text.Text at 0xb7c9f98>

Variance and standard deviation are two other common measures of spread. The variance of a distribution is the
average of the squared deviations (differences) from the mean. Use df.var() to check variance:
In [15]:
mtcars["mpg"].var()
Out[15]:
36.324102822580642
The standard deviation is the square root of the variance. Standard deviation can be more interpretable than
variance, since the standard deviation is expressed in terms of the same units as the variable in question while
variance is expressed in terms of units squared. Use df.std() to check the standard deviation:
In [16]:
mtcars["mpg"].std()
Out[16]:
6.0269480520891037
Since variance and standard deviation are both derived from the mean, they are susceptible to the influence of
data skew and outliers. Median absolute deviation is an alternative measure of spread based on the median,
which inherits the median's robustness against the influence of skew and outliers. It is the median of the
absolute value of the deviations from the median:
In [17]:
abs_median_devs = abs(mtcars["mpg"] - mtcars["mpg"].median())

abs_median_devs.median() * 1.4826
Out[17]:
5.411490000000001
*Note: The MAD is often multiplied by a scaling factor of 1.4826.
Skewness and Kurtosis
Beyond measures of center and spread, descriptive statistics include measures that give you a sense of the
shape of a distribution. Skewness measures the skew or asymmetry of a distribution while kurtosis measures the
"peakedness" of a distribution. We won't go into the exact calculations behind skewness and kurtosis, but they
are essentially just statistics that take the idea of variance a step further: while variance involves squaring
deviations from the mean, skewness involves cubing deviations from the mean and kurtosis involves raising
deviations from the mean to the 4th power.
Pandas has built in functions for checking skewness and kurtosis, df.skew() and df.kurt() respectively:
In [18]:
mtcars["mpg"].skew() # Check skewness
Out[18]:
0.6723771376290919
In [19]:
mtcars["mpg"].kurt() # Check kurtosis
Out[19]:
-0.022006291424083859
To explore these two measures further, let's create some dummy data and inspect it:
In [20]:
norm_data = np.random.normal(size=100000)
skewed_data = np.concatenate((np.random.normal(size=35000)+2,
np.random.exponential(size=65000)),
axis=0)
uniform_data = np.random.uniform(0,2, size=100000)
peaked_data = np.concatenate((np.random.exponential(size=50000),
np.random.exponential(size=50000)*(-1)),
axis=0)

data_df = pd.DataFrame({"norm":norm_data,
"skewed":skewed_data,
"uniform":uniform_data,
"peaked":peaked_data})
In [21]:
data_df.plot(kind="density",
figsize=(10,10),
xlim=(-5,5))
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0xc170be0>

Now let's check the skewness of each of the distributions. Since skewness measures asymmetry, we'd expect to
see low skewness for all of the distributions except the skewed one, because all the others are roughly
symmetric:
In [22]:
data_df.skew()
Out[22]:
norm 0.005802
peaked -0.007226
skewed 0.982716
uniform 0.001460
dtype: float64
Now let's check kurtosis. Since kurtosis measures peakedness, we'd expect the flat (uniform) distribution have
low kurtosis while the distributions with sharper peaks should have higher kurtosis.
In [23]:
data_df.kurt()
Out[23]:
norm -0.014785
peaked 2.958413
skewed 1.086500
uniform -1.196268
dtype: float64
As we can see from the output, the normally distributed data has a kurtosis near zero, the flat distribution has
negative kurtosis and the two pointier distributions have positive kurtosis.
Wrap Up
Descriptive statistics help you explore features of your data, like center, spread and shape by summarizing them
with numerical measurements. Descriptive statistics help inform the direction of an analysis and let you
communicate your insights to others quickly and succinctly. In addition, certain values, like the mean and
variance, are used in all sorts of statistical tests and predictive models.
In this lesson, we generated a lot of random data to illustrate concepts, but we haven't actually learned much
about the functions we've been using to generate random data. In the next lesson, we'll learn about probability
distributions, including how to draw random data from them.
Python for Data Analysis Part 22: Probability Distributions

Many statistical tools and techniques used in data analysis are based on probability. Probability measures how
likely it is for an event to occur on a scale from 0 (the event never occurs) to 1 (the event always occurs.). When
working with data, variables in the columns of the data set can be thought of as random variables: variables that
vary due to chance. A probability distribution describes how a random variable is distributed; it tells us which
values a random variable is most likely to take on and which values are less likely.
In statistics, there are a range of precisely defined probability distributions that have different shapes and can be
used to model different types of random events. In this lesson we'll discuss some common probability
distributions and how to work with them in Python.
The Uniform Distribution
The uniform distribution is a probability distribution where each value within a certain range is equally likely to
occur and values outside of the range never occur. If we make a density plot of a uniform distribution, it appears
flat because no value is any more likely (and hence has any more density) than another.
Many useful functions for working with probability distributions in Python are contained in the scipy.stats
library. Let's load in some libraries, generate some uniform data and plot a density curve:
In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
In [3]:
uniform_data = stats.uniform.rvs(size=100000, # Generate 100000 numbers
loc = 0, # From 0
scale=10) # To 10
In [4]:
pd.DataFrame(uniform_data).plot(kind="density", # Plot the distribution
figsize=(9,9),
xlim=(-1,11))
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x8c59080>

*Note: the plot above is an approximation of the underlying distribution, since it is based on a sample of
observations.
In the code above, we generated 100,000 data points from a uniform distribution spanning the range 0 to 10. In
the density plot, we see that the density of our uniform data is essentially level meaning any given value has the
same probability of occurring. The area under a probability density curve is always equal to 1.
Probability distributions in scipy come with several useful functions for generating random data and extracting
values of interest:
-stats.distribution.rvs() generates random numbers from the specified distribution. The arguments to rvs() will
vary depending on the type of distribution you're working with; in the case of the uniform distribution, we have
to specify the starting and ending points and the size (number of random points to generate.).
-stats.distribution.cdf() is used to determine the probability that an observation drawn from a distribution falls
below a specified value (known as the cumulative distribution function.). In essence, cdf() gives you the area
under the distribution's density curve to the left of a certain value on the x axis. For example, in the uniform
distribution above, there is a 25% chance that an observation will be in the range 0 to 2.5 and a 75% chance it
will fall in the range 2.5 to 10. We can confirm this with cdf():
In [5]:
stats.uniform.cdf(x=2.5, # Cutoff value (quantile) to check
loc=0, # Distribution start
scale=10) # Distribution end
Out[5]:
0.25
-stats.distribution.ppf() is the inverse of cdf(): it returns the x axis cutoff value (quantile) associated with a given
probability. For instance, if we want to know the cutoff value for which we have a 40% chance of drawing an
observation below that value, we can use ppf():
In [6]:
stats.uniform.ppf(q=0.4, # Probability cutoff
loc=0, # Distribution start
scale=10) # Distribution end
Out[6]:
4.0
-stats.distribution.pdf() gives you the probability density (height of the distribution) at a given x value. Since the
uniform distribution is flat, all x values within its range will have the same probability density and x values
outside the range have a probability density of 0:
In [7]:
for x in range(-1,12,3):
print("Density at x value " + str(x))
print( stats.uniform.pdf(x, loc=0, scale=10) )
Density at x value -1
0.0
Density at x value 2
0.1
Density at x value 5
0.1
Density at x value 8
0.1
Density at x value 11
0.0
Probability distribution functions in scipy also support median(), mean(), var() and std().
Generating Random Numbers and Setting The Seed
When you need to generate random real numbers in a range with equal probability you can draw numbers from
a uniform distribution using stats.distribution.rvs(). Python also comes with a library called "random" that lets
you perform various operations that involve randomization. Let's look at a few functions in the random library:
In [8]:
import random
In [9]:
random.randint(0,10) # Get a random integer in the specified range
Out[9]:
8
In [10]:
random.choice([2,4,6,9]) # Get a random element from a sequence
Out[10]:
2
In [11]:
random.random() # Get a real number between 0 and 1
Out[11]:
0.46190204420877423
In [12]:
random.uniform(0,10) # Get a real in the specified range
Out[12]:
0.3716846408759311
Notice that the random library also lets you generate random uniform numbers. Regardless of the method you
use to generate random numbers, however, the result of a random process can differ from one run to the next.
Having results vary each time you run a function is often not desirable. For example, if you want a colleague to
be able to reproduce your results exactly, you can run into problems when you use randomization. You can
ensure that your results are the same each time you use a function that involves randomness by setting the
random number generator's seed value to initialize it prior to running the function. Set the random seed with
random.seed():
In [13]:
random.seed(12) # Set the seed to an arbitrary value

print([random.uniform(0,10) for x in range(4)])

random.seed(12) # Set the seed to the same value

print([random.uniform(0,10) for x in range(4)])


[4.7457067868854805, 6.574725026572553, 6.664104711248381, 1.4260035292536777]
[4.7457067868854805, 6.574725026572553, 6.664104711248381, 1.4260035292536777]
Notice that we generated the exact same numbers with both calls to random.uniform() because we set the
same seed before each call. If we had not set the seed, we would have gotten different numbers. This
reproducibility illustrates the fact that these random numbers aren't truly random, but rather "pseudorandom".
Many functions in Python's libraries that use randomness have an optional random seed argument built in so
that you don't have to set the seed outside of the function. For instance, the rvs() function has an optional
argument random_state, that lets you set the seed.

* Note: The Python standard library "random" has a separate internal seed from the numpy library. When using
functions from numpy and libraries built on top of numpy (pandas, scipy, scikit-learn) use np.random.seed() to
set the seed.
The Normal Distribution
The normal or Gaussian distribution is a continuous probability distribution characterized by a symmetric bell-
shaped curve. A normal distribution is defined by its center (mean) and spread (standard deviation.). The bulk of
the observations generated from a normal distribution lie near the mean, which lies at the exact center of the
distribution: as a rule of thumb, about 68% of the data lies within 1 standard deviation of the mean, 95% lies
within 2 standard deviations and 99.7% lies within 3 standard deviations.
The normal distribution is perhaps the most important distribution in all of statistics. It turns out that many real
world phenomena, like IQ test scores and human heights, roughly follow a normal distribution, so it is often
used to model random variables. Many common statistical tests assume distributions are normal.
The scipy nickname for the normal distribution is norm. Let's investigate the normal distribution:
In [14]:
prob_under_minus1 = stats.norm.cdf(x= -1,
loc = 0,
scale= 1)

prob_over_1 = 1 - stats.norm.cdf(x= 1,
loc = 0,
scale= 1)

between_prob = 1-(prob_under_minus1+prob_over_1)

print(prob_under_minus1, prob_over_1, between_prob)


0.158655253931 0.158655253931 0.682689492137
The output shows that roughly 16% of the data generated by a normal distribution with mean 0 and standard
deviation 1 is below -1, 16% is above 1 and 68% lies between -1 and 1, which agrees with the 68, 95, 99.7 rule.
Let's plot the normal distribution and inspect areas we calculated:
In [15]:
# Plot normal distribution areas*

plt.rcParams["figure.figsize"] = (9,9)

plt.fill_between(x=np.arange(-4,-1,0.01),
y1= stats.norm.pdf(np.arange(-4,-1,0.01)) ,
facecolor='red',
alpha=0.35)

plt.fill_between(x=np.arange(1,4,0.01),
y1= stats.norm.pdf(np.arange(1,4,0.01)) ,
facecolor='red',
alpha=0.35)

plt.fill_between(x=np.arange(-1,1,0.01),
y1= stats.norm.pdf(np.arange(-1,1,0.01)) ,
facecolor='blue',
alpha=0.35)

plt.text(x=-1.8, y=0.03, s= round(prob_under_minus1,3))


plt.text(x=-0.2, y=0.1, s= round(between_prob,3))
plt.text(x=1.4, y=0.03, s= round(prob_over_1,3))
Out[15]:
<matplotlib.text.Text at 0x8f60e80>

*Note: This lesson uses some plotting code we did not cover in the plotting lesson in order to make plots for
explanatory purposes.
The plot above shows the bell shape of the normal distribution, the area below and above one standard
deviation and the area within 1 standard deviation of the mean.
Finding quantiles of the normal distribution is a common task when performing statistical tests. You can check
normal distribution quantiles with stats.norm.ppf():
In [16]:
print( stats.norm.ppf(q=0.025) ) # Find the quantile for the 2.5% cutoff

print( stats.norm.ppf(q=0.975) ) # Find the quantile for the 97.5% cutoff


-1.95996398454
1.95996398454
The quantile output above confirms that roughly 5% of the data lies more than 2 standard deviations from the
mean.
*Note: a mean of 0 and standard deviation of 1 are default values for the normal distribution.
The Binomial Distribution
The binomial distribution is a discrete probability distribution that models the outcomes of a given number of
random trails of some experiment or event. The binomial is defined by two parameters: the probability of
success in any given trial and the number of trials. The binomial distribution tells you how likely it is to achieve a
given number of successes in n trials of the experiment. For example, we could model flipping a fair coin 10
times with a binomial distribution where the number of trials is set to 10 and the probability of success is set to
0.5. In this case the distribution would tell us how likely it is to get zero heads, 1 head, 2 heads and so on.
The scipy name for the binomial is binom. Let's generate and investigate some binomial data:
In [17]:
fair_coin_flips = stats.binom.rvs(n=10, # Number of flips per trial
p=0.5, # Success probability
size=10000) # Number of trials

print( pd.crosstab(index="counts", columns= fair_coin_flips))

pd.DataFrame(fair_coin_flips).hist(range=(-0.5,10.5), bins=11)
col_0 0 1 2 3 4 5 6 7 8 9 10
row_0
counts 8 111 422 1181 1975 2453 2073 1224 450 94 9
Out[17]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000008D7EA58>]], dtype=object)

Note that since the binomial distribution is discrete, it only takes on integer values so we can summarize
binomial data with a frequency table and its distribution with a histogram. The histogram shows us that a
binomial distribution with a 50% probability of success is roughly symmetric, with the most likely outcomes lying
at the center. This is reminiscent of the normal distribution, but if we alter the success probability, the
distribution won't be symmetric:
In [18]:
biased_coin_flips = stats.binom.rvs(n=10, # Number of flips per trial
p=0.8, # Success probability
size=10000) # Number of trials

# Print table of counts


print( pd.crosstab(index="counts", columns= biased_coin_flips))

# Plot histogram
pd.DataFrame(biased_coin_flips).hist(range=(-0.5,10.5), bins=11)
col_0 2 3 4 5 6 7 8 9 10
row_0
counts 1 4 53 258 834 1997 3076 2689 1088
Out[18]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000A448B70>]], dtype=object)

The cdf() function lets us check the probability of achieving a number of successes within a certain range:
In [19]:
stats.binom.cdf(k=5, # Probability of k = 5 successes or less
n=10, # With 10 flips
p=0.8) # And success probability 0.8
Out[19]:
0.032793497599999964
In [20]:
1 - stats.binom.cdf(k=8, # Probability of k = 9 successes or more
n=10, # With 10 flips
p=0.8) # And success probability 0.8
Out[20]:
0.37580963840000003
For continuous probability density functions, you use pdf() to check the probability density at a given x value. For
discrete distributions like the binomial, use stats.distribution.pmf() (probability mass function) to check the mass
(proportion of observations) at given number of successes k:
In [21]:
stats.binom.pmf(k=5, # Probability of k = 5 successes
n=10, # With 10 flips
p=0.5) # And success probability 0.5
Out[21]:
0.24609375000000025
In [22]:
stats.binom.pmf(k=8, # Probability of k = 8 successes
n=10, # With 10 flips
p=0.8) # And success probability 0.8
Out[22]:
0.30198988799999998
The Geometric and Exponential Distributions
The geometric and exponential distributions model the time it takes for an event to occur. The geometric
distribution is discrete and models the number of trials it takes to achieve a success in repeated experiments
with a given probability of success. The exponential distribution is a continuous analog of the geometric
distribution and models the amount of time you have to wait before an event occurs given a certain occurrence
rate.
The scipy nickname for the geometric distribution is "geom". Let's use the geom functions to model the number
of trials it takes to get a success (heads) when flipping a fair coin:
In [23]:
random.seed(12)
flips_till_heads = stats.geom.rvs(size=10000, # Generate geometric data
p=0.5) # With success prob 0.5

# Print table of counts


print( pd.crosstab(index="counts", columns= flips_till_heads))

# Plot histogram
pd.DataFrame(flips_till_heads).hist(range=(-0.5,max(flips_till_heads)+0.5)
, bins=max(flips_till_heads)+1)
col_0 1 2 3 4 5 6 7 8 9 10 11 14 16
row_0
counts 5002 2537 1243 614 300 133 80 53 27 5 4 1 1
Out[23]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000A915780>]], dtype=object)

The distribution looks similar to what we'd expect: it is very likely to get a heads in 1 or 2 flips, while it is very
unlikely for it to take more than 5 flips to get a heads. In the 10,000 trails we generated, the longest it took to
get a heads was 13 flips.
Let's use cdf() to check the probability of needing 6 flips or more to get a success:
In [24]:
first_five = stats.geom.cdf(k=5, # Prob of success in first 5 flips
p=0.5)

1 - first_five
Out[24]:
0.03125
Use pmf() to check the probability of seeing a specific number of flips before a successes:
In [25]:
stats.geom.pmf(k=2, # Prob of needing exactly 2 flips to get first success
p=0.5)
Out[25]:
0.25
The scipy name for the exponential distribution is "expon". Let's investigate the exponential distribution:
In [26]:
# Get the probability of waiting more than 1 time unit before a success

prob_1 = stats.expon.cdf(x=1,
scale=1) # Arrival rate

1 - prob_1
Out[26]:
0.36787944117144233
*Note: The average arrival time for the exponential distribution is equal to 1/arrival_rate.
Let's plot this exponential distribution to get an idea of its shape:
In [27]:
plt.fill_between(x=np.arange(0,1,0.01),
y1= stats.expon.pdf(np.arange(0,1,0.01)) ,
facecolor='blue',
alpha=0.35)

plt.fill_between(x=np.arange(1,7,0.01),
y1= stats.expon.pdf(np.arange(1,7,0.01)) ,
facecolor='red',
alpha=0.35)

plt.text(x=0.3, y=0.2, s= round(prob_1,3))


plt.text(x=1.5, y=0.08, s= round(1 - prob_1,3))
Out[27]:
<matplotlib.text.Text at 0xaa3e5c0>

Similar to the geometric distribution, the exponential starts high and has a long tail that trails off to the right
that contains rare cases where you have to wait much longer than average for an arrival.
The Poisson Distribution
The Poisson distribution models the probability of seeing a certain number of successes within a time interval,
where the time it takes for the next success is modeled by an exponential distribution. The Poisson distribution
can be used to model traffic, such as the number of arrivals a hospital can expect in a hour's time or the number
of emails you'd expect to receive in a week.
The scipy name for the Poisson distribution is "poisson". Let's generate and plot some data from a Poisson
distribution with an arrival rate of 1 per time unit:
In [28]:
random.seed(12)

arrival_rate_1 = stats.poisson.rvs(size=10000, # Generate Poisson data


mu=1 ) # Average arrival time 1

# Print table of counts


print( pd.crosstab(index="counts", columns= arrival_rate_1))

# Plot histogram
pd.DataFrame(arrival_rate_1).hist(range=(-0.5,max(arrival_rate_1)+0.5)
, bins=max(arrival_rate_1)+1)
col_0 0 1 2 3 4 5 6
row_0
counts 3644 3771 1793 622 128 32 10
Out[28]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000AA0F198>]], dtype=object)

The histogram shows that when arrivals are relatively infrequent, it is rare to see more than a couple of arrivals
in each time period. When the arrival rate is high, it becomes increasingly rare to see a low number of arrivals
and the distribution starts to look more symmetric:
In [29]:
random.seed(12)

arrival_rate_10 = stats.poisson.rvs(size=10000, # Generate Poisson data


mu=10 ) # Average arrival time 10

# Print table of counts


print( pd.crosstab(index="counts", columns= arrival_rate_10))

# Plot histogram
pd.DataFrame(arrival_rate_10).hist(range=(-0.5,max(arrival_rate_10)+0.5)
, bins=max(arrival_rate_10)+1)
col_0 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 \
row_0 ...
counts 8 22 69 171 375 615 930 1119 1233 1279 ... 364 223 130
col_0 18 19 20 21 22 23 24
row_0
counts 80 38 18 3 7 1 3

[1 rows x 24 columns]
Out[29]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000A9D69B0>]], dtype=object)

As with other discrete probability distributions, we can use cdf() to check the probability of achieving more or
less than a certain number of successes and pmf() to check the probability of obtaining a specific number of
successes:
In [30]:
stats.poisson.cdf(k=5, # Check the probability of 5 arrivals or less
mu=10) # With arrival rate 10
Out[30]:
0.067085962879031888
In [31]:
stats.poisson.pmf(k=10, # Check the prob f exactly 10 arrivals
mu=10) # With arrival rate 10
Out[31]:
0.12511003572113372
Wrap Up
Python's scipy library contains functions that make it easy to work with a wide range of probability distributions,
including many that we did not discuss in this lesson. Probability distribution functions are useful for generating
random data, modeling random events and aiding with statistical tests and analysis.
In the next few lessons, we'll learn how to carry out common statistical tests with Python.
Python for Data Analysis Part 23: Point Estimates and Confidence Intervals

To this point, this guide has focused on the functions and syntax necessary to manipulate, explore and describe
data. Data cleaning and exploratory analysis are often preliminary steps toward the end goal of extracting
insight from data through statistical inference or predictive modeling. The remainder of this guide will focus on
methods for analyzing data and tools for carrying out analyses in Python.
Statistical inference is the process of analyzing sample data to gain insight into the population from which the
data was collected and to investigate differences between data samples. In data analysis, we are often
interested in the characteristics of some large population, but collecting data on the entire population may be
infeasible. For example, leading up to U.S. presidential elections it could be very useful to know the political
leanings of every single eligible voter, but surveying every voter is not feasible. Instead, we could poll some
subset of the population, such as a thousand registered voters, and use that data to make inferences about the
population as a whole.
Point Estimates
Point estimates are estimates of population parameters based on sample data. For instance, if we wanted to
know the average age of registered voters in the U.S., we could take a survey of registered voters and then use
the average age of the respondents as a point estimate of the average age of the population as a whole. The
average of a sample is known as the sample mean.
The sample mean is usually not exactly the same as the population mean. This difference can be caused by many
factors including poor survey design, biased sampling methods and the randomness inherent to drawing a
sample from a population. Let's investigate point estimates by generating a population of random age data and
then drawing a sample from it to estimate the mean:
In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import random
import math
In [3]:
np.random.seed(10)
population_ages1 = stats.poisson.rvs(loc=18, mu=35, size=150000)
population_ages2 = stats.poisson.rvs(loc=18, mu=10, size=100000)
population_ages = np.concatenate((population_ages1, population_ages2))

population_ages.mean()
Out[3]:
43.002372000000001
In [4]:
np.random.seed(6)
sample_ages = np.random.choice(a= population_ages,
size=500) # Sample 1000 values

print ( sample_ages.mean() ) # Show sample mean

population_ages.mean() - sample_ages.mean() # Check difference between means


42.388
Out[4]:
0.61437200000000303
Our point estimate based on a sample of 500 individuals underestimates the true population mean by 0.6 years,
but it is close. This illustrates an important point: we can get a fairly accurate estimate of a large population by
sampling a relatively small subset of individuals.
Another point estimate that may be of interest is the proportion of the population that belongs to some
category or subgroup. For example, we might like to know the race of each voter we poll, to get a sense of the
overall demographics of the voter base. You can make a point estimate of this sort of proportion by taking a
sample and then checking the ratio in the sample:
In [5]:
random.seed(10)
population_races = (["white"]*100000) + (["black"]*50000) +\
(["hispanic"]*50000) + (["asian"]*25000) +\
(["other"]*25000)

demo_sample = random.sample(population_races, 1000) # Sample 1000 values

for race in set(demo_sample):


print( race + " proportion estimate:" )
print( demo_sample.count(race)/1000 )
hispanic proportion estimate:
0.192
white proportion estimate:
0.379
other proportion estimate:
0.099
black proportion estimate:
0.231
asian proportion estimate:
0.099
Notice that the proportion estimates are close to the true underlying population proportions.
Sampling Distributions and The Central Limit Theorem
Many statistical procedures assume that data follows a normal distribution, because the normal distribution has
nice properties like symmetricity and having the majority of the data clustered within a few standard deviations
of the mean. Unfortunately, real world data is often not normally distributed and the distribution of a sample
tends to mirror the distribution of the population. This means a sample taken from a population with a skewed
distribution will also tend to be skewed. Let's investigate by plotting the data and sample we created earlier and
by checking the skew:
In [6]:
pd.DataFrame(population_ages).hist(bins=58,
range=(17.5,75.5),
figsize=(9,9))

print( stats.skew(population_ages) )
-0.12008483603917186

The distribution has low skewness, but the plot reveals the data is clearly not normal: instead of one symmetric
bell curve, it has as bimodal distribution with two high density peaks. The sample we drew from this population
should have roughly the same shape and skew:
In [7]:
pd.DataFrame(sample_ages).hist(bins=58,
range=(17.5,75.5),
figsize=(9,9))

print( stats.skew(sample_ages) )
-0.056225282585406065

The sample has roughly the same shape as the underlying population. This suggests that we can't apply
techniques that assume a normal distribution to this data set, since it is not normal. In reality, we can, thanks
the central limit theorem.
The central limit theorem is one of the most important results of probability theory and serves as the foundation
of many methods of statistical analysis. At a high level, the theorem states the distribution of many sample
means, known as a sampling distribution, will be normally distributed. This rule holds even if the underlying
distribution itself is not normally distributed. As a result we can treat the sample mean as if it were drawn
normal distribution.
To illustrate, let's create a sampling distribution by taking 200 samples from our population and then making
200 point estimates of the mean:
In [8]:
np.random.seed(10)

point_estimates = [] # Make empty list to hold point estimates

for x in range(200): # Generate 200 samples


sample = np.random.choice(a= population_ages, size=500)
point_estimates.append( sample.mean() )

pd.DataFrame(point_estimates).plot(kind="density", # Plot sample mean density


figsize=(9,9),
xlim=(41,45))
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0xa664f98>
The sampling distribution appears to be roughly normal, despite the bimodal population distribution that the
samples were drawn from. In addition, the mean of the sampling distribution approaches the true population
mean:
In [9]:
population_ages.mean() - np.array(point_estimates).mean()
Out[9]:
-0.084407999999996264
The more samples we take, the better our estimate of the population parameter is likely to be.
Confidence Intervals
A point estimate can give you a rough idea of a population parameter like the mean, but estimates are prone to
error and taking multiple samples to get improved estimates may not be feasible. A confidence interval is a
range of values above and below a point estimate that captures the true population parameter at some
predetermined confidence level. For example, if you want to have a 95% chance of capturing the true
population parameter with a point estimate and a corresponding confidence interval, you'd set your confidence
level to 95%. Higher confidence levels result in a wider confidence intervals.
Calculate a confidence interval by taking a point estimate and then adding and subtracting a margin of error to
create a range. Margin of error is based on your desired confidence level, the spread of the data and the size of
your sample. The way you calculate the margin of error depends on whether you know the standard deviation of
the population or not.
If you know the standard deviation of the population, the margin of error is equal to:

z∗σn√
Where σ (sigma) is the population standard deviation, n is sample size, and z is a number known as the z-critical
value. The z-critical value is the number of standard deviations you'd have to go from the mean of the normal
distribution to capture the proportion of the data associated with the desired confidence level. For instance, we
know that roughly 95% of the data in a normal distribution lies within 2 standard deviations of the mean, so we
could use 2 as the z-critical value for a 95% confidence interval (although it is more exact to get z-critical values
with stats.norm.ppf().).
Let's calculate a 95% confidence for our mean point estimate:
In [10]:
np.random.seed(10)

sample_size = 1000
sample = np.random.choice(a= population_ages, size = sample_size)
sample_mean = sample.mean()

z_critical = stats.norm.ppf(q = 0.975) # Get the z-critical value*

print("z-critical value:") # Check the z-critical value


print(z_critical)

pop_stdev = population_ages.std() # Get the population standard deviation

margin_of_error = z_critical * (pop_stdev/math.sqrt(sample_size))

confidence_interval = (sample_mean - margin_of_error,


sample_mean + margin_of_error)

print("Confidence interval:")
print(confidence_interval)
z-critical value:
1.95996398454
Confidence interval:
(41.703064068826833, 43.342935931173173)
*Note: We use stats.norm.ppf(q = 0.975) to get the desired z-critical value instead of q = 0.95 because the
distribution has two tails.
Notice that the confidence interval we calculated captures the true population mean of 43.0023.
Let's create several confidence intervals and plot them to get a better sense of what it means to "capture" the
true mean:
In [11]:
np.random.seed(12)

sample_size = 1000

intervals = []
sample_means = []

for sample in range(25):


sample = np.random.choice(a= population_ages, size = sample_size)
sample_mean = sample.mean()
sample_means.append(sample_mean)

z_critical = stats.norm.ppf(q = 0.975) # Get the z-critical value*

pop_stdev = population_ages.std() # Get the population standard deviation

stats.norm.ppf(q = 0.025)

margin_of_error = z_critical * (pop_stdev/math.sqrt(sample_size))

confidence_interval = (sample_mean - margin_of_error,


sample_mean + margin_of_error)

intervals.append(confidence_interval)
In [12]:
plt.figure(figsize=(9,9))

plt.errorbar(x=np.arange(0.1, 25, 1),


y=sample_means,
yerr=[(top-bot)/2 for top,bot in intervals],
fmt='o')

plt.hlines(xmin=0, xmax=25,
y=43.0023,
linewidth=2.0,
color="red")
Out[12]:
<matplotlib.collections.LineCollection at 0xa7166a0>

Notice that in the plot above, all but one of the 95% confidence intervals overlap the red line marking the true
mean. This is to be expected: since a 95% confidence interval captures the true mean 95% of the time, we'd
expect our interval to miss the true mean 5% of the time.
If you don't know the standard deviation of the population, you have to use the standard deviation of your
sample as a stand in when creating confidence intervals. Since the sample standard deviation may not match the
population parameter the interval will have more error when you don't know the population standard deviation.
To account for this error, we use what's known as a t-critical value instead of the z-critical value. The t-critical
value is drawn from what's known as a t-distribution--a distribution that closely resembles the normal
distribution but that gets wider and wider as the sample size falls. The t-distribution is available in scipy.stats
with the nickname "t" so we can get t-critical values with stats.t.ppf().
Let's take a new, smaller sample and then create a confidence interval without the population standard
deviation, using the t-distribution:
In [13]:
np.random.seed(10)

sample_size = 25
sample = np.random.choice(a= population_ages, size = sample_size)
sample_mean = sample.mean()

t_critical = stats.t.ppf(q = 0.975, df=24) # Get the t-critical value*

print("t-critical value:") # Check the t-critical value


print(t_critical)

sample_stdev = sample.std() # Get the sample standard deviation

sigma = sample_stdev/math.sqrt(sample_size) # Standard deviation estimate


margin_of_error = t_critical * sigma

confidence_interval = (sample_mean - margin_of_error,


sample_mean + margin_of_error)

print("Confidence interval:")
print(confidence_interval)
t-critical value:
2.06389856163
Confidence interval:
(37.757112737010608, 48.002887262989397)
*Note: when using the t-distribution, you have to supply the degrees of freedom (df). For this type of test, the
degrees of freedom is equal to the sample size minus 1. If you have a large sample size, the t-distribution
approaches the normal distribution.
Notice that the t-critical value is larger than the z-critical value we used for 95% confidence interval. This allows
the confidence interval to cast a larger net to make up for the variability caused by using the sample standard
deviation in place of the population standard deviation. The end result is a much wider confidence interval (an
interval with a larger margin of error.).
If you have a large sample, the t-critical value will approach the z-critical value so there is little difference
between using the normal distribution vs. the t-distribution:
In [14]:
# Check the difference between critical values with a sample size of 1000

stats.t.ppf(q=0.975, df= 999) - stats.norm.ppf(0.975)


Out[14]:
0.0023774765933946007
Instead of calculating a confidence interval for a mean point estimate by hand, you can calculate it using the
Python function stats.t.interval():
In [15]:
stats.t.interval(alpha = 0.95, # Confidence level
df= 24, # Degrees of freedom
loc = sample_mean, # Sample mean
scale = sigma) # Standard deviation estimate
Out[15]:
(37.757112737010608, 48.002887262989397)
We can also make a confidence interval for a point estimate of a population proportion. In this case, the margin
of error equals:

z∗p(1−p)n−−−−−−−√
Where z is the z-critical value for our confidence level, p is the point estimate of the population proportion and n
is the sample size. Let's calculate a 95% confidence interval for Hispanics according to the sample proportion we
calculated earlier (0.192):
In [16]:
z_critical = stats.norm.ppf(0.975) # Record z-critical value

p = 0.192 # Point estimate of proportion

n = 1000 # Sample size

margin_of_error = z_critical * math.sqrt((p*(1-p))/n)

confidence_interval = (p - margin_of_error, # Calculate the the interval


p + margin_of_error)

confidence_interval
Out[16]:
(0.16758794241348748, 0.21641205758651252)
The output shows that the confidence interval captured the true population parameter of 0.2. Similar to our
population mean point estimates, we can use the scipy stats.distribution.interval() function to calculate a
confidence interval for a population proportion for us. In this case were working with z-critical values so we
want to work with the normal distribution instead of the t distribution:
In [17]:
stats.norm.interval(alpha = 0.95, # Confidence level
loc = 0.192, # Point estimate of proportion
scale = math.sqrt((p*(1-p))/n)) # Scaling factor
Out[17]:
(0.16758794241348748, 0.21641205758651252)
Wrap Up
Estimating population parameters through sampling is a simple, yet powerful form of inference. Point estimates
combined with error margins let us create confidence intervals that capture the true population parameter with
high probability.
Next time we'll expand on the concepts in this lesson by learning about statistical hypothesis testing.
Python for Data Analysis Part 24: Hypothesis Testing and the T-Test

Point estimates and confidence intervals are basic inference tools that act as the foundation for another
inference technique: statistical hypothesis testing. Statistical hypothesis testing is a framework for determining
whether observed data deviates from what is expected. Python's scipy.stats library contains an array of
functions that make it easy to carry out hypothesis tests.
Hypothesis Testing Basics
Statistical hypothesis tests are based a statement called the null hypothesis that assumes nothing interesting is
going on between whatever variables you are testing. The exact form of the null hypothesis varies from one type
test to another: if you are testing whether groups differ, the null hypothesis states that the groups are the same.
For instance, if you wanted to test whether the average age of voters in your home state differs from the
national average, the null hypothesis would be that there is no difference between the average ages.
The purpose of a hypothesis test is to determine whether the null hypothesis is likely to be true given sample
data. If there is little evidence against the null hypothesis given the data, you accept the null hypothesis. If the
null hypothesis is unlikely given the data, you might reject the null in favor of the alternative hypothesis: that
something interesting is going on. The exact form of the alternative hypothesis will depend on the specific test
you are carrying out. Continuing with the example above, the alternative hypothesis would be that the average
age of voters in your state does in fact differ from the national average.
Once you have the null and alternative hypothesis in hand, you choose a significance level (often denoted by the
Greek letter α.). The significance level is a probability threshold that determines when you reject the null
hypothesis. After carrying out a test, if the probability of getting a result as extreme as the one you observe due
to chance is lower than the significance level, you reject the null hypothesis in favor of the alternative. This
probability of seeing a result as extreme or more extreme than the one observed is known as the p-value.
The T-test is a statistical test used to determine whether a numeric data sample of differs significantly from the
population or whether two samples differ from one another.
One-Sample T-Test
A one-sample t-test checks whether a sample mean differs from the population mean. Let's create some dummy
age data for the population of voters in the entire country and a sample of voters in Minnesota and test the
whether the average age of voters Minnesota differs from the population:
In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import math
In [3]:
np.random.seed(6)

population_ages1 = stats.poisson.rvs(loc=18, mu=35, size=150000)


population_ages2 = stats.poisson.rvs(loc=18, mu=10, size=100000)
population_ages = np.concatenate((population_ages1, population_ages2))

minnesota_ages1 = stats.poisson.rvs(loc=18, mu=30, size=30)


minnesota_ages2 = stats.poisson.rvs(loc=18, mu=10, size=20)
minnesota_ages = np.concatenate((minnesota_ages1, minnesota_ages2))

print( population_ages.mean() )
print( minnesota_ages.mean() )
43.000112
39.26
Notice that we used a slightly different combination of distributions to generate the sample data for Minnesota,
so we know that the two means are different. Let's conduct a t-test at a 95% confidence level and see if it
correctly rejects the null hypothesis that the sample comes from the same distribution as the population. To
conduct a one sample t-test, we can the stats.ttest_1samp() function:
In [4]:
stats.ttest_1samp(a= minnesota_ages, # Sample data
popmean= population_ages.mean()) # Pop mean
Out[4]:
Ttest_1sampResult(statistic=-2.5742714883655027, pvalue=0.013118685425061678)
The test result shows the test statistic "t" is equal to -2.574. This test statistic tells us how much the sample
mean deviates from the null hypothesis. If the t-statistic lies outside the quantiles of the t-distribution
corresponding to our confidence level and degrees of freedom, we reject the null hypothesis. We can check the
quantiles with stats.t.ppf():
In [5]:
stats.t.ppf(q=0.025, # Quantile to check
df=49) # Degrees of freedom
Out[5]:
-2.0095752344892093
In [6]:
stats.t.ppf(q=0.975, # Quantile to check
df=49) # Degrees of freedom
Out[6]:
2.0095752344892088
We can calculate the chances of seeing a result as extreme as the one we observed (known as the p-value) by
passing the t-statistic in as the quantile to the stats.t.cdf() function:
In [7]:
stats.t.cdf(x= -2.5742, # T-test statistic
df= 49) * 2 # Mupltiply by two for two tailed test*
Out[7]:
0.013121066545690117
*Note: The alternative hypothesis we are checking is whether the sample mean differs (is not equal to) the
population mean. Since the sample could differ in either the positive or negative direction we multiply the by
two.
Notice this value is the same as the p-value listed in the original t-test output. A p-value of 0.01311 means we'd
expect to see data as extreme as our sample due to chance about 1.3% of the time if the null hypothesis was
true. In this case, the p-value is lower than our significance level α (equal to 1-conf.level or 0.05) so we should
reject the null hypothesis. If we were to construct a 95% confidence interval for the sample it would not capture
population mean of 43:
In [8]:
sigma = minnesota_ages.std()/math.sqrt(50) # Sample stdev/sample size

stats.t.interval(0.95, # Confidence level


df = 49, # Degrees of freedom
loc = minnesota_ages.mean(), # Sample mean
scale= sigma) # Standard dev estimate
Out[8]:
(36.369669080722176, 42.15033091927782)
On the other hand, since there is a 1.3% chance of seeing a result this extreme due to chance, it is not significant
at the 99% confidence level. This means if we were to construct a 99% confidence interval, it would capture the
population mean:
In [9]:
stats.t.interval(alpha = 0.99, # Confidence level
df = 49, # Degrees of freedom
loc = minnesota_ages.mean(), # Sample mean
scale= sigma) # Standard dev estimate
Out[9]:
(35.405479940921069, 43.114520059078927)
With a higher confidence level, we construct a wider confidence interval and increase the chances that it
captures to true mean, thus making it less likely that we'll reject the null hypothesis. In this case, the p-value of
0.013 is greater than our significance level of 0.01 and we fail to reject the null hypothesis.
Two-Sample T-Test
A two-sample t-test investigates whether the means of two independent data samples differ from one another.
In a two-sample test, the null hypothesis is that the means of both groups are the same. Unlike the one sample-
test where we test against a known population parameter, the two sample test only involves sample means. You
can conduct a two-sample t-test by passing with the stats.ttest_ind() function. Let's generate a sample of voter
age data for Wisconsin and test it against the sample we made earlier:
In [10]:
np.random.seed(12)
wisconsin_ages1 = stats.poisson.rvs(loc=18, mu=33, size=30)
wisconsin_ages2 = stats.poisson.rvs(loc=18, mu=13, size=20)
wisconsin_ages = np.concatenate((wisconsin_ages1, wisconsin_ages2))

print( wisconsin_ages.mean() )
42.8
In [11]:
stats.ttest_ind(a= minnesota_ages,
b= wisconsin_ages,
equal_var=False) # Assume samples have equal variance?
Out[11]:
Ttest_indResult(statistic=-1.7083870793286842, pvalue=0.090731043439577483)
The test yields a p-value of 0.0907, which means there is a 9% chance we'd see sample data this far apart if the
two groups tested are actually identical. If we were using a 95% confidence level we would fail to reject the null
hypothesis, since the p-value is greater than the corresponding significance level of 5%.
Paired T-Test
The basic two sample t-test is designed for testing differences between independent groups. In some cases, you
might be interested in testing differences between samples of the same group at different points in time. For
instance, a hospital might want to test whether a weight-loss drug works by checking the weights of the same
group patients before and after treatment. A paired t-test lets you check whether the means of samples from
the same group differ.
We can conduct a paired t-test using the scipy function stats.ttest_rel(). Let's generate some dummy patient
weight data and do a paired t-test:
In [12]:
np.random.seed(11)

before= stats.norm.rvs(scale=30, loc=250, size=100)

after = before + stats.norm.rvs(scale=5, loc=-1.25, size=100)

weight_df = pd.DataFrame({"weight_before":before,
"weight_after":after,
"weight_change":after-before})

weight_df.describe() # Check a summary of the data


Out[12]:
weight_after weight_before weight_change
count 100.000000 100.000000 100.000000
mean 249.115171 250.345546 -1.230375
std 28.422183 28.132539 4.783696
min 165.913930 170.400443 -11.495286
25% 229.148236 230.421042 -4.046211
50% 251.134089 250.830805 -1.413463
75% 268.927258 270.637145 1.738673
max 316.720357 314.700233 9.759282
The summary shows that patients lost about 1.23 pounds on average after treatment. Let's conduct a paired t-
test to see whether this difference is significant at a 95% confidence level:
In [13]:
stats.ttest_rel(a = before,
b = after)
Out[13]:
Ttest_relResult(statistic=2.5720175998568284, pvalue=0.011596444318439857)
The p-value in the test output shows that the chances of seeing this large of a difference between samples due
to chance is just over 1%.
Type I and Type II Error
The result of a statistical hypothesis test and the corresponding decision of whether to reject or accept the null
hypothesis is not infallible. A test provides evidence for or against the null hypothesis and then you decide
whether to accept or reject it based on that evidence, but the evidence may lack the strength to arrive at the
correct conclusion. Incorrect conclusions made from hypothesis tests fall in one of two categories: type I error
and type II error.
Type I error describes a situation where you reject the null hypothesis when it is actually true. This type of error
is also known as a "false positive" or "false hit". The type 1 error rate is equal to the significance level α, so
setting a higher confidence level (and therefore lower alpha) reduces the chances of getting a false positive.
Type II error describes a situation where you fail to reject the null hypothesis when it is actually false. Type II
error is also known as a "false negative" or "miss". The higher your confidence level, the more likely you are to
make a type II error.
Let's investigate these errors with a plot:
In [14]:
plt.figure(figsize=(12,10))

plt.fill_between(x=np.arange(-4,-2,0.01),
y1= stats.norm.pdf(np.arange(-4,-2,0.01)) ,
facecolor='red',
alpha=0.35)

plt.fill_between(x=np.arange(-2,2,0.01),
y1= stats.norm.pdf(np.arange(-2,2,0.01)) ,
facecolor='white',
alpha=0.35)

plt.fill_between(x=np.arange(2,4,0.01),
y1= stats.norm.pdf(np.arange(2,4,0.01)) ,
facecolor='red',
alpha=0.5)

plt.fill_between(x=np.arange(-4,-2,0.01),
y1= stats.norm.pdf(np.arange(-4,-2,0.01),loc=3, scale=2) ,
facecolor='white',
alpha=0.35)

plt.fill_between(x=np.arange(-2,2,0.01),
y1= stats.norm.pdf(np.arange(-2,2,0.01),loc=3, scale=2) ,
facecolor='blue',
alpha=0.35)

plt.fill_between(x=np.arange(2,10,0.01),
y1= stats.norm.pdf(np.arange(2,10,0.01),loc=3, scale=2),
facecolor='white',
alpha=0.35)

plt.text(x=-0.8, y=0.15, s= "Null Hypothesis")


plt.text(x=2.5, y=0.13, s= "Alternative")
plt.text(x=2.1, y=0.01, s= "Type 1 Error")
plt.text(x=-3.2, y=0.01, s= "Type 1 Error")
plt.text(x=0, y=0.02, s= "Type 2 Error")
Out[14]:
<matplotlib.text.Text at 0x91ff3c8>

In the plot above, the red areas indicate type I errors assuming the alternative hypothesis is not different from
the null for a two-sided test with a 95% confidence level.
The blue area represents type II errors that occur when the alternative hypothesis is different from the null, as
shown by the distribution on the right. Note that the Type II error rate is the area under the alternative
distribution within the quantiles determined by the null distribution and the confidence level. We can calculate
the type II error rate for the distributions above as follows:
In [15]:
lower_quantile = stats.norm.ppf(0.025) # Lower cutoff value
upper_quantile = stats.norm.ppf(0.975) # Upper cutoff value

# Area under alternative, to the left the lower cutoff value


low = stats.norm.cdf(lower_quantile,
loc=3,
scale=2)

# Area under alternative, to the left the upper cutoff value


high = stats.norm.cdf(upper_quantile,
loc=3,
scale=2)

# Area under the alternative, between the cutoffs (Type II error)


high-low
Out[15]:
0.29495606111232298
With the normal distributions above, we'd fail to reject the null hypothesis about 30% of the time because the
distributions are close enough together that they have significant overlap.
Wrap Up
The t-test is a powerful tool for investigating the differences between sample and population means. T-tests
operate on numeric variables; in the next lesson, we'll discuss statistical tests for categorical variables.
Python for Data Analysis Part 25: Chi-Squared Tests

Last lesson we introduced the framework of statistical hypothesis testing and the t-test for investigating
differences between numeric variables. In this lesson, we turn our attention to a common statistical test for
categorical variables: the chi-squared test.
Chi-Squared Goodness-Of-Fit Test
In our study of t-tests, we introduced the one-way t-test to check whether a sample mean differs from the an
expected (population) mean. The chi-squared goodness-of-fit test is an analog of the one-way t-test for
categorical variables: it tests whether the distribution of sample categorical data matches an expected
distribution. For example, you could use a chi-squared goodness-of-fit test to check whether the race
demographics of members at your church or school match that of the entire U.S. population or whether the
computer browser preferences of your friends match those of Internet uses as a whole.
When working with categorical data the values the observations themselves aren't of much use for statistical
testing because categories like "male", "female," and "other" have no mathematical meaning. Tests dealing with
categorical variables are based on variable counts instead of the actual value of the variables themselves.
Let's generate some fake demographic data for U.S. and Minnesota and walk through the chi-square goodness
of fit test to check whether they are different:
In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
In [2]:
national = pd.DataFrame(["white"]*100000 + ["hispanic"]*60000 +\
["black"]*50000 + ["asian"]*15000 + ["other"]*35000)

minnesota = pd.DataFrame(["white"]*600 + ["hispanic"]*300 + \


["black"]*250 +["asian"]*75 + ["other"]*150)

national_table = pd.crosstab(index=national[0], columns="count")


minnesota_table = pd.crosstab(index=minnesota[0], columns="count")

print( "National")
print(national_table)
print(" ")
print( "Minnesota")
print(minnesota_table)
National
col_0 count
0
asian 15000
black 50000
hispanic 60000
other 35000
white 100000

Minnesota
col_0 count
0
asian 75
black 250
hispanic 300
other 150
white 600
Chi-squared tests are based on the so-called chi-squared statistic. You calculate the chi-squared statistic with the
following formula:

sum((observed−expected)2expected)
In the formula, observed is the actual observed count for each category and expected is the expected count
based on the distribution of the population for the corresponding category. Let's calculate the chi-squared
statistic for our data to illustrate:
In [3]:
observed = minnesota_table

national_ratios = national_table/len(national) # Get population ratios

expected = national_ratios * len(minnesota) # Get expected counts

chi_squared_stat = (((observed-expected)**2)/expected).sum()
print(chi_squared_stat)
col_0
count 18.194805
dtype: float64
*Note: The chi-squared test assumes none of the expected counts are less than 5.
Similar to the t-test where we compared the t-test statistic to a critical value based on the t-distribution to
determine whether the result is significant, in the chi-square test we compare the chi-square test statistic to a
critical value based on the chi-square distribution. The scipy library shorthand for the chi-square distribution is
chi2. Let's use this knowledge to find the critical value for 95% confidence level and check the p-value of our
result:
In [4]:
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
df = 4) # Df = number of variable categories - 1

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat, # Find the p-value


df=4)
print("P value")
print(p_value)
Critical value
9.48772903678
P value
[ 0.00113047]
*Note: we are only interested in the right tail of the chi-square distribution. Read more on this here.
Since our chi-squared statistic exceeds the critical value, we'd reject the null hypothesis that the two
distributions are the same.
You can carry out a chi-squared goodness-of-fit test automatically using the scipy function
scipy.stats.chisquare():
In [5]:
stats.chisquare(f_obs= observed, # Array of observed counts
f_exp= expected) # Array of expected counts
Out[5]:
Power_divergenceResult(statistic=array([ 18.19480519]), pvalue=array([ 0.00113047]))
The test results agree with the values we calculated above.
Chi-Squared Test of Independence
Independence is a key concept in probability that describes a situation where knowing the value of one variable
tells you nothing about the value of another. For instance, the month you were born probably doesn't tell you
anything which web browser you use, so we'd expect birth month and browser preference to be independent.
On the other hand, your month of birth might be related to whether you excelled at sports in school, so month
of birth and sports performance might not be independent.
The chi-squared test of independence tests whether two categorical variables are independent. The test of
independence is commonly used to determine whether variables like education, political views and other
preferences vary based on demographic factors like gender, race and religion. Let's generate some fake voter
polling data and perform a test of independence:
In [6]:
np.random.seed(10)

# Sample data randomly at fixed probabilities


voter_race = np.random.choice(a= ["asian","black","hispanic","other","white"],
p = [0.05, 0.15 ,0.25, 0.05, 0.5],
size=1000)
# Sample data randomly at fixed probabilities
voter_party = np.random.choice(a= ["democrat","independent","republican"],
p = [0.4, 0.2, 0.4],
size=1000)

voters = pd.DataFrame({"race":voter_race,
"party":voter_party})

voter_tab = pd.crosstab(voters.race, voters.party, margins = True)

voter_tab.columns = ["democrat","independent","republican","row_totals"]

voter_tab.index = ["asian","black","hispanic","other","white","col_totals"]

observed = voter_tab.ix[0:5,0:3] # Get table without totals for later use


voter_tab
Out[6]:
democrat independent republican row_totals
asian 21 7 32 60
black 65 25 64 154
hispanic 107 50 94 251
other 15 8 15 38
white 189 96 212 497
col_totals 397 186 417 1000
Note that we did not use the race data to inform our generation of the party data so the variables are
independent.
For a test of independence, we use the same chi-squared formula that we used for the goodness-of-fit test. The
main difference is we have to calculate the expected counts of each cell in a 2-dimensional table instead of a 1-
dimensional table. To get the expected count for a cell, multiply the row total for that cell by the column total
for that cell and then divide by the total number of observations. We can quickly get the expected counts for all
cells in the table by taking the row totals and column totals of the table, performing an outer product on them
with the np.outer() function and dividing by the number of observations:
In [7]:
expected = np.outer(voter_tab["row_totals"][0:5],
voter_tab.ix["col_totals"][0:3]) / 1000

expected = pd.DataFrame(expected)

expected.columns = ["democrat","independent","republican"]
expected.index = ["asian","black","hispanic","other","white"]

expected
Out[7]:
democrat independent republican
asian 23.820 11.160 25.020
black 61.138 28.644 64.218
democrat independent republican
hispanic 99.647 46.686 104.667
other 15.086 7.068 15.846
white 197.309 92.442 207.249
Now we can follow the same steps we took before to calculate the chi-square statistic, the critical value and the
p-value:
In [8]:
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()

print(chi_squared_stat)
7.16932128016
*Note: We call .sum() twice: once to get the column sums and a second time to add the column sums together,
returning the sum of the entire 2D table.
In [9]:
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
df = 8) # *

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat, # Find the p-value


df=8)
print("P value")
print(p_value)
Critical value
15.5073130559
P value
0.518479392949
*Note: The degrees of freedom for a test of independence equals the product of the number of categories in
each variable minus 1. In this case we have a 5x3 table so df = 4x2 = 8.
As with the goodness-of-fit test, we can use scipy to conduct a test of independence quickly. Use
stats.chi2_contingency() function to conduct a test of independence automatically given a frequency table of
observed counts:
In [10]:
stats.chi2_contingency(observed= observed)
Out[10]:
(7.1693212801620589,
0.51847939294884204,
8,
array([[ 23.82 , 11.16 , 25.02 ],
[ 61.138, 28.644, 64.218],
[ 99.647, 46.686, 104.667],
[ 15.086, 7.068, 15.846],
[ 197.309, 92.442, 207.249]]))
The output shows the chi-square statistic, the p-value and the degrees of freedom followed by the expected
counts.
As expected, given the high p-value, the test result does not detect a significant relationship between the
variables.
Wrap Up
Chi-squared tests provide a way to investigate differences in the distributions of categorical variables with the
same categories and the dependence between categorical variables. In the next lesson, we'll learn about a third
statistical inference test, the analysis of variance, that lets us compare several sample means at the same time.
Python for Data Analysis Part 26: Analysis of Variance (ANOVA)

In lesson 24 we introduced the t-test for checking whether the means of two groups differ. The t-test works well
when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For
example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to
compare the means of each level or group the variable. We could carry out a separate t-test for each pair of
groups, but when you conduct many tests you increase the chances of false positives. The analysis of
variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.
One-Way ANOVA
The one-way ANOVA tests whether the mean of some numeric variable differs across the levels of one
categorical variable. It essentially answers the question: do any of the group means differ from one another? We
won't get into the details of carrying out an ANOVA by hand as it involves more calculations than the t-test, but
the process is similar: you go through several calculations to arrive at a test statistic and then you compare the
test statistic to a critical value based on a probability distribution. In the case of the ANOVA, you use the "f-
distribution".
The scipy library has a function for carrying out one-way ANOVA tests called scipy.stats.f_oneway(). Let's
generate some fake voter age and demographic data and use the ANOVA to compare average ages across the
groups:
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
In [3]:
np.random.seed(12)

races = ["asian","black","hispanic","other","white"]

# Generate random data


voter_race = np.random.choice(a= races,
p = [0.05, 0.15 ,0.25, 0.05, 0.5],
size=1000)

voter_age = stats.poisson.rvs(loc=18,
mu=30,
size=1000)

# Group age data by race


voter_frame = pd.DataFrame({"race":voter_race,"age":voter_age})
groups = voter_frame.groupby("race").groups

# Etract individual groups


asian = voter_age[groups["asian"]]
black = voter_age[groups["black"]]
hispanic = voter_age[groups["hispanic"]]
other = voter_age[groups["other"]]
white = voter_age[groups["white"]]

# Perform the ANOVA


stats.f_oneway(asian, black, hispanic, other, white)
Out[3]:
F_onewayResult(statistic=1.7744689357289216, pvalue=0.13173183202014213)
The test output yields an F-statistic of 1.774 and a p-value of 0.1317, indicating that there is no significant
difference between the means of each group.
Now let's make new age data where the group means do differ and run a second ANOVA:
In [4]:
np.random.seed(12)

# Generate random data


voter_race = np.random.choice(a= races,
p = [0.05, 0.15 ,0.25, 0.05, 0.5],
size=1000)

# Use a different distribution for white ages


white_ages = stats.poisson.rvs(loc=18,
mu=32,
size=1000)

voter_age = stats.poisson.rvs(loc=18,
mu=30,
size=1000)

voter_age = np.where(voter_race=="white", white_ages, voter_age)

# Group age data by race


voter_frame = pd.DataFrame({"race":voter_race,"age":voter_age})
groups = voter_frame.groupby("race").groups

# Extract individual groups


asian = voter_age[groups["asian"]]
black = voter_age[groups["black"]]
hispanic = voter_age[groups["hispanic"]]
other = voter_age[groups["other"]]
white = voter_age[groups["white"]]

# Perform the ANOVA


stats.f_oneway(asian, black, hispanic, other, white)
Out[4]:
F_onewayResult(statistic=10.164699828384288, pvalue=4.5613242114167168e-08)
The test result suggests the groups don't have the same sample means in this case, since the p-value is
significant at a 99% confidence level. We know that it is the white voters who differ because we set it up that
way in the code, but when testing real data, you may not know which group(s) caused the the test to throw a
positive result. To check which groups differ after getting a positive ANOVA result, you can perform a follow up
test or "post-hoc test".
One post-hoc test is to perform a separate t-test for each pair of groups. You can perform a t-test between all
pairs using by running each pair through the stats.ttest_ind() we covered in the lesson on t-tests:
In [5]:
# Get all race pairs
race_pairs = []

for race1 in range(4):


for race2 in range(race1+1,5):
race_pairs.append((races[race1], races[race2]))

# Conduct t-test on each pair


for race1, race2 in race_pairs:
print(race1, race2)
print(stats.ttest_ind(voter_age[groups[race1]],
voter_age[groups[race2]]))
asian black
Ttest_indResult(statistic=0.83864469097479799, pvalue=0.4027281369339345)
asian hispanic
Ttest_indResult(statistic=-0.42594691924932293, pvalue=0.67046690042407264)
asian other
Ttest_indResult(statistic=0.97952847396359999, pvalue=0.32988775000951509)
asian white
Ttest_indResult(statistic=-2.3181088112522881, pvalue=0.020804701566400217)
black hispanic
Ttest_indResult(statistic=-1.9527839210712925, pvalue=0.051561971719525937)
black other
Ttest_indResult(statistic=0.28025754367057176, pvalue=0.77957701111176592)
black white
Ttest_indResult(statistic=-5.3793038812818352, pvalue=1.039421216662395e-07)
hispanic other
Ttest_indResult(statistic=1.5853626170340225, pvalue=0.11396630528484335)
hispanic white
Ttest_indResult(statistic=-3.5160312714115376, pvalue=0.00046412986490666839)
other white
Ttest_indResult(statistic=-3.7638093220778721, pvalue=0.00018490576317593065)
The p-values for each pairwise t-test suggest mean of white voters is likely different from the other groups, since
the p-values for each t-test involving the white group is below 0.05. Using unadjusted pairwise t-tests can
overestimate significance, however, because the more comparisons you make, the more likely you are to come
across an unlikely result due to chance. We can adjust for this multiple comparison problem by dividing the
statistical significance level by the number of comparisons made. In this case, if we were looking for a
significance level of 5%, we'd be looking for p-values of 0.05/10 = 0.005 or less. This simple adjustment for
multiple comparisons is known as the Bonferroni correction.
The Bonferroni correction is a conservative approach to account for the multiple comparisons problem that may
end up rejecting results that are actually significant. Another common post hoc-test is Tukey's test. You can carry
out Tukey's test using the pairwise_tukeyhsd() function in the statsmodels.stats.multicomp library:
In [6]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(endog=voter_age, # Data


groups=voter_race, # Groups
alpha=0.05) # Significance level

tukey.plot_simultaneous() # Plot group confidence intervals


plt.vlines(x=49.57,ymin=-0.5,ymax=4.5, color="red")

tukey.summary() # See test summary


Out[6]:
Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff lower upper reject
asian black -0.8032 -3.4423 1.836 False
asian hispanic 0.4143 -2.1011 2.9297 False
asian other -1.0645 -4.2391 2.11 False
asian white 1.9547 -0.4575 4.3668 False
black hispanic 1.2175 -0.386 2.821 False
black other -0.2614 -2.7757 2.253 False
black white 2.7579 1.3217 4.194 True
hispanic other -1.4789 -3.863 0.9053 False
hispanic white 1.5404 0.3468 2.734 True
other white 3.0192 0.7443 5.2941 True

The output of the Tukey test shows the average difference, a confidence interval as well as whether you should
reject the null hypothesis for each pair of groups at the given significance level. In this case, the test suggests we
reject the null hypothesis for 3 pairs, with each pair including the "white" category. This suggests the white
group is likely different from the others. The 95% confidence interval plot reinforces the results visually: only 1
other group's confidence interval overlaps the white group's confidence interval.
Wrap Up
The ANOVA test lets us check whether a numeric response variable varies according to the levels of a categorical
variable. Python's scipy library makes it easy to perform an ANOVA without diving too deep into the details of
the procedure.
Next time, we'll move on from statistical inference to the final topic of this guide: predictive modeling.
Python for Data Analysis Part 27: Linear Regression

In the last few lessons we learned about statistical inference techniques including the t-test, chi-squaredtest
and ANOVA which let you analyze differences between data samples. Predictive modeling--using a data samples
to make predictions about unseen data, such as data that has yet to be generated--is another common data
analytics task. Predictive modeling is a form of machine learning, which describes using computers to automate
the process of finding patterns in data.
Machine learning is the driving force behind all kinds of modern conveniences and automation systems like
ATMs that can read handwritten text, smartphones that translate speech to text and self-driving cars. The
methods used in such cutting-edge applications are more advanced than anything we'll cover in this
introduction, but they are all based on the principles of taking data and applying some learning algorithm to it to
arrive at some sort of prediction.
This lesson is intended to provide a high level overview of linear regression and how to begin using it in Python.
Linear Regression Basics
Linear regression is a predictive modeling technique for predicting a numeric response variable based on one or
more explanatory variables. The term "regression" in predictive modeling generally refers to any modeling task
that involves predicting a real number (as opposed classification, which involves predicting a category or class.).
The term "linear" in the name linear regression refers to the fact that the method models data with linear
combination of the explanatory variables. A linear combination is an expression where one or more variables are
scaled by a constant factor and added together. In the case of linear regression with a single explanatory
variable, the linear combination used in linear regression can be expressed as:

response=intercept+constant∗explanatory
The right side if the equation defines a line with a certain y-intercept and slope times the explanatory variable.
In other words, linear regression in its most basic form fits a straight line to the response variable. The model is
designed to fit a line that minimizes the squared differences (also called errors or residuals.). We won't go into
all the math behind how the model actually minimizes the squared errors, but the end result is a line intended
to give the "best fit" to the data. Since linear regression fits data with a line, it is most effective in cases where
the response and explanatory variable have a linear relationship.
Let's revisit the mtcars data set and use linear regression to predict vehicle gas mileage based on vehicle weight.
First, let's load some libraries and look at a scatterplot of weight and mpg to get a sense of the shape of the
data:
In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
from ggplot import mtcars
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as stats
matplotlib.style.use('ggplot')
In [3]:
mtcars.plot(kind="scatter",
x="wt",
y="mpg",
figsize=(9,9),
color="black")
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0xb257198>

The scatterplot shows a roughly linear relationship between weight and mpg, suggesting a linear regression
model might work well.
Python's scikit-learn library contains a wide range of functions for predictive modeling. Let's load its linear
regression training function and fit a line to the mtcars data:
In [4]:
from sklearn import linear_model
In [5]:
# Initialize model
regression_model = linear_model.LinearRegression()

# Train the model using the mtcars data


regression_model.fit(X = pd.DataFrame(mtcars["wt"]),
y = mtcars["mpg"])

# Check trained model y-intercept


print(regression_model.intercept_)

# Check trained model coefficients


print(regression_model.coef_)
37.2851261673
[-5.34447157]
The output above shows the model intercept and coefficients used to create the best fit line. In this case the y-
intercept term is set to 37.2851 and the coefficient for the weight variable is -5.3445. In other words, the model
fit the line mpg = 37.2851 - 5.3445*wt.
We can get a sense of how much of the variance in the response variable is explained by the model using the
model.score() function:
In [6]:
regression_model.score(X = pd.DataFrame(mtcars["wt"]),
y = mtcars["mpg"])
Out[6]:
0.75283279365826461
The output of the score function for linear regression is "R-squared", a value that ranges from 0 to 1 which
describes the proportion of variance in the response variable that is explained by the model. In this case, car
weight explains roughly 75% of the variance in mpg.
The R-squared measure is based on the residuals: differences between what the model predicts for each data
point and the actual value of each data point. We can extract the model's residuals by making a prediction with
the model on the data and then subtracting the actual value from each prediction:
In [7]:
train_prediction = regression_model.predict(X = pd.DataFrame(mtcars["wt"]))

# Actual - prediction = residuals


residuals = mtcars["mpg"] - train_prediction

residuals.describe()
Out[7]:
count 3.200000e+01
mean -5.107026e-15
std 2.996352e+00
min -4.543151e+00
25% -2.364709e+00
50% -1.251956e-01
75% 1.409561e+00
max 6.872711e+00
Name: mpg, dtype: float64
R-squared is calculated as 1 - (SSResiduals/SSTotal) were SSResiduals is the sum of the squares of the model
residuals and SSTotal is the sum of the squares of the difference between each data point and the mean of the
data. We could calculate R-squared by hand like this:
In [8]:
SSResiduals = (residuals**2).sum()

SSTotal = ((mtcars["mpg"] - mtcars["mpg"].mean())**2).sum()

# R-squared
1 - (SSResiduals/SSTotal)
Out[8]:
0.75283279365826461
Now that we have a linear model, let's plot the line it fits on our scatterplot to get a sense of how well it fits the
data:
In [9]:
mtcars.plot(kind="scatter",
x="wt",
y="mpg",
figsize=(9,9),
color="black",
xlim = (0,7))

# Plot regression line


plt.plot(mtcars["wt"], # Explanitory variable
train_prediction, # Predicted values
color="blue")
Out[9]:
[<matplotlib.lines.Line2D at 0xb9d67b8>]
The regression line looks like a reasonable fit and it follows our intuition: as car weight increases we would
expect fuel economy to decline.
Outliers can have a large influence on linear regression models: since regression deals with minimizing squared
residuals, large residuals have a disproportionately large influence on the model. Plotting the result helps us
detect influential outliers. In this case there does not appear to be any influential outliers. Let's add an outlier--a
super heavy fuel efficient car--and plot a new regression model:
In [10]:
mtcars_subset = mtcars[["mpg","wt"]]

super_car = pd.DataFrame({"mpg":50,"wt":10}, index=["super"])

new_cars = mtcars_subset.append(super_car)

# Initialize model
regression_model = linear_model.LinearRegression()

# Train the model using the new_cars data


regression_model.fit(X = pd.DataFrame(new_cars["wt"]),
y = new_cars["mpg"])

train_prediction2 = regression_model.predict(X = pd.DataFrame(new_cars["wt"]))

# Plot the new model


new_cars.plot(kind="scatter",
x="wt",
y="mpg",
figsize=(9,9),
color="black", xlim=(1,11), ylim=(10,52))

# Plot regression line


plt.plot(new_cars["wt"], # Explanatory variable
train_prediction2, # Predicted values
color="blue")
Out[10]:
[<matplotlib.lines.Line2D at 0xb9d6748>]

Although this is an extreme, contrived case, the plot above illustrates how much influence a single outlier can
have on a linear regression model.
In a well-behaved linear regression model, we'd like the residuals to be roughly normally distributed. That is,
we'd like a roughly even spread of error above and below the regression line. We can investigate the normality
of residuals with a Q-Q (quantile-quantile) plot. Make a qqplot by passing the residuals to the stats.probplot()
function in the scipy.stats library:
In [11]:
plt.figure(figsize=(9,9))

stats.probplot(residuals, dist="norm", plot=plt)


Out[11]:
((array([-2.02511189, -1.62590278, -1.38593914, -1.20666642, -1.05953591,
-0.93235918, -0.81872017, -0.71478609, -0.6180591 , -0.52680137,
-0.43973827, -0.35589149, -0.27447843, -0.19484777, -0.11643566,
-0.03873405, 0.03873405, 0.11643566, 0.19484777, 0.27447843,
0.35589149, 0.43973827, 0.52680137, 0.6180591 , 0.71478609,
0.81872017, 0.93235918, 1.05953591, 1.20666642, 1.38593914,
1.62590278, 2.02511189]),
array([-4.54315128, -3.90536265, -3.72686632, -3.46235533, -3.20536265,
-2.97258623, -2.78093991, -2.61100374, -2.28261065, -2.08595212,
-1.88302362, -1.10014396, -1.0274952 , -0.9197704 , -0.69325453,
-0.20014396, -0.0502472 , 0.152043 , 0.29985604, 0.35642633,
0.86687313, 1.17334959, 1.20105932, 1.29734994, 1.74619542,
2.10328764, 2.34995929, 2.46436703, 4.16373815, 5.98107439,
6.42197917, 6.87271129])),
(3.032779748945897, -4.8544962703334722e-15, 0.97566744517913173))

When residuals are normally distributed, they tend to lie along the straight line on the Q-Q plot. In this case
residuals appear to follow a slightly non-linear pattern: the residuals are bowed a bit away from the normality
line on each end. This is an indication that simple straight line might not be sufficient to fully describe the
relationship between weight and mpg.
After making model predictions, it is useful to have some sort of metric to evaluate oh well the model
performed. Adjusted R-squared is one useful measure, but it only applies to the regression model itself: we'd
like some universal evaluation metric that lets us compare the performance of different types of models. Root
mean squared error (RMSE) is a common evaluation metric for predictions involving real numbers. Root mean
squared error is square root of the average of the squared error (residuals.). If you recall, we wrote a function to
calculate RMSE way back in lesson lesson 12:
In [12]:
def rmse(predicted, targets):
"""
Computes root mean squared error of two numpy ndarrays

Args:
predicted: an ndarray of predictions
targets: an ndarray of target values

Returns:
The root mean squared error as a float
"""
return (np.sqrt(np.mean((targets-predicted)**2)))

rmse(train_prediction, mtcars["mpg"])
Out[12]:
2.9491626859550282
Instead of defining your own RMSE function, you can use the scikit-learn library's mean squared error function
and take the square root of the result:
In [13]:
from sklearn.metrics import mean_squared_error

RMSE = mean_squared_error(train_prediction, mtcars["mpg"])**0.5

RMSE
Out[13]:
2.9491626859550282
Polynomial Regression
Variables often exhibit non-linear relationships that can't be fit well with a straight line. In these cases, we can
use linear regression to fit a curved line the data by adding extra higher order terms (squared, cubic, etc.) to the
model. A linear regression that involves higher order terms is known as "polynomial regression."
In [14]:
# Initialize model
poly_model = linear_model.LinearRegression()

# Make a DataFrame of predictor variables


predictors = pd.DataFrame([mtcars["wt"], # Include weight
mtcars["wt"]**2]).T # Include weight squared

# Train the model using the new_cars data


poly_model.fit(X = predictors,
y = mtcars["mpg"])

# Check trained model y-intercept


print("Model intercept")
print(poly_model.intercept_)

# Check trained model coefficients (scaling factor given to "wt")


print("Model Coefficients")
print(poly_model.coef_)

# Check R-squared
poly_model.score(X = predictors,
y = mtcars["mpg"])
Model intercept
49.9308109495
Model Coefficients
[-13.38033708 1.17108689]
Out[14]:
0.81906135813840941
The output shows us that including the weight squared term appears to improve the model's performance
because the R-squared increased from 0.75 to 0.8190. It should be noted, however, that adding more variables
to a linear regression model can never cause R-squared to decrease, so we only want to add variables if there is
a substantial improvement in performance.
Let's plot the curved line defined by the new model to see if the fit looks better than the old one. To start off,
let's create a function that takes an array of x values, model coefficients and an intercept term and returns the x
values and fitted y values corresponding to those x values.
In [15]:
# Plot the curve from 1.5 to 5.5
poly_line_range = np.arange(1.5, 5.5, 0.1)

# Get first and second order predictors from range


poly_predictors = pd.DataFrame([poly_line_range,
poly_line_range**2]).T

# Get corresponding y values from the model


y_values = poly_model.predict(X = poly_predictors)

mtcars.plot(kind="scatter",
x="wt",
y="mpg",
figsize=(9,9),
color="black",
xlim = (0,7))

# Plot curve line


plt.plot(poly_line_range, # X-axis range
y_values, # Predicted values
color="blue")
Out[15]:
[<matplotlib.lines.Line2D at 0xbb855c0>]

The quadratic function seems to fit the data a little better than the linear one. Let's investigate further by using
the new model to make predictions on the original data and check the root mean squared error:
In [16]:
preds = poly_model.predict(X=predictors)

rmse(preds , mtcars["mpg"])
Out[16]:
2.5233004724610795
Since the RMSE of the quadratic model is lower than the old one and the adjusted R-squared is higher, it is
probably a better model. We do, however, have to be careful about overfitting the training data.
Overfitting describes a situation where our model fits the data we use to create it (training data) too closely,
resulting in poor generalization to new data. This is why we generally don't want to use training data to evaluate
a model: it gives us a biased, usually overly optimistic evaluation. One of the strengths of first and second order
linear regression is that they are so simple, they are unlikely to overfit data very much. The more complex the
model we create and the more freedom it has to fit the training data, the greater risk we run of overfitting. For
example, we could keep including more polynomial terms in our regression model to fit the training data more
closely and achieve lower RMSE scores against the training set, but this would almost certainly not generalize
well to new data. Let's illustrate this point by fitting a 10th order model to the mtcars data:
In [17]:
# Initialize model
poly_model = linear_model.LinearRegression()

# Make a DataFrame of predictor variables


predictors = pd.DataFrame([mtcars["wt"],
mtcars["wt"]**2,
mtcars["wt"]**3,
mtcars["wt"]**4,
mtcars["wt"]**5,
mtcars["wt"]**6,
mtcars["wt"]**7,
mtcars["wt"]**8,
mtcars["wt"]**9,
mtcars["wt"]**10]).T

# Train the model using the new_cars data


poly_model.fit(X = predictors,
y = mtcars["mpg"])

# Check trained model y-intercept


print("Model intercept")
print(poly_model.intercept_)

# Check trained model coefficients (scaling factor given to "wt")


print("Model Coefficients")
print(poly_model.coef_)

# Check R-squared
poly_model.score(X = predictors,
y = mtcars["mpg"])
Model intercept
-14921.1212436
Model Coefficients
[ 6.45813570e+04 -1.20086131e+05 1.26931928e+05 -8.46598449e+04
3.73155196e+04 -1.10334755e+04 2.16590403e+03 -2.70730543e+02
1.94974161e+01 -6.15515433e-01]
Out[17]:
0.87021066028375949
Notice the R-squared score has increased substantially from our quadratic model. Let's plot the best fit line to
investigate what the model is doing:
In [18]:
p_range = np.arange(1.5, 5.45, 0.01)

poly_predictors = pd.DataFrame([p_range, p_range**2, p_range**3,


p_range**4, p_range**5, p_range**6, p_range**7,
p_range**8, p_range**9, p_range**10]).T

# Get corresponding y values from the model


y_values = poly_model.predict(X = poly_predictors)

mtcars.plot(kind="scatter",
x="wt",
y="mpg",
figsize=(9,9),
color="black",
xlim = (0,7))

# Plot curve line


plt.plot(p_range, # X-axis range
y_values, # Predicted values
color="blue")
Out[18]:
[<matplotlib.lines.Line2D at 0xbbf11d0>]

Notice how the 10th order polynomial model curves wildly in some places to fit the training data. While this
model happens to yield a closer fit to the training data, it will almost certainly fail to generalize well to new data
as it leads to absurd predictions such as a car having less than 0 mpg if it weighs 5000lbs.
Multiple Linear Regression
When faced with a predictive modeling task, you'll often have several variables in your data that may help
explain variation in the response variable. You can include more explanatory variables in a linear regression
model by including more columns in the data frame you pass to the model training function. Let's make a new
model that adds the horsepower variable to our original model:
In [19]:
# Initialize model
multi_reg_model = linear_model.LinearRegression()

# Train the model using the mtcars data


multi_reg_model.fit(X = mtcars.ix[:,["wt","hp"]],
y = mtcars["mpg"])

# Check trained model y-intercept


print(multi_reg_model.intercept_)

# Check trained model coefficients (scaling factor given to "wt")


print(multi_reg_model.coef_)

# Check R-squared
multi_reg_model.score(X = mtcars.ix[:,["wt","hp"]],
y = mtcars["mpg"])
37.2272701164
[-3.87783074 -0.03177295]
Out[19]:
0.8267854518827914
The improved R-squared score suggests horsepower has a linear relationship with mpg. Let's investigate with a
plot:
In [20]:
mtcars.plot(kind="scatter",
x="hp",
y="mpg",
figsize=(9,9),
color="black")
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0xbbd5198>

While mpg does tend to decline with horsepower, the relationship appears more curved than linear so adding
polynomial terms to our multiple regression model could yield a better fit:
In [21]:
# Initialize model
multi_reg_model = linear_model.LinearRegression()

# Include squared terms


poly_predictors = pd.DataFrame([mtcars["wt"],
mtcars["hp"],
mtcars["wt"]**2,
mtcars["hp"]**2]).T

# Train the model using the mtcars data


multi_reg_model.fit(X = poly_predictors,
y = mtcars["mpg"])

# Check R-squared
print("R-Squared")
print( multi_reg_model.score(X = poly_predictors ,
y = mtcars["mpg"]) )

# Check RMSE
print("RMSE")
print(rmse(multi_reg_model.predict(poly_predictors),mtcars["mpg"]))
R-Squared
0.890727954967
RMSE
1.96091081342
The new R-squared and lower RMSE suggest this is a better model than any we made previously and we
wouldn't be too concerned about overfitting since it only includes 2 variables and 2 squared terms. Note that
when working with multidimensional models, it becomes difficult to visualize results, so you rely heavily on
numeric output.
We could continue adding more explanatory variables in an attempt to improve the model. Adding variables
that have little relationship with the response or including variables that are too closely related to one another
can hurt your results when using linear regression. You should also be wary of numeric variables that take on
few unique values since they often act more like categorical variables than numeric ones.
Wrap Up
Linear regression is one of the most common techniques for making real numbered predictions from data. It is a
good place to start any time you need to make a numeric prediction. Next time we'll revisit the titanic survival
data set and focus classification: assigning observations to categories.
Python for Data Analysis Part 28: Logistic Regression

In the last lesson, we introduced linear regression as a predictive modeling method to estimate numeric
variables. Now we turn our attention to classification: prediction tasks where the response variable is categorical
instead of numeric. In this lesson we will learn how to use a common classification technique known as logistic
regression and apply it to the Titanic survival data we used in lesson 14.
Logistic Regression Basics
Logistic regression is a classification method built on the same concept as linear regression. With linear
regression, we take linear combination of explanatory variables plus an intercept term to arrive at a prediction.
For example, last time, our simplest linear regression model was:

mileage=intercept+constant∗CarWeight
Linear regression determines which constants minimize the error this linear combination produces on the input
data.
In classification problems, the response variable is categorical. The simplest case of classification is where the
response variable is binary, meaning it can only take one of two values, such as true or false. Logistic regression
takes a linear combination of explanatory variables plus an intercept term just like linear regression, but then it
takes the result and passes it through the "logistic" function. The logistic or sigmoid function is defined as:

S(t)=11+e−t
where t is the same linear combination of variables used in linear regression. The logistic function looks like an
elongated S when plotted:
In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as stats
matplotlib.style.use('ggplot')
In [3]:
plt.figure(figsize=(9,9))

def sigmoid(t): # Define the sigmoid function


return (1/(1 + np.e**(-t)))

plot_range = np.arange(-6, 6, 0.1)

y_values = sigmoid(plot_range)

# Plot curve
plt.plot(plot_range, # X-axis range
y_values, # Predicted values
color="red")
Out[3]:
[<matplotlib.lines.Line2D at 0x8cae8d0>]

The sigmoid function is bounded below by 0 and bounded above by 1. In logistic regression, the output is
interpreted as a probability: the probability that an observation belongs to the second of the two categories
being modeled. When the linear combination of variables produces positive numbers, the resulting probability is
greater than 0.5 and when it produces negative numbers, the probability is less than 0.5.
We won't go deeper into the details behind how logistic regression works, but instead focus on how to use it in
Python. The most important thing to know is that logistic regression outputs probabilities that we can use to
classify observations.
Revisiting the Titanic
For the remainder of the lesson we'll be working with the Titanic survival training data from Kaggle that we saw
in lesson 14. We'll start by loading the data and then carrying out a few of the same preprocessing tasks we did
in lesson 14:
In [4]:
import os
In [5]:
os.chdir('C:\\Users\\Greg\\Desktop\\Kaggle\\titanic') # Set working directory

titanic_train = pd.read_csv("titanic_train.csv") # Read the data

char_cabin = titanic_train["Cabin"].astype(str) # Convert cabin to str

new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter

titanic_train["Cabin"] = pd.Categorical(new_Cabin) # Save the new cabin var

# Impute median Age for NA Age values


new_age_var = np.where(titanic_train["Age"].isnull(), # Logical check
28, # Value if check is true
titanic_train["Age"]) # Value if check is false

titanic_train["Age"] = new_age_var
Now we are ready to use a logistic regression model to predict survival. The scikit-learn library has a logistic
regression function in the learn model subfolder. Let's make a logistic regression model that only uses the Sex
variable as a predictor. Before creating a model with the sex variable, we need to convert to a real number
because sklearn's machine learning functions only death with real numbers. We can convert a categorical
variable like into a number using the sklearn preprocessing function LabelEncoder():
In [6]:
from sklearn import linear_model
from sklearn import preprocessing
In [7]:
# Initialize label encoder
label_encoder = preprocessing.LabelEncoder()

# Convert Sex variable to numeric


encoded_sex = label_encoder.fit_transform(titanic_train["Sex"])

# Initialize logistic regression model


log_model = linear_model.LogisticRegression()
# Train the model
log_model.fit(X = pd.DataFrame(encoded_sex),
y = titanic_train["Survived"])

# Check trained model intercept


print(log_model.intercept_)

# Check trained model coefficients


print(log_model.coef_)
[ 0.99180245]
[[-2.42172401]]
The logistic regression model coefficients look similar to the output we saw for linear regression. We can see the
model produced a positive intercept value and a weight of -2.421 on gender. Let's use the model to make
predictions on the test set:
In [8]:
# Make predictions
preds = log_model.predict_proba(X= pd.DataFrame(encoded_sex))
preds = pd.DataFrame(preds)
preds.columns = ["Death_prob", "Survival_prob"]

# Generate table of predictions vs Sex


pd.crosstab(titanic_train["Sex"], preds.ix[:, "Survival_prob"])
Out[8]:
Survival_prob 0.193110906347 0.729443792051
Sex
female 0 312
male 577 0
*Note: Use model.predict_proba() to get the predicted class probabilities. Use model.predict() to get the
predicted classes.
The table shows that the model predicted a survival chance of roughly 19% for males and 73% for females. If we
used this simple model to predict survival, we'd end up predicting that all women survived and that all men
died. Let's make a more complicated model that includes a few more variables from the titanic training set:
In [9]:
# Convert more variables to numeric
encoded_class = label_encoder.fit_transform(titanic_train["Pclass"])
encoded_cabin = label_encoder.fit_transform(titanic_train["Cabin"])

train_features = pd.DataFrame([encoded_class,
encoded_cabin,
encoded_sex,
titanic_train["Age"]]).T

# Initialize logistic regression model


log_model = linear_model.LogisticRegression()

# Train the model


log_model.fit(X = train_features ,
y = titanic_train["Survived"])

# Check trained model intercept


print(log_model.intercept_)

# Check trained model coefficients


print(log_model.coef_)
[ 3.29559054]
[[-0.93337465 -0.06151651 -2.42710335 -0.02679381]]
Next, let's make class predictions using this model and then compare the predictons to the actual values:
In [10]:
# Make predictions
preds = log_model.predict(X= train_features)

# Generate table of predictions vs actual


pd.crosstab(preds,titanic_train["Survived"])
Out[10]:
Survived 0 1
row_0
0 467 103
1 82 237
The table above shows the classes our model predicted vs. true values of the Survived variable. This table of
predicted vs. actual values is known as a confusion matrix.
The Confusion Matrix
The confusion matrix is a common tool for assessing the results of classification. Each cell tells us something
different about our predictions versus the true values. The bottom right corner indicates the True positives:
people the model predicted to survive who actually did survive. The bottom left cell indicates the false positives:
people for whom the model predicted survival who did not actually survive. The top left cell indicates the true
negatives: people correctly identified as non survivors. Finally, the top right cell shows the false negatives:
passengers the model identified as non survivors who actually did survive.
We can calculate the overall prediction accuracy from the matrix by adding the total number of correct
predictions and dividing by the total number of predictions. In the case of our model, the prediction accuracy is:
In [11]:
(467+237)/889
Out[11]:
0.7919010123734533
You can also get the accuracy of a model using the scikit-learn model.score() function:
In [12]:
log_model.score(X = train_features ,
y = titanic_train["Survived"])
Out[12]:
0.79190101237345334
Overall prediction accuracy is just one of many quantities you can use to assess a classification model.
Oftentimes accuracy is not the best metric for assessing a model.
Consider a model made to predict the occurrence of a disease that only occurs in 0.01% of people. A model that
never predicts that anyone has the disease would be 99.99% accurate, but it also wouldn't help save lives. In this
case, we might be more interested in the model's sensitivity (recall): the proportion of positive cases that the
model correctly identifies as positive.
Relying only on sensitivity can also be a problem. Consider a new model that predicts everyone has the disease.
This new model would achieve a sensitivity score of 100% since it would correctly label everyone who has the
disease as having the disease. In this case the model's precision--the proportion of positive predictions that turn
out to be true positives--would be very low.
We won't discuss all the different evaluation metrics that fall out the confusion matrix, but it is a good idea to
consider accuracy as well as sensitivity and precision when assessing model performance. We can view the
confusion matrix, as well as various classification metrics using sklearn's metrics library:
In [13]:
from sklearn import metrics
In [14]:
# View confusion matrix
metrics.confusion_matrix(y_true=titanic_train["Survived"], # True labels
y_pred=preds) # Predicted labels
Out[14]:
array([[467, 82],
[103, 237]])
In [15]:
# View summary of common classification metrics
print(metrics.classification_report(y_true=titanic_train["Survived"],
y_pred=preds) )
precision recall f1-score support

0 0.82 0.85 0.83 549


1 0.74 0.70 0.72 340

avg / total 0.79 0.79 0.79 889

For the Titanic competition, accuracy is the scoring metric used to judge the competition, so we don't have to
worry too much about other metrics.
As a final exercise, let's use our logistic regression model to make predictions for the Titanic test set. First, we
need to load and prepare the test data using the same steps we used to prepare the training data:
In [16]:
# Read and prepare test data
titanic_test = pd.read_csv("titanic_test.csv") # Read the data

char_cabin = titanic_test["Cabin"].astype(str) # Convert cabin to str

new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter

titanic_test["Cabin"] = pd.Categorical(new_Cabin) # Save the new cabin var

# Impute median Age for NA Age values


new_age_var = np.where(titanic_test["Age"].isnull(), # Logical check
28, # Value if check is true
titanic_test["Age"]) # Value if check is false

titanic_test["Age"] = new_age_var
In [17]:
# Convert test variables to match model features
encoded_sex = label_encoder.fit_transform(titanic_test["Sex"])
encoded_class = label_encoder.fit_transform(titanic_test["Pclass"])
encoded_cabin = label_encoder.fit_transform(titanic_test["Cabin"])

test_features = pd.DataFrame([encoded_class,
encoded_cabin,
encoded_sex,
titanic_test["Age"]]).T
In [18]:
# Make test set predictions
test_preds = log_model.predict(X=test_features)

# Create a submission for Kaggle


submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
"Survived":test_preds})

# Save submission to CSV


submission.to_csv("tutorial_logreg_submission.csv",
index=False) # Do not save index values
It turns out that upon submission, this logistic regression model has an accuracy score of 0.75598 which is
actually worse than the accuracy of the simplistic women survive, men die model (0.76555). This goes to show
that adding more extra variables to a model doesn't necessarily improve performance.
Wrap Up
Logistic regression is a common tool for generating class probabilities and predictions. Although logistic
regression models are simple and often insufficient to fully capture relationships between variables in many
predictive modeling tasks, they are a good starting point because simple models tend not to overfit the data.
Next time we will explore another predictive modeling technique for classification: decision trees.
Python for Data Analysis Part 29: Decision Trees

In the last lesson we introduced logistic regression as a predictive modeling technique for classification tasks.
While logistic regression can serve as a low variance baseline model, other models often yield better predictive
performance. Decision trees are another relatively simple classification model that have more expressive
power than logistic regression.
Decision Trees
If you've ever had to diagnose a problem with an appliance, car or computer, there's a good chance you've
encountered a troubleshooting flowchart. A flowchart is a tree-like structure of yes/no questions that guides
you through some process based on your specific situation. A decision tree is essentially a flow chart for deciding
how to classify an observation: it consists of a series of yes/no or if/else decisions that ultimately assign each
observation to a certain probability or class. The series of yes/no decisions can be depicted as a series of
branches that lead decisions or "leaves" at the bottom of the tree.
When working with the Titanic survival prediction data last time, we suggested a simple model that classifies all
women as survivors and all men as non-survivors. This model is an example of a simple decision tree with only
one branch or split.
Let's create the gender-based model on the Titanic training set using decision trees in Python. First we'll load
some libraries and preprocess the Titanic data:
In [2]:
import numpy as np
import pandas as pd
import os
In [3]:
# Load and prepare Titanic data
os.chdir('C:\\Users\\Greg\\Desktop\\Kaggle\\titanic') # Set working directory

titanic_train = pd.read_csv("titanic_train.csv") # Read the data

# Impute median Age for NA Age values


new_age_var = np.where(titanic_train["Age"].isnull(), # Logical check
28, # Value if check is true
titanic_train["Age"]) # Value if check is false
titanic_train["Age"] = new_age_var
Next, we need to load and initialize scikit-learn's the decision tree model and then train the model using the sex
variable:
In [4]:
from sklearn import tree
from sklearn import preprocessing
In [5]:
# Initialize label encoder
label_encoder = preprocessing.LabelEncoder()

# Convert Sex variable to numeric


encoded_sex = label_encoder.fit_transform(titanic_train["Sex"])

# Initialize model
tree_model = tree.DecisionTreeClassifier()

# Train the model


tree_model.fit(X = pd.DataFrame(encoded_sex),
y = titanic_train["Survived"])
Out[5]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
Note the list of default arguments included in the model above. Read more about them here.
Now let's view a visualization of the tree the model created:
In [6]:
# Save tree as dot file
with open("tree1.dot", 'w') as f:
f = tree.export_graphviz(tree_model,
feature_names=["Sex"],
out_file=f)
In [7]:
from IPython.display import Image
Image("tree1.png") # Display image*
Out[7]:

*Note: I converted the saved dot file to a png using an external editor.
The tree's graph show us that it consists of only one decision node that splits the data on the variable sex. All
312 females end up in one leaf node and all 577 males end up in a different leaf node.
Let's make predictions with this tree and view a table of the results:
In [8]:
# Get survival probability
preds = tree_model.predict_proba(X = pd.DataFrame(encoded_sex))

pd.crosstab(preds[:,0], titanic_train["Sex"])
Out[8]:
Sex female male
row_0
0.259615 312 0
0.811092 0 577
The table shows that the decision tree managed to create the simple gender-based model where all females
survive and all males perish.
Let's create a new decision tree that adds the passenger class variable and see how it changes the resulting
predictions:
In [9]:
# Make data frame of predictors
predictors = pd.DataFrame([encoded_sex, titanic_train["Pclass"]]).T

# Train the model


tree_model.fit(X = predictors,
y = titanic_train["Survived"])
Out[9]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
Now let's look at the graph of the new decision tree model:
In [10]:
with open("tree2.dot", 'w') as f:
f = tree.export_graphviz(tree_model,
feature_names=["Sex", "Class"],
out_file=f)
In [11]:
Image("tree2.png")
Out[11]:

Notice that by adding one more variable, the tree is considerably more complex. It now has 6 decision nodes, 6
leaf nodes and a maximum depth of 3.
Let's make predictions and view a table of the results:
In [12]:
# Get survival probability
preds = tree_model.predict_proba(X = predictors)

# Create a table of predictions by sex and class


pd.crosstab(preds[:,0], columns = [titanic_train["Pclass"],
titanic_train["Sex"]])
Out[12]:
Pclass 1 2 3
Sex female male female male female male
row_0
0.032609 92 0 0 0 0 0
0.078947 0 0 76 0 0 0
0.500000 0 0 0 0 144 0
0.631148 0 122 0 0 0 0
0.842593 0 0 0 108 0 0
0.864553 0 0 0 0 0 347
Notice that the more complex model still predicts a higher survival rate for women than men, but women in
third class only have a 50% predicted death probability while women in first class are predicted to die less than
5% of the time.
The more variables you add to a decision tree, the more yes/no decisions it can make, resulting in a deeper,
more complex tree. Adding too much complexity to a decision tree, however, makes it prone to overfitting the
training data, which can lead to poor generalization to unseen data. Let's investigate by creating a larger tree
with a few more variables:
In [13]:
predictors = pd.DataFrame([encoded_sex,
titanic_train["Pclass"],
titanic_train["Age"],
titanic_train["Fare"]]).T

# Initialize model with maximum tree depth set to 8


tree_model = tree.DecisionTreeClassifier(max_depth = 8)

tree_model.fit(X = predictors,
y = titanic_train["Survived"])
Out[13]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
In [14]:
with open("tree3.dot", 'w') as f:
f = tree.export_graphviz(tree_model,
feature_names=["Sex", "Class","Age","Fare"],
out_file=f)
In [15]:
Image("tree3small.png")
Out[15]:

The image above illustrates how complex decision trees can become when you start adding more explanatory
variables. You can control the complexity of the tree by altering some of the decision tree function's default
parameters. For example, when we made the tree above, we set max_depth = 8, which limited the tree to a
depth of 8 (if we hadn't done this the tree would have been much larger!).
For interest's sake, let's check the accuracy of this decision tree model on the training data:
In [16]:
tree_model.score(X = predictors,
y = titanic_train["Survived"])
Out[16]:
0.88751406074240724
The model is almost 89% accurate on the training data, but how does it do on unseen data? Let's load the test
data, make some predictions submit them to Kaggle to find out:
In [17]:
# Read and prepare test data
titanic_test = pd.read_csv("titanic_test.csv") # Read the data

# Impute median Age for NA Age values


new_age_var = np.where(titanic_test["Age"].isnull(), # Logical check
28, # Value if check is true
titanic_test["Age"]) # Value if check is false

titanic_test["Age"] = new_age_var
In [18]:
# Convert test variables to match model features
encoded_sex_test = label_encoder.fit_transform(titanic_test["Sex"])

test_features = pd.DataFrame([encoded_sex_test,
titanic_test["Pclass"],
titanic_test["Age"],
titanic_test["Fare"]]).T
In [19]:
# Make test set predictions
test_preds = tree_model.predict(X=test_features)

# Create a submission for Kaggle


submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
"Survived":test_preds})

# Save submission to CSV


submission.to_csv("tutorial_dectree_submission.csv",
index=False) # Do not save index values
Upon submission the model scores 0.78469 accuracy, which is slightly better than the simple gender-based
model, but far worse than the accuracy the model achieved on the training data itself. This underscores the fact
that predictive performance on the training data is a poor barometer of predictive performance on new data.
Holdout Validation and Cross Validation
When creating a predictive model, we'd like to get an accurate sense of its ability to generalize to unseen data
before actually going out and using it on unseen data. As we saw earlier, generating predictions on the training
data itself to check the model's accuracy does not work very well: a complex model may fit the training data
extremely closely but fail to generalize to new, unseen data. We can get a better sense of a model's expected
performance on unseen data by setting a portion of our training data aside when creating a model, and then
using that set aside data to evaluate the model's performance. This technique of setting aside some of the
training data to assess a model's ability to generalize is known as validation.
Holdout validation and cross validation are two common methods for assessing a model before using it on test
data. Holdout validation involves splitting the training data into two parts, a training set and a validation set,
building a model with the training set and then assessing performance with the validation set. In theory, model
performance on the hold-out validation set should roughly mirror the performance you'd expect to see on
unseen test data. In practice, holdout validation is fast and it can work well, especially on large data sets, but it
has some pitfalls.
Reserving a portion of the training data for a holdout set means you aren't using all the data at your disposal to
build your model in the validation phase. This can lead to suboptimal performance, especially in situations
where you don't have much data to work with. In addition, if you use the same holdout validation set to assess
too many different models, you may end up finding a model that fits the validation set well due to chance that
won't necessarily generalize well to unseen data. Despite these shortcomings, it is worth learning how to use a
holdout validation set in Python.
You can create a holdout validation set using the train_test_split() in sklearn's cross_validation library:
In [20]:
from sklearn.cross_validation import train_test_split
In [21]:
v_train, v_test = train_test_split(titanic_train, # Data set to split
test_size = 0.25, # Split ratio
random_state=1, # Set random seed
stratify = titanic_train["Survived"]) #*

# Training set size for validation


print(v_train.shape)
# Test set size for validation
print(v_test.shape)
(666, 12)
(223, 12)
*Note: When performing classification, it is desirable for each class in the target variable to have roughly the
same proportion across each split of the data. The stratify argument lets you specify a target variable to spread
evenly across the train and test splits.
The output above shows that we successfully created a new training set with roughly 75% of the original data
and a validation test set with 25% of the data. We could proceed by building models with this new training set
and making predictions on the validation set to assess the models.
Cross validation is a popular alternative to holdout validation that involves splitting the training data into two or
more partitions and creating a model for each partition where the partition acts as the validation set and the
remaining data acts as the training data. A common form of cross validation is "k-fold" cross validation, which
randomly splits data into some number k (a user specified parameter) partitions and then creates k models,
each tested against one of the partitions. Each of the k models are then combined into one aggregate final
model.
The primary advantage of cross validation is it uses all the training data to build and assess the final model. The
main drawback is that building and testing several models can be computationally expensive, so it tends to take
much longer than holdout validation. You can create K cross validation splits of the data using the Kfold()
function in sklearn's cross_validation library:
In [22]:
from sklearn.cross_validation import KFold

cv = KFold(n=len(titanic_train), # Number of elements


n_folds=10, # Desired number of cv folds
random_state=12) # Set a random seed
After creating a cross validation object, you can loop over each fold and train and evaluate a your model on each
one:
In [23]:
fold_accuracy = []

titanic_train["Sex"] = encoded_sex

for train_fold, valid_fold in cv:


train = titanic_train.loc[train_fold] # Extract train data with cv indices
valid = titanic_train.loc[valid_fold] # Extract valid data with cv indices

model = tree_model.fit(X = train[["Sex","Pclass","Age","Fare"]],


y = train["Survived"])
valid_acc = model.score(X = valid[["Sex","Pclass","Age","Fare"]],
y = valid["Survived"])
fold_accuracy.append(valid_acc)

print("Accuracy per fold: ", fold_accuracy, "\n")


print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))
Accuracy per fold: [0.7191011235955056, 0.84269662921348309, 0.7528089887640449, 0.7752808988764045,
0.8651685393258427, 0.7865168539325843, 0.8089887640449438, 0.7415730337078652,
0.8651685393258427, 0.84090909090909094]

Average accuracy: 0.79982124617


Model accuracy can vary significantly from one fold to the next, especially with small data sets, but the average
accuracy across the folds gives you an idea of how the model might perform on unseen data.
As with holdout validation, we'd like the target variable's classes to have roughly the same proportion across
each fold when performing cross validation for a classification problem. To perform stratified cross validation,
use the StratifiedKFold() function instead of KFold().
You use can score a model with stratified cross validation with a single function call with the cross_val_score()
function:
In [24]:
from sklearn.cross_validation import cross_val_score
In [28]:
scores = cross_val_score(estimator= tree_model, # Model to test
X= titanic_train[["Sex","Pclass", # Train Data
"Age","Fare"]],
y = titanic_train["Survived"], # Target variable
scoring = "accuracy", # Scoring metric
cv=10) # Cross validation folds

print("Accuracy per fold: ")


print(scores)
print("Average accuracy: ", scores.mean())
Accuracy per fold:
[ 0.74157303 0.83146067 0.75280899 0.85393258 0.87640449 0.78651685
0.83146067 0.76404494 0.85393258 0.84090909]
Average accuracy: 0.813304392237
Notice that the average accuracy across each fold is higher than the non-stratified K-fold example. The
cross_val_score function is useful for testing models and tuning model parameters (finding optimal values for
arguments like maximum tree depth that affect model performance.).
Wrap Up
Decision trees are an easily interpretable yet surprisingly expressive form of predictive model. A decision tree of
limited depth can provide a good starting point for classification tasks and model complexity is easy adjustable.
For our final lesson, we'll learn about random forests, an extension of decision trees that preform very well on a
wide range of classification tasks.
Python for Data Analysis Part 30: Random Forests

For the final lesson in this guide, we'll learn about random forest models. As we saw last time, decision trees
are a conceptually simple predictive modeling technique, but when you start building deep trees, they
become complicated and likely to overfit your training data. In addition, decision trees are constructed in a
way such that branch splits are always made on variables that appear to be the most significant first, even if
those splits do not lead to optimal outcomes as the tree grows. Random forests are an extension of decision
trees that address these shortcomings.
Random Forest Basics
A random forest model is a collection of decision tree models that are combined together to make
predictions. When you make a random forest, you have to specify the number of decision trees you want to
use to make the model. The random forest algorithm then takes random samples of observations from your
training data and builds a decision tree model for each sample. The random samples are typically drawn
with replacement, meaning the same observation can be drawn multiple times. The end result is a bunch of
decision trees that are created with different groups of data records drawn from the original training data.
The decision trees in a random forest model are a little different than the standard decision trees we made
last time. Instead of growing trees where every single explanatory variable can potentially be used to make a
branch at any level in the tree, random forests limit the variables that can be used to make a split in the
decision tree to some random subset of the explanatory variables. Limiting the splits in this fashion helps
avoid the pitfall of always splitting on the same variables and helps random forests create a wider variety of
trees to reduce overfitting.
Random forests are an example of an ensemble model: a model composed of some combination of several
different underlying models. Ensemble models often yields better results than single models because
different models may detect different patterns in the data and combining models tends to dull the tendency
that complex single models have to overfit the data.
Random Forests on the Titanic
Python's sklearn package offers a random forest model that works much like the decision tree model we
used last time. Let's use it to train a random forest model on the Titanic training set:
In [1]:
import numpy as np
import pandas as pd
import os
In [2]:
# Load and prepare Titanic data
os.chdir('C:\\Users\\Greg\\Desktop\\Kaggle\\titanic') # Set working directory

titanic_train = pd.read_csv("titanic_train.csv") # Read the data

# Impute median Age for NA Age values


new_age_var = np.where(titanic_train["Age"].isnull(), # Logical check
28, # Value if check is true
titanic_train["Age"]) # Value if check is false

titanic_train["Age"] = new_age_var
In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
In [4]:
# Set the seed
np.random.seed(12)

# Initialize label encoder


label_encoder = preprocessing.LabelEncoder()

# Convert some variables to numeric


titanic_train["Sex"] = label_encoder.fit_transform(titanic_train["Sex"])
titanic_train["Embarked"] = label_encoder.fit_transform(titanic_train["Embarked"])

# Initialize the model


rf_model = RandomForestClassifier(n_estimators=1000, # Number of trees
max_features=2, # Num features considered
oob_score=True) # Use OOB scoring*

features = ["Sex","Pclass","SibSp","Embarked","Age","Fare"]

# Train the model


rf_model.fit(X=titanic_train[features],
y=titanic_train["Survived"])

print("OOB accuracy: ")


print(rf_model.oob_score_)
OOB accuracy:
0.81664791901
Since random forest models involve building trees from random subsets or "bags" of data, model
performance can be estimated by making predictions on the out-of-bag (OOB) samples instead of using
cross validation. You can use cross validation on random forests, but OOB validation already provides a
good estimate of performance and building several random forest models to conduct K-fold cross validation
with random forest models can be computationally expensive.
The random forest classifier assigns an importance value to each feature used in training. Features with
higher importance were more influential in creating the model, indicating a stronger association with the
response variable. Let's check the feature importance for our random forest model:
In [5]:
for feature, imp in zip(features, rf_model.feature_importances_):
print(feature, imp)
Sex 0.266812848384
Pclass 0.0892556347506
SibSp 0.0523628494934
Embarked 0.0320938468195
Age 0.2743081392
Fare 0.285166681353
Feature importance can help identify useful features and eliminate features that don't contribute much to the
model.
As a final exercise, let's use the random forest model to make predictions on the titanic test set and submit
them to Kaggle to see how our actual generalization performance compares to the OOB estimate:
In [6]:
# Read and prepare test data
titanic_test = pd.read_csv("titanic_test.csv") # Read the data

# Impute median Age for NA Age values


new_age_var = np.where(titanic_test["Age"].isnull(),
28,
titanic_test["Age"])

titanic_test["Age"] = new_age_var

# Convert some variables to numeric


titanic_test["Sex"] = label_encoder.fit_transform(titanic_test["Sex"])
titanic_test["Embarked"] = label_encoder.fit_transform(titanic_test["Embarked"])
In [7]:
# Make test set predictions
test_preds = rf_model.predict(X= titanic_test[features])

# Create a submission for Kaggle


submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
"Survived":test_preds})

# Save submission to CSV


submission.to_csv("tutorial_randomForest_submission.csv",
index=False) # Do not save index values
Upon submission, the random forest model achieves an accuracy score of 0.75120, which is actually worse
than the decision tree model and even the simple gender-based model. What gives? Is the model overfitting
the training data? Did we choose bad variables and model parameters? Or perhaps our simplistic imputation
of filling in missing age data using median ages is hurting our accuracy. Data analyses and predictive
models often don't turn out how you expect, but even a "bad" result can give you more insight into your
problem and help you improve your analysis or model in a future iteration.
Python for Data Analysis Conclusion
In this introduction to Python for data analysis series, we built up slowly from the most basic rudiments of the
Python language to building predictive models that you can apply to real-world data. Although Python is a
beginner-friendly programming language, it was not built specifically for data analysis, so we relied heavily
upon libraries to extend base Python's functionality when doing data analysis. As a series focused on
practical tools and geared toward beginners, we didn't always take the time to dig deep into the details of the
language or the statistical and predictive models we covered. My hope is that some of the lessons in this
guide piqued your interest and equipped you with the tools you need to dig deeper on your own.
If you're interested in learning more about Python, there are many ways to proceed. If you learn well with
some structure, consider an online data science course that uses Python, like the intro to Machine
Learning course on Udacity, the Machine Learning specialization track on Coursera one of the many data
science offerings those sites or edX. If you like hands-on learning, try tackling some Kagglecompetitions or
finding a data set to analyze.
One of the hardest parts of learning a new skill is getting started. If any part of this guide helped you get
started, it has served its purpose.
*Final Note: If you are interested in learning R, I have a 30-part introduction to R guide that covers most of
the same topics as this Python guide and recreates many of the same examples in R

También podría gustarte