Está en la página 1de 43

Business Analytics

Data

Pristine www.edupristine.com
Pristine
Agenda
Introduction

Data

Predictive modeling using Linear Regression

Pristine 1
2.Data
I. Population vs. Sample

II. Types of Data Variables

III. Summarizing data

IV. Describe measure of central tendency/measure of location of data set

V. Describe spread/variability of data set

VI. Symmetry and Skewness for the distribution of a data set

VII. Data Collection

VIII. Data Dictionary

IX. Outlier Treatment

X. Missing Value Imputation

Pristine 2
2.a. Population vs. Sample
Population Sample

Not to be confused with literal meaning of It is a part of the "population".


"population" which means number of people living
in a defined geographical region. Can be biased or un-biased (also know as random
sample).
The "population" in statistics includes all members
of a defined group that we are studying or collecting
Example:
information on for data driven decisions.

Example: Current inflation rates of EU countries having per


capita income of less than 20000 Euros per
Current inflation rates of EU countries. annum.
All the votes casted in an electoral poll. A portion of votes collected to predict the
election outcome through "Exit Poll".

Sample1

Sample2 Sample3
Population

Pristine 3
2.b. Case: Types of Data variables

Romanov, an Analytics consultant works with Credit One bank. His manager gave him a list
having the name of bank's customers. Further he has been asked to pull the information from
bank's database pertaining to the customer list. The information will be around the credit
cards issued by the bank. He needs to define the variable types and the type of value each one
of them will contain. Romanov, who has just started his professional career, doesn't has a good
idea about different variable types.

Now, suppose after extracting data he approached you and asked your help in categorizing the
different variables. Help Romanov in variable categorization.

Pristine 4
2.b. Case: Types of Data variables

Information to be extracted by Romanov.

Number of Age of Gender Marital


Variable Name of Customer Annual Monthly Credit
Credit Customer of Status of
Name Customer ID Salary Card Usage
Cards Last Birthday Customer Customer

Value
? ? ? ? ? ? ? ?
Stored

Variable
? ? ? ? ? ? ? ?
Type

Remarks

Pristine 5
2.b. Case: Types of Data variables (Data snapshot)

Name of Number of Age of Customer Gender of the Marital Status of Annual Salary Monthly Credit
Sl # Customer ID
Customer Credit Cards Last Birthday Customer the Customer (in USD) Card Usage

1 Josh 111669 5 42 F Never Married Low


88,001
2 Janice 146861 6 25 F Married 592,489 Low

3 Dandre 171690 3 50 M Divorced 272,304 Low

4 Aiden 161721 6 37 M Married 726,593 Low

5 Celine 170359 7 50 F Never Married 612,075 Low

6 Emilio 175646 5 41 M Never Married 490,356 Low

7 Joaquin 180732 2 62 F Divorced 164,732 Low

8 Justus 113136 7 26 F Never Married 510,321 Low

9 Chaya 169254 4 24 M Never Married 358,534 Low

10 Justyn 149771 4 35 M Married 140,400 Low

11 Jadon 166226 7 36 M Never Married 105,259 Low

Pristine 6
2.b. Case: Types of Data variables

Number of Age of Gender Marital


Variable Name of Customer Annual Monthly Credit
Credit Customer of Status of
Name Customer ID Salary Card Usage
Cards Last Birthday Customer Customer

Name of Married / Low(<25%) /


Value the Unique Male / Divorced / Medium(<50%) /
1, 2, 3 18, 19, 20 Amount
Stored individual identifier Female Never High(<75%) /
customer Married Very High(>75%)

Variable
? ? ? ? ? ? ? ?
Type

Remarks

Pristine 7
2.b. Types of Data Variables
Data consists of a combination of "variables" which actually contain the values
Variables at a high level are of two types depending on the kind of values they store:
Numerical
Categorical

Numerical variables Categorical variables


Discrete Binary (or Dichotomous)
Arises from counting Has only two categories
can take only a set of particular values Examples: yes/no, male/female, pass/fail
including negative and fractional values etc.
Examples: Credit score, number of credit Nominal
cards owned by a person, number of states
in a country, charge on electron etc. Has several unordered category
Continuous Examples: Type of bank account, type of
insurance policy etc.
Arises from measuring
Ordinal
Can take any value with in a specified range
Has several ordered category
Examples: Height, Amount of money, Age
etc. Examples: questionnaire responses such as
"strongly in favour / / strongly against".

Pristine 8
2.b. Types of Data Variables - Summary

Data (Consists
of Variables)

Numerical Categorical

Dichotomous
Continuous Discrete Nominal Ordinal
or Binary

Several Several
Arises from Arises from Only two
unordered ordered
measuring counting categories
category category

Pristine 9
2.b. Case: Types of Data variables (Revisited)

Number of Age of Marital


Variable Name of Customer Gender of Annual Monthly Credit
Credit Customer Status of
Name Customer ID Customer Salary Card Usage
Cards Last Birthday Customer

Name of Married / Low(<25%) /


Value the Unique Male / Divorced / Medium(<50%) /
1, 2, 3 18, 19, 20 Amount
Stored individual identifier Female Never High(<75%) /
customer Married Very High(>75%)

Variable Numerical Numerical Categorical Categorical Numerical Categorical


-- --
Type (Discrete) (Discrete) (Binary) (Nominal) (Continuous) (Ordinal)

Arises from Arises from


counting. counting.
Several Takes many
Takes certain Takes certain Only two Several ordered
Remarks Identifier Identifier
discrete discrete values categories
ordered values in a
category
category given range
values in a in a given
given range range

Pristine 10
2.c. Case: Summarizing Data

Romanov, an Analytics consultant works with Credit One bank. His manager gave him some
data around credit cards relating to number of credit cards issued to a set of customers and
the credit limit of the cards. Further he has been tasked to summarize the data in a
presentable form and prepare the report. Romanov, who has just started his professional
career, has never played around with such kind of data, so he is clueless about the different
summarizing techniques.

Now, suppose he approached you and asked your help in preparing the report. Help Romanov
in summarizing the data and preparing the report.

Pristine 11
2.c. Comments: Summarizing Data

There are various ways to summarize data. Some of them are


1. Frequency distribution
2. Grouped frequency distribution
3. Cumulative frequency distribution
4. Stem leaf diagram
5. Line plots

Pristine 12
2.c. Summarizing Data - Frequency distribution
A technique to summarize discrete data
A simple process which involves counting of distinct discrete values
The representation can be either tabular or graphical
Example: Number of credit cards owned in a sample of 3000 individuals

Tabular representation Graphical representation - Bar Chart


Number of Credit Freq Distribution- #Cards vs. # Customers
# Customers
Cards
700
1 150
600
2 300 # Customers
500
3 450
# Customers
400
4 660
300
5 540 200
6 300 100

7 240 0
1 2 3 4 5 6 7 8 9 10
8 150
# Cards
9 120

10 90
Pristine 13
2.c. Summarizing Data - Frequency distribution (Using MS Excel)
1 2 3 Number of 4
Credit Cards

3
2
4
5
1
7
9
10
6
8
4. Press ctrl+alt+enter

# Customers 7 6 5
700

600

500

400

300 # Customers

200

100

0
1 2 3 4 5 6 7 8 9 10

Pristine 14
2.c. Summarizing Data - Grouped Frequency distribution
A technique to summarize continuous data or discrete data having large number of observations
and an extended range
A simple process which involves counting of values falling under the different intervals (grouped)
Example and illustration 2.2: Number of customers falling under different Salary groups
Graphical representation - Bar Chart

Freq Distribution- Salary Band vs. # Customers


120

100

80
#Customers

60

40

20

Salary Band

Pristine 15
2.c. Summarizing Data Grouped Frequency distribution (Using MS Excel)
1 2

1. Press ctrl+alt+enter

4
5.Observe the difference
between horizontal axes of
two charts
3
5

# Customers
4.From Edit select the
120
100
salary bands as horizontal
80 axis
60
40
20
0
450001-475000
100001-125000
150001-175000
200001-225000
250001-275000
300001-325000
350001-375000
400001-425000

500001-525000
550001-575000
600001-625000
650001-675000
700001-725000
750001-775000
800001-825000
850001-875000
900001-925000
950001-975000
0-75000

Pristine 16
2.c. Summarizing Data - Cumulative Frequency distribution
Cumulative frequencies are obtained by accumulating the frequencies to give the total number of
observations up to and including the value or group in question.
Example and illustration 2.3: Cumulative number of cards in the sample of 3000 individuals

Tabular representation Graphical representation

Number of Credit Cumulative Cumulative # Customers


Cards Up to # Customers 3000
1 150

Cumulative # Customers
2500
2 450
3 900 2000

4 1560 1500
5 2100
1000
6 2400
7 2640 500

8 2790 0
0 1 2 3 4 5 6 7 8 9 10
9 2910
# Cards
10 3000

Pristine 17
2.c. Summarizing Data - Cumulative Frequency distribution (Using MS Excel)
1 2

5 4 3
Cumulative # Customers
3500
3000
2500
2000
1500
1000
500
0
0 2 4 6 8 10 12
3. Observe the last entry. It is equal to
Pristine the total numbers of observations 18
2.c. Summarizing Data Stem-leaf diagram
Stem-leaf diagram
Not suitable for large data. Hence, not extensively used in industry.
Illustration: Given age of 20 individuals in years. Represent them using stem-leaf diagram
Sl # Age Age (Sorted)
1 23 21
2 33 23 Stem Leaf
3 23 24
4 33 27
5 34 30 20 1 3 4 7
6 21 31
7 54 33
8 52 34
30 1 3 4 5 6 9
9 34 35
10 36 36
11 52 39
12 51 40 40 0 3 8 9
13 48 43
14 35 48
15 40 49
16 43 51 50 1 2 3 4 7
17 49 52
18 54 53
19 27 54
20 39 57
Pristine 19
2.c. Summarizing Data Line Plots
Line plot diagram
Not suitable for large data. Hence, not extensively used in industry.
Illustration: Given test scores of 20 students. Represent them using line plot diagram
Sl # Score Score (Sorted)
1 50 20
2 20 20
3 50 20
4 50 30
5 50 30
6 30 30
7 30 30
8 40 30
9 30 40
10 40 40
11 30 40
12 20 40
13 50 40
14 40 50
15 20 50
16 30 50
17 40 50
18 40 50
19 50 50
20 50 50
Pristine 20
2.c. Case: Measure of Central Tendency/Location

After Romanov presented the summarized data to his manager at Credit One, he was asked to
produce the various measures of Central Tendency of the Credit Card data.

Now, Romanov being unaware of the term "central tendency" again approached you and asked
your help in calculating the central tendency of the data in question. Help Romanov in carrying
out his task.

Pristine 21
2.d. Measure of Central Tendency/Location
There are a number of different quantities, which can be used to estimate the central point of a
sample.

These are called measures of central tendency, or measures of location.

Just different ways of calculating the "average" value of dataset

These are:

Mean

Median

Mode

Pristine 22
2.d. Measure of Central Tendency/Location - Mean
By far the most common measure for describing the location of a set of data is the mean.
For a set of observations denoted by x1, x2,.,xn the mean is defined by
<x> = (x1 + x2 + + xn)/n (also denoted by x-bar i.e. ).
For a frequency distribution with values x1, x2, xn and corresponding frequency values f1, f2,
,fn it is defined as
<x> = (f1 * x1 + f2 * x3 + . + fn * xn)/(f1 + f2 + + fn).
Illustration 2.4: Calculating mean for sample of 3000 individuals having credit cards.
1. Using Excel function for 2. Using Excel function for frequency
granular data distribution table

Pristine 23
2.d. Measure of Central Tendency/Location - Median
Another useful measure of location.

The median is a value, which splits the data set into two equal halves.

So that half the observations are less than the median and half are greater than the median.

If n is odd, then the median is the middle observation.

If n is even, then the median is the midpoint of the middle two observations i.e. (n + 1) / 2th
observation.

One of the potential advantages of the median for certain data sets is that it is robust or resistant
to the effects of extreme observations.

Illustration 2.5: Calculating median for sample of 3000 individuals having credit cards along with
demonstration of extreme observations.

Pristine 24
2.d. Measure of Central Tendency/Location - Median

1. Using Excel function for granular data 2. For summarized data in form of frequency table

Median # Cards

Pristine 25
2.d. Measure of Central Tendency/Location - Mode
A third measure of location is the mode.

Defined as the value which occurs with the greatest frequency or the most typical value.
Illustration 2.6: Finding the mode for sample of 3000 individuals having credit cards.
Excel has inbuilt function Mode for granular data
For summarized data it can be find easily by visual inspection

Tabular representation
Number of
# Customers
Credit Cards
1 150
2 300 Mode = 4 i.e. highest number of
3 450 individuals have 4 cards
4 660
5 540
6 300
7 240
8 150
9 120
10 90

Pristine 26
2.d. Case: Measure of Spread

After Romanov presented the summarized data along with "measures of Central tendency" to
his manager at Credit One, he was further asked to add the various measures of spread to the
report.

Now, Romanov being unaware of the term "measures of spread" again approached you and
asked for your help. Help Romanov in carrying out his task.

Pristine 27
2.d. Measure of Spread
The central tendency of a data set is usually the main feature of interest.

Another feature of interest is the spread (or variability or dispersion or scatter)

Meaning how widely spread the data are about the mean (or other measure of location).

The different measures of spread are:

Variance and Standard Deviation

The Range

The Inter quartile range

Pristine 28
2.d. Measure of Spread - Variance and Standard Deviation
The most commonly used measure of spread is the standard deviation.
Essentially it is a measure of how far on average the observations are from the mean.
For a data set having values x1, x2,,xn (or xi where i=1,2,,n) and mean of <x> variance is
calculated as
For granular data: Variance (2) = (xi - <x>)2/n
For summarized frequency table: Variance (2) = {fi*(xi - <x>)2}/n
Standard deviation is positive square root of variance denoted by
For a sample variance is calculated as
Variance (s2) = (xi - <x>)2/(n-1)
Dividing by (n 1) makes the sample variance an unbiased estimator of the population variance.
We will look into the details of it in later part of the course
Illustration 2.7: Calculating variance and standard deviation for sample of 3000 individuals having
credit cards
Exercise: Do the algebra to make sure that the above mentioned formulae of variance are
equivalent.
Pristine 29
2.d. Measure of Spread - Variance and Standard Deviation
(Using MS Excel)

1. Using Excel function for granular data

2. For summarized data in form of frequency


table

1
2
Pristine 30
2.d. Measure of Spread - Range
The range is a very simple measure of spread defined, as its name suggests, by the difference
between the largest and smallest observations in the data set.

Range = max(xi) min(xi)

A poor measure of the spread of the data as it relies on the extreme values

Which aren't necessarily representative of the data as a whole.

Illustration 2.8: Calculating Range for sample of 3000 individuals having credit cards

1 2

3
Pristine 31
2.d. Measure of Spread - Inter quartile Range
Similar to Range but is not affected by the data extremes.
Just as the median divides a set of data into two halves, the quartiles divide a set of data into four
quarters. They are denoted by Q1, Q2 and Q3.
Q2 is just the median, while Q1 is called the lower quartile and Q3 the upper quartile.
Q1 can be defined to be the (n + 2) / 4th observation counting from below and Q3 as the same counting
from above, with relevant interpolation if needed.
The Inter quartile range is defined as Q3 Q1.
Illustration 2.9: Calculating Inter quartile Range for sample of 3000 individuals having credit cards

Pristine 32
2.d. Case: Symmetry and skewness of data

Romanov got appreciations after he presented the summarized data along with "measures of
Central tendency" and "measure of spread" to his manager at Credit One. But, he was further
asked to create an illustration around symmetry and skewness of data. Following that carry out
the analysis of credit card data

Now, Romanov being unaware of the term "symmetry and skewness" again approached you
and asked for your help. In return he promised to gift you a bottle of Champagne. Help
Romanov in carrying out his task.

Pristine 33
2.d. Symmetry and skewness
It deals with the shape of the distribution of a data set, that is, whether it is symmetric or skewed
to one side or the other.

The approximate shape of a distribution can be determined by looking at a histogram.

Illustration 2.9: Calculating mean, median, mode and variance for symmetric and skewed data.

Symmetrical Positively Skewed Negatively Skewed


120 200 200
180 180
100
160 160
140 140
80
120 120
60 100 100
80 80
40
60 60
40 40
20
20 20
0 0 0
0 5 10 15 20 0 10 20 0 10 20

Pristine 34
2.d. Symmetry and skewness

Symmetrical: Positively Skewed: Negatively Skewed:


Mean = Median = Mode Mean > Median > Mode Mean < Median < Mode
Pristine 35
2.d. Case: Data Collection and Management Framework

After Romanov presented the summarized data along with


Measure of central tendency
Measure of dispersion and
Skewness

he got appreciated for his work. As next step, his manager asked him to put a data
management and management framework in place.

Lets help Romanov in putting up the framework.

Pristine 36
2.d. Comments: Data Collection and Management Framework

At a high level, from an analyst's perspective data collection and management framework
will involve following components
Data collection mechanism
Maintaining a data dictionary
Missing value imputation
Outlier treatment

Pristine 37
2.e. Data Collection - quick background

Quality Check and


Identify Data Needs Data Mapping Data Request Plan Data Request Prep.
Merge Step
Start with Business Before preparing a data request, it is Identify & Assess Available Population Be as specific as possible! Always examine results before
Question necessary to become as familiar as Coverage acceptance!
Determine data need for possible with the data sources and Data availability constraints; viz. Accurate file names
delivering desired outcome their content that might be available archives time span Specify selection criteria with
For each data file received,
to address the business question to Population sizing by key respect to actual field names and
Compare basic statistics (no. of
Illustration: be answered. characteristics like credit history value formats (e.g. "Values of the
records, no. of fields, range of values
Business Question: Discuss with client unexpected size field STATE_CD in the subset =
in each field) to expectations and
How to match the most limitations (IN,MI)" rather than "Records from
resolve any discrepancies
profitable credit product with This Data Mapping has basically three Indiana and Michigan")
Ensure that delimiters, file format
each new customer? major components: Identify & Assess Alternate Data Specify required or acceptable file
and record format meet
Interview clients Sources formats
requirements
Solution: Obtain & study data layouts Choose between alternatives Give detailed randomization and/or
Ensure that the data dictionary
Using Credit & Payment Obtain & evaluate data samples Identify master data source stratification instructions
matches the file exactly
History and Financial Ensure that link keys work between Note:
Enter file into data inventory,
Statement data to predict Note: sources chosen (beware of key In case of Account x Transaction
recording basic descriptive
account performance for The results of each step may require length or encryption differences) level data, random sampling of
information (file name, date
different products. us to repeat one or more previous records is not the same as random
received, file size, record length,
steps. Plan to Optimize Client Resource Use sampling of accounts.
programmer source, date received)
Data Request: Minimize workload for client IT
A representative sample of department; even if it makes more Prepare the driver file if requesting
While merging files,
customers from each product work at our end to link files, convert data that needs to match another
Watch out for identical merge-key
with usage and payment data media, reformat etc. source - test the driver file to
field name with different meanings
for sufficient no. of months ensure that it can be linked back to
in two files
along with their credit and the data source you already have
Beware of the consequences of
financial history prior to
merging two datasets with few
acquisition.
identically named non-key fields
Specify a distinct output file for
sorting

Pristine 38
2.f. Data Dictionary
A comprehensive data dictionary should be maintained and updated as and when any new information is gathered.
USE: It can go a long way in helping us understand the data better. For instance, it can help us to revisit old information and see what our initial
hypothesis was and how it is changing with the new updated information.
Things To Include In The Data Dictionary:
Meaning of all Potential Predictors:
Maintain labels of as many variables as possible
If possible, one should also try to capture the business sense of these variables
Wherever things are not clear, it should be noted down so that it can be clarified with the client later on
Clear Definition of Unique Identifier and its Meaning:
Ascertain the level at which data is to be rolled up / down. For instance,
Individual level
Individual x Account level
Individual x Month level
Individual x Account x Month level, etc.
Identify unique key of every dataset. Few examples below:
Payment data may be at transaction level
Demographic data at individual level
Census data at zip code level
Dependent Variable Definition and Meaning: This is a very crucial step in modeling exercise as wrong definition can lead to completely
wrong conclusions. In absence of a clear definition at this stage, it may be defined later after some actual data analysis.
Variable Classification: If not already given, one should always try and classify the variables like
Demographic variables, e.g. age, gender
Performance variables, e.g. spend, number of transactions
Credit Attributes, e.g. total credit line, FICO score
Census level, e.g. population, location attributes such as income levels

Pristine 39
2.g. Missing Value Imputation
There are a variety of techniques for missing value imputation; but these should be considered more
as scenario-specific than just being a set of pure alternative choices.
Missing Value Imputation Techniques
A. Impute Missing Values with ZERO
B. Impute Missing Values with MEDIAN
C. Impute Missing Values with MEAN
D. Impute Missing Values with MODE
E. Information based Segmentation
F. Non-Missing Dummy Creation
G. Imputation and Non-Missing Dummy Creation
H. Impute based on Bivariate Graphs
I. Impute using Regression on other Non-Missing Predictors
J. DNI
K. Multiple Imputation
Pristine 40
2.h. Outlier Treatment
An outlier is a single observation "far away" from rest of the data.
Outlier
Reasons for outliers:
Errors
Data errors
Sampling error
Standardization failure Outlier
Faulty distributional assumptions
Human Error
Genuine Outliers

Why do we care about outliers?


Outliers are BAD
The presence of outliers can lead to inflated error rates and substantial distortions of results that can lead to wrong conclusions and
inferences.
Outliers are GOOD
The outliers can provide useful information in the data, for example, a spike in spend behavior of some customers may prove to be the
deciding factor in marketing response campaigns. So care should be taken while dealing with outliers.
In short, outliers are important and hence should not be ignored.
Techniques for outlier detection / treatment:
Capping and Flooring Technique
Exponential Smoothing Technique
Sigma Approach
Robust Regression Technique
Mahalanobis Distance Technique
Pristine 41
Thank you!

Pristine www.edupristine.com
Pristine 42