3 vistas

Cargado por Nipun Goyal

- Certification Guide on SAS Big Data Preparation, Statistics, and Visual Exploration (A00-220) Professional Exam
- Unit 9 Regression SLM
- Multivariate Linear Regression
- Topic03 Correlation Regression
- Research methods and Statistics_Ismail Khater
- 9781461452386-c1 (1)
- Stats
- Introduction Econometrics R
- The Role of Irritability in the Relation Between Job
- ec117-1-1
- w4l1 Causalmodels ANNOTATED FINAL
- Multiple Correlation
- statistics skittles project part 3 - shanon gunn
- Chapter12.pdf
- Mid Term Project
- spss4
- App1
- Classnotes2-ANOVATable
- Multifactor Non-linear Modeling for Accelerated Stability Analysis and Prediction
- Shuo Liu Preliminary Syllabus for Econ103

Está en la página 1de 143

Day 2 & 3

EduPristine www.edupristine.com

EduPristine | Business Analytics

Agenda

Basic Statistics

Basic Statistics

I. The Central Limit Theorem

Sampling

A probability sample is a sample selected such that each item or person in the population being studied

has a known likelihood of being included in the sample.

The sampling distribution of the sample mean is a probability distribution consisting of all possible sample

means of a given sample size selected from a population.

The cost of studying all the items in a population.

The sample results are usually adequate.

Contacting the whole population would often be time-consuming.

Central Limit Theorem

X

For a population with a mean and a variance 2 the sampling distribution of the means of all possible

samples of size n generated from the population will be approximately normally distributed.

The mean of the sampling distribution equal to and the variance equal to 2/n.

How is variance related to standard error?

Sampling distribution becomes almost normal regardless of shape of population

4

EduPristine | Business Analytics

Central Limit Theorem

Introduction:

It is perhaps one of the most important result in statistics

It provides the basis for large-sample inference about a population mean when the population

distribution is unknown.

It also provides the basis for large-sample inference about a population proportion, for example, in

opinion polls and surveys.

Definition:

If X1, X2, .,Xn is a sequence of independent, identically distributed (iid) random variables with finite

mean and finite (non-zero) variance 2 then the distribution of (<X> )/( /n) approaches the

standard normal distribution, N(0,1) , as n

is the population mean from which X1, X2, .,Xn have been extracted.

i=n

<X> is the sample mean calculated as <X> = (1/n) i=1

Xi

For large n, (<X> )/( /n) and ( Xi n )/((n 2)) has N(0, 1) distribution

OR

<X> ~ N(, 2/n)

Xi ~ N(n , n 2)

The Central Limit Theorem

Example:

It is assumed that the number of claims arriving at an insurance company per working day has a

mean of 40 and a standard deviation of 12. A survey was conducted over 50 working days. Find the

probability that the sample mean number of claims arriving per working day was less than 35.

Solution:

We have, = 40, = 12 , n = 50 .

2

The central limit theorem states that <X) ~ N(40,12 /50) .

We want P( <X> < 35) :

2

P( <X> < 35) = P(Z < (35-40)/ (12 /50))

= P(Z < -2.946) = 1 P(Z < 2.946)

= 1 0.9984 = 0.0016

Sampling Error

The sampling error is the difference between a sample statistic and its corresponding population

parameter. It is found by subtracting the value of a Parameter from the value of a Statistic.

For example, if a poll was conducted where the population included all students in that school and the

sample was a class. If the sample had a mean GPA of 3.4, and the populations mean GPA was 3.2, then the

sample error was 0.2.

7

EduPristine | Business Analytics

Standard Error of sample mean

It is the standard deviation of the distribution of the sample means

When the standard deviation of the population is known, the standard error of the sample mean is

calculated as:

Square root of the sample size (n)

Example: The mean hourly wage for Housten farm workers is $13.50 with a population standard deviation

of $2.90. Calculate & interpret the standard error of the sample mean for a sample size of 30

Answer: Because the population standard deviation is known, the standard error of the sample mean is

expressed as = $2.90/ root of (30) = $0.53

8

EduPristine | Business Analytics

Point Estimate & Confidence Interval

Point estimates: These are the single (sample) values used to estimate population parameters

Confidence interval: It is a range of values in which the population parameter is expected to lie

Confidence interval takes on the following form where N 30

CI = + Z*

True for a population distribution

Where, is the mean of the population

is the standard deviation of the population

For a sample mean,

Point estimate + (reliability factor * standard error )

CI = + Z*(/n)

Hypothesis Testing

A statistical hypothesis test is a method of making statistical decisions from and about

experimental data.

How well the findings fit the possibility that chance factors alone might be responsible."

Key steps in Hypothesis Testing

Null Hypothesis (H0): The hypothesis that the researcher wants to reject

reject null hypothesis

Test Statistic

Rejection/Critical Region

Conclusion

Launching a niche course for MBA students?

Sam, a brand manager for a leading financial training center, wants to introduce a new niche finance course for MBA

students. He met some industry stalwarts and found that with the skills acquired by attending such a course, the

students would able to land up a in a good job.

He meets a random sample of 100 students and discovers the following characteristics of the market

Mean household income to $20,000

Interest level in students = high

Current knowledge of students for the niche concepts = low

Sam strongly believes the course would adequately profitable in students if they have the buying power for the

course. They would be able to afford the course only if the mean household income is greater than $19,000.

Would you advice Sam to introduce the course?

What should be the hypothesis?

o Hint: What is the point at which the decision changes (19,000 or 20,000)?

o What about the alternate hypothesis?

What other information do you need to ensure that the right decision is arrived at?

o Hint: confidence intervals/ significance levels?

o Hint: Is there any other factor apart from mean, which is important? How do I move from population parameters to

standard errors?

What is the risk still remaining, when you take this decision?

o Hint: Type-I/II errors?

o Hint: P-value

Criterion for Decision Making

To reach a final decision, Sam has to make a general inference (about the population) from the

sample data.

Criterion: Mean income across all households in the market area under consideration.

If the mean population household income is greater than $19,000, then PD should introduce the product

line into the new market.

The population mean household income in the new market area is greater than $19,000

The term one-tailed signifies that all z-values that would cause Sam to reject H0, are in just one

tail of the sampling distribution

-> Population Mean

H0: $19,000

Ha: $19,000

13

EduPristine | Business Analytics

Identifying the Critical Sample Mean Value Sampling Distribution

0.25

0.2

0.15

Critical Value

0.1 (Xc)

0.05

0

-10 -5 $19,000

0 5 10

Sample mean values greater than $19,000--that is x-values on the right-hand side of the sampling

distribution centered on = $19,000--suggest that H0 may be false.

More important the farther to the right x is , the stronger is the evidence against H0

14

EduPristine | Business Analytics

Computing the Criterion Value

Standard deviation for the sample of 100 households is $4,000. The standard error of the mean (sx)

is given by:

s

sx $400

n

Critical mean household income xc through the following two steps:

Determine the critical z-value, zc. For =0.05:

zc = 1.645.

Substitute the values of zc, s, and (under the assumption that H0 is "just" true )

Critical Value xc

xc = + zcs = $19,658.

In this case, since the observed sample statistic (20,000) is greater than the critical value (19,658), so the null

hypothesis is rejected =>

Decision Rule

If the sample mean household income is greater than $19,658, reject the null hypothesis and

introduce the new course

15

EduPristine | Business Analytics

Test Statistic

The value of the test statistic is simply the z-value corresponding to = $20,000.

x 0.25

Z 2.5

sx 0.2

0.15

0.1 = 0.05

0.05

0

There is a significant difference in the -10 x=

5 $ 20,000

hypothesized population parameter and

-5 =$19,000

0 10

Mean income > 19,000 =>

Launch the course Do not Reject H0 Reject H0

X c $19,658

Z c 1.645

16

EduPristine | Business Analytics

Errors in Estimation

Please note: You are inferring for a population, based only on a sample

This is no proof that your decision is correct

Its just a hypothesis

Actual

There is still a chance that your inference is wrong

H0 is True H0 is False

How do I quantify the prob. of error in inference?

Inference

Type I and Type II Errors:

Type I error occurs if the null hypothesis is H0 is True

Correct Decision Type-II Error P(Type-II

rejected when it is true Confidence Level=1- Error)=

Type II error occurs if the null hypothesis is

not rejected when it is false H0 is False Power=1-

Type-I Error Significance

Level=

Significance Level:

-> Significance level : The upper-bound probability of a Type I error

1 - ->confidence level : The complement of significance level

The power of a test is the probability of correctly rejecting the null.

P - Value Actual Significance Level

The p-value is the smallest level of

significance at which the null

hypothesis can be rejected. 0.25

P-value 0.2

value of x (From the sample) as high as 0.15

$20,000 or more when actual

populations mean () is only $19,000 =

0.00621

0.1 = 0.05

Calculated probability of rejecting the 0.05

null hypothesis (H0) when that

hypothesis (H0) is true (Type I error)

0

The actual significance level of =$19,000 p-value= 0.00621

0.00621 in this case means that the Z=0

odds are less than 62 out of 10,000

that the sample mean income of Do not Reject H0 Reject H0

$20,000 would have occurred entirely

due to chance (when the population

mean income is $19,000)

18

EduPristine | Business Analytics

Some variations in the Z-Test

What if Sam surveyed the market and found that the student behavior is estimated to be:

They would found the training too expensive if their household income is < US$ 19,000 and hence would not

have the buying power for the course?

They would perceive the training to be of inferior quality, if their household income is > US$19,000 and hence

not buy the training?

How would the decision criteria change? What should be the testing strategy?

Hint: From the question wording infer: Two tailed testing

Appropriately modify the significance value and other parameters

Use the Z-test

Appropriate change in the decision making and testing process process:

Students will not attend the course if:

The household income >$19,000 and the students perceive the course to be inferior

The household income is <$19,000

This becomes a two tailed test wherein the student will join the course only when the household lie between

a particular boundary. i.e. the household income should be neither very high neither very low

Two- Tailed Test

0.25

which signifies that all z-values that would

cause PD to reject H0, are in both the tails

0.2

of the sampling distribution

-> Population Mean

0.15

H0: = $19,000

Ha: $19,000 = 0.025

Since we are checking for significance

0.1 = 0.025

difference on both the ends, so its a two

0.05

tailed test

The lower boundary = 0

-10 -5 =$19,000 10

Z=0

Z / 2 * 19,000 1.95 * 400 $19,784

Reject H0 Reject H0

Conclusion: If the household income lies Do not

between $18,216 and $19,784 then the Reject H0

student will attend the course at 95%

confidence

20

EduPristine | Business Analytics

Business Analytics

Predictive Modeling using Linear

Regression

EduPristine www.edupristine.com

EduPristine | Business Analytics

Agenda

Basic Statistics

4. Correlation and Regression

I. Covariance and Correlation coefficient

II. Regression

4a. Correlation

I. Covariance and Correlation coefficient

i. Definition

ii. Sample and population correlation

iii. Illustrative example

iv. Statistical significance test for sample correlation coefficient

4a. Covariance and Correlation Coefficient

Covariance is a statistical measure of the degree to which the two variables move together.

The sample covariance is calculated as

n

(X i X )(Y iY )

cov xy i 1

n 1

Correlation coefficient

It is a measure of the strength of the linear relationship between two variables

The correlation coefficient is given by:

cov xy

xy

x y

Population correlation is denoted by (rho)

Sample correlation is denoted by r. It is an estimate of same way as

S2 (sample variance) is an estimate of 2 (population variance) and

X (sample mean) is an estimate of (population mean)

Features of and r

Unit free and ranges between -1 and 1

The closer to -1, the stronger the negative linear relationship

The closer to 1, the stronger the positive linear relationship

The closer to 0, the weaker the linear relationship

4a. Example: Covariance and Correlation of the S&P 500 and

NASDAQ Returns given a sample

Date S&P 500 NASDAQ

12/2/2011 1,244.28 2,626.93

12/5/2011 1,257.08 2,655.76

12/7/2011 1,261.01 2,649.21

12/8/2011 1,234.35 2,596.38

12/9/2011 1,255.19 2,646.85

12/12/2011 1,236.47 2,612.26

4a. Solution: Covariance and Correlation of the S&P 500 and

NASDAQ Returns given a sample

Date S&P 500 NASDAQ S&P 500 NASDAQ S&P 500 NASDAQ

12/2/2011 1,244.28 2,626.93 Xi Yi Xi- X Yi- Y (Xi-X )*(Yi- Y )

12/5/2011 1,257.08 2,655.76 1.03% 1.10% 1.14% 1.20% 0.0137%

12/7/2011 1,261.01 2,649.21 0.31% -0.25% 0.43% -0.15% -0.0006%

12/8/2011 1,234.35 2,596.38 -2.11% -1.99% -2.00% -1.89% 0.0378%

12/9/2011 1,255.19 2,646.85 1.69% 1.94% 1.80% 2.05% 0.0369%

12/12/2011 1,236.47 2,612.26 -1.49% -1.31% -1.38% -1.21% 0.0166%

X Y

sx sy

Standard

0.01630504 0.01633798

Deviation

Covariance 0.000261013

Correlation 0.979811179

4a. Examples of Approximate r Values

y y y

x x x

r = -1 r = -0.6 r=0

y y

x x

r = +.3 r = +1

EduPristine | Business Analytics 28

4a. Testing the significance of the correlation coefficient

Test whether the correlation between the population of two variables is equal to zero

Null hypothesis, H0: r = 0

Assuming that the two populations are normally distributed, we can use a t-test to determine

whether the null hypothesis should be rejected.

The test statistic is computed using the sample correlation, r, with n 2 degrees of freedom (df )

t = r (n-2)

(1- r2)

Calculated test statistic is compared with the critical t-value for the appropriate degrees of

freedom and level of significance

Reject H0 if t > tcritical or t <-tcritical

Example: Correlation of the S&P 500 and NASDAQ Returns given a sample

n = 5, r = 0.979811179, v = 5-2 = 3

Calculate, t = 8.4885

tcritical at 95% confidence interval (df = 3) = 2.3534

Hence, reject H0 at CI of 95%

4b. Regression

II. State the usual simple regression model (with a single explanatory variable).

III. State and explain the least squares estimates of the slope and intercept parameters in a simple

linear regression model.

IV. Calculate R2 (coefficient of determination) and describe its use to measure the goodness of fit of

a linear regression model.

V. Use a fitted linear relationship to predict a mean response or an individual response with

confidence limits.

VI. Use residuals to check the suitability and validity of a linear regression model.

VII. State the usual multiple linear regression model

VIII. Discuss about issues in linear regression

i. Heteroskedasticity

ii. Multicollinearity

IX. Detailed case study on multivariate regression by using

I. MS Excel

II. R software

4.b. The Million Dollar Question

4.b. The Population

120

100

Marks in Test

80

60

40

20

0

0 10 20 30 40 50 60 70 80 90

Hours of Study

4.b. Introduction to Regression Analysis

Predict the value of a dependent variable based on the value of at least one independent

variable

Explain the impact of changes in an independent variable on the dependent variable

Dependent variable: the variable we wish to explain usually denoted by Y

Independent variable: the variable used to explain the dependent variable. Denoted by X

4.b. Simple Linear Regression Model

Relationship between x and y is described by a linear function

Changes in y are assumed to be caused by changes in x

4.b. Assumptions

1. A linear relationship exists between the dependent and the independent variable.

E( ) 0

4. The variance of the residual term is constant for all observations (Homoskedasticity)

E

i

2 2

5. The residual term is independently distributed; that is, the residual for one observation is not

correlated with that of another observation

[E( i j ) 0, j i]

6. The residual term is normally distributed.

4.b. Types of Regression Models

4.b. Population Linear Regression

(continued)

Y Y 0 1 X u

ui Slope = 1

of Y for Xi

Intercept = 0 Individual

person's marks

xi X

37

EduPristine | Business Analytics

4.b. Population Regression Function

Random Error

Dependent Population y Population Slope Independent term, or

Variable intercept Coefficient Variable residual

Y 0 1 X u

Linear component Random Error

component

If yes what all information we will need?

4.b. Information that we actually have

4.b. Sample Regression Function

(continued)

Y y b 0 b1 x e

Observed Value

of y for xi

ei Slope = 1

of Y for Xi

Intercept = 0

xi X

40

EduPristine | Business Analytics

4.b. Sample Regression Function

regression intercept regression slope

Independent

variable

4.b. The error term (residual)

Represents the influence of all the variable which we have not accounted for in the equation

It represents the difference between the actual "y" values as compared the predicted y values

from the Sample Regression Line

Wouldn't it be good if we were able to reduce this error term?

What are we trying to achieve by Sample Regression?

4.b. Our Objective

Y 0 1 X u

SRL

yi b 0 b1x

4.b. One method to find b0 and b1

b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared

residuals

e 2

(y y) 2

(y (b 0 b 1 x)) 2

Why don't we take the sum?

Why don't we take absolute values instead?

4.b. OLS Regression Properties

The sum of the residuals from the least squares regression line is 0.

( y y ) 0

The sum of the squared residuals is a minimum.

Minimize ( ( y

y ) 2

)

The simple regression line always passes through the mean of the y variable and the mean of

the x variable

4.b. Interpretation of the Slope and the Intercept

b0 is the estimated average value of y when the value of x is zero. More often than not it does

not have a physical interpretation

b1 is the estimated change in the average value of y as a result of a one-unit change in x

y

Y b0 b1 X

slope of the line(b1)

b0

4.b. Hypothesis Testing: Two Variable Model

How do we know whether the values of b0 and b1 that we have found are actually meaningful?

Is it actually possible that our sample was a random sample and it has given us a totally wrong

regression line?

We do know a lot about the sample error term "e" but what do we know about the error terms

"u" of the Population Regression Function?

How do we proceed from here?

4.b. Assumptions about "u"

The underlying relationship between Y

the X variable and the Y variable is linear

Cov(ex1,ex2) = 0

For a given value of Xi the sum of error

terms is equal to 0

e

The error term is uncorrelated with

the explanatory variable X

Error values are normally distributed

for any given value of X

The probability distribution of the errors

for a given Xi is normal

The probability distribution of the errors

for different Xi has constant variance x1 x2 X

(homoscedasticity)

Error values u for given Xi are statistically independent, their covariance is zero

Once we make these assumptions about "u" we are able to estimate the variance and standard

errors of b0 and b1 and this has been possible because of the properties

of OLS method (beyond the scope of lecture)

4.b. Standard Error of Estimate (SEE)

The standard deviation of the variation of observations around the regression line is estimated

by:

RSS

su

n k 1

Where

RSS= Residual Sum of Squares (summation of e2)

n = Sample size

k = number of independent variables in the model

Note: When k=1

RSS

su = Sample standard error of the estimate

n2

4.b. Comparing Standard Errors

Variation of observed y values from the regression Variation in the slope of regression lines from

line different possible samples

y y

small s u x smallsb1 x

y y

4.b. Inference about the Slope: t-Test

t-test for a population slope

Is there a linear relationship between x and y?

Null and alternative hypotheses

H0: 1 = 0 (no linear relationship)

H1: 1 0 (linear relationship does exist)

Test statistic

b1 1

t

sb1

d.f. n 2

The null hypothesis can be rejected if either of the following are true:

tc <t or

t < -tc

where:

b1 = Sample regression slope coefficient

1 = Hypothesized slope

sb1 = Estimator of the standard error of the slope

tc= the critical t value

4.b. Confidence Interval for 'y'

The confidence interval for the predicted value of Y is given by:

Y (tc * s f )

where:

Y = predicted 'Y' value (dependent variable)

n-2 = degrees of freedom

tc = the critical t value

s f = the standard error of the forecast

4.b. The Confidence Interval for a Regression Coefficient

The confidence interval for the regression coefficient, b1 is given by:

b1 (tc * sb )

1

where:

b1 = correlation between x and y

n-2 = degrees of freedom

tc = the critical t value

sb = the standard error of the regression coefficient

1

4.b. Explained and Unexplained Variation

y

yi SSE = Sum of squared errors

y

SST = Total Sum of Squares SSE = (yi - yi )2

_

SST = (yi - y)2

y _ 2

_ RSS = (yi - y) _

y RSS = Regression sum of squares

y

Xi x

4.b. Explained and Unexplained Variation (Cont)

Measures the variation of the yi values around their mean y (continued)

SSE = Sum of squared errors

Variation attributable to factors other than the relationship between x and y

SSR = Regression sum of squares

Explained variation attributable to the relationship between x and y

4.b. Explained and Unexplained Variation (Cont)

Total sum of Regression Sum of Squares

Sum of Squared Errors

Squares Also known as

Square Sum of Regression SSR

SST ( y y ) 2

SSE ( y y ) 2 SSR ( y y ) 2

Where:

y = Average value of the dependent variable

y = Observed values of the dependent variable

y = Estimated value of y for the given x value

4.b. Coefficient of Determination, R2

The coefficient of determination is the portion of the total variation in the dependent variable

that is explained by variation in the independent variable

The coefficient of determination is also called R-squared and is denoted as R2

SSR 0 R 1 2

R 2

where

SST

4.b. Coefficient of Determination, R2 (Cont)

Coefficient of determination

(continued)

SSR sum of squaresexplained by regression

R 2

SST total sum of squares

Where: R2 r 2

R2 = Coefficient of determination

r = Simple correlation coefficient

4.b. Examples of Approximate R2 Values

y

R2 = 1

R2 =1

y

x

R2 = +1

4.b. Examples of Approximate R2 Values (Cont)

y

0 < R2 < 1

x variation in x

4.b. Examples of Approximate R2 Values (Cont)

y R2 = 0

variation in y is explained by variation in x)

R2 = 0 x

4.b. Limitations of Regression Analysis

Parameter Instability - This happens in situations where correlations change over a period of

time. This is very common in financial markets where economic, tax, regulatory, and political

factors change frequently.

Public knowledge of a specific regression relation may cause a large number of people to react in

a similar fashion towards the variables, negating its future usefulness.

If any regression assumptions are violated, predicted dependent variables and hypothesis tests

will not hold valid.

4.b. General Multiple Linear Regression Model

In simple linear regression, the dependent variable was assumed to be dependent on only one

variable (independent variable)

In General Multiple Linear Regression model, the dependent variable derive sits value from two or

more than two variable.

General Multiple Linear Regression model take the following form:

Yi b0 b1 X 1i b2 X 2i ......... bk X ki i

where:

Yi = ith observation of dependent variable Y

Xki = ith observation of kth independent variable X

b0 = intercept term

bk = slope coefficient of kth independent variable

i = error term of ith observation

n = number of observations

k = total number of independent variables

4.b. Estimated Regression Equation

As we calculated the intercept and the slope coefficient in case of simple linear regression by

minimizing the sum of squared errors, similarly we estimate the intercept and slope coefficient in

multiple linear regression.

n

Sum of Squared Errors i

2

i 1

is minimized and the slope coefficient is estimated.

Yi b0 b1 X 1i b2 X 2i ......... bk X ki

Now the error in the ith observation can be written as:

i Yi Yi Yi b0 b1 X 1i b2 X 2i ......... bk X ki

4.b. Interpreting the Estimated Regression Equation

Intercept Term (b0): It's the value of dependent variable when the value of all independent

variables become zero.

b0 Value of Y

when X 1 X 2 ....... X k 0

Slope coefficient (bk): It's the change in the dependent variable from a unit change in the

corresponding independent (Xk) variable keeping all other independent variables constant.

In reality when the value of the independent variable changes by one unit, the change in the dependent

variable is not equal to the slope coefficient but depends on the correlation among the independent

variables as well.

Therefore, the slope coefficient are called partial slope coefficients as well

4.b. Assumptions of Multiple Regression Model

There exists a linear relationship between the dependent and independent variables.

The expected value of the error term, conditional on the independent variables is zero.

The error terms are homoskedastic, i.e. the variance of the error terms is constant for all the

observations.

The expected value of the product of error terms is always zero, which implies that the error

terms are uncorrelated with each other.

The independent variables doesn't have any linear relationships between each other.

4.b. Hypothesis Testing of Coefficients

The values of the slope coefficients doesn't tell anything about their significance in explaining the

dependent variable.

Even an unrelated variable when regressed would give some value of slope coefficients.

To exclude the cases where the independent variables doesn't significantly explain the dependent

variable, we need the hypothesis testing of the coefficients for checking whether they contribute

in explaining the dependent variable significantly or not.

The t-statistic is used to check the significance of the coefficients.

The t-statistic used for the hypothesis testing is same as used in the hypothesis testing of

coefficient of simple linear regression.

Following are the hypothesis and alternative hypothesis to check the statistical significance of b k:

Hypothesis H0: bk =0

Alternative Hypothesis (Ha): bk 0

The t-statistic of (n-k-1) degrees of freedom for the hypothesis testing of the coefficient bk

bk bk

t

s

bj

If the value of t-statistic lies within the confidence interval, H0 can't be rejected

4.b. Confidence Interval for the Population Value

The confidence interval for a regression coefficient is given by:

b j (tc s )

bj

Where,

tc is the critical t-value, and

sb is the standard error

j

4.b. Predicted Dependent Variable

The regression equation can be used for making predictions about the dependent variable by

using forecasted values of the independent variables.

Yi b0 b1 X 1i b2 X 2i ......... bk X ki

Where,

Y is the predicted value for the dependent variable

i

bi is the estimated partial slope for the ith independent variable

X ni is the forecasted ith value for the nth independent variable

4.b. Analysis of Variance (ANOVA)

Analysis of variance is a statistical method for analyzing the variability of the data by breaking the

variability into its constituents.

A typical ANOVA table looks like:

Source of Variability DoF Sum of Squares Mean Sum of Squares

Regression(Explained) k RSS MSR=RSS/1

Error(Unexplained) n-k-1 SSE MSE=SSE/n-2

Total n-1 SST=RSS+SSE

SSE

Standard Error of Estimate(SEE) = MSE

n2

Total Variation( SST) Unexplaine d Variation( SSE)

Coefficient of determination(R2) =

Total Variation( SST)

Explained Variation( RSS)

=

Total Variation( SST)

4.b. F-Statistic

An F-test explains how well the dependent variable is explained by the independent variables

collectively.

In case of multiple independent variable, F-test tells us whether a single variable explains a

significant part of the variation in dependent variable or all the independent variables explain the

variability collectively.

Where:

n: Number of observations

RSS

MSR k

F

MSE SSE

n k 1

EduPristine | Business Analytics 71

4.b. F-statistic contd.

Decision rule for F-test: Reject H0 if the F-statistic > Fc (Critical Value)

The numerator of F-statistic has degrees of freedom of "k" and the denominator has the degrees

of freedom of "n-k-1"

If H0 is rejected then at least one out of two independent variable is significantly different that

zero.

This implies that at least one out of household income(independent variable) or household

expenses(independent variable) explains the variation in the pocket money of Edward.

that the coefficients are simultaneously equal to zero

EduPristine | Business Analytics 72

4.b. Coefficient of determination (R2) and Adjusted R2

Coefficient of determination(R2) can also be used to test the significance of the coefficients

collectively apart from using F-test.

SST - SSE RSS Sum of Squares explained by regression

R2

SST SST Total Sum of Squares

The drawback of using Coefficient of determination is that the value of the coefficient of

determination always increases as the number of independent variables are increased even if the

marginal contribution of the incoming variable is statistically insignificant.

To take care of the above drawback, coefficient of determination is adjusted for the number of

independent variables taken. This adjusted measure of coefficient of determination is called

adjusted R2

Adjusted R2 is given by the following formula:

n 1 2

where R 2

a 1 n k 1 1 R

n = Number of Observations

k = Number of Independent Variables

Ra2 = Adjusted R2

4.b. Representing Qualitative Factors

How can we represent Qualitative factors in a regression equation?

By using 'dummy variables'; variables that take values of either 1 or 0, depending whether it is

true or false.

If we wanted to consider the spike in soft drink sales in the summer, we may have a regression

equation:

Rev(t) 10,000 2,000t 50,000S

Here,

1 if it' s summer

S

0 if it' s not summer

If there are n mutually exclusive and exhaustive classes, they can be represented by n-1 dummy

variables. This is derived from the concept of degrees of freedom.

For example, to represent the 4 stages of the business cycle, we can use 3 dummy variables.

The fourth variable would be represented by zeros for all three dummy variables.

We do not use 4 variables as that would indicate a linear relationship between all 4 variables.

4.b. Multicollinearity

Another significant problem faced in the Regression Analysis is when the independent variables or

the linear combinations of the independent variables are correlated with each other.

This correlation among the independent variables is called Multicollinearity which creates

problems in conducting t-statistic for statistical significance.

Multicollinearity is evident when the t-test concludes that the coefficients are not statistically

different from zero but the F-test is significant and the coefficient of determination (R2) is high.

High correlation among the independent variables suggests the presence of multicollinearity but

lower values of correlations doesn't omit the chances of presence of multicollinearity.

The most common method of correcting multicollinearity is by systematically removing the

independent variable until multicollinearity is minimized.

4.b. Model Misspecifications

Apart from checking the previously discussed problems in the regression, we should check for the

correct form of the regression as well.

Following 3 misspecification can be present in the regression model:

Functional form of regression is misspecified:

The important variables could have been omitted from the regression model

Some regression variables may need the transformation (like conversion to the logarithmic scale)

Pooling of data from incorrect pools

The variables can be correlated with the error term in time-series models:

Lagged dependent variables are used as independent variables with serially correlated errors

A function of dependent variables is used as an independent variable because of incorrect dating of the

variables

Independent variables are sometimes measured with error

Other Time-Series Misspecification which leads to the nonstationarity of the variables:

Existence of relationships in time-series that results in patterns

Random walk relationships among the time series

These misspecifications in the regression model results in the biased and inconsistent regression

coefficients which further leads to incorrect confidence intervals leading to TYPE-I or TYPE-II

errors.

Nonstationarity means that the properties(like mean, variance) of the variables is not constant

4.b. The Economic meaning of a Regression Model

Consider the equation:

Rev_Growth 4% 0.75GDP_Growth 0.5WPI_Infl

The economic meaning for this equation is given by the partial slopes or coefficients of the

variables.

If the GDP Growth rate was 1% higher, it translates into a 0.75% higher Revenue growth.

Similarly, if the WPI Inflation figures were 1% higher, it translates into a 0.5% higher revenue

growth.

4.b. Case- Multivariate Linear Regression (Revisited)

Adam, an Analytics consultant works with First Auto Insurance company. His manager gave him data

having Loss amount and policy related information and asked him to identify and quantify the

factors responsible for losses in a multivariate fashion. Adam has no knowledge of running a

multivariate regression.

Now suppose, he approaches you and request for your help to complete the assignment. Lets help

Adam in carrying out the multivariate regression.

Case- Multivariate Linear Regression (Rules of Thumb)

In due course of helping Adam to complete his task, we will walk him through following steps:

Variable identification

Identifying the dependent (response) variable.

Identifying the independent (explanatory) variables.

Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)

Creation of Data Dictionary

Response variable exploration

Distribution analysis

Percentiles

Variance

Frequency distribution

Outlier treatment

Identify the outliers/threshold limit

Cap/floor the values at the thresholds

Independent variables analyses

Identify the prospective independent variables (that can explain response variable)

Bivariate analysis of response variable against independent variables

Variable treatment /transformation

Grouping of distinct values/levels

Mathematical transformation e.g. log, splines etc.

Case- Multivariate Linear Regression (Rules of Thumb)

Heteroskedasticity

Check in a univariate manner by individual variables

Easy for univariate linear regression. Can be done manually.

Too cumbersome to do manually for multivariate case

The tools (R, SAS etc.) have in-built features to tackle it.

Fitting the regression

Check for correlation between independent variables

This is to take care of Multicollinearity

Fix Heteroskedasticty

By suitable transformation of response variable a bit tricky).

Using inbuilt features of statistical packages like R

Variable selection

Check for the most suitable transformed variable

Select the transformation giving the best fit

Reject the statistically insignificant variables

Fitting the regression

Analysis of results

Model comparison

Model performance check

R2

Lift/Gains chart and Gini coefficient

Actual vs Predicted comparison

EduPristine | Business Analytics 80

Multivariate Linear Regression- Data

Snapshot of the data

Data description (known facts):

Auto insurance policy data

Contains policy holders and loss amount

information (variables)

Policy Number

Age

Years of Driving Experience

Number of Vehicles

Gender

Married

Vehicle Age

Fuel Type

Losses (Dependent/Response Variable)

Next step

Create the Data Dictionary

Multivariate Linear Regression- Data Dictionary

3 ? ?

Experience Policy holder

Number of Vehicles insured under

4 Number of Vehicles ? ?

the policy

7 Vehicle Age ? ?

policy

9 Losses ? ?

the policy

Multivariate Linear Regression- Data Dictionary

Unique value identifying

1 Policy Number Unique Policy Number Identifier

the policy

3 0,1,.,53 Numerical (Discrete)

Experience Policy holder

Number of Vehicles insured under

4 Number of Vehicles 1,2,3,4 Numerical (Discrete)

the policy

6 Married Marital status of the Policy holder Married, Single Categorical (binary)

7 Vehicle Age 0,1,,15 Numerical (Discrete)

policy

9 Losses Range: 13- 3500 Numerical (Continuous)

policy

Multivariate Linear Regression- Response Variable (Losses)

Distribution: Losses- Scatter Plot

Distribution Value 4000

Min 13 3500

1 1 Outliers

1

5th 67 3000

10th 122 2500

25th 226

2000

50th 355

1500

75th 399

90th 685 1000

97.5th 981 0

99th 1,204 0 2000 4000 6000 8000 10000 12000 14000

99.50th 1,366

Max 3,500

Mean 390

Stddev 254

Multivariate Linear Regression- Response Variable (Losses)

Distribution:

Capped Losses- Scatter Plot

Distribution Losses Capped Loss

1200

Min 13 13

1000

5th 67 67

10th 122 122 800

50th 355 355

400

75th 399 399

90th 685 685 200

97.5th 981 981 0 2000 4000 6000 8000 10000 12000 14000

Capped Losses- Grouped Freq Dist

99.50th 1,366 1,200

1000

Max 3,500 1,200

900

Mean 390 386 800

1 700

Stddev 254 229

600

500

400

300

200

100

0

Multivariate Linear Regression- Bivariate Profiling

Age

600 4%

500

3%

400

% Policies

Loss

300 2%

200

1%

100

- 0%

0 10 20 30 40 50 60 70 80

Age

% Obs Average Loss Average Capped Losses

Age Band

600 50%

500 40%

400

% Policies

30%

Loss

300

20%

200

100 10%

- 0%

16-25 26-59 60+

Age Band

% Policies Average Loss Average Capped Loss

EduPristine | Business Analytics 86

Multivariate Linear Regression- Bivariate Profiling

600 10.0%

9.0%

500

8.0%

7.0%

400

6.0%

% Policies

Loss

300 5.0%

4.0%

200

3.0%

2.0%

100

1.0%

- 0.0%

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52

Multivariate Linear Regression- Bivariate Profiling

500

Gender 51%

400 51%

% Policies

300 50%

Loss

200 50%

100 49%

- 49%

F M

Gender

% Obs Average Loss Average Capped Losses

Marital Status

500 60%

450

400 50%

350 40%

% Policies

300

Loss

250 30%

200

150 20%

100 10%

50

- 0%

Married Single

Marital Status

% Obs Average Loss Average Capped Losses

EduPristine | Business Analytics 88

Multivariate Linear Regression- Bivariate Profiling

Vehicle Age

600 9%

8%

500

7%

400 6%

% Policies

5%

Loss

300

4%

200 3%

2%

100

1%

- 0%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Vehicle Age

% Obs Average Loss Average Capped Losses

600 45%

40%

500

35%

400 30%

% Policies

25%

Loss

300

20%

200 15%

10%

100

5%

- 0%

0-5 6-10 11+

Vehicle Age Band

% Obs Average Loss Average Capped Losses

EduPristine | Business Analytics 89

Multivariate Linear Regression- Bivariate Profiling

# Vehicles

400 40%

35%

395

30%

25%

% Policies

390

Loss

20%

385 15%

10%

380

5%

375 0%

1 2 3 4

# Vehicles

% Obs Average Losses Average Capped Losses

Fuel Type

800 90%

700 80%

600 70%

60%

500

% Policies

50%

Loss

400

40%

300

30%

200 20%

100 10%

- 0%

D Fuel Type P

EduPristine | Business Analytics 90

Linear Regression- Preparing MS Excel

1 2 3

Linear Regression- Using MS Excel (Demo.)

1 3 4

Multivariate Linear Regression- Variable Selection

Variable selection to be done on the basis of

Multicollinearity (correlation between independent variables)

Banding of variables e.g. whether to use Age or Age Band (also called custom bands)

Statistical significance of variables tested after performing above two steps

List of independent variables:

1. Age

2. Age Band

3. Years of Driving Experience

4. Number of Vehicles

5. Gender

6. Married

7. Vehicle Age

8. Vehicle Age Band

9. Fuel Type

Multivariate Linear Regression- Variable Selection (Multicollinearity)

Age and Years of Driving Experience are highly correlated (Correlation Coefficient = 0.9972). We can

use either of the variables in regression

Q: Which one to use and which one to reject?

Sol: Fit two separate models using either of the variable one at a time. Check for goodness of fit (R 2 in this

case). The variable producing higher R2 gets accepted.

Multiple R 0.475766 Multiple R 0.475273

R Square 0.226354 R Square 0.225885

Adjusted R Square 0.226303 Adjusted R Square 0.225834

Standard Error 201.2306 Standard Error 201.2916

Observations 15290 Observations 15290

Reject Years of Driving Experience

Multivariate Linear Regression- Custom Bands

Investigate whether to use Age or Age band

Fit regression independently using Age and Age Band

Before fitting regression, Age Band needs to be converted to numerical form from categorical. Replace

Age Band values with Average Age for the particular band.

Age Band Sum of Age # Policies Average Age

16-25 93,770.0 4,563.0 20.6

26-59 270,793.0 6,384.0 42.4

60+ 282,636.0 4,343.0 65.1

Regressions results using Age and Average Age

Regression Statistics (Age) Regression Statistics (Average Age)

Multiple R 0.475766 Multiple R 0.509969

R Square 0.226354 R Square 0.260068

Adjusted R Square 0.226303 Adjusted R Square 0.26002

Standard Error 201.2306 Standard Error 196.7971

Observations 15290 Observations 15290

Select Average Age

EduPristine | Business Analytics 95

Multivariate Linear Regression- Custom Bands

Investigate whether to use Vehicle Age or Vehicle Age band

Fit regression independently using Vehicle Age and Vehicle Age Band

Before fitting regression, Vehicle Age Band needs to be converted to numerical form from categorical.

Replace Vehicle Age Band values with Vehicle Average Age for the particular band.

Vehicle Age Band Sum of Vehicle Age # Policies Average Vehicle Age

0-5 9,229 3,688 2.50

6-10 44,298 5,523 8.02

11+ 78,819 6,079 12.97

Regressions results using Vehicle Age and Average Vehicle Age

Regression Statistics (Vehicle Age) Regression Statistics (Average Vehicle Age)

Multiple R 0.289431325 Multiple R 0.303099405

R Square 0.083770492 R Square 0.09186925

Adjusted R Square 0.083710561 Adjusted R Square 0.091809848

Standard Error 218.9903277 Standard Error 218.0203272

Observations 15290 Observations 15290

Select Average Vehicle Age

EduPristine | Business Analytics 96

Multivariate Linear Regression- Variable Selection

List of shortlisted variables:

1. Age Band in the form of Average Age of the band (selected out of Age and Age Band). Also got selected over

Years of Driving Experience.

2. Number of Vehicles

3. Gender

4. Married

5. Vehicle Age Band in the form of Average Vehicle Age of the band (selected out of Vehicle Age and Vehicle

Age Band).

6. Fuel Type

We will run regression in multivariate fashion and then select final list of variables by taking into

consideration statistical significance.

Multivariate Linear Regression- Categorical variable conversion

Categorical variables in Binary form need to be converted to their numerical equivalent (0, 1)

1. Gender (F = 0 and M = 1)

3. Fuel Type (P = 0, D = 1)

Snapshot of the final data on which we will run the multivariate regression

Multivariate Linear Regression- Output

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.865972274

R Square 0.749907979

Adjusted R Square 0.749809794

Standard Error 114.4310136

Observations 15290

ANOVA

df SS MS F Significance F

Regression 6 600073213.5 100012202.3 7637.751088 0

Residual 15283 200122584.4 13094.45688

Total 15289 800195798

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 624.56529 5.29192 118.02233 0.00000 614.19249 634.93809 614.19249 634.93809

Avg Age -5.55974 0.06546 -84.93889 0.00000 -5.68804 -5.43144 -5.68804 -5.43144

Number of Vehicles 0.17875 0.97039 0.18420 0.85386 -1.72333 2.08082 -1.72333 2.08082

Gender Dummy 50.88326 1.89081 26.91084 0.00000 47.17705 54.58947 47.17705 54.58947

Married Dummy 78.39837 1.92148 40.80106 0.00000 74.63204 82.16469 74.63204 82.16469

Avg Vehicle Age -15.14220 0.26734 -56.63987 0.00000 -15.66623 -14.61818 -15.66623 -14.61818

Fuel Type Dummy 267.93559 2.74845 97.48614 0.00000 262.54830 273.32287 262.54830 273.32287

EduPristine | Business Analytics 99

Multivariate Linear Regression- Output

1 Independent Vars Coefficients(b)

Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%

() (b/) (t-dist table) (b-1.96*) (b+1.96*) (b-1.96*) (b+1.96*)

X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431

X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081

X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589

X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165

X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618

X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323

Insignificant

ANOVA

2 Significance F (from

df SS MS (SS/df) F (MSReg/MSRes)

F dist table)

Regression { (ypredictedl- ymean)2} p 6 600073213.5 100012202.3 7637.75 0

Residual {(yactual - ypredicted)2} n-p-1 15283 200122584.4 13094.457

Total {(yactual - ymean)2} n-1 15289 800195798

Regression Statistics

3

Multiple R SquareRoot(R Square) 0.8659723

R Square SS Regression/SS Total 0.7499080

Adjusted R Square R2 - (1 - R2)*{p/(n-p-1)} 0.7498098

Standard Error SquareRoot{SS Residual/(n-p-1)} 114.4310136

Observations n 15290

EduPristine | Business Analytics 100

Multivariate Linear Regression- Output (Significance Test)

1 Independent Vars Coefficients(b)

Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%

() (b/) (t-dist table) (b-1.96*) (b+1.96*) (b-1.96*) (b+1.96*)

X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431

X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081

X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589

X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165

X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618

X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323

H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression)

H1: b is different than 0

Test statistic, Z = (b-0)/ (at 95% two tailed confidence interval, Z = 1.96)

Confidence interval = (b 1.96 * , b + 1.96 * )

For variable to be significant, the interval must not contain 0.

Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431)

No zero in the interval. Hence significant.

Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080)

Zero is present in the interval. Hence insignificant.

EduPristine | Business Analytics 101

Multivariate Linear Regression- Output (Significance Test)

1 Independent Vars Coefficients(b)

Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%

() (b/) (t-dist table) (b-1.96*) (b+1.96*) (b-1.96*) (b+1.96*)

X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431

X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081

X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589

X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165

X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618

X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323

b/StdErr(b) ~ tn-2

H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression)

H1: b is different than 0

At 95% two tailed confidence interval and df greater that 120, t = 1.96)

Confidence interval = (b 1.96 * , b + 1.96 * )

For variable to be significant, the interval must not contain 0.

Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431)

No zero in the interval. Hence significant.

Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080)

Zero is present in the interval. Hence insignificant.

EduPristine | Business Analytics 102

Multivariate Linear Regression- Output at 95% Confidence Interval

SUMMARY OUTPUT

Regression Statistics

Vehicles" Vehicles"

Multiple R 0.865971953 0.865972274

R Square 0.749907424 0.749907979

Adjusted R Square 0.749825608 0.749809794

Adjusted R-square improved

Standard Error 114.4273971 114.4310136

Observations 15290 15290

ANOVA

df SS MS F Significance F

Regression

Residual 15284 200123028.7 13093.6292

Total 15289 800195798

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 625.005 4.723 132.333 0.00 615.7474 634.2625 615.7474 634.2625

Avg Age -5.560 0.065 -84.942 0.00 -5.6879 -5.4314 -5.6879 -5.4314

Gender Dummy 50.883 1.891 26.912 0.00 47.1768 54.5890 47.1768 54.5890

Married Dummy 78.402 1.921 40.806 0.00 74.6356 82.1677 74.6356 82.1677

Avg Vehicle Age -15.142 0.267 -56.641 0.00 -15.6660 -14.6180 -15.6660 -14.6180

Fuel Type Dummy 267.935 2.748 97.489 0.00 262.5480 273.3223 262.5480 273.3223

Multivariate Linear Regression- Regression Equation

Predicted Losses = 625.004932715948

5.5596551344537 * Avg Age + 50.8828923910091 * Gender Dummy +

78.4016899779131 * Married Dummy -15.1420259903571 * Avg Vehicle Age +

267.935139741526 * Fuel Type Dummy

Interpretation:

Sign of

Coefficients Inference

Coefficient

Intercept 625.005

Avg Age -5.560 -ve Higher is the age, lower is the loss

Gender Dummy 50.883 +ve Average Loss for Males is higher than Females

Married Dummy 78.402 +ve Average Loss for Single is higher than Married

Avg Vehicle Age -15.142 -ve Older is the vehicle, lower are the losses

Fuel Type Dummy 267.935 +ve Losses are higher for Fuel type D

Multivariate Linear Regression- Residual Plot

Residual plot:

Residuals calculated as Actual Capped Losses Predicted Capped Losses

Residuals should have a uniform distribution else theres some bias in the model

Except for a few observations (circled in red), residuals are uniformly distributed

1200

1000

800

600

400

200

0

0 2000 4000 6000 8000 10000 12000 14000

-200

-400

Multivariate Linear Regression- Gains Chart and Gini

Gains chart is used to represent the effectiveness of a model prediction which is quantified by means of

Gini Coefficient

Methodology illustrated using MS Excel

Equal Obs Cumulative Cumulative % % Cumulative Actual Area Under

# Policies Predicted Loss Actual Loss Random Gini Coeff

Bin Actual Loss Obs Loss Gains Curve

0 0 0 0 0 0 0 0 0 0.27177

1 1528 1,167,070 1,230,474 1,230,474 10% 10% 20.87% 0.0104

2 1529 1,046,034 991,944 2,222,418 10% 20% 37.69% 0.0293

3 1529 757,330 746,854 2,969,272 10% 30% 50.36% 0.0440

4 1529 589,366 552,534 3,521,806 10% 40% 59.73% 0.0550

5 1529 531,160 553,919 4,075,725 10% 50% 69.12% 0.0644

6 1529 485,428 477,284 4,553,009 10% 60% 77.22% 0.0732

7 1529 432,934 385,411 4,938,420 10% 70% 83.75% 0.0805

8 1529 385,595 423,814 5,362,234 10% 80% 90.94% 0.0873

9 1529 308,050 310,846 5,673,081 10% 90% 96.21% 0.0936

10 1530 193,465 223,351 5,896,432 10% 100% 100.00% 0.0981

1,400,000 3500 100%

# Policies

1,200,000 3000

Predicted Loss 80%

1,000,000 2500

Losses

800,000 2000 60%

Cumulative % Obs

600,000 1500 40%

400,000 1000 % Cumulative Actual Loss

20%

200,000 500

- 0 0%

0 2 4 6 8 10 12 0 2 4 6 8 10

Bins of Equal # Policies Bins of Equal # Policies

Heteroskedasticity

When the requirement of a constant variance is violated, we have a condition of

heteroskedasticity.

Error u

Predicted y

Unconditional and Conditional Heteroskedasticity

Presence of heteroskedasticity in the data is the violation of the assumption about the constant

variance of the residual term.

Heteroskedasticity takes the following two forms, unconditional and conditional.

Unconditional Heteroskedasticity is present when the variance of the residual terms are not

related to the values of the independent variable.

Unconditional Heteroskedasticity doesn't pose any problem in the regression analysis as the variance

doesn't change systematically

Conditional Heteroskedasticity pose problems in regression analysis as the residuals are

systematically related to the independent variables

Y b 0 b1 X

Y

Low Variance of

Residual Terms

High Variance of

Residual Terms

X

EduPristine | Business Analytics 108

Detecting Heteroskedasticity

Heteroskedasticity can be detected either by viewing the scatter plots as discussed in the previous

case or by Breusch-Pagan chi-square test.

In Breusch-Pagan chi-square test, the residuals are regressed with the independent variables to

check whether the independent variable explains a significant proportion of the squared residual

or not.

If the independent variables explain a significant proportion of the squared residuals then we

conclude that the conditional heteroskedasticity is present otherwise not.

Breusch-Pagan test statistic follows a chi-square distribution with k degrees of freedom, where k is

the number of independent variables.

BP Chi Square Test Statistic n Rresid

2

where:

n: number of observations

2

Rresid :Coefficient of determination when residuals are regressed with independent variables

which are also called heteroskedasticity consistent standard errors

EduPristine | Business Analytics 109

Correcting for Heteroskedasticity

There are two methods for correcting the effects of conditional heteroskedasticity

Robust Standard Errors

Correct the standard errors of the linear regression model's estimated coefficients to account for

conditional heteroskedasticity

Generalized Least Squares

Modifies the original equation in an attempt to eliminate heteroskedasticity.

Statistical packages are available are available for computing robust standard errors.

Multivariate Linear Regression- Heteroskedasticity (Age)

Age Band Mean Variance

16-25

16-25 514 72,460

26-59 414 24,652

60+ 208 21,778

Showing heteroskedasticity.

26-59 60+

Multivariate Linear Regression- Heteroskedasticity (Gender)

Gender Mean Variance

F

F 342 35,277

M 431 65,878

Showing heteroskedasticity.

Multivariate Linear Regression- Heteroskedasticity (Married)

Married Mean Variance

Married

Married 323 30,380

Single 451 66,779

Showing heteroskedasticity.

Single

Multivariate Linear Regression- Heteroskedasticity (Vehicle Age)

Vehicle Age Band Mean Variance

0-5

0-5 509 65,688

6-10 369 39,155

11+ 325 43,066

Showing heteroskedasticity.

6-10 11+

Multivariate Linear Regression- Heteroskedasticity (Fuel Type)

Fuel Mean Variance

D

D 706 33,862

P 286 16,400

Showing heteroskedasticity.

Multivariate Linear Regression- Fixing Heteroskedasticity

Univariate scenario:

Find the Standard Deviation of response variable for the different levels of independent variable

Divide the independent values of the response variable by the respective standard deviation

The scaled values become the new response variable

E.g. if variable is Fuel Type

If Fuel Type is D, divide Capped Losses by SquareRoot(33862) = 184

If Fuel Type is P, divide Capped Losses by SquareRoot(16400) = 128

Multivariate scenario:

Create all possible unique combinations of independent variables

For each of the combinations, find Standard Deviations

Divide the independent values of the response variable by the respective standard deviation

Too cumbersome to do manually using MS Excel. Also the process is iterative.

More convenient to do using Statistical packages like R.

Course approach

First fit a multivariate regression without fixing heteroskedasticity to get a final set of significant variables

Then do manual adjustment and re-fit regression using MS Excel. This will be just for demonstration. As manual

adjustment is always questionable.

Demonstrate linear regression using R

Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.)

variables - Avg Age, Gender Dummy,

Married Dummy, Avg Vehicle Age and

Fuel Type Dummy

Find Standard

2 Deviation of capped

Losses for the segments.

Detailed methodology

explained in MS Excel.

Calculate Standardized

3 Capped Losses as

Capped Losses /

Segment Std Dev.

This becomes the new

response variable.

Manually doing this kind of exercise can be flawed as some of the segments could be sparsely populated.

This demo. Is just to explain the underlying technique/methodology.

Statistical packages like SAS, R have in-built capability to take care of this.

Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.)

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.359167467

R Square 0.129001269

Adjusted R Square 0.128716331

Standard Error 4.77078689

Observations 15290

Insignificant which is questionable

as D and P have significantly

ANOVA

different mean losses

df SS MS F Significance F

Regression 5 51522.10 10304.42 452.73 0

Residual 15284 347870.07 22.76

Total 15289 399392.17

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 12.476 0.197 63.374 0.000 12.091 12.862 12.091 12.862

Avg Age -0.086 0.003 -31.554 0.000 -0.091 -0.081 -0.091 -0.081

Gender Dummy 0.213 0.079 2.702 0.007 0.058 0.368 0.058 0.368

Married Dummy -0.204 0.080 -2.552 0.011 -0.361 -0.047 -0.361 -0.047

Avg Vehicle Age -0.376 0.011 -33.770 0.000 -0.398 -0.354 -0.398 -0.354

Fuel Type Dummy 0.136 0.115 1.188 0.235 -0.088 0.361 -0.088 0.361

Multivariate Linear Regression- Using R

Step1: Download and install R software from http://www.r-project.org/

Step2: Convert the data to R readable format e.g. *.csv.

D:\Business Analytics\Linear Regression\Data.csv

Writing R code for

Reading the data

Fitting the Linear Regression

setwd("D:/Business Analytics/Linear Regression")

DefaultData<-read.csv("Data.csv")

Code Output

DefaultData

Code Output

View(DefaultData)

Code Output

head(DefaultData)

tail(DefaultData)

Code Output

summary(DefaultData)

Code Output

plot(DefaultData$Losses)

Code Output

quantile(DefaultData$Losses, c(0,0.05,0.1,0.25,0.5,0.75,0.90,0.95,0.99,0.995,1))

DefaultData$CappedLosses<-ifelse(DefaultData$Losses>1200,1200,DefaultData$Losses)

summary(DefaultData)

Code Output

DefaultData3<-DefaultData[,-c(9)]

names(DefaultData3)

Code Output

#Generate plots to see the relation between the independent variables and the dependent variable

plot(DefaultData3$Age,DefaultData3$CappedLosses)

plot(DefaultData3$Years.of.Driving.Experience,DefaultData3$CappedLosses)

plot(DefaultData3$Number.of.Vehicles,DefaultData3$CappedLosses)

plot(DefaultData3$Gender,DefaultData3$CappedLosses)

plot(DefaultData3$Married,DefaultData3$CappedLosses)

plot(DefaultData3$Vehicle.Age,DefaultData3$CappedLosses)

plot(DefaultData3$Fuel.Type,DefaultData3$CappedLosses)

Code Output

Code Output

#Need to see Losses Distribution among all independant variables. Pivot Table in R with melt and cast

install.packages("reshape")

library("reshape")

#First look at the Data names, by which we want to create pivot table

names(DefaultData3)

#Melt the data: Melt will identify items to be summed/average (Called Measures) and separate out id

variables by which we want to add/average

data.m<-melt(DefaultData3, id=c(1:8), measure=c(9))

#Let's look at the different values of our new object

head(data.m)

#CappedLosses have been melted

Code Output

cast(data.m, Age~variable,fun.aggregate=sum)

data.c<-cast(data.m, Age~variable,mean)

data.c

data.c<-cast(data.m, Age~variable,c(sum,mean))

data.c

Code Output

DefaultData3$AgeBand<-ifelse(DefaultData3$Age<=25,"16-25",

ifelse(DefaultData3$Age>=60,"60+","26-59"))

head(DefaultData3)

tail(DefaultData3)

Code Output

data.ageband<-aggregate(Age~AgeBand,data=DefaultData3,mean)

data.ageband

DefaultData3<-merge(DefaultData3,data.ageband, by="AgeBand")

View(DefaultData3)

#We can export data from R to excel

write.csv(DefaultData3,"Data1.csv")

#Similarly we can convert Vehicle Age to Vehicle Age Band

Code Output

DefaultData3$GenderDummy<-ifelse(DefaultData3$Gender=="F",1,0)

#Similarly we can covert other categorical variables to Dummy Variables

names(DefaultData3)

summary(DefaultData3)

Code Output

#We will use the data which has been converted into bands and dummy variables. Read the final Data

DefaultData4<-read.csv("Linear_Reg_Sample_Data.csv")

names(DefaultData4)

Code Output

install.packages("car")

names(DefaultData4)

library("car")

Code Output

vif_data<-

lm(Capped_Losses~Years_Drv_Exp+Number_Vehicles+Average_Age+Gender_Dummy+Married_Dummy+Avg

_Veh_Age+Fuel_Type_Dummy,data=DefaultData4)

vif(vif_data)

Code Output

#Compare R-square of Average_Age and Years_Drv_Exp to check which performs better

age1<-lm(Capped_Losses~Average_Age,data=DefaultData4)

drv1<-lm(Capped_Losses~Years_Drv_Exp,data=DefaultData4)

summary(age1)

summary(drv1)

#keep Average_Age and remove Years_Drv_Exp

Code Output

#Run Linear Regression w/o Years_Drv_Exp

lin_r1<-

lm(Capped_Losses~Number_Vehicles+Average_Age+Gender_Dummy+Married_Dummy+Avg_Veh_Age+Fuel_

Type_Dummy,data=DefaultData4)

summary(lin_r1)

#Remove Number_Vehicles

Not Significant

Code Output

#Run Linear Regression w/o Number_Vehicles

lin_r2<-

lm(Capped_Losses~Average_Age+Gender_Dummy+Married_Dummy+Avg_Veh_Age+Fuel_Type_Dummy,data

=DefaultData4)

summary(lin_r2)

Code Output

#Variance Covariance Matrix

install.packages("sandwich")

library("sandwich")

vcovHC(lin_r2,omega=NULL, type="HC4")

Code Output

#Fixing Heteroskedasticity using Variance-Covariance matrix

install.packages("lmtest")

library("lmtest")

coeftest(lin_r2,df=Inf,vcov=vcovHC(lin_r2,type="HC4"))

coeftest() fixes the std error by taking into account Var-Covar matrix

There negligible change on coefficients

Thank you!

help@edupristine.com

www.edupristine.com/ca

EduPristine www.edupristine.com/ca

EduPristine | Business Analytics 142

- Certification Guide on SAS Big Data Preparation, Statistics, and Visual Exploration (A00-220) Professional ExamCargado porPalak Mazumdar
- Unit 9 Regression SLMCargado pormunmun8327
- Multivariate Linear RegressionCargado porsujitsekharmoharanaigit
- Topic03 Correlation RegressionCargado porVijay Phani Kumar
- Research methods and Statistics_Ismail KhaterCargado porIsmail Khater
- 9781461452386-c1 (1)Cargado porArijit Das
- StatsCargado porrt2222
- Introduction Econometrics RCargado porMayank Rawat
- The Role of Irritability in the Relation Between JobCargado porMiguel Frota Rodrigues
- ec117-1-1Cargado porRaziele Raneses
- w4l1 Causalmodels ANNOTATED FINALCargado porGlennizze Galvez
- Multiple CorrelationCargado porPond Juprasong
- statistics skittles project part 3 - shanon gunnCargado porapi-385652369
- Chapter12.pdfCargado porangiemmendoza4366
- Mid Term ProjectCargado porFarhan Ali
- spss4Cargado porAnkita Maurya
- App1Cargado porapi-19731569
- Classnotes2-ANOVATableCargado porhammoudeh13
- Multifactor Non-linear Modeling for Accelerated Stability Analysis and PredictionCargado porSiri Kalyan
- Shuo Liu Preliminary Syllabus for Econ103Cargado porAnonymous d6ThraMyk
- 1c778c8180655ba623a6996f16714668_business-stat-module-2.pdfCargado porюрий локтионов
- 2566-5090-1-SMCargado porNadzar Canggih Hendro
- Fundamentals of Statistics - UploadCargado poripconfigearth
- ChemistryCargado porSelvaKumar Raju
- DecisionTree.docCargado porPham Tin
- balotario+IV+ModuloCargado porAna Tereza Trujillo Margarito
- Analisis percobaanCargado porLugasSetiadjiMarpratama
- Machine Learning(Summary)Cargado porkarim rind
- quantttt.docxCargado porumut
- STA302_Final_2009FCargado porexamkiller

- Magoosh GRE 1000 WordsCargado pormahe32mahe
- Assignment 01 Nipun Goyal Jinye LuCargado porNipun Goyal
- RavenCargado porNipun Goyal
- Bonus_Material.pdfCargado porNipun Goyal
- 2_Clustering.pdfCargado porNipun Goyal
- ETM_2013_2_9_11 (1)Cargado porNipun Goyal
- 4_Linear_Regression_WPS.pdfCargado porNipun Goyal
- Logistic Regression %26 Scorecard.pdfCargado porNipun Goyal
- Social MediaCargado porNipun Goyal
- Social Media AnalyticsCargado porNipun Goyal
- Upsell model case.pdfCargado porNipun Goyal
- Market Basket Analysis Case.pdfCargado porNipun Goyal
- 2_Data.pdfCargado porNipun Goyal
- 1_chaid and CartCargado porNipun Goyal
- AmosamosCargado porNipun Goyal
- Predictive Model for E-commerceCargado porNipun Goyal
- titleCargado porNipun Goyal
- Statistics Miscellaneous QuestionsCargado porNipun Goyal
- April 2009Cargado porNipun Goyal
- Microeconomic Analysis of Ballarpur Industries LimitedCargado porNipun Goyal
- Chapter 25 - Discriminant AnalysisCargado porUmar Farooq Attari
- Study Notes ResearchCargado porNipun Goyal
- An Introduction to Regression AnalysisCargado porNipun Goyal

- Prob & Random Process Q.B (IV Sem)Cargado porvasu6842288
- Week5Assignment_Lima-Gonzalez,C. (extension).docxCargado porConstance Lima-gonzalez
- lesson5.pdfCargado pormovic
- Sample ResultCargado porneln
- retail service qualityCargado porHCGSJ
- customer satisfaction towards big bazarCargado porAnsh
- 87-ExpatriateFailure (1).pdfCargado porAnonymous 4RgSOoVV
- Quality Assurance Standards PublishedCargado porCharles Guandaru Kamau
- Penman FullCargado porHe Nry So Ediarko
- AN EVALUATION STUDY OF GENERAL SOFTWARE PROJECT RISK BASEDON SOFTWARE PRACTITIONERS EXPERIENCESCargado porAnonymous Gl4IRRjzN
- Ho_correlation t PhiCargado porjuandostres
- Correlation CoefficeintCargado porKiran Basu
- HW1_PIYUSH_DOCUMENT.docxCargado porPiyush Chaudhari
- Influence of Social Technical Factors on ICT Readiness for Primary Schools in Bungoma County, KenyaCargado porIsaac Batoya
- 17_RubashviliCargado porDIta
- Summary of Formula - StatisticsCargado porEzekiel D. Rodriguez
- an imparical study of the iso 9000 standards contribution towards total quality managementCargado porNoman Haq
- SAS Studio 3.5 User GuideCargado poraslam844
- ASL [Compatibility Mode]Cargado porHarold John
- Vasicek a Series Expansion for the Bivariate Normal IntegralCargado porginovainmona
- Roychowduri Earnings Management Through Real Activities ManipulationCargado porarfankaf
- DPA PresentationCargado porNgoc Quy Tran
- Assignment 4Cargado porSoonang Rai
- libro pag. 211-329.pdfCargado porLindaBravo
- Use of Windows for Harmonic Analysis With DFTCargado porNatália Cardoso
- lec5Cargado porToulouse18
- Determinants of Academic Performance Among Senior h Igh School Shs Students in the Ashanti Mampong Municipality of GhanaCargado porNix
- AUK-CONDOR Instructions for AuthorsCargado porAlexChang
- Mb0034 AssignmentCargado pormoshiurrah
- Correlation and RegressionCargado porKumar Swami