Está en la página 1de 9

Chapter 14 An Introduction to Multiple Linear Regression

In our previous example, we assumed that sales were linearly related to no. of emails
sent. When we ran a regression model based on a sample of customers, we found that
there was a weak positive linear relationship and that we could not conclude that there
was a positive linear relationship in our population. Suppose that, when we ran our
model, we forgot to take into account that different customers were offered different
discounts when ordering from this company and that these discounts along with the
original data were determined to be:

customer
1
2
3
4
5
6
7
8

no. of emails
x1
2
4
6
8
10
12
14
16

% discount
x2
15
5
10
5
15
10
5
10

sales
y
70
30
80
20
110
100
54
120

With this additional data, we may believe that there is a linear relationship between sales
and both the no. of emails and % discount and that this true relationship looks like:
yi 0 1 x1i 2 x2 i i
Where,
0 = y-intercept
1 = true effect that no. of emails has on sales
2 = true effect that the % discount has on sales
= a symbol that represents random fluctuations
As with simple linear regression, our initial objective is to find the best estimate of this
linear relationship,
y i b0 b1 x1i b2 x2 i
and use this best estimate to, among other things:
measure the strength of this linear relationship in our sample
test for the usefulness of this model
test the effects of specific Xs on Y

Mathematically, it becomes time consuming and sometimes difficult to manually do the


necessary calculations. So we rely on computer packages to do the necessary
calculations. For this particular example, Excel produced the following output:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9360
R Square
0.8761
Adjusted R Square
0.8265
Standard Error
15.2509
Observations
8
ANOVA

Regression
Residual
Total

Intercept
no. of emails
discount

df
2
5
7

SS
8221.0554
1162.9446
9384

MS
4110.5277
232.5889

F
17.6729

Significance
F
0.0054

Coefficient
s
-36.1373
4.7450
7.0861

Standard Error
19.1622
1.1950
1.4030

t Stat
-1.8859
3.9706
5.0506

P-value
0.1180
0.0106
0.0039

Lower 95%
-85.3954
1.6731
3.4795

From this output, we see that:


y i 36.1373 4.7450(no. of emails ) 7.0861(% discount )
and that:
R2

SSR
.8761
SST

or, that 87.61% of the variation in sales in our sample can be explained by its linear
relationship to no. of emails sent and/or % discounts offered.
Comparing this to our original model where,
y i 39.7855 3.6905(no. of emails)
and,
R 2 .2438
two observations are obvious:

Upper 95%
13.1208
7.8168
10.6928

the new intercept is different from the original intercept and the estimated effect that
no. of emails has on sales is also different
R2 with the new model is larger than the R2 with the original model
These observations lead us to the following:
Notes:
Multiple linear regression is not restricted to only two independent variables. There
could by any number of independent variables, k, as long as k n 2
In general, the true linear relationship can be expressed as:
yi 0 1 x1i 2 x2 i ... k xki i
And the estimated model as:
y i b0 b1 x1i b2 x2 i ... bk i xki
Changing a model by either adding or deleting independent variables will most likely
result in different estimates of the y-intercept and different estimates of the effects that
particular Xs have on Y. (In this example, the y-intercept has changed dramatically
while the estimated effect that no. of emails has on sales has increased by about 29%.)
Adding variables to an existing model will always increase R2. By how much R2 is
increased depends on how strong are the effects that these additional variables have on Y.
(In this example the increase in R2 is substantial.)
Cautions:
Because different models will most likely result in different estimated effects and
possibly different inferences, it is very important that we estimate the correct model
before we reach conclusions based on the estimated effects and make decisions based
on these conclusions.
Because we can always increase R2 by adding independent variables even though they
may be useless in explaining Y, we can calculate the Adjusted R2 which may either
increase, stay the same or decrease depending on how useful these additional variables
are in explaining Y.
Adjusted R2
n 1

n k 1

2
2
The formula for the adjusted R2 is Radj 1 (1 R )

where k = number of independent variables. In this example,


n 1
7
1 (1 .8761) .8265
n k 1
5

2
Radj
1 (1 R 2 )

2
2
which is substantially higher than Radj from the original model ( Radj .1178 ). This large
increase is a strong indication that the % discount is useful in explaining sales.

Now that we have estimated the linear relationship and determined its strength based on
the sample data, we can now proceed to make inferences concerning the model. As
previously mentioned, we can make inferences concerning:
the usefulness of the model
the effects that individual independent variables have on Y
Usefulness of the Model
As mentioned in the previous set of notes, the ANOVA table produced by Excel allows us
to test the usefulness of the model.
In general,
H 0 : 1 2 ... k 0
H1 : not all above s = 0

(model is useless)
(model is useful)

Decision rule
R : Fcalc F
or,

v1 k

2 n k 1

R: p-value <
(where p-value = Significance F in the ANOVA table)
Test statistic
Fcalc

MSR
MSE

Suppose in this example, we want to test, at = .05 whether no. of emails and/or %
discount is useful in explaining sales. Using Excels ANOVA table:

ANOVA
df
2
5
7

Regression
Residual
Total

SS
8221.0554
1162.9446
9384

MS
4110.5277
232.5889

F
17.6729

Significance F
0.0054

and the test would look like:


H 0 : 1 2 0
H1 : 1 0 and/or 2 0
.05

(model is useless)
(model is useful)

Decision rule
R : Fcalc F
Fcalc 5.79
or,

v1 k 2

2 n k 1 5

R: p-value <
Test statistic
Fcalc

MSR
17.6729
MSE

As 17.6729 > 5.79 (or as .0054 < .05), we reject H0 at =.05


This model is useful in explaining sales.
Depending on the purpose of estimating our relationship, we may also want to test
whether a specific X has an effect on Y, a positive effect on Y, or, a negative effect on Y.
Or, we may want to test whether a specific X has a specific effect on Y (e.g. H1 : 2 3 )
Testing a Specific
Testing a specific effect using the results of multiple linear regression is essentially the
same as testing that specific effect within a simple linear regression model if the various
statistics are calculated by some computer package such as Excel. As before,
0
If H 0 : j j

The test statistic is:


tcalc

b j j0
sb

n k 1

and, the rejection region is either:


R : tcalc t

0
if H1 : j j

R : tcalc t / 2 or t / 2

0
if H1 : j j

R : tcalc t

0
if H1 : j j

To answer a hypothesis testing question (or an estimation question), we would use the
following table:

Intercept
no. of emails
discount

Coefficient
s
-36.1373
4.7450
7.0861

Standard Error
19.1622
1.1950
1.4030

t Stat
-1.8859
3.9706
5.0506

P-value
0.1180
0.0106
0.0039

Lower 95%
-85.3954
1.6731
3.4795

where, the appropriate test statistic would be found under the t Stat column if H 0 : j 0 .
b j j0
0
0
t

H
:

If 0 j
where the
j where
j 0, we would have to calculate
sb
j

appropriate bj is found under the Coefficients column and the appropriate sb is found
under the Standard Error column. As before, if H 0 : j 0 , the appropriate two-tale pj

value can be found under the P-value column.


Using our example, we will answer the following questions, both using = .01:
1. Is there sufficient evidence to indicate that the no. of emails sent has an effect on
sales?
2. Is there sufficient evidence to indicate that an additional 1% discount will result in
more than a $5 increase in sales?
To answer the first question:
H 0 : 1 0
H1 : 1 0

.01
Decision rule
R : tcalc t / 2 or t / 2
tcalc -4.032 or 4.032

n k 1 5

Upper 95%
13.1208
7.8168
10.6928

or,
R : p value
p - value .01
Test statistic
tcalc

b1 0
3.9706
sb
1

As 3.9706 > - 4.032 and < 4.032 (or as .0106 > .01), we dont reject H0 at = .01.
There is insufficient evidence to indicate that the number of emails has an effect on sales.
To answer the second question:
H 0 : 2 5
H1 : 2 5

.01
Decision rule
R : tcalc t

n k 1 5

tcalc 3.365

tcalc

Test statistic

b2 20 7.0861 5

1.487
sb
1.4030
2

As 1.487 < 3.365, we dont reject H0 at = .01


There is insufficient evidence to indicate that an additional 1% discount will result in
more than a $5 increase in sales.
Observations:
This last table also gives us the 95% confidence intervals for each of the s. For
example the 95% confidence interval for 1 is [1.6731, 7.8168] which can be
interpreted as we are 95% confident that each additional email sent out increases
sales between $1.6731 and $7.8168 with the best estimate being $4.7450 as long as
the number of emails ranges between 2 and 16. (Without % discount being
included in the model as was the case in our original model, the interval was [-2.8017,
10.1827]) The smaller interval with the new model is another indication that the new
model is a better model than the original model.

In this example, we concluded that the model was useful at = .05 but, based on the
p-value of .0054, we would have also concluded that the model was useful at = .01.
It is possible that we could have concluded that the model was useful at a particular
and not have concluded that any individual variables had an effect on Y. Conversely,
we could have concluded that some specific variables had an effect on Y without
concluding that the model was useful.
Caution:
When testing several pairs of hypotheses, it is more likely that at least one of these
hypotheses would be rejected just by chance. (An analogy to this would be if a person
was to pick a number from 1 to 20 and you claimed that you were a mind reader and tried
to guess his/her number, eventually you would guess the correct number without being
able to read the persons mind.). Therefore, for your conclusions to be credible, the
number of tests conducted should be kept to a minimum. Only test what is necessary.
One last comment:
Suppose we believe that there is a qualitative variable that affects Y, we can take that into
account in our model. In our example, suppose we believe that a persons citizenship
affects sales and the two citizenships in this case are Canadian and American. We can
incorporate citizenship in our model by including a dummy variable as one of the
independent variable. A dummy variable can assume two possible values: 0 or 1. Which
value we assign to Canadians and which value we assign to Americans are arbitrary as
long as we dont assign the same value to both.
Suppose in this example, the odd-numbered customers are Canadians and the evennumbered customers are Americans and we assign the value 1 to Canadians and the
value 0 to Americans. Our data set would now look like:
customer
1
2
3
4
5
6
7
8

no. of emails
x1
2
4
6
8
10
12
14
16

% discount
x2
15
5
10
5
15
10
5
10

And the output looks like:

citizenship
x3
1
0
1
0
1
0
1
0

$ sales
y
70
30
80
20
110
100
54
120

SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9416
R Square
0.8866
Adjusted R Square
0.8015
Standard Error
16.3122
Observations
8
ANOVA
df
Regression
Residual
Total

Intercept
no. of emails
% discount
citizenship

3
4
7

SS
8319.6510
1064.3490
9384

Coefficient
s
-35.4228
4.6225
7.5597
-8.1040

Standard
Error
20.5293
1.2939
1.6904
13.3133

MS
2773.2170
266.0872

F
10.4222

Significance F
0.0232

t Stat
-1.7255
3.5725
4.4723
-0.6087

P-value
0.1595
0.0233
0.0111
0.5756

Lower 95%
-92.4214
1.0300
2.8665
-45.0675

Upper 95%
21.5758
8.2150
12.2529
28.8595

Observations:
The estimated intercept and the estimated effects that no. of emails and % discount
have on sales are different compared to our previous model but not that different in
this case.
The coefficient for citizenship is -8.1040. Since this variable has a value of 1 for
Canadians and 0 for Americans, our best estimate is that, all other things being equal
(i.e. same number of emails and same % discount), Canadians are estimated to spend
$8.1040 less than Americans although the confidence interval is very wide ([45.0675, 28.8595]) making this best estimate rather useless.
If we tested whether citizenship had an effect on sales, that tests high p-value (.5756)
would also not allow us to conclude that citizenship had an effect on sales.
The R2 for this model was slightly higher than the R2 for our previous model but the
adjusted R2 was lower leading us to also believe that citizenship does not have an
effect on sales. This is further substantiated by looking at the Significance F, or pvalue, for testing the usefulness of the model. Since this value increased to .0232
from .0054, this is also a strong indication that citizenship has no effect on sales.
Definitely the last comment:
This brief overview of multiple linear regression hasnt discussed various issues that
would need to be addressed if you were to apply multiple linear regression in complex
real world business situations. A brief discussion of multiple linear regression analysis
would not be able to do justice to these issues.

También podría gustarte