Documentos de Académico
Documentos de Profesional
Documentos de Cultura
In our previous example, we assumed that sales were linearly related to no. of emails
sent. When we ran a regression model based on a sample of customers, we found that
there was a weak positive linear relationship and that we could not conclude that there
was a positive linear relationship in our population. Suppose that, when we ran our
model, we forgot to take into account that different customers were offered different
discounts when ordering from this company and that these discounts along with the
original data were determined to be:
customer
1
2
3
4
5
6
7
8
no. of emails
x1
2
4
6
8
10
12
14
16
% discount
x2
15
5
10
5
15
10
5
10
sales
y
70
30
80
20
110
100
54
120
With this additional data, we may believe that there is a linear relationship between sales
and both the no. of emails and % discount and that this true relationship looks like:
yi 0 1 x1i 2 x2 i i
Where,
0 = y-intercept
1 = true effect that no. of emails has on sales
2 = true effect that the % discount has on sales
= a symbol that represents random fluctuations
As with simple linear regression, our initial objective is to find the best estimate of this
linear relationship,
y i b0 b1 x1i b2 x2 i
and use this best estimate to, among other things:
measure the strength of this linear relationship in our sample
test for the usefulness of this model
test the effects of specific Xs on Y
Regression
Residual
Total
Intercept
no. of emails
discount
df
2
5
7
SS
8221.0554
1162.9446
9384
MS
4110.5277
232.5889
F
17.6729
Significance
F
0.0054
Coefficient
s
-36.1373
4.7450
7.0861
Standard Error
19.1622
1.1950
1.4030
t Stat
-1.8859
3.9706
5.0506
P-value
0.1180
0.0106
0.0039
Lower 95%
-85.3954
1.6731
3.4795
SSR
.8761
SST
or, that 87.61% of the variation in sales in our sample can be explained by its linear
relationship to no. of emails sent and/or % discounts offered.
Comparing this to our original model where,
y i 39.7855 3.6905(no. of emails)
and,
R 2 .2438
two observations are obvious:
Upper 95%
13.1208
7.8168
10.6928
the new intercept is different from the original intercept and the estimated effect that
no. of emails has on sales is also different
R2 with the new model is larger than the R2 with the original model
These observations lead us to the following:
Notes:
Multiple linear regression is not restricted to only two independent variables. There
could by any number of independent variables, k, as long as k n 2
In general, the true linear relationship can be expressed as:
yi 0 1 x1i 2 x2 i ... k xki i
And the estimated model as:
y i b0 b1 x1i b2 x2 i ... bk i xki
Changing a model by either adding or deleting independent variables will most likely
result in different estimates of the y-intercept and different estimates of the effects that
particular Xs have on Y. (In this example, the y-intercept has changed dramatically
while the estimated effect that no. of emails has on sales has increased by about 29%.)
Adding variables to an existing model will always increase R2. By how much R2 is
increased depends on how strong are the effects that these additional variables have on Y.
(In this example the increase in R2 is substantial.)
Cautions:
Because different models will most likely result in different estimated effects and
possibly different inferences, it is very important that we estimate the correct model
before we reach conclusions based on the estimated effects and make decisions based
on these conclusions.
Because we can always increase R2 by adding independent variables even though they
may be useless in explaining Y, we can calculate the Adjusted R2 which may either
increase, stay the same or decrease depending on how useful these additional variables
are in explaining Y.
Adjusted R2
n 1
n k 1
2
2
The formula for the adjusted R2 is Radj 1 (1 R )
2
Radj
1 (1 R 2 )
2
2
which is substantially higher than Radj from the original model ( Radj .1178 ). This large
increase is a strong indication that the % discount is useful in explaining sales.
Now that we have estimated the linear relationship and determined its strength based on
the sample data, we can now proceed to make inferences concerning the model. As
previously mentioned, we can make inferences concerning:
the usefulness of the model
the effects that individual independent variables have on Y
Usefulness of the Model
As mentioned in the previous set of notes, the ANOVA table produced by Excel allows us
to test the usefulness of the model.
In general,
H 0 : 1 2 ... k 0
H1 : not all above s = 0
(model is useless)
(model is useful)
Decision rule
R : Fcalc F
or,
v1 k
2 n k 1
R: p-value <
(where p-value = Significance F in the ANOVA table)
Test statistic
Fcalc
MSR
MSE
Suppose in this example, we want to test, at = .05 whether no. of emails and/or %
discount is useful in explaining sales. Using Excels ANOVA table:
ANOVA
df
2
5
7
Regression
Residual
Total
SS
8221.0554
1162.9446
9384
MS
4110.5277
232.5889
F
17.6729
Significance F
0.0054
(model is useless)
(model is useful)
Decision rule
R : Fcalc F
Fcalc 5.79
or,
v1 k 2
2 n k 1 5
R: p-value <
Test statistic
Fcalc
MSR
17.6729
MSE
b j j0
sb
n k 1
0
if H1 : j j
R : tcalc t / 2 or t / 2
0
if H1 : j j
R : tcalc t
0
if H1 : j j
To answer a hypothesis testing question (or an estimation question), we would use the
following table:
Intercept
no. of emails
discount
Coefficient
s
-36.1373
4.7450
7.0861
Standard Error
19.1622
1.1950
1.4030
t Stat
-1.8859
3.9706
5.0506
P-value
0.1180
0.0106
0.0039
Lower 95%
-85.3954
1.6731
3.4795
where, the appropriate test statistic would be found under the t Stat column if H 0 : j 0 .
b j j0
0
0
t
H
:
If 0 j
where the
j where
j 0, we would have to calculate
sb
j
appropriate bj is found under the Coefficients column and the appropriate sb is found
under the Standard Error column. As before, if H 0 : j 0 , the appropriate two-tale pj
.01
Decision rule
R : tcalc t / 2 or t / 2
tcalc -4.032 or 4.032
n k 1 5
Upper 95%
13.1208
7.8168
10.6928
or,
R : p value
p - value .01
Test statistic
tcalc
b1 0
3.9706
sb
1
As 3.9706 > - 4.032 and < 4.032 (or as .0106 > .01), we dont reject H0 at = .01.
There is insufficient evidence to indicate that the number of emails has an effect on sales.
To answer the second question:
H 0 : 2 5
H1 : 2 5
.01
Decision rule
R : tcalc t
n k 1 5
tcalc 3.365
tcalc
Test statistic
b2 20 7.0861 5
1.487
sb
1.4030
2
In this example, we concluded that the model was useful at = .05 but, based on the
p-value of .0054, we would have also concluded that the model was useful at = .01.
It is possible that we could have concluded that the model was useful at a particular
and not have concluded that any individual variables had an effect on Y. Conversely,
we could have concluded that some specific variables had an effect on Y without
concluding that the model was useful.
Caution:
When testing several pairs of hypotheses, it is more likely that at least one of these
hypotheses would be rejected just by chance. (An analogy to this would be if a person
was to pick a number from 1 to 20 and you claimed that you were a mind reader and tried
to guess his/her number, eventually you would guess the correct number without being
able to read the persons mind.). Therefore, for your conclusions to be credible, the
number of tests conducted should be kept to a minimum. Only test what is necessary.
One last comment:
Suppose we believe that there is a qualitative variable that affects Y, we can take that into
account in our model. In our example, suppose we believe that a persons citizenship
affects sales and the two citizenships in this case are Canadian and American. We can
incorporate citizenship in our model by including a dummy variable as one of the
independent variable. A dummy variable can assume two possible values: 0 or 1. Which
value we assign to Canadians and which value we assign to Americans are arbitrary as
long as we dont assign the same value to both.
Suppose in this example, the odd-numbered customers are Canadians and the evennumbered customers are Americans and we assign the value 1 to Canadians and the
value 0 to Americans. Our data set would now look like:
customer
1
2
3
4
5
6
7
8
no. of emails
x1
2
4
6
8
10
12
14
16
% discount
x2
15
5
10
5
15
10
5
10
citizenship
x3
1
0
1
0
1
0
1
0
$ sales
y
70
30
80
20
110
100
54
120
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9416
R Square
0.8866
Adjusted R Square
0.8015
Standard Error
16.3122
Observations
8
ANOVA
df
Regression
Residual
Total
Intercept
no. of emails
% discount
citizenship
3
4
7
SS
8319.6510
1064.3490
9384
Coefficient
s
-35.4228
4.6225
7.5597
-8.1040
Standard
Error
20.5293
1.2939
1.6904
13.3133
MS
2773.2170
266.0872
F
10.4222
Significance F
0.0232
t Stat
-1.7255
3.5725
4.4723
-0.6087
P-value
0.1595
0.0233
0.0111
0.5756
Lower 95%
-92.4214
1.0300
2.8665
-45.0675
Upper 95%
21.5758
8.2150
12.2529
28.8595
Observations:
The estimated intercept and the estimated effects that no. of emails and % discount
have on sales are different compared to our previous model but not that different in
this case.
The coefficient for citizenship is -8.1040. Since this variable has a value of 1 for
Canadians and 0 for Americans, our best estimate is that, all other things being equal
(i.e. same number of emails and same % discount), Canadians are estimated to spend
$8.1040 less than Americans although the confidence interval is very wide ([45.0675, 28.8595]) making this best estimate rather useless.
If we tested whether citizenship had an effect on sales, that tests high p-value (.5756)
would also not allow us to conclude that citizenship had an effect on sales.
The R2 for this model was slightly higher than the R2 for our previous model but the
adjusted R2 was lower leading us to also believe that citizenship does not have an
effect on sales. This is further substantiated by looking at the Significance F, or pvalue, for testing the usefulness of the model. Since this value increased to .0232
from .0054, this is also a strong indication that citizenship has no effect on sales.
Definitely the last comment:
This brief overview of multiple linear regression hasnt discussed various issues that
would need to be addressed if you were to apply multiple linear regression in complex
real world business situations. A brief discussion of multiple linear regression analysis
would not be able to do justice to these issues.