Documentos de Académico
Documentos de Profesional
Documentos de Cultura
If a multiple regression model includes categorical predictor variables, these are not
measured on a numerical scale and their values need to be coded in order for them to
be included in the regression model. The coded variables are called either dummy
variables or indicator variables and in the case of a categorical predictor variable
taking only two values, the coding is usually with the values 0 and 1. The level of
the categorical predictor assigned the value 0 is often called the base level. In a
multiple regression model
X
k
y = β0 + β1 x1 + βi xi + ε
i=2
The data below from Neter, John et.al. (1979) provide the sales volume ($
thousands) for a chain of fast–food restaurants together with the number of households
(thousands) in the restaurant’s trading area and the location of the restaurant (X1 = 0
for restaurants on the highway and X1 = 1 for restaurants in a shopping mall).
155 0 135.27
178 1 179.86
215 1 220.14
93 0 72.74
128 0 114.95
114 0 102.93
172 1 179.64
158 0 131.77
207 1 207.82
95 1 113.51
183 0 160.91
6.1
CATEGORICAL PREDICTORS Dichotomous Dummy Variables
Yi = β0 + β1 X 1 + β2 X 2 + ε
E(Y ) = β0 + β1 (0) + β2 = β0 + β2 X2
Analysis of Variance
Source DF SS MS F P
Regression 2 21200 10600 523.40 0.000
Residual Error 8 162 20
Total 10 21362
and we can conclude that for an area with a given number of households, sales volume
would be $29,645 greater if the restaurant were in a shopping mall rather than on a
highway.
6.2
CATEGORICAL PREDICTORS Polytomous Dummy Variables
If a categorical predictor variable has more than two categories, several dummy
variables are required. Assigning values 0, 1 and 2 for a categorical variable with
three categories would not only suggest that the categories are physically ordered
but that effect of changing from category 0 to category 1 has the same magnitude as
changing from category 1 to category 2. Neither the assumption of ordering nor the
assumption of equal spacing of categories is justifiable. For a categorical predictor
variable with three levels, the three levels are defined by two dummy variables, say
X1 and X2 where the chosen base level of variable is defined by X1 = X2 = 0,
another level is defined by X1 = 1, X2 = 0 and the remaining level is defined by
X1 = 0, X2 = 1.
Example 4.9 in the textbook seeks to model shipping costs (Y ) on the basis of cargo
type where the categorical predictor variable cargo type has three levels; fragile,
semifragile and durable. The three cargo types are coded by two dummy variables,
X1 and X2 as shown below where durable is taken as the base level.
17.20 Fragile 1 0
11.10 Fragile 1 0
12.00 Fragile 1 0
10.90 Fragile 1 0
13.80 Fragile 1 0
6.50 Semifragile 0 1
10.00 Semifragile 0 1
11.50 Semifragile 0 1
7.00 Semifragile 0 1
8.50 Semifragile 0 1
2.10 Durable 0 0
1.30 Durable 0 0
3.40 Durable 0 0
7.50 Durable 0 0
2.00 Durable 0 0
6.3
CATEGORICAL PREDICTORS Polytomous Dummy Variables
Analysis of Variance
Source DF SS MS F P
Regression 2 238.25 119.13 20.61 0.000
Residual Error 12 69.37 5.78
Total 14 307.62
to give the results that the mean shipping cost of durable goods is β0 or $3.26, the
mean shipping cost of semifragile goods is β0 +β2 or $8.70 and the mean shipping cost
of fragile goods is β0 + β1 or $13.00. We also have interpretations of the coefficients
of the dummy variables, with for example the coefficient β1 being the difference in
mean shipping costs between durable and fragile goods of $9.74.
If a model involves more than one categorical predictor variable, each predictor
variable will be represented by the appropriate number of dummy variables.
6.4
CATEGORICAL PREDICTORS Comparing Regression Lines
Consider a simple linear regression where the observations can be grouped into m
classes, where for each class the correct regression model is
Y = β0k + β1k X + ε k = 1, 2, . . . , m.
All the parameters are different and we have m distinct regression lines.
In this model, β11 = β12 = . . . = β1m and so the lines are parallel but have different
intercepts.
In this model, β01 = β02 = . . . = β0m and so the lines have the same intercept but
different slopes.
In this model, β01 = β02 = . . . = β0m and β11 = β12 = . . . = β1m and so all the lines
are the same.
The following data set originally used by Burt (1966) gives the IQ scores of identical
twins with one member of the twins raised in a foster home (Y ) and the other
raised by natural parents (X). The classes of the natural parents are given by the
indicator variables with I1 indicating high class, I2 middle class and I3 low class.
In order to compare the four models above, three extra variables are added, namely
Z1 = I1 X, Z2 = I2 X and Z3 = I3 X.
6.5
CATEGORICAL PREDICTORS Comparing Regression Lines
Y X I1 I2 I3 Z1 Z2 Z3
82 82 1 0 0 82 0 0
80 90 1 0 0 90 0 0
88 91 1 0 0 91 0 0
108 115 1 0 0 115 0 0
116 115 1 0 0 115 0 0
117 129 1 0 0 129 0 0
132 131 1 0 0 131 0 0
71 78 0 1 0 0 78 0
75 79 0 1 0 0 79 0
93 82 0 1 0 0 82 0
95 97 0 1 0 0 97 0
88 100 0 1 0 0 100 0
111 107 0 1 0 0 107 0
63 68 0 0 1 0 0 68
77 73 0 0 1 0 0 73
86 81 0 0 1 0 0 81
83 85 0 0 1 0 0 85
93 87 0 0 1 0 0 87
97 87 0 0 1 0 0 87
87 93 0 0 1 0 0 93
94 94 0 0 1 0 0 94
96 95 0 0 1 0 0 95
112 97 0 0 1 0 0 97
113 97 0 0 1 0 0 97
106 103 0 0 1 0 0 103
107 106 0 0 1 0 0 106
98 111 0 0 1 0 0 111
6.6
CATEGORICAL PREDICTORS Comparing Regression Lines
Model 1
with no overall intercept term. The residual sum of squares in this model (RSS1 ) has
degrees of freedom (df1 ) equal to the sum of the degrees of freedom for the residual
terms in the three separate regressions of Y on X for each social class.
Model 2
which gives estimates of the individual intercepts and the common slope of β1 . The
residual sum of squares in this model (RSS2 ) has degrees of freedom (df2 ) equal to
n − m − p.
Model 3
which gives the common intercept β0 and the individual slopes. The residual sum of
squares in this model (RSS3 ) has degrees of freedom (df3 ) equal to n − mp − 1.
Model 4
Y = β0 + β1 X + ε
where the residual sum of squares (RSS4 ) has degrees of freedom (df4 ) equal to n − 2.
6.7
CATEGORICAL PREDICTORS Comparing Regression Lines
We can test the models 2, 3 or 4 against the general model using the F–test with the
F–test statistic for testing model i against the general model being given by
and the hypothesis that the model provides as good as an alternative as the general
model if the F–statistic is less that the critical point of the Fdfi −df1 ,df1 distribution.
From performing the four regressions we get the following summary of results of the
estimates of the coefficients and the residual sums of squares.
(1318 − 1317)/2
F2 = = 0.01
1317/21
(1326 − 1317)/2
F3 = = 0.08
1317/21
(1494 − 1317)/4
F4 = = 0.71
1317/21
all of which are non–significant and so the most restrictive model, model 4, is as good
as the general model, model 1.
6.8
CATEGORICAL PREDICTORS Example
People who work in call centres responsible for telephone selling receive bonus
payments on the number and type of sales they make. The bonus payments depend on
their productivity indices which are calculated at the end of each week. It is though
that people who have worked in the call centre for longer periods of time would be
more familiar with the job and the techniques which would increase their sales and so
gain them a higher productivity index. People working in a couple of call centres are
interviewed and the number of weeks they have worked in the call centre and their
most recent productivity index recorded with the results shown below. Which of the
two call centres each person works in is also recorded.
10 29 0
22 40 0
22 5 1
37 39 0
43 10 1
49 52 0
50 29 1
60 54 0
66 68 0
75 25 1
80 67 0
82 39 1
100 44 1
110 58 1
If the regression is repeated including the categorical predictor for the centre of
employment, the output shows that the productivity index does increase with the
period of employment but that the productivity index achieved for the same period
of employment is not the same for the two centres. A person working for a given
period of time in the centre coded 1 would achieve a productivity index 32.5 points
lower than a person working in the centre coded 0 for the same period of time.
6.9
CATEGORICAL PREDICTORS Example
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 1 1318.9 1318.9 4.56 0.054
Residual Error 12 3468.1 289.0
Total 13 4786.9
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 2 4440.5 2220.3 70.50 0.000
Residual Error 11 346.4 31.5
Total 13 4786.9
6.10