Está en la página 1de 10

CATEGORICAL PREDICTORS Dichotomous Dummy Variables

If a multiple regression model includes categorical predictor variables, these are not
measured on a numerical scale and their values need to be coded in order for them to
be included in the regression model. The coded variables are called either dummy
variables or indicator variables and in the case of a categorical predictor variable
taking only two values, the coding is usually with the values 0 and 1. The level of
the categorical predictor assigned the value 0 is often called the base level. In a
multiple regression model

X
k
y = β0 + β1 x1 + βi xi + ε
i=2

where the variable x1 is a categorical predictor, the interpretation of the coefficient


β1 is the difference between the mean response for the level assigned the value 1 and
the mean response for the base level.

The data below from Neter, John et.al. (1979) provide the sales volume ($
thousands) for a chain of fast–food restaurants together with the number of households
(thousands) in the restaurant’s trading area and the location of the restaurant (X1 = 0
for restaurants on the highway and X1 = 1 for restaurants in a shopping mall).

Number of Households ($X_2$) Location ($X_1$) Sales Volume ($Y$)

155 0 135.27
178 1 179.86
215 1 220.14
93 0 72.74
128 0 114.95
114 0 102.93
172 1 179.64
158 0 131.77
207 1 207.82
95 1 113.51
183 0 160.91

6.1
CATEGORICAL PREDICTORS Dichotomous Dummy Variables

The regression model for the two predictor variables is

Yi = β0 + β1 X 1 + β2 X 2 + ε

with the response function for restaurants on a highway being

E(Y ) = β0 + β1 (0) + β2 = β0 + β2 X2

and the response function for restaurants in a shopping mall being

E(Y ) = β0 + β1 (1) + β2 = (β0 + β1 ) + β2 X2

Fitting the regression gives

The regression equation is


Sales Volume = - 2.42 + 29.6 Location + 0.882 Number of Households

Predictor Coef StDev T P VIF


Constant -2.418 5.502 -0.44 0.672
Location 29.645 3.022 9.81 0.000 1.2
Number o 0.88216 0.03745 23.56 0.000 1.2

S = 4.500 R-Sq = 99.2% R-Sq(adj) = 99.1%

Analysis of Variance

Source DF SS MS F P
Regression 2 21200 10600 523.40 0.000
Residual Error 8 162 20
Total 10 21362

and we can conclude that for an area with a given number of households, sales volume
would be $29,645 greater if the restaurant were in a shopping mall rather than on a
highway.

6.2
CATEGORICAL PREDICTORS Polytomous Dummy Variables

If a categorical predictor variable has more than two categories, several dummy
variables are required. Assigning values 0, 1 and 2 for a categorical variable with
three categories would not only suggest that the categories are physically ordered
but that effect of changing from category 0 to category 1 has the same magnitude as
changing from category 1 to category 2. Neither the assumption of ordering nor the
assumption of equal spacing of categories is justifiable. For a categorical predictor
variable with three levels, the three levels are defined by two dummy variables, say
X1 and X2 where the chosen base level of variable is defined by X1 = X2 = 0,
another level is defined by X1 = 1, X2 = 0 and the remaining level is defined by
X1 = 0, X2 = 1.

Example 4.9 in the textbook seeks to model shipping costs (Y ) on the basis of cargo
type where the categorical predictor variable cargo type has three levels; fragile,
semifragile and durable. The three cargo types are coded by two dummy variables,
X1 and X2 as shown below where durable is taken as the base level.

Cost, Y Cargo type X1 X2

17.20 Fragile 1 0
11.10 Fragile 1 0
12.00 Fragile 1 0
10.90 Fragile 1 0
13.80 Fragile 1 0
6.50 Semifragile 0 1
10.00 Semifragile 0 1
11.50 Semifragile 0 1
7.00 Semifragile 0 1
8.50 Semifragile 0 1
2.10 Durable 0 0
1.30 Durable 0 0
3.40 Durable 0 0
7.50 Durable 0 0
2.00 Durable 0 0

Fitting the model


Y = β0 + β1 X 1 + β2 X 2 + ε
which has only the constant term and the two dummy variables as predictors gives

6.3
CATEGORICAL PREDICTORS Polytomous Dummy Variables

The regression equation is


Cost = 3.26 + 9.74 X1 + 5.44 X2

Predictor Coef StDev T P VIF


Constant 3.260 1.075 3.03 0.010
X1 9.740 1.521 6.41 0.000 1.3
X2 5.440 1.521 3.58 0.004 1.3

S = 2.404 R-Sq = 77.4% R-Sq(adj) = 73.7%

Analysis of Variance

Source DF SS MS F P
Regression 2 238.25 119.13 20.61 0.000
Residual Error 12 69.37 5.78
Total 14 307.62

to give the results that the mean shipping cost of durable goods is β0 or $3.26, the
mean shipping cost of semifragile goods is β0 +β2 or $8.70 and the mean shipping cost
of fragile goods is β0 + β1 or $13.00. We also have interpretations of the coefficients
of the dummy variables, with for example the coefficient β1 being the difference in
mean shipping costs between durable and fragile goods of $9.74.

In general, if a categorical predictor variable has k levels, it will need to be represented


by k − 1 dummy variables.

If a model involves more than one categorical predictor variable, each predictor
variable will be represented by the appropriate number of dummy variables.

6.4
CATEGORICAL PREDICTORS Comparing Regression Lines

Consider a simple linear regression where the observations can be grouped into m
classes, where for each class the correct regression model is

Y = β0k + β1k X + ε k = 1, 2, . . . , m.

In this situation we can have four different models:

Model 1 – Most General

All the parameters are different and we have m distinct regression lines.

Model 2 – Parallel Regressions

In this model, β11 = β12 = . . . = β1m and so the lines are parallel but have different
intercepts.

Model 3 – Concurrent Regressions

In this model, β01 = β02 = . . . = β0m and so the lines have the same intercept but
different slopes.

Model 4 – Coincident Regressions

In this model, β01 = β02 = . . . = β0m and β11 = β12 = . . . = β1m and so all the lines
are the same.

It is often of interest to test if models 2 or 4 provide an adequate fit when compared


with less stringent models.

The following data set originally used by Burt (1966) gives the IQ scores of identical
twins with one member of the twins raised in a foster home (Y ) and the other
raised by natural parents (X). The classes of the natural parents are given by the
indicator variables with I1 indicating high class, I2 middle class and I3 low class.
In order to compare the four models above, three extra variables are added, namely
Z1 = I1 X, Z2 = I2 X and Z3 = I3 X.

6.5
CATEGORICAL PREDICTORS Comparing Regression Lines

Y X I1 I2 I3 Z1 Z2 Z3

82 82 1 0 0 82 0 0
80 90 1 0 0 90 0 0
88 91 1 0 0 91 0 0
108 115 1 0 0 115 0 0
116 115 1 0 0 115 0 0
117 129 1 0 0 129 0 0
132 131 1 0 0 131 0 0
71 78 0 1 0 0 78 0
75 79 0 1 0 0 79 0
93 82 0 1 0 0 82 0
95 97 0 1 0 0 97 0
88 100 0 1 0 0 100 0
111 107 0 1 0 0 107 0
63 68 0 0 1 0 0 68
77 73 0 0 1 0 0 73
86 81 0 0 1 0 0 81
83 85 0 0 1 0 0 85
93 87 0 0 1 0 0 87
97 87 0 0 1 0 0 87
87 93 0 0 1 0 0 93
94 94 0 0 1 0 0 94
96 95 0 0 1 0 0 95
112 97 0 0 1 0 0 97
113 97 0 0 1 0 0 97
106 103 0 0 1 0 0 103
107 106 0 0 1 0 0 106
98 111 0 0 1 0 0 111

6.6
CATEGORICAL PREDICTORS Comparing Regression Lines

Model 1

Model 1 can be fitted by fitting the regression model

Y = β01 I1 + β02 I2 + β03 I3 + β11 Z1 + β12 Z2 + β13 Z3 + ε

with no overall intercept term. The residual sum of squares in this model (RSS1 ) has
degrees of freedom (df1 ) equal to the sum of the degrees of freedom for the residual
terms in the three separate regressions of Y on X for each social class.

Model 2

Model 2 is fitted by fitting the regression model

Y = β01 I1 + β02 I2 + β03 I3 + β1 X + ε

which gives estimates of the individual intercepts and the common slope of β1 . The
residual sum of squares in this model (RSS2 ) has degrees of freedom (df2 ) equal to
n − m − p.

Model 3

Model 3 is fitted by fitting the regression model

Y = β0 + β11 Z1 + β12 Z2 + β13 Z3 + ε

which gives the common intercept β0 and the individual slopes. The residual sum of
squares in this model (RSS3 ) has degrees of freedom (df3 ) equal to n − mp − 1.

Model 4

Model 4 assumes a single regression line so it is fitted by fitting

Y = β0 + β1 X + ε

where the residual sum of squares (RSS4 ) has degrees of freedom (df4 ) equal to n − 2.

6.7
CATEGORICAL PREDICTORS Comparing Regression Lines

We can test the models 2, 3 or 4 against the general model using the F–test with the
F–test statistic for testing model i against the general model being given by

(RSSi − RSS1 )/(dfi − df1 )


Fi = i = 2, 3, 4
RSS1 /df1

and the hypothesis that the model provides as good as an alternative as the general
model if the F–statistic is less that the critical point of the Fdfi −df1 ,df1 distribution.

From performing the four regressions we get the following summary of results of the
estimates of the coefficients and the residual sums of squares.

Variable Model 1 Model 2 Model 3 Model 4

Intercept − − 2.56 9.21


X − 0.97 0.90
I1 −1.87 −0.61
I2 0.82 1.43
I3 7.20 5.62
Z1 0.98 0.94
Z2 0.97 0.95
Z3 0.95 1.00

RSS 1317 1318 1326 1494


d.f. 21 23 23 25

In this case, the F–statistics are

(1318 − 1317)/2
F2 = = 0.01
1317/21
(1326 − 1317)/2
F3 = = 0.08
1317/21
(1494 − 1317)/4
F4 = = 0.71
1317/21

all of which are non–significant and so the most restrictive model, model 4, is as good
as the general model, model 1.

6.8
CATEGORICAL PREDICTORS Example

People who work in call centres responsible for telephone selling receive bonus
payments on the number and type of sales they make. The bonus payments depend on
their productivity indices which are calculated at the end of each week. It is though
that people who have worked in the call centre for longer periods of time would be
more familiar with the job and the techniques which would increase their sales and so
gain them a higher productivity index. People working in a couple of call centres are
interviewed and the number of weeks they have worked in the call centre and their
most recent productivity index recorded with the results shown below. Which of the
two call centres each person works in is also recorded.

Time Index Centre

10 29 0
22 40 0
22 5 1
37 39 0
43 10 1
49 52 0
50 29 1
60 54 0
66 68 0
75 25 1
80 67 0
82 39 1
100 44 1
110 58 1

A regression of productivity index against the number of weeks of employment gives


the output on the next page and although the fit of linear model is not good
(R2 =27.6%) it does suggest that the length of time worked does not have any effect
on the productivity index achieved.

If the regression is repeated including the categorical predictor for the centre of
employment, the output shows that the productivity index does increase with the
period of employment but that the productivity index achieved for the same period
of employment is not the same for the two centres. A person working for a given
period of time in the centre coded 1 would achieve a productivity index 32.5 points
lower than a person working in the centre coded 0 for the same period of time.

6.9
CATEGORICAL PREDICTORS Example

Regression Analysis

The regression equation is


Index = 20.5 + 0.337 Time

Predictor Coef StDev T P


Constant 20.51 10.16 2.02 0.067
Time 0.3373 0.1579 2.14 0.054

S = 17.00 R-Sq = 27.6% R-Sq(adj) = 21.5%

Analysis of Variance

Source DF SS MS F P
Regression 1 1318.9 1318.9 4.56 0.054
Residual Error 12 3468.1 289.0
Total 13 4786.9

Regression Analysis

The regression equation is


Index = 24.0 + 0.559 Time - 32.5 Centre

Predictor Coef StDev T P


Constant 24.000 3.373 7.11 0.000
Time 0.55865 0.05667 9.86 0.000
Centre -32.467 3.261 -9.96 0.000

S = 5.612 R-Sq = 92.8% R-Sq(adj) = 91.4%

Analysis of Variance

Source DF SS MS F P
Regression 2 4440.5 2220.3 70.50 0.000
Residual Error 11 346.4 31.5
Total 13 4786.9

6.10

También podría gustarte