Spss

Introduction to
Statistical Analysis Using

SPSS® Statistics
33373-001
SPSS v17.0.1;1/2009 nm
For more information about SPSS Inc. software products, please visit our Web site at
http://www.spss.com or contact
SPSS Inc.
233 South Wacker Drive, 11th Floor
Chicago, IL 60606-6412
Tel: (312) 651-3000
Fax: (312) 651-3668
SPSS is a registered trademark and the other product names are the trademarks of SPSS Inc. for its
proprietary computer software. No material describing such software may be produced or distributed
without the written permission of the owners of the trademark and license rights in the software and the
copyrights in the published materials.
The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or
disclosure by the Government is subject to restrictions as set forth in subdivision (c)(1)(ii) of The Rights in
Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233
South Wacker Drive, 11th Floor, Chicago, IL 60606-6412.
Patent No. 7,023,453
General notice: Other product names mentioned herein are used for identification purposes only and may
be trademarks or registered trademarks of their respective companies in the United States and other
countries.
Windows is a registered trademark of Microsoft Corporation.

Apple, Mac, and the Mac logo are trademarks of Apple Computer, Inc., registered in the U.S. and other
countries.
Introduction to Statistical Analysis Using SPSS Statistics

Copyright © 2009 by SPSS Inc.
All rights reserved.
Printed in the United States of America.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or
by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written
permission of the publisher.
CHAPTER 1: INTRODUCTION TO STATISTICAL ANALYSIS .... 1-1

1.1 COURSE GOALS AND METHODS .............................................................................. 1-1
1.2 BASIC STEPS OF RESEARCH PROCESS ...................................................................... 1-2
1.3 POPULATIONS AND SAMPLES .................................................................................. 1-3
1.4 RESEARCH DESIGN .................................................................................................. 1-4
1.5 INDEPENDENT AND DEPENDENT VARIABLES........................................................... 1-5
1.6 LEVELS OF MEASUREMENT AND STATISTICAL METHODS ....................................... 1-5
CHAPTER 2: DATA CHECKING.......................................................... 2-1
2.1 INTRODUCTION........................................................................................................ 2-1
2.2 VIEWING A FEW CASES ........................................................................................... 2-3
2.3 MINIMUM, MAXIMUM, AND NUMBER OF VALID CASES .......................................... 2-5
2.4 DATA VALIDATION: DATA PREPARATION ADD-ON MODULE ................................. 2-8
2.5 DATA VALIDATION RULES ...................................................................................... 2-9
2.6 WHEN DATA ERRORS ARE DISCOVERED .............................................................. 2-25
SUMMARY EXERCISES ................................................................................................ 2-27
CHAPTER 3: DESCRIBING CATEGORICAL DATA ....................... 3-1
3.1 WHY SUMMARIES OF SINGLE VARIABLES? ............................................................. 3-1
3.2 FREQUENCY ANALYSIS ........................................................................................... 3-2
3.3 STANDARDIZING THE CHART AXIS........................................................................ 3-11
3.4 PIE CHARTS ........................................................................................................... 3-16
CHAPTER 4: EXPLORATORY DATA ANALYSIS: SCALE DATA 4-1
4.1 SUMMARIZING SCALE VARIABLES .......................................................................... 4-1
4.2 MEASURES OF CENTRAL TENDENCY AND DISPERSION ............................................ 4-2
4.3 NORMAL DISTRIBUTIONS ........................................................................................ 4-3
4.4 HISTOGRAMS AND NORMAL CURVES ...................................................................... 4-4
4.5 USING THE EXPLORE PROCEDURE: EDA................................................................. 4-7
4.6 STANDARD ERROR OF THE MEAN AND CONFIDENCE INTERVALS .......................... 4-12
4.7 SHAPE OF THE DISTRIBUTION ................................................................................ 4-12
4.8 BOXPLOTS ............................................................................................................. 4-15
4.9 APPENDIX: STANDARDIZED (Z) SCORES................................................................ 4-21
CHAPTER 5: PROBABILITY AND INFERENTIAL STATISTICS . 5-1
5.1 THE NATURE OF PROBABILITY ................................................................................ 5-1
5.2 MAKING INFERENCES ABOUT POPULATIONS FROM SAMPLES .................................. 5-2
5.3 INFLUENCE OF SAMPLE SIZE ................................................................................... 5-2
5.4 HYPOTHESIS TESTING ........................................................................................... 5-10
5.5 TYPES OF STATISTICAL ERRORS ............................................................................ 5-11
5.6 STATISTICAL SIGNIFICANCE AND PRACTICAL IMPORTANCE .................................. 5-12
i
CHAPTER 6: COMPARING CATEGORICAL VARIABLES ........... 6-1

6.1 TYPICAL APPLICATIONS .......................................................................................... 6-1
6.2 CROSSTABULATION TABLES ................................................................................... 6-2
6.3 TESTING THE RELATIONSHIP: CHI-SQUARE TEST .................................................... 6-5
6.4 REQUESTING THE CHI-SQUARE TEST ...................................................................... 6-7
6.5 INTERPRETING THE OUTPUT .................................................................................... 6-8
6.6 ADDITIONAL TWO-WAY TABLES .......................................................................... 6-12
6.7 GRAPHING THE CROSSTABS RESULTS ................................................................... 6-16
6.8 ADDING CONTROL VARIABLES ............................................................................. 6-18
6.9 EXTENSIONS: BEYOND CROSSTABS....................................................................... 6-21
6.10 APPENDIX: ASSOCIATION MEASURES ................................................................. 6-21
CHAPTER 7: MEAN DIFFERENCES BETWEEN GROUPS: T TEST
...................................................................................................................... 7-1
7.1 INTRODUCTION........................................................................................................ 7-1
7.2 LOGIC OF TESTING FOR MEAN DIFFERENCES .......................................................... 7-1
7.3 EXPLORING THE GROUP DIFFERENCES .................................................................... 7-6
7.4 TESTING THE DIFFERENCES: INDEPENDENT SAMPLES T TEST ............................... 7-14
7.5 INTERPRETING THE T TEST RESULTS ..................................................................... 7-16
7.6 GRAPHING MEAN DIFFERENCES............................................................................ 7-20
7.7 APPENDIX: PAIRED T TEST.................................................................................... 7-22
7.8 APPENDIX: NORMAL PROBABILITY PLOTS ............................................................ 7-25
CHAPTER 8: BIVARIATE PLOTS AND CORRELATIONS: SCALE
VARIABLES .............................................................................................. 8-1
8.1 INTRODUCTION........................................................................................................ 8-1
8.2 READING THE DATA ................................................................................................ 8-1
8.3 EXPLORING THE DATA ............................................................................................ 8-2
8.4 SCATTERPLOTS........................................................................................................ 8-6
8.5 CORRELATIONS ..................................................................................................... 8-11
CHAPTER 9: INTRODUCTION TO REGRESSION .......................... 9-1
9.1 INTRODUCTION AND BASIC CONCEPTS .................................................................... 9-1
9.2 THE REGRESSION EQUATION AND FIT MEASURE .................................................... 9-2
9.3 RESIDUALS AND OUTLIERS ..................................................................................... 9-2
9.4 ASSUMPTIONS ......................................................................................................... 9-3
9.5 SIMPLE REGRESSION ............................................................................................... 9-3
SUMMARY EXERCISES .................................................................................................. 9-8
ii
APPENDIX A: MEAN DIFFERENCES BETWEEN GROUPS: ONE-

FACTOR ANOVA .................................................................................... A-1
A.1 INTRODUCTION ...................................................................................................... A-1
A.2 EXTENDING THE LOGIC BEYOND TWO GROUPS .................................................... A-1
A.3 EXPLORING THE DATA .......................................................................................... A-3
A.4 ONE-FACTOR ANOVA ......................................................................................... A-4
A.5 POST HOC TESTING OF MEANS............................................................................ A-10
A.6 GRAPHING THE MEAN DIFFERENCES ................................................................... A-18
A.7 APPENDIX: GROUP DIFFERENCES ON RANKS....................................................... A-20
SUMMARY EXERCISES ............................................................................................... A-23
APPENDIX B: INTRODUCTION TO MULTIPLE REGRESSION . B-1
B.1 MULTIPLE REGRESSION ......................................................................................... B-1
B.2 MULTIPLE REGRESSION RESULTS .......................................................................... B-4
B.3 RESIDUALS AND OUTLIERS .................................................................................... B-7
SUMMARY EXERCISES ............................................................................................... B-10
REFERENCES.......................................................................................... R-1
INTRODUCTORY STATISTICS BOOKS ............................................................................ R-1
ADDITIONAL REFERENCES ........................................................................................... R-1
ALTERNATIVE EXERCISES FOR CHAPTERS 8 & 9 AND
APPENDIX B ............................................................................................ E-1
SUMMARY EXERCISES FOR CHAPTER 8........................................................................ E-2
SUMMARY EXERCISES FOR CHAPTER 9........................................................................ E-3
SUMMARY EXERCISES FOR APPENDIX B...................................................................... E-4
iii
iv
Chapter 1: Introduction to Statistical

Analysis
Topics:
• Course Goals and Methods
• Basic Steps of Research Process
• Populations and Samples
• Research Design
• Independent and Dependent Variables
• Level of Measurement and Statistical Methods
Data
This course uses the data file GSS2004Intro.sav for most of the examples. These data are a subset
of variables from the General Social Survey. The General Social Survey 2004 (GSS, produced by
the National Opinion Research Center, Chicago) is a survey involving demographic, attitudinal
and behavioral items that include views on government and satisfaction with various facets of
life. Approximately 2,800 U.S. adults were included in the study. However, not all questions were
asked of each respondent, so most analyses will be based on a reduced number of cases. The
survey has been administered nearly every year since 1972 (it is now administered in even years).
1.1 Course Goals and Methods

In this chapter, we begin by briefly reviewing the basic elements of quantitative research and
issues that should be considered in data analysis. We will then discuss some SPSS Statistics
facilities that can be used to check your data. In the remainder of the course, we will cover a
number of statistical procedures that SPSS Statistics performs. This is an application-oriented
course and the approach will be practical. We will discuss:
1) The situations in which you would use each technique,

2) The assumptions made by the method,
3) How to set up the analysis using SPSS Statistics,
4) Interpretation of the results.
We will not derive proofs, but rather focus on the practical matters of data analysis in support of
answering research questions. For example, we will discuss what correlation coefficients are,
when to use them, and how to produce and interpret them, but will not formally derive their
properties. This course is not a substitute for a course in statistics. You will benefit if you have
had such a course in the past, but even if not, you will understand the basics of each technique
after completion of this course.
We will cover descriptive statistics and exploratory data analysis, and then examine relationships
between categorical variables using crosstabulation tables and chi-square tests. Testing for mean
differences between groups using T Tests and analysis of variance (ANOVA) will be considered.
Correlation and regression will be used to investigate the relationships between interval scale
variables. Graphics comprise an integral part of the analyses.
Introduction to Statistical Analysis 1 - 1

This course assumes you have a working knowledge of SPSS Statistics in your computing
environment. Thus the basic use of menu systems, data definition and labeling will not be
considered in any detail. The analyses in this course will show the locations of the menu choices
and dialog boxes within the overall menu system, and the dialog box selections will be detailed.
Scenario for Analyses

We will perform many analyses on the GSS data. As we review and apply the statistical methods
to this data, it is crucial that you think about how these methods might be used with information
you collect. Although survey data is used for our examples, the same methods can be used with
experimental and archival data.
In part to simplify the course through minimizing the number of data sets, we will produce more
types of analyses on one data set than are typically done. In practice the number and variety of
analyses performed in a study is a function of the research design: the questions that you, the
analyst, want to ask of the data.
1.2 Basic Steps of Research Process

All research projects can be broken down into a number of discrete components. These
components can be categorized in a variety of ways. We might summarize the main steps as:
1. Specify exactly the aims and objectives of the research along with the main hypotheses
2. Define the population and sample design
3. Choose a method of data collection, design the research and decide upon an appropriate
sampling strategy
4. Collect the data
5. Prepare the data for analysis
6. Analyse the data
7. Report the findings
Some of these points may seem obvious, but it is surprising how often some of the most basic
principles are overlooked, potentially resulting in data that is impossible to analyze with any
confidence. Each step is crucial for a successful research project and it is never too early in the
process to consider the methods that you intend to use for your data analysis. In order to place the
statistical techniques that we will discuss in this course in the broader framework of research
design, we will briefly review some of the considerations of the first steps. Statistics and research
design are highly interconnected disciplines and you should have a thorough grasp of both before
embarking on a research project. This introductory chapter merely skims the surface of the issues
involved in research design. If you are unfamiliar with these principles, we recommend that you
refer to the research methodology literature for more thorough coverage of the issues.

Research Objectives
It is important that a research project begin with a set of well-defined objectives. Yet, this step is
often overlooked or not well defined. The specific aims and objectives may not be addressed
because those commissioning the research do not know exactly which questions they would like
answered. This rather vague approach can be a recipe for disaster and may result in a completely
wasted opportunity as the most interesting aspects of the subject matter under investigation could
well be missed. If you do not identify the specific objectives, you will fail to collect the necessary
information or ask the necessary question in the correct form. You can end up with a data file that
does not contain the information that you need for your data analysis step.
For example, you may be asked to conduct a survey "to find out about alcohol consumption and
driving". This general objective could lead to a number of possible survey questions. Rather than
proceeding with this general objective, you need to uncover more specific hypotheses that are of
interest to your organization. This example could lead to a number of very specific research
questions, such as:
“What proportion of people admits to driving while above the legal alcohol limit?”
“What demographic factors (e.g. age/sex/social class) are linked with a propensity to drunk-
driving?”
“Does having a conviction for drunk-driving affect attitudes towards driving while over the
legal limit?”
These specific research questions would then define the questionnaire items. Additionally, the
research questions will affect the definition of the population and the sampling strategy. For
example, the third question above requires that the responder have a drunk-driving conviction.
Given that a relatively small proportion of the general population has such a conviction, you
would need to take that into consideration when defining the population and sampling design. For
example, a simple random sample of the general population would not be recommended although
several other approaches beyond the scope of this course would be considered.
Therefore, it is essential to state formally the main aims and objectives at the outset of the
research so the subsequent stages can be done with these specific questions in mind.
1.3 Populations and Samples

In studies involving statistical analysis it is important to be able to characterize accurately the
population under investigation. The population is the group to which you wish to generalize your
conclusions, while the sample is the group you directly study. In some instances the sample and
population are identical or nearly identical; consider the Census of any country in this regard. In
the majority of studies, the sample represents a small proportion of the population.
In the example above, the population might be defined as those people with registered drivers'
licenses. We could select a sample from the drivers' license registration list for our survey. Other
common examples are: membership surveys in which a small percentage of members are sent
questionnaires, medical experiments in which samples of patients with a disease are given
different treatments, marketing studies in which users and non users of a product are compared,
and political polling.

The problem is to draw valid inferences from data summaries in the sample so that they apply to
the larger population. In some sense you have complete information about the sample, but you
want conclusions that are valid for the population. An important component of statistics and a
large part of what we cover in the course involves statistical tests used in making such inferences.
Because the findings can only be generalized to the population under investigation, you should
give careful thought to defining the population of interest to you and making certain that the
sample reflects this population. The survey research literature—for example Sudman (1976) or
Salant and Dillman (1994)—reviews these issues in detail. To state it in a simple way, statistical
inference provides a method of drawing conclusions about a population of interest based on
sample results.
1.4 Research Design

With specific research goals and a target population in mind, it is then possible to begin the
design stage of the research. There are many things to consider at the design stage. We will
consider a few issues that relate specifically to data analysis and statistical techniques. This is not
meant as a complete list of issues to consider. For example, for survey projects, the mode of data
collection, question selection and wording, and questionnaire design are all important
considerations. Refer to the survey research literature mentioned above as well as general
research methodology literature for discussion of these and other research design issues.
First, you must consider the type of research that will be most appropriate to the research aims
and objectives. Two main alternatives are survey research and experimental research. The data
may be recorded using either objective or subjective techniques. The former includes items
measured by an instrument and by computer such as physiological measures (e.g. heart-rate)
while the latter includes observational techniques such as recordings of a specific behavior and
responses to questionnaire surveys.
Most research goals lend themselves to one particular form of research, although there are cases
where more than one technique may be used. For example, a questionnaire survey would be
inappropriate if the aim of the research was to test the effectiveness of different levels of a new
drug to relieve high blood pressure. This type of work would be more suited to a tightly
controlled experimental study in which the levels of the drug administered could be carefully
controlled and objective measures of blood pressure could be accurately recorded. On the other
hand, this type of laboratory-based work would not be a suitable means of uncovering people’s
voting intentions.
The classic experimental design consists of two groups: the experimental group and the control
group. They should be equivalent in all respects other than that those in the former group are
subjected to an effect or treatment and the latter is not. Therefore, any differences between the
two groups can be directly attributed to the effect of this treatment. The treatment variables are
usually referred to as independent variables, and the quantity being measured as the effect is the
dependent variable. There are many other research designs, but most are more elaborate
variations on this basic theme.
In survey research, you rarely have the opportunity to implement such a rigorously controlled
design. However, the same general principles apply to many of the analyses you perform.

1.5 Independent and Dependent Variables

In general, the dependent (sometimes referred to as the outcome) variable is the one we wish to
study as a function of other variables. Within an experiment, the dependent variable is the
measure expected to change as a result of the experimental manipulation. For example, a drug
experiment designed to test the effectiveness of different sleeping pills might employ the number
of hours of sleep as the dependent variable. In surveys and other non-experimental studies, the
dependent variable is also studied as a function of other variables. However, no direct
experimental manipulation is performed; rather the dependent variable is hypothesized to vary as
a result of changes in the other (independent) variables.
Correspondingly, independent (sometimes referred to as predictor) variables are those used to

measure features manipulated by the experimenter in an experiment. In a non-experimental study,
they represent variables believed to influence or predict a dependent measure.
Thus terms (dependent, independent) reasonably applied to experiments have taken on more
general meanings within statistics. Whether such relations are viewed causally, or as merely
predictive, is a matter of belief and reasoning. As such, it is not something that statistical analysis
alone can resolve. To illustrate, we might investigate the relationship between starting salary
(dependent) and years of education, based on survey data, and then develop an equation
predicting starting salary from years of education. Here starting salary would be considered the
dependent variable although no experimental manipulation of education has been performed. One
way to think of the distinction is to ask yourself which variable is likely to influence the other? In
summary, the dependent variable is believed to be influenced by, or be predicted by, the
independent variable(s).
Finally, in some studies, or parts of studies, the emphasis is on exploring and characterizing
relationships among variables with no causal view or focus on prediction. In such situations there
is no designation of dependent and independent. For example, in crosstabulation tables and
correlation matrices the distinction between dependent and independent variables is not
necessary. It rather resides in the eye of the beholder (researcher).
1.6 Levels of Measurement and Statistical Methods

The term, levels of measurement, refers to the properties and meaning of numbers assigned to
observations for each item. Many statistical techniques are only appropriate for data measured at
particular levels or combinations of levels. Therefore, when possible, you should determine the
analyses you will be using before deciding upon the level of measurement to use for each of your
variables. For example, if you want to report and test the mean age of your sample, you will need
to ask their age in years (or year of birth) rather than asking them to choose an age group into
which their age falls.
Because measurement type is important when choosing test statistics, we briefly review the
common taxonomy of level of measurement. For an interesting discussion of level of
measurement and statistics see Velleman and Wilkinson (1993).
The four major classifications that follow are found in many introductory statistics texts. They are
presented beginning with the weakest and ending with those having the strongest measurement
properties. Each successive level can be said to contain the properties of the preceding types and
to record information at a higher level.

• Nominal — In nominal measurement each numeric value represents a category or group

identifier, only. The categories cannot be ranked and have no underlying numeric value.
An example would be marital status coded 1 (Married), 2 (Widowed), 3 (Divorced), 4
(Separated) and 5 (Never Married); each number represents a category and the matching
of specific numbers to categories is arbitrary. Counts and percentages of observations
falling into each category are appropriate summary statistics. Such statistics as means
(the average marital status?) would not be appropriate, but the mode would be
appropriate (the biggest category: Married?).
• Ordinal — For ordinal measures the data values represent ranking or ordering
information. However, the difference between the data values along the scale is not equal.
An example would be specifying how happy you are with your life, coded 1 (Very
Happy), 2 (Happy), and 3 (Not Happy). There are specific statistics associated with
ranks; SPSS Statistics provides a number of them mostly within the Nonparametric and
Ordinal Regression procedures. The mode and median can be used as summary statistics.
• Interval — In interval measurement, a unit increase in numeric value represents the same
change in quantity regardless of where it occurs on the scale. For interval scale variables
such summaries as means and standard deviations are appropriate. Statistical techniques
such as regression and analysis of variance assume that the dependent (or outcome)
variable is measured on an interval scale. Examples might be temperature in degrees
Fahrenheit or IQ score.
• Ratio — Ratio measures have interval scale properties with the addition of a meaningful
zero point; that is, zero indicates complete absence of the characteristic measured. For
statistics such as ANOVA and regression only interval scale properties are assumed, so
ratio scales have stronger properties than necessary for most statistical analyses. Health
care researchers often use ratio scale variables (number of deaths, admissions,
discharges) to calculate rates. The ratio of two variables with ratio scale properties can
thus be directly interpreted. Money is an example of a ratio scale, so someone with
$10,000 has ten times the amount as someone with $1,000.
The distinction between the four types is summarized in Table 1.1.
Table 1.1 Level of Measurement Properties

Property
Level of
Measurement True Zero
Categories Ranks Equal Intervals
Point
Nominal 9
Ordinal 9 9
Interval 9 9 9
Ratio 9 9 9 9
These four levels of measurement are often combined into two main types, categorical consisting
of nominal and ordinal measurement levels and continuous (scale) consisting of interval and
ratio measurement levels.

The measurement level variable attribute in SPSS Statistics recognizes three measurement levels-
Nominal, Ordinal and Scale. The icon indicating the measurement level is displayed preceding
the variable name or label in the variable lists of all dialog boxes. Table 1.2 shows the most
common icons used for the measurement levels. Special data types, such as Date and Time
variables have distinct icons not shown in this table.
Table 1.2 Variable List Icons

Measurement Level Data Type
Numeric String
Nominal
Ordinal Not
Applicable
Scale Not
Applicable
Note: Rating Scales and Dichotomous Variables

A common scale used in surveys and market research is an ordered rating scale usually consisting
of five- or seven-point scales Such ordered scales are also called Likert scales and might be coded
1 (Strongly Agree, or Very Satisfied), 2 (Agree, or Satisfied), 3 (Neither agree nor disagree, or
Neutral), 4 (Disagree, or Dissatisfied), and 5 (Strongly Disagree, or Very Dissatisfied). There is
an ongoing debate among researchers as to whether such scales should be considered ordinal or
interval. SPSS Statistics contains procedures capable of handling such variables under either
assumption. When in doubt about the measurement scale, some researchers run their analyses
using two separate methods, since each make different assumptions about the nature of the
measurement. If the results agree, the researcher has greater confidence in the conclusion.
Dichotomous (binary) variables containing two possible responses (often coded 0 and 1) are often
considered to fall into all of the measurement levels except ratio. As we will see, this flexibility
allows them to be used in a wide range of statistical procedures
Statistics are available for variables at all measurement levels, and it is important to match the
proper statistic to a given level of measurement. In practice your choice of statistical method
depends on the questions you are interested in asking of the data and the nature of the
measurements you make. Table 1.3 suggests which statistical techniques are most appropriate,
based on the measurement level of the dependent and independent variables. Much more
extensive diagrams and discussion are found in Andrews et al. (1981). Recall that ratio variables
can be considered as interval scale for analysis purposes. If in doubt about the measurement
properties of your variables, you can apply a statistical technique that assumes weaker
measurement properties and compare the results to methods making stronger assumptions. A
consistent answer provides greater confidence in the conclusions.

Table 1.3 Statistical Methods and Level of Measurement
Independent Variables
Dependent Nominal Ordinal Interval/Ratio
Variable Nominal Crosstabs Crosstabs Discriminant
Logistic Regression
Ordinal Nonparametric tests Nonparametric Ordinal Regression
Ordinal Regression correlation
Interval/Ratio T Test, ANOVA Nonparametric Correlation
correlation Regression

Chapter 2: Data Checking

Topics:
• Viewing a Few Cases
• Basic Data Validation: Minimum, maximum and number of cases
• Data Validation Rules
o Creating single-variable rules
o Creating cross-variable rules
o Applying validation rules
• When Data Errors are Discovered
Data
This chapter uses the data file GSS2004PreClean.sav. This data set is a version of the
GSS2004Intro.sav file into which we have introduced an out-of-range value in CONFINAN and
an inconsistent response in HAPMAR to allow us to demonstrate some data checking features of
SPSS Statistics. All other values are unchanged.
This course assumes that the training files are located in the c:\Train\Stats folder. If you are not
working in an SPSS Training center, the training files may be in a different folder structure. If
you are running SPSS Statistics Server, then these files can be located on the server or a machine
that can be accessed (mapped from) the server.
2.1 Introduction
When working with data it is important to verify their validity before proceeding with the
analysis. Web surveys are often collected using software, such as SPSS Dimensions, that
automatically check the validity (for example, is the response within an acceptable range) of an
answer and is it consistent with previous information. Such methods as double-entry verification,
a technique in which two people independently enter the data into a computer and the values are
compared for discrepancies can be implemented using data entry software such as SPSS Data
Entry. If you are reading your data from some other source, you can use the SPSS Data
Preparation add-on module to construct validation rules to check for values of single variables
and consistency across variables. We will demonstrate this feature in this chapter. You can also
use some basic features of SPSS Statistics Base as a first step in examining your data and
checking for inconsistencies. Although mundane, time spent examining data in this way early on
will reduce false starts, misleading analyses, and makeup work later. For this reason data
checking is a critical prelude to statistical analysis.
Note about Default Startup Folder and Variable Display in Dialog Boxes
In this course, all of the files used for the demonstrations and exercises are located in the folder
c:\Train\Stats. You can set the startup folder that will appear in all Open and Save dialog boxes.
We will use this option to set the startup folder.
Click Edit...Options, then click the File Locations tab

Click the Browse button to the right of the Data Files text box
Data Checking 2 - 1
Select Train from the Look In: drop down list, then select Stats from the list of folders
and click Set button
Click the Browse button to the right of the Other Files text box
Move to the Train\Stats folder (as above) and click Set button
Figure 2.1 Set Default File Location in the Edit Options Dialog Box
Note: If the course files are stored in a different location, your instructor will give you
instructions specific to that location.
Either variable names or longer variable labels will appear in list boxes in dialog boxes.
Additionally, variables in list boxes can be ordered alphabetically or by their position in the file.
In this course, we will display variable names in alphabetical order within list boxes. Since the
default setting within SPSS Statistics is to display variable labels in file order, we will change this
before accessing data.
Click General tab (Not shown)

Select Display names in the Variable Lists group on the General tab
Select Alphabetical
Click OK and OK in the information box to confirm the change
Data Checking 2 - 2
2.2 Viewing a Few Cases

Often the first step in checking data previously entered on the computer is to view the first few
observations and compare their data values to the original data worksheets, survey forms, or
database records. This will detect many gross errors of data definition such as reading alphabetic
characters as numeric data fields incorrectly formatted spreadsheet data. Viewing the first few
cases can be easily accomplished using the Data Editor window in SPSS Statistics or the Case
Summaries procedure. Below we view part of the 2004 General Social Survey data in the Data
Editor window.
Click File…Open…Data
Double-click GSS2004PreClean.sav and click Open
Figure 2.2 General Social Survey 2004 Data in Data Editor Window
The first few responses can be compared to the original data source or surveys as a preliminary
test of data entry. If errors are found, corrections can be made directly within the Data Editor
window. (If you do not see the data values but labels instead, click on the Value Labels tool
button on the Toolbar .)
The Case Summaries procedure can list values of individual cases for selected variables. This
allows you to more easily check variables that may be separated in the Data Editor, or request
additional statistics.
Click Analyze…Reports…Case Summaries

Move HEALTH, RINCOME, and SIBS into the Variables list box.
Type 20 into the Limit cases to first text box
Data Checking 2 - 3
Note we limit the listing to the first 20 cases (the default is 100). The Case Summaries procedure
can also display case listings and summary statistics for groups of cases as defined by the
Grouping Variable(s).
Figure 2.3 Case Summaries Dialog Box
Click OK
Data Checking 2 - 4
Figure 2.4 Case Summaries Listing of First Twenty Cases
Case Summariesa
NUMBER OF
In general, how is your RESPONDENTS BROTHERS
health? INCOME AND SISTERS
1 GOOD . 3
2 . . 7
3 EXCELLENT NA 7
4 . $25000 OR MORE 10
5 . $25000 OR MORE 2
6 GOOD $15000 - 19999 3
7 EXCELLENT $25000 OR MORE 2
8 . . 5
9 GOOD . 3
10 . $25000 OR MORE 1
11 . . 2
12 . $25000 OR MORE 8
13 . $10000 - 14999 3
14 GOOD $25000 OR MORE 4
15 EXCELLENT $20000 - 24999 1
16 . $3000 TO 3999 7
17 GOOD $3000 TO 3999 18
18 EXCELLENT $15000 - 19999 5
19 . . 4
20 . $25000 OR MORE 3
Total N 9 13 20
a. Limited to first 20 cases.
By default, SPSS Statistics displays value labels in case listings; this can be modified within the
Options dialog box (click Edit…Options, then move to the Output Labels tab). We use the case
listing to look for potential problems, such as too much missing data. Looking at Figure 2.4,
(reformatted for presentation), there certainly are lots of system missing data (the period) for
General Health, but not all questions are asked of all respondents in the General Social Survey, so
this is most likely not of concern. The value of “NA” means “No Answer,” which appears as a
response to the Respondent's Income question. There are no missing data in the first 20 cases for
the number of siblings question.
2.3 Minimum, Maximum, and Number of Valid Cases

A second simple data check that can be done within SPSS Statistics is to request descriptive
statistics on all numeric variables. By default, the Descriptives procedure will report the mean,
standard deviation, minimum, maximum and number of valid cases for each numeric variable.
While the mean and standard deviation are not relevant for nominal variables (see Chapter 1), the
minimum and maximum values will signal any out-of-range data values. In addition, if the
number of valid observations is suspiciously small for a variable, it should be explored carefully.
Since Descriptives provides only summary statistics, it will not indicate which observation
contains an out-of-range value, but that can be easily determined once the data value is known.
The Data Validation procedure in the Data Preparation add-on module can be used to check for
Data Checking 2 - 5
specific values or ranges of values in a variable and list the violating cases. We will demonstrate
that procedure in the next section.
Click Analyze…Descriptive Statistics…Descriptives

Move all variables except ID into the Variable(s) list box (use shift-click to select all
variables and then ctrl-click on ID to de-select it)
Figure 2.5 Descriptives Dialog Box
Only numeric variables appear in the variable list box.
Click OK
Data Checking 2 - 6
Figure 2.6 Descriptives Output (Beginning)
We can see the minimum, maximum and number of valid cases for each variable in the data set.
By examining such variables as EDUC (Highest Year of School Completed), EMAILHR (Email
Hours per Week) and AGE (Age of Respondent) we can determine if there are any unexpected
values. The maximum for EMAILHR looks rather high (50) and we might want to investigate this
further. Note that all of the "Confidence" variables have a maximum value of 3 except for the
Confidence in banks & financial institutions. We will investigate this further later in the chapter.
Data Checking 2 - 7
Figure 2.7 Descriptives Output (End) Showing Valid Listwise
The valid number of observations (Valid N) is listed for each variable. The number of valid
observations listwise indicates how many observations have complete data for all variables, a
useful bit of information. Here it is zero because not all questions are asked of, nor are relevant
to, any single individual. If unusual or unexpected values are discovered in these summaries we
can locate the problem observations using data selection (Data...Select Cases on the menu) or the
Find function (under Edit menu) in the Data Editor window. Or, we can use the Data Preparation
add-on module data validation feature to define rules and clean the data.
2.4 Data Validation: Data Preparation Add-On Module

The task of data checking and fixing becomes more complicated and time-consuming as data
files, and data warehouses, grow ever larger, so more automatic methods to create a “clean” data
file are helpful. The SPSS Data Preparation add-on module allows you to identify errors in data
values/variables, excessive missing data in variables or cases, or unusual data values in your data
file. Both data errors and unusual values can influence reporting and data analysis, depending on
their frequency of occurrence and actual values.
The Data Preparation module contains two procedures for data checking:
• Validate Data helps you define rules and conditions to run checks to identify invalid
values, variables, and cases.
• Anomaly Detection searches for unusual cases based on deviations from the norms (of
cluster groups). The procedure is designed to quickly detect unusual cases in the
exploratory data analysis step, prior to any inferential data analysis.
Note: The Data Preparation module also includes the Optimal Binning procedure which defines
optimal bins for one or more scale variables based on a categorical variable that “supervises” the
process. The binned variable can then be used instead of the original data values for further
analysis. This procedure is discussed in the Data Management and Manipulation with SPSS
Statistics course.
Data Checking 2 - 8
In this chapter we will use the Validate Data procedure (VALIDATEDATA in syntax), which is
the basis for all data cleaning.
As with the data transformation facilities in SPSS Statistics, Validate Data requires user input
from you to be effective. You need to review the variable definitions in your file and determine
valid values. This also includes cross-variable rules (e.g., no customers should rate a product they
don’t own), or combinations of values that are commonly miscoded. You then need to create the
equivalent rules in the Validate Data dialog box and apply them to your data file. The more effort
you put into the rules, the more the payoff in cleaner data.
2.5 Data Validation Rules

A rule is used to determine whether a case is valid. There are two types of validation rules:
• Single-variable rules – Single-variable rules consist of a fixed set of checks that apply to
a single variable, such as checks for out-of-range values. For single-variable rules, valid
values can be expressed as a range of values or a list of acceptable values.
• Cross-variable rules – Cross-variable rules are user-defined rules that are commonly
applied to a combination of variables. Cross-variable rules are defined by a logical
expression that flags invalid values.
There are also basic checks available which look for problems in individual variables. These
checks detect excessive missing data, a minimal amount of variation in values (small standard
deviation), or many cases with the same value (data “heaping”).
Validation rules are saved to the data dictionary of your data file. This allows you to specify a
rule once and then reuse it. Rules can also be used in other data files (through the Copy Data
Properties facility).
Creating Single-Variable Rules

We’ll open the Validate Data dialog box, review its options, and then create some single-variable
rules in this section. Validate Data is accessed from the Data menu.
Click Data…Validation
There are three menu selections under Validation. The last choice (Validate Data) opens the
complete dialog box that allows you to define rules and then apply them to the active dataset. The
first two choices (Load Predefined Rules and Define Rules) allow you to load rules from an
existing data file shipped with SPSS Statistics, or to simply define rules for a set of variable/cases
without necessarily applying them to the data. We’ll say a bit more about these near the end of
the chapter.
Data Checking 2 - 9
Figure 2.8 Validation Menu Choices
Click Validate Data
The first step in using Validate Data is to specify the variables we wish to check (Analysis
Variables:) with single variable rules and basic variable checks. Optionally, you can also select
one or more case identification variables to check for duplicate or incomplete IDs, and to label
casewise output.
Figure 2.9 Validate Data Variable Tab
In this example we’ll define rules for only a few variables, but we’ll do basic checks on all of
them.
Data Checking 2 - 10
Move all the variables except ID to the Analysis Variables list

Move the variable ID to the Case Identifier Variables: list (not shown)
Click Basic Checks tab
The Basic Checks tab allows you to select basic checks for analysis variables, case identifiers,
and whole cases. You can perform the following data checks on the variables selected on the
Variables tab.
• Maximum percentage of missing values

• Maximum percentage of cases in a single category for categorical (nominal, ordinal)
variables
• Maximum percentage of categories with a count of 1 for categorical variables
• Minimum coefficient of variation for scale variables
• Minimum standard deviation for scale variables
Additionally, if you selected any case identifier variables on the Variables tab, you can flag
incomplete IDs (values for ID variables which are missing or blank). You can also flag duplicate
IDs in the file. Finally, you can flag empty cases, where all variables are empty or blank.
Figure 2.10 Validate Data Basic Checks Tab
We’ll use the default settings for the basic checks.
Click the Single-Variable Rules tab
Figure 2.11 Validate Data Single-Variable Rules Tab
The Single-Variable Rules tab displays available single-variable validation rules and allows you
to apply them to analysis variables. There are none defined yet. The list shows analysis variables,
summarizes their distributions (with a bar chart or histogram), lists the minimum and maximum
values, and shows the number of rules applied to each variable. Note that user- and system-
missing values are not included in the summaries. The Display drop-down list controls which
variables are shown. You can choose from all variables, numeric variables, string variables, or
date variables.
We need to define some rules so we can apply them to the analysis variables, and we do this in
the Define rules dialog box.
Click Define Rules…
Figure 2.12 Validate Data Define Single-Variables Rules Dialog
When the dialog box is opened it shows a placeholder rule named SingleVarRule 1 (you can have
spaces in rule names). Rules can be defined for numeric, string, or date variables. The selections
change somewhat depending on the variable type. With the exception of the variable GENDER,
the GSS2004 data has only numeric variables, so we’ll concentrate on that type in this example.
Rules must have a unique name (including the set of cross-variable rules). Valid values can be
defined as either falling within a range, or in a specific list of values (selected from the Valid
Values: dropdown). Noninteger values are acceptable by default. Also by default, missing values
will be included as valid values. This doesn’t imply that they aren’t flagged as missing.
Otherwise, though, missing values would be flagged as invalid, which is probably inappropriate
for user-missing values. If you don’t expect any blank values in a numeric variable, you might
wish to uncheck the Allow system-missing values check box.
In practice, you would check all of your variables. To illustrate, we make checks for some of the
variables:
Number of TV hours watched per Logically shouldn't be above 12 hours

day
"Confidence" Variables Should be 1, 2, or 3.
(CONARMY to CONSCI)
Note: The General Social Survey has been "cleaned" so we would not expect to find coding or
entry errors.
Before proceeding, note that what you should do is make a similar list for all the variables in the
file you wish to validate. Then you can see which rules need to be defined, and which rules can
be used for multiple variables.
We could use the Within a range choice for all these, but to demonstrate the In a list option we’ll
use that for the "Confidence" variables.
Change the Name text to TVHours Outliers

Enter 0 for the Minimum value and 12 for the Maximum value
Figure 2.13 TVHours Outliers Rule Defined
The rule is automatically stored; you don’t need to click OK to create it. We can simply define
the next rule.
Click New
Change the Name text to Confidence
Click the Valid Values dropdown and select In a list
Enter the values 1, 2, and 3 on successive rows (shown in Figure 2.14)
Figure 2.14 Confidence Rule Defined
Once you have defined all the single-variable rules you need, you must select the variables to
which each rule applies.
Click Continue
To apply the rules, select one or more variables and check all rules that you want to apply in the
Rules list in the Single-Variable Rules tab.
We see the two rules that we just defined in the Rules list. They are applied by selecting the
variable(s) to which they apply, and then clicking on the check box. More than one rule can be
applied to a variable and a rule can be applied to more than one variable. The Rules list shows
only rules that are appropriate for the selected analysis variables. These rules are available for all
of the numeric variables, but none of them will be listed for the string variable, GENDER.
We will now set the rules for the variables of interest.
Click on the variable TVHOURS in the Analysis Variables list

With TVHOURS selected, click check box for TVHours Outliers rule
Figure 2.15 TVHours Outlier Rule Applied
We want to apply the Confidence rule to the set of variables asking about confidence with various
organizations.
Select all variables from CONARMY to CONSCI

Click the Confidence rule (not shown)
Creating Cross-Variable Rule

In most data sets certain relations must hold among variables if the data are recorded properly.
This is especially true with surveys containing filter questions or skip patterns. For example, if
someone is not currently married, then his/her happiness with marriage should have a missing
code. Such relations can be easily checked with cross-variable rules. A cross-variable rule is
defined by creating a logical expression that will evaluate to true or false (1 or 0). The expression
should evaluate to 1 for the invalid cases.
The logic of the cross-variable rule will depend on whether some of the key data are missing or
not. Figure 2.16 depicts the relationship between marital status and happiness with marriage.
Unlike normal crosstab tables, the missing data categories are also displayed. In order to display
this table, we needed to include the missing data in the crosstab. To accomplish this, we recoded
the system-missing values in HAPMAR to zero (0). To include user-missing values, you would
need to temporarily remove the property of user-missing or paste the syntax for CROSSTABS
and add the subcommand "MISSING=INCLUDE" as in:
CROSSTABS
/TABLES=MARITAL BY HAPMAR
/MISSING INCLUDE
/FORMAT= AVALUE TABLES
/CELLS= COUNT
/COUNT ROUND CELL .
For ease of interpretation, we also displayed the data values along with the value labels by
changing the Edit…Options, Output Labels.
Figure 2.16 Relationship between Marital Status and Happiness with Marriage (All Cases)
The Happiness with Marriage question was only asked of a subset(682) of the married
respondents. Thus, several married respondents are also system-missing on the happiness
question because it wasn't asked. Those married respondents who were asked the question, but
didn't answer should be coded "No Answer" on the happiness question; there are six married
respondents who didn't answer. But, if a respondent is not currently married, they can not provide
a valid response to the happiness question. Note that we have introduced an error, circled on the
table; a never married person is incorrectly coded as "Very Happy" for happiness with marriage.
This cross-variable check is more easily accomplished by defining a cross-variable rule.
Click Cross-Variables Rule tab

Click Define Rules pushbutton
Figure 2.17 Validate Data: Define Cross-Variable Rules Dialog
The dialog box looks a bit similar to the Compute dialog box. In addition to cross-validation,
rules can be defined for only a single variable. You may find this to be necessary to define a rule
more complex than can be defined with the available options in the Single-Variable Rule tab.
The same functions available in Compute are available here. There is a placeholder rule called
CrossVarRule 1 by default. The logical expression is created in the Logical Expression text box,
either by typing it in directly or using the variable list, calculator buttons, and functions.
Change the text in the Name box to MaritalHappiness
The married respondents are coded 1 on the variable MARITAL. As shown above, the married
respondents can have any value on the HAPMAR variable. However, all other values on
MARITAL must be system missing on HAPMAR. Thus the logical condition to find invalid
cases is when MARITAL is not equal 1 and HAPMAR is not system-missing. In order to express
this, we will use the SYSMIS function as in, MARITAL ~= 1 and ~(SYSMIS(HAPMAR)) where
the symbol ~ means "not".
Click the variable MARITAL]

Click Insert (Alternatively, drop and drag MARITAL to the Logical Expression text box.)
Click Not Equal button or type ~=
Leave a space and type 1
Click Ampersand (and) button or type & (be sure to leave spaces around the
ampersand)
Click Not button or type ~
Select Missing Values from the Display dropdown list in the Functions and Special
Variables area
Click Sysmis from the Functions and Special Variables list
Click Insert below the Function: list
Click the variable HAPMAR
Click Insert below the Variables: list
The dialog box should now look like Figure 2.18 below.
Figure 2.18 Cross-Variable Rule for MaritalHappiness
Click Continue
When you return to the Cross-Variable Rules tab, the rule you just defined is listed (not shown),
with the Apply check box turned on (so the rule will be applied).
We could run the procedure now, but we first examine the settings on the Output and Save tabs.
Click Output tab
Figure 2.19 Validate Data Output Tab
This tab controls what output is created in the Viewer window.
Violations can be listed for each case (the default), using both single- and cross-variable rules. In
large files, the maximum setting for the number of cases in the report should be increased above
100 (the upper limit is 1,000). Also by default, violations will be displayed by variable for all
single-variable rules. You can also, by checking Summarize violations by rule, request a summary
of violations by rule. A check box (Move cases with validation rule…) will move cases with
violations to the top of the data file so they are easier to locate. In a large file this may be
especially helpful.
We’ll use the default settings.
You can also request to create new variables that will flag cases with violations. This is requested
from the Save tab.
Click Save tab
Figure 2.20 Validate Data Save Tab
Options are available to save a flag variable (coded 0 and 1) that identifies a case with no data,
duplicate IDs, and incomplete IDs. Another choice creates a variable that counts the number of
rule violations (single- and cross-variable) for each case. This can be useful in detecting cases that
have major problems.
The Save indicator variables that record all validation rule violations check box will create a flag
variable for every rule you have applied. It will record whether that rule was violated for each
case in the data. Although this option can create a large number of new variables, it makes it easy
to locate cases with violations. We’ll select this option to see its effect as well as the variable of
the count of rule violations for each case.
Click Validation rule violations check box in Summary Variables list

Click Save indicator variables that record all validation rule violations check box
We are ready to validate the GSS2004 data.
Click OK
The first table of output, displayed in Figure 2.21, is seen in almost every application of Validate
Data. It simply tells us that not every possible problem was found in the data. Recall that we
asked for the basic checks, including such problems as excessive missing data (above 70%),
standard deviations of 0, and so forth.
Figure 2.21 Warning Message from Validate Data
Warnings
Some or all requested output is not displayed because all cases, variables, or data
values passed the requested checks.
The next table reports the violations of these basic variable checks.
Figure 2.22 Variable Checks Table of Basic Check Violations
Variable Checks
Categorical Cases Missing > 70 Taking things all together, how would you
describe your marriage? Would you say that
your marriage
Visited web site for News and current events in
past 30 days
DOES GUN BELONG TO R
Visited web site for Travel Information in past
30 days
How would you rate your ability to use the
World Wide Web?
Cases Constant > 95 A DEATH OF SPOUSE
A DEATH OF CHILD
A DEATH OF PARENTS
INFERTILITY, UNABLE TO HAVE A BABY
DRINKING PROBLEM
CHILD ON DRUGS, DRINKING PROBLEM
Scale Cases Missing > 70 EMAIL HOURS PER WEEK
EMAIL MINUTES PER WEEK
Each variable is reported with every check it fails.
Because several questions were asked of only a subset of cases, a few variables are in violation of
the basic rule of excessive missing data (above 70%). Note that the violations of this rule are
reported separately for categorical and scale variables as you have defined them in the
Measurement Level variable property. Additionally, six variables were reported in violation of
the basic constant rule (greater than 95% of the cases in one category). These six variables asked
people to report (Yes or No) whether they had each of these events occur in the last year. You
would expect these events to be relatively rare in the general population, as reported.
The next table, Rule Description, (not shown) lists the single-variable rules we applied that had at
least one rule violation.
The Variable Summary table summarizes single-variable rule violations. We see that two
variables have rule violations. The Confidence rule was violated by one case on one variable.
Eleven cases watched TV more than 12 hours a day. The details of the violations are not listed If
this data had not been previously cleaned, we would likely want to check these cases to determine
if these values were misentered. Or, we might want to investigate to see if the values make sense
given the values of other key variables for each case. You should expect more violations if your
data have not been previously checked for these rules.
Figure 2.23 Variable Summary of Rule Violations
Variable Summary
Number of
Rule Violations
CONFID IN BANKS & Confidence 1
FINANCIAL INSTITUTIONS Total 1
HOURS PER DAY TVHours Outliers 11
WATCHING TV Total 11
The next table, Cross-Variable Rules, shows that the MaritalHappiness rule was violated for one
case.
Figure 2.24 Cross-Variable Rules Violations
Cross-Variable Rules
Rule Number of Violations Rule Expression

MaritalHappiness 1 MARITAL ~= 1 & ~ SYSMIS(HAPMAR)
The last table, shown in Figure 2.25 summarizes violations by case. There were 13 total rule
violations, and they all occurred on separate cases (so no case had more than one violation). The
MaritalHappiness violation, for instance, occurred on case ID 3. The Confidence rule is violated
for case ID 4 and so forth. We can now easily review those cases to see the problems.
Figure 2.25Case Report Summary of All Rule Violations
Case Report
Validation Rule Violations Identifier

RESPONDNENT ID
a
Case Single-Variable Cross-Variable NUMBER
3 MaritalHappiness 3
4 Confidence (1) 4
114 TVHours Outliers (1) 114
a. The number of variables that violated the rule follows each rule.
Now we can view the new variables added to the file.
Return to the Data Editor

Click the Data View tab if necessary
Scroll to the right until the end of the variables

Click on the Value Labels tool to turn off the value labels (if necessary)
As shown in Figure 2.26, eleven variables were added to the file. They are coded 0 (no violation)
or 1 (violation). We applied single-variable rules to 9 variables, and there is a new flag variable
for each of these 9. Then there is a flag variable for the one cross-variable rule, and another
variable (ValidationRuleViolations) which provides a total count of the number of violations for
each case. We have marked two of the cases in violation on the Figure; case 3 is in violation on
the MaritalHappiness cross-variable rule and case 4 is in violation of the Confidence rule on the
CONFINAN variable. And, each case had only one rule violation.
In this way, you can easily locate cases with particular problems to continue data cleaning.
Figure 2.26 New Flag Rule Violation Variables
Applying Validation Rules

Validation rules are stored in the data dictionary of a data file. This means that they will be
available in the future if a file is saved after defining a set of rules. In addition, rules can be
applied from one file to another using the Copy Data Properties facility.
SPSS Inc. supplies a file along with the software (in the Statistics17/lang sub-directories) that
contains many predefined validation rules for common problems such as range specifications for
variables coded 0 and 1. You can load these rules into your file first, then supplement them with
additional rules you define. Or you could add additional rules that you commonly use to the
predefined file so they will be immediately available for all files. The predefined rule file is
accessed from the Data...Validation…Load Predefined Rules menu. The file name is Predefined
Validation Rules.sav.
2.6 When Data Errors Are Discovered

If errors are found the first step is to return to the original survey or data source. Simple clerical
errors are merely corrected. In some instances errors on the part of respondents can be corrected
based on their answers to other questions. Or, systematic errors can be discovered and recoded
appropriately. If your data has been retrieved from an organization database, the database
administrator is often helpful in identifying the reasons for the problem. If these approaches are
not possible the offending items can be coded as missing responses and will be excluded from
analyses. While beyond the scope of this course, there are techniques that substitute estimated
data values for missing responses. For a discussion of such methods see Allison (2002) or Burke
and Clark (1992).
Having cleaned the data we can now move to the more interesting part of the process, data
analysis.
SPSS Missing Values Add-on Module

The SPSS Missing Values Add-on Module is very useful in both describing missing data patterns
across cases and variables as well as imputing (substituting) values for missing data. You can use
this add-on option to produce various reports and graphs describing the frequency and pattern of
missing data. It also provides methods for estimating (imputing) values for missing data. As of
SPSS Statistics 17, you can perform multiple imputation of missing values which allows you to
use multiple variables to more accurately estimate (impute) missing data.
Summary Exercises
These exercises use the GSS2004Intro.sav data file. Open this data file and close the
GSS2004PreClean.sav file. Or, exit SPSS Statistics and read this data file before beginning the
exercises.
1. Examine the value labels (click Utilities…Variables) for a few of the variables in
GSS2004Intro.sav and compare these ranges to the results in the Descriptives table. For
example, ETHNIC has a maximum value of 97. Is this a valid value?
2. Using the Validate Data procedure, write a single-variable rule to list all cases with values
greater than 30 on the EMAILHR variable. How many cases are identified? Examine values
of other variables for these cases (use the Data Editor display). Do these values seem
reasonable?
3. Define a cross-variable rule to check that the total number of persons in the Household
(HHSIZE) is equal to the sum of HHBABIES, HHPRETEEN, HHTEENS, and HHADULTS.
Save the Rule Violation indicator variables.
HINT: Using the SUM function in the expression has different results than using arithmetic
addition. Why? And, why would you use one method versus the other? Try both!
Chapter 3: Describing Categorical Data

Topics:
• Why Summaries of Single Variables?
• Frequency Analysis
• Standardizing the Bar Chart Axis
• Pie Charts
Data
This chapter uses the data file GSS2004Intro.sav.
Scenario
We are interested in exploring relationships between some demographic variables (highest
educational degree attained, gender) and some belief/attitudinal/behavioral variables (belief in an
afterlife, use of a computer). Prior to running these two-way analyses (considered in Chapter 6)
we will look at the distribution of responses for several of these variables. This can be regarded as
a preliminary step before performing the main crosstabulation analysis of interest, or as an
analysis in its own right. There might be considerable interest in documenting what percentage of
the U.S. (non-institutionalized) adult population believes in an afterlife. In addition, we will look
at the frequency distributions of work status and marital happiness.
3.1 Why Summaries of Single Variables?

Summaries of individual variables provide the basis for more complex analyses. There are a
number of reasons for performing single variable (univariate) analyses. One would be to establish
base rates for the population sampled. These rates may be of immediate interest: What percentage
of our customers is satisfied with services this year? In addition, studying a frequency table
containing many categories might suggest ways of collapsing groups for a more succinct, striking
and statistically appropriate table. When studying relationships between variables, the base rates
of the separate variables indicate whether there is a sufficient sample size (discussed in more
detail in Chapter 5) in each group to proceed with the analysis. A second use of such summaries
would be as a data-checking device—unusual values would be apparent in a frequency table.
The Level of Measurement of a variable determines the appropriate summary statistics, tables,
and graphs to describe the data. Table 3.1 summarizes the most common summary measures and
graphs for each of the measurement levels and SPSS Statistics procedures that can produce them.
Describing Categorical Data 3 - 1

Table 3.1 Summary of Descriptive Statistics and Graphs
NOMINAL ORDINAL SCALE
Unordered Metric/Numeric
Definition Categories
Ordered Categories
Values
Labor force
Satisfaction ratings, Income, height,
Examples status, gender,
degree of education weight
marital status
Measures Mode
Mode
of Central Mode Median
Tendency Median
Mean
Min/Max/Range, Min/Max/Range,
Measures
of N/A InterQuartile Range IQR,
(IQR) Standard
Dispersion
Deviation/Variance
Histogram,
Graph Pie or Bar Pie or Bar Box & Whisker,
Stem & Leaf
Frequencies,
Procedures Frequencies Frequencies Descriptives,
Explore
In this chapter, we will review tables and graphs appropriate for describing categorical (nominal
and ordinal) variables. Techniques for exploring scale (interval and ratio) variables will be
reviewed in Chapter 4.
The most common technique for describing categorical data is a frequency analysis which
provides a summary table indicating the number and percentage of cases falling into each
category of a variable as well as the number of valid and missing cases. To represent this
information graphically we use bar or pie charts. In this chapter we run frequency analyses on
several questions from the General Social Survey and construct charts to accompany the tables.
We discuss the information in the tables and consider the advantages and disadvantages in
standardizing bar charts when making comparisons across charts.
3.2 Frequency Analysis

We begin by requesting frequency tables and bar charts for five variables: labor force status,
WRKSTAT; highest education degree earned, DEGREE; belief in an afterlife, POSTLIFE;
computer use, COMPUSE and happiness with marriage, HAPMAR. Requests for bar charts can be
made from the Frequencies dialog box, or through the Graphs menu.
We begin by opening the 2004 General Social Survey data in the Data Editor:

Click GSS2004Intro.sav and click Open (not shown)
Click Analyze…Descriptive Statistics…Frequencies

Move WRKSTAT, DEGREE, POSTLIFE,COMPUSE, , and HAPMAR into the
Variable(s): list box
Figure 3.1 Frequencies Dialog Box
Note that three of these variables, WRKSTAT, POSTLIFE, and COMPUSE are defined as nominal
variables; DEGREE and HAPMAR are defined as ordinal variables.
After placing the variables in the list box, we use the Charts option button to request bar charts
based on percentages.
Click the Charts option button

Select Bar charts in the Chart Type area
Select Percentages in the Chart Values area

Figure 3.2 Frequencies: Charts Dialog Box
Click Continue
Click the Format option button on the Frequencies dialog
Click Organize output by variables in the Multiple Variables area
Figure 3.3 Frequencies Format Dialog Box
We choose to organize the output by variables which will display the frequency table followed by
the bar chart for each variable. The default would display the frequency tables for all of the
variables followed by the bar charts for all of the variables. Other options on the Format dialog
box allow you to change the display order of the categories in the frequency tables and charts and
suppress large frequency tables for variables with more than the specified number of categories.
You also can make additional format changes to the tables and charts by using the pivot table
editor and chart editor.
Click Continue
Click OK
We now examine the tables and charts looking for anything interesting or unusual.

Frequencies Output
We begin with a table based on labor force status.
By default, value labels appear in the first column and, if labels were not supplied, the data values
display. Tables involving nominal and ordinal variables usually benefit from the inclusion of
value labels. Without value labels we wouldn’t be able to tell from the output which number (data
value) stands for which work status category. Sometimes you want to display both the data value
and the label. You can do this under Edit…Options by changing the Pivot Table Labeling option
on the Output Labels tab.
The Frequency column contains counts, i.e. the number of occurrences, of each data value. The
Percent column shows the percentage of cases in each category relative to the number of cases in
the entire data set, including those with missing values. One case did not answer this question
(NA) and has been flagged as a user-missing value. This case is excluded from the Valid Percent
calculation. Thus the Valid Percent column contains the percentage of cases in each category
relative to the number of valid (non-missing) cases. Cumulative percentage, the percentage of
cases whose values are less than or equal to the indicated value, appears in the cumulative percent
column. With only one case containing a missing value, the percent and valid percent columns
are identical. Note we can edit the frequencies pivot table to display the percentages to greater
precision to see the difference between the two percents in the second decimal position.
Figure 3.4 Frequency of Labor Force Status
LABOR FRCE STATUS
Cumulative
Frequency Percent Valid Percent Percent
Valid WORKING FULLTIME 1466 52.1 52.2 52.2
WORKING PARTTIME 320 11.4 11.4 63.5
TEMP NOT WORKING 80 2.8 2.8 66.4
UNEMPL, LAID OFF 99 3.5 3.5 69.9
RETIRED 403 14.3 14.3 84.2
SCHOOL 115 4.1 4.1 88.3
KEEPING HOUSE 266 9.5 9.5 97.8
OTHER 62 2.2 2.2 100.0
Total 2811 100.0 100.0
Missing NA 1 .0
Total 2812 100.0
Examine the table. Note the disparate category sizes. Over half of the sample is working full time.
And, there are four categories that each has less than 5% of the cases. Before using this variable
in a crosstabulation analysis, you should consider combining some of the categories with fewer
cases. Decisions about collapsing categories usually have to do with which groups need to be kept
distinct in order to answer the research question asked, and the sample sizes for the groups. For
example, you might combine the "temporarily not working" and "unemployed laid off" depending
on the intent of your analysis. However, if those temporarily not working are of specific interest
to your study, you would want to leave them as a separate group. What are some other
meaningful ways in which you might combine or compare categories?

Next we view a bar chart based on the labor force variable. Does the picture make it easier to
understand the distribution?
Figure 3.5 Bar Chart of Labor Force Status
The disparities among the labor force status categories are brought into focus by the bar chart. We
next turn to highest education degree earned.
Figure 3.6 Frequency Table of Educational Degree
RS HIGHEST DEGREE
Cumulative
Valid LT HIGH SCHOOL 364 12.9 12.9 12.9
HIGH SCHOOL 1435 51.0 51.0 64.0
JUNIOR COLLEGE 224 8.0 8.0 72.0
BACHELOR 507 18.0 18.0 90.0
GRADUATE 281 10.0 10.0 100.0
Total 2811 100.0 100.0
Missing DK 1 .0
Total 2812 100.0

Figure 3.7 Bar Chart of Highest Educational Degree
There are some interesting peaks and valleys in the distribution of the respondent’s highest
degree. Again over half of the people fall into one category, high school graduates. Depending on
your research, you might want to collapse some of the categories. Can you think of a sensible
way of collapsing DEGREE into fewer categories? Or reasons why you would not?
Next, we look at the dichotomous variable, POSTLIFE
Figure 3.8 Frequency Table of Belief in Afterlife
BELIEF IN LIFE AFTER DEATH
Cumulative
Valid YES 958 34.1 81.8 81.8
NO 213 7.6 18.2 100.0
Total 1171 41.6 100.0
Missing DK 154 5.5
System 1487 52.9
Total 1641 58.4
Total 2812 100.0

Figure 3.9 Bar Chart of Belief in Afterlife
The great majority of respondents (81.8%) do believe in life after death (though we suspect that
“life after death” means different things to different people). It might be interesting to look at the
relationship between this variable and educational degree: to what extent is belief in an afterlife
related to level of education? The frequency tables we are viewing display each variable
independently. To investigate the relationship between two categorical variables we will turn to
crosstabulation tables in Chapter 6.
POSTLIFE has two missing value categories. The first missing code (DK) represents a response
of "Don't Know" and is very commonly used as a response in survey questions. This question was
not asked of all respondents so they were left blank in the data and SPSS Statistics converted the
blanks to the system-missing value. So, the second missing category represents those who were
not asked the question. These codes are excluded from consideration in the “Valid Percent”
column of the frequency table, as well as from the bar chart, and would also be ignored if any
additional statistics were requested.

Figure 3.10 Frequency Table of Computer Use
R USE COMPUTER
Cumulative
Valid YES 723 25.7 73.9 73.9
NO 255 9.1 26.1 100.0
Total 978 34.8 100.0
Missing NA 6 .2
System 1828 65.0
Total 1834 65.2
Total 2812 100.0
Figure 3.11 Bar Chart of Computer Use
Almost three-quarters of the respondents asked use the computer regularly. However, the
percentage of computer users might well be related to other demographic variables such as degree
or gender. Or others? Note that this question was asked of only 35% of the respondents. Of those
who were asked the question, 6 people did not give an answer (NA). This group is flagged as
user-missing and both groups are excluded from the bar chart.

Figure 3.12 Frequency Table for Happiness of Marriage
Taking things all together, how would you describe your marriage? Would you
say that your marriage
Cumulative
Valid VERY HAPPY 417 14.8 61.7 61.7
PRET.HAPPY 234 8.3 34.6 96.3
NOT TOO 25 .9 3.7 100.0
Total 676 24.0 100.0
Missing NO ANSWER 6 .2
System 2130 75.7
Total 2136 76.0
Total 2812 100.0
Figure 3.13 Bar Chart for Happiness of Marriage
About two-thirds of those married are very happily married. Of the rest, most say they are pretty
happy. A very small percentage (3.7%) of respondents is not too happy. However, remember this
question was only asked of a sample of those currently married. How might this influence your
interpretation of the percentages?

3.3 Standardizing the Chart Axis

If we glance back over the last few bar charts we notice that the scale axis, which displays
percents, varies across charts. This is because the maximum value displayed in each bar chart
depends on the percentage of respondents in the most popular category. Such scaling permits
better use of the space within each bar chart but makes comparison across charts more difficult.
Percentaging is itself a form of standardization, and bar charts displaying percentages as the scale
axis were requested in our analyses. Charts can be further normed by forcing the scale axis (the
axis showing the percents) in each chart to have the same maximum value. This facilitates
comparisons across charts, but can make the details of individual charts more difficult to see.
There are at least two methods within SPSS Statistics for standardizing the scale axis.
1. Chart Template: You can edit the scale axis range and other characteristics of a chart,
save the edited chart as a chart template, then apply the chart template to existing charts
or to all charts being built.
2. Chart Builder: You can edit the properties of the scale axis within the Chart Builder
when you build the chart.
We will illustrate this by reviewing two of the previous bar charts (labor force status and belief in
afterlife) and requesting that the maximum scale value be set to 100 (100%).
Creating and Using a Chart Template

First, we edit the COMPUSE chart, save a chart template and apply it to the WRKSTAT chart.
To edit the scale axis maximum to 100%:
Double click the R Use Computer chart to open the Chart Editor
Click on the scale axis values (Y axis)
Click Edit…Properties or the tool button to open the Properties dialog (if necessary)
Click the Scale tab in the Properties dialog box
Click the Maximum checkbox to deselect it, then set its value to 100
Click Apply, then Close
The scale axis of the chart will change to the range of 0 to 100.

Figure 3.14 Bar Chart of Belief in Afterlife with Edited Scale Axis
We could edit many other elements of the scale axis such as displaying a percent sign or
displaying decimal positions. Or, we could edit other elements of the chart. All of which can be
saved on a chart template. We will limit our editing to this one change and save the chart
template.
Click File…Save Chart Template

Figure 3.15 Save Chart Template Dialog Box
You specify the specific elements that you want to save on the chart template. In our case, we will
save just the Scale axes characteristics.
Click Reset
Select Scale axes
Click Continue
In the Save Template dialog, move to the C:\Train\Stats folder in the Look in:
dropdown list (not shown)
Type Ch3_bar100.sgt in the File Name textbox
Click Save
Close the Chart Editor window
To apply this chart template to the Labor Force Status chart,
Double click the Labor Force Status chart to open the Chart Editor
Click File…Apply Chart Template in the Chart Editor window
In the Apply Template dialog, move to the c:\Train\Stats folder, then select
Ch3_bar100.sgt
Click Open

Figure 3.16 Bar Chart of Labor Force Status with Chart Template Applied
Note that the scale axes of both bar charts are now in comparable units so we can make direct
comparisons based on the bar heights. This is the advantage of the percentage standardization.
However, it works best if the variables have a similar number of categories or size of the most
popular (largest) category. For example, multiple charts of satisfaction rating questions are best
presented with a standardized scale.
On the other hand, we can see that applying the 0 to 100 scale to the labor force variable with
eight categories results in the same shape as the previous one but has shrunken the bars. As a
result some detail is lost. Thus the advantage of standardizing the percentage scale must be
weighed against potential loss of detail. In practice it is usually quite easy to decide which
approach is better.
Close the Chart Editor window
Using Chart Builder to Create the Bar Chart

You can use the Chart Builder (Graphs…Chart Builder) to create bar charts and set the
maximum value of the scale axis percentages to 100% initially. For example, to directly create a
bar chart with a scale axis range 0 to 100 of the COMPUSE variable, we use the Chart Builder.

Click Graphs…Chart Builder

Click OK in the Information box (if necessary)
Click Reset (if necessary)
Click Gallery tab (if it's not already selected)
Click Bar in the Choose from: list
Select the first icon for Simple Bar Chart and drag it to the Chart Preview
canvas
Drag and drop COMPUSE from the Variables: list to the X-Axis? area in the Chart
Preview canvas
In the Element Properties dialog box (Click Element Properties button if this dialog box
did not automatically open),
Select Percentage(?) from the Statistic: dropdown list
Click Apply
Select Y-Axis1 (Bar1) from the Edit Properties of: list
Uncheck the Maximum checkbox in the Scale Range area
Set the Maximum to 100
Click Apply
Figure 3.17 Chart Builder and Element Properties to Create Bar Chart
Click OK in the Chart Builder to build the bar chart

Figure 3.18 Bar Chart Created with Chart Builder
Note that the default display for the scale axis labels is to display the percents with one decimal
place and the percent sign. We could have achieved this in the chart template as well by editing
these features.
3.4 Pie Charts

Pie charts provide a second way of picturing information in a frequency table. You can produce
pie charts from the Chart Builder or in the Frequencies Chart dialog box. To create a pie chart for
labor force status from the Graphs menu:

Click Reset
Click Pie/Polar in the Choose from: list on the Gallery tab
Select the icon and drag it to the Chart Preview canvas

Drag and drop WRKSTAT from the Variables: list to the Slices by? area in the Chart
Preview canvas
Select Percentage(?) from the Statistic: dropdown list
Click Apply

Figure 3.19 Chart Builder and Element Properties to Create Pie Chart
Click OK in the Chart Builder to build the pie chart

Figure 3.20 Pie Chart of Labor Force Status
While the pie and bar charts are based on the same information, the structure of the pie chart
draws attention to the relation between a given slice (here a group) and the whole. On the other
hand, a bar chart leads one to make comparisons among the bars, rather than any single bar to the
total. You might keep these different emphases in mind when deciding which to use in your
presentations. See Cleveland, W.S. (1994), Tufte (2001) and Few, Stephen (2004) for additional
considerations in displaying data graphically.

Summary Exercises
Suppose we are interested in looking at the relationship between race (RACE) and HLTH1,
whether you were ill enough to go to the doctor last year, NATAID, attitude toward spending on
foreign aid, and NEWS, how frequently you read the newspaper. In addition, we wish to
determine whether gender (GENDER) is related to these same beliefs.
1. First run a frequency analysis on these variables. Look at the distributions. Do you see
any difficulties using these variables in a cross tabulation analysis? If so, is there an
adjustment you would make?
2. Run Frequencies on the NATAID, NATENVIR, and NATCITY variables and request bar
charts displaying percentages. Standardize the percentage scales to 0 to 100 with
appropriate tick marks for all of the charts.
For those with extra time:
1. Run a frequency on WEBYR, year in which you began using the web. This is coded with
years in category ranges. How might you recode this variable before using it in a
crosstab?
2. Create a new variable with the collapsed categories of WEBYR. If you wish, save the
modified data in a data file named MyGSS2004.sav.


Chapter 4: Exploratory Data Analysis:

Scale Data
Topics
• Summarizing Scale Variables
• Measures of Central Tendency and Dispersion
• Normal Distributions
• Histograms and Normal Curves
• Using the Explore Procedure: EDA
• Standard Error of the Mean and Confidence Intervals
• Shape of the Distribution
• Boxplots
• Appendix: Standardized (Z) Scores
Data
In this chapter, we continue to use the GSS2004Intro.sav file.
Scenario
One of the aims of our overall analysis is to compare demographic groups on hours per week
using the Internet, the number of hours worked last week, and the amount of time spent watching
TV each day. Before proceeding with the group comparisons in Chapter 7, we discuss basic
concepts of the normal distribution and statistical measures to describe the distribution and look
at summaries of these measures across the entire sample.
4.1 Summarizing Scale Variables

In Chapter 3 we used frequency tables containing counts and percentages as the appropriate
summaries for individual categorical (nominal) variables. If the variables of interest are interval
scale we can expand the summaries to include means, standard deviations and other statistical
measures. Counts and percentages may still be of interest, especially when the variables can take
only a limited number of distinct values. For example, when working with a one to five point
rating scale we might be very interested in knowing the percentage of respondents who reply
“Strongly Agree.” However, as the number of possible response values increases, frequency
tables based on interval scale variables become less useful. Suppose we asked respondents for
their family income to the nearest dollar? It is likely that each response would have a different
value and so a frequency table would be quite lengthy and not particularly helpful as a summary
of the variable. In data cleaning, you might find a frequency table useful for examining possible
clustering of cases on specific values or looking at cumulative percentages. But, beware of using
frequency tables for scale variables with many values as they can be very long. In short, while
there is nothing incorrect about a frequency table based on an interval scale variable with many
values, it is neither a very effective nor efficient summary of the variable.
For interval scale variables such statistics as means, medians and standard deviations are often
used. Several procedures within SPSS Statistics (Frequencies, Case Summaries and Explore) can
Exploratory Data Analysis: Scale Data 4 - 1

produce these summaries and other summaries of the distribution. We will define each of these
measures and use exploratory data analysis to produce them in SPSS Statistics. In addition such
graphs as histograms, stem & leaf, and box & whisker plots are designed to display information
about interval scale variables. We will see examples of each.
Exploratory data analysis (EDA) was developed by John Tukey, a statistician at Princeton and
Bell Labs. Seeing limitations in the standard set of summary statistics and plots, he devised a
collection of additional plots and graphs. These are implemented in SPSS Statistics in the Explore
procedure and we will include them in our discussion and examples.
4.2 Measures of Central Tendency and Dispersion

Measures of central tendency and dispersion are the most common measures used to summarize
the distribution of variables. We give a brief description of each of these measures below.
Measures of Central Tendency

Statistical measures of central tendency give that one number that is often used to summarize the
distribution of a variable. They may be referred to generically as the "average". There are three
main central tendency measures: mode, median, and mean. In addition, Tukey devised the 5%
trimmed mean.
• Mode: The mode for any variable is merely the group or class that contains the most
cases. If two or more groups contain the same highest number of cases, the distribution is
said to be ‘multimodal’. This measure is more typically used on nominal or ordinal data
and can easily be determined by examining the frequency table.
• Median - If all the cases for a variable are arranged in order according to their value, the
median is that value that splits the cases into two equally sized groups. The median is the
same as the 50th percentile. Medians are resistant to extreme scores, and so are
considered robust measures of central tendency.
• Mean: - The mean is the simple arithmetic average of all the values in the distribution
(i.e. the sum of the values of all cases divided by the total number of cases). It is the most
commonly reported measure of central tendency. The mean along with the associated
measures of dispersion are the basis for many statistical techniques.
• 5% trimmed mean - The 5% trimmed mean is the mean calculated after the extreme
upper 5% and the extreme lower 5% of the data values are dropped. Such a measure is
resistant to small numbers of extreme values.
The specific measure that you choose will depend on a number of factors, most importantly the
level of measurement of the variable. The mean is considered the most "powerful" measure of the
three classic measures. However, it is good practice to compare the median, mean, and 5%
trimmed mean to get a more complete understanding of a distribution.
Measures of Dispersion
Measures of dispersion or variability describe the degree of spread, i.e. dispersion, or variability
around the central tendency measure. You might think of this as a measure of the extent to which
observations cluster within the distribution. There are a number of measures of dispersion
including: simple measures such as maximum, minimum, and range, common statistical measures
such as standard deviation and variance, as well as the exploratory data analysis measure, the
interquartile range (IQR) .

• Maximum: Simply the highest value observed for a particular variable. By itself, it can
tell us nothing about the shape of the distribution, merely how 'high' the top value is.
• Minimum: The lowest value in the distribution and, like the maximum, is only useful
when reported in conjunction with other statistics.
• Range: The difference between the maximum and minimum values gives a general
impression of how broad the distribution is. It says nothing about the shape of a
distribution and can give a distorted impression of the data if just one case has an extreme
value.
• Variance: Both the Variance and Standard Deviation provide information about the
amount of spread around the mean value. They are overall measures of how clustered
around the mean the data values are. The variance is calculated by summing the square of
the difference between the value and the mean for each case and dividing this quantity by
the number of cases minus 1. If all cases had the same value, the variance (and standard
deviation) would be zero. The variance measure is expressed in the units of the variable
squared. This can cause difficulty in interpretation; so more often the standard deviation
is used. In general terms, the larger the variance, the more spread there is in the data, the
smaller the variance, the more the data values are clustered around the mean.
• Standard Deviation: The standard deviation is the square root of the variance which
restores the value of variability to the units of measurement of the original variable's
values. It is therefore easier to interpret. Either the variance or standard deviation are
often used in conjunction with the mean, as a basis for a wide variety of statistical
techniques.
• Interquartile Range (IQR) - This measure of variation is the range of values between
the 25th and 75th percentile values. Thus, the IQR represents the range of the middle 50
percent of the sample and is more resistant to extreme values than the standard deviation.
Like the measures of central tendency, these measures differ in their usefulness with variables of
different measurement levels. The variability measures, variance and standard deviation, are used
in conjunction with the mean for statistical evaluation of the distribution of a scale variable. The
other measures of dispersion, although less useful statistically, can provide useful descriptive
information about a variable.
4.3 Normal Distributions

An important statistical concept is that of the normal distribution. This is a frequency (or
probability) distribution which is symmetrical and is often referred to as the normal bell-shaped
curve. The histogram in Figures 4.1 illustrates a normal distribution. The mean, median and mode
exactly coincide in a perfectly normal distribution. And the proportion of cases contained within
any portion of the normal curve can be exactly calculated mathematically or more usually from
tables of the normal distribution.
Its symmetry means that 50% of cases lie to either side of the central point as defined by the
mean. Two of the other most frequently-used representations are the portions lying between plus
and minus one standard deviations of the mean (containing approximately 68% of cases, see
Figure 4.1) and that between plus and minus 1.96 standard deviations (containing approximately
95% of cases, see Figure 4.2), sometimes rounded up to 2.00 for convenience. Thus, if a variable
is normally distributed, we expect 95% of the cases to be within roughly 2 standard deviations
from the mean.

Figure 4.1 Normal Distribution: Plus or Minus 1 SD and 1.96 SD
Many naturally occurring phenomena, such as height, weight and blood pressure, are distributed
normally. Random errors also tend to conform to this type of distribution. It is important to
understand the properties of normal distributions and how to assess the normality of particular
distributions because of their theoretical importance in many inferential statistical procedures. We
will discuss these issues later in this course. In this chapter, we will review descriptive statistics
and graphs that allow us to assess the distribution of our data in comparison to the normal
distribution.
4.4 Histograms and Normal Curves

The histogram is designed to display the distribution (range and concentration) of a scale variable
that takes many different values. A bar chart contains one bar for each distinct data value. When
there are many possible data values and few observations at any given value, a bar chart is less
useful than a histogram. In a histogram, adjacent data values are grouped together so that each bar
represents the same range of data values. SPSS Statistics automatically chooses the range of data
values for each bin, but you can specify the range of values or number of bins.
With this chart we can see the general distribution of data regardless of how many distinct data
values are present. While a bar chart is appropriate for an ordinal variable such as a one to five or
one to seven point rating scale, a bar chart of hours worked last week would contain too many
bars, some of them with few cases, and gaps in hours (values which no one worked) would not be
displayed. For these reasons, a histogram is a better choice.

You can request histogram plots from the Chart Builder or as options from the Frequencies and
Explore procedures.
Note
If you request histograms and summary statistics from the Frequencies procedure, you might
want to uncheck (turn off) Display frequency tables on the Frequencies dialog box.
We will begin by requesting a histogram on hours worked last week, HRS1 using the Chart
Builder. In addition, we will ask that a normal curve be superimposed on the histogram. Should
we expect it to be normally distributed?
Open the GSS2004Intro.sav data file (if necessary)

Click Reset
Click Histogram in the Choose from: list
Select the first icon for Simple Histogram and drag it to the Chart Preview
canvas
Drag and drop HRS1 from the Variables: list to the X-Axis? area in the Chart Preview
canvas
Click (check on) Display normal curve
Click Apply
See Figure 4.2 for the completed Chart Builder dialogs. This will produce a histogram displaying
frequencies for HRS1 with the bell shaped curve for normal distribution of the HRS1 mean and
standard deviation. The other icons in the gallery request special types of histograms such as a
stacked histogram and population pyramid; both of which allow you to display histogram
distributions of a scale variable for subgroups of cases.

Figure 4.2 Chart Builder and Element Properties to Create Histogram with Normal Curve
Click OK in the Chart Builder

Figure 4.3 Histogram with Normal Curve of Hours Worked Last Week
The mean, standard deviation and number of valid cases for HRS1 is automatically displayed in
the chart legend. The mean hours worked last week is 42.26, slightly above the norm of 40. The
standard deviation is about 15 hours, which indicates there is a fair amount of variation among
respondents in hours worked.
Does the histogram seem useful in describing hours worked? If HRS1 were normally distributed,
the data bars would align perfectly with the normal curve. We easily see that values near 40 are
by far the most common. This is because most people who work do so full-time. Given this
clustering, we might expect that the median (50th percentile) value is 40, and we will verify that
next. The distribution is basically symmetric. Technically speaking, this distribution would be
described as not being skewed. However, we can see that the actual distribution is more "peaked"
than the normal curve. There are formal statistical measures to describe these deviations from the
normal curve which we will discuss shortly.
4.5 Using the Explore Procedure: EDA

As we mentioned earlier, John Tukey devised several statistical measures and plots designed to
reveal data features that might not be readily apparent from standard statistical summaries. In his
book describing these methods, Exploratory Data Analysis, Tukey (1977) described the work of a
data analyst to be similar to that of a detective, the goal being to discover surprising, interesting
and unusual things about the data. To further this effort, Tukey developed both plots and data
summaries. These methods, called exploratory data analysis and abbreviated EDA, have become

very popular in applied statistics and data analysis. Exploratory data analysis can be viewed either
as an analysis in its own right, or as a set of data checks and investigations performed before
applying inferential testing procedures.
These methods are best applied to variables with at least ordinal (more commonly interval) scale
properties and which can take many different values. The plots and summaries would be less
helpful for a variable that takes on only a few values (for example, a five point scale).
The Explore procedure produces many of the EDA statistical measures and plots along with the
classic statistics and histograms. We will use the Explore procedure to examine hours worked
last week, hours spent using the Internet per week, and number of hours of TV viewed per day.
To run the Explore procedure,
Click Analyze…Descriptive Statistics…Explore

Move HRS1, WWWHR, and TVHOURS to the Dependent List: box
The scale variables to be summarized appear in the Dependent list box. The Factor list box can
contain one or more categorical (for example, demographic) variables, and if used would cause
the procedure to present summaries for each category of the factor variable(s). We will use this
feature in later chapters when we look at mean differences between groups. By default, both plots
and statistical summaries will appear. While not discussed here, the Explore procedure can
produce robust mean estimates (M-estimators), percentile values, and lists of extreme values, as
well as normal probability and homogeneity plots.
Figure 4.4 Explore Dialog Box
We can request specific statistical summaries and plots using the Statistics and Plots option
buttons. We will accept the default statistics; but request a histogram rather than a stem and leaf
plot.
Click Plots
Click off Stem-and-leaf
Click on Histogram

Figure 4.5 Plots Dialog Box in Explore
The stem & leaf plot (devised by Tukey) is modeled after the histogram, but contains more
information. Although not requested here, we will briefly discuss them later. For most purposes,
the histogram is easier to interpret and more useful. By default, a boxplot will be displayed for
each scale variable.
Click Continue
Options with Missing Values

Ordinarily SPSS Statistics excludes any observations with missing values when running a
procedure like Explore. When several variables are used (as here) you have a choice as to
whether the analysis should be based on only those observations with valid values for all
variables in the analysis (called listwise deletion), or whether missing values should be excluded
separately for each variable (called pairwise deletion). When only a single variable is considered
both methods yield the same result, but they will not give identical answers when multiple
variables are analyzed in the presence of missing values.
The default method is listwise deletion. In our example, each of the variables was asked of only a
subset of cases; but a different subset for each variable. Thus, the listwise option is not
appropriate. We will specifically request the alternative pairwise method using the Options
button.
Click Options
Click Exclude cases pairwise

Figure 4.6 Missing Value Options in Explore
Rarely used, the Report values option includes cases with user-defined missing values in
frequency analyses, but excludes them from summary statistics and charts.
Click Continue
Click OK
The Explore procedure produces two tables followed by the requested charts for each variable.
The first table, a Case Processing Summary pivot table, displays the number of valid cases and
missing cases for each variable. Each variable has a considerable amount of missing data. For
example, 1763 cases (respondents) had valid values for HRS1, while 37.3% were missing.
Ordinarily such a large percentage of missing data would set off alarm bells for the analyst.
However we know that people who did not work were not asked this question. TVHOURS has
the most missing data, 68% of the cases, because it was asked of only a subset of the respondents.
Notice the large differences in the number of valid cases among the three variables.
Figure 4.7 Case Processing Summary
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
NUMBER OF HOURS
1763 62.7% 1049 37.3% 2812 100.0%
WORKED LAST WEEK
WWW HOURS PER
1701 60.5% 1111 39.5% 2812 100.0%
WEEK
HOURS PER DAY
899 32.0% 1913 68.0% 2812 100.0%
WATCHING TV
Note
The statistics for all three variables are displayed in the one Descriptive table. We edited this
pivot table, moving the "Dependent Variables" from the row to the layer dimension. We will
present and discuss the summaries and plots for each variable (layer) separately.
The Descriptives table in Figure 4.8 displays a series of descriptive statistics for HRS1. From the
previous table, we know that these statistics are based on 1763 respondents.

Figure 4.8 EDA Summaries for Hours Worked Last Week
Descriptives
NUMBER OF HOURS WORKED LAST WEEK

Statistic Std. Error
Mean 42.26 .358
95% Confidence Lower Bound 41.56
Interval for Mean Upper Bound
42.96
5% Trimmed Mean 42.01

Median 40.00
Variance 225.651
Std. Deviation 15.022
Minimum 1
Maximum 89
Range 88
Interquartile Range 13
Skewness .275 .058
Kurtosis 1.099 .117
First, several measures of central tendency appear: the Mean, 5% Trimmed Mean, and Median.
As we discussed earlier, these statistics attempt to describe with a single number where data
values are typically found, or the center of the distribution. Useful information about the
distribution can be gained by comparing these values to each other. Here the mean, median and
5% trimmed mean are very close and their values (40.0 to 42.01) suggest either that there are not
many extreme scores (not true in this case), or that the number of high and low scores is balanced
(which we will see does seem to be the case). If the mean were considerably above or below the
median and trimmed mean, it would suggest a skewed or asymmetric distribution. A perfectly
symmetric distribution—the normal—would produce identical expected means, medians and
trimmed means.
The measures of central tendency are followed in the table by several measures of dispersion or
variability. As we discussed earlier, these indicate to what degree observations tend to cluster or
be widely separated. Both the standard deviation and variance (standard deviation squared)
appear. The standard deviation of 15.022 indicates a variation around the mean of 15 hours; a
modest amount of variation. The standard error is an estimate of the standard deviation of the
mean if repeated samples of the same size (here 1763) were taken. It is used in calculating the
95% confidence band for the sample mean discussed below. Also appearing is the interquartile
range (IQR) which is essentially the range between the 25th and the 75th percentile values. It is a
variability measure more resistant to extreme scores than the standard deviation. The interquartile
range of 13 indicates that the middle 50% of the sample lie within a range of 13 hours. The fact
that the IQR is lower than the standard deviation suggests that the distribution may be "peaked" in
the center. We also see the minimum, maximum and range. It is useful to check the minimum and
maximum in order to make sure no impossible data values are recorded.

4.6 Standard Error of the Mean and Confidence Intervals

As stated earlier, the standard error of the mean is an estimate of the standard deviation around
the mean and is a function of the sample standard deviation and the sample size:
Standard error of the mean = Sample standard deviation

Square root of sample size
The larger the sample size, the smaller the standard error given the same sample standard
deviation.
The standard error of the mean is used to calculate the 95% confidence interval. The 95%
confidence interval has a technical definition: if we were to repeatedly perform the study, on
average we would expect the 95% confidence bands to include the true population mean 95% of
the time. It is useful in that it combines measures of both central tendency (mean) and variation
(standard error of mean) to provide information about where we should expect the population
mean to fall.
The confidence band is based on the sample mean, plus or minus 1.96 times the standard error of
the mean. Recall from our earlier discussion about the normal distribution that 95% of the area
under a normal curve is within 1.96 standard deviations of the mean. Since the sample standard
error of the mean is simply the sample standard deviation divided by the square root of the
sample size, the 95% confidence band for the mean is equal to the sample mean plus or minus
1.96 * (sample standard deviation/(square root (sample size))). Thus if you have the sample
mean, standard deviation and number of observations, you can easily calculate the 95%
confidence band.
As we can see in Figure 4.8, the confidence band for the mean of hours worked is very narrow
(41.56 to 4296) so we have a fairly precise idea of the population mean for hours worked last
week expecting the population mean to fall within this range 95% of the time.
4.7 Shape of the Distribution

In Figure 4.3, we displayed the histogram showing the distribution of hours worked last week.
This same histogram, minus the normal curve line, was requested as part of the Explore output.
The final two statistical measures, skewness and kurtosis, in the Descriptive table in Figure 4.8
provide numeric summaries about the shape of the distribution of the data. Since most analysts
are content to view histograms in order to make judgments regarding the distribution of a
variable, these measures are infrequently used.
Skewness is a measure of the symmetry of a distribution. It measures the degree to which cases
are clustered towards one end of the distribution. It is normed so that a symmetric distribution has
zero skewness. A positive skewness value indicates bunching on the left and a longer tail on the
right (for example, income distribution in the U.S.); negative skewness follows the reverse
pattern. The standard error of skewness also appears in the Descriptives table and we can use it to
determine if the data are significantly skewed. One method is to use the standard errors to
calculate the 95% confidence interval around the skewness. If zero is not in this range, we could
conclude that the distribution was skewed. A second method is to compare the skewness value to
1.96*SE from zero.

Although the skewness value for hours worked is close to 0 (.275) (see Figure 4.8), using either
of the methods above, we would determine that the distribution was slightly positively skewed.
However, the histogram does not support any significant evidence of skewness.
Kurtosis also has to do with the shape of a distribution and is a measure of how much of the data
is concentrated near the center, as opposed to the tails, of the distribution. It is normed to the
normal curve (for which kurtosis is zero). As an example, a distribution with longer tails and
more peaked in the middle than a normal is referred to as a leptokurtic distribution and would
have a positive kurtosis measure. On the other hand, a platykurtic distribution is a flattened
distribution and has negative kurtosis values. A standard error for kurtosis also appears. The same
methods used for evaluating skewness can be used to evaluate the kurtosis values.
Since the kurtosis value for hours worked is 1.099 (Figure 4.8), which is well beyond two
standard errors (1.96*.1097) from zero, hours worked is a leptokurtic distribution and is not
normally distributed.
The shape of the distribution can be of interest in its own right. Also, assumptions are made about
the shape of the data distribution within each group when performing significance tests on mean
differences between groups. This aspect will be covered in later chapters.
Note: Stem & Leaf Plot of Hours Worked Last Week

The stem & leaf plot (devised by Tukey) is modeled after the histogram, but is designed to
provide more information. We requested the histogram rather than a stem & leaf plot; but provide
the stem & leaf plot for hours worked last week in Figure 4.9 as an example. The overall
distribution is reflected in the length of the lines. Instead of using a standard symbol (for example,
an asterisk ‘*’ or block character) to display a case or group of cases, the stem & leaf plot uses
data values as the plot symbols on each line. Thus the shape of the distribution appears and the
plot can be read to obtain specific data values.

Figure 4.9 Stem & Leaf Plot of Number of Hours Worked Last Week
In a stem & leaf plot the stem is the vertical axis and the leaves branch horizontally from the
stem. The stem width indicates the number of units in which the stem value is measured; in this
case a stem unit represents 10 hours. This means that the stem value must be multiplied by 10 to
reproduce the original units of analysis. The leaf values in the chart indicate the value of the next
unit down from the steam. To illustrate, the third row from the bottom of the chart contains a stem
value of 6 with one leaf of 4 and four leafs of 5. These represent six individuals who worked 64
hours last week and 24 who worked 65 hours. Each leaf represents 6 cases, so there are 24
respondents who worked 65 hours. Values in a stem that did not have at least 6 cases are
represented by a fractional leaf, denoted by an ampersand (&). Notice that there are a total of30
cases in this stem.
The first and last lines identify outliers. These are data points far enough from the center (defined
more exactly under Box & Whisker plots below) that they might merit more careful checking.
Extreme points might be data errors or possibly represent a separate subgroup. Outliers are those
cases with values less than or equal to 17 hours (there are 104 of these) and greater than or equal
to 70 hours (there are 99 of these). Thus besides viewing the shape of the distribution we can pick
out individual scores.
4.8 Boxplots
Boxplots, also referred to as box & whisker plots, are a more easily interpreted plot to convey the
same information about the distribution of a variable. In addition, the boxplot graphically
identifies outliers. Below we see the boxplot for hours worked.

Figure 4.10 Boxplot of Hours Worked
The vertical axis represents the scale for the number of hours worked. The solid line inside the
box represents the median or 50th percentile. The top and bottom borders (referred to as "hinges")
of the box correspond to the 75th and 25th percentile values of hours worked and thus define the
interquartile range (IQR). In other words, the middle 50% of data values fall within the box. The
“whiskers” (vertical lines extending from the top and bottom of the box) are the last data values
that lie within 1.5 box lengths (or IQRs) of the respective hinges (borders of box). Tukey
considers data points more than 1.5 box lengths from a hinge to be "outliers". These points are
marked with a circle. Points more than 3 box lengths (IQR) from a hinge are considered by Tukey
to be “far out” points and are marked with an asterisk symbol (there are none here). This plot has
many outliers. If a single outlier exists at a data value, the case sequence number appears beside it
(an ID variable can be substituted), which aids data checking.
If the distribution were symmetric, the median would be centered within the box, the hinges and
the whiskers. In the plot above, the median is toward the bottom of the box, indicating a
positively skewed distribution.
Boxplots are particularly useful for obtaining an overall ‘feel’ for a distribution in an instant. The
median tells us the location or central tendency of the data (40 hours for hours worked). The
length of the box indicates the amount of spread within the data, and the position of the median in
relation to the box tells us something of the nature of the distribution Box plots are also useful
when comparing several groups, as we will see in later chapters.

Note:
Boxplots like all charts in SPSS Statistics can be edited. For ease in interpretation or presentation,
you might want to delete some of the outlier and extreme data points after you have studied them.
You could also change range or ticks on the scale access and make other enhancements to the
chart attributes.
Exploring Hours Using the Internet

We use our knowledge to now consider hours using the Internet per week.
Figure 4.11 Exploratory Summaries of Hours Using the Internet Per Week
Descriptives
WWW HOURS PER WEEK

Mean 7.46 .255
7.96

Median 4.00
Variance 110.793
Minimum 0
Maximum 130
Range 130
Skewness 3.569 .059
Kurtosis 21.097 .119
The mean of 7.46 hours is much greater than the median (4.00). This suggests a positive skew to
the data, confirmed by the skewness statistic, which is much larger than 0. Examine the minimum
and maximum values; do they suggest data errors? You might look at other variables such as
hours worked per week for those who claim to use the internet 130 hours a week. After all, there
are only 168 hours in a week! Which other variables might you look at in order to investigate the
validity of these responses?
We have valid data for 1,701 observations with 1111 missing (these numbers appear in the Case
Processing Summary table in Figure 4.7). This is a large amount of missing data, but some people
don’t use a computer (and so wouldn’t answer this question), and it is also possible that a subset
of respondents was simply not asked this and other questions about computer usage. Notice that
the standard deviation is larger than the mean, a sign of great variation in the values. The kurtosis
is also very large.

Figure 4.12 Histogram for Hours Using the Internet Per Week
The histogram shows an extremely positively skewed distribution with roughly 800 cases having
values close to zero. Although in some cases, the bars for the high values are too short to see in
this rendering, there is at least one case as we know with a value of 130.
Finally, do you notice any pattern of peaks and valleys to the plot? For example, the
concentration of cases around 20 hours per week looks odd. This might be an example of data
heaping, when respondents can’t estimate precisely how often they do something and so choose a
convenient round number.

Figure 4.13 Box & Whisker Plot for Hours Using the Internet Per Week
Notice that all the extreme values occur at large values, unlike for hours worked last week.
Individuals who use the internet less than an hour (a value of 0) are not outliers. This is because a
value of 0 is not that far from the bulk of the observations, while values of 10 (75th percentile)
and above are.
The box is squashed because of the outliers and so is difficult to use, but it is clear how far some
of the outliers are from the bulk of cases. The positive skewness is apparent from the outliers at
the high end. Some of these are marked as extreme points (with an asterisk). While unusual
relative to the data, certainly people can work online for many hours. However, we begin to
wonder whether values above 60 or so hours are valid. If suspicious outliers appear in your data
you should check whether they are data errors. If not, you need to consider whether you wish
them included in your analysis. This is especially problematic when dealing with a small sample
(not the case here), since an outlier can substantially influence the analysis. We say more about
outliers when we discuss ANOVA and Multiple Regression in Appendices A and B.

Exploring Hours of TV Viewed

We now move to hours of TV watched per day.
The mean (2.87) is very near 3 hours, the trimmed mean is at 2.56 and the median is 2. This
suggests skewness. Do you notice anything surprising about the minimum or maximum?
Watching 20 hours of TV a day is possible (?), but unlikely, so perhaps it is a result of
misunderstanding the question. The trimmed mean is closer to the mean than the median,
indicating that the difference between the mean and median is not solely due to the presence of
outliers. The histogram in Figure 4.15, showing a heavy concentration of respondents at 1 and 2
hours of TV viewing, suggests why the median is at 2. There is also positive skewness and
kurtosis.
Figure 4.14 Exploratory Summaries of Daily TV Hours
Descriptives
HOURS PER DAY WATCHING TV

Mean 2.87 .087
3.04

Median 2.00
Variance 6.849
Minimum 0
Maximum 20
Range 20
Skewness 2.589 .082
Kurtosis 9.823 .163

Figure 4.15 Histogram of Daily TV Hours
This histogram identifies outliers on the high side. Other than that it is of limited use since
TVHOURS is recorded to the integer number of hours and a relatively small number of values are
chosen. This same point would apply when considering use of Explore for five-point rating
scales.

Figure 4.16 Box & Whisker Plot of Daily TV Hours
In addition to the asymmetry created by the large outliers, we see the median is not centered in
the box: it is closer to the lower edge (25th percentile value). This is due to the heavy
concentration of those viewing 0 through 2 hours of TV per day.
We would not argue that something of interest always appears through use of the methods of
exploratory data analysis. However, you can quickly glance over these results, and if anything
strikes your attention, and then pursue it in more detail. The possibility of detecting something
unusual encourages the use of these techniques.
4.9 Appendix: Standardized (Z) Scores

The properties of the normal distribution allows us to calculate a standardized score, often
referred to as a z-score, which indicates the number of standard deviations above or below the
sample mean for each value. Standardized (Z) scores can be used to calculate the relative position
of each value in the distribution. Z-scores are most often used in statistics to standardize variables
of unequal scale units for statistical comparisons or use in multivariate procedures. We'll have
more to say about this later.
For example, if you obtain a score of 68 out of 100 on a word test, this information alone is not
enough to tell how well you did in relation to others taking the test. However, if you know the
mean score is 52.32, the standard deviation 8.00 and the scores are normally distributed, you can
calculate the proportion of people who achieved a score at least as high as your own.

The standardized score is calculated by subtracting the mean from the value of the observation in
question (68-52.32 = 15.68) and dividing by the standard deviation for the sample (15.68/8 =
1.96).
Z Score = Case Score - Sample Mean

Standard Deviation
Therefore, the mean of a standardized distribution is 0 and the standard deviation is 1.

In this case, your score of 68 is 1.96 standard deviations above the mean.
The histogram of the normal distribution in Figure 4.1 displays the distribution as a Z-score so the
values on the x-axis are standard deviation units. From this figure, we can see only 2.5% of the
cases are likely to have a score above 68 (1.96 standardized score). The normal distribution table
(see Table 4.1), found in an appendix of most statistics books, show proportions for z-score
values.
Table 4.1 Normal Distribution Table
A score of 1.96, for example, corresponds to a value of .025 in the ‘one-tailed’ column and .050
in the ‘two-tailed’ column. The former means that the probability of obtaining a z-score at least as
large as +1.96 is .025 (or 2.5%), the latter that the probability of obtaining a z-score of more than
+1.96 or less than -1.96 is .05 (or 5%) or 2.5% at each end of the distribution. You can see these
cutoffs in the histogram displayed in Figure 4.1 Normal Distribution: Plus or Minus 1 SD and
1.96 SD.

Whether we choose a one or two-tailed probability depends upon whether we wish to consider
both ends of the distribution (two-tailed) or just one end (one-tailed). We will say more about 1-
tailed and 2-tailed probability when we discuss mean difference tests in Chapter 7.
As we mentioned, another advantage of standardized scores is that they allow for comparisons on
variables measured in different units. For example, in addition to the word test score, you might
have a mathematics test score of 150 out of 200 (or 75%). Although it appears that you did better
on the mathematics test from the percentages alone, you would need to calculate the z-score for
the mathematics test and compare the z-scores in order to answer the question.
You might want to compute z-scores for a series of variables and determine whether certain
subgroups of your sample are, on average, above or below the mean on these variables by
requesting descriptive statistics or using the Case Summaries procedure. For example, you might
want to compare a respondent's education and social economic index (SEI) values using z-scores.
The Descriptives procedure has an option to calculate standardized score variables. A new
variable containing the standardized values is calculated for the specified variables. To create z-
scores for education and socioeconomic index,
Click Analyze…Descriptive Statistics…Descriptives

Move EDUC and SEI to the Variable(s): list box
Click Save standardized values as variables
Click OK
Figure 4.17 Descriptives Dialog Box to Create Z-Scores
By default, the new variable name is the old variable name prefixed with the letter "Z". Two new
variables, zeduc and zsei, containing the z-scores of the two variables are created at the end of the
data.

Figure 4.18 Two Z-score Variables in the Data Editor
These variables can be saved in your file and used in any statistical procedure.
Note: You can assign specific names to the z-score variables by using the DESCRIPTIVES
syntax command. Paste the syntax and add the z-score variable name in parentheses after the
variable name in the VARIABLES subcommand as in:
DESCRIPTIVES
VARIABLES=EDUC (Zscore_EDUC) SEI (Zscore_SEI)
/SAVE
/STATISTICS=MEAN STDDEV MIN MAX .
This would name the new variables, Zscore_educ and Zscore_SEI.

Summary Exercises
We will later compare different groups on the average number of children they have (CHILDS),
their age (AGE) and number of hours spent on email per week (EMAILHR).
1. In anticipation of this, run an exploratory data analysis on these variables. Use a

histogram rather than a stem & leaf plot. Review the results. Keep in mind that this is a
U.S. adult sample; do you see anything unusual about the age range?
2. Using Chart Builder, run a histogram of AGE with the normal curve. Looking at this
chart and the information from Explore, how would you describe the distribution of
AGE? Given this information, how might you group years of age into a category
variable?
3. Use Visual Binning (Transform…Visual Binning) to create a new grouped AGE variable.
If you wish, save the modification in a data file named, MYGSS2004.sav.
1. Number of children (CHILDS) is coded 0 through 8, where 8 indicates eight or more

children. Look at the exploratory output, or run a frequency analysis on the variable.
Would you expect the truncation of CHILDS to have much influence on an analysis?


Chapter 5: Probability and Inferential

Statistics
Topics
• The Nature of Probability
• Making Inferences about Populations from Samples
• Influence of Sample Size
• Hypothesis Testing
• Type I and Type II Statistical Errors
• Statistical Significance and Practical Importance
Data
Data files showing the same percentages or means for samples of 100, 400 and 1,600. A data file
containing 10,000 observations drawn from a normal population with mean 70 and standard
deviation of 10.
Scenario
In this chapter, we overview some basic statistical concepts and principles that are needed to
understand the assumptions and interpretation of inferential statistical techniques. We then
display a series of analyses in which only the sample size varies and see which outcome measures
change. Finally, we discuss scenarios in which statistical significance and practical importance do
not coincide.
5.1 The Nature of Probability

Up to this point, we have used descriptive statistics (that is, literally describing the data in our
sample through the use of a number of summary procedures and statistics). The statistical
methods described later in this course, are termed inferential in that the data we have collected
will be used to provide more generalized conclusions. In other words, we want to infer the
results from the sample on which we have data to the population which the sample represents.
To do this, we use procedures that involve the calculation of probabilities. The fundamental issue
with inferential statistical tests concerns whether any 'effects' (i.e. relationships or differences
between groups) we have found are 'genuine' or as a result of sampling variability (in other
words, mere 'chance'). A probability value can be defined as 'the mathematical likelihood of a
given event occurring', and as such we can use such values to assess whether the likelihood that
any differences we have found are the result of random chance.
Consider the following example. Let's suppose we have conducted a study and found that there is
a slight difference between the mean blood pressure of left-handed people and the mean blood
pressure of right-handed people. We would be naive to expect both means to be exactly the same,
so has this difference occurred due to chance or does it simply reflect that in the population, there
really is a difference in the mean blood pressure of these two groups? In later chapters, we will
see just how researchers answer such a question.
Probability and Inferential Statistics 5 - 1

5.2 Making Inferences about Populations from Samples

Ideally, we would have data about everyone we wished to study (i.e. in the whole population). In
practice, we rarely have information about all members of our population and instead collect
information from a representative sample of the population. However, our goal is to make
generalizations about various characteristics of the population based on the known facts about the
sample.
We choose the sample with the intention of using the data from that sample to make inferences
about the ‘true’ values in the population. These population measures are referred to as
parameters while the equivalent measures from samples are known as statistics. It is unlikely
that we will know the population parameters; therefore we use the sample statistics to infer what
these population values will be. As noted in section 5.1, these statistical techniques are known as
inferential in contrast to the purely descriptive analyses we have considered so far.
We have already considered a number of statistics and parameters such as means, proportions,
standard deviations, etc. An important distinction between parameters and statistics is that
parameters are fixed (although often not known) while statistics vary from one sample to another.
Due to the effects of random variability, it is unlikely that any two samples drawn from the same
population will produce the same statistics. By plotting the values of a particular statistic (e.g. the
mean) from a large number of samples, it is possible to obtain a sampling distribution of the
statistic. For small numbers of samples, the mean of the sampling distribution may not closely
resemble that of the population. However, as the number of samples taken increases, the closer
the mean of the sampling distribution (the mean of all the means, if you like) gets to the
population mean. For an infinitely large number of samples, the mean will be exactly the same as
the population mean. Additionally, as the sample size increases, the amount of variability in the
distribution of sample means decreases. If you think of the variability in terms of the error made
in estimating the mean, it should be clear that the more evidence you have (i.e. the more cases in
your sample), the smaller will be the error in estimating the mean.
Of course, it is unlikely you will ever be able to take repeated samples - you usually get just the
one chance and must therefore base your conclusions on the data from this one sample.
If repeated random samples of size N are drawn from any population, then as N becomes large,
the sampling distribution of sample means approaches normality - a phenomenon known as the
Central Limit Theorem. This is an extremely useful statistical concept as it does not require that
the original population distribution is normal. In the next section, we'll take a closer look at the
influence of sample size on the precision of the statistics.
5.3 Influence of Sample Size

In statistical analysis sample size plays an important role, but one that can easily be overlooked
since a minimum sample size is not required for the most commonly used statistical tests.
Workers in some areas of applied statistics (engineering, medical research) routinely estimate the
effects of sample size on their analyses (termed power analysis). This is less frequently done in
social science and market research. The formulas for standard errors describe the effect of sample
size. Here we will demonstrate the effect in two common data analysis situations: crosstabulation
tables and mean summaries.

Precision of Percentages
Precision is strongly influenced by the sample size. In the figures below we present a series of
crosstabulation tables containing identical percentages, but with varying sample sizes. We will
observe how the test statistics change with sample size and relate this result to the precision of the
measurement. The results below assume a population of infinite size or at least one much larger
than the sample. For precision calculations involving percentages with finite populations see
Cochran (1977), Kish (1965) or other survey sampling texts.
Note
The Chi-square test of independence will be presented for each table as part of the presentation of
the effect of changing sample size. This test assumes that each sample is representative of the
entire population. A detailed discussion of the chi-square statistic, its assumptions and
interpretation, can be found in Chapter 6.
Sample Size of 100

The table below displays responses of men and women to a question asking for which candidate
they would vote. The table was constructed by adjusting case weights to reflect a sample of 100.
Figure 5.1 Crosstab Table with Sample of 100
We see that 46 percent of the men and 54 percent of the women choose candidate A, resulting in
an 8% difference. Since this sample of 100 people incompletely reflects the population we turn to
the chi-square test to assess whether men differ from women in the population. (As noted above,
the chi-square test will be examined closely in Chapter 6). Here we simply note the chi-square
value (.640) and state that the significance value of .424 indicates that men and women share the
same view (do not differ significantly) concerning candidate choice. The significance value of

.424 suggests that if men and women in the population had identical attitudes toward the
candidates, with a sample of 100 we could observe a gender difference of 8 or more percentage
points about 42% of the time. Thus we are fairly likely to find such a difference (8%) in a small
sample even if there is no gender difference.
Sample Size of 400

Now we view a table with percentages identical to the previous one, but based on a sample of 400
people, four times as large as before.
Figure 5.2 Crosstabulation Table with Sample of 400
The gender difference remains at 8% with fewer men choosing Candidate A. Although the
percentages are identical, the chi-square value has increased by a factor of four (from .640 to
2.56) and the significance value is smaller (.11). This significance value of .11 suggests that if
men and women in the population had identical attitudes toward the candidates, with a sample of
400 we would observe a gender difference of 8 or more percentage points about 11% of the time.
Thus with a bigger sample, we are much less likely to find such a large (8%) percentage
difference if the men’s and women’s attitudes are identical. Since much statistical testing uses a
cutoff value of .05 when judging whether a difference is significant, this result is close to being
judged statistically significant.
Sample Size of 1,600

Finally we present the same table of percentages, but increase the sample size to 1,600; the
increase is once again by a factor of four.

Figure 5.3 Crosstabulation Table with Sample of 1,600
The percentages are identical to the previous tables and so the gender difference remains at 8%.
The chi-square value (10.24) is four times that of the previous table and sixteen times that of the
first table. Notice that the significance value is quite small (.001), indicating a statistically
significant difference between men and women. With a sample as large as 1,600 it is very
unlikely (.001 or 1 chance in 1000) that we would observe a difference of 8 or more percentage
points between men and women if they did not differ in the population.
Thus the 8% sample difference between two groups is highly significant if the sample is 1,600,
but not significant (testing at .05 level) with a sample of 100. This is because the precision with
which we measure the candidate preference increases with the sample size, and as our
measurement grows more precise the 8% sample difference looms large. This relationship is
quantified in the next section.
Sample Size and Precision

In the series of crosstabulation tables we saw that as the sample size increased we were more
likely to conclude there was a statistically significant difference between two groups when the
magnitude of the sample difference was constant (8%). This is because the precision with which
we estimate the population percentage increases with increasing sample size. This relation can be
approximated (see the note at the end of this chapter for the exact relationship) by a simple
equation; the precision of a sample proportion is approximately equal to one divided by the
square root of the sample size. Table 5.1 displays the precision for the sample sizes used in our
examples below.

Table 5.1 Sample Size and Precision for Different Sample Sizes
Sample Size Precision
100 1/sqrt(100) = 1/10 .10 or 10%
400 1/sqrt (400) = 1/20 .05 or 5%
1600 1/sqrt(1600) = 1/40 .025 or 2.5%
And to obtain a precision of 1%, we would need a sample of 10,000 (1/sqrt(10,000) = 1/100). We
can understand now why surveys don’t often state that the results are accurate within ±1%.
Since precision increases as the square root of the sample size, in order to double the precision we
must increase the sample size by a factor of four. This is an unfortunate and expensive fact of
survey research. In practice, samples between 500 and 1,500 are often selected for national
studies.
Precision of Means
The same basic relation—that precision increases with the square root of the sample size—
applies to sample means as well. To illustrate this we display histograms based on different
samples from a normally distributed population with mean 70 and standard deviation 10. We first
view a histogram based on a sample of 10,000 individual observations. Next we will view a
histogram of 1,000 sample means where each mean is composed of 10 observations. The third
histogram is composed of 100 sample means, but here each mean is based on 100 observations.
We will focus our attention on how the standard deviation changes when sample means are the
units of observation. To aid such comparisons the scale is kept constant across histograms.

A Large Sample of Individuals

Below is a histogram of 10,000 observations drawn from a normal distribution of mean 70 and
standard deviation 10.
Figure 5.4 Histogram of 10,000 Observations
We see that a sample of this size closely matches its population. The sample mean is very close to
70, the sample standard deviation is near 10, and the shape of the distribution is normal.

Means Based on Samples of 10

The second histogram displays 1,000 sample means drawn from the same population (mean 70,
standard deviation 10). Here each observation is a mean based on 10 data points. In other words
we pick samples of ten each and plot their means in the histogram below.
Figure 5.5 Histogram of Means Based on Samples of 10
The overall average of the sample means is about 70, while the standard deviation of the sample
means (standard error) is reduced to 3.11. Comparing the two histograms we see there is less
variation (standard deviation of 3.11 versus 10) among means based on groups of observations
then among the observations themselves. Recall the rule of thumb that precision is a function of
the square root of the sample size. If the population standard deviation were 10, we would expect
the standard deviation of means based on samples of 10 to be the population figure reduced by a
factor of 1/square root(N) or 1/square root(10), or .316. If we multiply this factor (.316) by the
population standard deviation (10), the theoretical value we get (3.16) is very close to what we
observe in our sample (3.11). Thus by increasing the sample size by a factor of ten (from single
observations to means of ten observations each) we reduce the imprecision (increase the
precision) by the factor (1/square root(10)). The shape of the distribution remains normal.

Means Based on Samples of 100

The next histogram is based on a sample of 100 means where each mean represents 100
observations.
Figure 5.6 Histogram of Means Based on Samples of 100
While quite compressed, the distribution still resembles a normal curve. The overall mean
remains at 70 while the standard deviation is very close to 1 (1.00). This is what we expect since
with samples of 100, the expected value of the standard deviation of the sample mean (standard
error) would be the population standard deviation divided by the square root of 100. Thus the
theoretical sample standard deviation would be 10/square root(100) or 1.00, which is quite close
to our observed value.
Thus with means as well as percents, precision increases with the square root of the sample size.
Statistical Power Analysis

With increasing precision we are better able to detect small differences that exist between groups
and small relationships between variables. Power analysis was developed to aid researchers in
determining the minimum sample size required in order to have a specified chance of detecting a
true difference or relationship of a given size. To put it more simply, power is used to quantify
your ability to reject the null hypothesis when it is false. For example, suppose a researcher hopes
to find a mean difference of .8 standard deviation units between two populations. A power
calculation can determine the sample size necessary to have a 90% chance that a significant
difference will be found between the sample means when performing a statistical test at a
specified significance level. Thus a researcher can evaluate whether the sample is large enough

for the purpose of the study. The SPSS SamplePower program performs power analysis. Also,
books by Cohen (1988) and Kraemer & Thiemann (1987) discuss power analysis and present
tables used to perform the calculation for common statistical tests. In addition specialty software
is available for such analyses. Power analysis can be very useful when planning a study, but does
require such information as the magnitude of the hypothesized effect and an estimate of the
variance.
5.4 Hypothesis Testing

Whenever we wish to make an inference about a population from our sample, we must specify a
hypothesis to test. It is common practice to state two hypotheses: the null hypothesis (also
known as H0) and the alternative hypothesis (H1). The null hypothesis being tested is
conventionally the one in which no effect is present. For example, we might be looking for
differences in mean income between males and females, but the (null) hypothesis we are testing is
that there is no difference between the groups. If the evidence is such that this null hypothesis is
unlikely to be true, the alternative hypothesis should be accepted. Another way of thinking about
the problem is to make a comparison with the criminal justice system. Here, a defendant is treated
as innocent (i.e. the null hypothesis is accepted) until there is enough evidence to suggest that
they perpetrated the crime beyond any reasonable doubt (i.e. the null hypothesis is rejected).
The alternative hypothesis is generally (although not exclusively) the one we are really interested
in and can take any form. In the above example, we might hypothesis that males will have a
higher mean income than females. When the alternative hypothesis has a ‘direction’ (i.e. we
expect a specific result), the test is referred to as one-tailed. Often, you do not know in which
direction to expect a difference and may simply wish to leave the alternative hypothesis open-
ended. This is a two-tailed test and the alternative hypothesis would simply be that the mean
incomes of males and females are different. Whichever option you choose will have implications
when interpreting the probability levels. In general, the probability of the occurrence of a
particular statistic for a one-tailed test will be half that of a two-tailed test as only one extreme of
the distribution is being considered in the former type of test. You will see an example of this
when we demonstrate the T Test procedure in Chapter 7.
Significance Criteria Level

Having formally stated your hypotheses, you must then select a criterion for acceptance or
rejection of the null hypothesis. With probability tests such as the chi-square test or the t-test, you
are testing the likelihood that a statistic of the magnitude obtained would have occurred by
chance assuming that the null hypothesis (i.e. that there is no difference in the population) is true.
In other words, we only wish to reject the null hypothesis when we can say that the result would
have been extremely unlikely under the conditions set by the null hypothesis. In this case, the
alternative hypothesis should be accepted. It is worth noting that this does not ‘prove’ the
alternative hypothesis beyond doubt, it merely tells us that the null hypothesis is unlikely to be
true.
But what criterion (or alpha level, as it is often known) should we use? Unfortunately, there is no
easy answer! Traditionally, a 5% level is chosen, indicating that a statistic of the size obtained
would only be likely to occur on 5% of occasions (or once-in-twenty) should the null hypothesis
be true. This also means that, by choosing a 5% criterion, you are accepting that you will make a
mistake in rejecting the null hypothesis 5% of the time.

The 5% cut-off point is not a hard and fast rule, however. Some prefer to choose a 10% level,
others a far more conservative 1% or even 0.1%. In this last case, a statistic would only be
accepted as significant if it was shown to occur on 0.1%, or one-in-a-thousand, of all occasions
under the null hypothesis. The level you choose will to a large extent, depend upon the
importance of getting the answer correct. If performing more exploratory research, where the
outcome is not so critical, you may decide upon a more liberal region of rejection such as 10%.
Alternatively, if carrying out potentially life-or-death clinical trials, you will wish to be as certain
as possible that you have made the correct choice. In these cases, the more conservative the
criterion, the ‘safer’ you will be should you achieve a significant result.
5.5 Types of Statistical Errors

Recall that when performing statistical tests we are generally attempting to draw conclusions
about the larger population based on information collected in the sample. There are two major
types of errors in this process. False positives, or Type I errors, occur when no difference (or
relation) exists in the population, but the sample tests indicate there are significant differences (or
relations). Thus the researcher falsely concludes a positive result. This type of error is explicitly
taken into account when performing statistical tests. When testing for statistical significance
using a .05 criterion (alpha level), we acknowledge that if there is no effect in the population then
the sample statistic will exceed the criterion on average 5 times in 100 (.05).
Type II errors, or false negatives, are mistakes in which there is a true effect in the population
(difference or relation) but the sample test statistic is not significant, leading to a false conclusion
of no effect. To put it briefly, a true effect remains undiscovered. The probability of making this
type of error is often referred to as the beta level. Whereas you can select your own alpha levels,
beta levels are dependent upon things such as the alpha level and the size of the sample. It is
helpful to note that statistical power, the probability of detecting a true effect, equals 1 minus the
Type II error and the higher the power the better.
Table 5.2 Types of Statistical Errors in Hypothesis Testing

Statistical Test Outcome
Not Significant Significant
Type I error (α)

No Difference Correct False positive
(Ho is True)
Population
True Difference Type II error (β) Correct
(H1 is True) False negative
When other factors are held constant there is a tradeoff between the two types of errors; thus
Type II error can be reduced at the price of increasing Type I error. In certain disciplines, for
example in statistical quality control when destructive testing is done, the relationship between
the two error types is explicitly taken into account and an optimal balance determined based on
cost considerations. In social science research, the tradeoff is acknowledged but rarely taken into
account (the exception being power analysis); instead emphasis is usually placed on maintaining
a steady Type I error rate at some criteria level, commonly .05 (5%). This discussion merely
touches the surface of these issues; researchers working with small samples or studying small
effects should be very aware of them.

5.6 Statistical Significance and Practical Importance

A related issue involves drawing a distinction between statistical significance and practical
importance. When an effect is found to be statistically significant we conclude that the population
effect (difference or relation) is not zero. However, this allows for a statistically significant effect
that is not quite zero, yet so small as to be insignificant from a practical or policy perspective.
This notion of practical or real world importance is also called ecological significance. Recalling
our discussion of precision and sample size, very large samples yield increased precision, and in
such samples very small effects may be found to be statistically significant. In such situations, the
question arises as to whether the effects make any practical difference. For example, suppose a
company is interested in customer ratings of one of its products and obtains rating scores from
several thousand customers. Furthermore, suppose mean ratings on a 1 to 5 satisfaction scale are
3.25 for male and 3.15 for female customers, and this difference is found to be significant. Would
such a small difference be of any practical interest or use?
When sample sizes are small (say under 30), precision tends to be poor and so only relatively
large (and ecologically significant) effects are found to be statistically significant. With moderate
samples (say 50 to one or two hundred) small effects tend to show modest significance while
large effects are highly significant. For very large samples, several hundreds or thousands, small
effects can be highly significant; thus an important aspect of the analysis is to examine the effect
size and determine if it is important from a practical, policy or ecological perspective. In
summary, the statistical tests we cover in this course provide information as to whether there are
non-zero effects. Estimates of the effect size should be examined to determine whether the effects
are important.
Computational Note: Precision of Percentage Estimates

In this chapter we suggested, as a rule of thumb, that the precision of a sample proportion is
approximately equal to one divided by the square root of the sample size. Formally, for a
binomial or multinomial distribution (a variable measured on a nominal or ordinal scale), the
standard error of the sample proportion (P) is equal to
StdErr( P) = P * (1 − P) N
Thus the standard error is a maximum when P = .5 and reaches a minimum of 0 when P = 0 or 1.
A 95% confidence band is usually determined by taking the sample estimate plus or minus twice
the standard error. Precision (pre) here is simply two times the standard error. Thus precision
(pre) is
pre( P) = 2 * P * (1 − P) N .
If we substitute for P the value .5 which maximizes the expression (and is therefore conservative)
we have
pre(0.5) = 2 * 0.5 * (1 − 0.5) N

= 2* (0.5) * (0.5) N
= 2 * (0.5) N
=1 N

This validates the rule of thumb used in the chapter. Since the rule of thumb employs the value of
P=.5, which maximizes the standard deviation and thus the standard error, in practice, greater
precision would be obtained when P departs from .5.
It is important to note that this calculation assumes the population size is infinite, or as an
approximation, much larger than the sample. Formulations that take finite population values into
account can be found in Kish (1965) and other texts discussing sampling. When applied to survey
data, the calculation also assumes that the survey was carried out in a methodologically sound
manner. Otherwise, the validity of the sample proportion itself is called into question.


Chapter 6: Comparing Categorical

Variables
Topics
• Typical Applications
• Crosstabulation Tables
• Testing the Relationship: Chi-Square Statistic
• Additional Two-Way Crosstabs
• Graphing the Crosstabs Results
• Adding Control Variables
• Extensions: Beyond Crosstabs
• Appendix: Measures of Association
Data
Scenario
Using the GSS 2004 data, our interest is in investigating whether men differ from women in their
belief in an afterlife and in their use of the computer. In addition, we will explore whether
education relates to these measures.
6.1 Typical Applications

Thus far we have examined each variable isolated from the others. A main component of most
studies is to look for relationships among variables or to compare groups on some measure. Our
choice of variables is based on our view of which questions might be interesting to investigate.
More often a study is designed to answer specific questions of interest to the researcher. These
may be theoretical as in an academic project, or quite applied as often found in market research.
The crosstabulation table is the basic technique used to examine relationships among categorical
variables. Crosstabs are used in practically all areas of research. A crosstabulation table is a co-
frequency table of counts, where each row or column is a frequency table of one variable for
observations falling within the same category of the other variable. When one of the variables
identifies groups of interest (for example, a demographic variable) crosstabulation tables permit
comparisons between groups. In survey work, two attitudinal measures are often displayed in a
crosstab to assess relationship. While the most common tables involve two variables,
crosstabulations are general enough to handle additional variables and we will discuss a simple
three-variable analysis.
A crosstabulation table can serve several purposes. It might be used descriptively, that is, the
emphasis is on providing some information about the state of things and not on inferential
statistical testing. For example, demographic information about members of an organization
(company employees, students at a college, members of a professional group) or recipients of a
service (hospital patients, season ticket holders) can be displayed using crosstabulation tables.
Comparing Categorical Variables 6 - 1

Here the point is to provide summary information describing the groups and not to make explicit
comparisons that generalize to larger populations. For example, an educational institution might
publish a crosstabulation table reporting student outcome (dropout, return) for its different
divisions. For this purpose, the crosstabulation table is descriptive.
Crosstabulation tables are also used in research studies where the goal is to draw conclusions
about relationships in the population based on sample data (recall our discussion in Chapter 5).
Many survey studies and experiments have this as their goal. In order to make such inferences,
statistical tests (usually the chi-square test of independence) are applied to the tables. In this
chapter we will begin by discussing a simple table displaying gender and belief in the afterlife.
We will then outline the logic of applying a statistical test to the data, perform the test, and
interpret the results. To provide reinforcement, other two-way tables will be considered.
In addition to the statistical tests, researchers occasionally desire a numeric summary of the
strength of the association between the two variables in a crosstabulation table. We provide a
brief review of some of these measures.
Another aspect of data analysis involves graphical display of the results. We will see how bar
charts can be used to present the data in crosstabulation tables. Finally, we will explore a three-
way table and point in the direction of more advanced methods. We begin however, with a simple
table.
6.2 Crosstabulation Tables

The Crosstabs procedure in SPSS Statistics produces crosstabulation tables on at least two
categorical variables. These tables are most useful when there are a relatively small number of
categories. As we noted in Chapter 2, you might want to combine categories, especially those
with a small number of cases, before running the crosstab. This is easily done using the Recode or
Visual Binning facilities on the Transform menu.
To request a crosstabulation table we need to specify the row variable and the column variable.
We will specify POSTLIFE as the row variable and GENDER for the column variable. Note that
multiple variables can be given for both.
Open GSS2004Intro.sav data file (if necessary)

Click Analyze…Descriptive Statistics…Crosstabs…
Move POSTLIFE into the Row(s): box
Move GENDER into the Column(s): box
A checkbox option is available to graph the crosstabulation table results as a clustered bar chart
based on counts. Rather than request a bar chart of counts now, we will later use the Graphs menu
to construct a clustered bar chart based on percents. The Suppress tables option is available if you
want to see the crosstabulation statistical measures but not the crosstabulation tables. A button
labeled Exact will appear if the SPSS Exact Tests add-on module is installed.

Figure 6.1 Crosstabs Dialog Box
Because GENDER is designated as the Column variable, each gender group will appear as a
separate column in the table, and the categories of POSTLIFE will define the rows of the table..
The Layer box can be used to build three-way and higher-order tables; we will see this feature
later in the chapter. By default the Crosstabs procedure will display only counts in the cells of the
table. For interpretive purposes we want percentages as well. The Cells option button controls the
summaries appearing in the cells of the table.
Click Cells
Click the Column check box in the Percentages area to obtain column percentages
The completed Cells dialog box is shown in Figure 6.2.

Figure 6.2 Crosstab Cell Display Dialog
Row, column and total table percentages can be requested. Row percentages are computed within
each row of the table so that the percentages across a row sum to 100%. Column percentages
would sum to 100% down each column, and total percentages sum to 100% across all cells of the
table. While we can request any or all of these percentages, the column percentage best suits our
purpose. If there is a variable that can be considered to be the independent variable (which
GENDER is here), then the appropriate table percentage is based on that dimension. Since
GENDER is our column variable, column percentages allow immediate comparison of the
percentages of men and women who believe in an afterlife, which is the question we wish to
explore. We will not request row percents because we are not directly interested in them and wish
to keep the table simple.
Notice that Observed Counts is checked by default. The other choices (Expected Count and
Residuals) are more technical summaries and will be considered in the next example.
Click Continue
Click OK

Figure 6.3 Crosstabulation of Belief in an Afterlife by Gender
BELIEF IN LIFE AFTER DEATH * GENDER OF RESPONDENT Crosstabulation
GENDER OF
RESPONDENT
Female Male Total
BELIEF IN LIFE YES Count 541 417 958
AFTER DEATH % within GENDER
86.0% 76.9% 81.8%
OF RESPONDENT
NO Count 88 125 213
% within GENDER
14.0% 23.1% 18.2%
OF RESPONDENT
Total Count 629 542 1171
% within GENDER
100.0% 100.0% 100.0%
OF RESPONDENT
The statistics labels appear in the row dimension of the table. The two numbers in each cell are
counts and column percentages. We see that about 77% of the men and 86% of the women said
they believe in an afterlife. The table includes only those respondents who had valid values on
both questions. Although gender is known for all respondents, the belief in life after death
question was not asked of all respondents. The Case Processing Summary table (not shown) tells
us that 58.4% (1641 cases) were excluded from the table. The row and column totals, often
referred to as marginals, show the frequency counts for each variable. The column percentages
total to 100%. The total row percentages are the percentage for each category of belief in afterlife
based on all respondents in the table. On the descriptive level we can say that most of the sample
believed in the afterlife (look at the Total row percentages in the column labeled “Total”). If we
wish to draw conclusions about the population, for example differences between men and
women, we would need to perform statistical tests.
Row percentages, if requested, would indicate what percentage of believers is male and what
percentage of believers is female. In other words, the percentages would sum to 100% across
each row. Your choice of row versus column percents determines your view of the data. In survey
research, independent variables, such as demographics, are often positioned as column variables
(or as a banner variable in the stub and banner tables of market research), and since there is much
interest in comparing these groups, column percents are displayed. If you prefer to interpret row
percentages in this context, or wish both percentages to appear, feel free to do so. The important
point is that the percentages help answer the question of interest in a direct fashion.
Having examined the basic two-way table, we move on to ask questions about the larger
population.
6.3 Testing the Relationship: Chi-Square Test

In the table viewed in Figure 6.3, 77% of the men in the sample and 86% of the women believed
in an afterlife. There is a difference in the sample of approximately 9% with a higher proportion
of women believing. Can we conclude from this that there is a population difference between men
and women on this issue (statistical significance)? And if there is a difference in the population, is
it large enough to be important to us (ecological significance)?

The difficulty we face is that the sample is an incomplete and imperfect reflection of the
population. We use statistical tests to draw conclusions about the population from the sample
data.
Basic Logic of Statistical Tests

In general, we assume that there is no effect (null hypothesis) in the population. In our example,
Ho (Null Hypothesis) assumes that gender and belief in an afterlife are independent of
each other.
We then calculate how likely it is that a sample could show as large (or larger) an effect as what
we observe (here a 9% difference), if there were truly no effect in the population. If the
probability of obtaining so large a sample effect by chance alone is very small (often less than 5
chances in 100 or 5% is used) we reject the null hypothesis and conclude there is an effect in the
population. While this approach may seem backward, that is, we assume no effect when we wish
to demonstrate an effect; it provides a valid method of forming conclusions about the population.
The details of how this logic is applied will vary depending on the type of data (counts, means,
other summary measures) and the question asked (differences, association). So we will use a chi-
square test in this chapter, but t and F tests later.
Logic of Chi-Square
Applying the testing logic to the crosstabulation table, we first calculate the number of people
expected to fall into each cell of the table assuming the null hypothesis (no relationship between
gender and belief in an afterlife), then compare these numbers to what we actually obtained in the
sample. If there is a close match we accept the null hypothesis of no effect. If the actual cell
counts differ dramatically from what is expected under the null hypothesis we will conclude there
is a gender difference in the population. The chi-square statistic summarizes the discrepancy
between what is observed and what we expect under the null hypothesis. In addition, the sample
chi-square value can be converted into a probability that can be readily interpreted by the analyst.
Assumptions of the Chi-Square Test

• Each observation is independent of all other observations, i.e. that each individual
contributed one observation only to the data set.
• Each observation can fall into only one cell in the table.
• The sample size should be large. The larger the sample size, the more accurate the
estimate. Although there is no definitive guide governing what size a sample must be to
achieve this criterion, some useful guidelines are given in the output and we discuss these
latter.
The chi-square statistic is calculated by:
1. Computing the difference between the observed and expected frequencies for each cell.
2. Squaring the difference and dividing by the expected frequency of that cell.
3. Summing these values across all cells.

6.4 Requesting the Chi-Square Test

To demonstrate how this works in practice, we will rerun the same analysis as before, but request
the chi-square statistic. We will also ask that some supplementary information appear in the cells
of the table to better envision the actual chi-square calculation. In practice, you would rarely ask
for this latter information to be displayed. The chi-square statistic is requested with the Crosstabs
Statistics option button.
We return to the previous Crosstabs dialog box to request the chi-square statistic.
Click the Dialog Recall tool, and then click Crosstabs

Click the Statistics button
Click the Chi-square check box
Figure 6.4 Crosstab Statistics Dialog Box
The first choice is the chi-square test of independence of the row and column variables. Most of
the remaining statistics are association measures that attempt to assign a single number to
represent the strength of the relationship between the two variables. We will briefly discuss them
later in this chapter.
Click Continue
To illustrate the chi-square calculation we also request some technical results (expected values
and residuals) in the cells of the table. Once again, you would not typically display these
statistics. We request expected counts and unstandardized residuals using the Cells option button.
Click the Cells button

Check Expected in the Counts area and Unstandardized in the Residuals area

Figure 6.5 Cell Display Dialog Box
By displaying the expected counts we can see how many observations are expected to fall into
each cell, assuming no relationship between the row and column variables. The unstandardized
residual is the difference between the observed count and the expected count in the cell. As such
it is a measure of the discrepancy between what we expect under the null hypothesis and what we
actually observe. In this course we will not explore the other residuals listed, but note they can be
used with large tables to identify quickly cells that exhibit the greatest deviations from
independence.
Click Continue
Click OK
6.5 Interpreting the Output

As we can see in Figure 6.6, the counts and percentages are the same as before; the expected
counts and residuals will aid in explaining the calculation of the chi-square statistic. Recall that
our testing logic assumes no relation between the row and column variables (here gender and
belief in an afterlife) in the population, and then determines how consistent the data are with this
assumption. In this table there are 417 males who say they believe in an afterlife.

Figure 6.6 Crosstab with Expected Values and Residuals
BELIEF IN LIFE AFTER DEATH * GENDER OF RESPONDENT Crosstabulation
GENDER OF
RESPONDENT
Female Male Total
BELIEF IN LIFE YES Count 541 417 958
AFTER DEATH Expected Count 514.6 443.4 958.0
% within GENDER
86.0% 76.9% 81.8%
OF RESPONDENT
Residual 26.4 -26.4
NO Count 88 125 213
Expected Count 114.4 98.6 213.0
% within GENDER
14.0% 23.1% 18.2%
OF RESPONDENT
Residual -26.4 26.4
Total Count 629 542 1171
Expected Count 629.0 542.0 1171.0
% within GENDER
100.0% 100.0% 100.0%
OF RESPONDENT
We now need to calculate how many observations should fall into this cell if gender and belief in
an afterlife were independent of each other. First, note (we calculate this from the counts in the
cells and in the margins of the table) that 46.3% (542 of 1171, or .463) of the sample is male and
81.8% (shown in the row total) of the sample believes in an afterlife. If gender is unrelated to
belief in an afterlife, the probability of picking someone from the sample who is both a male and
a believer would be the product of the probability of picking a male and the probability of picking
a believer, that is, .463*.818 or .379 (37.9%). This is based on the probability of the joint event
equaling the product of the probabilities of the separate events when the events are independent—
for example, the probability of obtaining two heads when flipping coins. Taking this a step
further, if the probability of picking a male believer is 37.9% and our sample is composed of
1171 people, then we would expect to find 443.4 male believers in the sample (.379*1171). This
number is the expected count for the male-believer cell, assuming no relation between gender and
belief. We observed 417 male believers while we expected to find 443.4, and so the discrepancy
or residual is -26.4. Small residuals indicate agreement between the data and the null hypothesis
of no relationship; large residuals suggest the data are inconsistent with the null hypothesis.
Expected counts and residuals are calculated in this manner for each cell in the table. Since
simply summing the residuals has the disadvantage of negative and positive residuals
(discrepancies) canceling each other out, the residuals are squared so all values are positive. A
second consideration is that a residual of 50 would be large relative to an expected count of 15,
but small relative to an expected count of 2,000.
To compensate for this the squared residual from each cell is divided by the expected count of the
cell. The sum of these cell summaries—((Observed count - Expected count)^2 / Expected
count)—constitutes the Pearson chi-square statistic.

Figure 6.7 Chi-Square Test Results
The chi-square is a measure of the discrepancy between the observed cell counts and what we
expect if the row and column variables were unrelated. Clearly a chi-square of zero would
indicate perfect agreement (and no relationship between the variables); a small chi-square would
indicate agreement while a large chi-square would signal disagreement between the data and the
null hypothesis.
One final consideration is that since the chi-square statistic is the sum of positive values from
each cell in the table, other things being equal, it will have greater values in tables with more
cells. The chi-square value itself is not adjusted for this, but an accompanying statistic called
degrees of freedom, based on the number of cells (technically the number of rows minus one
multiplied by the number of columns minus one), is taken into account when the statistic is
evaluated.
In order to assess the magnitude of the sample chi-square, the calculated value is compared to the
theoretical chi-square distribution and an easily interpreted probability is returned (column
labeled Asymp. Significance (2-Sided)). The chi-square we have been discussing, Pearson chi-
square, has a significance value of .000 (which means it is less than .0005). This means that if
there were no relation between gender and belief in an afterlife in the population, the probability
of obtaining discrepancies as large (or larger) as we see in our sample (percentage differences of
9% between men and women) would be no greater than .0005 (or less than 5 chances in a
thousand). In other words, it is quite unlikely that we would obtain this large a sample difference
between men and women if there were no differences in the population. If we consider as
significant those effects that would occur less than 5% of the time by chance alone (as many
researchers do), we would claim this is a statistically significant effect. U.S. adult women are
more likely to believe in an afterlife than men. (We can display the significance value to greater
precision by double clicking on the pivot table to open the Pivot Table editor, then double
clicking on the significance value. Or we can select the cell and format the cell value to display
greater precision.)
The Continuity correction will appear only in two-row by two-column tables when the chi-square
test is requested. In such small tables it was known that the standard chi-square calculation did
not closely approximate the theoretical distribution, which meant that the significance value was
not quite correct. A statistician named Frank Yates published an adjusted chi-square calculation
specifically for two-row by two-column tables, and it typically appears labeled as the “Continuity
correction” or as “Yates’ correction.” It was applied routinely for many years, but more recent
Monte Carlo simulation work indicates that it over adjusts. As a result it is no longer

automatically used in two by two tables, but it is certainly useful to compare the two significance
values to make sure they agree (which they do here).
A more recent chi-square approximation than the standard Pearson chi-square is the likelihood
ratio chi-square test. Here it tests the same null hypothesis, independence of the row and column
variables, but uses a different chi-square formulation. It has some technical advantages that
largely show up when dealing with higher-order tables (three-variables or more). In the vast
majority of cases, both the Pearson and likelihood ratio chi-square tests lead to identical
conclusions. In most introductory statistics courses, and when reporting results of two-variable
crosstab tables, the Pearson chi-square is commonly used. For more complex tables, and more
advanced statistical applications, the likelihood ratio chi-square is almost exclusively applied.
Fisher’s exact test will also appear for crosstabulation tables containing two rows and two
columns (a 2x2 table); exact tests are available for larger tables through the SPSS Exact Tests
add-on module. Fisher’s test calculates the proportion of all table arrangements that have more
extreme percentages than observed in the cells, while keeping the same marginal proportions.
Exact tests have the advantage of not depending on approximations (as do the Pearson and
likelihood ratio chi-square tests). Although the computational effort required to evaluate exact
tests in all but simple situations (a 2x2 table) was large, recent improvements in algorithms have
resulted in exact tests calculated more efficiently. You should consider using exact tests when
your sample size is small, or when some cells in large crosstabulation tables are empty or have
small cell counts. As the sample size increases (for all cells), exact tests and asymptotic (Pearson,
likelihood ratio) results converge.
Ecological Significance
While our significance tests are definitive—U.S. adult men and women differ in their belief in an
afterlife—we now consider the matter of practical importance. Recall that majorities of both men
and women believe and the sample difference between them was about 9%. At this point the
researcher should consider whether a 9% difference is large enough to be of practical importance.
For example, if these were dropout rates for students in two groups (no intervention, a dropout
intervention program), would a 9% difference in dropout rate justify the cost of the program?
These are the more practical and policy decisions that often have to be made during the course of
an applied statistical analysis.
Small Sample Considerations

The crosstabulation table viewed above was based on a large sample. When sample sizes are
small and expected cell counts approach zero, the chi-square approximation may break down with
the result that the probability (significance values) cannot be trusted. Although there are no
definitive answers, some rules of thumb have been developed to warn the analyst away from
potentially misleading results.
Minimum Expected Cell Count: A conservative rule of thumb is 4 or 5 or greater. Studies have
shown that the minimum expected cell count could be as low as 1 or 2 without adverse results in
some situations. In the presence of many small expected cell counts, you should be concerned
that the chi-square test is no longer behaving as it should (matching its theoretical expectations).
A footnote to the Chi-square table displays number (percent) of cells having an expected count
less than 5 and the minimum expected count in the crosstab. In our crosstab above, none of the
cells had an expected count less than 5.

Observed cell count: You should monitor the number and proportion of zero cells (cells with no
observations). Some researchers say that no more that 20% of your observed cell counts should
be less than 5 in the situation where your expected counts are well behaved (see above). Too
many zero cells, or a particular pattern of zero cells, invalidate the usual interpretation of many
measures of association. Zero cells also contribute to a loss of sensitivity in your analysis. Two
subgroups, which might be distinguishable given enough data, are not when a small sample
makes cell counts small and near zero.
What to do when rules of thumb are violated?: In practice, when expected or observed counts
become small, and if it makes conceptual sense, researchers often collapse several rows or
columns together to increase the sample sizes for the now broader groups. Another possibility is
to drop a row or column category from the analysis, essentially giving up information about that
group in order to obtain stability when investigating the others. In recent years efficient
algorithms have been developed to perform “exact tests” which permit low or zero expected cell
counts in crosstabulation tables. SPSS has implemented such algorithms in the SPSS Exact Tests
add-on module.
6.6 Additional Two-Way Tables

We will examine two additional two-variable crosstabulation tables and apply the chi-square test.
Specifically, we will look at the relationship between education degree (DEGREE) and belief in
life after death, and gender and computer use (COMPUSE). We return to the Crosstabs dialog
box to request the additional tables. We also drop the expected counts and residuals from the
tables.
Click the Dialog Recall tool and then click Crosstabs

Move DEGREE into the Column(s) box
Move COMPUSE into the Row(s) box
Click Cells button
Click to uncheck the Expected cell counts and Unstandardized Residuals (not shown)
Click Continue

Figure 6.8 Multiple Crosstab Tables Request
Multiple tables can be obtained by naming several row or column variables. Each variable in the
Row(s) box is run against each variable in the Column(s) box. Our request will produce four two-
variable tables.
Click OK

Figure 6.9 Belief in Afterlife by Education Degree
Across different education degrees the belief in an afterlife ranges from a high of 88% (Junior
College) to a low of 73% (Less than High School). The Pearson and likelihood ratio chi-squares
indicate a significant result at the .05 level, but not at the .01 level (a sample with differences this
large would occur about 26 times in 1000 (.026) by chance alone if there were no differences in
the population). No continuity correction or Fisher’s Exact Test appears because this is not a two-
row by two-column table.
The Linear by Linear chi-square (not displayed for 2x2 tables) tests the very specific hypothesis
of linear association between the row and column variables. This assumes that both variables are
interval scale measures and you are interested in testing for straight-line association. This is rarely
the case in crosstabulation tables (unless working with rating scales) and the test is not often used.
No cells have an expected count less than 5, with the minimum expected frequency for a cell at
18.74. This satisfies the assumptions of using the chi-square test.

Figure 6.10 Computer Use and Gender
We see there is a 5.6% difference between men and women in the percentage that uses the
computer. One could argue that the relationship is significant if we use the .05 level as the cut-off
for the chi-square significance level since the significance for the Pearson Chi-Square statistic is
.049. However, the significance of both the continuity correction chi-square and the Fisher's
Exact Test are slightly above .05. Even though there are no problems of small sample (low
expected counts), the conservative approach would be to conclude from this sample that there is
no difference between male and female computer use in the adult population of the US. Since the
results are somewhat inconclusive, this would also suggest that further study would be warranted
if this was a relationship of great interest.
We will look at the fourth table that we ran, the table of computer use with education degree, in
the appendix of this chapter.
In this set of three tables, two were statistically significant while the third would conservatively
be evaluated as not significant at the .05 level. To repeat, if a significance value were above .05,
say for example .60, it would imply that, under the null hypothesis of independence between the
row and column variables in the population, it is quite likely (60%) that we could obtain the
differences observed in our sample. In other words the sample is consistent with the assumption
of independence of the variables.

6.7 Graphing the Crosstabs Results

Percentages in a crosstabulation table can be displayed using clustered bar charts. You can
request bar charts based on counts directly from the Crosstabs dialog box, but since we wish to
display percentages, we instead use the Chart Builder. A simple rule to apply for bar charts is that
the percentages represented in the bar chart should be consistent with those displayed in the
crosstabulation table. Typically, the variable on which you based the percentages is used as the
cluster variable and the percentages are based on the categories of that variable. However, using
Chart Builder, you have a choice in how you organize the variables on the chart. As an example,
we will graph the percentages for gender and belief in life after death. Figure 6.11 shows the
completed Chart Builder.
Click Graphs…Chart Builder…

Click Reset
Select the second icon for Clustered Bar Chart and drag it to the Chart Preview
canvas
Drag and drop POSTLIFE from the Variables: list to the X-Axis? area in the Chart
Preview canvas
Drag and drop GENDER from the Variables: list to the Cluster: Set Color area in the
Chart Preview canvas
Choose Bar1 in the Edit Properties of: list
Select Percentage(?) from the Statistics: dropdown list
We can now select the variable in the chart to use for the base or denominator of the percentage.
Choices are Grand Total to base the percentages on the total cases in the chart, Total for each X-
Axis Category to base the percentages on the x-axis variable and Total for Each Legend Variable
Category (same fill color) to base the percentages on the cluster (legend) variable. The
crosstabulation percentages are based on gender; so we want the bar chart percentages to be based
on the cluster variable.
Click Set Parameters button

Select Total for Each Legend Variable Category (same fill color)

Figure 6.11 Chart Builder and Element Properties for Clustered Bar Chart
Click Continue in the Element Properties: Set Parameters dialog box

Click Apply in the Element Properties dialog box
Click OK in the Chart Builder to build the chart

Figure 6.12 Bar Chart of Belief in Afterlife by Gender
We now have a direct visual comparison between the men and women to supplement the
crosstabulation table and significance tests. This graph might be useful in a final presentation or
report.
Hint
You can create a bar chart directly from the values in the crosstabs pivot table. To do so, double-
click on the crosstabs pivot table to activate the Pivot Table Editor, then select (Ctrl-click) all
table values, for example column percents except for totals, that you wish to plot. Next, right-
click and select Create Graph…Bar from the Context menu. A bar chart will be inserted in the
Viewer window, following the pivot table.
6.8 Adding Control Variables

To investigate more complex interactions, you may want to explore the relationship of three or
more variables. Within the Crosstabs procedure, we can specify one or more layer variables.
Adding one layer variable produces a three-way table which is displayed as a series of two-way
tables; one for each category of the third variable. This third variable is sometimes called the
control variable since it determines the composition of each sub table.
We will illustrate a three-way table using the table of belief in an afterlife by degree as a basis
(see Figure 6.9). In this table, we discovered that those with less than a high school education had
the lowest belief in an after life and that the relationship was significant at the .05 level. Suppose

we are interested in seeing how gender might interact with this observed relationship. To explore
this question we specify sex as the control (or layer) variable in the crosstabulation analysis. In
this way we can view a table of belief in afterlife by degree separately for males and females. We
will request a chi-square test of independence for each sub table.
Click on the Dialog Recall tool and then click Crosstabs

Click the Reset pushbutton
Move POSTLIFE into the Row(s): list box
Move DEGREE into the Column(s): list box
Move GENDER into the Layer list box
Click on the Cells pushbutton and click the Column check box under Percentages
Click Continue
Click Statistics and click the Chi-square check box
Click Continue
Figure 6.13 Crosstabs Dialog Box for Three-Way Table
Click OK
As before, POSTLIFE and DEGREE are, respectively, the row and column variables; but
GENDER is added as a layer (or control) variable. Note that GENDER is in the first layer.
Additional control variables can be added as higher-level layers by clicking the Next button.
Although not shown, we asked for column percentages in the Cells dialog box and the chi-square
test from the Statistics dialog box.

Figure 6.14 Belief in Afterlife, Degree, and Gender

BELIEF IN LIFE AFTER DEATH * RS HIGHEST DEGREE * GENDER OF RESPONDENT Crosstabulation
RS HIGHEST DEGREE
GENDER OF LT HIGH HIGH JUNIOR
RESPONDENT SCHOOL SCHOOL COLLEGE BACHELOR GRADUATE Total
Female BELIEF IN LIFE YES Count 61 293 54 90 43 541
AFTER DEATH % within RS
77.2% 85.4% 94.7% 90.0% 86.0% 86.0%
HIGHEST DEGREE
NO Count 18 50 3 10 7 88
% within RS
22.8% 14.6% 5.3% 10.0% 14.0% 14.0%
HIGHEST DEGREE
Total Count 79 343 57 100 50 629
% within RS
100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
HIGHEST DEGREE
Male BELIEF IN LIFE YES Count 44 210 37 74 52 417
AFTER DEATH % within RS
67.7% 78.9% 80.4% 75.5% 77.6% 76.9%
HIGHEST DEGREE
NO Count 21 56 9 24 15 125
% within RS
32.3% 21.1% 19.6% 24.5% 22.4% 23.1%
HIGHEST DEGREE
Total Count 65 266 46 98 67 542
% within RS
100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
HIGHEST DEGREE
The crosstabulation has two sub-tables, one for females and one for males. The percentages
saying “yes” for females are more dissimilar than for males and the patterns are slightly different.
Among the females, there is a drop-off of 4% from Bachelor's to Graduate degree groups; while
for the males, the percentage of believers actually increases slightly between these same two
degree groups.
Figure 6.15 Chi-Square Statistics for Three-Way Table
The chi-square results are intriguing because they indicate that the relationship between afterlife
and degree is significant for females (p=.039) at least at the .05 level but not for males (p=.382).
This difference suggests a possible interaction effect: the effect of one variable (DEGREE) on
another (POSTLIFE) depends on the value of a third variable (GENDER). However, given the
large size of the sample and the significance levels, we would be cautious in our interpretation.

As suggested in the next section, we could use more advanced techniques to further test for the
significance of the interaction. These new results don’t mean that the two-way table is wrong or
inaccurate. That table does present the relationship between highest educational degree and belief
in an afterlife.
The next step for the analyst would be to determine if this new found difference is important to
their interest, and perhaps look at other variables that might provide more information about this
relationship.
6.9 Extensions: Beyond Crosstabs

Decision Tree analysis is often used by data analysts who need to predict, as accurately as
possible, into which outcome group an individual will fall, based on potentially many nominal or
ordinal background variables. For example, an insurance company is interested in the
combination of demographics that best predict whether a client is likely to make a claim. Or a
direct mail analyst is interested in the combinations of background characteristics that yield the
highest return rates. Here the emphasis is less on testing a hypothesis and more on a heuristic
method of finding the optimal set of characteristics for prediction purposes. CHAID (chi-square
automatic interaction detection), a commonly used type of decision-tree technique, along with
other decision-tree methods are available in the SPSS Decision Trees add-on module.
A technique called loglinear modeling can also be used to analyze multi-way tables. This
method requires statistical sophistication and is beyond the domain of our course. SPSS Statistics
has several procedures (Genlog, Loglinear and Hiloglinear) to perform such analyses. They
provide a way of determining which variables relate to which others in the context of a multi-way
crosstab (also called contingency) table. These procedures could be used to explicitly test for the
three-way interaction suggested above. For an introduction to this methodology see Fienberg
(1977). Academic researchers often use such models to test hypotheses based on survey data.
Occasionally there is interest in testing whether a frequency table based on sample data is
consistent with a distribution specified by the analyst. This test (one sample chi-square) is
available within the SPSS Statistics Base nonparametric procedure (click
Analyze…Nonparametric Tests…Chi-Square).
6.10 Appendix: Association Measures

We have discussed the chi-square test of independence and how to use it to determine whether
there is a statistically significant relationship between the row and column variables in the
population. And, we viewed the percentages in the table to describe the relation and determine the
magnitude of the differences. It would be useful to have a single number to quantify the strength
of the association between the row and column variables. Measures of Association have been
developed to allow you to compare different tables and speak of relative strength of association or
effect. They are typically normed to range between 0 (no association) and 1 (perfect association)
for variables on a nominal scale. Those assuming ordinal measurement are scaled from –1 to +1,
the extremes representing perfect negative and positive association, respectively; here zero would
indicate no ordinal association.
One reason for the large number of measures developed is that there are many ways two variables
can be related in a large crosstabulation table. In addition, depending on the level of measurement

(for example, nominal versus ordinal), different aspects of association might be relevant.
Association measures tend to be used in academic and medical research studies, less so in applied
work such as market research. In market research you typically display the crosstabulation table
for examination, rather than focus on a single summary.
We will review some general characteristics of the association measures, but not consider them in
great detail. For more involved discussion of association measures for nominal variables see
Gibbons (1993), while a more complete but technical reference is Bishop, Fienberg and Holland
(1975).
First, some general points:
• Some measures of association are based on the chi-square values; others are based on
probabilistic considerations. The latter class is usually preferred since chi-square based
values have no direct, intuitive interpretation.
• Some measures of association assume a certain level of measurement (for example,

dichotomous, nominal, or ordinal). Consider this when choosing a particular measure.
• Some measures are symmetric, that is, do not vary if the row and column variables are
interchanged. Others are asymmetric and must be interpreted in light of a causal or
predictive ordering that you conceive between your variables.
• Measures of association for crosstabulation tables are bivariate (two-variable). In general,

multivariate (two or more) extensions do not exist. To explore association in higher order
tables you must turn to a method called loglinear modeling (implemented in Genlog and
Hiloglinear procedures of the SPSS Advanced Statistics add-on module: see Loglinear
choice under the Analyze menu). Such analyses were briefly mentioned in this chapter
(section 6.9), but are beyond the scope of this course.
Association Measures Available within Crosstabs

Several common measures are available in the Crosstabs procedure. They can be classified in the
following groups.
Chi-Square Based: Phi, V, and the Contingency Coefficient are measures of association based on
the chi-square value. Their early advantage was convenience: they could be readily derived from
the already calculated chi-square. Values range from 0 to 1. Their disadvantage is that there is not
a simple, intuitive interpretation of the numeric values.
Nominal Probabilistic Measures: Lambda and Goodman & Kruskal’s Tau (both produced by
selecting Lambda in the dialog) are probabilistic or PRE (proportional reduction in error)
measures suitable for nominal scale data. They are measures attempting to summarize the extent
to which the category value of one variable can be predicted from the value of the other. These
measures are asymmetric and are reported with each variable predicted from the other.
Ordinal Probabilistic Measures: Kendall’s Tau-b, Tau-c, Gamma and Somers’ d are all
probabilistic measures appropriate for ordinal tables. Values range from -1 to +1, and reflect the
extent to which higher categories (based on the data codes used) of one variable are associated
with higher categories of the second variable. Some of these are asymmetric (e.g., Somers’ d).

Correlations produces Pearson’s r, the standard correlation coefficient, which assumes both
variables are interval scaled. If this association were the main interest in the analysis, such
correlations can be obtained directly from the Correlation procedure.
Eta is asymmetric and assumes the dependent measure is interval scale while the independent
variable is nominal. It measures the reduction in variation of the dependent measure when the
value of the independent variable is known. It can also be produced by the Means
(Analyze…Compare Means…Means) and GLM (Analyze…General Linear Model…Univariate)
procedures.
The McNemar statistic is used to test for equality of correlated proportions, as opposed to general
independence of the row and column variables (as does the chi-square test). For example, if we
ask people, before and after viewing a political commercial, whether they would vote for
candidate A, the McNemar test would test whether the proportion choosing candidate A changed.
The Cochran’s and Mantel-Haenszel statistics test whether a dichotomous response variable is
conditionally independent of a dichotomous explanatory variable when adjusting for the control
variable. For example, is there an association between instruction method (treatment vs. control)
and exam performance (pass vs. fail), controlling for school area (urban vs. rural).
An association measure often used when coding open-end responses to survey questions is
Kappa, which measures the agreement between two raters of the same information. The relative
risk association measure is often used in health research; it assesses the odds of the occurrence of
some outcome in the presence of an event (the use of a drug, a medical condition). It is not
bounded as the other association measures are.
These association measures are found in the Crosstab Statistics dialog box. We will request
several measures for the computer use by education degree table. Here both nominal and ordinal
measures of association might be desirable, as we will explain.
Click on the Dialog Recall tool, then click Crosstabs

Click Reset
Move COMPUSE to the Row(s) box
Move DEGREE to the Column(s) box
Click Cells
Click to check on Column in the Percentage area
Click Continue
Click the Statistics pushbutton
Click to check Chi-square, Lambda, Gamma, and Kendall’s tau-c

Figure 6.16 Association Measures in Crosstabs
The association measures are grouped by level of measurement assumed for the variables. We
checked Lambda (which will also produce Goodman & Kruskal’s Tau) along with Kendall’s c
and the Gamma statistic.
Click Continue
Click OK

Figure 6.17 Computer Use and Education Degree
The significance level of the chi-square test is well below .01 so there is a statistically significant
relationship, and the highest degree levels are more likely to use the computer. The majority of
the people in all of the degree groups, except the less than high school group, use the computer
and the percentage difference between that group and the Graduate degree group is well over
50%.
We move next to the association measures.

Figure 6.18 Association Measures for Computer Use and Education Degree
The association measures are in two tables, one for the nominal and the other for the ordinal
statistics. The column labeled Value contains the actual association measures. Focusing on the
first table (Directional Measures), we focus on the values for computer use as the dependent
variable since that is our assumption. Keeping in mind that zero would be the weakest
association, the values of .2 for both measures are well above zero and statistically significant.
We have a situation in which there is a statistically significant result, but the level of association
is lower than we might expect given the differences in the percentages among the degree groups.
This is often the case with nominal measures of association.
In the second table, note that the ordinal measures (gamma and Tau-c) are larger in magnitude.
For an ordinal measure to be non-zero, the proportion of respondents using the computer needs to
increase (or decrease) as education degree increases. This is indeed the case, and gamma indicates
a strong association (–.739). The two measures have a negative sign because as education
increases, computer use percentage also increases, but a “yes” for COMPUSE is coded with a
lower value (1) than “no” (2). Thus higher data values on degree are associated with lower data
values on COMPUSE.
We have used COMPUSE, which would seem to be coded on a nominal scale, with ordinal
measures of association. Dichotomous variables, for purposes of crosstabulation (and some other
techniques), can be considered as measured on an ordinal scale.

The other columns in the two tables are somewhat technical and we will not pursue them here
(see the references cited earlier in this section). However they are used when you wish to perform
statistical significance tests on the association measures themselves to determine whether an
association measure differs from zero in the population.

Summary Exercises
We want to study the relationship between two demographic variables, race and gender and three
attitude and behavior variables, HLTH1, whether you were ill enough to go to the doctor last year,
NATAID, attitude toward spending on foreign aid, and NEWS, how frequently you read the
newspaper. Before running the analysis, think about the variables involved in these tables. What
relations would you expect to find here and why?
1. Run crosstabulations of race (RACE) against the measures: HLTH1, whether you were ill
enough to go to the doctor last year, NATAID, attitude toward spending on foreign aid,
and NEWS, how often newspaper is read. Request the appropriate percentage within
racial categories and run the chi-square test of independence.
2. Then repeat the analysis after substituting GENDER in place of RACE. How would you
summarize each finding in a paragraph?
3. Run a three-way table of HLTH1 by GENDER by RACE and NATAID by RACE by
GENDER. Request the chi-square. Do these findings affect your summaries from above?
If so, how?
4. Create a clustered bar chart displaying the results of one of your tables.
1. Request appropriate measures of association for the HLTH1 by GENDER table and the
NEWS by RACE by GENDER table. Are the results consistent with your interpretation up
to this point? Based on either the association measures, or percentage differences, would
you say the results have practical (or ecological) significance?
2. If you created a collapsed version of the WEBYR variable in Chapter 3 exercise, run a
crosstab with NEWS30, whether you accessed a news website in the last 30 days. Would
you expect to see a relationship?

Chapter 7: Mean Differences Between

Groups: T Test
Topics
• Logic of Testing for Mean Differences
• Exploring the Group Differences
• Testing the Differences: Independent Samples T Test
• Interpreting the T Test Results
• Graphing the Mean Differences
• Appendix: Paired T Test
• Appendix: Normal Probability Plots
Data
Scenario
Using the GSS 2004 data, we want to investigate whether men differ from women on two types
of behavior: the hours spent watching TV every day and the number of hours each week using the
internet. Since both measures are scale variables, we will summarize the groups using means. Our
goal is to draw conclusions about population differences based on our sample.
7.1 Introduction
In Chapter 6 we performed statistical tests in order to draw conclusions about population group
differences on categorical variables using crosstabulation tables and applying the chi-square test
of independence. When our purpose is to examine group differences on interval scale outcome
measures, we turn to the mean as the summary statistic since it provides a single measure of
central tendency. Also, from a statistical perspective, the properties of sample means are well
known, which facilitates testing. For example, we will compare men and women in their mean
number of hours using the Internet each week.
In this chapter, we outline the logic involved when testing for mean differences between groups,
state the assumptions, and then perform an analysis comparing two groups. Appendix A will
generalize the method to analysis involving more than two groups.
7.2 Logic of Testing for Mean Differences

The goal of statistical tests on means is to draw conclusions about population differences based
on the observed sample means. To provide a context for this discussion, we view a series of
boxplots, such as those discussed in Chapter 4, showing three groups (A, B and C) sampled from
different populations with different distributions on a scale variable. The first boxplot in Figure
7.1 displays the case when the three groups are distinctly different in mean level on the scale
variable
Mean Differences Between Groups: T Test 7 - 1

Figure 7.1 Samples from Three Very Different Populations
We see that the groups are well separated: there is no overlap between any sample group and
either of the remaining two. In this case, a statistical test is almost superfluous since the groups
are so disparate, but if performed we would find highly significant differences.
Next we turn to a case in which the groups are samples from the same population and show no
differences.

Figure 7.2 Three Samples from the Same Population
Here there is considerable overlap of the three samples; the medians (and means) and other
summaries match almost identically across the groups. If there are any true differences between
the population groups they are likely to be extremely small and not have any practical
importance.
When there are modest population differences, we might obtain the result below.
Figure 7.3 Samples from Three Modestly Different Populations

There is some overlap among the three groups, but the sample means (medians in the boxplot) are
different. In this instance a statistical test would be valuable to assess whether the sample mean
differences are large enough to justify the conclusion that the population means differ. This last
plot represents the typical situation facing a data analyst.
As we did when we performed the chi-square test, we formulate a null hypothesis and use the
data to evaluate it.
Ho (Null Hypothesis) assumes that the population means are identical.
We then determine if the differences in sample means are consistent with this assumption. If the
probability of obtaining sample means as far (or further) apart as we find in our sample is very
small (less than 5 chances in 100 or .05), assuming no population differences, we reject our null
hypothesis and conclude the populations are different.
We implement this logic by comparing the variation among sample means relative to the
variation of individuals within each sample group. The core idea is that if there were no
differences between the population means, then the only source for differences in the sample
means would be the variation among individual observations (since the samples contain different
observations), which we assume is random.
We then compute a ratio of the variance among sample means (referred to as between group
variance) divided by the variance among individual observations within each group (referred to as
within group variance).
This ratio will be close to 1 if there are no population differences. If there are true population
differences, we would expect the ratio of variances to be greater than 1.
This ratio value is referred to as the F-value although typically reported as a t value (the square
root of F) in the two group comparison.
Under the assumptions made in analysis of variance (discussed below), this variation ratio
follows a known statistical distribution, the F distribution. Thus the result of performing the test
will be a probability indicating how likely we are to obtain sample means as far apart (or further)
as we observe in our sample if the null hypothesis were true. If this probability is very small, we
reject the null hypothesis and conclude there are true population differences.
This concept of taking a ratio of between-group variation of means (between-group) to within-

group variation of individuals (within-group) is fundamental to the statistical method called
analysis of variance. It is implicit in the simple two-group case (t test), and appears explicitly in
more complex analyses (general ANOVA).
Assumptions
When statistical tests of mean differences are applied, i.e. t test for two group differences and F
test for the more general case, at least two assumptions are made. First, that the distribution of the
dependent measure within each population subgroup follows the normal distribution (normality).
Second, that its variation is the same within each population subgroup (homogeneity of variance).
When these assumptions are met, the t and F tests can be used to draw inferences about
population means. We will discuss each of these assumptions as it applies in practice and see
whether they hold in our data.

Normality Assumption
Normality of the dependent variable within each group is formally required when statistical tests
(t, F) involving mean differences are performed. In reality, these tests are not much influenced by
moderate departures from normality. This robustness of the significance tests holds especially
when the sample sizes are moderate to large (over 25 cases) and the dependent measure has the
same distribution (for example, skewed to the right) within each comparison group. Thus while
normality is assumed when performing the significance tests, the results are not much affected by
moderate departures from normality (for discussion and references, see Kirk (1964) and for an
opposing view see Wilcox (2004)). In practice, researchers often examine histograms and box
plots to view each group in order to make this determination. If a more formal approach is
preferred, the Explore procedure can produce more technical plots (normal probability plots) and
statistical tests of normality (see the second appendix to this chapter). In situations where the
sample sizes are small or there are gross deviations from normality, researchers often shift to
nonparametric tests. An example is given in Appendix A.
Homogeneity of Variance
The second assumption, homogeneity of variance, indicates that the variance of the dependent
variable is the same for each population subgroup. Under the null hypothesis we assume the
variation in sample means is due to the variation of individual scores, and if different groups
show disparate individual variation, it is difficult to interpret the overall ratio of between-group to
pooled within-group variation. This directly affects significance tests. Based on simulation work,
it is known that significance tests of mean differences are not much influenced by moderate lack
of homogeneity of variance if the sample sizes of the groups are about the same. If the sample
sizes are quite different, then lack of homogeneity (heterogeneity) is a problem in that the
significance test probabilities are not correct. SPSS Statistics performs a formal test, the Levene
test of homogeneity, to test the homogeneity assumption. We will discuss this with the example
results.
When comparing means from two groups (t test) and one-factor ANOVA (see Appendix A) there
are corrections for lack of homogeneity. In the more general ANOVA analysis a simple
correction does not exist. It is beyond the scope of this course, but it should be mentioned that if
there is a relationship or pattern between the group means and standard deviations (for example,
if groups with higher mean levels also have larger standard deviations), there are sometimes data
transformations that when applied to the dependent variable will result in homogeneity of
variance. Such transformations can entail additional complications, but provide a method of
meeting the homogeneity of variance requirement. The Explore procedure’s Spread & Level plot
can provide information as to whether this approach is appropriate and can suggest the optimal
data transformation to apply to the dependent measure.
To oversimplify, when dealing with moderate or large samples and testing for mean differences,
normality is not always important. Gross departures from homogeneity of variance do affect
significance tests when the sample sizes are disparate.
Sample Size
Generally speaking, tests involving comparisons of sample means do not require any specific
minimal sample size. Formally, there must be at least one observation in each group of interest
and at least one group with two or more observations in order to obtain an estimate of the within-
group variation. While these requirements are quite modest, the more important point regarding
sample size is that of statistical power: your ability to detect differences that truly exist in the

population. As your sample size increases, the precision with which means and standard
deviations are estimated increases, as does the probability of finding true population differences
(power). Thus larger samples are desirable from the perspectives of statistical power and
robustness (recall our discussion of normality), but are not formally required.
These analyses do not require that the group sizes be equal. However, analyses involving tests of
mean differences are more resistant to violation of the homogeneity of variance assumption when
the sample sizes are equal (or near equal). In the more general ANOVA analysis, equal (or
proportional) group sample sizes bring assurance that the various factors under investigation can
be looked at independently. Finally, equal sample size conveys greater statistical power when
looking for any differences among groups. In summary, equal group sample sizes are not
required, but do carry advantages. This is not to suggest that you should drop observations from
the analysis in order to obtain equal numbers in each group, since this would throw away
information. Rather, think of equal group sample size as an advantageous situation you should
strive for when possible. In experiments equal sample size is usually part of the design, while in
survey work it is rarely seen.
7.3 Exploring the Group Differences

In this analysis we wish to determine if there are population gender differences in hours using the
Internet each week and hours watching TV each day. Before performing these tests, we will use
the exploratory data analysis procedures we discussed in Chapter 4 to look at group differences
on these variables and check for violations of the assumptions above.

Move WWWHR and TVHOURS into the Dependent List: box
Move GENDER into the Factor List: box
Figure 7.4 Explore Dialog Box Comparing Groups
Explore will perform a separate analysis on each dependent variable for each category of the
Factor variable. The Factor variable, GENDER, will produce statistical summaries separately for
males and females. We will use the Options button to request that missing values should be

treated separately for each dependent variable (Pairwise option). We mentioned in Chapter 4 that
Explore’s default is to exclude a case from analysis if it contains a missing value for any of the
dependent variables.
Click Options button

Click Exclude cases pairwise
Click Continue
Finally, we will request histograms rather than stem & leaf plots.
Click Plots button

Click off (uncheck) Stem & leaf in the Descriptives area
Click on (check) Histograms
Click Continue
Figure 7.5 Explore Options and Plots Dialog Boxes
Click OK

Figure 7.6 Summaries of Hours of Internet Use per Week
Descriptives
WWW HOURS PER WEEK

GENDER OF GENDER OF
RESPONDENT RESPONDENT
Female Male Female Male
Mean 6.30 8.79 .329 .392
95% Confidence Lower Bound 5.65 8.02
6.94 9.56
5% Trimmed Mean 4.83 7.24

Median 3.00 5.00
Variance 98.341 121.877
Std. Deviation 9.917 11.040
Minimum 0 0
Maximum 130 100
Range 130 100
Interquartile Range 6 8
Skewness 4.602 2.757 .081 .087
Kurtosis 35.922 10.990 .162 .173
Note: Editing Descriptives Table

The original output for Figure 7.6 was edited using the Pivot Table editor to facilitate the male to
female comparisons (steps outlined below).
Double-click on the Descriptives pivot table

Click Pivot…Pivoting Trays to activate the Pivoting Trays window (if necessary)
Drag Gender of Respondent from the Row dimension tray to the Column dimension
tray below Stat Type already there
Drag Dependent Variables from the Row dimension tray to the Layer dimension tray
Close the Pivot Table Editor
In Figure 7.6, we can see that the means (male 8.79; female 6.30) are higher than both the median
and trimmed mean for each gender, which suggests some skewness to the data. This is confirmed
by the positive skewness measures and the histograms in Figure 7.7. The mean for males is about
2.5 hours greater than that for females. Also the sample standard deviation for males is larger
(11.04 to 9.92), and the IQR for males is also greater (8 to 6). Thus, it is unclear whether the
homogeneity of variance assumption has been met. Both genders have some very high maximum
values.

Figure 7.7 Histograms for Females and Males of Hours of Internet Use

Viewing the histograms with normality in mind, it is very obvious that both distributions are
positively skewed and not normal. However, keeping in mind our earlier discussion about the
assumptions for statistical testing, since both gender groups show a similar skewed pattern, we
will not be concerned since the sample sizes are fairly large (793 for males and 908 females).
These numbers are found in the Case Processing Summary table (not shown).
The box plot provides some visual confirmation of the mean (actually median) differences
between the two groups. Note that the difference would be made more apparent by editing the
range of the vertical scale of the chart. The side-by-side comparison shows that the groups have a
similar pattern of positive skewness. The height of the box (the IQR) is smaller for women
confirming the smaller dispersion for females. Outliers are identified and might be checked
against the original data for errors; we considered this issue when we performed exploratory data
analysis on Internet use for the entire sample. Based on these plots and summaries we might
expect to find a significant mean difference between men and women. Also, since the two groups
have a similar distribution of data values (positively skewed) with large samples, we feel
comfortable about the normality assumption to be made when testing for mean differences.
Figure 7.8 Boxplot of Internet Use Per Week for Males and Females
Next we turn to the summaries for the hours watching television each day.

Figure 7.9 Summaries of Hours Per Day Watching TV
Descriptives

GENDER OF GENDER OF
RESPONDENT RESPONDENT
Female Male Female Male
Mean 2.99 2.72 .132 .111
95% Confidence Lower Bound 2.73 2.51
3.25 2.94
5% Trimmed Mean 2.63 2.47

Median 2.00 2.00
Variance 8.291 5.184
Std. Deviation 2.879 2.277
Minimum 0 0
Maximum 20 14
Range 20 14
Interquartile Range 3 2
Skewness 2.696 2.112 .112 .119
Kurtosis 10.183 5.874 .223 .238
The mean is 2.99 for females and 2.72 for males, so each gender watched about 3 hours per day.
The means are quite similar, suggesting that there may be no difference in the population between
the two genders. The means are above their respective trimmed means and the medians; the
skewness measures are several standard errors from zero, so both groups have positive skewness.
And the kurtosis values are far from zero, especially for females. All these signs indicate that the
distributions are not normal. However, the sample sizes are large.
Notice also that the standard deviations for the groups are reasonably similar (2.88, 2.28)
although the IQR is different for each group. These may hint at the possible lack of homogeneity
of variance between the groups.
The histograms in Figure 7.10 show somewhat similar distributions and have outliers at larger
positive values for both genders.

Figure 7.10 Histograms for Females and Males of Hours Watching TV
To compare the groups directly, we move to the boxplot.

Figure 7.11 Boxplot of Hours Watched TV Per Day For Males and Females
The median for males and females is the same. However, the IQR is higher (3) for females than
males (2) and the outliers are more widely spread for females. Both groups show positive
skewness. Since both groups follow a similar skewed distribution and the samples are large, the
normality assumption will not be a problem. There is some evidence of not meeting the
homogeneity of variance assumptions which we will confirm or deny in the next step of our
analysis.
Having explored the data focusing on group comparisons, we now perform tests for mean
differences between the populations.
Note: Producing Stacked and Paneled Histograms with Chart Builder

The Chart Builder provides other types of histograms that you can use to compare distributions of
a scale variable on multiple groups. Both a stacked histogram and a population pyramid will show
the distribution of multiple groups in one histogram. These types of charts are available in the
Gallery of histogram icons. Or, you can produce paneled histograms, separate histograms for each
group shown on the same scale range, arranged in either columns or rows. The panel charts are
available on the Groups/Point ID tab of the Chart Builder.
Figure 7.12 shows a stacked histogram by gender of hours of Internet use.

Figure 7.12 Stacked Histogram of Hours of Internet Use by Gender (Chart Builder)
7.4 Testing the Differences: Independent Samples T Test

The t test is used to test the differences in means between two populations. If more than two
populations are involved, a generalization of this method, called analysis of variance (ANOVA),
can be used. The T Test procedure is found on the Compare Means menu.
Click Analyze…Compare Means
Figure 7.13 Compare Means Menu

There are three available t tests: one-sample t test, independent-samples t test, and paired-samples
t test. The One-Sample T Test compares a value you supply (it can be the known value for some
population, or a target value) to the sample mean in order to determine whether the population
represented by your sample differs from the specified value. The other t tests involve comparison
of two sample means. The Independent-Samples T Test applies when there are two separate
populations to compare (for example, males and females). An observation can only fall into one
of the two groups. The Paired-Samples T Test is appropriate when comparing two measures
(variables) for a single population. For example, a paired t test would be used to compare pre-
treatment to post-treatment scores in a medical study. IF the same observation contributes to both
means, the paired t test takes advantage of this fact and can provide a more powerful analysis. An
example applying the paired t test to compare the formal education of the respondent’s mother
and father appears in the appendix at the end of this chapter. In our present example, an
observation (individual interviewed) can fall into only one of the two groups (male or female), so
we choose the independent-samples t test.
Click Analyze…Compare Means…Independent-Samples T Test…

Move WWWHR and TVHOURS into the Test Variable(s): box
Move GENDER into the Grouping Variable: box
We first indicate the dependent measures or “Test variable.” We specify both WWWHR and
TVHOURS, which will yield two separate analyses. The “Grouping” or independent variable is
GENDER.
Figure 7.14 Independent-Samples T Test Dialog Box
Notice the question marks following GENDER. The T Test dialog requires that you indicate
which groups are to be compared, which is usually done by providing the data values for the two
groups. Since GENDER is a string variable with females coded "F" and males coded "M", we
must supply these codes using the Define Groups dialog box. Be sure to type upper case for both
as shown in Figure 7.15.
Click the Define Groups pushbutton

Enter F as the first and M as the second group code

Figure 7.15 T Test Define Groups Dialog Box
We have identified the values defining the two groups to be compared. If the independent
variable is a numeric variable, you can specify a single cut point value to define the two groups.
Those cases less than or equal to the cut point go into the first group, and those greater than the
cut point fall into the second group. This option is rarely used though.
Click Continue
Figure 7.16 Completed T Test Dialog Box
Our specifications are complete. By default, the procedure will use all valid responses for each
dependent variable (pairwise deletion) in the analysis.
Click OK
7.5 Interpreting the T Test Results

We will first look at the output for hours using the Internet each week.
NOTE: In the original output, the test results for both dependent variables were displayed in a
single pivot table, but for discussion purposes we present them separately and edit them for better
display.

In Figure 7.17 we can see some of the same summaries as the Explore procedure displayed:
sample sizes, means, standard deviations, and standard errors for the two groups. The mean for
males is about 2.5 hours more per week than for females. The actual sample mean difference is
displayed in the Independent Samples Test table in Figure 7.18.
Figure 7.17 Summaries for Hours of Internet Use
Group Statistics
GENDER OF Std. Error

RESPONDENT N Mean Std. Deviation Mean
WWW HOURS Female 908 6.30 9.917 .329
PER WEEK Male 793 8.79 11.040 .392
Figure 7.18 T Test Output for Hours of Internet Use

Independent Samples Test
Levene's Test for

Equality of
Variances t-test for Equality of Means
95% Confidence
Interval of the
Sig. Mean Std. Error Difference
F Sig. t df (2-tailed) Difference Difference Lower Upper
WWW Equal variances
15.182 .000 -4.902 1699 .000 -2.491 .508 -3.487 -1.494
HOURS assumed
PER WEEK Equal variances
-4.866 1605.39 .000 -2.491 .512 -3.495 -1.487
not assumed
Homogeneity of Variance Test: Levene's Test

Next, we consider the Levene test of homogeneity of variance for the two groups. With this test
we can assess whether the data meet the homogeneity assumption before examining the t test
results. There are several tests of homogeneity (Bartlett-Box, Cochran’s C, Levene). Levene’s test
has the advantage of being sensitive to lack of homogeneity, but relatively insensitive to
nonnormality. Bartlett-Box and Cochran’s C are sensitive to both lack of homogeneity and
nonnormality. Since nonnormality (recall our discussion in the assumptions section) is not
necessarily an important problem for t tests and analysis of variance, the Levene test is directed
toward the more critical issue of homogeneity.
Homogeneity tests evaluate the null hypothesis that the dependent variable’s standard deviation
is the same in the two populations. Since homogeneity of variance is assumed when performing
the t test, the analyst hopes to find this test to be nonsignificant. The probability (labeled Sig. in
the table) from Levene’s test indicates that the probability of obtaining sample standard
deviations (technically, variances are tested) as far apart (11 versus 9.9) as we observe in our
data, if the standard deviations were identical in the two populations. The probability is quite low
(Sig. value is .000). This is below the common .05 cut-off (some researchers suggest using a cut-
off of .01 for larger samples), so we conclude the standard deviations are not identical in the two
population groups, and the homogeneity requirement is not met. Since this is an important
assumption, we will need to use an adjusted t test.

If this procedure seems too complicated, some authors suggest the following simplified rules: (1)
If the sample sizes are about the same, don’t worry about the homogeneity of variance
assumption; (2) If the sample sizes are quite different, then take the ratio of the standard
deviations in the two groups and round it to the nearest whole number. If this rounded number is
1, don’t worry about lack of homogeneity of variance. Using this simplified test, the ratio of the
group standard deviations rounds to 1. This demonstrates that Levene's test is a conservative
measure of homogeneity of variance especially for large samples.
T Test
Finally two versions of the t test appear in Figure 7.18. The row labeled “Equal variances
assumed” contains results of the standard t test, which assumes homogeneity of variance. The
second row labeled “Equal variances not assumed” contains an adjusted t test that corrects for
lack of homogeneity in the data. You would choose one or the other based on your evaluation of
the homogeneity of variance question, so we choose the bottom row. However, as we can see the
two values are very similar in this example. The actual t value and df (degrees of freedom) are
technical summaries measuring the magnitude of the group differences and a value related to the
sample sizes, respectively. The degrees of freedom, equal to the number of sample cases in the
analysis minus 2, is used in calculating the probability (significance) of the t value.
To interpret the results, move to the column labeled “Sig. (2-tailed).” This is the probability
(rounded to .000, meaning there is less than .0005), of our obtaining sample means as far or
further apart (2.49 hours), by chance alone, if the two populations (males and females) actually
have the same Internet use each week. Thus the probability of obtaining such a large difference
by chance alone is quite small (less than 5 in 10,000), so we would conclude there is a significant
difference in Internet use between men and women, with men using the Internet more.
The term “2-tailed” significance indicates that we are interested in testing for any differences in
Internet use between men and women, either in the positive or negative direction (ergo the two
tails). Researchers with hypotheses that are directional—for example, that men use the Internet
more than women—can use one-tailed tests to address such questions in a more sensitive fashion.
Recall our discussion in Chapter 5. Broadly speaking, two-tailed tests look for any difference
between groups, while a one-tailed test focuses on a difference in a specific direction. Two-tailed
tests are more commonly done since the researcher is usually interested in any differences
between the groups, regardless which is higher.
If interested, you can obtain the one-tailed t test result directly from the two-tailed significance
value that is displayed. For example, suppose you wish to test the directional hypothesis that in
the population men do use the Internet more than women, the null hypothesis being that either
women use it more than men or that there is no gender difference. You would simply divide the
two-tailed significance value by 2 to obtain the one-tailed probability, and verify that the pattern
of sample means is consistent with your hypothesized direction. Thus if the two-tailed
significance value were .0005, then the one-tailed significance value would be half that value
(.00025), if the direction of the sample means violates the null hypothesis (otherwise it is 1 – p/2,
where p is the two-tailed value). To learn more about the differences and logic behind one and
two-tailed testing, see SPSS 16.0 Guide to Data Analysis (Norusis, 2008) or an introductory
statistics book.

Confidence Band for Mean Difference

The T Test procedure provides an additional bit of useful information: the 95% confidence band
for the sample mean difference. Recalling our earlier discussion, the 95% confidence band for the
difference provides a measure of the precision with which we have estimated the true population
difference. In the output shown in Figure 7.18, the 95% confidence band for the mean difference
between groups is from -3.5 to -1.5 hours (use the Equal variances not assumed row). Note that
the difference values are negative because "Males", the second group, has the highest mean hours
using the internet. Thus we expect that the population mean difference could easily be a number
like 1.9 or 2.5 hours but would not be a number as large as 5 or 6 hours. So the 95% confidence
band indicates the likely range within which we expect the population mean difference to fall.
Speaking in the technically correct fashion, if we were to continually repeat this study, we would
expect the true population difference to fall within the confidence bands 95% of the time. While
the technical definition is not illuminating, the 95% confidence band provides a useful precision
indicator of our estimate of the group difference.
Summary for Internet Use Per Week

Our analysis indicated that the assumption of homogeneity of variance is not satisfied, and that
there is a significant difference in Internet use per week between men and women. Our sample
indicates that, in the population of adults, on average men use the Internet about 2.5 hours more
than women per week and the 95% confidence band on this difference ranges from 1.5 to 3.5
hours.
T Test Results for Television Viewing

We will now compare men and women on their daily amount of television viewing.
Figure 7.19 Summaries for Hours Per Day Watching TV
Group Statistics
GENDER OF Std. Error

RESPONDENT N Mean Std. Deviation Mean
HOURS PER DAY Female 479 2.99 2.879 .132
WATCHING TV Male 420 2.72 2.277 .111
Figure 7.20 T Test Output for Hours Per Day Watching TV

Independent Samples Test
Levene's Test
for Equality of
Variances t-test for Equality of Means
95% Confidence
Interval of the
Sig. Mean Std. Error Difference
F Sig. t df (2-tailed) Difference Difference Lower Upper
HOURS Equal variances
5.620 .018 1.520 897 .129 .266 .175 -.077 .609
PER DAY assumed
WATCHING Equal variances
TV 1.543 887.775 .123 .266 .172 -.072 .604
not assumed

Reviewing Figure 7.19, we see that the sample means are very close. As we discussed in Chapter
4, the standard errors are the expected standard deviations of the sample means if the study were
to be repeated with the same sample sizes. The difference (shown in Figure 7.20) between men
and women is small (.266 hours). Note that although the standard deviations of the two groups are
fairly close, the Levene test returns a probability value of .018, which is below the .05 cutoff; or
above the .01 cutoff which we might consider using with our large sample. It is a good idea to
keep the sample size in mind when evaluating the homogeneity test, because with increasing
sample size there is more precise estimation of the sample standard deviations, and so smaller
differences are statistically significant. Thus if the Levene test were significant, but the sample
sizes were large and the ratio of the sample standard deviations were near 1, then the equal
variance t test should be quite adequate (and in this situation the two t values almost invariably
give the same result).
Proceeding to the t test itself, the significance value of .129 (in the equal variances assumed line)
indicates that if the null hypothesis of no gender difference in the amount of TV viewing were
true, then there is about a 13% chance of obtaining sample means as far (or further) apart as we
observe in our data. This is not significant (well above .05) and we conclude there is no evidence
of men differing from women in the number of hours watching TV each day. Notice that the 95%
confidence band of the male-female difference includes 0. This is another reflection of the fact
that we cannot conclude the populations are different.
7.6 Graphing Mean Differences

Although the T Test procedure displays the appropriate statistical test information, a summary
chart is often preferred as a way to present significant results. Bar charts displaying the group
sample means with 95% confidence bands can be produced using Chart Builder. However, many
people prefer an error bar chart instead. It is a cleaner chart that focuses more on the precision of
the estimated mean for each group than the mean itself. We will produce an error bar chart
showing the gender difference in Internet use.
Click Graphs…Chart Builder…

Click Reset
Select the icon for Simple Error Bar and drag it to the Chart Preview canvas
Drag and drop GENDER from the Variables: list to the X-Axis? area in the Chart
Preview canvas
Drag and drop WWWHR from the Variables: list to the Y-Axis? area in the Chart
Preview canvas
The completed Chart Builder is shown in Figure 7.21.

Figure 7.21 Chart Builder for Simple Error Bar Chart
By default, the sample mean will be displayed for each group with the error bars representing the
95% confidence interval for the sample means. In the Element Properties dialog box, you can
choose to display error bars representing Standard error or Standard deviation of the mean. From
the Statistics dropdown list, you can choose to display a statistic other than the mean. We will
display the default choices.
Click OK

Figure 7.22 Error Bar Chart of Internet Use Per Week
The small circle in the middle of each error bar represents the sample group mean of Internet use
per week, and the attached bars are the upper and lower limits for the 95% confidence interval on
the sample mean. Thus we can directly compare groups and view the precision with which the
group means have been estimated. Notice that the lower bound for men is well above the upper
bound for women indicating these groups are well separated and that the population difference is
statistically significant. Such charts are especially useful when more than two groups are
displayed, since one can quickly make informal comparisons between any groups of interest.
7.7 Appendix: Paired T Test

The paired t test is used to test for statistical significance between two population means when
each observation (respondent) contributes to both means. In medical research a paired t test
would be used to compare means on a measure administered both before and after some type of
treatment. Here each patient is tested twice and is used in calculating both the pre- and post-
treatment means. In market research, if a subject were to rate the product they usually purchase
and a competing product on some attribute, a paired t test would be needed to compare the mean
ratings. In an industrial experiment, the same operators might run their machines using two
different sets of guidelines in order to compare average performance scores. Again, the paired t
test is appropriate. Each of these examples differs from the independent groups t test in which an
observation falls into one and only one of the two groups. The paired t test entails a slightly
different statistical model since when a subject appears in each condition, he acts as his own
control. To the extent that an individual’s outcomes across the two conditions are related, the

paired t test provides a more powerful statistical analysis (greater probability of finding true
effects) than the independent groups t test.
To demonstrate a paired t test using the General Social Survey data, we will compare mean
education levels of the mothers and fathers of the respondents. The paired t test is appropriate
because we will obtain data from a single respondent regarding his/her parents’ education. We are
interested in testing whether there is a significant difference in education between fathers and
mothers of respondents in the population. Keep in mind that while the population we sample from
is the U.S. adult population, the questions pertain to their parents’ education. Thus the population
our conclusion directly generalizes to would be parents of U.S. adults. To test directly for
differences between men and women in the U.S. population, we could run an independent-groups
t test comparing mean education level for men and women.
While not pursued here, we would recommend running exploratory data analysis on the two
variables to be tested. The homogeneity of variance assumption does not apply since we are
dealing with one group. Normality is assumed and applies to the difference scores, obtained by
subtracting the two measures to be compared. To investigate this assumption using SPSS
Statistics, compute a new variable that is the difference between the two measures, then run
Explore on this variable to examine the descriptive statistics and plots.
To run the paired samples t test on mother's and father's years of education,
Click Analyze…Compare Mean…Paired-Samples T Test

Click on MAEDUC and drag it to the Variable1 box for Pair 1
Click on PAEDUC and drag it to the Variable2 box for Pair 1
Figure 7.23 Paired-Samples T Test Dialog Box
Click OK

Figure 7.24 Summaries of Differences in Parent’s Education
The first table displays the mean, standard deviation and standard error for each of the variables.
We see that the means for mothers and fathers are virtually the same. This might indicate very
close educational matching of people who marry. Another possibility is incorrect reporting of
parent’s formal education by the respondent with a bias toward reporting the same value for both.
In the next table, the sample size (number of pairs) appears along with the correlation between
mother and father’s education. Correlations and their significance tests will be studied in a later
chapter, but we note that the correlation (.648) is positive, high, and statistically significant
(differs from zero in the population). This suggests that the power to detect a difference between
the two means is substantial.
Figure 7.25 Paired T Test of Differences in Parents’ Education
Paired Samples Test
Paired Differences
95% Confidence
Std. Interval of the
Deviatio Std. Error Difference Sig.
Mean n Mean Lower Upper t df (2-tailed)
Pair HIGHEST YEAR SCHOOL
1 COMPLETED, MOTHER -
.049 3.159 .071 -.091 .188 .683 1976 .494
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
The mean formal education difference, .049 years, is reported along with the sample standard
deviation and standard error (based on the parents’ education difference score computed for each
respondent). Not surprisingly, with this small mean difference, the significance value (.494)
indicates that if mothers and fathers in the population had the same formal education (null
hypothesis) then there is almost a 50% chance of obtaining as large (or larger) a difference as we
obtained in our sample. Using the standard cut-off probability of .05, we accept the null
hypothesis and conclude that mothers and fathers have the same level of education.

7.8 Appendix: Normal Probability Plots

The histogram is useful for evaluating the shape of the distribution of the dependent measure
within each group. Since one of the t test assumptions is that these distributions are normal, we
implicitly compare the histograms to the well-known normal bell-shaped curve that we discussed
in Chapter 4. If a more direct assessment of normality is desired, the Explore procedure can
produce a normal probability plot and a fit test of normality. In this section we return to the
Explore dialog box and request these features.
Earlier in the chapter we used the Explore dialog box to investigate Internet use and hours
watching TV for males and females. If we return to this dialog box by clicking the Dialog Recall
tool, we see that it retains the settings from our last analysis.
Click the Dialog Recall tool, and then click Explore
To request the normal probability plots and test statistics,
Click the Plots pushbutton

Check Normality plots with tests
Figure 7.26 Explore Plots Dialog Box
As mentioned in the discussion concerning homogeneity of variance, the spread & level plot can
be used to find a variance stabilizing transformation for the dependent measure.
Click Continue
Click OK
We ignore the other output which we have seen before and proceed to the Normal Q-Q plots.

Figure 7.27 Normal Probability Plot - Females
Figure 7.27 displays the normal probability plot on females of hours of internet use.
To produce a normal probability plot, the data values (here hours using the Internet each week)
are first ranked in ascending order. Then the normal deviate corresponding to each rank
(compared to the sample size) is calculated (based on the standard normal curve) and plotted
against the observed value.
The vertical axis of the normal probability plot represents normal deviates (based on the rank of
the observation). The actual data values appear along the horizontal axis. The individual points
(circles) represent the data values (females only in Figure 7.27). The straight line indicates the
pattern we would see if the data were perfectly normal. In Figure 7.27, the line passes through the
point Expected Normal=0 (the center of the normal curve) and Observed Value=6.30, which
corresponds to the mean of Internet use for females. If Internet use followed a normal distribution
for females, the plotted values would closely approximate the straight line. Notice the large
deviations of the higher data values.
The advantage of a normal probability plot is that instead of comparing a histogram or stem &
leaf plot to the normal curve (more complicated), you need only compare the plot to a straight
line. The plot above confirms what we concluded earlier: that for females, Internet hours per
week does not follow a normal distribution.

Tests of Normality
Accompanying the normal probability plot is a modified version of the Kolmogorov-Smirnov test
(Lilliefors correction) and the Shapiro-Wilk test, which address whether the sample can be
viewed as originating from a population following a normal distribution. The null hypothesis is
that the sample comes from a normal population with unknown mean and variance. The
significance value is the probability that we can obtain a sample as far (or further) from the
normal as what we observe in our data, if our sample truly came from a normal population.
Figure 7.28 Tests of Normality
Tests of Normality
a
GENDER OF Kolmogorov-Smirnov Shapiro-Wilk
RESPONDENT Statistic df Sig. Statistic df Sig.
WWW HOURS Female .263 908 .000 .578 908 .000
PER WEEK Male .215 793 .000 .712 793 .000
HOURS PER DAY Female .200 479 .000 .737 479 .000
WATCHING TV Male
.220 420 .000 .791 420 .000
a. Lilliefors Significance Correction
For both tests the significance value is at .000 (rounded to 3 decimals) in all cases. If we assume
we have sampled from a normal population, the probability of obtaining a sample as far (or
further) from a normal as what we have found is less than .0005 (or 5 chances in 10,000). So we
would conclude that for females in the population, the distribution of Internet hours per week is
not normal. Please recall our discussion during which we outlined when normality might not be
that important. Also keep in mind that since our sample is large, we have a powerful test of
normality so relatively small departures from normality would be significant.
Detrended Normal Plot

The detrended normal probability plot (as shown in Figure 7.29) focuses attention on those areas
of the data exhibiting the greatest deviation from the normal. This plot displays the deviation of
each point in the normal probability plot from the straight line corresponding to the normal. The
vertical axis represents the difference between each point in the normal probability plot and the
straight line representing the perfect normal. The horizontal axis represents the observed value.
This serves to visually magnify the areas where there is greatest deviation from the normal. If the
data in the sample were normal, all the data points in the detrended normal plot would appear on
the horizontal line centered at 0.
Figure 7.29 shows the detrended normal plot of Internet use for females. We see that the major
deviations from the normal occur in the right tail of the distribution.The same conclusion could
have been made from the a histogram or the normal probability plot. The detrended normal plot is
a more technical plot, which allows the researcher to focus in detail on the specific locus and
form of deviations from normality.

Figure 7.29 Detrended Normal Plot of Internet Use -Females
A normal probability plot and a detrended normal plot also appear for the males. These will not
be displayed here since our aim was to demonstrate the purpose and use of these charts, and not
to repeat the investigation of normality.

Summary Exercises
We want to see whether men and women differ in the average number of children, their use of
email, and their age.
1. Run exploratory data analysis on CHILDS, EMAILHR, and AGE by GENDER. (Hint:
Don't forget to select the "Pairwise Deletion" option.) Are any of the variables normally
distributed? What differences between males and females do you notice from the boxplot
of CHILDS?
2. Use Chart Builder to produce a paneled histogram for number of children by gender.
3. Run t tests looking at mean differences by gender for these three variables. Interpret the
results. Which variables met the homogeneity of variance assumption? Are the means of
any of the variables significantly different for males and females? Can you explain why?
4. Use Chart Builder to display an error bar chart of number of children by gender.
The analysis suggests that women have a greater number of children than men. Can you suggest
reasons for this seemingly odd result?
1. Other measures that might be of interest are the age when their first child was born
(AGEKDBRN) and number of household members (HHSIZE). Are you surprised by any
of these results?


Chapter 8: Bivariate Plots and

Correlations: Scale Variables
Topics
• Scatterplots: Plotting Relationships between Scale Variables
• Types of Relationships
• The Pearson Correlation Coefficient
Data
This chapter uses the Bank.sav data file: personnel file containing demographic, salary and work
related data from 474 employees at a bank in the early 1970s. The salary information has not been
converted to current dollars. Demographic variables include sex, race, age, and education in years
(edlevel). Work related variables are job classification (jobcat), previous work experience
recorded in years (work), time (in months) spent in current job position (time). Current salary
(salnow) and starting salary (salbeg) are also available.
8.1 Introduction
In previous chapters we explored relations among categorical variables (using crosstab tables),
and between categorical variables and interval scale variables (t-test). Here we focus on studying
two interval scale measures: starting salary and formal education. We wish to determine if there is
a relationship, and if so, quantify it. Starting salary is recorded in dollars and formal education is
reported in years; thus both variables can be interval scales or stronger (actually ratio scales).
Note, that education level has been defined as nominal measurement level, so we will need to
change this prior to completing our analysis. Each variable can take on many different values. If
we tried to present these variables (beginning salary and education) in a crosstabulation table, the
table could contain hundreds of rows. In order to view the relation between these measures we
must either recode salary and education into categories and run a crosstab (the appropriate graph
is a clustered bar chart), or alternatively, present the original variables in a scatterplot. Both
approaches are valid and you would choose one depending on your interests. Since we hope to
build an equation relating amount of education to beginning salary, we will stick to the original
scales and begin with a scatterplot. But first we will take a quick look at the relevant variables
using exploratory data analysis methods.
8.2 Reading the Data

The data are stored as the SPSS Statistics file named Bank.sav.
Double-click Bank.sav
Bivariate Plots and Correlations: Scale Variables 8- 1

Figure 8.1 The Bank Data
We see the data values for several employees in the Data Editor window.
8.3 Exploring the Data

As this is the first time seeing the Bank data, we will explore the data before performing more
formal analysis (for example, regression). While the scatterplot itself provides much useful
information about each of the variables displayed, we begin by examining each variable
separately. We will run exploratory data analysis on beginning salary (salbeg) and education
(edlevel). The id variable will be used to label cases.

Move salbeg and edlevel into the Dependent List: box
Move id into the Label Cases by: box
Bivariate Plots and Correlations: Scale Variables 8 - 2

Figure 8.2 Explore Dialog Box
There are no Factor variables in this analysis since we are looking at the two variables over the
entire sample. Outliers in the box plot will be identified by their employee ID number. This file
has no missing data so we need not change the option for handling missing data. However, we
will suppress the stem & leaf plot and examine the boxplot
Click Plots button

Click off (uncheck) Stem & leaf (not shown)
Click OK
This procedure will create a number of tables and charts in the Output Viewer which we discuss
below.

Figure 8.3 Statistics for Beginning Salary
The descriptive statistics for beginning salary are displayed in Figure 8.3. The mean ($6,806) is
considerably higher than the median ($6,000), suggesting a skewed distribution. This is
confirmed by the skewness value compared to its standard error. Starting salaries range from
$3,600 to $31,992 (recall that these are salaries from the 1960s and early 1970s in unadjusted
dollars)
The extreme values at the high salary end result in a skewed distribution. Since several different
job classifications are represented in this data, the skewness may be due to a relatively small
number of people in high paying jobs. The positive kurtosis is partially caused by the large peak
of values in the $6,000 salary range.

Figure 8.4 Boxplot of Beginning Salary
In the box plot above, all outliers are at the high end, and the employee numbers for some of them
can be read from the plot (changing the font size of these numbers in the Chart Editor window
would make more of them legible). It might be useful to look at the job classification (jobcat) of
some of the higher salaried individuals as a check for data errors.
Figure 8.5 Statistics for Formal Education (in years)

The mean for education is again above the median, but the skewness value is very near zero
(suggesting a symmetric distribution). Here the mean exceeding the median is not due to the
presence of outliers. We will see that there are only a few extreme observations, and they are at
high education values. The mean is above the median because of the concentration of employees
with education of 15 to 19 years.
Figure 8.6 Box & Whisker Plot of Education
In the boxplot (Figure 8.6), the median or 50th percentile (dark line within box) falls on the lower
edge of the box (25th percentile) indicating a large number of people with 12th grade education.
There are very few outliers.
Having explored each variable separately, we will now view them jointly with a scatterplot.
8.4 Scatterplots
A scatterplot displays individual observations in an area determined by a vertical and a horizontal
axis, each of which represent an interval scale variable of interest. In a scatterplot you look for a
relationship between the two variables and note any patterns or extreme points. The scatterplot
visually presents the relation between two variables, while correlation and regression summaries
quantify the relationships.
To request a simple scatterplot, we will use the Chart Builder.

The first step is to select a chart from the Gallery or individual elements from Basic Elements;
then drag and drop them onto the canvas. For most charts, you will want to use the Gallery as
your starting point.
Click the Gallery tab if it is not selected

Click Scatter/Dot in the Choose from list if is not selected
There are a number of options for the scatter/dot charts. Since we are dealing with only two
variables, a simple scatterplot will suffice.
Drag the icon for Simple Scatter (the first icon) into the canvas
Next we indicate that beginning salary (salbeg) and education (edlevel) are the Y and X variables,
respectively.
Move salbeg into the Y Axis: area
Traditionally, the vertical axis is called the Y axis, while the horizontal axis is referred to as the X
axis. Also, if one of the variables is viewed as dependent and the other as independent, by
convention the dependent variable is specified as the Y axis variable.
Notice the measurement level of the variable edlevel. As seen in the dialog, it is defined as
nominal (you can see this from the three balls icon next to edlevel and also from the Categories
section below the variable list). Scatterplots are normally run on scale variables in order to
determine the type of relationship between the two variables. To change the measurement level of
Education level [edlevel], we could cancel out of the dialog and return to the Variable View of
the Data Editor or we can change it within the dialog. The change made in the dialog box affects
only this chart. The measurement level specification on the file (viewed in the Data Editor) is
unchanged.
In the Variables: list, select and right-click on edlevel

Select Scale from the resulting pop-up menu.
Figure 8.7 Changing the measurement level of Education level

We can now move the edlevel variable into the X-axis and click OK to build the chart. The
completed Chart Builder dialog is shown in Figure 8.8.
Move edlevel into the X Axis: area

Click OK
Figure 8.8 Chart Builder with Simple Scatterplot

Figure 8.9 Scatterplot of Beginning Salary and Education
Each circle represents at least one observation. We see there are many points (fairly dense) at
several values, including 8 and 12 years of education. The plot has gaps because education is
recorded in integers. Overall, there seems to be a positive relation between the two variables,
since higher values of education are associated with higher salaries. Notice there is no one with
little education and a high salary, nor is there anyone with high education and a very low salary.
This will be explored in more detail shortly. There is one individual at a salary considerably
higher than the rest. If this were your study, you might check this observation to make sure it
wasn’t in error.
While we can describe the pattern to an interested party by saying that to some extent greater
education is associated with higher salary levels, or simply show them the chart, there would be
an advantage if we could quantify the relation using some simple function. We will pursue this
aspect later in this chapter and in the next chapter.
If we wish to overlay our plot with a best-fitting straight line, we can do so using the Chart
Editor.
Double click on the chart to open the Chart Editor

Click Elements…Fit Line at Total

Figure 8.10 Chart Editor Elements Menu
Figure 8.11 Element Properties for Fit Line
By default a straight line (linear) will be fit to the data. However, you can use the Properties
dialog box to specify lines with other fit methods such as quadratic and cubic. The Loess choice
applies a robust regression technique to the data. Such methods produce a result that is more
resistant to outliers than the traditional least-squares regression. Note that 95% confidence bands
around the best-fitting line can be added to the plot. Although not obvious, the r-square measure
will also be displayed. We will leave the defaults.

Close the Chart Editor
Figure 8.12 Scatterplot with Best Fitting Line
The straight line tracks the positive relationship between beginning salary and education. How
well do you think it describes or models the relationship? We use scatterplots to get a sense of
whether or not it is appropriate to use correlation coefficients and, later, regression. Both of these
techniques assume a linear relationship (although regression can be used to model curvilinear
relationships with suitable modifications). The other fit choices available can be used to
determine what type of nonlinearity might exist.
It would be helpful if we could quantify the strength of the relationship, and furthermore to
describe it mathematically. If a simple function (for instance a straight line) does a fair job of
representing the relationship, then we can very easily describe a straight line with the equation,
Y = a + b*X. Here b is the slope (or average change in Y per unit change in X) and a is the
intercept. Methods are available to perform both tasks: correlation for assigning a number to the
strength of the straight-line relationship, and regression to describe the best-fitting straight line.
8.5 Correlations
A correlation coefficient can be used to quantify the strength and direction of the relationship
between variables in a scatterplot. The correlation coefficient (formally named the Pearson
product-moment correlation coefficient) is a measure of the extent to which there is a linear (or
straight line) relationship between two variables. It is normed so that a correlation of +1 indicates
that the data fall on a perfect straight line sloping upwards (positive relationship), while a
correlation of –1 would represent data forming a straight line sloping downwards (negative
relationship). A correlation of 0 indicates there is no straight-line relationship at all. Correlations

falling between 0 and either extreme indicate some degree of linear relation: the closer to +1 or –
1, the stronger the relation. In social science and market research, when straight-line relationships
are found, significant correlation values are often in the range of .3 to .6.
Below we display four scatterplots with their accompanying correlations, all based on simulated
data following normal distributions. Four different correlations appear (1.0, .8, .4, 0). All are
positive, but represent the full range in strength of linear association (from 0 to 1). As an aid in
interpretation, a best-fitting straight line is superimposed on each chart.
Figure 8.13 Scatterplots Based on Various Correlations.
For the perfect correlation of 1.0, all points fall on the straight line trending upwards. In the
scatterplot with a correlation of .8 the strong positive relation is apparent, but there is some
variation around the line. Looking at the plot of data with correlation of .4, the positive relation is
suggested by the absence of points in the upper left and lower right of the plot area. The
association is clearly less pronounced than with the data correlating .8 (note greater scatter of
points around the line). The final chart displays a correlation of 0: there is no linear association
present. This is fairly clear to the eye (the plot most resembles a blob), and the best-fitting straight
line is a horizontal line.
While we have stressed the importance of looking at the relationships between variables using
scatterplots, you should be aware that human judgment studies indicate that people tend to
overestimate the degree of correlation when viewing scatterplots. Thus obtaining the numeric
correlation is a useful adjunct to viewing the plot. Correspondingly, since correlations only
capture the linear relation between variables, viewing a scatterplot allows you to detect nonlinear
relationships present.
Additionally, statistical significance tests can be applied to correlation coefficients. Assuming the
variables follow normal distributions, you can test whether the correlation differs from zero (zero
indicates no linear association) in the population, based on your sample results. The significance
value is the probability that you would obtain as large (or larger in absolute value) a correlation as
you find in your sample, if there were no linear association (zero correlation) between the two
variables in the population.

In SPSS Statistics, correlations can be easily obtained along with an accompanying significance
test. If you have grossly non-normal data, or only ordinal scale data, the Spearman rank
correlation coefficient (or Spearman correlation) can be calculated. It evaluates the linear
relationship between two variables after ranks have been substituted for the original scores.
Another, less common, rank association measure is Kendall’s coefficient of concordance (also
known as Kendall’s coefficient, or Kendall’s tau-b). We will obtain the correlation (Pearson)
between beginning salary and education, and will also include age, current salary, and work
experience in the analysis.
Click Analyze…Correlate…Bivariate
Move salbeg, salnow, edlevel, age and work to the Variables: list box
Figure 8.14 Correlations Dialog Box
Notice that we simply list the variables to be analyzed; there is no designation of dependent and
independent. Correlations will be calculated on all pairs of the variables listed.
By default, Pearson correlations will be calculated, which is what we want. However, both
Kendall’s tau-b and Spearman nonparametric correlation coefficients can be requested as well. A
two-tailed significance test will be performed on each correlation. This will posit as the null
hypothesis that in the population there is no linear association between the two variables. Thus
any straight-line relationship, either positive or negative, is of interest. If you prefer a one-tailed
test, one in which you specify the direction (or sign) of the relation you expect and any relation in
the opposite direction (opposite sign) is bundled with the zero (or null) effect, you can obtain it
though the One-tailed option button. This issue was discussed earlier in Chapter 5 and in the
context of one versus two-tailed t tests. A one-tailed test gives you greater power to detect a
correlation of the sign you propose, at the price of giving up the ability to detect a significant
correlation of the opposite sign. In practice, researchers are usually interested in all linear
relations, positive and negative, and so two-tailed tests are most common. The Flag significant

correlations check box is checked by default. When checked, asterisks appearing beside the
correlations will identify significant correlations.
The Options button opens a dialog box in which you can request a table of descriptive statistics
for the variables used in the analysis. There is also a choice for missing values. The default
missing setting is Pairwise, which means that a case is dropped from a correlation coefficient if
one of the two variables is missing. However, the case will be used in all other pairs of variables
that are not missing. In this way, SPSS Statistics will still use the valid information from all pairs
of variables. The alternative is Listwise, in which a case is dropped from the entire correlation
analysis if any of its analysis variables have missing values. Neither method provides an ideal
solution; in practice, pairwise deletion is often chosen when a large number of cases are dropped
by the listwise method. This is an area of statistics in which considerable progress has been made
in the last decade, and the SPSS Missing Values add-on module incorporates some of these
improvements.
Click OK
SPSS Statistics displays the correlations, sample sizes and significance values together in a cell in
the Correlations table. Looking at the table in Figure 8.15, we see that the variable labels run
down the first column and across the top row. Each cell (intersection of a row and column) in the
matrix contains the correlation (also significance value and sample size) between the relevant row
and column variable. The correlation (Pearson Correlation) is listed first in each cell, followed by
the probability value of the significance test (“Sig. (2-tailed)”), and finally the sample size (N).
The Pearson Correlation coefficient (generally abbreviated as 'r') itself is based upon the ‘least-
squares’ criterion and is calculated by summing a standardized version of the discrepancies
between the actual and the predicted values for each data point in turn. The predicted values (or
pairs of co-ordinates since we are considering values on two variables) are those which would
constitute our best guess for predicting scores.
The correlation coefficient can take on values between +1 and -1 where:
• r = +1.00 if there is a perfect positive linear relationship between the two

variables
• r = -1.00 if there is a perfect negative linear relationship between the two
variables
• r = 0.00 if there is no linear relationship between the two variables
Note that the sign does not reveal anything about the strength of the relationship, just its
direction.
One extremely important consideration when using Pearson’s correlation coefficient is exactly
what is being measured by the statistic. Remember, r is simply a measure of the linear
relationship between the variables, hence a value of 0 does not necessarily mean that the two
variables are not related, simply that there is no evidence for a linear relationship.

Figure 8.15 Correlation Matrix
Note all correlations along the major (upper left to lower right) diagonal are 1. This is because
each variable perfectly correlates with itself (no significance tests are performed for these
correlations). Also, the correlation matrix is symmetric, that is, the correlation between beginning
salary and education is the same as the correlation between education and beginning salary. Thus
you need only view part of the matrix to the upper right (or to the lower left) of the diagonal to
see all the correlations.
There is, not surprisingly, a strong (.88) correlation between beginning salary and current salary.
Its significance value rounded to three decimals is .000 (thus less than .0005). This means that if
beginning salary and current salary had no linear association in the population, then the
probability of obtaining a sample with such a strong (or stronger) linear association is less than
.0005. The sample size is nearly 500, which should provide fairly sensitive (powerful) tests of the
correlations being nonzero.
Formal education and beginning salary have a substantial (.63) positive correlation, while age has
no linear association with beginning salary (correlation = –.01; probability value of .81, or 81%
chance of obtaining a sample correlation this far from zero, if it were truly zero in the
population). Do you see any other large correlations in the table, and if so can you explain them?
Also note that asterisks mark the significant correlations (at the .01 and .05 level).
A correlation provides a concise numerical summary of the degree of linear association between
pairs of variables. However, outliers can influence a correlation with no visible evidence. Often,
such outliers would be visible in a scatterplot. Also, a scatterplot might suggest that a function
other than a straight line be fit to the data, whereas a correlation simply provides a measure of
straight-line fit. For these reasons, serious analysts look at scatterplots. If the number of variables
is so large as to make looking at all scatterplots impracticable, then at least view those involving
important variables.

Summary Exercises
Suppose you are interested in predicting current salary (salnow), based on age, education in years
(edlevel), minority status (minority), beginning salary (salbeg), gender (sex), and work experience
before coming to the bank (work).
1. Run frequencies on minority so you understand its distribution. Run descriptive statistics
on the other variables for the same reason.
2. Now produce correlations with all these predictors and current salary.
3. Then create scatterplots of the predictors and current salary. Which variables have strong
correlations with current salary? Did you find any potential problems with using linear
regression? Did you find any potential outliers?

Chapter 9: Introduction to Regression

Topics
• Introduction and Basic Concepts
• Regression Equation and Fit Measure
• Assumptions
• Simple Regression
• Interpreting Standard Results
Data
This chapter uses the Bank.sav data file.
9.1 Introduction and Basic Concepts

We found in Chapter 8, based on a scatterplot and correlation coefficient, that beginning salary
and education are positively related for the bank employees. We wish to further quantify this
relation by developing an equation predicting starting salary based on education. One such
statistical method used to predict a variable (an interval scale dependent measure) from one or
more predictor (interval scale) variables is regression analysis. Commonly, straight lines are used,
although other forms of regression allow nonlinear functions. In this chapter we will focus on
linear regression, which typically involves linear or straight-line relations between variables. To
aid our discussion, let’s revisit the scatterplot of beginning salary and education.
Figure 9.1 Scatterplot of Beginning Salary and Education
Introduction to Regression 9- 1
Earlier we pointed out that to the eye there seems to be a positive relation between education and
beginning salary, that is, higher education is associated with greater starting salaries. This was
confirmed by the two variables having a significant positive correlation (.63). While the
correlation does provide a single numeric summary of the relation, something that would be more
useful in practice is some form of prediction equation. Specifically, if some simple function can
approximate the pattern shown in the plot, then the equation for the function would concisely
describe the relation, and could be used to predict values of one variable given knowledge of the
other. A straight line is a very simple function, and is usually what researchers start with, unless
there are reasons (theory, previous findings, or a poor linear fit) to suggest another. Also, since
the point of much research involves prediction, a prediction equation is valuable. However, the
value of the equation would be linked to how well it actually describes or fits the data, and so part
of the regression output includes fit measures.
9.2 The Regression Equation and Fit Measure

In the scatterplot (Figure 9.1), beginning salary is placed on the Y-axis and education appears
along the X axis. Since formal education is typically completed before starting at the bank, we
consider beginning salary to be the dependent variable and education the independent or predictor
variable (this assumption was more true in the 1960s than today). A straight line is superimposed
on the scatterplot; the line is represented in general form by the equation,
Y = a + b*X
where, b is the slope (the change in Y per unit change in X) and a is the intercept (the
value of Y when X is zero).
Given this, how would one go about finding a best-fitting straight line? In principle, there are
various criteria that might be used: minimizing the mean deviation, mean absolute deviation, or
median deviation. Due to technical considerations, and with a dose of tradition, the best-fitting
straight line is the one that minimizes the sum of the squared deviation of each point about the
line.
Returning to the plot of beginning salary and education, we might wish to quantify the extent to
which the straight line fits the data. The fit measure most often used, the r-square measure, has
the dual advantages of falling on a standardized scale and having a practical interpretation. The r-
square measure (which is the correlation squared, or r2, when there is a single predictor variable,
and thus its name) is on a scale from 0 (no linear association) to 1 (perfect linear prediction).
Also, the r-square value can be interpreted as the proportion of variation in one variable that can
be predicted from the other. Thus an r-square of .50 indicates that we can account for 50% of the
variation in one variable if we know values of the other. You can think of this value as a measure
of the improvement in your ability to predict one variable from the other (or others if there are
multiple independent variables).
9.3 Residuals and Outliers

Viewing the scatterplot, we see that many points fall near the line, but some are quite a distance
from it. For each point, the difference between the value of the dependent variable and the value
predicted by the equation (value on the line) is called the residual (also known as the error).
Points above the line have positive residuals (they were under-predicted), those below the line
Introduction to Regression 9 - 2
have negative residuals (they were over-predicted), and a point falling on the line has a residual
of zero (perfect prediction). Points having relatively large residuals are of interest because they
represent instances where the prediction line did poorly. For example, one case has a beginning
salary of about $30,000 while the predicted value (based on the line) is about $10,000, yielding a
residual, or miss, of about $20,000. If budgets were based on such predictions, this is a substantial
discrepancy. The Regression procedure can provide information about large residuals, and also
present them in standardized form. Outliers, or points far from the mass of the others, are of
interest in regression because they can exert considerable influence on the equation (especially if
the sample size is small). Also, outliers can have large residuals and would be of interest for this
reason as well. While not covered in this class, SPSS Statistics can provide influence statistics to
aid in judging whether the equation was strongly affected by an observation and, if so, to identify
the observation.
9.4 Assumptions
Regression is usually performed on data for which the dependent and independent variables are
interval scale. In addition, when statistical significance tests are performed, it is assumed that the
deviations of points around the line (residuals) follow the normal bell-shaped curve. Also, the
residuals are assumed to be independent of the predicted values, which implies that the variation
of the residuals around the line is homogeneous. SPSS Statistics can provide summaries and plots
useful in evaluating these latter issues.
One special case of the assumptions involves the interval scale nature of the independent
variable(s). A variable coded as a dichotomy (say 0 and 1) can technically be considered as an
interval scale. An interval scale assumes that a one-unit change has the same meaning throughout
the range of the scale. If a variable’s only possible codes are 0 and 1 (or 1 and 2, etc.), then a one-
unit change does mean the same change throughout the scale. Thus dichotomous variables, for
example sex, can be used as predictor variables in regression. It also permits the use of nominal
predictor variables if they are converted into a series of dichotomous variables; this technique is
called dummy coding and is considered in most regression texts (Draper and Smith, 1998; Cohen
and Cohen, 2002). The multiple regression analysis (multiple independent variables) performed
in Appendix B uses a dichotomous predictor (sex).
9.5 Simple Regression

A regression involving a single independent variable is the simplest case and is called simple
regression. We will develop a regression equation predicting beginning salary based on
education. There are a number of regression techniques available in SPSS Statistics on the
Regression menu such as linear regression, that is used to perform simple and multiple linear
regression, and logistic regression, that is used to predict nominal outcome variables such as
purchase/not purchase. Logistic regression and many of the other regression techniques are
discussed in the Advanced Techniques: Regression course. We will select Linear to perform
simple linear regression, then name beginning salary (salbeg) as the dependent variable and
education (edlevel) as the independent variable.
Click Analyze…Regression…Linear from the menu

Move salbeg to the Dependent: box
Move edlevel to the Independent(s): box
Figure 9.2 Linear Regression Dialog Box
In this first analysis we will limit ourselves to producing the standard regression output. In the
multiple regression example in Appendix B, we will ask for residual plots and information about
cases with large residuals. Also, the Regression dialog box allows many specifications; here we
will discuss the most important features. However, if you will be running regression often, some
time spent reviewing the additional features and controls mentioned in the Help system will be
well worth it.
The Independent(s) list box will permit more than one independent variable, and so this dialog
box can be used for both simple and multiple regression. The block controls permit an analyst to
build a series of regression models with the variables entered at each stage (block), as specified
by the user.
By default, the Method is Enter, which means that all independent variables in the block will be
entered into the regression equation simultaneously. This method is chosen to run one regression
based on all variables you specify. If you wish the program to select, from a larger set of
independent variables, those that in some statistical sense are the best predictors, you can request
the Stepwise method.
The Selection Variable option permits cross-validation of regression results. Only cases whose
values meet the rule specified for a selection variable will be used in the regression analysis, yet
the resulting prediction equation will be applied to the other cases. Thus you can evaluate the
regression on cases not used in the analysis, or apply the equation derived from one subgroup of
your data to other groups.
While SPSS Statistics will present standard regression output by default, many additional (and
some of them quite technical) statistics can be requested via the Statistics dialog box. The Plots
dialog box is used to generate various diagnostic plots used in regression, including residual plots.
We will request such plots in the analysis in Appendix B. The Save dialog box permits you to add
new variables to the data file. These variables contain such statistics as the predicted values from
the regression equation, various residuals and influence measures. Finally, the Options dialog box
controls the criteria when running stepwise regression and choices in handling missing data. By
default, SPSS Statistics excludes a case from regression if it has one or more values missing for
the variables used in the analysis.
Note: The SPSS Missing Values add-on module provides more sophisticated methods for
handling missing values. This module includes procedures for displaying patterns of missing data
and imputing (estimating) missing values using multiple variable imputation methods.
We perform the analysis by finishing the dialog box. The output in the Output Viewer will be a
series of tables described below.
Click OK
Figure 9.3 Model Summary and Overall Significance Tests
After listing the dependent and independent variables, Regression provides several measures of
how well the model fits the data. These fit measures are displayed in the Model Summary table.
First is the multiple R, which is a generalization of the correlation coefficient. If there is a single
predictor variable (as in our case) then the multiple R is simply the unsigned (positive) correlation
between the independent and dependent variable—recall the correlation between beginning salary
and education was .63. If there are several independent variables then the multiple R represents
the unsigned (positive) correlation between the dependent measure and the optimal linear
combination of the independent variables. Thus the closer the multiple R is to 1, the better the fit.
As mentioned earlier, the R-Square measure can be interpreted as the proportion of variance of
the dependent measure that can be predicted from the independent variable(s). Here it is about
40% (.40), which is far from perfect prediction, but still substantial. The Adjusted R-Square
represents a technical improvement over the r-square in that it explicitly adjusts for the number of
predictor variables relative to the sample size, and as such is preferred by many analysts.
Generally, they are very close in value; in fact, if they differ dramatically in multiple regression,
it is a sign that you have used too many predictor variables for your sample size, and the adjusted
r-square value should be more trusted. In our results, they are very close.
The Standard Error of the Estimate is a standard deviation type summary of the dependent
variable that measures the deviation of observations around the best fitting straight line. As such
it provides, in the scale of the dependent variable, an estimate of how much variation remains to
be accounted for after the line is fit. The reference number for comparison is the original standard
deviation of the dependent variable, which measures the original amount of unaccounted
variation. Regression can display such descriptive statistics as the standard deviation, but since
we didn’t request this, we will note that the original standard deviation of beginning salary was
$3,148 . Thus the uncertainty surrounding individual beginning salaries has been reduced from
$3,148 (standard deviation) to $2,439 (standard error). If the straight line perfectly fit the data, the
standard error would be 0.
While the fit measures indicate how well we can expect to predict the dependent variable or how
well the line fits the data, they do not tell whether there is a statistically significant relationship
between the dependent and independent variables. The analysis of variance table (ANOVA in the
Output Viewer) presents technical summaries (sums of squares and mean square statistics) of the
variation accounted for by the prediction equation. Our main interest is in determining whether
there is a statistically significant (non-zero) linear relation between the dependent variable and the
independent variable(s) in the population. Since in simple regression there is a single independent
variable, we are testing a single relationship; in multiple regression, we test whether any linear
relation differs from zero. The significance value accompanying the F test gives us the
probability that we could obtain one or more sample slope coefficients (which measure the
straight line relationships) as far from zero as what we obtained, if there were no linear relations
in the population. The result is highly significant (significance probability less than .0005 or 5
chances in 10,000). Now that we have established there is a significant relationship between the
beginning salary and education, and obtained fit measures, we turn to the next table, Coefficients,
to interpret the regression coefficients.
Figure 9.4 Regression Coefficients
The first column contains a list of the independent variables plus the intercept (constant term).
The column labeled B contains the estimated regression coefficients we would use in a prediction
equation. The coefficient for education level indicates that on average, each year of education was
associated with a beginning salary increase of $691.01. The constant or intercept of –2,516.39
indicates that the predicted beginning salary of someone with 0 years of education is negative
$2,516.39, so they would pay the bank to work! This is clearly impossible. This odd result stems
in part from the fact that no one in the sample had fewer than 8 years of education, so the
intercept projects well beyond the region containing data. When using regression it can be very
risky to extrapolate beyond where the data are observed; the assumption is that the same pattern
continues. Here it clearly cannot! The Standard Error (of B) column contains standard errors of
the regression coefficients. These provide a measure of the precision with which we estimate the
B coefficients. The standard errors can be used to create a 95% confidence band around the B
coefficients (available as a Statistics option). In our example, the regression coefficient is $691
and the standard error is about $39. Thus we would not be surprised if in the population the true
regression coefficient were $650 or $730 (within one standard error of our sample estimate), but
it is very unlikely that the true population coefficient would be $300 or $2,000.
Betas are standardized regression coefficients and are used to judge the relative importance of
each of several independent variables. We will use these measures when discussing multiple
regression. Finally, the t statistics provide a significance test for each B coefficient, testing
whether it differs from zero in the population. Since we have only one independent variable, this
is the same result as what the F test provided earlier. In multiple regression, the F statistic tests
whether any of the independent variables are significantly related (non-zero coefficient) to the
dependent variable, while the t statistic is used to test each independent variable separately. The
significance test on the constant assesses whether the intercept coefficient is different from zero
in the population (it is).
Thus if we wish to predict beginning salary based on education for new employees, we would use
the B coefficients in the formula: Beginning Salary = $691 * Education – $2,516. Even when
running simple regression, the analyst would probably take a look at some residual plots and
check for outliers; we will follow through on this aspect in the Appendix B example.
Summary Exercises
1. Run a simple linear regression using beginning salary to predict current salary.
2. How well did you do in predicting current salary from this one variable? (Hint: What is
the R-square and how would you interpret it?
3. Interpret the constant (intercept) and B values.
4. Use the predictive equation to predict the value of current salary for a person that had a
beginning salary of $6400.
5. Run a simple linear regression using one of the other variables from the set you explored
in the last chapter. Hint: You might want to try the one with the next highest correlation
coefficient. Answer Questions 2-3 above.
Appendix A: Mean Differences Between

Groups: One-Factor ANOVA
Topics
• Extending the Logic of T Test Beyond Two Groups
• Exploring the Differences
• One-Factor ANOVA
• Post Hoc Testing of Means
• Graphing the Mean Differences
• Appendix: Group Differences on Ranks
Data
In this appendix, we use the GSS2004Intro.sav file.
Scenario
Using the GSS 2004 data, we want to investigate the relation between level of education and
amount of TV viewing. One approach is to group people according to their type of degree, and
then compare these groups on average amount of daily TV watched. In the General Social Survey
the question about highest degree completed (DEGREE) contains five categories: less than high
school, high school, junior college, bachelor, and graduate. Assuming we retain these categories
we might first ask if there are any population differences in TV viewing among these groups. If
there are significant mean differences overall, we next want to know specifically which groups
differ from which others.
A.1 Introduction
Analysis of variance (ANOVA) is a general method of drawing conclusions regarding differences
in population means when two or more comparison groups are involved. The independent-groups
t test (Chapter 7) applies only to the simplest instance (two groups), while ANOVA can
accommodate more complex situations. In fact, t test can be viewed as a special case of ANOVA,
and they yield the same result in a two-group situation (same significance value, and the t statistic
squared is equal to ANOVA’s F statistic).
We will compare five groups composed of people with different education degrees and evaluate
whether the populations they represent differ in average amount of daily TV viewing. Before
performing the analysis we will look at an exploratory data analysis plot.
A.2 Extending the Logic Beyond Two Groups

The basic logic of significance testing for comparing group means on more than two groups is the
same as that for the t test which we reviewed in Chapter 7. To summarize,
Ho (Null Hypothesis) assumes the population groups have the same means.
Determine the probability of obtaining a sample with group mean differences as large (or
larger) as what we find in our data. To make this assessment the amount of variation among
Mean Differences Between Groups: One-Factor ANOVA A- 1

group means (between-group variation) is compared to the amount of variation among

observations within each group (within-group variation). Assuming in the population that the
group means are identical (null hypothesis), the only source of variation among sample means
would be the fact that the groups are composed of different individual observations.
Thus a ratio of the two sources of variation (between group/within group) should be about 1 when
there are no population differences. When the distribution of individual observations within each
group follows the normal curve, the statistical distribution of this ratio is known (F distribution)
and we can make a probability statement about the consistency of our data with the null
hypothesis. The final result is the probability of obtaining sample differences as large (or larger)
as what we found if there were no population differences. If this probability is sufficiently small
(usually less than 5 chances in 100, or .05) we conclude the population groups differ.
The assumptions of normality within each group and homogeneity of variance that we discussed
in Chapter 7 and considerations of sample size are applicable to all ANOVA models. Likewise,
the "rules of thumb" approaches that we considered for violation of these assumptions carry over
to this extended ANOVA model.
Factors
When performing a t test comparing two groups there is only one comparison that can be made:
group 1 versus group 2. For this reason, the groups are constructed so their members
systematically vary in only one aspect: for example, males versus females, or drug A versus drug
B. If the two groups differed on more than one characteristic (for example, males given drug A
versus females given drug B), it would be impossible to differentiate between the two effects
(gender, drug).
When the data can be partitioned into more than two groups, additional comparisons can be
made. These might involve one aspect or dimension, for example, four groups each representing a
region of the country. Or the groups might vary along several dimensions, for example eight
groups each composed of a gender (two categories) by region (four categories) combination. In
this latter case, we can ask additional questions. (1) Is there a gender difference? (2) Is there a
region difference? (3) Do gender and region interact? Each aspect or dimension the groups differ
on is called a factor. Thus one might discuss a study or experiment involving one, two, even three
or more factors. A factor is represented in the data set as a categorical (nominal) variable and
would be considered an independent variable.
SPSS Statistics allows for multiple factors to be analyzed, and has different procedures available
based on how many factors are involved and their degree of complexity. If only one factor is to
be studied, use the One-factor ANOVA procedure. This is the procedure we will demonstrate in
this Appendix to study the education degree related to average daily TV viewing hours.
When you have two or more factors, the Univariate procedure under the General Linear Models
menu can be used. Other procedures on the General Linear Models menu, such as Multivariate
and Repeated Measures as well as the Linear Mixed Models procedure can be used for more
complex models. These models are beyond the scope of this course; but are covered in the
Advanced Topics: ANOVA course and to some degree in the Advanced Statistical Analysis using
SPSS course.
Mean Differences Between Groups: One-Factor ANOVA A - 2

A.3 Exploring the Data

As in Chapter 7, we begin by applying exploratory data analysis procedures to the variables of
interest. In practice, you would check each group’s summary statistics, looking at the pattern of
the data and noting any unusual points. For brevity in our presentation we will examine only the
boxplot.
Open GSS2004Intro.sav if it is not already open

Move TVHOURS to the Dependent List: box
Move DEGREE to the Factor List: box
Figure A.1 Explore Dialog Box to Compare TV Hours for Degree Groups
The dependent variable is the scale variable of interest and the factor variable is the nominal or
ordinal variable that defines the groups which we want to compare. Since we are comparing
different formal education degree groups, we designate DEGREE as the factor (or nominal
independent variable) and TVHOURS as the dependent variable. We accept the default output. As
we've seen in earlier chapters, you might chose to run histograms rather than stem & leaf plots or
request additional statistics.
Click OK
An exploratory analysis of TV hours will appear for each degree group. For brevity in this
presentation we move directly to the boxplot.

Figure A.2 Boxplot of TV Hours by Degree Groups
The median hours of daily TV watched appears higher for those with less than a high school
degree and lower for those with graduate degrees. Each group exhibits a positive skew that is
more exaggerated for those with high school or lesser degree. Some individuals report watching
rather large amounts of daily TV; one might want to examine the original survey to check for data
errors or evidence of misunderstanding the question. Also, based on the box heights (interquartile
ranges), those with a high school degree or less show greater within-group variation than the
others. This suggests a potential problem with the homogeneity of variance assumption,
especially since the sample sizes are quite disparate (see the Case Processing Summary table).
We also note there is no apparent pattern between the median level and the interquartile range
(for example as one increases so does the other) that might suggest a data transformation to
stabilize the within-group variance. We will come back to this point after testing for homogeneity
of variance. Let’s move on to the actual ANOVA analysis.
A.4 One-Factor ANOVA

To run the analysis:
Click Analyze…Compare Means…One-Way ANOVA

Move TVHOURS to the Dependent List: box
Move DEGREE to the Factor: box

Figure A.3 One-Way ANOVA Dialog Box
We have provided the minimum information to run the basic analysis: one dependent variable and
one factor variable. You could use One-Way ANOVA on more than one dependent variable for
the same factor groups by placing multiple variables in the Dependent List. The Contrasts button
is used to request statistical tests for planned group comparisons of interest. The Post Hoc button
will produce multiple comparison tests that can compare each group mean against every other
one. Such tests facilitate determination of just which groups differ from which others and are
usually performed after the overall analysis establishes that some significant differences exist. We
will use these tests in the next section.
Finally, the Options button controls such diverse features as missing value handling and whether
to display optional output such as descriptive statistics, means plots, and homogeneity tests. We
want to display both descriptive statistics (although having just run Explore, they are not
necessary) and the homogeneity of variance test.
Click Options button

Check Descriptive check box
Check Homogeneity of variance test check box
Check Brown-Forsythe and Welch check boxes
The completed dialog box is shown in Figure A.4. As mentioned earlier, ANOVA assumes
homogeneity of within-group variance. However, when homogeneity does not hold there are
several adjustments that can be made to the F test. We request these optional statistics because
the boxplot indicates that the homogeneity of variance assumption may not hold.

Figure A.4 One-Way ANOVA Options Dialog Box
The missing value choices deal with how missing data are to be handled if you specify several
dependent variables. By default, cases with missing values on a particular dependent variable are
dropped only for the specific analysis involving that variable. Since we are looking at a single
dependent variable, the choice has no relevance to our analysis. The Means plot option will
produce a line chart displaying the group means. This is one type of chart to present the results.
However, we will request an error bar plot later because it shows more information than the
Means line plot.
Click Continue
Click OK
We now turn to interpretation of the results.
One-Factor ANOVA Results

The One-way output includes the descriptive statistics table, the analysis of variance summary
table, robust tests that do not assume homogeneity of variance, and the probability value(s) we
will use to judge statistical significance. We first review the ANOVA summary table (the default
output) to determine if any of the group means differ from any of the other groups.

Figure A.5 One-Factor ANOVA Summary Table
ANOVA

Sum of
Squares df Mean Square F Sig.
Between Groups 483.655 4 120.914 19.075 .000
Within Groups 5667.059 894 6.339
Total 6150.714 898
Most of the information in the ANOVA table (Figure A.5) is technical in nature and is not
directly interpreted. Rather the summaries are used to obtain the F statistic and, more importantly,
the probability value we use in evaluating the population differences.
In the first column there is a row for the between-groups and a row for within-groups variation.
The “df” column contains information about degrees of freedom, related to the number of groups
and the number of individual observations within each group. The degrees of freedom are not
interpreted directly, but are used in calculating the between-group and within-group variation
(variances). Similarly, the sums of squares are intermediate summary numbers used in calculating
the between- and within-group variances. Technically the Between Groups Sum of Squares
represents the sum of the squared deviations of the individual group means around the total
sample mean. And, the Within Groups Sum of Square, the sum of the squared deviations of
individual observations around their respective sample group mean. These numbers are never
interpreted and are reported because it is traditional to do so. The Mean Squares are measures of
the between-group and within-group variation (variances). Technically, they are the Sum of
Squares divided by their respective degrees of freedom. Recall in our discussion of the logic of
testing that under the null hypothesis both variances would have the same source and the ratio of
between to within would be about 1. This ratio of the mean square values is the sample F statistic.
The F value in our example is 19.075, very far from 1!
Finally, and most readily interpretable, the column labeled “Sig.” provides the probability of
obtaining a sample F ratio as large (or larger) than 19.075 (taking into account the number of
groups and sample size), assuming the null hypothesis that in the population all degree groups
watch the same amount of TV. The probability of obtaining an F value this large (in other words,
of obtaining sample means as far apart as we have), if the null hypothesis were true, is about .000.
This number is rounded when displayed so the actual probability is less than .0005, or less than 5
chances in 10,000 of obtaining sample mean differences so far apart by chance alone. Thus we
have a highly significant difference.
In practice, most researchers move directly to the significance value since the columns containing
the sums of squares, degrees of freedom, mean squares and F statistic are all necessary for the
probability calculation but are rarely interpreted in their own right. To interpret the results we
move to the descriptive information shown in the Descriptives table (Figure A.6) that we
requested as optional output.

Figure A.6 Descriptive Statistics for Groups

Descriptives

95% Confidence Interval for
Mean
Std. Lower Upper
N Mean Deviation Std. Error Bound Bound Minimum Maximum
LT HIGH SCHOOL 111 4.50 3.697 .351 3.80 5.19 0 20
HIGH SCHOOL 476 2.96 2.561 .117 2.73 3.19 0 20
JUNIOR COLLEGE 79 2.53 1.894 .213 2.11 2.96 0 12
BACHELOR 141 2.17 2.080 .175 1.82 2.52 0 15
GRADUATE 92 1.78 1.333 .139 1.51 2.06 0 6
Total 899 2.87 2.617 .087 2.69 3.04 0 20
The pattern of means is consistent with the boxplot in that those with less formal education watch
more TV than those with more formal education. The 95% confidence bands for the group means
gauge the precision with which we have estimated these values, and we can informally compare
groups by comparing their confidence bands. The minimum and maximum values for each group
are valuable as a data check; we again note some surprisingly large numbers.
Often at this point we are interested in making a statement about which of the five groups differ
significantly from which others. This is because the overall F statistic simply tested the null
hypothesis that all population means were the same. Typically, you now want to make more
specific statements than merely that the five groups are not identical. Post Hoc tests permit these
pairwise group comparisons and we will pursue them later. But first, we must check the
homogeneity of variance assumption by reviewing the tests that we requested under the Options.
Homogeneity of Variance and What to Do If Violated

We also requested the Levene test of homogeneity of variance. This is the same test that we saw
in Chapter 7 in the t test table and is interpreted in the same way.
Figure A.7 Homogeneity of Within-Group Variance
Test of Homogeneity of Variances

Levene
Statistic df1 df2 Sig.
12.015 4 894 .000
Unfortunately the null hypothesis assuming homogeneity of within-group variance is rejected at

the rounded .000 (less than .0005) level. Our sample sizes are quite disparate (see Figure A.6), so
we cannot count on robustness due to equal sample sizes. For this reason we turn to the Brown-
Forsythe and Welch tests, which test for equality of group means without assuming homogeneity
of variance. Since these results will not be calculated by default, you would request them based
on homogeneity tests done in the Explore or One-way ANOVA procedures. These tests are show
in Figure A.8.

Figure A.8 Robust Tests of Mean Differences
Robust Tests of Equality of Means

a
Statistic df1 df2 Sig.
Welch 19.259 4 269.315 .000
Brown-Forsythe 20.508 4 350.785 .000
a. Asymptotically F distributed.
Both of these measures mathematically attempt to adjust for the lack of homogeneity of variance.
When calculating the between-group to within-group variance ratio, the Brown-Forsythe test
explicitly adjusts for heterogeneity of variance by adjusting each group's contribution to the
between-group variation by a weight related to its within-group variation. The Welch test adjusts
the denominator of the F ratio so it has the same expectation as the numerator, when the null
hypothesis is true, despite the heterogeneity of within-group variance.
Both tests indicate there are highly significant differences in average TV viewing between the
education degree groups, which are consistent with the conclusions we drew from the standard
ANOVA.
These robust tests, as noted by the caption, are asymptotic tests, meaning their properties improve
as the sample size increases. Both tests do assume that the distribution is normal. Simulation
work (Brown and Forsythe, 1974) indicates the tests perform well with group sample sizes as
small as 10 and possibly even 5.
As one alternative, a statistically sophisticated analyst might attempt to apply transformations to

the dependent measure in order to stabilize the within-group variances (variance stabilizing
transforms). These are beyond the scope of this course, but interested readers might turn to
Emerson in Hoaglin, Mosteller and Tukey (1991) for a discussion from the perspective of
exploratory data analysis, and note that the spread & level plot in Explore will suggest a variance
stabilizing transform. Box, Hunter and Hunter (1978) contains a brief discussion of such
transformations, and the original (technical) paper was by Box and Cox (1964).
A second alternative would be to perform the analysis using a nonparametric statistical method
that assumes neither normality nor homogeneity of variance (recall the Brown-Forsythe and
Welch tests assume normality of error). A one-factor analysis of group differences assuming that
the dependent measure is only an ordinal (rank) variable is available as a Nonparametric Test
procedure within SPSS Statistics Base. When this analysis was run (see appendix to this chapter
if interested), the group differences were found to be highly significant. This serves as another
confirmation of our result, but corresponding nonparametric procedures are not available for all
analysis of variance models.
In situations in which robust or nonparametric equivalents are not available, many researchers
accept the ANOVA results with a caveat that the reported probability levels are not exactly
correct. In our example, since the significance value was less than .0005, even if we discount the
value by an order or two of magnitude, the result would still be significant at the .05 level. While
these approaches are not entirely satisfactory, and statisticians may disagree as to which would be
best in a given situation, they do constitute the common ways of dealing with the problem.

Having concluded that there are differences in amount of TV viewed among different educational
degree groups, we probe to find specifically which groups differ from which others.
A.5 Post Hoc Testing of Means

Post hoc tests are typically performed only after the overall F test indicates that population
differences exist, although for a broader view see Milliken and Johnson (1984). At this point
there is usually interest in discovering just which group means differ from which others. In one
aspect, the procedure is quite straightforward: every possible pair of group means is tested for
population differences and a summary table produced. However, a problem exists in that as more
tests are performed, the probability of obtaining at least one false-positive result increases. Recall
our discussion of Type I and Type II errors in Chapter 5. As an extreme example, if there are ten
groups, then 45 pairwise group comparisons (n*(n-1)/2) can be made. If we are testing at the .05
level, we would expect to obtain on average about 2 (.05 * 45) false-positive tests. In an attempt
to reduce the false-positive rate when multiple tests of this type are done, statisticians have
developed a number of methods.
Why So Many Tests?

The ideal post hoc test would demonstrate tight control of Type I (false-positive) error, have good
statistical power (probability of detecting true population differences), and be robust over
assumption violations (failure of homogeneity of variance, nonnormal error distributions).
Unfortunately, there are implicit tradeoffs involving some of these desired features (Type I error
and power) and no current post hoc procedure is best in all these areas. Add to this the facts that
pairwise tests can be based on different statistical distributions (t, F, studentized range, and
others) and that Type I error can be controlled at different levels (per individual test, per family of
tests, variations in between), and you have a large collection of post hoc tests.
We will briefly compare post hoc tests from the perspective of being liberal or conservative
regarding control of the false-positive rate (Type 1 error) and apply several to our data. There is a
full literature (and several books) devoted to the study of post hoc (also called multiple
comparison or multiple range tests, although there is a technical distinction between the two)
tests. Some books (Toothaker, 1991) summarize simulation studies that compare multiple
comparison tests on their power (probability of detecting true population differences) as well as
performance under different scenarios of patterns of group means, and assumption violations
(homogeneity of variance).
The existence of numerous post hoc tests suggests that there is no single approach that
statisticians agree will be optimal in all situations. In some research areas, publication reviewers
require a particular post hoc method, simplifying the researcher’s decision. For more detailed
discussion and recommendations, short books by Klockars and Sax (1986), Toothaker (1991) or
Hsu (1996) are useful. Also, for some thinking on what post hoc tests ought to be doing see
Tukey (1991) or Milliken and Johnson (1984).
Below we present some tests available within SPSS Statistics, roughly ordered from the most
liberal (greater statistical power and greater false-positive rate) to the most conservative (smaller
false-positive rate, less statistical power), and also mention some designed to adjust for lack of
homogeneity of variance.

LSD
The LSD or least significant difference method simply applies standard t tests to all possible pairs
of group means. No adjustment is made based on the number of tests performed. The argument is
that since an overall difference in group means has already been established at the selected
criterion level (say .05), no additional control is necessary. This is the most liberal of the post hoc
tests.
SNK, REGWF, REGWQ & Duncan

The SNK (Student-Newman-Keuls), REGWF (Ryan-Einot-Gabriel-Welsh F), REGWQ (Ryan-
Einot-Gabriel-Welsh Q, based on the studentized range statistic) and Duncan methods involve
sequential testing. After ordering the group means from lowest to highest, the two most extreme
means are tested for a significant difference using a critical value adjusted for the fact that these
are the extremes from a larger set of means. If these means are found not to be significantly
different, the testing stops; if they are different then the testing continues with the next most
extreme set, and so on. All are more conservative than the LSD. REGWF and REGWQ improve
on the traditionally used SNK in that they adjust for the slightly elevated false-positive rate (Type
I error) that SNK has when the set of means tested is much smaller than the full set.
Bonferroni & Sidak

The Bonferroni (also called the Dunn procedure) and Sidak (also called Dunn-Sidak) perform
each test at a stringent significance level to insure that the family-wise (applying to the set of
tests) false-positive rate does not exceed the specified value. They are based on inequalities
relating the probability of a false-positive result on each individual test to the probability of one
or more false positives for a set of independent tests. For example, the Bonferroni is based on an
additive inequality, so the criterion level for each pairwise test is obtained by dividing the original
criterion level (say .05) by the number of pairwise comparisons made. Thus with five means, and
therefore ten pairwise comparisons, each Bonferroni test will be performed at the .05/10 or .005
level.
Tukey (b)
The Tukey (b) test is a compromise test, combining the Tukey (see next test) and the SNK
criterion producing a test result that falls between the two.
Tukey
Tukey’s HSD (Honestly Significant Difference; also called Tukey HSD, WSD, or Tukey(a) test)
controls the false-positive rate family-wise. This means if you are testing at the .05 level, that
when performing all pairwise comparisons, the probability of obtaining one or more false
positives is .05. It is more conservative than the Duncan and SNK. If all pairwise comparisons are
of interest, which is usually the case, Tukey’s test is more powerful than the Bonferroni and
Sidak.
Scheffe
Scheffe’s method also controls the family-wise error rate. It adjusts not only for the pairwise
comparisons, but also for any possible comparison the researcher might ask. As such it is the
most conservative of the available methods (false-positive rate is least), but has less statistical
power.

Specialized Post Hoc Tests
Hochberg’s GT2 & Gabriel: Unequal Ns

Most post hoc procedures mentioned above (excepting LSD, Bonferroni & Sidak) were derived
assuming equal group sample sizes in addition to homogeneity of variance and normality of error.
When the subgroup sizes are unequal, SPSS Statistics substitutes a single value (the harmonic
mean) for the sample size. Hochberg’s GT2 and Gabriel’s post hoc test explicitly allow for
unequal sample sizes.
Waller-Duncan
The Waller-Duncan takes an approach (Bayesian) that adjusts the criterion value based on the
size of the overall F statistic in order to be sensitive to the types of group differences associated
with the F (for example, large or small). Also, you can specify the ratio of Type I (false positive)
to Type II (false negative) error in the test. This feature allows for adjustments if there are
differential costs to the two types of errors.
Unequal Variances and Unequal Ns
Tamhane T2, Dunnett’s T3, Games-Howell, Dunnett’s C

Each of these post hoc tests adjusts for unequal variances and sample sizes in the groups.
Simulation studies (summarized in Toothaker, 1991) suggest that although Games-Howell can be
too liberal when the group variances are equal and sample sizes are unequal, it is more powerful
than the others.
An approach some analysts take is to run both a liberal (say LSD) and a conservative (Scheffe or
Tukey HSD) post hoc test. Group differences that show up under both criteria are considered
solid findings, while those found different only under the liberal criterion are viewed as tentative
results.
To illustrate the differences among the post hoc tests we will request that three be done: one
liberal (LSD), one midrange (REGWF), and one conservative (Scheffe). In addition, since
homogeneity of variance does not hold in the data, we request the Games-Howell and would pay
serious attention to its results. Ordinarily, a researcher would not run as many different tests.
Although running multiple tests to compare the results is often useful as we will see. Due to the
homogeneity of variance violation in our data, in practice only the Games-Howell might be run.
Click on the Dialog Recall tool, then click One-Way ANOVA

Click on the Post Hoc button
Click LSD (Least Significant Difference), R-E-G-W-F (Ryan-Eniot-Gabriel-Welsh F),
Scheffe and Games-Howell check boxes
The completed dialog box is shown in Figure A.9.

Figure A.9 Post Hoc Testing Dialog Box
Click Continue
Click OK
By default, statistical tests will be done at the .05 level. If you prefer to use a different alpha value
(for example, .01), you can specify it in the Significance level box.
The beginning part of the output contains the ANOVA table, robust test of mean differences,
descriptive statistics, and homogeneity test which we have already reviewed. We will move
directly to the post hoc test results.
Note
Some of the pivot tables shown in this section were edited (changed column widths; only one post
hoc method shown in some figures) to better display in the course guide.

Figure A.10 Least Significant Difference Post Hoc Results
Multiple Comparisons
Dependent Variable: HOURS PER DAY WATCHING TV

LSD
Mean 95% Confidence Interval

(J) RS HIGHEST Difference Lower Upper
(I) RS HIGHEST DEGREE DEGREE (I-J) Std. Error Sig. Bound Bound
LT HIGH SCHOOL HIGH SCHOOL 1.540* .265 .000 1.02 2.06
JUNIOR COLLEGE 1.964* .371 .000 1.24 2.69
BACHELOR 2.325* .319 .000 1.70 2.95
GRADUATE 2.713* .355 .000 2.02 3.41
HIGH SCHOOL LT HIGH SCHOOL -1.540* .265 .000 -2.06 -1.02
JUNIOR COLLEGE .424 .306 .166 -.18 1.02
BACHELOR .786* .241 .001 .31 1.26
GRADUATE 1.173* .287 .000 .61 1.74
JUNIOR COLLEGE LT HIGH SCHOOL -1.964* .371 .000 -2.69 -1.24
HIGH SCHOOL -.424 .306 .166 -1.02 .18
BACHELOR .361 .354 .307 -.33 1.06
GRADUATE .749 .386 .053 -.01 1.51
BACHELOR LT HIGH SCHOOL -2.325* .319 .000 -2.95 -1.70
HIGH SCHOOL -.786* .241 .001 -1.26 -.31
JUNIOR COLLEGE -.361 .354 .307 -1.06 .33
GRADUATE .388 .337 .251 -.27 1.05
GRADUATE LT HIGH SCHOOL -2.713* .355 .000 -3.41 -2.02
HIGH SCHOOL -1.173* .287 .000 -1.74 -.61
JUNIOR COLLEGE -.749 .386 .053 -1.51 .01
BACHELOR -.388 .337 .251 -1.05 .27
*. The mean difference is significant at the .05 level.
The rows are formed by every possible combination of groups. For example, at the top of the
pivot table the “Less than High School” group is paired with each of the other four. The column
labeled “Mean Difference (I-J)” contains the sample mean difference between each pairing of
groups. We see the “Less than High School” and Graduate groups have a mean difference of 2.71
hours of daily TV viewing. If this difference is statistically significant at the specified level after
applying the post hoc adjustments (none for LSD), then an asterisk (*) appears beside the mean
difference. Notice the actual significance value for the test appears in the column labeled “Sig.”.
Thus, the first block of LSD results indicate that in the population those with less than high
school degrees differ significantly in daily TV viewing from each of the other four groups.
In addition, the standard errors and 95% confidence intervals for each mean difference appear.
These provide information on the precision with which we have estimated the mean differences.
Note that, as you would expect, if a mean difference is not significant, the confidence interval
includes 0.
Also notice that each pairwise comparison appears twice (for example: high school - college
degree and also college degree - high school). For each such duplicate pair the significance value
is the same, but the signs are reversed for the mean difference and confidence interval values.

Summarizing the entire table in Figure A.10, we would say that the lowest degree group (less
than high school) differs in amount of TV viewed daily from all other groups, and those with high
school differ from all other groups except Junior College, those with higher degrees watch less
TV. The three highest degree groups do not differ from each other. Since LSD is the most liberal
of the post hoc tests, we are interested to learn whether the same results hold using more
conservative criteria.
Figure A.11 Homogeneous Subsets Results for REGWF Post Hoc Tests
The REGWF results in Figure A.11 are not presented in the same format as we saw for the LSD.
This is because for some of the post hoc test methods (for example, the sequential or multiple-
range tests) standard errors and 95% confidence intervals for all pairwise comparisons are not
defined. Rather than display pivot tables with empty columns, a different format, homogeneous
subsets, is used. A homogeneous subset is a set of groups for which no pair of group means
differs significantly (the Sig. value at the bottom of the column will be above the alpha criterion
of .05). This format is closer in spirit to the nature of the sequential tests actually performed by
REGWF. Depending on the post hoc test requested, SPSS Statistics will display a multiple
comparison table, a homogeneous subset table, or both. Recall the REGWF tests first the most
extreme, then the less extreme means, adjusting for the number of means in the comparison set.
Viewing the REGWF portion of the table, we see three homogeneous subsets (three columns).
The first is composed of graduate, bachelor, and junior college groups; they do not differ, but one
or the other differs from the two remaining groups. This result is consistent with the LSD tests.
The second subset is composed of junior college and high school groups (they do not differ
significantly). This result is also consistent with the LSD results. Notice that the third
homogeneous subset contains only one group (less than high school). This is because that group
of respondents differs from each of the others on television viewing (also consistent with the LSD
results). The homogeneous subset pivot table thus displays where population differences do not
exit (and by inference, where they do).

A homogeneous subset summary appears for the Scheffe test as well (Figure A.11). The results
are similar, except for subset 2, where bachelor is added to the homogenous group with junior
college and high school. This is consistent with Scheffe being a more conservative test (smaller
false-positive rate) than the LSD or REGWF. Thus, notice that under the Scheffe test, the high
school and bachelor populations are not found to be significantly different.
Figure A.12 Scheffe Post Hoc Results

Scheffe

LT HIGH SCHOOL HIGH SCHOOL 1.540* .265 .000 .72 2.36
JUNIOR COLLEGE 1.964* .371 .000 .82 3.11
BACHELOR 2.325* .319 .000 1.34 3.31
GRADUATE 2.713* .355 .000 1.62 3.81
HIGH SCHOOL LT HIGH SCHOOL -1.540* .265 .000 -2.36 -.72
JUNIOR COLLEGE .424 .306 .750 -.52 1.37
BACHELOR .786* .241 .032 .04 1.53
GRADUATE 1.173* .287 .002 .29 2.06
JUNIOR COLLEGE LT HIGH SCHOOL -1.964* .371 .000 -3.11 -.82
HIGH SCHOOL -.424 .306 .750 -1.37 .52
BACHELOR .361 .354 .903 -.73 1.45
GRADUATE .749 .386 .440 -.44 1.94
HIGH SCHOOL -.786* .241 .032 -1.53 -.04
JUNIOR COLLEGE -.361 .354 .903 -1.45 .73
GRADUATE .388 .337 .858 -.65 1.43
HIGH SCHOOL -1.173* .287 .002 -2.06 -.29
JUNIOR COLLEGE -.749 .386 .440 -1.94 .44
BACHELOR -.388 .337 .858 -1.43 .65
A careful observer will notice that the Scheffe multiple comparison results in Figure A.12 are not
completely consistent with the homogeneous subset results (Figure A.11). The multiple
comparisons results indicate that the high school group differs significantly from the bachelor
group (p=.03), while the homogeneous subset 2 indicates they do not. Here a slightly different
sample size adjustment (for homogeneous subsets, sample size is set to be the harmonic mean of
all groups, while for multiple comparison tables the default is to compute harmonic means on a
two-group (pairwise) basis) produces different conclusions.
This is a not an uncommon result when doing post hoc testing because of the different
assumptions and methods of the various tests, and it is but one reason why investigators often
request multiple tests to compare and contrast the results.

Figure A.13 Games-Howell Post Hoc Results

Games-Howell

LT HIGH SCHOOL HIGH SCHOOL 1.540* .370 .001 .52 2.56
JUNIOR COLLEGE 1.964* .411 .000 .83 3.10
BACHELOR 2.325* .392 .000 1.24 3.41
GRADUATE 2.713* .377 .000 1.67 3.76
HIGH SCHOOL LT HIGH SCHOOL -1.540* .370 .001 -2.56 -.52
JUNIOR COLLEGE .424 .243 .411 -.25 1.10
BACHELOR .786* .211 .002 .21 1.36
GRADUATE 1.173* .182 .000 .67 1.67
JUNIOR COLLEGE LT HIGH SCHOOL -1.964* .411 .000 -3.10 -.83
HIGH SCHOOL -.424 .243 .411 -1.10 .25
BACHELOR .361 .276 .685 -.40 1.12
GRADUATE .749* .254 .031 .05 1.45
HIGH SCHOOL -.786* .211 .002 -1.36 -.21
JUNIOR COLLEGE -.361 .276 .685 -1.12 .40
GRADUATE .388 .224 .416 -.23 1.00
HIGH SCHOOL -1.173* .182 .000 -1.67 -.67
JUNIOR COLLEGE -.749* .254 .031 -1.45 -.05
BACHELOR -.388 .224 .416 -1.00 .23
The Games-Howell multiple comparison test (Figure A.13) adjusts for both unequal variances
(determined to be present by the Levene test earlier) and unequal sample sizes. The overall results
are more conservative than any of the other tests. For example, Games-Howell determines that
the junior college and graduate groups are statistically distinct (p=.03).
These results are surprising consistent given the lack of homogeneity and the unequal cell sizes;
but we have a large sample. When results differ and they often do even more than our example,
what is the true situation? We don’t know. Your original choice of a post hoc test would be based
on how you want to balance power and the false-positive rate. Here under more liberal false-
positive rates we would conclude that there is no significant difference between the junior college
and graduate group on the amount of TV they watch. But, only Games-Howell adjusts for the
unequal variances across degree groups, and it found that these two groups differed. As a
consequence, some researchers would state the junior college – graduate difference as a tentative
result; others may state there is a difference, preferring the Games-Howell test. On the other hand,
there are other comparisons in which we can have more confidence. All four tests found that
those with a bachelors or graduate degree watch less TV than those with a high school degree or
less. And in all these comparisons, never lose sight of practical or substantive significance. Are
the amounts of television viewing by each group different enough to be of practical importance?

The bottom line is that your choice in post hoc tests should reflect your preference for the
power/false-positive tradeoff and your evaluation of how well the data meet the assumptions of
the analysis, and you live with the results of that choice. Such books as Toothaker (1991) or Hsu
(1996) and their references evaluate the various post hoc tests on the basis of theoretical and
Monte Carlo considerations.
A.6 Graphing the Mean Differences

For presentations it is helpful to display the sample group means along with their 95% confidence
bands. We saw such error bar charts in the two-group case and they are useful here as well. To
create an error bar chart for TV hours grouped by respondent’s highest degree:

Click OK in the Information box
Click Reset
Select the icon for Simple Error Bar (usually third icon in second row) and drag
it to the Chart Preview canvas
Drag and drop DEGREE from the Variables: list to the X-Axis? area in the Chart Preview
canvas
Drag and drop TVHOURS from the Variables: list to the Y-Axis? area in the Chart
Preview canvas
The completed Chart Builder is shown in Figure A.14.

Figure A.14 Chart Builder for Error Bar Chart
Click OK
The chart, shown in Figure A.15, provides a visual sense of how far the groups are separated. The
confidence bands are determined for each group separately and no adjustment is made based on
the number of groups that are compared or their (unequal) variances. From the graph we have a
clear sense of the relation between formal education degree and TV viewing.

Figure A.15 Error Bar Chart of TV Hours by Degree Group
A.7 Appendix: Group Differences on Ranks

Analysis of variance assumes that the distribution of the dependent measure within each group
follows a normal curve and that the within-group variation is homogeneous across groups. If any
of these assumptions fail in a gross way, as an alternative, you can sometimes apply techniques
that make fewer assumptions about the data. We saw such a variation when we applied tests that
didn’t assume homogeneity of variance but did assume normality (Brown-Forsythe and Welch).
However, what if neither homogeneity nor normality assumptions were met? In this case,
nonparametric statistics are available; they don’t assume specific data distributions described by
parameters such as the mean and standard deviation. Since these methods make few if any
distributional assumptions, they can often be applied when the usual assumptions are not met.
They can also be used with ordinal level measures.
The downside of such methods is that if the stronger data assumptions hold, then nonparametric
techniques are generally less powerful (probability of finding true differences) than the
appropriate parametric method. Second, there are some parametric statistical analyses that
currently have no corresponding nonparametric method. It is fair to say that boundaries separating
where one would use parametric versus nonparametric methods are in practice somewhat vague,
and statisticians can and often do disagree about which approach is optimal in a specific situation.
For more discussion of the common nonparametric tests see Daniel (1978), Siegel and Castellan
(1988) or Wilcox (1997).

Because of our concerns about the lack of homogeneity of variance and normality of TV hours
viewed for the different degree groups, we will use a nonparametric procedure—the Kruskal-
Wallis test—that only assumes that the dependent measure has ordinal (rank order) properties.
The basic logic behind this test is straightforward. If we rank order the dependent measure across
the entire sample, we would expect under the null hypothesis (no population differences) that the
mean rank (technically the sum of the ranks adjusted for sample size) should be the same for each
sample group. The Kruskal-Wallis test calculates the ranks, the sample group mean ranks, and the
probability of obtaining average ranks (weighted summed ranks) as far apart (or further) as what
are observed in the sample, if the population groups were identical.
To run the Kruskal-Wallis test, we declare TVHOURS as the test variable (from which ranks are
calculated) and DEGREE as the independent or grouping variable.
Click Analyze…Nonparametric Tests…K Independent Samples

Move TVHOURS into the Test Variable List: box
Move DEGREE into the Grouping Variable: box
Note that the minimum and maximum value of the grouping variable must be specified using the
Define Range pushbutton.
Click the Define Range pushbutton

Enter 0 as the Minimum and 4 as the Maximum
Figure A.16 Analysis of Ranks Dialog Box
Click Continue
Click OK
By default, the Kruskal-Wallis test will be performed. The Kruskal-Wallis is the most commonly
used nonparametric test for this situation. However, two additional statistical tests are available
and you can choose to run all three if you want.

Figure A.17 Results of Kruskal-Wallis Nonparametric Analysis
The results are displayed in two tables shown in Figure A.17. In the first table, we see the pattern
of mean ranks (remember smaller ranks imply less TV watched) and follows that of the original
means of TVHOURS - the higher the degree, the less TV watched. The chi-square statistic used in
the Kruskal-Wallis test indicates that we are very unlikely (less than .0005 or 5 chances in
10,000) to obtain samples with average ranks so far apart if the null hypothesis (same distribution
of TV hours in each group) were true. Based on this result, we are now much more confident in
our original conclusion about overall mean differences because we were able to confirm that
population differences exist without making all the assumptions required for analysis of variance.
As we noted earlier, there are no nonparametric equivalents for the post hoc tests, so if we are
interested in which groups differ from each other, we would still have to rely on the post hoc tests
in the One-way ANOVA procedure.

Summary Exercises
We will continue our investigation (from Chapter 7) of TV watching hours (TVHOURS) and the
number of hours using the web (WWWHR). We want to see whether the average number of hours
watching TV or the number of hours using the web differ by marital status.
1. Run exploratory data analysis on TVHOURS and WWWHR by MARITAL. (Hint: Don't
forget to select the "Pairwise Deletion" option.) Are any of the variables normally
distributed? What differences do you notice in the means and standard deviations for
each group? Is the homogeneity of variance assumption likely to be met? You might use
Chart Builder to produce a paneled histogram for number of children by marital status.
2. Run a One-way ANOVA looking at mean differences by marital status for these two
variables. Request the test for homogeneity and the robust measures. Interpret the results.
Which variables met the homogeneity of variance assumption? Are the means of any of
the variables significantly different for marital status groups?
3. Run Post Hoc Tests selecting a liberal test, such as LSD, a more conservative test, such
as Scheffe, and the Games-Howell if the variables did not meet the homogeneity of
variance criteria. Which groups are significantly different from which other groups? Do
the tests agree? If not, how might you summarize the results?
4. Use Chart Builder to display an error bar chart for each of these analyses.
In Chapter 7 exercises, you looked at the age when their first child was born (AGEKDBRN) and
number of household members (HHSIZE) by gender. Would you expect the average of either of
these variables to be different for education degree groups or marital status? Test your assumption
by running the appropriate ANOVA and interpret your results.


Appendix B: Introduction to Multiple

Regression
Topics
• Running Multiple Regression
• Interpreting the Results
• Residuals and Outliers
Data
This chapter uses the Bank.sav data file.
B.1 Multiple Regression

In Chapter 9, we produced a predictive equation for beginning salary based upon education level.
Compensation analysts more often build such equations using several predictor variables instead
of the single variable approach we have used. This approach is called multiple regression.
Additionally we want to assess how well the equation fits the data and view diagnostics to assess
the regression assumptions. Since we have measured several variables that might be related to
beginning salary, we will add additional predictor (independent) variables to the equation,
evaluate the improvement in fit, and interpret the equation coefficients. In a successful analysis
we would obtain an equation useful in predicting starting salary based on background
information, and understand the relative contribution of each predictor variable.
Multiple regression represents a direct extension of simple regression. Instead of a single

predictor variable, multiple regression allows for more than one independent variable (Y = a +
b1*X1 + b2*X2 + b3*X3 +…) in the prediction equation. While we are limited to the number of
dimensions we can view in a single plot (SPSS Statistics can build a 3-dimensional scatterplot),
the regression equation allows for many independent variables. When we run multiple regression
we will again be concerned with how well the equation fits the data, whether there are any
significant linear relations, and estimating the coefficients for the best-fitting prediction equation.
In addition, we are interested in the relative importance of the independent variables in predicting
the dependent measure.
In our example, we expand our prediction model of beginning salary to include years of formal
education (edlevel), years of previous work experience (work), age, and gender (sex). Gender is a
dichotomous variable coded 0 for males and 1 for females. As such (recall our earlier discussion),
it can be included as an independent variable in regression. Its regression coefficient will indicate
the relation between gender and beginning salary, adjusting for the effects of the other
independent variables.
To run the analysis, we open the Linear Regression dialog box and add the additional independent
variables to the Independent Variables list.
Introduction to Multiple Regression B- 1

Click Analyze…Regression…Linear…
Move salbeg to the Dependent box
Move edlevel, sex, work, and age to the Independent(s): box
Figure B.1 Setting Up Multiple Regression
Since the four independent variables will be entered as a single block (we are at block 1 of 1), the
order in which we list the variables will not affect the analysis, but Regression will maintain this
order when presenting results.
Residual Plots
While we can run the multiple regression at this point, we will request some diagnostic plots
involving residuals and information about outliers. By default no residual plots will appear. These
options are explained below.
Click Plots
Within the Plots dialog box:
Click the Histogram check box in the Standardized Residual Plots area
Move *ZRESID into the Y: box
Move *ZPRED into the X: box
Introduction to Multiple Regression B - 2

Figure B.2 Regression Plots Dialog Box
The options in the Standardized Residual Plots area of the dialog box all involve plots of
standardized residuals. Ordinary residuals are useful if the scale of the dependent variable is
meaningful, as it is here (beginning salary in dollars). Standardized residuals are helpful if the
scale of the dependent is not familiar (say a 1 to 10 customer satisfaction scale). By this we mean
that it may not be clear to the analyst just what constitutes a large residual; is an over prediction
of 1.5 units a large miss on a 1 to 10 scale? In such situations, standardized residuals (residuals
expressed in standard deviation units) are very useful because large prediction errors can be easily
identified. If the errors follow a normal distribution, then standardized residuals greater than 2 (in
absolute value) should occur about 5% of the time, and those greater than 3 (in absolute value)
should happen less than 1% of the time. Thus standardized residuals provide a norm against
which one can judge what constitutes a large residual. We requested a histogram of the
standardized residuals; note that a normal probability plot is available as well. Recall that the F
and t tests in regression assume that the residuals follow a normal distribution.
Regression can produce summaries concerning various types of residuals. Without going into all
these possibilities, we request a scatterplot of the standardized residuals (*ZRESID) versus the
standardized predicted values (*ZPRED). An assumption of regression is that the residuals are
independent of the predicted values, so if we see any patterns (as opposed to a random blob) in
this plot, it might suggest a way of adjusting and improving the analysis.
Click Continue
Next we will look at the Statistics dialog box. The Casewise Diagnostics choice appears here.
When this option is checked, Regression will list information about all cases whose standardized
residuals are more than 3 standard deviations from the line. This outlier criterion is under your
control.
Click Statistics
Click the Casewise diagnostics check box in the Residuals area

Figure B.3 Regression Statistics Dialog Box
Other statistics such as the 95% confidence interval for the B (regression) coefficients can be
requested.
Click Continue
Click OK
B.2 Multiple Regression Results

We now turn to the results of our multiple regression run.
Recall that listwise deletion of missing data has occurred, that is, if a case is missing data on any
of the five variables used in the regression it will be dropped from the analysis. If this results in
heavy data loss, other choices for handling missing values are available in the Regression Options
dialog box (see also the SPSS Missing Values add-on module for multiple variable imputation
methods for estimating missing data values).
In Figure B.4 the dependent and independent variables are listed followed by the Model
Summary table. The R-square statistic is about .49, indicating that with these four predictor
variables we can account for about 49% of the variation in beginning salaries. Education alone
had an r-square of .40, so the additional set of three predictors added only an additional 9%: an
improvement, but a modest improvement. The Adjusted R-square is quite close to the r-square.
The standard error has dropped from $2,439 (with just education as a predictor) to $2,260: again,
an improvement, but not especially large.

Figure B.4 Variable Summary and Fit Measures
Next we turn to the ANOVA table.
Figure B.5 ANOVA Table
Since there are four independent variables, the F statistic tests whether any of the variables have a
linear relationship with beginning salary. Not surprisingly, since we already know from the
analysis in Chapter 9 that education is significantly related to beginning salary, the result is highly
significant.

Figure B.6 Multiple Regression Coefficients, Bs and Betas
In the Coefficients table, the independent variables appear in the order they were given in the
Regression dialog box, not in order of importance. Although the B coefficients are important for
prediction and interpretive purposes, analysts usually look first to the t test at the end of each line
to determine which independent variables are significantly related to the outcome measure. Since
four variables are in the equation, we are testing if there is a linear relationship between each
independent variable and the dependent measure after adjusting for the effects of the three other
independent variables. Looking at the significance values we see that education and gender are
highly significant (less than .0005), age is significant at the .05 level (p=.035), while work
experience is not linearly related to beginning salary (after controlling for the other predictors).
Thus we can drop work experience as a predictor. It may seem odd that work experience is not
related to salary, but since many of the positions were clerical, work experience may not play a
large role. Typically, you would rerun the regression after removing variables not found to be
significant, but we will proceed and interpret this output.
The estimated regression (B) coefficient for education is about $651, similar but not identical to
the coefficient ($691) found in the simple regression using formal education alone. In the simple
regression we estimated the B coefficient for education ignoring any other effects, since none
were included in the model. Here we evaluate the effect of education after controlling
(statistically adjusting) for age, work experience and gender. If the independent variables are
correlated, the change in B coefficient from simple to multiple regression can be substantial. So,
after controlling (holding constant) age, work experience and gender, a year of formal education,
on average, was worth $651 in starting salary. The gender variable has a B coefficient of –$1526.
This means that a one-unit change in gender (moving from male status to female status),
controlling for the other independent variables in the equation, is associated with a drop (negative
coefficient) in beginning salary of $1,526. To put it more plainly, females had a beginning salary
$1,526 lower than men, controlling for the other three variables in the equation. Age has a B
coefficient of $33, so a one-year increase in age (controlling for the other variables) was
associated with a $33 beginning salary increase. Since we found work experience not to be
significantly different from zero, we treat it as if it were zero. The constant or intercept term is
still negative, and would correspond to the predicted salary for a male (sex=0) with 0 years of
education, 0 years of work experience and whose age is 0—not a realistic combination. The
standard errors again provide precision measures for the regression coefficient estimates.
If we simply look at the estimated B coefficients we might think that gender is the most important
variable. However, the magnitude of the B coefficient is influenced by the standard deviation of
the independent variable. For example, sex takes on only two values (0 and 1), while education

values range from 8 years to over 20 years. The Beta coefficients explicitly adjust for such
standard deviation differences in the independent variables. They indicate what the regression
coefficients would be if all variables were standardized to have means of 0 and standard
deviations of 1. A Beta coefficient thus indicates the expected change (in standard deviation
units) of the dependent variable per one standard deviation unit increase in the independent
variable (after adjusting for other predictors). This provides a means of assessing relative
importance of the different predictor variables in multiple regression. The Betas are normed so
that the maximum should be less than or equal to one in absolute value (if any Betas are above 1
in absolute value, it suggests a problem with the data: multi-collinearity). Examining the Betas,
we see that education is the most important predictor, followed by gender, and then age. The Beta
for work experience is very near zero, as we would expect.
If we needed to predict beginning salary from these background variables (dropping work
experience) we would use the B coefficients. Rounding to whole numbers, we would say:
salbeg = 651*edlevel – 1526*sex + 33*age – 2666.
B.3 Residuals and Outliers

In Figure B.7, we see those observations more than three standard deviations from the regression
fit line; assuming a normal distribution, this would happen less than 1% of the time by chance
alone. In this data file that would be about 5 outliers (.01*474), so the seven cases listed does not
seem excessive. However, it is interesting to note that all the large residuals are positive, and
some of them are quite substantial. Residuals should normally be balanced between positive and
negative values; when they are not, you should investigate the data further, so the next step would
be to see if these observations have anything in common (same job category perhaps, which may
be out of line with the others regarding salary). Since we know their case numbers (an ID variable
can be substituted), we could look at them more closely.
Figure B.7 Casewise Listing of Outliers
We see the distribution of the residuals with a normal bell-shaped curve superimposed in Figure
B.8. The residuals are a bit too concentrated in the center (notice the peak) and are skewed; notice
the long tail to the right. Given this pattern, a technical analyst might try a data transformation on
the dependent measure (taking logs), which might improve the properties of the residual

distribution. However, just as with ANOVA, larger sample sizes protect against moderate
departures from normality, and our sample size here should be adequate. Overall, the distribution
is not too bad, but there are clearly some outliers in the tail; these also show up in the casewise
outlier summary.
Figure B.8 Histogram of the Residuals
In the scatterplot of residuals (Figure B.9), we hope to see a horizontally oriented blob of points
with the residuals showing the same spread across different predicted values. Unfortunately, we
see a hint of a curving pattern: the residuals seem to slowly decrease then swing up at the end.
This type of pattern can emerge if the relationship is curvilinear, but a straight line is fit to the
data. Also, the spread of the residuals is much more pronounced at higher predicted values than at
the lower ones. This suggests lack of homogeneity of variance. Such a pattern is common with
economic data: there is greater variation at larger values. At this point, the analyst should think
about adjustments to the equation. Given the lack of homogeneity, the suggestion of
curvilinearity, and knowing that the dependent measure is income, an experienced regression user
would probably perform a log transform on beginning salary and rerun the analysis. This is not to
suggest that such an adjustment should occur to you at this stage; the main point is that you
should look at residual plots to check the assumptions of regression, and you may find hints there
on how to improve your model.

Figure B.9 Scatterplot of Residuals and Predicted Values
Summary of Regression Results

Overall, the regression analysis was successful in that we can predict about 49% of the variation
in beginning salary from education, gender and age. We found that education is the best predictor,
but that gender played a substantial role. Examination of residual summaries suggested that a
straight line may not be the best function to fit (this was also evident from the scatterplot in
Figure 9.1), and there were several large positive residuals that should be checked more carefully.

Summary Exercises
Using the Bank data,
1. Use linear regression to develop a prediction equation for current salary (salnow), using
as predictors: age, edlevel, minority, salbeg, sex, time, and work. Request a histogram of
the errors, the scatterplot of residuals, and casewise diagnostics. What variables are
significant predictors of current salary? Which variable is the strongest predictor? The
weakest? Do the results make sense? What is the prediction equation?
2. Are there any problems with the assumptions for linear regression, such as homogeneity
of variance? Does a linear model fit the data?
1. Rerun the regression with only the significant variables. Do the coefficients (B) change
much or not?

References
Introductory Statistics Books

Burns, Robert P and Burns, Richard. 2008 (forthcoming). Business Research Methods and
Statistics Using SPSS London Sage Publications Ltd.
Field, Andy. 2005. Discovering Statistics Using SPSS (2nd Ed.). London Sage Publications Ltd.
Hays, William L. 2007. Statistics (6th Ed.). New York: Wadsworth Publishing.
Kendrick, Richard J. 2004. Social Statistics: An Introduction Using SPSS (2nd Ed.). Allyn &
Bacon.
Knoke, David, Bohrnstedt, George W. and Mee, Alisa Potter. 2002. Statistics for Social Data
Analysis. (4rd ed.).Wadsworth Publishing.
Moore, David S., 2005. The Practice of Business Statistics with SPSS. W.H. Freeman.
Norusis, Marija J. 2008. SPSS 16.0 Guide to Data Analysis. (2nd Ed.) New York: Prentice-Hall.
Norusis, Marija J. 2008. SPSS 16.0 Statistical Procedures Companion. (2nd Ed.) New York:
Prentice-Hall.
Additional References
Agresti, Alan. 2007. An Introduction to Categorical Data Analysis (2nd Ed.). New York: Wiley-
Interscience.
Allison, Paul D. 1998. Multiple Regression: A Primer. Thousand Oaks, CA: Pine Forge.
Allison, Paul D. 2001. Missing Data. Thousand Oaks, CA: Sage.
Andrews, Frank M, Klem, L., Davidson, T.N., O’Malley, P.M. and Rodgers, W.L. 1981. A Guide
for Selecting Statistical Techniques for Analyzing Social Science Data. Ann Arbor, MI: Institute
for Social Research, University of Michigan.
Bishop, Yvonne M., Fienberg, S. and Holland, P.W. 1975. Discrete Multivariate Analysis:
Theory and Practice. Cambridge, MA: MIT Press.
Box, George E. P. and Cox, D.R. 1964. “An Analysis of Transformations,” J. Royal Statistical
Society, Series B, 23, p. 211.
Box, George E. P., Hunter, W.G. and Hunter, J.S. 2005. Statistics for Experimenters (2nd Ed.).
New York: Wiley.
References R - 1
Brown, Morton B. and Forsythe, A. 1974. “The Small Sample Behavior of Some Statistics Which
Test the Quality of Several Means,” Technometrics, pp. 129-132.
Cleveland, William S. 1994. The Elements of Graphing Data (2nd. Ed.). Chapman & Hall/CRC
Cochran, William G. 1977. Sampling Techniques (3rd ed.). New York: Wiley.
Cohen, Jacob. 1988. Statistical Power Analysis for the Behavioral Sciences (2nd Ed.). Hillsdale,
NJ: Lawrence Erlbaum Associates.
Cohen, Jacob and Cohen, P, et.al. 2002. Applied Multiple Regression/Correlation Analysis for the
Behavioral Sciences (3rd. ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.
Daniel, Cuthbert and Wood, Fred S. 1999. Fitting Equations to Data (2nd ed.). New York: Wiley.
Daniel, Wayne W. 2000. Applied Nonparametric Statistics (2nd ed.). Boston: Duxbury Press.
Draper, Norman and Smith, Harry. 1998. Applied regression Analysis (3rd ed.). New York:
Wiley.
Few, Stephen. 2004. Show Me the Numbers: Designing Tables and Graphs to Enlighten.
Analytics Press
Fienberg, Stephen E. 1980. The Analysis of Cross-Classified Categorical Data (2nd ed.).
Cambridge, MA: MIT Press.
Gibbons, Jean D. 2005. Nonparametric Measures of Association. Newbury Park, CA: Sage.
Hoaglin, David C., Mosteller, F. and Tukey, J.W. 1985. Exploring Data Tables, Trends and
Shapes. New York: Wiley.
Hoaglin, David C., Mosteller, F. and Tukey, J.W. 1991. Fundamentals of Exploratory Analysis of
Variance. New York: Wiley.
Hsu, Jason C. 1996. Multiple Comparisons: Theory and Methods. London: Chapman & Hall.
Huff, Darell and Geis, Irving. 1003. How to Lie with Statistics (Reissue ed.). W.W. Noton &
Company
Kirk, Roger E. 1994. Experimental Design: Procedures for the Behavioral Sciences (3d ed.).
Belmont, CA: Brooks/Cole Publishing.
Kish, Leslie. 1965. Survey Sampling. New York: Wiley.
Klockars, Alan J. and Sax, G. 1986. Multiple Comparisons. Newbury Park, CA: Sage.
Kraemer, H.K and Thiemann, S. 1987. How Many Subjects? Statistical Power Analysis in
Research. Newbury Park, CA: Sage.
References R - 2
Milliken, George A. and Johnson, D.E. 2004. Analysis of Messy Data, Volume 1: Designed
Experiments. Chapman & Hall/CRC.
Mosteller, Frederick and Tukey, John W. 1977. Data Analysis and Regression. Reading, MA:
Addison-Wesley.
Salant, Priscilla and Don A. Dillman. 1994. How to Conduct Your Own Survey. New York:
Wiley.
Searle, Shayle R. 2005. Linear Models for Unbalanced Data. New York: Wiley-Interscience.
Siegel, Stanley and Castellan, N. J. 1988. Nonparametric Statistics for the Behavioral Sciences.
(2nd ed.). New York: McGraw-Hill.
Sudman, Seymour. 1976. Applied Sampling. New York: Academic Press.
Toothaker, Larry E. 1991. Multiple Comparisons for Researchers. Newbury Park, CA: Sage.
Tufte, Edward R. 2001. The Visual Display of Quantitative Information (2nd. Ed.). Graphics
Press.
Tukey, John W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.
Tukey, John W. 1991. “The Philosophy of Multiple Comparisons,” Statistical Science, vol. 6, 1,
pp. 100-116.
Velleman, Paul F. and Wilkinson, L. 1993. “Nominal, Ordinal and Ratio Typologies are
Misleading for Classifying Statistical Methodology,” The American Statistician, vol. 47, pp. 65-
72.
Wilcox, Rand R. 2004. Introduction to Robust Estimation and Hypothesis Testing (2nd ed.). New
York: Academic Press.
Wilkinson, Leland. 2005. The Grammar of Graphics (2nd. Ed.). Springer.
References R - 3
References R - 4
Alternative Exercises for Chapters 8 & 9

and Appendix B
This appendix contains an alternative set of exercises for Chapter 8, Chapter 9 and Appendix B.
The exercises in the main text are based on the Bank.sav data file which is discussed in the
respective chapters. These exercises are based on the GSS200Intro4.sav file which is used in
other chapter examples and exercises. We present this as an option for the instructor.
Alternative Exercises E - 1
Summary Exercises For Chapter 8

Using the GSS 2004Intro.sav file:
Suppose you are interested in predicting the age when first child is born. This might be of interest
if you were looking at programs targeted toward teenage parents. The outcome variable is
agekdbrn. Consider age, education (educ), spouses'education (speduc) (Note: By using this
variable though, you are limiting the analysis to those currently married), household size (hhsize),
number of children (childs) and sex (A numeric version of Gender).
1. Create a numeric version of Gender using the "Recode Into Different Variable"
procedure. Name the new variable Sex and recode "M" to 0 and "F" to 1. Assign value
labels to the new variable.
2. Run frequencies on sex so you understand its distribution. Run descriptive statistics on
the other variables for the same reason.
3. Now produce correlations with all these predictors and agekdbrn.
4. Then create scatterplots of the predictors and agekdbrn. Which variables have strong
correlations with agekdbrn? Do you find any potential problems with using linear
regression? Did you find any potential outliers?
Summary Exercises For Chapter 9

Using the GSS2004Intro.sav file:
1. Run a simple linear regression using education (educ) to predict age when child first born
(agekdbrn).
2. How well did you do in predicting age when child first born from this one variable?
(Hint: What is the R-square and how would you interpret it?
3. Interpret the constant (intercept) and B values.
4. Use the predictive equation to predict the value of age when first child was born for a
person with 12 years of education.
5. Run a simple linear regression using one of the other variables from the set you explored
in the last chapter. Hint: You might want to try the one with the next highest correlation
coefficient. Answer Questions 2-3 above.
Summary Exercises For Appendix B

Using the GSS 2004Intro.sav file:
1. Use linear regression to develop a prediction equation for age when child first born
(agekdbrn), using as predictors: age, educ ,speduc,hhsize, childs, and sex (numeric
version of gender that you need to create). Request a histogram of the errors, the
scatterplot of residuals, and casewise diagnostics. What variables are significant
predictors of current salary? Which variable is the strongest predictor? The weakest? Do
the results make sense? What is the prediction equation?
2. Are there any problems with the assumptions for linear regression, such as homogeneity
of variance? Does a linear model fit the data?
3. Are there other variables that you believe might be good predictors? Consider recoding
Race to two categories, combining "Black" and "Other" and use it as a predictor. Is it
statistically significant?

Spss

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Spss

Cargado por

Copyright:

Formatos disponibles

Introduction to

Statistical Analysis Using

Windows is a registered trademark of Microsoft Corporation.

Introduction to Statistical Analysis Using SPSS Statistics

CHAPTER 1: INTRODUCTION TO STATISTICAL ANALYSIS .... 1-1

CHAPTER 6: COMPARING CATEGORICAL VARIABLES ........... 6-1

APPENDIX A: MEAN DIFFERENCES BETWEEN GROUPS: ONE-

Chapter 1: Introduction to Statistical

1.1 Course Goals and Methods

1) The situations in which you would use each technique,

Introduction to Statistical Analysis 1 - 1

Scenario for Analyses

1.2 Basic Steps of Research Process

2. Define the population and sample design

4. Collect the data

5. Prepare the data for analysis

6. Analyse the data

7. Report the findings

Introduction to Statistical Analysis 1 - 2

1.3 Populations and Samples

Introduction to Statistical Analysis 1 - 3

1.4 Research Design

Introduction to Statistical Analysis 1 - 4

1.5 Independent and Dependent Variables

Correspondingly, independent (sometimes referred to as predictor) variables are those used to

1.6 Levels of Measurement and Statistical Methods

Introduction to Statistical Analysis 1 - 5

• Nominal — In nominal measurement each numeric value represents a category or group

The distinction between the four types is summarized in Table 1.1.

Table 1.1 Level of Measurement Properties

Introduction to Statistical Analysis 1 - 6

Table 1.2 Variable List Icons

Note: Rating Scales and Dichotomous Variables

Introduction to Statistical Analysis 1 - 7

Table 1.3 Statistical Methods and Level of Measurement

Introduction to Statistical Analysis 1 - 8

Chapter 2: Data Checking

Click Edit...Options, then click the File Locations tab

Click General tab (Not shown)

2.2 Viewing a Few Cases

Click Analyze…Reports…Case Summaries

Figure 2.3 Case Summaries Dialog Box

Figure 2.4 Case Summaries Listing of First Twenty Cases

2.3 Minimum, Maximum, and Number of Valid Cases

Click Analyze…Descriptive Statistics…Descriptives

Figure 2.5 Descriptives Dialog Box

Only numeric variables appear in the variable list box.

Figure 2.6 Descriptives Output (Beginning)

Figure 2.7 Descriptives Output (End) Showing Valid Listwise

2.4 Data Validation: Data Preparation Add-On Module

2.5 Data Validation Rules

Creating Single-Variable Rules

Figure 2.8 Validation Menu Choices

Click Validate Data

Figure 2.9 Validate Data Variable Tab

Move all the variables except ID to the Analysis Variables list

• Maximum percentage of missing values

Figure 2.10 Validate Data Basic Checks Tab

We’ll use the default settings for the basic checks.

Click the Single-Variable Rules tab

Figure 2.11 Validate Data Single-Variable Rules Tab

Click Define Rules…

Figure 2.12 Validate Data Define Single-Variables Rules Dialog