Está en la página 1de 168

STATISTICS IN

LANGUAGE STUDIES

b1-this series:
ANTHONY WOODS
P. H. MATTHEWS Morphology PAUL FLETCHER
B. COI\-lRIE!lspect
R. M. KEMPSON
T, BY NoN
Semantic Them)'
!listmica/ Lbtgu#tics
ARTHUR HUGHES
J. ALLWOOD, L.G. ANDERSSON,(), DAULLfJgicfnLinguistfcs UNIVERSITY OF READING
D, B. FRY ThePhysicsofSpeech
R, A. HUDSON Socio/h1g1tfstt'cs
J. K, CHAMBERS and p, TRUDGILL Diafecto/ogy
A. J, ELLIOT Chi/dLmtguage
P. J.~. MATTHEWS SyntaX ''''""""'"''"'-...
,?"' "''\,
,t,f \
A. ~AD FORD Trans[onnatioual Sy11tax
L. BAUER E',pfish R"ord~Jonnatim1 " ~-~ .~
~f 1.1':'~':''"''~''
V'&fot :: '
S. C, LEVINSON Pragmatics ~.
G. li;ROWN and G, YULE Disroursolnal.wis
\,,......,:
R. -'1ASS Phonology ~BD-FFLCH-USP
R. HUDDLESTON Immductinn to the Grammaroft.'~tglish
e. qoMRIE Tense
w, KI, IN Secmtd Language ilcquisitiou
A. c'auTTENDEN Intonation
A, WOODS, P. FLETCHER, A. II UGIIES Statf,,tics ill La11guage Studies

Tht rlglrl ~1 the


Un(Yfr~il)' of
CQlllbrldge
toprl~land 1fll
11/{ manner of boolu
JYQISfllllltd by
HMr)' V/11/n Ill~.
Tlw Un/vm/1)' ~ prlnltd
ond pub//Jhfd ttlnllni/OIIJI)'
lirlCl /$~4,

CAMBRIDGE UNIVERSITY PRESS


CAMBRIDGE
LONDON NEW YO.RR NEW ROCHELLE
MELD0\!1\Nil SYDNI<Y
CAMBRIDGE TEXTBOOKS IN LINGUISTICS
Genera/Editors: B. COMRIE, C. J. FILLMORE> R. LASS, R. B. LEPAGE,
J. LYONS, P. H. MATTHEWS, F. R, PALMER, a; POSNER, S. ROMAINE,
N. V. SMITH, J. L, M. TRIM, A. ZWICKY

STATISTICS IN LANGUAGE STUDIES

Erratum:
The first expression oo page 123 ~hould read
H0 :p.=8o t.ersus H 1 :~..t<8o
,(;

CONTENTS

page
Published by the Press Syndicate of the University of Cambridge Preface X!
The Pitt Building, Trumpington Street, Cambridge cnz..tRP
32 East 57th Street, NcwYofk, NY 10022, USA:1 ; -, :-.~:-, ';~
10 Stamford Road, Oakleigh, Melbourne Jf66~~A.uStrali8'
'\\

')c'
. '"I"
' ..
.
'':
~
~
~;-:''
I Why d61i"guists ~eed statistics? .. I

Cambridge University Press 1986 .,. d : I, 2 Tables and graphs 8


2.1 Categorical data 8
First published 1986 -- 2.2 Numerical data 13
Printed in G~eat Britain at The Bath Press, Avon 2.3 Multi-way tables '9
2.4 Special cases 20
B1itish Lib1:ary cata/Oguiflg in publication data Summary 22
Woods, Anthony Exercises 23
Statistics in language studies. -
(Cambridge textbooks-in linguistics) 25
I, Linguistics- Research- Statistical methods
3 Summary measures
J, Title n. Fletcher, Paul 111. Hughes, Arthur, 3' The median 27
1
4 to .72 PIJ8S 32 The arithmetic mean 29
33 The mea~ .a[}d . the median comp~red 30
Libm1)' of Congress cataloguing t'n publkat(oii;dat_a-~
,r,.,,:,- ',t!; H Means of proportions and percentages 34
Woods, Anthony.
Statistics in language studies. 35 Variability or dispersion 37
(Cambridge textbooks in linguistics) 36 Central intervals 37
Bibliography: p.
3-7 The variance and the standard deviation 40
. Includes index.
38 Standardising test scores 43
t. Linguistics- Statistical methods. I. Fletcher,
Paul. II, Hughes, Arthur. m. Title. iV. Series. Summary 45
PrJB.s.w66rgBs 5'95 Bs-47o Exercises 46
__.,,,
ISBN o 5~1 25326 8 hard covers
4 Statistical inference 48
ISBN o s:u 2731.2 9 paperback
4' The problem
48
42 Populations 49
43 The theoreti'cal solution 52
+4 ,- The, pr:~gmatic_ solution 54
'I - ;_ :l __ ,- -.'' ''
'{i :;- SunrtJ~rY'' 57
1111111 111111111111111111111111111111 Exercises ss
21300004814 v
r
vvn~ciH.>
\....-VH~C.'I-'-.l

: -!~:~ 8.4.2 The value of the test statistic is not significant IJO
Probability
5 59
5' Probability 59 8~
Summary IJO
61 , t Exercises IJI
52 Statistical independence and conditional probability.
53 Probability and discrete numerical random variables 66

I
68 Testing the fit of models to data IJZ
54 Probability and continuous random variables 9
Testing how well a complete model fits the data IJ2
55 Random sampling and random number tables 72
.
. 9'
Summary 75 .
92 Testing how well a type of model fits the data '37
Exercises 75 . 93 Testing the model of independence '39
94 Problems and pitfalls of the chi-squared test '44
})_Cj
6 Modelling statistical populations 77 ~ 9+' Small expected frequencies '44
~~. The 2 X z contingency table
6.1 A simple statistical model 77 942 '46
6.2 The sample mean and the importance of sample size So $ 9+3 Independence of the observations '47
6.J A model of random variation: the normal distribution 86 9+4 Testing several tables from the same study 149
6.4 Using tables of the normal distribution Sg 9+5 The use of percentages !50
Summary 93 Summary 15'
Exercises 93 Exercises !52

7 Estimating from samples 95 IO Measuring the degree of interdepend<lnce between


7' Point estimators for population parameters 95 two variables ~54
72 Confidence intervals 96 IO.I The concept of covariance !54
73 Estimating a proportion 99 10.2 The correlation coefficient x6o
74 Confidence intervals based on small samples IOI IO.J Testing hypotheses about the correlation coefficient 162
75 Sample size IOJ !0.4 A confidence interval for a correlation coefficient t63
75 1 Central Limit Theorem !OJ !0.5 Comparing correlations x6s
752 When the data are not independent !04 Io.6 Interpreting the sample correlation coefficient !67
753 Confidence intervals 105 10.7 Rank correlations I6g
754 More than one level of sampling xo6 Summary 174
7-55 Sample size to obtain a required precision 107 Exercises 174
76 Different confidence levels IIO
Summary
Exercises
III
Il2 'li
: ~~!:.~:
II
II. I
Testing for differences between two populations
Independent samples: testing for differences between
176

means 176
8 Testing hypothe$es about population values IIJ 11.2 Independent samples: comparing two variances x8z
8.1 Using the confidence interval to test a hypothesis IIJ II.J Independent samples: comparing two proportions x8z
8.2 The concept of a test statistic "7 "4 Paired samples: comparing two means !84
8.3 The classical hypothesis test and an example !20 u.s Relaxing the assumptions of normality and equal var
8.4 How to use statistical tests of hypotheses: is significance iance: non parametric tests t88
significant? !27 u.6 The power of differenttests 19'
8+1 The value of the test statistic is significant at the t% Summary 192
level 129 Exercises 193
vi vii
12 Analysisofvariance-ANOVA" ,,,. "''"''' ,, .. ,... ,,,. '48 -Linear discriminant analysis -' z~Ys
I2.I Comparing-:-- several -meanst 'simu'ltaneausly::",,one;.~Y-~'"" ,. .. ,\,'.. ., ,- :t4.9 The lineardiscriminantfunction for two-groups 268
14..10 Probabilities-uf.misclassification 269
ANOVA '94
12.2 Two-way ANOVA: randomised blocks 200; Summary 271
12..3 Two-way AN OVA: factorial experirill!nts \:' "'~;262' Ex-ercises 271
12.4 ANOV A: main effects only 206
12.5 AN OVA: factorial experiments 2H IS Principal components analysis and factor analysis 273
2.6 Fixed and random effects 212 '5' Reducing the dimensionality of multivariate data 273
12.7 Test score reliability and AN OVA 215 '52 Principal components analysis 275
12.8 Further comments on AN OVA 219 153 A principal components analysis of language test scores 278
12.8. x Transforming the data 220 154 Deciding on the dimensionality of the data 282
12.8.2 'Within-subject' ANOVAs 221 155 Interpreting the principal components 284
Summary n~ 1 <.1
222 '56 Principal components of the~correlatiOn matrix : 287
Exercises 222 157 Covariance matrix or correlation rriatrix? 287
'58 Factor analysis 290
13 Linear regression 224 Summary 2 95

13.1 The simple linear regression model' 226


13.2 Estimating the parameters in a linear regression ' 229 Appendix A Statistical tables 296
r 3 3 The benefits from fitting a linear regression -230 Appendix B Statistical computation. 307
13 4 Testing the significance of a linear regression Appe11dix C Answers to some of the exercises 3'4
233
13.5 Confidence intervals for predicted values 234
References 316
13.6 Assumptions made when fitting a linear regression 235
Index 3'9
13;7 Extrapolating from linear models 237
13.8 Using more than one independent variable: multiple
regression 2 37

'39 Deciding on the number of independentivatiable~ n "' i ' '24i ,,: " ; ~ <_-

13.10 The correlation matrix and< partial 26'rre'latiotP' ' ' 1 ' 244:
IJ.II Linearising relationships by transfdrtning'the data 245
13.12 Generalised linear models 247
Summary 247
Exercises 248

14 Searching for groups and clusters 249


'4' Multivariate analysis 249
'42 The dissimilarity matrix 252
'43 Hierarchical cluster analysis 254
144 General remarks about hierarchical clustering 259
'45 Non-hierarchical clustering 26r
14.6 Multidimensional scaling -..; .. -.;I. ,;I tL:l ''''262
'47 Further comments on multidimenSional -s(!aling 265'
ix
viii
PREFACE

This book began with initial contacts between linguists (Hughes apd
Fletcher) and a statistician (Woods) over specific research problems in
first and second language learning, and testing. These contacts led to ah
increasing awareness of the relevance of statistics for other areas in \iliguis
tics and applied linguistics, and of the responsibility of those wor.king
in such areas to subject their quantitative data to the same kind of statistical
scrutiny as other researchers in the social sciences. In time, stu~en(s in
linguistics made increasing use of the Advisory Service provided by the
Department of Applied Statistics at Readipg. It soon became clear that
the dialogue between statistipian and student linguist, if it was tope ~axi
mally ureful, required an awareness of basic statistical concepts on the
student's part. The next st~p, then, "' ;he settirlg JlP pf a cOllrse in
statistic.s for lingtjistics stu9ents (taug~t by Woofls). Tllis is esserltfally
the bonk of the course, anq reflects our jqint view$ on what lipguistics
students who want to u.se statistics with t~eir data need to kqow.
There are two main diff~rences between this and other introductory
textbooks in statistics fpr linguists. Firs\, the portion of th~ bo0k devoted
to prob~bility and statistical inference is considerable. In order to eluciq~te
the san)Ple-populatiol] relationspip, we consider in SOFIJe detail basic
notions pf probabUity, of statistical mode\lin~, and (using th~ noriJlal distri-
bution as an example of a statistical model) of the problem ol estimating
pqpulat)on values. from sample estimates. While these chapters (4-8) 'may
on initi~l reading ~eem difficult, \fe stropgly advise readers who wish fully
to understand wh~t they are doing, when they use the techniques elabonated
Iuter in the book, to persevere wit\1 them. '
The second differenfe concern~ the tan!lt of statistical methods we deal
with. From the second half of chapter IJon, a number of multiv~riate
techniques arc examined in relation to linguistic data, Multiple re11ression,
cluster analysis, <\iscriminant function analysis, and principal COf!lPOI)ent
und factor analysis have bt,cn applied in recent years to a range o.f linguistic
xi
rre]ace
. - -(~_,.;;_ .:,

I
Why do linguists neeci
.. statistics?

Linguists may wonder why they need statis(ics. The domi11ant theoretical
f~amework in the, field, that of ~enerative .grammar, has as its rrimaty
ilafa-source j~dge!llents abqut tlie well-for'piedness of se11tences, These
'judgem.e'nts usuajly come frony linguist~ th~mselyes 1 are eitherc.or
decisio11s, and relate to the language abilitY of an k\eal rative speaker
in a 'homoge!leou~ spe~ch comm11nity. The data simply do not call for,
or lend themselves to, the assignme11t of numerical values wqich need
to be st~mmarised or from which inferences may be drawn. There appears
to be no place here for statistics.
Generative grammar, however, despite its great contribution to linguistic
July 1985 PA,UL F:LETCHER knorvledge oyer t~e past 25 ~ears, is not the sole topic of linguistic study.
Unii.Je~sity of Re(lding AR'fHU~ HUGHE~ There are otjler areas of the subject where the. observed <lata positivly
denpnd statistical treatment. In !his qook we will scr11tinise st11dies from
a number of thes~ areas anli show, we hope, the nefessity foo statistics
in each, In this qrief intro<lucti~n we will use~ few ofthese studies to
illustrat~ the major issues with which we shall beJaced.' .
As w~ will demqnstrate th;oughout the book, s(atistics allows us to s11m
marise complex qume"ical data and then, if desired, to draw inferences
fro~ th~m. Indeed, a ~istinction is sometiJ;nes made between descriptive
statistics on \he one ha!'d an.d inf~rential statistics on the other. The 11eed
to summarise ancl infer comes from the fact that there is variatio'l in
the numerical values associated with the data (i.e. the values over a se(
of measuremfnts are nqt identical). If (here were no variation, there would
be no need fqr statistics.
Let ljS imagine that a phonetjcian, interested in the way that voiced-
voiceless disfinctions are f!1aintained by speakers of English, begins by
taking measurements of voice OI]Set times (VOT), i.e. the time between
the rele.ase of the stop and the onset qf voicing, in il]itial stops. The first
set of data consists of ten repetition~ from ea~h of 20 speakers of. ten
/p/initial words. No\)' if q1er<> were no difference in VOT time, cithct

xii
vvny uu ullgUI~HJ nc::eu ""uu~t.tt: ..,;
Why do linguists need statistics?
between words or between speakers, there would be no need here for IPI-initial words, not just those in the sample. Here again statistics can
statistics; the single VOT value would simply be recorded. In fact, of help. There are techniques which allow investigators to assess how closely
course, very few, if any, of the values will be identicaL The group of the 'typical' scores offered in a sample of a particular size are likely to
speakers,rl)ay produce VOT values that are all distinct on, for example, approximate to those of the group of whom they wish to make the generalis'
their' first 'pr~nunciation of a particular word. Alternatively, the VOT ation (chapter 7), provided the sample meets certain conditions (see 44
values of an individual speaker may be different from word to word or, and 55) . '
indeed, between repetitions of the same word. Thus the phonetician could Let us take the example of the phonetician further. At the same time
have as many as z,ooo different values, The first contribution of statistics as the IPI -initial data were being collected, the ten subjects were also
will be to provide the means of summarising the results in a meaningful asked to pronounce zo lb/-initial worc;ls. Again, the phonetician could
and readily understandable way. One common approach is to provide a reduce these data to two summary measures: the average VOT time for
single 'typical' value to represent all of the VOT times, together with the group and the standard deviation. Let us consider for the moment
a measure of the way in which the VOT times vary around this value only one of these - the average, or typical value. The phoneticia11 would
(the mean and the standard deviation - see chapter 3). ln this way, a then observe that there is a difference between the typical vorr .vajue
large number of values is reduced to just two. forIPI lbl
-initial words and that for -initial words. The typic~! VqT value
We shall return to the phonetician's data, but let us now look at another'. for I PI -initial words would be larger. The question then arises as to whether
example. In this case a psycholinguist is interested in the nature of aptitude. the difference Lo, ween the two values is likely to be one which represents
for learning foreign languages. As part of this study 100 subjects are given a real difference in VOT times in the larger group for which a generalisation
a language aptitude test and later, after a period of language instruction, will be made, or whether it is one that has come about by chance. (If
an achievement test in that language. One of the things the psycholinguist you think about it, if the measurement is precise there is almosf certain
will wish to know is the form of the relationship between scores on the to be some difference in the sample values.) There are statistical te91lniq11es
aptitude test and scores on the achievement test, Looking at the two sets which allow the phonetician to give the probability that the sample differ-
of scores may give some clues: someone who scored exceptionally high ence is indeed the manifestation of a 'real' difference in the latger grOI!P
on the aptitude test, for instance, may also have done extremely well on In the example we have given (VOT times), the difference is in fact quite
the achievement test. Btit the psycholinguist is not going to be able to well established in the phonetic literature (see e.g. Fry 1979: 135-7), put
assimilate all the information conveyed by the 200 scores simply by looking it should be clear that there are potentially many claims of a similar nature
at each pair of scores separately. Although the kind of summary measures which would be open to- and would demand- similar treatment.
used by the phonetician will be useful, they will not tell the psycholinguist These two examples hint at the kind of contribution statistics can and
directly about the relationship between the two sets of scores. However, should make to linguistic studies, in summarising data, and il) making
there is a straightforward statistical technique available which will allow inferences from them. The range of studies for which statistics is applicable
the strength of the relationship to be represented in a single val.ue (the is vast - in applied linguistics, language acquisition, language 'variation
correlation - see chapter 10). Once again, statistics serves the purpose and linguistics proper. Rather than survey each of these fields briefly at
of reducing complex data to manageable proportions. this point, let us look at one acquisition study in detajl, to try to achieve
The two examples given have concerned data summary, the reduction a better understanding of the problems faced by the i!lvestigator and the
of complex data, In both cases they have concerned the performance of statistiCal issues involved. What are the problems which investiga~ors want
a sample of subjects. Of course, the ultimate interest of linguistic investiga- to solve, what measures of linguistic behaviour do tpey adopt, what is
tors is not in the performance only of samples. They usually wish to the appropriate statistical treatment for these measures, and how reliable
generalise to the performance of larger groups. The phonetician with the >ll'C their results?
sample of 20 sp~akers may wish to be able to say something about all We address these issue$ by returning to voice onset time, now with
speakers of a particular accentual variety of English, or indeed about reference to language acquisition. What precisely are the stages children
speakers of English in generaL Similarly, he or she is interested in all go through in acquiring, in produetilm, the distinction between voiced
~ 3
Why do lmgwsts need statzstlcsr Why do lmgUlsts need statzstzcsr

and voiceless initial stops in English? (fFhi!niisoussion draws heavilyi<)n!T: ;.. " ~tutly Was'~! lot'igh~'cHnal cine, Using.Jour.: children who 'were monolingUal
'speakers of- English withrio siblings of schodLage . .' .. were producing
.J. - :.'- .- - ... _-- , .
Macken & Barton 198oa; for simi-lal' studies' boncerningthe acquisition
to
of voicing contrasts in other languages the: reader is referred :Macken ai 'least solrie'i'ilitial stop words , .. showedevidellceof normal language .
& Barton 198ob for Spanish, Allen 1()8sforFrench: and"Viaha t&85 far'' ' ,,. '' deveTopm~nt '.':". ahdappeared to be co,operativ'e' (198oa; 42~3). Irl ad" . :
Portuguese.) The inquiry begins from the obserVation that t'ranscriptions' dltioll, 'both j:iarents of each child were naiive speakers of English, and
of children's early pronunciations of stops often show no/ p/-/b/disfinc all'the children had n()rmallearning: The reasonsfor aspects of this subject
tions; generally both /p/-targets (adult words beginning with /p/) and description are transparent; generat:issus relating to sample size and struc-
/b/targets (adult words beginning with /b/) are pronounced with initial ture are discussed below ( 4-4 and 7.5).
(b], or at least this is how auditory impressionistic transcriptions represen.t (b) A second issue which is corn111on in linguistic stu.dies is the size
them. Is it possible, though, that young children are making distinctions of the data sample from each individ\lal. The number of subjects in the
which adult transcribers are unable to hear? VOT is established as a crucial study we are considering is four, but the number of tokens of / pt k/
perceptual cue to the voiced--voiceless distinction-for initial stops; for Eng- . and /bdg/-initial adult targets is.potentia)ly !lery.large. (A~ immediate .
lish there is a 'short-lag' VOT radgMtir voiced stbps: (from o 'to ~3<Nns ' qui!sii6n 'tlilit' iriigl\t be 'as~ed 'is whether if'is 'better io have' relati!!ely
for labials and apicals, o to +4oths forvelars}.'and'a''longlag' rangti'for ''"fev/'subjects, With h!!ativdy many .iijstarices'Mthe behaviour- in which
voiceless stops (~6o to +rooms):Englishsp.eak~rs perceive stoph>ilth we are inthestect' from each subject, or tMnY subjects and fewer tokens
a VOT of less than ~3oms (for labials and apicals,+soms for velars) ..Osee 7.'5 foi-smne discussion.) In the VOT acquisition study the investiga-
as voiced; any value above these figures tends to lead to the perception torsalso had to decide on the related. issue of fr~quency of sampling and
of the item in question as voiceless. Children's productions will tend to the number of tokens within each of t~e six categories of initial stop target.
be assigned by adult transcribers>to thephohemic-categodes definedby As it happeris, they chose a fortnightly sampling interval a,nd the number
short- and long-lag VOT. So if at some stage of development children of tokens in a session ranged from a low of 25 to a high of 4'4 (The
are making a consistent contrast using VOT, but within an adult phonemic goal was to obtain at least rs tokeqs for each stop coris0 nant, but this
category, it is quite possible that adult transcribers, because of their percep- was not always achieved in the early sessions.)
tual habits, will miss it. How is this possibility investigated? It should (c) Once the data are collected ~!Jd the measurements made oh each
be apparent that such a study involves a number of issues for those carrying token from each individual for each' session, the information prbvicled
it out. needs to be presented in an acceptable and comprehensible forf11. Macken
(a) We require a group of children; oNtn appropriateagec,togenernt~ '''" & Batta'n .'Sfdcf themselves, for- the 'ln~trurhyntal tneasuteirients. they
the data. For a developmental study like this we !lave iwdecide whether ' niake; t6 ~ ~ ' tokens 6 elich stop type per sessj0n, It may w~ll be that
the data will be collected longitudidally (from the sari:te children at succes- each of the t5 tokens within 'a catekory has a different VpT value, and
sive times separated by a suitable interval) or cross-sectionally (from differ- for evaluation we therefore need summary values and/ or graphic displays
ent groups of children, where each 'group' is of a particular age, and the of the data. Macken & Barton use both tabular, numerical summaries
different groups span the age-range that is of-interest for us). Longitudinal and graphic representations (see chapters ,-and 3 for a general discussion
data have the disadvantage that they take as long to collect as the child of methods for data summaries).
takes to develop, whereas cross-sectional data can be gathered within a (d) The descriptive summaries of the child VOT data suggest some
brief time span. With longitudinal data, however, we can be sure that interesting conclusions concerning one stage of the development ol initial
we are charting the course of growth within individuals and make reliable stop contrasts in sotne children. Recall that itis generally held that the
comparison between time A and time B. With cro,ss-sectional comparisons perceptual boundary between voiced and voiceless labial or alveplar stops
this is not so clear. Once we have decided on the kind of data we want, is +jo ms. At an early point in the development of alveolars py one subject,
decisions as to the size of the sample and the selection of its elements Tessa, thcaverage value for/ d/ -initial targets is ~2.4 f11S while tl)e oiverage
have to be addressed. It is on the 'decisions made here that our ability of
for'/t/ inithil 'tilrgcts is + 20.5o. Both thesd'"alues are withitz the adult
to generalise the results of a study will depend. The Macken & Bartc)n '' vc\iced categ<ity, ai\d sn the adult is likdy to pdceive thetn as voiced;
4 5
nuy uu uugut::JL::J neeu ~tuu::Jttt-::J:
vvny au angmscs neea ~cansczcsf

But the values are rather different. Is this observed difference between field of the techniques it explains. The chapter is then followed by extensive
the two averages a significant difference, statistically speaking? Or, restat- exer~ises which must be worked through, to accustom the reader to the
ing the question in the terms of the investigation, is the child making applications of the techniques, and their empirical implications. While
a consistent distinction in VOT for I dl -initial and It/ -initial targets, but the book is obviously not intended to be read through from cover to cover,
one which; because it is,inside an adult phonemic category, is unlikely since different readers will be interested in different techniques, we recom-
to be perceived? The particular statistical test that is relevant to this issue mend that all users of the book read chapters z-8 inclusive, since these
is dealt with in chapter 10, but chapters 3-8 provide a crucial preparation are central to understanding.!! is here that summary measures, probability
for understanding it. and inference from samples to populations are dealt with. Many readers
(e) We have referred to one potentially significant difference for one will find chapters 4-8 difficult. This is not because they require special
child. As investigators we are usually interested in how far we are justified, knowledge or skills for their understanding. They do not, for example,
on the basis of the sample data we have analysed, in extending our findings contain any mathematics beyond simple algebra and the use of a common
to a larger group of subjects than actually took part in our study. The notation which is explained in earlier chapters. However, they do contain
answer to this question depends in large measure on how we handled arguments which introduce and explain the logic and philosophy of statisti-
the issues raised in (b) and (d) above, and is discussed again in chapter 4 cal inference. It is possible to use in a superficial, 'cookbook' fashion
Much of the discussion so far has centred on phonetics - not because the techniques described in later chapters without understanding the
we believe that is the only linguistic area in which these issues arise, but material in chapters 4-8, but a true grasp of the meaning and limitations
because VOT is a readily comprehensible measure and studies employing of those techniques will not then be possible.
it lend themselves to a straightforward illustration of concerns that are The second part of the book - from chapter 9 onwards - deals with
common to many areas of language study. We return to them continually a variety of techniques, details of which can be found in the contents
in the pages that follow with reference to a wide variety of studies. list at the beginning of the book.

While the use made of the information in the rest of the book will reflect
the reader's own purposes and requirements, we envisage that there will
be two major reasons for using the book.
First, readers will want to evaluate literature which employs statistical
techniques. The conclusions papers reach are of dubious worth if the mea-
surements are suspect, if the statistical technique is inappropriate, or if
theassumptions of the technique employed are not met. By discussing
a number of techniques and the assumptions they make, the book will
assist critical evaluation of the literature.
Second, many readers will be interested in planning their own research.
The range of techniques introduced by the book will assist this aim, partly
by way of examples from other people's work in similar areas. We should
emphasise that for research planning the book will not solve all problems.
In particular, it does not address in detail measurement (in the sense
of what and how to measure ip a particular field), nor, directly, experimen-
tal design; but it should _go,,\'9,ffie of the way to assisting readers to select
an appropriate statistical ff~IJiework, and will certainly enable problems
to be discussed in an informed way with a statistician.
Each chapter in the book contains some exemplification in a relevant
6
7
Categorical data
Table2.1
2 (a) Frequencies ofdisorders-in a sample o/364 -liinguage-impairedm'a/ednUSA

Tables and graphs' ,,


: ;; __ .
Stu-ttering
- Phonological
- disability
Specific language ! Impaired
disOrder h(!afirig ' Total J.

57 20<) 47 51 364

(b) Relative frequencies. ofdi$0rders in a sample of;64 language-impait-ed males in US;l


Phonological Specific language Impaired
Stuttering disability disorder hearing Total
0.157 0574 0,129 0.140 1.000
When a linguistic study is carried out the investigator will be faced with
the prospect of understanding, and then explaining to others, the meaning (c) Frequencies of disorders in a sample of 364 Janguage-impalred males itl USA (figures
of the data which have been collected. An essential first step in thisprocess {'!, ,~r(],C,~f!$/'I,J,~_tJIIP#'lJ.,e jreque.w:ies (lf p_ert{!tUpge~)_
is to look for ways of summarisingthe resulkwhich bring out their most 1 , -1 ; :, :,-Phonological Specific languagt; :Impaircd
, ... S.t~~tering dis,abi.!~ty , disp~4cr : , hearing Total
obvious features. Indeed if this is''don<Hihaginatively'imd the trends1in' ' '''
57 (16) . 209. (57) . 47 (13) 51 : ( 1 4) 364 (106)
the data are clear enough, there may be no need for sophisticatedanalysis.
In this chapter we describe the types of table' and diagram most commonly'
employed for data summary.
Let .us begin by looking at typical examples of the kind of data which Table 2.1(a) itself already comprises a neat and intelligible summary
might be collected in language studies. We will consider how, by means of the data, displaying the number of times that each category was observed
of tables, diagrams and a few simple calculations, the data may be summar- out of 364 instances. This number is usually called the frequency or
ised so that their important features can be displayed concisely. The proce- observed frequency of the category. However, it may be more revealing
dure is analogous to writing a precis of an article or essay and has similar to display the proportions of subjects falling into the different classes,
attractions and drawbacks. The aim is to reduce detail to a minimum and these can be calculated simply by dividing each frequency hy the
while retaining sufficient information to communicate the essential characc total frequency, 364. The proportions or relative frequencies obtained
teristics of the original. Remember always thatthe use of idata ought to ., < <i'n this way' are displayed in table z.r(b):Note that :no more than three
enrich and elucidate the linguistic argument, -and:this t~n often be-done ''figur-es l!re gillen, though most pocket calculators Will give eight ortbil.
quite well by means of a simple table or diagram.' " : 1 : This is deliberate. Very high accuracy is rarely-required iri such results,
!
and the ease of assimilation of a table decreases rapidly with the number
2. ' Categorical data of figures used for each value. Do remember, however, that when you
It quite commonly arises that we wish'toclassify a group of. truncate a number you mafhave to alter the last figure which' you wish
people or responses or linguistic elements, putting each unit into orie of to include. For example, written to three decimal places, o.6437' becomes
a set of mutually exclusive classes. The data can then be summarised o.644, while o.JI7>6 would be o.J'7 Tbe rule should be obvious.
by giving the frequency with which each class was observed. Such data A table of relative frequencies is not really informative (and can be
are often called categorical since each element or individual of the group downright misleading) unless we are given the total number of observations
being stuoied can be classified as belonging to one or a (usually small) on which it is based. It should be obvious that the claim that so% of
number of different categories. For example, in table 2.r(a) we have pre- nntivc English speakers display a certain linguistic behaviour is better sup
sented the kind of data one might expect on taking a random sample of ported by the-behaviour in question being observed in soo of r ,ooo subjects
364 males with diagnosed speech and language. difficulties in the USA than in just 't\vo of a tt>tal of four; (This poirit< is discussed. in detail In
(see e.g. Healey eta/, r98r ). We have put these subjects into four different I' cHap!er\ii) 'lf1iR best to givebothfrcquenciebtnd relative' trequendes,
categories of impairment. ns in table 2.1 (c). Note here that the relative frequencies have been rounded
8 9
1ames ana graphs c:azegoncat aara

Table 2.2 TabJe 2.3


(a) Frequencies of disorders in a sample of s6o languagezinpaired individuals in the USA, (a) Frequencies of disorders in a sample of s6o language-impaired individuals t'n the USA
cross-classified by sex (frequencies relative to row totals are given in brackets as (figures in brackets are percentages)
percentages) Phonological Specific langUage Impaired
Phonological Specific language Impaired Stuttering disability disorder hearing Total
Stuttering disability disorder hearing Total s6o (roo)
8+ (rs) 327 (s8) 78 (r4) 7' (r3)
Male 57 (r6) 29 (57) f7 (13) sr (If) 36f (roo)
Female 27 (t4) u8 (6o) 3' (r6) 2.0 (1o) 196 (roo) (b) Frequencies of disorders in a sample of s6o, language-impaired individuals in the USI!,
cross-classified by sex (percentages 1'n brackets give relative frequencies of sexes within
Total 84 (rs) 327 (s8) 78 (14) 7I (I3) s6o. (too) disorders)
(b) Freque~lcies of disorders br a sample of s6o language-impaired individuals in the USJl, Phonological Specific language Impaired
cmss-classijied by sex (frequencies relative to column totals are given in brackets as Stuttering disability disorder hearing Total
percentages) 364 (6sl
Male 57 (68) 209 (64) 47 (6o) 5' (72)
Phonological Specific language Impaired Female 27 (32) u8 (36) 31 (fO) 20 (28) I96 (35)
Stuttering disability disorder hearing Total
(c) Frequendes of disorders;, a sample of s6o language-impaired indt'viduals in the USA,
Male 57 (68) 209 (6f) f7 (6o) sr (72) 364 (6s) cross~classt'jied by sex (percentages itt brackets give relative freqUencies of disorders
Female 27 (32) u8 (36) 31 (fO) 20 (38) rg6 (35) within sex)
Total 84 (roo) 327 (roo) 78 (roo) 7r (roo) s6o (roo) Phonological Specific language Impaired
Stuttering disability disorder hearing
(c) Frequencies of disorders in a sample of s6o latzguage-impaired individuals in the USil,
cross-classified by sex (frequencies relative to the total sample si::e are given in brackets Male 57 (ro) 209(37) 47 (8) 5' (g)
as percentages) Female 27 (s) u8 (21) 31 (6) 20 (4)
Phonological Spedfic language Impaired Total 8f (rs) 327 (s8) 78 (r+) 7r (r3)
Stuttering disability disorder hearing Total
Male 57 (ro) 209 (37) 47 (8) 5' (9) 364 (6s)
Female 27 (S) II8 (2.1) 31 (6) 20 (4) 196 (35) that the proportion of the total who are male and hearing-impaired is
Total 84 (IS) 327 (s8) 78 (If) 7I (13) s6o (roo)
approximately 9% (sx/ s6o).
These tables have been constructed in a form that would be suitable
if only one of them were to be presented. The choice would of course
further to only two figures and quoted as percentages (to change any deci- depend on the features of the data which we wanted to discuss. If, on
mal fraction to a percentage it is necessary only to move the decimal point the other hand, more than one of the tables were required it would be
, . places to the right). neither necessary nor desirable to repeat all the total frequencies in each
It often happens that we wish to compare the way in which the frequen table. It would be preferable to present a sequence of simpler, less cluttered
1:ies of the categories are distributed over two groups. We can present tables as in table 2.3.
this by means of two-way tables as in table 2.2 where the sample has The tables we have introduced so far can be used as a basis for construct-
been extended to include 196 language-impaired females from the same ing graphs or diagrams to represent the data. Such diagrams will frequently
background as the males in table 2. r. In table 2.2(a), the first row is bring out in a striking way the main features of the data. Consider figure
exactly as table 2.r(c); the second row displays the relative frequencies 2.r(a) which is based on table 2.1. This type of graph is often called
of females across disorder categories, and the third row displays relative a bar chart and allows an 'at a glance' comparison of the frequencies
frequencies for the two groups combined. Table. 2.2(b), however, displays of the classes. Figure 2. r (b) is the same chart constructed from the relative
the relative frequency of males to females within categories. So, for exam frequencies. Note that its appearance is identical to tigure 2.r(a); the only
pie, of the total number of stutterers (84), 68% are male while 32% are alteration required is a change of the scale of the vertical axis. Since the
female. In table 2.2(c) the frequencies in parenthesis in each cell are relative categories have no inherent ordering, we have chosen to present them
to the total number of individuuls (s6o), and we can sec, for instance, in the chart in decreasing order of frequency, but this is a matter of taste.
!0 II
1 ames anagrapns Numencal data
250
Figure 2.2 shows how similar.diagrams can.be used.'to display the data
of table 2.2. Note that in constructing:figure 2.2 -we have used the propor
tions relative to the total frequency: that is, we have divided the original
200 "l !'-; frequencies by s6o. 'Whether or not-this is the appr<>priate procedure w'i-1-h
depend on the point you wish to make and on how the data were collected.
~ 150
-;;;
E
~
0
0
2 100. 0.40

-~ lim~
Male Female
~
~
0.30
50 .,,-,,, - '
,'
O
0

''I 0.20
.e -o.1o
Phonological Stuttering Impaired Specfficn
disability hearing language
disorder
Phonological Stuttering Specific Impaired
Figure 2.1 (a). Bar chart of frequencieS of'disordcrs in a Saritple of 364 disability language hearing
language-impaired males in the USA (based on table 2.1). disorder
Figure 2.2. Relative frequencies of disorders in a sample of s6o language-
impaired individuals, further classified by sex (based on table 2.zc).

0.6-
If the whole sample were collected without previously dividing the subjects
into male and female, figure 2.2 based on table 2.2(c) would correctly
0.5. (1. :J ~: give the proportionsof a sample of individualpatieilts who fallintodifferent ' ' :
categories, determined both by their sexi and-'by. the 'type of defect-they
ID E suffer. This would not be true if, say; males and females-were recorded
-;;; 0.4.
E on different registers which were sampled separately, since the numbers
0
of each sex in the sample would not then necessarily bear any relation
f 0.3 to their relative numbers overall. Figure'2.3 based on table z.z(a), showing
the proportion of males who suffer a particular defect and, separately,
a. 0.2
the proportion of females suffering the same defect, would be correct.
This would always be the more appropriate diagram for comparing the
0.1 distribution of various defects within the different sexes.

2.2 Numerical data


Phonological Stuttering Impaired SpecifiC
disability hearing language The variables considered in the previous section were all-classes
disorder m categories, and the numbers we depicted arose by counting how often
Figure 2.1 (b), Bar ~:hart of rclntivr. frequen 1dM'Of dif'jorticr!i i'n.!!arilfllcdf J64 .u pac(icular category occurred. It- often' happens thatthe variable we are
!rmgungcimpnircd mnlctl (bnucd on table J., 1 ). observing takes numctical values, lot example, the number of letters in
13 '3
!Vumencat aata
'Fables and graphs

0.60
~

!.~
0
0.45
~
1~!181
:~:::
!Jill
Male
~
F~male
c
0
~ 0.30
<>
e
"-
0.15

Phonological Stuttering Specific


disability language
disorders
Figure 2.3. Relative frequencies ~f language disorders in samples of language
impaired individuals, within sexes (based on table 2.Ja).

a word or morphemes in an utterance, a student's score in a vocabulary of them on the bar chart, including any which do not actually appear
test, or the length of time between the release of a stop and the onset in the data set.
of voicing (VOT), and so on. If the number of different values appearing in a data set is large, there
If the number of different observed values of the variable is small, will be some difficulty in fitting them all onto a bar chart without crushing
then we can present it using the display methods of the previous section. them up and reducing the clarity of the diagram. Besides, unless the
In table 2.4(a) are given the lengths, in morphemes, of 100 utterances number of observations is very large, many possible values will not appear
observed when an adult was speaking to a child aged 3 years. These have
been converted into afrequeneytable in table 2.4(b) and the corresponding 24
bar chart can be seen in figure 2.4. A major difference between this data 22
and the categorical data of the previous section is that. the data have a 20
natural order and spacing. All the values between the minimum and maxi- 18
mum actually observed are possible. and provision must be made for all !ic 16
m 14
Table 2.4(a). Lengths of roo utterances (in morphemes) of an ~ 12

..
0
adult addressing a child aged J years ~ 10

I'
8
7 IO 5 6 5 7 9 7 7
.J~ 1~i::~:
II 3 3 7 10 4 3 3 9 6
5 ~' 3~!,
6 8
10
5 8
3
4 ,.
7 4
8 4 8 3
4
2
7 IO 3 9 7 4 IO 4
9
6
'7
6 6
5 '4
IS

IO
6
8
4
3
. I2
6 ,.3 8
9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
IO 8 7 4 7 7 IO 5 3 7 Length of utterance
4
3 2
5 10
14 9
5 3 8
6
IO 7
6
9 Figure Z+ liar chnrt for dntn in table 2.4-(b). Lengths of Jooutterances (in
5 9 4 morplu:m<m) of un udult uddrctlsing a child aged 3 ycnrs,

14 15
1'abies ana graphs J.VUfltt:;llf..U' UUH.._

Table 2.5(a). Scores of ro8 stud(?nts in theJllne.i98o Cathbridge. :fr;' .:'J',::, I'
.,N9w we counnhe :number. of scores belongingaoeach :class; 11he most
Proficiency Examination ' . . efficient, and accurate, method- of doing. sois towork,through the list
194 184 '35 r6r r86 198 1:go 240 147 t74 ' 1 197
of scores, cf9~sing ollt 'each one in turn and -iloting it by a tally mark
176 I8J "7 r6r r8s '186 {'208 '200 157 'i1~ ,i:gi /.) ',;opposite the,.corresponding cli\ss intervaL Talliesare :usually counted
tW 192 '45 r62 r86 If8 241 I8f 201 'iOB 'Ii7
135 229 208 209 203 I-45 ~oi 204 192
in blocks offive, the fifth tally stroke l:ieing made .diagonally to complete
'79 '79
ZZ4 209 1.79 023 I92 221 239 238 199 174 '45 a block, as 'in column 2 of the table. The n_umber of tally marks is noted
226- 214 :iu ~15 176 nB t84 :z:n xg8 I96 I84
J_~:..;
for each class in column 3. These frequencies are then used to construct
l6f :iog 142 I96 i6o 165 x66 224 z~g !84
I"f? I6J :io7 I79 197 120 255 ISO 23_3 ~88' i75 what is referred to as a histo~am of th~ data (figure 2.5). No gaps are
zzs rs6 2II I90 204 22.2 219 r86 x6o 189
>r8 '49 rtil> r88 224 140 ~20 149 170 197 25

T~ble 2.5(b). Fr~qde(ICY l(lbiJ of thb ~cores of iq8 studellls. in the u)8o r-
C<jmbridgePraficirmcyExa,minatiom ,., .. ,. '' ... ' . "' .J
.5
20
r:-

Cl~ss interVal Tlly :. li:rcqUCficy ,


Rela.tive
:~requency
Cumt.ill\tivc
ff.equenCy
Relative
cumulative c :. ;
freRucncy
1

"
E 1!'1
~
1-
~

..
IIQ-ht II 2 0.02 2 0.02 11 ~--.,.
us-l3l 2 0.02, 4 o.o4 ~
Ifo-xs mr lJ.n 1 II o.to IS O.If ~
'5
10
I55-l6~ m1 j-Hi IL l2 o.h 27 0.25 0
I7c>-t84 o,t8, ~

I8~-!9?
l-Hij-H1lf11
II II
J<tllitll<tl
I9

23 o.:u
46

69
0.43

o.64
'0
0
2:
5 -
mr 111
20o-214 u-ti J<tl J<tl I7 0,16, 86 o.So

II ,I I h
2IS-22Q J<til<tll-H1 IS 0.4. o 0-94 11U 140 l70 200 ' 230 260
ZJo-244 \HI I 6 o~q6 <07 099 Score
24~-25~ I I q,ot Ip8 1.00
Figu~ 2.5. Histogram Of the frequency data in tablr z.s(b).
-
I08
left betwe~n the rectangl~s, it being ass.umed by convention that each
Class \yill C\)nt~i\1 all tpo~e ~cores greater t~an, or eqlj~l to, fhe lower bound
at all,. causing frequent gaps lQthe chart, while otl]er v~lues will ~ppear of th~ int~rval but less than the v~lue qf the lower bound pf the ne)\t
ra\her inlrequeQtly ilJ the data. Tab!~ $(a) lists the to\al score of each interv~L S9, for exap1ple a score of ris will go jnto the t~lly for the
1
of w8 stuqents )aking the Cambridge Proficie~cy Examination (CPE) of cl~ss ps-139, ~s wi\1 a score of '~9 Provi4ed the cla~s intfrvals are
J1.1ne 1980 at a Europe~Q ceiltre. 'I!he ~arks. range fro;, a low qf II7 eqr<al, the height of each rectangle cprre~ponqs to the f~equency of each
to a tnaximum bf 25~ givihg a range 0(255 '- II7 = 1~8. The most frequent cl~ss. 1\s i11 the. case of t~e bar chaft, r~lative frequencies may be used
sc(lre, 184, appears oply five ti!f1es and clearly it i.v9uld be inappropriate to construct the histograf11, this entailing only a change of scale 9n the
to attempt to construct a b~r chart usipg th~ individual scores. vertic~! axis. Great care h~s to be taken not to !'~raw a misleading diagram
The first step in sulllmarising this qata i~ to group the sc6res into ~rounq w\Jeh (he class Intervals are nqt all \Jf the same siae - a good reason for
ten cl~sses (betwe~n eight and IS is usl!ally9onv 0nient and pracfical) 1 c~oosing t~em to be ~qual.. However, in u~usu~l cases it may not be appro-
The first qolump of table 2.5(9) shqws tile classes decided OIJ. The first priate to have equal width intervals, or oqe may wish to *aw a histpgram
will contain all thos~ nlarks in the interyal1I0-1"4 the SCfOOd'thosu: ' 'ba,se<\'m\ d~taalread~ gro\1ped itito classes o'fupeqljaliwi<!th (see exercise
lying in the rang.e 12,5:_139, and so on. 2.4 "'\d figure s.z).
t6 li
1 ames ana graphs Multi-way tables

The fifth column of table 2.5(b) contains the cumulative frequencies up through the scores) while the seth percentile, 'halfway' through the
for the scores of the 108 students. These cumulative frequencies can be data, is called the median score. These values are discussed further in
interpreted as follows: two students scored less than 125 marks, 27 scored the next chapter. In this example, the median is about rgo.
less than 170 marks and so on. The relative cumulative frequencies of
column 6; obtained by dividing each cumulative frequency in column 5 Multi-way tables
2.3
by the total freq)lency, 108, are us!Jally more convenient, being easily It is not uncommon for each observation in a linguistic studr
transl~ted into sfate"'e~Jis such as zs% of the $tudents spored less than to be cross-classifjed by several factors. Until now we have lookeq only
170 marks, while zo% scored at lea~t PS marks, and so on. at orie,way plassification (e.g. t~ble ~.r) and twoway cl~ssifi~ation (e.g.

Table z.6. 44lndia11 ~ubjects cross-'rlassifi~d qy se:>;, age, and c/QSS


1.0"
~
$ o.a / Sqx
Fem~e

I
.. Male

~
1

, Clasl Claj'
~ 0.6
A~e 1)1'4 UM , U To!al Age LM UM U Total
0
.o
t 04
/
16-30
Jh45
Over45
4
2
.4
4
2
4
2
2
4
IO
6
12
16-30
JI-45
Over45 -
4
2
-
4
2
2;
2
i-
6

Q.. 0.2
/ Total, 1p , 10 8, z8 Tot~l 6 6 ,4 16

0
------:::?1
L,_' 1 I
X

it I I I I I

100 :: 120 140' 160 1~0 200 220 240 260


Score table 2.2 - in this latter table the ~o classifying factor~ ate sex on the
one hand and type of language disqrder on t\le other). Wheq subjects,
Figure :z.6. Relative cum\jlative freqh~ncy curve for dati in table. 2.5(b)'.
or any other kind of experimdntal uqits, are cl~ssifi~d in several differen\
ways it is still possi~le, and helpful, tq tabulate the data.
We can also make use of this table to answer qqestibps su~h as 'What Khan (forthcoming) carried out a stuqy of a number of phonological
is the mar~ that cuts off tjle lowest (or highest) sbqring 10% pf ~tudents variables in tHe spoken Epglish of 44 supject~ in Aligarh city in North
from the remainder?' Prob~bly the easiest way to db this is viol the ielative India. The subjects were clas~ified by sex, age and soqial cJ'ass. There
qum~lative frequency C!lr~e of fi!l"llre 2.6. To draw tqis durve we plot were three age groups (r6--Jo, JI-'\5 o'>[er 45) and three social classes
each relatiye cumulative frequency against the value at the beginning of (LM - Lowei middle, UM ~ Upper m\ddle, and U - Upper). lfabl~
tpe hext class i11terval, Le: plot o.6z verticady abbve 125, oi43 verfically 2.6, an example of a tltreeway table, shows the distribution of the sample
above r8s, and so op. Then join up the points with a serie.s of short, of subjects over these categories. , ' ' .
straigl,t lines connecting successive points. If we now find the pqsition Khan studied se~eral phonological variables releVI!nt to, Indian English,
on the vertical axis corresponding to rb% (b.r), draw a horizontal line i11cluding subjects', pronunciation of I dl. The subjects' speech was
till it reai::qes tl]e curve and then drops vertically onto the X-axis, the re10orded in four different sit\'ations and Khan recognised three different
cprr~spondjng sqore, '5' is ah esti!n(lte (see chaptkq) of the score whi,ch variants df I d/ in her suojects' speecjl: d-r, an alveolar variant 'lery similar
10% of studerlts fall to reach. This ~core is called. the tenth percentil'l to English pronunciation of I dl; d-2, a post-a1veol~r variant of I dl; anq
score. Any other percentile can be obtained in the same w0y. d,J, a retroflex variant which is perceived
,
as corresponding
' ,
to
I
a 'heavy
Certain percentiles have a historical importance aqd special names. The Indian accent'. This type. of study prod11ces rather complex data since
25th and 75th percentiles are known respectively as the first quartil'1 there were I 2 SCOres for eafh supject: a SCore fo,r each of the three V~riant~
and the third quartile (being (lflC'lJUUrtc and thrcf<]uartcrs of the way in each df the four sitmitions. One way t9 deal with this (others are sug1
r8 19
1'ables and graphs ~'pecial cases

Table 2. 7. Phonological indices for /dI ofNlnliiali speakerS ofEnglish iilw~y,H!Iat the major purpose ofthe table or graph is to communicate
the data more easily without distorting its general import;
::;ex
Hughes (1979) tape-recorded all the English spoken, to and by an adult
Male Fema.le____
Spanish learner of English over a period of six months from the time
C:Iass . Class that she began to learn the language, solely through conversation with
Age LM UM u Age LM UM u the investigator. Hughes studied the frequency and accuracy with which
478 so.o 459 s6.9 548 57'
the learner produced a number of features of English, one of whioh was the
16-30 494 484 51.6 43' 487 52.8 possessor-possessed ordering in structures like Marta's book, which is
_16-30
485 493 533 47-2
489 sz.o 467 448
different from that of its equivalent in Spanish, ellibro de Marta. The
learner produced both sequences with English order, like Marta book,
578 509 66.7 51.2- 46-s 49-4
3 1-45 51.6 31-45 lrlmta's book, and sequences which seemed to reflect possessor-possessed
594 57-8 62,1 479 433
sz.6 56.4 58.6 order in Spanish like bool< Marta. The frequency of such structures
Over 45 44-0 6t.J 6I.J
:over4S No data ,,during the first 45 hours of learning, and their accuracy, is displayed in
596 57' 594 figure 2.7, reproduced from the original. The figure contains information
5Z,I 51.6 647
about both spontaneous phrases, initiated by the learner, and imitations.
How was it constructed? First, the number of relevant items per hour
Table 2.8. Average phonological index for /dl of44lndian
speakers ofEnglish cross-classified by sex and social class
is indicated by the vertical scale at the right and represented by empty
bars (imitations) and solid bars (spontaneous phrases). Second, the percen-
Sex tage of possessor-possessed noUn phrases correct, over time, is read from
ClaSs Male Female Both sexes the vertical scale at the left of the graph and represented by a continuous
Lower middle 5'"9 (Io) 52.2 (6) 52.0 (16) line (spontaneous phrases) and a dotted line (imitations - they were all
Upper middle 53-5 (to) fS.J (6) 51.6 (16) correct in this example). It is clear that the learner is successful with
Upper 575 (8) 57 (4) 55.2 (n)
imitations from the beginning, but much slower in achieving the accurate
.All classes 54-' (28) 504 (t6) 5>7 (44) order in her spontaneous speech. It should be clear from earlier examples
Note: The figures in brackets give the number of stibjectswhosescores\corttri- how the bars representing frequency are i:orisiructed. But how was accuracy
buted to the corresponding average score. '
defined and represented on this graph?
gested in chapters 14 and rs) is to convert the three frequencies into Hughes first scored all attempts, in order' of occurrence, as correct (0)
a single score, which in this case would be an index of how 'Indian-like', or incorrect (X) in terms of the order of the two elements (imitations
on average, a subject's pronunciations of I dl are. A method suggested and non-imitations being scored separately).
by Labov (1966) is used to produce the data presented'in tab'le 2:7 (details Response no. 2 3 4 5 6 7 8 9 10 I I !2.
of the method, which need not concern 'us here, are to be found in 15.1 ). X X X X X X X 0 0 0 X 0
Table 2.8 gives the average phonological indices for each sex by class
combination. In order not to lose the benefit of the completeness of the record, the
data were treated as a series of overlapping samples. Percentages of suc
2.4 Special cases cessful attempts at possessor-possessed noun phrases in every set of ten
Although it will usually be possible to display data using one successive examples were calculated. From examples 1-"12 the following
of the basic procedures described above, you should remain always alive percentages are derived:
to the possibility that rather special situations may arise where you may CorreCdn res'ponses r-Io= 30%
need to modify or extend one of those methods. Yau may feel that unusual 2-rr =
3dfo
J-12 = 40%
types of data re<Juirc a rather special form of presentation. Remember
21
Tables and graphs -EXercises
Each percentage was_ thep assigned to the mid-point of the series of res (2) _- 'f.he notiOn of ~elative 'frequency, in terms of p~opof1_i~Jis oi' percentages,
ponses to which they referred. Thus 30% was considered as the level was introduced.
of accuracy in the period between responses numbered 5 and 6, still 30% (3) Advice was given on the Construction of:
in the period between responses 6and 7, and so on. The average level (a) tables: one-way, two-way, and multi-way
of accuracy for each hotjr was then calculated (this kind of average is (b) diagrams: bar charts, histograms, and cumulative frequency curves
usually call~d a !llOVing !'Vetage, since the sample on which it is based (4) Percentiles, quartiles and median scores were introduced.
'move.s' throng~ time), aqd this is what is showfl in figure 2.7, the points (5) It was point~d out that there may be occijsions when,unusual data will necessit~
ate modifip<tion" of the basic proce~ures.
(6) It was emphasised that the function of tables and diagrams is to present inform
26 tio~ in a readily assimilable form and withoUt distortion. In particular, it
wa~ urged that proportions anq percentages should always be accompanied
100
90
---------------------- 2Q ' , by an indicatiol' of original freq\lendies.
80
.2
1\" 70 15 ! EXERCISES
m (t) In \he Orirk ~eport (Quirk '97) the following f)gures are given as estimates
t'! 60
50 ~ of the num!Jer of children in differeqt categories vih.o might require speech
10, ~
~ 40 therapy. _Construct a tai.Jte of relativb frequencies for these categories .
30
20 5. J (i) Pre-school age children with speech and language
10 6odoo
problems
Hour1lj 5 10 15 20 25 3q 35 40 45 (ii) Children in ordinary schools wjth speech and language
probl~ms x8oooo
FigUre 2.7. Possessor-possessed nouq phrasrs in thc.spcc~h of 11. Spanish
learner of English (Hughes 197:9). (iii) Physically and/ or mentally-handicapped children 42800
(iv) Special groups (language disorders, autism, etc.) SdOO
'
(z) As~ume that the nlale/female split in t)le categories above is as follows:
joined by an unbro~en Ji11e for non-imitations, and by a broken line for (i) +ooo~/zoooo
imitations. These lines give an indication of tjle way in which the ability (ii) !Oo oqo/8oooo
(Iii) f-5 oog/ 7 8oo
of the speaker to express the possessor-posse~sed ordering c9rrectly w~s
(iv) Jsoo/ soo
progressing through time,
(a) Construct a ~o~Way table of frequencies and relative freqlieqcjtis1 cro'Ss-
We have certainly not exhausted the possibilities for tabular and graphi ciassified by sex (cf. table ,.z(c)).
cal presentatioq of data in this short chapter, though you should find (b) Draw a bar dhart of the frequencies In each categury.
that one of the options proposed here will ~uffice for many data sets. (c) Draw a bar chart of th.e rela!ive frequencies for each category.
' ' !
It is worth reitfrating that the purpose of such simple summaries is to (J) The table b~low gives the scores of 93 stt!dents in a test of English proficiency.
, -., ; ! , I , , , ' I

promote a rapid understanding of the main fpatures of the data, and it


183 206 ISO 164 t62 zz6 189
goes without saying that no summar~ method spould be used if it obscures 2 33 '59 '49
tile t88 t66
205 200 t,f6 190 236 '55 io3
the argument or covers serious deficiencie~ in tlje data. 165 237 172 '52 t8o '4' 140 t8I 194 2d8
J8q 173 225 ,s 5 191 t68 t6"i 209 205 ,66
SUMMARY
165
144
165
~6s
t68
!86
156
!87
19~
!2
204
133
t6t
197
173
l6>
179
54 t
10
zd6
This chapter has shown a number of ways i11 which data can be sum- lJ3 171 177 193 'IJl
Jt, t67
t81
!52 i:u Il4
198
...
marised in tables and diagrams. 56 149 t87 178 q;:6 '73 149
l:J:i !81 llt) 3 159 '53 1(18 207 205 198
~o8
(I) A distinction was made between categorical and numerical data.
~~
- 0111
~.....,_.;;,

23
Tables aiidgraphs
(a) Provide a frequency table (c. tablez.5(b)).
(b) Provide a histogram of the frequency data: in the table.
(c) Provide a relative cumulative frequency curve for the data. 3
(d) What are the scores corresponding tO the teilth percentile, the first qtiatiile',-
and the third quartile/ ,Summary measures-
(4) Suppose that the scores of 165 subjects in a l!mguage test are reported "in
the form of the following frequency table, Construct the corresponding histo
.gram (note that the class intervals are not all the same size).
Class interval
of scores so-79 8o-89 9~4 95-<!9 roo-ro4 ros-tog I Io-J 19 Uo-139
Number of
subjects r6 20 26 28 22
'9 23 We have seen that diagrams can be helpful as a means for the presentation
" i>f a summary version of a co11ecti0n of data.- Often, however, we will
find it convenient to be able to talk succinctly about a set of numbers,
and the message carried by a graph may be difficult to put into words.
Moreover, we may wish to compare various sets of data, to look for impor
lant similarities or differences. Graphs may or may not be helpful in this
rcspect; it depends on the specific question we want to answer and on
how clearly the answer displays itself in the data. For example, if we
compare figure 3' (derived from data in table J.I)with figure 2.4 we
clin see immediately that the lengths of the l)tterances of a mother speaking
to an 18-month-old child tend to be rather shorter than those in the speech
of the same woman speaking to a child aged 3 years.

24
22
20
18
~c 16
i 14
:l 12
0
d 10
" 8
6
4
2

l 3 4 G ij 'I 8 0 10 11 12 13 14 15 16 17
Umgth of Ultllrnnco

Figur~ J 1, Uur clwrt pf lijfii!Itha of 100 llltcrnnt:cof n motlwr llpeultinl{ to


n cltild ug~;d 18 monthll-(d!!tn in- tii.hk ;;. r).

'~"' aG
The median
Summary measures
Table 3. r. Frequency table of the lengths 25
of 100 utterances (in morphemes) of an
adult addressing a child aged I 8 months ~ 20 --,
Length of utterance , Number of utterances -l' r-

3
5
zB ~ 15 - 1-_
4
5.
'7
22 s" r-
'4 -!l 10
6 II
7
8
7
6
1 -
9 0
0
~ 5
. ~

-~
IO 0
II
I2
I3
0
0
0
r
110 140 170 200 ---
230
'4 0 Scora
IS 0
z6 0 Figure 3,2(a). Histogram. of CPE scores of 93 L~tin American candidates (data
I1 0 in exercise .2.2.).

25
However, it is quite rate for the situalion to be so cleat. hi-figure 3.2(ii)
r-
we have drawn the histbgram of the data of exercise 2.2 (reproduced in
table 3.2), which consists of the total scores of 93 s!udents at a Latin 1
.s
20
r-
~ r-
'!'able 32 Scores, ranked in ascendittgorder, obtained by 93 candidates
at a Latin Ameri~an centre in ihe Cambridge Proficiency Examindtimt .
~ 15 -'-

R _.r-
II4 J2.1 u6 I29 I32 I36
I46
'33 '33 '35 '4" 10
'4' I# I49 I49 '49 ISO I 52 I 52 'S3 "C
'5
'55 zs6 zs6 zs6 '59 '59 I6I z62 I~2 I64 liu
z66 z66

~
z6s z65 t68
I70 171
I65
I72 '73 '73
I67
'73 7 '77
z68
I78
z68
'79
0
d 5
r-
z8o z8o
,gg
z8o J8J IBI I8J 3 185 J86 ,s, z
IB7 !89 I9d '9' '9' '93 '94 '97 uj8
z98 200 203
208 208 209
204
2Il
205
222
205
223
205
>25
206
226
206
233
207
2j6
I I. h
110 140 170 2b0 230 260
237 254 265
Store
r-igUrc 3.2(b). Histogram of CPE scores of toS European candidates (data
m tuble 2.5).
American centre in the June rg8o Cambridge Proficiency in English Exatrii-
nation. In figure 3.2(b) we have repeated the histogram ofthe scores
of the European students already discussed in the previous chapter. The The median
3.1
two histograms are rather alike but there arc some dissimilarities. Can We have draW11 the cumulative relative fretjuency curve for
we make any precise statcmCntN about such dissitnilarities in the vvehzll the Latin American group in flgule 33() and again repeated that for
level of performance of the two group~ df students? the European student~ in figure 33(b). In both diagrams we have tnarked
20 '2,7
Summary measures The arithmetic mean
in the median, or seth percentile(ihtroducediin<ch~pter-;;:.':Retl<leltlb'er-- i,,.u,:iJ in''bo't1i 'setti' who 'ilbt<lined. scores closetothe relevant' median. This, .
th~t this is the score which divid~s elieh:seubf scorces. irito:twone!ltly togetherwith 'the central position of-eachmedian 'indts own -data set,
equ~l subgroups; one of these cont~ins'all iherscores less thim the me'di~h, giVes; rise 'tii'>the idea' of a representative. score for: the group, expressed
the other all those greater than' thehmedian. We"see'th1ltthe illediah.~c!lrel '"' 1 'rn plii'aiies'st't'ch'as 'a typical'sttident' ot''uh~ratice's ofaverage:lengtb',
for the Latin American students, about '74 is somewhat lower than.that The median provides a way to specify a /typical' Or 'average' value
for the Europeans, about 190, and we might use this as the basis for which can be used as a descriptor for a whole group. Although som~
a statement that, in June 198o, a Latin American student at the celltre Latin Americans scored much higher marksthan many Europeans (indeed
of the ability range for his group scored a little Jess well than the correspond- the highest score from the two groups was from a Latin American), we
ing European. It is clear from the histograms that there are many students nevertheless can feel that we know something meaningful about the relative
performance of each group as a whole when we have obtained the two
1.0 median values.
1/_,--
!!l 0.8 .?1:;
y)( 3.2 The arithmetic mean
~ ..!
The median is only one type of average and there are several
~ 0.6 others which can be extracted from any set of numbers. Perhaps the most
0 familiar and widely used of these ~ many of us have learned to call it
c
.g
g
E
0.
0.4

0.2
,//
I ... Median score
the average - is the arithmetic mean, or just mean, which is calculated
for any set of numerical values simply by addirlg them together and dividing
by the number of values in the set. The mean score for the Latin American
students (from table 3.2) is given by:
'
120 140 160 180 ' 200 220 240 260 (II4 + I2I + ... + 265) = 16 517
Score 16 5'7 + 93 = 776
Figure 3J(a). Relative cumulative frequency; curvdor-CPE-scores of 93 Latin The meim le'dgth of the utterances (in morphemes) of a mother Speaking,
Americansubjects(dataincxe.rclse2.2). ,.., ' <.'l=!,i.- :..:
to a child aged 3 years (from table 2.4(a)).ls:

1.0 ,.........- (2 + 7 +.
635 +
':. + 4) = 635
'
100,;, 6.35
J 0.8 /./ '
This is a convenient point for an irttroduction to .the kind of simple
., algebraic notation which we will use throughout the.book. Although .we.
~ 0.6 will require little knowledge of mathematics other than simple arithmetic,
0 the use of some basic mathematical notation is, in the end, an aid to clear
c

L.2
oA

/"
I argument and results in a great saving in space.
When we wish to refer to a general data set, without specifying particular
muncrical values, we will designate each rlumber in the set by means
,X
of a letter and a suffix. For example, X 21, will just mean the 26th value
/Median score
in some set of data. The suffixes do not imply an ordering in the values
120 140 160 180 200 220 240 260
Score '':, of the numbers; it will not be assumed that X 1 is greater .(or smaller)
thnt,t X 1110 , but only that X 1 is the first and X 100 the woth in. some list
Figure J.J(b). Hclntivc (:umulative frcqmnl':)' curve hwGPE scort;11 of Jo8
Europcun ,~~ubjectu (dutn In tnb.lc 2.5). of numbers. For cxump!o, in table a.4(n) the firot utterance length recorded
11>8 29
Summary measures The mean and the median compared
is 2 morphemes while the last utterance length recorded is 4 morphemes: Table 3, 3 Lengths of utterances, ranked in ascending order, of adult
thus X, = 2, X10o = 4 speaking to child aged 3 years
If we wish to refer to different data sets in the same context we use
different letters, For instance, X,, x,,,,,, X 28 and Yl> Y 2, , , , , Yzs would 3 3 3 3 3 3 3
be .labels for two different lists each containing 28 numbers. More generally
3 . 3 3 3 3 4 4 + + +
we write X 1, X 2, , , , , X, to mean a list of an indefinite number (n) of '' 4 4 4 4 4 5 5 5 5 5
5 5 5 5 6 6 6 6 (j 6
numerical values, and we write X; to refer to an unspecified member of 6 6 6 7 7 7 7 7 7 7
7 8 8 8 8
~
the set. 7 7 7 7
8 8 8 9 9 9 9 9 9
The arithmetic mean of the data set X 1, X2 , , , , X., is usually designated 9 9 IO IO IO to 1p IO IO 10
as Xand defined as: IO 10 II I2 I2 12 14 14 '5 17

X= (sum of the values)+ (number of values)


list, 174, has been circled; 46 students scored less than this and 46 scored
"'(X 1 + X2 + ... +X,)+ n more. Hence 174 is the median score of the group. You may Hke tq try
I . the same procedrre for the European students, There are to8 of tpese
=- (X 1 + X 2 + ... +X,) and you should find that when ranked in ascending order the 54th score
n
is 189 and the 55th 190. Thus there is no mark which exactly divides
I the group into two equal halves, Cbnyentionally, in ~uch a case, we take
=-"'X
n L.. ' l th~ average of the middle pair, so that here we would say t)lat the mepian

where the symbol ~ means 'add up all the Xs indicated by different values score is t895 The s~me convention is adopted even when tl)e mipdle
' pair have th~ sa!l1e value. Tpe utterance lengths of table 2.4(a) are written
of the suffix i', Tpis may be simplified further and written:
in rank otd~r in table 33 Of the too lengths, the soth and SISt, the
- I
middle pair, both have the value 6, and the median wil\ therefore be:
X=-!X
n t/2 (6 + 6) = 6.
Although the !Uean and the median ate both measJ.Ires for locating the
'centre' of a data set, they can sometimes give ratqer differeqt vajues.
3. 3 The mean and the median compared It is helpful to investigate what features of the data might c~use this,
We now have two possible measures of 'averpge' or 'typical', to help us deciqe when one of the yalues, mean or median, might be
the mean and the median. They need not have the same value. For example, more appropriate than the other as an indicator of a typical valu~ for
the 93 Latin American students have a mean score of J77.6, while the the data. This is best done .by means of a very simp!~ example. Consider
median extracted from figure 33(a) has a value of I74 Moreover, although the set of five n11mbers 3, 7, 8, 9, 13. The !11ediat) val4e is 8 anq the
the median score for the group of Latin American students happens to mean is 8 (X= t/5 (3 + 7 + 8 + 9 + IJ) =" 8). In this c~se the two mea~ures
be similar to the mean score, this need not be the case - .as we discuss give e)lactly the same result. Note that tlie numbers are dist;ibuted ex~ctly
below, Why should there be more than one average, and how can we symmetrically about this central value: 3 and 13 are both the same distance
reconcile differences between their values? fmm 8, as are 7 and g; the mean and the median will always have the
In order to be able to discuss the different properties of the median arne value when the data has this kind of symp1etry. If we dq the same
and the mea!l, we need to know how the former can be calculated rather for the values 3, 7, 8, 9, 83, we tim! that the median is still 8, but the
than obtained from the cumulative frequency curve, although the graphical mean is now 22. In other words, the presence of one extreme value has
method is usually quite accurate enough for most purposes. I11 table 3.2 increased the mean dramatically, causing it to be rather different in yalue
we have written the scores of the 93 Latin American students in ascending fmrn the more typical group of small values (3, 7, 8, g), falling bctwctm
order, from the lowest to the highest mark. The 47th mark in this ranked them and the. very large cxtnmw value. Tlw madian, on the other hand,
31)
a.~
Summary measures The mean and the median compared
is quite unaffected' by the,,presence'l!lfiil'single';:-tinllsu~Uy>Jarge'ntithbe+ , . '.equ~!'to 1 theltWode ,''while'the'1'rieanf34l )is'::larger, ll:fore t>ypically, unless
and retains the value iLhad pr<lviously,,Onec;woul&osayi!HaLthe,median' ihe TH6de is "in 'the'' centr~ of the1 chart-,' :the '.me.an: and-ithe median will
is a robust indicator of the more typical valuesiofthe"<lata,beingunaff~cteif i both"be different froth the mode; andfrom,each,other, but the . median .'!

by the occasional atypical value wlJicH \night occur~t one or othetexttbtne:< -._ '(- '!:i
When a set of data contains one or more observations which seem to .
have quite different values from the great bulk of ;the.observations it may
be worthwhile treating them separately by eliclilding.the uimsual vall!e(s)
before calculating the mean. For example, ln this very simple case we
could say that the data consists of four observations with mean 6. 75 (3,
7, 8, 9) together with one very large value, ~3 This is not misleading
provided that the complete picture is honestly reported. Indeed, it may
be a preferable way to report if there: is <stime' jhdication, other ihaii jus(
its value, that the observation is somehow unusfiat ''" ' '
When the data values are symmetrld~llydlstributed:,,:as in <the previdus' '' i';'' :1.\l;: Ll; 'i'
example, then the mean and median will beequ:U. On the other hand;.
the difference in value between the mean and median willbe quite marked 'Figure 3.4(a), Histogram skewed to the right.
if there is substantial skewness or lack of symmetry' in a set of data;
see figures 3.4(a) and 34(b). A data set is said to be skewed if.the highest
point of its histogram, or bar chart, is nqt in the centre, hence causing
one of the 'tails' of the diagram (o be longer than the other. The skewness
is defined to be in the directioo of the longer tail, so that 3.4(a) shows
a histogram 'skewed to the right' and 3.4(b) is a picture of a data set
which is 'skewed to the left'.
We have seen that the median has the:property: of rbbustne~s.:in 1 the: !1._
presence of an unusually extreme value and the:meiln doe~ not>.,Neverthe- : '1;/,>

less, the mean is the more commonly<\rsed"fOr the !centre' ofa Stit'df'
datal Why is this? The more important reasonswill become apparent
in later chapters, but we can see already that the mean can claim an a'clvan-
tage on grounds of convenience or easeofcalculation for straightforward'
~-i~4rC 3~4(}:>). Histogram skew~d to t~e left.
sets of data. In order to calc:tilate the median we have to rank the data.
At the very least we will need to group them to draw a cumulative frequency
curve to use the graphical method. This can be very tedious indeed for will be closer to the mode than the mean will be. We might feel that
a large data set. For the mean we need only add up the values, in whatever the mode is the best indicator of the typical value for this kind of data.
ordet. Of course, if the data are to be processed by computer both measures Figure 3,5(b) shows a rather different case. Because the histogram is sym-
can be obtained easily. metrical, the mean and median are equal, but either is a silly choice as
However, there ate situations when both the mean and the median can 11 'typical value', However, neither is there a single mode in this case.

be quite misleading, and these are indicated by the diagrams in figures These last two examples stand a~ a warning. For reasons which will
3S(a) and 3S(b). In case (a) it is dear that the dataare dominatedby" ilti<lr\' beccilhe"deaf, it becomes' a h~bit to assUme that; unless otherwise
one particular value, easily the most rypical value of the set. The most indicated, n dum set is roughly syn\rnctricalllnd bellshapcd. Important
frequent value is called the modal value mimode. 'l'hc n\cdian (3) is dcpltrturcs fromthis ~hould ulwayslw clcat'ly indicated, either by means
33 3;1
Summary measures Means ofproportions and percentages
of a histogram or similar graph or by a specific descriptive label -st.;Lch Tabl.e 3+ Pronunciation of words ending'-ing' by 10 Middle working class
as 'U-shaped', 'J-shaped', 'bi-m,odal', 'highly skewed', and so on. speakers
. o l
:-i~ ': '~
I
(a) Interview (b) Wordlist
ill
,. ' Number of Number of
Number of Number of
30 '-!' n :_ ;" Subject [n] endings tokens % ,!n] cpdings r to~s %

1 10 44 ( 227 36 :too -.i8.o


2o.l l I MOt~e ;3-~- '- i /
8'7 193- 4H 48 <OO 24.0
3 Iq 2!6 51.4 64 200 32..0
4 55 103 sl-4 s6 <00 zS.o
10 -1 I I n !\!11!1 5 145 241 6o.~ 70 ~00 35-0
6 u6 183 634 61 >OO 305
194 397 62 200
~
77 JI.O
---- 126 <18 57-B 53 :<lOO 6-s

I
0 1 2 3 4 5 7 6
9 109 223 ~-9 77 100 38-s
Figure 3-s(a). A hypothetical bar chart wlth a pronounCed mode. i :} 10 6 32 J8.8 4 100 2,1,0
. "'
Total 842 1647 s69 2QOO
r- r-
-~

- r- the interview condition, \j'e obtain the values in the fof!rth column of
If
,w.~ tak~ the 'mea(f of the percentages in, that column, we
-
-
- the table.
qptain a v~lu\), o( 4614%. Ho'l'ever 1 oh looking at thf_liata we can see

- ~
r-
1
I
trat the eight subjects who provide the great bulk of the (Jata (I 571 tokens)
have perceritages which are close to qr m4ch hjgher thAn 46. '4 The two
'
T
'
subjects whose percentages are low are responsible fdr pnly 76 tqkens,
I yet their inclusiqn in the sample mean as ~alculated abdve greatly reduces
its numerical value. It is general)y goqd pr~ctice to avoid, as far as possible,
Figure 3.5(b). A symmetrical U-shapCd histogram.
dolleqing data in such a way that we haye less information about some
~ubjeqts in the study thaQ we have about others. It will not always b~
34 Me"ns of ptopottions and percentllges possible to achieve. this :- certainly not where the experimental tnaterial
It happens frequently that data are Pfl'Seilted as percentages consists of ~egments of spontaneous speech or writing. The use of wordlists
or proportioris. For example, suppose that the problem ol interest were can remove the pro~lem, since the!) each sUbject )Vill provi<je a sjmilar
the distribution of a linguistic variable in relation to some external variable I1Uffi~r of toke!' Jiowever, it is noteworthy that the t)"o el(perimental
such as social class. To simplify matters we w!ll consider the distdbutio'l conditions have given quite different results for many of the suqjects;
of the variable (say, the prdportion of times [n) is used wordfiqally in some \.>f them gjve only half the proportion of rnl
endibgs in the wordli~t
wotds like winning, running, as opposed to [Ill) in one social group, whic~ that they e;press in spontapeous speech.
we will call Middle working class (MWC). Let us suppose that we have At this point' we shoulp not~ .that it may sometimes pe in~pprqpriate
two conditions for collecting the d~ta: (a) a standard interview, and (b) to take sill)ple averages of percentages ~nfl propottipns. ~uppose tljat the
a wordlist of 200 items. Ten differ~nt subjects are examined under each data of table 3'4 for the interview condition (a) were rot observed or
condition, and table 34 shows the number of times a verb with progressive \en djffcrcnt subjects but rather were the result of analy~ing ten different
ending w~s pronounced 'j'ith final [n ], expressed as a fractimi of the total speech sa'Pples from a single individual for wholl) we wish to m~asure
number of verbs with progressive eqdings used by the subject in the inter- the p~rccntage of [n1 endings. Over the con1pl,ete experiment this subject
view. If we calculate for each subject a percentage of [n]tinul forms in would have ptovided n tot11l of 1 ,6471okcns, of whicl1 842 were pronounced
1<1.
35
Summary measures central zntervats

(n]. This gives a percentage of (842jr647)X roo =sr:.u%: By' adding 'to -interpret~a:percentage o~--a -'_m~an: tpercentage _when it.js not clear,what
all the tokens in this way we have reduced the effect of the two situations .was.the basequantity (i.e, .iota! number of observations) oyer which the
in which both the number of tokens andthe praportion ol(n] endings. original proportion was measured, especially when the raw{i.e. original)
were untypically low. It will not alwa:vs beobvioils whether tHis is a~better'c >;. ' '' values on\vhicli thepercentageswerebased are-natquoted.. ' ' ' 1':

answer dum the value obtained by averaging the ten individuarpercentages.


~ou shoulf think caJ.'ful!y about the meaning of the data and have a 3S Variability or dispersion
ir
clear understanding of what it you .are trying to qleasure. Cjf c(lllrse, > ; ' We i~;roduced tre Il)"an ~nd t;he mrtdian <~s measures which
if the base number of tokens is constant it will not matter which method "'auld be useful for the corpparison of data se\S, and it is certainly true
of cakulation is used; both methods give the sa\)1e answer. Fat the wordlist that if we calculate the meaps (or medians) of two sets of scores, say,
condition (b) of table 34 the average of the ten percent~ges is 28.4s% and find tljem to be very different, then we will have made a significant
and s69 is exa~tly 2S.4s% of 2,ooo. . . discovery. If, on the other jlaQd, the two means are similar in value, or
Suppose an examination consists ot .two .parts, one, oral scored out ,of .. the medians are, this will not usually be sufficientevidenceJor. the. statement,
10 by an observer, the other a writteri,-.multiple choice paper with so. thanhe comjJlete set~ of-scores have<a similar overall shape or.structute.
items. The examination s~ore for each subject will ~Pnsist.of one mark ,,. ,. Corisiderthef6llowintfextreme, but artificial, example'/ Suppose one group.
out of so and ar1otlier otit of ro. For.exa~ple, a subjept m~y score 20/ so, of sosubjects have all scored exactly the same. mark, 35 out .of so say,
(4o%) in the written paper and 8/ro (8o%) in the oral test . Haw should in some test; while in another group of the same size, 25 score so and
his overall petcentag~ be calculated 1 The crux of the matter here is the 25 score 20. In each case the mean and median mark would be 35. However,
weight that the exa01iner wishes to give to each part of the test. If the there is a clear difference between the groups. The first is very homoge-
oral is to have equal weight with tqe written test then the overall score neous, containing people all of an equal, reasonably high level of ability,
=
wotild be 1/2 (4o% + 8o%) 6o%. However, if the scoring system has while the second consists of one subgroup of extremely high-scoring and
been chosen to reflect directly the importance of each test in the complete another of rather low-scoring subjects. The situation will rarely be as '
examination, the written paper should be five times as important as the obvious as this, but the example makes it clear that in order to make
oral. The tester might calculate the overall score by the second method meaningful comparisons between groups it will be necessary to have some
above: mea'!.ure of how the scores in each group relate to !heir 'typical vaiue'
20 + 8 28 as. determined by the mean or median or some .other. average. value, 1L
--=-=46.7%
SO+ IO 6o is therefore usual to measure (and repbrt) the. degree to. which a group
This latter score is freql1ently refe,rred to as a Weigl:i.ted mean score . lacks .homogeneity, Le. the variability, or dispersion, of the scores about
since the ~cores for the individual parts of the examination are ne longer the average value.
given equal weights. It may be calculated thus: Furthermore, we have argued in the opening chapter that it is variability
between subjects or betwe~n linguistic tokens from the same subject which
so X 40% + ro X So% 1nakes it necessary to use statistical techniques in the analysis of language
46.7%- 6o
data. The extent of variability, or heterogeneity if you like, in a population
where each percentage is multiplied by the required 'weighrl. If the scores is the main determinant of how well we can generalise to that population
for the individual parts were given ~qual weights the mean score would the characteristics we observe in a sample. The more variability, the greater
be 6o%, as above. It is important to be clear about the meanirig of the will be the sample size required to obtain a given quality of information
two methods imd to understand that they will often lead to different results. (sec chapter 7). It is essen1ial to have some way of measuring variability.
It sometimes occurs that published data are presented orily as propor-
tions ot percentages without the original values bein~ given. This is: a 3;6 Central intervals
serious error in presentation since,-.as~we- will-sec:in-1at~r .chapters, it often - We would not expect to find in ptactice either of ihe outcomes
prevents analysis of the data by another researcher. It: mny be diflicult dccl'ibcll irr the previous scctimJ. Fo1 cxalnplc, when we administer a

.Jfl 37
Summary measures Central intervals
Table 3. s. Frequency tables ofmarks for two groups of2oo subject$
60
Group t Group z
Class 1 ,_t' ,_1-1, , ., Relative Relative ~ '50
interval'S' Cunlulativc cumulative Cumulative cumulative
(marks)-- Frequency frequcncy frequency Freq),lency frequency frequency .~
D-4
5-<J
4
4
4
8
0.02
0.04
!" 40

~
,o-14 I I 0.01 6 If 0.07
I 2 Q,Ol 8 22 o.n 30
JS-I9
,ZQ-24 6 8 0.04 6 28 0.14 .~
25-29
JD-34
IO
I4
I8
32
o.og
0.16.
I2
I7
40
57
0.20
o.~g
a
'o
20
68. 78 . 0.39
034 ~
35-39 36 2I
4o-44 62 I30 o.6s 43 I2I o.61
10
45~49 38 I68 o.84 23 I44 0.72
so-54 I7 Iss 093 I6 I6o o.8o
55-59 8 I93 097 9 I69 o.Ss.
6o-64 5 I98 o.gg 8 I77 o.89 0 10 40
6s-69 2 200 J,OO 4 I8I o.gt Score
7D-74 6 I87 094
I9I o.96 Figure 3.6(a). Histogram for data of Group I from table 35
75-79 4
8o-84 4 I95 o.g8
85-89 3 I98 o.gg
60
9D-94 2 200 1.00
95-99
~ 50
test to a group of subjects we usually expect their marks tp vary over .~
a more or less wide range. Table 3S gives two plausible frequency tables
of marks for two different groups of ~oo subjects. The corresponding
140
histograms and cu111ulative frequency curves are given in figures 3.6 and ~ 30
37 We can see from the histograms that the marks of the second group ~
are more 'spread out' and from the cumulative frequency curves that the f 20
'!;
median mark is the same for both groups (i.e. 42).
~ 10
How can we indicate numerically the difference in the spread of marks?
One way is by means of i.ntervals centred on the median value. and contain
ing a stated percentage of the observed data values. For example, q 1 and 0 10 20 3 40
Score
q 3 , the first and third quartiles, have been marked on both cumulative
Fig1.;1re 3.6(b). ~istogram for data of GroUp 2 from tablC 35
frequency curves. 1 This means that generally half of all the observed values
will lie between q 1 and q3 We write (q" q 3) to represent all the possillle
values between q 1 and q 3 and we would say that (q 1, q3) is a so% central called theinterqua,rtile distance (sometimes, the lnterquartile range).
interval; 'central' because it is centred on the median, which itself is In the example above, figure 37(a) shows a so% central interval (37S
the centre point of the data set, and 'so%' because it contains so% of 475) while for figure 3.7(b) the interval is(32.5, sr.s). The interquartile
the values in the data set. The length of the interval, q 3 minus q" is dist~nce gives a measure of how widely dispersed are the data. If at least
half tlw values arc very close to the median, the quartiles wi11 be close
1
These were dcfltwd in 1..:L Onc(]Unrtcr.nf the ubaervcd value~ ar(~ smaller than tlw first
tfunrti!c, whil'' thrcc"qtwrtcrll arc smallcr-thnn the third qm1rtilc. The first quanilc, q 1,
together; if the data do not group closely around the median, the quartiles
h; ofwn rdt1rrcd to UH the <~sth pcn-rntilc, Pt,~. hct~nUM.' .zs% uf tlw dMn huvt n vnlm' will be further apart. Forth<: two cases shown in figures 3.7(a) and 37(b),
nm1dlm t~111n q 1, flirnlhuly, q,1 m~y h(' m~llvd ~ht;~?,;th pN~t!!Hile, 1\~ the intorquartilc distnnceB are wand 19 respectively, reflecting the wider
~ll M
:iummary measures The variance and the standard devtatzon

1.0
I of dispersion because-of the:need toirank the data ,cvalues to obtain the
yx_,.-x
median and quartil0s;:With the advent ofeasyto-use computer packages
like MINITAB ,(see Appendix B) this.is no longer a constraint, and the
10 0.8 ------------ -------;!,.jx/i ..! ,_. ---- -.---
quartile~ l!ndp~rcentiles can provide a useflll,C.IIescripdve measure of-Nark '
~a o.s ability. HoweVer, classical statistidll"theory -is" not b::ised on percentiles;
0
.
~--------------- . . ---.~ i: and other m.easures of dispersion have been d~veloped whose theoretical
properties are better known and more convenjetu. The prst of these that
i 0.4
~
' we will discuss is the variance,
e
... 0.2
. . . . . . . . . . . . . . . . . . . . . . . . . .!!'
.
I
' '0
I
Suppose we have a data set X, X2 , , , X, with arithmetic mean, X.
We could calculate the difference betWeen ~ach value and the mean value,
o
Jl' I I I
_...,x'
.~.
I 1 I

I
o~ l :l
: :
as follows: d 1 = X1 - X, d2 = X2 - X, etc, It might seem that the average
0 10 20 30 j40i 1so. 60 ,., 10 80 .,.. ."; .. of. these differences would give some"II)easure ofhow closely the data -. ';'
'q, m' q31 values cluster a'round the mean, X, In table 3,6 we have carri~d out this
Score calculation for the number of wordsper sent~nde in a short reading text:'
Figure 37(a). Relative cumulative frequency. curve for data of GroupI from' Notice that some values of the differerice'(s'ecdnd columft)are positive,
table 35
some are negative. and that when added together they cancel out to zero.

~--"
1.0 This will happen with any set of data, However, it is still appealing to
use the difference between each value and the mean value as a measure
.ll 0.8
- _..-X_.-_.,..._.. of dispersion, There 'is a method of doing this which will not give the
iS -------- /.,.---
~
-~
0.6 -------.,.
Table 3.6. Calculation of the variance of number
of words per sentence in a short reading text
0.4 --------------------(
~0 .
/ ''
Number of words
~e 0.2
.
............
.
----------7,
x' :I
10
per sentence d1(= x,- X) di2

I I.I" 3 -675. 4556


0 I
'I ,.. " ,7 -2.75 756
~/ ~ I ' ~) -575 'JJ.06
' ' ' ....9,:,. . "0,75 0,56
0 10 20 30 I 40i so:, 60 ~90 '100 l2 2.25 s.o6
'
q,'
''
m q,' 2 -775 6o.o6
'4 425 z8.o6
Score 18.o6
'4 425
Figure 3 7(b). Relative cumulative frequency _curve for .data of.Group .l from 6 ""3-75 . 14,06
table 35 9 -o.75 o.s6
II 1.25 1,56
dispersion of the values in Group 2, The quartiles do not have any special '7 75 s-56
r8 8.25 68,o6
claim to exclusive use for defining a central interval. Indeed, as we shall l2 2.25 5.06
see in later chapters, central intervals containing a higher proportion of 10 0,25 o.o6
8 ""75 J.o6
the possible values are much more commonly used.
lX= 156 Id;=o kd: 2 = JJZ.cj6

3. 7 The variance and the stand.11rd deviation .X=97s


Before the usc of computers to manage experimental d~ta' it '' (: ,1!, .J.', ;r 3;,12.q6, .-332..-9(,: ! , ,
\ ~~-ll"'"'"~-~.zl.ZO
n ~ 1 1S
was common to reject the usc of the interqunrtilc distance. as a measure.

~Q 41
Summary measures i':>tanaarmszng le:u ~cures
same answer (i.e. zero) for every data set. and which has other useful T)le standard deviation is one of the most important statistical measures.
mathematical properties. After calculating the differences d;, we square It indicates the typical amount by which values in the data set differ from
each to get the squared deviations about the mean, d,2 (third column). the mean, X, and no data summary is complete until all relevant standard
=
We next total these squ~re~ ,deviations (~d, 2 332.96). Finally, we divide deviations have been calculated.
this toilll by, (n- x) (the !)umber of values minus x). The result of this
calculalionis tlw varia11ce (V): 3.8 Standardising test scores
A major application of the standard deviation is in the process
:Ut' I
known as standardising, which can be carried out on any s~t of ordered,
~ or ---2d2
n-x fl-. I 1 I
numerical data. One important use of standardising in language studies
This would be the arithmetic mean of the square deviations if the divisor is to facilitate the comparison of test scores, and an example of this will
were the sample size n. The reason for dividing by (n- x) instead is be used to demonstrate the method.
technical and will not be explained here. Of coljrse, if n is quite large, Suppose that two individuals, A and B, have been tested (each by a
it will hardly make any difference to' the result whether we divide by. different test) on their language proficiency and have achieved scores of
nor by (n- i), and it is convenient, and in no waymisleading, to think 4' and 53 respectively. Naively, one might say without further ado that
of the variance as the average of the squared deviations. B has scored 'higher' or 'better' than A. However, they have been examined
One inconvenient aspect of the variance is the units in which it is mea by different methods and questions must be asked about the comparability
sured. In table 3.6 we calculated the variance of the number of words of the two tests. Let us suppose that both tests were taken by large numbers
per sentence in a short reading text. Each X; is a number of words and of subjects at the same time as they were taken by the two individuals
X is therefore the mean nu111ber of words per utterance. But what are that interest us, and that we have no reason to pelieve that, overall, the
the units of d(? If we multiply jive words by five words what kind of subjects taking the two tests differed in any systematic way, Suppose also
object do we have? Twenty-five what? The variance, V, will have the that the mean score on the first test was 44 lind that on the second test
same units as each d,Z, i.e. 'square words'. This concept is unhelpful the mean was 49 We note immediately that A lias scored below the average
from the point of view of empiripal interpretation. However, if we take in one test while B has scored above the avhage in the other, so that
the square root of the variance, this will again have the unit 'words'. there is an obvious sense in which B has done better than A. The compari-
This measure, which is referred to as the stand11rd deviation, referred son will not always be so obvious.
to by the symbol s, has many other useful properties and, as a result, Let us assume, as before, scores of 41 and 53 for A and B in their
the stanqard deviation is the most frequently quoted measure of variability. respective tests, but that now the mean scores for the two tests are, respecti-
Note, in passing, that s = VV; thus s2 = V, and the variance is frequently vely, 49 and 58. Both individuals now have lower than average scores.
referred to by the symbol s2. A has scored 8 marks below the mean, while B has just 5 marks less
Before electronic calculators were cheap and readily available it was than the mean for this test. Does this imply that B has achieved a relatively
the custom at this point in statistical textbooks to introduce a number higher score? Not necessarily. It depends how 'spread oui' the scpres
of formulae to simplify the process of calculating s or (equivalently) s2. are for the two tests. It is just conceivable that B's marl< of 53 is the
However, on a relatively inexpensive calculator the mean, X, and the stan lowest achieved by anyone for the test While, say 40% of the candidates
dard deviation, s, of any data set can be obtained simultaneously by entering on the other test scored less ihan A. In that case, jl col!ld be right at
the data values into the calculator )n a prescribe\! fashion. 2 the bottom of the ability range while A would be somewhere near the
middle. The comparison can properly be made only by taking into accpunt
z Many calculators will give tWo possible values of the standard deviation, Ooc of these, the standard deviations of the test scores as well as their means, Suppose,
designated s11 _ 1 ora11 _ 1 is the value calculated according to our formula and we rccom!ncnd
its u~c un all occasion~. The olher,su or <r11 , has berm <~nlculated by replacing (n- 1) by
for example, that the two tests had standard deviations of 8 marks and
n in our formula, 1tnd it~ \l:lc i!l he1:1t avoided. AI'! nolcd ~tbovc, when n is rcasnmthly larg:c 5 marks respectively. Then the distance of A's score of 41 from the mertn
thl't'c will be very little difl'crcucc bctwccll the two values.
nf his test is exactly tlw VHhw of the stamhtrd deviation (49- 8 = 41).
1~
~3
Summary mea~ures Summary
The same is true for B. The dist~nce pf.his: scor:e o.;53; fronHh~ m~~n .. , , :; ,, ,,, ,,,:'J!.ableJ7.Anexampleofthe' ''
is exactly the value of the standard \levhition onthe test:.that hetook standardising procedure on a
(58- 5 =53) .. Both scores can be .said to be. one standard deviationcl.ess . hypothetical data set
than their respective means and, taking into account in this waythe:differ. "' ,. ;_._, ,_.l;- ~ ll~ L,.--,--,-:c,--:--:,-_ ~- i! ... " -' --:::--
X~ subtract X~ _l'~dividcbys:-;_,.X
ent pispersion of scores in the two. tests, there. is a set)se in. which bpth S:o,;_,.
A apd .B have achieved the same scores relative .to the other can\lidates 73 zo I.~z
42 -I! -o.73
who took the tests at the same time. , 36 -7 -"-"t'.u
If the complete sets of scores f0r th~ two above tests had haq exactly 5' -
,,
2 -o,t3
63 o.66
the same mean and exactly the same standard deviation, any mark on
one test would have been directly comparable with the same mark on
X=s3 \'=o Z=o
the other. But of course this is not usually the case. Standardi~ing ;est S;.; = 15.12 Sy = Sx = 15.tz Sz =I

scores facilitates comparisons between sco.re~ on differ~nttests. ToilhJs c: _'-, ->,

trate, suppose now that we have .the sco"es X" X2 , .-:., etc. of a se\. of . in a .certa.in .area, 0ach test on a different day Their scores on the first
subjects for some test and that the mean anp stao.dard deviation of, these. , . (es,t !wee a,ll)ean ,of.g2,and ,a standa<<l' ~evialion .of .I.j; and :on the second
scores are X and s respectively. Begip by changing each scar~; X;, to :,test mean of I43 and. a standard d.eyiatiop of 2I. Student A, who fbt
a new score, Y;, by subtracting the mean: some reason or other is prevente\1 from t~king the second test,. scores
I2I on the first. Student B misses the fir~t test but scores 177 on the
Y;=X,-X
second. If we wanted to compare the perfoqnl!nce of these two students
Now change each. of the scores, Y;, into a further new score, Z;, by we could standardise their scores, The standardi~ed Sfore for Subject A
dividing each Y by s, the standard deviation of the original X scores: is:
Z1:;=Y 1+s I2.I - 9~
ZA- 2.07
In table 37 this procedure has been applied to a small set of hypothetical '4
scores to demonstrate the outcome. The original scores of the firs\ column and for Subject B is:
have a mpn ol X= 53 and a standar~l..c\eviation of sx = IS.H .Theme 0n,
X, is subtracted from each of the; ,otigit)ah scor~s to give :ai new: Ya)ue, ;:l ;'0!J= zn- .'t3 ~ I:62
Y, in the seqond column. The rp.ean.,of these new values, is Y=:obut..: '- .:.H,
2I

the stanflard devi~tion is Sy = I5,!l2, the, ~me v,alue .~s before. Fin~lly; Since ZA is'greater than Z 8 we would say th~t'Subject A has scored higher
each value in the second column is c)langed into a standardised va)ue, than Subject B on a standardised sca\e. !twill be seen it) chapters 6-8
Z, by dividhjg by IS. If.. The five values in ooh1mn 3 now have a mean. that quite precise statements cart often be made :abbut such COfl1parisons
of zero and a stand~rd dpviation of .I and that standardised values play an lmpohant role ih statistical theory.
By these two stepS we have change<! a set of ~cores, x.. X 2, etc. with
me~n X and standard ~evia\ion s, into a new set of scores Z 1, Z 2, etc., SUMMARY
with mean zero and standard deviation I. The Z scores are the stan This chapter introd,uces and explains various numerical quantities or
dardlsed X scor~s. The process of standardising can be described by measures which can be used to sp.mm~rise S~;J~s of data in just <l few nurhbers.
theit'ing1e foqnula 1 ( t) The median and the mea'! (or ~rithmetic mea'!) were defi~ed as meas~res
~- ,_ of the 'typical valueJ of a set of data.
X-X
)' Z=- (z) The properties of the mca11 and median were discussed, and the median was
s
shown th be mnrc robust than the mean in the presence of orle or two unusually
Suppose two difft:rcnt language proficiency tests which p~trport to meat ; .. , ' 11 tXlrl!tlHHhiluCs~
sure the same nbility arc administered to all available high school students (J) h' waa poi'ntcd out thal there were two C'omiuon ways (the ordin11ry mpan

44 45
Exercises
Summary measures
or the weighted mean) of calculating the mean proportion of a set of proporM (a) Calculate the ~tandardised score for each of the students listed in the
tioos and the motivation and result of each type of mean were discussed, table and rank the students according to their apparent ability, putting
(4) The variance, the standard deviatio,:. and the int~rquartile distance were the best first.
prese-nted as measure'S of variability or dispersion: the concept of the central (b) The institute groups students into classes C, D, E or F according to
interval was introduced. their score on Test A as follows: thOse scoririg'at leaSt i4o are assigned
(5) Standardised scores were explained and shown to be an appropriate tool to Class C; those with at least 120 but less than 140 are ClaSs D; ~hose
for c'otnparing the scOres of different subjects on different testS.- with at least 105 but less than Izo are Class E.-The remainder are assigned
to Class F. In which classes would you-place students R, T.and U,?
EXERCISES
(1) (a) Work out the mean, median and mode for the data in table + What
cohclusimls can you reach about the most suitable measure of central
tendency. to-apply to il)ese data?
(b) Follow the same procedures for the data of table 3.1.
(z) The following table gives data from the June 1980 application of the CPE
to a group of Asian students. For this data, construct a table of frequencies
within appropriate cif!SS intervals, cumulative frequencies and relative cumulaM
tive frequencies. Draw the corresponding histogram and the relative cumulative
frequency curve. Estimate the median and the interquartile.distance.

123 132 '54 136 121 220 106 92 127 134


127 70 m 116 70 131 136 170 74 114
6s 112 Bz '93 172 '34 221 217 138 138
51 113 136 108 97 146 75 !88 123 92
191 '95 74 '73 167 '59 149 115 '47 8s
88 96 255 93 171 219 84 u8 90 III
128 78 213 149 110 zs6 zs6 172 129 110

(3) Calculate mean and startdard deviation scores for the mean length of utterance
dat~ in tables 2.4 and 3. I.
(4) An institute which offers intensive courses in oriental languages assigns beginM
riing students to classes on the basis of their language aptitude test scores.
Some studehts wlll pave taken Aptitude Test A, which has a mean of 120
and a standard deviation of IZ; the remainder will have t3ken Test B, which
has a mean bf I oo and a standard deviation of I 5, (The means and standard
devjations for both tests were obtained from large administrations to comparM
able groups of stude!lts.)

Student TestA TcstB


p 132
Q 124
R 122
s 8z
T 75
u
"""'==== ~';2,~
f)l l ,..,...

.. ~..
~
Populations
from which it has been selected;. How far.. can we assumethe characteristics

4 /'.'-- of this latter group to be similar to those of the smaller group which has
beenobserved? Thisis the classical problemof.statistical inference:how
Statistical inference .- :'\' :-l a
to i~fer fro'ncthe prppe'rties of partdhe likelylptopertiesuf the.whole: ,,
It will turn up repeatedly from now on. It is worth emphasising at the
outset that because of the way in which samples are selected in many
studies in linguistics and applied linguistics, it is often simply !JOt possible
to generalise beyond the samples. We will return to this difficulty.

4.2 Populations
A population is the largest class to which we caiJ generalise
4' The problem
. the results. of an investigation basei!on a s~bclass. The population of
Urltil now we have been corlsidering;how .to describe or:sum.
ln,t~~e~tX<l,~ ilii:get popuia(ioh)wi!l yaryii\ typti and magnitude depending.
marise a set of data considered simpl)':as,an object in its:own right,: Very ..
often we want to do more t\lan tpis:wewishto use a collection of observed. .. .. o~}~~,,altl{~~,~d d~cun;st~nce~of, e~ch, ,~ifferept study ~r inyestig~ti~p
Wtthtn thehrcpts set by the study ui questwn, .the populatton, m statistiCal
values to make inferences about a larger set:of potential values; we would
terms, 'willlllways be considered as th~ set of all possible values of a variable.
like to consider a particular set of data. we hav.e obtained as representing
We have already referred to one study which is concerned with the vocabul"
a larger class. It turns out that to accomplish tl)is ls py no means straight-
ary of 6-7-yeat-olds. The variable here is scores pn a test for comprehension
forward. What is more, an exhaustive treatment of the difficulti~s involved
vocabulary size; the population of ii]teresf is tqe set of aU possible values
is beyond the scope of this book. In this chapter we can only provide
of this variable which co1Jid be derived from all P--7-year-old children
the reader with a general outline of the problem of making inferences
in the country. There are two points which should be apparent here.
from observed values. A full understanding of this exposition will depend
First, although as investigators our primary i!]terest is in the individuals
to some degree pn fainiliarity with the coptent of later chapters. For this
whose behaviour we are measuring, a statistical pop1,datiqn is to be
reason we suggest that this chapter is first read to obtain a general grasp
thought of as a set of val pes; a inean vocabulary size calculated from a
of the problem, and returned to latedor re-reading inr.the light of sub.
'sah!'ple of observed value$ is', as we shall see in chapter 7 an estimate
sequent chapters. ': ,
of the niean vocaliulary size that wpuld be'obtained from the complete
We will illustrate the problem of irtferehee,by: introducing'some of.
set'ohi~lues'wbich forin the t'argeipoj:lulation, The second p6int th~t
the cases which we will aQalyse in greater detail :in Jhe c\laptets to;come.
~hoi.tld'beapparent is that it is often riot str~ightforw~td in language stu die~
One, for example, in chapter 8, CO!)Cerns the size of the comprehension
to define the target population. After all, the set of 16-7-yeat-old children
vocabulary of British children between6 and 7 years.of age. his obviously
in Britain', if' we iake ihis to refer to the, period between the sixth and
not possible, for practical reasons, to test all.British children of this age.
seventh birthdays, is changing daily; so for us to put some limit on our
We simply will not have the resources. We can only test a sample of chil-
statistical population (the set of values which would be available from
dren. We have learned 1 in chapters 2 al]d 3, how to make an adequate
these children) we have to set some kind of constraint. We return to this
description of an observed group, by, for example, constructing a histo-
kind of problem below when we consiqer Sljmj>ling frames. For the
gram or calculating the mean and standard deviation of the vocabulary motnent let us consi<ler further the notion of 'target' or 'intended' popula-
sizes of the subset of children selected. But our interest is often broader tion in relatioh to some of the othet examples used later in the book.
than this; we would like to know the mean and standard deviation wllicp
would have been obtained by testing all children of the relevant age. 1-low Utterance length. If we are interested in the change in utterance length
close would these have been to the mean and standard deviation actually; . .O)'f'X..t.inw..il1 childten's spe,ech, and.collect. data which sample utterance
observed? This will depend on the rdationship we expect to hold betwecp length, the. statistical population in this case is composed of the length
the group we have selected to measutc and the larger group of children 'l..Jlllues of the individual utttl'lHH:cs, not the utterances thcrn.~clvc..'>. Indcc<t

.8 49
Statistical inference Populations

we could use the utterances of the children io derive values for many emphasised that in linguistic studies of the kind represented in this book
different variables and hence to construct many different statistical popula- it is not always easy to conceptualise the population of interest. Let us
tions. If instead of measuring the length of each utterance we gave each assume for the moment, however, that by various means we succeed in
one ~ ~cort(ttlpresentingthe number of third person pronouns it contained, defining our target population, and return to the problem of statistical
the population of interest would then be 'third pers(}n pronoun per utter- inference from another direction. While we may be ultimately interested
ance sCofes'.' in populations, the values we observe will be from samples. How can
we ensure that we have reasonable grounds for claiming that the values
Voice onset time (V01). In the study first referred to in chapter 1, Macken from our sample are accurate estimates of the values in the population?
& Barton (xg8oa) investigated the development of children's acquisition In other words, is it possible to construct our sample in such a way that
of initial stop contrasts in English by measuring VOTs for plosives the we can legitimately make the inference from its values to those of the
children produced which were attempts at adult voiced and voiceless stop population we have determined as being of interest? This is not aquestion
contrasts. The statistical population here is the VOT measurements for to which we can respond in any detail here. Sampling theory is itself
jp, b, t, d, k and g/targets, not the phonological items themselves. Note the subject of many books. But we can illustrate" some of the difficulties
once again that it is not at all easy to conceptualise the target population. that are likely to arise in making generalisations in the kinds of studies
If we do not set any limit, the population (the values of all VOTs for that are used for exemplification in this book, which we. believe are not
word-initial plosives pronounced by English children) is infinite. It is untypical of the field as a whole.
highly likely, however, that the target population will necessarily be more Common sense would suggest that a sample should be representative
limited than this as a result of the circumstances of the investigation from of the population, that is, it should not, by overt or covert bias, have
which the sample values are derived. Deliberate constraints (for example a structure which is too different from the target population. But more
a sample taken only from children of a certain age) or accidental ories technically (remembering that the statistical population is a set of values),
(non-random sampling- see below) will either constrain the population we need to be sure that the values that constitute the sample somehow
of interest or make any generalisation difficult or even impossible. reflect the target statistical population. So, for example, if the possible
Tense marking. In the two examples we have just looked at, the population range of values for length of utterance for 3-year-olds is 1 to II morphemes,
values can vary over a wide range. For other studies we can envisage with larger utterances possible but very unusual, we need to ensure that
large populations in which the individual elements can have one of only we do not introduce bias into the sample by only collecting data from
a few, or even two, distinct values. In the Fletcher & Peters (Ig8<j.) study a conversational setting in which an excessive number of yes-no questions
(discussed in chapter 7) one of the characteristics of the language of children are addressed to the child by the interlocutor. Such questions would tend
in which the investigators were interested was. their marking of lexical to increase the probability of utterance lengths which are very short -
verbs with the various auxiliaries and/ or suffixes used for tense, aspect minor utterances like yes, no, ot short sentences like l don't know. The
and mood in English. They examined frequencies of a variety of individual difficulty is that this is only one kind of bias that might be introdl.lced
verb forms (modal, past, present, do-support, etc.). However, it would into our sample, Suppose that the interlocutor always asked open-ended
be possible to consider, for example, just past tense marking and to ask, questions, like What happened? This might increase the probability of
for children of a particular age, which verbs that referred to past events lohger utterances by a child or children in the sample. And there must
were overtly marked for past tense, and which were not. So if we looked be sources of bias that we do not even contemj:llate, and cannot control
at the utterances of a sample of children of 2 ;6, we could assign the value for (even assuming that we can control for the ones we can contemplate).'
I to each verb marked for past tense, and zero to unmarked verbs. The
1 We have passed over here an issUe which we have to postpone for the morn.ent, but y.'hich
statistical population of interest (the values of the children's past referring
itt of considcrublc importance for much of the research done in litnguage studies. Imag~ne
verbs) would similarly be envisaged as consisting of a large collection of the cnac where the population of interest is utterance lengths of British English-sp~aking
clements, each of which could only have one or the other of these two prc-11chool children. We have to consider whether it is better to construct a s<jmple which
t:ont~i~t of mnny uttcn.ltWI;la from n few chHdrcu, or one which consists of a small number
values. of ut.toi'tUICell from cnch of mnny chlldn.m. We will return to this question, un(l the g!.:ncral
A populntiun then, for "tuti~ticul purp!Js~a. 1!1 11 act of vnlt~e&. We have ltmuc or lltltnplct l:ll~fll in ttbtlptcr 7
Statistical inference J he theoretlcat sotuuon
Fortunately there is amethod.of sampling;J\uow'!lcas:randomsampli!lg'' ""
that can overcome problems of ovett or covert hlas . :Whatcthis,termmeans-- ,,,,_
of 'allithe'subjiiets:in.the-group- to~hich:generalisa.tionis intended. Here,: ,
forexample, we -could extract a'llst of all :the :babies :With birth-dates in
J
' .:
will become clearer once we know more.about probability. But it isimpJJr,-, .. the televant'year from'the records,of all health-,visitorsin Great Britain.
tant to understand from the outset: thato.'.nandom'.. here.dotls not,:mean, ""'' '"'We'c'ou1d''thtin choose a simple random satriple(chapter:s)ofn,-ofthese- ., :. '
that the events in a sample are haphazard or completely :lacking. in order, babiesand hottothe birthweightsin theirrecord: If.n-is large, the mean
but rather that they have been. constructed by a:procedure:that allows weight.of the sample should be very similar (chapter 7) to the mean for
every element in the population a known probabjljty oLbeing: selected all the babies born in that year. At the very least-we-wilL be able to say:
in the sample. how big the discrepancy is likely to be (in terms of what is known as
While we can never be entirely sure that a sample is representative a 'confidence interval'- see chapter 7).
(that it has roughly the characteristics of the population relevant to our The problem with this solution is that the construction of the sampling
investigation), our best defence against the introduction of experimenter frame would be extremely time-consuming and costly. Other options are
bias is to follow a procedure that ensures. random sa~ pies {one.such. pro,,. available.--For example; sampling .frame could- be qonstmcted, in two
cedure will be described in chapter.,S'):,This \can give:u!>:r.easonable cOI'lfic i : ' or fhore'itages.The country (Britain):could be' divided. into:large regions;
dence that our inferences from sample valuesnto population::values;.are "' ''"'''Sc~tliuid,Waleli, NorthcEast,: West:-.Midlands,:,etc.;.anda .fe1>1 regions, ,,, ,-:
valid. Conversely, if our sample is ribt:constructedaccording. to 1a random chosen from this first stage. sampling ftame. For each of fhe selected
procedure we cannot be confident,that our estimatesJrom it-are likely regions a list of Health Districts can be drawn up (~econd stage) and a few
to be close to the population values in which we'are interested, and any Health Districts chosen, at random, from each region. Then it may be
generalisation will be of a dubious nature. possible to look at the records of all the health visitors in each of the
How are the samples constructed in the studies we consider in this chosen Districts or perhaps a sample of visitors can be chosen from the
book? Is generalisation possible from them to a target population? list (third stage) of all health visitors in each district. 2
The major constraint is of course resources- the time and money avail-
43 The theoretical solution able for data collection and analysis. In the light of this, sensible decisions
It will perhaps help us to answer these questions if we introduce have to be made about, for example, the number of Health Districts in
the notion of a sampling frame by way. aLa non,linguistic example. Britain to be included in the frame; or it may be necessary to limit the
This will incidentally clarify some of the difficulties, wei saw earlier. d!l inquit'y to chlldteri bbrn in four morithsin the,:year instead of:a complete-
attempts to specify populations. ,li, .,. 1 '" , , : L. , ,, yeo\r.>fnthis example, the. sampling frame mediates betw,een the population ,
Suppose researchers are interested:in,thebirthweights oLchildren. born ' " 'ofAnterest'(which'is the birth weight~ ofall,ehildreri. bani in Britain in
in Britain in 1984 (with a view ultimately to comparing birth weights 1984) and the sample, and allows us to generalise from the sample values
in that year with those of 1934). As .isusual with any investigation, their to those in the population of interest.
resources will only allow them to collect-a subset of these measurements If we' ribwreturn to an earlierlinguistic example, we can see how the
- but a fairly large subset. They have to decide Where and how this subset saniplirig frame would enable us to link our sample with a population
of values is to be collected. The first decision they have to make concerns of interest. Take word-initial VOTs. Our interest will always be in the
the sources of their information. Maternity hospital records are the most individuals of a relatively large group and in the measurements we derive
obvious choice, but this leaves out babies born at home. Let us assume from their behaviour. In the present case we are likely to be concerned
that health visitors (who are required to visit all new-born children and with English children between 1 ;6 and 2;6, because this seems to be the
their mothers) have accessible records which can be used. What is now time when they are learning to differentiate voiceless from voiced initial
required is some well-motivated limits on these. records, to constitute a stops using VOT as a crucial phonetic dimension. Our. resources will be
sampling frame within which a random sample. of birth weights.can.he
constructed. ,',!11;-. ~.-i:; l The ~trwly~ifl qf t.hc ~lata gathered by such complex sampHng scheme!;' clln, become quite
cm~plkntcd und _we \vil11iiit deal_ with it in thifl !;rmk. Interested rcUderllllhould sec tcx.ts
The most common type of sampling frame is list. (actual or notional) ~ . , :lin numplin~ thcory.1or Sllrvcy d(!aign Or'{'OiltiUlt nn cxpCricnccd.:aurvcy l:illltiatician.

sz 53
Statistical inference The pragmatic solution
limited. We should, however, at least h_ave asampling -frame-v.:hich sets The great majority of these studies will be exploratory in- nature; they
time limits (for instance, we could choose for the lower limit of our age- will be designed to test a new hypothesis which has occurred to the investi-
range children who are I ;6 jn a particular week in 1984); we would like gator or to look at a modification of some idea already studied and reported
it to be geographically 'well-flistributed (we might again use Health' Dis- on by other researchers. ~Most investigators have very limited resources
tricys) ;, )Vithin, the,. sampling frame we must select a random sample of and, in any case, it would be extravagant to carry out a.Jarge...and _expensive
a reason~ble size. J study unless it was expected to confirm and give more !letailed information
That 'is how we might go abo)lt sefectirig childreqior ~uch 'a st9dy. on' a hypothesis \jihich' was likely to be true and whose implications had
I
Buf how ate [ang11age samples to be selectep from a child?'Changing the deep sclenti~c, social or eco11omic significance. Of nece~sity, each investiga
example, consider the Jlroblem of selecting utter~nces from a young cpild tor will have to make use of the experinleqtal lljaterial (includipg human
to peasure his mean length of utterance (mlu- see chapter 13). Again subjects) to which he can gain access easily. This will almost always
it is possible to devise ~sampling frame. One method would be to attach prec!u9e the settjng up of ~ampling frames anc\ the random selection of
a radio micrqphoqe to \he child, which would transmit an<,\ record every subjects.
single utterance h~ makes ovet some period. of time to a tape-recorder. At first si!lht it may look a~ if there is an unbridgeable gap !>ete. Statistical
Lef us say we record all his utterances over a three,month period. We theory requires that sampling should be done in a specipl way before geneta-
could then attach a unique number to fach !ltterance ~nd choose a simple lisatiOI) can pe in~de formally frqm a sample to ~ population. Most studies
random sample (chaptqr 5) of utterances. This is clearly nfither sensible do not invo)ve s~mples selectbd in the required fashion. Poes thls mean
nor feasible - it woul<j req1,1ire an unreali~tic expenditure ol resources. that st~tisti9al t~chniques wil! pe inapplicable to these studies? Before
Alternatively 1 and more reasonably, we coulp divide each mpnth Into days addressing this questipn directly, let us step b~ck fqr a momept and ask
and eacjl day intc. hours. select a few days at rrndom im? few hours a what it is, in the most gdneral sense, th~t the discipline of statistics offers
within each day and record ali the ut1eranpes made 1fuHng the selected tq linguistics if its techniques are applicable.
hours. Glee \Yells 1985: chapter ' for a study of this kind.) If!his method What the techniques df statistics offer is a common grojlnd, a common
of selection were to be used it would be better to ass11nle that thai child IIJeasuring stick by which experimenters can meast~re and coml'are the
is active only between, say, 7 a.m. and 8 p:m. and select hours from strength of evidence for one hypothesis or another th:it can be dbtained
that dme period. from a sample of subjects, hln&uage tokens, etc. This is worth striving
In a similar way, it )"ill always be possible to imagine how a sampling after. Different sttidips will tn~asure quantities which are more or less
fra:me could be qrawq up for any finite population if time and other variable and wllljnclude different numbers of subjects and language tokens,
resources were un\imitqd. T\le theory which underlies all the usual statisti- Language researcher~ may find several different directions from which
cal rrtetpods assumes that, !f the resljlts optained from a sample are to to tackle t!]e same issue. Unless a commorl groun? can be establishe<\
be generalised to a wider population, a suitable sampling frame has been op whicH tpe results of djfferept inveStigations ~aq be compared using
established and the sample chosep ranflotnly from the frame. In practice, a common yardstick jt wo)lld lw almost impoasil:lle to assess the quality
however, lt is frequentlY impossfble to dr~w u11 an ~cceptable sampling of the ~vidence cpntaiped in difffrent studies br to judge how much weight
frame- so what, t\len, can be done? to give to cqnflicting claitns.
Ret11rnin~ to the question of applical:iility, we would suggest that a sen
4-4 The pragll)atic solution sible way tp proceed is tq accept the tesults of each study, in the first
In any year a large number of li~guistic studies of an empirical place, as thpugh any sampling !]ad been carried mlt in a thedretically 'cor
na;ure are carrieq out by many researcHers in maily different locations. rcct' fashiop. If these res'llts are interesting - suggesting some hew hy
:l The itHiliCI:I raised in the first footnote crop Up ag11in here: mcusurcrpcnts ori linguistic tl<>thcsis or contradicting a previously accepted one, for example-'- t\len is
vurill.blcs arc more complex tlul!l birth weights. We could again ask whether we should time enough to question how the sample was obtained and whether this
collect mnny word-initial plusivcl!l from few children, or few plosiv~~s from many_ children
(i!(1!J cht~ptcr 7). A llimilnr problem nriacfi with the t~mnplc chosen by sevcrttl stages. ls is likely to haV(' a l>earing 011 the validity of the conclusions reache4. Let
It bctt~r to choollo muny r~uion~ and n J'cw Hcnlth DI~Jtricts in c:nch region or vice vtrnn~ us look at an oxam)Jle.
i!t 55
Statistical inference Summary

In chapter I I we discuss a study'byrHughes.&-Lascaratou ti<J8I')on, ' in thi~ 'kin\! of study oto. trY' to avoid.choosing. eitht; oLthe,,samples ,such .,,
the gravity of errors in written English as perceived by two different groups: that'they,belongobviouslyto some,speciaiNbgroup.
native English-speaking teachers of English aild'Greek'teachersof English; THere is tme type of investigatiOidor which proper random sampling
We conclude that there seep1s to be a differencdn the waythanhe two. ' '" is absolutely essimtial. If a reference test ofSome kind,is to be established,
groups .iudge: errors, the Greek teachers tending to be more severe in perhaps to detect lack of adequate language development in young: children,
their ju~gements. How n\uch. does this tell us about possible differenc'es then the testmust be applicable to the whole: target population and not
betwken the complete popu)ation 'or nativespealling English teachers and justto a particular sal)lple. Inaccuracies indetermining cut-offlevels for
Gree feabh~rs of English 1 The res11lts of the experiment would carry d~tecting children who shquld be given special assistartce have ecol)omic
over to thos~ pt>pulrttions - in tlfe se'lse to be explained in the followir)g implications (e.g. too great a demand on re~ources) and social implications
fol'r chapters - if the samp)es h~d been selected carefully from complete (language-impaired ci)ildren not being detected), For studies of this nature,
sap1pling fr~mes. This was certainly not dorte. Hughes and. l,asearatou a statistician should be recruited befpre any data is collected and before
had to gain 'the co-operation of thOse teachers 'to whom;they had i ready'' ' ,,,, asamplittgframehas been established.: "'' ., ,,
access. The formally correct alternative would have' been prohibitively With this brief introduction to sQme of.the' .problems of the relation
ex pensive. However, both samples ofteachersrtdeastccoritaipedindividrials
1

between'sample and population, we pow tum in,,chapter sto the concept .


from different insti~utions.' If all the Erlglish teachers had come from a of probability'as a crucial notion in providing. a link between the properties
single l):nglish Institution and all the Greek teachers frotn a single Greek of a sample and the structt~re of its parent pop4lation. ln the final section
scpool of langua!les tqen it. coul~ be ar[iued that the difference ih error of that chapter we outline a procequre for randorr sampling. Chapter
gnivi(y scores could be due to the attitudes of the institutions rather thdn 6 deals wit)l the modelling of statistical populations, and introduces the
the rlatiotla!ity of the teachers. On the other )larld, all put one of the normal distribution, an important model for our purposes jn ch~racterislng
Greek teachers wprked in Athens (the Erjglish teachers cbme /rom a widh tpe relation between sample and populatiop.
selection of backgrounds) and we migllt query whether their attitudes
to errors m\ght be different from those' of their colleag4es, in .different
SUMMARY
p~rts of dreece. Without testing tHis argument it is impossible td refute
In this chapter the basic problem of the sample-populat!<ln relation-
it, but pn common se11se grounds (i.e. the/commonsense' ofa'feseatcher.
s~ip has been. discussed . . ,
in the tFaching of second lal)guages) it' seems Uhlike!y,. ". . ,
Tl!is then seems a reason~ble way to proceed.dudge.the.result~as though, ' (lj A statis,ti~al popUlation was defined as a set of all the vahies which might
they were~baseCI on random samples and therlJook al thepossibility.that . evr .be in~cluded in a particul'lr Study'. Ttie target population is the set to
they m~y be distorted )Jy the way the sample was, ih fact, obtained. How- \Vhich geriftalisation is intended frbJll a"Stufly'hased on a sample.
ever, this imposes oh researchers the inescapable duty of.describing.care- (i) Genehlisation from a sample to a popul~tion c;!an be made formally only if
fully how their experimental materhil -' including subjects - was actually 'the sample is collected randomly from a sampling ftame which allows each
obtained. It is also gooc;l practice to altempt to foresee some of the obj,ectio~s clement of the population a knqwn c~ance of being selected.
that inight be made about the quallty ofthat matetial and either attempt (3). The point was made thatt fot the rlta)ority pf llnguistic investigations, resource
to forestall criticism or admit openly to arly serious defects. constraints prohibit the collectipn of Q.ata iq this way. It was argued that statisti-
cal theory ami mctljadology still have ad import~nt role to play in the planning
When (he subjects themselves determine to which experimental group
and analysis of language studies.
they belong, wh~her deliben\tely br accidentally, the sampling needs ,to
be done tathet mote catefully. An important objective of the FletcHer
& Peters (r984) study rhentibncd earlier was to compare the speech of EXEilCISES
language-normal with that of ianguagc-impaired children. In this case the ( rY lu dwptc!"' 9 we rdt~t to a sttdy by Ferris & .Pol'itzer (rq81) in which children
investigator8 could not randomly assign children to one of these groups with bilillfitUHI background an~ tcstt~d on their English ability via compositions
.. they hnd alrcndy bemt daHsitlod hllforc they \vcrc sckctcd, It is iniportant ill t'CMJRHl!l-(' to n uhort nlm. H(~ad lh" brief account of the t'tudy (pt1 139ffL

s;fr 57
Statistical inference
find out what measureSWeie Used, and decide-\hat Would -constitute the ele~
ments of the relevant statistical population.
(2) See if you can do the same for Brasington's (I978) study ()f Remielleseloan 5
w0rd,, also explained in chapter 9 (PP: 'I42-4).
(3) ~eview,the Macken.&. Barton (I g8oa) study, detailed in chapter I . In consider~_
Probability
:ipg,.th~ 'i'nt~nded:.p.opulation' for this study, what factors do we have to take
.,!tl!O.~ccount?,. ,,
(4) In a well-known experimeht, Newport, Gleitman & Gleitman (I977) collected
conversational data from ts mother-child pairs, when the ctlildren were aged
between 15 and 27 months and again six months later. One of the variables
of Interest was the number of yes/no interrogatives used by mothers at the
first r,ecording. If we simply consider this variable, what caQ we say is the The link between the properties of a SaiJlple ~nd the structure.of its parent
intended population? Is it infinite? population Is prdvi!ied through the concept of I)robability. The concept
of probability js best introduced by' means of straightforward nlln-linguistic
examples. We will return to linguistic exemplil1cation once the ideas are
established. The esserttials car be illustrated by means of a simple game.

s.x Probability . .
Suppose We have ~ box cont~ining ten plastic discs, of identical
shape, of which three are red, the remainder white. The discs are numbered
I to ro, the red discs being t~ose numbered r, z and 3 The game consists
of shaking the box, drawing a disc, witho'!t looking inside, an!i noting
both its number and its colour. The disc is returned to the box and the
game repeated indefinitely. What proportion of the draws do you think
+
will result in disc. number bei!lg drawn from the bo~? One dr~w in
three? One draw in five? One draw in ten?
Surely vie would all agree that the )ast is the most reasonable. 'there
are ten discs. Each is as ilkely to be drawn as the others every time the
game is played. Since there are ten discs with different numbers, we should
expect that each number will be dra\jlfi about one-tenth of. the time in
a large number of draws, or ttials as they are pften called. We wquld
say that the probability of drawing the disc numbered 4 on any one occasion
is one in ten, and we will write:

I
P(disc number 4) =- = o. I
IO

Instead of determining the probability of disc 4 being drawn in the


game, we could ask another question, peri)aps, 'What is the problibility
dtrany occasion that the disc drawn is red?' Sipce tpree out of the ten
discs arc of this colour, that is three-tenths of the discs in the box are
red, then:
ss 59
Probability Statistical independence and conditional probabtlzty

P(red disc)= 1. = 0.3 ;)(" Table 5.1. Num/Jerpftimes a red discwas drawnfrom a box
10
containing J red and 7 white discs in 100 trials .by 42 students
In the same way we can ask questions, about the probabilities of many .. zcv: it 22-
24 24 25 z6. 26 26 -26 27 27 28 28 28
other o~tcom~s. For example: ll~ : r r d1. 29 29: z<) 29 29 JO . 30 JO 30 3'1 Jl 31 32- 32- 33
33 33 33 34 34 35 36 36 37 37 38 39
:(i) the prQbability of drawing an even~numbered disc; .since there are
five of them, is: Number of ted discs Frequency
2Q-22 3
l'(evennumpered disc)=!;= o.s 2~-25 3
26-28 9
(ii) the probability of drawing a disc which is both red and odd numbered, 29-31 12
32--34 8
since there are two of them (I anti 3) 1 is:
35-37 5
38-40 2
. 2
P(red, oddnumbered disc)=-= .0.2 IVfeao of results of the 4:i st4-dcnts ls :i99$2 per .robdraws
10
and so on. "" ''" ' !Vfea~ proportipn is ther~forc ciozggs2.
It is helpful to play stlch games to gain somf insigh! into the rel~tion I

between the relativ~ frequency of particular outcoines in a serifs of trials game empirjcally, when !he proportions of different types are ~nown. It
and your a prion expectations based on your knowlfdge of the contents is also possible to study the prgpehil's bf sampling games theoretically.
of the box. You could repeal the game 100 times and see what is the In estaplishjng the prpportiorl of rha)es in the village, for instance, it is
actual proportlbn of red discs whicH is dtavyn' It is even better if a nu1111Jer possibll' to find out how many persmis spoulq be samp)ed in order to
of people play the game simultaneously and compare results. Table 5.1 give a reasonably accurate estimate, l)y using methods th~t are explained
shows what happened on one occasion \\.hen 4~ students each playep the in chapter 7 (see especially 7.6). As we shall see, if \ve take a sample
game IOd times. Although the propbrtion qf red discs drawn by individual of 400 persops \ve can expect with reasonable confidence (as% probability
students varied from o.2b to d.Jg, the mean ptopo~tion of red discs drawn of error) that the proportion Of males in the ~ample is be\w~el) 28% and
over the whole group was o.29g5, very close to the value bne would expect, 32%i apd we can pe!tlmo~t ~ure (p to/o probability ol error) that tpe propor
i.e. ~3 tion of mhles in fhe sample would not be less \han 25% or greater than
It s~ould be clear that the actual number of ted discs i:l not the,;mpqrtant 3~%r (Compare these figures to the actual proportion o! 36%.)
quahtity here. The properties of the gaihe depend opl~ on the propqrtibn . iWeshduld'jJO\nt Ojlt th~t,in pti\ctice,itliS not comrponto l!sethe
of discs df that colour. For ~xample, if the box captained rq,odo discs sampling procedure we have emf1loyed in t~e game wlJero eacli djsc/ person
of which j,ooo were reel it would continue to l.:le true that P(red disc) ='o.J. is replaced in the box/ villpg~ after its type has l:leen hotecl. It Is mdre
This has some practical relevance. Suppose that a village contained t<>,ooo , common totake oi.it awhole group of di~cs/pople siinultaneously and
people of whom JO% were male, though this faCt was tfnknqwrl tp tiS. note the type of eyef'y element in the sample. This i:! equivalent tp cho0sihg
If we wished to try to establish this proportion without having to Jdentify the discs/ people one at a time and then not repl~cit\g those al~e~dy chosen
the sex of every person in the village, we could pjay the abpve game, befote selecting the next. Prqvided onlf a sin~ll prpportion of the total
but this time with people instead bf disc&. We would repeatedly choose (say less than to%) bf the total is sainpled i) will not make much difference
a person at random (a detailed method fbr random selection will be given whether the sampling is done with rir without replacement.
in the final section of thls chapterj and npte whether the person chosen
was male or female. If We repeated the process n time~, We could use S Statistieal independence ahd cdnditlonal probability
the proportion of males among these n people as.an estimate of tpe propor, Table 5.2 displays the numbers of individuals bf either Sex
tion or expected telative frequency<of males in the population. We ,, irrohl>(l differetlt hypothetical pppulations classed irs monolingual or hi
saw from table 5' that it was possible to nsscss the properties of a sampling lingual. Each population contains the same number of individuals, ro,ooo,
6o 61
Probability Statistical independence and conditional probability

Table -5-_2, Numbet:s .of.m.ono~itzguq/.or. bilinguaJ. adults -in two The probability of a monolingual:
hypothetical popu/atiolls cross-ta~ulatedby se>; P(monolingual) = o.6

Population il The probability of a bilingual:


Male Female Total P(bilingual) = 0.4
Bilinguai zo8o 1920 4000
Monolingual 3120 z88o 6ooo (Notice that when thorpopulatidn is partitioned in such a Way that each
'5200 4.8ao - 1000.9 individual belongs to one and only one category, e.g. mal,.,.female or
monolingual-bilingual, the total probability over all the categories of the
Population B
Bilingual 2500
partition is always r.o; 0.52 + o.48 = r.o and o.<j. + o.6 = r.o, for either
1500 4000
Monolin~ual 2700 3309 6ooo population.)
5200 48op ~0000
Although the two populations have identical proportions of male-female
arid of monolingual-bilingual individuals, they have otherwise different
of whom 5,2oo are male and 4,8oo are, mbnolingual. However, in popula- structures. We see this when we look at finer detail:
tion A the proportion of tnales who are bilingual is 2086/5200 =o.4, the Fot population A:
smne as the proportion of bilingual females (rg2o/ 48od). 11) population 2o8d
B, on the other hand, 25oo/ 5200 males, i.e. 0.48, are bilihgual while P(male and bilingual) = ---. = 0.208
10000
the proportion of bilingual females is only O.Jr. This kind of imbalance
may be observed in practice, for example, among the Totona<l population 192d
P(female and bilingual) =-----=0,192
of Central Mexico, Where the men are more accustomed to trade with 10000
outsiders and are more likely to speak Spanish than the more isolated
JI20
women. A similar effect may be encountered in first-generation immigrant P(male and monolingual) =----'- = O.J12
10000.
communities where the rhen learn the tongue of their adopted cbuntry
quickly at their place of employment while the women spend much more z88o
time at horne, eiti)er isolated or mirlgling with others of their own ethnic P(female and monolingual)=--= o.z88
10000
and linguistic group.
Suppose that we were tb label every member of such a population with
Totai t.ooo
a different number and then write eacll number on a different plastic ~

Jisc. AJJ the discs are to be of identical shape, but ifthe number corresponds
For population B:
to a male it is Written on a white disc, if female it is written on a red.
sod
The discs are placed in a large box and welJ shaken. Then one is chosen P(male and bilingual) =---'- = 0.25
10000
without looking in the box. Clearly this is one way of choosing objectively
a single person, i.e. a sample of size one, from the whole population, 1500
and you should see immediately that the following probability statements P(female and bilingual) =--=0.15
IOOOb
can be made abdut such a person chosen frbm either of the two populations,
A orB. 2700
P(male and monolingual) = ~ = 0.27
The probability bf a male being chosen: 10000

. 5200
P(male)=--=o.sz JJOd
IOOOO !'(female and monolingual)=--= o.JJ
IOOOb
The prob~tbility of a female:
!'(female)~ 0.48 Total I.oo.

6l!< 63
Probability Statistical independence and conditional probability
Suppose that,. for either. population, .;wMcOntinueuraWi:ng :discs'.lin1iL : . (,,__,;. ;-l,_;,and1 ,.
.,,.
the very first white disc"appears,ithatds;lm~it:thec,firsi'mal<:'.persoids A ,'~ '\'{

chosen. What is the probability.that:this'llrstcemale>is,bili'ngual?. Notice' .;, ,: ,, !'(mal~)= soo =c. :i.
that this is not the same as askingl'<lt tlleprdbabilltylhatth.,,fii'<it'Person::i ~~;PL~ fi>,.' ~+:J;;_::~"-~l:Jf'rl'." ;-.,,,,_,_;fo_c;ioo ts _ ?. t-~ -i
chosen is male and bilingual. We'keep going:until 'w.e <:hoose ille'first . ...,
,. , 'Fromth~se' danipleswe' can see that; in population A; whichever restric-
male and this deliberately excludeslthe females-from .consideratiom In tion as to sex is imposed oi> selection(or indeed if there is no restriction
fact, we are restricting our attention to the subpopulation of -males and- at all); ihe probab!liiy of biling'llalism/monolingualism remains
theh asking abo1.1t the probability of an event which could occur when unchanged. Similarly, whichever restriction as to language category we
we imnose that restriction or condition. For this teason, such probability impose (or ihhere is no restriction at all), the probability of male/female
is called a conditional probability. remains the same.
To signal the restriction that we will consider only males, we will use Irl a population with these characteristics, the variables 'sex' and 'ian
a standard notation, P(bilingualimale'),.:where ..the.vertical.line:indic~tes, guage cate~O'ry' are said to be st~tlstically independent. In .practice,
that a restriction is being imposed, and wesaythatwei'equirethe~proli~bi< .. '' this ln'eai\9 ifilii kndwlng tile value 'Qf one ,gives no' .information about the '
lity that a chosen person is bilingu~t:g~'ve thliNhe persott i~ntale'; Siilc.e,'d -.. "j\1<:-:~ :1-Q\H~~-.~---~q;.:\,c,;_- _ \<- :,-_;,,, ...-,.:. .. , -_,1:""-'''::";:_:<f.::: -.'--'/ ! \: ..

in population A, there are a tota!.ofs',20o'males of'1whom 2,:o8oare:bic ..,, Pbj>Uiatiori B?<hibits taihtir different properties. We'know already that
lingual, the value oftheconditional:probability:is:. ' "' .,.: '" P(bilinguaf) ,;,:o ..\., but we cat\ see:
208d
!'(bilingual Imale)=--= 0.4
52.0d P(Bilingtiall male)=--= 0.48
2500

szoo
* 0.4
Note that this is exactly the same as the probability that a person cho~en,
and
irrespective of se~, will be bilingual, i.e.
,. . ' . ISOO
P(bilioguall male)= P(biiingual) = 0.4 P(l:ithnguall female)=--= o.JI "'0.4
48oo
We can calculate the probability in a similar fa~hion for the likelihood
that a chosen person is bilingual, give!\ thoit<she>is.femak: , . ,,,., 'i,:i: : (Both these c'onditional probabilities have:been.rounded totwo decimal ;.

;};.
places..) .It is clear that in thls ca~e a.p,erson~s languag.e.categQry wi)l be -. ' ~ 1 ; ' '
1920
P(bilioguall female)=--= o.4 ~ r;, : :p .. "'" depenclent on that persdn's sex. Thad: t~ sa:y; if a male Is selected, then f<(
'),:o_.l
4800 we know there is a higher chance that this person is bilingual than if
Note that this is again the same as theprobability for bilinguals; irrespective a feliiiile had been choseh. In general; P(X IY), the probability that event
of sex: 1! X occurs, given that the everlt Y has already occurred, can be calculated
. ' 4000
P(bthngual) = - - = 0.4 by the rule:
xoodo
If -e wish to determine the probability that a chosen netson is male, P(XIY) _ P(XandY)
given that the person is bilingual, the calculation is as follows:
hv)
2080 For example, in population B:
P(male Ibilingual)=--= o.s
4000 P(bilingual and male)
. P(bilinguall male)
Note that this is the same probability as: P(malc)

3 I2.0 0,25
P(malc I mormlingui\l) = - - ~ 0.52 m~,._._.:;,.ftli 0.48
6ooo o.sz
(14 6s
Probability Probability and discrete numen'cal random van'ables

There is one important property of population A which results from Table :;.j: HypOiheiical family size distn'bution of 1,ooo
the independence of the two variables. Consider the probability that the families
first person chosen is both male and bilingual. It turns out that:
2o8o
P(inale and bilingual) = - - =o.2o8
No. of children (X)
0 ..,
No. of families

179
Proportion
0,121
0.179
o.z63
, IOOOO
3 >63
217 .0.~17

5200 99 0.099
4
P(male) =--=o.sz
10000 5 61 0,061
6o o.o6o
6or more
Total
--
1000
4000
P(bilingual) =--=0.4
10000
such variables allows us to introduce a richer variety of outcomes wjlose
Now, o.2o8 = o.4 X 0.52, so that we cansee that the probability of a person probability we might want to consider.
being chosen who is both male and bilingual can be calculated as follows: Suppose that 1 ,ooo families have given rise to the populatiol) of fa!llilY
P(maleand bilingual)= P(male) X P(bilingual) sizes (i.e. number of children) summarised in table 53 In !pis ppptilation,
let us choose one family at random. What is the probability th~t X, the
This result holds only because the two variables of sex and linguistic type number of children in this family, is 3?
are independent.
In population B, on the other hand, we have P(male and bi- Answer: P(X = 3) = o.2I7
lingual)= 0.25, while, for the same population, P(male) = 0.52 and P(bi- since that is the proportion of the family sizes whic)l take the value 3.
lingual) = 0.4, so that: Similarly, 1
P(male and bilingual#. P(male) X P(bilingual) P(X = 5) = o.o6I
This indicates the lack of independence between the two variables in this P(X"' 2) = P(X=o) + P(X= I)+ P(X = 2)
population, However, the relation: = o.x21 + o. '79 + o.263 = o.563
=
P(o <X"' 3) = P(X x or or 3) o.659 =
P(male and bilingual)= P(male Ibilingual) X P(bilingual)
Table 53 is an example of a probabilily distrib!ltion of a random
docs hold, since: variable. A random variable can be thought of as ~ny variaqle whose
P(malel bilingual)= 2500/4000 = o.625 value, which cannot be predicted with certainty, is ascertajned as the out-
P(bilingual) = 0.4 come of an experiment, usually a sampling experime'lt, or game of some
0.4 X o.625 = 0.25 kind. For example, if we decide to choose a family ~t random from the
hypothetical I ,ooo families, we do not know for certain what the size
In general, for any two possible outcomes, X andY, of a sampling experi-
of that family will be until after the sampling has bee!) done. The distribu-
ment:
tion of such a random variable is simply the list of probabilities of the
P(X and Y occur together) = P(X IY) X P(Y) = P(Y IX) X P(X) different values that the variable can take. If the different possible values
of the variable can be enumerated or listed, as in tjlis case, it is called
5. 3 Probability and discrete numerical random variables a discrete random variable. Discrete variables may be numerical, like
The examples we have seen so far have been concerned with 'family size' or categorical like 'sex' or 'colour'. (In the previous.section
ruther simple categorical variables such as sex or linguistic category. How- we saw an example of the categorical variable 'colour' which took the
ever, the situation is very similar when we considc1 discrete numerical
Tlw !l)'mhnt < meum~ 'ia lt~aa than', while E<. means 'is less th!Jfl or cqu~l to'. Similarly,
variables, the only difference bting thnt the cxtcndcd range of values in the llymh~ll > llWIUI!l 'ill gr~atcr thu.u', while~ rncam; 'ill greater thun or cllual to'.
(\l) 67
Probability Probabilil'y and continuous random variables
two values 'red' and 'white' with'probabilities,n:]:'arul'o;:.,,~espediVely;},;.:' :' Table 5 +'H)ifJiJthetical distlibutlot1ofttisk,titlzes,
The distribution of a piscrete random. varhible,,can beo,represented. by a
bar chart: figure 5-1, for examplt;"gives the baro:chart corresponding:to' ~ T ~ ---~: ..,._T.:..::im.:..::e_,_(i::c~c.:'co'.c'o:;n.cd::s!.).:y'-'-~
, : ;

table 5-3- We have already seen similar diagrams in figuresz.r 'antJ:k:!J; '\ i- > ~ > _;f~om; < .,T~just less ~han Pr_opot~~no,q.iOl_~_s i~ this_~_ange: h -

and these can be considered as appro:iiimations to the bar charts of the 0 w 0.035
w ~ O.OJI
discr.ete random variables 'types of deficit: and 'length ofutterance'based ,~ ~ o.o6x
on samples of values of the corresponding random variables: ," ~ p 0.154
p

.
55 0.202.

"
~ ~
~
0-230
o.r6r
250 ~ 0.07I
100 o.oss
0.5
200

~ 150

...~0
0.4

z
100 ..~
c
'0
0 0.3

n1
50
~
!2
~
0 2 3 4 5 6 or more g
No. of children ln farnlly
Figure 5 I. Bar chart corresponding to data!of_tabl(!_5'3)' ,~
i o.21 r

For any discrete, numerical random variable; bhe,pr,pbabilitythat a single,, ".;]:


.l
.,-QJ.
a:
~

random observation has one of a possible range,.of:valties,is:just th~~uml '' 0.; , ..


of the individual probabilities for the separate values in the range (see
exercise 5-J). Note that the actual size ofthepopulation is.irrelevant once,, r-
1-
the proportion of elements taking a particular value is known.
~ h
54 Probability and continuous random variables 0 20 40 60 80 100
Time (seconds)
Suppose that a statistical population is constructed by having
each of a large number of people carry out a simple linguistic task and Figure 5.2. Distribution of task times of table S3

noting the reaction time, in seconds, for each person's response. We sup possible to S;Iggest one even closer. For example, if r.643217r seconds
pose that the device used to measure the time is extremely accurate and is suggested, that is less close than 1.64321705 which, in turn, is not
measures to something like the nearest millionth of a second. In fact, as close as r .643217049 and so on. A variable with this property is called
conceptually at least, there is no limit to the accuracy with which we a continuous variable. Table 5 4 gives a hypothetical relative frequency
could measure a length of time. Ask yourself what is the next largest time table for the population of task times and figure 5.2 the corresponding
interval greater than 1.643217 seconds. Whatcverflgure you give it is ulways histogi'Um.There arc severn! points worth noting here.
69
Probability Probability and continuous random variables
First, the class intervals are -not all of the same width and yOu should this probability? The range does not coincide with the endpoints of class
examine how the histogram is adjusted to account for this. In particuhir, intervals as all the previous examples did. Remember that in the histogram
the class interval o-20 has a higher proportion of the population than the area defined by a particular range is the probability of obtaining a
its neighbo1,1ring class ~o-30 and yet the corresponding rectangle in the
histograril'isl~s~ tall; it i~ the areas of the rectangles (i.e. width x height) 0.5

which: correspond to the relative frequencies of the classes. Second, we


do noF'ru)ed''b state the actual' number of elements of the population
belonging to each class since, for calculating probabilities, we need only
know the relative frequencies. Thirdly, the upper bounds of the class 0.4

intervals are given as 'less than 20 seconds', 'less than 30 seconds', and
so on since, because of the accuracy of our measuring instrument, we
are assuming it is possible to get as close to these values as we might
l"
g
wish. ~ 0.3
Let us choose, at random, one of these task times and denote it by ~
Y. What is the probability that Y takes the value 25, say? We need to ~
~
think rather carefully about this. Since we are measuring to the nearest
millionth of a second, the range from o to 100 contains a possible 1oo l~ 0.2 r-
million different possible values, even more if our instrument for measuring
times is even more accurate. The probability of getting a time of any ~ 1-
a:
fixed, exact value, say 25.oooooo seconds or JI.I6J217 seconds is very
small. To all intents and purposes it is zero. This means that we will 0.1

j-~P(Y>7B)
have to content ourselves with calculating probabilities for ranges of values
....-
of the variable, Y. For example:
P(Y < 20)
P(Y <so)
= O.OJS
= o.OJS + o.OJI + o.o6r + o.rs4= 0.28!
h
o7 20 40 60 80 100
P(4o"" Y <55)= o.r54 + o.2o2 = 0.356 P (10<Y<20I Time (seconds)
(P(4o"" Y < 55) is to be interpreted as the probability that Y is equal Figure SJ(a). EStimating probabilities.
to or greater than 40 but less th.an 55.)
In each case we simply identify the required range of values on the value within that range. If we shade, on the histogram, the area correspond-
histogram and calculate the total area ofthe histogram encompassed by ing to 10 < y < 20 (and refer to table 5+for the proportion of times that
that range. Note here a very special point which was not true in the previous fall in this range) we see immediately that the required probability
section: 1/2 X 0.035 = o.o175- see figure 53(a). Similarly:
P(Y < 20) = P(Y"" 2o) 2
P(Y > 78) =-X o.o7r + 0.055
because, since the probability of getting any exact value is effectively zero: IO

P(Y .;;2o) = P(Y < 20) + P(Y = 20) and- see figure 5-J(b):
= P(Y <zo) +o
P(5o < Y < 87) = 0.202 + 0.230 + o.r6r
This means that for continuous variables it will not matter whether or
not we write< or""; the probability does not change.
Now, suppose we try to calculate P(to < Y < 20). How can W<1 evaluate + 0.071 +(}-X 0.055)
20
70 71
Probability Random sampling and random number tables
However, these probabilities will be only 1!pprm;imate1yc"Correct;;,~ince'tl>ec ' the'sarnjlle nndp0pulation sizes small simplifies the.e,:phmatiori.) Random
histogram of figure s.z is drawn to a.rather.crude.scale on;wide.. class . sampling or; in full, simple random sampling, ;is a selection process
intervals. Its true shape may be something like the, smooth curve.in figure which iti this case will ensure thai' every sample of tliree subjecis has
6.6. It wol\ld then be more difficult to calculate:. the area corresponding. the same probability of being choseri: Suppose the eight subjects of the"
to rlre 1nt~r~al 10 < Y < zo, but there are methods which enable itto be population are labelled A, B, C, D, E, F, G; and H. Then there are
deee.to ariy required degree of accuracy so that tables can be produced. 56 possible'different samples of size 3 which could be selected:
\llikWll return to this idea in the nextcha:pter.
ABC ABD ABE ABF ABG ABH ACD ACE ACF ACG
0.5 ACH ADE ADF ADG ADH AEF AEG AEH AFG AFH
AGH BCD BCE BCF BCG BCH BDE BDF BDG BDH
BEF BEG BEH BFG BFH BGH CDE CDF CDG CDH
CEF CEG CEH CFG CFH CGH DEF DEG DEH DFG
0.4 DFH DGH EFG. EFH EGH .FGH
'

l Now suppose that s6 identical, blank discs areobtained and the. three
letter code for a different sample inscribed on each disc. If the discs are
1:2 0.3
now placed in a large drum or hat and well mixed, and then just one
is chosen by blind selection, the three subjects corresponding to the chosen
letter code would constitute a simple random sample.
8.
The problem with this method is that, for even quite moderate sizes
ig. 0.2 of .sample and population, the total number of possible samples (and hence
~ P(50<Y<87) discs) becomes very large. For example, there are around a quarter-of-a-

I"' million different samples of size 4 which can be chosen from a population
of just so. It is impossible to contemplate writing each of these on a different
0.1 disc .just to select a sample. Fortunately there is a much quicker method
to ~qhieve the same result. Let us return, for the moment, to the example l.
;
of choosing a sample of three from a .pbpuhition .of ejght. Take ~ight discs :I
and write the letter A on the first dis~, B on the second, etc;. until there
is a single disc corresponding to each of the letters A-H. Thoroughly
0 20 40 60
Time (seconds)
mix the discs and choose one of them blindfold. Mix the remaining seven
discs, choose a second, mix the remaining six and choose a third. It can
Figure SJ(b). Estimating probabilities.
be shown mathematically that this method of selection also gives each
of the 56 possible samples exactly the same chance of being chosen. How-
5. 5 Random sampling and random number tables ever, this is.still not a practicable method. For very large populations
We can now return to the issue of random selection and explain it would require a great deal of work to prepare the discs and it may
in more detail what we mean by a 'random sample'. We have already be difficult to ensure that the mixing of them is done efficiently. There
indicated that 'random' .does not mean 'haphazard' or 'without method'. is another method available which is both more practicable and more effi.
In fact the selection of a truly random sample can be achieved only by dent.
closely following well-defined procedures. For illustration, let us suppose Each member of the population is labelled with a number: I, 2, ... ,
that we wish to select a aample of three subjects from a population (Jf N, where N 'is the total population size. Tables of random numbers
eight. (Obvioualy this oituation would never arise in practice, but keeping can then be used to select u random sample of n different numbers between
';:~\.,.
'13
Probability Exercises

Table 55 Random numbers It is good practice not to enter the random number tables _always at
the same point but to use some simple rule of your own for determining
44 59 62 z6 82 51_ 04 19 45 g8 03 51 so 14 28 02 12 29 88 87
the entry point, based on the d.ate or the time, or any quantity of this
''"" . ,,,.~0 .9o 5~ 5. 90 20 76 95.70 o2 84 74 69 o6 '3 98 86 o6 so
kind which will not always have the same value on every occasion you
.. ' 44-t33' z9 88 -_go. 49 07 55 6g so zo 27 59 51 97 53 57 04 2_2 26
use the tables. Many calc.ulators have a facility for producing random
47 57 22 52 75' 74 53 Il 76 II 21 J6 12 44 JI 89 16 91 _47 75
numbers, which can be useful if tables are not available.
03 20 54 ZQ 70 56 77 59 95 60 19 75 29 94 II 23 59 30 If 47
Simple random sampling is not the only acceptable way to obtain sam
I and N. The individuals corresponding to the chosen numbers will consti
pies. Although it does lead to the simplest statistical theory and the most
tute the required random sample. A table of random numbers is given direct interpretation of the data, it may not be easy to see how to obtain
in appendix A (table Ax). A portion of the table is given in table 5S a simple random sample of, say, a child's utterances or tokens of a particular
and we will use this to demonstrate the procedure. Suppose we wish to phonological variant. This is discussed briefly in chapter 7. However a
select a random sample of ten subjects from a total population of 7,83 sample is collected, if the result is to be a simple random sample it must
The digits in the table should be read off in non-overlapping groups of be true that all possible samples of that size have exactly the same chance
four, the same as the number of digits in the total population size. It of being selected.
does not matter whether they are read down columns or across rows -
SUMMARY
we read across for this example. The first 14 four-digit numbers created This chapter has introduced the concept of probability and shown
in that way are: how it can be measured both empirically and, in some situations, theoretically.
4459 6226 8251 0419 4598 035' 5014 ( 1) The probability of a particular outcome to a sampling experiment was identified
2.802 1229 8887 8590 2258 SZ90 2Z76 with the expected relative frequency of the outcome.
and would lead to the inclusion in the sample of the individuals numbered (2) The concept of statistical independence was discussed; to say that two
44S9 6226, 419, 4S98, 3s1, so14, 2802, 1229, 22s8, 2276 and s29o. The events, X and Y, are independent is equiv1;1lent to the statement that P(X
numbers 8251, 8887 and 8s9o are discarded since they do not correspond andY both true)= P(X) x P(Y).
(3) The conditional probability of one event given that another has already
to any individual of the population (which contains only 7,832 individuals).
occurred was defined as P(X IY) = P(X and Y)/P(Y).
If a number is repeated it is ignored after the first time it is drawn. It
(4) If two events, X and Y, are independent, then P(XIY)=P(X) and
is not necessary actually to write numbers against the names of every P(Y IX) = P(Y); that is, the conditional probabilities. have the same values
individual in the population before obtaining the sample. All that is as the unconditional probabilities.
required is that the individuals of the sample are uniquely identified by (5) The concept of a probability distribution was introduced. For discrete
the chosen random numbers. For example, the population may be listed variables, the probability distribution can be presented cis a table; for cont
on computer output, each page containing so names. The random number inuous variables it takes the form of a histogram and the probability that
44S9 then corresponds to the ninth individual on page 90 of the output. the variable lies in a certain range can be identified as the area of the correspond
Similarly, 2S random words could be chosen from a book of 216 pages ing part of the histogram.
in which each page contains 30 lines and no line contains more than rs (6) It was demonstrated how a simple random sample can be selected from
words, as follows. Obtain 2S random numbers between oor and 216 a finite population with the help of random number tables.
(pages); for each of these obtain one random number between or and
EXERCISES
30 (lines on a page) and one random number between or and 15 (words
( 1) Replicate yourself the experiment whose results are tabulated in table 5 x.
in the line). This time, if a page number is repeated, that page should Include thC result froin your too tl'ials to table 5, I, and recalcUlate the mean.
be included both (or more) times in the sample. Only if page, line and (>) Using datum 41 us the entry point (you will find it in appendix A, table
word number all coincide with a previous selection (i.e. the very sarnc At, 6th row, 4th column) and using this book as your data source, list the
word token is repented) should the choice be rejected. llilmplc uf ;l,5 wordm ~uggc:i!tcd by the procedure on pngc 74..
'14
75
Probability
-~eulateAbe:,..,
(3) Using the probability distributiop,-pf- fami1yt!>ize,in table.s3;
probability that a randomly chosen farvily has' :
: _. ~- ,' ;

'"'(t' .

(a) more than 3 children


(b) fewer than 4 children ,... Modelling statisticaL II
(~) at least 2 but no more than 5 children
populations
(4) Estimate from figure 5.2 the following:.
P(Y > 23) P(so,;;; Y < 6o)
P(r4,;;;Y<92) P(9r,;;;Y<96)
(5) Calculate from table 53 the following:
P(X,;;; 4)
We pointed out in chapter 4 that the solution of many of our problems I
1
will depend on our ability to infer accurately from samples to populations ...
P(o<X,;;;4) In chapter 5 we introduced the basic elements of probability and argued
!".

. r:
(6) Calculate from the data for population.B in tabk5.2: : ,,, 1,,\, ;:)'!: that it is, by.meansofprobability statements.concerningnndom variables ' , ~:

that we will be able to make inferences from samples to populations. In


P(malel bilingual)
P(female I bilingual) the present chapter we introduce the notion of a statistical model and
describe one very common and important model.
We should say at the outset that the models with which we are concerned
here are not of the kind most commonly met in linguistic studies. They
are not, for instance, like the morphological models proposed by Hockett
(1954); nor do they resemble the psycholinguists' models of speech produc-
tion and perception. The models discussed in this chapter are statistical
models for the description and analysis of random variability. No special
mathematical knowledge or aptitude is required in order to understand
them. .

6,1 A simple statistical model


Statistical models are best introduced by means of an example.
In chapter 1 we discussed in detail a. study which looked at the voice
onset time (VOT) for word-initial plosives in the speech of children in
repeated samples over an eight-month period. For our present purpose
we will consider only the VOTs for one pair of stop targets, I tl and I dl,
for one child at 1 ;8. To make our exposition easier, we will also assume
that the tokens were in the same environment (in this case precedinglu:l).
Look at table 6. r. The fictitious data displayed there are what one would
expect to see only if an individual's VOT for a particular element iri'li
certain environment were always precisely the samE:, i.e. if the population
of an individual's VOTs for that element in that environment had a single
value. Such VOTs would be like measurement of height or arm length.
Provided that the measurement is very accurate, we do not have to measure

>16: 77
Modelling statistical populations A simple statistical model
Table 6.1. Hypothetical sample ofVOTs inabsence of If it were possible ever to obtain the complete population of ldl VOTs
variation for this child we could then calculate the mean VOT for the population.
Let us designate it by JL. (It is customary for population values to be
VOT for/d/ VOTfor/t/
represented by Greek characters and for sample values to be represented
14,25 22.3
by Roman characters.) Any individual value of VOT could then be under-
14.25 22.3
14.25 2.2.3 stood as the sum of two elements: the mean, JL, of the population plus
22.3
'4'5 the difference, 8, between the mean value and the actual, observed VOT.
14.25 22.3
14.25 .:U.J A sample of VOTs, X 1, X2, , X" could then be expressed as:
14,25 22.3
14.25 22.3
X1 =JL+ e1
22.J X 2 = JL + s 2
22.3

Table 6.2. Hypothetical, but realistic, sample ofVOTsfrom


X"= JL+ 8"
a single subject
JL is often called the 'true value' and 8; the 'error in the i-th observation' .I
VOTfor/d/ . VOT for/t/ Neither the word 'true' nor the word 'error' is meant to imply a value
17,05 16.81 judgement. We suggest a more neutral terminology below.
I J.70 2432
18.09 20.17
Any individual (observed) value can then be seen as being made up
'578 z8.31 of two elements: the true value (the mean of the population), and !he
'394 r8.27
distance, or deviation, of the observed value from the true value. Td
1452 21.03
!6.74 '794 illustrate this, let us imagine that the mean of the population of thF child's
t6.t6 '937
2J.I6
Idl VOTs in the stated environment is '495 (Of course! in lact, \he
population mean can never be known for certain without obserying the
entire population, something which is impossible since there is ho defiqite
the length of a person's arms over and over again.in order to know whether limit to the number of tokens of this VOT which the child might express.)
the left arm is longer than the right. In the same way, we would not So JL = '495 If we take the observed I dl VOTs in table 6.z, these can
have to take repeated measures of an individual's VOTs for Idl and It I- be restated as follows:
targets in a specific environment. In the case of the child, on the basis
'495 + 2.10
of a single accurate measurement of each, we would be able to say that
'495- 3 2 5
the population VOT for I d/(14.25 ms) is shorter than that for ItI (:i.2.3 ms) '495 + 3'4
in the environment I _u:l. Put another way, it would be clear that the (etc.)
sample ldl VOT and the sample ltl VOTdo not come from the same 1
This is one of the maily examples of statistical termin,ology appropriate td the contrxt
statistical population. in which a concept or technique was developed being transferred to a wid'r qontcxt: in
which it is inappropriate or, at least, confusing. Scientists such as Pasc~l, ~a~lact apd,
But of course VOTs are not like that. The data in table 6.2, though particularly, Gauss in the second half of the eighteenth, and first part of the nipettenth,
again invented, are much more realistic. This time there is considerable centuries were concerned with the problem of obtaining accurate me~sutcn;.ents of pl:lysi~al
variation amongst I dl VOTs and amongst /tl VOTs in the same environ- quantities such as length, weight, density, temperature, etc. The instrutheq'ts tbe'1 availapie
were relatively crude and repeated measurements gave noticeably diffe~ent v~lue10. ~n t~at
ment. As a result, it is no longer possible to make a simple comparison context it seemed reasonable to propose that there was a 'true value' and that a sfJectfic
between a single I dl VOT and a single ItI VOT and come to a straightfor- measurement could be described usefully in terms of its deviation from tht! t111e valpe.
Furthermore, it seemed intuitively reasonable that the mean of an indefibitely l~rg~ number
ward decision as to whether the I d/ VOTs and the /t/ VOTs came from of measurements would be extremely close to the true value provided the mcaf!uripg (iev!ce
the same population. In order to make this decision we will have to infer was unbiassed, Dy analogy, the mean of a population of measurements is often rcferved
to ns the 'true' value (cf. the 'true test score' in chapter 13) and any deviation from
the structure of the pop1llntions from the structure of the smnples. this rneun as 1m 'error' or 'random error',

?a 79
Modelling statistical populations The sample mean and the importance ofsample size
The second element,Jhe -error ,-iridicatescthe_,pdsition ,of each observed"_, --.. , . ,_ . ,_.Now: Jon-the sample l)lean; .X, ,We. have :1 '
value relative to the true value onlllean)JLwillcbe,.represented by -'the,-, ~-'! \(. -' ~ _, i' A,
. I
symbol e and its value may he- either positive.-oi:cnegatie-The division. _,_,,_- .x=~l:x,
into true value and error can be illltertded to'thepopulatinn<as 'anv.hole;<ko r<' ; 1 c~tr'

any possible VOT for a ( d/ -targe~which might _ever be pronounced b;r>- ' ' ' ,., ; ; -.: h ~; ', ,., ~ ::,;I

this child in this context can be roptesented,as.: =-[X 1 + X2 + X 3 + X4 + X,j


5 .
p.+e
1
=~[(p.+ e,) + (p.+ Bz) +(fl.+ e,) + (p.+ e,) + (p.+ s~]
Ancl that is an example of a statistical modeL However, this definition 5
of the model is not complete until some way is found to describe the
I
variatidn in the value of e from one VOT to another. ="[SP.+ (e 1 + e,+ e3 + e4 + e5)]
Returning to the child's VOTs, we use the ,model in this case to restate s
our problem. Is the mean of thepnpulati'on,oi:/df,VO'I's<the same-,asn + ,, , '"'It +:e ., (.signifies !hean er-ror) ..
the mean of the population of /t/VO':Fs? <Does .. J.t/ilequalJ.t;afF nsd/ :,'<' :; 't: '''' ... ~ ' ' :;,:

we assume that the /d/ VOTs and/t/VOTs are ,membersmfthe -same'- ' Clearly; the' value. of X can a)sb be expresked as -a true valu.e plus an
population and that the child is not.distinguishing hetween /d/ and/t/ erroq where the true value is still /L, t)l~ population mean of the original
in terms of VOT in the specified en'lironment. Xs and the error is the average value of the original errqrs. However,
The example we ~ave used in this chapter has concerned one individual the mean of several errors is likely to be sma!Ier in size than sing!~ errors,
providing a numbev of VOT values. But the model we- have- presented if only because so.me of the original errors will be negative and solne ,will
(population mean plus errot) can also be applied in cases where a number be positive so th~t there will be a certain amo11nt of caqcelling oui. It
of individuals have each provided a single value. would seem, tHen, tha; the mean of a sample will ]'ave the saihe true
value about whi,ch we requi,re information and will tend to have a smaller
6.2 'i'he sal'!lpie l'!lean and the imp()rtance of sample size. error than the !ypical single measure;,ent. 'The larger the sample size,
So far we have said that a pop4ladon may he modelled by ,thec sm.aller th~ errot . is likely to be. This is such a central concept to
considering each of its values expressed,drJ.t +:eH hi the firial section-ofd , , , thejounl\at\ons_ of statistical inferet1cet\lat It! is worthstudying if'in some, ~ '" ..,,, ... \'' ;.\.1!

chapter 4 we discussed some of tHe~nferonces We might .want to triake'''" , , ,, ,detail. via a:~im.pkflic,e.,throwing exper\rrient: ,,:.:.. , .). - \'It

A 11roperly lll~uf~ctured dice shout~ be ih the' s~ape of a cube with, ,I!,


from a sample to a population. W&1rnayoftenrwish to ,extracdnforriu\tion
from the sample about -the value of the population 1mean;pJ Suppose '. each, face mar.ked by a ..<;lifferent one- of the numbers 1, 2, 3, 4> 5 or 6,
now that we have a sample, XI> Xz, ;; i iX,:, ofn values fro.m some populi!, '' ,, .and be perfectly balanced sq thatno fa~eis-more lik0ly to turn upthan
tion. It seems reasonable to ima'ghre 11\at the~e will b~ a more o'riJsst'.i -': '-any other.when the dice is.throwil. Jf we,were<askec!.to predict the pr0 pdr-
strong resemblance betweeh the sample mean and the population rriean: tion of occurrences bf ~ny one number, say 3, iiJ the entire population
In particular, it appears to be a common intuition that in a 'very large' of numbers resultipg from pqssible throws of the dice, our best prediction,
sample the sample mean should have a value 1very close' to the population givert that there are six faces, would be nne-sixth. We W041d expect each
mean, J.t. Let us explore this intuition by considering a sample ol just of the six numbers to occur on one-sixth of the occasions that the dice
five observations: was thrown. This dn be represented in a bar chart (figure 6.r), which
X 1 = J.t + e 1, X2 = J.t + Ez, X3 = J.t + e3, X, = p. + 84, and X5 = J.t + e5 would be a motlel for the population bar chart, i.e. a bar char\ derived
from all possible t)lrbws of a perfect dice. It is pos~ible to calcu)ate what
Z It wUlbe chapter I I b~fot'c we finally obtain the. answer to thisqucstion, In thcmeantimC.,-" wmild be the popuhlti\m mean, J.t 1 9f all the possitile throws of ibis dice,
the reader is asked to accept that the truth Will evClltllally bc-fcvcalcd.and;thrttthc arguttuin <<
tation which will cause the delay is ncccsAatY=to' a' prbpcr!Undcn:lllllHiir\g of the stati~'t\C'al '
Stipppse ihfi.~\ce i~ cluown.a ''"'Y
largcnumper of limes, N. Eacljpossible
mctlwda used iJJ obtaining the nnswcr. value will alst> appear a very large number .of times. Suppose the value
81
8.~
Modelling statistical populations The sample mean and the importance of sample size
,.
0.2
;;-c - -
~'0.2 ' , r-
c ~

.!!
!
!
"'
~ 0.1
I
0:
0.1

2 3 4 5 6
Score
2 3 4 5 6 Figure 6.2. Typical histogram for the scores of Jooo single dice throws.
Score
the values occurred with roughly equal frequency. The mean of the 1 ,ooo
Figure 6.1. Bar chart for population a( single throws of an ideal dice.
values is 3.61. and the standard deviation is 1.62. The actual outcome
1appears N 1 times, 2 appears N 2 times, and so on. The total score achieved is similar to what would be predicted by the model we constructed for
by the N throws will then be: the population of throws of a perfect dice.
Using the model of the previous section we could express the value,
N, X I+ N,x + N,x3+ N,x4+ N,x s+ N,x6 X, of any particular throw of the dice as:
and the mean will be: X=~<x+
I
I'=N(N, + zN, + 3N, + 4N; + sNs + 6N6) where Mx = 35' and e takes one of the values 2.5 (X= 6), 1.5 (X= 5),
o.s (X= 4), -o.s (X= 3), -~.s (X= 2) and -2.5 (X= 1).
N1 (N') + 3(N') (N) + 6(N') Furthermore, from the physical properties of the dice we would expect
=N+>N N + 4(N')
N +sN N that each of the 'errors' is equally likely to occur. This is a particularly
If N is an indefinitely large number, the model of figure 6.1 implies that simple model for the random error component e, viz. that all errors are
each possible value appears in one-sixth of throws. In other words: equally likely. The model is adequately expressed by the bar chart of
figure 6.3 (which is identical to that in figure 6.1 except that the possible
N1 N, N3 N, N 5 N6 1 values of e are marked on the horizontal axis rather than the possible
N=N=N= N=N=N=6 scores), or by the formal statement:
and: 'e can takiHhe six values 2.5, 1.5, o.s. -o.s, - 1.5, or -2.5'
and, for any particular throw:
I'= Z, + ( 2X~) + (3 XZ,) + (4 X~) + (5 X~) + (6X~) P(e = 2.5) = P(e = 1.5) = P(e = o.5) = P(e = -o.5) =
I
= 35 P(e = -1.5) = P(e = -z.s) ='6
With a similar kind of argument it is possible to show that the standard
l We have written P.x here to indicate that we wish to refer to the population mean of
deviation of the population of dice scores iss= 1.71. the variable X. 'We will shortly introduce more variables and, to avoid confusion, it will
A real diqe was thrown 1 ,ooo times and the results arc shown in figure be necessary to usc a different symbol for the population mean of each different variable.
Nott! hen.~ how inapproprintc iR the term 'true value'. Although the mean of the population
6.a. A~ we might expect from the model shown in Hgure 6.1, each of of dice tScorcs is IJ.x 3 :Hi it ill never possible to achieve such a score on a single throw.

83
Modelling statistical populations 1ne sampte meun- unu tne mtpunwtce (~/sam pte stze

This latter description of the modeHs anotherexainple;ofca probabilityh' of the cstandavd deviation of the -ohgi11ahr populatidn of 'score's Of single . '"'
distribution (see 5.3), the term ;for. a: list or.a Jorffi.ula :which indicates:' ... .':-throws . . ,Each.;ofihe-iildi.vidual m'dm ,Sco'tescan:bC :VJrittet1--as--.Yi = f.J.'i + -er --.c.
the relative frequency of occurrence of the different .values of a random ; ,,, where,"as wet hm<e shown above, JJ.f = JJ.x = J S '(since each Yisrhe average '"
variable. '" ar,clt')lc. of a sample ofXs) and the residualsare. means of:theresiduals of:single' . "'
Relative
frequency

0.2

i ! l 2.5 3.0 3.5 4.0 4.5

Figure 6.4.:Histogram of rooo means of ten dice throws,

-3.0 -2.0 1.0 0 1.0 2.0 3.0 scores and thus will generally be smaller than those for single scores. This
Value of error
seems to be borne out by the smaller value of the standard deviation which
Figure 6.3. Bar chart of population of 1crrors' for a single throw of a dice. indicates that the Y values are less spread out than the X values. The
histogram of figure 6.4 shows this feature quite clearly when it is compared
It is already obvious that the use of the. word 'error' is hard to sustain, with figure 6.2.
and we will from now oh usually adopt the more neutral term residual The whole experiment was repeated several times, using a different
which indicates that e is the value remaining when p. has been subtracted sample size each time. In every experiment I,ooo mean scores were
(p.+ e- p.= e). obtained; the means and standarddeviations of the I,ooo scores for each ,
A second experiment was carried. 'out :with the:dicev This time>' ~fter '! ' . line. samplesize are'!'ecorded in table :6;3; It ~an be seen that as thesample'
every ten throws the mean was taken of:the :numbers: occurring<in t'hese .. ; ' .size.:is -increased the standard deviation 'afthe sample mean 'decreases;
throws. Thus if the first ten throwswere' 2, -2, '6,3, 4, I, 2, s. 2.,6; indicating. that the larger the sample size, the closer the sample mean
a mean score of 33 (33 + 10) was noted. In this way; I,ooo numbers,.,.' is likely .to be to the true value. There is, indeed; a 'simple relationship
each a mean of ten scores, were oblained:Iehls call them Yti .Y,A :rr ,,
1
which,call .b~demonstrated-theoretically betweerl' the"standard deviation.

Y 10011 These numbers are a sample of a population of means~ themeans of a population of sample means and the standard deviation of the popuhi
that would occur if the procedure were repeated indefinitely. Since there tion of single scores. To obtain the standard deviation for sample means
is effectively no limit to the number of times that the procedure could of samples of size n the standard deviation of single scores should be
be repeated, the population of mean scores is infinite. The distribution, divided by the square root of n. For example, in this case the population
or histogram, of the population is known as the sampling distribution standard deviation of single scores is 1.71. For the population of sample
of the sample mean. The histogram of the sample of I,ooo mean scores means based on samples of size 10, the standard deviation will be
is shown in figure 6.4. =
1.71 + Yro 0.54 (The sample of I,ooo such sample means had a sample
Note that the histogram is quite symmetrical; it is shaped .rather like standard deviation of o.62.) Other examples appear in table 6.3.
an inverted bell. Furthermore, the mean -of these r 1ooo sample ineans ' ' ' The rcsults:<Jf the series of experiments stipporl'ihe intuition that the'
was 3.48 and their standard deviation was o.62, whi<:h is about oncthi"d sample mean should be 'something like' the population mean and that
84 ss
Modelling statistical populations A model of ra11dom variation: the nonnal d1stnbutwn

Table 6. 3. The 'mean and the standard deviation of the sample mean easy to define randomness. Its essentialquality is lack of predictability.
If we think of the child producing tokens of/ d/ in the specified environ-
Typical sample
of xooo scores Population of scores ment, there is no w.ay in which we can predict ln advance precisely what
Stitndard Standard the VOT of the next token will be. In this sense, the variation in VOTs
Numbcf-of throws
averagedfo~ each score Mean deviation Mean (J.L) deviation is random. 4
36 t.62 J.5 1.71 Random variation can take many forms. The histogram of a population
of measurements could be symmetrical with most of the values close to
lO 3-47 o.62 J.5 1,71 the mean, or skewed to the right with most values quite small but with
=0-54
V!o a noticeable frequency of larger-than-average values. We have already dis-
cussed in chapter 2 the possibility that a histogram could be U-shaped
or bi-modal, in which case most values would be either somewhat larger
s 3-49 0,29 Jo5 I 71
=o.34 or somewhat smaller than the mean and very few will be close to the
v's
mean value. With this range of diversity,- is it possible to formulate a
general and useful model of random variation?
100 35 1 0.14 35 I.7I
VIoo =o.I?
In figure 6.5 we have superimposed the histograms for several of our
dice experiments. As we increase the number of throws whose mean is
calculated to give a score, the histogram of the scores becomes more peaked
400 350 o.og8 35 1.71 and more bell-shaped. It is a fact that, even if the histogiam of single
-;;:;-- =o.o86
400 scores had been skewed, or U-shaped, or whatever, the histograms of the
means would still be symmetrical and bell-shaped fmlarge samples. Furth-
0.059
ermore, it can be demonstrated theoretically that, for large samples, the
1{)00 350 Jo5 I.7l
VIooo =o.os 4 histogram of the sample mean will always have the same mathematical
formula irrespective of the pattern of variation in the single measurements
=========================== that are used to calculate the means, The formula was discovered by Gauss
the bigger the sample, the closer the sample mean will tend to be to the about two centuries ago and the corresponding general histogram (figure
population mean. However, we would like to be more specific than this. 6.6) is still often called the Gaussian curve, especially by engineers and
For example, we would like to be able to calculate a sample mean from physicists. During the nineteenth century the Gaussian curve was widely
a single sample and then say how close we believe it to be to the true used, in the way that we describe below ai1d in succeeding chapters, to
value. Alternatively, it would be useful to know what size of sample we analyse statistical data. Towards the end of that century other models
need to attain a particular accuracy. In;. order to ansWer such questions were proposed for the analysis of special cases though the Gauss model
we need a model to describe the way that the value of the residual, e, was still used much more often than the others. Possibly as a result of
might vary from one measurement to another, this it became known as the normal curve or normal distribution and
this is how we will refer to it henceforth.
6.3 A model ofrandom variation: the normal distribution We have here an example of a very stable and important statistical phen-
The model to be presented in this section is one for random omenon. If samples of size n are repeatedly drawn from any population,
variation. This term is in general use for those variations in repeated ami the sample means (i.e. the means of each of the samples) are plotted
measurements which we seem unable to control. For example, the VOTs I Even if there were a discernible puttcrn, attributable perhaps to the effect of fatigue 1
In table 6.2 are all different, though they purport to be several measure- it would still be impossibk to make precise predictions about future VOTs; there would
fltill be u rnndom dement. It is simpler at this stage of our exposition to deny ourselves
ments of the same quantity all obtained under similar conditions, and the luxury of introducing a third clement, t~uch <IS fatigue effect, into our model of VOT
they vary according to no recognisable or predictable pattern. It iH not popula.tionl;l,

87
Modelling statistical populations Using tables of the normal distn'bution
the population from which the samples.are drawn; The only differences
.... will be (a) the position of the centreof the histogram will depend on
; _ 100 throws --
the value of-the original population mean, p.; and (b) the degree to which
it is peaked or flat d~pends on u; the,standard deviation of the original'
population; the larger 0' is, the more spread out will be the histogram;

.. .' .'
the smaller 0' is, the higher will be the peak in the centre.
...~-,
This patterning of sample means allows us to develop a statistical model
r1 1 for the histogram of the population of sample means from any experiment.
: 1 I

't ' ' In order to construct such-a model for a particular case we need to know
1 :- 25 throws
'' '' '''
the mean and standard deviation of the population from which each sample
' ' is drawn. Each such model histogram, which will exhibit the shared charac-
teristics of the; histograms in figure 6.5, will ~loselyapproximate the<truc
population histogram of figure 6.6, prnvidedthat the sample size is 'large'.
(We will have more to say in succeeding chapters about the means of
'large' in this context.)'
The normal distribution is basic to a great deal of statistical theory
which assumes that it provides a good model for the behaviour of the
sample mean. It is this which will allow us to give answers to some
of the various problems which we set ourselves in chapter 4 Before we
2,5 3.0 3.5 4.0 4.5 can do this, however, we must learn to use the tables of the normal distribu-
Figure 6.5. Histogram of means of different numbers of dice throws.
tion.

6.4 Using tables of the normal distribution


We said in the previous section that the normal distribution
is a good model for the statistical behaviotinoHhe'sample mean. We will
use it in this way in future chapters. But this is,not its only use. It turns
out that any variable whose value comes about as the result of summing
the values of several independent; or almost independent, components
0 can be modelled successfully as a normal distribution; The size of a plant,
Figure 6.6. The normal, or Gaussian, curve. for example, is likely to be determined by manyfactors such as the amount
of light, water, nutrients and space, as well as its genetic make-up, etc.
in a histogram, we find the following three things happen, provided that And it is indeed true that the distribution of the sizes of a number of
n, the sample size, is large: (r) the histogram is symmetrical; (2) the plants of the same species grown under similar conditions can be modelled
mean of the set of sample means is very close to that of the original popula- rather well by a normal distribution, i.e. a histogram representing the
tion; (3) the standard deviation of the set of sample means will be very
close to the original population standard deviation divided by the square 5
The discu~sinn in the ln11t fi.!W pam~raphl'l can be !mnunarisccl by wlHlt is knowu ail the
root of the sample size, n. Central Limit Theorem: 'Buppoac thc population of vnlucs of a \'nriuhlc, X, hm; mean
fL und staodurJ deviation rr. Let X'(n) be the llH!ltlJ of~~ sample of n vulucs randomly
We can go further than this. If the sample size, n, is large enough, choHcTi from the pnpuhHion. Then 111.1 11 gl!lt~ larg(!r the ttlli.> hifltng:ntm of the pnpt~lntion
then the histogram of the means of the samples of size n can nlways lw oLall--thc.po~sibk vnlu~A of' X(n) bt,~omcHnmn 1wurly.'lilw tht histogrum of n nnmml
distribution with nwnn IJ., und t1fllt11Jnnl dt~vinlion (r/ Vn. 'T'hi~ nnmlt will he lnw whutcvcr
very closely ckscribcd by a Hingle matlwmaticnl formula, irrcspttctivc of i11 th<t hH'n'1 rif tlw oriuitiul populnl'iuu hi~~to~rnm i!f thv vnl'inbl~:, X.'

n Sq
Modelling statistical populations Using tables of the normal distn'bution

sizes of plarits- of a -certain species will have the characteristic 'normal' . the same value is subtracted from the mean. So the mean value of the
shape. This is true of many biological measurements.' It is also often new variable, Y, will be equal to the mean value of X minus Jl.. That
true of scores obtained on educational and psychological tests. Certainly 1S."
tests cat) be constructed in such a way that the population of test scores (mean Y,) = (mean X) - p.
will have a histogram very like that of a normal distribution. We need
=!'-!'
not bother how such tests are constructed. For the rest of this chapter =o
we will simply assume that we have such a test.
The test we have in mind is a language aptitude test, that is, a test The standard deviation of Y will, however, be exactly the same as that
designed to predict individuals' ability to learn foreign languages. The of X. All the values of X have been reduced by the same amount; they
distribution of test scores, over all the subjects we might wish to administer will still have the same relative values to one another and to the new
the test to, can be modelled quite closely by a normal distribution with mean value. In other words, subtraction of a constant quantity from all
a mean of so marks and a standard deviation of xo marks. Suppose that the elements of a population will not affect the value of the standard devia-
we know that a score of less than 35 indicates that the test-taker is most tion (exercise 6.J). To complete the standardisation we have to change
unlikely to be successful on one of the intensive foreign language courses. this standard deviation so that it will have the value I. We can do that
We might wish to estimate the proportion of all test-takers who would by dividing all the Y values by the number rt, the standard deviation
score below 35. We can do this very easily using tables of the normal of Y (and X). When a vmiable is divided by some number, the mean
distribution. In the following italicised passage we say something about is divided by the same number. So if we write Z= Y/rr it will be true
how the tables are constructed and this will help explain why they can that:
be used in the way that they are, and in particular, why we do not need (meanofY) o
to have separate tables for every possible combination of mean and standard meauofZ -=o
deviation. The reader may wish to skip this passage in order to see how " "
the tables are used, returning to this point in the text only later. Furthermore (see exercise6.4):
A normal distn'bution can have any mean or standard deviation. We (standard deviationofY) rr
therefore cannot give tables of every possible normal distribution - there standard deviation of Z =-=I
are simply too many possibilities. Fortunately, one of the exceptional and " "
extremely useful properties of the nonnal model is that an observation By these two steps we have changed X, a variable with population mean
jimn any normal distribution can be converted to an observation from 11. and standard deviation a; into Z, a van'able whose mean is zem and
the standard normal distribution by using the standardisation procedure standard deviation is unity, I. Remember what the two steps are. From
desclibed in 3 9. The population mean ofa standardised variable is always each value X we subtract Jl., the mean of the population of X values,
zero and its standard deviation is I. Let us try to see. why this is true .. and then divide the result by the population standard deviation, rr. As
Suppose a vmiable, X, comes from a papulation with mean 11. and stan- before, we can write the complete rule in a single formula:
dan/ deviation rr. First we change X into a new van'able, Y, by the rule:
Z=X-!'
Y=X-1'
\Vhen we subtract a constant quantity from all the elements in a population,
"
/.is called a standardised rmzdom van'able whether or not the distribution
11
Equally, it is 11ot tnll'- of many variables. The distribution of income or wealth in many
l!f the migina/ scores can be modelled successfully as a normal distn'bution.
5ocit:lica is Utnmlly skewed, with the great bulk of individuals receiving less thai1 the mean 1/owever, it is a special property of the normal distn'bution that if X was
income since the mcnn is inflntcd by n few very large incomes, A similar effect CliO often mmnal(v distributed then Z will also have a normal distribution. We say
he lii!Cil in the di~tribution of the time required to tt~urn a new tusk - n few individuals
w.ill t11kc 't-'t'!)' mu~::h longer lhnn the othcrtllU k-IIJ'n a new ~>ltill. 1t ought nul Hd1c-difllcult that Z has the standard normal distribution. We can exploit all this
tc1 thin!~ t)f otlwt ('Xunplt!ll, to change l}Ul!stilms about any nomllllly di.vtributed random valiable i11to
9!
Modelling statistical populations Exercises
equivalent questions about the standard n01mar vmiable and then cuse: : bf.test :results i:an, in fact, be deschhed by the normal distributibnwith
tables of that vmiable to answer the question. In other words, we do not the same mean and variance as the population; If. the population of test
need tables for every different normal distribution. scores has a distribution which cannot be modelled by a normal distribu
We want to know the proportion of test'takers we can expect to 'achieve ':1
tian, 'ltWo\ild be ii:uippropriate to tise standard scores in this way, since
a score of less than 35. To put this another way, if we choose a test-taker Z would not have a standard normal distribution.
at random and obtain his test score, X, we wish to know the likelihood
that the inequality X< 35 will be true. In order to answerthis question,
SUMMARY
we will have to alter it until it becomes an equivalent question about the
This chapter has discussed the concept of a statistical model.
corresponding standardised score. (This is because, as was explained in
the italicised passage, the tables we will use relate to the standard normal (I) A model for a single measurement, X, was proposed: X = f.L + s where J.L
distribution.) This can be done as follows: is the ~true' value or population mean and e the error, deviation from
the mean or residual.
X< 3S is equivalent to X- so< 3S- so (subtracting the mean) . (2) It was argued that means of samplesOf measurement's would be less variable
t~an individual measurements.
is equivalent to X - so < - 15
(3) The sampling distribution of the sample mean wlis introduced: for any
X-so -xs random variable X with mean f.L and standard deviation 0', the variable X,
is equivalent t o - - < - - (dividing by the calculated from a sample of size n, will have the same mean f.L but a smaller
10 IO standard dev1at10n
. . )
standard deviation, a/Vn. Furthermore, if n is large, X will have a normal
X- so distribution.
is equivalent to---< -x.s (4) It was shown how to use tables of the standard normal distribution to
IO
answer questions about any normally distributed variable.
What have we done? We have altered X by subtracting the population
mean and dividing by the standard deviation; we have standardised X
EXERCISES
and changed it into the standard normal variable Z. Thus:
(I) (a) Using the procedure of exercise 5.2, choose a random sample of Ioo words
.
X < 35 1s . I z 3S- so
..equtva ent to < - - -
and find the mean and standard deviation of the sample of Ioo word
xo lengths.
(b) Divide the IOO words into 25 subSamples of 4 words each- and calculate
i.e. Z< -x.s the mean word length of each subsample.
This is another way of saying that a subject whose test score is less than (c) ~alculate the mean and standard deviation of the 25 subsample means.
(d) Discuss the standard deviations obtained in (a) and (c) above.
35 will have a standardised test score less than -r.s. (Note thal'the minus
(2) Assuming that the 'true' mean VOT for /d/ for: the observed child is q .. 2s,
sign is extremely important.)
calculate the residuals for the/d/ VOTs of table 6.2.
Table A2 in appendix A gives the probability that Z< -r.5. Notice (3) (a) Calculate the standard deviations of the/ d/ VOTs of table 6.z,
that the table consists of several pairs of columns. The left column of (b) Calculate the standard deviation of their residuals (see exercise 6.2). Dis-
each pair gives values of Z. The right column gives the area of the standard cuss.
normal histogram that lies to the left of the tabulated value of Z. (The (4) (a) Calculate the standard deviation of the/t/ VOTs of table 6,2.
rclatiom~hip between areas in histograms and probabilities was discussed (b) Divide each VOT by the standard deviation calculated in (a).
in chaplet' 5.) The diagram and rubric at the head of the table should (c) Calculate the standard deviation of these modified values and discuss the
he helpful. result.
For the example we have chosen we find P = o.o668. Hence we can (5) A :.~core, Y, on a Lest is normally. distributed with mean 120 and standard
~Y that .nbout 7% (6,68%) of scores will be less than 35 The accuracy ,deviation zo. 'FinO:'
of :thi0 an~wer will depend on how clo~ely the J istr.ibution of the population (H) l'(Y'< loa) (b) l'(Y>4o)

0~
ModeU.:11g statistical populatimzs
(c) P(Y < 130) (d) P(Y> 105)
(e) P(roo<Y<r3o). (f) P(r35<Y<rso)
(g) the score which will be exceeded by 95% of the population.
7
(Hint: You may find it helpful to begin by drawing sketches similar to figure Estimating from samples
53)

Chapter 6 introduced the normal distribution and the table associated


with it. In the present chapter we will show how to make use of these
to assess how well population values might be estimated by samples. We
return to the question of measuring the/ d/VOT for a child (r ;8) discussed
in the previous chapter. We introduced there a model for a specific token,
X, of VOT expressed by the child, namely:
X=!L+e
which says that the value of the token can be considered as a mean (or
'true' ) value plus a random residual. If the value of I' were known we
could use this single value as the/ d/ VOT for the child and go on perhaps
to compare it with the /t/ VOT (i.e. the mean of the population of /t/
VOTs) of the same child to decide whether the child is distinguishing
between /d/ and /t/. In the present chapter we will consider the extent
to which the value of I' can be estimated from a sample of / d/ VOTs.
In chapter I I we will return to the problem of comparing two different
populations of VOTs.

7. I Point estimators for population parameters


It has to be recognised at the outset that the question we have
just posed concerns population means. Clearly we do not have direct access
to the population means; all we can do is estimate them from the sample
values available to us. The question is 'how?'
It seems intuitively reasonable to suppose that the mean value, X, of
a sample of/ d/ VOTs will be similar to the mean, iJ-, of the population
of VOTs that the child is capable of producing. But how accurate is this
intuition? How similar will the two values be? X has two mathematical
properties which sanction its use as an estimator for I' First of all, it
is unbiassed. In other words, in some samples X will be smaller than
J.L, in some it will hi! larger, hut (Jfl average the value will be correct;

95
Estimating from samples Confidence intervals

the mean of an infinitely large miinber of such sample means would' be deviation u; .though of course we cannoJ know what these values are.
the mean of the population, the very quantity which we wish to estimate. Now we discovered in the previous chapter that, for reasonably large sam-
Second, X is a consistent estimator. This is the technical term used to ples of size n, X is a random observation taken from a normally distributed
describe an estimator which is more 'likely to be close to the true' value' , populationbf possible sample means. The mean of that population is also
of the parameter it is estimating if it is based on a larger sample. We JL but its standard deviation is u/ V n. Again, we know from the c)1aracteris-
have seen that this is the case for X in figure 6. 5. The larger the sample tics of the normal distribution discussed in chapter 6 that it follows that
size, the more closely the values of X will cluster around the population 95% of samples from this population will have a mean value, X, in the
mean, p.,. interval JL r.g6u/Vn (i.e. within r.g6 standard deviations of the true
In fact, it is extremely common to use X as an estimator of the population mean value). If u is small or n is very large, or both, the interval will
mean, not just for VOTs. The mean of any sample can be used to estimate be narrow, and X will then usually be 'rather close' to the population
the mean of the population from which the sample is drawn. A single value, J.t. What we must do now is see just how close we can expect a
number calculated from a sample of data and used to estimate apopulatioo sample.mean.of 14.88 to be to the population mean, given a sample size
parameter is usually referred to as a point estimator (in opposition to of 100 and an estimated uof 5 You may wish to skip the italicised argumen-
the interval estimators introduced" below). There are many instances tation that follows, returning to it only when you have appreciated the
of a sample value being used as a point estimator of the correspot1ding practical- otitcome.
population parameter. The proportion, jl, of incorrectly marked past tense We .can first calculate the standard deviation ofX, the sample mean:
verbs in a sample of speech from a 2-year-old child is an unbiassed and
u 5 5
consistent estimator of the proportion of such errors in the child's speech s\=-=--=-=o.s
v'n v'100 10
as a whole. The variance, s2 , of a sample of values from any population
is likewise an unbiassed 1 and consistent estimator of the population Using the same argument as in 6.4 we know that:
variance, a2.
X-p. .
--=Z
o.s
7.2 Confidence intervals
Although a point estimator is an indicator of the possible value has a standard llonnal distribution and that 2 , from table A]:
of the corresponding population parameter, it is of limited usefulness. by
itself. Its value depends on the charactedstics'of asingle sample; a'new P(-1.96 <Z<1.g6) =o.gs
sample from the same population will provide a different value. It is there-'
fore preferable to provide estimates which take into account explicitly this In other words:
x-1-'
P(-1.96 <--<1.96) =o.gs
sampling variability and state a likely range within which the population o.s
value may lie. This is the motivation behind the idea of a confidence
interval. We will illustrate the concept by considering again the VOT
But the inequality: x -p. <1.g6
problem discussed at the beginning of the chapter. o.s
Suppose that we have a sample of 100 /d/-target VOTs from a single
child and find that the sample has a mean value of X= 14.88 ms and is the same as: X-p.<(1.96 xo.s)
a standard deviations= 5.oo ms. Let us suppose further that the population
of/ d/VOTs which could be produced by the child has mean JLand standard
!-ltrict!~, t;peakiug Wl' hat'e t;Jwwn thi:: to be true only if the population standard dc\iation
1 In chapter 3 it was stated that, in the calculation of s2 ~ the sum of squared deviations l!i-,kiHJW.tl, whl'reuH here the .:;mnple standunl dc\iutinn, usually called the standard error,
is divided not by the sample size, n, but by (n- t). The main reason for this<is tn ensure lt!1:; h.e~~tl p:-;cd. lluwcn-r, it c_an he ~h\1\\'ll,. th;H tht <11-g\I!Hent still holds, even when. the
that s2 is an unbiasscd cstimntor of c?-, If. n were uscd:'in the-denominator the l!amplc 1'111llpk \'ahll' ifl mwd, pro\idtd the sampll' Mizt is Jarg(~: n == 100 ought tn he large enough.
variance would, on nvcragc, ttndcn~stimatc the population htriancc, though tiH~ di~crcpnrwy, {lutHtiollt~Jihout !'lil\Hplt si;w urc di!!twuwd h1nlwr i11 7!i
or bias, w~~uld be IWgligibh- in large Mllllplcs~
97
Q6
b'stimatzng a proportiOn
&timatingfrom samples
The term standard error is generally used to refer to the sample standard
Similarly, the inequality:
x
-I.g6<~ deviation of any estimated quantity.
o.s

7. 3 Estimating a proportion
is the same as: w<X + (1.96 x o.s) Another question raised in chapter 4 which we will deal with
here concerned the estimating of proportions. How can we estimate from
'!1wrefore, in place of: P(-1.96 <Z< 1.96) =o.gs a sample of a child's speech (at a particular period in its development)
the proportion of the population of tokens of correct irregular past tenses
as opposed to (correct and incorrect) inflectionally marked past tenses,
?Ve can wn"te:
e.g. ran, brought, as opposed to runned, bringed, danced. In a sample
P{(X- 1.96 x o.s) < p.< (.\' + 1.96 x o.s)} =o.gs of an English child's conversations during her third year, the following
totals were observed:
Now, the value of p. is some fixed but unknown quantity. The value Correct irregular past tense: 987
of X varies from one sample to another. The statement: Inflectionally marked past tense: 372
Total: '359
P{(X- r.g6 x o.s) <: p.< (X+ r.g6 x o.s)} = o.95 The proportion (p) of inflectionally marked past tenses in this sample
is 372/1359 = 0.2737 Within what range would we expect the population
lllt~ans that if samples of size 100 are repeatedly chosen at random from proportion (p) to be? Just as with the mean (p.) in the previous section,
a population with u = 5 then for 95% of those samples the inequality it will be possible to say that we can be 95% sure that the population
will be true. In other words, the interval X (1.96 X 0.5) will contain proportion is within a certain range of values. Indeed, the question about
the value p. about 95% of the time. This interval is called a 95% confidence proportions can be seen as one about means. Suppose that a score, Y,
interval for the value of p.. is provided for each verb where:
In the present example, the value 0.5 is just the standard deviation
Y, = 1 if the ith verb is inflectionally marked
of X which we know to have the value u/v'n in general. So (in general)
Yt;::::: o if the i~th verb is not inflectionally marked
we can say that:
The mean of Y is Y = IY;/ n and this is just the proportion of verb tokens
which are inflectionally marked.
Xr.g{;J Thus jl is in fact a sample mean and we know, therefore, from the
Central Limit Theorem, that the population of values of jl will have. a
is a 95% confidence interval for p., the population mean. If you like, you normal distribution for large samples. In order to calculate confidence
may interpret this by saying that you are '95% certain' that p. lies inside intervals for the p as we did in the previous section for p., we need to
the interval derived from a particular sample. In large samples (see 7.5) know the standard deviation of the population of sample proportions. As
the sample standard deviation, s, can be used in place of <Tand the interval it turns out, there is a straightforward way of calculating this. At the
X 1.96 (s/v'n) will still be a 95% confidence interval for p.. same time, however, for a technical reason, the confidence limits for p
In the case we have used to exemplify the procedure, we have X= 14.88 arc not calculated in quite the same way as for p.. The reader may wish
and the interval is 14.88 (1.96 X o.s). Thus we arc '95% sure' that the to avoid the explicit discussion of these complicating factors and go directly
mean /d/ VOT of the child in his speech as a whole lies in the interval to the formula for determining confidence limits for p, which is immediately
( '.19" 15 .86) milliseconds. The sample standmd deviation of the sample after the italicised passage.
mean is called the standard error of the sample mean, and the 95% We noted ill the j>receding paragraph that zoe need to know the standard
<:nnli<lcncc intcrvnl is nftctt written a.: X 1.90 (staJH.hml error of x). deviation of the population <j( .\'(/lllple pmfJOrtions. We could estimate the
99
Estimating from samples Confideuce interzwls based on small-samples
sample variance of the sample of Kvalues1 ,s,i; and then use sy/Vn as, . , Le,/(o.ziii2, {),z86z), This'is now a more accurate 95% confidence interval
the standard devation of p. We would then, for large'n;luii!e a.i ags% fbr'the true 'p'n:>p~rtion. The correction has n<lt made much difference
confidence interval for the tme value p,, the nterval p: 1 .g6 sdv'n . . here because the'sample size was rather'large. It: would be more important
In fact ths procedure w11 do perfegtly well forc/arge values.of n. Dmt ''l"<e . in' smaller samples.''' . .
the other hand, it is the case that the value.of s 1' is always.vel),close. "'
to {;(1- p). {It can be shown algebracally that: Confidence intervals based on small samples.
74
z 1l .. ...
The second issue which we will deal with in this chapter con-
SJ' =--p(I -p) cerns the estimates of population values on the basis of small samples.
n-I
In chapter 4 mention was made of a study of syntactic and lexical differences
and for lm~e n the factor n/n- I is almost esactly equal to I aud can
between normal and language-impaired children in which Fletcher & Peters
be f/1wred.) Tin's means that we can avoid calculating s/ and that, as
( rg84) isolated a small group of children (aged 3-6 years) who failed the
soon as we have calculated p we can wn'te immediately that s/ ={;(1 - p),
Stephens Oral. S~reening Test,and who were ,]so,at least sixmonths.delaved
so that sy = Vp(I - p) and l' =p.. Thus a 95% cimfidence intiwval f<ii' :. ,.:,.
bn tli~ Reyn~ii Developme~tal Langu'age Scales! Receptive The children
P s: had hearing and intelligence within normal limits, were intelligible, and
p {~.96 ,JP(I ~ p)}
1
had not previously had .speech therapy. Two hundred utterances were
collected from each child in conversations under standard conditions and
For our example: these were subjected to syntactic and lexical analysis. The purpose of the
study was to make a preliminary identification of grammatical and lexical
!Y= 372 categories which might distinguish language-impaired children from nor-
,-, .172
mal. There were a number of categories examined in the study which
'=--= 0 2 74=P discriminated the groups; for the purposes of this section we will consider
IJ59
only one, 'verb expansion'. This is a measure of the occurrence of auxiliary
z
372 g87 plus verb sequences in the set of utterances by a subject. The data for
sl' =p(I-ji)=--X-.-=o.Ig88
IJ59 IJ59 'verb expansion' in the language~impaired group are shown in table 7L
Hence, a 95% confidence interval fdr the true proportioh ofinfiectionally . yiv~q .~.s~m.\'l.,le mean. here, 0fo,:is'4, how;close'would .we. expect this :to:
.! be..to tl1e mean, score of the popu/ationfrol11:whiph the sample was-drawn?
marked past tenses in the speech of this child:.during this peniod is II'- _,,,c ''''
,. ' ' ' ' ; ,, --, '
How. does the small sample. size affect. the way we go about establishing '
0.2737 V (o.1g88 + 1359) = 0.2737 om 21';i.e .. (o.26I6,o.28s8).
a conf)dence i(lterval? ,
Unfortunately, for a technical reasrm which we will not. explain here
Beca\]SC .of. the, small sample size we cannot rely on the .Central Limit
(see Fleiss I981), this will give mdnterval which is.. a little too narrow.
'Theorem, nor-can .we assume that t-he.sampte--variance is very close to
In other words, the probability that the true value of p lies nside the confi-
the true population variance (which i's unknown). We can proceed from
dence interval will actually be a bit less than 95%. A minor correction
ought to be used to adjust the interval to allow for this.
The formula to give the correct 95% confidence interval is: Table 7. I. Verb expansion scores of eight children

p{.g6r:p)+ :J
Child I .2J5
2 .270
3 .a6s
4 .JOO

In the present case this means: 5 320


6 2 75

~)
1 7 .I05
0.2737 (o.Ol2I +- 8 .260
'7'8 '"""""'"~).'<''"

IOI
Estimating from samples Sample size
this point only if we are willing to assume that the populaddn distribution this we take into account the size of the Saffiple and hence the shape of
can be modelled by a nonnal distn'bution, i.e. that verb expansion scores the distribution for that sample size, Since in the present case there are
over all the children of the target population would have a histogram close eight observations, the t-value will be based on 7 df. We enter the table
to that of a normal distribution. The validity of the procedure which follows at 7 df and move to the right along that row until we find the value in
depends on this basicaSsumption. Fortunately, many studies have shown the column headed '5%'. The value found there (2.36) is the s% t-value
that the procedure is quite robust to some degree of failure in the assump corresponding to 7 df and may be entered into the formula presented
tion. However, we should always remember that if the distribution of above.
the original population were to turn out to be decidedly non-normal the So the 95% confidence interval is:
method we explain here might not be applicable. In the present example
o.o653 )
there is no real problem. The score for each child is the proportion of VB
0.254 ( X 2.J6 = 0,254 0,054
auxiliary plus verb sequences found in a sample of 200 utterances. As
we have argued above, a sample proportion is a special type of sample i.e. from 0.200 to o.3o8; we are 95% confident that the true mean verb
mean and the Central Limit Theorem allows us to feel secure that a sample expansion score for the population from which the eight subjects were
mean based on 200 observations will be normally distributed. Hence the drawn lies between o.r96 and O.JI2. Note that if we had constructed
verb expansion scores will meet the criterion of normality. the confidence interval, incorrectly, using the standard normal tables we
The method of calculating confidence intervals for small samples is, would have calculated the 95% confidence interval to be (o.2o6, o.Ja2),
in fact, essentially the same as for larger samples. The difference stems thus overstating the precision of the estimate. The smaller the sample,
from the fact that for smaller samples the quantity ('X- JL)/(s/Vn) does the greater would be this overstatement.
not have a standard normal distribution because s2 is not a sufficiently
7. 5 Sample size
accurate estimator of u 2 in small samples even when, as here, we are assum-
At various points in the book we have already referred to 'large'
ing that the population is normally distributed. This means that we cannot
and 'small' samples and will continue to do so in the remaining chapters.
establish a confidence interval by the argument used above. Fortunately,
What do we mean by a large sample? As in any other context, 'large'
the distribution of ex- JL)/(s/Vn) is known, provided the individual
is a relative term and depends on the criteria used to make the judgement.
X values are normally distributed, and is referred to as the !distribution.
Let us consider the most important cases.
The !-distribution has a symmetric histogram like the normal distribution,
but is somewhat flatter. For large samples it is virtually indistinguishable 7. 5. I Central Limit Theorem
from the standard normal distribution. In small samples, however, the In the discussion of the theorem in chapter 6 we pointed out
!distribution has a somewhat larger variance than the standard normal on several occasions that it is only in 'large' samples that we can feel
distribution. secure that the sample mean will be normally distributed. The size of
In order to calculate the 95% confidence interval we make use of the sample required depends on the distribution of the individual values of
x v
formula ts/ n, in which t is the appropriate 5% value taken from the variable being studied and frequently not much will be known about
the tables of the !distribution (table A4). This !value varies according that. If the variable itself happens tO be normally distributed then the
to sample size. You will see that on the left of the tables there is a column mean of any size of sample, however small, will have a normal distribution.
of figures headed degrees of freedom which run consecutively, r, z, If the variable is not normal but has a population histogram which has
J, etc. This is a technical term from mathematics which we shall not a single mode and is roughly symmetrical, i.e. is no more than slightly
attempt to explain here (but see chapter 9). It is important to enter the skewed, then samples of 20 or so will probably be large enough to ensure
tables at the correct point in this column, i.e, with the correct number the normality of the sample mean. Only when the variable in question
of degrees of freedom. The appropriate number is (n- r) (one less than is highly skewed or h>~s a markedly bi-modal histogram will much larger
the number of observations in the sample). Thus, even without under samples b~> required. Even then, a sample of, say, 100 observations ought
standing the concept of degrees of fl'ccdorn, you can s~c that by doing
to be large twmgh.
w~
!03
Estimating from samples Sample size
752 Whenthedataarenotindepe!fdenf'<c, ' _.,. : , ....;
Table 'f;2.Standard errors from dif./e1'ent samplingru/(!s
The above comments on sample size ate relevant 'Where-the ,1 ,, (sample size so)
observations in the sample are independent- of ;one ,anoth~r:,.- an- aSseition: .:
impossible ,to sustain for certain typesoHiiiguistic data: ,;, ',. '''"., !, . \ _,P<t~i~_g ~pt':":ec~ verbs ... :S!"~dard-l!rror (%)
For e:i.:ample, suppose we wish to estimate, for a single child, the propor.
tion of verbs marked for the present perfect in his speech. If. we analyse Consecutive 2.I7
Every sccoitd l_-.-41
a single, long conversational extract of the subject's speech it is perfectly :"Every third I-. I 21

possible that, following a question from his interlocutor like 'What have Every fourth r.os
Every fifth 0,99
you been doing?', he will respond with a series of utterances that contain Even' tenth 0.84
verbs marked for present .perfect. Thus, if he produces, say, five consecu- EverY twentieth 0,7J

tive verbs marked for present perfect, it may be argued that he has made
only one choice and not five. There is a sense in which.he: has provided : !1' .two orthree sessions rhaywell be repaid bydata'which are more informative
a single instance of the verb form in-which we are Interested, not.five .. . ih 'the sense that the standard errors of estimateci quantities are smaller-
If, on the other hand, we admit for analysis only every tweatieth verb ..:..:'Le; Cbhfi&~ri~Cei"nterva1S arc narrowetaridhypothesis tests more sensitive. 3 . -~
that the subject produces, we might more reasonably hold that each tense. Supptise, for the sake of
illustration, that the, subject, throughout. his
or aspect choice appearing in the samples is independent of the ;others, speech as 'a whole, puts so% of his verbs in the present perfect. The
and could consider that the tokens we have selected constitute a random probability that a randomly chosen verb in a segment of his speech is
"'mple. What implications does this view of independence have for research present perfect will then be 0.5. However, if the first verb in the segment
in applied linguistics and the efficient collection and analysis of language is marked in this way then the probability that the next verb he utters
data, particularly when they comprise natural speech? is also present perfect will be greater than 0.5 (because of a 'persistence
Information is a commodity which has to be paid for like any other; effect'). The fifth or tenth verb in a sequence is much less likely to be
and resources (money or time) available to purchase it will always be influenced by the first than is the immediate successor. Again to illustrate,
limited. A data set consisting of n related values of a variable always contains let us suppose that the probability that the next verb is the same type
bs information for the estimation of population means and proportions as its' immediate predecessor' is'o.g. Now.suppose that 100 verbs are sam
thnn does a sample of n independent observations .of the,same;varil!ble. ,, pled tti dstimate the proportion of presetitperfects uttered by th~ subject.
(The extent of the loss of information. will- depend on how conelated , "' We e6t\1rl ch6ose too cdnsecutive verbs, every other verb, every fifth verb,
- Hcc chapter 10 - the observed values are;) Thus, if-.it. is possible to etc. Table7.'i'shows how the strindard error of the estimated proportion
obtain n independent values at the same cost. as- n .interdependent values decreases as a bigger gap is left between the sampled verbs. (There is
it will be more efficient to do so. In particular, if theNalues are,not.indepen, , ' 'no simple f6rmula'availabl'e to calculate these standard errors. They have
dent much bigger samples than usual will be needed to assume that the. been obfaim'd by acomputer simulation of the sampling process.)
Hample means are normally distributed. Suppose we have decided to trans
cribe nnd analyse 100 utterances. (For simplicity, assume that each utter- 753 Co11fide11ce i11tervals
ance contains only one main verb. The message will still be the same There are two criteria involved here in the definition of 'large'.
even if this is not true, but the argument is more complex.) We might: One is, as before, the question of how close to normal is the distribution
(a) record and transcribe the first 100 utterances; (b) record a total of of the variable being studied. The second is the amount of knowledge
soo utterances and transcribe every fifth utterance. The transcription costs available about the population variance. Usually the only information about
will be roughly equal in both cases but strategy (b) will require a recording this comes from thcsample itself via the value ofthe sample variance.
period five times as long as (a). I-10\..\!ever, if it .is still possible to .caHy
out the recording in a single ~cssio-n the real diffcnmce- in cnst may._,not 1 Of t',IH.II'!IIi thr:rCN'illlw many ncca~ion~ when an invcstigutor.rnay.bc interested .in a sequence
of {:t!mwtutin: llt.tcnmct:;;. We ~>imply wi!lh to pnint out that information on wmc variables
be very large. Even if t"hi1:1 ia not possiblt:, the cxtrainconve.nicncc. of nct~Uing mny IH~ l!Otlt~chrd t'nlhcr incHicicntly in that cn!\c,

104 105
Estimating from samples Sample size
If the variable in question has a.normal distribution, a confidence interval then the most precise answer would be obtained by sampling one utterance
can be obtained for any sample size using the Hables and the methods from each of n children of that age. However, this will also be the most
of the previous section. For samples of more than so or so the !-distribution expensive sampling scheme. It will inevitably be cheaper to take more
is virtually indistinguisi)able from the standard normal and the confidence utterances from fewer children. Furthermore, by reducing the number
interval of 7.2 would be appropriate. If, on the other hand, there is of children it will almost certainly be possible to increase the total number
some doubt about the normality of the variable, then it will not be possible of utterances. We may be able to use the same resources to analyse, say,
to calculate a reliable interval for small samples. Biological measurements 20 utterances from each of 25 children (500 utterances in total) or 75
of most kinds can be expected to be approximately normally distributed. utterances from each of ro children (750 utterances in total) and it may
So can test scores, if the test is constructed so that the score is the sum be far from obvious which option would give the best results. It depends,
of a number of items or components more or less independent of one in part, on the question being addressed: whether interest lies principally
another. Apart from that, certain types of variable are known from repeated in variations between children, in variability in the speech of individuals,
study to have approximately the correct shape of distribution, If you are or in estimating the distribution of some linguistic variable over the popula
studying a variable about which there is some doubt, you should always tion as a whole.
begin by constructing a histogram of the sample if the sample size makes There is no room here to give a fuller discussion of this problem whose
that feasible; gross deviations from the normal distribution ought to show solution, in any case, requires a fair degree of technical knowledge. Refer-
up then (see also chapter 9). With smaller samples and a relatively unknown ence and text books on sampling problems tend to be difficult for the
type of variable only faith will suffice, though it must be remembered layman largely because of the considerable variety of notation and termin-
that the accuracy of any conclusions we make may be seriously affected ology they employ and the level of technical detail they include. Y au should
if the variable has an unusual distribution. However, whatever the form consult an experienced survey statistician before collecting large quantities
of the data it should be safe to rely on a confidence interval based on of observational data of this kind.
a sample of IOO or more observations and calculated as in 7.2.

754 More than one level of sampling 755 Sample size to obtain a required precision
In many cases a study will require sampling at two different Let us return to the example of 7.2. Suppose we decide that
levels. In the verb expansion study discussed above a sample of children we want to estimate the true average VOT for the child and that we want
was first chosen and then a sample of utterances taken from each child. to be 95% sure that our estimate differs from the populationvalue by
How should the experimenter's effort be distributed in such cases? Is no more than one millisecond. What size of sample ought we to take?
it better to take a large number of utterances from a small number of Another way of stating this requirement is that the 95% confidence
childrent or vice versa? Does it matter? interval for the true mean should be of the form X I. On the other
There is no single or simple answer to this question. Reduction in the hand, when we obtained the 95% confidence interval in 7.2 it was
value of data caused by lack of independence also occurs when the data X (1.96 (o/Yn)). Hence we require that 1.96 (o/Yn) = I. In this exam
are obtained from several subjects. Consider an example. In chapter I I pie we have estimated that u= 5, so we have (r.96 X s)/ (Yn) = I, or:
we discuss the relationship between age and mean length of utterances
Vn = 1.96 X 5 = 9.8
(MLU) in young children. Suppose we wished to estimate the MLU for
n = (9.8) 2 = 96.04
children aged z;6. We might look at many utterances from a few children
or a few utterances from many children -will it matter? Clearly, repeated So, to meet the tolerance that we have insisted on we should need to
observations of the same child are likely to be related. A child may have obtain the test scores of 96 or 97 randomly sampled/ d/ VOTs- we would
a tendency to make utterances which are rather shorter or rather longer probably round upton= roo.
than average for his peer group. It is quite easy to demonstrate that if Wt~ t:an obtain a general formula for the sample size in exactly the same
n uttcntncc.HI arc to lw URcd to estimate !VILLI for tht' whole age group wny. Suppose that, with a ecrta.in confidence, we wish to estimate a popula-
Ht(l
107
Estimating from samples Sample size
tion mean with an error.no greatev thane d:.Thencwe:have.to phoosedib.e '
samplesizentosatisfy: ,,, .. ... i:
T !,. f7j
. . rt, 96r~~p) .,
n
u
. ".. I.96c:-r
vn
=d leaving out the special cqrr_ect_ion. "V~e, th~n ,have,: .. ! ., '- , , , h ;J

I.96<T d ":' 1.96Jp(r- p)


(corresponds to Vn = 1 in the last example) ()

or or
1.96u= dYn
or d' = I.g6' {p(r- p)}
n
I.96u = Yn .... ,.
d or

Thus the formula to choose the appropriat~ sample size to estimate a


population mean with 95% confidenq},is::. ,,,. ,,._.. , , ~i,b~{p(!
. d
"'p)}
. .

n = ( I.~6u)' This formula seems to suffer from ;a difficulty similar to the previous
one. We will not know the population value of p. Even if we decide to
where d. indicates .the required. precision. The value obtained from this _us~ p,_ t_h.e_ e~_t_imate of p, .i!l the expression for_the standard deviation as
formula will rarely be a whole number and we would choose the next we did previously, we will not know the value of p until after the sample
largest convenient number as we did in the example. is taken. However, it should not take you long to convince yourself that
We notice that n will be large if (a) the value of d is small- we are if p is a number between zero and r, then p( 1 - p) cannot be larger than
insisting on a high degree of accuracy; and (b) cr 2 , the population variance, 0.25 and this will happen when p = o.s. We can then obtain a value of
i" large- we then have to overcome the inherent variability of the popula n which will never be too small by using p(r - p) = 0.25. Thus the formula
lion. The value of dis chosen.by the,experinTcmteLdHbwever; ui'sa prob~. fp.cho,~se .~H?.nservatively l~r~e.s~mplesize to estimate a population pr,opor.-
!em. Usually its value wiiJ not be.known:..Otoe 'way round>thiS'is!to take \i9t;>."1ith 9s'f,o confiderqfi~: . 'i i \ ; . . ' ' " .

a fairly small sample, say zo or 30, and calculate!direqtly.thesample.variarice'. :-' \\'' )I..g6Z'
"' We can then use s in place of a in the formula for the appropriate .n:= o.25--
. d'
"ample size, in the study proper.
Thus the formula to choose the aJ>propriatc.sample size,toestimate where d.'indibatesthe required precision., . '
a population mean with 95% confidence in ignorance of the population ... Suppose; .for example, that we wish to estimate what.pereentage oLthe.
variance is: population has a given characteristic and that, with 95% confidence, we
- (I.96s)'
n- --
wish to get the answer correct to 1% either way. That is the same as
d saying that we want a 95% confidence interval for the proportion, p, of
The problem is that our estimate of the population variance based on the form p o.or (remember that a percentage is just Ioo times a propor-
a rather small sample might be quite inaccurate, but this is usually the tion); so we need:
best we can do.
(I.g6J'
If we are estimating a proportion, -p, a. different kind.of solution"is ,.,.,, ,,_~n =,o.2 5-.-.-2
. (o.o1)
available. First, let us begin by writing the simplest form for the confidence
intcrval.for p, i.e. "= 9604 (a large snrnplc!)
109
Summary
Estilll'iiting from samples
lf'we~ire the answer only within 1o% either way, then: Table 73 Confidence intervals with different confidence levels
Confidence level(%) c Confidence interval Length of interval
(I.g6)'
';;.; n=o.~5--.-=g6 o.6745 (1454 15.22) o,67
' 2
(o.1) so
6o o.8416 (1446, IS.JO) o.84
(1436, 1540) 1.04
70 1.0364
If the true value of pis small (<o.2) or large (>o.8) then this procedure 8o 1.2816 (14.24, 1552) 1..28

may greatly exaggerate the sample size required. If you suspect that this t.6449 (14.06, 1570) t.64
90
I.g6oo (1 3 .go, 15.86) I.96
may be the case and sampling is expensive or difficult, then you should 95
(13-59 16,17) s
2. 5
99 25758
consult a statistician. 3-2905 (1p3. 16. 53 ) 329
999
x = 14.88 a=s n = 100
7.6 Different confidence levels Note: The number Cis obtained from table A3 to give the required confidence level.
There is nothing sacred about the value of 95% which we have
used throughout this chapter to introduce and discuss the concept of confi than narrow ones. We can be virtually I co% confident that the true value
dence intervals, though it is very commonly used. However, one experi- lies between 1.36 and 28.4o, but tbat is hardly a useful statement. Some
menter may wish to quote a range of values within which he is 'gg% compromise must be reached between the level of confidence we might
sure' that the true population mean will lie. Another may be content to like and the narrow interval we would find useful. If a researcher knows,
be 'go% confident' of including the true value. What will be the conse- before carrying out an experiment, what level of confidence he requires,
quence of changing the confidence level in this way? he can estimate the sample size, using the methods of the previous section,
When the idea of a 95% confidence interval was introduced in 7.2, to obtain the desired width of interval. It is, again, simply a matter of
the starting point was the expression: choosing the appropriate number, C, from table 73 or table A3. Thus
the general formula for choosing sample size to estimate a population mean
P( -1.96 < z < 1.96) = o.g5
is:
The number 1.96 was chosen from the tables of the normal distribution
to fix the probability at 0.95 or 95%. This probability can be altered to
_(Ca)'
n--
d
any other value by choosing the appropriate number from table A3 to where, as before, d is the required precision. In the examples worked
replace 1.96, For example, beginning from: through in 7.2 we have used C = 1.96 corresponding to a confidence
P( -2.5758,; Z,; 2.5758) = o.gg level of 95%.
and carrying out a sequence of calculations similar to those in 7.2, we SUMMARY
will then arrive at a 99% confidence interval, 14.88 (2.5758 X 0.5) or This chapter has addressed the problem of using samples to estimate
(1359. 16.17) for the true population mean. The length of this interval the unknown values of population parameters such as a population mean or a
is 2.58 marks (16.17- 13.59) longer than the 95% interval. That is to population proportion.
be expected. In order to be more certain of including the true value we (1) Point estimators were introduced and it was suggested that the sample mean
must include more values, thus lengthening the interval. On the other and the sample variance and the sample proportion would be reasonable point
hand, a go% confidence interval would be 14.88 (1.6449 X 0.5) or (14.o6, estimators for their population counterpartsj all of these estimators are
1570), shorter than the 95% interval. Table 73 gives a range of confidence unbiassed and consistent.
intervals all based on the same data. ( 2) The concept of a confidence interval was explained and it was shown how
Is it better to choose a high or a low confidence level? The lower the to derive a 95% confidence interval for a population mean, p., For a
confidence level the more chance there is that the stated interval does large sample such an interval takes the form:

X 1.96(~;;)
not include the true value. On the other hand, the higher the confidence
level, the wider will be the interval and wide intervals arc Ieos informative
:f'f'O Jll
Estimating from smnples
where X is the sample mean, s is the sample standard deviation and n is
the sample size.
(3) A 95% confidence interval for the population proportion, p, was discussed
8
and an example calculated, using the formula:
Testing hypotheses about
+{
p- t.g6/'(x-p)+.:_}
n 2n
population values
where pis the sample proportion and n the sample size.
(4) The problem of obtaining a confidence interval for I' from a small sample
was discussed and it was shown how this could be done provided the sample
came from a nonna/ distribution using the tables of the t-distribution and 8. x Using the confidence interval to test a hypothesis
the formula: In the previous chapter the confidence interval was introduced
_ ts as a device for estimating a population parameter. The interval can also
Xy;; be used to assess the plausibility of a hypothesised value for the parameter.
where tis the s% point from table A4 corresponding to the appropriate number Miller (1951) cites a study of the vocabulary of children in which the
of degrees of freedom. average number of words recognised by children aged 6-7 years in the
(s) The issue of sample size was discussed with its relation to the Central Limit USA was 24,000. 1 Suppose that the same test had been carried out in
Theorem and the required precision of a confidence interval. the same year on 140 British children of the same age and that the mean
(6) It was shown how to calculate confidence intervals with different confidence size of vocabulary recognised by that sample was 24,8oo with a sample
levels. standard deviation of 4,200 words. How plausible is the hypothesis that
the population from which the sample of British children was chosen had
EXERCISES
the same mean vocabulary as the American children of the same age?
(x) (a) Using the data of table J.t, calculate a 95% confidence interval fm the
Admittedly the sample of British children had a higher mean vocabulary
mean length of utterance of the observed adult speaking to her child.
(b) Calculate a gg% confidence interval. size, but many samples of American children would also have had a mean
(c) Explain carefully what is the meaning of these intervals. score of more than 24,000, We need to rephrase the question. The mean
(2) Repeat exercise 7.1 using the data of table 33 of a sample of British children is 24,8oo, not 24,ooo. Is it nevertheless
(3) In table 34 are given the numbers of tokens of words ending ~ing and the plausible that th~ mean vocabulary of the British population of children
number pronounced [n] by each of ten subjects. in this age range could be 24,000 words and that the apparent discrepancy
is simply due to sampling variation, so that a new sample will have a
(a) Calculate 95% confidence intervals for subject 6 for the proportion
of [n] endings in all such words the subject might utter (a) sponta
mean vocabulary size closer to, perhaps Jess than, 24,ooo?
neously, (b) when reading from a wordlist. Let us begin by using the data obtained on the sample to calculate
(b) Repeat for subject x. a 95% confidence interval for the mean vocabulary size of the whole popula-
(c) Suggest reasons for the differences in the widths of the four confidence tion from which the sample was selected. Let us denote that mean vocabu-
intervals. lary size by,.,_,. Following the procedure of 7.2 we obtain the interval:
- s
(4) Ten undergraduate students are chosen at random in a large university and X x.g6 v'n
arc given a language aptitude test. Their marks are:
, + 1.96 X 4200
J.c. 24 8oo- v' (24 104, 25496)
6z,J9,48, 72,8I,5 1 54t59o67,44 140
1 This vahw was itHclf based on a sample. However, for the moment we will treat it as
Calculate a 95% confidence interval for the mean mark that would have been
thou~.th it wen the population vuluc, u reasonable mwugh procedure if the American figure
obtained if all tilL' undcrgntdunte :c;tudcnts of the univl'l'sily hnd lakl~n the wus hatttd on n much higg't'f sample tlwu the British mean. It is explained in chapter
test. 11 how to l.'lllllplll\' tW(J ,lomplt tnl'lliHi dint'fly.

U<l II 3
Testing hypotheses about population values Using the confidence interval

At this point we might remind ourselves of the meaning of. a 95% confi- with the hypothesis; if the . interval does. not include 24,0.0.0 we woul!i
dence interval. If it has been calculated properly, there is only as% chance doubt the plausibility of the hypothesis. A convention has been established
that it will not contain the value of the mean of the whole population by which; in the latter case, we would say that we reject. the hypothesis
from which the sample was chosen. So, if the mean vocabulary size for as false, while if the interval contains. the hypothesised value we simply
British children aged fr-7 years were in fact 24,000 words, then for 95% say that we have no grounds for rejecting it. (This convention and its
of randomly chosen samples, that is for 19 samples out of 20, the 95% dangers are discussed further in 8.4, but let us adopt it uncritically for
confidence interval would be expected to include the value 24,ooo, while the time being.) Since the hypothesis must be either true or false, there
one time in 20 it would not. For the particular sample tested, it turns are four possible outcomes, which are displayed in table 8. 1. As can be
out that the interval does not include the value 24,ooo, so either the true seen, two of these outcomes will lead to correct assessment of the situation,
mean is some value other than 24,000 or it really is 24,ooo and the sample while the other two cause a mistaken conclusion.
we have chosen happens to he one of the s% of samples which result The first type of error, referred to as a type I error, is due to rejecting
in a 95% confidence interval failing to include the population mean. There the, value 24,ooo when it is the correct value for the. populatiqn mean,
is no way of knowing which of these two cases has occurred. There will The. probability; of this type of error is exactly the s% chance that the
never be anyway of knowing for certain what is the truth in such situations. true;population mean (24,ooo) acci<len,tallyfallsoutside the interyal defiJled
However, it is intuitively reasonable to -suggest that- -when a confidence-- by,the particular sample we have chosen.
interval (based on a sample) fails to include a certain value, then we should The. second type of error, known as a type 2 error, .occurs )Vhen,
begin to doubt that the sample has been drawn from a population having although the population mean is no longer 24,ooo, that value is still
a mean with that value. In the present case, we should begin to doubt included in the confidence interval. In general, the probability of making
that the sample was drawn from a population having a mean of 24,000. this type of error will not be known. It will depend on the true value,
On the basis of what we have said to far, it is possible to develop a JJ,, of the population mean, the sample size, n, and the population variance,
formal procedure for the use of observed data in assessing the plausibility u 2 Sometimes it is possible to calculate it, at least approximately, for
of a hypothesis. In particular, let us suppose that the hypothesis we wished different possible values of the population mean. That is true here and
to test was that the average size of vocabulary of British children of fr-7 some values are given in table 8.2. These values have been calculated
years old was the same as that of their American counterparts. We could assuming that the .standard deviatio,:ns, of.. the saJ11ple of yo,cabulary siz.es
decide to make use of a 95% confidence interval based on ,a sample of.
British children. Table 8. 2. Probabilities of type 2 error; using a 95% confidence; .
For the moment let us imagine that we do not yet know what that interval to test whether JJ- = 24000
confidence interval is. There are two possible conclusions we could reach
Sample size
on the basis of the confidence interval, depending on whether or not it
.Tr.ue mean n= 140 n =soc n=Iooo
included 24,000, the mean vocabulary of the American children. If the
interval does include 24,ooo we could conclude that the data were consistent 23 000 o.xg very small very small
23500 0.71 0,24 0.04
236oo o.So O.Jg 0.15
Table 8.1. Possible outcomes from a test ofhypothesis 23700 o.87 0.64 O.JS
238oo o.g2 o.8z o.67
True state 23900 o.gs o.gz o.Sg
24100 095 o.gz o.89
Population mean is Population mean is 24200 0.92 o.82 o.&;
Result of sample 24000 not 24000 o.87 0,64 O.JS
24300
I ntcrval includes the value 24400 o.So O.JQ o.IS
l4500 0.71 0.24 0.04
24 ooo Correct Error 2
Interval docs not include l5000 O.H) very small very small
2.~ooo
=-~=.,"';:;;<;;;;~~-=~~--~-
Error 1
,.. ~----
Corrcd
- ,- - - - ~--~-~~~-=-
(In ull <~II!WS ll""' 4 :zoo)

115
"1 esttng hypotheses about populatwn vatues 'file concept of a test statistic

is the correct standard deviation for the population of all British children Table 8.3. Pmbabilities of type 2 error using a 99% confidence
in the relevant age group. To that extent they are approximate. When interval to test whether p, = 24 ooo
the sample size is 140 it can be seen, for example, that if the true population
Sample size
mean is 23,ooo there is a probability of about rg% that the hypothesis
n= rooo
that JL = 24,000 would still be found acceptable, while if the mean really True mean n= 140 n =sao

is 24,500 the probability of making this error is more than 70%. Table 2J000 O.JI Q,OOl very small
23500 o.Bz 037 0.08
8.2 also demonstrates that the probability of making the second kind of 2J6Do o.88 o.s6 0.25
error depends greatly on the size of sample used to test the hypothesis. 2J700 o.gJ 0-77 o.sJ
2J8Do o.g6 o.go 0.79
Of course, we could decide to use a confidence interval with a different 2JQDO o.g8 o.g6 O.Q4
confidence level to assess the plausibility of the hypothesis that the British 24100 o.g8 o.g6 0 94
o.go
population mean score was JL = 24,000. In particular, we might decide 24200 o.g6
o.gJ 0.77
0-79
0 53
24300
to reject that hypothesis only if the value p, = 24,ooo was not included 24400 o.88 o . s6 0.25
24500 o.Sz o.Jj o.o8
inside the 99% confidence interval. In that case, the probability of making very small
zscoo O.Jl Q,OQI

a type 1 error would reduce to r% since there is now a 99% probability


(In all cascss = 4 zoo)
that the true value will be included by the confidence interval, so that
if the true population mean is 24,000 there is only a I% chance that it
will be excluded from the interval. On the other hand, the probability that the conclusion is in en-or (a type I error) would be only so/o, or I
of making a type 2 error will now be increased. A 99% confidence interval in 20. It would seem that, on balance, the evidence suggests there is some-
will always be wider than a 95% confidence interval based on the same thing about the education or linguistic environment of the British children
sample of data and therefore it will have more chance of including the which promotes earlier assimilation of vocabulary.
value 24,ooo even when the population mean has some other value. Table In some cases, the rejection of a hypothesis might lead to some costly
8.3 gives the probability of making this second kind of error for different action being taken, a change in educational procedures or extra help to
true values of the population mean, p,, and for three sample sizes. When some apparently disadvantaged section of the community. In such cases
n = 140 and the true population mean is 23,000, the probability of the we might feel that a I in 20 chance of needlessly spending resources as
second kind of error is now about 3 I%; when a 95% confidence interval the result of an incorrect conclusion is too high a probability of. error.
is used this error would occur with a probability of only Ig%. We seem We could base our decision on a gg% confidence interval since then a
to have reached an impasse. Any attempt to protect ourselves against one conclusion, based on 'a sample, that a particular subpopulation had a differ-
type of error will increase the chance of making the other. The conventional ent or unusual mean value, would have a probability of only I in xoo
way to solve the dilemma is to give more importance to the type I error. of being incorrect. If the hypothesis testing procedure is to be used in
The argument goes something like this. this fashion (but see 8.4) we will wish to fix our attention on the probability
The onus is on the investigator to show that some expected or natural of wrongly rejecting a hypothesis, and it would be useful to formulate
hypothesis is untrue. Evidence for this should be accepted only if it is the procedure in such a way that the answer to the following question
reasonably strong. Let us consider our vocabulary size example in this can be obtained easily: 'If I reject a certain hypothesis on the basis of
light. The population mean score over a large number of American children data obtained from observing a random sample, what is the probability
tested is 24,000. If this vocabulary test had never been carried out on that my rejection of the hypothesis is in error?'
British children before we could start out from the point that, in the absence
of sp<.~cinl circumstances, the mean vocabulary size of the latter population 8.2 The concept of a test statistic
ought to hC' about the same as that of the former. If we decide to use Let us recap briefly the procedure presented in 7 .2 for calculat
the 95% confidence interval obtained from the sample of .140 test scores ing the confidence intervals we have been discussing above. We obtain
to tcmt this hypothesis, we would conclude that it was false. 1/w pmbability a randmn sample of 140 vocabulary scores and calculate X, the sample
117
Testing hypotheses about population values The concept ofa test statzstzc
mean, and s, the sample standard deviation. Provided the sample size Thus JL, the postulated value of the population mean, will be included
is large enough - as it is here - the confidence interval then takes the by the confidence interval only if both the inequalities A and Bare true.
form: Taken together, these inequalities can he put into words, as follows.
s
XZv;;- Find the absolute .difference between X and 11- (That is, take X- 11- if
X is greater, 11- - X if 11- is greater.) Divide that difference by s/ V n, the
where the value Z is chosen from tables of the standard normal distribution standard error of X. If the answer is less than the value of Z needed
to give the required level of confidence. By an algebraic manipulation to calculate the confidence interval then the interval will include /1-, other-
of this expression it can be shown that it is not strictly necessary to calculate wise it will not.
the confidence interval in order to test a hypothesis. The algebraic argu- For our example, X is larger than J.l., and s = 4,2oo:
ment follows below, in italics, for interested readers.
X- J.t 248oo- ZfOOO 8oo
Suppose we wish to test the hypothesis that the population mean has 2.25

the value 11- We will reject this if 11- is not contained by the confidence s/Vn = 42oo/YI40 355
interval. Now, M wW be contained inside the interval provided: Now, to construct a 95% confidence interval, we see from table A3
that we would need to use Z = 1.96; for gg%, Z = z.58. This tells us
- s - s
X- Z-::-r< JL<X +Z-:;-r- that 24,000 will be included in the 99% but not the 95% interval, since
vn vn
the Z-value corresponding to the sample is less than 2. 58 but greater than
The inequality 'A': 1.96. If we reject the value 24,ooo as incorrect, the probability that the
rejection is an error (type I) is less than s% (since the value falls outside
- s
X-Z-<JL the 95% confidence interval) but greater than I% (since it does not fall
Yn
outside the value of the gg% confidence interval). Conventionally, we
is the same as: say that the postulated value, 11- = 24,000, may be rejected at the s% signifi-
cance level but not at the r% significance level. A common notation used
s
X<JL+ZVn to express this is o.oi < P < o.os, where 'P' is understood to be the prob-
ability of making a type r error, i.e. Pis the significance level.
is the same as: 'At the s% significance level' is just another way of saying that the
postulated mean is not contained by the 95% confidence interval based
s
x- JL<zv" on the sample. The 'significance level' is then just the probability that
this exclusion is due to a sampling accident and not to the failure of the
is the same as: hypothesis. The value:
x
~<Z x-JL
s/Vn z= s/Vn
Similarly, the inequality '8': is used as the criterion to assess the degree to which the sample supports,
or fails to support, the hypothesis that 11- is the mean of the population
- s
JL<X +Z:;-r from which the sample has been selected. Such a value used in such a
VII
way is known as a test statistic. Every testable statistical hypothesis is
is the same as: judged on the basis of some test statistic derived from a sample.
Every test statistic will be a random variable because its value depends
JL-X
on the results of a random sampling procedure. If we were to repeat our
s/Vn <Z
test procedure a lll,l!l'lbCI' of tinws, drawing a different sample each time,
1!8
;g II9
'1 'estmg hypotheses about populatwn values 1f<e classical hypothesis test
we would obtain a. random sample of. values .of the test statistic and we have a lower mean vdcabulary size as n\easured by the test used (possibly
could plot its histogram as we did for other variables in chapter 2, A because the test, devised in the USA, has a cultural bias); (ii) We believe
test statistic is always .chosen in such a way that a mathematical model that British children will have a higher mean vocabulary size than American
for its hi~to!\pm is k11own so lo12g ai .the hypothesis being tested is true; children,'of"the sam.e'age (possibly because they start school at an earlier
lll- this ca~?.Jhe distribution of our test statistic, Z, whenever JL is, in age); (iii) We are simply checking whether there is any difference in the
faete.the mean of the population from which the sample is taken, would mean vocabulary size and have no prior hypothesis about the direction
be the standard normal distribution, It follows that the value Z = 1.96 of any difference which might exist.
woutd be exceeded in only 2,5% of samples when we have postulated A full statement of the problem will include a definition of the null
the correct value of JL, Similarly, only 2.5% of random samples would hypothesis (H 0 ) and the alternative hypothesis (H 1), In the present case
give a value less than - L96 when our hypothesis is true. Clearly we are the null hypothesis is H 0 : JL = 24,000 and we have to choose an alternative
again using the 95% central confidence interval and claiming that values from:
outside this interval are in some sense .. unusual. or. interesting. since they
(i) HI :'1'<2400~
should result only from I random sample in 20.
(ii) I-ll: 1'. > 24000
You should now be ready to understand a complete and formal definition
of the classical procedure for a statistical test of hypothesis,
(iii) HI': !L"' 24 000

Despite its vagueness, the form of H 1 will have a definite bearing on


8.3 The classical hypothesis test and an example the outcome of a statistical hypothesis test. Consider again the test statistic:
A hypothesis is stated about some random variable, for exam
ple,.that British childrCn of a certain-age have a mean vocabulary of 24,ooo X- 24ooo
z s/v'n
words, The hypothesis which is taken as a starting point and which often
it is hoped will be refuted, is commonly called the null hypothesis and If (i) is true, the population mean is less than 24,ooo, so that samples
designated H 0 , We might write: let the variable X ')'ith mean JL be the from the population will generally have a sample mean less than 24,000,
vocabulary size of British children aged 6-7 years, Then we wish to test In that case Z will have a negative value since X- 24,000 will be less
Ho: JL = 24,ooo, The most important requirement to enable the test to than zero, So large, negative values of Z will tend to support H 1 rather
take place is the existence of a suitable, test statistic whose diStribution than 'Hu; and positive values, however large, Cannot be cited as support
is known when H 0 is true. In this case there:ris one, provided:we'take for the hypothesised alternative, If (ii) is true the argument will be com
a large sample of children: pletely reversed so that large, positive values of Z should lead to the rejec
X-p, tion of H 0 in favour of H 1, If (iii) is tme the value of Z will tend to
Z=-- be either negative or positive depending on the actual value of JL All
s/v'n
we can say in this case is that any value of Z which is sufficiently different
where JL is the hypothesised mean and sis the sample standard deviation' from zero, inespective of sign, will be support for H 1 Up to this point
For the example we are discussing here: in our discussion we have tacitly been considering possibility (iii) and
X- 24ooo have been prepared to reject H 0 if the value of Z is extreme in either
z direction.
s/v'n
Next, we require tables which will indicate those values of the test statis
Next, we must be able to say what the value of our test statistic would tic which seem to give support to H 1 over H 0 , These values are usually
have to be in order for us to reject the null hypothesis, This depends referred to as the critical values of the test statistic, The tables ought
firstly on what alternative hypothesis we have in mind, i,e, what we tn be in a form which enables us to state the significance level of the
suspect may in fact be the case if the: null hypothesis is untrue. There test This is simply another name for the probability that we will make
are three obviou fJI>Hsibilities: (i) We believe that British children will n rnistakt~ 'if wt decide to reject H0 in favour of I I 1 on tlu. evidence of
.1$.."', Y~O 121
~
Testing hypotheses about population values The classical hypothesis test

our random sample of values of the variable _we have observed, i.e. the The:mean of this sample is 71.5, with a standard deviation of IJ.t8. Do
probability of making a type r error. you think their performance has been affected by the interruption of their
Care is required at this point. The correct way to calculate the signifi studies? One way to answer this question is to test whether the students
can~~Jevel q{ a test A~11ends on the particular form of H 1 which is con seem to have achieved results as good as, or worse than, those achieved
sider~d, relevant for 1 tp.~ test and on the way in which the tables are by the body of students who have taken the same proficiency test in previous
presented. Some tab!esa~e,presented in a form suitable for what is often years at the same point in their French language education. 2
called a, two-tailed te~t:. o.ur tables A3 and A4 are of that kind. This The null hypothesis for this test will be that the students tested come
type of test is so called because values sufficiently far out in either tail from a population whose mean score is 8o, the historical mean for students
of the histogram of the test statistic will be taken as support for H 1 A who have had the usual preparation for the A level examination. We would
two tailed test is the appropriate one to use when the alternative hypothesis not expect the disruption of teaching to improve students' performance
does not predict the direction of the difference. In the present case, there on the whole, so that a one-sided alternative will be appropriate.
fore, a two-tailed. test will be called for if our alternative hypothesis is We will therefore test the hypothesis that the mean test score of the
I" f 24,000. population from which these mean scores are drawn is 8o, against the
A onetailed test, by contrast, is so called because values sufficiently alternative that the population mean score is less than 8o. In other words,
far out in only one tail of the histogram of the test statistic will be taken we wish to test:
as support for H 1 A one-tailed test is appropriate when the direction Ha :J.J..J:So versus H 1 :p. <So
of the difference is specified in the alternative hypothesis. Thus both the
following: How are we to carry out this test? It looks similar to the test of hypothesis
on mean vocabulary size that we carried out in 8.2 - apart from the
HI :p.<2f000
HI :,u.>24000
change in the alternative hypothesis - but there is one very important
difference. Here the sample size is too small for the Central Limit Theorem
require a one~ tailed teSt. to be invoked safely. In the vocabulary example we did have a large sample
As noted above, our treatment of the vocabulary size problem has and this is what enabled us to be sure that the test statistic, Z, had a
assumed the alternative hypothesis to be I" f 24,000. It follows from this standard normal distribution. In small samples that will not be true.
that a two-tailed test has been appropriate. What we have done informally In the last chapter we have already addressed the problem of small
is i!ldeed to carry out a two-tailed test, making use of the_ significance sample size in the context of the determination of a confidence interval.
levels provided in table A3. We will now use another example to demon The solution which was suggested there carries over to the current problem.
strate the procedure formally, step by step, this time with an alternative First, for small samples, we have to be willing to make an assumption
hypothesis that specifies the direction of the difference. about the distribution of the variable being sampled, namely that it has
Let us suppose that there is a proficiency test for students of French a normal distribution. Here, for example, in order to make any further
as a second language in which students educated to British A level standard progress we must assume that the proficiency test scores would be more
are expected to scar~ a mean of 8o marks. In a certain year teaching activities or less normally distributed over the whole .population of students with
at some schools are disrupted by selective strikes. Ten students are chosen disrupted schooling. Provided that assumption is true, then the quantity:
at random from those schools and are administered the test just before
the time when they are due to sit the A level examination. (In practice, x-~"
a much bigger sample would normally be chosen if there were r~al grounds Sf'Tn
for believing that the students' performance had been affected by the
will still make a suitable test statistic, since its distribution is known and
strikes, but we wish to demonstrate here how a small sample could be
analysed.) The scores of the ten students were: l :\ better way would be to comparc tlwm directly with students \vho will take the A level

.,
,f'
IZZ
62 71 75 56 So 87 62 g6 57 69
in tht same V(~Hr and who11c Htlldic!l wtn not disrupted. How to do that is cxplai11cd
i11 chupter 1 1.'

lZ3
Testing hypotheses about population values The classical hypothesis text
can be tabulated. It is no. longer standard normal; rather it has. the t-
distribution introduced in the previous chapter. For the present example Standard normat------7'
distribution
the statistic: ~ t-distribution with
9 df

x-"'
s/Vn
mil! have a !-distribution with 9 (= 10- I) degrees of freedom (df). 0
l'lji'pothesis tests are often referred to by the name of the test statistic Figure 8.1, Comparison of the histogram of the t-distribution with that of
they use. In this case we might say that we are 'carrying out a t-tcst' the standard normal distribution.
(cf. F-test and chi-squared test in later chapters).
Hence, if the population mean score really is 8o then for any sample will have a value around zero. However, even if H 0 is true some samples
of ten scores :
will correspond to a value of t in the leftchand tail (and so look as if
X-8o they supported the alternative H 1 : J.t < 8o). A table of !-distributions, such
t=-- as table A4, gives values of t which are somewhat unlikely if H 0 is true.
s/Yio
These values are often called percentage points of the !-distribution
will be a random value from the !-distribution with 9 df. If J.t < 8o, we since they are the values which will be exceeded in only such and such
would expect the test statistic to have a negative value (since then X will a per cent of samples if H 0 is tnte. However, the tables are set up to
usually be less than 8o), and it will no longer have a !-distribution (since give the percentage points appropriate for a test which involves a two~ tailed
the incorrect value of J.t will have been used in its calculation). In other alternative hypothesis when an extreme value of t, whether positive or
words, if the alternative hypothesis 1-1 1 :j.t<8o is true, then we would negative, could be evidence in favour of the alternative rather than the
expect a value oft which is negative and far out in the tail of the histogram null hypothesis. For example, for the !-distribution with 9 df the value
of the !-distribution. given as the ro% point is 1.83 and figure 8.2 demonstrates the meaning
In the test score example we have: of the tabulated value. When the null hypothesis is true the value of t
n = ro, X= 71.5, s = IJ.I8 will lie in one of the tails- shaded in the figure (i.e. t > 1.83 or t < - 1.83)
- for a total of 10% of samples. For half of those samples (s%) the value
so that the value of the test statistic is: of t will lie in the left-hand tail, for the other half in the right. Only
71.5- So values in the left-hand tail can support the alternative hypothesis that
' --2.04 J.t < 8o. If we decide to reject the null hypothesis in favour of the alternative
IJ.I8/Yio
whenever the t-value falls in the left-hand tail in figure 8.2, i.e. whenever
The value of t is negative. If it had not been so, there could be no t < ~ 1.83, we would reach this conclusion mistakenly in only s% of sam-
question of the data supporting the alternative hypothesis against the null, ples when the null hypothesis is actually true. In the present example,
since a sample mean greater than 8o cannot be claimed as evidence in t = -2.04 ( < - 1.83). Conventionally, we could say that 'at the s%
favour of a population mean less than 8o! The question is whether it
is a value extreme enough to indicate that the hypothesis J.t = 8o is implaus-

~
ible. Figure 8. I shows the histogram of the t-distrihution with 9 df, with
the histogram of the standard normal superimposed. They are very similar.
Both are symmetric about the value zero but the !-distribution is flatter
and spreads more into the tails, reflecting the extra uncertainty caused
t" -1.83 0 t 1.83
by the small sample size. If the null hypothesis is true, then J.t =So and
most samples will have a mean of around So so that t =(X- J.t)/(H/Yn) Figun) 1'1.~. 'J'Iw lno-lm'lcd 10% point of the t-distrih11ti1m with 9 df.

t U<f 125
Testing hypotheses about population values 1s significance stgmftcant '!

significance level we can reject 1-1 0 : IL =So in favour of the alternative 8.4 How to use statistical tests of hypotheses: is significance
H 1 : 1-' < 8o'. This statement is frequently shortened to 'the value of t is significant?
signifkant at s%'. The shorter version is acceptable provided the null Every statistical test of hypothesis has a similar logic whatever
and ali' er~ative hypothesis are stated elsewhere. are the hypotheses being tested. There will be two hypotheses, one of
It i worth repeating again that the percentage points in the table are which, the null, must be precise (e.g. 1-' = 8o) while the other may be
relevaht for a two-tailed test. The tables say that 1.83 is the 10% value. more or less vague (e.g. ~t< 8o). There must be a test statistic whose
Since the alternative used in the present example is one~sided only, one dist1ibutio11 is known when the null is true. Percentage points of that distri-
of the t~ils is eliminated from consideration and the use of the 'cut-off' bution can then be calculated and tabulated. Sometimes the tables will
or crihcal value t = -I .S3 will lead to only a s% probability of a type be appropriate to two-sided and sometimes to one-sided alternatives -
I erroL Notice that if the two-sided alternative H 1 :I-' f So had been rele- the rubric will make it clear which is the case. The ma]or constraint on
vant here, then, at the s% significance level, only values oft greater than the use of significance tests is that it is generally difficult to discover test
2.26 or less than -2.26 would have been significant so that the value statistics with known properties. Such statistics are available for only a
obtairled here, t = -2.o4, would no longer be significant at the s% level! few, standard null hypotheses. It frequently happens that a researcher
You ~~uh,l find this entirely logical, perhaps after some thought. The wishes to address a question which is not easily formulated in terms of
sigoifihrc~ level is just the probability that the null hypothesis will be one of those hypotheses and it would be mistaken to try to force all investiga-
(mist~ke.rirJ rejected when it is correct. The value that will cut off the tions into this framework. The value of statistical hypothesis testing as
gs% df 'pc~eptable' !-values will be different depending on whether those a scientific tool has been greatly exaggerated.
values are 4istributed between both tails or are confined to just one. Similar In chapter 7 it was argued that a confidence interval gives a useful
arguments:apply to the percentage points corresponding to any other signi- summary of a data set. We hope it is clear from the development of the
ficance ltv~!. The 2% point oft with 9 df is given as 2.76 in the tables. hypothesis test from a confidence interval in 8. 2 that a test statistic and
But this;,s,;b.s always, for a two~sided alternative. For a one-sided alternative the result of a test of hypothesis is simply another way of summarising
2.76 (foe H 1 :It> So) or -2.76 (for H 1 : 1-' <So) is th<n% point. a set of data. It can give a useful and succinct summary, but it is no
Note in passing that table A4 does not give all possible degrees of free- more than that. Important decisions should not be taken simply on the
dom. Fqr hample, after IS df the next tabulated value is 20 df. What basis of a statistical hypothesis test. It is a misguided strategy to abandon
hanpen&:if.you
I . ,,
need I8 df, i.e. the sample size is n = rg? You will notice an otherwise attractive line of research because a statistically significant
th~i the c~itical value, of t at any significance level decreases as the df result is not obtained as the result of a single experiment, orto believe
incre~se, u\ other words, the bigger the sample the smaller are the !-values that an unexpected rejection of a null hypothesis means, by itself, that
which are found to be significant. (This reflects the extra confidence we an important scientific discovery has been made. A hypothesis test simply
have in s2 as a measure of population variance in bigger samples.) Notice gives an indication of the strength of evidence (in a single experiment)
further fhat the critical values for 20 df are very similar to those for Is for or against a working hypothesis. The emphasis is always on the type
df. Fbr rny number of degrees of freedom between IS and 20 either of I error, error which arises when we incorrectly reject a true null hypothesis
those tWo rows give very close approximations to the correct answer. It on the basis of this one statistical experiment. The possibility of type
is conveptional (and conservative) to use the row corresponding to the 2 error tends to be forgotten. If we do not find a highly significant result,
neare~t 9umber of degrees of freedom smaller than those which are required . this does not mean that the null is correct. Sampling error, small sample
when th latter are not tabulated. size or the natural variability of the population under study may prevent
A finaf point to notice about the !-tables is that as the degrees of freedom us from detecting a substantial failure in the null hypothesis. Furthermore,
increase :the values become more and more similar to those of the standard although we control the probability of type I error by demanding that
norm.!! ~istribution. The values in the last row of table A4 are exactly it be small, say s% or I% or less, we will not usually know how large
those ;afjtH,e standard normal, i.e. the values in the last row of table A4 is the probability of making a type 2 error. Except when the sample size
correspcjfid to the second column of table A3.

is vety large, the probnbility of a type z error is often rather higher than
126
127
Testing hypotheses about populatwn values Is significance significant?
the probability of a type r. Hence, there is often a,high chance of missing whether any differences from the hypothesised value were negligible,
important scientific effects if we rely solely on statistical tests made on although the test had rejected the null hypothesis. If a statistical test indi-
small samples to detect them for us. cates that some null hypothesis is to be rejected we should always attempt
Hypothesis testing is set up as a conservative procedure, asking for a to estimate more likely parameter values to replace those in the rejected
fairly small probability of a type I error before we make a serious claim null hypothesis. We must keep in mind always the difference between
that the null hypothesis has been refuted. The procedure is designed to statistical and scientific significance, and we should remember that the
operate from the point of view of a single researcher or research team latter will frequently have to be assessed further in the light of economic
assessing the result of a single experiment. If the experiment is repeated considerations.
several times, whether by the same or by different investigators, the results Let us consider an example. Suppose that a new method of treatment
need to be considered in a different light. For example, suppose the editors has been suggested to alleviate a dysphasia. An investigation is carried
of a scientific journal decide, misguidedly, to publish those papers which Out whereby an experimental group of n patients is treated for some months
contain a statistically significant result and reject all others. They might by the new' method while a matched control group of the same size is
decide that a significance level of 5% would be required. Now let us suppose treated over the same period by a standard method. We could then test,
that 25 researchers have independently carried out experiments to test using one of the tests to be introduced in chapter I 1, the null hypothesis
a null hypothesis which is, in fact, true. For any one of these individuals that the degree of improvement was the same under both treatment meth-
it is correct that there is only a 5% chance he will erroneously reject the ods against the alternative that the new method caused more improvement.
null hypothesis. However, there is a chance of greater than 72% that Let us consider the possible outcomes of such a test.
at least one of the 25 researchers will find an incorrect, but statistically
significant, result. If there were roo researchers, this chance rises to more 8-4- I The value of the test statistic is significant at the 1% level
than 99%! The chance of a research report appearing in the journal would What does this tell us? In itself, very little. We have just pointed
then depend largely on the popularity of the research topic, but there out that we never expect any null hypothesis to be exactly true. The signifi-
would be no way of assessing how likely to be tr~e are the results in cant value of the test statistic means that our experiment has been able
the published articles. to discover that. The question is, has the significant value come about
Another point to remember is that no null hypothesis will be exactly because, on average, there is a large benefit from the new method or
true. We tested the hypothesis (in 8.3) that 1-' =So; we would probably because, perhaps; a very large sample of subjects was used? If it is the
not wish it to be rejected even if the true value were' not 8o but 79999 fornier then it is still possible that this is a sampling phenomenon, that
since this would not indicate any interesting difference in mean scores the accidental allocation of patients to the two groups has placed in the
between the subpopulation and the wider population. On the other hand, experimental group a majority of patients who would have made most
if a large enough sample is used such small discrepancies will cause rejection improvement under the old method. However, we know, from the signifi-
of the null hypothesis. Look back to the example in 8.3. The test statistic cance level, that there is only a one in a hundred chance that the new
was: treatment is in no way better than the standard.
X-p, Does this then mean that the new treatment should be introduced?
t-
-Sf\Tn Not at all. We must now ask about the relative costs of the two methods.
This can be seen to be a quotient of two numbers. The numerator is , If the new method costs much more than the standard method to administer
X -J.t; the divisor or denominator is s/Vn. The test statistic will have then it can only be introduced if it causes improvement so much more
a large value if either the numerator is large or the denominator is small. rapidly that at least the same number of patients annually can be helped
We can make the denominator as small as we please by increasing the to the same level of improvement as under the standard method. It will
value of n. For very1 large values of n, the test statistic can be significant be ncccssa1y to test the new method with a large number and variety
even if the difference between X and 1-' is trivially small. The explicit of patients bcforo suflici(,llt information can be obtained to assess this
calculation of a suitable conlldcncc intcrvttl would show immediately properly, Although ll Blllflli ~ample 111ig-ht show that there is a statistically
u8 12()
nxerczses
Testing hypotheses about population values
significant difference of an apparently interesting magnitude, more infor-
( r) A confidence interval can be used to test the hypothesis that a sample mean
takes a particular value; type I errors and type z errors were defined.
mation will always be necessary to assess the economic implications of
(2) The concept of a test statistic was used to link confidence intervals with
changes of this type.
hypothesis tests.
(3) The classical hypothesis test was introduced as the test of a null hypothesis
8.4.2 The value of the test statistic is not significant (llo) against a specific alternative hypothesis (H1) using as a criterion the
In itself an 'uninteresting' value of the test statistic should value of a test statistic whose distribution is known provided H0 is true. The
not be the end of the story. Never forget the possibility of type 2 errors, sample value of the test statistiG is compared to a table of critical values
especially if the sample size is very small. At least you should always to obtain the significance level (probability of a type I error) of the test.
look at the difference in performance of the two samples and ask 'Would The meaning of, and need for, one-tailed and two-tailed tests was explained.
a difference of this magnitude be important if it were genuine?' If the To carry out a test of the null hypothesis, H0 : ,u =specified value, against
answer is affirmative, then it is worth considering a repetition of the experi- any of the three common alternatives the relevant statistic is (X- f.')/(s/Yn).
ment with a larger sample size. You should also take a careful look at For small samples its value is compared with those of the tdistribution with
some of the details of the data. For example, it could happen that many (n- 1) degrees of freedom; for large samples it is compared with the critical
values of the standard normal distribution.
patients do not improve much under either method but, of those who
do, the improvement might be more marked under the new treatment.
The average gain of the new method would then be quite small because EXERCISES
(r) A sample of I84 children take an articulation test. Their mean score is 48.8
of the inertia of the 'non-improvers' and the variability in both samples
with standard deviation I 2..4. Show that these results are consistent with the
would be increased because they are really mixtures of two types of patient.
null hypothesis that the population mean is JJ. = so against the alternative that
Both of those conditions would increase the probability of a type 2 error.
0,. so.
(:z.) All the exercises at the end Of chapter 7 require the calculation of confidence
In the light of the above comments it should be clear that to report the intervals. Take just one of those confidence intervals and reconsider its meaning
results of a study by saying that something was significant at the r% level in the light of the present chapter. In particular, formulate two different null
or was not significant at the s% level is unsatisfactory. It makes much hypotheses, one of which would be found plausible and the other of which
more sense to discuss the details of the data in a manner which throws would be rejected as a result of the interval. In both cases state, very precisely,
as much light as possible on the problem which you intended to tackle. the alternative hypothesis.
A formal test of hypothesis then indicates the extent to which your conclu- (3) An experimenter wants to test whether the mean test score, IJ.-, of apopulation
sions may have been distorted by sampling variability. The occurrence of subjects has a certain value. In particular he decides to test Ho: 1-1- =So
of a significant value of a test statistic should be considered as neither versus H 1 : p, >So. He obtains scores on a random sample of subjects and
calculates the sample mean and variance as X; 84.2. and s = I4.6. He omits
necessary nor sufficient as a 'seal of approval' for the statistical validity
to report the sample size, n. Show that if n = I6, H0 would not be rejected
of the conclusion. The general rule introduced in chapter 3 still holds
at the s% level, but that H0 would be rejected if n = 250.
good. Any summary of a set of experimental data may be revealing and
(4) In the last example, find the smallest sample size, n, which would lead to
helpful. This is equally true whether the summary takes the form of a
the rejection of H 0 :
table of means and variances, a graph or the result of a hypothesis test.
In all cases, as much as possible of the original data should always be (i) at the s% significance level
given to enable readers of an article to assess how adequate the summary (ii) at the I% level
is and to enable them to carry out a new analysis if they wish. (s) If X= 8o.r and s= 14.6, show that H0 : f.'= So could still be rejected in favour
of H 1 : 1-1- > 8o at any level of significance.
SUMMARY (6) Discuss the implications of exercises 2, 3 and 4, above.
This chapter introduces tlw concepts and philosophy of statistical
hypothesis testing.

!JO 131
11 complete moaet

Table 9 r. Frequency table of scores r~f 184 children's test scores


9 2 .1
Standardised
4 5

Testing the fit of models to dass


inlcr>als

data Class intcrntls


of scores
Observed
number of
X-
z=---
<0
~o
Expected Expected
scores in each proportion number uf scores
Frnm~ less than class From----4lcss than in each ch1s:; in each class
0 ]0 2 -2.0 0.02J p
JO -2.0 8.1
35
35
40
"
'7 -1.5 -I.o
-~.s 0.044
0.0()2 J6.q
9 r Testing how well a complete model fits the data 40' 45 3' -1.0 -o.s O.J_'iO 27.6
45 so ]2 -o.s 0,0 0. H) I 35 1
In the previous chapter we learned how to test hypotheses con- so 55 39 o.o o.s O.JQI 351
cerning the value of important quantities associated with a population. '55 6o 22 o.s LO 0.150 27.6
6o 6s '9 LO r.s o.092 16.9
What we tested was whether a particular model of the chosen type could 6s 70 6 t.S 2.0 8.<
0.044
be supported or should be rejected on the basis of data observed in a 70 75 4 2.0 2.5 0.017 J.l
random sample. There are times, however, when we might have doubts 75 or greater 2.5 - o.oo6 Ll

about the very form of the model, when, for instance, we are uncertain
whether it is appropriate for the population in which we are interested lying between any two points is known. If, therefore, the population of
to be modelled as a normally distributed population. scores of the US children on the test is normally distributed, it will be
Imagine the case where a School District il) the USA wishes to identify possible to calculate what proportion of the population of children will
those children beginning school who should be provided with speech ther- obtain scores within a certain range. More particularly, it will be possible
apy. Rather than involve themselves in the lengthy and expensive business to calculate the proportion of children who will be given speech therapy
of constructing a new articulation test, they plan to use a test which is if a certain (low) score is used as a cut-off point; i.e. only children obtaining
already available. One test which seems on the surface to be suitable for a score lower than this will receive therapy. It is in fact quite common
this purpose is British. It has been validated and standardised in Glasgow practice in the USA in identifying children ,in need of treatment, to
in such a way that for the whole population of s-year-old children in. Glas- set that cut'Off point at two standard deviations below the mean. If the
gow the scores on the test are normally distributed, with a mean score test scores are normally distributed, this means that approximately z.s%
of so and standard deviation of w. In order to discover whether scores of children will be selected for treatment. But unless we know that the
on the test will have similar properties when used with s-year-old children distribution of the US population scores can be modelled on a normal
in its own area, the US School District administers it to a random sample distribution, we cannot ascertain the proportion in this way.
of r84 of these children. The results of this are presented as a frequency How then do we determine whether or not the sample data obtained
table in the first two columns of table 9 r. The mean score of the US are consistent with a normally distributed population of test scores? If
children is 48.8 and the standard deviation is 12.4. Using the methods the frequency data are represented as a histogram (figure 9.1), it can be
of the previous chapter, it can readily be shown (see exercise 8.r) that. seen that they do in fact show some resemblance to a normal curve. But
the mean of this sample is indeed consistent with a population mean of we must go further than this. The question that we have to answer is
so. But this does not tell us whether the complete population of test scores whether the number of scores in each class is sufficiently similar to that
(i.e. those which would be made by all s-year-old children in the area) which would result typically when a random sample of r84 observations
could be modelled adequately as a normally distributed population. This is taken from a normally distributed population with a mean of so and
qtwt'tion is important to the School District. 1\R we saw in chapter 6, standard deviation of ro. The first step is to calculate the proportion of
if !I populatiou is normally distributed, tlw proportion of the population the model population which would lie in each class interval. To do this,
1:1:1
'Jesting the fit ojmodels to data A complete model

the endpoints of the intervals are standardised to Z-values, as in the third Table 9.2. Calculations for testing the hypothesis that the data of table 9.1
column of table 9 r (Z-values calculated using the formula Z = (X- p,)/ u come from a normal distlibution with mea11 so and standard deviation ro
presented in chapter 6). In the fourth column are the expected proportions
of the model population to be found in each class interval, these proportions 3 4 5
Relative
discrepancy
being taken from a normal table (table Az) in the way described in chapter
6. In order to calculate the number of scores we would expect in each (oi-ci
Observed Expected Discrepancy
frequency, o; frcqucncv, (o;- c;) (o,-ci
C;
''
25 1 ~ }r3 42}
S.r 12.3 0.7 0-4!J 0.04

17 r6.9 0,1 Q,OI o.ooo6

> 20
- 31 27.6
-J.l
3-4 rr.s6
CJ.6r
I 0.42

u"
@~
,
aE
.
~l!l 15
,-r-
32
39
22
351
35 1
1.7.6 -s.6
39 rs.zr
3'36
0.27
0-43
1.14

1~:;}
19 >.1 44 1 0.26
!"0
-
.;::::;iii
~:::
,'~..'
10 - ;}Jl J.l 12.] ~LJ r.69 O.lf

.
~"

~~ r-
r !.1

Total deviance= 2. 70

..
10
..
20 30
..
40
..
50
.
60
b..
70
.0 were true. The most obvious thing to do is calculate for each class the
Score
difference between the observed number of scores and the expected number
Figure 9 I. Histogram of the data of table 9 r. (Note that the first and last of scores. The expected number can simply be subtracted from the
classes arc not represented in the diagram.) observed number. The result of doing this can be seen in column 3 of
table 9.2. (The reason for combining a number of the original classes
class if the sample of 184 scores had been taken from a population with will be given below.) Some of the discrepancies are positive, the remainder
the stated properties, we have only to multiply each expected proper negative. But it is the magnitude of the discrepancy rather than its direction
tion by 184, the results being given in column 5 (for example, which is of interest to us; the sign (plus or minus) has no importance.
184 X 0.023 = 4.2). What is more, just as with deviations around the mean (chapter 3) the
Comparing columns 2 and 5 of the table, we can see that the actually sum of these discrepancies is zero. It will be helpful to square the discrepan-
observed number of scores in each class is not very different from what cies here; just as we did in chapter 3 with deviations around the mean.
we would expect of a sample of 184 taken from a normal population with It should he clear that it is not the absolute discrepancy between observed
a mean of 50 and standard deviation of 10. So far this merely confirms and expected frequencies which is important. If, for example, we expected
the impression given by the histogram (figure 9.1). But we will now use ro scores to fall in a given class and observed 20 (twice as many), we
these observed and expected frequencies to develop a formal test of the would regard this as a more important aberration than if we observed
fit of the proposed model to the observed data. no where roo were expected, even though the absolute difference was
Let the null hypothesis (chapter 8), H 0 , be that the data represents 10 in both cases. For this reason we calculate the relative discrepancy
a random sample drawn from a normally distributed population with by dividing the square of each absolute discrepancy by the expected fre-
p, =so and u= 10. The alternative, H, is that the parent population (the quency. Thus, in the first row: 0.49 (square of discrepancy)+ 12.3
one from which the sample has been drawn) has a distribution different (expected frequency) f(ives a relative discrepancy of o.o4. The results of
from the one proposed. We must first obtain a measure of the discrepancy this and calculatiom:; for the remaining rows are found in column 5
between the observed scores and those which would be expected if 11 0 The procedure that we have fvllowed so far has given us a measure
1;14 135
Testing the fit of models to data A type of model
of deviance from the model for each class which will bezero whenthe Testing how well a type ofmodel fits the data
9'2
observed frequency of scores in the class is exactly what would be predicted In the previous section we saw how to test the fit of a model
by H 0 , and which will be large and positive when the discrepancy is large with a normal distribution and a given mean and standard deviation. It
compared to the expected value .. By summing the deviances in column was because the mean and standard deviation were given that the model
s we arrive at the total deviance, which is 2.70. Using the total deviance was described in the section heading as 'complete', since it was fully speci
as a test statistic, we are now in .a- position to decide whether or not the fied. But this information is not always available. If the US School District
sample scores are consistent wit~.;their being drawn from a population had decided not to use an existing test but to develop one of its own,
of normally distributed scores with JL =so and O" = w. This is because, then there would be no population mean and standard deviation which
provided that all the expected frequencies within classes arc large enough could be incorporated into the model. Nevertheless, for the reasons given
(in that they are all greater than s). the distribution of the total deviance in the previous section, the School District would still be concerned to
is known when H 0 is true. It is called a chi-squared distribution (sometimes kn'ow whether the population of s-year-old children on the new test would
written x'J, though, as with the !-distribution, itis really a family of distri' have a normaldistribution, regardless of the mean and standard deviation.
butions, each member of the family being identified by a number of degrees How wo.uld it find out?
of freedom. The degrees of freedom. depend on the number of-classes. Let usimagine that such a test is producc&and administered to a.random
which have contributed to the total deviance., For the present case there sample of 34' -syear-old children in the area, The null hypothesis is that
are eight classes, some of the original classes having been grouped together. the data obtained in this way represent a random sample drawn from
This was done in order to meet the requirement that the expected frequency a normally distributed population. The alternative hypothesis is that the
in each class should be more than s (for example the original classes '7o-less parent population is not normally distributed. The procedure followed
than 7S' and '7s or greater' did not have sufficiently large values). There to test the fit of a type of model (here one with a normal distribution)
are eight classes but the expected frequencies are not all independent. is very similar to the one elaborated in the previous section for a complete
Since their total has to be 184, as soon as seven expected frequencies model. We begin with the sample scores summarised in the first two
have been calculated, then the last one is known automatically. There columns of table 9 3. Since we have not hypothesised a population mean
are therefore just 7 df. If H11 is true, the total deviance will have approxi- (p,) nor a standard deviation (0"), we will take as our estimate of these
mate.ly a chi-squared distribution with 7 df. Critical values of chi-squared
Table g.:J. Calculatio11s[ortesting the hypothesis that the scores qf341
are to be found in table As. We can see there ;that the ro% critical value
children cmne jro1n a nOJ1nal distiibution
for 7df is rz.o. Since the value we have actually obtained, 2.70, is-much
smaller than this, there is no real case tb be made against the null hypoth- 'I z 3 4 5 6
Class
esis. The scores made by the US children are consistent with those expected interval Standardised
from a random sample drawn from a population of scores having a normal of scores interval
Proportion
distribution with a mean of so and a standard deviation of ro. With this Obser\'ed expected in Expected
knowledge the School District should be able to predict with reasonable From- Less than frequency From-+Less than each interval frequency Deviance
accuracy the number of syearold children in their area for whom speech - I.8.] 0.0.].]4 11.4 o.51
45 9
therapy will be indicated by the British articulation test, whatever cut-off 45 so z8 -J,83 -I.]6 0.0536 dL3 5 14
so 55 z8 -I..]6 -o.89 0.0992 33.8 I .00
point they might wish to select. (Of course, we must not forget the possi- - 6o 46 -o.8Q -0.41 0.1548 52.8 o.88
55
bility of a type 2 error which would occur when the data really did come 6o 6s JO -0.41 o.o6 0.!829 62.4 093
6s JO 53 o.o6 0 53 0.!780 6o.7 o.q8
from a quite different distribution but the particular sample happened 1.00
JO 75 4' o.sJ 0.!394 475 o.89
to look as though it came from the distribution specified by the null hypoth- 75 Ro ,s 1.00 1.47 0,0879 JO.O 2.13
esis. Given the largish sample size and the fact that the value of the test 8o and uvcr z8 '47 and over 0,0708 24.1 o.63

statistic is well below the critical value, it is highly unlikely that any serious \_;::;(J4-J(I Total dc\iancc = IJ.OCJ
~ ,., 1o.ft
error will be committed by acctpting- that lin is true.) .
---=~-=-- .,_=~=;':A~;:;:t";:;-

136 '37
Testing the fit of models to data Testing the model of independence

the sample mean (X.) and standard deviation (s). In the present case squared distribution with 6 df. We see in table As that the corresponding
X= 64.36 and s = Io.6. Using these figures, the endpoints of the intervals s% critical value is I2.6. Since the total deviance obtained is greater than
are standardised (column 3) and the expected proportion of the model this, we reject the null hypothesis. From the evidence of the sample scores,
population to be found in each class interval is again calculated with the it would seem rather unlikely that the population of scores on the test
help of a normal table (table A2). Each proportion is then multiplied of the s-year-old children will have a normal distribution. This means
by 34I to give the number of scores we would expect in each class if that if the School District were to use the test in its present form, it
the sample were taken from a normally distributed population (column might not be possible to benefit from the known properties of the normal
5). For each class, the deviance is computed by means of the formula: distribution when making decisions about the provision of speech therapy
to children in the area. However, we must not forget that there is a proba-
(o 1- e1) 2
bility of s% that the result of the test is misleading and that the test
e, scores really are normally distributed, or at least have a distribution suffi.
and the deviances are summed to give a total deviance of I 3.09 (column ciently close to normal to meet the requirements of the School District
6). authorities. On the other hand, inspection of table 9 3 suggests that the
The only difference from what was done in the last section is that this distribution of test scores is rather more spread out then the normal distri-
time the sample mean and standard deviation have been used to estimate bution, with higher frequencies than expected in the tails. If a cut-off
their population equivalents. This affects the degrees of freedom. As we point of two standard deviations below the mean is used the remedial
have said before, the degrees of freedom can be considered in a sense education services may be overwhelmed by having referred to them many
as the number of independent pieces of information we have on which more children than expected.
to base the test of a hypothesis. We began here with nine separate classes Yau will realise that the application of the statistical tests elaborated
whose frequencies we wished to compare with. expected frequencies. How in this and the previous section are not limited to articulation tests and
ever, we really have only eight separate expected values since the ninth School Boards in the USA. It should not be difficult to think of comparable
value will be the total frequency less the sum of the first eight expected examples. What you might not realise is that this idea of testing the fit
frequencies. But the degrees of freedom have to be further reduced. In of models to observed data is not limited to test scores, but can be extended
estimating the population mean and standard deviation for the sample, to different kinds of linguistic data. For instance, an assumption underlying
we have, if you like, extracted two pieces of information from the data factor analysis and regression (chapters IS and 13) is that the population
(one to estimate the mean and another to estimate the standard deviation), scores on each variable are normally distributed, and this can be checked
reducing by two the number of pieces of information available for checking in particular cases, whe~ the sample size is large enough, by the method
the fit of the model to the data. The degrees of freedom in this case are described in this section. If the data do not meet the assumption of norm-
therefore 6. ality, the results of factor analysis or regression analysis (if carried out)
You should not worry if you do not follow the argument in the previous should be treated with extra caution.
paragraph. The concept of degrees of freedom is difficult to convey in
ordinary language and in a book such as this we cannot hope to make 93 Testing the model of independence
it fully understood. So far we have appealed to intuition, using the notion In this section we will present two examples of a rather different
of 'pieces of information'. From now on, however, as the reasoning in., application of the chi-squared distribution. The first example is taken from
particular cases becomes more complex, we shall not always attempt to a study reported by Ferris & Politzer (I98I). They wanted to compare
provide the rationale for the degrees of freedom in a particular example. the English composition skills of two groups of students with bilingual
We shall continue, of course, to make clear exactly how they should be backgrounds. The children in group A had been born in the USA and
calculated. educated in English. Those in Group B had been born in Mexico, where
As we saw above, the degrees offreedom in this instance arc 6 (9- I - 2). they had received their early schooling in Spanish, and had later moved
If Hn is correct, the total deviance (13.09) will have approximately a chi- to the USA, where their schooling had been entirely in English. There
lJ8 '39
Testing the fit ofmodels to data Testing the model of independence
Table 9+ Contingency table q/'manberqfverb tenseerrors in children's set up the null hypothesis, H 0 , that the number of errors is independent
essays of the early school experience. If H 0 were true, we could consider that,
(a) Observed frequcncics 11 as regards propensity to make errors in verb tense, the two groups are
Number of error::; in verb tense really a single sample from a single population. The proportion of this
0 1 error z-6 errors Row total single sample making no errors is 2o/6o (i.e. 20 of the 6o children make
GroupA 7 16 JO
no errors). We consider this as an estimate of the proportion in the complete
7
Group B 13 II 6 JO population who would make no errors of tense under the same conditions.
Column total 20 18 Z2 6o
If Group A were chosen from such a population, about how many could
(b) Expected frequencies: (row total) X (column total)+ (grand total) be expected to make no errors? Let us assume that the same pmportion,
Number of errors 2o/6o, would fall into that category. Since Group A consists of 30 subjects,
0 I error z-6 error::; Row total we would expect about (2o/6o) X 30 = ro subjects of Group A to make
Group A 10 9 II JO no errors of tense. This figure is entered as an expected frequency in
Group B 10 9 II JO table g.4(b). Proceeding in the same way, we obtain the number of subjects
Column total 20 18 Z2 uo in Group A expected to make one verb error (9) and from two to six
(c) Deviances: (observed- cxpcctcd) 2 +expected errors ( r r). When this process is repeated for Group B, the same expected
Number of errors
frequencies ( ro, g, 11) are obtained. This is because the two groups contain
0 I error z-6 errors the same number of subjects, which will not always, or even usually,
Group A o.g 044 2.27 be the case. It is not necessary that the groups should be of the same
Group B o.g 0.44 2.27
size; the test of independence which is developing here works perfectly
Total deviance= 7.22 well on groups of unequal size.
Reproduced from Ferris & Pulitzer (rg8r)
Generalising the above procedure, the expected frequencies of different
numbers of errors in each group can be obtained by multiplying the total
were 30 children in each group, all about 14 years old. Each of them frequency of a given number of errors over the two groups (the column
wrote a composition of at least roo words in response to a short film, total) by the number of subjects in each group (the row total) and dividing
and the first roo words of each essay were then scored in several different the result by the grand totalof subjects in the experiment. The formula:
ways. One of the measures used was the number of verb tense errors cOlUmn tOtai X row total
made by each child in the composition, and the results of this are shown grand total
in table g.4(a) (such a table is referred to as a contingency table). We will give you the expected frequencies, however many rows and columns
can see there that there are differences between the two groups in the you have. You should check that by using it you can obtain all the expected
number of tense errors that they have made. What we must ask ourselves frequencies in table g.4(b).
now is whether the differences observed could be due simply to sampling Now that we have a table of observed frequencies and another of the
variation, that is, we have two samples drawn from the same population; corresponding expected frequencies, the latter being calculated by assum-
or whether they indicate a real difference, that is, the two samples are ing the model of independence, we can test that model in the same way
actually from different populations. that we have tested mOdels in the previous two sections. The total deviance
To answer this question, we must refer back to chapter 5 and the discus- is computed in 9.4(c), and it then only remains to check whether this
sion there about the meaning of statistical independence. If the number value, 7.22, can be considered large enough to call for the rejection of
of errors scored by an individual is independent of his early experience, the null hypothesis. As previously, the total deviance will have a chi-
then, if the experiment were to be carried out over the entire population, squared distribution if the model is correct. The degrees of freedom are
the proportions of individuals scoring no errors, one error, or more than easily calculated uing the formula:
one error would be the same for both subpopulations. Suppose then we (munboi nf column~<-.... 1) X (number of rows - r).
l.<ftl 141
Testing the model of independence
Testing the fit of models to data
Table 95 Some typical treatments of English loans in Rennellese Table 9.6. Contingency table of type of epenthesis by position
of vowel in English loan words in Rennellese
blade buleli half hapu
carltidge katalini matches masesc (a) Observed frequencies
crab kalapu milk melcki Type of epenthesis
cricket kilikiti plumber palama Position of
cross kolosi pump pamu Epcnthctic Vowel Reduplicating Non-reduplicating
engine ininsini 1ijle laepolo
fight paiti rugby laghabi Initial 20 14

fishing pisingi ship sipi Medial IJ 6


fork poka suny sitoli Final 61 112

Data from Brasington ( 1978) 94 1]2

In the present case this means (3- I) X (2- I) = 2. If we consult the (b) Expected frequencies and deviances
Reduplicating Non-reduplicating
chi-squared tables, we find that, with 2 df, the value 7.22 is significant
at the 5% level. This suggests that the null hypothesis may be untenable Initial 14.1 (z.s) Ig.g ( !.7)
Medial 7-9 (J.J) II. I (z.J)
and that the distribution of errors is different for the two populations. Final 72,0 (I.?) IQI.I ( 1.2)
The data point to the population of I4year-old bilingual children who
Total devia~ce = 12.7 on .z df
had early schooling in Spanish in Mexico making more verb errors in
compositions than those who were born in the USA and were educated
there entirely in English. of English plumber, /palama/, select/ a/ as the epenthetic vowel to break
The second example concerns the way in which English words borrowed up the /pi/ cluster? The most straightforward explanation would be one
into a Polynesian language have been modified to fit the phonological struc- of reduplication of the non-epenthetic vowel. Rennellese represents the
ture of the language. Brasington (I978) examined the characteristics of / u/ vowel in English plumber as/ a/; the same vowel is used as the epenthe-
vowel epenthesis in loan words from English into Rennellese. This is a sised one. The same strategy seems to be followed in the Rennellese word
language spoken on the island of Rennell, at the eastern edge ofthe Solomon for crab: English / re/ is represented as/ a/, and the same vowel epenthe
group. Table 95 gives some examples of typical treatments of English sised. We can see similar examples for medial position: rugby /laghabi/;
loans in this language. It is apparent from the examples that (a) English and for final position: ship/sipi/. There are however counterexamples.
consonant clusters tend in the Rennellese forms to have a vowel introduced In initial position, English blade /bleid/ is realised in Rennellese as
between the two elements of the cluster: the initial/kr/ of crab becomes /buledi/; in final position, half/ha:f/ appears as /hapu/.
/kal-/, the initial /bl-/ of blade becomes /bul-/, the medial -gb-/ of rngby We might ask at this point whether there is any association between
becomes/ -ghab-/; (b) English final consonants tend to appear in Rennel- the position at which epenthesis occurs and whether or not reduplication
lese supported by a vowel: ship becomes /sipi/, half becomes /hapu/. is the strategy adopted for selection of the epenthetic vowel. Our null
These modifications (all referred to by Brasington as 'vowel epenthesis') hypothesis would be that reduplication and position are independent. To
can plausibly be attributed in general to the phonotactic structure of Ren test this hypothesis of independence we tabulate the observed frequencies
nellese, which exhibits a 'typically Polynesian ... simple sequential alterna of each type of epenthesis, reduplicating and non-reduplicating, in each
tion of consonants and vowels' (Brasington I978:27). The CV syllable position, initial, medial and final, as in table 9.6(a). The data for this
structure-of the borrowing language modifies the CCV. or VC or. VCCV table were obtained by Brasington from Elbert (I97S), a dictionary of
structures of the loaning language in obvious and predictable ways. While Rennell-Bellona, and includes all English loan words entered there - a
this may explain the fact of epenthesis, the selection of particular epenthetic total of 226. The expected frequencies are calculated in the same way
vowels in specific cases remains to be accounted for. The Rennellcse vowel as in the previous example, by multiplying column totals by row totals
systl'!n (transcribed as i, c, a, o, u) has three heights and (except for and dividing by the grand total. Table 9.6(b) shows expected frequencies
low vowel) n front/back distinction. Why do~s the Rennclit,sc version and deviances. The total deviance of 12.7, with 2 df, exceeds the I% critical
I43
Testing the fit of models to data Problems of the chi-squared test
value, 9.2I. On this basis we are likely to reject our null hypothesis, .and children where the sizes of the groups were chosen. by the experimenter.
assume that the use of the reduplicating strategy for epenthetic vowel Suppose that, for one of the groups, some cells have too small an expected
selection is not independent of position. Inspection of the differences value. Let us say that the smallest such value (for a cell we clearly wish
between observed and expected frequencies leads us to believe that redupli- to retain in the table) is about 1.25, i.e. a quarter ofthe required minimum.
cation is more likely in initial and medial position, and less likely in final Then increasing the size of that group to four times its original value
position. This does not, of course, exhaust the search for factors relevant will cause all the expected values in that row to be increased by about
to the selection of specific epenthetic vowels, and for full details the reader the same factor provided that the proportions of the new data falling in
is urged to consult Brasington ( I978). We will, however, leave the example the different columns are roughly the same as in the original. However,
at this point. to do this could be extremely expensive in terms of experimental effort,
even supposing that the variables are under the experimentees control.
94 Problems and pitfalls of the chi-squared test In the Rennellese loans, for example, both the type of epenthesis and
9+ I Small expected frequencies its place in the word are language~contact phenomena which are not under
We have already noted that, in order for the X' test to have the control of the researcher. In such circumstances it would be necessary
satisfactory properties, all expected frequencies have to be sufficiently large to increase the total number of observations fourfold. Even then success,
(generally 5 or greater). As we saw with the frequencies in table 9.2, though likely, is not guaranteed, since. we cannot be sure that the new
it is sometimes necessary to group categories together to meet this condi~ observation~ will increase the row and column totals relevant to the cell
tion. A similar problem may occur with a contingency table, with one in question as much as we had hoped. When we consider, in addition,
or more of the expected frequencies falling below 5. There are two possible that it is impossible in this particular case to find any further loan words
ways of dealing with this problem when it arises. The first is to consider from the sources used (since Brasington's data comprise all the loan words
whether a variable has been too finely subdivided: if so, then categories from English in Elbert 1975), and that further data will require expensive
can be collapsed so that all the cells of the table do have a sufficiently fieldwork, we are likely to conclude that there is little value in trying
large expected frequency. Suppose, for example, that in table 9.4(b) we to augment the observations.
had found that we expected very few people in one of the groups to make There is, however, a third option. We can go ahead and carry out the
exactly one error. Then we could combine the second and third columns chi-squared test even 1/ some expected frequencies are rather too small.
and classify degree of error into 'all correct' and 'some errors', and still It can be shown that the likely effect of this is to produce a value of
test whether the two groups were similar with. respect to the frequency the test statistic which is rather larger than it ought to be when the null
of their errors. If, on the other hand, a similar problem had arisen in hypothesis is true; that is, there is more likelihood of a type I error (chapter
table 9.6(b) - let us say that the expected frequency for reduplicating 8). On the other hand, if all the cells with small expected values have
epenthetic vowels in medial position was too low - it would have been an observed frequency ve1y similar to the expected and thus contribute
more difficult to decide how to regroup the data. Should the medial position relatively little to the value of the total deviance, it is unlikely that the
vowels be considered alongside those occurring in initial position, or with value of the deviance has been seriously distorted, and the result of the
those epenthesised finally? Because of the nature of the data, there is no test can be accepted, especially if we adopt the attitude to hypothesis
obvious solution to this particular problem. testing suggested in 8.5.
It may also happen that, in a large table, any problem cells are distributed. Whenever X' values based on expected frequencies of less than 5 are
in a rather haphazard way. In such cases, the collapsing of cells necessary reported, the reader's attention should be drawn to this fact and any conclu-
to eliminate all those with small expected frequencies will remove interest- sions arrived at on the basis of a statistically significant outcome should
ing detail. The only really satisfactory approach is to collect more data. be expressed in suitably tentative terms. An examination of the use of
It is in fact easy to estimate roughly how much more data we may need x' in the applied linguistics literature reveals that this is not always done.
to collect. Consider the case where one of the variables classified in the Indeed, there is cause to wonder whether the authors are always aware
table is controlled experimentally, aR in the :;tudy of tiH,; bilingual l'lchoo!R of the fnilurc of the data to meet the requirements of the test.

l# '4S
Testing the fit of models to data Problems of the chi-squared test

Table 97 Errors in pronoun agreement for two groups of bi- Table g.8. Corrected deviances for table 9 7, using the fonnula:
lingual.<choolchildren ( {observed frequency -expected frequency} -o .5)2 +expected
frequency
Number of errors
Row total Number of errors
0 I-J
Group A IS IS JO 0 I-J
Group B 2J 7 JO Group A o.64s I. I IJ
22 6o Group B o.645 I. ITJ
Column total J8
chi-squared= JSt6 Total= 3516

Data from Ferris & Politzcr ( 1981)


this to 35 and calculate the deviance (3.5) 2 + 19 = o.645 The remainder
of the individual deviances are given in table g.8 and we see that the
9.4.2 The 2 X 2 contingency table chi-squared value of 3.516 is correct. Note that, in this example, the un-
Table 97 reproduces another section of the results from the modified value of the total deviance would have led us to conclude, incor-
paper of Ferris & Politzer (I98I), this time comparing the essays of two rectly, that the result was significant at the 5% level.
groups of bilingual children for the number of errors made in pronoun
agreement. The authors quite correctly report a chi-squared value, i.e. the 9+3 Independence of the observations
value of the total deviance, of 3.5I6 with I df, and this fails to reach the It is quite common in the study of first or second language
tabulated 5% value, 384- If you calculate the total deviance using the learning to analyse a series of utterances from the speech of an individual
method explained above, you will arrive at a value of 4.6, which would in order to ascertain the incidence of various grammatical elements and/ or
apparently lead to the rejection of the null hypothesis at the 5% significance errors of a particular kind. Hughes ( 1979) provides a simple example of
level, the null hypothesis stating that the type of early education does this. In his study of the learning of English through conversation by a
not affect the number of errors in pronoun agreement. Why is there this Spanish speaker (referred to in chapter 3) one of the features that he in-
discrepancy between the statistic derived via the analysis explained earlier, vestigated was noun phrases of the type possessor-possessed, e.g. Marta( 's)
and the value quoted in Ferris & Politzer ( 198 I)? It can be shown mathema- bedroom. Over one period covered in the study the learner used a total
tically that if the model of independence is correct then the total deviance of 81 constructions of this type. Of these, 55 showed correct English order-
will have approximately a chi-squared distribution with the relevant ing (e.g. Marta('s) bedroom), while the remainder (26) reflected Spanish
degrees of freedom. The approximation is very close when the expected constituent order (e.g. bedroom Marta). On the lexical dimension, 23
frequencies are large enough and the table is larger than 2 X 2. When of the 81 collocations involved pairings of words that were referred to
the table has only two rows and two columns the total deviance tends as 'novel', since these particular pairings had not been used previously
to be rather larger than it ought to be for its distribution to be modelled by the learner. The remaining 58 pairs were non-novel. An interesting
well as a chi-squared variable with I df. Thus a correction (referred to question, relating to productivity in the learner's languge, is whether the
as 'Y ate's correction') is needed, and takes the following form. novel and non-novel instances manifest different error rates. The data
As always, for each of the four cells, you must find the difference between. are entered in table 99 which looks similar in structure to the 2 X 2 con-
expected and observed frequency. Then, ignoring the sign of that differ- tingency table (table 97). However, it would be mistaken to use the chi-
ence, reduce its magnitude by 0.5, square the result and divide by the Rquated test on this data, because the separate utterances which were ana-
expected value as before. For example, for the cell giving the number lysed to obtain the data for the table were not independent of each
of subjects in Group A who make no errors with agreement of pronouns other. Wc have already stressed in earlier chapters that the data we use
(table 9.7) the expected frequency is I9 while 15 instances were actually to estimate a pnrnrneter of a population, or to test a statistical hypothesis,
observed. The magnitude of the difference (IS- 19) is 4 We reduce muRt come from a random ample of the population. This is no less true
14.6 147
Testing the fit of models to data l IVUtt:fft~ Uj Ult: (.flt-:;(jllUI'eU teSt

Table 99 The relation between the novelty of word pairings Tab kg. 1 o. Fi-equency with which two groups qf leamers of English supply
and ordering ennrs in the English of a native Spanish speaker and fail to supply regular plural morphemes in obligatmy contexts during
interview
Novel pairings Non-novel pairings Row total
Correct Group with instruction Group without instruction
order 8 47 55 lVJorphcmc Morpheme Morpheme l\-'lorphcmc
Incorrect supplied not supplied supplied not supplied
order IS 26
" ]2 28 42 ]6
Column total 23 s8 8I II2 24 28
39
Io6 39 3I 29
42 40 55 37
of the chi-squared test for association between the rows and columns of 26 24 62 55
a contingency table. In particular, all the instances which have been Group
recorded must be separate and not linked in any way. For example, in totals 3I8 ISS 229 I8S
spontaneous data from language learners it is not uncommon for instances
of the same structure to occur at points close in time, though not as part
of the same utterance. It would be naive to suppose that the form which to the plural morpheme on one occasion cannot be seen as independent
the structure takes on a second occasion is quite independent of the form of his performance.on all the other occasions within a single hour.
it took in an utterance occurring a few seconds earlier (see also 75.2). If we seem to have laboured the point about independence in relation
This is particularly important in situations like the one under discussion to x', there is a reason. There is evidence in applied linguistics publications
where some cell frequencies are likely to be more affected than others that the requirement of independence is not generally recognised. The
by the lack of independence. Instances of novel pairings of nouns cannot, reader is therefore urged to exercise care in the use of XZ and also to
by definition, be affected by repetition since when a word pair is repeated be on the alert when encountering it in the literature. Whenever an indivi-
it is no longer novel. However, instances of non-novel noun pairings could dual's contribution to a contingency table is more than 'one', then there
occur in successive utterances. Whenever it is not clear that the observations must be suspicion that the assumption of independence has not been met.
are completely independent, if a chi-squared value is calculated. it should
be treated very sceptically and attention. drawn to possible defects in the 9+4 Testing several tables from the same study
sampling method. In the Ferris & Politzer study, the essays of the 6o students
Consider the following example. A researcher is interested in the fre- were marked for six types of error in all,. and for each type of error a
quency with which two groups of learners of English supply the regular contingency table was presented in the original paper. Each table was
plural morpheme (realised as /s/, /z/, /<z/) in obligatory contexts. One analysed to look for differences between groups at the s% significance
group of learners has had formal lessons in English, the other group has level. There are two points to watch when several tests are carried out
not. Each learner is interviewed for one hour and the number of times based on data from the same subjects.
he supplies and does not supply the morpheme is noted. The results are First, the risk of spurious 'significant' results (i.e. type I errors) increases
reported in table 9 10. with the number of tests. If we carry out six independent chi-squared
The researcher would be completely mistaken to use the group totaJ tests at the s% significance level then, for each test it is true that the
in order to carry out a x' test. For one thing, two members of the group probability of wrongly concluding that the groups are different is only
that had received instruction contributed many more tokens than anyone o.os. However, the probability that at least one test will lead to this wrong
else in the study. Their performance had undue influence on the total conclusion is r -(0.95) 6 = 0.26. In fact the errors in verb tense which
for their group. But even if all learners had provided an equal number we analysed in table 94 were the only set of errors which gave a significant
of tokens, it would still not be correct to make usc of the group totals diffCI"(.'IlCC at the s% level. From the above argument the true significance
in calculating X' This is because an individual's performance with respect (taken iu conjunction wit:h all tht~ other tc~ts) could be o.26, or 26%,
I49
M8
'1 'estmg the ftt oj moaets to aata 0ummary

and we might conclude that there is no str<?ng evidence that the groups except that the cell frequencies have been converted to percentages of
are more different than might have occurreO by random sampling from the total frequency, Table 9, I I (b), where an. alternative method of calculat-
the same population. (\11/e cannot be sure of the exact value since the ing the frequencies of table 97 in terms of percentages is used, gives
contingency tables, being used all on the same subjects, will not be indepen~ an even more dramatic result. Here the frequencies are given as percentages
dent. The true significance level will be somewhere between 5% and z6%.) of the row totals, and the resulting chi-square of I3,4 appears to be highly
A second problem that may occur when one analyses the same scripts significant. Both examples serve to underline how misleading the conver-
or the same set of utterances for a number of different linguistic variables sions of raw observed frequencies into percentages can be, The effect
is that these variables may not be independent, Even though we may can operate in the other direction, disguising significant results, if the
have, say, yes/no questions and auxiliaries as separate categories in our true total frequency is greater than roo.
analysis it is unlikely that these variables are entirely unconnected, Analys-
ing sets of dependent tables can cause the groups to appear either more SUMMARY
or less alike than they really are, depending on the relationship between The previous chapter dealt with the situation where the underlying
the different variables, Again, in practice one may very well carry out model was taken for granted (e.g. the data came from a normal distribution) or
many tests based on a single data set, but it is important to realise that did not matter (because the sample size was so large). This chapter discussed
the problem of testing whether the model itself was adequate, at least for a few
the results cannot be given as much weight as if each table were based
special cases.
on an independent data source.
( 1) It was shown how to test Hu: a sample is from a normal distribution with
9+ 5 The use ofpercentages a specific mean and standard deviation versus Hl: the sample is not from
We have already seen that the data in table 9,7, when tested that distribution, by constructing a frequency table and comparing the
for independence, produced a non-significant value of chi-squared, sug- observed class frequencies, oit with the expected class frequencies, ei>
if H 0 were true. The test statistic was the total deviance, X'=~{ (oi- ei 2/e;},
gesting no evidence of any association between rows and columns. Consider
which would have a chi-squared distribution with k- 1 degrees of freedom
now table g,u(a), which provides a chi-squared value of 6,I8, significant
where k is the number of classes in the frequency table.
at the o,oi level (for I df), This table is, however, 'identical' to the first
(2.) A test was presented of H0 : the sample comes from a normal distribution
with no specified mean and standard deviation versus H 1: it comes from some
other distribution. The procedure and test statistic were the same except that
Table 9,IL Data oftable97 restated as percentages the degrees of freedom were now (k- 3),
(3) The contingency table was introduced together with the chi-squared test,
(a) Percentage of total number of subjects
to test the null hypothesis H 0 : the conditions specified by the rows of the
Number of errors
table are independent of those specified by the columns or Hn: there is no
0 ,_3 Row total association between rows and columns versus the alternative, which is the
Group A 25 2S so simple negation of Hu. The expected and observed frequencies are again corn
Group B 38
" so pared using the total deviance which h'!-s a chi-squared distribution when H 0
Column total 6J 37 wo is true. The number of degrees of freedom is obtained from the rule:

(b) Percentage of number of subjccls in each group df =(no, of rows- I) X (no, of columns- r)
Number of errors (4) It was pointed out that the chi-squared test of contingency tables is frequent_ly
0 ,_3 Row total misused: all the expected frequencies must be 'reasonably large' (generally
Group A so so wo 5 or more) ; the 2 X 2 contingency table requires special treatment; the obser
Group B 76 24 wo vations must be independent of one another - the only completely safe
Column total rule is that each subject supplies just a single token; the raw observed fre-
126 74 200
quencies should be used, not tht~ rclntive rrcqucncics or percentages.
150 I5I
Testing the fit of models to data Exercises
EXERCISES
Amongst his results he reports the fallowing: for sentences containing one
(I) Lascaratou ( 1984) studied the incidence of the:passive in Greek texts of various
kind of error, there was a 33% rejection rate; for sentences containing a related
kinds. Two types of text that she looked at were scientific writing and statutes.
error, the rejection rate was only 13%. In both cases N is said to be 6o.
Out of a sample of r,ooo clauses from statutes, 698 were passive (incidentally,
Is the rejection rate for one kind of error significantly different from that
she included only those clauses where a choice was possible, eliminating active
for the other kind?
clauses which could not be passivised); out of a sample of 999 clauses of
scientific writing, 642 were passive. Knowing that samp!lng has been very
careful, what conclusions would you come to regarding the relative frequency
of the passive in texts of the two types?
(2) A group of monolingual native English speakers and a group of native speakers,
bilingual in Welsh and English, were shown a card of an indeterminate colour
and asked to name the colour, in English, choosing between the names blue,
green or grey. The responses are given below. Does it appear that a subject's
choice of colour name is affected by whether or not he speaks Welsh?

Blue Green Grey

Monolingual: 28 41 16
Welsh: 40 ]8 29
(3) An investigator examined the relationship between oral ability in English and
extroversion in Chinese students. Oral ability was measured by means of an
interview. Extroversion was measured by means of a well-known personality
test. On both measures, students were designated as 'high', 'middle' or 'low'
scorers. The results obtained are shown separately for males and females in
table 9 12. Calculate the two >! values, state the degrees of freedom, and
say whether the results are statistically significant. What conclusion would
you come to on the basis of this data?

Table 9.12. Data 011 relationship of oral ability in


English and extroversio11 (Chinese stude12ts)

Female subjects Oral proficiency scores


High Middle Low
Extroversion High 2 6 4
Middle 3
Low
'
2
3
I 2

Male su~iecrs Oral proficiency scores


High Middle Low
Extroversion High 2 0 0
Middle 2 7 0
Low 3 I 2

(4) An investigator asked English-speaking learners of an exotic [angu<~gc whether


sentences he ,presented to them iu that language were gnmUH!Hical or not.

15~ c'' 31 153


'lhe concept oj covanance

Table Io. r. Total error scores assigned by ten native English


10 teachers (X) and ten native English non-teachers (1] for each
of]2 sentences
Measuring the degree of Sentence English teachers (X) English non-teachers (Y)

interdependence between two 2


22
.6
22
,g
variables 3
4
42
25
42
2<
5 3' 26
6 36 4'
7 29 26
8 24 20
In chapters 5 and 6 we introduced a model for the description of the 9 29 ,g
random variation in a single variable. In chapter I 3 we will discuss the 10 ,g '5
23 21
use of a linear model to describe the relationship between two random "
12 22 '9
variables defined on the same underlying population elements. In the 13 31 39
'4 2< 23
present chapter we introduce a measure of the degree of interdependence
15 27 24
between two such variables. 6 32 29
'7 23 ,g
18 18 6
1o. r The concept of covariance 19 JO 29
A study by Hughes & Lascaratou (Ig8I) was concerned with 20 3' 22
21 20 <2
the evaluation of errors made by Greek learners of English at secondary 22 21 26
school. Three groups of judges were presented with 32 sentences from 23 29 43
24 22 26
essays written by such students. Each sentence contained a single error. 25 26 22
The groups of judges were (I) ten teachers of English who were native 26 20 '9
27 29 30
speakers of the language, (2) ten Greek teachers of English, and (3) ten 28 18 17
native speakers of English who had no teaching experience. Each judge 29 23 15
was asked to rate each sentence on a o-s scale for the seriousness of the 30 25 IS
3' 27 28
error it contained. A score of o was to indicate that the judge could find 32 14
no error, while a score of 5 would indicate that in the judge's view it
"
~ = 25.03 sx = 6.25
contained a very serious error. Total scores assigned for each sentence y = 2].63 sy = 8.z6
by the two native English-speaking groups are displayed in table ro. r.
One question that can be asked of this data is the extent to which the
groups agree on the relative seriousness of the errors in the sentences teachers (theY axis). So, for example, the point for sentence I2 is placed
overall. In this case we wish to examine the degree of association between at the intersection of X= 22 and Y = '9 The pattern of points on the
the total error scores assigned to sentences by the two native~speaker. scattergram does not appear to be haphazard. They appear to cluster
groups. roughly along a line running from the origin to the top right-hand corner
As a first step in addressing this question, we construct the kind of of the scattergram. There are three possible ways, illustrated in figure
diagram to be seen in figure IO.I, which is based on the data in table 10.2, in which a scattergram with this feature could arise:
Io.I, and is referred to as a scatter diagram, or scattergram. Each point
(a) There is an exact linear (i.e. 'straight line') relationship between X
has been placed at the intersection on the graph of the total error score and Y, distorted otily by measurement error or some other kind of
for a sentence by the English teachers (the X axis) and the English non- random variation in the two variables.
lH '55
Measun'ng interdependence of two variables The concept of covariance

50 (c) There is no degree of linear relation between X andY and any apparent
linearity in the way the points cluster in figure ro. r is due entirely
X
X
X to random error.
" 40-1

X

~
0
As usual there will be no way to decide with certainty between these
~ 30 x'Xx different hypotheses. We must find some way of assessing the degree to
"
0
0
0
XX

yx
XX which each of them is supported by the observed evidence. To do this
we first of all require some measure of the extent to which a linear relation-
e 20
0
0 X
X
~
XXX
XX
X
ship between two variables is implied by observed data. We have already
"' 10
X
developed measures of 'average' (the mean and the median), measures of
dispersion (variance and standard deviationL and now we will introduce
a measure of linear correlation.
0 10 20 30 40 50
Scores of teachers This will be done in two stages, the first of which is to define a rather
Figure ro. I. Scattcrgram of data of table 10:.r. general measure of the way and the extent to which two variables, X
and Y, vary together. This is known as the covariance, which we will
1,1 y
I
designate COV(X,Y), and is defined by: 1
X X

~
I - -
COV(X,Y) =-$(X,- X)(Y,- Y)
X n-I
In table I0.2 we have shown what this formula would mean for the
data of table Io.r. For both of the variables X, the total error score for
each sentence given by the English teachers, and Y, the total error score
X
for the same sentence given by the English non-teachers, we start as though
lbl y
we were about to calculate the variance (cf. 3.7). For example, we find
X
XX X, the mean of the 32 X observations, and calculate the difference d,(X)
X X X
X X X X between each observed value, Xil and the mean. \'Ve then carry out a
X X XX
X X similar operation on the 32 Y values to obtain the d,(Y). We then multiply
xx
d 1(X) by d 1(Y), d 2 (X) by d 2 (Y) and so on. We finish by adding the 32
products together and then divide by 3 I to find the 'average product of
X
deviations from the mean'. The results of these calculations can be found
lei Y
in table 10.2. (Note that the column of this table giving the cross-products
X X X X XXX X
x )(...x '><l: x xx~xX~x of X andY, is not used here, but is made use of in table 10.3 to calculate
X )('5l: X~>',( Xx..,.~ ~
XXX X X X xXX '1(-.,r the covariance by an alternative method.)
0 Xxxxx:*lk~~% The reason for doing all this may not immediately seem obvious. It
XX xfxxxx x
X X
may become clearer, however, when it is recognised that three distinct
patterns in the data are translated by this process into three rather different
X
outcomes:
Figure Io,z. Three hypothetical relationships which might give rise to the
data of table to. r and figure ro. I. Pattern I
X and Y tend to increase together. In this case, when X is bigger
(b) There is a non~linear relationship between X and Y which, especially 1 There i~ a dcgrc~ of arUitrurincfl~ ubout the choice of the divi~or; some authors may use
in the presence of random error, can be represented quite well by n, othcrl!n- :t. The dwkc w~~ IHI\'C mndc Airnplifict-~ later formu\nc. Of course, for largish
a line<tr model. ~amplct~ the rNillit will b~; cHN:tlvdy tlw ftufllc.

156 '57
.lVleasun'ng interdependence of two van'ables The concept of covariance
Table Io.z. Covan'ance betwee1l the error scores assigned by ten nath.:e Pailern2
English teachers and ten natite English non-teachers 01132 seutences There is a tendency for Y to decrease as X increases and vice versa.
Now we will find that when X is below average 1 Y will usually be
2 3 4 5 6 above average and vice versa. The two deviations d 1(X) and d 1(Y)
cl(X)_ d(Y)_ will usually have opposite signs so that their product will be negative.
Sentence X y XY X-X Y-Y d(X)d(Y)
The sum of the products will then be large and negative.
22 22 484 -3.0J -.63 4-94
2 16 18 288 -9,03 -s.63 so.84
3 42 42 1764 16.97 18,37 Pattem3
3 I I. 74
4 25 21 525 -o.o3 -2.63 o.o8 There is no particular relation between X and Y. This implies that
5 31 26 So6 597 237 If. IS the sign of d1(X) will have no influence on the sign of d 1(Y) so that
6 36 41 1476 10.97 1737 I<)o.ss
29 26 for about half the subjects they will have the same sign and for the
7 754 3-97 237 941
8 24 20 480 - I . OJ -3.63 374 remainder one will be negative and the other positive. As a result,
9 29 18 522 397 -s.6J -22,35 about half the products di(X)d1(Y) will be positive and the rest negative
10 18 15 270 -7.03 -8.63 6o.67
ll 23 21 483 -2.03 -2.63 so that the sum of products will tend to give a value close to zero.
534
12 22 19 418 -3.03 -4.63 14.03
13 31 39 1209 597 1537 91.76
14 21 2] 483 -f.OJ -o.63 254
15 27 24 648 1.97 O.J7 0,73 The covariance between two random variables is a useful and important
16 32 29 928 6-97 537 37-43
17 2] 18
quantity and we will make use of it directly in this and later chapters.
4 14 -2.03 -5.63 11.43
18 18 16 288 -7.03 -7.63 53-64 However, its use as a descriptive variable indicating the degree of linearity
19 30 29 870 26,69
20
497 537 in a scatter diagram is made difficult by two awkward properties it pos-
31 22 682 597 -t,6J -9-73
21 20 12 240 -s.oJ - II.6J sS.so sesses.
22 21 26 546 -4 03 237 -9-55 The first we have met before when the variance was introduced (see
2] 29 43 1247 397 1 937 76.go
24 22 26 57 2 -J.OJ 2.37 -7.18 3.7), The units in which covariance is measured will normally be difficult
25 26 22 572 0.97 -1.63 -~. 5 s to interpret. In the example of table 10.1 both d;(X) and d;(Y) will be
26 20 19 38o -5-03 -4.63 2J.29
27 29 JO 87o 6.37 25.29
a number designating assigned error score so that the product will have
397
28 18 17 306 -7.03 -6.63 46.6r units of 'error score X error score' or 'error score squared'. The second
29 23 15 345 -2.03 -8.63 1752
JO 25 15 -o.o3 -8.63
is more fundamental. Look at figure 1o. 3 which shows the scatter diagram
375 0.26
31 27 28 756 1.97 437 8.61 of height against weight of 25 male postgraduate students of a British
32 ll 14 1 54 -14.03 -9.6] IJS I I university. As we would expect and can see from the data, taller students
COV(X, Y) ~ 12]149 + 31 ~ 3973
lend to be heavier, but it is not invariably true that the taller of two
COV(X,Y)
3973 students is the heavier. In figure 10.3(a) heights are measured in metres
r 0.772
S;.;Sy 6,25 X 8.26 and weight in kilos. In figure Io.3(b) the units used are, respectively,
centimetres and grams. (If you think the two diagrams are identical look
carefully at the scales marked on the axes.) Clearly the relationship between
than the X average, Y will usually be bigger than the ):' average, so the two variables has not changed in the sample, but the covariances corres-
that the product d 1(X)d 1(Y) will tend to be positive. Whenever X ponding to the two diagrams are 0.275 metre-kilos and 27,500 centimetre
is less than the X average, Y will usually be smaller than theY average. grams, respectively. Changing the units from metres to centimetres and
Both deviations d1(X) and JiC't') will then be negative ~o that their kilos to grams causes the covariance to be increased by a factor of Ioo,ooo.
product will again be positive. Since must of the produc.:tB will bt (We would say that covariance is a scale-dependent measure.) Surely
positivt: 1 tlwir sum (and HII.Jan) will be positivc: und (pum~ibly) quito if we carl plot graphs of identical shape for two sets of data, we would
lul'!>{t';
wum any measure of that shap<> to give the same value hoth times? Fortu-
tg8 159
JVIeasunng znteraepenaence of two vanabtes 'lhe correlatzon coejjicient
nately, both these defects in the covariance statistic are removed by making are exactly the same as the units of the covariance in both cases: metre-kilos
a single alteration, which we will present in the next section. and centimetre-grams respectively. This suggests that to describe, in some
(a)
sense, the 'shape' of the two scatter diagrams of figure ro.3, we might
80 try the quantity r, known as the correlation coefficient, in which we
X
use the product sxsv as a denominator:
X X
X
70 X
] X X X
X
X
For Set a:
g X
X X XX X
X X X X X

"'~ 60
X X
X
r(X,Y)
COV(X,Y)
- o.05I0.275metres
metre-kilos
o.87
X Sx_Sy X 6.2 kilos
50

and for Set b:


1.6 1.7 1.8 1.9 2.0 COV(X,Y) 27 sao em-grams
Height (metres) r(X,Y) = 0.87
s*x_s*y 5. I ems X 62oo grams
(b)
80 000 First note tha~ r does not have any units. It is a dimensionless quantity
X
like a proportion or a percentage. (The units in the numerator 'cancel
X X
X out' with those in the denominator.) Second, it has the same value for

'"~1
X X
X X X
X
X X XX X
both scattergrams of figure ro.3. Changing the scale in which one, or
X

160000
X X X X X both, variables are measured does not alter the value of r. It can also
X X
X he shown that the value of the numerator, ignoring the sign, can never
X
50 000 he greater than Sx_Sy, and hence the value of r can never be greater
than r.
We can then sum up the properties of r(X, Y) as follows:
1600 1700 1800 1900 2000
Height (centimetres)
COV(X,Y)
(r) r = r(X,Y)
Figure IO.J. Scattergram of height and weight of 25 students. SxSy
(2) The units in which the variables are measured does not affect the
value of r (the correlation coefficient is scale-independent)
ro.2The correlation coefficient
(3) - I:::;;:; r:::;;:; 1 (i.e. r takes a value between plus and minus one)
The standard deviation was calCulated of both the heights and
the weights from the data on which the scatter diagrams of figure ro.3
The quantity r is sometimes referred to formally as the 'Pearson product-
were based. They are (sx and s*x are actually the same length- the star
moment coefficient of liner correlation'. 2 There are other ways of measur-
simply indicates a change in units):
inf.( the degree to which two variables are correlated (we will meet one
Set a: fig. IO.J(a) sx = o.05I metres sy = 6.2 kilos Nhortly), but we will adopt the common practice of referring tor simply
Set b: fig. IO.J(b) s*x = 5 I centimetres s*y = 62oo grams
'l'ht tcl'minology of statistics is often obscure. Many of the terms were chosen from other
As we would expect from our knowledge of the properties of the sample hranchcg of applied mathematics. Pearson is the name of the statistician credited with
standard deviation, s*x = roosx and s*y = r,ooosy. tht dh~eovcry of r and its properties; many simple statistics like the mean, variance and
emarimu.:c urc called 'moments' by analogy with quantities in physics which arc calculated
Now consider the product sxsy. Its value for Set b is exactly lOo,ooo ill U l"imi\ar fashion, q.{. QHHllCilt of i!.ll'l'tia, and 'product' refer,:; to the multiplica-
times its value for Set a (cf. the covariance) and the units of thiH pl'oduct tion ,,r tin two factor~> (Xi~~ X) 1\1\d (\' 1 - Y) in the calculation ofthc covariance.

rf>o 161
Measurzng mreraepenaence OJ cwo vunumes companng corretauons
The same applies here. It is one thing to show that the hypothesis that
two variables have no correlation can be rejected; it is quite another to =-loge
I
>
--
o.68
("3')
argue that either variable contains an important amount of information
about the other. In other words, it is not so much the existence of correla~ = o.slog, (r.94u)
ti&n between two variables that is important but rather the magnitude
of that correlation. A sample correlation coefficient is a point estimator = 0.5 X 0.6633 = o.3316
~chapter 7) of the population value, p, and may vary considerably from Now,
"""'sample to another. A confidence interval would give much more infor-
mation about the possible value of p. r.g6 )
X=z ( z- vn=:i
Unfortunately, there is no simple way to obtain a confidence interval
for the true correlation p, in samples of less than so or so, even if
both variables have a normal distribution. For larger samples and normally r.g6)
distributed variables a 95% confidence interval can be calculated as
= >( 0.33I6- V57
follows: 3
= 0.1440
ex-r e"-r and
-.-<p<-.-
ex+I e\+r
r.g6 )
Y=zZ+vn=:i
(
= I.I8>4
I.96 )
whereX=z ( z- Vn-3
Next we have e' = e01440 =I. ISS (the relevant calculator key may be
marked e' or perhaps exp) and e" = el. 1824 = 3.2622.
!.96 )
(
Y=zZ+Vn-3
eX- I 0.155 = 07>
- - = - - 0.
eX+ I 2.155
and I (I-+-r)
Z=-loge
2. r-r eY- r 2.,2.62.2
-.-=--=0.53 1
e\ +1 4.262.2.
The sample size is nand r is the sample correlation coefficient. The quantity
e has the value 2.71828 .... and logarithms to the base e are known as The value of the 95% confidence interval is then:
natural logarithms; sometimes loge is written 'In'. It is possible that some
.o~72.<p<o.sJr.
of the symbols used here are unfamiliar to you, so we will work through
an example, step by step, which you should be able to follow on a suitable
calculator. ro. 5 Comparing correlations
Suppose, in a sample of 6o subjects, a correlation of r = 0.32 has been Using Fisher's Z-transformation it is also possible to test
observed between two variables. Let us calculate the corresponding 95% whether two correlation coefficients, estimated from independent samples,
confidence interval for the true correlation, p. First calculate: have come from populations with equal population correlations. Suppose
the correlations r1 and r2 have been calculated from samples of size n 1
I (I
z=-logc -+-')
2, r-r
and n 2 respectively. For both correlations, calculate the value:

3 This interval is obtained hy relying on a devit:c known as Fisher's Ztransformation


--for nxumpk, Downie & !Ieath ( 1q(15: IS6).
I (I
Z=-loge -+-')
2 r-r

,t64 r6s
1vuasurmg znteraepenaence Of avo vanames interpreting the sample co17'elation coefficient
The statistic: This test should not be used for small samples. For samples with n > roo
the value of t should be compared with percentage points of the standard
(z, - Zzl, j<nl.:::-3f[i,z=-i!
n1 +n2 -6
normal distribution.

is approximately standard normal provided the null hypothesis: p 1 = p 2 Io.6 Interpreting the sample correlation coefficient
is true. For sample sizes greater than so the statistic has a distribution It is rather difficult at this stage to explain precisely how to
very close to that of the standard normal and the hypothesis H 0 : p 1 = interpret a statement such as 'the correlation coefficient, r, between X
*
Pz versus 1-1 1: p 1 p2 can be tested by comparing the value of the statistic and Y was calculated from a sample of 32 pairs of error scores and
to percentage points of the normal distribution. A simple example will r = o.7z2'. We will be able to reveal its meaning more clearly after we
show how the arithmetic is done. have discussed the concept of linear regression in chapter I 3. However,
A sample of n 1 = 62 subjects is used to estimate the correlation between for the moment we will attempt to explain the essential idea in a somewhat
two variables X and Y. In a sample of n 2 = 69 different subjects the vari- simplistic way, at least for the case where both variables are normally
ables X and Ware measured. The first sample has a correlation r 1 = o.83 distributed.
between X and Y, the second has a correlation of r2 = o. 74 between X We show three different sample scattergrams in figure ro+ Now, let
and W. Test the hypothesis that the population correlations are equal us suppose for the moment that the three samples which gave rise to
(i.e. the correlation between X andY is equal to the correlation between
X and W). 1,)

y~l-~------------
I
Zt =-loge -- (1.83) = r.r88x
2 0.17

I
Z 2 =-log 0
(1.74) = 0.9505
--
y2 ------~--
Yl ----,
' . 1
: :
I
r=1.0
2 0.26 '' '
'
'
''
'
i '

~ = 1.326
(Z 1 - Z 2) --
X1 X2 X3 x
(b)
I25 y
X X X
The IO% point of the standard normal distribution for a two-tailed Xx X
test is I .645 so that the value I .326 is not significant. It is perfectly possible X X XXX
X X X X X r"' Q.8
that the correlation between X and Y is the same as the correlation between X X X

XandW. X X X X
X
The test we have just described is relevant only if the correlations r1
and r 2 are based on independent samples. If X, Y and W were all measured X
(c)
on the same n subjects the test would not be valid. Downie & Heath y
(I95S) give a test statistic for comparing two correlations obtained on
the same sample. Their test is relevant when exactly three variables are
I X X X

involved, X, Y and W say, and it is required to compare the correlations X X


X X X
:x X
Xx
X r = 0,2

of two of the variables with the third. To test whether rxw and rvw arc ><xxxxx
X X X X
significantly different they suggest the statistic:

(rxw-rvw) Y(n-3)(r +rxv) X


v' 2.( 1 ..:-r~~?- ryw2 - rx/ + Zl'xwrywrxy) Figure 10.4, 'f'hrcc hypothetical scattergrams with associated value of the
;{)f: cnrrclution cocffidcnt,

167
MeasU1ing interdependence oj two vanables J<..ank correlatwns

the graphs produce values of r close to the true value p for their respective although most points are distributed in apparently random fashion, there
populations. In the first diagram, figure ro.4(a), the plotted points lie are three obviously extreme points. That these would greatly influence
exactly on a straight line so that r = 1.0, i.e. the sample points are perfectly the estimated correlation can be seen from exercise ro.4. This is an ad-
correlated. Now, if this is true of the population from which the sample ditional reason for drawing a graph of- the data. We have stressed at various
was drawn then there is a sense in which either of the variables X or times the value of looking closely at the data before carrying out any analysis.
Y is redundant once we know the value of the other. Suppose that we
ha,,.e three observations (X 1,Y 1), (X2,Y2), (X 3,Y 3), with the property, for y

"""mple, that the difference X 3 - X 1 is exactly twice as large as the differ-


ence X2 - X 1 Then it will follow that the difference Y3 - Y 1 is also exactly

twice as large as the difference Y2 - Y1 We could say that variations in
Y (or X) are accounted for roo% by variations in X (or Y). X X XX X

Now consider figure ro.4(b). Because the points do not lie exactly on X
X X
X X
X X
X X
X
a straight line, it will not be true that variations in X will be associated xxxxxxx
X X X X
with exactly predictable variation in Y. However, it is clear from the scatter- X

gram that if we know how the X values vary in a sample we ought to


be able to guess quite well how the corresponding Y values will vary in
X
relation to them. With the case shown in figure 10.4(c) we will not be
Figure 10.5. A hypothetical scattcrgmm. Most of the points arc dotted about
able to do that nearly so well. as though there were very little correlation- sec figure 10.4{c)- but the
The correlation coefficient is often used to describe the extent to which presence of the circled points will produce a large value of r,
knowledge of the way in which the values of one variable vary over observed
units can be used to assess how the values of the other will vary over A highish correlation apparently due to only two or three observations
the same units. This is frequently expressed by a phrase such as 'the ought to be treated with great caution. Apart from any other reason, if
observed variation in X (or Y) accounts for P% of the observed variation both variables come from normally distributed populations it will be ex-
in Y (or X)'. The value of P is obtained by squaring the value of r. tremely rare for a random sample of moderate size to contain two or three
The reason for this is explained in chapter rJ. unusually extreme data values- often called outliers.
For the three examples pictured in figure ro.4 we have:
(a) r= 1.0 r2 = 1.00 (=roo%) ro.7 Rank correlations
(b)r=o.8 r2 = o.64 (= 64%) The statements made in the previous two sections about the
(c) r=o.2 r2 = o.o4 (= 4%). meaning of r and how to test it all depend on the assumption that both
the variables being observed follow a normal distribution. There will be
For example, for case (b) we might say that '64% of the variation in many occasions when this assumption is not tenable. For example, in
Y is accounted for (or 'is due to') the variation in X'. a study of foreign language learners' performance on a variety of tests,
As always, it is important to keep in mind that another random sample one of the variables might be the score on a multiple choice test with
of values (X;,Y;) from the same population will lead to a different value a maximum score of roo (and perhaps with a distribution, over the whole
of r and r 2, so that the 64% mentioned above is only an estimate. How population, approximately normal), while the other might be an impressio-
good that estimate is will, as usual, depend on the sample size. An important nistic score out of 5 given by a native-speaking judge for oral fluency
point to realise here is that by a 'random sample' we mean a sample chosen in an interview. The latter, principally because of the small number of
randomly with respect to both variables. possible categories (i.e. o-s), will be distributed in a way very unlike
It is worth noting that just a few unusual observations can affect greatly a normal distribution. There will be times when the data do not-consist
the value of r. ThiR is tlw situation in the Rcat:tergrarn of flgutc .10.5 whtrc, of scores at all, for example when several judges are asked to rank a set
~~ r68 r69
~ '!
Measuring interdependence of two variables Rank correlations

Table I0.4. Calculating the rank correlation coefficient figure Io.6(a). For perfect negative correlation (r = - I .a), si = n + I - Ri,

R s where n is the sample size- see figure ro.6(b). If X andY vary indepen-
Sentence X y (Rank of X) (Rank of Y) R-S dently of one another, there ought to be no relation between the rank
aa 22 Il.O 17.0 -6.o of a particular X; among the Xs and the rank of the associated Y; among
a .6 ,g a.o 90 -7.0
LO Ia I
3 4a 4a J2.0 JI.O y
4 as 21 175 145 J.O
5 3' a6 z8.o 22.5 S5
LO r"' 1.0
6 16 41 JI.O JO.O

7 a9 a6 zJ.s 22.5 LO
8 a4 ao r6.o IJ.O J.O s, .. ----------------- --
9 29 18 235 90 1 45
0.0
10
II
18
a3
15
21
4-0
14.0
40
I<fS -o.s s3 r----===----
s, ----
12 22 19 !I.O 11.5 -o.s s,
13 11 39 z8.o zg.o -I.e

14 21 a3 8.5 rg.o -ro.s


15 a7 24 zo.s 20.0 o.s
16 3a a9 JO.O z6.s 3S R4 A3 A2 R, X
17 a3 18 14-0 90 s.o lbl
18 18 16 4-0 6.0 -z.o y
'9 30 a9 :z6.o z6.s -o.s
ao 3' 22 z8.o 17.0 11.0

ao 12 6.s LO 55 r=-1.0
''
22 21 a6 s.s 22.5 -14.0
2J.s -s.5 s,
aJ a9 43 J2.0
a4 22 a6 11.0 22.5 -II.S s,~-----1-----',
as a6 aa I<).O 17.0 2.0 s3 -----:------:---"'
a6 ao 19 6.s 11.5 -s.o s4 ____ J______ l---~-------
a7 ag JO 2J.s z8.o -4-5
a8 18 40 70 -J.O '
I
'' ''
'7 ' ' I
a9 a3 s 40 10,0 I
30 as 15
14.0
1 75 40 IJS
' '' '
'
aS
R, A3 R2 R, X
3' a7 zo.s zs.o -45
3a II '4 LO a.o -r.o
Figure 10.6. The relation between (Xi, Yi) and (Ri, Si) for perfectly correlated
data.
:!:(R;- S;)' = '4'3S
the Ys. It ought therefore to be possible to derive a measure of the correla-
6XI41J5
r, = r 0. 741 tion between X and Y by considering the relationship between these rank-
32X (Jz 1 - r)
ings. One such measure is rs, the Spearman rank correlation
coefficient.
of texts on perceived degree of difficulty. In such cases there is an alterna- Suppose a random sample of n pairs has been observed and let R; and
tive method for measuring the relationship between two variables which S; be the ranks of the Xs and Ys as we have defined them above. Let
does not assume that the variables are normally distributed. This method D = l(R;- S;)'. Then r, is calculated by the formula:
depends on a comparison of the rank orders rather than numerical scores.
Let us go back to the example on error gravity and change the pairs 6D
rs = I
of values (X;,Y;) into (R;,S;) where R; gives the rank order of X; in an n(n2 -1)
ordered list of all the X values, S; is the rank order of Y; in a list of It may look to you at first sight as though r, is simpler to calculate
all the Y values - see table ID+ It should be clear to you that if X and than r. This is not usually the Case. First, except for very small samples,
Y arc ptrfectly positively correlated (r .7.:::: r.o) then 81 = H1 nlwuy::r ~~ !:lee it is a time-consuming exercise, unless it can be done by computer, to
~170 171
Measun'ng interdependence of two van'ables Rani~ correlations
calculate all the ranks. Second, there is the .problem of tied ranks. This whose exact values are not known . (e:g; when only the ranks -are known
arise:s when one of the variables has the :same value in two or more of in the first place) and should not be used on data which have a more
the pairs. Suppose in a sample of five observations we have: or less normal distribution. Furthermore, for certain kinds of data the
x, = 6, Xz == J, x3 = 4, x4 = J, Xs =I population value, p5 , of the Spearman coefficient is very different from
that of p, the Pearson correlation coefficient, in-the same population. Since
How should we calculate the ranks; R1? rs is often calculated rather than r precisely in those situations where the
underlying distribution of the data is unknown it is never safe to try to
R1 = 1 (X 1 has the largest value)
R2 = ? (Should X2 be ranked third or fourth?) interpret rs as a measure of correlation. In fact its only legitimate use
R3=2 is as a test statistic for testing the hypothesis that two variables are indepen-
R, =? dent of one another. Consider the following example. Two different judges
R, = 5 are asked to rank ten texts in order of difficulty.
Text: ABCDEFGHJK
Clearly, R 2 has to have the same value as R 4 since X 2 = X 4, and its value
Jl1dge:x 2 3 4 5 6 7 8 9 !O
must lie between 2 and 5 The way to deal with this is to take the average' Judge:2 4 2 7 l 3 !0 8 9 6 5
of the 'missing' ranks, in this case take the average of 3 and 4, which
is 35 Then R, = R4 = 35 We can then calculate r, in the usual way. Test the hypothesis that the judges are using entirely unrelated criteria
Unfortunately, the problem does not end here. Tied ranks cause bias and hence that their rankings are independent.
in the value of rs. If we use mean ranks and calculate r~ by the usual H 0 : There is no interdependence in the sets of judgements
formula we will tend to overestimate the true correlation. A full discussion H 1 : There is interdependence in the sets of juelgements
of this problem can be found in Kendall (1970). However, unless a very
Here we have:
high proportion of ranks is tied, the bias is likely to be very small. Siegel
(1956) gives a formula to adjust the value ofr, for tied ranks. He then D = 3' + o' + 4' + 3' + . , . + 5' = go
gives an example with a number of tied ranks for which r, = o.6r7 if the
and
usual formula is used and r, = o.6r6 if the exact (and more complex)
formula is used. Our advice is to use the formula we have given even 6 X 90
in the presence of ties, but to suspect that theapparent correlation may :s-:--,- 10X99
--
be very slightly exaggerated in the presence of a substantial proportion
of tied ranks. The data on sentence error gravity are analysed in this way = 04545
in table IO+ From table A7 we can see that this value of r, is just significant at the
In this case, the Spearman correlation coefficient r5 = 0.741, quite close ro% level. There is, therefore, somewhat weak evidence of some measure
to the value of the Pearson correlation coefficient r = 0.772 calculated on of agreement between the judges. However, nothing else can be said. It
the same sample. Does this mean that rs can be interpreted in much the is not possible, for example, to calculate a confidence interval for the popu~
same way as r? In particular will r, 2 still indicate the extent to which lation value, Ps, and there is therefore no way of knowing how precise
one variable 'explains' the other? Unfortunately, the answer to this ques- is the estimate 0.4545 This is a general deficiency of rank correlation
tion is an unequivocal cNo'. For moderate-sized samples from a population methods.
in which the variables are normally distributed the Spearman coefficient Table A7 gives percentage points of r, for sample sizes up to n = 30.
can be interpreted in roughly this way. However, even then r~ will tend For sample sizes greater than this it is safe to use table A6 of percentage
to underestimate the true correlation and will be more variable from one points of the Pearsol'l correlation coefficient. In large samples and when
sample to another than the Pearson correlation cocfllcicnt. Beside~, ra waH the two vwiables are actually independent (so that p= p, = o) the values
deBignod t" cope with daw sets which nrc not nonnully distributed or of r,\ and r will be very similar. In the case of correlated variables our
liA t~!l> 1 73
Exercises
Measun'ng interdependence of two variables
previous statement holds; there is no simple relationship, in general, Sentence no. Greek teachers English teachers
between r and rs, nor between the population correlation coefficients p 33 3 3
34 0 0
and p,. There is no simple way to interpret the 'strength' of correlation
implicit in a sample value. of r,. Include this data in the set you used for your calculations in exercise I0.2
and recalculate the correlation coefficient. What difference does the addition :1

of these two data points for each group make?


SUMMARY
(5) Calculate the Spearman rank order correlation coefficient for the data you II
(I) The covariance, COV(X,Y) = S(X;- X) (Y;- Y)/(n- I), was introduced 1
used in exercise 10.2.
and exemplified as a measure of the degree to which two quantities vary in
(6) Twenty-five non-native speakers of English took a so-item doze test. The
step with one another. The covariance was shown to be sca~e-dependent.
exact word method of scoring was used. Without warning, a few days later, 11

(2) The correlation coefficient (Pearson's product~mornent coefficient of


they were presented with the same cloze passage, but this time for each blank Ill
linear correlation) between two variables, X and Y, was defined by ,.
they were given a number of possible options (i.e. multiple choice), these
COV(X,Y)
r(X, Y) = and was shown to be independent of the scales in which including all the responses they had given on the first version with the addition
SxSy
of the correct answer, if this had not been amongst their responses. ThF results
X andY were measured. The value of r 2 could be interpreted as the proportion
were:
of the variability in either variable which was 'explained' by the other. Multiple
Multiple
(3) The test statistic for the hypothesis H 0 : p = o was r itself: the critical values choice
choice
to be found in table A6.
rst version version 1st version version
(4) It was shown how to calculate a confidence interval for the true correlation -
from large samples using Fisher's Ztransformation. 33 27 25 20
(5) A test for the hypothesis that two population correlations were equal was 3I 3I 24 25
presented. 32 29 24 24
(6) The Spearman rank correlation coefficient, rs, was defined. It was 30 33 23 32
explained that it would usually be difficult to interpret rs as a measure of 29 3I 23 23
'degree of interdependence' but that a test of the hypothesis H 0 : two variables 29 30 22 24
are independent could be based on rs even when the data are decidedly non- 28 30 2I IS
normal. 29 23 Z2 24
27 30 2I 23
28 29 I7 IS
EXERCISES
27 26 I6 I7
(I) Draw a scattergram to examine the relationship between the assigned error
26 27 I6 24
scores in the columns headed 'X' and 'Y' of table II 5 on page. I86, that
25 24
is, between the scores of the Greek teachers and the English teachers. Compare
your scattergrarn with figure ra. I. What do you notice? Twenty-five native speakers of English also took the first version of the test.
(2) (a) Now calculate the covariance of these two sets of scores, using the method Their scores were:
of table Io.2, and then determine the correlation coefficient.
36 35 33 34 34 33 33 32 32 32 32
(b) Recalculate the correlation coefficient using the rapid method of table
ro.3. What does this value tell you about the linear relationship between 3I 3I 3I 3I 3I 3I 30 30 29 28 28
the scores of the Greek teachers and those of the English teachers?
26 25 24
(3) Is the correlation coefficient obtained in exercise ro.2 significant?
(4) In the original study from which this data were taken, judges were presented (a) Are the two sets of scores for the non-native speakers related? How do
with some sentences which were correct. For two of these sentences EngliSh you interpret the correlation coefficient you have calculated?
and Greek teachers were in complcle agreement and nssigned the following (h) What is the relationship between non-native~speaker scores, and native-
:liCOfCfl; speaker scores, on the first version of the test?
~. !'N 1 75
independent samples: dljjerences between means

Table I 1. 1. Notation used in thefonnulation of a test of the


Il hypothesis that two population means are equal
First Second
Testing for differences population population

between two populations Population mean


Population standard deviation
I'J
u,
P.z
uz
Sample size ~I ~2
Sample mean x, x,
Sample standard deviation s, s,

(At the outset of the experiment there was no compelling reason to believe
It is often the case in language and related studies that we want to compare
that the audiolingual group would do better on this test after two years;
population means on the basis of two samples . .In a well-known experiment.
it was not inconceivable that the grammar-translation group would do
in foreign language teaching (Scherer &, Wertheimer I964), first'year
better. For this reason a non-directional alternative hypothesis is chosen.)
undergraduates at the University of Colorado were divided into two groups. '
A test statistic can be obtained to carry out the hypothesis test provided
One group was taught by the audiolingual method, the other by the more
two assumptions are made about this data:
traditional grammar-translation method. The intention was to discover
which method proved superior in relation to the attainment of a variety 1. The populations are both normally distributed. (As before, this will not be
of language learning goals. The experiment continued for two years, and necessary if the sample size is large since then the Central Limit Theorem
many measurements of ability in- German were taken at various stages assures the validity of the test.)
in the study. For our present purpose we will concentrate on just one 2. u 1 = u 2 ; that is, the population standard deviations are equal. This point is
measurement, that of speaking ability, made at the end of the experi- discussed further below. The value of the common standard deviations is esti,
ment. Out of a possible score on this speaking test of IOO, the traditional mated by:
grammar-translation group obtained a mean score of 777I while the
audiolingual group's mean score was 8z.gz. Thus t,he sample mean of Hn1 - I)~T+(~z -.=.--;)Sl
s=.y.:_
the audiolingual group was higher' than that of the grammar-trarislation : n1 + n2 - 2

group. But what we are interested in is whether this is due to


an accidental assignment of the more able students to the audiolingual group . Thi~ estim_~te is used in the calculati6n .of the test statistic, t, as follows:
or whether the higher mean is evidence that the audiolingual method is
more efficient. We will address this question by way of a formal hypothesis
x,-x,
test. Jsz + s2

nl nz
II.I Independent samples: testing for differences between
means This statistic has a !-distribution (the same as that met in chapter 7) with
The notation we will use is displayed in table r 1. 1. We wish n, + n 2 - 2 df whenever the null hypothesis is 11ue. The value of t can
to test the null hypothesis: be compared with the critical values in table A4 to determine whether
the difference between the sample means is significantly large.
H0 'I''= 1'2 The data in table r I. 2 are taken from table 6-s of Scherer & Wertheimer
(1964).
Our alternative hypothesis is:
W c first estimate the population standard deviation using the formula
presented above:
ll, 'I'J "''"'
1 77
restmgjordzjjerences between two poputatwns Independent samples: differences between means
Table r r.2. Group dfferences in speakng ability at the end Table I I 3. Mean VOT values (in milliseconds), standard
of two years deviations and number of tokens for a single child at different
Audiolingual Grammar-translation ages

,,x, 82.92
6.78 ,,x, 777I
737
Age 1 ;8.zo Age 1;u.o

n, 24.00 n, 24.00
Consonant Consonant

D"ata from Scherer & Wertheimer ( 1964: table 6--5)


/d/ /t/ /d/ /t/
Mean VOT (ms) 14.25 22.JO 767 122.13
Standard deviation 1 S44 IJ.SO 19.17 5419
,= 1(;3><6:78'5 +(23 x 737') Number of tokens 8 IO IS IS
..;.:....::. 46
Data from Macken & Barton ( xg8oa: table 5).

= ji05727 + 124929
y 46
the child and its mother. The focus of the instrumental analysis of this
=7.08
data was the measurement of voice onset time (VOT) in initial stops in
This estimate of the population standard deviation is then used to calculate t: tokens produced by the children. On the assumption that VOT is the
82,92 -777I major determinant of a voiced/voiceless contrast in initial stops, its values
t
were examined, separately for each child, to identify the point at which
"'
f7oo- 7.o82
y--+-- the child became capable of making a voicing contrast. The data in table
24 24
I 1.3 were extracted from Macken & Barton's table s, which presents a

s.zr summary of measurements on their subject, Jane.


Y2.o9+ 2.09 In normal, adult spoken English, the mean VOT for /t/ is higher than
5-21
for I dj. Children have to learn to make the distinction. For the data in
=-- table I I. 3 we can test the null hypothesis that there is no detectable differ-
2.04

= 2.55 ence in the mean VOT against the natural alternative that the mean VOT
is higher for ltl. Using the formulae given above, we find the ~tandard
If we enter table A4 with 46 df (n 1 + n 2 - 2), we discover that the t-value
deviation estimate:
of 2.ss is significant at the 2% level (which is what was reported by Scherer

~7x Is.44S+(9
& Wertheimer). The probability of the two sample means coming from x I3.so') . . d
populations with the same mean is less than 2 in IOo (which is the same s= 14.38 rot 11 1secon s
I6
as I in so). There then appears to be support for the belief that the audio-
and evaluate the test statistic as:
lingual method is superior to the grammar-translation method in develop-
ing speaking ability in German. However, various difficulties encountered If.25- 22,30
by Scherer and Wertheimer in conducting the experiment mean that this ~ = -I.r8with r6df
result should be treated with caution ..
Let us look now at another piece of research where it is appropriate
Ys+::
IO

to test for differences between means on the basis of two samples. Macken On this occasion it can be argued that the natural alternative hypothesis
& Barton (Ig8oa) investigated the acquisition of voicing contrasts in four is directional: we may discount the possibility that the VOT for I d/ could
English-speaking children whose ages at the beginning of the study ranged be greater than the VOT for It/. The s% critical value for t with I6
from r ;4.28 to r ;7.9. Each child was then seen every two weeks over df would be L7S (= for IS df). The t-value we have calculated is less
nn cight~month period, and recordings made of conversations be.twccn than thio; indeed it is less than the w% critical value (1.34). We will
'7~ I79
Testing for differences between two populations Independe11t samples: dzjfereuces between means
therefore be unwilling to claim that the papulation mean VOT for /t/ we have made~ that the variable, VOT; is nottnally distributed for this
is greater than that for / d/. We will not wish to claim that the child has subject at the two ages when the observations were taken and that, at
learned to distinguish the two sounds successfully in terms of VOT, despite both ages, the VOT was equally variable for either sound.' Since the
the difference of more than 8 ms inthb sample means. This is due partly full data set Is not published in the source paper we cannot judge whether
to the large standard deviations and partly to the small number of tokens the VOTs seemed to be normally distributed. With the small number
observed. If we carry out a similar analysis on the VOTs observed on of tokens observed it would be rather difficult to judge anyway and the
tili!'same child at age r ;II .owe find t = -7.71, which is highly significant investigators would have had to rely on their experience of such data as
and indicates that the child is now making a distinction between the two a guide to the validity of the assumption. If we had, say, 30 tokens at
consonants. each age instead of 15, modest deviations from normality would be unim
Having discovered that it seems likely that a difference may exist between portant. On the other hand, turning to the second assumption, there is
two means, it will usually make good sense to estimate how large that always information contained in the values of the two sample standard
difference seems to be. A 95% confidence intervalfor the difference p. 1 -p.z deviations about the variability of the populations. In the test related to
is given by: the VOTs at age I; I r.o, the sample standard deviations of the VOTs
for /d/ and /t/ were I g. I7 and 54 19, the larger being almost three times
XI- X2 (
constant X ls'"T)
'\f:-+:-
llt nz
the value of the former. In the light of this, were we justified in carrying
out the test? In the following section we give a test for comparing two
where the constant is the 95% critical value of .the !-distribution with variances and it will be seen that we will reject decisively the assumption
(n 1 + n 2 - 2) df. The data imply that there is a difference between Jane's that VOT for /t/ has the same variance as VOT for jdj. This is serious
two mean VOTs at age 1 ;u.o. To estimate the size of this difference and would remain so even if the sample sizes were much larger. 2
we first calculate: There is a further problem which may arise with this kind of data where
2 2 several tokens are elicited from a single subject in a short time period.
S = J(r4 X 19I7 ) + (14 X 54I9 )
Every method of statistical analysis presented in this book presupposes
28
that the data consist of a set of observations on one (or several) simple
= 40.65 random samples with the value of each observation quite independent
A 95% confidence interval for the difference in the mean VOTs would of the values of others. If a linguistic experiment is carried out in such
then be: a way that several tokens of a particular linguistic item occur close together
the subject may, consciously or not, cause them to be more similar than
~)
x,-x, (2.o4 xy-;-;+-;-; they would normally be. Such a lack of independence in the data will
14 14 distort the results of any analysis, including the tests discussed in this
or II4.46 31.34 chapter. This point has already been discussed in chapter 7.
i.e. 83.12 to I45.8o milliseconds language.tc~ching example above.
1 We did not question these assumptions in the case of the
First, though we do not have access to the complete data set, tt IS not unrea~onablc to
expect the distribution of scores on a language test with many items to approxtmate nor-
We calculate therefore that we can be 95% certain that the difference mality, Secondly, the F -test (see below) did not indicate a significant difference in variances.
between the population mean VOT for jdj and that for /t/, for this child
Z However, provided the sample sizes are equal, as they arc here, it ca~ be shown m~themati
at this age, is somewhere between 83.I2 and I45.8oms. In view of the cally that the test statistic will still have a t-distribution when H 0 1s t~e _hut WJth fewer
comment in the next paragraph concerning the variances of the VOTs degrees of freedom - though never less than the number of tokens ~~ Just one of the
sample~, i.e .. r5 . The value of -7.71 is still high,ly significant even w1th .only 15 df, so
of the two consonants, this interval is probably not quite correct, but that the initial conclusion stands, that the population mean VOT for /t/1slarger at age
it is unlikely to be far wrong and certainly gives a good idea of the order r; rr.o than the population mean VOT for / d/. If the sample variances arc such that
of difference between the mean VOTs. we have to reject the hypothesis of equal population variances and the sample sizes are.
unequal then a different test statistic with different properties has to be used. The relevant
The correctness of the above analysis depends on the two assumptions forrnulnc, with a discuasiun, can be found in VVctherill (1972: 6.6).
181
Testing for differences between two populations Independent samples: comparing two proportions
I x.2 Independent samples: comparing two variances Proportion of errorfree cases
We have seen that it is necessary to assume that two populations Group A Is/3o = o.s (iii)
have the same standard deviation or, equivalently, the same variance before Group B 23/30 = 0.7667 (p,)
a !-test for comparing their means will be appropriate. It is possible to Overall 38j6o = o.6333 (p)
check the validity of this assumption by testing H 0 : a}= a.,Z (see table Now p1 and p2 are estimates of the true population proportions p 1 and
I I. I) against H 1 : a-12 # a-22 using the test statistic: We can test the hypothesis that H 0 : p 1 = p2 using the test statistic,
P2

larger sample variance


lp,- Pzl-~(.:_+.:_)
F
smaller sample variance
2 n n 1 2
In the example of the previous section, two samples of IS tokens of z
/t/ and /d/ from a child aged I ;II.o gave standard deviations of I9.I7
and 54.I9ms respectively. The larger sample variance is therefore Jjl(I- jl)(.:_+.:_)
n1 llz

54.I9 2 = 2936.56 and the smaller is I9.Ii = 367.49 so that:


where n 1 and n2 are the sample sizes, (Remember that IPt - p2 l means
F = 2936.56 + 367.49 = 799 the absolute magnitude of the difference between p1 and pz and is always
If the null hypothesis is true and the population variances are equal a positive quantity.) When the null hypothesis is true Z will have a standard
the distribution of this test statistic is known. It is called the F -distribution normal distribution. 3 Here we have:
and it depends on the degrees of freedom of both the numerator and
the denominator. In every case the relevant degrees of freedom will be lo.s-o.76671--
I ( II )
-+-
those used in the divisors of the sample standard deviations. Since both 2 JO JO
z
samples contained IS tokens, the numerator and denominator of the F
statistic, or variance ratio statistic as it is often called, will both have J (o.6333 ~-~.3667) (; 0
+
3
IJ
I4 df, and the statistic is denoted F 14 , 14 In general we write Fm,,m, where
m 1 and m2 are the degrees of freedom of numerator and denominator o.z667- 0 0 333 = I.876
respectively. In the tables of the F -distribution (table AS) we find that 0.1244

no value is given for 14 df in the numerator or in the denominator. To


give all possible combinations would result in an enormous table. However, The s% significance value of the normal distribution for a two-sided
the I% significance value for F 12 , 15 is 3.67 and that for F 24 , 15 is 3.29 so ll'Hlis I.g6 so that this result just fails to be significant at that level, exactly
the value for F 14 , 14 must be close to 3.6. The value obtained in the test the conclusion reached by the chi-squared test of section 9+2. Further-
is much larger than this so that the value is significant at the I% level more, 1.8762 = 35I9, almost exactly the value of the chi-squared stdtistic
and it seems highly probable that the VOT for /t/ has a much higher in that analysis and the relevant s% significance value of chi-squared was
variance than that for/ d/. Note, however, that this test (frequently referred J.H.f, which is r.g6 2 , Because of this correspondence, the two methods
to as the F-test) also requires that the samples be from normally distributed will always give identical results, The only real difference is that with
populations and it is rather sensitive to failures in that assumption. the test we have presented in this section it is possible to consider a one-
~idcd alternative such as H 1 :p 1 < p2 while the chi-squared test will always
n~quirc a two-sided alternative.
3 Independent samples: comparing two proportions
I I

We have already established a test for comparing two propor- It will usually be helpful to estimate how unlike two proportions may
tions. In table 97 we presented, as a contingency table, data from a study J Nntc tlutt it i1:1 always the normal distribution that is referred to when testing for differences
by Ferris & Politzer ( Ig8I) of pronoun agreement in two groups of bilingual hutw(~Cn two simple proportions- never the t-distribution. However, as when using the
dli~<~qunn\d tl~:~t, ~~arc must be taken if there arc fewer than five tokens in any of the
children. The data in that table can be presented in a different form as fiUr fHHH!.ihk catcgodc1:1.
18{1
183
Testing for differences between two populations Paired samples: comparing two means
be whenever a test has found them to be significantly different. An approxi- Table I I + Summary of data from table 10.1
mate 95% confidence interval for the difference in two proportions is calcu-
lated by the formula: Group r (English) Group 2 (Greek)
Sample mean X1= 25.03 X2 = 28.28
Sample standard deviation s 1= 6.25
If>,- l'>zl {r.g6 xJp,(I- p,) + Pz(I- pz)} Sample size 32
Sz = 7.85
32
nl nz

J
i.e. 0.266 7 (I .g6 X 0 5 X 0 5 + 0.7667 X 0.2JJJ)
JO JO
same error. It is in this sense that we refer to the samples as 'correlated'.
When we have correlated samples we follow a rather different procedure
i.e. o.o32 to o.sor
in tests for a difference between population means. This will be detailed
Note that this interval does not include the value zero, which we would below, but first of all let us see what results we would obtain if (mistakenly!)
expect since the hypothesis that p 1 = p2 , or p 1 - pz = o, was not rejected we proceeded as if, as in the above examples, the samples were independent
at the s% level (see chapter 7). This occurs because the confidence interval (i.e. uncorrelated). The test statistic would be:
is not exact. It will not be sufficiently incorrect to mislead seriously, X 1 -X2
especially if the exact test of hypothesis is always carried out first.
,z
I r 4 Paired samples: comparing two means
# 32
+-
32
Table II. 5 presents data on the error gravity scores of ten where:
native English-speaking teachers and ten Greek teachers of English for JIS 1
2
+ JISl
32 sentences which appeared in the compositions of Greek-Cypriot learners s' 62
of English. (Note that the data for Greek teachers is adapted from that
!.t.
presented in Hughes & Lascaratou rg8r, for purposes of exposition.) A
summary of the data in the notation of the present chapter appears in t = -r.83 with 62 df.
table I I 4 This is not significant at the s% level. Apparently there is little evidence
First let us test the assumption that the variances in the two populations tit at errors are judged more severely by the Greek teachers. However,
= '*
are equal, Ho: cr1 cr2 , against the hypothesis H 1 : cr1 O"z. The test statistic we hnvc not used all the information in the experiment. The teSt we have
is: curried out would be the only one possible if we had only the summary
vnlucs uf table I I + It ignores the fact that since each group of teachers
largers2 7-Ss' 6r.62
F 2 - - , = - - = I.S8 11fo1Hessed the same 32 sentences, and we know their scores on each sentence,
smallers 6.25 39.06
Wl' cnn compare their severity scores for the individual errors. In table
with 31 df in both numerator and denominator. From tables of the F- 11.5 we present the total data set. In the last column of the table we
distribution we can see that this is not significant and we will assume hllvc given the value obtained by subtracting the total score of the Greek
that the population variances are equal. ttmclwrR from the total score of the English teachers for each sentence
Let us now test the hypothesis that the two groups of teachers Individually. Some of these differences will be positive and others negative
the same error gravity scores on average, i.e. 1-1 0 : f.J-t = f.Lz it iN important to be consistent in the order of subtraction and evaluate
*
H 1 :p, 1 p,2 The situation is different from the previous examples in (iw Hign properly. Now, the null hypothesis we wanted to test was that,
chapter inasmuch as it is possible to compare the scoring of Greek tcachcrfl< uvcragc, the English and the Greek teachers give equal scores for the
and the English teachers in respect of each of the 32 sentences, not cnors, H.,: p, 1 = p,2 This is logically equivalent to testing the hypo-
in terms of their overall scoring. Tlwre iM likely, on nvcrngc, to be that, on average, the differences in item scores is zero. Since di =
corTclntion bNwccn scores nwmded by diffcn~ut groups of judgctJ on ,_~ Y i! tht population mean difference, /J-d, willhavethevaluef..td = p.. 1 - fi-z

184 r8s
Testing for differences between two populations Paired samples: comparing two means
Table r 1.5. Total error gravity scores often native Englt'sh where d is the mean of the observed differences, s is the standard deviation
teachers (X) and ten Greek teachers of English (lJ on 32 of the sample of 32 differences and n is the sample size, 32.
English sentences We find:
X y d~X-Y
Sentence
- d = -3.25 s ~ 8.32
1 36 -14
and therefore that t = -2.21 with 31 df. From tables of the !distribution
2>
2 16 9 7
3 42 29 13 we find that this value is significant at the s% level, giving some reason
4 25 35 -1o to believe that there is a real difference between the scores of the teachers
5 31 34 -3
6 36 23 13 of the different nationalities.
7 29 25 4 Let us summarise what has occurred here. In the opening section of
8 24 31 -7
9 29 35 -6 the chapter we presented a procedure for testing the null hypothesis that
10 18 21 -3 two population means are equal. We began the present section by carrying
11 23 33 -1o
12 22 13 9 out that test on the error gravity score data of table II. 5. The conclusion
13 31 2> 9 we reached was that the null hypothesis of equal scores for the two groups
14 21 29 -8
15 27 25 2 could not be rejected even at the 10%. level. We then carried out another
16 32 25 7 test of the same \>ypothesis using the same data and found we could reject
17 23 39 -16
18 18 19 -1
il at the s% significance level. Is there a contradiction here?
19 30 28 2 There is none. It is important to realise that the second test made use
20 31 41 -1o
21 20
of information about the differences between individual items which was
25 -5
2> 21 17 4 ignored by the first. Indeed the first test could have been carried out
23 29 26 3 in exactly the same way even if the two groups had scored different sets
24 22 37 -15
25 26 34 -8 uf randomly chosen student errors. By matching the items for the groups
26 20 28 -8 we have eliminated one source of variability in the experiment and increased
27 29 33 -4
28 18 24 -6 the Hcnsitivity of the hypothesis test.
29 23 37 -14 This paired comparison or correlated samples t-test will frequently
30 25 33 -8
31 27 39 -12 be relevant and it is usually good experimental practice to design studies
32 11 20 -g to exploit its extra sensitivity. However, it requires exactly the same
llHHtunptions as the test which uses independent samples; the two popula-
Ho : 1-'1 = J.tz is =
the same hypothesis as H0 : J.t 1 - f.Lz o which in turn
l ions being compared should be approximately normally distributed and
be written as H 0 : f.Ld = a.
hHV('! equal variances.
We already know how to test the last hypothesis. In chapterS we
duced a test of the null hypothesis that a sample was drawn from a popula,/' II 95% confidence interval for the difference between the two population
tion with a given population mean. The 32 differences in the last means can be calculated as:
of table 1 r.s can be considered as a sample of the differences that lcil (constant X s/Yn)
arise from two randomly chosen groups of these types assessing
where the constant is the s% significance value of the !-distribution with
errors. Following the procedure of 8.3, a suitable statistic to test
null hypothesis H 0 : f.Ld = o is:
(11 "' r) df. Forthisexamplewehave:
d-o _ l-351 (>.04 X 8.3 + Y3I)
t ~ - - ~ d + (s/Yn), with 3' df
s .15 3 0 5
Vn L~, 0.2 to 6.3
J8b
187
1 'estmg jar dzjjerences between two populatwns J.\'onparamernc rests

Relaxing the assumptions of normality and equal


I I. 5 Tabler r.6. RankingofVOTsfor/g/ and /k/Jrom a single child
variance: nonparametric tests Value 3 35 ]8 SI sr 56 73 89 125 138 169 190 195
We have seen that experimental situations do arise (u.2) Source /g/ /k/ /g/ /k/ I g/ /g/ /k/ /g/ /k/ /k/ /k/ /k/ /g/
Rank 2 3 4-5 45 6 7 8 9 10 II 12 13
where the assumption that two populations have equal variances may be
untenable, and that this will affect the validity of some of the tests intro-
duced above. We may also have doubts about the other assumption, necess- If m is the size of the sample whose ranks have been summed ( m = 6
ary except for large samples, that both samples are drawn from normally in this example) and n is the size of the other sample (n = 7) we calculate
distributed populations. Occasions will arise when we have samples which two statistics, U 1 and U 2, as fo11ows:
arc so small that there is a need to worry about this assumption as well.
It is possible to carry out a test of the hypothesis that the samples come m(m + r)
from two populations with similar characteristics without making any U 1 =mn+ T
2
assumptions about their distribution. It will, of course, still be necessary
that our data consist of proper random samples in which the values are U 2 =rnn-U 1
independent observations. There is a great number of such tests, collected
under the general heading of nonparametric tests~ tests which require Here we have:
no special distributional assumptions- and we will present just two exam- ' 6X7
ples here. The x' test for association in contingency tables in chapter U 1 = 6 X 7+--- 355
2
= 27.5
9 and the test for significant rank correlation in chapter ro are two nonpara-
metric tests we have already presented. A larger selection can be found Uz = 6 X 7- 27.5 = '45
in Siegel (1956).
Suppose that, as part of a study like that of Macken & Barton (r98oa) We then refer the smaller of these values to the corresponding value
(see u.r), a child aged 2;o is observed for tokens of /g/ and /k/ in of table A9. (We have given the s% significance values in this table; see
the same phonological environment, and that the VOTs in milliseconds Siegel if other critical values are required.) Since '45 is greater than
for the observed tokens were: the tabulated value of 6 we do not reject the equality of mean VOTs
at the s% significance level. The table of critical values we have supplied
for/g/: J8,195, 56, 3,51, 89 (sixtokens) allows only for the situation where the larger of the two samplescontains
for/k/: 125, 73, r38, 35, 5' rgo, r6g (seven tokens) a maximum of 20 observations. For larger samples the sum of the ranks,
T, can be used to create a different test statistic:
Despite the small number of tokens, we can test the null hypothesis that
the VOTs for the two consonants are centred on the same value by means
of a two-sample Mann-Whitney rank test. We begin by putting all m
T--(m+n+ r)
13 observations into a single, ranked list, keeping note of the sample in Z=--2

~(m+n+r)
which each observation arose, as in table rr.6. It will not matter whether
the ranking is carried out in ascending or descending order. Note how 2

we have dealt with the tied value of 5 r ms (see ro.6).


which can then be compared with critical values of the standard normal
Now, sum the ranks for the smaller of the two samples. If both samples
distribution (table A3) to see whether the null hypothesis of equal popula-
are the same size then sum the ranks of just one of them. In this case
tion means can be rejected.
the smaller sample consists of the VOTs for the child's six tokens of /g/;
There are likewise various nonparametric tests which can be used to
T, the sum of the relevant ranks, is given by:
test hypotheses in paired samples. Suppose that 22 dysphasic patients
T ~ I + 3 + 4 5 + (J + 8 + I 3 ". 5 have the extent of thuir disability asf!csscd on a tcn~point scale by two
I8g
t&8.
1 ne power OJ WJJerew Le.u~
Testing for differences between two populations
different judges. Suppose that for I 3 patients judge A assessed their condi- rI .6 The power of different tests
tion to be much more serious (i.e. have a higher score) than judge B, In chapter 8, where the basic concepts of statistical hypothesis
for five patients the reverse is true and for the remaining four patients testing were introduced, we discussed the notion of the two types of error
the judges agree. We can test the null hypothesis H0 : the judges are giving which were possible as a result of a test of hypothesis: type I error, the
the same assessment scores, on average, versus the two-sided alternative incorrect rejection of a valid null hypothesis, and type 2, the failure to
H 1 : on average the assessment scores of the judges are different, although reject the null when it is mistaken. Let us recapitulate the results of the
it is unlikely that the assessment scores are normally distributed and the tests we have carried out in the current chapter on the error gravity scores
size of the sample is not large enough for us to rely on the Central Limit of the two groups of judges (table I I .5). The paired sample t-test which
Theorem. we carried out in I 1.4 resulted in the conclusion that there was a differ
The procedure is to mark a subject with a plus sign if the first judge ence, significant at the 5% level, between the mean scores of the two
gives a higher score, or with a minus sign if the first judge gives a lower groups. The sign test carried out on the same data in II 5 could not
score. Subjects who receive the same score from both judges are left out detect a difference, not even at the s% significance level, which is the
of the analysis. Note the number of times, S, that the less frequent sign weakest level of evidence which, by general convention, most experi
appears and the total number, T, of cases which are marked with one menters would require in order to claim that 'the null hypothesis can
or other sign. HereS= 5 and T = I8. be rejected'. It is important to understand why this has come about. Assess
Now enter table A10 using the values of S and T. Corresponding to ing a statistical )lypothesis is similar to many other kinds of judgement.
S = 5 and T = I8 we find the value o.o96. These tabulated values are The correctness of the conclusion will depend on two things: the quantity
the significance levels corresponding to the two-tailed test. Hence we could of the information available and its quality. The two tests in question
say here that H 0 could be rejected at the Io% significance level but not use different information.
at the s% level (P = o.og6 = 9.6%). For the one-sided alternative we first It is an assumption of both tests that the observations were collected
of all have to ask whether there are any indications that the alternative as an independent random sample. The fact that the observations were
is true. For example, if we had used in the above example the one-sided collected in this way can therefore be seen as a piece of information that
alternative H 1 : judge B scores more highly, on average, there would be both tests use. A second piece of information that both tests use is the
no point in a formal hypothesis test since judge B has actually scored direction of the difference in each pair (represented simply by plus or
fewer patients more highly than judge A. If the dilection of the evidence minus in the case of the sign test). But the paired sample !-test makes
gives some support for a one-sided alternative, then table A10 should be use of additional, different, information. First, it makes use of the fact
entered as before using the values of S and T but the significance level that the populations of scores are normally distributed (one of its necessary
should be halved. We have seen this before with table A3 and table A4. assumptions). Secondly, it uses not only the direction but also the size
Only values of T up to T = 25 are catered for by table A10. If T is of the difference in pairs (the information contained in the last column
greater than 25, the test can still be carried out by calculating S and T of table u.s). Since the Hest is based on richer information, it is more
in the same way and then using the test statistic: sensitive to differences in the population means and will more readily
give a significant value when such differences exist (and for this reason
T- zS- I
z YT
may be referred to as a more powerful test). In other words, for any
set of data the sign test will be more likely than the Hest to cause a
which should be referred to table A3, critical values of the standard normal type 2 error. However, if the assumption about the parent population
distribution. For example, if we carry out a $ign test on the error gravity which underlies the Hest, i.e. that they are normally distributed, is not
scores of table I 1.5 we have T = 32, S = II (there are II positive and justified, the likelihood of a type I error will be higher than the probability
2I negative differences) so that Z = (32- 22- I)/Y32, i.e. Z = 1.59, indicated by the !-value. The apparent significance of the test result can
which is not a significantly large value of the standard normal distribution be exaggerated.
(table A3). which test then i~; it more appropriate to usc? There is no simple answer~
tqo 191
Testing for differences between two populations b'xercises
A sensible procedure might be firsttO use the sign test. If a significant ''E)(E:RCISES
result is thereby obtained there is really no need to go on to carry out (r) In chapter 8 we discussed a sample of British children whose comprehension
a t-test. However, there is no way to.calculate a confidence interval for vocabularies were measured. The mean vocabulary for a sample of 14ochildren
the size of the difference without assuming normality anyway. If a t-test .wa's 24;80o words with a standard deViation of. 4 1200. If a random sample
is carried out, the researcher should be aware of the consequences of the of 108 American children has a mean vocabulary of 24,000 words with a stan-
possible failure to meet the assumptions of the test. dard deviation of 5,931, test the hypothesis that the two samples come from
populations with the same mean vocabulary.
(2.) Table ro.I gives the total error gravity scores for ten native English speakers
who are not teachers. In table 11.5 can be found the scores of ten Greek
SUMMARY teachers of English on the same errors. Test the hypothesis that the two groups
This chapter has looked at various procedures for testing for differ- give the same error gravity scores, on average.
ences between two groups. (3) Calculate a 95% confidence interval for the difference between the mean error
g_r~vity scores of the two.groups in exercise I I .2.
( r) The t-test for independent samples tO test H 0 : J.L 1 = p.-2 uses the test statistic;
(4) For the d.ita of table g.6 on vowel epeOthesis i~ Rennellese, use the procedure
ex,- x,) of 11.3 to test whether reduplication is equally likely in initial and medial
position.
ls'7 (s) Using the sarl!e data as -in exercise I 1.2, test whether the two sets of error
y-:;-+-:;-
nl llz scores come from populations with equal variance.
(6) A sample of 14 subjects is divided randomly into two groups who are asked
where: to learn a set of 20 vocabulary items in an unfamiliar language, the items
being presented in a different format to the two groups but all subjects being
(n 1 - I)s 12 + (nz- I)s,'
s' n 1 +nz-z
allowed the same time to study the items before being tested. The numbers
of errors recorded for each of the subjects are:
which has at-distribution with (n 1 + n2 - 2) df when H 0 is true. FormatA: 3 4 11 6 8 2
(2) To compare two proportions estimated from independent samples the Format B: I 5 8 7 9 14 6 8
statistic:
(Two of the students in the first group dropped out without taking the test:)

li>t-i>zi-.:(I-+~)
Test whether the average number of errors is the same under both formats.

z- 2 nl Dz
---;===;::~
Jj'>( I - p) (-'-- + -'--)
n1 lllz

should be referred to tables of the standard normal distribution.


(3) The F -test for comparison of two variances was explained. The test statistic
was F =(larger s2}/ (smaller sZ) to be compared with critical values of the
F-distribution with (n2 ,n2) df.
(4) The paired samples ttest for testing for differences between two means
was presented. The test is carried out by calculating dand s from the differences
and then comparing t = d/(s/v'n) with tables of the !-distribution with (n- I)
df.
(s) Two nonparametric tests were cxplnined. Tlw Mann-Whitney test for inde-
pendent surnplcs nnd the sign test fur paired !:!!llllplcs.

l9il I93
One-wayANOVA

Table IZ.I. Marks in a multiple choice vocabulary test of


!2 candidates for the Cambridge Proficiency of English
examination from four different regions
Analysis of variance- ANOV A Groups
2 3 4
South North Far
Europe America Africa East
10 33 26 26
19 21 25 21
24 25 19 25
17 32 31 22
In the last chapter we explained how it was possible to test whether two 29 16 15 11
37 16 25 35
sample means were sufficiently different to allow us to conclude that the 32 20 23 18
samples were probably drawn from populations with different population 29 13 32 12
means. When more than two different groups or experimental conditions 22 23 20 22
31 20 15 21
are involved we have to be careful how we test whether there might be
Total 250 219 231 213
differences in the corresponding population means. If all possible pairs
Mean zs.o 2I.9 2J.I 21.3
of samples are tested using the techniques suggested in the previous
chapter, the probability of type I errors will be greater than we expect, Sample standard deviation 8.138 6.607 59 1 5 6.897
Sample variance 66.222 43655 34-988 47567
i.e. the 'significance' of any differences will be exaggerated. In the present
chapter we intend to develop techniques which will allow us to investigate
possible differences between the mean results obtained from several (i.e. than students from other areas? This is a generalisation of the problem
more than two) samples, each referring to a different population or col- discussed in I I. I for the comparison of two samples to test whether
lected under different circumstances. they were drawn from populations with different mean values. It might
seem that the solution presented there could be applied here, by comparing
Comparing several means simultaneously: one-way
I2. I these groups of candidates in pairs: the Europeans with the North Africans,
ANOVA the South Americans with the Europeans, and so on. Unfortunately, it
Imagine that an investigator is interested in the standard of can be demonstrated theoretically that doing this leads to an unacceptable
English of students coming to Britain for graduate training. In particular increase in the probability of type I errors. If all six different pairs are
he wishes to discover whether there is a difference in the level of English tested at the s% significance level there will be a much bigger than s%
proficiency between groups of distinct geographical origins - Europe, chance that at least one of the tests will be found to be significant even
South America, North Africa and the Far East. As part of a pilot study when no population differences exist. The greater the number of samples
he administers a multiple choice test to 40 graduate students ( 10 from observed, the more likely it will be that the difference between the largest
each area) drawn at random from the complete set of such students listed sample mean and the smallest sample mean will be sufficiently great -
on a central file. The scores obtained .by these students on the test are even when all the samples are chosen from the same population - to give
shown in table I2.I. The means for the four samples do not have exactly a significant value when a test designed for just two samples is used. We
the same value - we would not expect that. However, we might ask if need a test which will take into account the total number of comparisons
the observed variation in the means is of the order that we could expect we are making. Such a test can be constructed by means of an analysis
from four diFferent random samples, each drawn from the sarnc population of variance, usually contracted to ANOVA or ANOVAR.
of lest scores, or arc the diffrrcntcs sufficiently large to indicate that As usual, there will be some assumptions that must be met in order
atudonts fi'Otn ccnuin urcan nn: more proficient in English, on nvcragc, for the test to be applied, i.e. that cnch sample comes from a normally
11)4i
I95
flnatyszs OJ vanance- mvuvfi One-way ANOVA
distributed population and that the four populations of candidates' scores
all have the same variance, uz. The data: in table rz. i consist of four,
s2; is !.6322:2.662. SiiiCe this is an estima.te of CJ'2/ro, multiplying it
by ro gives a new estimate of CJ' 2 called the between-groups estimate
I
groups of scores from four populations. Suppose that the i-th population of variance, sb2, since it measures the variation across the four sample I
has mean /-';, so that Group r is a sample of scores from. a population means. We now have calculated s]= 48 I I and sb 2 = IO X 2.662 = 26.62.
'I
of scores, normally distributed with mean p.. 1 and varianCe u 2; arid .so If the null hypothesis is true, both of these are estimates of the same I,
on. The null hypothesis we will test shortly is 1-1 0 : I-'! =1-'z =/-'J =1-' against qUantity, u 2, and it can be shown that the ratio of the estimates:
the ruternative that not all the!-'; have the same value. F=sb2+sw2
JDsuming that each sample comes from a population with variance u 2,
the four different sample variances are four independent estimates of the has an F -distribution with 3 and 36 df. There are 3 df in the numerator .I
common population variance. These can be combined, or pooled, into
a single estimate by multiplying each estimate by its degrees of freedom
since it is the sample variance of four observations, and 36 in the denomi-
nator since it is a pooled estimate from four samples each of which had
I
1:1

(the sample size minus one) summing the four products and dividing the ro observations and each of which hence contributed 9 df to calculate II
total by the sum of the degrees of freedom'.of the four sample variances, the sample variance. The F-distrihution has appeared already in rr.2 II
thus: as the test statistic for comparing two variances.
(n 1 - r)s 12 + (n2 - r)sz' + (n 3- r)s,Z + (n 4 - r)s,Z If the null hypothesis is not true, sb2 will tend to be larger than
pooled variance estimate s,/ because the yariability in the four sample means will be inflated
n 1 + n2 + n3 + n4- 4
by the differences between the population means, Large values of the
(If you glance back to 11.1 you will recognise this as a direct generalisation F statistic therefore throw doubt on the validity of the null hypothesis. I;
of the method we used to estimate the common variance when we wished In the case of the multiple choice test scores, we have
to compare only two samples.) sb2 + Sw2 = 26.62 + 48. II =0,55 The s% critical value of F3,36 is just bigger 'il
This estimate of the population variance is often called the within- than F,, 40 , which is 2.84, so that the value obtained from the data is not
samples estimate of variance since it is obtained by first calculating significant and there are no grounds for claiming differences between the
the variances within each sample and then combining them. We will refer groups. In other words, these data do not support the view that graduate
to this as sw' In the example of table I2.I, all the samples are the same students coming to Britain differ in their command of English according
size, n 1 = n2 = 11 3 = 11 4 = ro, so that:, to their geographical origin.
The description just given of the analysis of variance procedure is not
(9 X 66.222) + (9 X 43.655) + (9 X 34-988) + (9 X 47567) . ' . the most usual way in which the technique is presented. ANOVA, as li
Sw 2 - 48.II
36 we will see, is a rather general technique which can be applied to the
comparison of means in data with quite complex structure. It is convenient, l1
Now, let us suppose for the moment that the null hypothesis is true, therefore, to have a method of calculating all the required quantities which
and that the common population mean value is 1-' In that case (chapter will generalise easily. For this reason we will now repeat the analysis of :1
5), each of the four sample means is an observation from a normal distribu- the multiple choice test scores using the more common and general method.
tion with mean 1-' and variance CJ'2/ Io. (Since we know that in general The analysis is a particular example of a one-way analysis of variance !
the standard deviation of a mean has the value CJ'/ Y n its variance will - the comparison of the means of groups which are classified according
be u'/n). In other words, the four sample means constitute a random to a single (hence 'one-way') criterion variable, linguistic/ geographical
sample from that distribution and the variance of this random sample origin in this example. During the presentation of the alternative analysis
of four means is an estimate of the population variance, u 2/ ro. The sample we will take the opportunity to state the problem in a completely general
means are 25.0, 21.9, 23.1, 21.3. Treating these four values as a random way.
sample of four observations, we can cal.culate the sample standard deviation Suppose that samples of size n have been taken from each of m popula-
in the normal way. We find that it is 1 .632, and hmcc the sample variance, tions. We will write Y;; for the j-th observation in the i-th group. For

l9(l '97 i !)
flnatyszs Of vanance- fUVU V.fl. One-way ANOVA
example, in .table 12.1, Y4 ,7 = 18, the score of the seventh Far Eastern Table I2.2. ANOVA table for the data of table 12.1
(group 4) candidate. As is common when analysis of variance is presented,
Source df ss MSS F -ratio
we write Yi. to mean the total of the observations of group i. That is:
"
Between groups 3 79875 26.62 F3,36 = o.ss
y I, ="Y
L_. l) Within groups (residual) 36 1 7JI.9 48.11
j=J
Total 39 I8II.775
For our example:
" 1i = 250, Y2 . = 219, YJ.
Y1. = L_Y = 231, Y.1. = 213 the term 'residual sum of squares'. An ANOV A table is now constructed
i"'' -table I2.2.
The grand total of all the observations is designated Y .. so that: The first column in the table gives the source of the sums of squares
Y .. =913 -between-groups, residual and total. The second column gives the degrees
of freedom which are used to calculate the different variance estimates
Since we have m samples (m = 4) each' of size n (n = w) we have mn
i.e. 3 for between-groups and 36 (4 X g) for the within-groups estima~es,
=
(4 X IO 40) observations in all. A ~erm, usually called the correction
as we had in the first analysis above. Generally the between-groups degrees
factor or CF, is now calculated by:
of freedom will be m- I, one less than the number of groups, the total
y.' 913 2 available degrees of freedom will be mn- I, one less than the total number
CF=-=-=2o839.5 of observations, and the residual degrees of freedom are obtained by sub-
mn 40
traction (see table I2.3, which is a general ANOVA table for one-way
(It is often necessary, when calculating for an ANOVA, to keep a large
AN OVA of m samples each containing n observations). The fourth column
number of figures in the intermediate calculations.)
of table I2.2, the mean sum of squares, is obtained by dividing each
We now calculate the total sum of squares, TSS, which is the sum
sum of squares by its degrees of freedom. Note that the values obtained
of the between-groups sum of squares and the within-groups sum
at this stage are exactly the between-groups variance estimate and within-
of squares. (The latter is often called the residual sum of squares
groups variance estimate that we calculated previously. The final column
(RSS) for a reason which will become apparent shortly.)
then gives, on the row corresponding to the source which is to be tested
TSS (total sum of squares)= :l:Y;;'- CF for differences (in this case between-groups) the F-ratio statistic required
= 1o2 + 19 2 + ... + 22 2 + 21 2 - CF for the hypothesis test. It is important that you learn to interpret such
= 226sr- 20839225 tables, for two reasons. The first is that researchers often present their
= r8II 775 results in this way. The second is that, especially for complex data struc-
:l:Y' tures, you may perhaps not carry out the calculations by hand, leaving
between-groups SS = -'- - CF that to a computer package. The output from the. package will usually
n
= (2502 + 2I92 + 23I 2 + 2I3 2) + ro- CF contain an AN OVA table of some form.
= 79875 Table I2.3. General ANOVA tableforone-way ANOVA ofm samples each
within-groups SS =total SS- between-groups SS containing n observations
= r8II.775 -79.875 Source df ss MSS F-ratio
= I73"9
Between groups m- I BSS 2_ BSS ~
The within-groups sum of squares is the quantity left when the between- sb -m-r sr
groups sum of squares is subtracted from the total sum of squares- hence
Within groups (residual) m(n-I)
1 It is not necessary that the samples be of the same Rizc, though experiments arc often RSS
s'=~
r m(n-I)
designed to make such groups equal. Howcv~r, the general exposition becomes rather
cumbersome if the sample l'izes nrc different. 'l'utul mn- 1 TSS

I98
199
Analysis of variance -ANOVA Two-way ANOVA: randomised blocks
~ 'I
'1:1
I2.2Two-way ANOVA: randomised blocks Table I 2+ Total error gravity scores of ten native English ll
In chapter IO we saw that the sensitivity of a comparison teachers (1), ten Greek teachers ofEnglish (2) and ten native ii

~'I
I
between two means could be improved by pairing the observations in English non-teachers (J) on 32 English sentences
the two samples. This idea can be extended to the comparison of severaL
means. Table I2.4 repeats the data on gravity of errors analysed in chapter
Group
III
I I, but now extended to three groups of judges, the third group consisting
Sentence I 2 3 Total (Y;.)

of the ten English non-teachers, There are now three ways to divide up 22 ]6 22 So
2 16 9 18 43
the variability: variation in score's between m groups of judges, variation 3 42 29 42 II3
in scores between the n different errors and residual, random variation. 4 25 35 21 8r
5 JI 34 26 91
The necessary calculations and the resulting table are similar to those 6 ]6 2] 41 roo
found in the one-way case, but with an extra item, between-errors sum 7 29 25 26 So
8 24 Jl 20 75
of squares, added. rS S2
9 29 35
We begin by calculating the totals, displayed in table I2.4: 10 r8 2! '5 54 II
ll 2] 33 2! 77 !IIi
Yi. the total for the i-th error !2 22 IJ '9 54 l;i,l
Y 1 the total for the j-th set of judges IJ Jl 22 39 92 I

'4 2! 29 2] 73 :1.1
il
Y .. the grand total
1

rs 27 25 24 76 I

r6 32 2S 29 S6 11!!
We calculate, as before, a correction factor by: 17 23 39 1S So I' I
rS r8 19 r6 53 il.
Y .. 2 24622
CF=-=--=63I4o.o4
19
20
JO 28 29
22
87 ~:
mn 3 X 32
Jl 41 94 II
21 20 25 l2 57 1,'
22 21 17 26 64
Then the total sum of squares: 2] 29 26 43 98 :;I;
26 'II
24 22 37 ss
TSS = :i:Y,J'- CF 25 26 34 22 82
= 22 2 + I62 +. '. + 28 2 + I42 - CF 26 20 28 rg 67
= 68 742- 63 I40.04 27 29 33 JO 92
28 r8 24 '7 59
= s6oi.g6 29 2] 37 15 75
30 25 33 15 73
Between-errors sum of squares: JI 27 39 28 94
32 ll 20 14 45
:i:Y .2
1
ESS = - - - CF Total (Y) So1 905 756 2462 1,1
m

'"~!.
'. 1

= (8o 2
+ 43 + ... + 94 +
2 2
45 2) + 3- CF
1

= Ig8og6+3- CF of observations which have gone into each of the values being squared.
= 2891.96
For example, in GSS we have (8oi 2 + 905 2 + 7562) + 32 because each of
Between-groups (of judges) sums of squares: the values 8oi, 905 and 756 is the sum of 32 data values. The corresponding
AN OVA is presented in table I2.5.
:i:Y. 2
GSS = --' - CF As before, the residual sum of squares and the residual degrees of free-
n dom are calculated by subtraction from the total sum of squares and total
= (8oi 2 + 905 2 + 7562) + 32- CF degrees of freedom respectively. The F-ratio for groups of judges is 4.85
= 365.02
with 2 and 62 df and this is significant beyond the 2.5% level, clearly
Note that the divisor in each Rum of squares calculation is just the number indicating differences in the scores of the three sets of judges. The question
~QI:l 20I
Analysis of variance -ANOVA Two-way ANOVA: factorial experiments 111

Table 12-5-ANOVAfordata of table 12.4 Table 12.6. Marks of 40 subjects in a multiple choice test (the I
Source df ss MSS F-ratio
subjects are classified by geographical location and sex)
Between errors 31 z8g1.g6 9329 FJI.62 = 2.47 Geographical location
Between groups of judges 2 36s.az 182.51 Fz.6Z = 4.83 South East
South North
Residual 62 234498 37-82 Total
Sex Europe (1) America (1.) Africa (3) Asia (4)
Total 95 s6oi.g6
'0 33 26 26
19 21 25 21
Male(') 24 25 19 25
remains whether this is due to the Greek judges scoring differently from 17 32 31 22
English judges (whether teachers or not), teachers (whether Greek or Eng- 29 16 '5 u
lish) scoring differently from non-teachers, and so on. We will return Subtotal 99 "7 u6 105 447
1'1

to this question in 12.5. 37 16 25 35


20 18 it
Note that the F-ratio for comparing errors is significant beyond the
1% level, but to investigate this was not an important part of this analysis.
Female (2)
32
29 13
23
32
, "
We have simply taken account of the variability that this causes in the
22
3'
23
20
20
15 ,22
scores so that we can make a more sensitive comparison between groups. Subtota-l '51 92 us 108 466
This type of experimental design is often called a randomised block
design. Total 250 219 231 213 913

12.3 Two-way ANOVA: factorial experiments


It is often convenient and efficient to investigate several experi- Yij. =total score of subjects belonging to the i-th location and j-th
mental variables simultaneously. The sociolinguist, for example, may be sex (e.g. Y 3 ~, = rr6)
'I.
interested in both the linguistic context and the social context in which Yi .. =total score of subjects at i-th location (Y 2.. = 2r9)

l!
a linguistic token is used; a psycho linguist may wish to study how word Yj. =total score of subjects of ith sex (Y.z. = 466)
recognition reaction times vary in different prose types and with subjects Y... = grand total = 913
in different groups. Indeed it will only be possible to study the interaction An experiment designed to give this kind of data structure is usually
between such variables if they are observed simultaneously. We will use I~
called a factorial experiment, the different criterion variables being called !~
again the multiple choice test scores of the four groups of graduate students
factors. These 'factors' are entirely unrelated to those of factor analysis, !;i!
to introduce the terminology and denwn.strate the technique. I
a technique discussed in chapter 15. Here there are two factors, sex and
In table 12.6 we have given the same data as in table 12.1 but now
geographical origin. The different values of each factor are often referred '
'i
cross-classified by geographical origin and sex. The style of presentation
to as the levels of the factor. Sex has two levels, male and female, and 'li
of this table is quite a common one for cross-classified data, with various '
geographical location has four. i.
totals given in the margins of the table (they are often referred to as mar-
We can use this single set of data to test independently two different
ginal totals): total scores by sex, total scores by geographical location
null hypotheses: whether mean scores are the same between geographical
and subtotals by the origin by sex cross-classification. To discuss these
origins and whether mean scores are the same for the two sexes. The
data and describe any formulae for their analysis it is convenient to refer
calculations required are similar to those of the example in the previous
to them by means of three suffixes; Yiik will refer to the score of the
section. We begin by calculating the correction factor, CF:
k-th subject of the j-th sex who belongs to the i-th geographical location.
G(:ncralising the usc of the dot notation introduced in the previous section Y,.' 913 2
we writJ;: cr =40- = -40 = ao BJ9"5
~!tilt'. OM
Analysis of variance -ANOVA Two-way ANOVA: factmial experiments
and continue by obtaining the various sums of. squares: to table 12.6). For the North Africa and South East Asia samples the
total SS = :l:Y,;,2 - CF mean scores of the two sexes are still very similar, but among European
= ( 102 + 19 2 + ... + 22 2 + 21 2) - CF students the females have apparently done rather better, while for the
= r8rr.775 South Americans the reverse is the case. These differences cancel out
when we look at the sex averages over all the locations simultaneously.
!Y. 2
between-locations SS = _,_.. - CF What we are possibly seeing here is an interaction between the two factors.
IO In other words, it may be that there is a difference between the mean
= (zso2 + 2192 + 23r 2 + 2132.) + ro- CF
scores of the sexes, but the extent of the difference depends on the geo-
= 79875 graphical location of the subjects. Any difference between the levels of
:l:Y 2 a single factor which is independent of any other factor is referred to
between-sexes SS = __J - CF
20 as a main effect. Differences which appear only when two or more factors
= (447' + 466 2) + 20- CF are examined together are called interactions. As a form of shorthand,
= 9025 main effects are often designated by a single letter, e.g. L for the variation
in mean score of stud~nts from different locations, and S for the variation
and this leads to the ANOVA in table r2.7(a) from which we conclude
between sexes. Interaction effects are designated by the use of the different
from the small F -ratios that there is no significant difference between geo-
main effects symbols joined by one (or more) crosses, e.g. LXS for the
graphical locations (we came to the same conclusion in 12.1), and none
interaction between location and sex. Provided there is more than one
between sexes. However, the analysis carried out thus tests the differences
observation for each combination of the levels of the main factors it is
between the sample means of the locations calculated over all the observa-
possible to test whether significant interaction effects are present. For
tions for an origin irrespective of the sex of the subject. Likewise, the
the multiple choice test scores data of table 12.6 we have observed five
sample means for the sexes are calculated over all zo observations for
scores for each of the eight combinations of the levels of sex and location.
each sex ignoring any difference in location. Calculated in that way, the
(For factorial experiments it is important that each combination has been
sample mean score for males is 447 + 20 = 22.35 and for females it is 23.]0,
observed the same number of times. If that has not happened it is still
so that they are rather similar. However, suppose we look to see if there
possible to carry out an analysis of variance but the main effects cannot
are differences between sexes within some of the locations (refer back
then be tested independently of one another and there may be difficulties
of interpretation of the ANOV A. Furthermore, the calculations become
Table 12.7.
much more involved and it is not really feasible to carry them out by
(a) JLVOVA ofmain effects only from data of table 12.6 hand- see r3.12.) To test for a significant interaction we simply expand
Source df SS MS F-ratio the AN OVA to include an interaction sum of squares calculated by:
Between locations 3 79875 26.62 F3,3s = 0.54 ~y .. z
Between sexes I g.o.z5 9025 Fu 5 =o.x8 interaction SS = -'- 1
- CF IIi
Residual 1722.87 5
Total
35
I8IL775
49-225 5
= (992 + 127 2 + ... +II 5
2
+ ra8 2) + 5 -CF li
39 II
= 473775
(b) !LVOVA of main effects and inte.ractionfmm data of table 12.6

Source df ss MS F-ratio The relevant ANOVA appears in table r2.7(b). The only extra feature
Between locations (L) 26.62 F 3,32 = o.68
requiring comment is that the degrees of freedom for the interaction term II
3 79875
Between sexes (S) I 9-025 g.ozs FI ..1Z = 0.23 are obtained by multiplying the degrees of freedom of the main effects
Interaction (LXS) 3 473775 157-925 FJ,J2 = 405 included in the interaction (here, 3 X r = 3): the symbol for the interaction
Residual 32 1249-100 3903
effect, LXS, is a useful mnemonic for this. The F-ratio for testing the
Total 39 Jf!J !.775
interaction effect is significant at the r% level, showing that such effects
i:
~.04 205
Analysis of variance -ANOVA ANOVA: main effects only
need to be considered. The practical implication of this would be that Some values in such a model, Yii and eii' depend on the specific data
when considering possible differences between the scores of subjects of values observed in the experiment. Others, J.t and Li, are assumed to be
different sex we should not leave out of consideration their geographical fixed for all the different samples that might be chosen; they are population,
origin. 2 as opposed to sample, values and are referred to as the parameters of
We now go on to consider more generally the interpretation of main the model. 1-' is usually called the grand mean and L; the main effect
effects and interaction in ANOV A. of origin i.
The previous null hypothesis that all the /-'; had the same value, ~-'
12.4ANOVA models: main effects only =
can now be restated asH,: L; o, for every value of i (that is, the main
Let us reconsider the first problem we discussed in this chapter effect of geographical location is zero). This model can be generalised
- the one-way ANOVA of four independent samples of students from to cover a huge variety of situations. For-example, consider the randomised
four locations. We wished to test the hypothesis that the population mean block experiment of table 12+ The observations are arranged in 32
score of students from all geographical locations was the same, and we 'blocks', i.e. the errors. Within every block we have a score from each
assumed that, at all locations, the scores were from a normal distribution set of judges. A suitable model would be:
with variance a'. All this can be summarised neatly in a simple mathe-
yij = /L+ bi + gj + Cjj
matical model:
which says that each score, Y;;, is composed of four components summed
Yij=J.Li+eij
together, the grand mean, ~-' the block effect, b;, the group effect,
where Y;; is the j-th score observed at the i-t.h location, 1-'; is the population g;, and the random variation e;; about the mean score of judges of type
mean at the i-th locatio11, and e;; is the random amount by which the j scoring the error i for its gravity,
j-th score, randomly chosen at the i-th location, deviates from the mean It may be easier to understand what this means if we fit the model
score. Our earlier assumption that the scores of students from the i-th to the observed data and estimate values for the parameters. Each parameter
location were nornlally distributed with mean /-Li and variance u 2 is equiva- is estimated by the corresponding sample mean: 1-' is estimated by Y. + g6,
lent to the assumption that, for each geographical location the 'error' or since there is a total of g6 observations. We will use a circumflex to designate
'residual', eii' was normally distributed with mean zero and variance u 2 an estimate and write:
We then tested the null hypothesis that 1-'; = ~-' the same value, for all
il = Y. +g6= 2462+96 = 25.65
locations.
gl == (total for group I +number of scores for group I) - jl
Although this simple model is perfectly adequate for the one-way
= Y. 1 + 32- 25.65 = (Sor + 32)- 25.65 = -o.6z
ANOVA problem, it does not generalise easily to more complex cases
(suggesting that groupI scores may be smaller than the overall aver-
such as the factorial experiment. In order to make that possible a slight age)
modification is needed. Suppose we ignore the existence of the four differ-
ent locations. We could then consider the 40 scores as having come from b1 = Y 1. + 3 (each 'block' contains three scores- one from each group ,,I'II.'
a single population with mean ~-' say. Now write L; = !-'; - 1-' That is, of judges)- z6 '

L; is the difference between the mean of the overall population (that is, =(So+ 3)- 25.65 = z6.66- 25.65 = r.o1
(so that the first error may be reviewed as more serious than average)
the grand population mean,.~-') and the mean score of the population of
scores of students from the i-th location, J.A-i Equivalently we can write The complete set of parameter estimates is given in the margins of
!-'; = 1-' + L;, and substituting this into the previous model we now have: table 12.8. The values of these estimates are useful when discussing the
yij= J.1.+ Li + eij data. For example, we can say that the English teachers' group gives o.62
(g 1) marks per error less than the mean (Jl) while the Greek teachers'
z ln this catlc the division of 6uhjcctll into 'mak' and '(crnalc' was entirely hypnthcticnl, group gives 2.63 (g2) marks per error more than the average. Error number
~~ankd out to 'h~monatnitc the bnt3-i(, concept of 'lntcr11ctiun'.
2 receives 11.32 (b 2) marks per group of judges less than the mean gravity
llO(!
207
Analysis of variance -ANOVA ANOVA: main effects only
'Ii'
Table 12.8. Total error gravity scores often native Engh'sh teachets (1),. so that the so-called residual error, the difference:
and ten Greek teachers ofEnglish (2), and ten native English non-teachers
ezo,z = Y20.2- Yzo,z = 41 - 3396
(J), on 32 English sentences
between the observed and fitted values is 7.04, which seems rather large,
Group
while:
Sentence 1 2 3 Total (Y;) Mean b ezo,l = Yzo,l- v,,,,
= 3I- (25.65 + 5.68- o.62) = 0.29
22 36 22 So 26.67 1.02
which suggests a good correspondence between the observed and fitted
2 16 9 1S 43 1433 -rr.J2
3 42 29 42 113 3767 values of the score of the English teachers on the error number 20. This
4 25 35 ., S1 27.00
12,02
1.35 brings us to the last parameter which apparently has not yet been estimated,
5 31 34 26 91 JO.JJ 4.6S
6 36 23 41 wo 3333 7.6S namely 0"2 , the random error variance or residual variance. A good
7 29 25 26 So 26.67 I.02 estimate of its value is given by the residual mean square error in the
s 24 31 20 75 zs.oo -o.6s
29 1S S2 1.68 AN OVA displayed in table 12.5, i.e. 6" 2 = 37.82, so that the standard
9 35 2733
10 1S 21 15 54 J8.oo -7.65 deviation is estimated by V37.82 = 6.15. This value is Important when II
II 23 33 21 77 25.67 0,02
12 22 it comes to deciding which types of judges do seem t~ be giving different
13 19 54 r8.oo -7.65
13 31 22 92 300 s.oz scores, on average, for the errors used in the study. In chapter II we
14 ., 29
39
23 73 2433 -r.p . gave the formula for a 95% confidence interval for the size of the difference
15 27 25 24 76 2 533 -O.J2
16 32 25 29 S6 2S.67 J.02 between the two means:
17
1S
19
23
1S
JO
39
19
2S
1S
16-
29
So
53
S7
26.67
17.67
zg.oo
1.02
-79s
335
..
(difference betV-.lCeir sam pie- means) ( ~)
+--::- -
'constant' X i.j--::-
nl nz
20 31 41 22 94 31.33 5.6s
21 20 25 12 57 rg.oo -6.6 5 where s2 was an estimate of the common standard deviation and n 1 and
22 21 17 26 64 21.JJ -4-32 n 2 were the sample sizes. Let us use this to estimate how much difference
23 29 26 43 98 J2.67 7.0z
24 22 37 26 ss 2S.33 2.68 there seems to be between the mean scores of English and Greek teachers.
26 22 S2
25
26
27
20
29
34
2S 19
30
67
92
2 733
22.J3
r.68
-J.J2
s.oz
---
Applying the above formula gives the interval:

2S
29
1S
23
33
24
37
17
15
59
75
J0.67
tg.67
zs.oo
-598
-o.6s
- -
Y.z-Y. 1(constant J 3782
--+-'- )
32
3782
32
30 25 33 15 73 2433 -I.J2
Each of the samplec.sizes is 32, since both means, are based on the scores
31 27 39 28 94 JI.JJ 5.68
32 II 20 14 45 rs.oo - ro.6s for 32 different errors. The constant used in the formula is the s% signifi-
Total (Y) So1 905 756 2462 cance value of the !-distribution with the same number of degrees of
Mean 25.03 z8.z8 2J.6J 25.65 freedom as there are for the residual in the ANOVA table - 62 in this
g -o.6z 2.63 -2.03 case (see table 12.5). The interval will then be:
(28.28- 25.03) (2.0 X V2.J6)
score (jl), but error number I I is seen to be of about average seriousness or 3.25 3.07
:1
since bn is very close to zero, and so on. \Ve can also examine the degree i.e. o.r8 to 3.62
to which a particular score is well or badly fitted by the model, by calculat-
ing the value of the random or residual component nf the score, cii Similar confidence intervals for the other possible differences in means
II
For txamplc, Y;w,z, the observed score on etTol munbcr zo of tht Grech. are as follows. For English teachers versus English non~teachers:
judge~ is 4', whi\v ~-~"'' tlw value obtaitl('d fromlhe fitted model is: 25.03- 2J.63 307
\' .
. -;m;t---. P, + b~u + g~ ~ ~s .6~ + 5 .!.1H .+ z.63 *
0
JJ ,q6 i.e. -r.67to4.47

~ill.! 209
ANOVA: factorial experiments
Analysis ofvan'ance -ANOVA
and for Greek teachers versus English non-teachers: the data and ask whether they seem large enough to be important, whether
or not they are found to be significant by a statistical hypothesis test.
28.28- 2J.6J 307
1 2. 5 ANOVA models: factorial experiments
I.e. r.s8 to 7 72
In 12.4 we introduced the concept of a factorial experiment,
It might seem safe now, following the procedure of chapter 5 for carrying using as an example vocabulary test scores classified by two factors, the
out tests of hypotheses using confidence intervals, to conclude that, at sex of the subject who supplied the score and his or her geographical
the 5% significance level, we can reject the two hypotheses that Greek location. A model which we might try to fit to these data is:
teachers give the same scores as English teachers and non-teachers. On
Yiik = J.L+ Li + S; +eiik
the other hand, there does not seem to be a significant difference between
the mean scores of English teachers and English non-teachers. However, where Y;;ko as before, is the score of the k-th subject who is of the i-th
this procedure is equivalent to carrying out three pair-wise tests and, at location and is of the j-th sex, p. is the grand population mean, L; is
the beginning of this chapter, we denied the validity of such an undertaking. the main effect of the i-th location, S; is the main effect of the j-th sex
There are theoretically correct procedures for making multiple compari and e;;k is the random amount by which this subject's score is different
sons- comparing all the possible pairs of means- but they are not simple from the population mean of all scores of subjects of the j-th sex of the
to carry out. A frequently adopted rule of thumb is the following. Provided i-th origin. As before, we assume that all the values. of e;;k are from a
that the ANOVA has indicated a siguijicant difference between a set of normal distribution with mean zero and variance, eft.. Use of this model
means, calculate the standard error s* for the comparison of any pair would lead to the analysis of table 12.6. The values of the various par-
of means by: ameters can be estimated using exactly the same steps as in the analysis
2. X resiCfi.laT mean square of the error gravity scores above (see exercise 3). The residual variance,
s* = u 2 , is estimated by & 2 = 49.225, the residual mean square of the ANOV A
n
table 12.7(a). We have already seen that neither of the main effects is
where n is the number of observations which have been averaged when significant, i.e. there is no obvious difference in mean scores for the two
calculating each mean. Then find the difference between each pair of sexes nor in the mean scores at the four different locations.
means. If the difference between a pair of means is greater than 2s, However, look again at the model:
take this as suggesting that the corresponding population means may be
Yiik = f.L + Li + Si + eiik
different. If the difference in two sample means is greater than 3s*, take
There is an assumption here that, apart from the random variation, eijkl
this as reasonably convincing evidence of a real difference.
For the three groups of judges, we know (see table 12.5) that the residual each score can be reconstructed by the addition of three parameters, the
grand mean plus the effect of having origin i (assumed equal for both
mean square is 37.82 and therefore:
sexes) plus the effect of the subject being of sex j (assumed equal for
s* = v'2.36 == 1.54, 2s* = 307 and 3s* = 4-61 all locations). We have already demonstrated in 12.3 that this model
mean of Greek teachers- mean of English teachers = 3.25 will not give an adequate description of the data. There is an additional
mean of English teachers- mean of English non-teachers= I .40 effect to consider. There seems to be an interaction between sex and loca~
mean of Greek teachers- mean of English non-teachers = 4-65 tion, males scoring better, on average, in one location and females scoring
from which we might conclude that Greek and English teachers probably better in another. The model can be expanded to cope with this as follows:
give different scores on average and the Greek teachers and English non- tt + Li + Si + aii + eiik
Yiik =
teachers almost certainly do. However, this seems a suitable moment to where the parameter a;; is the additional correction which should be made
reiterate our comment about the difference between statistical significance lor the interaction between the effects of the i-th origin and the j-th sex.
and scientific importance (chapter 7). It is important always to consider We have already seen in tab!" 12.7(b) that this interaction effect is signifi-
thc observed magnitudc of the difrcrcnccs in the mcuns as estimated from cant. Table t ~.9 !jives the parnmt~ter estimates and the sample means lor
2!1
Analysis of variance- ANOVA Ftxed and random effects

Table r2.g. Estimation of the model };p, = 11- + L 1 + S,+a 11 + eijk to the mark of students having another. How wide is the scope of this conclusion?
data of table I 2.6 Does it apply only to the four locations actually observed or can we extend
it to students from other locations? The analysis we have carried out above
Geographical location
is correct only if we do nat wish to extend the results, formally, beyond
Sex I 2 3 4 the four locations involved in the experiment. If we intend these locations
Y11. = tg.8o ,.21. = 2540 \'.1!. = 2J.20 \"fl.= 21.00 ':.I.= 22.35 to serve as representatives of a larger group of locations, the model has
all = -4.72 liz1 = 398 aJI = o.s8 a~, = o.r8 s, = -o.48
to be conceptualised differently and a different analysis will be required.
2
\" 12 . = JO.:zo \' 22 . = r8.4o '\' 32 . = 23.00 i\ 2. = 21 .6o \' .2 . = 23.30 The model fitted to multiple choice test scores, ignoring interactions
11 12 = 4.72 fizz = -3.g8 li32 = -o.s8 342 = -o.r8 S2 = o.48
=============================================================== for the moment, was:
~\ = :zs.o Yz .. = :zr.g '\' 1 = 2J.I t .. = 21.J '\' .. = 22.83
L 1 = 2.17 L 2 = -o.g3 c~~ = 0.27 c+=-r.sJ jl = 22.83 YiJk = P.,+ Li + Si + eiik
&2 =residual mean square= 39.03- sec table 12. 7(b) far which we tested the hypotheses H 11 : L; = o for all four locations and
standard error for comparing origin means= yr3~9~.0-3'-cX~2~-,-o- 2. 79 Hn: S; = o for both sexes. We reached the conclusion that both these
standard error for comparing sex means= V 3903 X 2 : 20- I .g8
hypotheses seemed reasonable. With this formulation of the model the
standard error for comparing interaction means= V JQ.03 X 2 : S 395
results will not extend to other origins. This is known as a fixed effects
the different mean effects and interactions. In this table, Y;;represents model.
the sample mean score of subjects from location i and sex j, etc. \Vhen If we wish to widen the scope of the experiment we have to construct
the interaction effect is significant there is nat a great deal of point in a mechanism to relate the effect of the locations actually involved to the
examining the main effects. In this example it is quite unhelpful to say effects of those not included in the experiment. This is usually dane by
that the mean scores for the different sexes arc about equal when that assuming that there is a very large number of possible locations each with
hides the fact that, between some origins, there seems to be an important its own location effect, L, on the mean score of students having that loca~
difference of scores. It makes more sense to compare sexes within origins tion. We then have to assume further that the different values of L can
and origins within sex. In order to make this comparison we have to use be modelled as a normal distribution with mean zero and some standard
the standard error for comparing interaction means, which has the value deviation, aL. The null hypothesis far a location effect will now be formu-
395 (see table 12.9). For example, for origin r the differencein the sex lated somtrwhat differently, as Hu : O'L = o since if there is no variation
means is 30.2- 19.8 = ro.4, which is 2.6 times the relevant standard error. in the values of L the variance of the distribution of L values would be
Using the guidelines proposed in the previous section, this suggests a zero. It will now be assumed that the four locations we have chosen for
real difference. In any case, an observed average difference of ro marks the experiment have been randomly sampled from all the possible locations
in a test which was scored out of so is sufficiently large to merit further we could have chosen. This is an example of a random effect, the four
investigation. On the other hand, in the case of origin 4, the difference levels of the factor 'locations' being chosen randomly from a papulation
is only o.6 which is certainly not significant compared to the standard of possible levels.
error and is in any case hardly large enough to have any practical For a one-way ANOVA (see r2.r) the calculations and the F-test are
importance. carried out exactly the same whether or not origin is viewed as a fixed
or random effect. The difference lies in the conclusion we can reach and
r2.6 Fixed and random effects the kind of estimation possible in the model. The small F-value (table
In the example analysed in the previous section, students had 12.2) would indicate that location effects were not important and, pmvided
been sampled from four different locations. We reached the conclusion the four locations in the experiment had been randomly chosen fmm a
that there was no main effect of location. Ignoring for the moment the large set of possible locations, this conclusion would apply to the whale
important interaction effects, we might conclude that the mean mark of papulation of locations. However, we have frequently indicated that, what-
students having one of these locations would be very Rimilar to the m~an ever the results of a hypothesis test, it is always advisable to estimate
213
~l~
Analysis of varimzce- ANOVA Test score reliability and ANOVA
Table 12. 10. ANOVA of main effects and interaction from data of table the greater is the number of possible combinations. Many books on experi-
12.6with location as a raudom effect mental design or AN OVA give the details (e.g. Winer 1971). It is sensible
Source df 88. M8 F-ratio to avoid the random effects assumption wherever possible, choosing levels
26.62 F3 ,_; 2 = o.68
of the different factors for well-considered experimental reasons rather
Between locations (L) 3 78.875
Between sexes ( S) <).025 g.o25 Fu 2 = o.o6 than randomly, The fixed effects model is always easier to interpret because
LxS '
3 473775 157925 FJ,Jl. = 405 all the parameters can always be estimated. However, there are situations
Residual 32 1249-100 3903
in linguistic studies where it may be difficult to avoid the use of the random
Total 39 i8II.775
effects model. It could be argued, for example, that in the analysis pre
the parameters of any model in case important effects have been missed sented in 12.2 the 32 errors whose gravity was assessed by sets of judges
by the statistical test or unimportant effects exaggerated. In this model are representatives of a population of possible errors and that 'error' should
an important parameter is uL, the standard deviation of the location effects. be considered as a random effect. This problem is discussed at length
It is estimated by subtracting the residual mean square from the between by Clark (1973), who advocates that random effects models should be
locations mean square and taking the square root of the answer. From used much more widely in language research.
table 12.2 we would estimate: Clark's suggestion is one way to cope with a complex and widespread
il-L= V26.62 48.II problem, but it does seem a pity to lose the simplicity of the fixed effects
model and replace it with a complicated variety of models containing var
Unfortunately the square root of a negative number does not exist, so ious mixtures of fixed and random effects. There are other possible solu
that we cannot estimate aL, and this is a not infrequent outcome in random tions. One is to claim that any differences found relate only to the particular
effects ANOVAs. The best we can say is that we are fairly certain that language examples used. We could conclude that the Greek and English
2
the value of aL is about zero. teachers give different scores on this particular set of errors. Though this
With higher order AN OVA (that is, two-way and more), even the F-tests may seem rather weak it may serve as an initial conclusion, allowing simple
will differ, depending on which effects we assume to be random or fixed, estimation of how large the differences seem to be and at least serving
though the actual calculations of the sums of squares and mean squares as a basis for the decision on whether further investigation is warranted.
will always be the same. In the previous section we carried out an AN OVA A second solution would be to identify classes of error into which all
of data classified by sex and location. The table for that analysis (table errors could be classified. If each of the 32 errors in the study were a
12.7(b)) is correct assuming that both sex and location are fixed effects. representative of one of the, say, 32 possible classes of error', then we
Clearly sex will always be a fixed effect - there are only two possible would be back to a fixed effects model.
levels- but we could have chosen location as a random effect. The revised There may still' be a problem. Remember that an important assumption
ANOV A table is given in table 12. 10. You may have to look very hard of the ANOVA model is that the variability should not depend on the
before you find the only change that has occurred in the table. It is in levels of the factors. It may very well be that, say, different sets of judges
the F ratio column; the F value for testing the main effect of sex is now find it easier to agree about the gravity of one type of error than the
obtained by dividing the sex mean square by the mean square for the gravity of another. The variance of scores on the latter error would then
LXS interaction and not by the residual mean square. This has caused be greater than on the former. If the difference is large this could have
the F-value to decrease. In this example that was unimportant but it is serious implications for the validity of the ANOVA (see 12.8). There
in general possible that the apparent significance of the effect of one factor
is one area where random effects models occur naturally- in the assessment
may be removed by assuming that another factor is a random rather than of the reliability of language tests.
a fixed effect.
Further discussion of this problem in general terms is beyond the scope 12.7 Test score reliability and ANOVA
of this book. Every different mixture of random and fixed effects gives
Language testers are quite properly interested in the 'reliability'
rigc to clifftll'Cllt set of Frntios and the greater the number of factors of any test which they may administer. A completely reliable test would
~14
215
Analysis of variance -ANOVA Test score reliability andANOVA
be one in which an individual subject would' always obtain 'exactly the Table I2.n: Scores often subjects on two para/lei forms of
same score if it were possible for him to repeat the test several times; the same test
How can reliability be measured? Several indices have been proposed (e.g.
. ~ubject Form~ Form2 Total
Ghiselli, Campbell & Zedeck I98I')ibut \he most corrimon is the following."
Assume that for the i-th subject in a population there is an underlying 6] 67 1]0
2 41 39 8o
true score, J.L;, for the trait measured by the test. The 4 true'.scores (chapter 3 78 71 149
6) will form a statistical population ,with, mean p, and variance O'b2,' the 4 24 21 45
5 39 48 87
b signifying that the variability is measured between subjects. In fact, 6 53 46 99
a subject taking the test will not usually express his true score exactly, 7 56 51 107
8 59 54 IIJ
due to random influences, such as the way he feels on a particular day 9 46 37 8]
and so on. The score actually observed for the i-th subject will be 10 53 61 II4
Yi = J.Li + ei where the error, e;, is usually .assumed. to be normally distri-. . Total 512 495 !007

buted with mean zero and some variance, a-4. If this subject takes the
same test several times, his observed score on the jth occasion will be that the two different' versions of the test will be measuring the same
Yii = J.Li + eii' where e;i is the error in measuring the true score of the i-th trait in the same way- quite a large ass-umption. The correlational method
student on the j-th occasion when he takes the test. This model can be of estimating the reliability makes no check on this assumption. If the
written. second versiou of the test gave each subject exactly IO marks (or so marks)
Y;i = p. + ai + e;i more than the first version, the correlation would be I, and the reliability
would be apparently perfect, though the marks for each subject are quite
where fJ- is the mean 'true' score of all the students in the population different on the two applications of the test. The random effects AN OVA
and a, is the difference between the 'true' score of the i-th student and model provides a different method for estimating the reliability and also
the mean for all students. Since we had previously assumed that true offers the possibility of checking the assumption that the 'parallel forms'
scores were normally distributed with mean J.L and variance Oh 2, the values of the test do measure the same thing in the same way.
ai will be from a normal distribution with mean zero and variance ub 2 Table I2. I I shows the hypothetical marks of ten subjects on two forms
Now, if the measurement error varianCe isCloseto zero, all the variability of a standard test. There are two ways to tackle the analysis of this data.
in scores will be due to differences in:the,truescOI-es of students> A conimon One is to assume that the parallel forms are equivalent so that the data
reliability coefficient is: can be considered as tWo independent observations of the same trait score
on ten students. This is equivalent to assuming that any student has the
rei = --r;--
"" 2
-~ t-\ame true score on both forms of the test. The observed scores can then
""
which is the proportion of the total variability which is due to true differ-
be analysed using the model:

ences in the subjects. If rei= I, there is no random error. If rei is close Yii = p.,+ ai + eii
to zero, the measurement error is large enough to hide the true differences where t.t is the common mean of all parallel forms of the test over the
between students. How can we estimate rei? It can be shown that rei= p, whole population of students, a, is the amount by which the score of the
the correlation between two repetitions of the same test over all the subjects, i-th student in the sample differs from this mean, and e,; is the random
and rei is frequently estimated by r, the correlation between the scores error in measuring the score of the i-th student at the j-th test. The corres-
of a sample of subjects each of whom takes the test twice. However, there ponding AN OVA is given in table I2,I2(a).
is a problem with this. It is simply uot possible to administer exactly The random error variance a 2 is estimated by the residual mean square,
the same test to a group of subjects on different occasions. It is much ~i 2 = 31 95 It can be shown that the between-students mean square is
mort common wadministcr two.flwms of the sumc test. The t.catcr hopes an tstimutl! of 11< + krl),l wlltrc k is the number of parallel forms used,
2I7
finatysts of varzance - mv v vii r<urtnercomments onmvvvfl

Table 12. 12. ANOVA/ordata in table I 2.I I allowing for differences in caused by using different forms. From table 12.12(b) we can re-estimate
supposed parallel fonns when assessing test reliability s2 = 14.83, s2 + 2sb2 = 406.12 so that sb2 = 19564. This is a new estimate
(a) One-wayANOVflofdata in table 12.11
of crb 2 , which we have already estimated, above, as r87.o85. These two
Source df SS MS estimates for the same quantity, both calculated from the same data, have
slightly different values because of the different models assumed for their
Between students 9 3655.05 406.u
Residual IO 319-50 31 95
calculation. The variance in the mean scores of different parallel forms,
Total I9 3974-55
rrl, can be estimated in a similar way by:
(b) Two-wayANOVAofdata in table 12.1 r
s2 + ks? =mean square for forms

Source df ss MS F-ratio where k is the number of subjects in the sample. This gives
Between students 9 365s.os 406.12 r4.83 + ros, 2 = r86.os, or s12 = '7 12. Now, the total variance of any score
Between forms I 186.os r86.os F 1,9 = 12.55 will be the sum of these three variances:
Residual 9 13345 If.8J
Total 3974-55 s2 + sb2 + s? = 14.83 + rg5.64 + 17.12 = 227.59
'9
Using the definition of reliability that says:

between~subjects variance 195.64


here two. The quantity s, 2, an estimate of crb 2, can be obtained by putting rei ----c--,.---
total vanance
- - = o.86o
227.59
s2 + 2sb2 = 4o6.r2 which gives sb2 = r87.o85. An estimate of the reliability
is given by: which is very close to the estimate of o.854 we obtained previously. How-
ever, if the variability due to parallel forms is wrongly assumed to be
sbz
rei= - = o.854 part of the true scores variance, we would obtain:
2- -2
sb +s
'9564 + 17.12
On the other hand, the correlation between the two sets of scores is rei 093
227-59
0.93 and frequently this would have been used as an estimate of the re-
liability. Why is there this discrepancy? The AN OVA table 12. 12(b) gives which is the correlation between the scores on the two forms of the test!
a clue. The sample correlation will be a good estimator of the reliability Thus the use of the correlation to estimate reliability is likely to cause
only if it is true that the parallel forms of the test really do measure the its overestimation. Furthermore, the ANOVA method extends with no
same trait on the same scale. To obtain this second ANOVA we have difficulty to the case where several parallel forms have been used. Further
assumed the model : discussion of the meaning and dangers of reliability coefficients can be
found in Krzanowski & Woods (1984).
Yii = ,u + ai + fi + eii
where f; is the difference between the overa!l mean score f.L of all forms r 2.8 Further comments on ANOVA
of the test over the whole population and the mean of the j-th form used In this, already rather long, chapter we have covered only the
in this study over the whole population, i.e. the main effect of forms basic elements of AN OVA models. The possible variety of mo<!els is so
(a random effect). The F -ratio corresponding to this effect is highly signifi- large, with the details being different for each different data structure
cant, showing that the mean marks of different forms is not the same. or experimental design, that it is neither possible nor appropriate to attempt
In the sample the means for the two forms are 51.2 and 495 This suggests a complete coverage here. The general principles are always the same
that the sample correlation is not appropriate as an estimate of the re~ and the details for most designs can be found in any of several books.
liability since its usc in that way assumes equivalence of parallel forms. However, there are two special points, of some importance in linguistic
In fact the usc of the correlation cucfilcicnt ignores entirely the variability r<,$cnrch, which need to be mentioned.
~18
219
l-lntuysls OJ vanance -f1JVV\Ifi I' urmer commems on mvu vfl
12.8. I Trmzsj01ming the data Table 12. '3 The structure of a 'within-subject' ANOVA model
The first point is the possibility of .transforming <lata which
Stimulus
do not meet the assumptions required for ANOVA to a different form
which do. There are many possibilities, depending on the specific feature Nationality 2 3
of the original data which might cause problems. However, one special '{ Subjcct-1 Yut ylZl

case which may arise fairly frequently in applied language studies is data
1 Subject 2 Ym Ym
in the form of proportions or percentages, e.g. the proportion of correct Subject r Yw
Yztt
insertions in a cloze test with! 20''deletions. In this example, a sample 2 { ~~~.j~ct 2 Ym Yzzz
of native speakers would be expected to score higher than second language
learners. In an 'easy' doze test native speakers might achieve very high
scores, many getting all or almost all items correct, with a few perhaps
scoring less well. Such data would lack symmetry and could not be normally to the mean value. For example, if two groups of individuals have markedly
distributed. Furthermore, since most of the subjects would then have very different mean vocabulary sizes the group with the higher mean will usually
similar scores, the sample variance would be small. A sample of second show more variability in the vocabulary sizes of the individuals comprising
language learners might show much greater spread of ability with a lower the group. Wthe variance of the values in the different groups seems
mean. In general, with this kind of data, the nearer the mean score is to be roughly proportional to their means then analysing the logarithms
to so% correct the greater will be the variance, while the symmetry and of the original values will give more reliable results. If, instead, the standard
variance of the sample scores will both decrease as the mean approaches deviations of the groups are proportional to their means, taking the square
one of the extremes of o% or roo%. It may not be legitimate in such root will help. (See also r3.12.)
cases to carry out a t-test or ANOVA to investigate whether there was
a significant difference in the average scores of the two groups or to estimate rz.8.z 'Within-subject' ANO\fAs
what the difference might be, using the methods of chapter 8. The second general point we have not discussed but which
Provided most of the subjects in the experiment obtained scores in the may be important in linguistic experiments is the situation where subjects
range zo%-8o% (i.e. 4/zo to t6/zo) it would probably be acceptable to are divided into groups and each subject is measured on several variables.
analyse the raw scores directly. However,. if more than one or two scores For example, we might consider an experiment where 12 subjects of-both
lie outside this range, in particular if anyscoresare smaller than Io% 6f tWo -different nationalities are teSted for their reaction times to several
correct or greater than go% correct-, it will not. be safe to. analyse them different stimuli. The data would have the structure shown in table zz. '3
by the methods of earlier chapters without first transforming them. The observation Y;;k will be the reaction time of the k-th subject of
The traditional solution to this problem is to change the scores of the the i-th nationality to the j-th stimulus. The comparison of stimuli can
individual subjects into scores on a different scale in such a way that the be carried oUt within each subject while the comparison of nationalities
new scores will be normally distributed and have constant variance. This can only be carried out between sets of subjects. Variation in the reaction
is done via the arcsine transformation.' Standard computer packages times of a single subject on different applications of the same stimulus
will usually include a simple instruction to enable the data to be transformed is likely to be rather less than variation between subjects reacting to the
in this way. The usual AN OVA or regression analysis can then be carried same stimulus. Stimuli can therefore be compared more accurately than
out on the W~scores instead of the X-scores. nationalities from such a design. (Standard texts on ANOVA often refer
Other transformations are in common use. It may happen that for some to this as a split-plot design since it typically occurs when several large
variables the variability over a population, or subpopulation, is related ag-ricultural plots (subjects) are treated with different fertilisers (national-
ities) and then several varieties of a cereal (stimuli) are grown in each
3 The transformed scores (W) can be obtained from the originalgcores (X) (where these
arc in pcrccntngcH) by the formula: W =arcsin VX{IOO. Mollt scientific calculaton; have
plot.) Pata with this kind of structure cannot be analysed using the models
a funetion key which gives tht~ value of arcsine, otherwise written sin" 1

~:1,0
we have presented in this chapter- see, e.g., Winer (1971) for details.
..,
Analysis of van'ance -ANOVA Exercises

SUMMARY Table 12.14. Scores ofstudentsfmm three different centres


Analysis of variance (ANOVA) was introduced and a number of on the same language test
special cases discussed.
A B c
( 1) One~way ANOV A was explained and it was stated that to compare several
42 34
mean$ simultaneously it would not be correct to carry out pair~wisc ttests. "
10 36 39
(2.) Two-way ANOVA was introduced, especially the randomised block design 12 40 38
which is an extension to several groups of the paired t-test. 10 34 41
10 38 38
(3) The concept of a factorial experiment was explained together with the terms 10 38 J6
factor and levels of a factor; the possibility of interaction was discussed 9 32 38
41 30
and it was explained that when significant interaction was present it did not "9 35 39
make much sense to base the analysis on the main effects alone. 35 36
7
(4) A convenient rule of thumb was presented for examining the d'iffcrences 35 31
between the means corresponding to different experimental conditions.
"4 32 33
10 29 29
(s) The difference between a fixed effect and a random effect was discussed. 8 28
35
(6) The reliability of tests was discussed and it was shown, via ANOVA, that 8 32 33
reliability measures based on correlations could be misleading. 8 28 30
8 37 39
(7) It was pointed out that linguistic data may not be suitable for ANOV A and 30 26
7
transformation may be necessary. 7 29 32
(8) It was stated that the data from experiments which involved repeated 8 30 20
8 26 27
measures from individual subjects may need to be analysed as a within 31 27
7
subjects or split-plot design, though this type of AN OVA was not explained 7 22 21
further. 7 29 23
4 19 29

EXERCISES
(3) Using the methods outlined in 12.5:
(r) Table r2.14 represents scores on a 'language test by three groups of subjects (a) Estimate the following parameters from table 12.6: the overall.t~Uean;
from different centres. Using the method outlined in\ 12. x, test the hypothesis the mean for Location r; the mean for the male subgroup from Loca-
that there is no difference between the sample means overall. Use the rule tion 2; the mean for the female subgroup from Location+ .
of thumb procedure of r2.4 to test for differences between individual means. (b) Using the residual variance, compare the difference between observed
(2) The values that follow are error scores (read from left to right) from a replication
and fitted values for Y 1. 1; Y o~, 1 ; YJ.Z
of the Hughes-Lascaratou study performed by non-native teachers of English
who were of mixed nationalities.

Non-native teacher scores


35 IO 26 40 30 24 26 3' 44 I4 33 IS 24 34 22 24
30 20 25 40 30 I5 24 35 40 25 35 26 33 30 42 IO
(a) Construct a new table (see r2.4) by substituting this column of 32
numbers for the Greek teachers' scores in th.e original.
(b) Construct an ANOV A table for this data (see r:i.z and table 12.5).
(c) Construct a table with the new data comparable to table 12.8, and
again compare the difference betwccri obscrvt~d anJ fitted values for
Yzu.zand y2o.l
(d) Do tht~ n<m-nntivc tcaciH:r!l behave 1:\imilurly lu the Gn.!clt ttmcht:rH?
2.3~
223
Lznear regresswn

The model can be represented graphically as in figure I 3. I by a straight


13 line passing through the origin of the graph. When the value of X, the
month's total sales, is known, then the corresponding value of Y, the
Linear regression commission, can be read off from the graph as shown in the figure. Note
that for every l.r increase in X, the commission increases by zp or l.o.o2.
We would say that the slope or gradient of the line is 0.02. This tells
us simply how much change to expect in the value of Y corresponding
to a unit change in X.
y

1000
In chapter 9 we proposed the correlation coefficient as a measure of the
degree to which two random variables may be linearly related. In the
BOO
present chapter we will show how information about one variable which
is easily measured or well-understood can be exploited to improve our ~ 600
knowledge about a less easily measured or less familiar variable. To intro-
duce the idea of a linear model, which is crucial for this chapter, we will
.
~
0
>- 400
begin with a simple non-linguistic example.
y
200

1000
0 8 16 24 32 4B X
Sales (thousands of)
BOO
Figure 13.2. Graph of Y =sao+ o.rX.
c
0
600
:~ Suppose that the shop manager does not like the extreme fluctuations
E
E which can take place in his earnings from one month to another and he
0
u 400
negotiates a change in the way in which he is paid so that he receives
a fixed salary of soo each month plus a smaller commission of I% of
200
the value of sales. Can he still find a simple mathematical model to calculate
his monthly salary? Clearly he can. With X andY having the same meanings
0 B 16 24 32 4B X as before, the formula:
Sales (thousands of)
Y = soo + o.orX
Figure IJ.L Graph ofY = o.o2X.
will be correct. The corresponding graph is shown in figure IJ.Z. Again
Suppose the manager of a shop is paid entirely on a commission basis it is a straight line. However, it does not pass through the origin, since
and he receives at the end of each month an amount equal to 2% of the even if there are no sales the manager still receives l.soo; nor does it slope
total value of sales made in that month. The problem, and the model so steeply, since now a unit increase in X corresponds to an increase of
for its solution, can be expressed mathematically. Let Y be the commission only o.or in Y. We would say in both cases that there was a linear relation-
the manager ought to receive for the month just ended. Let X be the ship between X and Y, since in both cases the graph takes the form of
total value of the sales in that month. Then: a otraight line. In general, any algebraic relation of the form:
Y ~a.ozX (Remember, o.o2 = 2/o) Y~a+(3X

225
2;~~
Linear regression The simple linear regression model

will have a graph which is a straight line. The quantity f3 is called the Table I 3 r. Age and mean length of utterance for 12 hypothetical children
slope or gradient of the line and a is often referred to as the intercept Child Age in months (X) mlu(Y) l~red1ctcd mlu (Y) Residual
.
or intercept on the Y-axis (figure '33l The values a and f3. remain
24 2.10 1.82 0.28
fixed, irrespective of the values of X andY. 2 23 2..16 1.73 0,43
31 2,,25 2 43 -o-.rs
.Y 3
4 20 L93 1.47 0.46
5 43 2.64 3-49 -o.Bs
6 sa 563 4.8o o.83
28 r.g6 2,,17 -o.2l
7
8 34 2..2J 2.70 -0,47
9 53 519 4-)6 o.83
10 46 3-45 375 -O.JO
11 49 J.2I 4.01 -o.Bo
1Z 36 2.84 2.87 -0.03

COV(X,Y) = 13.881
a sx = 12 573 ~= J7.08I
Sy = 1.243 y = 2.g66

& Chapman (rg8r). They calculated mean length of utterance (mlu) in


0~--~--~---L---L--~----
morphemes for a group of 123 children between 17 months and 5 years
0 1234 5X of age. In figure 13.4, X, the age of each child, is plotted on the horizontal
Figure I33 Graph of the linear equation Y = 01 + f3X. axis, andY, the corresponding mlu, on the vertical axis. It is clear that
these points do not fit exactly on a straight line. It is equally clear that
8.0
0
mlu is increasing with age, and that it might be helpful to make some
7,0 statement such as 'between the ages of a 1 and a2 mlu increases by about
6.0 so much for each month'. It will make it simpler to introduce and explain
ID the concepts of the present chapter if we use the same two variables as
E
J:! 5.0
e.~ Miller & Chapman, mlu and age, but with data from a smaller number
4.0 of children, We have therefore constructed hypothetical data on 12 children
c
; 3.0 and this appears in table 13.1. The values in the table are realistic and
'E commensurate with the real data discussed in Miller & Chapman (rg8r).
2.0
The corresponding scattergram appears in figure '35 The correlation,
1.0 r, of mlu with agefor the 12 children is o.8882, obtained as follows:

12 24 36 48 60 72 covariance (mlu, age) = IJ.88r (see 10.1)


Age in months standard deviation of ages= 12.573
Figure '3+ Relationship between age (r month) and mean length of standard deviation of mlu = I .243
utlenmcc (mlu) in morphemes in 12.3 children: mlu = -o.548 + o.ro3 (age).
Reproduced from Miller & Chapman (tg8I). lienee r = 13.88r + (1.243 X 12.573) = o.8882. From table A6, the
hypothesis that the true correlation coefficient is zero can be rejected with
1 J. 1
The simple linear regression model P < o.oo1, i.e. at the o. r% significance level (cf. 10.3.). We can therefore lj
Is thi1:1 notion of a lim:ar relationship useful to us in linguistic be quite confident that there is some degree of linear relationship between
"~"di1mr Cumlidci' fig<~rc .134 which rcprccntH data d<>acribcd by Milkr

~i!l.
age and mlu. The next step is to construct a model which will specify
227
I!
Lznear regresswn Estimating the parameters

that relationship and help us to estimate the expected mlu of a child .of 13.2 Estimating the parameters in a linear regression
a given age who was not observed in our experiment. This is entirely a mathematical problem and it is helpful to
y
consider a general formulation of it. We have two variables which we
6 will call X (age) andY (mlu). We have a number of observations consisting
X
of pairs of values (X,, Y;), a value of X and the corresponding value of
X
5 Y. We have plotted these points in figure r 3. 5 and we would now like
to draw, on the same graph, the straight line which 'most nearly passes
1i 4 through all the points'. We can attempt to do this 'by eye', which will
E
X be subjective and probably inaccurate. Or we can look for an objective
X
5 3 method which will ensure that the chosen line is determined automatically
""'E

//
X
:5 by the points we are trying to fit. A mathematician can suggest several
_;, 2 ways of doing this, all of which lead to slightly different answers. To
E
X
,[
choose between the suggestions offered we have to consider the motivation
for requiring the model in the first place.
We have implied already that we will want to use the model to estimate
the mlu of a child of a given age. This suggests that we require from
0 10 20 30 40 50 60 X
Age (months) the mathematician a solution which ensures that the MLUs assigned by
the model for the various ages which are represented in our sample will
Figure 135 Scatter diagram and least squares regression line.
Y = -o.2897 + o.o878X for the data of table 13.1. be close to those actually observed. We will adopt the notation that Y,
is the ML U predicted by the model corresponding to the age, X1, of the
The process begins by defining a model which says that MLU(X), the i-th subject while Y1 is the value of the mlu observed for the i-th subject
average mlu of all the children aged X months in the complete target in the sample. The difference:
population, will lie on a straight line when plotted against the age, X.
ri == Yi- Yi
Algebraically: 'i
IJ
is called the ith residual. If all the residuals were zero, the model line
MLU(X) =a+ {3X
would pass exactly through the observed data points. We would like to '!!,It
or achieve this if possible, but in practice it will never occur. In some sense, ~~~
MLU 1 =a+{3Xage we could try to make sure that these residuals were as small as possible. I
However, there will be conflicts. Making the line pass closer to some
for some fixed, but as yet unknown, values of the parameters, a and {3. points will make it move further from others. Again, some of the residuals
The form of the model presupposes that age will be known first and ML U will be positive, when the line passes below the corresponding point, while
(i.e., mean mlu) calculated afterwards. For this reason we say that age others will be negative, when it passes above. Perhaps we can solve the
is the independent variable and mlu the dependent variable. Alterna problem by choosing the line for which the algebraic sum of the residuals
tively, we may call MLU the predicted variable and age the predictor is zero? Unfortunately, there are many different lines for which this is
yariable. The next step is to estimate values for a and {3 using the data. true. It is rather similar to the problem we encountered when we were
The final stage will be to assess how well the model appears to fit the looking for a measure of dispersion of sample data about the sample mean
observed data and the extent to which we can use it to predict the mlu in chapter 3 and discovered that the sample deviations always summed ,i:'
of any child whose age is known. to zero. The solution we will adopt here is similar to the one we chose i
there. We will obtain estimates a and b of a and {3, in such a way that
Note that mlu is the mean length of utterance of a single child while l\'ILU is the IIV_\:ragc
the line, Y=a+ bX, will be the line which minimises I(Y;- Y;)Z, the 111
mlu of children of \1 partieulur a~l'.

228
229
~
Linear regression The benefits from fitting a linear regression
sum of squared residuals. The resultant line is called the least squares into account, as we can demonstrate by pretending that we do not know
line of linear regression of Yon X. 2 how old each subject is, though we still know the subject's mlu. We would
The calculations needed to obtain the a and b are very similar to those then be able to say only that, for these I2 subjects, the sample mean
required for the calculation of r, the correlation coefficient, which we will mlu was 2.g66 morphemes with a standard deviation of 1.243. We might
usually have carried out anyway with this kind of data. They are: comment further, by calculating a confidence interval (see chapter 7),
that gs% of the children in our target population, aged between IS and
b COV(X,Y) and a= y- bX 6o months, should have an mlu in the range 2.966 (2.2 X 1.243). (We
sx 2 are assuming here that mlu is normally distributed and afe using the sample
For the data of table 13.1 we have: mean and standard deviation as estimates of the population values.) In
other words, we believe 95% of children between IS and 6o months will
COV(X,Y) = IJ.88I, sx2 = I58.o8o, X= 37.o8I, Y= 2.966
have an mlu in the range o.53-54o. This information is of little use.
whence In the first place, the interval is too wide to be meaningful. In the second,
b = 0.0878 a=- 0.2897 we know already that expected MLU depends on age and we are not
exploiting this knowledge.
The estimated linear regression is therefore: It might then have occurred to us to carry out a bigger study, drawing
Y1=- o.28 97 + o.o8 7s xi several separate samples of mlus, each provided by subjects of a specified
age, say IS months, 20 months, etc. In this way we might hope to estimate
For example, for the first child in table IJ.I who is aged 24 months,
the MLU for each age group and the range into which mlus might be
the model predicts an MLU of Y; =- 0.2897 + (o.o878 X 24) = 1.82,
expected to fall for, say, 95% of that age group in the population. We
which should be compared with an observed mlu of 2. IO. The residual
would expect that for each age these 95% confidence intervals would all
therefore has a value of r 1 = Y 1 - Y1 = 0.28. The full set of fitted values
be narrower than the rather wide one we have just estimated for the whole
and residuals is given in the final two columns of table I3.1. Note that
population when we ignored age differences. This is exactly what fitting
the sum of the residuals is - o.oi, which is effectively zero. The small
a linear regression allows us to do, without the need to carry out different
discrepancy is caused by rounding errors accumulated in the series of
experiments separately for each age group. Of course, it will not be as
calculations needed to produce the residuals.
accurate as carrying out several separate studies, but it will cost a great
deal less effort and, under certain assumptions which we make explicit
I 3. 3 The benefits from fitting a linear regression
below, it will give results which are almost as good.
What do we hope to gain by the use of a simple linear regression
Nate that the model purports to describe how the average mlu, which
such as we have just carried out? The significant correlation tells us that
we have designated as MLU, alters with age. For example, it predicts
at least part of the variation between the mlus of individuals is due to
that the average mlu of all children aged 24 months will be 1.82, from
their different ages, in the sense that MLU tends to increase linearly with
the model:
age. Fitting the regression line provides us with a tool for taking this
MLU =a+ ({3 X age)
2 Why 'regression'? Yet another instance of confusing terminology. The term was first used
by Karl Pearson- cf. Pearson's r. To demonstrate the procedure we arc discussing here, estimated by MLU =- 0.2897 + o.o878 X age.
Pearson fitted a linear relationship between the heights of sons and the heights of their Of course, within each age group there will still be variation in mlu;
fathers. Pearson noted that, although there was a good-fitting linear relationship, the sons
of small men tended to be taller than their fathers, while the ROllS of tnll rncn tended
individual children of the same age will have different mlus. We will have
to be smaller than their fathers. In other words, in both case~, the sons of fnthcrn of to assume that this variation can be modelled successfully by a normal
extreme heights, either very small or very tall, would U!>\11111)' be ncnrer to nvcrag:e height
thuu their fathers hnd been. Pcun;on rl'fcrrcd to thili a~ 'rcg:rcRsion towards the mean',
distribution with mean MLU (X) and some standard deviation (]', The
Althouuh thiK was n spcciul cnsc unrdnt~~d tn the gt~nl'ral principli~tl of fitting !incur rt!ation- mean will be different for different age groups, but it will be assumed
!!hip~ thl.) term 'liu~lii'I'CttrCI:I!litm' wuR ndoptcd for gvncntl U!lc, that the standard deviation is the same for all age groups, as in the ANOVA
Ml:l 21I
Linear regression Testing the significance of a linear regression
models for the last chapter. This can be summarised in the model, appro- is common practice to assess the explanatory powers of a regression model
priate to individual subjects: in terms of the percentage reduction in the sample variance of the depend-
mlu =a+ ({3 X age)+ e ent variable brought about by fitting the model. The sample variance of
the mlus was 1.243 2 = 1.545. After the model is fitted the residual variance
or is 0.5992 = 0.359 The percentage reduction is:
Yi =a+ {3Xi + ei
"545- _:c:..;_
0 359
_:;_:_:_ X roo= 76.8%
where e is the residual difference between the true value of a subject's "545
mlu and the expected value predicted by the regression line. It will be
You may remember that in the previous chapter we suggested that the
y degree to which two variables might be linearly related could be measured
by r2, the square of the sample correlation coefficient. Here we have
ue r = o.8882 and r2 = 0.789 = 78.9%, very similar to the percentage reduc-
e tion in variance. In fact, for larger samples the two values will be virtually
~ identical.
0 It is not necessary to calculate all the residuals to obtain the value of
%, s,; it can be calculated more efficiently using the formula:
e

"'
e

"' s=
r Jn-r( ,
---sy
n-2
{COV(:,Y))')

Age
X
= J II (
-
10
z 13.88r')
1.243 - - - -
I2.573l

Figure 13.6, A normal distribution of mlus for each age group. All the = 0.599, as before
distributions have the same standard deviation.
I 34 Testing the significance of a linear regression
assumed that the residual 'error', e, is normally distributed over the popula- Although a regression line may seem to fit the data points quite
tion with mean zero and standard deviation IJ' within a particular age group well, there is, as always, the possibility that the effect is a property of
and that tY has the same value for all age groups. These assumptions are a particular sample. It may be due simply to a chance selection of a few
illustrated in figure 13.6. subjects in which the relation appears to hold, though for the population
The last column in table 13.1 gives the observed residuals from the values as a whole there is no such relationship. We can test the hypothesis
fitted regression line. The sum of these residuals approximates to zero H 0 : f3 = o against H 1 : /3* o using the test statistic:
and hence their mean, e, is also zero. The sample standard deviation of bs,Vn=I
the residuals, sn is then estimated: s,

Sr =
~j'~e-'
= - - (since e = o)
n 2 n-2
1
= 0.599 and compare its value to the critical values of the !-distribution with (n- z)
df. Here we have:
This is the familiar formula for the calculation of a sample standard
0.0878 X 12.573 X Yu
deviation, except that the denominator is (n- 2) instead of the (n- r) . 6.II
0.599
you might have expected. We have used up 2df in obtaining the two
estimates a and band this leaves only (n- z) for the residuul error. It and this is significant CV('fl at the o. 1% level. We can therefore be confident

~J.a ~33
Assumptions
Linear regression
that there is a real gain to .be made by fitting the model, a result we is narrowest when estimating- the MLU for children whose age is equal
had already discovered by considering the correlation coefficient. In fact to the average for the sample. Note that both the terms under the square
the test 'of the nuJI hypothesis H 0 : {3 = o is equivalent to the test of root sign will decrease as the sample size increases.
Ho: p= o. The second possibility is that we might wish to predict the mlu of an
We must not forget that the values a and bare estimates of the population individual child of age X, say 34 months. We should not expect predictions
values and are therefore subject to sampling error. We have estimated for individuals to be as precise as the estimation of the mean for groups
that the slope of the regression line is o.o878, or, to put it another way, of individuals. The correct formula for a confidence interval for mlu(X),
on average we expect a monthly increase of o.o878 in the MLU of children the mlu of an individual child of age X, is:
in the target population. An approximate confidence interval for the true
value, {3, can be obtained as:
. {
Yx constant X srX J I+-+
I
n
<x-=x)' }
(n-I)sx 2
constant X sr Again the constant is chosen from !tables with (n- 2) df. For X= 34
b + J;==;:::::;i2
V(n- I)sx months we have:
where the constant is the appropriate critical value from the !-distribution
with (n- 2) df. Here we have IO df and the corresponding 95% confidence 2.70 {2.23 X o. 599 X J +.:.. +
I
IZ
(34- 3").o8)'}
12.573 2 X II
interval wiJI be:
i.e. Z.70 I 39
2.23 X 0.599
o.o878 + or o.ossS to o.n98 We can interpret this by saying that as a result of our regression study
V II(I2.573)2
we are 95% confident that a randomly chosen child aged 34 months will
have an mlu of between 1.3I and 4.09. Note that in the expression under
I35 Confidence intervals for predicted values
the square root sign this time there is a term (the first) which does not
It is important to be clear exactly what we are trying to predict,
decrease with the sample size.
or estimate, using a regression model. There are two poSsibilities.
We might wish to estimate the population mean of the mlus for all
children of a given age, X - the quantity we have previously designated I3.6 Assumptions made when fitting a linear regression
by MLU (X). This can be done by using the confidence interval: The first assumption is that the value of the independent vari-

tx {
constantXsrX
J' -+
(X- X)i}
n (n- I)sx
z.
able should be known exactly for each element of the sample. In the ex
ample this means only that we should be certain of the age of each child.
If there is some imprecision in the values of the independent variable,
where Yx is the value predicted by the regression equation for age X this will cause us to underestimate the slope of the regression line which,
and the constant is again chosen from the t-tables with (n- 2) df. For in turn, will affect the accuracy of thepredicted values of the dependent
example, for the MLU of children aged 34 months we would calculate variable.
a 95% confidence interval as: Secondly, the distribution of the dependent variable about its mean

2.70 {2.2
3
X o. 599 X JI-;(34-::._~ji}
IZ I2.573 2 X II
value, for a given value of the independent variable, should be approxi-
mately normal. This is equivalent to requiring that the residuals have
a normal distribution with mean zero.
or z.7o 0.40
Thirdly, the residual variance, or standard deviation, of the dependent
Hence, with 95% confidence, we can estimate that MLU(34) lies between variable about the line should be the same at all points of the line. In
2.30 and 3.Io. The calculation of such confidence intervals is the best other words, the variability of the dependent variable should not be
way to display how precise, or otherwise, arc the <.~stimates. This interval related to its prcdictc'll value. Fo1 example, it sometimes happens that
1<34 335
Linear regression JVlutllpte regresswn
the variability of the residuals increases in step with the predicted value, I 3:7 Extrapolating from linear models
You might notice such an effect if you plot the squared residuals against In I 35 we presented formulae for the calculation of confidence
the predicted values as we have done for the mlu example in figure I 3. 7, intervals of predicted values. The larger this value is, the less precise
although in this case there is no evidenoe of any relationship: A special will be the predictions. The value of (X- X)' is smallest, zero, when
plot of the residual values on normal probability paper can give a rough the independent variable is equal to X, the mean.of the sample from which
check on the validity of the second assumption. Unfortunately, the special we estimated the regression equation. If we try to predict values of the
paper is not always readily avapable and the technique is rather tedious dependent variable corresponding to values of the independent variable

e' y

0.8

..-5
r
X

~
~

,~
0.6

0.4
X
X X

_jY ~x
X X
Ax X X
XX

0"

"'
x
X X
0.2
x '~ X

l
0
X

2 3
X

4 5 y
Figure 13.8. Examples of situations in which extrapolation outside the
observed data range would be highly misleading.
predicted MLU

Figure 13.7. Plot of squared residuals against fitted values for the regression far from the sample mean, X, the value of the term (X- X) 2 will become
of figure 135
very large and the predictions may be so imprecise as to be worthless.
In any case there are special dangers involved in extending the prediction
so it will not be described here. Some computer packages, such as MINI- outside the range of values of the independent variable observed in the
TAB (appendix B), will give a normal probability plot as part of the output sample. It may be that the sample coincided with the part of a non-linear
whenever a regression line is fitted. The points on this plot should lie graph which can be approximated reasonably well by a straight line: figure
approximately on a straight line if our second and third assumptions are I 3.8 displays some possible examples. In general it is not wise to extrapolate
justified. If you are carrying out, important- research and use . regression outside the ooserved range of the independent variable.
as one of your analytical techniques you should consult a statistician about
this and other diagnostic tools . I3.8 Using more than one independent variable: multiple
. In 10.5, we pointed out that a very few extreme values may have a regression
disproportionate effect on the value of a sample correlation (figure Io.g). The simple linear regression model offered an opportunity to
Exactly the same will occur with the estimate of the slope of a regression increase the precision by which we could make statements about a variable
line (see exercise I3.1). As usual, the data should be carefully examined (mlu) by exploiting information about a second variable (age), the variables
by means of Scattergrams to see whether such distortion is a possibility. being observed simultaneously, There is no need to restrict ourselves to
The program MINITAB draws attention to any individual data points just one independent variable. It may very well be possible to improve
which seem to have an overwhelming; influence on the_ estimation of the the prediction of the values of the dependent variable by observing more
regression model. than one independent variable. We will discuss the possible advantages
2~0 >37
Multiple regression
Linear regression IIi
Table 13.2. Hypothetical scores ofJO students on three tests proficiency test, and persuades them to take in addition a cloze test and
a simple vocabulary test. The scores of the 30 students in these three
Student Proficiency test (Y) Cloze (X 1) Vocabulary (X 2)
tests are given in table 13.2. Figure 13.9 shows the three scattergrams
93 '9 29 which can be constructed from the data. From the scattergrams it is clear
2 86 ,6 26
6g 25 that the scores from both the simpler tests are correlated with the pro-
3 '4
So 25 ficiency test score (and that the doze test scores are somewhat correlated
4
5 92 "'9 3'
6 53 7 20
100 100
..
t
7 55 6 '9 ~
8 72 '3 23
E 80 ' E '
9 79 '6 32 '' '
X~'*.\
''
)(~)('C)( x<'
'0 45 4 '5 ~ 60
60
"'2 4' 7 '5 >
0
x xxx x
' ~'
>
g =~)( '
5' 9 '9 c
~ 40 X~ X X
40 ' 11,:'

~
'3 6o w '6 ' ' ' '
'4 72
42 "9 3'
w
e
0..
20 20
'5
'6 78 '3 28
75 '5 24 0 5 10 15 20 0 10 20 30 40
'7 Vocabulary test (X 2 )
'8 70 '6 30 Cloze test (Xd
52 ,6 r= 0.8963 r= 0.8378
'9
20 67
"'4 '9
2> 45 5 '7
22 40 5 '6 ~ 40
23 36 4 '3
24 49 9 22 i 30 ' ,,
25 67 '3 2'
~ ' ' '
26
27
55
36
8
7
28
9
~ 20
.c 'ft.*
x~<
X
*' ' 'x~' '
KX)<.

28
29
58 9 20 ~ 10
' '
69 '4 2'
30 ss u ,g
0 5 10 15 20
Cloze test (X 1)
of this approach, and the new features that are introduced, by considering r= 0.7082

a problem involving two independent variables. The extension to a larger Figure 139 Scattergrams of data from table I 3.2.
number is quite straightforward. Consider the following hypothetical situa-
tion. with the vocabulary test scores). However, the doze test, X 1, has appar-
In the assessment of foreign language learners teachers have available ently a slightly higher correlation (0.8963) with the proficiency test, Y,
to them several proficiency tests which are often cumbersome and difficult than has the vocabulary test, X 2, (0.8378), so that it may be a better
to administer. A teacher with limited time available might prefer to use predictor of the proficiency test score. Let us begin by fitting the simple
a simpler test, but only if he could relate the scores on this test to the linear regression model :
general proficiency test which has been standardised on a large sample.
Yi =a+ f3X 1; + e;
He has an idea from previous research that a doze test might be such
a predictor. To check on the predictive value of the doze, he can use Using the procedures introduc~d above, we find the best fitting regres
the simple linear regression technique above. He also envisages the pos- sion is:
sibility that another simple test, e.g. of vocabulary, will serve as a predictor y = 24 '9 + 3433 x,
of the general proficiency scores. Again the linear rcgn~ssion technique with a residual standurd deviation, s,."' 7372 The sample mean and stand-
CU.Il bo UB~d, H~ ch()UseS 30 stud~nts who have recently taken the standard ant deviation of the 30 obijcrved clo~e scores ano l:.,"' .1o,87 and s, ~ 4.26.
li?'tlt- ~'19
Linear regression JVlutttple regresswn

Using the appropriate formula of r3.5 above, he can say that, oh the Table r 3 3 95% confidence intervals for prediction ofproficiency test
basis of this sample, he is 95% confident that the examination score of scores of individual students
a student who scored, say, 8 marks in the cloze test lies in the interval:
Independent variables used
Cloze test Vocabulary test
I I (8- I0.87)'}
Y(8) {2.o4X s,.y I+-+ score (X 1) score (X 2) X 1 alone X2 alone X 1 and X2
30 29 X 4.262
5 15 26-57 IJ-67 2(}-54
or 5!.65 '540, i.e. between 36 and 67 marks. 5 zo 6-57 40-78 34-59
5 25 26-57 51-89 Jl)-65
This interval is too wide to be useful. It could be reduced by increasing 10 15 44-74 2Q-67 fi-65
the sample size, but the first term inside the square root sign will always <0 zo 44-74 40-78 46-70
10 25 44-74 sr-8g SI-75
have the same value and eventually the only way to narrow the interval 15 6t-Ql 2(}-67 51-77
'5
will be to reduce s,. But this can be done only by fitting a better predicting 15 20 6I-QI 40-78 57-82
'5 5 61--(}I SI-89 6)-87
model. How much more precise might be the prediction if we could take
account of the student's vocabulary test score at the same time as his
cloze test score? To answer this question we need to extend the regression Table '33 shows the 95% confidence intervals for the predicted scores
model to include, as independent variables, the scores on both the cloze of a student who obtains various (arbitrarily chosen) scores in the doze
test and the vocabulary test. The most obvious way to do this is to add test and the vocabulary test. It can be seen that the intervals calculated
an extra term to the previous model and write: from the model that uses the scores from both preliminary tests are always
Yi =a+ ,8 1Xu + ,82Xzi + ei narrower (i.e. more precise) than those obtained using just one of the
scores. They are still rather too wide to be very useful. Can they be made
which says that the proficiency examination score of the i-th student in
even more precise? There are many possible reasons why the scores in
the sample is arrived at as the sum of four terms, a constant, a multiple
the proficiency test are not predicted very precisely, even using two inde-
of his score (XH) on the cloze test, a multiple of his score (X 2;) on the
pendent variables, but there are three which have particular importance.
vocabulary test and a random error (e;) involved in arriving at his pro-
First, the model may not be correct. Suppose we write Y(X 1, X2) to mean
ficiency score. There are now three model parameters to estimate: a, {3 1
the average score in the proficiency test of all the students in the population
and {3 2 (whose estimates we designate by a, b1o b2). With more than one
who would score X 1 marks in the cloze test and X 2 marks in the vocabulary
independent variable, it is rather tedious to carry out the calculations by
test. Then the multiple regression model implies that this average can
hand and we will present the results obtained using the program MINI-
be calculated exactly by the formula:
TAB.
We find that the least squares regression estimate of the model gives: Y(X 1, Xz) =a+ /l,X, + /lzXz

a= I3.708 b 1 = 2.329 b2 = 1.057 If this is not true for some values of the independent variables, the
predicted proficiency scores for students with those values of X 1 and X 2
i.e. Y = I3.7o8 + 2.329 X1 + 1.057 X,
will be biassed. This 'lack of fit' will also inflate the value of the residual
The residual standard deviation is now s,= 57'5 which should be com standard deviation and hence widen all the confidence intervals, The model
pared with a value of 7372 when the model containing only the cloze might be improved by the inclusion of more independent variables but
test score was fitted. As might be expected, including the extra term in the more complex it becomes, the harder it is to interpret and the more
the model has allowed it to explain more of the variability in the proficiency information is needed before it can be used for calculating predicted values.
scores, thus reducing the level of residual variation left unexplained by We discuss this possibility further in the following section.
the model. In fact, the percentage reduction in the variance of the pro- Second, even if the model is correct, so that the mean proficiency score
ficiency scores has been increased from 8o% (using clozc test only) to is well predicted for given values of the two independent variables, there
88% (using the vocabulary test as well). may still be considerable variation between individual proficiency test
i<lMl 24I
Linear regression Dedding on the number of independent variables
scores of students who all have the same scores on the doze and vocabulary pendent variable, since that ensures that the residual variance will be as
tests; in other words the residual variation may be large. Again the effect small as possible. Although adding in the second will cause some further
of this will be to inflate the residual sta.ndard deviation leading to imprecise reduction in the residual variance, it will also complicate the model and
prediction of proficiency scores for individuals. In this case, too, it may it is advisable to increase the complexity of the model in this way only
be possible to make improvements by adding another independent variable, if the extra reduction is statistically significant. In other words, we would
if one can be discovered which explains a significant proportion of the like to test whether the second independent variable can explain a signifi-
residual variance left unexplained by the first two. cant amount of the variance left unaccounted for by the simple regression
The third possibility is that, even if the model is adequate, the sample model involving only the first variable.
size is not sufficient to give precise estimates of the parameters. The im~ This can be done by exploiting a mathematical link between linear regres-
portant quantity here is the number of degrees of freedom available for sion and analysis of variance. It is perfectly possible to produce an AN OVA
the estimation of the residual standard deviation, which is obtained by table as part of a regression analysis. We will not explain here how the
subtracting from the total sample size the number of parameters which calculations are carried out, but the principle is important and we -will
have been estimated in the model. In our example, the residual degrees use the AN OVA tables produced by MINI TAB as part of the regression
of freedom are 30- 3 = 27, which ought to be quite sufficient. Increasing analysis of the test scores of the 30 students. In table '34(a) is the AN OVA
the sample size will always improve the precision of the parameter estimates corresponding to the regression of proficiency test scores on the single
but the scope for improvement will be limited if either of the first two independent variable, doze test score. The effect of the regression is highly
sources of variability is present. significant (F 1,28 = "44). 3
The ANOVA for the multiple regression model containing the cloze
'39 Deciding on the number of independent variables and vocabulary test scores as independent variables appears in table
In the example discussed in the previous section there were '34(b). It is in two parts. The first simply shows that the multiple regres-
two possible independent variables which could be used to predict the sion accounts for a significant proportion of the variance (F2,27 = 105.0).
value of a dependent variable. If we decide to use only one of them, we This is not surprising, since we already know that just one independent
should clearly use that one which has a higher correlation with the de- variable causes a significant effect. The second part of the AN OVA allows
Table 13+ ANOVAfor the regression analysis of the data of table IJ.2
us to test whether the additional effect of adding in the second independent
variable is significant. The extra reduction in the residual sum of squares
(a)AN0\'1\for the model f= a+ {3;'( 1
Source df SS MSS F-ratio
(63g.68) due to adding in X 2 is tested for significance by dividing it by
the residual mean square (32.66). The result, 63g.68 + 32.66 = 19.6,
Regression 6:zt7,87 6217.87 1144
Residual 28
' 1521.63 should be compared with critical values of F 1,27 and is highly significant.
54-34
We conclude that it is worthwhile to include both X1 and X2 in the model.
Total 29 773950
This whole procedure generalises easily to more than two independent
(b) AN OVA for the model Y = et + {3 1X1 + f3zX2 variables. The only new feature is that it will not usually be obvious what
Source df SS MSS F-ratio
is the best order in which to introduce additional independent variables
Regression 2 685756 3428.78 105.0
Residual 88<.94 32.66
into the regression equation. Many computer packages will include an
27
option for stepwise regression in which the 'best' regression model (i.e.
Total 29 773950
the model which leaves the smallest residual variance after it is fitted)
SS explained by each van'able in order given: with just one independent variable is calculated, then the best model with
Source df SS
Regression 2 685756 3
If we test the significance of the regression using the test suggested in 13.4, we get
xl l 6;n7.87 t = 10.7 = V 1 I4+ This relationship between the t-distribution with k degrees of freedom
X,. I 6Jt).()H und the F-distribution with (r,k) degrees of freedom holds generally, and ensures that
,.......- -- '====== tht te!:lt of hypothesis H 0 : f3 = o will give the same result whichever test statistic is used.
i,j.~
243
Linem regression
Transfonning the data
just two variables, and so on, until the addition of further variable~ does
cally; it is said that r(Y ,X 2 1 X 1) is the correlation betweenY and X with
not cause a significant improvement. If there are many possible indepen- 2
X 1 partialled out. It is calculated from the formula:
dent variables, the simple variable which gives the best fitting model will
not necessarily appear as one of the independent variables in the best r(Y,X 1)r(X 2 ,X 1)
i(Y,X2 1X 1)= . .
two-variable model. It is also possible to carry out stepwise regression V{ I - r'("'i,X 1)} {I - r2 (X2 ,X 1)}
by first including all the possible independent variables and deleting them
o.8378- o.8g63 X o. 7082
one at a time. You should consult"~ statistician if you are considering
fitting models in this way. V(I o.8963') (I 0.7082')

= 0.6485
'3 ro The correlation matrix and partial correlation To some extent it is possible to interpret partial correlation coefficients
The value of adding in a second independent variable, X2, in a similar way to ordinary correlation coefficients. In particular,
0.6485 = 0.42 = 42%, and we can say that adding X2 into the regression
to a regression model containing one ind~pendent variable, XI> depends 2

on how much information about the .dependent variable, Y, is contaiped model after X 1 will account for 42% of the residual variance still remaining
in X 2 which was not already contained in X 1.. This, in turn, depends after X 1 has been fitted in the simple regression model (see exercise 13.3)
largely on the correlation between X 1 and X 2 If they are highly correlated, The concept of partial regression can be particularly useful when investi-
most of the information contained in one of them will also be contained gating the relationship between two variables, both of which are affected
in the other. Because of the importance of these correlations it is usual by a third variable in which we may not be particularly interested.
to present them in special tabular form called the correlation matrix.
For the example introduced in q.8, this is the sample correlation matrix: I 3. I Linearising relationships by transforming the data
I

y x, x, Sometimes mathematical transformations are used to make the


y data fit better to simple models. Figure I 3. ro shows a hypothetical relation-
I
x, o.8963 I ship between two variables X andY. The curve drawn through the points
x, o.8378 0.7082

Here the number in any position is the correlatiq11 be.tween the variables y

which are used as labels for the cor,espondil)g r.ow and. column. All tl)e
15
values in the main diagonal are exactly equal to 1, since they represent
the correlations between each variable and itself. The values above this
diagonal are frequently left .out since they already appear in the lower
triangle. For example, the value left off in the upper right-hand corner
10
=
iso.8378, since r(Y,X2 ) r(X 2,Y).
The two 'independent' variables X 1 and X 2 have a correlation of 0,7082,
which means that they are certainly not independent of one another. To
some extent the information contained about the dependent variable in
5
-either one of them will be duplicated by the other. It would be useful
to have a measure of how much one of these variables is correlated with
the dependent variable even after allowing for the infmmation provided
by the other. Such a measure is the partial correlation coefficient. We
will usc the notation r(Y ,X 2 1 XJ) to mean the correlation of X, with Y 0 1 2 3 X
after the information contained in X 1 has be'"'
taken into account. 'l\:chni-
Figure IJ. to. Graph of the relationship y = 0.1 X6 7 .

~4<1:
245
Linear regression Summary
is clearly not a straight line. It is similar in shape to curves which can it is likely that the interest of the experimenter will remain with the raw
be expressed by an equation of the form: data in the original units. We cannot pursue this further here, but whenever
Y=AX" you are advised to transform data it is best to make sure - before going
to the trouble pf carrying out the analysis - that ii will be possible to
where A and B are constants or parameters. Now, instead of Y consider interpret the results in the way that you require.
its logarithm, logY:
logY= log (AX") I 3. I 2Generalised linear models
=logA+blogX We have discussed the traditional transformations of non-
normal data in the previous section, and the majority of extant textbooks
(If you do not understand the algebra then clearly you ought to consult
dealing with AN OVA or regression will state that this is the only option
a statistician before attempting this procedure.) If we write W =logY,
for coping with such data. However, in recent years statisticians have
Z = log X and a = log A the equation can be written:
developed tools for analysing directly many types of data without any
W=a+bZ need for them to be normally distributed. Indeed, these same tools can
which is exactly the form of the simple linear regression model introduced cope with data which are non-standard in other ways. For example,
early in the chapter. Figure IJ.II shows a scatter diagram of W (logY) ANOVA can be carried out even when there are different numbers of
against Z (log X) and .indicates a much more linear relationship than was subjects in each of the experimental groups, whereas traditional methods
apparent in the previous figure. A linear regression could then be fitted allow this only for one-way AN OVA. These new methods fall under the
safely to the logarithms of the original scores. general heading of generalised linear models (GLMs). They require
Although there are many situations in which data can be transformed specialised computer packages- e.g. GLIM (generalised linear interactive
to new values which are more suitable (in a technical sense) for statistical modelling), elaborated at the Rothamsted Experimental Station in
analysis, a major disadvantage of this process is that the interpretation England, or the GLM option of SAS, the statistical analysis package usually
of the analysis in terms of the original problem will be more difficult. available on large IBM computers - which may not be widely available.
The results obtained will be in terms of the transformed values, while Furthermore, some professional statisticians are not yet familiar with
GLMs and their analysis. However, there is no doubt that GLMs will
w become more widely used in the social sciences, and a recent book by
2
Gilbert (I984) provides an accessible introduction to. their applications.

SUMMARY
This chapter dealt with linear regression models and how to fit them.
( 1) The algebraic equation for a straight line was introduced as Y = a+ bX where
the parameters a and b were the intercept and slope respectively. Y was
called the dependent or predicted variable and X the independent or predic-
tor variable.
0 "" 0.4 0.6 z (2) The residuals were defined as the difference between the observed and pre
dieted values of the dependent variable. The least squares regression line
was calculated by b = COV(X,Y)/s2 and a= Y- bX.
(3) The residual standard deviation, sr, was calculated by:
-1

Figun.J 13.1 1, (Jrnph of the n:latiou W "" a+ b/,',, where W"" logY, X""' log X, n-
- r -(- 2 {COV(:,Y)))
ll*~kl~(t,),_l), bt;;;fJ,7, sr=Jn-2
-- Sy

?.A.,
Linear regression
(4) The hypothesis H0 : f3 = o ({3 being the 'true'.slope .of.the line) versus H1 :
{3 #- .o was tested by using the test statistic, t = bsx ~/ s0 referring this
to tables of the !distribution with (n - I) degrees offreedom.
14
(5) Formulae (I3.5) were given for. calculating confidence intervals for pre,
dicied va:Iues.
Searching for groups and
(6) hwasshown how to extend the model to incorporate several predictor variables
ina multiple linear regression.
clusters
~~"Ehe concept of partial correlation was explained and discussed.
(8) It was noted that when a relationship is apparent between two variables which
clearly cannot be represented by a straight line it may be possible to transform
one or both variables so that the relation between the transformed variables
is linear. To this point in the book the methods introduced and discussed have
(9) It was pointed out that the use of generalised linear models allows ANOV A been appropriate to the presentation and analysis of data consisting of
or regression analysis of many sets of data which do not meet the assumptions a single variable observed on each of several sample units or subjects.
for the traditional analyses. We have discussed situations which involved several experimental condi-
tions, or factors, in the ANOVA chapter, but the data itself consisted
EXERCISES of observations of the same variable under each of the experimental condi-
( 1) Amend table 13. I by including these.two extra observations:
tions. It is true that in chapter I 3 examples were introduced where several
Age mlu variables were measured for each subject, but one of these variables was
I3 24 572 given a special status (the dependent variable) and the others (the indepen
I4 s6 1.90 dent variables) were used to assist in its analysis via multiple regression.
(a) Recalculate the linear regression for the extended table. A rather different situation arises when several variables are observed
(b) Calculate fitted values and residuals for the new scores. on each sample unit or subject and we wish to present the data or to
(c) Test the significance of the new linear regression. extract the information in the complete set of variables without giving
(d) Calculate a 95% confidence interval for the new MLU for children a priori a special status to any of them. An example of this, which we
aged 30 months. shall develop in the next chapter, would be the set of scores of a number
(e) Calculate the new 95% confidence interval for a child of-34 months. of subjects on a series of foreign language tests, with no reason for the
(f) Compare all your calculations (a)-( e) with the values derived from scores on any one test to be regarded differently from the scores on the
the original table. remainder of the tests. It is in such cases that multivariate statistical
(2) Using the relevant data of table 13 .2, calculate the linear regression for predict~ analysis is appropriate. Before looking at any technique in particular it
ing proficiency test score from the vocabulary test score. verify a selection
may be helpful to make some general remarks about multivariate analysis.
of the confidence intervals in the 'X2 alone' column of table 133
(3) (a) Calculate r(Y,X 1 1 X2), the correlation of Y and X1 with X2 partialled
I4. I Multivariate analysis
out.
(b) Estimate how much of the variability in Y which is not explained Multivariate analysis is a term used to cover a considerable
by X 2 will then be accounted for by taking X 1 into consideration as variety of techniques for the description, simplification, synopsis and analy-
well. sis of data sets consisting of several different variables all measured on
(c) (Harder!) Show that the total percentage of the variability in Y which the same sampling units or subjects. The notation of I3.8 is adapted
is accounted for by X 1 and X2 together will be the same whether easily for multivariate data sets by writing X;; for the value or score of
X 1 is filted first and X 2 afterwards or vice vcn;a. the j-th variable observed for the i-th individual. Suppose that we have
a total of p different variables all observed on a sample of n individuals.
The data can then be written as a matrix:
. 'lit~ 249
Searching for groups and clusters Multivariate analysis

X11 X1z X13 x,, to present, say, rz objects of similar. size and shape but of different colours
Xzi Xzz XzJ x,, to a number of children from each of several different language back-
X 31 X 32 X33 x,, grounds, and ask every child to name the colour of each object. The resultant
observations will consist of a list of 12 colour words for each subject.
How can we decide whether, say, children who learned Welsh as a first
language tend to use different colour names from those who are English
monolinguals? Or if children whose first language is Urdu are different
Xnl Xnz Xn3 x", from both groups in the labels they apply? If all the children, irrespective
The vector (Xa, X12 , X", ... , X,,) is often called the i-th observation of their first language, use the same 12 colour names to describe the objects,
or the observation on the i-th individual. again there is no problem. Of course, it is most unlikely that this will
Most multivariate methods require extensive calculations and will often happen. Some Welsh will probably use a set of names which, for at least
be operated on large quantities of data. For these reasons multivariate some objects, agrees with the names used by some English children. On
analysis will usually require a large and powerful computer equipped with the other hand, there will be variation within the two groups. The question
special software packages. In this chapter and the next we will consider to be considered is whether children of one group seem to make choices
in detail only a few of the techniques which are, or could be, used in which agree more within that group than with the choices made by children
language studies. We will not attempt to describe how any of the calcula- from another group.
tions are carried out but will restrict the discussion to the motivation and The problem is to decide what is meant here by 'agree more'. There
meaning of each technique. Most multivariate methods are descnptive; is no obvious way to give a single score to each child and then compare
they are used to provide succint means of presenting large volumes of variability in scores within and between groups (perhaps using ANOVA).
complex data. Few multivariate methods have been developed for fitting There is no 'correct' name for the colour of each object with which we
models or making precise inferences. Hypothesis testing is relatively rare can compare the names given by the individual children. On the other
with multivariate data, partly because the theory is difficult to develop hand, it is not difficult to think of a measure of the extent to which any
and partly because it is rather hard to believe that a complex set of variables two children are in agreement. One obvious possibility is to look at the
will meet the underlying assumptions required for those tests which have names given by one child and compare these, object by object, with the
been developed. names given by the other. The number of times that the two children
Early in this book, in chapter 2, a distinction was made between two use the same name for the same object could be used as a measure of
different types of data: the categorical, for which each observation was the extent to which they use colour names in the same way.
a type, e.g. male or female, and the numerical, for which each observation It is, in fact, more usual to measure how dissimilar the subjects are,
was a numerical value, e.g. a test score. The same distinction carries over and the total number of mismatches would give a measure of that. One
to the multivariate case and the two types of data will often need different problem that frequently arises with this kind of experiment is that some
treatment. Suppose, for example, that an experiment is carried out to subjects give no colour name, or give an unintelligible response, for one
see whether school children with different linguistic backgrounds use Eng- (or more) of the objects. We might then decide to discount the correspond
lish colour words differently. Different languages divide up the colour ing object when comparing this subject with others. This will tend to
spectrum in different ways; closely related languages like English and reduce, artificially, the number of mismatches. To exaggerate: a subject
Welsh, or English and French, may have minor differences, while the with no responses could be found to have no mismatches with anybody!
differences between English and Urdu may be extensive. For children Of course, such a subject should be discarded; it is usual to discard subjects
who have already become used, in their home background, to a particular with more than a very few missing or unintelligible responses. To cope
mode of labelling familiar items by colour, the learning of a new set of with the problem of missing values of this type the dissimilarity or distance
colour terms which may divide the colour spectrum differently can initially between two subjects would commonly be calculated as the proportion
present probkms. One way to carry out such an experiment would be of mismatches among the objects for which both subjects give an intelli
as~ 251
Searching for groups and clusters The dissimilarity matrix
Table I 4 I. Colour names given by three subjects to twelve objects Table I4'3 Con-elation mallix often subjects calculated from their scores
Object
on eight tests
z 3 4 5 6 7 8 9 10 II IZ Subject
Subject I c D G H E L J A B F I K Subject I z J 4 5 6 7 8 9 10
z c D I H E J K A B F G L
3 c D I L E - J A B G - K
z
I.OOO
0.384 I.OOO
Note :,Each of the letters represents a different colour name, while a dash stands for a missing 3 0.729 0571 1.000
4 0.688 0.381 o.265 I.OOO
response.
5 0 543 0-773 o.sro 0.121 t.ooo,
6 0,409 o.682 o.635 o. 35 s 0.526 I.OOO
Table I4.2. Dissimilarity matn'x of three 0.239 0.297 0.187 0.2II 0 345 1.000
7 0.371
subjects calculated from their colour 8 0.709 o.538 0.700 0,122 o.596 0.468 0,384 I.OOO
9 0.714 o.693 o.859 o. 35 6 o.627 0.729 0.416 0.705 J.OOO
descriptions of I 2 objects 10 o.639 o.663 0,724 0,4I.'i o.61o o.641 0.249 0.526 o.867 I ,000
Subject
2 3 of zeros, since there can be no mismatches between a subject and himself.
I 0
These two properties, symmetry and zeros on the main diagonal, are char-
z 0.417 0
3 O.JOO 0.400 0 acteristic of a dissimilarity matrix.
Although we introduced the idea of dissimilarity as a means of comparing
0.417 = __, 0.300 = l_, 0.400 = __ subjects scored on categorical variables, it is perfectly possible to construct
12' tO' IO
a dissimilarity matrix from numerical variables. For example, suppose
that ten students have each taken eight tests. A possible measure of similar-
gible response. This type of distance measure is often called a matching
ity in the pattern of performance of two students across the tests is the
coefficient. It may be felt that a child's inability to find a colour name
correlation, r, between their sets of eight scores. Now, two students with
for a particular object is important and should be included as part of
exactly the same scores would have a correlation of I. If we then choose
the assessment of the (dis)similarity between two subjects. In that case
as a measure of dissimilarity r- r2 (not an uncommon choice), two
a missing value could be treated as the 'colour name' BLANK for the corres-
students scoring exactly the same marks would be found to have zero
ponding object and the subjects always compared on their responses
dissimilat:ity. Table 14.3 gives the correlation coefficients for pairs of stu-
(including BLANKS) on the full set of objects. The proportion of mismatched
dents and table I44 gives the corresponding dissimilarity matrix.
responses would still make a sensible measure of the 'distance' between
two subjects.
Table I4+ Matn'x of dissimilan'ties of ten subjects calculated as
(1 -correlation')
I4.2 The dissimilarity matrix
When all the distances, or dissimilarities, have been calculated Subject I z .1 4 5 6 7 8 9 10
for every pair of subjects in a sample they can be presented in the form 0
of a matrix known as the dissimilarity matrix. In table I4.I we have 2 o.853 0
J 0.469 o.674 0
given the responses from three hypothetical subjects presented with I 2 0.992 o.8s5 0.930 0
4
different colour stimuli. Note that there are a few missing responses. Table 5 0.705 0.402 0.740 o.g85 0
(, o.833 0.535 0.597 o.872 0.723 0
14.2 gives the corresponding dissimilarity matrix. Only the bottom half o.gr2 o.g6s o.88r o.862
7 0.943 0.955 0
of the matrix is given: the missing half can be filled in by symmetry since H 0.71 I 0.510 o.g&s o.64S 0.781 o.8s3 0
the dissimilarity between subject i and subject j must be the same as the
,, 0497
0.490 0.520 0.262 o.873 o.6o7 0.469 o.827 o.soJ 0
<O 0.592 o.s6o 0.476 o.828 o.6o3 o.s8g 0.938 0.723 0.248 0
dissimilarity between subject j and subject i. The.: main diagonal consi~ts
ll$11 2 53
~earchlng)or groups ana ctusters nzerarcmcat cluster anatyszs

Looking at the correlation matrix we find that the highest correlation Baker & Derwing (Ig82), in a study of the acquisition of English plural
of all is o.867, between subjects 9 and ro, and we could say that these inflections by I 20 children from 2 to 7 years, have developed an analytical
seem to be the pair that are most alike. Subjects 9 and 3 have a very technique based on hierarchical cluster analysis to investigate the manner
similar correlation (o.859) and could be linked together. Since subject in which the ruie system for these inflections is acquired. The data for
9 is linked to both 3 and ro we could say that these three subjects form their study comes from Innes (1974), who employed an improved version
a cluster. If we require that a correlation be at least, say, o.8o befor<:: of Berko's (rg58) technique for eliciting plurals from young children .. This
a link is formed this would be the only cluster of subjects to be discovered technique employs pictures of real or invented objects, with monosyllabic
using correlations in this way. If the 'alikeness' criterion is relaxed to names, which the child is asked to identify. So, for example, the eliciting
allow correlation values down to o. 75 we find that subjects 2 and 5 are utterances by the experimenter might be (using Berko's most famous non-
linked, so that we now have two clusters of subjects, (3, 9, ro) and (2, sense word) 'Here is a picture of a wug. Now there are two of them.
5). If we link all those with correlations down to 0.70, the former group There are two--.' The child's responses constitute the data for analysis.
grows to include subjects I (linked with 9 and 3), 8 (linked with 3 and The focus of such studies is the child's acquisition of a rnle of pluralisation.
I), while the latter cluster remains with just the same two subjects. By There are three regular allomorphs of plural in English, and their applica-
gradually allowing weaker and weaker linkage (smaller correlations) we tion is conditioned by features of the final consonant or vowel of the noun
will eventually arrive at a point where all the subjects form a single cluster to which they are attached. The three forms are lzl, lsi
and lrz/. lz/
(exercise I). Clearly, instead of using a similarity criterion, linking subjects is the form used after vowels and voiced consonants except dzl, I Izl;
with the greatest similarity, we could use a dissimilarity criterion, linking I sl appears after voiceless consonants except Itl, I I
sl; and rzl is the form
subjects with the smallest dissimilarity (exercise 2). used after stems ending in It I, I
dzl,lsi, lzl. There are, of course, excep-
If, in the last example, instead of r- r 2 we used Y(r- r 2), this would tions to this rule in English - there are various irregular nouns like foot
still be a sensible measure of dissimilarity. In general, for any set of data or sheep which behave differently. But the vast majority of nouns in the
it is possible to imagine many different measures of 'distance' which would language, and any new nouns added, conform to this pluralising rule.
lead to different similarity matrices. We will return to this point later In the context of linguistically based approaches to acquisition, a question
in the chapter. Note, in passing, that it will not always be the clustering of interest is how children develop this rule. At what point are they able
of the subjects which is the endpoint of a cluster analysis. It will frequently to treat any new noun appropriately by adding the inflection appropriate
be the case that interest centres on the different variables on which the to the stem-final item, and what are the stages by which they proceed
subjects were scored. If that were the case here we could obtain a measure to this knowledge?
of the similarity between any two tests by calculating the correlation of As Baker & Derwing (I982) point out, this is not an easy question
the scores on those tests over the ten subjects. to answer from a cross-sectional developmental study. An analysis is
Now that we have a measure of how dissimilar each individual is from required such that stages, or developmental patterns, emerge from the
all the others in the sample, it ought to be possible to explore questions data. The stages of development ought to be identified as a consequence
such as whether Welsh children are less dissimilar to one another, in the of the analysis. However, the analytical methods adopted with data of
main, in their choice of colour names than they are to children of other this kind have tended to obscure subject-determined patterns of response
language backgrounds. There are many ways to continue using the dissimi- by using percentage correct scores (generally, the number of children res-
larity matrix as a starting point. We will describe two widely used tech- ponding correctly to a given item), and then age-blocking the data to
niques: hierarchical cluster analysis and multidimensional scaling. try to discern developmental trends. This has the effect of tying the chil-
dren's performance to adult norms, and of equating 'stage' with 'age'.
'43 Hierarchical cluster analysis As Derwing & Baker suggest, it may be that a given data set ought to
It is difficult to give a good description of this technique in be arranged by age groups, but this is something that should emerge from
the abstract. We will therefore introduce it via the discussion of a recently the data rather than be imposed on it in advance. A further problem is
published example of its use. that the interpretation of group percentage correct scores by age as the
~$<\ 2 55
Searching for groups and clusters
"N"> :3_
basis for the inference of rules inindividual.<:hildren,can,becquite misleit&_, ~z-

ing. Suppose, for example that, 'in,,response.toa group ,of final :stems -~~, "3
,z
requiring/z/ plurals, the pattern emerged_as shown in the followingtabk,- . -...;,.
"KJ'.!:l
I 0 0

Child ----;;;;-'7r---
~ 0 0
Stem-final A B c D E ---___ .,._
---
.t:::..::.. I ~ .!:l o o
Vowel I X X X X
--.."3 "3~
I X I X X X ~z 1 oooZZ
b X X I X X
d
g
X
X
X
X
X
X
I
X
X
I -- ---
~'vi-

';i;-"J;-
I
I o o o o o o

<llOOOOOO

We can see that each child has a,'2o%,,suc.cess rate_-on /zf.pluralised


--- ---
~- :3 :; :3"3
--..z 1 oo-ZoooZZ
items, and that if these five stem-finals are .the- only ones included in,the ""0:: . 1:::
I a o o o o o o o o
experiment, then the percentage ,co.reCtrate for children of thisage-, on,_r' .; ---~
~--;n-
voiced stem-finals is 20%. Each child, however, contributes to this result ..__-......, I o o "' "' o o o o o o
quite differently. Baker & Derwing utilise hierarchical clustering to over ~""N- I o o o 6 a o o o o a o
come the problems of age-blocking and percentage correct as a measure, --- ---
""1::;-'"N-
I N 0 0 0 0 0 0 0 0 0 0 0
and to search out, in the data, groups of children who are treating similar
subsets of stem-final segments as classes.
---
8"-N- ---
I N N 0 0 0 0 0 0 0 0 0 0 0

Baker & Derwing had available data consisting of the responses of 94 --- ---"
children to 24 stimuli. Their first problem was to construct a measure --- ---
--- ---
I N N N 0 0 0 0 0 0 0 0 0 0 0

of the dissimilarity between each pair of subjects based on the observed ., ---"
--- I N N N N 0 0 0 0 0 0 0 0 0 0 0
--- ---
responses. Since they wished to leave aside the question of 'correct' or ~~ I N N N N N 0 0 0 0 0 0 0 0 0 0 0
'incorrect' responses (compared with adult norms) they began by con.
structing for each child a matrix_ showing the relationship between,_ the- ---
0 ---
--;--., " I N N N N N N 0 0 0 0 0 0 0 0 0 0 0

child's responses to the different stimuli. Aotypical one of these, matrices, . k


'l:: ~~- I N N N N N N'N 0 0 0 0 0 0 Q 0 0 0. 0

is reproduced here as table '4-5 (Baker & Derwing rg82-: fig r ); IT'he ~ :::-. ";i I r -;:;-
li:. I N N N N N N N N 0 0 0 0 0 0 0 0 0 0 0
24 stimuli and the child's responses are written across, the top and at ---~--;f---
~
~
~ --- :--.
I N N N N N N N N N 0 0 0 0 0 0 0 0 0 0 0
the left of the table. Wherever two stimuli elicited the same response,, "
00

~ --- E
--- ---
that response was entered in the corresponding cell of the matrix, otherwise , ..0 N'
f' N N N N N N N N N N 0 0 0 0 0 0 0 0 0 0 0

the entry was zero. (An irrelevant response was marked Irr.) Once the ---
.!; ""C>'N- .~
94.'response coincidence' matrices had been constructed (each matrix con- e --- --- I N N N N N N N N N N N 0 0 0
:3
0 0 0
'3
0 0 0 0 0
""3:1 p
~
~

tains 276 entries) the dissimilarity between each pair of children was then ~ --..z
(.):1
1 ooooooooooooooZooZoooZZ
"'
calculated as the proportion ofthe 276 entries in the coincidence matrices ~ ~ ~
.r:: "'j
of the two children which were not the same. The resulting dissimilarity
matrix, made up of the dissimilarities between each pair of children, was
then used as input to a standard computer package known as CLUSTAN
""+"'
~
! :;~~~~~~~~~~~~~1::::;~~"3???3:1
z..__..__..__..__..__..__..__..__..__..__..__..__..___z..__..__z..__..__..__zz

]
(see Wishart r978). Part of the output is shown in figure '4' (Baker " ~ ~
& Dcrwinp(: fig.
'
z). This graph
.
(often called a dendrogram) has subject
numbers along the horizontal <txis and values of the dissimilarity on the
~ ..s -1 (:)'"?:a-
rn
;a-:::--;;:-2--;-:c--.-:::-e;:;-~::::-C;:;-:?d:l~--;;:,;::;:,"";";;):u-~"N-
-....-...... . . . . -....-..__.........-......._-....-....-..--...--...-.....--....--.--..-.._-._..__-....---. ~

!!!$.~
General remarks about hierarchical clustering

""'" vertical axis. The diagram can be used to see which children seem to
'"
"'" be 'closest together' in the form of their responses (as measured by their
"""'
"'
response coincidence matrices). For example, subjects 5 and 102
together (dissimilarity= 0.3) than either is to any other subject. 1The same
are closer

"" 2:
""' is true of subjects 107 and II9 (dissimilarity= 0.2), the latter pair being
"
u~&~ '" more alike than the former as Indicated by the relative values of the dissimi-
"' larities. (Note that it is necessary to take care when reading off values
"'" of the dissimilarity near the bottom of the diagram - reproduced here
I
"
==l~~ as it appears in the original. The horizontal axis has been drawn in at
" a level which would correspond to a dissimilarity of -o.33, although nega-
""
LJre:::::l "~~
tive values are meaningless. Presumably this is a peculiarity of the computer
"" package used by Baker & Derwing.)
"" The dendrogram can be used to look for clusters of children who have
"" become grouped together by the way in which they have responded to
rlc:::l~~
""'
the stimuli (according to their individual response coincidence matrices).
""
l...fr==l:~ "
1i A cluster is defined by setting a maximum value, d, of the dissimilarity
""" -
-
E
,
c
and drawing a line horizontally from that value all the way across the
~rt cluster diagram. Every time the horizontal line meets a vertical line a
"
"' ~, cluster is identified which includes all the subjects who can be reached
""
"".
""
"' by moving downwards along the dendrogram from the point of intersec-
tion. Each of these clusters had the property that every child in the cluster
"fi has at least one 'nearest neighbour' with whom his dissimilarity score
'""' is not greater than d. From figure 14.1 we can see that fixing d = 3 defines
"""' two clusters, the first containing 20 subjects (from subject 5 to subject
""
"" 77 on the horizontal axis) and the second containing the remaining 74
_j-cc:=31' subjects. By using d = 1.5 Baker & Derwing found four distinct groups,
r1 "
ro~ which they designated I, II, III, IV in their figure. They point out that
!~
I I ~ the links between pairs of subjects in groups I and II arc higher up (i.e.
I L......J-~~~" correspond to larger dissimilarities) than those in III and IV, suggesting
-{r::
..,___:::=~
r '1...r--- ~~ "
" that the latter two groups provide clearer or more consistent results than

---,----,r----r---.----.---.----.---.-~~~~-------~~~
groups I and II in terms of how they treat the test items. They then
"" ~" go on to argue that the subjects in the four groups demonstrate different
'
N 1i!
"'<ri <ri . .
~t'l illN'
"l
"' o.i
At!JeiJLU]SS]p JO san1eA
~ "' ~
0
levels of evolution of pluralisation rule formation.

General remarks about hierarchical clustering


'44
Cluster analysis is largely exploratory and descriptive. The
number and composition of clusters can depend on several decisions which
1
Although only 94 children are being compared, the subject numbers go as high as uo.
Buker & Dcrwing eliminated 26 of the original subjects whose respOnse pattern was judged
to conform too closely (at least 21 of 24 items agreeing) to adult norms or to be too
far frmn that norm '(fewer than 3 of 24 items agreeing).

259
;searchmg for groups and clusters
Jlion-hierarchical clusten'ng
have to be made by the researcher as well as on the structure of the data. by the daia alone without the addition of a pn.D1i assumptions by the
There are essentially three stages in the process, and we will consider investigator. Of course no plausible theory can be built on the basis of
them separately. The first stage is the construction of a dissimilarity matrix. an exploratory technique. The most that can be hoped for is that clusters
For any set of data it will usually be possible to imagine several different will appear whose structure has a plausible linguistic explanation or which
ways of measuring the dissimilarity between two subjects, all of which suggest a novel linguistic hypothesis. A specific investigation should then
ways give different numerical results. For purely categorical data the be carried out to see whether these tentative hypotheses can be confirmed
matching coefficient is very comtnon, but it is certainly not the only option. before any firm conclusions are reached.
Using different similarity measures may give quite different results. The
second stage is to adopt a mathematical criterion - the clustering algor- '45 Non-hierarchical clustering
ithm- to convert the dissimilarity matrix to a cluster diagram. Suggestions One special problem with hierarchical cluster analysis is that
for suitable criteria abound and the suitability of a particular criterion a few very close neighbours can distort the analysis and 'mistakes' made
depends in part on the type of dissimilarity measure used. It is difficult

to give general guidelines. The most important feature of this stage is
the degree of linkage which is required. A subject may be allocated to
a group if he is sufficiently like just one member of that group (single


linkage) -this was the criterion we adopted in the example of r4.2.
2
e3
Or it may be required that the subject be sufficiently close to all pre-existing 1 4
members of the group (complete linkage); or some intermediate level .s

of linkage may be required. (Single linkage is frequently called the 'nearest-
neighbour' criterion, since a subject is added to a cluster if he is sufficiently
like just one individual already included- the nearest neighbour; complete I

linkage, on the other hand, is referred to as 'furthest~neighbour' since

a new subject has to be sufficiently close to all the current members of
the cluster, including the one from whom he differs most.) The third Figure 14.2. An example of two clusters visible to the eye which would be
stage is the choice of a value of the dissimilarity at which groups will diffiel!lt to detect using hierarchical cluster analysis.

be defined as in figure '4 r. Different values of d can be chosen to define early in the clustering process cannot be undone later. Figure 14.2 demon
different numbers of groups. strates a case where this would happen. The rectangle represents a real
There are two ways of viewing the high degree of arbitrariness in the pasture in which a wild flower is growing, the dots representing individual
cluster analysis technique induced by leaving the investigator the choice plants. It is clear that there are two clusters of plants, possibly the siblings
of dissimilarity measure, clustering algorithm and cut-off dissimilarity of two different ancestors. However, if the distance between each pair
value. It could be considered that this a fatal defect of the procedure, of plants is used as a measure of their dissimilarity and a single linkage
in that too many subjective decisions have to be made; alternatively, it cluster algorithm is used, the first 'cluster' to form will consist of plants
might be felt that the wide variety of possible choices is a positive benefit, r, 2, 3, 4, 5 The other plants will gradually be incorporated into this
allowing the technique to have a useful flexibility. Marriot (r974: s8), spurious cluster (which, if hierarchical clustering is used, can never be
discussing methods based on the analysis of dissimilarity measures, says broken up) and the two clusters which are obvious in the diagram will
'The experimenter must cho-ose that which seems to him best for his prob- not be identified in the dendrogram. There are other techniques devised
lem; the mathematician cannot give much guidance. It is precisely this for reducing multivariate data to a two-dimensional graph which do not
subjective clement that gives distance-based methods thcil' partit'ular suffer from this defect, although, on the other hand, they may leave it
value.' This viewpoint is justified provided that clut'tcr annlysh~ is used entirely to the eye of the experimenter to detect 'clusters' in this pictorial
))f'irnarily as an cxplorutory technique to look for plausiblc.1 fltructurt defined representation of the data.
ii<6Q 261
Searching.forgTVups and clusters
14.6 Multidimensional scaling
M~
As with cluster analysis, the starting point for multidimensional
scaling is to define a measure of distance or dissimilarity for the objects " "'
N

under study (subjects, variables, etc.). Unlike cluster analysis, multi !:'"
s N

dimensional scaling is not designed explicitly to link the experimental


elements in clusters. Instead, its object is to construct a pictorial represen~
tation of the relationships between the different elements implied by the N " N M ~
N
values in a dissimilarity matrix. Many of the computer programs commonly "'
used for carrying out multidimensional scaling provide for nonmetric " N lf'l !'I 00
N ;:'
N

scaling, in which the actual magnitude of the dissimilarity between two "
elements is not preserved but in which the rank order of dissimilarities "'
t} N
-
N "',_
"'"'0
M~N

is reproduced as far as possible. Suppose, for example, A, B and C are 8,,., II "

three of the subjects being analysed and the dissimilarity between A and "i'0 >
" N
-
M
~~N

0 "'
N

B is twice as large as the dissimilarity between B and C. A nonmetric R


scaling method attempts only to provide a picture in which A, B and
C will be represented by points such that A and B are at least as far
'&
~ "'
....- -
;:'
!'I ..r 0 ..r

apart as B and C; no attempt will be made to place the points so that


the distance between A and B is still twice the distance between B and
~
~
?." "'0
N
.
"'O<> MM

C. (See Shepard, Romney & Nerlove 1972 for further details.) Again, "
~II ;; :r"
M
~
a full description of a linguistic example seems the best way to describe .c
"
the process in detail. "'
~
Miller & Nicely (1955) carried out a study of the perceptual confusions
among r 6 English consonants spoken over a voice communications system l= ..., " "' M
N

with frequency distortion and random masking noise. One of the confu-
sion matrices resulting from that study (their table v) is reproduced ~ "'"" M

II N

in table 14-6. The spoken stimuli are indicated by the consonants in the
.~ M
[;;9-~ N
first column of the table, while those reported in response by the listener ~ "'
are indicated across the top of the table. Miller & Nicely analysed the
data using information theory techniques to investigate the effect of distor- l.;:p ~
-" 0~"-
~
-;:-
"
tion and noise on five articulatory features or 'dimensions': voicing, ..."' ~
~
~

nasality, affrication, duration and place of articulation. They concluded ~

that perception of any of these features is relatively independent of the ~


~

"'
.. , 1n ..r .....
.., ... !::' "" "'
1l"'
perception of the others and that voicing and nasality were least affected
by their different experimental conditions. Let us look at a multidimensional
I:
~
iil
-
00<>
~
N
z
~
.ll
scaling analysis of this study. "'
~ "'> "'>
~
N ~~

The first step is to define a measure of the dissimilarity between any o.l '8 ,. . ,
d
two of the 16 phonemes studied in the investigation. The less similar ~ 1
two consonants are, the less frequently should one be confused for the +
~
]
other. Suppose that the i-th phoneme was uttered us a stimulus n, times ~
in total and that n,, is the number of times tlw jth phoneme was perceived
.~Iii!
~ 0.....,..!>:: ..... ~<ll......_, .C'"COO ;..IQNt<"l 8~::: !
Searching for groups and clusters Linear discn'minant analysis
although the i-th was uttered. Then a possible measure for the similwity, voiCeless fricatives (s, J) and the 'dental' voiceless fricatives (f, 9). This
sii' between these two phonemes would be the proportion of times that is a 'Striking demonstration of the way that multidimensional scaling can
either one was perceived as the other: indicate the structural relationships between individual elements based
on dissimilarities between pairs of individuals. Furthermore, one of the
s.. =nii+nii
11
n+n important motivations for carrying out multidimensional scaling is to
' I
reduce the apparent dimensionality of the data (cf. principal components
and the dissimilarity can then be . defined as d;; =I-s;; Shepard (1972) analysis in the next chapter), and that, too, has occurred here. Although
carried out a multidimensional scaling on a dissimilarity matrix derived there were initially five articulatory dimensions to describe the 16 conso-
Nasality nants, a satisfactory representation has been achieved here in two dimen-
sions.
Voiced nasals
m
" '47 Further comments on multidimensional scaling
In the example given above, clusters of-a particular composition
were expected, or hoped for, a prion. This will often not be the case:
multidimensional scaling is. frequently carried out to see whether useful
elements of structure will be suggested by the two-dimensional solution.
If an experimenter has no prior expectations about clusters which may
arise from the analysis, it will frequently be difficult to decide what should
Voicing
constitute a cluster. It may help to carry out both a hierarchical clustering
b
k I and a multidimensional scaling on the same dissimilarity matrix. Shepard i'!
d
v (1972) gives an example of how the results from both can be combined
e II''
p
I a
g
in a single diagram.
As with hierarchical clustering, there are many different mathematical
:I!
l;ij

methods and computer algorithms for carrying out multidimensional scal-


ing. However, provided nonmetric scaling is used, the different methods
z
3 give very similar results. Nevertheless, if there is not a two-dimensional
I
solution which retains the correct ranking for all the dissimilarity values
Voiceless stops and Voiced stops and
fricatives (it may not be possible to find one) it may occasionally happen that different
fricatives
approximate solutions can be obtained from the same data using the same
Figure 143 1\:lultidimcnsional scaling 'of data on the perception of speech
sounds (adapted from Shepard 1972). computer program. Discussion of this problem and how to avoid it should
appear in the document which explains how to use any computer program
from the Miller & Nicely data to obtain the two-dimensional scaling solu- for multidimensional scaling, under the heading of 'suboptimal solutions'
tion of figure 143 It can be seen that the 16 phonemes are distributed or 'local maxima'. A full discussion of multidimensional scaling and its
along two dimensions which can be identified as a voicing dimension and use can be found in Shepard, Romney & Nerlove (1972). A suite of com-
a nasality dimension. Furthermore, the 'central' voiced fricatives (alveolar puter programs called MDS(X) has been produced to carry out multi- "I'

and alveopalatal) [z, 3) appear close together as do the labial and alveolar dimensional scaling (and hierarchical cluster analysis)- see Coxon (1982). l"
nasals [m, n]. Among the other voiced consonants we can sec relationships 1:1

between [b) and [v), [v) and [OJ on the one hand, and [b) and [g) on 14.8 Linear discriminant analysis
II
the other. The voiceless phonemes to the left of the vertical axis cluster Cluster analysis and multidimensional scaling are techniques il
:1:
fairly well into the unvoiced stops (k, p, t), the alveolar and alveolurpalatal whose chief purpose is to search for that structure in multivariate data fi
h
~64 265 ij
Searching for groups and clusters Linear discn'minant analysis
which enables clusters of individuals such as subjects, variables, experi- The study explores the grammatical and lexical dimensions which char-
mental conditions etc. to be identified. They can be applied to any kind acterise the expressive language of language-impaired children. Spon
of data, numerical or categorical, providing a measure of dissimilarity can taneous language data were collected under standard conditions from a
be defined and calculated for each pair of individuals. By contrast, linear group of 20 normal children (henceforth LN) and a group of nine children
discriminant analysis is a technique for verifying that apparent clusters diagnosed as language-impaired, using standardised test criteria (hence-
are real and for deciding to which cluster a new individual, observed in forth Ll) and matched for age and intellectual functioning with the LN
addition to those used to determine the clusters, should be assigned. The
type of linear discriminant analysis 'to be described here is applicable when
x,
all the various scores or measUrements observed for each individual are
approximately normally distributed. Similar procedures do exist for cate
gorical data, but they are less common and will not be discussed further
here. Again we introduce the technique via an example discussed by
Fletcher & Peters (1984)
40
X
Table 14.7. Scores on two linguistic van'ables of 29 subjects
30
Subject x, x, y y
I
2 t;
I3
rg.oo
IS75
rg.oo
-I
-I
-I
-0.07J
-o.ggz
0730
20

3
4 ss I6.25 -I -o.8or 10
5 6o I375 -I - I.II6
6 26 IS 25 -I -o.or8
7 35 IS.25 -I -o.268
8 37 I6.25 -r -o.zzo 10 20 30 40 50 6o X1
9 28 I6;so -I o.os6
IO 28 2675 !.117 Figure 14+ Scattcrgram of data in table 147: the circled points originate
II IS 23.25 I. I. liS from language-impaired subjects.
I2 23 22.50 r o.8rs
I3 45 2~75 I 0 439
I4 27 I 75 I 0.316 group (mean age for the LN group was 6o.86 months, and for the LI
IS 25 20.25 0.527 group 62.33 months). Samples of 200 utterances from each group were
I6 23 24.25 0.997
I7 33 20.25 0.305 scored on a set of 65 grammatical and lexical categories- largely derived
IS II 20.00 I o.889 from the LARSP procedure outlined in Crystal, Fletcher & Garman
I9 36 22.50 I 0.455 (1976). One of the grammatical variables scored was unmarked verb forms
20 35 20.00 0.224
21 I8 1450 I 0.126 ( UVF) - the number of lexical verb stems which had neither a suffix
22 29 20.25 I 0.4I6 nor any auxiliary premodification. One of the lexical categories used was
23 I8 I45 o.rz6
24 23 22.50 I o.8IS verb types (VT), referring to the number of different lexical verbs used
25 30 3775 r r.8go in the sample by a child. The scores for each subject on these two variables
26 27 22.75 0731
28 0.729 are given in table I47 and figure '44 shows the scattergram obtained
~~
2.3.00
28 2!.75 0.599 by plotting X 2 (VT) against X 1 (UVF) for each subject. The solid line
29 24 25.25 1.072
in the figure divides the graph into two regions, one of which contains
X 1is the number of unmarked verb forms (UVF) in2oo utterances only points corresponding to LI subjects; the other contains points corres-
X2 is the number of verb types (VT) in 200 uttcrnncct>
Y""' -1 mCUllS tlu1t 1\llhjcl't was tliug-nuscd ntJianJ.{lla~{l'itnpainxl ponding to the 20 LN subjects and to one LI subject. This suggests that
it may be possible to formulate a simple rule, based solely on the scores
1!66 267
Searchingjo1 groups and clusters Probabilities of misclassification

on UVF and VT, for assigning each subject to one of the categories 'lan- one of theLlsubjects (subject 3) to.be misdassified. Why then choose
guage-impaired' or 'normal'. In fact,. when there are only two categories 0.1 as the cut-off value? Does it matter? If our only object is to see how
(or groups or clusters) as in this case, such a rule can be developed using well we can discriminate between the two types of subject in the sample,
the multiple regression techniques of chapter '3. it does not. However, it does matter which value we choose if we would
like to use the discriminant function to classify new subjects (i.e. subjects
The linear discriminant function for two groups
'49 not included in the original sample) into the normal or impaired categories
Suppose that a number of subjects have been scored on several solely on the basis of their UVF and VT scores in 200 utterances. We
~les, X 1, X 2, etc., and in addition, each subject in the sample is then need to choose a specific value for the discriminant score which will
known or believed to belong to one of two categories. Create a new variable, be the boundary score between the two groups. It might then seem that
Y, which takes only two values -I and +I. All the subjects of one category a reasonable choice would be o.ogi, exactly mid-way between 0.056 and
(arbitrarily chosen - the LI groups in the present example) are given o.I26. However, this choice ignores the unfortunate fact that one of the
a score of Y = -I on the new variable; all the subjects of the other category LI subjects had a discriminant score higher than o.os6; subject 3 has
score Y = r (the LN group, in our case). Y is then used as the dependent been ignored entirely in choosing the cut-off value. We really need some
variable for a multiple regression analysis, X 1, X2, etc. being the indepen~ method of ;choosing the boundary value which takes into account all the
dent variables. Such an analysis was carried out on the data of table '4 7 discriminant scores in the sample. It is more usual to proceed as follows.
to obtain the multiple regression equation: The mean discriminant score for the LI group was -o.3oo (the mean
of the first nine discriminant scores, Y, in table 14.7). The mean discrimi-
Y = o.877ro- o.o277o X 1 + o. ro354 X2
nant score for the 20 subjects in the LN group is o.68s. Commonly the
Using this equation we find that the 'predicted value of Y' for subject cut-off point would be chosen to be mid-way between these two means,
I IS: i.e. at o.I925. With this value we find that three subjects of the sample
Y1 = -o.87710- o.o2770 X 42 + o.ro354 X rg.oo = -o.oni.4
would now be misclassified. Subject 3, as before, with a discriminant score
while the 'observed value' of Y is -I, since that subject was in the initial of Y3 = o. 730, which is greater than o. I925, would be wrongly assigned
LI group. In the present context the multiple regression equation is to the normal category. Subjects 21 and 23 with discriminant scores
referred to as the discriminant function and the fitted value Y1 is usually Y21 = Y23 = o.r26 would be wrongly classified as language-impaired. It
called the discriminant function score of the i-th subject. The discrimi- may seem perverse to choose a cut-off value which misallocates three sub-
nant function scores of the 29 subjects are given in the firial column of jects when it is possible to choose a value which assigns only one subject
the table. Seven of the nine language-impaired subjects have negative scores to the wrong group. Nevertheless, there are good reasons for doing this
and all the subjects in the LN group have positive scores. Subject 3 has which are discussed in the following section.
a score more typical of the normal group. If we leave this subject out
of consideration for the moment we find that the maximum score among '4 Io Probabilities of misclassification
the other LI subjects is only +o.os6 while the minimum score for the The classification rule suggested at the end of the previous
normal language group is +o.I26. In other words, if we adopt the rule section can be summed up thus. For any subject, obtain UVF and VT
that any subject whose discriminant function score is less than, say, o. I scores as described at the beginning of I4.8. Calculate the discriminant
will be assigned to the LI group, while if the score is greater than o. I score, Y, for the subject using the discriminant function:
we assume the subject is linguistically normal, only one of the 29 subjects Y= -o.8nr -(o.o277 X UVF) + (0.10354 X VT)
of the sample (subject 3) will be wrongly classified.
However, there is a degree of arbitrariness in this discrimination rule. If Y is greater than o. r925 assign the subject to the normal category,
Why use the value o. r as the boundary value for separating discriminant otherwise assign him to the language-impaired category. On this basis
scores into two groups? Any cutMoff value between o.os6 and o. 12.6 will- one of the language-impaired subjects (subject 3)of the nine in the sample
cause all the normal-language SL!lJit'cts to be correctly dassificu and only would be incorrectly classified, as would two of the 20 subjects with normal
269
~6.$..
Searchi11g for groups a11d clusters Hxercises

language (subjects :zr ail.d 23). The proportion of language-impaired sub- and the probability of having a smaller Z value than this is 0.31. For
jects actually misclassified is I/9; the proportion of the normal subjects this example we have P 12 = P21 = o. 3 I. That the two types of misclassifica-
misclassified is 2/20. This might seem to imply that the probability of tion have equal probabilities is a consequence of the cut-off value being
misclassifying a subject of either type is about 10%. Unfortunately this chosen exactly mid-way between the sample discriminant mean scores of
is almost certain to be a serious underestimate- of the true probability the two groups. It is perfectly possible to choose a cut-off value which
of misclassification. The coefficients of the discriminant function makes one type of misclassification less likely than the other (exercise I4.3).
(-o.877I, -0.0277, o'.'I0354) have been estimated using a sample of 29 If the proportions of the two types in the population are known a pn"ori
subjects. A different sample containing subjects of both types would lead then it is possible to choose the cut-off point to minimise (P 1z + Pz,) the
to different estimates for these coefficients. The estimation procedure cal- total probability of misclassification (see Overall & Klett I97" 9.7). The
culates the discriminant function which gives the best discrimination pos- technique of linear discriminant analysis can be extended to the study
sible for the subjects in the sample. (Quite often a discriminant function of several groups (see Overall & Klett I972; ch. 10; Bennett & Bowers
can be found which makes no classification errors in the sample.) A new I976; 7. Io). A linguistic study which uses multiple discriminant analysis
subject whose scores were not used to calculate the discriminant function can be found in Bellinger (r98o).
is more likely to be misclassified than a subject of the original sample.
This is particularly true when, as here, the discriminant function has been
SUMMARY
estimated from a small number of subjects. Provided the values of the
This chapter has introduced and discussed the notion of multivariate data and
discriminating variables (UVF and VT) are normally distributed over the
their analysis.
target population and the subjects for the exercise are chosen randomly,
then it is possible to estimate the probability of misc!assification in either (I) It was pointed out that most multivariate methods are descriptive.
direction. (2) A dissimilarity matrix was defined and the example of a matching coeffi.
In table I47 each subject is given the label -I or I to indicate the cient was presented. It was shown that when continuous variables are observed
group to which he belongs. It is convenient to refer to the subjects of on each subject the correlation between subjects can be used as a measure
of closeness and clusters of subjects identified.
the group labelled with -I as type I, the others being type 2. Then we
(3) Hierarchical cluster analysis was explained and an example was given.
can define the probabilities P 12 and P21 by:
The concept of linkage was discussed and the terms nearest neighbour
Pij = P(subject of type i is misclassified as type j) and furthest neighbour defined. It was stressed that cluster analysis is essen-
tially exploratory.
We will write mi to stand for the observed mean discriminant score of (4) It was stated that a two-dimensional 'map' or picture of a set of multivariate
subjects of the i-th type and c to stand for the cut-off value to divide data can be drawn by means of multidimensional scaling, and the meaning
the discriminant scores into two groups. (Here m 1 = -o.Joo, m2 = o.685 of nonmetric scaling was explained. An example was presented to show
and c = O.I925.) Then P,z, the probability of wrongly identifying as type that meaningful dimensions can be identified by a multidimensional scaling.
2 a new subject who is really type I, can be estimated, as follows: (s) Linear discriminant analysis was explained as a special case of multiple
c-m1 regression (chapter' 13). It was shown how to use the discriminant function
04925
Calculate Z1
Y(m 2 -mt)
Vo:98s = 0 4962 and the discriminant scores to allocate subjects to one of two possible groups.
It was shown how to establish a cut-off value and how to estimate misclassifi-
From tables of the standard normal distribution we find that the probability cation probabilities.
of exceeding this value (i.e. getting a discriminant score big enough to
misclassify a type t subject who should be achieving a small score) is
0.31. This is our estimate of P12 To estimate P21 we calculate: EXERCISES
(x) Using the nearest-neighbour approach, continue the clustering process begun
Zz= c-mz - 492;
0 in 14.2 until a single cluster is eVentually established. If you were told that
~=- 0 4962
Ymz-ml really there were two types of subject in the sample, each of whom should
~70 27I
Searchzngjor groups ana clusters
belong to a different group, what would be the two groups you would suggest
on the basis of the above clustering?
(2) Using the dissimilarity matrix of table 14.4> repeat the clustering process for
IS
the ten subjects, but this time putting _in the same cluster those subjects who
have least dissimilarity to one another.-,.
Principal components analysis
(3) Using the value of c = o. I as a cut-off, recalculate the misclassification probabi-
lities for the discrimination example at the end of the chapter.
and factor analysis

In the previous chapter we defined the idea of a multivariate observation


and looked at multivariate analysis techniques for discovering and confirm-
ing the presence of special groups among the observed individuals. In
the present chapter we will look at methods designed specifically to reduce
the dimensionality of the data.

I 5. I Reducing the dimensionality of multivariate data


Suppose that their scores on p variables, X 11 X 2, . , XP, have
been observed for each of n subjects (see 15.3 for a language testing
example). Typically, the variables will be intercorrelated, each variable
having a higher correlation with some of the other variables than it does
with the remainder. It is quite common that the pattern of intercorrelations
is rather complex. As is often the case, the major statistical interest will
lie in considering the differences between individuals. The special problem
to be faced now is that the subjects can be different in a variety of ways.
Two subjects may have similar scores for some of the variables and quite
dissimilar scores on some of the others. If the number of variables is
large, then it may be difficult to decide whether subject A differs more
from subject B than from subject C since the pattern of differences may
be quite dissimilar in the two cases. This prompts the question of whether
it is possible to reduce the whole set of scores for each subject to a single
score on a new variable in such a way that the variability between the
subjects over the original set of variables is somehow expressed by the
score on the single new variable. !-Jere we have a situation frequently
encountered in economics where, for example, a single variable (the retail
price index) is used to indicate changes in the price of a large number
of commodities or (the Dow Jones industrial share prices index) to measure
movements in the value of shares traded on the New York Stock Exchange.
The advantage of reducing a complex multidimensional data set to a
single variable is obvious. Use of the retail prices index allows statements

?<7~ 273
Principal components analysis and factor analysis Principal components analysis

to be made about 'rising prices' or 'inflation' without the need to describe 15.2 Principal components analysis
explicitly a confused situation in which prices of some foodstuffs are rising Suppose, again, that we have a set of data consisting of the
while others are falling, fuel is becoming more expensive while the cost scores of n subjects on each of p variables, X;i being the score of the
of mortgages is decreasing, and so on. The disadvantage is equally obvious. i-th subject on the j-th variable, Xi. Usually the scores on some pairs
The changes in any subset of the original variables are more important of variables will be highly correlated; other pairs will have only small
to some individuals than others. For example, it is of little comfort to correlation. The scores of the sUbjects may have a larger variance on some
a single investor to know that the share price index is rising in value if of the variables than on others or the variances may all be roughly the
the few shares that he owns are rapidly becoming worthless. same.
When it is felt that reduction of a multivariate data set to values of Principal components analysis (PCA) is a mathematical procedure for
a single variable may be worthwhile there remains the question of how converting the p original scores (X;~o X;2, , X;p) for the i-th subject
it should be done. For the retail prices index, a 'basket' of commodities into a new set of p scores (Y;~, Y;2, , Y;p) for that subject in such
such as bread, meat, fuel, rents, mortgages, etc. is monitored, with each a way that the new variables thus created will have properties which help
commodity given a numerical weight corresponding in some way to its to provide answers to the two questions posed at the end of the previous
average importance in the community. The price of each commodity is section. The endpoint of a PCA is to calculate a set of numerical weights
multiplied by the corresponding weight and the weighted prices are added or coefficients. An individual's score on one of the new variables, Yk
together to create the index. (This is a gross oversimplification which say, will be calculated by multiplying his score on each of the original
nevertheless describes the essence of the procedure.) Is this type of index variables by the appropriate coefficient and then summing the weighted
relevant to linguistic studies? Are indices ever used in linguistics? scores. If we write Y;k to mean the score of the i-th subject on the new
Yes, they are. Two examples have appeared earlier in this book. The variable, Yk, this can be written succinctly as:
numbers in table 2. 7 arise as values of an index. The original observation
for the i-th subject was a three-dimensional vector (d;t, d; 2, d;3) where yik = aklxil + ak2xi2 + akJxiJ + ... + akpxip
d;~ was the number of tokens expressed by the subject of the first phonologi-
cal variant, etc. (see 14.1 for notation). The value of the index for the where ak; is the coefficient by which the score on the i-th original variable
i-th subject was then calculated as: is being calculated. Note that the values of the coefficients aki are the
same for every subject. The set of coefficients, (akl! ak2, ... , akp), used
{(d; 1 + zd; 2 + 3d;3) _,_ (d; 1 + d; 2 + d;,)- 1} X 100
to calculate a subject's score on the new variable, Yk are called the coef-
(see z.3). Again, we have, at various points in the text, referred to data ficients of the k-th principal component, Yk is referred to as the k-th
from the Cambridge Proficiency in English Examination. This examination principal component and the score Y;k as the score of the i-th subject
consists of a number of different sections, or language tasks. The obser- on the k-th principal component. There is a d.lnger that the average
vation for each candidate is the set of marks he obtains on these different reader is already becoming lost in what may seem to be a plethora of
tasks which are then combined into a single variable - the candidate's algebraic notation. A simple biological example may help to clarify the
mark for the examination. This is a common procedure in language testing. concept.
It is therefore relevant to ask two questions. When we are faced with Jolicoeur & Mosimann (1960} used a PCA to analyse measurements
a set of multivariate linguistic data, can we reduce them to a single score of the length, height and width of the carapaces of painted turtles.' The
for each subject and still retain most of the information about differences original data is three-dimensional, XI> X 2 and X 3 being, respectively, the
between subjects which was contained in the original data? If this is not length, height and width of each turtle shell. The set of coefficients, aik
possible, can we at least reduce the original data to a new form of smaller
dimensionality (i.e. in which each subject is scored on fewer variables) 1 The discussion of this data is adapted from i'vlorrison (1976). Although l\lorrison's book
without significant loss of information? Many of the techniques of multi~ may be too mathematical in its presentation to suit the taste of most of our readers, it
can be recommended for the clear discussion of examples of all the multivariate procedures
variate analysis are motivated by the desire to answer these questions. trclltcd there.

~14 275
Pn'ncipal components analysis and factor analysis Principal components analysis

Table I 5. I. Coejficie11ls ofPincipal components fivm a PCA Table rs.z. Vanance-covmiauce matrix ofmeasutements of
of measurements of turtle shells a sample of carapaces ofpainted turtles
Length (X 1) Height (X2) Wiuth(X,)
Component (Y)

Original dimension (X) Y,


x, 4$I.J9 r68.7o 271.17
Yz Y3 Xz 66.6s lOJ.29
Length (X 1) o.8r -o.ss -a.21 XJ I7L7J
Height (Xz) O.JI o. ro o.gs
Width (X 3) o.so o.83 -o.zs Adapted from l'vlorrison ( 1976: 274)

Adapted from Morrison (1976: 274)


say that Y 1 is constructed to explain as much as possible of the total
variability in the original scores of the sample subjects.
is given in table I5.I, from which it can be seen that the first principal (c) The coefficients, az11 azz, az 3, ... , a2P, of the second principal compo-
component, YI> can be defined as: nent, Y 2, are chosen so that Y2 is uncorrelated with Y 1 and Y 2 is
as variable as possible, i.e. explains as much as possible of the total
Y 1 = o.8IX 1+ o.JIX2 + o.soX 3
variance remaining after Y 1 has been extracted.
i.e. Y 1 = o.81 (length)+ 0.31 (height)+ o.so (width)
(d) The coefficients,_ a31, a32 1 a33 , ... , a3P, of the third principal campo
In other words, to calculate the score on the first principal component nent, Y 3 , are chosen so that Y 3 is uncorrelated with both Y 1 and Y 2
of any turtle shell of the original sample we need only multiply the length, and explains as much as possible of the total variance remaining after
height and width of the shell by the appropriate weights and sum the Y 1 and Y 2 have been extracted. This process continues until the co-
weighted scores. Note that in table 15.1 some of the coefficients are nega- efficients of all p principal components have been obtained,

tive. This is quite usual. The score of the i-th shell on the second component
Table rs.z gives the variance-covariance matrix for the measure-
Y 12 would be given by:
ments made by Jolicoeur & Mosimann. The numbers on the main diagonal
Y; 2 = -o.ssX; 1 + o.roX; 2 + o.83X; 3
are the sample variances of the indicated variables. The off-diagonal terms
i.e. Y12 = -o.ss (length of i-th shell)+ o.83 (height of ith shell)
are the sample covariances. (The matrix is, of course, symmetric - see
+ o. 10 (width of i-th shell)
chapter 10 - so that it .is sufficient to fill in only the upper triangle.)
and could possibly be negative: this point is discussed further below. We see that VAR(X 1) = 451.39, VAR(X 2) = 66.65and VAR(X,) = 17'-73
Of course, all that has happened so far is that the three scores or measure- so that the total variance is 451.39 + 66.65 + I7L73 = 68g.77.
ments (length, height and width) have been changed into three different The corresponding PCA, for which the coefficients are given in table
scores (on the three principal components). To see what has been gained 15.1, gives a first principal component, Y 1, with variance 68o.4o, i.e.
in the process we have to look at the properties of these principal compo- g8.64% of the total variance! In other words, almost all the overall varia-
nents. bility in the three dimensions of the turtle carapaces can be expressed
The calculation of the coefficients of the principal components is carried in a single dimension defined by the first principal component. The remain-
out in such a way that the following statements are true: ing two principal components have variances of 6.50 and 2.86 respectively,
(a) The total variance of the principal component scores of then subjects both very small proportions of the total variance.
in the sample is the same as the variance of their scores on the original Before leaving this example, two further points should be noted. The
variables, i.e. first is that the coefficients of the principal components are calculated
VAR(Y 1) + VAR(Y,) + ... + VAR(Y,) = VAR(X,) + VAR(X,) + ... or extracted from the values of the variance-covariance matrix- the actual
+ VAH(X,.) measurements of the shells are not required; any other sample which gave
rise to the same variances and covariances' would result in an identical
(b) The coefftei(mtil, a11 , u12 , nu .. ,., a11 ,, of the Hr!'l prindpul component
PCA. Second, it may' be necessary to consider more than one component.
nre dt<J!Hnl i\o th11t VAH.(Y 1} i~>~ M lurgc !Ill posHiblc. Colloquinlly we
Z77
P1incipat components anatysls ana ]actor analysts 11 l"'Lfi o; tanguage test scores

In the turtle shell example almost 99% of the total variance was explained 70 candidates, those who took the examination in Hong Kong. At that
by the first principal component and there is obviously no need to go time the CPE could be seen as comprising 19 recognisable subtests:
further. Frequently, as happens in the language testing example in the (r) Multiple choice vocabulary (4o items): supply missing word in single~
following section, several components are needed to explain a reasonably sentence context.
large proportion of the total variance. However many principal components (2, 3) Multiple choice reading comprehension (each IO items): questions
are requ-ired, it is customary to attempt to interpret them meaningfully on passages of about 6oo words in length.
(some authors use the term to reify). From table 15.1 it can be seen (4) Multiple choice listening comprehension (rs items): 5 questions on
that the first principal component has coefficients which are all positive, each of three passages read twice.
so that Y 1 is a kind of weighted average of the original measurements. (5,6) Essays: one requires the students to describe or narrate, the other
The score of any shell of this component is necessarily positive and will to discuss.
increase as any of the three measurements, length, height or width, (7) Extended reading: the candidates answer questions of a rather literary
nature (e.g. what is the extended metaphor,,. ?) on a passage of Engw
increases. For this reason it might be considered as a 'size' component:
lish. Scoring is without reference to the quality of expression shown
'large' turtle shells will have a high score on this component. The later
in the answers.
components, on the other hand, can be thought of as 'shape' components.
(8) In a conversation based on a photograph the candidates are scored
For example, the score on the second component of an individual shell for overall communication.
can be either positive or negative. Shells which are unusually long for ( 9) In the same conversation the candidates are scored for vocabulary.
their height and width will have negative scores, while shells which are (1o) The candidates speak on a prepared topic for about two minutes.
untypically short compared to their height and width will have positive They are scored for overall communication.
~cores. PCAs of biological measurements often result in the extraction (I I) They are also scored for 'grammar and structure' for their performance
of a first principal component which corresponds to an overall measure on the prepared topic.
of size, which may be relatively uninteresting, and later components corres~ (12) The candidates read one part in a dialogue, the examiner the other.
ponding to aspects of shape, which may be more useful or interesting Scoring is for pronunciation.
for purposes of classification or comparison. (Tasks 8-12 are all scored on a tenpoint scale)
(13) Situation: the candidates respond verbally to three situations put to

II
r 5. 3 A principal components analysis of language test scores
them by the examiner. Scoring is out of 3 for each situation.
How may PCA be used in language studies? An obvious candi-
( 14) Cloze (2o item~): rational deletion and acceptable response Scoring.
date for such analysis, mentioned earlier, would be the set of scores of (I 5) Paraphrase ( ro items): the first word or two of the paraphrase is given.
a group of subjects on a battery of language tests, not all of which claim (16) Sentence completion (8 items): the midwpoint of a sentence is omitted, .
I
I

to be measuring the same ability. Some of the tests might claim to measure
'vocabulary', others 'grammar', yet others 'reading comprehension', and
the candidates having to fill the gaps 'with a suitable word or phrase'.
(17) Paraphrase (9 items): with the requirement that a given lexical item
I
so on. The purpose of carrying out PCA would be to determine how
many distinct abilities (appearing as components) are in fact being
measured by these tests, what these abilities are, and what contribution
each test makes to the measurement of each ability. This is obviously
of interest to researchers in language testing. It can also be seen as possibly
be used in making the paraphrase.
( 18) Summarise ( 14 questions): the candidates answer questions on a pass~
age of about 750 words, a test of their ability to 'understand, interpret
and summarise'.
( 19) Style: the candidate~ are required to convey information in a particular
II
.
:
.

providing clues to the nature of language ability in general, which psycho-


linguists may make use of in developing theories and devising experiments.
form or style.

Correlations between scores on the various subtests, as shown in table


l
Hughes & Woods (r982, 1983) studied the performance on the Cam- '53 varied from 0.12 (subtests r6 and rg) to 0.95 (subtests ro and II).
bridge Proficiency Examination (CPE) of candidates at different examin-
ation centres in June 1980. In this chapter we will ~onccntratc on just
The fact that two tests correlate as highly as 0.95 (and several pairs had II
correlations greater than o.8o) suggests strongly that the CPE was not
:
Z78 279 .
1.
Principal components analysis and factor analysis

Table '53 Mean, standard deviation and variance of a sample 0[70 ;r 0">0 "
0 " 0
0 0 ci 0
candidates [o1 the Cambridge Proficiency Examination
q.,- ;t
Subtcst Mean
Standard
deviation Variance "'
0 0 0 ci
:g-g
I I

6.7
~
I. M/Cvocabulary 22.0 447 0 ci 0 ci
2. M/C reading comprehension 52 r.6 2.5
3 M/C reading comprehension 6.o 2.' 46 .:l' ~00 " '
" 0 0
4 M/C listening comprehension 9' 3' g.6 0 0 ci 0
8.3 2.6 6.g I I
5 Essay {describe/narrate)
6. Essay (discuss) 78 2.4 57 ~ ~~~
ci 0 0
7 Extended reading 6.4 52 2 75 "
8. Conversation (communication) 5' 2.0 J.9
g. Conversation (vocabulary) 5' I.g 36 8 N .... "
~" 0

2.1 ci ci ci
to, Communication (prepared topic) 53 45 I
11. Grammar (prepared topic) 52 2. I
>.o
43 ...
':>
000
0 N
12. Pronunciation .47 4' ci ci 0
IJ. Conversation (response to situation) 47 2.0 40 I
14 Cloze 8.6 4-3 t8.g ~ N " 0'
':" 0 N
15. Paraphrase (free) 5' 2.7 75 ci 0 ci
t6. Sentence completion 2.5 6.! I
44
17. Paraphrase {constrained) 10.8 ~
N + ~
43 33 "0 ~
18. Summarise IJ.I 5.8 33-9 .; ci 0 0
l:l
r.8 2.4
I(). Style 58
"
"
.<;> :g N 6 !'-
"0 ~
.;
COI\IPLETE TEST IJ2.2 440
a ci 0 0

~ N 0 "
"0
~
~
,;., ci ci ci
'& I
measuring '9 distinct abilities. But how many? Hughes and Woods sub- ~ ....,;.,
~
0 0
" ... "'
~

jected the scores to a PCA. The results of the analysis, as output by the 8 ci ci ci
I
"'
computer program used to carry it out, are shown.in table I54 ~ "
r;-;;;: ~ ~ s
+E-; ci ci ci
The output begins by reminding the researcher how' many variables
have been measured (19) and the number of subjects involved (7o). Note
~ ~
I'~ t.r/N CO
I I
'& ';~'":';~
Vl{.) a o o
that the precise statement is NO. OF OBSERVATIONS= 70. As we have already ~
mentioned, it is usual to consider the whole set of scores observed for .1:> "::2E-; VJt:
z
'<t- ".. 0lTl"
...

any subject as a single observation - a multivariate observation. This ..r:-.d~Ocici

is a useful convention which, apart from its mathematical convenience .t: ~


... >
0 ... 0 ('I

which we do not discuss here, serves to indicate the real size of sample :t3: zt3 ;;ut.. ci ci ci
involved. If an investigator chooses five subjects or five pieces of text,
a~ g_t:VlONN~'-
::2
II 't~ o"'! o~ ~a
8 ..r: Mr
o-m..__ '"'Z I
say, it does not matter whether he measures one variable or 6o variables "Z~
on each, the sample size is still only five, and the data cannot be expected 1:! uoOuo"'z~'"'"
_ ... o a a
~ ~E-;t.. +O ci 0 ci
to give a reliable indication of the structure of a complete population based ;: ...:l<o'"'~ I
on a sample of five. Of"coursc, looking at more variables may give a much ~ ~~en o-o ... -.o o
~
,
;:;; u.1
>:r::cnOOO
w "?u lQ lQ t'
aoo
more complete picture of the sample, but even this will be true only if + <r!!%l....:l ~...:l I
~ >O< ~
the extra variable:; are not "'~ry highly correlated with those already ,_. U.!:l.,~ tl""NM
ze::o...o...
~
observed. lf th(ty nrc 1 tlwy will Rimp.ly repent int'ormution already contained oo~
c)c)0 ~,..::;;::;;
Ct\t'lie.r vnria~le. ll .i to. be rccommvndcd, in g~ncntl, that t.hc zzi;J g:ggg
Principal components analysis and factor analysis Deciding on the dimensionality of the data
sample size - the number of sample units or subjects - should be several It seems most unlikely that this 19th component measures any real dimen-
times as large as the number of variables observed. The dimension of sion of language proficiency. It certainly does not uncover any differences
a multivariate observation is simply the number of different variables it between the subjects in the sample on this putative dimension, and it
contains. is not credible that 70 subjects would achieve exactly the same score in
Look next at the heading PRINCIPAL COMPONENTS OF COVARIANCE MATRIX a real test of some aspect of proficiency. A similar argument can be used
followed by several roWs of numbers, each row headed COMPI, COMP2, to eliminate more than half the components. They have so little variation
etc, COMP simply being an abbreviation of COMPONENT. (In the original that we can readily accept the possibility that it is entirely due to random
output there were 19 such rows of which only the first three are reproduced error and not to real differences in the subjects in some aspects of language
in the table.) Each of these rows gives the values of the coefficients required proficiency. On the other hand, the first principal component has a large
to calculate scores on the corresponding principal component. For variance which accounts for 66% of the total variance. Does this component
example, to obtain the score of each student on the first principal compo- already measure all the reliable variability in proficiency among the 70
nent we take: subjects? Does the second component, with a variance of 7% of the total,
measure some genuine aspect of language proficiency? If so, it is measuring
o.sr X (score on subtest r) + o.o7 X (score on subtest 2) + 0.12
something quite distinct from whatever is measured by the first principal
X (score on subtest 3) + ... + 0.42 X (score on subtest x8) +
o.og X (score on subtcst 19)
component, since the variables defined by the components are completely
uncorrelated. An objective method has been developed recently for de-
The variances of all the principal components are given, in order, in ciding how many components to accept. Eastment & Krzanowski's ( r 982)
the row of table 15.4 headed EIGENVALUES OF THE COVARIANCE MATRIX. technique determines whether more information than 'random noise' is
The variance of COMPI is 13839, that of COMP2 is r4.IO, and so on, and obtained through the acceptance of more principal components. The proce-
it can be seen that the variances decrease rapidly, until the last few principal dure begins with the first component and tests whether the second should
components have very low variances indeed. The first five components be accepted too. If it is, the third is tested and so on until the point
(whose coefficients are given in the table) together account for 86% of is reached where the inclusion of any further components would add more
the total variance- 66%, 7%, 6%, 4% and 3%, respectively. The remaining noise than information. The remaining components are rejected while the
14 components together account for only 14% of the total variance. As previous ones are all accepted. In. the present case, all but the first three
expected, the sum of the variance of the principal components (2o8.82) components were rejected by the technique. It appears that the CPE sub-
is equal to the total variance of the subtest scores (2o8.9)- the tiny discre- tests were measuring three distinct (presumably language) abilities in the
pancy is due to rounding errors in the calculations. The latter total can Hong Kong candidates. In the next section we will discuss what these
be obtained by summing the variances in the last column of table '53 three abilities might be.
The Eastment & Krzanowski method requires a special computer pro-
gram and a large computing facility. In their absence a rule of thumb
'54 Deciding on the dimensionality of the data is frequently employed in deciding how many components to accept. The
Our first purpose in carrying out this PCA, it will be remem- rule is: if the original data has p dimensions, assume that components
bered, was to determine how many distinct abilities were being measured which account for less than a fraction r/p of the total variance should
by the Cambridge Proficiency Examination. How do we decide this? Well, be discarded. There is no theoretical basis for this rule but it has frequently
if we look again at table '54 we sec that the 19th (and last) principal been adopted by users of principal components analysis. In the present
component had a variance of only o. r4 (less than o. r% of the total variance). example there were 19 subtests, and the rule would imply that any compo-
In effect, this means that when the sums of the ~ubjccts on this component nent accounting for less than r/ 19 of the total variance (i.e. less than
tuc calculated all 70 scon:s turn out to he almost exactly the ::;arne. The 5.3%) should be eliminated. On this basis, three components are accepted.
Mrna\1 amount of rcl'lidunl variation could very easily be attributed to random The two methods give the same answer on this occasion, but that will
mcaaurom~nl vnrlut-ion t.tnrt.~lntcd to the subjcct8' longuuge profkietu~y. frequently not he the case.
Jl!!11 283
Pn'ncipal components a1zalysis and factor analysis
; -t- ~'~ a
'55 Interpreting the principal components
?I o ci ci 1:
ll)(")ll') ]
In the example of 15.2, where PCA was carried out on the
~lcidci -3
measurement of turtle shells, the three principal components were interR I I i:
preted in terms of size and shape. The interpretation was carried out "' 0 0 e
~I ci ci ci ~
by looking at the coefficients of the principal components to see what
weight was to be given to each of the original variables when calculating
l:J " N "'
"'> 0

51
-c
'e "'"
cl ci 0 0
-"
the scores on a given component; An alternative, and better, method is I I 0
11
to look at the correlation between each component and every variable.
If a component is very highly correlated with a particular variable then, '"'
~
~I """'""
ci ci ci -"
tl
~ I
-s" ~
I
OX) 1.() ....

in s6me sense, the component contains nearly all the information about
: :!"I ci d d -"
0
I
differences in the subjects expressed by their scores on that variable (see 1l b ~

chapter ro). The correlations of each of the 19 CPE subtests with the t ~~
"'""
ll') 0

o ? o 11
10

-ci
first three principal components of the scores of the Hong Kong candidates "
-E' " N
..... 0 ~~~,_!;
l(') .J:J
are given in table '55 11 ~ 0

0

0
1'1:1
....
0 I
As can be seen from the table, the first principal component has a notice-
able correlation with all of the subtests, the lowest correlation being 0.44
~" "' "'"" I ;;;
'>0 0
:::1 Ocici
tr)

..::-
o-

"'
~
w
This happens frequently with the first component. As noted earlier in 0
~II :::-g_c
~
o~~~~
"'
the chapter, the biologist will frequently interpret the component as a - 0 0 0

if::
measure of size and may find it relatively uninteresting. Similarly, a
researcher investigating 'athletic competence' might give tests of skill in
u
~
l:J
Jl
0\l
"
'<>0'<>
ci ci ci
N 00

I
..",.-
:r:"
~ ~

0
~
'.\'1,,
several different sports. The first, general, component will really be a ~ ~cg~
oo I ci ci ci -"
measure of the overall health and fitness of his subjects, not of their specific I
- 0.
~"'iii
athletic skills: speed, hand-eye co-ordination, etc. Our first principal com-
ponent would likewise seem to represent the subjects' general level of
j;).
8 l'-1
'-0
00
ci ci ci
lrl
N
n
"
-S"'
..8 .5
~

.
I I
'0; J::0 "
.,. "'N
ability in English (or perhaps something even deeper, like an aptitude
for learning new languages or even general intellectual ability). It is hardly
t
~
"'""
-.o I ci ci ci -.-
~
0
0

surprising that the general level of ability should be correlated with all a, ,..,...,. 0 0
;;;.~ ir1
~r~l
,_ON
ci ci ci
., 0
'I!
of the 19 subtests: the better one is in English, the better one is likely "
~
i';
to do on 'each subtest, broadly speaking. This is more likely to be the
;S 0 0 .,.
~:E
"" 0 "
"0 ,- II
case when, as here, the subjects have been preparing for the test, presuma- ~ ..rl ci ci ci

~
c
,_ ... N
"
~ :t
~
I
bly practising for its various parts. Ml
"' 0
ci c)
"
ci ::0 0
Let us for the moment ignore the second component and turn to the 1::1 I ~~
.e
third. Here there are six correlations over o.s, the remainder being less _,
~

N I
"li') 0N '<>
ci ci ci
...
;~
w-
:.E::::

than 0.25. These six correlations are between the component and subtests
8 to r 3, all of which are related to what we might call 'speaking ability'.
~
1::
(l 0
"'MN
ci ci ci
I
0 N
- 0

(ij
-
, ..
o'O
~
>"0
(Note that such a 'speaking' component did not appear in the PCA of I
"'
every group analysed by Hughes & Woods. Remember that PCA is
designed to pick out those components or dimensions on which the subjects
~
"'
"
,...~
"

"'"'"'
~~~
000
N M

" w
...,_
.~~
""'
so-o:;;
"'.. .s .
had scores which were rather variable. If the subjects of a pr11ticular group uuu

~~+
Pn'ncipal components analysis and factor analysis Covmiance matn'x or correlation matn'x?

are relatively homogeneous in their overall speaking ability then no compo- I5.6 Principal components of the correlation matrix
nent corresponding to that skill will appear in the analysis.) In the.examples discussed so far in this chapter the principal
The meaning of the second component is less obvious. There are just components were extracted from the covariance matrix of the subjects'
three correlations with a magnitude of at least o.zs: o.JO with subtest scores. It is perfectly possible to extract components instead from the
I, -o.zs with subtest 7 and -0.43 with subtest I8. A subject who has correlation matrix of the scores. Indeed this latter is the only option for
a large, positive score on this component will be one who scores above PCA offered by some computer packages (e.g. SPSS). Before discussing
average on the multiple choice vocabulary subtest and below average on why and when one might be preferable to the other, we will show that
the other two subtests, which both involve written answers to questions it can matter which is chosen. The results of the principal components
or texts presented (they were the only two subtests involving such a task). analysis based on the correlation matrix of the I9 subtest scores of the
What ability does this second component represent? While we might wish same 70 subjects as before is presented in table I5.6. A cursory comparison
to make something of the similarity between subtests 7 and I8, it is hard with table IS-4 is sufficient to see that the sets of coefficients are quite
to see why performance on them should be related in the way it is to different. For example, the coefficients of the first component now all
performance on subtest 1. All we can be sure of is that an important lie in a narrow range (o.IJ to o.z6) compared to the much wider range
source of variability in the subjects was that while some did well in subtest (o.o7 to 0.51) for the first component based on the covariance matrix.
I and badly in subtests 7 and I8, others showed the reverse pattern. It In the solution based on the covariance matrix the second component
simply does not seem possible to identify the second principal component had much larger coefficients for subtests 7 and I8 than the others. In
with a specific language skill. the new solution this is not true. The Eastment-Krzanowski criterion
Our failure to identify this component, indicated as significant by the indicates that only the first two components - accounting for s8% and
Eastment-Krzanowski criterion, makes this an appropriate moment to 10% respectively - are significant, i.e. a different conclusion about the
point out the exploratory nature of PCA, especially in this study. The inherent dimensionality of the data is reached depending on which matrix
CPE is a practical language test, not a research instrument, and little is analysed. The correlations between the two significant new components
is known of the subjects except that they took the examination at a particular and the sub tests are given in table I 5. 7, and again we can see that the
centre. Using PCA in this way is best thought of as a preliminary step results look different from those of table IS-5 The first component is
to more carefully controlled studies in which components indicated by still a 'general' component which has a highish correlation with all the
PCA can be investigated in greater detail. Further data from the Hughes subtests (all but one over o.s). However, the second component is nega-
& Woods study will be found in the exercises in previous chapters. tively correlated with scores on all the speaking tasks and positively corre-
Before leaving this topic we should point out that (except in the special lated with some of the writing/ reading tasks. No component of this struc-
case discussed in IS- 7) the standard output of a PCA is unlikely to include ture was isolated from the covariance matrix.
the correlations between components and variables. However, these corre-
lations can be calculated by the si~ple formula: 15.7 Covariance matrix or correlation matrix?
aiiVvi That this is a question worth addressing has just been demon-
rij=--
s; strated. The apparent number and the structure of the significant compo-
nents in a principal components analysis can depend on which matrix
where vi is the variance of the i-th component, si is the standard deviation
of the j-th variable and rii is the correlation between the scores of the is used as a basis for the analysis. It is not difficult to show that the
subjects on the i-th principal component and the j-th variable. For example, Correlation between two variables is just the covariance of the correspond-
r15, the correlation between the first principal component and the fifth ing standardised variables. This means that the question used as a heading
subtcst, is given as 0.77 in table I55 It has been calculated as: for this section can be paraphrased as, 'When a principal components
analysis is to be carried out on the scores of several variables measured
u1s Vv 1 o. rtVtJH.,w on a sample of subjects, should the components be extracted from the
r IS ~ "'~""'"---~- ~ ---n------- ""' O. 77
~'~5 ;;.,.6 covariance matrix of the original scores or should the scores be standardised
~86. 287
Covanance matnx or correlatton matnxr
t M"' MM

0 0 0 "'I
"
-t "'!
0 0 first?'' Standardising the variables before analysis means that the compo

8' ~g
~ .....
" ' 00
nents will be based on the correlation matrix of the original variables.
ci ci ci t 'e ci ci There is no way to formulate a clear criterion on which to base this decision,
::: though there is some support from theoretical statisticians for the use
8'
ci
~M

"c:i "ci ~ ~I
MOO

ci ci of unstandardised scores (e.g. Morrison r976: 268). It is probably best


~
""ci 0 M ~ -e I "'~0 "0f'7l
to leave the original scores alone, i.e. to base the analysis on the covariance
"ci "ci .::! matrix, unless there is a good reason to standardise. One good reason
:!' ... o
"ci
~ 0 .... would be if the data consist of variables which are not commensurable,
0 ~I ci ci
ci

" ~"
"ci
r8 M"'
such as age, income and IQ; in such a case they should almost certainly
be standardised. If the variables are all of the same type, for example
"ci "ci "ci ~ !I ci ci a set of scores on different language tests, they should be analysed in
~
ci
""'
"c:i "ciI "'~ Ml
"
~~
0 0
I
their original form. In general, the effect of standardising the scores will
be to give extra weight to the first, general component. Also, the table
ii
ci
~
"
d "
ci
"'
~-::;-
~ ~ ~I
M"
d ciI
of coefficients of the components (table r5.6) will then tell a very similar
I story to the table of correlations between the components and the original
ii "'c:i
M
.., 8..,
!:1
~ ~ variables. This can be seen from the formula for calculating the correlation,
~~
M
ci ci ::: I c:i ci
""' ~ r;;, between the i-th component and the j-th variable:
tl I
t;~
I
!.<, ~ "'c:i ci" "' ci"'~d
i
M M
~ 81
""'i:l ci ~ ~ aii\1\
0 ~ +oo
I
'5
~ " en 0
I
M
rij=--
s;
":5 t:;"
00 ~
"- M M M
o-1 ci ci
'& ci ci ci
.., I ~
.., 1:
I Since, for all standardised variables, the variance (and hence the standard
~ "'ci ~~
~
'R~ deviation) is unity, i.e. si == t, it follows that, for components based on
;:~
M

~ ci ci
"'
co I ci ci
"0~ .~
I I standardised variables (i.e. on the correlation matrix):
t; X t:
M- 'i"!'l
JlM
~
""'i:l
'+P::
ci t-o ci d
<
N N

r "Iii
0
t-1 ci c:i rij =aiiVvi

'& "'~ <"')+


~~ ~ ~ "B"" ::::::: -o I ci
'Rg
ci
so that the correlation between a component and a variable will be pro-
0 u 0 0
~
.0
X
::2
z
N<.,.. t--.
"' "
- '!:
E"" "t.J 0 0
portional to the coefficient, aii' which determines how much weight a sub-
ject's score on the variable has in determining his score on the component.
~
t-<..0-NO 00 "
...:ci~cici o.nl ci ci
"' 8 If we then look at the relative weights given to two variables in calculating
!:':
;;;
ll'lQ
~
~ ~ M ...
:;::
Ill!
~'X!u"'!'"!
u
ll'l 0 ~
~
~
. 00 "
'+I ci ci
a component, it will not matter whether we consider the correlations or
~ -~
0 0 0
the original coefficients. For example, from table r5.6, in the second princi-
,a" z ll'lo
g.~ "'
....... en '"'"'
N ...
'1::.,,
~~ .....
0 "
pal component, subtest 6 (coefficient o.r4) has twice the weight given
< ... zr o.o.
II- ~ .., ~ Ml ci ci
l..;l

'ti
01cn .___
"Z~
II 0 OU r--.~
.g::: "
~
to subtest 5 (coefficient o.o7); in table '57 the correlation of subtest
r--. t-
6 with the second component (r 26 = o.zo) is also exactly twice the correla-
"~"
' 0
~en- o- ... a
c ~r--~.:ooo
IS ...:~:;o ~
- "" -"0
N I ci ci
tion of the fifth subtest with that component (r25 = o.ro). The only differ-
,. ,. .
~
~"'en c:g_o ;t~
-~tl.l U . .
~
a~
~
"
~

&~ ence is that to obtain the correlation all the coefficients of the i-th
..0 ~cn~8...:~oo . ~ ci c::i
o:t,!Xl...:l ~ r:- 8 component have been multiplied by Vvi, the standard deviation of the
~ ~0~
~CJ.,.11'
00' Up.,~
o.;
-.., N H"'"
'"
~ - " N
P.O. 2

~
Standardised ntriablcs (6.4) always ha,c \ariancc equal to r. Since we han; 19 variables
z--
000 f.l.l
., .. -
t>-F,.w
,_;ii':~
00 ~wu
:7l~ !i:~ 8
uu
all with variance 1, the total \'ariancc is HJ, \vhich is again the sum of the eigenvalues
in table rs.6.
""' '"" 289
Principal components analysis andfactmanalysis Factor analysis

component. It is worth noting that this means that the coefficients of fields, and there exists an abundance of texts describing the technique
earlier components are forced to be quite small. In the example, the var- and giving examples of its use (e.g. Bennett & Bowers 1976; Maxwell
iance of the first component extracted from the correlation matrix is ro.g8 I977)- For reasons explained below we will not attempt a detailed expo-
and its standard deviation is 3.31. Since rli =ali X 3.31, and rli can never sition ofFA here, but there is often confusion about the difference between
be greater than r, a,; cannot be greater than rf3.3I = 0.30. On the other PCA and FA - indeed some researchers imply that they are the same
hand, the second component has variance 1.97 and standard deviation thing- and it therefore seems worthwhile to make some attempt to discuss
1.40. It is therefore possible that a coefficient of this component could the essential differences between the two. Although testing is not by any
be as high as I/!.40 = 0.71. This makes nonsense of any rule for interpret- means the only area of application for these techniques, they are commonly
ing components which looks only at the size of the coefficients. If we applied to sets of test scores in an attempt to discover the latent dimensions
decide to reject coefficients smaller than, say, 0.35, all the coefficients or constructs actually being measured by the tests, and we will discuss
in the first principal component are certain to be too small to be considered. them in that context.
It is recommended always to construct the table of correlations between Carroll (1958) reports the results of a study of test batteries to measure
components and variables (subtests), whether or not the variables have the possible aptitude for learning foreign languages using subjects about
been standardised, and to base any empirical interpretation of the compo- to undertake a short intensive 'trial course' of Mandarin Chinese. Two
nents on these correlations. However, although it does seem advisable independent samples, each of about 8o subjects, were presented with differ-
to extract the principal components analysis from the covariance matrix ent batteries of tests and their scores analysed via a factor analysis. Our
of the unstandardised variables, some widely used statistical analysis com- table I 5. 8 is adapted from Carroll's table I and shows the results of the
puter programs, e.g. some versions of SPSS, do not permit that option analysis based on one of his samples. The numbers in the table (except
and automatically standardise the variables before analysing them. This for the final column) are the factor loadings which, for the moment,
is unfortunate, to say the least. we will interpret as though they were coefficients of principal components
You may also find the recommendation that a component should be since that is how they are interpreted by many authors, including Carroll
included as significant or useful if its variance is greater than r. This in this paper. We discuss below what is the essential difference between
is a special case of the ad hoc rule suggested in IS+ Since all standardised factor loadings in a FA and the coefficients in a PCA.
variables have unit variance, the total variance of, say, rg staudardised
variables will berg, and r/rg of this total is r. For the example we have Table IS-8- Factor analysis of test scores of students of Mandan'n Chinese
been discussing in this chapter the first three principal components of
Test F, F, FJ F, F, F,, h'
the correlation matrix have variances greater than unity, and the customary
r, Tern-tern learning I O.IJ -o.o7 -o.15 O.J7 o.s6 O.OJ o.69
empirical rule would suggest that there are three recognisable dimensions 2. Tern-tern learning II -o.o8 o.os 0.16 0.12 o.69 -o.o5 0.68
in the data. However, according to the Eastment-Krzanowski criterion 3 Tem-tem learning III 0.07 o.oo o.tg -o.02 0.46 O.JO 0.46
4 Turse spelling 0.40 0.42 -o.o6 0.12 o.oJ 0-33 o.61
there are only two. 5 Turse phonetic association 0-45 0.27 -o.o4 0.4J -o.OJ 0.24 0.77
A discussion of principal components analysis with an extended example 6. Spelling clues 0.70 o.oo O.ll 0.01 0.07 0.26 o.8g
of its use in sociolinguistics can be found in Horvath ( r985)- 7 Disarranged letters 0.38 0.27 o.o6 O.J6 -o.24 0-25 o.62
8. Rhyming o.s6 0.17 0.04 o.Js -o.o7 -o.os o.6g
g. Co-operative vocabulary o.ss 0.26 o.or 0.18 Q.OI -o.OI o.s6
I5.8 Factor analysis 10. Number learning -o.o6 -o.OI 057 0.04 O.OJ 0.07 0.48
1 I. Words in sentences -o.o7 o.Js o.r8 0.22 0.19 0.29 0.46
PCA is certainly not the only technique widely used to deter- 12. Phonetic discrimination o.os -o.45 O.J4 0.12 0.01 -o.u 0.47
mine the inherent dimensionality of a multivariate data set and identify IJ. Paired associates 0.32 o.si 0.38 -o.OJ -o.o7 O.OJ o.s6
14. Devanagari script -o.og o.xg O.JO 0.02 o.rg 0.39 0.48
the meaning of the underlying dimensions. 1\n alternative technique with rs. Perdaseb -o.Io o.xg 0.02 O.Ig
O.JO O.Jg 0.48
a longer hi~tory - the b~1sic mathematical formulation was propo::;cd by 16. Phonetic script 0.12 0.41 o.o8 047 0. I I 0.04 o.67
17. Criterion o.os O.JJ 0.43 0.18 0.28 O.IJ 0.77
Spcnnrum (1904) as 11 mcuns of investigating the structure of 'intelligence'
is factor analysis (Ft\). Fi\ is widely used in psychometrics and tclatcd
"e" Adnptt~d fwm Carroll (195H: table 1)

~fj(;i 29I
Principal components analysis and factor analysis Factor analysis
Carroll points out that factor I (FJ) has high loadings for tests 4, 5,
To carry out a factor analysis we must assume that the scores on each
6, 7, 8, 9, IJ and argues that it can therefore be identified as a 'verbal
of the tests are normally distributed. Indeed, it is assumed that for a
knowledge' factor. Note, in particular, that tests 6 and 9 do not have
give!) subject each test score is made up of a linear combination of that
high loadings on any other factor. In a similar fashion he identifies an
subject's scores on a set of latent dimensions or factors plus a component
'associative memory' factor (F3), a 'sound-symbol association ability' fac-
particular to that subtest, i.e.
tor (F 4), and an 'inductive language learning ability' factor (F 5). He
suggests that F 6 might be related to 'syntactical fluency' or 'grammatical Xii = bitFil + bizFiz + ... + bimFim + eii
sensitivity' but cannot find a simple interpretation for F 2 , although he where X;; is the score of the i-th subject on the j-th subtest and F;, is
notes that 'whatever the nature of the factor, it is probably one of the the 'score' of the same subject on the k-th dimension of, say, language
most important components of foreign language aptitude'. He argues that ability, among the m dimensions being measured by the various subtests.
this factor may represent 'an increment of test performance ascribable The quantity B;; is the part of the score X;; which is not accounted for
to a specific motivation, interest, or facility with respect to unusual linguis- by the common dimensions and can be thought of as something particular
tic materials' and tentatively suggests the name 'linguistic interest' factor. to the j-th test. The factor equation can be written with simpler notation,
So much then for possible interpretation. However, what exactly is a thus:
'factor'? What does factor analysis 'do'? Full answers to these two questions
Equation2 (FA)
would require more space than we wish to devote to them here, and inter-
ested readers should refer to the texts we have mentioned above. Unfortu- X;= bpFJ + bjzFz + ... + bjmFm + Ei
nately for those who do not have a good grounding in mathematics it and is often expressed in words as 'any score on subtest j is a linear combi-
is necessary to come to grips with the underlying mathematical concepts nation of the scores on the common factors Fh F 2 , , Fm plus a contri-
in order to have a good understanding of FA in all its detail. However, bution from the specific factors, e( The quantity, b;, is the loading
we can go some way towards an explanation of the process by comparing of the k-th factor in the j-th response (i.e. subtest score).
PCAandFA. Although Equations I and 2 look very similar, there is a crucial differ-
As we have seen, the essential step in a PCA is to convert the scores ence. The values of all the Xs in Equation I are observed for all the
on the p original variables (X 1, X 2, , X,) into scores on p new variables subjects and the coefficients ak; are estimated using the criteria that Y 1
(Y~o Y 2, . , Y,) byaruleoftheform: is as variable as possible, Y2 is uncorrelated with Y 1, etc. These criteria
are sufficient to lead to a unique set of values for the a,;. It is these criteria
Equation 1 (PCA)
which ensure that every experimenter who carries out a PCA on the vari-
Yk = ak,xl + akzXz + ... + akpxp ance-covariance matrix of a given set of data will arrive at the same principal
components coefficients. FA, however, does not lead to a unique solution.
(i.e. the Ys are linear combinations of the Xs), where the coefficients The process known as rotation of fa~tors, discussed below, allows an
are chosen to give special properties to the new variables. There is only experimenter to choose the solution he prefers from an infinite set of poss-
one set of coefficients which will cause the Ys to have the required proper- ible solutions.
!ies. In other words, the solution to a PCA is unique in the sense that Factors are referred to as 'latent' because they have not been, and poss-
two different researchers analysing the same set of test scores will arrive ibly will never be, observed directly. Their existence, importance and
at exactly the same principal components. There is no underlying model structure have to be inferred from data via the model described above.
for PCA (as there is, for example, for a discriminant analysis or a multiple The model, described by Equation 2, has many superficial similarities
regression), nor are any assumptions needed about the distribution of the with multiple regression (chapter IJ). The model assumes that there is
test scores in some population. PCA reorganises the data in the sample a (small) number of universal latent dimensions or variables, called factors;
without the nc<~d to assume anything about its relation to a wider popu that any individual has an unobserved 'true score' on each factor; and
lat.ion. liA i t!H)l"C flexible but le~ objective. that the individual's score on any of the observed variables (i.e. the test
293
Pn.ncipal components analysis and factor analysis Summary

scores in the language aptitude example) can be 'explained' or predicted will give an estimate of the communalities. Without changing the number
to a greater or less extent by his factor scores when these are combined of factors or the communalities it is possible to alter the factor loadings
using Equation 2. The final column of table 15.8 gives the values of h', by rotating the factors to search for loading patterns which offer simple
the proportion of the variability of the sample scores in each test which interpretation, In the main this will mean trying to establish factors with
is 'explained' by the factor scores. The values of h2 are usually called very high loadings on just a few tests and almost zero loadings on the
the communalities while I - h' gives the estimated v.alue of the specific others. AU of this is a valid process in any case only if the assumptions
variance of each test - that part of the variability in the sample scores of the factor model are justified.
not explained by the factors. Cureton & D' Agostino suggest that if the purpose of analysis is to convert
However, the resemblance to multiple regression is only superficial. a battery of test scores into a single, composite score the first principal
First, the factors, which here play the role of the independent variables, component will give the best weights for doing this. Furthermore, principal
are not observed. In other words, we are trying to fit a model without components are much more accurately estimated from any sample than
knowing the values of the independent variables! Second, there is not factor loadings from the same sample and may give a clearer indication
just one dependent variable- there are always several. Carroll's test battery of the underlying dimensionality of the data. On the other hand, factors
contained 17 tests and his FA attempts to explain simultaneously the scores are usually easier to interpret than principal components and the plausi-
on these 17 'dependent variables'. Not surprisingly, it is a far from trivial bility of a particular factor solution could be checked by means of further
exercise to solve this problem and several different methods have been data and a confirmatory factor analysis which allows hypotheses to
proposed. The most common are the centroid method, the principal- be tested about the original solution. Further discussion of the similarities
axes method, and the maximum likelihood method. It is impossible and differences between PCA and FA in a linguistics context can be found
to discuss the difference between them in the present book (see e.g. Cureton in Woods (1983).
& D'Agostino 1983 for discussion and further references), but they will
give different results, estimating different values for the communalities SUMMARY
and perhaps even indicating different numbers of important factors. The This chapter has introduced and compared principal components
maximum likelihood method provides a test of the hypothesis that no analysis (PCA) and factor analysis (FA).
further factors are required after the first few have been fitted and is scale ( 1) PCA produces a set of coefficients which allow the original scores to be con
invariant, which means that the same solution will be reached whether verted into scores on new variables, the principal components, each of which
the analysis is carried out on the variance-covariance matrix or on the is independent of the others and each of which successively accounts for as
correlation matrix of the sample. However, whichever method is used, much as possible of the total variability left after the earlier components have
even once the number of factors has been decided upon there are very been extracted.
many (an infinite number!) mathematical solutions, all of which are 'cor- (2) It was recommended that PCA be carried out usually on the variance-covari
rect' but all of which will give different values to the factor loadings. ance matrix and the correlatiori matrix be used only where'there are good
It is possible to explore these solutions, by a process called factor rotation, reasons for doing so,
to look for a set of factor loadings which the experimenter believes to (3) By means of an example, the problems of reifying the components and de-
give a meaningful solution. It is not easy to give a short and clear description ciding on the inherent dimensionality were discussed.
of this process and interested readers should consult the references. (4) The model for FA was introduced and discussed, The different variables are
assumed to be normally distributed. The interpretation of factors was dis
From the brief discussion above it should be clear that there is a certain
cussed and exemplified.
amount of indeterminancy in the FA process. The experimenter has a
(5) It was pointed out that PCA gives a unique solution while FA leaves a great
choice of methods of factoring which will give different solutions. For deal of choice to the experimenter through rotation of factors. It was pointed
some of those methods it will matter whether the factors are extracted out that this flexibility could be used creatively but must be used carefully
from the covariance or the correlation matrix: A first solution from any
of these m~thoda will allow a decision about the number of factors and
2tld'
295
Statistical tables

APPENDIX A

Statistical tables ~-
~
~ "
Table AI. Random numbers
44 59 6:z 26 82 51 04 19 45 98 OJ 51 so 14 z8 o:z 12 29 88 87
85 go 22 5s 52 go 22 76 95 70 02 84 74 6g o6 13 g8 86 o6 so
44 33 a9 88 90 49 07 55 6g so 20 27 59 51 97 53 57 04 22 z6
47 57 22 52 75 74 53 II 76 II 21 r6 12 44 31 8g 16 91 47 75
OJ 20 54 20 70 s6 77 59 95 6o 19 75 29 94 ll 23 59 JO 1 4 47
40 91 24 41 OJ 45 51 g8 22 54 8a 44 43 43 23 29 t6 24 15 62
91 14 61 71 OJ 40 15 69 44 46 54 66 35 01 87 61 aJ 76 36 8o
27 71 a9 93 sa 8g 64 78 Ja 97 6s a8 99 8a 41 IO 97 52 41 91
I2 g6 17 70 7a 76 17 93 38 a6 72 g6 a8 73 27 64 78 10 72 81
AI Random numbers 54 JO 61 IJ 6o so 61 s6 19 22 JO 61 6o og
40 20 43 8g 8a 39
Az Standard normal distribution
A3 Percentage points of standard normal distribution 83 32 99 a9 30 o6 19 "71 I I J2 69 17 86 34 so 76 37 41 76 54
27 I7 25 6t 91 76 19 54 99 73 97 21 44 87 39 63 24 22 74 3
A4 Percentage points oft-distribution 40 89 21 88 s6 84 II 75 74 88 aJ 55 48 98 19 48 79 8r 92 62
As Percentage points of chi-squared distribution 51 66 17 48 a6 96 00 83 8r 23 sB og 21 39 39 20 8J 46 JO 75
A6 Percentage points of Pearson's correlation coefficient 95 22 63 34 s8 91 78 22 so 22 77 21 14 19 s8 66 49 as OJ Sl
A7 Percentage points of Spearman's rank correlation coefficient
93 83 73 70 8o 88 71 ss 64 44 57 so rg 82 6o 77 J8 95 93 33
A8 Percentage points ofF distribution 42 02 33 18 33 55 96 66 88 J8 16 So 77 51 17 g6 49 76 99 a8
Ag s% critical values of U for Mann-Whitney test fa 42 IJ JJ 66 00 18 37 s8 8o 54 32 00 g6 25 16 IS 37 34 12
66 71 67 54 79 s 64 34 82 IS 2S 97 88 84 84 51 62 go I7 71
Aro Significance levels for the sign test 73 os 53 8s 63 18 o6 47 71 00 32 31 59 72 34 28 70 83 12 go
02 So 12 24 34 78 22. so 57 02. 07 01 13 00 78 So 94 93 14 53
22 Sg 81 32 32 72 48 92 95 75 88 s6 75 73 79 17 53 81 54 17
94 45 64 84 17 28 o6 57 71 96 81 J6 37 6s 42 62 43 84 45 23
10 30 os 07 21 34 59 18 8s 95 21 87 73 16 78 37 15 98 16 66
73 39 21 94 01 84 28 20 so 35 57 82 88 IJ sa 53 76 73 68 22

47 91 87 36 45 6g OJ 01 24 25 IJ 64 42 74 36 67 77 07 00 92
39 a4 a6 77 6a 37 8a 46 93 96 8a 75 75 16 95 5 30 68 83 02
77 29 og 12 41 77 29 57 34 8g 94 95 45 70 59 8s J8 Of 04 So
04 78 20 07 17 15 68 12 38 a6 or go 68 30 83 8o 19 89 98 65
83 Sx 53 o8 og 23 22. 6r 99 41 27 go 35 43 07 og 62. 26 45 8J
97 67 74 54 g6 14 63 z8 g8 II IS 33 82 6o go 41 33 II 77 59
52 So a6 89 IJ J8 70 o8 73 22 64 70 83 44 49 24 20 93 12 59
8o 69 43 27 33 s6 39 88 73 JI 22 44 87 33 o8 21 40 o6 77 gr
00 48 24 o8 73 92 37 19 69 87 91 79 86 27 47 91 Jl 70 53 52

I
,r
14

so
91

62
97
28
37
51
53

94
40

10
46
15
a6
t8
29
o6
25
02
g6

39
42

94
57

IJ
22

QI
94
54
34
so
59
6o
71
27
23
28
59
68
17 59 53 o8 58 o6 So co 75 71 g5 IJ 76 91 24 55 34 og 97 l2
73 17 99 45 85 a8 6J 17 99 Jl 24 62 75 8a 78 8g 27 59 r8 62
37 95 74 g6 as 44 95 66 42 02 31 48 82 21 76 87 86 75 07 95
76 95 18 76 76 a8 18 6o 44 92 76 og 46 96 39 37 27 12 30 44

. ;itJI! 297
Appendix A Statistical tables

Table A2. Standard normal distribution

The distribution tabulated is that of the normal distribution with mean zero and standard
deviation I. For each value of Z, the standardised normal deviate, the proportion. P, of
A -Z 0 z
This table gives the values of Z for which a given percentage, p, of the stan-
-z
dardised normal distribution lies outside the range to +Z.
the distribution less than Z is given. For a normal distribution with mean p. and variance
u 2, the proportion of the distribution less than some particular value, X, is obtained by p z
calculating Z = (X- J.t)/ a and reading the proportion corresponding to this value of Z.
go 0.1257
z p z p z p z p 8o 0.2533
70 0.3853
-4.00 0.00003 -I. so o.o668 o.oo o.sooo I. 55 0.9394 6o 0.5244
-350 0.00023 -q5 0.0735 o.os 0.5199 1.60 0.9452 so o.6745
-3.00 0.0014 -I.40 o.o8o8 Q.IO 0.5398 1.65 o.gsos 40 o.8416
-z.gs o.oox6 -J.35 o.o88s o.ts 0.5596 I.70 0.9554 30 1.0364
-z.go 0.0019 - I.JO 0.0<)68 0.20 0.5793 I.75 0.9599 20 1.2816
-z.Ss 0.0022 -t.zs 0.1056 o.zs o.s987 x.8o o.g641 IO t.6449
-z.So o.oo26 -I.2.0 O.IISI 0.30 o.6179 I.85 0.9678 5 1.96oo
-2.75 O.OOJO - I . IS o. 1251 o.Js o.6368 1.90 0.9713 2 2.J26J
-2.70 0.0035 -I.IO O.IJ57 0.40 0.6554 1.95 0.9744 I 2.5758
-2.65 0.0040 -r.os 0.1469 o.45 o.6736 2.00 o.9772 0.2 30902
-z.6o 0.0047 -r.oo o.rs87 o.so o.6915 z.os 0.9798 0. I 32905
-2.55 0,0054 -o.g5 O. 171 I 0-55 0.7088 2.10 o.g8zt
-2.50 o.oo62 -o.go 0.1841 o.6o 07 257 2.15 o.g842
-2..45 0,0071 -o.85 0.1977 o.6s 0,7422 2.20 o.g86t
-2.f0 o,oo8z -o.So 0.2119 0.70 o.758o 2.25 0.9878
-2.JS 0.0094 -0.75 0.2266 075 0.7734 2.JO 0.9893
-2.30 0.0107 -0.70 0.2f20 o.So 0.788t z.J5 o.g9o6
-2.25 0.0122 -o.6s 0.2578 o.85 o;8o23 2.40 o.ggr8
-2.20 O.OIJ9 -o.6o 0.2743 o.go o.Srsg 245 0.9929
-2.15 o.o158 -o.55 0.2912 0.95 o.828g 2.50 09938
-z.xo 0,0179 -o.5o o.3o8s I.OO o.8413 255 0,9946
-z.os 0.0202 -0.45 0.3264 1.05 o.8531 2.6o 0.9953
-z.oo 0.0228 -O.fO 0.3446 1.10 0.8643 2.65 o.gg6o
-1.95 o.o256 -o.Js .. jOJ2 I. IS o.8749 2.70 0.9965
-I.go 0.0287 _,.. "'"' O.J82.I I.20 0.8849 275 0,9970
-t.85 O.OJZ2 -o.25 O.fOIJ 1.25 o.8944 z.So 0.9974
-I.80 0.0359 -o.2o 0.4207 I.JO O.QOJZ 2.85 0,9978
-I.75 0.0401 -o.15 0.4404 LJ5 O.Qli5 z.go 0.9981
-I.70 0.0446 -o.to 0.4602 l.fO 0.9192 295 0.9984
-I.6S 0,0495 -o.o5 0.4801 1.45 0.9265 J.OO o.gg86
-r.6o 0.0548 -o.oo o.sooo I. 50 09332 350 0.. 99977
-t.ss o.o6o6 f.OO 0.99997

~98 2 99
AppendiX A Statistical tables

Table A4. Percentage points oft-distn'bution Table As. Percentage points of chi-squared distributiou
narrows with increasing degrees of freedom
- spreads with increasing
degrees of freedom

~ -,, "
" 0
-t 0 This table gives the values of >! for which a particular percentage, p, of the chi-squared
distribution is greater thanJ{-. These values of J(- are tabulated for various degrees of freedom.
This table gives the values of t for which a particular percentage, p, of the student's !-
distribution lies outside the range -t to +t. These values of t are tabulated for various
p=
degrees of freedom. Degrees of
freedom 975 95 so TO 5 2.5 I 0,1
p=
Degrees of 0.000982 0.00393 0-45 2.71 384 s.o2 6.64 ro.8
freedom so zo IO 5 z I o.z O.I 2 o.oso6 0.103 1.39 4.61 599 n8 Q.2I IJ.8
3 0.216 0-352 2-37 6.25 7-82 935 11.3 r6.3
I I.OO J.08 6.31 12.7 3 r.S 6J.7 318 637
4 0.484 0.71 I 336 778 949 ll. I IJ.J r8.5
z o.Bz r.Sg 2.92 430 6.96 992 22.J JI.6
o.83r I2.8
5 I. I 5 435 924 It. I IS. I 2o.s
3 0.76 !.64 '35 p8 4-54 584 10.2 12.9
4 0.74 LSJ 2.IJ >.78 375 4.6o 7'7 8.61 6 1.24 I.64 535 Io.6 12.6 145 r6.8 22.5
5 0.73 1.48 2.02 2 57 336 403 589 6.87 I.6g 12.0 16.o
7 2.17 6.Js I4.1 I8.5 243
8 2.18 2.73 734 134 155 175 20.1 26. r
6 0.72 1.44 1.94 '45 J.If 371 s.zl 596 147
9 2.70 333 8.34 r6.9 rg.o 21.7 279
7 0.71 I.42 r.8g 2.J6 J.OO 350 479 541 IO r6.o
325 3-94 934 I8.3 20.5 2J.2 29.6
8 0.71 I.fO t.86 2.JI z.go J-36 450 504
9 0.70 I.J8 I.8J 2.26 2.82 325 43 o8 12 11.3 r8.s 21.0 26.2
440 523 2J.J 329
10 0.70 I.J7 1.81 2.2J 2.76 3 1 7 4 1 4 459 6.26 7.26 143
'5 22.3 25.0 275 J0.6 377
20 959 ro.g 193 28.4 3L4 342 376 453
IZ 0.70 I.J6 1.78 2.18 2.68 305 3-93 432
4 I2.4 139 233 332 364 394 430 5L2
IS o.69 1.34 1.7~ 2.IJ 2.60 295 373 407 30 !6.8 18.s
z.og z.8s 293 403 438 470 509 597
zo o.6g I.J2 1.72 53 J.SS J.8s 40 244 26.s 5!.8
2.8o 393 558 593 6Jo7 7H
24 o.68 I.J2 1.71 2,06 2.49 347 3-75 6o 88.4
405 43-2 593 744 791 833 996
30 0.68 X.JI 1.70 2.04 2.46 2.75 339 365
40 o.68 I.JO 1.68 2.02 2.42 2.70 33 1 355

6o o.68 I.JO 1.67 2,00 2.]9 2.66 J.2J J.46


o.67 1.28 1.64 1.96 2.33 2.58 309 329
"'

3?9 JOI
Appendix A ;iiJII Statistical tables

Table A6. Percentage points of Pearson's correlation coefficient Table A7 Percentage points for distribution of the Spemman rank
This table gives absolute values of the sample correlation coefficient r which would lead
to the rejection of the null hypothesis that the population cOrrelation coefficient p = o against
correlation coefficient, r, to test the hypothesis H0 : p, =o versus H1: p, '* o

the alternative hypothesis that p -4: oat the stated significance levels p. Sample size (n) o ..Io o.os 0.02 0.01

4 1.000
Significance levels (p)
5 o.goo I.OOO LOCO
Sample size (n) 0.10 o.os Q,OJ Q.OOI 6 o.829 o.886 0-943 I.OOO
7 0.714 0.786 o.893 0.929
3 0.9877 09969 09999 0-9999 8 o.643 0.714 o.833 o.88t
4 o.goo o.gso o.ggo 0-999 9 o.6oo 0.700 0.783 o.833
5 o.Sos o.878 0,959 o.ggi '0 o.s64 o.648 0 745 0.794
6 0.729 o.Sti 0.917 o.6r8
7 o.669 0-754 o.87s
0-974
0-951 ," o.s36
0.503 o.s87
0.709
o.678
0.764
0-734
8 o.6:zr 0.707 o.834 0.925 '3 0.484 o.s6o o.648 0.703
9 o.ss:z. o.666 0798 o.8g8 '4 0.464 o.s38 0.626 o.679
w 0549 o.632 o.76s o.872 0.446 0.521 0.604 o.6s7
'
'5
u 0.521 o.6o2 0 735 o.847 --'

,. 0.497 o.576 0.708 o.823


'1, ~; '6 0.429
0.414
0.503
0.488
o.s84
o.s66
o.634
o.618
'3 0.476 0553 o.6B4 o.Soz
'-!;"' '7
0.401 o.6oo
';0 ~~j '8 0.474 0550
'4 0.457 0532 o.66I 0.780 0 / 19 0391 0.460 0 535 o.s84
'5 0441 0.514 o.641 0.760 'f 20 0.380 0.447 0,522 0.570
'6 0.426 0497 0.623 0742 --~ 0.370 o.sto o.s66
17 0,412 0.482 o.6o6 0.725 "2> O.Jfii
0436
0.425 0,497 0544
'8 0.400 0.468 0590 0.708 23 0.353 0.416 0.486 0532
'9 0.389 0.456 0 575 o.693 24 0.344 0.407 0476 o.521
20
., 0.378
0.369
0.444
0433
o.s6t
0549
o.679
o.66s
25
26
0.337
0.331
0.398
0.390
0,466
0,457
0.511
0499
2> O.JfiO 0.423 0537 o.652 27 0.324 0.383 0449 0.492
27 0.323 0.381 0.487 0 597 28 O.JI8 0 375 0.483
0441
32 0.296 0.349 0 449 0554 29 O.JI2 0.369 0 433 0475
42 0,257 0.304 0.393 0-490 30 0.306 0.362 0,426 0.467
52 0.2Jl 0.273 0354 0443
62 0.21 I 0.250 0.325 0.408 Adapted from Glasser & Winter (r96r: table 2)
82 0,183 0.217 0.283 0357
102 0.164 0.195 0.254 O.J2I

~<($
303
Appendix A !Ill Statistical tables

Table A8. Percentage points ofF-distribution


These tables give the values of F for which a given percentage of the Fdistribution is
greater than F.
The F-distribution arises when two independent estimates of a variance are divided one
by the other. Each of these estimates has its degrees of freedom associated with it, thus
to- Spt!Cif which particular F-distribution is to be considered, the degrees of freedom of
~numerator n 1, and the denominator n2, must be given.

(a) 5 per cent point (b) r per cent point

1%

n1= 1 2 3 4 5 6 7 8 10 12 24 n1 =r 2 3 4 5 6 7 8 10 12 24
n2 l:= z rS.s rg.o rg.z 19,2 193 193 194 1 94 194 194 195 n2 = z 985 gg.o 992 992 993 993 99-4 994 99-4 99-4 995
3 [0,1 955 928 g.u g.oi 8.94 8.89 8.85 8.79 8.74 8.64 3 341 J0.8 295 28.7 zS.z 279 277 2 75 27.2 27.1 z6.6
4 77 1 6.94 6.59 6.J9 6.26 6.16 6.og 6.04 596 591 577 4 2!.2 x8.o r6.7 16.o 155 15.2 150 14.8 145 144 139
5 6.61 579 541 519 505 495 4.88 4.82 474 4.68 453 5 r6.3 133 12.1 11.4 II.O 10.7 105 10.3 10.1 989 9-47

6 599 5 14 476 453 4-39 428 421 4 15 406 4-00 384 6 137 ro.g8 978 9 15 8.75 8.47 8.26 8.10 787 772 73 1
7 559 474 435 4,12 3-97 387 379 373 364 J.57 341 7 12,3 955 8.45 785 7-46 719 6.99 6.84 6.62 6.47 6.07
8 532 446 407 384 369 358 350 3-44 335 3.28 3.12 8 II.3 8.6 5 759 701 6.63 6.37 6.18 6.03 5 .81 567 5.28
9 5.12 4.26 J.86 J.6J J.48 337 329 323 3 14 307 2.90 9 10,6 8.02 6.99 6.42 6.o6 5.8o 5 .61 547 5 .26 51I 473
10 4-96 4.10 371 348 333 322 314 307 2.g8 2.91 ' 2. 74 10 10.0 756 6.55 599 564 539 520 5.06 485 471 4-33

12 475 389 3-49 J.26 J.II 3.00 2.gt 2.8 5 2,75 2.6g 2.51 12 933 6.gJ 595 541 s.o6 4.82 464 45 430 4.16 J.78
15 454 J.68 J.29 J.06 2.90 2 79 2.71 2.64 54 2.48 2.29 15 8.68 6.J6 542 489 456 432 414 4,00 3.8o 367 329
20 3.10 2.87 2.71 2.6o 2.51 2 45 2 35 2.28 2.08 20 8.10 5.85 443 410 387 370 356 337 3:23 2.86
435 J.49 494
2.62 3 17 303 2.66
24 426 340 3.01 2.78 2.51 2.42 2.J6 2.25 2.18 1.98 24 7.82 5 .61 47 422 390 367 350 336
2.g8 2.84 247
JO 417 33 2 2.92 2.6g 53 2.42 233 2.27 2.16 2.09 1.89 JO 756 539 45 1 402 370 J.47 33 317
3.12 2.99 2.8o 2.66 2.29
40 408 3 23 2.84 2.61 245 234 2.25 2.18 2.08 2.00 I.79 40 731 5.18 431 383 351 329
6o 400 2.76 253 2.25 2.17 2.10 r.gg 1.92 I.70 6o 7.08 498 413 365 334 J.I2 295 2.82 2.6J 2.50 2.12
3 IS 237

305
Appendix A

APPENDIX B
Table Ag. 5% critical values of U for a two-tailed Mann-Whitney test

4 5 6 7 8 9 IO II IZ I3 I4 IS I6 I7 I8 I9 zo Statistical computation
z - -
- -
-
-
-
-
0
-0 -
0
-
0
-
I
-
I
-
I
-
Iz z
-
I
-
z
- - -
z
3 - 0 0 0 z z 3 3 4 4 5 5 7 7 6 6 8
4 0 I z 3 4 4 5 6 7 8 9 IO II II IZ I3 IJ
5 z 3 5 6 7 8 9 II IZ IJ I4 IS I7 I8 I9 zo
6 5 6 8 IO II IJ I4 I6 I7 I9 ZI zz 4 s 7
7 8 IO IZ I4 I6 I8 zo zz 24 z6 z8 JO JZ 34
8 I3 IS I7 I9 zz 4 z6 9 JI 34 36 38 4I
9 I7 zo Z3 z6 z8 JI 34 37 39 4Z 45 48
IO ZJ z6 9 33 36 39 4 45 48 sz 55 1. Calculators
II JO 33 37 40 44 47 SI 55 ,18 6z Most of the measures, estimators and test statistics discussed in the
IZ 37 4I 45 49 53 57 6I 6s 69 first thirteen chapters of the book can be obtained from data by the use of a
IJ 45 so 54 59 63 67 7 76
I4 55 59 64 67 74 78 83 modest electronic calculator costing less than 20. To be at all useful a calculator
IS 64 70 75 So Bs 90 must have the following features: (a) an automatic square root function (key usually
I6 75 8I 86 9 98 marked '\l')i (b) at least two memories. It is certainly worthwhile to buy a calcu-
I7 87 93 99 IDS
I8 99 106 II2 lator which has some built-in statistical calculations. Most calculators now have
I9 IIJ II9 a facility to calculate the mean and variance of a single data set. Many will also
20 IZ7 calculate correlations (look for the letter e - usually to be found in a different
Adapted from Siegel (1956: tables J, K). colour beside one of the keys rather than on it) and simple linear regressions.
Note: Values of U less than the tabulated values are significant.

2. Using a computer
Table Aro. Significance levels for the sign test (the values given are However, throughout the book it has been assumed that, except where
relevant to a two-tailed test) the calculations are simple and straightforward, the analysis will be carried out
using a suitable computer package. There are several stages involved in that process.
T 0 I 3 4 5 6 7
The data h3.ve to be put in an appropriate form and possibly stored in advance,
5 o.o6t
6 O.OJI the analysis carried out and the results printed on paper so that they can be studied
7 o.ots 0.070 greater than o. 1 carefully at leisure. It will be assumed in what follows that the analysis is to be
8 o.oo8 0.039 carried out on a moderately large computing facility accessed via a terminal (or
9 0.002 0.022
IO 0,01 I remote terminal) which is not an integral part of the computer. It is possible
II o.oo6 0.064 that you may have access to a microcomputer which has a program for analysing
IZ O.OOJ 0.037 statistical data. As yet there is no comprehensive statistical analysis program widely
IJ 0.002 0.021 0.092
If 0.012 0.057 available for microcomputers so that we can offer no. general advice here, although
IS 0.007 0.036 much of what follows below will still be relevant. (Versions of MINITAB, SPSS,
I6 O,OOJ o.o23 0,077
I7 BMPD have been written for various microcomputers but the authors have no
0.002 O.OIJ o.oso
I8 0.001 o.oo8 0.029 o.og6 information about their reliability or availability.)
I9 o.aos 0,019 0.064 From a remote terminal it will be necessary to establish communications with
zo very small 0.002 0.012 0.041
2I Q,QOI
the computer and identify yourself to it. Exactly how this is done will depend
0,007 0.027 0.078
zz 0,004 o.or6 0.053 on local conventions as well as on the type of computer you wish to use and
ZJ O,OOJ 0.00<) 0.034 o.o95 the type of terminal at which you are working. You will probably have to apply
Zf 0.002 o.oos 0.021 0,0(14
to a Computer Centre for a personal identification code (to which you will usually
s ;;;.:-~~--- -~;:_~] 0,015 0,043
ndd 1:1 private password) and at the same time you should be able to receive copies
306 307
AppendixB Statistical computation
of the documents which explain the basic procedufes. While communications are career you should endeavour to keep good records of your files, their filenames
being established from the terminal the computer may respond to each of your and a summary of their contents. Note that even when using interactive mode
instructions by printing messages on the screen. These messages may be quite it is usually convenient to prepare your data and store it in a file before attempting
long and technical. It is usually worthwhile learning to recognise those phrases to analyse it. Interactive packages will usually allow you to type in the data as
or symbols which indicate that connections are being established as expected. it is required but if you make a typing error it may then be more difficult to
Now, suppose that your terminal has been successfully connected and the com~ correct it than if it were stored separately.
puter has accepted you as a valid user. The next stage depends on whether you
wJW;be using the statistical analysis package in batch mode or interactively. 3 Using MINITAB
fil Batch mode you will have to prepare all your instructions to the package in The first two elements in the computing process are to connect the
a kind of shopping list or program and then submit the complete program to terminal to the main computer and to create data files or program files and store
the computer. The machine will attempt to carry out all your instructions in them. The third step will be to use a standard statistical analysis package. There
sequence and will reply to your terminal only when it has completed all the tasks are many of these and most computer installations will offer only one or two
requested in the program or until an erroneous instruction is discovered. In the of them. However, they have many similarities, at least from the point of view
latter case it will usual1y carry out all the instructionS prior to the error and print of an unsophisticated user. There should be manuals available at the Computer
out the corresponding results - the output - together with information about Centre, though these may be difficult to read- computer manuals are not renowned
the error. Whether or not it attempts to carry out the rest of the program will for their clarity. Most interactive packages have a HELP facility which enables
depend on the seriousness of the error, but even if it does continue it is wise the user to request information from the computer about the instructions required
to be highly sceptical of any analysis subsequent to an error and to repeat it after to carry out different types of analysis. For some packages there are simplified,
making the required modification to the program. introductory handbooks available, e.g. Ryan, Joiner & Ryan (985).
A program is stored in the form of a file with a unique filename. It will usually We give an example below of the MINITAB program requi~ed to carry out
be created, i.e. typed into the computer, using a piece of software (another the analysis of some of the examples discussed in earlier chapters. We have chosen
program) called an editor which also allows you to correct any errors you make, this package as an illustration since, from our experience, it is easily self-taught.
either at the moment of typing or after they are discovered when you attempt Although it is somewhat limited, it can be used to carry out most of the analyses
to run your program. Again, the Computer Centre will supply instructions for discussed in the book except for the multivariate examples of chapters 14 and
the use of the editor and for the submission of your program to the.statistical '5
analysis package of your choice. This is not intended to be an exhaustive guide to MINITAB. Here we simply
If the package is interactive or conversational and facilities are available to exemplify some of its features in relation to our data. In addition, it is assumed
use the computer interactively or in interactive mode then it will be convenient I that the user has determined for himself how to access MINJTAB on the particular
to take advantage of this. It means that you can submit your instructions one mainframe he is employing. To exemplify the use of the package we will take
at a time to the analysis package and you will receive an immediate response on the error gravity score data analysed in different ways in chapters 10, II and
the screen. There are at least two advantages in this. First, if you write an instruction 12, which appear in full in table 12.4 on page 201.

incorrectly the computer will refuse to accept it and will respond with a message In the examples below you may note small discrepancies between the values
which will be more or less helpful. You can then try again until you get it right. given by MINITAB and those quoted in the main text of the book. For example,
Second, the results obtained from one step in the analysis may help to suggest in the correlation matrix calculated below, MINlTAB gives a correlation as
new ideas or may indicate that the analysis you have planned is clearly going 0.769 whereas in chapter 10 (page 158) we obtained a value of 0.770 for this
to be unsuccessful or inappropriate. It will also be possible to experiment and, quantity. These values are so close that the difference is unimportant. The MINI-
with an immediate response to each of your attempts, reach a satisfactory analysis TAB value is probably more correct. Our value is more likely to have some rounding
withoUt spending hours or even days waiting for paper to reach you from the error because of the sequence of calculations - more suitable for hand calculation
Computer Centre. It will always be possible to get a hard copy, i.e. a version -which we used to obtain it.
printed on paper, of any part or all of the conversation you carry on with the
computer and the results of your analysis. The results, and the record of your 4 Inputting the data
conversation, can also be stored internally in the computer with known filenames The first task is to provide MINITAB with the data which is to be
which will enable you to recall them later. From the beginning of your computing statistically analysed. Data can either be typed in direct from the keyboard, or

398 309
Statistical computation
AppendixB
alternatively read in from an already extant file in the user's filestore. Both methods including diagrams. The command HISTOGRAM C2, for example, would proM
make use of the READ command. vide the following histogram of the scores in the second column of the table:

MIDDLE OF NUMBER OF
4o 1 Direct input INTERVAL OBSERVATIONS
Data is defined in terms of columns, and read in rows, For the data
of table 12-.4, type:
IO I
IS 2
READC1-C3 20 4 **'*"
(since there are three columns in the data). After you hit RETURN, the data 25 7 ****
is typed in as follows (with a space between each data point): 30 5 ****""
35 IO
22 36 22 (type RETURN)
(type RETURN)
40 3
16 9 I8
(type RETURN) The commands MEAN and STDEV for individual columns will supply these
42 29 42
particular measures of central tendency and dispersion, but a comprehensive set
and so on. After the 32 rows of data have been typed in, type: of descriptive statistics for the data in our file can be obtained by using the DES-
END CRIBE command, as follows:

The machine will respond with the number of rows read; if you need to check DESCRIBE C1-C3
the file once it is read, the PRINT command is available: CI C2 C3

PRINT C1-C3 N 32 32 32
MEAN 25.03 28.28 23.62
4.2 File input MEDIAN 2450 28.so 22.00

If the data you wish to analyse is already available in a file, simply TMEAN 24.86 28.68 23.04
use the READ command with the name of the file enclosed in quotes, and the STDEV 6.25 785 8.26
number of columns into which you want it to be read: SEMEAN I. IO "39 I.46
MAX 42.00 41.00 4300
READ 'filename' C1-C3 MIN II.OO 900 !2.00

After RETURN, the machine will respond with the number of rows read and 03 29.00 34-75 275
the first few lines of the file, as follows: QI ZI.OO 23.25 18.oo

32 ROWS READ In addition to the mean and standard deviation, the median (see p. 27), and
standard error of the mean (see p. 98) are supplied, as well as information on
Row CI C2 C3
22 minimum and maximum values in the description and the points at which the
22 36
18 first (QI) and third (Q3) quartiles fall. (The one piece of information .we have
2 16 9
not discussed earlier in the book is the TMEAN. This is a trimmed mean,
3 42 29 42
2I i.e. the mean of the data with the extreme values trimmed off. The MINITAB
4 25 35
TMEAN is calculated by excluding the highest s% and the lowest s% of the
data values. Its use is not recommended.)
lf you want to see more of the file, the PRINT command can again be used.

5. z Correlation
5 Analysing the data A scatter diagram for two sets of data you want to correlate can be
5. I Descriptive statistics supplied using the PLOT command. Here we PLOT CI VS C3 (error gravity
The MINITAB package will readily provide descriptive statistics, S{'orcs of native En!{lish teachers and native English non~teachersj compare figure

110 3II
AppendixB
Statistical computation
MTB> C1 This is a brief overview of a very few of. the capabilities of the MINITAB
52.00
package, to give you an inkling of what it can do. At the very least it will cut
X down dramatically on tedious computational time. Also, by doing this it (or a
38.00 X similar package) will enable you eventually to concentrate on interpreting the results
X X X X
X X X X X of the tests and analyses you apply rather than on the details of their calculation.
X X X
24.00 2 X X 2
X 2 XX 2
XXX
X

10.00 ~ X C3
B.OO 16.00 24.00 32.00 40.00 48.00

Figure Bx. MINITAB plot of C1 vs. CJ.


ro.r, p. 156). The correlation is provided by the CORR command, and can be
used for all possible correlations a_mong our subjects: the command CORR Cr-C3
gives the correlation matrix:

Cr C2
C2 O.J21

C3 0.769 o.ooo

53 Paired sample t-test


As a final example, let us see how MINJTAB deals with the t-test
that we discussed in II.4. There we were concerned with testing the hypothesis
that two groups of teachers (native English teachers and Greek teachers of English)
give the same error gravity scores on average, i.e. H 0 : IJ.J = 1-0. against Ht: p., 1 P-2 '*
We noted that because judges were addressing the same error there was likely
to be some correlation between scores awarded by different groups of judges on
the same error. We were then dealing with 'correlated' samples, and testing the
hypothesis was seen to require a paired sample t-test. Within MINITAB this
is done as follows: We set up a new column of numbers which corresponds to
the differences between CI and C2 (the X- Y column in table 11.5), as follows:

SUBTRACT C2 FROM Cr, PUT DIFFERENCES INTO C4

(The new column, C4, corresponds to the X- Y column in table n.s; if you
want to check, try PRINT C4.)
To test the hypothesis H 0 : f.LJ = o (seep. r8s), we proceed as follows:

TTEST MU = o, FOR DIFFERENCES IN C4

The output we get looks like this:

TEST OF MU = o VS MU N.E. o
N MEAN STDEV SEMEAN T PVALUE
c4 32 -J.25 8.32 !.5 -2.21 0.035

Compare the t-value here with that arrived at in chapter J I.

J12
313
Answers to some of the exercises
Chapter ro
APPENDIX C (2a) COV = 15.8, r = O.J2
(3) o.o5 < p<o.1o
Answers to some of the (4) correlation increases dramatically tor= o.62.
(5) D = 3926 so that rank correlation coefficient = 0.28
exercises
Chapter u
(I) Z = I.2427, cannot reject H 0
( 2) paired samples here: t = 2. 3 1 on 3 I df, reject Ho at s% level
(3) o.55to8.77
(4) Z = o.3g6, not significant
Chapter3 (5) F = 1.11 on (31, 31) df; not significant
(ra) mean= 6.35 median= 6, bimodal with modes at 2 and 7 (6) U 1 = 32 and U 2 = II, hence not significant
(rb) mean= 4IJ, median= 4, mode= 4
(3} mean= 6.35, standard deviation= J.J2i mean= 4IJ, standard
deviation= r.88
(4a) order is: R(1.47), P(r.oo), Q(o.33), U(-o.6o), T(-1.67), S(-3.25)
(4b) R is class D, T is class F, U is class E

Chapter 5
(3) 0.22, 0.78, 0.~
(4) 0.956, 0.4J2, 0954 0.014
(5) o.879, 0.758

Chapter6
(3) standard deviation= 2.035 in both cases
(4) (a) 3.654 (c) r.o
(5) (a) o.1587 (b) 0.1587 (c) o.6915 (d) 0.7734 (e) o.5328
(f) o.1598 (g) 152.9

Chapteq
(1) (a) 374 to4.52 (b) 3.65104.61
(Ja) o.s6r to o.7o6 and o. 124 to o.236
(4) mean= 577 and standard deviation= IJ.I; confidence intervalz8.r to 87.3

ChapterS
(1) t = -1.313
(3) lfn = 16then t = 1.151; ifn = 250 then t = 4548
(4) (i) 33 (ii) 66

Chapter9
(1) chi~l:\quarL = 4.62. on 1 df and is sif.{nificant ut 5(!/v, not t-~ip;nilicant <ll 1%
(2) chi:<~quart""" Jn on z. df iilld it! nm ~iv;uiHcHnt uts(?h

""
Fletcher, P. & Peters, J, rg84. Characterising language impairment in children:
an exploratory study. Language Testing I: 33-49.
Fry, D. B. '979 Thephysicsofspeech. Cambridge: CUP.
REFERENCES Ghiselli, E. E., Campbell, J. P. & Zedeck, S. rg8r. Measurement theory for the
behavioural sciences. Oxford: W. H. Freeman.
Gilbert, G. N. 1984. 1Ylodelling society: an introduction to loglinear ana~vsis for
social researchers. London: George Allen & Unwin.
Glasser, G. J. & Winter, R. F. rg6r. Critical values of rank correlation for testing
the hypothesis of independence. Biometn'ka 48: 444-8.
Healey, W. C., Ackerman, B. L., Chappell, C. R. Perrin, K. L. & Stormer,
Allen, G. 1985. How the young French child avoids the pre-voicing problem J. rg8 r. The prevalence ofcommunicative disorders: a 1-eview of the literature.
for word-initial voiced stops. Jouma/ of Child Language, 12: 37-46. Rockville, Md.: American Speech~ Language-Hearing Association.
Baker, W. }. & Derwing, B. 1982. Response_coincidence analysis a~ evidence Hockett, C. 1954. Two models of grammatical description. \Vord ro: zro-31.
for language acquisition strategies. Applied Psycholinguistics 3: IQ3':"'"22I. Horvath, B. M. 1985. Fan'atiou in Australian Rugb~h. Cambridge: CUP.
Bellinger, D. xg8o. Consistency in the,.pattern of change in mothers' speech:_ Hughes,- A. 1979. Aspects of a Spanish adult's acquisition of English.lnterlanguage
some discriminant analyses.Jouma/ (l('hildLanguage 7:469-87. Studies Bulletin 4: 49-65.
Bennett, S. & Bowers, D. 1976. N/ultivariate techniques for social and behaviour'al Hughes, A. & Lascaratou, C. 1981. Competing criteria for error gravity. ELT
sciences. London: Macmillan. Joumal36: '75-82.
Berko, J. rg58. The child's learning of English morphology, Word '4' r5c-77. Hughes, A. & Woods, A. J. 1982. Unitary competence and Cambridge Proficiency.
Brasington, R. 1978. Vowel epenthesis in Rennellese and its general implications. Joumal of Applied Lmzttuage Stm(l I: s-Is.
'\-Fork in Progress 2, Phonetics Laboratory, University of Reading. Hughes, A. & Woods, A. J. 1983. Interpreting the performance on the Cambridge
Carroll, J. B. 1958. A factor analysis of two foreign language aptitude batteries. Proficiency Examination of students of different language backgrounds. In
Journal of General Psychology 59: 3-r9. A. Hughes & D. Porter (eds) Current de7.elopmenls inlml{tttage testing. Lon-
Clark, H. I973 The language-as-fixed-effect fallacy: a critique of language statistics don: Academic Press.
in psychological research. Joumal of Verbal Leaming and Verbal Behavior Innes, S. 1974- Developmental aspects of plural formation in English. Unpublished
12 ' 335-59 M.Sc. thesis, University of Alberta.
Coxon, A. P. M. 1982. The user's guide to. multidimensional scaling.: London:, Jolicoeur, P ..& Mosimann, J. E. r96o.-Size and shape variation in the painted
Heinemann. turtle: a principal component analysis. Growth 24: 339-54
Crystal, D., Fletcher, P. & Garman, M. rg76. The -grammatical analysis _of. Kendall, M. Q., 1970. Rank cmrelation methods, 4th edn. London: Griffin.
language disability: a procedure for assessment and remediation. London: Khan, F. (forthcoming), Linguistic variation in Indian English: a sociolinguistic
Edward Arnold. study. Unpublished Ph.D. thesis, University of Reading.
Cureton, E. E. & D'Agostino, R. B. r983. Factor analysis: an applied approach. Krzanowski, W. J. & Woods, A. J. 1984. Statistical aspects of reliability in language
London: Lawrence Erlbaum Associates. testing. Language Testing r: r-20.
Downie, N. M. & Heath, R. W. 1965. Basic statistical methods, 2nd edn. New Labov, W. rg66. The social stratification of English in New York City. Washington
Y ark: Harper & Row. DC: Center for Applied Linguistics.
Eastment, H. T, & Krzanowski, W. J. 1982. Cross-validatory choice ofthe number Lascaratou, C. 1984. The passive voice in Modern Greek. Unpublished Ph.D.
of components from a principal components analysis. Technomettics 24: 73-7. thesis, University of Reading.
Elbert, S. H. '975. Dictionary ofthe language ofRennel/ and Bellona. Copenhagen: Macken, M. & Barton, D. r98oa. The acquisition of the voicing contrast in English:
National Museum of Denmark. a study of voice onset time in word-initial stop consonants. Journal of Child
Ferris, M. R. & Politzer, R. L. rg8r. Effects of early and delayed second language Language 7: 41-74
acquisition: English composition skills of Spanish-speaking junior high school Macken, M. & Barton, D. r98ob. The acquisition of the voicing contrast in
students. TESOL Quarterly r 5: 253-74- sPanish: a phonetic and phonological study of word~initial stop consonants.
Fleiss, J, L. r9fh, Statistical method.\' for raf(.~ ami proportions. New York:. Wiley, Jounzal of Child Lmzguage 7: 433-58.

310 JI7
References I USP -=--1
Marriot, F. H. C. 1974- The interpretation of multiple observations. New York
VALOR
and London: Academic Press.
Maxwell, A. E. I977 frfultivariate analysis in behavioural research. London: 9-o.Bo
Chapman & Hall. .
Mi11er, G. 1951. Language and commu~ication. New York: McGraw. Hill. TOMB() l;o)CJ::;1._
Miller, G. & Nicely, P. E. 1955 An analysis of perceptual confusion among English
consonants.Joumal ofAcoustic Society ofAmerica 27: 338-52.
Miller, J. & Chapman, R. tg8r. The relation between age and mean length of
utterance in morphemes. Journal ofSpeech and Hearing Research 24: I 54-6 I.
Morrison, D. F. 1976. lvlultivariate statistical methods, znd edn. New York:
McGraw-Hill. Allen, ,I chi-squared (Xz), 132-53; problems and
Newport, E., Gleitman, L. & Gleitman, H. 1977 Mother, I'd rather do it myself: alwmntiw hypothesis (H,), 12off; see also pitfalls of, 144-51; test for goodness of fit,
some effects and non~effects of maternal speech style. In C. E. Snow & C. null hypolhesis I3:oHJ; test fur independence, 139-44;
anulyt~iH of variance (ANOVA(R)), Yatc's correction in, 146
A. Ferguson (eds) Talking to children: language input and acquisitiOn. Cam~ I<Jf-22Ji ANOVA models, 206-15; Clark, H., 215
bridge: CUP. assumptions for, 195-6; fixed and random class interval, 17
Overall, j. E. & Klett, C. r972. Applied multivan'ate analysis. New York: effects in, 212-14, 217; for regression doze test, 2.20, 238ff
analysis, 24.2-4; multiple comparison of CLUSTAN, zs6
McGraw-Hill. means in, 2ID-II; one-way, 194-9; split~ cluster analysis, see hierarchical cluster
Quirk Report. I972. Speech therapy services. London: HMSO. plot designs in, 221; test score reliability analysis, non-hierarchical clustering
Ryan, B. F., joiner, B. L. & Ryan, T. A. 1985. MINITAB student handbook, and, 215-19; transforming data in, 22o-1; colour words, 25o-2
two-way: factorial, .20.2-6; two-way: conditional probability, 61-6
znd edn. Boston, Mass.: Duxbury Press. randomiscd blocks, 20o-2; 'within~ confidence interval, 96--8, 101ff; for
Scherer, G. A. & Wertheimer, M. 1964- A psychological experiment in further subjects' ANOV A, 221-2 correlation coefficient, 163-5 i for
language teaching. New York: McGraw- Hill. arcsine transformation, 220 proportions, 184; for regression, 231; for
audiolingual method, 176ff t-ratio, x8o, 187
Shepard, R. N. 1972. Psychological representation of speech sounds. In E. E. avernge, J, zg; sa also mean, median, confidence lc\'el, I ro-1 I
David & P. B. Denes (eds) Human commmzication: a u11ijied view. New moving average contingency table, lfO, 182
York: McGraw-Hill. correction factor (CF), 198
Baker, 255ff correlation, 2, 104, 157-75 1 .216-19, 287ff;
Shepard, R.N., Romny, A. K. & Nerlove, S. B. (eds) 1972. ltiultidimensional barchart, JJff correlation coefficient (r:
scaling: theory and applicatitms in the behavioral sciences. New York: Seminar Barton, 4, 5, so, 58, 178, r88 product-moment), 162-69: confidence
Press. Bellinger, 271 interval for, 163-5, comparisons of,
Bennett, 271, 291 165-7 interpreting, 167-Q, significance
Siegel, S. 1956 . .Vonparametric statistics fur the behavioral sciences. New York: Berko, 255 test for, 163, testing hypotheses about,
McGraw-Hill. bctwcen~groups variance estimate, 197; see 162-3, r64; partial correlation, 2.44-5;
Spearman, C. 1904. The proof and measurement of association between two things. also analysis of variance rank correlation (r~: Spearman's),
bias, see sampling bias rfxJ-74: significance test for, 173
American Joumal of Psychology 15: 88-r 03. bilingualism, 61-6, 139-42, 146-7, 149-50, correlation matrix, 244, 2.53, 287
Viana, M. 1985. The acquisition of Brazilian Portuguese phonology. Unpublished 18z COVariance, 155-60, 161, 162, 174,227, 2JOj
Ph.D. thesis, University of Reading. bimodal distribution, 34, 103 calculation of, 157
BMPD, 307 covariance matrix, see varianct.'-covariance
Wells, G. r985. Language development i11 the pre-school years. Cambridge: CUP. Bowers, 27I-fJI matrix
Wetherill, G. B. I972. Elementary statistical methods. London: Chapman & Hall. Brasington, s8, 142, 143 I4f, 145 Coxon, 265
Winer, B. 1. 1971. Statistical principles in experimental design. New York: critical \alue, 126
Cambridge Proficiency Examination (CPE), cross~product, 157
McGraw-Hill. r6-1 9 , 26-9 , 3o, 2 74 , 2 78, 284- 5 cross-sectional data, 4
Wishart, D. 978. CLUSTAN: user manual. Program Library Unit, Edinburgh Campbell, 216 Crystal, z67
University. Carroll, 2QI, 292 cumulative frequency, r 8
categorical data, 8-13, 250, z66 Ct!reton, 294, 295
Woods, A. J. 1983. Principal components and factor analysis in the investigation central inten-al, 38-40
of the structure of language proficiency. In A. Hughes & D. Porter (eJs) Central Limit Theorem, 8g, 101, 102, IOJ, D'Agostino, 294, 295
Currn1t devl'iopment iulanguaue te!>lhlf?. London: Academic P1ess. I2J, 177 degrees of freedom, 102, 126, 136, 138-Q,
Chapman, 226, 2.27 141-2
3!8 om
index Index
dendrogram, 256, 258, 259 goodness of fit tests, see chi-squared linear regression (tmttd.) paired samples, 184-7
dependent variable, 228, 235ff, 249 grammar-translation method, 176 estimating paramctcts in, -229-30, partial correlation, 244-5
Derwing, 255ff Greek, 152 extrapolating from, 237, lt~ust squares Pascal, 79
directionality, 182, 183, tgo, 191; see also regression line for, 2;.~.8, testing the passive, 152
one tailed test, two tailed test Healey, 8 significance of, 233-4; slcpwit~c past tense, 99-101
discriminant analysis, see linear Heath, 166 regression, 243 Pearson, K., 161, 172, 173,230
discriminant analysis hierarchical cluster analysis, 254-61 logarithms, 164,221, 24{1 Pearson product-moment correlation
dispersion, measures of, 37-43 histogram, 17 loan words, 142-4 coefficient, see correlation
dissimi~rity matrix, 252-4, 260, 264 Hockett, 77 longitudinal data, 4 percentages, 34-7, 15o-1, 220
dlstf.itivtion: chi-square, 132; F-, 182, Horvath, 290 percentile score, t 8
1Y'4'""22I passim, 243, 304-5; normal, Hughes, 2r, 22, 55 147,154,184,278,284, Macken, 4 5 so. sB. 178, I88 Peters, so, 57, 101, 266
86-i)3; sampling, 84; standard normal, 285, 286 Mandarin, 291 phonology/phonological, 19-20, 142-4; .~ee
go-3, 183, 189, 270j t, 102-3, I06, 124, hypothesis, see null hypothesis, alternative Mann-Whitney rank test, 188-t), 306 also voicing
us-6, x81, r8J, 243, 3oo hypothesis Marriott, 260 plural morpheme inflection, 148-g, 255ff
Dow jones Index, 273 hypothesis testing, 1 13-30; see also null matching coefficient, 252 point estimators, gsff
Downie, r66 hypothesis, alternative hypothesis Maxwell, 291 Politzer, s8, IJg, 146, 149 !82
dysphasia, 129-JO, r8g-go MDS(X), z6s pooled variance estimate, 196
independence of observations, 104-5, ro6, mean, 2, 2Q-J7, 31 I; differences between populations, 49ff
Eastmcnt, 283, 286, 290 147-9 ISO, 181 two means: independent samples, power of tests, 191-2
eigenvalue, 285ff independent variable, 228, 235, 242-3, 249 176-81, paired samples, 184-7; multiple present perfect, 104-5
Indian English, rg-2o comparisons among, 21o-I I; standard principal components analysis (PCA); 265,
Elbert, 143, 145
inference, see statistical inference error of, 97n, g8 275-90; interpreting principal
epenthesis, 142-4
error (deviation), 79-84; see also re!'lidual Innes, 255 mean length of utterance (mlu) and average components, 284-6; compared with factor
error gravity, 55-6, IS4ff, I6J,I72, t84ff, interaction (between factors), 205 mlu (MLU), 106-7, 7"""33 analvsis, 292
201-2,207-11,215, 309ff intcrquartilc distance/range, 38-40 median (score), tg, 27-9, 3o-4 probability, 59-75; conditional, 61-6;
Miller, G., 113, 264 distribution, 67, 84
errors (by learners), 154
estimating a proportion, gg-r o1 Joiner, 309 Miller, J., 226,227 proficiency tests, 194ff, 238ff
estimating from samples, 95-1 11 Jolicoeur, 275, 277 MINITAB, 41, 236, 240,243, 307,309-13 pronoun, 182; agreement, 146-7
mode, 33 proportions, g, 34, 182, 220; comparing two
Khan, 19 models, see statistical models proportions, 182-4; confidence interval
factor analysis, 139, 206, :zgo-5
faeton/factorial experiment, 202-6 Klett, 271 Morrison, 275ff, 289 for, 184; estimating, 99-IOI; means of,
F-distribution, 182, 194-221 passim, 243, Krzanowski, 216, 283, 290 Mosimann, 275, 277 34-7
moving average, 22
34-S multidimensional scaling, 262-5 Quirk, 23
Ferris, 58, 139, 146, 149, 182 Labov,2o
language acquisition, 3-6, 25-6, 2()--30, 31, multiple regression, 237-45, 268-93
first quartile, 18
4g-5o, 51, 53-4, 77-So, 95, 110 multivariate analysis, 249-52, 273ff random effect model in A NOVA, 212-15,
first stage sampling frame, 53
Fisher's Z-transformation, 164n, 165 language aptitude, 2, go, 92-3 217
fixed effects model in AN(JV A, 212-1 ~ language impairment/disorders,:S-1 3, nasality, 264 random error, 79-84; see also residual
101-3, 129-30, 132-9, 266ff Nerlove, 265 randomised block design, 202
Fletcher, so, 57, 101, 266, 267 '
language learners' performance (foreign), Newport, s8 randomness, 86-7
foreign language teaching, 55-7, 176
21-2, 147-9 1 169 Nicely, 264 random number tables, 73-5
F-ratio, 182, 184, 194-221 passim;
language tests, r6-rg, 26-g, 30, 36, 122ff; non-hierarchical clustering, 26r random sampling, 52, 54, 72-5
significance test for, 182, 304-5
see also Cambridge Proficicncv nonparamctric tests, 188-go random variable: continuous, 68-72;
French, 250 normal curve, 87, 88; see also normal discrete, 66-8
Examination, proficiency test;
Fry, 3 distribution random variation, 86-93
language variation, 3, 19-20
F-test, 182
Laplace, 79 normal distribution, 86-iJJ; testing tit of to range, r6
LARSP (language assessment, remediation data, 132-9 rank correlation, see correlation
Garman, 267 normal probability paper, 236 rank order, 170
Gauss, 79 87; Gaussian cun'c, 87, 88; see and screening procedure), 267
Lascaratou, 55, 152, 154, 184 noun phrase, 147-9 reaction time, 68-72
also normal distribution null hypothesis (H 0), I :zoff; see also regression analysis, see linear regression
linear correlation, st't! correlation
generalised linear models, 247
linear discriminant analysis, 265-71 alternative hypothesis relative cumulative frequency curve, I 8
generative grammar, 1 rcl<Jtivc frequencies, gff
linear model, 224ff numerical data, IJff, 250
German, 176 l'elinbility, 215-tC); cocHicicntof, 216
linear regression, 139,154,167, t68, 169,
Ghiselli, 216 onc-tuilcd (vs. two-tailed) test, 122ff Rcnnclk'~c, 142-4
220, 224-48; multiple linear regression
Gilbert, 247 oral fluency ability, 169, 176 ret~iduul, 84ff, 2:.lC)ff; .Vl'l' also error
model, 24o-2: partial rcgregsion, 245;
Gleitman, H., 58 outli~rs,
simple linear regression model, l26-3o, 169 n'I:IJlClluw l~nindd(~IHx mHtdx:, z~h-7
Gleitman, L,, :;8
231-2, 239, 246: calculation of, 230 Overall, Z7I Rc!ytwlllhvt~lopmcntnll ,a1\j.l'l1!1),:t> Scllk!<,
G LII\1 (gcncrnliscd lincnr intt~ractiw
conlidcm:c interval for, Z.H-5 240, cwcrlapping ~IHnpkfl, 1.1 "A IQI
nwdtllin~). :;.,47
JU
Yl:2fr,
Index
Rooney, 265 confidenCe interVal for, t8o, r87;
Ryan, B., 309 directional vs. nondirectional test, 177,
Ryan, T., 309 179; formula for, 177; for independent
samples, I 76-81; for paired samples,
sample mean, 84-9 184-7 IQI, JIZi significance test for,
samples, 2, 51-7; estimating from, 95-11I; 177-<), 3oo; t-distribution,.181, 183,243
size of, 4 5, 37, SQ-9, 101-11, xz(r..g, tables, Sff; multi-way, 19; three-way, tg;
136, 280 two-way, 10
sampling, sx-7; more than one level of, target population, 49ff
106-7 tense marking, so, 9Q-IOI
sampling bias, 51-2 test statistic, I 17-20
sampling distribution of the sample mean, transformation of data, 22o-1, 245-7
. s,.. true value, 79
sampling:frames, 49, 52-4 two-tailed (vs. one-tailed) test, 122ff
SAS,.~47_- type 1 error, us-r7, ng, 127-8, 145 149,
acale~d~peftd(lJlt (vs. scale-independent) IQI, 194
m~a,~~:'l:59 typezerror, II5-17, 127-8,136,191
s.catt(!{.gll,iim/scattcr diagram, 154 ISS I s6,
ol..W,.~t 1741 227, 239 Urdu, 250
Scherer, 176 -utterance length, 14-15,25-6, 21)-30 1 Jl,
schizophrenia, 220 49-50,51, 54 106-7
Shepard, 264, 265
Siegel, 306 variability, 37-43; measures of, 235
significance level, niff variables, see random variable, independent
sign test, tgo, tgt, 192,306 variable, dependent variable
skewness, 32-4, 103 variance, 4o-3; comparing two variances,
sor;iolinguistics, 34-6 ,g,
Spearman, 171, 172, 173,290 variance-covariance matrix, znff. 2.87-90
Spearman rank correlation coefficient, St'e variance estimates: between-groups, 197:
correlation pooled, 196; within-samples, 196
split-plot designs in A NOVA, 221 variance ratio statistic, see F-ratio
SPSS,287,290, 307 verb expansion, tot-3
standard deviation, 2, 4o-3, 31 r; of sample verb forms, 139-42, I4Q-SOi see also
mean, 85, 97-g; St'(' also standard error present perfect, past tense, verb
of means expansion
standard error, 99; of means, 97, 99; of Viana, 4
proportiops, ros vocabulary,4!HJ, I13-17, I2D-I, 239
standardised score, 43-5, 9o-2 \oice onset time (VOT), 1-2, 3, 4-6, so,
standard normal distribution, 9o-3, r83, 53-4, 77-80, 95-<), I07-IO, 179ff, 188-9
t8g, 270; compared with t-distribution, \'oicing, 178ff, 264
125-6
statistical independence, 61-6; testing weighted mean, 36
model of, 139-42 Wells, s4
statistical inference, 48-57 Werthcimer, 176ff
statistical models, 77-94;'in analysis of Wetherill, 181
variance, 206-15; in multiple regression, Winer, z15, 221
240; in simple linear rcgrcssion,_z3 1-2, \Vishart, 256
239; linear models, 224ff .. , :' 1': __ : __ within-samples variapce eS_tirila,i.c (Sw), 196
statistical population, see poJm\atioriS' \\toads, 216, 278, 284, 2;85, 286:
Stephens Oral Screening Test, 101
stepwise rcgress~on, 243-4 Yatc's correction, see chi-square
sum of squares, IQ8ff
Zedeck, 216
t (Student's t), IOZ-J, 166-7, 177ff, t8z-7, Z-scorc, see standardised score
IQI, rg2,2zo:~.Ssumptionsfor, 18r, I87, Z-valuc, r83, r8q, tqo; set' ai,\'O t~hmdard
r88, IQ2;ealcq)P,tionsof, 178,179: normnl distribution

Jll;l,

También podría gustarte