Está en la página 1de 27

THE USE OF RANKS TO AVOID THE ASSUMPTION OF

NORMALITY IMPLICIT IN THE ANALYSIS OF VARIANCE


BY MILTON FRIEDMAN

M
National ResourcesCommittee
OST projectsinvolvingthe collectionand analysisof statistical
data have for one of their major aims the isolation of factors
whichaccountforvariationin the variable studied.The statisticaltool
ordinarilyemployedforthis purposeis the analysis of variance. Fre-
quently,however,the data are sufficiently extensiveto indicate that
the assumptionsnecessaryforthe valid application of this technique
are not justified.This is especiallyapt to be the case with social and
economicdata wherethe normaldistributionis likelyto be the excep-
tion ratherthan the rule. This difficultycan be obviated,however,by
arrangingeach set of values of the variate in orderof size, numbering
them 1, 2, and so forth,and using these ranks instead of the original
quantitative values. In this way no assumptionwhatsoeverneed be
made as to the distributionof the originalvariate.
The utilizationof ranked data is thus frequentlya desirabledevice
to avoid normalityassumptions;in addition,however,it may be ines-
capable either because the data available relate solely to order, or
because we are dealing with a qualitative characteristicwhichcan be
ranked but not measured.
The possibilityof using ranked data in problemsinvolvingsimple
correlationand thereby avoiding assumptions of normalityhas re-
centlybeen emphasized in an article by Harold Hotelling and Mar-
garet Richards Pabst.' It is the purpose of the presentarticleto out-
line a procedurewherebythe analysisof rankeddata can be employed
in place of the ordinaryanalysis of variance when there are two (or
more) criteriaof classification.This procedurehas two major advan-
tages. As already indicated,it is applicable to a wider class of cases
than the ordinaryanalysis of variance. In addition,it is less arduous
than the lattertechnique,requiringbut a fractionas much time. The
loss of informationthroughutilizing the procedure outlined below
when the analysis of variance could validly be applied may thus be
morethan compensatedforby its greatereconomy.This consideration
is likelyto be especiallyimportantwiththoselarge scale collectionsof
social and economicdata whichhave become increasinglyfrequentin
recentyears and forwhichthe fundsavailable foranalysisare limited.
InvolvingNo Assumptionof Normality,"Annalaof
1 "Rank Correlationand Tests ofSignificance
MathematicalStatistics,VII (1936) 29-43.

675
676 AMERICAN STATISTICAL ASSOCIATION*

THE PROCEDURE

The procedure,whichI shall call the methodofranks,involvesfirst


rankingthe data in each rowof a two-waytable and then testingto
see whetherthe different columnsof the resultanttable of ranks can
be supposed to have all come fromthe same universe. This test is
made by computingfromthe mean ranks for the several columns a
statistic,Xr2,whichtends to be distributedaccordingto the usual x2
distributionwhenthe rankingis, in fact,random,i.e., whenthe factor
tested has no influence.
The details of the procedurecan best be explainedby presentingan
example. Table I gives the standard deviations of expenditureson
different categoriesof expenditurefor seven income levels.2It is de-
TABLE I
STANDARD DEVIATIONS AT DIFFERENT INCOME LEVELS* OF EXPENDITURES ON
THE MAJOR CATEGORIES DURING 1935-36 OF 246 MINNEAPOLIS AND
ST. PAUL FAMILIES OF WAGE-EARNERS AND LOWER
SALARIED CLERICAL WORKERSt

Annual familyincome
Categoryofexpenditure $750- $1,000- $1,250- $1,500- $1,750- $2,000- $2,250
1,000 1,250 1,500 1,750 2,000 2,250 2,500

Housing $103.3 $68.42 $89.53 $77.94 $100.0 $108.2 $184.9


Household operation 42.19 44.31 60.91 73.90 43.87 61.74 102.3
Food 71.27 81.88 100.71 86.52 100.3 90.75 100.6
Clothing 37.59 60.05 56.97 60.79 71.82 83.04 117.1
Furnishingsand equip-
ment 58.31 52.73 96.04 60.42 104.33 89.78 85.77
Transportation 46.27 82.18 129.8 181.0 172.33 164.8 246.8
Recreation 19.00 23.07 38.70 45.81 59.03 50.69 55.18
Personal care 8.31 8.43 9.16 14.28 10.63 15.84 12.50
Medical care 20.15 33.48 60.08 69.35 114.34 45.28 101.6
Education 3.16 4.12 12.73 18.95 8.89 41.52 66.33
Communitywelfare 4.12 18.87 8.54 12.92 25.30 19.85 16.76
Vocation 7.68 11.18 10.44 10.95 10.54 13.96 14.39
Gifts 5.29 10.91 11.22 25.26 42.25 48.80 69.38
Other 6.00 5.57 22.23 2.45 6.24 1.00 4.00

* In computingthe standarddeviationsthe influenceof familycomposition(in termsof number


of membersand theirage) was eliminatedby groupingthe familiesat each incomelevel into similar
familytypesand computingthe sums of squares withinsuch income-family type groups.These sums
of squares weresummedforthe familytypesat each incomelevel and divided by the numberof de-
greesof freedom.This gave the variance at each incomelevel. It is the square rootsof the variances
whichare enteredin thetable.
t The figuresin this table are based on schedulescollectedby the Cost of Living Division of the
U. S. Bureau of Labor Statistics.These scheduleswereloaned to the National ResourcesCommittee
forspecial analyses,of whichthisis one.
2 The figuresgivenin Table I wereobtainedfromscheduleson the receiptsand disbursements of
familiesof wage earnersand lowersalaried clericalworkersduring 1935-36 collectedin Minneapolis
and St. Paul by the Cost of Living Division of the U. S. Bureau of Labor Statistics.These schedules
wereloaned to the National ResourcesCommitteeforspecial analyses,severalof whichare used in this
article.
*THE USE OF RANKS 677
siredto determinewhetherthe standarddeviationsdiffersignificantly
forthe different income levels.
The firststep is to formTable II fromTable I by rankingthe stand-
are deviationsforeach category,givingthe lowest value a rank of 1,
TABLE II
RANKING OF INCOME LEVELS BY SIZE OF STANDARD DEVIATION FOR EACH
CATEGORY OF EXPENDITURE*

Annual familyincome
Categoryofexpenditure $750- $1,000- $1,250- $1,500- $1,750- $2,000- $2,250-
1,000 1,250 1,500 1,750 2,000 2,250 2,500

Housing 5 1 3 2 4 6 7
Household operation 1 3 4 6 2 5 7
Food 1 2 7 3 5 4 6
Clothing 1 3 2 4 5 6 7
Furnishingsand equip-
ment 2 1 6 3 7 5 4
Transportation 1 2 3 6 5 4 7
Recreation 1 2 3 4 7 5 6
Personal care 1 2 3 6 4 7 5
Medical care 1 2 4 5 7 3 6
Education 1 2 4 5 3 6 7
Communitywelfare 1 5 2 3 7 6 4
Vocation 1 5 2 4 3 6 7
Gifts 1 2 3 4 5 6 7
Other 5 4 7 2 6 1 3

a. Total 23 36 53 57 70 70 83
b. Mean rank 1.643 2.571 3.786 4.071 5.000 5.000 5.929
c. Deviation -2.357 -1.429 -.214 .071 1.000 1.000 1.929

Sum of squared deviations=13.3692


Xr2=40.108
in this table are derivedfromTable I.
* The figures

the nextlowestrank of 2, etc.3Thus, in each row of Table II, we have


a set of numbersfrom1 to 7, since thereare seven incomelevels.
On the hypothesisthat forany one categorythe value of the stand-
ard deviation is the same at all income levels, differences
among the
values in each row of Table I will arise solely fromsampling fluctua-
tions. The rank enteredfora particularincomelevel would then be a
matterof chance; in repeatedsamples each ofthe numbersfrom1 to 7
would appear with equal frequency.4
3 It is, of course,immaterialwhetherthe rankingis fromthe lowestto the highestor the reverse,
i.e., fromthehighestto thelowest.
4 This statement entriesin the same roware assumedto come
is strictlyvalid onlyif the different
fromthe same universe-no matter,of course,whatits nature.In the presentexampleit requiressome
qualificationsincethestandarddeviationsin each row are not all based on the same numberof cases.
In this case, whiletwo entriesin the same row of the originaltable (e.g., Table I) will have the same
expectedvalue, one will exceed the othermore than half the time. The reason forthis is that the
678 AMERICANSTATISTICAL
ASSOCIATION*
If, therefore,the standarddeviationwereindependentofthe income
level,the set ofranksin each columnwould representa randomsample
of 14 items (that being the numberof categoriesof expenditure)from
the discontinuousrectangularuniverse-1, 2, 3, 4, 5, 6, 7. The mean
of this universeis 4, or, in general, (p+ 1), wherep is the numberof
ranks.The varianceis also 4, or in general(p2 -1)/12.5
The next step in the procedureis to obtain the mean rank foreach
column.These are givenon line b of Table II. In the absence of a rela-
tion between the standard deviations and income level, these means
are all estimatesof the same thing,namely of the mean of the rec-
tangularuniverse.Moreover,the sampling distributionof the means
will be approximatelynormalso long as the numberof rowsis not too
small.6
The samplingdistributionof the mean ranks (whereij is the mean
rank of the j-th column) will have a mean value (p) of (p + 1) and a
variance a2 of (p2- 1)/(12 n), wheren is the numberof rows,i.e., the
numberof ranks averaged.7
Since the true mean and true standard deviation of the chance
universeare known,the hypothesisthat the means come froma single
homogeneous normal universe can be tested by computing
p - lP 12n P
Xr2= E- ((p- p)2 p ) E (p + 1) .

pa 1 P(p + 1)j=
samplingdistributionof the ratio of two variancesis not symmetricalunlessboth variancesare based
on the same numberofdegreesoffreedom.The mean value of theratiois approximatelyunity,but the
medianis not equal to one-it is less than one if the numeratoris based on fewerdegreesof freedom
than the denominator,and conversely.In rankingtwostandarddeviations,therefore, the one based
on the smallernumberofcases wouldreceivea rankof 1 morethan halfthetime.Whenmorethan two
standarddeviationsare rankedthis tendencyis somewhatcompensatedforby the greaterprobability
that thosebased on the fewestcases will receiverelativelyhigh ranks,and thus the average rank will
be less affected.This difficulty
does not,however,affectthevalidityoftheillustrativeanalysispresented
here,since thetwohighestincomeclassescontainthesmallestnumbersoffamiliesbut have thehighest
average ranks.
More generally,when the entriesin different columnsof the same row come fromsymmetrical
universeswith the same mean but different variances,the several ranks will have the same expected
value, but the probabilitydistributionforeach cell will not be exactlyrectangular.This conditionof
symmetry is a sufficient
conditionforthe ranksto have the same expectedvalue; it is, however,more
stringentthanis necessary.This difficulty clearlycalls forfurther
analysis.
6 The sum of the numbersfrom1 to p is lp(p+1). The mean is thereforeI(p+1). The sum of the
squares of the numbersfrom1 to p is (2p+1)(p+1)p/6. The variance is, therefore,(2p+1)(p+1) /6
-*(p +1)2 = (p2-1)/12.
That the samplingdistributionof samples drawn froma rectangularuniverseapproachesnor-
6
malityquite rapidlyis, of course,well known. The distributionof means forsamples of two is a tri-
angle; forsamplesof threeit is made up of threeparabolic segments,the firstand thirdconcave up-
wards,and the middleone concave downward.An empiricaldistributionforsamplesof tenis givenby
Hilda Frost Dunlap, "An EmpiricalDeterminationofthe Distributionof Means, StandardDeviations
and CorrelationCoefficients Drawn fromRectangularPopulations," AnnalsofMathematical Statistics,
II (1931), 66-81. The universesampledwas a discontinuousrectangularuniverse,includingtheintegers
from1 to 6. The empiricaldistributionshowsextremelyclose conformity to the normalcurve.
7 This followsfromthe fact that the variance of a mean of n observationsof equal weightis 1/n
timesthe varianceof an individualobservation.
-THE USE OF RANKS 679
So long as the numberof rows and columnsis not too small, Xr2 com-
puted in this way will be distributedaccordingto the usual x2distribu-
tion with p -1 degreesof freedom.8If, now, Xr2 is significantly greater
than mightreasonably have been expectedfromchance, the implica-
tion is that the mean ranksdiffersignificantly, i.e., that the size ofthe
standard deviation depends on the income level.
The computationof Xr2is extremelysimple.The mean of the seven
mean ranksis, ofnecessity,equal to the truemean of4. The difference
between the mean rank for each column and 4 is given on line c of
Table II. The sum of the squares of these differences is 13.3692 and
Xr = 40.1076.
This illustrativecomputationhas been made using a formulathat
makes clear the nature of xr2. In actual practicethe followingalterna-
tive formulawhichinvolves only integersand makes unnecessarythe
computationof the actual mean ranks will be foundmore convenient:
12 p 2
Xr2= E rij 3n(p+ 1),
np(p+l) i-l i=1

whererij is the rank enteredin the i-th row and j-th column.
The numberof degreesof freedomon which this estimate is based
is p -1 = 6. For six degreesof freedomthe value of x2 whichwould be
exceeded by chance once in 20 timesis 12.592,and once in a hundred
times,16.812.9The probabilityof a value greaterthan 40 is .000001.10
There can thus be little question that the observedmean ranks differ
significantly,i.e., that the standard deviationis related to the income
level. From the mean ranksit is seen that withbut one minorexception
the standarddeviationsconsistentlyincreasewithincome.
Since the value of Xr2is invariant under transpositionsof the
columnsofranksundertheircaptionsthisinformation-thatthe ranks
increase with income-has not been utilized. Wheneverthe columns
themselvescan be ranked,the additional informationsupplied by the
relationshipbetweenthe orderof the mean ranks and the orderof the
columns can be used by computinga rank differencecorrelationbe-
tweenthe two corresponding sets ofranks,determining the probability
that the correlationcoefficient obtained would have been equalled or
exceeded by chance, convertingthis probabilityinto the value of x2
8 For a justificationof the formulafor xr2 and of the statementthat Xr2tends to be distributed
like x2,as well as forsome indicationof the numberof columnsand rowsnecessary,see pp. 687-694
and the mathematicalappendix.
9 Fisher,R. A., StatisticalMethodsforResearchWorkers, Table III.
15 Pearson, Karl, Tables for Statisticiansand Biometricians,3rd Edition, London, 1930, Part, I
Table XII.
680 AMERICANSTATISTICALASSOCIATION-
which correspondsto it for two degrees of freedom,and pooling the
resultantvalue of x2 withXr2.10a In the presentillustrativeexamplethe
evidenceis so clear that this additionalinformationwill obviouslynot
affectthe conclusion. It will, however, serve to exemplifythe pro-
cedure.The rankdifference correlationbetweenthe mean rankand the
income level is .991. (In derivingthis coefficient the tied ranks were
treated in the manner suggestedbelow, i.e., they were assigned the
average value of the ranksforwhichthey were tied.) The probability
of securinga value as great as or greaterthan this is between .00277
and .00040. The value of x2corresponding to the largerof these figures
for two degrees of freedomis -2 loge .0277= 11.77. Adding this to
Xr2 gives 51.88 as the value to be enteredin the x2table for8 degreesof
freedom.The probabilityassociated with this value is smaller than
that forXr2 and, indeed,is so small that it cannot be determinedfrom
the publishedtables.
In orderto test whetherthe standard deviations are related to the
type of expenditureit is only necessaryto repeat the above analysis;
this time,however,treatingthe columnsis the way in whichthe rows
were previouslytreated,and viceversa.Thus the standard deviations
would be ranked foreach incomelevel, and the mean ranks obtained
foreach type of expenditure.
It mightappear offhandas if the procedureused to study the rela-
tion between standard deviations and income level does not make
use of all of the informationprovidedby Table II, that it neglectsthe
distributionof the ranks withinthe columns,and that this supplies
additional informationabout the consistencyof the ranking. This,
however,is not the case. Since Table II must containn l's, n 2's, . . ..
n p's, the total sum of the squared deviations fromthe grand mean
is the same no matterwhat the arrangementof the ranks withinthe
table-it is, in fact,equal to np(p2-1)/12. The sum of squares within
columns plus the sum of squares between columns must add up to
this total. Knowledge of one of these sums of squares thus implies
knowledgeof the other. In the above example we have used the sum
ofsquares betweencolumns;no additionalinformation is thussupplied
by the sum of squares withincolumns.
It should be noted that in testingthe significanceof the differences
among the columnsno assumptionwhatsoeverneeds to be made as to
the similarityof the distributionofthe originalvariateforthe different
rows. The test takes the formof comparingthe mean ranks for the
several columns; essentially, however, the null hypothesis tested is
'OaSee Hotellingand Pabst, op. cit., pp. 35 and 40, and Fisher,op. cit. art.,21.1.
* THE USE OF RANKS 681
that the original entries in each row are from the same universe;
whetheror not this universeis the same forthe different rows is en-
tirelyirrelevantto the validityof the test.
The method of ranks does not providefortesting"interaction."It
is of the verynatureofthe methodthat it cannot do so. Withoutexact
quantitative measurement,"interaction,"in the sense used in the
ordinaryanalysis of variance,is meaningless.
It shouldfurtherbe notedthat the methodofranksmay not provide
a test of the influenceof a factorifthereis reason to suspectthat this
influenceis in a different
direction forthe differentrows;if,forexample,
the standard deviation increaseswith income forcertaintypes of ex-
penditureand decreaseswithincomeforothers.For in such a case the
mean ranks of the p columnsmay all have the same expectedvalue,
althoughthe p ranksforeach of the rowsdo not. Thus, if Xr2is signifi-
cant, the conclusionis that the rankingis not random.But Xr2may not
be significant, not because the rankingis random,or because the dif-
ferencesin the mean ranks are too small for the observed sample to
display significance,but because the influenceof the factortested is
different in directionforthe different rows.In thisconnection,however
the generalpoint should be emphasizedthat non-significant resultsdo
not establishthe validity of the null hypothesisin the same way that
significantresultstend to contradictit.
In some cases two (or more) ofthe values of the variate in a rowwill
be identical,i.e., there will be "tied" ranks. Two procedurescan be
followed:first,the ranks tied forcan be assigned to the two (or more)
values at random; or second, each value can be given the average
value of the ranks tied for (e.g., if two values are tied forthe ranks 2
and 3 each can be giventhe rankof2.5). In general,the second ofthese
proceduresseems to be preferable,since it uses slightlymore of the
informationprovided by the data." The substitutionof the average
rankforthe tied values does not affectthe validityofthe Xr2 test.'2
THE EFFICIENCY OF THE METHOD OF RANKS RELATIVE
TO THE ANALYSIS OF VARIANCE

It is evident that the methodof ranks does not utilize all of the in-
formationfurnishedby the data, since it relies solely on order and
11This alternativemethodofhandlingtied ranksand its advantageswerebroughtto myattention
by Mr. W. Allen Wallis, who has developed a simple adjustmentto the usual formulaforthe rank-
difference correlationto allow forthe treatmentoftied ranksin this fashion.
12 Its only effectis to changeveryslightlythe 'true" value of the variance. In the extremecase
whentied ranksare as probableas untiedranks,the varianceof an individualobservationis changed
from(p2-1)/12 to p(p -1)/12, i.e., it is reducedby (p -1)/12 orin theratioof 1 to p +1. The reduction
is thus relativelysmall whenp is moderatelylarge.
682 AMERICAN STATISTICAL ASSOCIATION

makes no use of the quantitative magnitude of the variate. It is this


very fact that makes it independentof the assumptionof normality.
At the same time, it is desirable to obtain some notion about the
of the method
amount ofinformationlost,that is, about the efficiency
ofranksin situationswherethe analysisof varianceprovidesthe proper
test.'3
For the special case of p = 2 (i.e., of two ranks) the methodof ranks
is equivalent to the binomialseriestest of significanceof a mean dif-
ference,that is, it is equivalent to testingwhetherthe proportionof
positivedifferences betweenthe pairs ofvalues in each row of the 2Xn
table (the proportionof 2's in the firstcolumnof the table of ranks)
differssignificantly from2.14 Now, W. G. Cochran recentlyshowed'6
that the binomial seriestest of a mean difference has an efficiency
of
63.7 per cent. It followsthat the methodof ranks,forthe special case
of p = 2, likewisehas an efficiency of 63.7 per cent.
13 By the 'efficiency"of a statisticm used to estimatea parameter,uis meantthe ratio of the var-
iance of the maximumlikelihoodestimateof,uto the variance of m. The difference betweenthis ratio
and unitymultipliedby 100 gives the percentageof 'information"lost. (R. A. Fisher,op. cit.,Chapter
IX.)
In the presentinstance,since the analysisofranksand the analysisof varianceprovideestimates
ofdifferent parameters-in the one case, of Xr2, and in the other,ofthe analysisofvarianceratio-it is
firstnecessaryto securea relationshipbetweenthe two parameterswhichcan be used to estimateone
fromthe other.In thisway bothmethodscan be used to estimatethe same parameter.
14 The analogous methodforp greaterthan 2, whileit providesa methodforanalyzinga table of

ranksand seemssuperficially closelyrelatedto themethodofranks,is essentiallyverydifferent.


This alternativeprocedureinvolvesthe formationfromthe basic table of ranksof a p XP contin-
gencytable givingthenumberofranksofeach size in each column.Thus,one oftheclassifications is by
columnnumber,the otherby the value ofthe rank. Such tables can thenbe analyzed by computingx2
in theusual mannerand testingits significance. Unlessthe numberofrowsis largerelativetothenumber
ofcolumns,the usual x2tables will,of course,not be applicable. Exact distributionscan, however,be
obtainedin the mannerindicatedby F. Yates ('ContingencyTables InvolvingSmall Numbersand the
x2 Test," Journal of the Royal Statistical Society, Supplement,VolumeI (1934), pp. 217-35).
This proceduredoes not, however,test the same hypothesisas the methodof ranks. The reason
is that with the contingencytable methodthe numericalvalues of the ranks in no way affectthe
result,whereasin the methodofrankstheydo. Thus, considerthe following3 X3 tables of ranks:
A. 1 2 3 B. 1 2 3
1 2 3 1 2 3
3 2 1 1 3 2
It is clearthat B indicatesgreaterdeparturefromthe hypothesisthat the rankingis randomthan does
A. Both tables contain one columnin whichall threeranks are identical and two columnsin which
two out of threeranksare the same. But in B theselattertwo columnscontainranks whichvary less
than forthe corresponding columnsof A. Stated differently, in B everyrankin the last two columnsis
greaterthan any rankin the first;no comparablestatementis valid forA.
The contingencyanalysiswouldindicate,however,thatA and B divergeequallyfromexpectation,
since both will give contingencytables which,except for permutationsof rows and columns,are
identical.The methodofranks,on the otherhand, willindicatethatB divergesmorefromexpectation
than A; Xr2 is 41 forB, but only 3 forA.
For the purposeof determiningwhetherone variable has a significantinfluenceon another,it
seemsclear that the methodof ranksis definitely preferableto the contingencyanalysisjust outlined.
The reason why the two methodsare equivalent forp =2 is evident; when there are only two
ranks,thereis no possibilityof different ranksdivergingby varyingamounts.
15 "The Efficiencies of the Binomial Series Tests of Significanceof a Mean and of a Correlation
Coefficient," Journal of the Royal Statistical Society, C (1937), 69-73.
* THE USE OF RANKS 683
Moreover,this providesa measureof the minimium efficiency
of the
methodof ranks.When p = 2, a classificationin termssolelyof greater
or smaller is substitutedfor the exact quantitative measurements;
as p increases a more and more finelysubdividedscale is substituted
for the exact measurements.It seems reasonable,therefore,that the
loss in informationthroughusingranksdecreasesas p increases.
For the special case of n = 2, it is shownbelow that Xr2and the rank
differencecorrelationare essentiallyequivalent. On the assumption
that the true correlationis zero,Hotellingand Pabst have shownthat
the efficiency of the rank difference correlationapproaches 91.19 per
cent as p increases.In theirwords,"the product-moment correlation
is approximatelyas sensitivea test of the existenceof a relationship
in a normallydistributedpopulationwith91 cases as the rank correla-
tion with 100 cases."'16
For the more general case, when p and n are greaterthan 2 I have
not been able to determinethe efficiency of the method of ranks. It
seems clear, however,that the loss of informationis less than the 36
per cent lost whenp = 2 and probablygreaterthan the 9 per cent lost
whenn = 2.
In the absence of the theoreticalanalysis there are presentedhere
the resultsof applyingboth the analysis of variance and the method
ofranksto the same data. A comparisonof theseresultswill,of course,
offerno conclusive evidence as to the relative efficiency of the two
methods;but it should at least suggestwhetherthe loss ofinformation
in using the method of ranks is so great as to vitiate completelyits
usefulness.
The data analyzed are the same as those utilized in the illustrative
analysis summarizedin Tables I and II above, i.e., they are data on
the expendituresand savings during1935-36 of 246 Minneapolis and
St. Paul familiesof wage earnersand lower salaried clerical workers.
In the presentinstance,however,the analysisis directedtowarddeter-
miningwhetherincome and familycompositionhave a significantin-
fluenceon the expendituresforthe various categoriesand on savings.
The analysis given above, it will be recalled,attemptedto determine
whetherincomehad a significantinfluenceon the standarddeviations
of expenditure.
The 246 familieshave been grouped into seven income classes,'7
16 Op.cit.,pp. 42-43.
17 The totalincomeofa familyis definedas includingnot onlymoneyincome,but also theimputed
value ofgiftsin kind,offoodproducedat homeand oftheuse ofa homeownedby the family.
684 AMERICAN STATISTICAL ASSOCIATION-

each $250 in range,and fivefamilycompositiontypes.18This gives 35


groupsin all.
For each of the major categoriesof expenditures,forsavings, and
forcertainsub-groupsof items,threevariances have been computed:
the variance (1) betweenincomelevels, (2) betweenfamilytypes,and
(3) withingroups.'9
For 14 major categoriesof expenditure,for13 sub-groupsof several
of these categories,and forsavings,therewas computedthe ratios of
the variance betweenincome levels and the variance betweenfamily
typesto the variancewithingroups.These ratios,designatedas Pi and
Ff respectively,are givenin Table III.
To each of the 28 items considered,the method of ranks was also
applied to test the influenceof income and familytype.
In testingthe influenceof income,the seven mean expendituresfor
each familytype were ranked. This gave fivesets of seven ranks. In
testingthe influenceof familytype the procedurewas reversed;the
fivemean expendituresat each incomelevel wereranked,givingseven
sets of fiveranks.
The resultsof the methodof ranks are likewisegiven in Table III.
This table gives the values of Xr2computedin testingforthe influence
of income, as well as those obtained in testingfor the influenceof
familytype.
For both the analysisofvariance and the methodofranksthe values
whichare significantat the .01 level are indicated with a double star;
those whichare significantonly at the .05 level, with a singlestar.
The two methods yield measures of the influenceof income and
familytype for28 items. There are thus 56 independentanalyses by
each method. In Table III these measures are classifiedinto three
18 The familytypesare definedas follows:
Type 1 Husband, wife,and one child under16
2 Husband, wife,and two childrenunder16
3 Husband, wife,one person16 or over,and one or no otherpersons
4 Husband, wife,one child under 16, one person16 or over,and one or two otherpersons
5 Husband, wife,and threeor fourchildrenunder 16.
19There was also computedthe variance due to interaction.Since the methodof ranks can give
no measureofinteraction,this varianceis of no interesthere. It is worthpointingout, however,that
interactionwas significantforonly threeout of 28 cases; forone of thosethe probabilitywas between
.05 and .01 and fortwoit was less than .01.
Since the numbersof items in the subclassesare neitherequal nor proportionate,thereis some
difficultyin decomposingthe variationbetweengroups.The variancesbetweenincomelevels and be-
tweenfamilytypeswerecomputedby the methodofweightedsquares ofmeans. This methoddoes not
give an estimateof interactionwhenthereare morethan two classes forboth of the factors.Conse-
quently,the variancedue to interactionwas computedby the methodofunweightedmeans.
For an excellentstatementof the difficulties
raised by disproportionate subclass numbersand of
the available methodsof analysis, see G. C. Snedecor,and G. M. Cox, "DisproportionateSubclass
Numbers in I ables of Multiple Classification,"Research Bulletin 180, AgriculturalExperiment
Station, Iowa State College of Agricultureand Mechanic Arts (March 1935).
*THE USE OF RANKS 685
TABLE III
RESULTS OF ANALYSIS OF VARIANCE AND METHOD OF RANKS
Measures of the Influenceof Income and Family Type on Expendituresforthe Major Categories
ofExpenditureand forSub-GroupsofItems,and on Savings,Based on Data on the Expendituresand
Savings During 1935-36 of 246 Minneapolisand St. Paul Families'

Analysisofvariance Method ofranks


Ratios of variances2 Xr2
Item
Income Family Family
Fi type Ff Income type

Major categoriesofexpenditure3
Food 15.33** 5.75** 27.02** 19.09**
Household operation 9. 95** 1.01 24. 24** 4.94
Housing 9.50** 1.63 21.94** 6.17
Clothing 9.40** 1.38 25.54** 9.46
Recreation 4.25** 1.98 23. 83** 11. 89*
Personal care 4. 10** .80 21.11** 4.14
Transportation 3. 78** 1.97 24. 00** 10. 06*
Gifts 3.36** .96 21.17** 3.74
Communitywelfare 2. 95** .45 17. 04** .49
Education 2.93** 1.79 17.31** 8.11
Medical care 2.51* .80 18.69** 6.51
Vocation .69 1.01 4.71 1.51
Furnishingsand equipment .42 .37 6.96 3.69
Other .25 .30 5.74 5.40

Savings (or deficit) 2.50* 1.25 14.74* 4.57

Sub-groupsof items
Food4:
Dairy products 6.71** 9.41** 23. 66** 21.83**
Fruit 4.87** .38 12.69* 3.31
Food away fromhome 3.49** 3.94** 17.34** 10.09*
Meat 2.59* 2.02 9.34 3.77
Miscellaneousfoods 2.01 1.21 15. 00* 5.49
Fish .98 2.43* 4.11 1.91
Vegetables .73 2.11 6.69 8.80
Grain products .71 4.76** 3.26 9.71*
Sweets .20 1.05 3.96 9.94*
Poultry .20 .99 .30 1.89

Personal care:
Personal service 4.31** .70 19.80** 4.71
Personal supplies 3.38** .75 14.34* 1.49

Household operation:
Fuel and light5 7.26** 1.56 23.25** 6.74
* Indicates that observedfigureis "significant,"i.e., greaterthan the value whichwould be ex-
ceeded by chance once in twentytimes.For the ratios of variancesthis value is 2.14 forincomeand
2.42 forfamilytype. For Xr2 it is 12.592 forincomeand 9.488 forfamilytype.The difference between
the values forincomeand familytypeis a resultofa difference in the numberof degreesoffreedomon
whichthe respectiveestimatesare based.
** Indicates that observedfigureis 'highly significant,"i.e., greaterthan the value whichwould
be exceededbut once in a hundredtimesby chance. For the ratios of variancesthis value is 2.89 for
incomeand 3.41 forfamilytype. For Xr2 it is 15.033 forincomeand 13.277 forfamilytype.
1 The figuresin this table are based on schedulescollectedby the Cost of Living Division of the
U. S. Bureau of Labor Statistics.These scheduleswereloaned to the National ResourcesCommittee
forspecial analyses,one ofwhichis presentedhere.
2' 8, 4, 6 See nextpage.
686 AMERICAN STATISTICAL ASSOCIATION*

groups: those which would have been exceeded by chance (a) in more
than fiveper cent of randomsamples, (b) in betweenfiveper cent and
one per cent of random samples, and (c) in less than one per cent of
random samples. An indication of the relative efficiencyof the two
methodsis providedby Table IV, whichgives a comparisonof the two
classifications.
From the entriesin the diagonal of Table IV, it is seen that for45
out of the 56 analyses the two methodslead to similarconclusions.In
no case does one of the methodsindicate a probabilityof less than .01
whilethe otherindicates a probabilitygreaterthan .05.
TABLE IV
COMPARISON OF RESULTS OF ANALYSIS OF VARIANCE AND METHOD OF RANKS

Analysisof variance
Numberof F's withprobability
Method of ranks Total
Probabilityof Xr2 Greater Between Less
than .05 .05 and .01 than .01

Greaterthan .05 28 2 0 30

Between .05 and .01 4 1 4 9

Less than .01 0 1 16 17

Total 32 4 20 56

In thisexample,it seems clear that the loss of informationin using


the method of ranks is not very great. Indeed, on the basis of Table
if not impossible,to choose betweenthe
IV alone, it would be difficult,
two methods.
A comparisonof the rankingof the 28 itemsby the size of F and by
the size of Xr2, provides one furtherindication that the hypotheses
tested by the two methods are essentiallythe same except for the
inclusion of the normality assumption in that tested by the analysis
of variance. The rank differencecorrelation between Fi and the
correspondingXT2 is .88; between Ff and the correspondingxr2,.66.
Both correlationsare very large in comparisonwith their standard
errorof .19.
2
Pi iS the ratio of the variance betweenincome levels to the variance withinclasses. Ff is the
ratio of the variancebetweenfamilytypesto the variancewithinclasses.
8 Expendituresincludenot only moneyexpensesbut also the imputedvalue of giftsin kind. For
food,the imputedvalue of home producedfood,and forhousing,the imputedvalue of the use of an
owned home,are also included.
4 The originaldata give theexpenditures on thesub-groupsoffoodonlyfora sevenday period.The
remainingratiosin thetable are all based on data forannual expenditures.
6 Fuel and lightis, of course,but one ofthe sub-groups
underhouseholdoperation.
*THE USE, OF RANKS 687
It should be noted that the illustrativecomparisonjust presented
is to some extentweightedagainst the analysis of variance. The dis-
tributionof expendituredata departs considerablyfromnormality.20
In addition, the analysis summarizedin Tables I and II indicated
that the standard deviation of expendituresis related to the income
level; the assumptionof uniformvariance is, therefore,not justified.
However,the body of data analyzed representsno moreextremea de-
parturefromthe assumptionsof normalityand uniformvariance than
is frequentlymet with.
THE RELATION BETWEEN THE DISTRIBUTION OF Xr2 AND X2
The statementwas made above withoutproofthat xr2tends to be
distributedas x2withp-1 degreesoffreedom.This statementrequires
justification.
It is well knownthat the sum of the squares of m independentob-
servationsdrawn froma normaluniversewithunit variance and zero
mean is distributedaccordingto the x2 distributionwith m degreesof
freedom.In the presentinstance,when the numberof rowsis not too
small, the mean ranks can be treated as observationsfroma normal
universewith a true mean I (p+ 1). However,onlyp -1 of the p mean
ranksare independent,since the sum of the p mean ranksmust equal
p (p1+). If (p-1) of them were selected at random,the sum of the
squared deviationsfromthe true mean of (p4+1) would seem to be
distributedas x2. However, to discard one of the mean ranks would
neglectsome of the information;in addition,thereis no criterionfor
decidingwhichto discard. Instead we can computethe mean squared
deviationand multiplyit by the numberof degreesoffreedom,(p-1).
This gives21
P- 1Er p) 2
p i=1

as the numeratorofXr2. The denominatormustbe o2,the variance of r.


20 On the questionof the effectofdeparturefromnormalityon theanalysisofvariance,see Egon S.
Pearson, "The Analysisof Variancein Cases of Non-normalVariation,' Biometrika, Vol. 23,1931, and
T. Eden and F. Yates, 'On the ValidityofFisher'sz Test WhenAppliedto an ActualExample ofNon-
Normal Data," Journalof Agricultural Science,Vol. 23, 1933. The conclusionof both papers is that
moderatedeparturefromnormalitydoes not seriouslyaffecttheanalysisofvariance.
21 By analogywithx2as ordinarily defined,the multiplier(p -1)/p seemsunnecessary.The differ-
ence is this.In theordinarycase we have a sumofsquaresartificially lessenedbecause thedeviationsare
computedfromthe observedratherthan the truemean. Here, the observedmean is, of necessity,equal
to the true mean. We thus have the sum of p squared deviationsfromthe truemean, one of these,
however,being essentiallya duplication.This is evident when thereare only two columnsand the
twodeviationsmustbe equal in absolutevalue; it is less obvious whenthereare more thantwo columns
and the duplicationis, as it were, spread among all of the deviations. A rigorousdemonstrationthat
(p -1)/p is the multiplierneeded to correctforthisduplicationis providedby the proofin the mathe-
matical appendixthat the Xr2distributionapproachesthat of x2.
688 AMERICAN STATISTICAL AssoCIATIONS

This statement,of course,is not a rigorousproofthat the distribu-


tion of Xr2approaches the distributionof x2as n increases.A rigorous
proofhas, however,been providedby Dr. S. S. Wilksand is reproduced
in the mathematicalappendix.
In addition,the exact values of the firstthreemomentsof Xr2 have
The mean value is p-1; thevariance,2 (p-1) (n- 1)/n;
been derived.22
and the thirdmomentabout the mean, 8 (p- 1) (n- 1) (n-2)/n2. The
TABLE V
EXACT DISTRIBUTION OF Xr2FOR TABLES WITH FROM 2 TO 9 SETS OF THREE RANKS
(p =3; n =2,3,4,5,6,7,8,9)
P is the probability of obtaining a value of Xr2 as great as or greater than the corresponding value of Xr'

n=2 n=5 n=7 n=8 n=9

Xr2 P Xr2 P Xr2 P Xr2 P Xr2 P

O 1.000 0.0 1.000 0.000 1.000 0.00 1.000 0.000 1.000


1 .833 0.4 .954 0.286 .964 0.25 .967 0.222 .971
3 .500 1.2 .691 0.857 .768 0.75 .794 0.667 .814
4 .167 1.6 .522 1.143 .620 1.00 .654 0.889 .865
n=3 2.8 .367 2.000 .486 1.75 .531 1.556 .569
XrV P 3.6 .182 2.571 .305 2.25 .355 2.000 .398
0.000 1.000 4.8 .124 3.429 .237 3.00 .285 2.667 .328
0.667 .944 5.2 .093 3.714 .192 3.25 .236 2.889 .278
2.000 .528 6.4 .039 4.571 .112 4.00 .149 3.556 .187
2.667 .361 7.6 .024 5.429 .085 4.75 .120 4.222 .154
4.667 .194 8.4 .0085 6.000 .052 5.25 .079 4.667 .107
6.000 .028 10.0 . 00077 7.143 .027 6.25 .047 5.556 .069
n=4 n=6 7.714 .021 6.75 .038 6.000 .057
8.000 .016 7.00 .030 6.222 .048
p
Xr2 p Xr2 8.857 .0084 7.75 .018 6.889 .031
0.0 1.000 0.00 1.000 10.286 .0036 9.00 .0099 8.000 .019
0.5 .931 0.33 .956 10.571 .0027 9.25 .0080 8.222 .016
1.5 .653 1.00 .740 11.143 .0012 9.75 .0048 8.667 .010
2.0 .431 1.33 .570 12.286 .00032 10.75 .0024 9.556 .0060
3.5 .273 2.33 .430 14.000 .000021 12.00 .0011 10.667 .0035
4.5 .125 3.00 .252 12.25 .00086 10.889 .0029
6.0 .069 4.00 .184 13.00 .00026 11.556 .0013
6.5 .042 4.33 .142 14.25 .000061 12.667 .00066
8.0 .0046 5.33 .072 16.00 .0000036 13.556 .00035
6.33 .052 14.000 .00020
7.00 .029 14.222 .000097
8.33 .012 14.889 .000054
9.00 .0081 16.222 .000011
9.33 .0055 18.000 .0000006
10.33 .0017
12.00 .00013

correspondingvalues forthe x2distributionwithp - I degreesof free-


dom are p-1, 2(p- 1), and 8(p- 1), respectively.It follows that xr2
and x2 always have the same mean value, and that the variance and
22
I am indebted to Mr. William C. Shelton for the derivation of the mean value and for suggesting
the method of deriving the other moments.
*THE USE OF RANKS 689
thirdmomentof Xr2 approach the variance and thirdmomentof x2 as
n increases.
For the special case of p=3, the exact distributionof Xr2has been
derived forn from2 to 9; and forp=4, forn equal to 2, 3 and 4.23
Table V gives the distributionsforp = 3, and Table VI forp = 4. These
distributionsgive some empiricalindication of how rapidly the Xr2
distributionapproaches the x2 distribution;in addition, they can be
used to make exact testsforsmall values of n and p.

TABLE VI
EXACT DISTRIBUTION OF Xr2FOR TABLES WITH FROM 2 TO 4 SETS OF FOUR RANKS
(p _4; n =2, 3, 4)
P is the probabilityofobtaininga value OfXr2as greatas or greaterthanthecorresponding
value ofXr2

n =2 n=3 n=4

Xr2 P Xr2 P Xr2 P Xr2 P

0.0 1.000 0.2 1.000 0.0 1.000 5.7 .141


0.6 .958 0.6 .958 0.3 .992 6.0 .105
1.2 .834 1.0 .910 0.6 .928 6.3 .094
1.8 .792 1.8 .727 0.9 .900 6.6 .077
2.4 .625 2.2 .608 1.2 .800 6.9 .068
3.0 .542 2.6 .524 1.5 .754 7.2 .054
3.6 .458 3.4 .446 1.8 .677 7.5 .052
4.2 .375 3.8 .342 2.1 .649 7.8 .036
4.8 .208 4.2 .300 2.4 .524 8.1 .033
5.4 .167 5.0 .207 2.7 .508 8.4 .019
6.0 .042 5.4 .175 3.0 .432 8.7 .014
5.8 .148 3.3 .389 9.3 .012
6.6 .075 3.6 .355 9.6 .0069
7.0 .054 3.9 .324 9.9 .0062
7.4 .033 4.5 .242 10.2 .0027
8.2 .017 4.8 .200 10.8 .0016
9.0 .0017 5.1 .190 11.1 .00094
5.4 .158 12.0 .000072

The tables show that if we adopt .01 as a level of significance,then


value forn less than 4,
forp = 3, it is impossibleto obtain a significant
and forn =4 only perfectconsistencywill yield a significantvalue; for
n=5, two values will satisfythe criterionand for n=6, fourvalues.
If .05 is adopted as a level of significance,only perfectconsistencyis
significantforn=3, while 2 values are significantforn=4, and four
values forn = 5.
For p = 4, the .01 criterioncannot be satisfiedforn =2, is satisfied
by perfectconsistencyfor n =3, and by 6 values for n = 4. The .05
werederivedby the ratherlaboriousprocessof buildingup the distribution
23 These distributions

foreach value ofn fromthedistribution forthenextsmallervalue. The laborinvolvedincreasesgreatly


as n increases,and even morerapidlyas p increases.
690 AMERICANSTATISTICALASSOCIATION-
criterion is satisfied by one value for n =2, three values for n =3, and
11 values for n =4.
CHART1
OF DISTRIBUTIONS
COMPARISON OF Xr2AND X2FORTWODEGREESOF FREEDOM
PANEL A: DISTRIBUTIONS OF X: FOR 3x9 TABLE ANDOF X! 1.0''.!
1.0

.8 .8
X_ DISTRIBUTION,2 DEGREES OF FREEDOM
X - DISTRIBUTION,3x9 TABLE

.6 ~~~~~~~~~~~~~~
p ~~~~~~~~~~~~
.4 -4

.2 .2

O I a 4 X!5 6 7 a
PANEL B: TAILSOF DISTRIBUTIONS SCALE
OF 'Xt. ANDOF 'X! ON LOGARITHMIC
.10 .10

.05 .05
\- X DISTRIBUTION,a DEGREES OF FREEDOM
- 4 DISTRIBUTION,3x3 TABLE
X.2DISTRIBUTION,3x5 TABLE
-
\v<tK DISTRIBUTION,3x7 TABLE
\. \..X --~. X DISTRIBUTION,3x 9 TABLE
.01 \ . .01

.005 .005

.001 .001

.0005 .0005

.0001 0001
4 5 6 7 8 9 10 "I X12 13 14 15 16 17 18 Is

The comparisonof the xr2 distributionwith the x2 distributionis


shownin Chart 1 forp = 3 and in Chart 2 forp = 4. In makingthis com-
*THE USE OF RANKS 691
parisonit is necessaryto allow forthe discontinuityof the Xr2distribu-
tion. Only a discretenumberof finitevalues of Xr2 are possible while
x2 is continuous.The probabilityassociated with any XT2in Tables V
and VI must thus be consideredas correspondingto a value of x2inter-
mediate between that value of Xr2 and the immediatelypreceding
value. This intermediatevalue has been arbitrarilychosenas halfway
between the two values of Xr2. It is these intermediatevalues which
formthe abscissas of the points plottedin the charts.
Panel A of Chart 1 comparesthe Xr2distributionfora 3 X 9 table with
the x2distributionfor2 degreesof freedom.For convenience,the dis-
tributionshave been comparedin cumulativeform.The ordinategives
the probabilityof securinga value of x2 or Xr2 as great as or greater
than the abscissa. The solid line gives the x2 distribution,the dotted
line,the Xr2distribution.The agreementbetweenthe two distributions
is very close. The cumulative curve for the xr2 distributiontends to
be somewhatabove that forthe x2 distributionfor low values of X,2
and below it for high values. This is to be expected since XT2 must
be less than a fixedfinitevalue (that corresponding to perfectconsist-
ency) whilex2is not so limited.
In utilizingtests of significance,
it is the small values of P, i.e., one
ofthe tails of the distribution,withwhichwe are ordinarilyconcerned.
In orderto bringout more clearlythe behavior of this part of the Xr2
distribution,a logarithmicscale is used forthe probabilityin Panel B
of Chart 1. This panel gives the cumulativedistributionof Xr2to the
rightof P =.10 forp=3, and n=3, 5, 7, 9; as well as the corresponding
portionof the x2 distribution.The chart shows the tendencyof this
portionof the Xr2curve to be below the x2 curve. It also clearlyindi-
cates the tendencyforthe Xr2distributionto approach the x2distribu-
-tionand suggeststhat it does so fairlyrapidly.
Panels A and B of Chart 2 are similarto the correspondingpanels
of Chart 1, but relate to p = 4. Panel A comparesthe xr2distribution
fora 4X4 table with the x2 distributionforthreedegreesof freedom.
The agreementis good,although,because of the smallervalue of n, the
discrepanciesare somewhat greaterthan in Panel A of Chart 1. It
indicates the same consistenttendencyfor the Xr2 distributionto be
above the x2distributionforsmall values of Xr2 and below it forlarge
values. Panel B gives the portionof the cumulativedistributionof Xr2
to the rightofp =-.10, forp = 4 and n = 2, 3, 4, as well as the correspond-
ing portionofthe x2distribution.Once again the verticalscale is loga-
rithmic.The tendencyforthe Xr2distributionto approach the x2dis-
tributionveryrapidlyas n increases,is plain.
692 AMERICAN STATISTICAL ASSOCIATION-

The tendencyforthe values of Xr2 adjusted fordiscontinuityto be


less than x2 for small probabilitiessuggeststhat any errorsresulting
CHART 2
COMPARISON OF DISTRIBUTIONS OF Xr2AND X9FOR THREE DEGREES OF FREEDOM

PANEL A: DISTRIBUTIONSOF XT FOR 4x4 TABLE ANDOF X!

._
,X DISTRIBUTION,3 DEGREES OF FREEDOM
X DISTRIBUTION,4x4 TABLE

.6

.4 .4

.2

0
0
0 1 2 a 4 .5 6 7 8 9 10
OF Xr ANDOF 'X ON LOGARITHMIC
PANEL B: TAILS OF DISTRIBUTIONS SCALE
.10 -o

.05 .05

4I .~~~~~~~N.01

.005 : \ 8\ ^ .005
P P

.001 .001

.0005 .0005
- X. DISTRIBUTION,3 DEGREES OF FREEDOM
- O0 DISTRIBUTION,4x2 TABLE
-4 DISTRIBUTION 4x3 TABLE
-4 DISTRIBUTION,4x4 TABLE

1.1^ 000
0670gs
'?? st0"
-x
fromusing the x2distributionas an approximationto the Xr2 distribu-
tion are likelyto be in the properdirection-that is, the significanceof
*THE USE, OF RANKS 693
results will be understatedrather than exaggerated. This tendency
toward under-statementis compensated-indeed, in some cases over-
compensated-by the factthat the values of Xr2whichcan be observed
(i.e., the values ofXr2not adjusted fordiscontinuity)are always greater
than the adjusted values. This factoris of minorsignificance, however,
since the numberof possiblevalues of Xr2 increasesveryrapidly-and
hence the intervalbetweenthem decreases very rapidly-as eitherp
or n increases.Even forp and n both as small as 4, the difference be-
tween the adjusted and unadjusted values of Xr2 is, forpractical pur-
poses, negligible.It is .15 in all but fourcases, .3 in threeof these,and
45 in the remainingone.
A comparisonof the x2 and Xr2 distributionsat the critical points
sheds furtherlighton this problem.For p=3 the value of x2 corre-
spondingto P =.05 is 5.991. From Table V, the nearest value which
Xr2 can have for p = 3, n = 9 is 6, and this has a probabilityof .057
associated withit. Thus, by using the x2distributionwe should be led
to overestimateslightlythe significanceof a value of Xr2= 6. The next
highervalue of Xr2is 6.22, and its significance we shouldestimateprop-
erly,since the probabilityassociated with it is .048. The value of x2
correspondingto P =.01 is 9.21. From Table V, the nearest values
which xr2 can have are 8.67, with a probabilityof .0103, and 9.55 with
a probabilityof .0060. In this case, the use of the x2distributionwould
yield the correctresults';8.66 would be attributeda probabilitygreater
than .01 and 9.55 one less than .01.
For p = 4, the value of x2corresponding to P =.05 is 7.815. The near-
est values of Xr2 forp =4 and n =4, as givenin Table VI, are 7.5 with
a probabilityof .052, and 7.8 with a probabilityof .036. The .01 value
Of x2 is 11.341. From the table, 9.3 has a probabilityof .012, 9.6 of
.0069, and 11.1 of .00094. Here, the use of x2would in each case under-
state the significanceof Xr2.
Whileno definitiveconclusionscan be drawnfromthesecomparisons
they suggestthat forp = 3, the use of the x2 distributionis likelyto
give sufficiently accurate resultsforn greaterthan 9; while forp =4,
the use of the x2distributionis likelyto understatethe significanceof
large values of Xr2 unless n is somewhatlargerthan 4. In view of the
apparent rapiditywith which the Xr2 distributionapproaches the x2
distributionwhen p =4, it seems reasonable that for n equal to or
greaterthan 6, the x2distributionwillgive sufficiently accurateresults.
4
For p greaterthan it is moredifficult to make any generalstatement;
but it seemssafe to say that the x2distributionwill give fairlyaccurate
694 AMERICANSTATISTICALAssoCIATION'
resultsforn equal to or greaterthan 6.24A procedurethat seems appli-
cable whenp is quite largeand n less than 6 is discussedbelow.

RELATION BETWEEN Xr2 AND THE RANK DIFFERENCE CORRELATION

When only two sets of ranks are available, the appropriatemeasure


of relationshipis the rank differencecorrelationcoefficient,r'. This
coefficientis computed by the usual product-momentformulawith
the ranks serving as the variables, or by the equivalent, but more
convenient,formula
6 2 d2
r' = 1 -
p3_ p

whered is the differencebetweentwo paired values, and p, as above, is


the numberof pairs of ranks.25
For n=2, Xr2 uses the same data as the rank difference correlation
and is designed to test the same hypothesis.The two are, therefore,
essentiallyequivalent. It is shownin the mathematicalappendix that
the relationbetweenthemis
(p- 1)(1+ r')
Xr2 =
=
For n 2 testingthe significanceof r' is thus equivalent to testingthe
significance of Xr2.
Under the hypothesisof homogeneity,the mean value of r' is zero,
and its variance, 1/(p-1).26 It followsfromthe last equation that,
forn =2, both the mean value and variance of Xr2 are (p -1). These
resultsagree, of course,with the more general formulaegiven above.

THE APPROACH TO NORMALITY

Hotelling and Pabst have shown that r' tends to become normally
distributedas p increases. It followsthat for n=2, Xr2 tends also to
become normallydistributedas p increases.
When n is large the distributionof Xr2 approachesthat of x2and the
latter approaches normalityas the numberof degreesof freedomin-
creases.
Since, forthe smallestvalue of n as well as forlarge values, the dis-
tributionof Xr2 tends to normalityas p increases,it seems reasonable
to assume that forintermediatevalues of n it behaves similarly.As-
24 It is worthrecallingthat the rapiditywithwhichthe varianceofthe Xr2distribution approaches
the variance of the x2 distributiondepends solely on n and not at all on p. On the otherhand, the
numberofdistinctvalues of xr2dependson bothp and n.
25 This is the usual notationexceptthat the numberofpairs of ranksis ordinarilydesignatedas n.
The presentnotationis used in orderto preserveconsistencywiththeprecedinganalysis.
26 Hotellingand Pabst, op. cit.,p. 36.
*THE USE OF RANKS 695
sumingthis to be the case, then forsmall values of n and large values
of p the significanceof Xr2can be testedby considering
Xr
-
1)

n-1
2 (p-i)
n
as a normallydistributedvariate with zero mean and unit standard
deviation.
Furtherstudyis clearlyneeded on this point,bothin orderto obtain
a rigorousproofthat forsmall values of n the Xr2distributiontends to
normalityas p increases,and also to determinethe rapiditywithwhich
it approachesnormality.
CONCLUSION

The method of ranks is a method which can be applied to data


classifiedby two (or more) criteriato determinewhetherthe factors
used as criteria of classificationhave a significantinfluenceon the
variate classified.Stated differently, the method tests the hypothesis
that the values ofthe variate corresponding to each subdivisionby one
of the factors are homogeneous,i.e., from the same universe. The
method uses solely informationon "order" and makes no use of the
quantitativevalues of the variate as such. For this reason no assump-
tion need be made as to the nature of the underlyinguniverseor as to
whetherthe different sets of values come fromsimilaruniverses.The
methodis thus applicable to a wide class of problemsto whichtheanal-
ysis of variance, designedto test a similarhypothesis,cannot validly
be applied.
The basic step in the applicationof the methodof ranksis the com-
putationof a statistic,Xr2, froma table of ranks. The samplingdistri-
butionof this statisticapproachesthe x2 distributionas the numberof
sets ofranksincreases.When the numberofsets ofranksis moderately
large (say greaterthan 5 forfouror more ranks) the significanceof
Xr2 can be tested by referenceto the available x2 tables. When the
numberof ranks in each set is 3, and the numberof sets 9 or less, or
when the numberof ranks in each set is 4, and the numberof sets 4
or less, the significanceof Xr2 can be tested by referenceto the exact
tables given above. When,however,both the numberof ranksand the
numberofsets ofranksare verysmall,it is impossibleto obtain signifi-
cant results.
When the numberof ranksis large,but the numberof sets of ranks
small, there is reason to suppose-though no rigorousproofis avail-
696 AMERICAN STATISTICAL ASSOCIATION-

able-that XT2is normallydistributedabout a mean of p-1 and with


a variance of 2(p- 1) (n- 1)/n, where p is the numberof ranks and
n the numberof sets of ranks. In such cases, then,the significanceof
xr2 can be tested by referenceto tables of the normalcurve.
The theoreticaldiscussionof the efficiency of the method of ranks
relative to the analysis of variance indicates that in situationswhen
the lattermethodcan validly be applied and when the numberof sets
of ranks is large the maximumloss of informationthroughusing the
analysis of ranks is 36 per cent. The minimumloss is probably 9 per
cent. The amountofinformation lost appears to be greatestwhenthere
are only two ranks in each set, and decreases as the numberof ranks
increases.
The application of the two methodsto the same body of data pro-
vides furtherevidenceas to theirrelativeefficiency. The data employed
were classifiedinto fivegroupsby one of the factorsand into seven by
the other. The results suggested that in this instance the loss of in-
formationthroughusingthe methodof rankswas not verygreat,that
both methodstendedto yieldthe same result.
The method of ranks requires less than one-fourthas much time
as the analysis of variance. In the lightof the conclusionsjust stated
concerningtheirrelativeefficiency, this suggeststhat even thoughthe
assumptionsnecessaryforthe lattermethodare knownto be satisfied,
if the problemof computationis a serious one, the methodof ranks
mightprofitablybe used as an alternativeto the analysis of variance
or, at least, as a preliminarymethod to suggest fruitfulhypotheses
whichmightthenbe moreaccuratelytestedby the analysisofvariance.
MATHEMATICAL APPENDIX
1. ProofthattheXT2distribution as n in-
approachestheX2 distribution
creases.27
Let ri -the rank in the i-th row and j-th column, (i 1,
n; j-,= p)
(1)~~~~~ rii ri - (P + 1
and
(2) fi' -Er'i.
n =
The characteristic function of the quantities fi' (j =1, , p) is
given by
(3) = E (exp i JOifi')
27 This proofis adapted fromone givenby Dr. S. S. Wilksin a letterto the author.
*THE USE OF RANKS 697
whereE stands forexpectedvalue. Only p-1 of the P'/s are included
because fp' is expressiblein termsof fi', *, fp1. This in turn fol-
lows fromthe factthat,foreach value of i, rij takes all the values from
1 to p as j varies. Substitutingfrom(2) into (3) we have
i n p-1

(4) ck=E(exp - : jOr'ii)


n j=1 j=1 /
Since the set of ranks in each row is independentof the set of ranks
in any otherrow
F! i P-'
(5) =[E (exp-ZE Oir')J

wherer/'stands forany of the sets of r'ii. Expanding,

(6) = E1
E
(F
+ Z
-1
Oiri'+ 2n
j2 (P
fOiri) 2
+
1
- R'

or

{E 1+ ZOiri'
n j.,
(7)
(2
+ n2
P-1

E 0O2ri'2+
p-2

2Z
p-i
E j 1)r/r'i'+
) + kRI J n

2n2 j-1 i_~~j=1


j'=j-J-13

But since ri' takes all of the p values differingby unity from
-(p -1) to 2(p -1) with equal probability
(8) Er' = 0,
1 (p-l)/2
(9) Er -2= - r/2-= (p2-1)/12
P rj'=-(p-1)/2

Further
(10) Er/'r'i, = -(p + 1)/12
since
((p-1)/2 \ 2 (p-1)/2
E ri= rj-2 E
(11) r=rj/-(p-1)/2 rj/'-(p-l)/2
(p-3)/2 (p-1) /2
+ 2 E E r/r'j, =0
rj'=-(p-l)/2 r'jl=rjl+l

and hence
(12) p(p - 1)Eri'r'i,- pEr/'2 - p(p2 - 1)/12
698 AMERICAN STATISTICAL ASSOCIATION*

Usingtheseresultsin (7) gives

(13) ?'_ {_ pZ[ i 12 j~ __j+ j +-7R}


p p
~2n _L12
whereR is a boundedfunctionof p, rl', , r'p_1forn= 1, 2, 3,
Allowingn to approach we
infinity, have
p2 -l/P-1 2 p-2 p-
(14) p exp-
ex ( Ej2 _ EZ E ojoj,)
24n i-l p - 1 j=l j'=ji+

This, however,is the characteristicfunctionfora multivariatenormal


distribution. It follows that ii', , r'p-1 are asymptotically nor-
mally distributedwitha matrixof variances and covariances given by
the matrixof the quadratic in 01, , O,_j
Taking the reciprocal of the matrix of the 0's and associating it
withthe fj"s we have as the distributionfunctionof the fr"s:
(15 12
12n 2p-2
exp {- (1 - P-1
/ i2Ej2+ Ef'f',)
p-i
(15) 2ep~-- p(p +l1
drltdf2' dr tP_
where C is a constant
p p-i
Since E i/ = 0 it followsthat rp'=-E ij' and hence
j=1 .=1

( p-1 \2 p-1 p-2 p-I

(16) rt2= ri Z '2 + 2Z Z r'r',


\j=l/ j_ j=l1 j'= j+l

Substituting(16) in the exponentof (15) we have, finally,for the


distributionfunction
f1 12n p
C exp - -? ,2 dfi1d'ddiIy-
(17) { 2 p(p + 1) E j
C exp (- xTr2)dri'dr2'
- drp_1,
by the definitionof Xr2. It followsthat forn large Xr2is distributedlike
x2 with p-1 degreesof freedom.
ofXTr2
2. Derivationof theexactmoments28
By definition
(12n+ p 12n 1 p ( n 2
-
p( +1 i-1 p(p +1) n2 _1 /
28 The derivationof the mean value of Xr2is adapted fromone communicatedto the authorby
Mr. William C. Shelton,who also suggestedthe methodused forderivingthe othermoments.The
methodemployedis essentiallythat developed by J. Splawa-Neyman,'Contributionsto the Theory
ofSmall Samples Drawn froma FinitePopulation,"Biometrika, Vol. 17, 1925,pages 472-79.
*THE USE OF RANKS 699
P n
Expanding,replacing
E E r' by its value np (p2 - 1)/12,and re-
J-1 =-1

signswe have
theorderofthesummation
arranging
24 n-1 n P
(19) Xr2 = (p-1) + E Erjij,IE
p(p + 1)n j, jj=i+1 Z=

Takingtheexpectedvalueofbothsidesgives
24 n-1 n P
(20) EXr2 = (p-1) + 2 E E E E(r'jjrtj)
p(p + l)n j., i==i+l j-1

But sinceone rowofthetableof ranksis entirely ofany


independent
otherrow,Er'i r'j'=Er'jj Er'j' =O.
(21) EX2 p1.

From(19) and (21)


24 n-1 n P
(22) Xr2 -EXr2 = E r rjirj
i
p(p + 1)n i21 ij=i+l j=1

be ob-
The k-thmomentof Xr2 aboutits meanvalue can therefore
tainedby evaluatingtheexpectedvalueofthek-thpowerof the right
handsideof (22).
p
To determinethe varianceof Xr2 firstnote thatL r'ii r'j i is in-
j=1

I r'ii r'j,j. This can be provedby multiplying


p
dependentof thetwo
*51
expressions.The expectedvalue of the resultantproductis easily
shownto be zero.Likewise, I r"fir'j s is independent
P P
ofZr'is' rtj ,j.
J=1 j=1
It followsthat
rn-1 n p 2 n-1 n P 2

(23) EE1 iZ
(23) r r
1~~r11 r11- = E Z r r'i
ir '
~~~~~Ii=1
\G i'=i+1
-
il j= /
But
p
X

(24)E( Er'i, r'iii)=


(24) jljljlj-+
\ 2
E (
/

E
p
r' ri
2r'y2 + 2
p-1 p
r'i r'i,i r'iitr'i'i')

p p-1 p
= Z Er ii2 Er i,2+ 2 E(r'jj r'j )E(r'j s riy).
i=1 j=1 j'==j+l
700 AMERICAN STATISTICAL ASSOCIATION-

Substituting(9) and (10) into (24) and the resultantexpressioninto


(23) gives
(25) E t
tr'rn1nT 2 n(n - 1) p2(p - 1)(p + 1)2

EL~ l
i'-i= 1 2 122

Multiplyingthis by [p(p+l)n/24]-2 gives, finally,the variance of


Xr2

n-1
(26) r2= 2 (p-1).
n
To determinethe thirdmomentof Xr2 about its mean note that the
onlytermin the expansionof
n-1 n p 3

whose expectedvalue is not zero is


n-2 n-1 n _p p P

(27) 6ZE
i=1 j'=i+1
Ei''=i'+1L ErIiixr:,j
j=i j=
rij rj,,j
j
r'isr'ji,"iJ

Expanding the expressionin bracketsgives


p P-1 p
r'i,, 2 + 3 2 i r'i,,r'i,, r'i,,,
r '12
j r',2r'
r,,2
_
(28) j=1 j

P-2 p-1 P
+ 6E E -
j=1 j'=j+l j"=j'+i

Taking expected values, and substitutingfrom(9) and (10) gives


p3(p-l)(p+1)3/123 as the expected value of (28). Substitutingthis
in (27) gives
n-1 n p 3

E E r1ijr'j S
(29) --+ j
, ,

6n(n-l1)(n -2) p3(p-l1)(p + 1)3


12 3 123

Multiplyingthis by [p(p+l)n/24]-3 gives as the third momentof


Xr2 about its mean
8(n- 1)(n -2)
n2
THE USE OF RANKS 701
correlation
3. Derivationof relationshipbetweenXr2 and rank difference
(r') whenn=2.
From (19) we have, forn= 2
12 p
(31) Xr2 = (p-1) + - r/,jr'2.
p(p + 1
But, usingthe productmomentformula,the rank difference
correla-
is definedas29
tion coefficient
p

Zr i r 2j 1
(32) r' = =,rlir
'F" r12
_ r/2l
r'22 p( 1)

Substitutingin (31) thisgives


(33) Xr2 = (p-1)(1 + r')
as the relation between Xr2and the rank differencecorrelationcoef-
ficient.
29 The notationused in (32) may be somewhatconfusing. The symbolr' whichstands forthe rank
differencecorrelationcoefficientis to be distinguishedfromr'jj which stands forthe deviaton of a
rankfromits expectedvalue.

También podría gustarte