Está en la página 1de 5

6WDWLVWLFDO3RZHU$QDO\VLV

$XWKRU V -DFRE&RKHQ
5HYLHZHGZRUN V 
6RXUFH&XUUHQW'LUHFWLRQVLQ3V\FKRORJLFDO6FLHQFH9RO1R -XQ SS
3XEOLVKHGE\Sage Publications, Inc.RQEHKDOIRIAssociation for Psychological Science
6WDEOH85/http://www.jstor.org/stable/20182143 .
$FFHVVHG

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

Sage Publications, Inc. and Association for Psychological Science are collaborating with JSTOR to digitize,
preserve and extend access to Current Directions in Psychological Science.

http://www.jstor.org
98 VOLUME 1, NUMBER 3, JUNE 1992

pression, Journal of Clinical Psychiatry, 51, 61-69 nual meeting of the American College of Neuropsy and J. Perlmutter, The application of positron emis
(1990). chopharmacology, San Juan, Puerto Rico sion tomography to the study of panic disorder,
11. L.R. Baxter, Jr., J.M. Schwartz, B.H. Guze, (December 1991). American Journal of Psychiatry, 143, 469-477
J.C. Mazziotta, M.P. Szuba, K. Bergman, A. 12. E.M. Reiman, M.E. Raichle, F.K. Butler, P. (1986); T.E. Nordahl, W.E. Semple, M. Gross, T.A.
Alazraki, C.E. Selin, H.K. Freng, P. Munford, and Herscovitch, and E. Robins, A focal brain abnormal Mellman, M.B. Stein, P. Goyer, A.C. King, T.W.
M.E. Phelps, Obsessive-compulsive disorder vs. ity in panic disorder, a severe form of anxiety, Na Uhde, and R.M. Cohen, Cerebral glucose metabolic
Tourette's disorder: Differential function in subdivi ture, 310, 683-685 (1984); E.M. Reiman, M.E. Ra differences in patients with panic disorder, Neuro
sions of the neostriatum, paper presented at the an ichle, E. Robins, F.K. Butler, P. Herscovitch, P. Fox, psychopharmacology, 3, 261-272 (1990).

= .01 is
power at a2 Note
Statistical Power Analysis only.56.1
also that at any given value of a, a
test ismore stringent than
JacobCohen two-sided
a one-sided test.
Statistical
power analysis exploits
themathematical relationship
The power of a statistical test of a which r indeed does equal zero, re among these four variables in statis
null hypothesis (H0) is the probabil searchers risk mistakenly rejecting tical inference: power, a, N, and ES.
ity that the H0 will be rejected when the H0 when it is true, a Type Ierror, The relationship is such that when
it is false, that is, the probability of whose rate (.05) is controlled by the any three of them are fixed, the
obtaining a statistically significant a criterion. They also riskmistakenly fourth is determined. Two forms of
result. Statistical power depends on accepting the H0 as tenable when it power analysis are most useful: One
the significance criterion (a), the is false, a Type IIerror, whose prob is the determination of the N that is
-
sample size (N), and the population ability is called ?. Power is thus 1 necessary to attain a specified de
effect size (ES). ?, the probability of not accepting gree of power to detect as significant
The importance of power analysis the H0 when it is false, that is, the (at specified a) a hypothesized ES.
arises from the fact that most empir probability of successfully rejecting This form of power analysis is used
ical research in the social and be the H0. in research planning. The second is
havioral sciences proceeds by for The outcome of a statistical test the determination of power to detect
mulating and testing H0s that the depends on the degree to which the a hypothesized ES (for specified N
investigators hope to reject as a H0 is false, that is, on the magnitude and a), the form used in meta
means of establishing facts about the of the population ES, which in this analytic power reviews of research
phenomena under study. case is the absolute size of the pop areas or journals.
A typical H0 is that a population ulation r?the larger the r, the
product-moment correlation, r, is greater the likelihood that the H0
zero, to be tested at the two-sided will be rejected. It is also true that
(ct2 =) .05 level. When this H0 is the outcome depends on N, a larger EFFECTSIZE
tested on a sample of N cases ran sample being more likely to result in
domly drawn from a population in rejection of a false H0 than a smaller
=
one. Thus, at a2 .05, for exam I noted that in testing a sample r,
ple, if the population r is .30, when the ES is simply the population r.
Jacob Cohen, Professor of Psychol N is 40, the power of the standard t More in the Neyman
generally,
ogy at New York University, is the
test of a sample r turns out to equal Pearson system of statistical induc
author of Statistical Power Analysis
Sciences .48, whereas when N is 80, power is tion,2 whence the concept of power
for the Behavioral (2nd
.78. If the population r is .40, when is derived, the ES is the discrepancy
ed., 1988) and co-author with Pa
N is 40, power is .74, but when N is between the null
tricia Cohen of Applied Multiple hypothesis, H0,
80, power is .96. Finally, the test and the alternate hypothesis of inter
Regression/Correlation Analysis for
the Behavioral Sciences (2nd ed., outcome depends also on a, the risk est, Hv For testing a sample r, the
1983), both published by Law of a Type I error. A smaller and H0 is that the population r is zero,
rence Erlbaum Associates. Address therefore more stringent a criterion, and the H^ posits a specific nonzero
to Cohen, =
correspondence Jacob say, a2 .01, for any given popu value, for example, .30. Thus, the
Department of Psychology, New lation r and N, would result in ES in this example is simply the dif
York University, 6 Washington -
smaller power. For example, with ference: .30 .00. Every statistical
Place, 5th Floor, New York, NY r = .30 and N = 80, test has its own ES index, a contin
population
10003. =
while power at a2 .05 is .78, uous value that runs from zero,

Published by Cambridge University Press


CURRENTDIRECTIONS INPSYCHOLOGICALSCIENCE 99

when the H0 is true. Each ES index is fortunate effect


of emphasizing the and .40for small, medium, and
a pure (i.e., scale-free) value that magnitudes of p values from signifi large ESs.1
measures, in terms appropriate to it, cance tests rather than the magni Another means of facilitating the
the discrepancy between the H0 and tudes of the psychological phenom understanding of the various ES in
the Ht ena under study.3 A salutary side dices is by transforming them into
For example, the ES index for the effect of the study of power analysis other measures. For example, many
difference between independent is its emphasis on ES. Neither power of the ES indices (e.g., d, f, and the
means in the classical t test is d, the nor sample size can be determined ES indices for the difference between
difference between the population in the absence of the investigator's proportions and for the degree of as
means standardized by dividing this readiness to consider just how sociation in contingency tables of
difference by the common within wrong the null hypothesis is likely to frequencies) may be translated into
population standard deviation. (The be (i.e., the ES). The decision as to correlation coefficients or their
difference is absolute for two-sided what population ES to posit arises squares, which may then be inter
tests and is either positive or nega from the investigator's knowledge of preted as proportions of variance. As
tive for one-sided tests.) The stan the field?the sample ESs found in another example, d may be ex
dardization results in a scale-free previous investigations with similar pressed as various proportions of
measure: d = .25 implies a quarter variables, the results of pilot studies (non)overlap between normal distri
of a standard deviation difference (though not reliable when based on butions.1
between the population means, free small samples), and his or her edu
of the units of measure of the vari cated intuition.
able in question, whether they are Because the ES indices are not
inches, centimeters, or points scored generally familiar, I have
proposed
a, THE SIGNIFICANCE
on a psychological test. as conventions, or operational defi CRITERION
As another example, for testing nitions, "small/' "medium," and
the departure of a population pro "large" values of each ES index to
The probability of mistakenly re
portion (P) from .50, the ES index is provide the user with some sense of
= P - jecting the H0, a, represents a re
g .50. If an investigator be its scale.1 Itwas my intent that me
search policy?the maximum risk
lieves that there is a sex difference in dium ES represent an effect of a size
one is prepared to take of making
the incidence of dyslexia such that likely to be apparent to the naked
this error. It has become conven
boys are at different risk from girls, eye of a careful observer, that small
tional that unless otherwise stated,
in a sample of dyslexic children, she ES be noticeably smaller yet not triv
this risk is set at .05. Smaller and
would posit as the H0 that half the ial, and that large ES be the same thus more stringent values may be
sample are of one sex, and as the H^ distance above medium as small is
used, for example, when several H0s
that a specified different proportion, below it. I also made an effort to
are to be tested in order to minimize
say, .60, are of the other. The ES make these conventions comparable the risk of making any Type Ierrors
-
index would then beg = .60 .50 across different statistical tests.
= .10. Still another in investigation (the experimentwise
example is the For example, for the test that r =
risk). Larger values may be used in
analysis of variance test that a set of 0, small, medium, and large ESs are,
studies. Also, for tests
rs .10, exploratory
population means are all equal. The respectively, the population whose ESs may be either positive or
ES index for this test, f, is the stan .30, and .50. For the test that two
dard deviation of these means di means are negative, a may be defined as two
population equal, the
= sided or one-sided. The latter has
vided by the common within ESs, in the same order, are d .20, more power than the former when
population standard deviation of the .50, and .80. The .20 ES is exempli the sample effect is in the direction
observations.1 fied by the mean IQ difference be
posited, but has zero power when
Investigators in the social sci tween twins and nontwins (the latter
the effect is in the opposite direction
ences find specifying the ES the most being larger), the .50 ES by the mean because the one-sided test logically
difficult aspect of power analysis. IQ difference between clerical and
This is at least partly due to the rel semiskilled workers, and the .80 ES precludes a contrary finding.

atively low level of consciousness by the mean IQ difference between


about magnitudes in those disci Ph.D.s and college freshmen. In the
plines. The conquest of psychologi analysis of variance test of the H0 DETERMININGSAMPLESIZE
cal science by Fisherian null hypoth that g
populations have equal
esis testing (where the alternative to means, the index is (the standard
the H0 is simply its negation, so that ized standard deviation of the In planning research, deciding
no Hy is specified) has had the un means) are, respectively, .10, .25, the sample sizes is crucial. Because

Copyright ? 1992 American Psychological Society


100 VOLUME 1, NUMBER 3, JUNE 1992

research costs are at least approxi the incidence of dyslexia. If in a Abnormal and Social Psychology
mately linear in the number of sub population of dyslexic children half from the perspective of power.51 de
jects, cost-effectiveness demands are boys, there is no sex difference, termined power for each statistical
that this decision be appropriate. so H0 is P = .50. Departure from test in each article using the N em
with a =
When asked in connection .50 would render H0 false. The ES ployed at a2 .01, .05, and . 10 for
-
particular investigation what a and index for this test isg = P .50, the the conventional definitions of
power are desired, a neophyte re departure of the proportion from one small, medium, and large ES. I
searcher might suggest a2 = .01 and half. If the investigator's resources found, for example, that the median
some very large value for power, are such that she could obtain an N power to detect a medium ES at a2
=
say, .99. Power analysis quickly de of 90 to 100, and her expectation is .05 was .46. The many power
termines that these specifications ne a value of g in the range .10 to Ve, surveys done in the biosocial sci
cessitate a sample size that is likely she might compile the sample size ences since that time have had sim
beyond the available resources. For planning table shown in Table 1 by ilar results. For example, a similar
example, for a test of the difference looking up various combinations of review by Sedlmeier and Gigerenzer
between means, if a medium ES (d a2 and g that would result in Ns of the 1984 Journal of Abnormal
= in
.5) exists the population, these within the desired range and noting Psychology6 found the median
specifications require 194 cases in the resulting power. From this table, power under the same conditions to
each of the two samples. Similarly, she could choose a set of specifica be a little worse (.44)?and itwas
they require that if population r = tions. lower still (.37) when an experi
.30, a test of the significance of a mentwise a criterion was employed.
=
sample r have 254 cases. For a2 Even worse was the finding that in
.05 and .99 power, the N require 11% of the studies, the H0 was taken
ments are, respectively, 148 and DETERMININGPOWER as the research hypothesis and non
195. significance taken as confirmation:
To determine the necessary sam The median power of these studies
ple size, one needs to posit the a, There is a useful role for power to detect a medium ES at a2 = .05
ES, and desired power. I have pro analysis in assessing completed re was .25!
posed as a convention that in the ab search, particularly research in
sence of any other basis for setting which nonsignificant results were
the value for desired power, .80 be obtained. Given the N employed
used.1 In scientific research, it is typ and a, one needs only to posit the CONCLUSION
ically more serious to make a false population ES to determine power.
positive claim (Type I error) than a The sample ES found, or one or
false negative one (Type IIerror). Be more ES values posited by the asses There has been no disagreement
cause the implicit convention for sor, may serve
this purpose. It is a among research methodologists
significance is a = .05, the use of common finding that power was about the desirability of power anal
the convention
.80 for desired poor for plausible ESs, usually be ysis in research planning and assess
= cause of small N. in application
power (hence, ? .20) makes the ment, yet progress of
Type IIerror 4 times as likely as the In 1962, I reviewed the articles in this method over the last quarter
Type Ierror, an arbitrary but reason the 1960 volume of the journal of century has been slow. There have,
able reflection of their relative im however, been some rays of hope in
portance.4 the past few years. The popularity of
A useful aid in determining the meta-analysis has served to empha
Table 1. A sample size
necessary sample size is a sample size the size of effects and by thus
planning table
size planning table. To prepare such raising the consciousness of behav
a table, the investigator selects val a2 g Power N ioral scientists has promoted the
ues or ranges of values for a, ES, and cause of power analysis.3 More di
.01 1/6 .75 92
power and then determines the N for .02 .15 .75 98 rectly, both graduate and undergrad
each combination. This table pro .02 1/6 .85 98 uate statistics textbooks have begun
vides the basis for a judicious choice .05 .10 .50 96 to feature chapter-length treatments
or leads to the use .05 .15 .85 97 of power analysis.7
of specifications Finally, in addi
.10 .10 .60 90
ful discovery that the research as tion to the reference works already
.10 .15 .90 91
conceived is not viable.3 .10 1/6 .95 92 noted,1,4 there are available com
Recall the investigator pursuing .20 .15 .95 90 puter programs for power analysis
the question of a sex difference in and sample size determination.8

Published by Cambridge University Press


CURRENTDIRECTIONS INPSYCHOLOGICALSCIENCE 101

statistical inference, Biometrika, 20A, 175-240, 6. P. Sedlmeier and G. Gigerenzer, Do studies


Acknowledgments?I am, as always, 263-294 (1928); J. Neyman and E.S. Pearson, On of statistical power have an effect on the power of
grateful to Patricia Cohen for her helpful the problem of the most efficient tests of statistical studies? Psychological Bulletin, 105, 309-316
comments. hypotheses, Transactions of the Royal Society of (1989).
London Series A, 231, 289-337 (1933).
7. R. Rosenthal and R.L. Rosnow, Essentials of
3. J. Cohen, Things I have learned (so far),
American Psychologist, 45, 1304-1312 Behavioral Research: Methods and Data Analysis,
(1990).
2nd ed. (McGraw Hill, New York, 1991); J.
4. For an article-length treatment of sample size
Notes = Welkowitz, R.B. Ewen, and J. Cohen, Introductory
determination using the .80 convention and a 4th
Statistics, ed. (Harcourt Brace Jovanovich, San
.01, .05, and .10, see J. Cohen, A power primer,
1. J. Cohen, Statistical Power Analysis for the Diego, 1991).
Psychological Bulletin (in press). A useful alternative
Behavioral Sciences, 2nd ed. (Erlbaum, Hillsdale, treatment is offered in H.C. Kraemer and S. Thie 8. M. Borenstein and J. Cohen, Statistical
NJ, 1988). This is the source of the system of power mann, How Many Subjects? Statistical Power Anal Power Analysis: A Computer Program (Erlbaum,
analysis described here; the power values and sam ysis in Research (Sage, Newbury Park, CA, 1987). Hillsdale, NJ, 1988); J. Hintze, Power Analysis and
ple sizes of the illustrations derive from this book's 5. J.Cohen, The statistical power of abnormal Sample Size (NCSS, Kaysville, UT, 1991). Some 13
tables. social psychological research: A review, Journal of programs are reviewed in R. Goldstein, Power and
2. J. Neyman and E.S. Pearson, On the use and Abnormal and Social Psychology, 65, 145-153 sample size via MS/PC-DOS computers, American
interpretation of certain test criteria for purposes of (1962). Statistician, 43, 253-260 (1989).

others might argue that Student's


Why Can Methods for Comparing Means
t
test is robust to nonnormality. For
Have Relatively Low Power, and What many years within the statistical lit
erature, it has been known that all
Can You Do to Correct the Problem? three of these methods have serious
practical problems, particularly in
Rand R. Wilcox
terms of power. Improved methods
have now emerged and are ready to
be used in applied work.
one of the most com are the same in terms of some mea When choosing a procedure
for
Certainly,
mon goals in applied research is sure of location, then using standard comparing groups, it helps to keep
two or more groups in methods for comparing means three common goals in mind:
comparing isone
terms of some measure of location, of the worst things you could possi
1. Control the probability of a Type
that is, a quantity intended to repre bly do. In fact, very slight departures Ierror when the distributions are
sent the "typicar subject or object from normality can have serious
identical.
under study. Of course, the measure consequences.
of location routinely used is the pop In this article, I review the prob 2. Compute accurate confidence
ulation mean, |x. If there is no differ lem that arises in using conventional intervals for the difference be
ence between the distributions asso statistical methods to compare group tween two measures of location
ciated with two or more groups, means and then discuss some solu when the distributions differ.
standard methods for comparing tions. Standard nonparametric 3. Achieve reasonably high power
means appear to provide good con methods do not correct the problem, when the two groups differ in
trol over the probability of a Type I nor do some of the better known im terms of some measure of loca
error (i.e., concluding the means are provements for comparing means. tion.
different when in fact they are There are, however, new methods
equal). However, if the groups differ that can help applied researchers. Goal 1 has received the most atten
in some way, and in fact you should tion, especially within the social sci
reject the hypothesis that the groups ences. In this regard, Student's t test,
and its extension to more than two
WHY ISTHEREA PROBLEM?
groups, appears to perform very
Rand R.Wilcox is Professor of Psy well.2
chology at the University of South For comparing means, Student's t For Goal 2, Student's t test ap
ern California. Address correspon test is the most used
commonly pears to perform reasonably when
dence to Rand R. Wilcox, method, although some researchers equal sample sizes are used, but for
Department of Psychology, Uni
might use Welch's1 method instead. unequal sample sizes, serious prob
versity of Southern California, Los If you whisper lems arise. In particular, Cressie and
"nonnormality,"
Angeles, CA 90089-1061; e-mail:
some researchers might respond by Whitford3 described general circum
rwilcox@wilcox.usc.edu.
using the Mann-Whitney U test, but stances under which, no matter how

Copyright ? 1992 American Psychological Society

También podría gustarte