Está en la página 1de 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, M.D., Ph.D.

Department of Emergency Medicine Harbor-UCLA Medical Center Associate Professor UCLA School of Medicine

Presented at the 1999 Annual Meeting of the Society for Academic Emergency Medicine (SAEM) in Boston, Massachusetts.

Revision Date: May 17, 1999 Contact Information: Roger J. Lewis, M.D., Ph.D. Department of Emergency Medicine Harbor-UCLA Medical Center 1000 W. Carson Street, Box 21 Torrance, CA 90509 Tel: (310) 222-6741 Fax: (310) 212-6101 e-mail: roger@emedharbor.edu (all lower case letters)
c:\mydocuments\roger\saem\lectures\reliability and validity

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD Introduction The purpose of most research, whether a clinical trial, health services research, or an experimental animal study, is to demonstrate a relationship between an outcome of interest and one or more other variables or characteristics. These other variables or characteristics may be experimentally manipulated, such as the treatment given a patient in a clinical trial, or they may be characteristics of a healthcare system, such as the staffing pattern in an emergency department. Regardless of whether the study involves human subjects, animal subjects, or basic laboratory model systems, our ability to identify and measure the relationships of interest depends on our ability to accurately measure both the outcome variable of interest and those variables tentatively believed to affect outcome. In many cases, it is not possible to measure directly the outcome variable of interest. For example, while one may be most interested in the effect of a new treatment on long term survival, it may be only practical to measure patient survival up to 30 days after hospitalization. Other outcomes, such as quality of life or satisfaction with care, may be difficult to define precisely or measure. In other situations, there may be a gold standard. The term gold standard refers to some exact method for measuring the characteristic of interest. By definition, the gold standard has absolute accuracy. In some cases, the gold standard is easy to envision while, in other cases, there may be no true gold standard. For studies in which it is either impossible or impractical to use a gold standard for outcome, we may instead use a surrogate endpoint. A surrogate endpoint is a characteristic which is both practical to measure (thats why we are using it) and is thought to be closely associated with the outcome of true interest. In virtually all clinical research, the outcomes measured are surrogates for the outcomes of true interest, although this is often not explicitly stated. The terms validity and reliability refer to well defined, statistical measures of our ability to measure an outcome or characteristic. The purpose of this lecture is to introduce the meaning of validity and reliability, teach the use of appropriate statistical methods for estimating validity and reliability, and illustrate the use of these techniques using data from actual clinical studies. Concepts Types of Variables There are generally considered to be four different types of variables. These are nominal or categorical, rank, interval, and ratio. A nominal or categorical variable can be used to divide patients into categories, for example, gender, mechanism of injury, etc. For a single variable, the categories are usually exclusive. A rank variable puts observations into a particular order, but there is no defined distance between categories. Examples of rank variables include school grades and Likert scales (e.g., strongly disagree, mildly disagree, neutral, mildly agree, strongly agree). Interval variables are numbers, so differences between interval variables are meaningful, but there is no meaningful zero. Dates are the most common interval variables. Ratio variables are numbers for which there is a meaningful zero. Examples are numerous, including serum glucose concentration, the length of time after an injury, etc. Most numbers measured in medical studies are ratio variables.

Page 2 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD Both interval and ratio variables are usually continuous or roughly continuous. I will frequently use the term continuous to refer to both types. When planning the analysis of continuous variables, it is often necessary to consider whether they are likely to be normally distributed. If they are not likely to be normally distributed, then they must often be converted to ranks before analysis (e.g., Spearman rank correlation coefficient). Data Structure The structure of data in reliability and validity studies is different than in most other medical studies, as it consists of two or more paired measurements. In the most common situation, there are two paired measurements on each subject or sample. In more complex designs, however, there may be more than two measurements on each sample, or different samples may have different numbers of measurements. Some measurements, such as the administration of questionnaires or tests, actually consist of multip le items each one of which represents a measurement. In the case of a questionnaire, one could consider the number of measurements on each subject as the number of separate items on the questionnaire. When considering the structure of the data, it is also important to anticipate its type and, if it is continuous (interval or ratio), to consider whether it is likely to be normally distributed. Meaning of Reliability The term reliability refers to the degree with which repeated measurements, or measurements taken under identical circumstances, will yield the same results. This definition assumes that the act of measuring does not affect the variable or characteristic of interest. The statistical definition of reliability is related to the lay definition, in that a piece of machinery which is reliable according to the lay definition yields the same behavior each time it is used. Reliability is a measure of the randomness of the measurement process itself. In determining reliability of a measurement technique, we may actually make repeated measurements, or we may calculate statistically what is likely to happen if we were to make repeated measures, based on an analysis of the correlations between parts of our measurement (e.g., correlations between the answers on different questions on a quality of life questionnaire). Figure 1. Reliable Measurement Reliability is an important concept when there is no gold standard. The quality of the measurement is considered relative to itself at another time, rather than relative to an external standard. When there is a gold standard or an accepted external measure, then validity should also be considered (see below). Figure 1 shows results that might be obtained in a reliability study of a measurement method. Each sample or subject has the same measurement made twice and, for each observation, the first result is the horizontal coordinate and the second result is the vertical First Measurement coordinate. If the measurement technique were perfectly reliable, then all the data would fall on
20

Second Mesurement

10

-10

-20

-20

-10

10

20

Page 3 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD the line that goes up from the origin at a 45 degree angle. The data shown suggest that the method being tested is highly reliable. Note that there is no reference to a true value. One has no idea whether or not any of the measured values are correct, only that they dont change with repetition. Meaning of Validity The validity of a measurement can be defined as the degree with which the measured value reflects the characteristic it is intended measure. For example, bedside glucose determinations are considered valid if the numerical results rarely differ substantially from serum glucose determinations. On the other hand, the presence of tachycardia is not a valid marker for intravascular volume depletion, as a substantial fraction of patients with intravascular vo lume depletion will fail to exhibit tachycardia. In order to be highly valid, a measurement technique must also be reliable but the converse is not true--a measurement technique may be reliable but invalid. In other words, it can be consistent in the value returned but these values may not accurately represent the characteristic that is intended. A measurement technique which is reliable but not valid is said to be biased if the errors tend to occur in one direction more than another, or if is influenced by a factor that we do not intend it to measure. The term valid implies that there is some sort of external standard, or gold standard against which the current measurement is being compared. For example, the results of autopsy can be used as the gold standard for the cause of death in a study of trauma patients, or the results of serum measurements of troponin and the MB fraction of creatinine kinase may be used as a gold standard definition for the occurrence of myocardial infarction. Validity can have both very concrete and somewhat subjective meanings. A concrete example would be the validity of cranial CT for detecting subarachnoid hemorrhage (SAH). The validity of the cranial CT reading can be rigorously defined by the sensitivity and specificity of the test, using the results of lumbar puncture as the gold standard. In contrast, the validity of ABEM certification to define clinical competency in emergency medicine is subjective, because of the difficulty in defining what we mean by clinical competency. Similarly, the validity of quality of life questionnaires is difficult to define precisely, because quality of life is a subjective characteristic. Illustrations of Reliability and Validity Figures 2 through 4 show data that might be obtained, when comparing a measurement technique yielding a numerical result to a gold standard. In figure 2 the measurement is seen to be both reliable and valid, as it never deviates substantially from the 45 degree line. On the other hand, the data shown in figure 3 suggests that the measurement technique is valid, since it yields values that fall evenly on either side of the true value, but less reliable than shown in figure 2. Some might argue, however, that because of the spread in the data (reduced reliability) that the validity is somewhat reduced as well. The data shown in figure 4 suggests that the measurement
Figure 2. Reliability and Validity
20

Measured Value

Reliable and Valid


10

-10

-20 -20 -10 0 10 20

Gold Standard Measurement

Page 4 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD technique is neither reliable or valid, yielding results that vary substantially and are biased towards values above the true value. Intrinsic Reliability Generally, the term reliability refers to intrinsic reliability, the consistency of a measurement technique on repeated trials. Examples include the consistency of repeated measurement on the same patient or sample (assuming the act of measurement does not change the true value in some way), or the correlation of scores on two administrations of the same test (again assuming no testing effect). Even though intrinsic reliability implies consistency on repetitions of the same measurement, the intrinsic reliability of the score from a multi- item test or questionnaire can be calculated, even if the test or questionnaire is only given to each subject once, by analyzing the correlation of responses to the different items in the test or questionnaire.

Figure 2. Reliability and Validity


20

Less Reliable but Valid Measured Value


10

-10

-20 -20 -10 0 10 20

Gold Standard Measurement

Figure 3. Reliability and Validity


20

Measured Value

10

Interrater Reliability Interrater reliability refers to the correlation of responses from two or more raters, each evaluating the same endpoint or making the same measurement, in multiple subjects. Interrater reliability incorporates Less Reliable the randomness of the measurement itself and the and Not Valid variability due to differences in raters. Interrater reliability is an extremely important concept which arises frequently in clinical research. Gold Standard Measurement In a multicenter clinical trial, the eligibility criteria, definitions of adverse events, and the definitions of outcomes must all be selected to ensure that they can be applied consistently by different personnel at geographically distant clinical study sites. In other words, each of these criteria or definitions must have high interrater reliability. Even in single-center trials, multiple personnel often must evaluate patient characteristics (e.g., scoring adverse events) and the resulting data are compromised if the evaluations have low interrater reliability. After a clinical trial is complete and the results published, the interrater reliability of the original eligibility criteria determines whether the same patient population can be identified by clinicians in practice (can you read head CTs with the same accuracy as the investigators in the NINDS TPA-stroke trial?).
0 -10 -20 -20 -10 0 10 2 0

Approaches to Measuring Reliability There are a number of different approaches to measuring reliability. These include the test-retest method, the alternative form method, and the split- halves method.

Page 5 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD In the test-retest method, the measurement or multi- item test is used twice on each subject. The correlation coefficient between the paired measurements on each subject is calculated, and termed a reliability coefficient and denoted r. In general, a measurement is considered reliable if r > 0.80, although the desired reliability depends on the intended use of the measurement. Unfortunately, in the case of tests or questionnaires, the first administration often influences the subjects performance on the second administration, reducing the accuracy of reliability assessment using this method. This type of effect does not occur with laboratory measurement, and the test-retest method is most useful in that setting. The test-retest method can be used even when each measurement yields only a single result (like a laboratory test), while the reliability assessment methods to be discussed below can not. Educational tests and questionnaires, on the other hand, usually contain multiple items, and a single test or questionnaire administration yields multiple data points. In the alternative form method, the test or questionnaire is administered twice, but in a slightly different form the second time. For example, the order of the questions may be altered. The goal is to change the test enough to reduce memory effects but, at the same time, make sure that the content of the test is as similar as possible. The alternative form approach may reduce slightly the effect of the first administration on the results of the second administration, but subjects can still learn from the first test administration. In the split- halves method, the multi- item test or questionnaire is administered once and the score on one half of the test (e.g., all the odd numbered questions) is correlated with the same subjects score on the other half of the test (e.g., all the even numbered questions). Because the correlation coefficient is calculated based on two half tests, it must be adjusted to give the appropriate reliability coefficient for the whole test or questionnaire. When using the split-halves method, it is important to note that the actual way the test or questionnaire is split is arbitrary, and different methods of splitting the test would yield slightly different reliability coefficients. The best value of the reliability coefficient is obtained by calculating the average of the coefficients that would be obtained from all possible splits (see Cronbachs and the Kuder-Richardson formula below). Additional Validity Concepts Recall that the term validity refers to the degree with which a measurement procedure, test, or questionnaire measures the characteristic it is intended to measure. Validity implies that there is some external reference. Within this general definition of validity, there are at least four types of validity which are commonly mentioned. These types are face validity, content validity, construct validity, and criterion validity. Face validity is the degree with which a test or questionnaire appears to be appropriate for its intended purpose, based on simple inspection of the test or questionnaire itself. It is the weakest form of validity and, some would argue, is not really a type of validity at all. Content validity refers to the degree with which the content of a test or questionnaire covers the extent and depth of the topics it is intended to cover. Although highly subjective, this is a useful concept when evaluating educational tests and research questionnaires. For example, a questionnaire that asks only about the subjects experiences with injection drugs would have low content validity, if the purpose was to measure all types of illicit drug use. Construct validity refers to the consistency between the questions on a test or questionnaire and accepted theoretical constructs related to the subject being studied. For example, given the accepted idea or construct that domestic violence tends to be repetitive or

Page 6 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD chronic, a questionnaire asking about previous domestic violence experiences would be a valid measure of future risk. In other words, the content of the questions (previous domestic violence experience) and the purpose of the questionnaire (predicting future risk) are internally consistent because of our knowledge about patterns in domestic violence (chronic and repetitive). Criterion validity refers to the degree with which the proposed measurement, test, or questionnaire yields results consistent with an independent external criterion or gold standard. Criterion validity is the most concrete type of validity, and the type most often considered in traditional medical research. Other forms of validity, especially content and construct validity, often arise in health services research, psychiatric research, and sociological research. Measurement of Reliability and Validity Loosely speaking, all statistical methods for measuring reliability and validity assess the correlation between two or more measurements made on a single unit (e.g., patient, student, laboratory sample, etc.), averaged over many such units. In a reliability study there is no criterion standard, so each measurement is treated equally. Examples of appropriate statistical methods in this setting include correlation coefficients and Cohens . In a validity study, one of the measurements is usually the criterion standard against which the measurement of interest is being compared. Examples of appropriate statistical methods for validity studies include regression methods (not to be discussed here), and estimates of % accuracy, sensitivity, specificity, positive predictive value and negative predictive value. Sometimes, statistical methods most often used for assessing reliability (e.g., Pearson correlation coefficients) are used to assess validity, because one of the two variables is actually a criterion standard. The interpretation of statistical results, therefore, is determined more by the type of paired data, rather than by the statistical test itself. First Steps in Measurement: Inspection Before any statistical analysis of data is begun, it is important to inspect the data graphically (continuous data), or in a contingency table (categorical data). This initial inspection step is extremely important, as one can often visually detect patterns in the data that strongly influence the interpretation of the subsequent statistical analysis. Each statistical method, whether applied to continuous or categorical data, reduces all the information to one or a few numbers. This reduction can inadvertently eliminate important information. Consider the situation in which two assays for a drug level are being compared and, because of a miscalibration, the second assay always yields a value twice the value from the first assay. Although this problem would be easy to detect by graphing the data, the resulting correlation coefficient would still be one. Pearson Correlation Coefficient The Pearson correlation coefficient is the correlation coefficient most commonly taught in elementary statistics courses and calculated. This correlation coefficient measures the strength of association between two variables. The Pearson correlation coefficient assumes that the two variables being correlated are normally distributed. The resulting value ranges from 1 to 1, with 0 indicating no correlation and 1 indicating perfect positive correlation. When a p value is reported along with a correlation coefficient, it is usually referring to the implied null hypothesis that the correlation coefficient is exactly zero. This null hypothesis is rarely of clinical interest,

Page 7 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD as variables being correlated usually have some relationship to each other. Thus, it is much more useful to report a correlation coefficient with its associated 95% confidence interval (CI). Spearman Rank Correlation Coefficient The Spearman rank correlation coefficient is a nonparametric correlation coefficient used to assess the relationship between two continuous or roughly continuous variables. The Spearman rank correlation can be used on continuous variables that are not normally distributed, or on rank variables with many levels. It is obtained by replacing the numerical value for each measurement by its rank, and then calculating a Pearson correlation coefficient using the resulting ranks. Like the Pearson correlation coefficient, it can range from 1 to 1 in value, with 0 indicating no correlation above that expected by Figure 5. Neurologic Outcome Score chance, and 1 indicating perfect correlation of the ranks. Also, like the Pearson correlation coefficient, 120 the Spearman rank correlation coefficient is commonly reported with a p value for the null 100 hypothesis that = 0. It is much better to estimate , reporting a point estimate and a 95% CI, than to test 80 the hypothesis that = 0. Figure 5 shows a scatter plot of neurologic 60 outcome scores obtained from children, using two different systems. The standard scoring system, 40 whose results are shown on the horizontal axis, is an accepted system which is difficult to administer and 20 which requires significant training to apply 20 40 60 80 100 120 appropriately. The experimental scoring system, Standard Scoring System whose results are shown on the vertical axis, is an experimental scoring system which is simplified and can be applied by nonspecialists with minimal training. If the experimental scoring system was found to be reliable and valid, then it would be an appropriate surrogate outcome for clinical studies of neurologic outcome in this population. When the data shown in the figure 5 are analyzed, one obtains a Pearson correlation coefficient of 0.59 (p < 0.0001), and a Spearman rank correlation coefficient of 0.48 (p < 0.0001). Thus, there is highly statistically significant association between the results obtained using the two scoring systems. It is important to note, however, that the data obtained by both scoring systems are not normally distributed, and the Pearson correlation coefficient is probably an inappropriate measure of the validity of the experimental scoring system. Analysis of Differences of Pairs When two measurements are taken on each subject, it is quite useful to calculate the difference between the paired measurements, for each subject. Very simple descriptive statistics, such as the standard deviation of these differences, or simply a graphical presentation of the differences can be very useful in communicating the degree of reliability or validity. Specifically, for the case of numerical variables, we can define the difference for the ith subject as i = yi - zi, where yi is one measurement and zi is either the second measurement or the gold standard. Applying this method to the data for the neurologic outcome score (figure 5) yields the plot shown in figure 6. This plot shows that there is clearly a systematic problem with
Experimental Scoring System

Page 8 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD the experimental neurologic outcome score being tested, since the errors are clearly not random and, therefore, the validity depends on the score obtained.

Figure 6. Graphical Presentation of Differences


60 40

Intraclass Correlation Coefficient 20 The analysis of differences between paired measurements is a central theme in reliability and 0 validity assessment. There is a set of techniques, -20 called intraclass correlation coefficients, based on the analysis of differences between pairs of -40 observations. Intraclass correlation coefficients 20 40 60 80 100 120 140 exist for complex experimental designs, e.g., more Criterion Value than two observations on each subject, more than two raters, etc. Even for simple cases, such as two normally-distributed variables, the intraclass correlation coefficient has advantages over the Pearson correlation coefficient. For example, if one of the variables is a multiple of the other variable, as may occur when an assay is miscalibrated, that will reduce the intraclass correlation coefficient but not the Pearson correlation coefficient. Many statistical methods for the analysis of paired data, such as Cohens are actually forms of intraclass correlation coefficients. Analysis of Inter-Item Consistency As you recall, one method for assessing the intrinsic reliability of a multi- item test or questionnaire is the split-halves approach, in which the responses on one half of the items are correlated with the responses on the other half. Unfortunately, there are many different ways to split a multi- item test, and each split will yield a potentially different reliability coefficient. The best measure of the tests reliability would, therefore, be the average of the reliability coefficients obtained from all possible splits. We will consider two statistical methods used to obtain this type of average reliability coefficient--one for the analysis of continuous or rank data, the other for the analysis of dichotomous data. When each item of a multi- item test yields a numerical or rank result, then Cronbachs can be used to estimate the average reliability coefficient that would be obtained from all possible splits. Cronbachs is an appropriate method to analyze the reliability of questionnaires that use Likert scales (e.g., strongly disagree, mildly disagree, neutral, mildly agree, strongly agree), since Likert scales give rank type results. When each of the items on a multi- item test yields dichotomous results, it is still possible to obtain an estimate of the average reliability coefficient which would be obtained from all possible splits. In this case, the Kuder-Richardson formula (KR20) can be used. The split- half method (using a single split), Cronbachs , and the Kuder-Richardson formula are all forms of inter- item consistency methods of reliability assessment. It is important to note that these forms of reliability assessment assume that all items measure the same characteristic or opinion. Thus, if a questionnaire is intended to assess several different opinions or subject areas, a separate reliability coefficient would need to be determined for each set of questions which address a particular area. For a single group of questions or test items, each addressing the same subject area, the reliability increases with the number of items. This increase in reliability is useful up to approximately 8 items addressing a single area. Page 9 of 17

Median = 15 IQR 1.4 to 24.6

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD Analysis of Nominal or Categorical Variables Figure 7 shows the typical arrangement of validity data for a Figure 7. Arrangement of Data for Categorical continuous variable (which happens to Measurements vs a Gold Standard cluster around values of zero and one) and the corresponding contingency Gold Standard table after the data are converted to a 1 1 0 dichotomous form. Note that the upward sloping diagonal, which 1 45 12 corresponds to perfect agreement for continuous variables, is converted to a 0 0 7 36 downward sloping diagonal in the contingency table. Consider the evaluation of a 0 1 diagnostic test which yields a Gold Standard Measurement dichotomous result, either positive or negative. We will also assume that there is a gold standard method for defining the Validity Data Reliability Data presence or absence of the disease for which this test is used. In evaluating the performance Gold Standard Rater A of the test, one would arrange the data as + + shown in the left side of figure 8. + A B + A B Now consider the case in which there is no gold standard but, instead, two different C D C D diagnostic tests are used to determine the presence of the disease. In this case, neither diagnostic test is known to be superior than the other and, because no gold standard is Figure 8. Arrangement of Dichotomous Data for available, we can only quantify the agreement Reliability and Validity Assessment between the two types of tests. In this case, the data would be arranged as shown in the right side of figure 8. The data are arranged identically to the left side of figure 8, but the gold standard has been replaced by one of the tests being evaluated. Because the data are arranged identically, whether or not a gold standard is available, the same statistical methods can be used in the two circumstances. The interpretation of the resulting statistics, however, is quite different.
Measured Value
Measured Value

Validity of Dichotomous Variables When there is a gold standard for a dichotomous variable, there are standard measures of the validity of a test which are so commonly discussed in clinical medicine that you may not think of the m as validity measures at all. These are the sensitivity, specificity, positive predictive value, and negative predictive value of the test. These customary measures of test performance are, in fact, measures of the tests validity as a surrogate for the gold standard method of diagnosing the disease of interest. The likelihood ratio for a positive test result (LR+) and the likelihood ratio for a negative test result (LR-), although less commonly used, are also very good measures of test performance. The LRs provide a measure of the test result's ability to increase

Page 10 of 17

Rater B

Measured Value

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD or decrease the probability of the disease being present, and are independent of the disease prevalence in the population. Characteristics and definitions of these measures of dichotomous test validity are shown in table 1. Table 1. Measures of validity for a dichotomous test result. Term Sensitivity Description The sensitivity is the fraction of patients with the disease in question who have a positive test result. Thus, the sensitivity depends upon the test results only among those patients who truly have the disease in question. The specificity of a test is the fraction of patients without the disease in question who have a negative test result. Thus, the specificity of a test depends only upon the results in patients without the disease in question. The positive predictive value of a test is the fraction of all patients with a positive test result who truly have the disease in question. Thus, the positive predictive value depends on the presence of the disease only among those patients with a positive test result. The negative predictive value of a test is the fraction of all patients with a negative test result who do not have the disease in question. Thus, the negative predictive value depends upon the absence of disease only among those patients with a negative test result. The likelihood ratio for a positive test result is the ratio of the probability of having a positive test result among those patients with the disease to those patients without the disease. The likelihood ratio for a negative test result is the ratio of the probability of having a negative test result for patients with the disease to those patients without the disease. Formula
A A+C D B+D A A+ B

Specificity

Positive Predictive Value

Negative Predictive Value

D C+D

Likelihood Ratio (positive)

Likelihood Ratio (negative)

A A+C B B+D C A+C D B+D

Reliability of Categorical Data Consider again the arrangement of the dichotomous data shown in the right hand panel of figure 8. In this case there is no criterion or gold standard, so our statistical analysis will be evaluating reliability rather than validity. There are a number of methods for calculating reliability in this setting; we will consider % agreement, Cohen's , and the weighted in some detail.

Page 11 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD % Agreement The simplest measure of reliability, in the case of two dichotomous variables, is the calculation of % agreement. This is simply the percent of all observations for which both raters give the same result. It is calculated as ( A + D ) 100 % Agreement = ( A + B + C + D) Although simple to calculate, % agreement is rarely useful because a significant amount of agreement can occur by random chance, giving a falsely elevated impression of the reliability of the raters (see below). Chance Agreement Consider two raters scoring a test which results in a dichotomous result (e.g., reading a chest radiograph for the presence or absence of a pulmonary infiltrate). Even if each rater randomly gives positive result (i.e., neither of them ever looks at the films), then sometimes both will score a given radiograph as positive and sometimes both will score a given radiograph as negative. This is called chance agreement. The amount of chance agreement that is likely to occur among random raters depends on the fraction of the time, on average, they give a positive result. For example, if they each give a positive result 10% of the time, but randomly, then 81% of the time they will both give a negative result by chance, and 1% of the time they will both give a positive result by chance. The total chance % agreement will be 82%! Cohen's and Weighted Because some agreement among raters will occur by chance, even if both raters are completely random, it is useful to have a measure of interrater reliability which takes this chance agreement into account. In addition, the amount of extra agreement which is possible, given perfect interrater reliability, depends on the amount of agreement which occurs by chance. The definition of Cohens is chosen so that the agreement expected only by chance is given a value of zero, and the maximum possible agreement is given the value one. Thus, regardless of the underlying frequency of the different outcome categories, values for will always range between 1 (the opposite of perfect agreement) and 1 (perfect agreement), with 0 signifying no agreement above that expected by chance. The formula for is given by

= 1
ii i+ i i i+ +i i

+i

o e 1 e

where ij is the proportion of observations in the ith row and jth column, i+ is the proportion of observations in the i th row, and +i is the proportion of observations in the ith column

Page 12 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD Consider the data shown in figure 9. Using Figure 9. Example of Cohens the formula for yields a value of 0.62. How do we interpret this value? As a general guide, values of Rater A + less than 0.4 signify poor agreement, values between 0.4 to 0.8 show mild to moderate agreement, and 45 12 + values above 0.8 represent excellent agreement. It is = 0.62; 95% important to remember, however, that what CI 0.46 to 0.77 constitutes acceptable or excellent agreement 7 36 depends on the intended use of the test. For example, when using cranial CT to detect hemorrhage prior to the administration of TPA for stroke, very high reliability is desired, since the consequence of giving TPA to a patient with an intracerebral bleed is so great. The definition and interpretation of is relatively straightforward for the case of two raters and only two ratings (e.g., positive and negative). When there are more than two possible ratings, however, the question arises of Figure 10. Reliability of a Neurologic Outcome Score the relative importance of the different 6 possible misclassification errors. For example, if the three ratings are 5 5 nonurgent, urgent, and emergent, then 1 4 a disagreement in which one rater classifies the subject as emergent and 3 2 the other as nonurgent is a more serious disagreement than if one rater 6 2 1 classifies the subject as emergent and 16 the other as urgent. The weighted 1 allows different importance to be 0 placed on different misclassification 0 1 2 3 4 5 6 errors. Although there is a standard Neurologic Outcome Score (Rater A) choice for weighting different disagreements, the weighting can be = 0.95 (95% CI 0.85 to 1.00) adjusted for a particular use. This Weighted = 0.98 (95% CI 0.93 to 1.00) n = 31 flexibility also makes it possible to obtain different values of the weighted , given a particular set of data, depending on the weighting scheme chosen. Figure 10 shows some actual data used to evaluate the interrater reliability of a neurologic outcome scale which yields a value between 1 and 5. If these data are analyzed with a standard a value of 0.95 is obtained. Note that the only disagreement was only one point apart, the smallest non-zero disagreement. If the same data are analyzed with a weighted , a higher value of 0.98 is obtained, because the type of disagreement is taken into account. In general, the goal of using for analysis is to estimate the magnitude of the agreement, rather than simply to show that there is any agreement above that expected by chance. Because of this, it is better to report an estimate for , along with a 95% CI, rather than the commonly produced p value for the null hypothesis that = 0. While many statistical software packages will give 95% CIs for , the formulas used are often inaccurate if is close to one. One
Neurologic Outcome Score (Rater B)
Rater B

Page 13 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD symptom of this is that the upper bound for the CI given may be greater than one, even though one is the highest possible value for . Additional versions of are defined for situations in which there are more than two raters. Although these are beyond the scope of this lecture, they are discussed in some of the references and available in some commercially available statistical software packages. Comparison of Kappas Consider the sit uation in which we are evaluating the interrater reliability at two different Figure 11. Comparing Kappas clinical sites participating in a multicenter clinical Rater A Rater C trial. For example, there may be a dichotomous + + endpoint, and we wish to ensure that this endpoint 49 3 51 21 + + can be evaluated reliably at both of the clinical study sites. If each of the clinical study sites uses 6 42 1 27 two different raters, then we might obtain data like those shown in Figure 11. We note that the at Site 1: = 0.82 Site 2: = 0.55 Site 1 (0.82) appears to be higher than the at Site 95% CI 0.71 to 0.93 95% CI 0.40 to 0.70 2 (0.55). Some statistical analysis software packages, e.g. SAS, will perform a statistical test to determine whether two or more kappas are statistically significantly different from each other. These analyses also give an overall or pooled value for , which takes into account the stratified nature of the data. In this case, the pooled value for is 0.72, with a 95% CI of 0.63 to 0.81. For this pooled to be meaningful, however, we must be sure that the two sites have statistically equivalent interrater reliability. In this case, SAS conducts a test, descriptively named the test for equal kappa coefficients, which demonstrates that the values of at the two sites are highly statistically significantly different, with a p value of 0.005. We would conclude from these data that the interrater reliability at Site 1 is better than at Site 2. Non-Reliability Tests for 2 2 Tables with Paired Measurements There are a number of tests which are used with 2 2 tables and paired observations which, nonetheless, are not appropriate for reliability and validity assessment. The most common of these tests is McNemars test, which tests the null hypothesis that the number of disagreements in each direction is the same. It is essentially a test of symmetry of disagreement. Cochrans Q test is an extension of McNemars test for use when there are multiple 2 2 tables. It is, in essence, a stratified test for homogeneity of symmetry. Examples of Published Reliability Studies Please find below the abstracts from two representative reliability studies. They illustrate many of the issues discussed above. In addition, a very nice example of a reliability and validity study is Kothari RU, Pancioli A, Liu T, Brott T, Broderick J. Cincinnati prehospital stroke scale: Reproducibility and validity. Ann Emerg Med 1999;33:373-378. Another good example is Mahadevan S, Mower WR, Hoffman JR, Peeples N, Goldberg W, Sonner R. Interrater reliability of cervical spine injury criteria in patients with blunt trauma. Ann Emerg Med 1998;31:197-201.
Rater D Rater B

Page 14 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD
Example 1. Interrater Reliability of the Coma Recovery Scale (J Head trauma Rehabil 1996;11:61-66). Objective: To assess interrater reliability (IRR) of the Coma Recovery Scale (CRS), a tool to evaluate recovery from severe brain injury and efficacy of interventions in minimally responsive patients. Design: Simultaneous CRS assessments of persons with brain injury were performed by 3 rehabilitation professionals of differing disciplines. CRS scores were independently assigned with 1 rater administering the CRS and the other 2 observing. Setting: Acute inpatient brain injury rehabilitation and long-term care units. Patients: A convenience sample of 18 persons with severe brain injuries at Rancho levels 1 to 4. Results: Spearmans correlation coefficients for CRS subscores were very strong among the 3 raters (r = .60 to .96, all P < .01). Kappa statistic was calculated for the overall CRS at 0.69, indicating substantial agreement. Conclusion: We conclude that the IRR of the CRS is adequate for use with individuals at Rancho Levels 1 to 4. Further research should examine...

Example 2. Preventable Death Classification: Interrater Reliability and Comparison with ISS-Based Survival Probability Estimates (Accid Anal and Prev 1995;27:199-206) The purpose of the study was to compare the injury-related threat to survival estimated by the Injury Severity Score (ISS) and a committee of experts. The charts of 116 (73 fatalities and 43 survivors) patients with severe injuries were reviewed. A committee of nine clinicians classified each case as survivable, potentially survivable, and nonsurvivable based on anatomical descriptors, mechanism of injury, and patients age. Majority was used to determine the final committee classification. Based on the ISS values, cases were classified as survivable (9-24), potentially survivable (25-49), and nonsurvivable (>49). The results shows poor interrater reliability among the nine clinicians with an overall intraclass correlation coefficient of 0.43. The ISS-based classification had high agreement with the final committee classification (overall weighted kappa = 0.71). This study has demonstrated no additional benefit for using a committee to classify injury severity on the basis of anatomical damage over applying ISS-based survival probabilit ies.

Conclusions Since medical research is primarily a process of measurement, we should be careful to assess the reliability and validity of the measurements we make. Reliability and validity assessment involves both qualitative assessment (considering intrinsic versus interrater reliability, deciding on the type of validity to consider, reflection on the purpose of the measurement, consideration of the availability and appropriateness of a criterion standard, etc.) and quantitative assessment using a variety of statistical methods and measures.

Page 15 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD References General References Fleiss JL. Statistical Methods for Rates and Proportions. Second Edition. John Wiley & Sons, New York, 1981. Sachs L. Applied Statistics: A Handbook of Techniques. Second Edition. Springer-Verlag, New York, 1984. Agresti A. An Introduction to Categorical Data Analysis. John Wiley & Sons, New York, 1996. Agresti A. Categorical Data Analysis. John Wiley & Sons, New York, 1990. Karras DJ. Statistical methodology: II. Reliability and validity assessment in study design, Part A. Acad Emerg Med 1997;4:64-71. Karras DJ. Statistical methodology: II. Reliability and validity assessment in study design, Part B. Acad Emerg Med 1997;4:144-149. Specialized References Shrout PE. Measurement reliability and agreement in psychiatry. Stat Methods in Medical Research 1998;7:301-317. Fleiss JL. Measuring agreement between two judges on the presence or absence of a trait. Biometrics 1975;31:651-659. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measure of reliability. Educ Psychol Meas 1973;33:613-619. Fleiss JL, Cohen J, Everitt BS. Large-sample standard errors of kappa and weighted kappa. Psychol Bull 1969;72:323-327. Masanic CA, Bayley MT. Interrater reliability of neurologic soft signs in an acquired brain injury population. Arch Phys Med Rehabil 1998;79:811-815. Bland JM, Altman DG. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Computers in Biology and Medicine 1990;20:337-340. Brenner H, Kliebsch U. Dependence of weighted kappa coefficients on the number of categories. Epidemiology 1996;7:199-202. ODell MW, Jasin P, Stivers M, Lyons N, Schmidt S, Moore DE. Interrater reliability of the coma recovery scale. J Head Trauma Rehabilitation 1996;11:62-66.

Page 16 of 17

Reliability and Validity: Meaning and Measurement Roger J. Lewis, MD, PhD Sampalis JS, Boukas S, Nikolis A, Lavoie A. Preventable Death Classification: Interrater Reliability and Comparison with ISS-Based Survival Probability Estimates. Accid Anal and Prev 1995;27:199-206. Kothari RU, Pancioli A, Liu T, Brott T, Broderick J. Cincinnati prehospital stroke scale: Reproducibility and validity. Ann Emerg Med 1999;33:373-378. Ridout MS, Demetrio CGB, Firth D. Estimating intraclass correlation for binary data. Biometrics 1999;55:137-148. Mahadevan S, Mower WR, Hoffman JR, Peeples N, Goldberg W, Sonner R. Interrater reliability of cervical spine injury criteria in patients with blunt trauma. Ann Emerg Med 1998;31:197-201. Tracy K, Adler LA, Rotrosen J, Edson R, Lavori P. Interrater reliability issues in multicenter trials, Part I: Theoretical concepts and operational procedures used in department of veterans affairs cooperative stud y #394. Psychopharmacology Bulletin 1997;33:53-57. Edson R, Lavori P, Tracy K, Adler LA, Rotrosen J. Interrater reliability issues in multicenter trials, Part II: Statistical procedures used in department of veterans affairs cooperative study #394. Psychopharmacology Bulletin 1997;33:59-67. Johnson CJ, Kittner SJ, McCarter RJ, Sloan MA, Stern BJ, Buchholz D, Price TR. Interrater reliability of an etiologic classification of ischemic stroke. Stroke 1995;26:46-51. Gray SL, Nance AC, Williams DM, Pulliam CC. Assessment of interrater and intrarater reliability in the evaluation of metered dose inhaler technique. Chest 1994;105:710-714. Vogel HP. Influence of additional information on interrater reliability in the neurologic examination. Neurology 1992;42:2076-2081. Brennan PF, Hays BJ. The Kappa statistic for establishing interrater reliability in the secondary analysis of qualitative clinical data. Research in Nursing & Health 1992;15:153-158. Posner KL, Sampson PD, Caplan RA, Ward RJ. Measuring interrater reliability among multiple raters: An example of methods for nominal data. Statistics in Medicine 1990;9:1103-1115.

Page 17 of 17

También podría gustarte