Você está na página 1de 10

Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach Author(s): Elizabeth

R. DeLong, David M. DeLong, Daniel L. Clarke-Pearson Reviewed work(s): Source: Biometrics, Vol. 44, No. 3 (Sep., 1988), pp. 837-845 Published by: International Biometric Society Stable URL: http://www.jstor.org/stable/2531595 . Accessed: 08/11/2011 17:56
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to Biometrics.

http://www.jstor.org

BIOMETRICS 837-845 44, September1988

Comparingthe Areas Under Two or More CorrelatedReceiver OperatingCharacteristicCurves:A NonparametricApproach


ElizabethR. DeLong Quintiles,Inc., 1829 East FranklinStreet,
Chapel Hill, North Carolina 27514, U.S.A.

David M. DeLong SAS Institute,Cary,North Carolina27511, U.S.A. and Daniel L. Clarke-Pearson Division of Oncology,Departmentof OBGYN, Duke UniversityMedicalCenter, Durham,North Carolina27710, U.S.A.
SUMMARY

Methodsof evaluating comparing performance diagnostic and the of testsareof increasing importance as new tests are developedand marketed.When a test is basedon an observedvariablethat lies on a continuous or gradedscale, an assessmentof the overallvalue of the test can be made throughthe use of a receiver operatingcharacteristic (ROC) curve. The curve is constructedby varying the cutpoint used to determinewhich values of the observedvariablewill be consideredabnormaland then plotting the resultingsensitivitiesagainstthe corresponding false positive rates. When two or more empiricalcurvesare constructedbased on tests performedon the same individuals,statistical betweencurvesmust take into accountthe correlated analysison differences natureof the data. This paperpresentsa nonparametric approachto the analysisof areasunder correlatedROC curves,by to using the theoryon generalizedU-statistics generatean estimatedcovariancematrix.

1. Introduction Methods of evaluating and comparing the performance of diagnostic tests or indices are of increasing importance as new tests or indices are developed or measured. When a test is based on an observed variable that lies on a continuous or graded scale, an assessment of the overall value of the test can be made through the use of a receiver operating characteristic (ROC) curve (Hanley and McNeil, 1982; Metz, 1978). The underlying population curve is theoretically given -by varying the cutpoint used to determine the values of the observed variable to be considered abnormal and then plotting the resulting sensitivities against the corresponding false positive rates. If a test could perfectly discriminate, it would have a value above which the entire abnormal population would fall and below which all normal values would fall (or vice versa). The curve would then pass through the point (0, 1) on the unit grid. The closer an ROC curve comes to this ideal point, the better its discriminating ability. A test with no discriminating ability will produce a curve that follows the diagonal of the grid. For statistical analysis, a recommended index of accuracy associated with an ROC curve is the area under the curve (Swets and Pickett, 1982). The area under the population ROC
Key words. Jackknifing; Mann-Whitney test; Receiver operating characteristic (ROC) curve; Structural components; U-statistics.

837

838

Biometrics,September1988

curverepresents probabilitythat, when the variableis observedfor a randomlyselected the individual from the abnormal population and a randomly selected individual from the normal population, the resultingvalues will be in the correctorder (e.g., abnormalvalue higher than the normal value). Generally, parametricassumptions are applied on the distributions of the observed variable in the normal and the abnormal populations. Maximum likelihood programs for estimating the area under the curve and relevant parametersunder a binormal model assumption have been widely employed (Dorfman and Alf, 1969;.Metz, 1978;Swetsand Pickett, 1982) in orderto estimatethis area,although these distributionscannot be uniquelydeterminedfrom the ROC curve. The methodology binormal"model for has been extended(Metz, Wang, and Kronman, 1984) to a "bivariate testing differencesbetween correlatedsample ROC curves that arise, for example, when differentdiagnostictests are performedon the same individuals. This paper addressesthe nonparametriccomparison of areas under correlatedROC curves.When calculatedby the trapezoidal rule,the areafallingunderthe points comprising an empiricalROC curve has been shown to be equal to the Mann-Whitney U-statisticfor comparing distributionsof values from the two samples (Bamber, 1975). Although the trapezoidalrule systematicallyunderestimatesthe true area (Hanley and McNeil, 1982; Swetsand Pickett, 1982) when the numberof distinctvalues taken on by a discrete-valued diagnosticvariableis small (say, 5 or 6), it nonethelessproducesa meaningfulstatisticthat can be used with confidencewhen the variabletakes on a largernumberof values. Hanley and McNeil (1983) use some propertiesof this nonparametricstatisticto compare areas under ROC curves arising from two measures applied to the same individuals. Their approachinvolves calculatingfor both the normaland the abnormalsamplethe correlation between the values of the original measures.The averageof the two correlationsis used along with the averageof the areasunderthe two curvesto arriveat an estimatedcorrelation between the two areas.A table that applies when the averagearea is at least .70 is given. However,for measuresthat arenot continuousor nearlyso, theirmethodrelieson Gaussian modelingassumptionsfor estimatingthe variancesof the two areas.In Section 2 we present an alternative methodology using a more completely nonparametricapproach which exploits the propertiesof the Mann-Whitney statistic. Section 3 presentsan example of three correlatedROC curves derived from data on ovarian cancer patients undergoing surgery for bowel obstruction. Three different prognostic indices are evaluated and compared. 2. Analysis of Areas Under CorrelatedROC Curves Suppose a sample of N individuals undergo a test for predictingan event of interest or determiningpresence or absence of a medical condition and that the test is based on a continuous-valueddiagnosticvariable.We will follow the convention that highervalues of the test variableare assumedto be associatedwith the event of interest,e.g., positivedisease status.Also supposeit can be determinedby means independentof the test that in of these individualstruly undergothe event or have the condition. Let this groupbe denotedby Cl and let the group of n (= N - m) individualswho do not have the condition be denotedby C2. Let Xi, i = 1, 2, . . ., m and Yj, j = 1, 2, . . ., n be the values of the variable on which the diagnostictest is based for membersof Cl and C2, respectively.These outcome values can be used to constructan empiricalROC curve for assessingthe diagnosticperformance of the test. For any real numberz, let sens(z) =
1
-

in

E I(XI > z)

ofAreas UnderROC Curves Comparison Nonparametric


where I(A) = 1 if A is true and 0 otherwise. Also let spec(z) = - E, If Y < z). n j=1
1
*

839

Then sens(z) is the empirical sensitivity of a test that is derived by dichotomizing the variable into positive or negative results on the basis of the cutpoint z and spec(z) is the corresponding empirical specificity. Now, as z varies over the possible values of the variable, the empirical ROC curve is a plot of sens(z) versus [1 - spec(z)]. Clearly, when z is larger than the largest possible value, the curve passes through (0, 0) and it monotonically increases to the point (1, 1) as z decreases to the smallest possible value. To be informative, the entire curve should lie above the 450 line where sens(z) = 1 - spec(z). Selection of an optimal cutpoint depends on a cost function of sensitivity and specificity. It has been shown that the area under an empirical ROC curve, when calculated by the trapezoidal rule, is equal to the Mann-Whitney two-sample statistic applied to the two samples {Xi} and {Yj}. Because the Mann-Whitney statistic is a generalized U-statistic, statistical analysis regarding the performance of diagnostic tests can be performed by exploiting the general theory for U-statistics. The Mann-Whitney statistic estimates the probability, 0, that a randomly selected observation from the population represented by C2 will be less than or equal to a randomly selected observation from the population represented by Cl. It can be computed as the average over a kernel, A, as 1 where
Il
O(X,Y)={2
0O

n m

mn j= i=1

Y<X Y=X.
Y>X

In terms of probabilities, E(0) = 0 = Pr(Y< X) + 'Pr(X = Y). For continuous distributions, Pr(Y= X) = 0. Asymptotic normality and an expression for the variance of the Mann-Whitney statistic can be derived from theory developed for generalized U-statistics by Hoeffding (1948). Define
(lo = E[I(Xi,

Yj)lp(X, Yk)]

02J

j $ k; i k; (1)

0oj = E[t(Xj, Yj)t(Xk, Yj)] 1II = E[J(X1, Yj)J(X, Yj)] _

02,
02-

Then

(n - l) lo + (m var(6) = mn

l)tol

+-

+i(

mn

(2)

Bamber (1975) provides a method of estimating the variance in the context of testing the significance of a single ROC curve. Bamber introduces a quantity Bxxy, which is the probability that two randomly chosen elements of the population C1 will both be greater than or less than a randomly chosen element of C2, minus the complementary probability that the observation from C2 will be between the two from Cl. A similar quantity Byyxis
also defined and the variance of A is given in terms of B and

Var(6) is then

840

Biometrics,September1988

estimated by empirically estimating Byyx and Bxxy. Formula (2) can be shown to be equivalentto Bamber'sformula(4), which derivesfromworkof Noether(1967) and applies when X and Y are not necessarilycontinuous. k) be a vector Hoeffding'stheory extendsto a vector of U-statistics. 0 = (j l, Let . of statistics, representingthe areas under the ROC curves derived from the readings n; 1 < r < k) of k different diagnostic measures. {Xir}, {YJ)} (i = 1, . . ., m; j = 1, ..., Then, similarto (1) above, define = 010E[A(Xr, YJ)p(Xi,
01l= E[A(Xi, Ys)]
-

rIs

j $ k;
i 5

Yjr)(Xs, Yjs)] - 0rs rS= E[u(Xr, Yr)J(XS, Es)] - rs

k;

(3)

The covarianceof the rth and sth statisticcan then be writtenas


cov(r
as)

(n

1)t

(m mn
+

1)rsU

rs,

mn

Sen (1960) has provideda method of structural componentsto provideconsistentestimates of the elements of the variance-covariance matrixof a vectorof U-statistics. This approach turns out to be equivalentto jackknifing,but is conceptuallysimplerwhen dealing with U-statistics.We will exploit this methodology to compare the areas under two or more ROC curves. For the rth statistic, or, the X-components and Y-componentsare defined, respectively,as Vro(X,)=- nl j=1i A(Xr, Yr) (= E and
in i n
. .IM 1,2,...,m) ,2

VI' rV) m i=1 =(Xi,


1 r= [V-r0(X)
m
-

YJ) (j= 1, 2, ...,n).

Also define the k x k matrixS10such that the (r, s)th element is 1 i=1I
-

][V (X,

and similarlyS0l, which has (r, s)th element


1 = 501~ n - 1 E[V61(Y1)
n1
-

1j(

][VS

(Y)

]
-

The estimated covariance matrix for the vector of parameter estimates, 0


O (al
2

,),

is thus
S =-Sio

+ - Sol.

Let g be a real-valuedfunction of 0 that has bounded second derivatives a neighborhood in of 0. Combiningresultsfrom Sen (1960) and Arveson(1969, Theorem 16), it follows that if limNO,m/n is bounded and nonzero, then N12 [g(O) - g(O)] is asymptoticallynormally distributedwith mean 0 and varianceo2, where
2
N co N-c9 j-k = k

agg a

10'

1m 1 I 6~ nl

Nonparametric Comparison ofAreas UnderROC Curves Further,


s9
2

841

k k = N SE El Ogg /1I Soldfj(ms0 j=1 i=1 06 06' \m

1.
n

is a consistentestimate of (2. When g is simply a linear function, the theory reducesconsiderably, becausethe partial derivativesare the constants that comprise the linear function. Thus, for any contrast LO', whereL is a row vector of coefficients,
LO' [ m
-

LO'

[L (--S1o +

I
n

Sol)L'J
) ]

has a standardnormal distribution.A confidenceintervalfor LO' naturallyfollows. By a modest generalizationof these results,we can also apply any set of linear contrasts to a vector of areas under correlatedROC curves and perform a test of significanceon LO'. The test then takes the form (0( - O)L'[L (Is1 + nS01) ) ) LJ m L(0 ( 0)' )'()

(5)

which has a chi-squaredistributionwith degreesof freedomequal to the rank of LSL' . A confidence regioncan also be constructed. A computer program written in the SAS language is available from the authors for computing components, covariancematrices,and contrasts.However,as indicatedin the next section, the components can be computed easily by hand or by a simple computer program.The components can then be input to any programwhich computes sums of in squaresand cross-products orderto obtain the covariancematrixS. 3. Example When to performsurgicalcorrectionof intestinal obstructionin patients known to have ovarian carcinoma is an unresolvedproblem. The dilemma centers around determining those patients for whom surgerypresents a benefit. Castelado et al. (1981), and other authorshave proposedthat patientswho survivelongerthan 2 months postoperatively can be declaredto have "benefited" from the surgery.Using this criterion,Krebsand Goplerud (1983) devised a preoperativescoring system for use as a screeningtest in determining a patient's risk for failing to benefit from surgery.The scoring algorithmis presentedin Table 1. Accordingto this scoringsystem,patientswith low scoresshouldbe good candidates for surgeryand those with higher scores should be consideredat risk for failing to benefit from surgery. The following example evaluates the discriminatingability of the proposed screening algorithmon 49 consecutive ovarian cancer patients undergoingcorrectionof intestinal obstructionat Duke UniversityMedicalCenter.Of the 49 patients, 12 survivedmore than 2 months postoperatively and could be consideredsurgicalsuccesses;the remaining37 are considered failures. The Krebs-Goplerud score (K-G) is compared against two other measuredindices:total protein(TP) and albumin(ALB), both of which are preoperatively positively associatedwith the patient's nutritionalstatus. BecauseALB is one component of TP, these two measures are highly correlated,with a Kendall's tau-b value of .65. Increasinglevels of ALB and TP are associated with better nutritional status, whereas increasinglevels of K-G are associatedwith poorerprognosis.Thus, to simplifycomputa-

842

Biometrics, September 1988


Table 1

Krebs-Goplerud scoringsystemfor prognostic parametersin ovariancarcinoma complicated bowelobstruction by Assigned Parameter riskscore Age (yr) <45 0 1 45-65 2 >65 Nutritionalstatus(deprivation) 0 None or minimal 1 Moderate 2 Severe Tumor status masses 0 No palpableintra-abdominal 1 Palpableintra-abdominal masses 2 Liverinvolvementor distantmetastases Ascites 0 None or mild (asymptomatic, abdomennot distended) 1 Moderate(abdomendistended) 2 Severe(symptomatic, requiresfrequentparacentesis) Previouschemotherapy 0 None, or no adequatetrial 1 Failed single-drug therapy 2 Failedcombination-drug therapy Previousradiationtherapy 0 None 1 Radiationtherapyto pelvis 2 Radiationtherapyto whole abdomen

tions, we transformedby subtractingK-G from 12, the maximum possible value, so that in all indices would prognosticate the same direction. Figure 1 displays the empirical ROC curves for the three indices. From this figure, it appearsthat K-G offers little improvement over either ALB or TP. The estimated areas To underthe curvesfor K-G, ALB, and TP are .69, .72, and .65, respectively. analyzeand comparethese areas,the covariancematrix for the vector of areasis needed. The method of structuralcomponents easily producesthis matrix.For each of the variablesof interest, (K-G, ALB, TP), we can denote by Xr (r = 1, 2, 3) the values associatedwith successand by yr (r = 1, 2, 3) the values associatedwith surgicalfailures.Then, Or = Pr(Y' < Xr) +
iPr( yr
=

X') and we compute the components individually for each of the three variables.

If the data are first sorted by the variableof interest,it is a simple matterto calculatefor eachX the numberof Y's less than X (NYLx)and the numberof Y's equal to X (NYEQx). The component for X is then NYLx + 'NYEQx. Likewise, for each Y we calculate the number of X's greaterthan Y (NXGy) and the number of X's equal to Y (NXEQy).The
component for Y is NXGy + 4NXEQy.

For this example, there are 12 X's and three variablesof interest,so the X-components form a 12 x 3 matrix, V10.The 37 Y's yield a component matrix of dimension 37 x 3, V0o.The 3 x 3 matricesS10and S0l are then computed as
So1 (Vf0V10 - 120'0)

and
Sol
=

(V1lVol

370'0).

Nonparametric Comparison ofAreas UnderROC Curves


1.0

843

0.8 -

,-

- -- -

--

--

1-

-6

0.6-

"04

'

0.2 _

0.2

0.4 0.6 False positive rate

0.8

1.0

curves for Krebs-Goplerud score (0), total protein(A), Figure1. Receiveroperatingcharacteristic and albumin(0). It is clear that S10 and Sol are the covariance matrices of V1oand Vol, respectively. They can readily be obtained from any computer program that computes covariance matrices. The covariance matrix for the vector of areas is then 12 37

Table 2 Estimatedcovariance matrixbetweenareas underthe threeROC curves Covariance K-G score Albumin Total protein K-G score .0110 .0028 .0033 Albumin .0076 .0086 Total protein .0100 Table 3 matrix(ECM)and Correlation of coefficients pairs of areas calculatedfromestimatedcovariance alsofrom methodof Hanley and McNeil (HM) Correlation Kendall'stau-b Kendall'stau-b Correlation Survivors Nonsurvivors (ECM) (HM) .34 .20 .18 .17 K-G, ALB .27 .21 -.01 .10 K-G, TP .61 .66 .82 .61 ALB, TP

844

Biometrics,September1988

This matrix is displayed in Table 2. In Table 3, the resulting correlation coefficients are presented, along with Kendall's tau-b values for the group that benefited from surgery and for the remaining group, and finally the estimated correlations derived from the table in the paper by Hanley and McNeil (1983). For this set of data, our estimates tend to be larger. Now, to compare K-G to the average of ALB and TP, we use the contrast L (1, -.5, -.5). Evaluated at 0, the value of the contrast is .004. The standard deviation of this estimate is

(LSL'

)1/2

= 116

A two-sided 95% confidence interval for this contrast is thus (-.223, .231), indicating negligible improvement by K-G over ALB and TP. To determine whether the Krebs-Goplerud score is better than at least one of the other indices, ALB and TP, we use the contrast
I 1

-I?)

Then based on (5), the x2 statistic with 2 degrees of freedom can be computed as 1.51 with a P-value of .47. Based on this sample of 49 patients, there appears to be no advantage in using the Krebs-Goplerud score over other routinely collected nutritional parameters, although power in this situation is likely to be very small because of the small sample size.

4. Discussion
ROC curves are frequently being applied to the evaluation of diagnostic or prognostic tests and indices. In order to make comparisons between two or more such indices derived from the same test units or subjects, the implicit correlation between the curves should be taken into account. This paper has presented a totally nonparametric approach to the comparison of the areas under two or more ROC curves by using the theory developed for generalized U-statistics. A covariance matrix can be estimated using the method of structural components and the resulting test statistic has asymptotically a chi-square distribution. The covariance matrix may also be used to construct confidence regions.
ACKNOWLEDGEMENTS

This work was supported in part by the Veterans' Administration Region 2 Health Services Research and Development Field Program. RESUME de des L'importance methodesd'evaluationet de comparaison la performance testsdiagnostiques des croit dans le meme temps que de nouveauxtests se developpentet sont lanc6ssur le marche.Quand un test est fonde sur une variableobserveecontinueou qui prendses valeurssur une 6chellegraduee, on peut faireune estimationglobalede la valeurdu test en utilisantla courbecaract6ristique (ROC) du receveur.La courbe est construiteen faisant varierla coupure utilis6epour d6terminerquelles valeursde la variableobserv6esont a considerercomme anormales,et ensuite en faisantla graphe faussementpositifs.On doit tenir compte contre les ratioscorrespondants des sensibilit6sr6sultantes entre courbesquanddeux de la naturecorr6l6e donn6esdans l'analysestatistiquedes differences des ou plusieurscourbesempiriquessont construitesa partirde tests bas6ssur les memes individus.On de presentedans ce papierune approchenon param6trique l'analysedes airessous des courbesROC correlees,en utilisant la theorie sur la statistique U generalisee,pour engendrerune matrice de covarianceestim6e.

ofAreas UnderROC Curves Comparison Nonparametric


REFERENCES

845

Arveson,J. N. (1969). JackknifingU-statistics. Annalsof MathematicalStatistics40, 2076-2100. Bamber,D. (1975). The area above the ordinal dominance graphand the area below the receiver operatingcharacteristic graph.Journalof MathematicalPsychology12, 387-415. Castelado,T. W., Petrilli, E. S., Ballon, S. C., and Lagasse,L. D. (1981). Intestinaloperationsin and 139, 80-84. patientswith ovariancarcinoma.AmericanJournalof Obstetrics Gynecology of Dorfman,D. D. and Alf, E. (1969). Maximumlikelihoodestimationof parameters signaldetection of theoryand determination confidenceintervals-rating-methoddata.Journalof Mathematical Psychology6, 487-496. Hanley,J. A. and McNeil, B. J. (1982). The meaningand use of the area undera receiveroperating characteristic (ROC)curve.Radiology143, 29-36. Hanley, J. A. and McNeil, B. J. (1983). A method of comparingthe area under two ROC curves derivedfrom the same cases.Radiology148, 839-843. Hoeffding, W. (1948). A class of statistics with asymptoticallynormal distribution.Annals of MathematicalStatistics19, 293-325. Krebs,H. B. and Goplerud,D. R. (1983). Surgicalmanagementof bowel obstructionin advanced ovariancarcinoma.Obstetrics Gynecology 327-330. and 61, Metz, C. E. (1978). Basicprinciplesof ROC analysis.Seminarsin NuclearMedicine8, 283-298. Metz, C. E., Wang, P.-L., and Kronman,H. B. (1984). A new approachfor testingthe significance data. In Information of differences betweenROC curvesmeasuredfrom correlated Processingin MedicalImaging VIII, F. Deconick (ed.), 432-445. The Hague:MartinusNijhof. Statistics.New York:Wiley. Noether,G. E. (1967). Elementsof Nonparametric Sen, P. K. (1960). On some convergencepropertiesof U-statistics.CalcuttaStatisticalAssociation Bulletin10, 1-18. Swets, J. A. and Pickett, R. M. (1982). Evaluation of Diagnostic Systems: Methodsfrom Signal DetectionTheory.New York:AcademicPress.

ReceivedApril 1987; revisedOctober1987 and January 1988.

Você também pode gostar