Escolar Documentos
Profissional Documentos
Cultura Documentos
Generalizability theory
explicitly recognizes that there are multiple sources of error and that measures may have different reliabilities in different situations (e.g. different hospitals, gender, different countries) allows us to determine the sources of error due using variance components and mulitlevel modelling techniques is an extension of the reliability model from two to more random and fixed factors
Reliability and standard error/Confidence intervals Calculating the necessary number of items for a good reliability Sample sizes for test-retest reliability studies Reliability for categorical variables: Kappa From item scores to scale scores (total score, subscores)
Example
A new simple depression test was developed for use in psychiatric clinics to screen patients for depression. How can we determine a cut-off score to correctly classify a patient as depressive? We could apply the new test to patients and ask psychiatrists to assess the same patients. If our test should classify similar as psychiatrists would do we could use e.g. ROC curves to determine the optimal cut-off score. The ROC Curve allows to evaluate the performance of tests that categorize cases into one of two groups.
ROC curve
ROC Curve
1.0
True positives
150
0.8
Sensitivity
Count
0.6
100
0.4
0.2
50
1 - Specificity
Diagonal segments are produced by ties.
Depression score
Area .970 Std. Error .003
a
Asymptotic 95% Confidence Interval Lower Bound Upper Bound .964 .977
The test result variable(s): Depression score has at least one tie between the positive actual state group and the negative actual state
Cut-off point
Choice of Cut-off point depends on Importance of correct classification, Cost of misclassification Prevalence (the lower the prevalence, the higher the proportion of false positives among the positive results)
What's a good value for the area under the curve? 0.50 to 0.75 = fair 0.75 to 0.92 = good 0.92 to 0.97 = very good 0.97 to 1.00 = excellent.
Cut-off point
Coordinates of the Curve
ROC Curve
1.0
Sensitivity
0.9
0.8
1 - Specificity
Diagonal segments are produced by ties.
Sensitivity 1.000 1.000 .999 .997 .994 .993 .985 .979 .969 .934 .918 .865 .758 .700 .573 .513 .379 .321 .188 .086 .000
1 - Specificity 1.000 .898 .813 .683 .596 .529 .372 .306 .240 .134 .092 .034 .015 .013 .005 .002 .001 .001 .000 .000 .000
10.5 seems to be a good classification cut-off point >10 = depressive 10 not depressive 91.8% of depressive are classified as depressive, 9.2% of not depressed are misclassified as depressive But original score still has got more information (most of the wrongly classified patients score around 9 and 10). Dichotomizing data causes up to 66% power loss in statistical analysis!
True positives
False positives
The test result variable(s): Depression score has at least one tie between the positive actual state group and the negative actual state group.
ROC Curve
gender: male
1.0
ROC Curve
gender: female
1.0
male
60 40
0.8
0.8
Sensitivity
20
0.6
Sensitivity
0.0 0.2 0.4 0.6 0.8 1.0
0.6
gender
Count
0 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.4
0.4
0.2
0.2
female
0.0
1 - Specificity
Diagonal segments are produced by ties.
1 - Specificity
Diagonal segments are produced by ties.
Depression score
There seem to be no gender differences in cut-off point Use logistic regression to see influence of age, gender, ethnicity on classification Depressive yes/no test score + sex + age + ethnicity
Validity
Now we have a scale with very good reliability. But what about validity? Is our scale measuring what it is supposed to measure? Next step: Validity Is our scale not only reliable but also valid?
Validity
A valid measure is one which is measuring what it is supposed to measure. Validity refers to getting results that accurately reflect the concept being measured: Are we drawing valid conclusions from our measures: does as is a high score on our IQ scale really means that the person is intelligent? Validity is the degree of confidence we can place on inferences about people based on their test scores. Validity implies reliability: Reliability places the upper limit of the validity of a scale!
Validity Types
3 general types of validity with several subcategories: Expert validity assessment that the items of a test are drawn from the domains being measured (takes place after initial form of the scale has been developed) Criterion validity correlate measures with a criterion measure known to be valid, e.g. established test (= other scales of the same or similar measure are available) Construct validity examines whether a measure is related to other variables as required by theory, e.g. depression score should change in response of a stressful life event (= other scales of the same or similar measure are not available)
Expert validity
Expert validity: do the experts agree? Face validity: subjective assessment that the instrument/items appear to asses the desired qualities Does the operationalization look like a good translation of the construct? Assessment by colleagues, friends, target subjects, clinician Weakest way to demonstrate construct validity Content validity: closely related to face validity, but it is a more rigorous assessment and done by an expert panel. It is concerned with sample-population representativeness: subjective assessment that the instrument samples all the important contents/domains of the attribute
Criterion validity:
Criterion validity: Is the measure consistent with what we already know and what we expect?
Concurrent validity: correlate measurements of a new scale with gold standard criterion, both which are given at the same time e.g. new depression test and Beck Depression inventory = allows for immediate results Predictive validity: correlate with criterion, which is not yet available, e.g. measuring aggressiveness and comparing it with reported aggressive acts in the following year; intelligence and final exam results
Criterion Validity
Steps for conducting a criterion-related validation study: Identify a suitable criterion and method for measuring it. (New test for IQ with the aim to predict school performance: test scores should be correlated exam results. Identify an appropriate sample (students) Correlate test scores and criterion measure (exam results). The degree of relationship is the validity coefficient. If we estimate a Pearson correlation coefficient r (or ICC), r is our validation coefficient. The interpretation is similar to reliability: the closer r to 1, the better the validity.
The number of breeding storks correlates almost perfectly with the number of newborn babies!
New test
Millions of newborn babies in Germany
crosstabs statistics)
Different measures of association result in different association measures (see phi = 0.69 and contingency coefficient =0.57). Therefore, comparison of different association measures is not possible. It is difficult to evaluate a coefficient with larger contingency tables.
a. Not assuming the null hypothesis. b. Using the asymptotic standard error assuming the null hypothesis.
] Phi is 0.691: there is a significant association between test classification and doctors opinion.
Effect sizes
2x2 table (phi)
small effect (small or weak association): 0.1 medium effect: 0.3 large, strong association: 0.5 perfect association: 1
In general we want to see if there is an association between our test score and an appropriate measures such as:
(SPSS: Analyze Descriptives Crosstabs Statistics Type of Correlation Coefficient Pearson product-moment Spearman rank-order Phi Contingency coefficient Types of Scales Both scales interval/ratio Both scales ordinal Both scales are naturally dichotomous (nominal) Both scales are nominal (more than 2 categories) Ordered categorical variables Both scales are artificially dichotomous (nominal) One scale naturally dichotomous (nominal), one scale interval/ratio One scale artificially dichotomous (nominal), one scale interval/ratio One scale nominal, one scale ordinal
Pearsons correlation
small effect (small or weak association): 0.3 medium effect: 0.5 large, strong association: 0.8 perfect association: 1
Construct validation
Perhaps the most difficult type of validation (if done correctly) Construct validity refers to the degree to which a test or other measure assesses the underlying theoretical construct it is supposed to measure. It is traditionally defined as the experimental demonstration that a test is measuring the construct it claims to be measuring.
Construct validity
So far mainly: sensitivity to change, responsiveness Convergent and discriminate validity Multitrait Multimethod matrix
Convergent validity
Convergent validity is the degree to which our scale that should be related theoretically with other attributes or variables is interrelated with them in reality. Example: A new depression scale should correlate with a measure of serotonin and norepinephrine imbalance or with social inactivity. We assume that social activity is not only influenced by depression and therefore we do not expect a perfect correlation. Therefore, we expect a correlation between the measures, but not a very strong one. If the correlation would be very strong, our scale would measure the same thing. In this case, social activity would be just a proxy for depression and we could use social activity as a measure. Difference to Concurrent validity: We do not compare our measure with similar measures of the same trait but with related variables.
Discriminant validity
But by looking only at correlations with things that are only similar, we will never discover its flaw (measures too much). We need a method to tell that it it measures only enough (not too much, not too little). Discriminant validity Discriminant validity is the degree to which our scale that should not be related theoretically with other attributes or variables are, in fact, not interrelated in reality, such as e.g. depression and body size or neuroticism. We need to find the traits, which are similar according to our theory (high correlations) = convergent validity But we need also need to test traits, which are different! (low or no correlations) = discriminant validity This way we can narrow down and check that we measure just right
MTMM-Matrix
Homotrait-Homomethod trait: The Reliability Diagonal Main diagonal are reliabilities (= correlations with itself) of our instruments, they should be the highest.
SDL Know Rater Test Rater Test SDL Rater 0.53 Test 0.42 0.79 Know Rater 0.18 0.17 0.58 Test 0.15 0.23 0.49 0.88
SDL Know Rater Test Rater Test SDL Rater 0.53 Test 0.42 0.79 Know Rater 0.18 0.17 0.58 Test 0.15 0.23 0.49 0.88
Homotrait-Heteromethod: Concurrent validity Correlation between similar traits but with different methods, should show high correlation (but less than reliability coefficient)
Heterotrait-Homomethod: Discriminant validity Correlation between the different traits but with same methods, should low correlation. If correlations are high, method has got an effect on scores.
SDL Know Rater Test Rater Test SDL Rater 0.53 Test 0.42 0.79 Know Rater 0.18 0.17 0.58 Test 0.15 0.23 0.49 0.88
SDL Know Rater Test Rater Test SDL Rater 0.53 Test 0.42 0.79 Know Rater 0.18 0.17 0.58 Test 0.15 0.23 0.49 0.88
Heterotrait-Heteromethod Correlation between the different traits but with different methods, should show similar low correlations as heterotrait-homomethod if method has got no effect on scores.
SDL Know Rater Test Rater Test SDL Rater 0.53 Test 0.42 0.79 Know Rater 0.18 0.17 0.58 Test 0.15 0.23 0.49 0.88
see: http://www.socialresearchmethods.net/kb/mtmmmat.htm
Validity
Example: A test is developed to measure xenophobe. The test was administered to 100 people working in the car industry. In addition to the test, the people were asked to answer the following questions: age, type of work (blue collar-white collar worker), political attitude (left to right), handedness, number of foreigners as friends and education. Can you think about some validity test?
Upper limit of validity: Relationship between reliability of criterion and validity of new scale:
1.00
0.80
maximal validity
0.60
0.40
10
r *xy =
r* xy = estimated validityi f if both scales were 100% relaible (r = 1) rxy = validity (correlation between new test and gold standard ) rxx = reliabilit y of new test , e.g. test - retest ryy = reliabilit y of " gold standard" , e.g. test - retest
r *xy =
r *xy =
Divide observed reliability by maximum reliability Validity would be 0.8 if both measures would be perfectly reliable
r* xy = estimated validityif if " gold standard" was 100% relaible (r = 1) rxy = validity (correlation between new test and gold standard ) ryy ' = reliabilit y of new test , e.g. test - retest
r *xy =
rxx ryy
changed changed rxx' and ryy are the changed reliabiite s for the two variables '
r *xy : estimated validity rxy : observed correlatio n rxx , ryy ' : observed reliabilit ies
11
Example:
We observed a correlation (validity) of 0.6, the observed reliabilities are 0.8 for our scale and 0.7 for the criteria scales. How much would the validity increase if we could increase reliability of our scale by .1?
changed changed rxx' = 0.9 ryy = 0.7 '
observed correlation rxy : 0.6 observed validity : rxx = 0.8 and ryy ' : 0.7
r * xy =
r xy *
changed r xx'
changed r yy'
rx x
ry y
0 . 6 * 0 . 95 * 0 . 84 = 0 . 89 * 0 . 84 = 0 . 64
Internal consistency
Inter-item correlation Item-total correlation Cronbachs Alpha
12