Você está na página 1de 12

Last week: Reliability II

Introduction to measurement and scale development Part 5: Validity


Daniel Stahl Department of Biostatistics & Computing

Generalizability theory
explicitly recognizes that there are multiple sources of error and that measures may have different reliabilities in different situations (e.g. different hospitals, gender, different countries) allows us to determine the sources of error due using variance components and mulitlevel modelling techniques is an extension of the reliability model from two to more random and fixed factors

Reliability and standard error/Confidence intervals Calculating the necessary number of items for a good reliability Sample sizes for test-retest reliability studies Reliability for categorical variables: Kappa From item scores to scale scores (total score, subscores)

Last week: From items to total score


If our final item pool passed requirements for internal consistency, factor analysis for checking dimensionality and other reliability checks, we need to combine the items to a total score (or several subscores). We need a rule to combine the information of all items into a total score (or a total score for each domain) Usually you simply add up the scores of each domain to obtain a subtotal score for domain Adding up usually/often works well

Question from last week: Categorizing a continuous scale


If we accept the fact that doctors like to ruin a nice continuous outcome measure by turning it into a dichotomy (depressed not depressed) we need to find a way to find the best cut-off score tradeoffs between sensitivity and specificity (or between true positives and false positives) http://www.childrensmercy.org/stats/ask/roc.asp

Example
A new simple depression test was developed for use in psychiatric clinics to screen patients for depression. How can we determine a cut-off score to correctly classify a patient as depressive? We could apply the new test to patients and ask psychiatrists to assess the same patients. If our test should classify similar as psychiatrists would do we could use e.g. ROC curves to determine the optimal cut-off score. The ROC Curve allows to evaluate the performance of tests that categorize cases into one of two groups.

Receiver-operating characteristic curve


An ROC curve is a graphical representation of the trade off between the false negative (sensitivity) and false positive rates (1-specificity) for every possible cut off. The plot shows the false positive rate on the X axis and 1 - the false negative rate on the Y axis. A good diagnostic test is one that has small false positive and false negative rates. SPSS Analyze ROC curve

Depression score versus doctors classification


200

ROC curve
ROC Curve

Actual state according to doctor


not depressive depressive

1.0

True positives

150

0.8

Sensitivity

Count

0.6

100

0.4

0.2

50

0.0 0.0 0.2 0.4 0.6 0.8 1.0

Area Under the Curve


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 - Specificity
Diagonal segments are produced by ties.

Test Result Variable(s): Depression score Asymptotic b Sig. .000

Depression score
Area .970 Std. Error .003
a

Asymptotic 95% Confidence Interval Lower Bound Upper Bound .964 .977

False positives Area under the curve:0.97

The test result variable(s): Depression score has at least one tie between the positive actual state group and the negative actual state

Area under the ROC curve


A good test will result in a ROC curve that rises to the upper left hand corner very quickly. Therefore, the area under the curve is a measure of the quality of the test diagnostic. The larger the area, the better the diagnostic test:
Area is 1.0 = 100% sensitivity and 100% specificity. Area is 0.5 = 50% sensitivity and 50% specificity = not better than flipping a coin. In practice, a diagnostic test is somewhere between these two extremes.

Cut-off point
Choice of Cut-off point depends on Importance of correct classification, Cost of misclassification Prevalence (the lower the prevalence, the higher the proportion of false positives among the positive results)

What's a good value for the area under the curve? 0.50 to 0.75 = fair 0.75 to 0.92 = good 0.92 to 0.97 = very good 0.97 to 1.00 = excellent.

Cut-off point
Coordinates of the Curve

Cut-off point = 10.5


Test Result Variable(s): Depression score Positive if Greater Than a or Equal To .00 1.50 2.50 3.50 4.50 5.50 6.50 7.50 8.50 9.50 10.50 11.50 12.50 13.50 14.50 15.50 16.50 17.50 18.50 19.50 21.00

ROC Curve

1.0

Sensitivity

0.9

0.8

0.7 0.00 0.05 0.10 0.15 0.20 0.25 0.30

1 - Specificity
Diagonal segments are produced by ties.

Sensitivity 1.000 1.000 .999 .997 .994 .993 .985 .979 .969 .934 .918 .865 .758 .700 .573 .513 .379 .321 .188 .086 .000

1 - Specificity 1.000 .898 .813 .683 .596 .529 .372 .306 .240 .134 .092 .034 .015 .013 .005 .002 .001 .001 .000 .000 .000

10.5 seems to be a good classification cut-off point >10 = depressive 10 not depressive 91.8% of depressive are classified as depressive, 9.2% of not depressed are misclassified as depressive But original score still has got more information (most of the wrongly classified patients score around 9 and 10). Dichotomizing data causes up to 66% power loss in statistical analysis!

True positives

False positives

The test result variable(s): Depression score has at least one tie between the positive actual state group and the negative actual state group.

Check if classification is the same between males and females


120 100 80

Actual state according to doctor


not depressive depressive

ROC Curve
gender: male
1.0

ROC Curve
gender: female
1.0

male

60 40

0.8

0.8

Sensitivity

20

0.6

Sensitivity
0.0 0.2 0.4 0.6 0.8 1.0

0.6

gender

Count

0 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.4

0.4

0.2

0.2

female

0.0

0.0 0.0 0.2 0.4 0.6 0.8 1.0

1 - Specificity
Diagonal segments are produced by ties.

1 - Specificity
Diagonal segments are produced by ties.

Depression score

There seem to be no gender differences in cut-off point Use logistic regression to see influence of age, gender, ethnicity on classification Depressive yes/no test score + sex + age + ethnicity

Literature: ROC curves


Quantifying the information value of clinical assessments with signal detection theory. Richard M. McFall, Teresa A. Treat. Annu Rev Psychol 1999: 50, 215-41 W.C. Lee. (1999) Selecting diagnostic tests by ruling out or ruling in disease: the use of the Kullback-Leibler distance. I.J. of Epidemiology, 28:521525, 1999. (more than 2 classifications) J. J. Strik, A. Honig, R. Lousberg, J. Denollet.(2001) Sensitivity and specificity of observer and self-report questionnaires in major and minor depression following myocardial infarction. 42(5), 423-8.

Validity
Now we have a scale with very good reliability. But what about validity? Is our scale measuring what it is supposed to measure? Next step: Validity Is our scale not only reliable but also valid?

Validity
A valid measure is one which is measuring what it is supposed to measure. Validity refers to getting results that accurately reflect the concept being measured: Are we drawing valid conclusions from our measures: does as is a high score on our IQ scale really means that the person is intelligent? Validity is the degree of confidence we can place on inferences about people based on their test scores. Validity implies reliability: Reliability places the upper limit of the validity of a scale!

Measurement Validity Types


3 general types of validity with several subcategories: Expert validity Criterion validity Construct validity

Validity Types
3 general types of validity with several subcategories: Expert validity assessment that the items of a test are drawn from the domains being measured (takes place after initial form of the scale has been developed) Criterion validity correlate measures with a criterion measure known to be valid, e.g. established test (= other scales of the same or similar measure are available) Construct validity examines whether a measure is related to other variables as required by theory, e.g. depression score should change in response of a stressful life event (= other scales of the same or similar measure are not available)

Expert validity
Expert validity: do the experts agree? Face validity: subjective assessment that the instrument/items appear to asses the desired qualities Does the operationalization look like a good translation of the construct? Assessment by colleagues, friends, target subjects, clinician Weakest way to demonstrate construct validity Content validity: closely related to face validity, but it is a more rigorous assessment and done by an expert panel. It is concerned with sample-population representativeness: subjective assessment that the instrument samples all the important contents/domains of the attribute

Criterion and Construct validity


Criterion validity: Is the measure consistent with what we already know and what we expect? Concurrent validity: correlate measurements of a new scale with gold standard criterion, both which are given at the same time Predictive validity: correlate with criterion, which is not yet available Construct validity: Sensitivity to change, responsiveness Is the measure related to other variables as required by theory, e.g. depression score should change in response of a stressful life event. Discriminate Validity: Doesnt associate with constructs that shouldnt be related.

Criterion and construct validity another view


Validity is mainly about generating hypothesis and designing experiments to test them. It is a process of hypothesis testing and the important question is not, which kind of validity we apply but: Does our hypothesis makes sense in light of what the scale is designed to measure?. However, the different categories of validity may help to generate and to design hypothesis.

Criterion validity:
Criterion validity: Is the measure consistent with what we already know and what we expect?
Concurrent validity: correlate measurements of a new scale with gold standard criterion, both which are given at the same time e.g. new depression test and Beck Depression inventory = allows for immediate results Predictive validity: correlate with criterion, which is not yet available, e.g. measuring aggressiveness and comparing it with reported aggressive acts in the following year; intelligence and final exam results

Criterion Validity
Steps for conducting a criterion-related validation study: Identify a suitable criterion and method for measuring it. (New test for IQ with the aim to predict school performance: test scores should be correlated exam results. Identify an appropriate sample (students) Correlate test scores and criterion measure (exam results). The degree of relationship is the validity coefficient. If we estimate a Pearson correlation coefficient r (or ICC), r is our validation coefficient. The interpretation is similar to reliability: the closer r to 1, the better the validity.

Concurrent validity: Number of storks as a measure of fertility?

Association between categorical scores


Often we classify according to our score into two or more categories, e.g. neurotic or not neurotic Extroverted introverted In this case we want to see if there is an association between the categories, e.g. our neuroticism test result with the psychiatrists/doctors diagnosis :
Doctors Neurotic Not neurotic 20 3 2 10

Pairs of breeding storks

The number of breeding storks correlates almost perfectly with the number of newborn babies!

New test
Millions of newborn babies in Germany

Neurotic Not neurotic

Problems with categorical association coefficients:


(SPSS
Analyze Descriptive statistics
Symmetric Measures Value .691 .691 .568 35 Approx. Sig. .000 .000 .000

crosstabs statistics)

Nominal by Nominal N of Valid Cases

Phi Cramer's V Contingency Coefficient

Different measures of association result in different association measures (see phi = 0.69 and contingency coefficient =0.57). Therefore, comparison of different association measures is not possible. It is difficult to evaluate a coefficient with larger contingency tables.

a. Not assuming the null hypothesis. b. Using the asymptotic standard error assuming the null hypothesis.

] Phi is 0.691: there is a significant association between test classification and doctors opinion.

Effect sizes
2x2 table (phi)
small effect (small or weak association): 0.1 medium effect: 0.3 large, strong association: 0.5 perfect association: 1

In general we want to see if there is an association between our test score and an appropriate measures such as:
(SPSS: Analyze Descriptives Crosstabs Statistics Type of Correlation Coefficient Pearson product-moment Spearman rank-order Phi Contingency coefficient Types of Scales Both scales interval/ratio Both scales ordinal Both scales are naturally dichotomous (nominal) Both scales are nominal (more than 2 categories) Ordered categorical variables Both scales are artificially dichotomous (nominal) One scale naturally dichotomous (nominal), one scale interval/ratio One scale artificially dichotomous (nominal), one scale interval/ratio One scale nominal, one scale ordinal

Pearsons correlation
small effect (small or weak association): 0.3 medium effect: 0.5 large, strong association: 0.8 perfect association: 1

Linear-by-linear association Tetrachoric Point-biserial Biserial Gamma

Measurement Validity Types


3 general types of validity with several subcategories: Expert validity assessment that the items of a test are drawn from the domains being measured Criterion validity correlate measures with a criterion measure known to be valid, e.g. established test Construct validity
sensitivity to change, responsiveness Is the measure related to other variables as required by theory, e.g. depression score should change in response of a stressful life event. Discriminate Validity: Doesnt associate with constructs that shouldnt be related.

Construct validation
Perhaps the most difficult type of validation (if done correctly) Construct validity refers to the degree to which a test or other measure assesses the underlying theoretical construct it is supposed to measure. It is traditionally defined as the experimental demonstration that a test is measuring the construct it claims to be measuring.

Idea behind estimating construct validity


A construct is an unobservable trait, such as anxiety, self-esteem , xenophobe or IQ, that can not be measured. The basic idea behind construct validity is that we can observe and measure behavioral (or other traits, e.g. blood pressure), which according to our hypothesized construct, are influenced by or result of the construct, which we intend to measure, e.g. sweaty palms are influenced by anxiety. We can now compare measurements of observable traits with the measurements of our construct. If the observed behavior correlates well with the scores of the test, we assume to have a good construct validity.

Construct validity: examples


A simple experiment could take the form of a differential-groups study:
you compare the performances on the test between two groups: one that has the construct and one that does not have the construct: The serotonin levels of people who score high in depression test will be compared with people who score low. If the group with the construct performs different than the group without the construct, that result is said to provide evidence of the construct validity of the test (extreme group validity).

Construct validity: examples


You could do an intervention study, Theories tell us how constructs respond to interventions
e.g.. phobic symptoms subside after repeated exposures to the feared stimulus

Construct validity: more examples Developmental changes


Should the construct change with age? E.g.. attention increases with age e.g.. memory retrieval decreases with age This should be seen in your scale. Give your scale to a group of various ages check for the pattern predicted by your theory: Is there a predicted correlation between age and score?

Our scale should measure the differences after intervention as predicted


after repeated exposures, our test should show lower scores If the test score of the subjects change in the predicted direction, we have evidence of the construct validity.

Construct validity: more examples


We developed a test for perceived general health. For theoretical reasons we assume that perceived general health should be associated with the number of visits to a GP or levels of self-medication. These two hypothesis can be formally tested by correlating perceived general health score and number of visits to the GP and levels of self-medications. If the correlational analysis confirm your assumption, this will add to the evidence of construct validity. Quasi-experimental study: We developed a test for Depression. We will assume that people who recently had a stressful live event score higher on our test than people who did not had a stressful life event (t-test). Furthermore, the depression score of a person should increase after a stressful life event (Sensitivity for change).

Main steps in construct validity


Formulate several hypotheses Try to formulate hypotheses for each domain of your construct. Define (Operationalise) how you want to test your hypotheses: experimental, quasi-experimental or correlational approach? Develop the scale now! Gather data that will allow you to test the hypotheses Determine if your data support the hypotheses, e.g. ANOVA, Correlation, Regression.

Construct validity
So far mainly: sensitivity to change, responsiveness Convergent and discriminate validity Multitrait Multimethod matrix

Convergent validity
Convergent validity is the degree to which our scale that should be related theoretically with other attributes or variables is interrelated with them in reality. Example: A new depression scale should correlate with a measure of serotonin and norepinephrine imbalance or with social inactivity. We assume that social activity is not only influenced by depression and therefore we do not expect a perfect correlation. Therefore, we expect a correlation between the measures, but not a very strong one. If the correlation would be very strong, our scale would measure the same thing. In this case, social activity would be just a proxy for depression and we could use social activity as a measure. Difference to Concurrent validity: We do not compare our measure with similar measures of the same trait but with related variables.

Discriminant validity
But by looking only at correlations with things that are only similar, we will never discover its flaw (measures too much). We need a method to tell that it it measures only enough (not too much, not too little). Discriminant validity Discriminant validity is the degree to which our scale that should not be related theoretically with other attributes or variables are, in fact, not interrelated in reality, such as e.g. depression and body size or neuroticism. We need to find the traits, which are similar according to our theory (high correlations) = convergent validity But we need also need to test traits, which are different! (low or no correlations) = discriminant validity This way we can narrow down and check that we measure just right

Multitrait Multimethod matrix


If we find a correlation between two variables but we used similar methods of administration (both are psychometric tests), the correlation may be due to the same administration method (e.g. wording of items is difficult or subjects try to be social desirable in their answers). In this case we may find a correlation between two dissimilar measures of traits, where we do not assume one. C&D methods can be used to check the influence of the measurement method on our scores. Multitrait Multimethod matrix analysis allows us to detangle correlations between instruments due to similarity of test methods form and similarities due to tapping the same attribute.

Multitrait Multimethod matrix


Two or more different traits (both similar and dissimilar traits) are measured by two or more methods (e.g., psychometric test, a direct observation, a performance measure) at the same time. You then correlate the scores with each other. To construct an MTMM, you need to arrange the correlation matrix by concepts within methods. Essentially, the MTMM is just a correlation matrix between your measures, with one exception: instead of 1's along the diagonal (as in the typical correlation matrix) we substitute an estimate of the reliability of each measure as the diagonal.

Multitrait Multimethod matrix


We developed a scale for self developed learning and want to validate the measure: We conduct a study of students and measure two similar traits : Self directed learning (SDL) and Knowledge (Know). Furthermore, we measure each of the two traits in two different ways: a test measure and a exam rating. We assume that the two measures should not correlate. We measured reliability for each measure The results are arrayed in the MTMM.

Example from: Streiner and Norman 2003, page 184

MTMM-Matrix

Homotrait-Homomethod trait: The Reliability Diagonal Main diagonal are reliabilities (= correlations with itself) of our instruments, they should be the highest.

SDL Know Rater Test Rater Test SDL Rater 0.53 Test 0.42 0.79 Know Rater 0.18 0.17 0.58 Test 0.15 0.23 0.49 0.88

SDL Know Rater Test Rater Test SDL Rater 0.53 Test 0.42 0.79 Know Rater 0.18 0.17 0.58 Test 0.15 0.23 0.49 0.88

Homotrait-Heteromethod: Concurrent validity Correlation between similar traits but with different methods, should show high correlation (but less than reliability coefficient)

Heterotrait-Homomethod: Discriminant validity Correlation between the different traits but with same methods, should low correlation. If correlations are high, method has got an effect on scores.

SDL Know Rater Test Rater Test SDL Rater 0.53 Test 0.42 0.79 Know Rater 0.18 0.17 0.58 Test 0.15 0.23 0.49 0.88

SDL Know Rater Test Rater Test SDL Rater 0.53 Test 0.42 0.79 Know Rater 0.18 0.17 0.58 Test 0.15 0.23 0.49 0.88

Heterotrait-Heteromethod Correlation between the different traits but with different methods, should show similar low correlations as heterotrait-homomethod if method has got no effect on scores.

Multitrait Multimethod matrix


can be easily extended to include more similar and dissimilar traits and more methods: see e.g.
http://www.socialresearchmethods.net/kb/mtmmmat.htm

SDL Know Rater Test Rater Test SDL Rater 0.53 Test 0.42 0.79 Know Rater 0.18 0.17 0.58 Test 0.15 0.23 0.49 0.88

Multitrait Multimethod matrix


We developed a scale for self esteem and want to validate the measure: We conduct a study of students and measure three traits or concepts: Self Esteem (A), Self Disclosure (B) and Locus of Control (C). Furthermore, we measure each of these traits in three different ways: a Paper-and-Pencil (1) measure, a Teacher rating (2) and parent assessment (3). We assume that self Esteem should correlate with self Disclosure but not with Locus of Control (LC). We measured reliability for each measure The results are arrayed in the MTMM.

Have a go for yourself.

Example form: http://www.socialresearchmethods.net/kb/mtmmmat.ht

see: http://www.socialresearchmethods.net/kb/mtmmmat.htm

Construct validity: measure and theory


In construct validity we are measuring two types of validity at the same time: is our measure valid and is the theory regarding our construct valid (are the observed traits derived from our hypotheses really measures of the construct, is e.g. college grade a measure of IQ). If we have high validity, then we have more confidence that both our measure and our theory are correct. However, if we only obtain low validity then the problems could be:
Our measure is good, but the theory is wrong The theory is good but the measure is wrong Both theory and measure are wrong If we did an experiment (e.g. inducing anxiety) it also could be the experiment which did not work but theory and measure are correct

Summary: Evaluating construct validity


Evaluating construct validity will involve a large number of tests and an assessment of disparate information. Trying to confirm the plausible associations and disconfirm the implausible ones is often a long and incremental process. There will be no definitive answers and there is no formal way to weigh the overall evidence. Reaching conclusions on construct validity is further complicated by the problem that evidence tends to be interpreted in two different ways:
to use these associations to test whether the instrument is a good measure of the intended constructs (validation of the instrument) ; to use these associations to confirm and clarify the constructs (validation of the underlying construct).

many carefully planned validation tests needs to be done

Validity
Example: A test is developed to measure xenophobe. The test was administered to 100 people working in the car industry. In addition to the test, the people were asked to answer the following questions: age, type of work (blue collar-white collar worker), political attitude (left to right), handedness, number of foreigners as friends and education. Can you think about some validity test?

Possible validity tests


Older people show on average more xenophobe than younger ones. People with rightwing political attitude show on average more xenophobe. People with unsafe job situation show more xenophobe than people with safe jobs blue collar worker should show on average more xenophobe than white collar worker. People with higher education should show on average less xenophobe. Left-handed people should show similar degree of xenophobe than right-handed people.

Validity and reliability


Validity implies reliability. A valid measure must be reliable, but a reliable measure may not be valid. Reliability places the upper limit of the validity of a scale!

Upper limit of validity


Validity is to some degree dependent on reliability: reliability places the upper limit of the validity of a scale.

rxy rxx ryy


rxy = validity(correlation between new test and e.g. gold standard) rxx = reliabilit y of new test , e.g. test - retest ryy = reliabilit y of " gold standard" , e.g. test - retest

Example: max validity


If the reliability of our test is 0.8, and the reliability of a gold standard test is 0.7, the maximum correlation (validity coefficient) between the variables is:

Upper limit of validity: Relationship between reliability of criterion and validity of new scale:
1.00

reliability of new scale


.70 .80 .90

0.80
maximal validity

rxy 0.8 0.7 = 0.75

0.60

0.40

0.20 0.00 0.20 0.40 0.60 0.80 1.00

reliability of criterion ("gold standard")

10

The problem: unreliable criterion


This relationship between reliability and validity can cause a problem in our validity analysis: Our measure may be valid but if we compare it with a not very reliable criterion, we will get a low validity for =our measure.

Correcting for low reliability:


However, we can estimate what the validity coefficient of our new measure would be if both - our new scale and the criterion - would be perfectly reliable:

r *xy =

rxy rxx ryy

r* xy = estimated validityi f if both scales were 100% relaible (r = 1) rxy = validity (correlation between new test and gold standard ) rxx = reliabilit y of new test , e.g. test - retest ryy = reliabilit y of " gold standard" , e.g. test - retest

Example: Estimated validity


A correlation between our new measure and a gold standard test results in r =0.6. What would be the validity if both would be perfect reliable? (Reliability of our test: 0.8, reliability of gold standard test: 0.7)

Estimating validity if gold standard is perfect


However, our new measure is not perfect reliable and wont be. A more realistic approach would be to estimate the validity of our test if only the gold standard would be perfectly reliable:

r *xy =

0.6 0.6 = = 0.8 0.8 0.7 0.75

r *xy =

rxy ryy '

Divide observed reliability by maximum reliability Validity would be 0.8 if both measures would be perfectly reliable

r* xy = estimated validityif if " gold standard" was 100% relaible (r = 1) rxy = validity (correlation between new test and gold standard ) ryy ' = reliabilit y of new test , e.g. test - retest

Example: Estimated validity


A correlation between our new measure and a gold standard test results in r =0.6. What would be the validity if the gold standard would be perfect reliable? (Reliability of our test is 0.8, reliability of gold standard test is 0.7)

Effect of increasing reliability on validity


If the validity of our scale is low due to low reliability, we could improve reliability to increase validity. How much do we have to improve the reliability of the scale to obtain a acceptable validity?
r *xy = where
changed changed rxy * rxx ryy ' '

r *xy =

0.6 0.6 = = 0.72 0.7 0.83

rxx ryy

Validity would be 0.72, if the criterion measure would be perfectly reliable.

changed changed rxx' and ryy are the changed reliabiite s for the two variables '

r *xy : estimated validity rxy : observed correlatio n rxx , ryy ' : observed reliabilit ies

11

Example:
We observed a correlation (validity) of 0.6, the observed reliabilities are 0.8 for our scale and 0.7 for the criteria scales. How much would the validity increase if we could increase reliability of our scale by .1?
changed changed rxx' = 0.9 ryy = 0.7 '

Usefulness of validity corrections


If our new scale should correlate strongly with another measure, corrections will give us an indications of the true validity of the instrument and hence a indications that we are measuring the construct: A low uncorrected validity score may lead us otherwise to discard our scale, while it is in reality a problem of reliability. Is it worth increasing the reliability of our new scale? How reliable needs to be our scale to get a valid scale?

observed correlation rxy : 0.6 observed validity : rxx = 0.8 and ryy ' : 0.7

r * xy =

r xy *

changed r xx'

changed r yy'

rx x

ry y

0 . 6 * 0 . 95 * 0 . 84 = 0 . 89 * 0 . 84 = 0 . 64

Which reliability measure to use?


Use the type of reliable estimate that treats as error those factors that one decides should be treated as error (Muchinsky 1996) E.g. If you think that the test may suffer of low item coverage of all your domains, then you should use Cronbachs alpha as a reliability measure. If you think that subjects may have problems with the test, you should use test-retest reliability estimates (ICC) Unfortunately, there is no acceptable psychometric basis for creating validity coefficients as a product of correcting multiple types of unreliability (Muchinsky 96). Perhaps use the lowest one.

Summary of the course


The aim of psychometric scale development is to develop a tool to measure unobservable traits, latent constructs. These unobservable traits are measured by a scale which consistes of many items (e.g. questions), which all should tap into the construct or in domains of the construct. The key concepts of classical test development theory are reliability and validity.

Item pool generation

Face and content validity

Internal consistency
Inter-item correlation Item-total correlation Cronbachs Alpha

Dimensionality: Remove/ Revise items


Factor analysis

Validity Construct Concurrent

From items to scale: Total score Total subscores

More Reliability (Stability): Interobserver Intra observer Test-retest

12

Você também pode gostar