Tugas Baca - Waltz and Strickland 2010

6
Validity of Measures
With Karen L . Soeken
Validity is defined as the degree to which Thus, for any given measure, one or more types
evidence and theory support the interpretation of evidence will be of interest. Evidence for the
entailed by proposed use of tests (American validity of a newly developed measure requires
Educational Research Association [AERA], extensive, rigorous investigation using a num-
American Psychological Association [APA], ber of different approaches depending upon
National Council on Measurement in Education the purpose(s), subjects, and the conditions
[NCME], 1999, p. 9). The type of validity infor- under which it will be used, prior to employing
mation to be obtained depends upon the aims or the measure clinically and/or within a research
purposes for the measure rather than upon the study. In addition, evidence for validity should
type of measure per se. Hence, it is not appro- be obtained within the context of each study in
priate to speak of a valid tool or measurement which the measure is used for the collection of
method, but rather of accruing evidence for data. Therefore, for any given tool or measure-
validity by examining the scores resulting from a ment method, validity will be investigated in
measure that is employed for a specific purpose multiple ways depending upon the purpose(s)
with a specified group of respondents under a for measurement and evidence for validity will
certain set of conditions. For any given measure, be accrued with repeated use of the measure. The
different aspects of validity will be investigated Standards focus on five distinct types of evidence
depending upon the measures purpose(s). That based on: (1) test content, (2) response processes,
is, when test scores are used or interpreted in (3) internal structure, (4) relations with other
more than one way, each intended interpretation variables, and (5) the consequences of testing.
must be validated (AERA, APA, & NCME, 1999, Two important points are emphasized: First,
pp. 9 11). these sources of evidence may illuminate differ-
Standards for measurement validity changed ent aspects of validity, but they do not represent
substantially in the Standards for Educational distinct types of validity . . . it is the degree to
and Psychological Testing published by AERA, which all of the accumulated evidence supports
APA, and NCME in 1999. These changes resulted the intended interpretation of test scores for the
from concern that the focus on only the three intended purpose (AERA, APA, & NCME, 1999,
types of validity (content, criterion-related, and p. 11). Second, A sound validity argument inte-
construct) tended to narrow thinking and lead grates various strands of evidence into a coherent
to the misconception that validity is a property account of the degree to which existing evidence
of a test or measure per se. In the 1999 Stan- and theory support the intended interpretation
dards the definition of validity was expanded of test scores for specific uses . . . encompasses
to emphasize that validity should be viewed as evidence gathered from new studies and evi-
a unitary concept. Validity is not a property of a dence available from earlier reported research . . .
tool. It is a property of the scores obtained with indicates the need for refining the definition of
a measure when used for a specific purpose and the construct, may suggest revisions in the test
with a particular group of respondents. or other aspects of the testing process, and may
163
164 Part II Understanding Measurement Design
indicate areas needing further study (AERA, Differential item function (DIF) studies
APA, & NCME, 1999, p. 17). to detect item bias
Five types of evidence to be considered when Item analysis to examine interrelation-
investigating the validity of a measure are ships
described in the Standards: 4. Evidence based on relations to other vari-
ables: The nature and extent of the relation-
1. Evidence based on test content: The extent ships between scores and other variables
to which content, including specific items that the measure is expected to correlate
or tasks and their formats, represent the with or predict, as well as variables that
content domain. Evidence for relevance of the measure is not expected to relate with.
the content domain to the proposed inter- Extent to which these relationships are
pretation of scores might include: consistent with the construct that serves as
Extent to which content represents the the basis for the proposed interpretation
content domain (Content Validity) of scores. Evidence based on relations with
Experts review of sufficiency, relevance, other variables may include:
and clarity of items (Content Valid- Correlation with scores from another
ity, Item Analysis), extent to which the measure of the same construct (Criterion-
intended construct is measured (Con- related Validity)
struct Validity) Extent to which scores correlate with
Extent to which items or subparts match data obtained at a later date (Criterion-
the definition of the construct and/ related Validity)
or purpose of the measure (Construct Differential group prediction studies
Validity) Validity generalization studies (Meta-
Evidence that construct underrepresen- analysis)
tation or construct-irrelevant components Convergent and divergent validity studies
may give unfair advantage to subgroups (Multitrait-multimethod)
of subjects (Construct Validity) Experimental and known group com-
2. Evidence based on response processes: The parison studies
fit between the type of performances or 5. Evidence based on consequences of testing:
responses in which respondents engage Extent to which anticipated benefits of
and the intended construct. Evidence that measurement are realized and/or unantici-
the construct of interest is assessed and pated benefits (positive and negative) occur.
criteria applied as intended and responses Extent to which differential consequences
are not the result of irrelevant or extra- are observed for different identifiable sub-
neous factors like social desirability may groups. Evidence based on consequences
include: of testing may include:
Interviewing subjects to assess the basis Descriptive studies
for their responses Focus groups (AERA, APA, & NCME,
Observing subjects as they engage in 1999; Goodwin, 2002, pp. 100106).
task performances
Studying the way in which judges or A number of validation frameworks have been
observers employ criteria to record and proposed that are based on these Standards.
score performances Readers interested in learning more regard-
3. Evidence based on internal structure: The ing these frameworks are referred to Slaney
extent to which relationships among items and Maraun (2008) who provide an overview
and components match the construct as and evaluation of validation frameworks pro-
operationally defined. Evidence based on posed by others and propose a framework for
internal structure might include: test analysis based upon the Standards that
Factor analysis, especially Confirmatory addresses some of the limitations in existing
Factor Analysis (CFA) frameworks, and Baldwin and colleagues (2009)
Chapter 6 Validity of Measures 165
who developed and implemented a process to (3) judge if they believe the items on the tool
validate the 75 Core National Association of adequately represent the content or behaviors in
Clinical Nurse Specialists (NACNS) clinical the domain of interest.
nurse specialist (CNS) competencies among When only two judges are employed, the
practicing CNSs. content validity index (CVI) is used to quantify
the extent of agreement between the experts. To
compute the CVI, two content specialists are
NORM-REFERENCED given the objectives and items and are asked
VALIDITY PROCEDURES to independently rate the relevance of each
item to the objective(s) using a 4-point rating
scale: (1) not relevant, (2) somewhat relevant,
Content Validity (3) quite relevant, and (4) very relevant. The
Validation of the extent to which content rep- CVI is defined as the proportion of items given
resents the content domain is important for all a rating of quite/very relevant by both raters
measures and is especially of interest for instru- involved. For example, suppose the relevance of
ments designed to assess cognition. The focus is each of 10 items on an instrument to a particular
on determining whether the items sampled for objective is independently rated by two experts
inclusion on the tool adequately represent the using the 4-point scale, and the results are those
domain of content addressed by the instrument displayed in Figure 6.1. Using the information
and the relevance of the content domain to the from the figure, the CVI equals the proportion
proposed interpretation of scores obtained when of items given a rating of 3 or 4 by both judges:
the measure is employed. For this reason, content CVI = 8/10 or 0.80. If all items are given ratings
validity is largely a function of how an instru- of 3 or 4 by both raters, interrater agreement
ment is developed. When a domain is adequately will be perfect and the value of the CVI will be
defined, objectives that represent that domain 1.00. If one-half of the items are jointly classi-
are clearly explicated, an exhaustive set of items fied as 1 or 2, while the others are jointly classi-
to measure each objective is constructed, and fied as 3 or 4, the CVI will be 0.50, indicating an
then a random sampling procedure is employed unacceptable level of content validity (Martuza,
to select a subset of items from this larger pool 1977).
for inclusion on the instrument, the probability When more than two experts rate the items
that the instrument will have adequate content on a measure, the alpha coefficient, discussed
validity is high. in Chapter 5, is employed as the index of con-
When investigating content validity, the tent validity. Figure 6.2 provides an example
interest is in the extent to which the content of alpha employed for the determination of
of the measure represents the content domain. content validity for six experts rating of the
Procedures employed for this purpose usually relevance of each of five items on a measure.
involve having experts judge the specific items In this case, column headings represent each
and/or behaviors included in the measure in of the six experts ratings (AF), and the row
terms of their relevance, sufficiency, and clar- headings represent each of the five items (15)
ity in representing the concepts underlying the rated. The resulting alpha coefficient quanti-
measures development. To obtain evidence for fies the extent to which there is agreement
content validity, the list of behavioral objec- between the experts ratings of the items. A
tives that guided the construction of the tool, a coefficient of 0.00 indicates lack of agreement
definition of terms, and a separate list of items between the experts and a coefficient of 1.00
designed to specifically test the objectives are indicates complete agreement. It should be
given to at least two experts in the area of con- noted that agreement does not mean that the
tent to be measured. These experts are then same rating was assigned by all experts, but
asked to (l) link each objective with its respec- rather that the relative ordering or ranking of
tive item, (2) assess the relevancy of the items scores assigned by one expert matches the rela-
to the content addressed by the objectives, and tive order assigned by the other experts. When
166 Part II Understanding Measurement Design
Judge 1
(1 or 2) (3 or 4)
not/somewhat quite/very
Judge 2 relevant relevant Total
(1 or 2) not/somewhat relevant 2 0 2
(3 or 4) quite/very relevant 0 8 8
Total 2 8 10
FIGURE 6.1 Two judges ratings of 10 items.
Judges
Subjects A B C D E F Total
1 10 5 4 5 10 8 42
2 8 4 4 3 9 7 35
3 10 5 5 4 8 10 42
4 10 4 5 4 10 10 43
5 9 5 5 5 10 8 42
Total 47 23 23 21 47 43 204
Mean 9.4 4.6 4.6 4.2 9.4 8.6 40.8
Variance 0.64 0.24 0.20 0.70 0.64 1.44 8.56
alpha = ( )[ ( )]
6 1 3.86
5 8.56
= (1.2) (1 0.45)
= (1.2) (0.55)
= 0.66
FIGURE 6.2 Example of alpha employed for the determination of content validity for six judges rating of a
five-item measure.
sufficiency and/or clarity of the specific item Face validity is not validity in the true sense and
and/or behavior included in the measure in refers only to the appearance of the instrument
representing the concept underlying the mea- to the layperson; that is, if upon cursory inspec-
sures development is of interest, the same pro- tion, an instrument appears to measure what
cedure is employed. the test constructor claims it measures, it is said
Content validity judgments require subject to have face validity. If an instrument has face
matter expertise and therefore careful atten- validity, the layperson is more apt to be moti-
tion must be given to the selection, preparation, vated to respond, thus its presence may serve as
and use of experts, and to the optimal number a factor in increasing response rate. Face valid-
of experts in specific measurement situations. ity, when it is present, however, does not provide
Readers interested in more information in evidence for validity, that is, evidence that the
this regard are referred to Berk (1990), Davis instrument actually measures what it purports
(1992), and Grant and Davis (1997). Content to measure.
validity judgments are different from the judg- For example, Ganz et al. (2009) examined
ments referred to in determining face validity. content validity of a survey instrument employed
Chapter 6 Validity of Measures 167
in a study undertaken to describe the oral-care Studies of the relationship between scores
practices of ICU nurses to compare those prac- and other tools or methods intended to
tices with current evidence-based practice and measure the same concepts
to determine if the use of evidence-based prac- Examining relationships between scores
tice was associated with personal demographic and other measures of different constructs
or professional characteristics. Brodsky and Hypothesis testing of effects of specific
Dijk (2008), when examining the validity of interventions on scores
their questionnaire designed to evaluate Israeli Comparison of scores of known groups of
nurses and physicians attitudes regarding respondents
the introduction of new nursing roles and the Testing hypotheses about expected differ-
expanding scope of nursing practice, employed ences in scores across specific groups of
a panel of experts with theoretical knowledge respondents
and practical experience in the fields being ana- Ascertaining similarities and differences
lyzed. Specifically, the panel was asked to inde- in responses given by members of distinct
pendently review the measures content, face subgroups of respondents
validity, and to scrutinize each item regarding
its appropriateness, comprehensiveness, and rel- Construct validity is usually determined using (1)
evance to the target population. Similarly, Black- the contrasted groups approach, (2) hypothesis
wood and Wilson-Barnett (2007) distributed testing approach, (3) the multitrait-multimethod
their questionnaire to assess nurses knowledge approach (Campbell & Fiske, 1959; Martuza,
and principles regarding protocolized-weaning 1977), and/or (4) factor analysis (Rew, Stuppy, &
from mechanical ventilation to a panel of inter- Becker, 1988).
national experts in critical care with research
experience to comment on its content and face Contrasted Groups
validity. Other examples of content validity can
be found in Glaister (2007); Schnall et al. (2008);
Approach
Gatti (2008); Kearney, Baggs, Broome, Dough- In the contrasted groups approach, the investi-
erty, and Freda (2008); and Hatfield (2008). gator identifies two groups of individuals who
are known to be extremely high and extremely
low in the characteristic being measured by the
Construct Validity
instrument. The instrument is then adminis-
Construct validity is important for all measures tered to both the high and low groups, and the
especially measures of affect. The primary con- differences in the scores obtained by each are
cern in assessing construct validity is the extent to examined. If the instrument is sensitive to indi-
which relationships among items included in the vidual differences in the trait being measured,
measure are consistent with the theory and con- the mean performance of these two groups
cepts as operationally defined. The importance should differ significantly. Whether the two
of consistency between theory and operation- groups differ is assessed by use of an appropri-
alization is exemplified in the work of Foulad- ate statistical procedure such as the t test or an
bakhsh and Stommel (2007) who address the analysis-of-variance test.
need for conceptual and operational consistency For example, to examine the validity of a
in discussing the development of a Complemen- measure designed to quantify venous access, the
tary Alternative Medicine Health Care Model. nurse might ask a group of clinical specialists
Activities undertaken to obtain evidence for on a given unit to identify a group of patients
construct validity include: known to have good venous access and a group
known to have very poor access. The nurse
Examining item interrelationships would employ the measure with both groups,
Investigations of the type and extent of the obtain a mean for each group, then compare the
relationship between scores and external differences between the two means using a t test
variables or other appropriate statistic. If a significant

Tugas Baca - Waltz and Strickland 2010

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Tugas Baca - Waltz and Strickland 2010

Enviado por

Direitos autorais:

Formatos disponíveis

6

FIGURE 6.1 Two judges ratings of 10 items.

Você também pode gostar