Você está na página 1de 22

APPLIED MEASUREMENT IN EDUCATION, 23: 286306, 2010

Copyright Taylor & Francis Group, LLC


ISSN: 0895-7347 print / 1532-4818 online
DOI: 10.1080/08957347.2010.486289

Using Confirmatory Factor Analysis and the


Rasch Model to Assess Measurement
Invariance in a High Stakes
Reading Assessment

1532-4818
0895-7347
HAME
Applied
Measurement in Education
Education, Vol. 23, No. 3, May 2010: pp. 00

Assessing
Randall
and
Measurement
Engelhard, Jr.
Invariance in Reading Assessment

Jennifer Randall
Research and Evaluation Methods Program
University of Massachusetts, Amherst

George Engelhard, Jr.


Education Studies
Emory University

The psychometric properties and multigroup measurement invariance of scores


across subgroups, items, and persons on the Reading for Meaning items from the
Georgia Criterion Referenced Competency Test (CRCT) were assessed in a sample
of 778 seventh-grade students. Specifically, we sought to determine the extent to
which score-based inferences on a high stakes state assessment hold across several
subgroups within the population of students. To that end, both confirmatory factor
analysis (CFA) and Rasch (1980) models were used to assess measurement
invariance. Results revealed a unidimensional construct with factorial-level
measurement invariance across disability status (students with and without specific
learning disabilities), but not across test accommodations (resource guide, readaloud, and standard administrations). Item-level analysis using the Rasch Model
also revealed minimal differential item functioning across disability status, but not
accommodation status.

The federal government, with the Individuals with Disabilities Education Act of
2004 (IDEA), defines the term child with a disability to mean a child with
mental retardation, hearing impairments (including deafness), speech or language

Correspondence should be addressed to Professor Jennifer Randall, Ph.D., University of


Massachusetts, Hills House South, Room 171, Amherst, MA 01003. E-mail: jrandall@educ.umass.edu

ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT

287

impairments, visual impairments (including blindness), serious emotional disturbance, orthopedic impairments, autism, traumatic brain injury, other health
impairments, or specific learning disabilities and who, by reason thereof, needs
special education and related services (Public Law 108-446, 108th Congress).
Over 6.5 million infants, children, and youth are currently served under IDEA
legislation (U.S. Department of Education, 2007b) that requires that all public
school systems provide students with disabilities a free and appropriate education
in the least restrictive environment. This least restrictive environment mandate
often requires schools and school systems to place students with disabilities in
regular, nonspecial education classrooms.
In addition to IDEA, No Child Left Behind (NCLB, 2002) seeks to address,
and to prevent, the practice of excluding disabled students from quality instruction and, consequently, assessment. Although the U.S. Department of Education
(DOE) does not require students with significant cognitive disabilities to achieve
at the same levels of non-disabled students under NCLB, the DOE does demand
that all other students with less severe disabilities make progress similarly to that
of their non-disabled peers. Because many students with disabilities must be
assessed using the same tests as students without disabilities, the need for testing
accommodations that compensate for their unique needs and disabilities becomes
apparent. Yet some may argue that the non-standard accommodations required
by special needs students could undermine the meaningfulness of scores obtained
on a standardized test.
The inclusion of students with disabilities (SWDs) certainly presents some
measurement challenges. Federal law requires that the mandatory assessments
of SWDs meet current psychometric and test standards related to validity, reliability, and fairness of the scores. States must (i) identify those accommodations for each assessment that do not invalidate the score; and (ii) instruct IEP
Teams to select, for each assessment, only those accommodations that do not
invalidate the score (Department of Education, 2007b, p. 177781). The Standards for Educational and Psychological Testing (AERA, APA, and NCME,
1999) dictate:
Standard 10.1
In testing individuals with disabilities, test developers, test administrators, and test
users should take steps to ensure that the test score inferences accurately reflect the
intended construct rather than any disabilities and their associated characteristics
extraneous to the intent of the measurement. (p. 106)
Standard 10.7
When sample sizes permit, the validity of inferences made from test scores and the
reliability of scores on tests administered to individuals with various disabilities
should be investigated and reported by the agency or publisher that makes the

288

RANDALL AND ENGELHARD, JR.

modification. Such investigations should examine the effects of modifications


made for people with various disabilities on resulting scores, as well as the effects
of administering standard unmodified tests to them. (p. 107)

This study seeks to address these standards by examining evidence of measurement invariance for a set of reading items used on the Georgia Referenced
Competency Test (CRCT). The basic measurement problem addressed is
whether or not the probability of an observed score on these reading items
depends on an individuals group membership. In other words, measurement
invariance requires that students from different groups (students with disabilities,
students without disabilities as well as students who receive resource guide, readaloud, or standard administrations), but with the same true score, have the same
observed score (Wu, Li, & Zumbo, 2007). Meredith (1993) provides a statistical
definition of measurement invariance:
The random variable X is measurement invariant with respect to selection on V if F
(x|w, v) = F (x|w) for all (x, w, v) in the sample space. Where X denotes an observed
random variable with realization x; w denotes the latent variable with realization w
that underlies, or measures X. V denotes a random variable, with realization v that
functions as a selection of a subpopulation from the parent population by the
function s(V), 0 s(v) 1. (see Meredith, 1993, p. 528)

Wu et al. (2007) assert that such a general definition is useful in that it can be
applied to any observed variables at the item or test level; consequently providing a statistical basis for psychometric techniques such as factor analytic
invariance, as well as differential item functioning, or item response theory methods (p. 3). At the test level, factor analysis provides an excellent psychometric
framework in that the factor score acts as a surrogate for an individuals true
score, and the observed random variables are represented by the items. When
assessing data with dichotomous outcomes, factorial invariance is established if
the factor loadings and thresholds are equivalent across multiple sub-populations.
At the item level, item response models provide an appropriate psychometric
framework in that a persons expected score on any one item acts a proxy for the
true score and the observed score on that same item represents the observed random variable. Item-level invariance is established if the item parameters are
equivalent across multiple populations. In other words, for all values of (the
underlying, or latent construct), the item true scores are identical across groups.
Both factorial and item-level equivalence is necessary when one seeks to provide
evidence of measurement equivalence. As pointed out by Bock and Jones (1968),
in a well developed science, measurement can be made to yield invariant results
over a variety of measurement methods and over a range of experimental conditions for any one method (p. 9).

ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT

289

Previously, several methods have been employed to establish the measurement


invariance of assessment results for SWDs receiving test accommodations. Analysis of variance and analysis of covariance procedures have been used to measure
the effects of extended time (Munger & Lloyd, 1991; Runyan, 1991) and readaloud or oral (Meloy, Deville, Frisbie, 2000; Bolt & Thurlow, 2006; Elbaum,
2007; Elbaum, Arguelles, Campbell, & Saleh, 2004) accommodations. Factoranalytic methods have been used to examine factorial invariance of assessments
for SWDs receiving various accommodations such as oral administration of items
(Huynh & Barton, 2006; Huynh, Meyer, & Gallant, 2004), extended time
(Huesman & Frisbie, 2000; Rock et al., 1987), and large type (Rock, Bennett, &
Kaplan, 1985). Similarly, methods that examine item-level equivalence have been
utilized to examine across Braille (Bennett, Rock, & Novatkoksi, 1988), use of
calculator (Fuchs, 2000a), read-aloud (Bielinski, Thurlow, Ysseldyke, Friedebach,
& Friedebach, 2001; Bolt & Ysseldyke, 2006; Fuchs, 2000b), and extended time
(Cohen, Gregg, & Deng, 2005; Fuchs, 2000b) accommodations.
The purpose of this present study is to describe a coherent framework that can
be used to explore systematically whether or not specific accommodations meet
psychometric criteria of measurement invariance for students with specific learning
disabilities (SLD) on items designed to assess reading for meaning in two stages.
The first stage utilizes the confirmatory factor analysis (CFA) model to establish
unidimensionality and to assess measurement invariance across several subgroups
at the test level, specifically factorial invariance. In the second stage, we present a
different approach to assessing measurement invariance using the Rasch Model
(1980), an Item Response Theory Model, to investigate item-level equivalence.
First, we assessed the factor structure of the reading for meaning items by examining whether a single factor underlay the items. Next, we sought to determine
whether a one factor measurement model for reading for meaning was invariant
across disability status and type of test administration (i.e., assessing factorial
invariance). In the second stage of the analysis, we examined the data to insure the
overall fit to the Rasch model. Finally, we sought to test item invariance over disability status and test administration using the Rasch Model. This conceptualization
of measurement invariance includes a consideration of test-level invariance as
defined within the framework of confirmatory factor analysis, as well as item-level
and person-level invariance as conceived with Rasch measurement theory.

METHOD
Participants
The students included in this study were drawn from a larger study in Georgia
that examined the effects of two test modifications (resource guide and read

290

RANDALL AND ENGELHARD, JR.

aloud) on the performance of students with and without identified disabilities


(Engelhard, Fincher, & Domaleski, 2006). The original study included students
from 76 schools with a wide range of physical, mental, and intellectual disabilities. Because the value and impact of a test accommodation can vary in relation
to the specific disability, we chose to focus only on students identified as having
a specific learning disability (SLD) within the broader category of students with
disabilities. Table 1 provides a description of the demographic characteristics of
the students by disability status (N = 788). Table 2 provides the demographic
characteristics by accommodation category (resource guides, read-aloud, and standard administration). Consistent with previous research that indicates male students
are disproportionately identified as having learning disabilities (DOE, 2007a,
Wagner, Cameto, & Guzman, 2003; Wagner, Marder, Blackorby, & Cardoso,
2002) 70% of the 219 students with specific learning disabilities were male.
According to the Georgia Department of Education website, over 700,000 full
academic year students participated in the statewide Criterion Referenced
Competency Test in reading. Due to NCLB mandates, student ethnicity must also
be tracked and reported. This information can be used to infer the overall demographic make-up of all test-takers (as all K8 students in Georgia are required to
complete the CRCT) in order to assess the representativeness of our sample.
Across all ethnic groups the sample and population proportions were nearly identical. For example, 47.3% of public school students in Georgia are White and
48.0% of our sample was composed of White students. Similarly, 38.1% of
Georgias population of students are Black, and 40.3% of sample was composed
of Black students. Hispanic students compose 8.99% of the student population,
TABLE 1
Demographic Characteristics of Seventh-Grade Students by Disability Status
SWOD

SLD

Total

n = 569
27.8%

n = 219
72.2%

n = 788

Gender (percentages)
1. Male (n = 410)
2. Female (n = 376)

32.4
39.6

19.7
8.1

52.0
47.7

Race/Ethnicity (percentages)
1. Asian, Pacific Islander
2. Black, Non-Hispanic
3. Hispanic
4. American Indian, Alaskan Native
5. White, Non-Hispanic
6. Multiracial

2.3
30.0
3.9
0.0
34.2
1.7

0.8
10.3
2.0
.1
13.7
0.9

3.1
40.3
6.0
.1
48.0
2.5

Note. SWOD = Students Without Disabilities; SLD = Specific Learning Disability.

ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT

291

and 6.0% of our sample. In an effort to achieve equal group sizes, students with
disabilities were oversampled in the original study. We would like to note, however, that 13.39% of Georgias tested public schools students have identified disabilities. We feel confident that our sample adequately represents the student
population of Georgia students.
Instrument
Developed in 2000, the Georgia Criterion Referenced Competency Test (CRCT)
is a large state standardized assessment designed to measure how well public K8
students in Georgia have mastered the Quality Core Curriculum (QCC). The
QCCs are the curriculum content standards developed by the Georgia Department of Education for its public schools. Georgia law requires that all students in
grades 18 be assessed in the content areas of reading, English/language arts, and
mathematics. In addition, students in grades 38 are assessed in both social studies and science as well. The CRCT yields information on academic achievement
that can be used to diagnose individual student strengths and weaknesses as
related to instruction of the QCC, and to gauge the quality of education throughout Georgia. The reading CRCT for sixth-grade students consists of 40 operational selected-response items and 10 embedded field test (FT) items (FT items
do not contribute to the students score) within four content domains: reading for
vocabulary improvement, reading for meaning, reading for critical analysis, and
reading for locating and recalling. Twelve items from the reading for meaning
domain were selected and analyzed here because this domain most closely represents what is commonly referred to as reading comprehension. Reading for
Meaning is defined as the recognition of underlying and overall themes and concepts in fiction and nonfiction literature as well as the main idea and details of the
text. It also includes the recognition of the structure of information in fiction and
nonfiction. Items in the reading for meaning domain include identifying literary
forms; identifying purpose of text; identifying characters and their traits; recognizing sequence of events; recognizing text organization/structure; recognizing
explicit main idea; and retelling or summarizing.
Data Collection
All state schools were stratified into one of three categories based on the proportion of students receiving free and reduced lunch in each school. Within those
categories schools were then randomly selected and assigned to one of three conditions (resource guide test modification, read-aloud test, or oral, test modification, or the standard test condition), and all students (both students with
disabilities and without disabilities) within the school were tested under the same
condition. Two of the three conditions involved the use of a test modification,

292

RANDALL AND ENGELHARD, JR.

and the third condition involved the standard administration of the test. It should
be noted that, for the purposes of the larger original study, that all students were
tested under standard operational conditions at the end of the sixth grade during
the regular operational administration of the reading exam. The assignment to
one of three conditions involved the second administration of the same test which
was given the following spring when students were in the seventh grade. In summary, every student completed the reading exam under standard, operational conditions and then a second time under one of three conditions. Data from the
second experimental administration were analyzed for the purposes of this study.
Description of Test Modifications
The resource guide consisted of a single page (front and back) guide that provided students with key definitions and examples that were hypothesized to be
helpful. The resource guides were designed to provide students with scaffolded
support, much like they would receive in the classroom and English language
learners receive from a translation dictionary. The guides included commonly
used definitions of academic terms and vocabulary words (provided in alphabetical order as in a dictionary) that could be applied to the test. These terms were not
assessed by the exam, but rather provided explanations of construct-irrelevant
words, expressions, or phrases that might be found in the passages or within the
item stems. For example, a question may ask the student to identify the central
idea of the passage. The resource guide indicated that the central idea meant the
main point. Similarly, vocabulary within a passage that a student may not be
familiar with but not directly assessed was defined in hopes that providing such
support would increase the students comprehension of the overall passage.
Vocabulary that was assessed was not defined. The guides were developed by a
committee of Georgia Department of Education specialists from assessment, curriculum, and special education offices. Careful attention was given to the constructs measured by the test items. The intent of the guides was to provide
students with information they could use to answer questions on the test, but
would not provide the students with the answers themselves. It was hypothesized
that the removal of construct irrelevant vocabulary, or expressions, would
improve student performance on the exam as they would be able to focus on the
intended construct without confusion or frustration. One could imagine the
resource guide as a glossary of important terms used throughout the exam.
Because the use of resource guides was new for most students, students were
given the opportunity to work through a sample test using the resource guide.
Teachers were allowed to review the sample test with students and provide pointers, if necessary, on how the sample test related to the resource guide. Because
the test material is secure, it is not possible to reveal the actual content of the
resource guides here.

ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT

293

The read-aloud administration involved the teacher reading the entire test to
students, including reading passages and questions. Teachers were instructed to
read the test to students at a natural pace. Students were encouraged to read along
silently with teachers. The third type of administration was simply the standard
administration in which the test was administered in the standard format as if it
were an operational administration. Engelhard et al. (2006) should be consulted
for additional details regarding the full study.
Procedures
Data analyses were conducted in two stages. In the first stage, analyses were conducted with Mplus computer software (Muthen & Muthen, 19982007) using the
confirmatory factor-analytic model for dichotomous data as defined by Muthen
and Christofferson (1981).

x g = tg + L gxg + d g ,

(1)

where
xg is a vector of observed scores for each group,
tg is a vector of item intercepts (or thresholds),
g is a matrix of the factor loadings,
xg is a vector of factor scores (latent variables),
and dg is a vector of errors.
With the CFA model, the relationship between observed variables (in this case
12 reading items) and the underlying construct they are intended to measure (in
this case reading for meaning) is modeled with the observed response to an item
represented as a linear function of the latent construct (x), an intercept/threshold
(t), and an error term (d). The factor loading () describes the amount of change
in x due to a unit of change in the latent construct (x). The parameters of the
thresholds (t(g)) and the factor loadings ((g)) describe the measurement properties of dichotomous variables. If these measurement properties are invariant
across groups, then

t (1) = t (2 ) = ... t (G ) = t,
(1) = (2 ) = ... (G ) =
where G represents each group (see Muthen & Christofferson, 1981, p. 408).

294

RANDALL AND ENGELHARD, JR.

Confirmatory factor analysis (CFA) was used to test the one factor measurement model of the 12 reading for meaning items with six groups, one for the full
sample and one for each of the five subgroups of interest: students with specific
learning disabilities, students without disabilities, students who received the
resource guides, students who received the read-aloud administration, and students who received the standard administration. All items were hypothesized to
be a function of a single latent factor, and error terms were hypothesized to be
uncorrelated. In each model, the factor loading from the latent factor to the first
item was constrained to 1.0 to set the scale of measurement.
All parameters were estimated using robust weighted least squares (WLSMV)
with delta parameterization. With WLSMV estimation, probit regressions for
the factor indicators regressed on the factors are estimated. We used the chisquare statistic to assess how well the model reproduced the covariance
matrix. Because this statistic is sensitive to sample size and may not be a
practical test of model fit (Cheung & Rensvold, 2002), we used two additional goodness of fit indexes less vulnerable to sample size: the comparative
fit index (CFI) and the root mean square error approximation (RMSEA). CFI
values near 1.0 are optimal, with values greater than .90 indicating acceptable
model fit (Byrne, 2006). With RMSEA, values of 0.0 indicate the best fit
between the population covariance matrix and the covariance matrix implied
by the model and estimated with sample data. Generally, values less than .08 are
considered reasonable, with values less than .05 indicating a closer approximate fit (Kline, 2005).
Because identical model specification for each subgroup does not guarantee
that item measurement is equivalent across groups (Byrne & Campbell, 1999),
we conducted a series of tests for multigroup invariance by examining two
increasingly restrictive hierarchical CFA models. Models were run separately by
disability status as well as accommodation category, and the fit statistics were
used to verify adequate model fit before proceeding to subsequent steps (Byrne,
2006). Muthen and Muthen (19982007) recommend a set of models to be considered for measurement invariance of categorical variables noting that because
the item probability curve is influenced by both parameters, factor loadings and
thresholds must be constrained in tandem. With the baseline model, thresholds
and factor loadings are free across groups; scale factors are fixed at one in all
groups; and the factor means are fixed at zero in all groups (to insure model identification). This baseline model provides a model by which the subsequent invariance model can be compared. In the second model, factor loadings and thresholds
were constrained to be invariant across the groups; scale factors were fixed at one
in one group and free in the others; factor means were fixed at zero in one group
and free in the others (the Mplus default). This is because the variances of the
latent response variables are not required to be equal across subgroups (Muthen
& Muthen, 19982007). Because the chi-square values for WLSMV estimations

ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT

295

cannot be used for chi-square difference tests, we compared the fit of the two
models using the DIFFTEST option to determine if an argument for factorial
invariance could be supported. The DIFFTEST (available in Mplus) null hypothesis asserts that the restricted model worsens the model fit (i.e., the unconstrained
model is a better model). Non-significant p-values indicate equivalent model data
fit consistent with factorial invariance. In the absence of full factorial invariance,
data were also examined to determine if partial measurement invariance was
present. Partial measurement invariance applies when factors are configurally
invariant (as in the baseline model), but do not demonstrate metric (factor loadings) invariance (Byrne, Shavelson, & Muthen, 1989). Byrne et al. (1989) assert
that further tests of invariance and analysis can continue as long as configural
invariance has been established and at least one item is metrically invariant. In
such cases Benedict, Stenkamp, and Baumgarther (1998) recommend that invariance constraints be relaxed for highly significant modification indices in order to
minimize chance model improvement and maximize cross-validity of the model.
In the second stage, data analyses were conducted with the FACETS, a multifaceted Rasch measurement (MRM), computer program (Linacre, 2007). Three
models were fit to these data. Model I can be written as follows:

ln Pnijk1 / Pnijk 0 = qn d i a j l k

(2)

where
Pnijk1 = probability of person n succeeding on item i for group j and
administration k,
Pnijk0 = probability of person n failing on item i for group j and administration k,
n = location of person n on latent variable,
di = difficulty of item i,
j = location of group j, and
k = location of administration k.
This model dictates that student achievement in reading for meaning is the latent
variable that is made observable through a set of 12 reading items, and that the
items vary in their locations on this latent variable. Unlike the CFA model, the
Rasch Model (a) allows for person and item parameters to be estimated independently of each other and (b) includes no item discrimination parameter (or item
loadings) as it is assumed to be equal across all items. The observed responses
are dichotomous (correct or incorrect), and they are a function of both person
achievement and the difficulty of the item. Group membership (dichotomously
scored as student with a specific learning disability or student without disability)

296

RANDALL AND ENGELHARD, JR.

and type of administration (standard, resource guide, read aloud) may influence
person achievement levels. Once estimates of the main-effect parameters were
obtained, the residuals were defined. The unstandardized residual reflects the
difference between the observed and expected responses:

R nijk = (x nijk Pnijk ).

(3)

A standardized residual can also be defined as follows:

Z nijk = (x nijk Pnijk )/[(1 Pnijk )Pnijk ]1/2 .

(4)

These residuals can be summarized to create mean square errors (MSE) statistics
(labeled Infit and Outfit statistics in the FACETS computer program) for each
item and person. These MSE can also be summarized over items and persons, as
well as subsets of items and subgroups of persons. See Wright and Masters
(1982) for a description of the Rasch based fit statistics.
In addition to establishing item parameters and model fit statistics, the FACETS program was used to examine uniform differential item functioning (DIF).
DIF is present when the locations of the items vary, beyond sampling error,
across group characteristics, such as gender, race, or disability status. If, as a
researcher, one suspects that certain characteristics may interact or behave differently than others, one can simply add an interaction term for those two characteristics. Model II focuses on examining the interaction effects between items i and
group j (diaj). This can be written as follows

ln Pnijk1 / Pnijk0 = n i j k i j

[5]

Student groups are defined as students with and without disabilities. This model
explores whether or not the items are functioning invariantly over disability status.
The final model examined, Model III, assesses possible interaction effects
between items i and administration k (dik). It can be written as

ln Pnijk1 / Pnijk0 = n i j k i k

[6]

This model explores whether or not the items functioned invariantly across test
administrations. For both Model II and Model III, FACETS provides bias

ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT

297

measures in terms of logits. These estimates are reported as t-scores (bias measure divided by its standard error) with finite degrees of freedom. When dealing
with more than 30 observations, t-scores greater than the absolute value of two
are considered statistically significant, indicating differential item functioning
and a threat to item level invariance. Because we can expect statistically significant results to appear by chance due to the use of multiple significance tests, we
used the Bonferroni multiple comparison correction to guard against spurious
significance. Testing the hypothesis at the p < .05 level that there is no DIF in
this test, the most significant DIF effect must be p < .05 divided by the number of
item-DIF contrasts.

RESULTS
Study results are discussed within the frameworks of the two stages: first, the
results of the multigroup confirmatory analysis using Mplus software; and secondly, the results of the Rasch analyses using FACETS software.
Stage 1: Confirmatory Factor Analyses
Results for Stage 1 of the data analysis are divided into three subsections. The first
subsection addresses the fit of the measurement model within each subgroup:
SLDs, students without disabilities, and students who received the resource guide,
read-aloud, and standard administrations. The second subsection details the examination of factorial invariance across test administration. The final subsection
describes the examination of factorial invariance across disability status.
Model fit within each subgroup. Recollect that five separate CFAs were
conducted to examine the measurement models of reading for meaning for each
subgroup of interest. The reading for meaning measurement model demonstrated
excellent model fit for students without a specific learning disability, c2 (45) = 41.20
p = .63, CFI = 1.00, RMSEA = .00; for SLD c2* (40) = 53.222, p = .08, CFI = .96,
RMSEA = .04; for students who received the resource guide test administration,
c2 (38) = 40.24, p = .37, CFI = 1.00, RMSEA = .02; for students who received the
read-aloud administration, c2 (36) = 44.48, p = .16, CFI = 0.98, RMSEA = .03;
and for students who received the standard administration, c2 (8) = 31.97, p = .74,
CFI = 1.00, RMSEA = .00. Consequently, when testing groups for factorial
invariance, we specified the same model for all subgroups.

*
Degrees of freedom for these groups differ due to the way in which they are computed for the
WLSMV estimator.

298

RANDALL AND ENGELHARD, JR.

Factorial invariance across administration type. Recall that to assess


between-group invariance, we examined change in fit statistics between the baseline model (i.e., factor loadings and thresholds free) and Model 2 in which factor
loadings and thresholds were constrained to be equal or invariant. Our findings,
presented in Table 3, suggest excellent overall fit, c2 (112) = 117.76, p = .34 and
CFI = 1.00, RMSEA = .01. Model 2 also reflects adequate model fit, c2 (122) =
152.80, p = .03 and CFI of .98, RMSEA = .03. Using the DIFFTEST option in
Mplus, we assessed if Model 2 (nested model) was significantly different from
the less restrictive model: c2 (18) = 43.27, p < .01. Results suggest that the factor
structure of the reading for meaning domain is not invariant across the three test
administrations. Consequently, the data were investigated to determine if partial

TABLE 2
Demographic Characteristics of Seventh Grade Students by Test Administration
Resource Guides

Read Aloud

Standard

Total

n = 254
32.3%

n = 257
32.7%

n = 275
35.0%

n = 786

Gender (percentages)
1. Male (n = 410)
2. Female (n = 376)

16.8
15.5

16.9
15.8

18.4
16.5

52.2
47.8

Race/Ethnicity (percentages)
1. Asian, Pacific Islander
2. Black, Non-Hispanic
3. Hispanic
4. American Indian, Alaskan Native
5. White, Non-Hispanic
6. Multiracial

1.5
9.2
1.7
0.0
18.8
1.1

0.1
13.2
2.4
0.0
16.0
0.9

1.4
17.9
1.9
0.1
13.1
0.5

3.1
40.3
6.0
0.1
48.0
2.5

TABLE 3
Tests for Invariance for Reading for Meaning Measurement Model Across Test
Administration: Summary of Goodness of Fit Statistics
Equality

Test

c2

df

CFI

RMSEA

Dc2

p-value

No Constraints (Configural)
Factor Loadings & Thresholds
Free Items 6 & 12

21
31

117.76
152.80
133.05

112
122
120

1.00
.98
.99

.01
.03
.02

__
43.27
22.81

__
.00
.08

Note. c2 = chi-square statistic based on robust weighted least squares estimation; df = degrees of
freedom; CFI = comparative fit index; RMSEA = root mean square error of approximation; Robust
statistics are reported. Students who received resource guide (n = 254), students who received readaloud (n = 257), students who received standard administration (n = 275).

ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT

299

measurement invariance (Byrne, Shavelson, & Muthen, 1989) could be established across test administrations. Examination of the modification indices
revealed that releasing the equality constraints of both the factor loadings and
thresholds of Items 6 and 12 resulted in a better overall model, c2 (120) = 133.05,
p = .20, CFI = 0.99 RMSEA = .02, and a non-significant chi-square test for difference, c2 (15) = 22.81, p = .08. Closer examination of the unconstrained parameter
estimates, displayed in Table 4, revealed that Item 6 was less discriminating and
easier for students in the read-aloud test administration than for students in the
resource guide and standard test administrations. Furthermore, Item 12 was more
discriminating and easier for students in the resource guide test administration
than in the read- aloud or standard administrations. These findings suggest partial
measurement invariance or factorial invariance for a majority of the items.
Factorial invariance across disability type. In the next series of models
within Stage 1 we examined factorial invariance across disability status as
presented in Table 5. As in the analyses of test administration, results indicate
excellent overall fit across disability status with the baseline model, c2 (84) = 95.40,
p = .19, CFI = .99, RMSEA = .01. Model 2 (factor loadings and thresholds constrained) also demonstrated adequate fit, c2 (90) = 98.95, p = .24, CFI = .99,
RMSEA = .02. Again, using the DIFFTEST option in Mplus, we assessed if

TABLE 4
Item 6 and Item 12 Unconstrained Parameter Estimates
Resource Guide
Item

Read Aloud

Standard

Factor Loading

Threshold

Factor Loading

Threshold

Factor Loading

Threshold

1.198
.923

.644
.239

.735
.838

.885
.042

1.030
.835

.523
.080

6
12

TABLE 5
Tests for Invariance for Reading for Meaning Measurement Model Disability Status:
Summary of Goodness of Fit Statistics
Equality

Test

c2

df

CFI

RMSEA

Dc2

p-value

No Constraints (Configural)
Factor Loadings & Thresholds

21

95.40
98.95

84
90

.99
.99

.01
.02

__
8.14

__
.62

Note. c2 = chi-square statistic based on robust weighted least squares estimation; df = degrees of
freedom; CFI = comparative fit index; RMSEA = root mean square error of approximation; Robust
statistics are reported. Regular education students (n = 569), students with specified learning disabilities (n = 219).

300

RANDALL AND ENGELHARD, JR.

Model 2 (nested model) was significantly different from the less restrictive
model: c2 (10) = 8.14, p = .62. Results support complete factorial invariance
across disability type.
Given evidence that the measurement model representing the latent reading
ability for the reading for meaning factor was invariant across disability status
and demonstrated partial invariance across test administration, we ran a final
CFA of the model for the full sample for the original model (all items loading on
the latent factor reading for meaning). This final full model showed excellent fit
to the data, c2 (47) = 56.89, p = .15, CFI = 1.00, RMSEA = .02. Stage 1 results
provide strong evidence that at the test level (a) the reading for meaning domain
is a unidimensional construct and (b) the factorial structure is fully invariant
across disability status and partially invariant across administration type.
Stage 2: Multifacted Rasch Measurement
Next, we turned our attention to Stage 2 of our data analysis based on the Rasch
measurement model. The results within this stage are divided into three subsections. The first subsection presents the main effects model (Model I). The second
and third subsections explore the interaction between items and disability status
(Model II) and test administration (Model III).
Model Imain effects model. Figure 1 displays a variable map representing the calibrations of the students, items, conditions, and groups. The FACETS
computer program (Linacre, 2007) was used to calibrate the four facets. The first
column of Figure 1 represents the logit scale. The second column of the variable
map displays the student measures of reading (for meaning) achievement. Higher
ability students appear at the top of the column, while lower ability students appear
at the bottom. Each asterisk represents 8 students. The student achievement measures range from 4.36 logits to 4.49 logits (M = .94, SD = 1.64, N = 786). The
third column shows the locations of the administration conditions on the latent variable. Administrations appearing higher in this column yielded higher achievement.
In the case of the reading for meaning items, the read aloud administration yielded
slightly higher results than both the standard and resource guide administrations;
the resource guide administration yielded the lowest results overall. Group differences are shown in column four of the variable map. As expected, the overall
achievement of the students without specific learning disabilities was higher on
average as compared to the students with specific learning disabilities. The fifth
and final column represents the location of the reading for meaning items with item
difficulty ranging from 1.02 logits to 1.86 logits (M = .00, SD =.84, N = 12).
Table 6 presents a variety of summary statistics related to the FACETS analyses. The items, administrations, and disability status are anchored at zero by definition. In order to define an unambiguous frame of reference for the model, only

ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT

Logit

Students

Administration

High
4

High

Disability
High

301

Items
Harder

*********

.
*.
.
****.
*********.
.
*.
7

-1

-2

-3

-4

Logit

****.
******.
****.
***.
**.
********
***.
****.
**.
***.
*****.
****.
*.
***.
*.
***
**.
*
*
.
*

12

SWOD
Read-Aloud
Standard
Guide
SLD

10 11
5
9

1 8

6
2 3 4

.
.
.
.
.
.

**.
Low
Students

Low

Low

Administration

Disability

Easy
Items

FIGURE 1 Variable map of reading ability. * = 8 students. Higher values for the student,
type of administration, and disability status facets indicate higher scores on the reading ability
construct. Higher values on the item facet indicate harder items.

302

RANDALL AND ENGELHARD, JR.

TABLE 6
Summary Statistics for FACETS Analysis (Reading for Meaning Items, Grade 7)
Students
Measures
Mean
SD
n
INFIT
Mean
SD
OUTFIT
Mean
SD
Reliability of Separation
Chi-Square Statistic
Degrees of Freedom

.94
1.64
786
1.00
.24
1.00
.59
.65*
1795.6
785

Item

Administration

Disability Status

.00
.84

.00
.12
3

.00
.30
2

.99
.11

1.00
.01

1.02
.05

12

1.00
.23
.99*
884.8
11

1.00
.02
.83*
17.6
2

1.03
.07
.98*
107.4
1

*p < .01.

one facet (student measure) is allowed to vary. The overall model-data fit is quite
good. The expected value of the mean square error statistics (infit and outfit) is
1.00 with a standard deviation of .20, and the expected values for these statistics
are very close to the observed values. The most prominent exception is the
student facet which has more variation than expected for the outfit statistic
(M = 1.00, SD = .59).
As shown in Table 6, all four of the reliability of separation statistics are statistically significant (p < .01): students, disability status, type of administration,
and items. The reliability of separation statistic is conceptually equivalent to
Cronbachs coefficient alpha, and they are used to test the hypothesis of
whether or not there are significant differences between the elements within a
facet. The largest reliability of separation index is .99 (Items) indicating a good
spread of reading for meaning items on the latent variable. The smallest reliability of separation index is .65 (Students). Given the small number of items
(N = 12), this is comparable to the values obtained for other subtests in similar
situations. Both the type of administration (.83) and disability status (.98) were
also well differentiated.
Model II: Item disability status interactions. Model II explores the
interactions among items and disability status. This model explores whether or
not the items are functioning invariantly over groups (i.e., differential item functioning). Two items (4 & 7) exhibited statistically significant differential item
functioning. Recall that the use of multiple significance tests can result in spurious significance, so Bonferroni multiple comparison tests were used to confirm

ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT

303

any apparent DIF. Comparison tests revealed only one statistically significant
item. Item 7 was differentially easier for students with specific learning disabilities with an observed score of 0.32, but an expected score of 0.23, t = 3.53.
Model III: Item test administration interaction. Model III explores itemlevel invariance across test administrations. We found no statistically significant
interaction bias between test administration and the reading for meaning items suggesting complete item-level invariance across the type of test administration.

SUMMARY AND DISCUSSION


The major contribution of this study is to encourage a systematic approach to
establishing measurement invariance with large-state assessments with dichotomous data. By combining and integrating both confirmatory factor-analytic and
Rasch measurement procedures, practitioners are able to develop a more complete picture of the extent to which score-based inferences from these measures
hold across several subgroups within a population of students. Although establishing measurement invariance is essential for all tests/measures that seek to
make inferences across multiple groups, it is particularly necessary when these
inferences have high stakes consequences (i.e., promotion/retention/graduation).
Add to this, the legal obligation of a school system to assess accurately protected
or vulnerable groups (i.e., students with disabilities), and the significance of this
study becomes apparent.
A two-stage approach was utilized. The first stage works within a CFA framework to establish both unidimensionality and test-level measurement invariance,
specifically factorial equivalence. Assuming factorial equivalence is established
in the first stage, the second stage works within a narrower conceptual framework focusing on invariance at the item-level using a model that allows for the
separation of item and person parameters. These complementary methods enable
the practitioner to address issues of model-data misfit to insure accurate interpretation of test scores.
The results of this study provide strong evidence that the reading for meaning
items of the CRCT exhibit test-level invariance across SLDs and students without disabilities. The factorial invariance across test administration, however, is
less clear. Multigroup confirmatory factor analysis revealed a one-factor model
with partial-measurement invariance (when items 6 & 12 are freely estimated).
These findings suggest that the use of read aloud and resource guides may
change the underlying structure of the exam. Further examination into the utility
and appropriateness of these test accommodations may be necessary.
Analyses using the Rasch Model also suggest overall good item fit (outfit = 0.99)
with only one item exhibiting evidence consistent with differential item functioning

304

RANDALL AND ENGELHARD, JR.

across disability status. SLD performed differentially better on this item than
expected on Item 7. Closer examination of this item also reveals some mild item
misfit (outfit = 1.26). The tendency of this item to function differentially across
disability status, its lack of fit to the measurement model, as well as its extremely
low p-value (0.39) suggest that it is a threat to item-level measurement invariance
and should be examined more closely by measurement professionals and practitioners. Indeed, such results suggest a clear need for detailed qualitative interpretations of the quantitative analysis. The two-stage approach to assessing
measurement invariance described in this article provides a useful template that
can be used, in conjunction with qualitative evaluations, to aid in establishing
fairness and equity in high stakes testing.

ACKNOWLEDGMENTS
We thank Chris Domaleski and Melisa Fincher for providing us with access to
the data set. The opinions express in this article are those of the authors, and they
do not reflect the views of the Georgia Department of Education.

REFERENCES
Asparouhov, T., & Muthen, B. (2006). Robust chi-square difference testing with mean and variance
adjusted test statistics. Mplus Web Notes: No. 10. Retrieved January 13, 1998, from http://
www.statmodel.com/download/webnotes/webnote10.pdf
Benedict, J., Steenkamp, E. M., & Baumgartner, H. (1998). Assessing measurement invariance in
cross-national consumer research. Journal of Consumer Research, 25, 7890.
Bennett, R., Rock, D., & Novatkoski, I. (1989). Differential item functioning on the SAT-M Braille
edition. Journal of Educational Measurement, 26(1), 6779.
Bielinski, J., Thurlow, M., Ysseldyke, J., Freidebach, J., & Freidebach, M. (2001). Read aloud
accommodations: Effects on multiple choice reading and math items (Technical Report 31).
Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Bock, R. D., & Jones, L. V. (1968). The measurement and prediction of judgment and choice. San
Francisco: Holden-Day.
Bolt, S., & Thurlow, M. (2006). Item level effects of the read aloud accommodation for students with
disabilities. (Synthesis Report 65). Minneapolis: University of Minnesota, National Center on
Educational Outcomes.
Bolt, S., & Ysseldyke, J. (2006). Comparing DIF across math and reading/language arts tests for
students receiving a read-aloud accommodation. Applied Measurement in Education, 19(4), 329355.
Byrne, B. (2006). Structural equation modeling with EQS: Basic concepts, applications, and
programming (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Byrne, B. & Campbell, T. L. (1999). Cross-cultural comparisons and the presumption of equivalent
measurement and theoretical structure: A look beneath the surface. Journal of Cross-Cultural
Psychology, 30(5), 555574.
Byrne, B., Shavelson, R., & Muthen, B. (1989). Testing for the equivalence of factor covariance and
mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456466.

ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT

305

Cheung, G., & Rensvold, R. (2002). Evaluating goodness of fit indexes for testing measurement
invariance. Structural Equation Modeling: A Multidisciplinary Journal, 9, 233245.
Cohen, A., Gregg, N., Deng, M. (2005). The role of extended time and item content on a high-stakes
mathematics test. Learning Disabilities Research and Practice, 20(4), 225233.
Elbaum, B., Arguelles, M., Campbell, Y., & Saleh, M. (2004). Effects of a student-reads-aloud
accommodation on the performance of students with and without learning disabilities on a test of
reading comprehension. Exceptionality, 12(2), 7187.
Elbaum, B. (2007). Effects of an oral testing accommodation on the mathematics performance of secondary students with and without learning disabilities. The Journal of Special Education, 40(4),
218229.
Engelhard, G., Fincher, M., & Domaleski, C. S. (2006). Examining the reading and mathematics
performance of students with disabilities under modified conditions: The Georgia Department of
Education modification research study. Atlanta: Georgia Department of Education.
Fuchs, L. (2000a, July). The validity of test accommodations for students with disabilities: Differential
item performance on mathematics tests as a function of test accommodations and disability status.
Final Report: U.S. Department of Education through the Delaware Department of Education.
Fuchs, L. (2000b, July). The validity of test accommodations for students with disabilities: Differential item performance on reading tests as a function of test accommodations and disability status.
Final Report: U.S. Department of Education through the Delaware Department of Education.
Huesman, R. & Frisbie, D. (2000, April). The validity of ITBS reading comprehension test scores for
learning disabled and non learning disabled students under extended tie conditions. Paper presented
at the annual meeting of the American Educational Research Association, New Orleans, LA.
Huynh, H., Meyer, J., & Gallant, D. (2004). Comparability of student performance between regular
and oral administrations for a high stakes mathematics test. Applied Measurement in Education,
17(1), 3957.
Huynh, H., & Barton, K. (2006). Performance of students with disabilities under regular and oral administrations of a high-stakes reading examination. Applied Measurement in Education, 19(1), 2139.
Kline, R. (2005). Principles and practice of structural equation modeling (2nd ed.). New York:
Guilford.
Linacre, J. M. (2007). A users guide to FACETS: Rasch-model computer programs. Chicago:
winsteps.com.
Meloy, L., Deville, C., & Frisbie, D. (2000, April). The effect of a reading accommodation on
standardized test scores of learning disabled students. Paper presented at the annual meeting of the
American Educational Research Association: New Orleans, LA.
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika,
58(4), 525543.
Munger, G., & Loyd, B. (1991). Effect of speededness on test performance of handicapped and nonhandicapped examinees. Journal of Educational Research, 85(1), 5357.
Muthen, B., & Christofferson, A. (1981). Simultaneous factor analysis of dichotomous variables in
several groups. Psychometrika, 46(4), 407419.
Muthen, L. K., & Muthen, B. O. (19982007). Mplus users guide. Fifth Edition. Los Angeles, CA:
Muthen & Muthen.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: The
University of Chicago Press. (Original work published 1960)
Rock, D., Bennett, R., & Kaplan, B. (1985). Internal construct validity of the SAT across handicapped
and nonhandicapped populations. ETS Research Report RR-85-50. Princeton, NJ: Educational
Testing Service.
Rock, D., Bennett, R., & Kaplan, B. (1987). Internal construct validity of a college admissions test
across handicapped and nonhandicapped groups. Educational and Psychological Measurement,
47(1), 193205.

306

RANDALL AND ENGELHARD, JR.

Runyan, M. (1991). The effect of extra time on reading comprehension scores for university students
with and without learning disabilities. Journal of Learning Disabilities, 24(2), 104108.
U.S. Department of Education (2007a). Demographic and school characteristics of students receiving special education in elementary grades (NCES Publication 2007-005). Jessup, MD: National
Center for Educational Statistics.
U.S. Department of Education (2007b). Title I: Improving the academic achievement of the disadvantaged; Individuals with Disabilities Act (IDEA); Final Rule. Federal Register, Vol. 72, No. 67,
Monday, April 9, 2007.
Wagner, M., Marder, C., Blackorby, J., & Cardosa, D. (2002). The children we serve: The
demographic characteristics of elementary and middle school students with disabilities and their
households. Menlo Park, CA: SRI International.
Wagner, M., Cameto, R., & Guzman, A. (2003). Who are secondary students in special education
today? (A Report from the National Longitudial Transition Stud). Retrieved September 1, 2008,
from http://www.ncset.org/publications
Wright, B. D., & Masters, G. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA
Press.
Wu, A., Li, Z., & Zumbo, B. (2007). Decoding the meaning of factorial invariance and updating the
practice of multi-group confirmatory factor analysis: A demonstration with TIMMS data. Practical
Assessment, Research, & Evaluation, 12(3), 123.

Copyright of Applied Measurement in Education is the property of Taylor & Francis Ltd and its content may
not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written
permission. However, users may print, download, or email articles for individual use.

Você também pode gostar