Escolar Documentos
Profissional Documentos
Cultura Documentos
Spring 2011
Recommended Citation
Gu, Lin. "At the interface between language testing and second language acquisition: communicative language ability and test-taker
characteristics." PhD (Doctor of Philosophy) thesis, University of Iowa, 2011.
https://ir.uiowa.edu/etd/972.
LANGUAGE ACQUISITION:
CHARACTERISTICS
by
Lin Gu
An Abstract
May 2011
ABSTRACT
manifested in performance on the TOEFL iBT® test, as well as the relationship between
this ability with test-takers’ study-abroad and learning experiences. The research interest
in the nature of language ability is shared by the language testing community, whereas
understanding the factors that affect language acquisition has been a focus of attention in
the field of second language acquisition (Bachman & Cohen, 1998). This study utilizes a
structural equation modeling approach, a hybrid of factor analysis and path analysis, to
address issues at the interface between language testing and second language acquisition.
The purpose of this study is two-fold. The first has a linguistic focus: to provide
language ability by examining the dimensionality of this construct in both its absolute
and relative senses. The second purpose, which has a social and cultural orientation, is to
investigate the possible educational, social, and cultural influences on the acquisition of
English as a foreign language, and the relationships between test performance and test-
taker characteristics.
The results revealed that the ability measured by the test was predominantly skill-
oriented. The role of the context of language use in defining communicative language
ability could not be confirmed due to a lack of empirical evidence. As elicited by the test,
this ability was found to have equivalent underlying representations in two groups of test-
superiority of the study-abroad environment over learning in the home country could not
have significant associations with aspects of the language ability, although the results
also suggested that variables other than the ones specified in the models may have had an
From a test validation point of view, the results of this study provide crucial
validity evidence regarding the test’s internal structure, this structure’s generalizability
across subgroups of test-takers, as well as its external relationships with relevant test-
constitutes the test construct, and how this construct interacts with the individual and
Thesis Supervisor
_________________________________________________
_________________________________________________
Date
_________________________________________________
Thesis Supervisor
_________________________________________________
________________________________________________
Date
AT THE INTERFACE BETWEEN LANGUAGE TESTING AND SECOND
LANGUAGE ACQUISITION:
CHARACTERISTICS
by
Lin Gu
May 2011
LIN GU
2011
CERTIFICATE OF APPROVAL
____________________________
PH.D. THESIS
_____________
Lin Gu
________________________________________
Timothy N. Ansley, Thesis Supervisor
________________________________________
Helen H. Shen
________________________________________
Lia M. Plakans
________________________________________
Bonnie S. Sunstein
To my loved ones:
献给我亲爱的:
Mother 母亲 卓志华 女士
Father 父亲 顾生根 先生
ii
ACKNOWLEDGMENTS
I would like to thank my advisor, Dr. Liskin-Gasparro, for her guidance, help, and
encouragement in every phase of this dissertation study. Without her belief in me and her
unfailing support, I would not have been able to see this study to its completion.
I am also indebted to Dr. Ansley for his insightful comments and challenging
Sincere thanks also go to Dr. Plakans, Dr. Shen, and Dr. Sunstein for their warm
I would like to thank the Graduate College of the University of Iowa and
Educational Testing Service for funding this research project, and to thank Educational
Testing Service for permissions granted to use their copyrighted test material and data.
Last, but not least, I wish to express my deep gratitude to my parents, to Mary,
iii
ABSTRACT
manifested in performance on the TOEFL iBT® test, as well as the relationship between
this ability with test-takers’ study-abroad and learning experiences. The research interest
in the nature of language ability is shared by the language testing community, whereas
understanding the factors that affect language acquisition has been a focus of attention in
the field of second language acquisition (Bachman & Cohen, 1998). This study utilizes a
structural equation modeling approach, a hybrid of factor analysis and path analysis, to
address issues at the interface between language testing and second language acquisition.
The purpose of this study is two-fold. The first has a linguistic focus: to provide
language ability by examining the dimensionality of this construct in both its absolute
and relative senses. The second purpose, which has a social and cultural orientation, is to
investigate the possible educational, social, and cultural influences on the acquisition of
English as a foreign language, and the relationships between test performance and test-
taker characteristics.
The results revealed that the ability measured by the test was predominantly skill-
oriented. The role of the context of language use in defining communicative language
ability could not be confirmed due to a lack of empirical evidence. As elicited by the test,
this ability was found to have equivalent underlying representations in two groups of test-
superiority of the study-abroad environment over learning in the home country could not
iv
have significant associations with aspects of the language ability, although the results
also suggested that variables other than the ones specified in the models may have had an
From a test validation point of view, the results of this study provide crucial
validity evidence regarding the test’s internal structure, this structure’s generalizability
across subgroups of test-takers, as well as its external relationships with relevant test-
constitutes the test construct, and how this construct interacts with the individual and
v
TABLE OF CONTENTS
LIST OF TABLES……………………………………………………………………...viii
LIST OF FIGURES………………………………………………………………………ix
Introduction………………………………………………………………………12
Unitary Competence Hypothesis………………………………………………...14
Multidimensional Competency Models………………………………………….18
Method Effect……………………………………………………………………32
The Factor Analysis Approach………………………………………………34
The CFA-MTMM Matrix Approach………………………………………...39
Test-Taker Characteristics and Test Performance……………………………….43
Proficiency Level…………………………………………………………….47
Other Background Characteristics…………………………………………...52
Target Language Contact……………………….……………………………56
ETS–TOEFL® Studies………………………………………………………64
Making a Validity Argument…………………………………………………….71
Conclusion……………………………………………………………………….72
vi
Linearity and Multicollinearity……………………………………………..110
Estimation Method………………………………………………………….110
Assessing Model Fit………………………………………………………...111
Establishing the Baseline Model……………………………………………113
Modeling the Context of Language Use……………………………………118
Multi-Group Invariance Analysis…………………………………………..120
Building Structural Equation Models………………………………………124
Preliminary Analysis……………………………………………………………126
The Baseline Model…………………………………………………………….129
The Context of Language Use………………………………………………….132
The Content Dimension…………………………………………………….132
The Setting Dimension……………………………………………………..133
The Final Model………………………………………................................136
Multi-Group Invariance Analysis………………………………………………138
Measurement Invariance……………………………………………………141
Structural Invariance………………………………………………………..149
Results of Multi-Group Invariance Analysis…………………………….....152
Structural Equation Models…………………………………………………….153
The Home-Country Group………………………………………………….157
The Study-Abroad Group…………………………………………………..159
NOTES………………………………………………………………………………….190
REFERENCES…………………………………………………………………………192
vii
LIST OF TABLES
Table 10. Fit Indices from the Multi-Group Structural Invariance Analysis……….151
viii
LIST OF FIGURES
Figure 10. Factor Loading Invariance with Unstandardized Estimates Group I.........144
Figure 11. Factor Loading Invariance with Unstandardized Estimates Group II……145
Figure 12. Indicator Residual Invariance with Unstandardized Estimates Group I….147
Figure 13. Indicator Residual Invariance with Unstandardized Estimates Group II...148
Figure 14. Factor Mean Invariance with Unstandardized Estimates Group I………..150
Figure 15. Factor Mean Invariance with Unstandardized Estimates Group II………151
Figure 18. Structural Equation Model with Standardized Estimates Group I……….158
Figure 19. Structural Equation Model with Standardized Estimates Group II………160
ix
1
CHAPTER ONE
INTRODUCTION
Recent years have witnessed a growing number of foreign language (FL) learners
in the United States and worldwide. To meet the assessment needs of this expanding
population, tests whose development and validation are well grounded in applied
The most recent edition of Standards for Educational and Psychological Testing
educational and psychological measures, and it regards these measures as “the most
significant improvements over previous practices” (p. 1). The Standards regards validity
defines validity as “the degree to which evidence and theory support the interpretations of
Validity has been traditionally divided into three different types (Alderson &
Banerjee, 2002; Bachman, 1990). The three traditionally used validity categories include
construct-related validity (Anastasi & Urbina, 1997; Messick, 1989). Messick (1989)
article, validity is “an integrated evaluation judgment of the degree to which empirical
2
inferences and actions based on test scores or other modes of assessment” (p. 13; original
emphases in italics). The former definition of three validities has been replaced with a
single unified view of validity (Chapelle, 1999). As pointed out by Bachman (1990), this
unitary view of validity has been “endorsed by the measurement profession as a whole”
(p. 236).
validation. The Standards (AERA, APA, & NCME, 1999) defines constructs as
interpretation to the “construct or concepts the test is intended to measure” (p. 9).
Construct validity has always been considered central to test validation (Cronbach &
Meehl, 1955). In the unitary validity framework, construct validity is the “fundamental
embraced and adapted the unitary validity concept. Validity, as Bachman (1990) puts it,
FL construct framework based on both theoretical and empirical grounds. The most
proposed to examine the structural component of a test to understand the degree to which
the nature and dimensionality of the inter-item structure reflect the nature and
dimensionality analysis was also proposed by Chapelle (1999) to examine the fit of
Evidence based on a test’s internal structure was also suggested by Wolf et al. (2008) to
could be used to interpret the nature of the FL construct: whether the construct is
construct is.
The generality of this FL construct has also been a focus of investigation in the
field of language testing. Messick (1989) warned against taking the generalizability of a
construct meaning across various contexts for granted. He proposed that context effects,
Validity evidence based on a test’s generalizability was also proposed by Chapelle (1999)
to ensure legitimate test score interpretation and uses across groups of test-takers, time,
instruction conditions, and test task characteristics. The idea of a universally applicable
differences in the language to be measured (e.g., English, French, Chinese), and the
usually heterogeneous nature of the test-taking population. This line of research helps to
answer the question of whether the same construct structure holds across groups of test-
takers who differ in terms of the features salient to a specific testing situation.
The research interest in the nature of language ability has been shared by the
language testing community, whereas understanding the factors that affect language
acquisition has been one of the focuses in the field of language acquisition (Bachman &
4
Cohen, 1998). However, to ensure valid test score interpretation and uses, the acquisition
factors that reside outside of a test need to be investigated in relation to test performance.
In other words, validity evidence based on a test’s external structure is needed to ensure
validity.
example. Let us assume that the relevant literature uniformly asserts that study abroad is
study is designed to compare the language gains of study-abroad learners versus those of
classroom learners. Contrary to the common belief upheld by the literature, test results
show that the gains are equivalent. We are now confronted with the dilemma that the
test’s external relationship with a situation factor (study abroad vs. classroom learning)
does not conform to what has been demonstrated in the language acquisition literature.
This would bring us to doubt if the scores can be interpreted to have captured the real
gains across the groups, and if the test has been used appropriately for the purpose of this
study.
Examining a test’s external relationships with other tests and with non-test
behaviors has been proposed by both measurement professionals (AERA, APA, &
NCME, 1999; Messick, 1989) as well as language testing researchers (Chapelle, 1990;
Wolf et al., 2008). In the context of language testing, this line of research ensures that a
test has the appropriate external relationships that are compatible with both testing and
acquisition theories.
5
This dissertation research investigates the nature of FL ability1, and this ability’s
relationships with test-takers’ study-abroad and learning experiences. The current view in
Canale & Swain, 1980). Situated in the context of the TOEFL iBT® test2, this study
The purpose of this study is two-fold. The first has a linguistic focus: to provide
language ability by examining the dimensionality of this construct in both its absolute
and relative senses, as measured by the TOEFL iBT Test. Dimensionality refers to the
implies the invariance of the proposed factorial structure, irrespective of factors such as
the FL tested, the test chosen, the population studied, and the conditions under which
language acquisition takes place, to name just a few. Dimensionality in its relative sense
immediately above. It also implies that different aspects of this proficiency, if the
proficiency is indeed divisible, may develop at different rates and over different paths
the relationships between test performance and test-taker characteristics (TTCs). These
6
characteristics can include, among others, native language, language spoken at home,
years of immersion, years of schooling in the target language, and anticipated career.
experiences is the focus of this research study. By investigating the individual and social–
this study can shed light on how differences in learning contexts and experiences may
Although it has been generally agreed that FL proficiency is a complex construct with
multiple dimensions, it is still unclear to the research community what this proficiency
consists of, and how the constituent parts interact. Specifically, this research focuses on
communicative competence (Canale, 1983; Canale & Swain, 1980). Since its initial
appearance, the concept of communicative competence has had a strong influence not
only on language teaching and learning, but also on test design and development. Based
contexts (Chapelle, Grabe, & Berns, 1997). According to the COE model, the TOEFL
iBT test performance, this study will make both theoretical and empirical contributions to
measured by the test. The results could also provide evidence based on the internal
7
structure of the test for developing a validity argument for test score interpretation and
uses.
relationships between TTCs and FL test performance. The field of FL testing has recently
studies on TTCs, such as gender (Wang, 2006), native language background (Shin,
2005), ethnicity and preferred language (Ginther & Stevens, 1998), and home language
environment (Bae & Bachman, 1998), to name just a few. Multiple studies have also
been conducted in TOEFL® testing to validate scores interpretation and uses in light of
differences of test-takers on background variables, such as reasons for taking the test
(Swinton & Powers, 1980), native language background (Stricker et al., 2005), target
language exposure in home countries (Stricker & Rock, 2008), and years of classroom
instruction (Stricker & Rock, 2008). The general consensus reached by the language
testing research community is that the factor structure underlying FL test performance
can be interpreted more meaningfully if relevant TTCs are taken into consideration.
However, the amount of information we have obtained is far from what we need to fully
understand the complex and dynamic network of relationships. Moreover, there are still
TTCs that have not attracted enough attention from language testing researchers, and
One such TTC is contact with the target language. Language contact is a concept
test-takers are non-native speakers of the target language who may have had the
8
experience of studying and/or living in the target language environment. Learning and
living experience in the target language environment has been investigated in relation to
test performance in only a few studies (e.g. Kunnan, 1994, 1995; Morgan & Mazzeo,
1988). This TTC is salient in most FL testing situations, but has not been examined
extensively.
This study will contribute to our knowledge of the relationships between TTCs
and FL test performance, especially the test performance of test-takers with different
language contact experiences. Because of the TOEFL iBT® test’s diverse test-taking
population, examining the internal structure of the test across relevant subgroups of test-
takers would provide evidence for the test’s generalizability in building a validity
From a pedagogical perspective, the study will inform us whether study abroad
has an effect on language learning and acquisition and, if so, in what ways. There has
been a rapid growth of research on study abroad contexts of learning, and these studies
have generated mixed results (e.g., Collentine & Freed, 2004). The results of this study
will provide empirical evidence on the impact of study abroad, if any. From a practical
point of view, the results of this study could advise potential FL test-takers on how to
prepare for a test, as well as how to improve their FL proficiency in general. Studying in
a foreign country normally requires a huge investment, both financial and emotional,
from test-takers and their families. It is my hope that the results of this study can shed
acquisition lies in two areas: (a) extending language testing research agenda to include
9
variables external to a test so that we can have a better understanding of not only the
nature of FL construct but also this construct’s external relationships with relevant TTCs;
and (b) employing a SEM approach to tackle issues at the interface between language
Chapter One describes the context of the problem, explains the purposes of the
study, indicates the contributions and significance, and presents the organization of the
dissertation.
proficiency. This review includes proficiency models that have been subjected to
empirical investigation in a testing situation. Furthermore, only studies that have utilized
latent factor analysis and structural equation modeling are reviewed. The organization of
of the review to include various multidimensional competence models. The review shows
that the concept of a multidimensional FL proficiency has been well received based on
both theoretical and empirical grounds, although the research community has not yet
reached an agreement regarding the makeup of the construct. The next section reviews
studies that have modeled the effect of test methods, as opposed to constructs, in their
analyses of proficiency models. The review indicates that test method facet ought to be
visualized as an integral part of the factor structure underlying FL test performance. The
last section is devoted to summarizing empirical studies that have explicitly modeled
TTCs in their analyses. Studies are divided into four groups. The first group focuses on
10
the relationship between proficiency level and the degree of factor differentiation. Studies
in the second group examine other TTCs, such as native language background and
proficiency, learning condition, and gender. Studies in the third group examined target
language contact as a TTC by using multi-group factor analysis. Last, validation studies
in the context of TOEFL® testing are reviewed and summarized. The results of the
literature review demonstrate that two consensuses have been reached by the research
Second, the factor structure of FL test performance can be interpreted more meaningfully
if relevant TTCs are taken into consideration. Directions for future research are suggested
at the end of the chapter. They are: (a) to test two competing hypotheses, a hierarchical
performance on integrated test tasks that require the use of multiple skills; and (c) to
study TTCs that have not attracted enough attention from previous researchers, especially
Chapter Three presents the design of the study. The background and development
of the TOEFL iBT® test as well as the format of the operational test are first introduced.
The reason for choosing this test and how the unique characteristics of this test can assist
us in accomplishing the general research purposes of this study are explained. Situated in
the context of TOEFL iBT testing, three research questions are proposed and six
hypotheses are formulated. A layout of materials and data is provided, and planned data
Chapter Four reports the results of the analyses. Taking both skill-oriented
abilities and language use context into consideration, the outcomes of establishing a
11
model for the entire sample are reported first, followed by results from multi-group
experiences. Last, results from analyzing two unique structural equation models with
Chapter Five provides a review of the study and a summary of the primary
ability, group membership and language ability, and learning contexts and language
ability. Insights gained from this study on language test development and validation are
interface between language testing and acquisition is appraised. The chapter further
discusses the study’s contributions and limitations. Recommendations for future research
CHAPTER TWO
LITERATURE REVIEW
Introduction
assessment, they stated that what is central to language testing is “an understanding of
what language is, and what it takes to learn and use language, which then becomes the
basis for establishing ways of assessing people’s abilities” (p. 80). Bachman (2000) also
emphasized the importance of the nature of language ability and language use. In his
review of modern language testing at the turn of the twenty-first century, Bachman
asserted that what has guided the development and refinement of new tests is “current
thinking in applied linguistics about the nature of language ability and language use” (p.
2). These positions were echoed by Wolf et al. (2008) in their statement that “[a] clear
The essential message conveyed by these authors is that prior to conducting any
proficiency that is being assessed should be established. This point has been well
have emerged. Some models are compatible with one another, whereas others are
competing and contradictory. Some models are more general than the others and, as
construct in diverse contexts, whereas a more local model tends to depict the construct as
it applies in a particular context. Some models are more complex as they tend to include
13
as many construct elements as possible, whereas others are more parsimonious as they
include only the elements that pertain to a particular testing situation. Some models enjoy
strong empirical foundations, whereas others are based on scanty empirical support.
Among models that claim empirical support, a wide range of methodologies are used by
To discuss every language proficiency model that has been developed by theorists
and practitioners is beyond the scope of this review of the literature. An exhaustive
review might also suffer a lack of focus, and it would be difficult to draw any practical
conclusions. Therefore only models that have been subjected to empirical investigation in
validity evidence based on a test’s internal structure, external structure, and the test’s
latent trait approach is most appropriate for conducting such validity inquiries. This
chapter reviews studies that have utilized a latent approach in their investigations of the
of the review to include various multidimensional competence models. The next section
reviews studies that have modeled the effect of test methods, as opposed to constructs, in
their examination language proficiency models. The last group of studies branches out to
examine variables that reside outside a test, namely test-taker characteristics (TTCs), and
The chapter concludes that there is not a best model of FL proficiency, at least not
according to the results of this literature review, on which a test developer can base his or
her test. However, consensus in some areas has been reached by the profession.
Almost every article on the nature of languag proficiency traces this line of
research back to Oller’s (1979) view of language proficiency as a unitary entity and the
empirical investigations of the factorial structure of language proficiency led by him and
his associates. Oller’s research addresses one of the fundamental questions that is
relevant to educators and researchers of language acquisition: What is the nature of the
From the standpoint of a language test developer, Oller (1979) framed his
question as “whether or not language ability can be divided up into separately testable
components” (p. 423). A component that is testable will yield scores that are reliable,
which means that this component of language ability is a relatively stable trait. Being
testable also means that the component can be distinguished from other components on
forward by Oller (1979). The divisibility hypothesis claims that tests of different
components, skills, aspects, or elements do not share common variance, and performance
on each test is accounted for by its unique variance. On the contrary, the indivisibility or
unitary competence hypothesis states that no test measuring a particular element has
unique variance, and that all tests share the same variance. Taking the middle ground, the
performance on a particular test is accounted for by both shared variance with other tests
and unique variance that pertains only to this test. This last hypothesis was also
The theory widely used by the proponents of the unitary hypothesis comes from
dominates most of the variance in human performance. If there are good reasons for
assuming that language proficiency functions more or less like other human activities,
from Spolsky’s belief in two fundamental truths about language: Language is redundant
and it is creative. Spolsky (1968) argued that global proficiency tests, which assess
sequences, are essentially measures of linguistic competence, rather than discrete skills,
knowledge of rules, seems much the same as underlying linguistic competence, which
operates in all kinds of performance. Built upon Spearman’s (1904) general factor of
convenient label for the psychologically internalized grammar that governs all kinds of
studies tested the unitary competence hypothesis (Oller, 1979; Oller & Hinofotis, 1980;
Scholz, Hendricks, Spurling, Johnson, & Vandenburg, 1980). In a PCA procedure, the
first principal component extracted accounts for the largest amount of variance among the
16
measures. In all these studies, the same factorial pattern emerged. The first principal
component accounted for a considerable amount of common variance, and therefore was
tests had high loadings on the first component, and the loadings on this component were
higher than the loadings on the subsequently extracted component(s). The residual
correlations among the measures, after the first principal component had been extracted,
were negligible. By obtaining such a factorial pattern, these researchers believed that the
Although a dominant global factor was repeatedly confirmed, the PCA method
used in these studies was criticized and considered inappropriate for conducting this line
of investigation (Carroll, 1983; Farhady, 1983; Vollmer & Sang, 1983). In terms of
extracting the initial factors, as Farhady (1983) argued, since values on the diagonal are
set at unity in PCA, all variances, including the common variance, specific variance and
error variance are used to define the underlying factors. Instead of using PCA, principal
factor analysis (PFA) can be used. In PFA, estimated communalities are assigned to the
diagonal cells, and specific and error variance components are not included. An iterative
method is used to refine estimates of the communalities to obtain the best possible
estimates of the communalities for various steps of factor extraction. PFA uses only the
common variance among the variables in the analysis while systematically discarding
uniqueness.
Both PCA and PFA extract the first factor in such a way as to account for the
maximum amount of variance in each and all of the variables. Variance from many
different common factor sources is being extracted because the factor vector is placed in
17
such a way that as many of the variables as possible have substantial projections on it.
The first factor can extract variance from several unrelated variables. It is highly inflated
and, therefore, the subsequent factors are correspondingly deflated. The first factor, by
the very nature of the extraction procedure, will account for the greatest amount of
Researchers (Carroll, 1983; Farhady, 1983; Vollmer & Sang, 1983) also
meaningful factor patterns since the initial unrotated factors may not give the best picture
of the factor structure and may lead to misinterpretations. By rotation, a simpler factor
structure might be arrived at, with each variable loading primarily on only one factor, and
each factor accounting for a maximum of the variances generated by the variables that
load on it.
Multiple-factor structures did emerge from some of the findings when analysis
methods different from the PCA method were used. Oller and Hinofotis (1980) found a
two-factor and a three-factor structure by using factor rotation techniques. Scholz et al.
(1980) were able to describe the data with four independent dimensions by using PFA.
However, these researchers still uniformly concluded that the unitary factor solution was
the best theoretical explanation for their data. The major reason was that no clear pattern
appeared in which tests were grouped according to the posited skills or components. They
came to the agreement that there must be a general factor underlying performance on
many language processing tasks. As Oller (1983) stated, although this general factor may
language testing research community started to realize the limitations of using PCA to
configure the latent structure of language proficiency, the notion of a unitary language
competence became less convincing. As researchers argued (Farhady, 1983; Vollmer &
Sang, 1983; Woods, 1983), the derived factors contained not only common variance that
was shared by different tests, but also test-specific variance and error variance. Since a
PCA approach is programmed to extract the greatest amount of variance from all sources,
the technique generally overestimates the significance of the first derived factor. The
unidimensional model claimed by Oller and his colleagues has been generally considered
techniques incorrectly.
While the field continues to pursue this question regarding the nature of language
among language researchers. Results from a growing number of studies using factor
language proficiency. In the next section, the discussion on the nature of language
structuralism (Carroll, 1965; Lado, 1961). According to this school of thought, language
19
components are placed against skills. The language components are phonology or
orthography, morphology, syntax, and lexicon. The skill set includes auditory
comprehension, oral production, reading, and writing. Even though the idea of the
independence of the various components and skills has never been expressed in the form
of a hypothesis, the components and skills are considered to be logically different and
therefore independent of one another. This position implies the strong form of the
practice, however, even Carroll admitted that it seemed daunting and also unnecessary to
test all 16 separate competences. A review of the relevant literature shows that the strong
form of this hypothesis has not drawn much empirical evidence in its support.
structure or lexicon. Naturally the four skills of listening, reading, speaking, and writing,
which are regarded as integrated performance based on the candidate’s mastery of the
whole array of language components, receive focus in this integrated approach to testing.
20
He asserted that an ideal language proficiency test should make it possible to differentiate
After reviewing the history of the structuralist approach to language teaching and
testing, Vollmer and Sang (1983) formulated one possible weak form, hypothesizing that
each of the four skills could be focused upon and that each one could be measured more
or less independently from the others. In this weak view of divisibility, executing skills
like listening, reading, speaking, and writing requires the integrated use of different levels
approach has had an enormous impact on how FL has been taught and tested. There is
abundant evidence of teaching the four language skills as more or less distinguishable
Neither the unitary hypothesis nor the complete divisibility hypothesis can be
justified in light of results from empirical studies. Oller (1983) conceded that multiple
factors might underlie language proficiency. He was still in favor of a general factor, but
this general factor might be partitioned into components. He further proposed that
hierarchical models with one general factor and several first-order factors would work
From the structuralist tradition, Carroll (1983) also started to recognize the
possibility of the existence of a general language ability as the indicator of overall rate of
language skills could be separately recognized and measured because skills have the
Results from early empirical studies convinced the field to embrace the
multidimensional concept of FL ability (Carroll, 1958; Gardner & Lambert, 1965; Hosley
& Meredith, 1979; Pimsleur, Stockwell, & Comrey, 1962). All of these studies confirmed
divisible instead of unitary. The trend shows that FL proficiency can be better explained
However, resulting factorial structures often vary vastly from one study to
could look like. In results from a majority of these studies, a more general factor
appeared along with a set of more specific factors. However, questions remain regarding
how this general factor relates to the specific factors as well as what these specific factors
are and how they relate to one another. The inconsistency of the resulting factorial
structures from various studies does not allow for strong arguments in the development of
Vollmer and Sang (1983) offered several suggestions to tackle this issue. They proposed
language ability. Their observations showed that factor analysis was used in an
exploratory way in most of the earlier studies; therefore, results were largely limited to
22
the labeling and ad hoc interpretation of the factors. Using confirmatory factor analysis
(CFA) instead of exploratory factor analysis (EFA) could provide researchers with more
opportunities for theory testing and theory building. As the field moved forward, in more
recent studies, these issues have been recognized and treated with care by researchers.
trait approach. Four hypotheses are commonly tested in this body of research. Two of
them are derived from Oller’s (1979) partial divisibility hypothesis. The first of these is
called the correlated trait hypothesis, which hypothesizes that a number of factors
underlie FL ability, and that these factors are correlated with each other. Competing with
primary (or first-order) factors subsumed under a general (or higher-order) factor. This
general factor affects test performance only indirectly, through the primary factors. The
assumption behind this hypothesis is that the correlations among the primary factors can
be accounted for by the general factor. These two hypotheses are usually examined
against each other and against the other two hypotheses, the unitary hypothesis and the
accounted for by only one general underlying factor. The divisibility hypothesis claims
that a number of factors account for the variance in FL performance, and that these
phonology and graphology were not included in this framework. Each trait in this study
was measured by using four different methods: interview, writing, multiple-choice, and
self-rating. In the results, both the unitary hypothesis and the complete divisibility
hypothesis were rejected. The model with a higher-order general factor and two
uncorrelated primary trait factors provided the best fit. Grammatical competence and
pragmatic competence clustered together, and emerged as one trait that was distinct from
sociolinguistic competence. Except for the grammar measures, the rest of the measures
loaded heavily on the general factor. This study found strong support for a substantial
general factor. Although the nature of this higher-order general factor remained open, the
The framework investigated in Bachman and Palmer (1982) was based on the
the authors hoped that their study could provide empirical support for some of the
competence, and strategic competence, with the latter two being considered as equally
24
Canele (1983) with the addition of discourse competence to the original model. As
Canale (1983) pointed out, the tremendous interest communicative competence generated
its acceptance moved the general discussion about the nature of language proficiency
farther away from the unitary proposition. However, as Canale (1983) acknowledged, the
framework was more a theoretical than an empirical model. Little empirical evidence has
been found to distinguish the four aspects of the competence, or to specify the manner in
which the four areas interact. Although the framework has been criticized for being based
mainly on theoretical but not empirical work (Cziko, 1984), it has also been credited for
advancing the field’s knowledge upon which future efforts of proficiency model
The four hypotheses were tested again in Bachman and Palmer (1983). They
conducted a construct validation study of the Foreign Service Institute (FSI) oral
interview test. Each posited trait, speaking or reading, was measured by three methods:
interview, translation, and self-rating. Based on the results from the six measures, they
found that the two partial divisibility hypotheses provided a better fit to their data. They
thus rejected the unitary trait hypothesis and the divisibility hypothesis. Although the
model of a general factor with two uncorrelated primary traits fit the data best
statistically, the authors decided to adopt the two-factor correlated trait model because of
25
the principle of parsimony. In their argument, the latter model had fewer free parameters
to estimate, and therefore was preferred as the more parsimonious model. The results
demonstrated strong support for the distinctness of speaking and reading as separate but
proficiency.
Sang, Schmitz, Vollmer, Baumert, and Roeder (1986) claimed that neither the
could be upheld in light of empirical data. Instead of using the four-skills framework,
they postulated a factor structure with three correlated dimensions. The first dimension,
and lexical knowledge. The second dimension, integration of basic knowledge elements,
interaction. The authors found that the posited correlated three-factor model fit the data;
however, the correlations among the latent factors were high. To rule out a rival one-
factor model based on the unitary factor hypothesis, a competing one-factor model was
tested and rejected based on the results of model fit comparison. No higher-order model
was tested in this study. The potential comparison of a higher-order model and a
Using both factor analysis and comparison of group means, Harley, Cummins,
Swain, and Allen (1990) hypothesized a three-trait structure that included grammar,
discourse, and sociolinguistic competence. Each trait was measured in three modes: an
oral productive mode, a written productive mode, and a multiple-choice written mode.
26
The analysis failed to confirm the hypothesized three-trait structure. Instead, a two-factor
solution with a general factor and a written method factor were found. Although factor
analysis did not show strong support for the hypothesized distinctions, the analysis of
mean comparisons did provide some evidence in support of a distinction among the
hypothesized traits. The pattern of score differences between native speakers and second
language learners was found not to be uniform across the three competence areas. Based
on this observation, the authors argued that since language skills could develop at
different rates, they could then be distinctly identified and measured separately.
structure-reading and discourse were investigated in Fouly, Bachman, and Cziko (1990):
a correlated-trait hypothesis and a high-order hypothesis. Both models proved to fit the
data fairly well. The study concluded that both distinct factors and a general language
factor existed.
Davidson, Ryan, and Choi (1995) investigated whether two test batteries were
Educational Testing Service (ETS) included the TOEFL® test, the SPEAK® test, and the
TEW® test. The Cambridge-based test battery consists of five measures of the First
A similar correlated two-factor solution was found within each test battery. In
both cases, measures of listening and speaking loaded predominantly on one factor.
Written tests, on the other hand, loaded primarily on the second factor. A cross-test
analysis using all 13 variables was performed to investigate whether the two tests indeed
27
measured similar abilities. A correlated four-factor solution was obtained with a speaking
factor, a listening factor, and a written mode associated with the ETS test battery, and a
written mode associated with the Cambridge test battery. Because all the factors were
higher-order factor model. The resulting factor structure had a higher-order general factor
and four uncorrelated primary factors. It was found that the higher-order factor accounted
for a large proportion of the variance in each test battery. The authors claimed that the
results of this comparative study supported the position that FL proficiency consists of a
A more recent study (Shin, 2005) used the data from the Bachman et al. (1995)
comparability study, and found a correlated three-factor solution for the ETS test battery.
The factor structure with a listening factor, a speaking factor, and a written mode factor
closely resembled the findings from Bachman et al. (1995). The Shin (2005) study also
rejected the unitary trait model and the complete divisibility model. Between the higher-
order model and the correlated-trait model, the higher-order model was found to be
optimal in representing the data, and it was also considered to be more parsimonious.
Focusing on listening trait validity, Buck’s (1992) study was significant in this
line of research in two ways. First, the study successfully demonstrated the uniqueness of
the listening trait and its difference from the reading trait, in contrast to the previous
studies that had failed repeatedly to make this distinction (Carroll, 1983; Oller &
Hinofotis, 1980; Scholz et al., 1980). Second, the study illustrated the process of
designing measures that operationalized the listening trait properly. Buck pointed out a
number of ways in which listening and reading differ: mode of input, information
28
density, and different vocabulary, among others. He also suggested that contextual
controlling for these textual and contextual variables, a measure can be made to include
many of the characteristics of the normal listening situation, and become theoretically
valid in operationalizing the underlying listening trait. Each trait, listening and reading,
gap-filling tests, and translation. The results yielded a model with two closely correlated
traits: listening and reading. When addressing this close relationship between listening
and reading, Buck asserted that reading and listening share a common general language
tests of general language comprehension and tests of the listening comprehension trait.
The results supported the position that language ability is divisible, even where two
context, and an ability to comprehend relatively long context. The author detected that
the two partial divisibility models could not be distinguished statistically, meaning that
they fit the data equivalently. Model comparison between the partial divisibility models
yielded similar unfruitful results in the past (Bachman & Palmer, 1983; Fouly et al.,
1990). A higher-order model with three first-order factors and a correlated three-factor
comparisons, and therefore, the superiority of one model over the other cannot be
29
first-order factors in order to settle this controversy over the best-fitting model for second
language proficiency. Nonetheless, the two partial divisibility models fit the data much
better than the unitary model or the uncorrelated model. The researcher concluded that
there were at least several distinct trait factors between the general factor and the
observed variables, and that these specific trait factors were not independent of each
other.
While the majority of the studies to date have focused on English as a FL, Bae
and Bachman’s (1998) study which examined Korean as a FL and as a heritage language,
was an exception that departed from the English norm. Learners of Korean-American
ancestry who were assumed to have exposure to Korean at home constituted the heritage
group. Learners of non-Korean ancestry who were assumed to have little or no out-of-
class contact with the Korean language were in the non-heritage group. By focusing on
the two receptive language skills, reading and listening, the study showed that a
correlated two-factor model described the data for both the heritage group and the non-
heritage learner group. A rival model with a single factor was also tested, and the fit of
this one-factor model was significantly worse than that of the two-factor model. The
authors concluded that the same underlying two-factor pattern applied for both groups.
In the studies being discussed so far relationships among the factors were
specified as either none or non-directional. This means that factors were hypothesized
factors have not been reported extensively in the literature. Only a few studies examined
such possible relationships. Upshur and Homburg (1983) found that while grammar and
30
vocabulary were correlated, they both affected reading. In their study, the relationships
between grammar and reading and between vocabulary and reading were modeled not to
be reciprocal, but rather directional. This model implies that the development of grammar
and vocabulary abilities has a direct effect on the development of reading ability.
Directional relationships among the latent factors were also tested in Sang et al. (1986) in
a three-factor model concerning grammar ability, vocabulary ability, and reading ability.
The relationships were directional as the grammar and vocabulary ability were modeled
There are reasons for this lack of interest in testing directional relationships
among the factors. Factor analysis, having its roots in correlation and regression
directional) relationship. To justify that one construct causes another, one needs to rely
on an experimental design that applies random sampling procedures and that controls for
experimental design can hardly be achieved, because there are often too many extraneous
factors to be controlled, and also because the construct under investigation is often poorly
defined. Considering the wide variety of specific language factors that have emerged in
the literature and the many possible relationships among them, it seems almost
impossible to single out one construct from the rest, and focus on its directional
The previous discussion should also bring our attention to the nature of the
higher-order general factor that has repeatedly emerged in most of the studies discussed
above. A higher-order model asserts that the general factor accounts for the correlational
31
relationship among the first-order factors. However the nature of this general language
concepts that can be used to define this general language ability. Oller (1983) claimed
that it is pragmatic mapping that underlies the ability. Bachman and Palmer (1982)
asserted that this general ability has something to do with information processing in
extended discourse. As vague as it sounded, Bachman et al. (1995) suggested that the
general factor can be interpreted as the common aspect of language proficiency shared by
As summarized by Fouly et al. (1990), the nature of this general factor is open to
speculation and future research. There may even exist more than one higher-order factor,
depending on the number of first-order traits and the correlational relationships among
them. Although the number and the nature of the measures used have been different from
study to study, the field seems to have come to the consensus that language ability
regarding what these specific components are and how they interact.
ranged from 6 (Bachman & Palmer, 1983) to 13 (Bachman et al., 1995). The format of
interactive speaking task. Furthermore, the number and the nature of traits in the models
for hypothesis testing also varied vastly from one study to another. In some studies,
Palmer, 1982). In other cases, a study was conducted for the purpose of test score
validation (Bachman et al., 1995). Therefore, what was included in the factor model
32
depended on the nature and purpose of the test but not on a particular theory. There were
also times when the researchers intended to examine only some specific traits of interest
and, therefore, included only these traits in their models (Buck, 1992; Shin, 2005).
received based on both theoretical and empirical grounds (Kunnan, 1998a), although the
research community has not yet reached an agreement regarding the nature of the
recent publication on issues in assessing English language learners (Wolf et al., 2008),
the authors emphasized the field’s uncertainty as they declared that “there has been no
consensus across the tests in the definition of language proficiency” (p. 22).
The field needs to find a mechanism to reconcile findings from different studies in
order to move forward on the agenda of defining what FL proficiency means and how we
Method Effect
The major goal pursued in the studies reviewed is the identification of one or
more factors underlying FL proficiency. Given that a trait or a factor is latent and cannot
be measured directly, test measures are needed to elicit performance that will reflect the
target trait structure. We therefore take test performance on such measures as indicators
of the latent trait structure. In most cases, multiple indicators or measures are needed to
based or skill-based. These models assume that no other factor contributes systematically
guaranteed. If a particular test method (e.g., cloze test) affects test performance in a
systematic way, this method-specific influence should be taken into consideration when
the overall factor structure of test performance is modeled. That is to say, if a special
ability can be developed to respond to a particular type of test method, this method-
To say that score interpretation on a FL measure is valid is to suggest that the test
reflects how test-takers respond to the test items differently as a result of their different
standings on the ability continuum rather than of their differential mastery of that
particular test method. In other words, if another test of a different method designed to
measure the same trait yields the same pattern in test results, we can conclude that
convergent evidence has been found that the two different measures measure the same
trait, and that it is the trait, not the test method, that contributes to the resulting test
performance. In reality, method effect can hardly be ignored, and trait effect is always
intertwined with method effect. A review of the relevant literature suggests that there are
factors other than language proficiency that might also contribute to the underlying
structure of test performance (Bachman et al., 1995; Bachman & Palmer, 1982, 1983;
configuring the trait, namely FL proficiency. In the current section, this focus has been
broadened as method factors are investigated along with the trait factors. Some of the
studies reviewed in the previous sections will be reexamined from this new perspective
along with some new studies. In these studies, test methods as potential latent factors
34
responsible for test performance, were investigated. The delicate relationship between
test method and trait in the context of FL testing is discussed at the end of the review.
distinctness of trait and test method. The first is confirmatory factor analysis (CFA),
which has been widely used in FL proficiency dimensionality studies. Instead of focusing
exclusively on the possible trait factors, test method effects are also considered as
CFA and the multitrait-multimethod (MTMM) matrix method. Both approaches provide
measurement method from the effect of the trait being measured. Using either approach,
the studies cited below demonstrated the influence of test method on test performance.
In Bachman and Palmer’s (1983) validation study of the FSI oral proficiency
interview, three different methods were used to provide both convergent and divergent
evidence to the construct validity of the oral interview test. Besides the interview method,
translation and self-rating were also used to measure the two traits: speaking and reading.
Findings from the analyses showed that models with three correlated method factors
consistently provided a better fit than models positing no method factor or with fewer
than three methods. A model with two correlated trait factors and three correlated method
factors was shown to provide the best fit to the data. Among the three methods used,
measures involving the oral interview method loaded more heavily on trait than on
method, whereas measures using translation and self-rating loaded more heavily on
method than on trait. The authors claimed that the interview method demonstrated the
35
largest trait component and the smallest method component, showing both convergent
and discriminant validity evidence for the score interpretation and use.
In another study, Bachman and Palmer (1982) again employed multiple measures
competence. The four methods used in this study were interview, writing sample,
multiple-choice test, and self-rating. As far as the method factors were concerned, models
with three correlated methods proved to fit better than models with four uncorrelated
methods. The three correlated method factors were interview, writing and multiple-choice
as one method factor, and self-rating. They also found that measures using interview and
self-rating loaded more heavily on method than on trait factors, whereas measures using
the writing and multiple-choice method consistently loaded more heavily on trait than on
method factors. This result contradicted the findings in their 1983 study, where measures
involving oral interview demonstrated more trait effect than method effect.
In both studies, besides language-related traits, test methods also accounted for
test performance. In the first study, three correlated method factors (oral interview,
translation, and self-rating) were built into the factor structure. In the second case, three
proved to fit the data better than models positing only language-related factors. These
results showed that when configuring latent factor structures underlying test results, test
Test method factors have also been found in other studies investigating FL
proficiency (Harley et al., 1990; Song, 2008; Turner, 1989). Harley et al. (1990) used
components. They categorized their measures by mode: oral productive mode, written
productive mode, and multiple-choice written mode. For example, a grammar oral
production task could consist of a structured interview scored for accuracy of verb
production task could ask examinees to write a formal request letter and informal note
scored for ability to distinguish formal and informal register. In their final factor model, a
general factor and a written method factor emerged from the test results on the various
measures. Although the nature of the general factor remained unclear, the written method
An audio mode versus written input mode difference was found in Song (2008).
This study examined the similarities and differences in factor structure underlying
listening and reading abilities. In the results, three kinds of skill-related factors with the
Factors other than language ability that contributed to performance on cloze tests
were found in Turner (1989). Eight cloze tests were given, varying in first language (L1)
or second language (L2), content domain (two topics), and contextual constraints (local
and global). The results helped answer the question of what knowledge was required for
Based on the results from factor analyses, the model that demonstrated the best fit was
37
composed of three uncorrelated factors: two distinct language factors (L1 and L2) and a
general factor, which was defined by Tuner as cloze-taking ability (CTA). The author
claimed that the inclusion of the L1 factor made it possible to distinguish the L2 factor
from the CTA factor. Regarding the measures in L2, the findings showed that L2 cloze
performance was dependent on both the L2 language factor and on the CTA factor. The
results also showed that this CTA factor had greater influence on cloze performance than
did the L2 language factor. The presence of the CTA factor in the final factor structure
demonstrated that method effect existed in cloze tests across linguistic boundaries.
ruled out simply based on the results from statistical analysis. Substantive knowledge
should always be called upon to determine the nature of the factors, and how to label
them. It is likely that a seemingly language-related factor can also be interpreted as a test
method factor. Taking the results from Bachman et al.’s (1995) comparability study as
the example, the factors in the final solution across both test batteries included a speaking
factor, a structure, reading, and writing factor associated with the ETS test battery, a
structure, reading, and writing factor associated with the Cambridge test battery, and a
listening and interactive speaking factor. The second and third factor both included
measures in the written input mode. The distinctiveness of these two factors might have
lain either in the kind of language ability into which each test battery tapped or in the
kind of methods each battery used. In the latter case, this distinctiveness reflected method
effect across the two test batteries. The last factor, listening and interactive speaking,
included the TOEFL® test listening section, the FCE listening comprehension section,
38
and the FCE oral interview. At first it might seem strange that the FCE oral interview
would load on this factor with two listening measures instead of on the speaking factor
with the SPEAK® test from the ETS battery. However, after taking a closer look at the
nature of the measures, we would find that the FCE oral interview was interactive in
nature, and it demanded substantive listening skill from the test takers, whereas the
SPEAK test was non-interactive and monologic in nature, which did not entail oral
interaction, and it did not have a listening requirement. Therefore the last factor was
essentially the ability to perceive and process aural input, in other words, the listening
ability.
The discussion has come to the place where it is hard to draw a clear-cut line
structure, reading, and writing sections all load on a single factor? Is it true that grammar
because all of the measures share the same written mode so that the real differences
among the language-related abilities are masked by the commonalities of the methods
All of the measures used in the studies reviewed here can be categorized in many
different ways. Measures can be separated by input mode as either aural or written. They
can be divided by output mode as either oral or written. They can also be distinguished
that a listening test must have aural input. However, whether the listening test items
a particular test. In the case when a short-answer item type is chosen, might listening
ability be confounded with writing ability, so that listening and writing emerge as a single
factor simply because the two measures share a common output mode? Does this mean
because the former measure the unique listening construct with no influence from
writing? But can we be sure that the abilities being tapped by constructed-response
listening tasks, such as processing, internalizing, and reconstructing aural input, are not
part of the listening construct? These questions remained to be answered by the field in
the future.
The MTMM method has been used widely in construct validation studies to
investigate both convergent and discriminant validity, and to distinguish between method
effect and trait effect. The traditional MTMM method (Campbell & Fiske, 1959) provides
a straightforward mechanism to detect and distinguish the trait and method effects by
simply comparing the correlations in a MTMM matrix without having to conduct any
significance test. However, it does not specify any underlying factorial model to quantify
and account for the relationships between the measures and the traits. Due to this
inadequacy of dealing with multivariate data, model evaluation and comparing cannot be
MTMM matrix data. In this approach, model specification is performed at the onset of
the analysis. The best fitting model for the data is selected as the baseline model to which
a series of alternative models are compared by using chi-square difference tests. The
40
results of these model comparisons tell us the degree of trait convergence, trait
A CFA approach of MTMM data was adopted in Llosa (2007) to investigate the
construct validity of a standardized test and a classroom assessment. The MTMM data
were based on the test results from Grades 2, 3, and 4 students on two English as a
Test (CELDT) and the English Language Development (ELD) Classroom Assessment
measures. To be specific, the study aimed to examine the extent to which these two tests
measured common language ability constructs, the extent to which the constructs
measured by the tests were distinct, and the extent to which the differences in test design
and Writing) and one higher-order general factor was first established. The two methods,
the CELDT and the ELD Classroom Assessment measures, were modeled onto the
baseline model as either correlated or uncorrelated. This established two versions of the
trait-method model. In Grade 4, the trait-method model with correlated method factors
was confirmed to provide the best fit while in Grades 2 and 3, the trait-method model
with uncorrelated method factors was selected as the best model. The results showed that
To test convergence, the fit of a trait-method model was compared to the fit of a
method-only model (correlated method factors for Grade 4 and uncorrelated method
factors for Grades 2 and 3). Significant improvement in model fit of the former over the
41
latter indicated that method alone could not provide satisfactory explanation to account
for the data. This also confirmed the presence of the trait effects.
the fit of a unidimensional trait-method model in which only one general trait was
specified. The significant chi-square difference result indicated that the traits, although
the fit of a trait-only model. Substantial method effects were detected based on the result
of significant model fit improvement provided by the former over the latter.
Trait convergence, trait divergence, and method effects were further examined in
light of individual parameter estimates. The results deviated from those of the overall
model comparisons to some extent. As far as convergence was concerned, the proportion
of method variance exceeded that of trait variance in the baseline model for Grade 4.
Trait discrimination was also challenged by the uniformly high factor loadings on the
This study demonstrated how a CFA approach could be applied to MTMM data to
detect and distinguish trait and method effects. Substantial method effects were found,
meaning that performance on the two tests reflected not only the traits measured but also
factors associated with the instruments used. It also showed that the trait effects were
significant in accounting for test performance, and that a trait-method model provided the
listening test, a CFA approach of MTMM data was also employed in Shin’s (2008). The
three methods under investigation were summary task, incomplete outline, and open-
ended questions. The academic listening constructs specified in the study were the
abilities to extract main ideas, major ideas, and supporting details. A baseline trait-
method model with correlated traits and uncorrelated method factors was established at
the outset. The baseline model was then compared to a method-only model with no trait
factors to find convergent evidence. To search for discriminant evidence, the baseline
model was compared to a trait-method model with one general trait factor, and a trait-
model with uncorrelated trait factors. Both convergent and discriminant evidence was
A higher-order model, which was equivalent to the baseline model in terms of fit
statistics, was chosen for examining the individual parameter estimates. The relative
effects of trait factors were mostly larger than those of the method factors. Although for
each trait factor some tasks were stronger indicators than others, the author concluded
that all three response formats were valid indicators of academic listening
comprehension.
The CFA approach to MTMM data is the most commonly used alternative to the
original Campbell and Fiske (1959) MTMM analysis (Sawaki, Stricker, & Oranje, 2008).
Multiple steps of model comparisons assist in finding evidence for trait convergence, trait
divergence, and method effects. The main goal of this kind of analysis is to evaluate the
effects of factors (e.g. test method factors) that are not specified as the constructs a test
intends to measure. Unlike the original MTMM analysis, this CFA approach to MTMM
43
data allows researchers to specify and evaluate underlying factorial models, and to find
the best model that takes into account both trait and method factors based on the
Before we can arrive at any conclusion regarding how test method interferes with
testing. In the context of language testing, often times a particular method is designed to
measure a unique aspect of language ability. It is likely that the trait and the method
define each other until it is impossible to disentangle one from the other. In my view, FL
tests should be designed to reflect the dynamic and interactive relationships between trait
and method. The test method facet ought to be visualized as an integral part of the factor
structure underlying FL test performance. As Bachman (1990) put it, “the matching of
trait and method factors is one aspect that contributes to test authenticity” (p. 241).
This section summarizes studies investigating the effects of test method on test
performance. The next section will discuss the relationship between test-taker
the field has witnessed a proliferation of factor-analytic studies devoted to examining the
issues. The first consensus is that the result of a FL test can be better interpreted if
44
multiple factors are considered to account for the performance. This is to say that FL
implied in the mixed findings in terms of the nature and number of factors underlying the
construct. It has become clear that the make-up of FL proficiency is context dependent.
Results of any factorial analysis heavily depend on why and how the test is
constructed. Regarding test construction, test method facets (Bachman, 1990) have been
many possible constraints of the test we use. Tests may vary in length, content coverage
and representation, and internal consistency, to name just a few. Methods that a test
utilizes can also differ in terms of input mode, expected response, the complicated
relationship between the two, and so on. As a result, the outcome of any analysis aimed at
Another trend that has emerged in the literature is that researchers have started to
pay attention not only to the test being used but also to the characteristics of the test-
takers.
Results from early research cast doubt on the uniformity of factor structure across
Harley et al. (1990) failed to confirm an a priori three-trait model, and instead found that
a general factor and a written method factor could explain their test performance data
well. Although this two-factor model was chosen as the final model, they suggested that
the relationship between the different components of the construct might be highly
45
dependent on different language learning experiences of the test-takers, and that it might
when interpreting test performance was also emphasized in Bachman et al.’s (1990)
study. These authors recommended that future researchers investigate the relationships
TTCs are recognized as one of the four influences on test scores in the
language ability, test method facets, personal characteristics, and random measurement
on a par with CLA and test method facets. The personal characteristics proposed in the
CLA model include cultural background, background knowledge, cognitive abilities, sex,
and age.
The Bachman (1990) model acknowledges the influences of test method and
TTCs on test performance. It has a language ability component, a test method component,
and a component of TTCs, based on which researchers can posit and test relationships
about factors that influence test performance. Although critics (Chalhoub-Deville, 1997;
McNamara, 1990; Skehan, 1991) argued that it might be difficult to implement such a
comprehensive model in actual test development, the CLA model was nevertheless
portrayed as the state of the art by Alderson and Banerjee (2002) as they argued that the
model emphasizes the interaction between the language user, the discourse, and the
context.
46
Suggestions from these earlier researchers served as the springboard for later
studies concentrating on this test-taker factor. It has become a common awareness that
test use. The TTCs mentioned in that summary were academic background, native
language and culture, field in/dependence, gender, ethnicity, and age. Kunnan (1998a)
argued that test takers from certain background might perform differently based on
factors that are relevant to the abilities being tested. This message is echoed in Alderson
and Banerjee’s (2002) review of language testing and assessment, as they argued that the
different test-takers and how these characteristics interact with test-takers’ ability
explicitly modeled TTCs in their analyses of FL proficiency. Studies are divided into four
groups. The first group of studies focused on the relationship between proficiency level
and the degree of factor differentiation. Studies in the second group examined other
TTCs, such as native language background and proficiency, learning condition, and
gender. Two studies investigating language ability in relation to target language contact
are included in the third group. And last, validation studies in the context of TOEFL®
Proficiency Level
Regarding the relationship of learners’ target language ability level and the nature
of their proficiency, two competing hypotheses have received much attention from the
field. One hypothesis states that the dimensions of language ability become more
and increasing proficiency. In opposition to both hypotheses, the null hypothesis claims
that factor structure becomes neither more nor less differentiable but stays the same
across learner groups of different ability levels. Research so far has shown quite mixed
results (Kunnan, 1992; Römhild, 2008; Shin, 2005; Swinton & Powers, 1980).
Kunnan (1992) detected a positive relationship between proficiency level and the
University of California. Based on test results, ESL students would be placed into four
different course levels to receive further instruction on the English language, depending
the language ability of the test-takers, and examined whether the test measured the same
abilities in the same way across all ability levels. To achieve the first goal, an exploratory
factor analysis (EFA) was performed based on six subtest scores for the total group. A
two-factor solution was selected to be the most parsimonious and interpretable, with the
two listening subtest scores loading on one factor, and the subtest scores of reading and
level of the test-takers, results of the EFA for each course group were compared.
Although a two-factor solution appeared to be the best fit for each course group, the
variables that loaded on the factors and the relationships of the factors were different
across the groups. The most salient difference was that only the group with the lowest
proficiency had an oblique solution, whereas the solutions for the other three course
groups were orthogonal. This is to say that for the students with the lowest proficiency,
the two factors were correlated in the EFA solution, whereas the factors could be
considered relatively independent for the other groups with higher proficiency. The
indication was that the students with lower ability had less differentiable skill abilities in
contrast to the students at higher levels of ability, whose skills appeared to be more
distinct. The author concluded that the test might not measure the same abilities across
groups of different ability levels. This finding posed a threat to the validity of score
interpretation, especially if a single composite score were reported for all test-takers, as it
was in this case. Due to the inconsistency in the factor structure across the course groups,
the author suggested considering reporting section scores along with the total scores for
Römhild (2008) also found that factor structure changed across examinee groups
like the one found in Kunnan (1992), Römhild observed that language proficiency was
negatively related to the degree of factor differentiation based on the results of the
is a test of advanced English language proficiency, reflecting skills and content typically
for this study, included the listening and reading subtests, and three structure subtests
(grammar, vocabulary, and a cloze test). The items were divided into two equivalent
halves by using the split-half technique. The assumption was that each half of the test
would be an adequate representation of the full-length test. To divide the examinees into
low- and high-proficiency groups, the study used an internal criterion measure based on
the test results from one ECPE test half. The results from the other half of the test were
group to determine the number of latent factors. Among the three competing baseline
models, the five-factor model representing the subtest structure of ECPE provided the
Although identical baseline models were found for each group, the outcome of
analyses on factor loading invariance showed that only partial invariance could be held
across the groups. Items that exhibited factor loading noninvariance were removed, and
the baseline model was re-established. This new partial measurement invariance model
variances and covariances were observed, indicating that structural invariance could not
be held across the groups. Factor variances were smaller in the low-proficiency group,
50
meaning that the performance of this group on each of the factors was more uniform. In
other words, there was less variation in the outcome measures for the low-ability group.
Factor correlations also appeared to be smaller in this group, compared to the high-
proficiency group. The weaker correlations indicated that the five factors were more
separable in the low-proficiency group. This result indicated that the structure of test-
takers’ language ability became less differentiated as their overall language proficiency
increased.
ability was also examined in Shin (2005). Unlike the two studies reviewed above, this
study did not find sufficient evidence to suggest that factor structure differed across
ability groups. Participants in this study were from Bachman et al.’s (1995) study.
Students’ test results on the Cambridge test battery served as the external criterion to
define their proficiency levels. Their scores on the TOEFL® test and the SPEAK® test
from the ETS test battery were used to examine factorial invariance across the low-,
mode, and speaking) seemed to represent the test results of each ability group optimally,
and therefore was established as the baseline model. This baseline model was then
estimated simultaneously across all ability groups with equality constraints imposed in a
loadings that showed significant variances across the groups. The author then proceeded
factors on the general language factor. Four of the six first-order factor loadings were
shown to be invariant. Based on these results, the author concluded that the structure of
the ETS tests stayed partially equivalent across the groups, implying that the degree of
factor differentiation neither increased nor decreased with the development of overall
language ability. This finding helped to make the argument that the ETS tests functioned
in the same way across different proficiency groups, and that the use of a single
The decreasing trend in the relationship between language proficiency and the
degree of factor differentiation found in Römhild (2008) was in disagreement with the
finding of Kunnan (1992). Factor structure was found to be partially invariant on both
measurement and structural levels in Shin (2005), indicating that the degree of factor
differentiation did not vary as a function of language proficiency. One reason for this
discrepancy might have to do with the nature of the data used in each study. Kunnan
(1992) and Shin (2005) used section scores as the unit of data analysis, whereas in
Römhild (2008) analyses were based on item-level data. Another reason could simply be
One issue that deserves our attention is how group membership was determined
differently in the studies. Kunnan (1992) did not use an independent criterion measure for
separating the groups. Instead, the same test used to determine group membership was
used to perform multiple group analyses. In contrast, both Shin (2005) and Römhild
membership. The latter approach is preferred because any significant result would be less
characterized not only by their prior target language achievement but also by their native
language background, gender, past and current learning conditions, and many other
characteristics. Test-takers’ identities and life experiences are also valuable information
for us to understand their current learning profiles, and how they have arrived at where
they are. The research community has gradually embraced the idea that treating test-
takers regardless of their identities and life experiences will give us an over-simplified
picture of their test performance. In the previous section, we have discussed studies that
have examined prior target language achievement as one of the many TTCs, and its
interplay with the degree of factor differentiation. In the following section, studies that
factor structure regardless of the differences in the learners or the conditions under which
language acquisition took place. The study posited that differences in cognitive skills and
divided into two groups: one with high L1 (German) proficiency and the other with low
communicative) was confirmed to be the best fitting model for the total sample. Then the
invariance of this structure was examined as a function of L1 proficiency and of the kind
53
of instruction the learners received. The study found that achievement in L1 could affect
the factor structure in several ways. Across the low-and high-L1 groups, differences in
factor loadings on complex skills and interactive use were found. Loadings on the factors
of more advanced language skills became more salient as the learners advanced in
mastering their native language. In other words, advanced FL skills were more likely to
appear as a distinguishable factor for learners with a high level of L1 ability. Another
finding was the differences in factor correlations. Factors were more closely correlated in
the group with a high level of competence in their L1. Based on this observation, the
authors suggested that a general L2 proficiency factor could emerge if the sample was
highly proficient in their mother tongue because learners with high-level L1 proficiency
tended to develop a more generalized ability structure by being more adept in transferring
proficiency was found to be dependent on the learners’ degree of mastery of their L1.
During the next step, two types of teaching strategies were examined in relation to
predictors of the language components in the model. The traditional teaching style was
characterized by its heavy use of L1, and by favoring a bilingual approach to lexicon,
unstructured oral interactions. The results showed that the traditional teaching strategy
had a positive relationship with the elementary factor but no association with the complex
factor, and a negative association with the communicative factor. In contrast, the modern
54
teaching strategy was positively associated with all aspects of ability development,
Gender, as another key TTC, was studied in Wang (2006). Demonstrating test
attempt to provide evidence of fairness for two test batteries, the Examination for the
Assessment Battery (MELAB), the author investigated the dimensionality of both tests
and the equivalence of the factor structures across genders. Since both tests report total
scores, the author aimed to gather evidence to support the use of the total scores for both
listening section, and a GCVR section. Analyses were performed on subtest scores.
Subjects were randomly split into two samples to form a calibration sample and a
validation sample. EFA without rotation was performed for each test separately on the
calibration sample. It was found that only one factor dominated the distribution of score
variance in both ECPE and MELAB. Factorial invariance under this one-factor model
was examined across gender on the validation sample for both tests. A model with
equality constraints on the model structure, latent factor variances, factor loadings, and
unique variances, provided the best fit to the data for both tests. This result demonstrated
that all parameter estimates in the one-factor model were invariant across genders for
ECPE and MELAB. The indication was that both tests functioned in the same way across
55
gender groups. Therefore it was reasonable to report a single total score for both groups,
As reviewed in the previous section, Shin (2005) did not find enough evidence to
support that the degree of factor differentiation varied as a function of proficiency level.
Partial measurement and structural invariance held across all three ability groups.
in the factor loadings of pronunciation and fluency on the speaking factor. To help locate
specified as a covariate to determine how the language background of the test-takers was
Two language groups were formed: the Asian language group and the European
language group. Direct effects on the pronunciation and fluency measures from the
covariate were added in the re-specified model. This resulted in significant model
improvement. The appearance of the two direct effects suggested that there was a scale
bias that would produce different performance ratings due to group membership. The
author concluded that based on the results of the MIMIC modeling, the two speaking
rating scales, pronunciation and fluency, functioned differentially across the two major
language groups.
test results reflect information other than what the test intends to reveal, such as native
language background. Efforts should be made to investigate the nature of any such
56
confounding information. Otherwise, the validity of score interpretation and use will be
put into jeopardy. This study demonstrated how MIMIC modeling could be a powerful
The studies reviewed in this section specified one or multiple TTCs in their
models based on FL test performance. The findings generally support the assertion that
variability. The impact of the various TTCs can be observed in two ways. By means of
modeled as independent variables to have direct effects on test performance. Results from
studies using either approach have shown that a universal view of FL proficiency,
regardless of TTCs, is not warranted. Studies in the section have also demonstrated the
usefulness of multi-group invariance analysis and SEM as tools for interpreting test
There were a few studies that examined test performance in relation to target
language contact experience as a TTC. Two kinds of language contact were of particular
interest to the researchers: language contact with the target language community and
language contact at home. The former usually occurs in a study-abroad context, whereas
Language contact with the target language as a TTC was first examined in two
Advanced Placement (AP) examinations. Each year, college-bound high school students
who want to earn college credit and advanced placement in particular areas of study take
57
the College Board’s AP courses and exams. AP examinations cover a wide range of
skills that students might acquire after four to six semesters of college FL courses. Four
FL exams are traditionally available for students to take: French, German, Latin, and
Spanish. In 2007, three new languages were added to AP’s list of FL examinations:
The demographics of the AP examinees can vary from exam to exam and at
different time periods. In the case of the AP French Language Examination, it was
indicated in Morgan and Mazzeo (1988) that the majority of the examinees taking this
exam learned French primarily through secondary school academic coursework, but that
Whether the test is an appropriate measure for the latter two groups, and whether the test
functions in the same way across all examinees is a validity question that needs to be
Morgan and Mazzeo (1988) identified four groups of examinees based on their
language-learning experiences. The first group, the standard group, had little or no out-
of-school French language experience. This group, which constituted the majority of the
test-takers, was divided into two samples for cross-validation purposes. The second group
(special group I) included the examinees who had spent at least one month in a French-
speaking country. The third group (special group II) regularly spoke or listened to French
at home, and therefore fit the definition of heritage learners. The fourth group included in
Since the exam was intended to measure the listening, reading, writing, and
speaking skills (College Board, 1987), the primary goal of this study was to verify the
skills measured by the exam, and to determine whether the same dimensional structure
could be applied to the four populations. A correlated four-factor model (listening and
writing, language structure, reading, and speaking) provided the best overall fit to the
data. In this model the listening scores loaded with the writing scores on a common
factor. Discrepancy was found between the make-up of this factor model and the test’s
subsection structure, in which listening and writing were separate sections, and both
language structure and reading comprehension belonged to the reading section. The
reason given by the researchers for this discrepancy was that the listening items measured
a dimension similar to the one measured by the writing tasks. The researchers criticized
the equal weighting scheme, arguing that it actually overweighed students’ ability to
produce grammatically correct prose, as it was measured in both listening and writing
sections of the test, and it downplayed the ability to read and interpret French passages,
as some of the items in the Reading section seemed to assess a separate grammar-oriented
This correlated four-factor model was examined for invariance between the
standard group and each of the other three groups. Between the standard group and
special group I, only invariance in factor loadings could be held. No invariance was
found between the standard group and special group II. Invariance in both factor loadings
and unique variances was found between the standard group and the college group. The
factor structure based on the standard group was most similar to the one of the college
group. What was common to the standard group and the college group was that both
59
groups lacked significant out-of-school French language experience. The indication was
performance.
assess high school students’ Spanish language proficiency in order to grant college credit
evaluated by Ginther and Stevens (1998). One purpose of the study was to investigate
whether the dimensionality underlying test performance was compatible with the four-
section design of the test, intending to measure language abilities in listening, reading,
writing, and speaking. The second goal was to study whether the same factor structure
would hold constant across relevant groups within the entire test-taking population.
background as Latin. The other four groups were Mexican Spanish-speakers, Mexican
bilingual-speakers (who indicated equal preference for Spanish and English), White
specified as the a priori model. This model was found to provide a good fit for the
reference group based on item parcel scores. Group comparisons were conducted by
imposing multiple levels of equity constraints between the reference group and each of
When comparing the reference group to the Mexican Spanish-speaking group and
the Mexican bilingual-speaking group, invariance in the number of the factors and factor
covariances did hold across the groups. However, only invariance in the number of the
factors was found when comparing the reference group to the two English-speaking
groups. This finding suggested that the more a group’s background in ethnicity and
preferred language deviated from the reference group, the less equivalent the factor
It was also found that the four factors were less closely related for the Spanish-
speaking groups than for the English-speaking examinees. Descriptive statistics also
showed that Spanish-speakers enjoyed high means in all scores, compared to the English-
speakers. Unfortunately the study did not include a mean structure in group comparisons.
The researchers suggested classroom instructional environment as one potential cause for
the high factor correlations of the low-level groups. The rationale given was that if a
the abilities they developed would be highly constrained by the type of instruction they
development (listening, reading, writing, and speaking), it would be likely that the growth
Differences were also found in model parameter estimates across groups. The
speaking factor was found to be less correlated with the other factors for the two Spanish-
speaking groups than for the two English-speaking groups. This indicated that speaking
ability was both quantitatively and qualitatively different across the Spanish-speakers
compared to the English-speakers. Not only did the Spanish-speakers perform better on
61
the speaking tasks, but also their speaking performance was less related to their
performance on the other parts of the test. The researchers suggested that having out-of-
class experience with the target language might have had a fundamental influence on this
difference in factor structure. The researchers also criticized the fairness of the test. They
argued that some examinees were evaluated not only on what they learned in the
language learners, was studied by Bae and Bachman (1998). The uniqueness of this study
was that the nature of language proficiency was investigated in the Korean language
instead of in English. Two groups of learners were included: the heritage learners and the
heritage learners would mainly communicate in English. The study focused only on two
comprehension skills, listening and reading. A correlated two-factor model was selected
as the baseline model as it described the data for both groups well. This baseline model
was tested for the equivalence of parameters of interest across the groups.
The results from examining measurement invariance showed that the factor
loadings from one listening task on the listening factor were different across the two
groups. A task analysis of this listening task suggested that this task might have had
tapped into relatively more integrated higher-level listening skills due to its complex
input. The factor loading for this task was greater for the Korean heritage group than for
the non-heritage group, suggesting that this task measured listening differentially across
the two groups, and that this task was a better indicator of the listening ability for the
62
former group with higher overall listening ability than for the latter group with lower
listening competence. The results from examining structural invariance indicated that the
factors were correlated in similar ways across the two groups. However, compared to the
non-heritage learners, the Korean heritage learners performed more uniformly on the
listening tasks. In contrast, the non-heritage learners demonstrated less variance in their
reading scores than the heritage learners. Although the same two-factor pattern was
accepted from both groups in terms of the overall model fit, significant differences in
individual parameters across groups pointed out that some parts of the test functioned
language backgrounds might display different factor structures. Furthermore, not only the
configuration of their factor structures might differ, but their ability to respond to certain
tasks is likely to vary as well. The authors made an urgent call for future researchers to
include a mean structure to multi-group factor analysis so that groups’ factor means could
be compared.
Kunnan’s (1994, 1995) study was one of the few studies that employed a multi-
representation and its external structural relationships. The study investigated the
multi-group SEM analyses were conducted based on performance data on the ETS and
The overall model had a measurement component that modeled the relationships
between test measures and latent factors, as well as the relationships among the factors. It
63
also had a structure component to account for test performance’s external relationships
with multiple TTCs. These TTCs included: home-country formal instruction (HCF),
(MON). Both the measurement and structural components were tested for model fit
across two groups of test takers: non-Indo-European native language group (Thai, Arabic,
Japanese, and Chinese), and Indo-European native language group (Spanish, Portuguese,
French, and German). The research question was how the four TTCs mentioned above
attempted for both language groups. The four factors were: reading and writing assessed
by the ETS test battery, reading and writing assessed by the Cambridge test battery
postulated when modeling the relationships among the TTCs and test performance. In the
first model, the four TTCs were specified to have equal influences on the four factors. In
the second model, only MON had direct influence on test performance, whereas the other
three were allowed to cast influences on test performance through MON, the intervening
factor.
The study found that neither model produced a clear overall statistical fit for both
groups, although some direct effects of TTCs on the test performance were substantive
and interpretable. Home language formal instruction was found to have positive influence
on reading and writing, but negative influence on listening and speaking. Home country
informal exposure was found to have positive impact on listening and speaking, but
64
negative impact on reading and writing. It was also discovered that the experience of
ETS-TOEFL® Studies
(ETS), and is intended to measure English skills to use in an academic setting. Since the
As discussed in the previous sections, the internal structure of a test can be highly
dependent on the characteristics of the group taking the test. Any effort to build a validity
argument for test score interpretation and use should take relevant test-taker variables
into consideration. Because of the TOEEL test’s extremely diverse test-taking population,
examining the relationships between TTCs and test performance has been a strong focus
in TOEFL validation studies. With every new generation of the TOEFL test, there has
been a new wave of validation studies concentrating on the factor structure and its
interplay with the ever-changing test-taking population (Hale et al., 1988; Hale et al.,
1989; Stricker et al., 2005; Stricker et al., 2008; Swinton & Powers, 1980). The following
Swinton and Powers’s (1980) provided validity evidence of TOEFL® test score
use by determining the construct the test intended to measure and the relationships among
the factors in light of multiple TTCs. As many as seven native language groups were
Japanese, and Spanish. Groups differed in their overall level of proficiency based on their
test performance with the Germanic group having the highest and the Farsi group having
the lowest overall language ability. Variables of age and reason for taking the TOEFL
test were also analyzed for their relationships with test performance. Analyses were
Regarding the influence of overall proficiency, it was found that the language
ability of the group with the highest proficiency, the Germanic group in this case, showed
the greatest distinctions in its factor structure. As many as eight factors emerged for the
Germanic group during the initial factor extraction. As the analyses proceeded, it was
also discovered that only two factors could be confirmed for the Farsi group, the group
with the lowest overall proficiency. These results demonstrated that the amount of factor
differentiation did vary as a function of overall ability level. One reason for finding a
larger set of differentiated abilities in the high-level group, as the author suggested, might
be that the responses from high-level test-takers were less likely to be contaminated with
random factors, such as guessing, which made it possible for some minor dimensions to
emerge.
Except for the Farsi and Germanic groups, a three-factor model was established
for each of the other language groups. A listening factor corresponding to Section I
(Listening Comprehension) of the test was found for all groups, meaning that the
response pattern on the listening items could be explained by one common factor for all
language groups. However, there were differences in the make-up of the factor structures
in the other parts of the test across two major language groups, the Indo-European group
(Spanish and Germanic) and the non-Indo-European group (African, Arabic, Japanese,
66
and Chinese). For the Indo-European group, the majority of the items in Section II
(Structure and Written Expression) loaded on one common factor, and most of the items
factor. The two factors corresponded well to Section II and section III of the test. As for
the non-Indo-European group, most items from Reading Comprehension were found to
load with items from Structure and Written Expression on a common factor, whereas the
items from Vocabulary loaded on a separate factor. For this group, the items in Section
III did not seem to measure a common construct. Instead, the reading items tended to
measure a factor that was similar to the one measured by the items in Section II.
According to the authors, this finding had tremendous implications for score
reporting and interpretation. They suggested reporting subtest scores on Vocabulary and
Reading Comprehension separately rather than a combined Section III score, especially
for the non-Indo-European group. Their finding indicated that the interpretation of
Section III scores could be very different for the two major language groups. For the
ability, and the interpretation was relatively straightforward. For the non-Indo-European
speakers, two factors seemed to account for performance in Section III. To make score
interpretation even more complicated, one dimension also overlapped with a factor
measured by another section of the test. Reporting one single Section III score with the
assumption that all items in the section measured the same ability would be misleading.
The usefulness and necessity of reporting a separate vocabulary score were highlighted
Influences of other test-taker variables were also detected via a factor extension
analysis, in which these variables were regressed on the test performance. It was found
that both age and graduate degree-oriented motive were positively related to a factor
associated with the vocabulary ability for the non-Indo-European group and with both
As language testing at ETS evolved into the 21st century, attempts were made to
revise the TOEFL® test so that the most recent theoretical developments in the field of
second language acquisition and FL testing could be reflected in the test design and
courseware package had an ESL test which was similar in task design and test structure
to the new generation of the TOEFL test which was under development at that point
(ETS, 2004). In both the LanguEdge test and the new TOEFL test, speaking and writing
have become more integrated with reading and listening (ETS, 2004).
Stricker et al. (2005) examined the construct validity of the LanguEdge test. In the
test, integrated tasks were implemented so that language skills could be tested in a way
that was close to how they would have been used in real life communications. Two out of
the five speaking tasks were integrated with listening and reading, respectively. There
were also two integrated writing tasks (out of three writing tasks in total), one having a
listening component and the other having a reading element. Factor structure of test
performance was examined across three native language groups: Arabic, Chinese, and
Spanish.
68
Scores on the four integrated tasks were counted for the speaking and writing
sections. These integrated tasks added an extra layer of complexity onto score
interpretation across different test-taking groups. The questions of concern to the authors
were (a) whether they could still find an appropriate model with factors corresponding to
the sub-section structure, and (b) whether the factor structure would hold equivalent
listening and reading as well as polytomously rated speaking and writing scores. A
correlated two-factor solution was chosen for each group and for the whole group. In this
model speaking ratings all loaded on one factor, whereas scores from the other sections
loaded on the other factor. According to this solution a common factor accounted for test
performance of the Listening, Reading, and Writing sections. The analyses did not
produce a factor model that corresponded to the structure of the test. The authors
suspected that there might have been a lack-of-instruction effect on the under-
development of the test-takers’ speaking ability, which might have made speaking stand
The multiple group invariance analyses showed that invariance held across the
groups in terms of the number of factors, factor loadings and unique variances but not
factor correlations. The study concluded that the test functioned in a similar way across
groups although the correlations between the factors were lower for the Arabic sample
The study failed to find separate factors for each section of the test. This might
have been because the inclusion of the integrated tasks had blurred the distinction of the
69
skill measured by each section. With regard to score interpretation, this finding also
called into question the usefulness of reporting section scores. On the other hand, the
study established a uniform factor structure across three native language groups. Factor
structure equivalence was a valuable piece of validity evidence of test fairness, especially
The TOEFL® test evolved into an internet-based mode in 2005. The introduction
of the internet-based TOEFL test, the TOEFL iBT® test, was a milestone in the test’s
growth and modernization. Computer and internet technologies make the TOEFL iBT
population, it has become even more important for researchers to investigate the test’s
underlying factor structure across different test-taking groups around the world. Research
studies, such as Stricker et al. (2005), laid the groundwork for validating TOEFL iBT test
Stricker and Rock (2008) recently assessed the structure invariance of TOEFL
iBT performance across subgroups of test takers who were identified by three criteria:
native language family backgrounds, exposure to English use in educational and business
contexts in home countries, and years of formal instruction in the English language.
Various native language groups in the test population were categorized into two
major language groups: the Indo-European language family and the non-Indo-European
administrative status) with regard to the prevalence of English use in educational and
were not included in the study. Countries represented in the test-taker population were
divided into either the outer-circle country group or the expanding-circle country group.
test-takers were also divided into three groups: a group with fewer than 6 years of formal
instruction in English, a group with instruction between 7 and 10 years, and a group with
A higher-order model with four first-order factors was found to be the best-fitting
model for all test-takers in this study. The same structure was also identified in all
instruction. The authors concluded that this higher-order structure conformed to the four-
section test design, and supported the usefulness of reporting the total score as an overall
indicator of the general language ability as well as four section scores as indicators of the
This uniformity in the structure of language ability as measured by the test across
diverse subgroups of test-takers provided strong validity evidence related to the test’s
internal structure and this structure’s generalization. Based on these findings, a validity
argument can be made that the TOEFL iBT® test functions the same way for diverse
subgroups of test-takers. This also implies that score comparisons among individual test-
building the validity argument for a test. The resulting factor structure sums up the
response pattern of test performance on the test. It also reveals the underlying ability or
abilities the test empirically measures. Our understanding of the nature of FL proficiency
is enhanced and verified through this process. However, as demonstrated by the results
from the studies cited in this review of literature, the make-up of the structure of FL
proficiency is highly context-dependent. It depends on what test is used, and how the test
is constructed. It also depends on who takes the test and under what circumstances. Tests
aiming at assessing proficiency in a FL can vary in test length, subtest structure, item or
task type, and scoring scheme, to name just a few. These facets, proposed by Bachman
(1990) as test method facets, can cast tremendous influences on test performance. In
addition, TTCs, or learner variables, also interact with how a test functions. Evaluating
condition, home language environment, and so forth, provides critical validity evidence
for test generalization and fairness. Finding factorial noninvariance implies that
divergence in the test performance of different groups exists. This will make it hard to
argue that the results of the test can be interpreted in a comparable way across the groups.
Due to the differences in factor structure such as the number of factors and the
this case, will it still make good sense to use the same test or the same score reporting
Test fairness addresses whether a test measures the same construct in all relevant
subgroups of the population (AERA, APA, & NCME, 1999). Factorial invariance is an
important assumption for claiming test fairness. As highlighted by many authors in this
review, finding out whether a factorial model is invariant across groups of test-takers help
decide how to report scores, and how test scores should be used. Relevant issues in this
regard include whether a single composite score or section scores or both should be
reported, and how weights should be assigned to the section scores if a total score is
a score reporting system for a test that exhibits factorial noninvariance for different test-
taking groups.
Conclusion
This chapter has reviewed and discussed models of FL proficiency. The review of
the models was not intended to be inclusive. The decision to review only selected models
was based on the two criteria. First, these models represent milestones in the course of
searching for an understanding of what FL proficiency is, and they have been influential
in FL test development. Second, the soundness of these models has been theoretically
examined and empirically investigated via a latent trait approach in language testing.
At the end of the last decade, Bachman (2000) pointed out two major themes of
advancement in language testing research. The progression of the field has been powered
by both the refinement of a rich variety of approaches and tools, and by a broadening of
defining the nature of FL proficiency, the field has witnessed rapid growth in theoretical
positions and empirical techniques: from the domination of the unitary competence
73
dynamic and communicative; from holding a dichotomous view of trait and method to
concerned with the internal structure of a test to examining a test’s external relationship
with TTCs.
According to this literature review, consensus has been reached by the research
construct with multiple dimensions. Second, the view is held that the proficiency in a FL
can be interpreted more meaningfully if relevant TTCs are taken into consideration.
What the field has come to agree upon is crucial to the consolidation and
belief about the nature of FL proficiency demonstrates the collaborative efforts from
researchers with different backgrounds and strengths. However, uncertainties still remain,
and these pose both challenges and opportunities for future language testing researchers.
One common theme that has emerged from this review is that it is still unclear to
the research community what FL proficiency consists of or how the constituent parts
interact. Sawaki et al. (2008) pointed out that multidimensional competence models come
different schools of thought for modeling this construct. The first one claims that FL
competence consists of multiple uncorrelated factors. The second believes that this
proficiency can best be represented by a set of correlated factors. The third argues that the
74
higher-order factor. This higher-order factor is general in nature, which is responsible for
performance across multiple skills. The higher the loadings of first-order factors on the
general factor, the more likely it is this general factor actually exists.
While the first position has been repeatedly refuted based on empirical evidence
showing that factors are indeed correlated, choosing between the second and the third
position has never been easy. The question is how high the correlations among the factors
should be to adopt a hierarchical structure. This question becomes even harder to resolve
when there are three first-order factors in the model. A hierarchical model with three
model in terms of overall model fit statistics. Researchers in the past have leaned towards
one position or the other by comparing model fit or by examining individual parameter
estimates when results from model comparisons did not indicate which one was clearly
superior to the other. Most recently, a bifactor model was tested in Sawaki et al. (2008).
In a bifactor model, two factor loadings are estimated for each observed score, one on the
general higher-order factor and the other on one of the first-order factors. More research
needs to be conducted before we can tell how well this bifactor model can be applied to
Future researchers should keep testing these competing hypotheses until general
The second unresolved issue has to do with the compatibility of trait and method.
Since its initial appearance, the concept of communicative competence (Canale & Swain,
1980) has had a strong influence not only on language teaching and learning, but also on
test development. Communicative competence emphasizes the use and coordination of all
possible knowledge and skills required for interacting in actual communication (Canale,
1983). This posits an immediate threat to the validity and usefulness of tests that focus on
only one aspect of language knowledge or skill at a time. For instance, a listening test is
usually designed to measure listening ability and, ideally, not to measure anything else. A
test battery will usually consist of a series of skill-based tests (listening, reading, writing,
Such a test battery is good enough to measure a wide repertoire of skills and knowledge
separately, but not sufficient to measure the coordination of different skills and
A new type of language test tasks, the so-called integrated tasks, has been
developed for and implemented in the TOEFL iBT® test. An integrated task requires the
use of multiple skills simultaneously. For example, an integrated speaking task could
involve both speaking and listening, and an integrated writing task could involve reading,
listening, and writing. Such an integrated task is no longer tied to one modality and,
The arrival of integrated language tasks raises questions about how these skill-
based factor models found in many of the studies in the review are compatible with the
structure underlying performance on integrated tasks might look like. A third follow-up
76
question is if we do find distinct factors, how are we going to name the factors so that
Future researchers should examine test performance on integrated tasks that are
latent factor approach would still apply; however, the resulting factor structure might no
Third, this review has also pointed out that one of the top tasks on the agenda for
future research is to investigate the relationships between TTCs and test performance.
TTCs are recognized as one of the four influences on test scores in Bachman’s (1990)
CLA model, and in Bachman and Palmer’s (1996) description of language use in
language tests. According to this review, the field has recently witnessed a surge of
interest and a growing number of empirical studies on TTCs, such as gender (Wang,
2006), native language background (Shin, 2005), ethnicity and preferred language
(Ginther & Stevens, 1998), home language environment (Bae & Bachman, 1998), to
name just a few. However, the amount of information we have obtained is far from what
we will need to fully understand the complex and dynamic relationships. Moreover, there
are still TTCs that have not attracted enough attention by the field, and therefore their
One such TTC is language contact with the target language in a study-abroad
environment. FL test-takers are non-native speakers of the target language, but they may
have had the experience of studying and/or living in the target language environment.
Learning experience in the target language environment has been investigated in relation
to test performance in very few studies (Kunnan, 1994, 1995; Morgan & Mazzeo, 1988).
77
Since this TTC is salient and relevant in the context of FL testing, future research should
study the relationships of this TTC and FL test performance in various testing situations.
It is also worth mentioning that the growing interest on the topic of TTCs and test
equation modeling (SEM). SEM has offered not only powerful research techniques, but
also new perspectives to explore issues that interest researchers from both language
Within the general SEM framework, we can investigate a wide range of issues: (a)
what the internal structure of a test is; (b) whether and to what degree the internal
structure of a test holds equivalent across different test-taking groups of interest; and (c)
whether this latent internal structure relates to any external variables of TTCs and, if so,
in what way. Factor analysis, a family of latent multivariate analysis techniques, has been
used extensively to investigate the first question. Multi-group invariance factor analysis
has been used to address the second question. Built upon factor analysis and path
analysis, SEM is able to further provide a structural view of the relationships between a
test’s internal structure and the learner variables that exist outside the test, such as age,
component, together with the measurement component based on the internal structure of
the test, could offer insights into the relationships of test-taker background and test
performance. Future researchers, who are interested in the relationships between TTCs
This chapter reviews and synthesizes research results of studies that have
investigated the nature of FL proficiency through a latent factor approach in the context
78
of FL testing. Directions for future research are suggested at the end of the chapter.
These concerns for future studies will be addressed in the design of the current study
CHAPTER THREE
METHODOLOGY
This chapter first introduces the TOEFL iBT® test developed at Educational
Testing Service (ETS), including a brief history of the test, and ETS’s philosophy of
testing English as a foreign language (FL). Next, two general goals of the study are
explained, and three research questions are put forward. Borrowing knowledge from the
studies reviewed in the previous chapter as well as insights from the study-abroad
literature, six hypotheses are formulated. This is followed by a description of the TOEFL
iBT public dataset, the subjects, and the measures used in this study. Last, planned
The TOEFL iBT test is a test of English as a FL, and it is administered world-
wide. The purpose of the test is to “measure the communicative language ability of
people whose first language is not English” (Jamieson, Jones, Kirsch, Mosenthal,
&Taylor, 2000). An online report on validity evidence states that “TOEFL iBT test
scores are interpreted as the ability of the test taker to use and understand English as it is
spoken, written, and heard in college and university settings,” and “[t]he proposed uses of
TOEFL iBT test scores are to aid in admissions and placement decisions at English-
(ETS, 2008). As the newest member of the TOEFL® test suite, the TOEFL iBT test has
The development of the TOEFL iBT test was part of a broad effort to modernize
language testing at ETS. Jamieson et al. (2000) explained that the impetus behind this
80
movement was to build a test that could reflect communicative competence models. The
design of the test was also intended to meet the needs of various stake-holders for more
In response to these calls, the current TOEFL iBT® test has the following unique
(COE), the test measures communicative language proficiency and focuses on academic
language and the language of university life (Chapelle et al., 1997). This COE model
makes an explicit distinction between the context of language use and the internal
capacities of individual language users. The model portrays the relationships between the
two as dynamic and integrated, as the COE members believed that “the features of the
context call on specific capacities defined within the internal operations” (p. 4). This
understanding of language ability in relation to its context of use lays out the theoretical
foundation upon which a test development framework with both a context and ability
Second, the new test has mandatory speaking and writing sections, intended to
namely integrated tasks, is for the first time introduced into the TOEFL testing. Unlike
the traditional independent tasks, which call upon language skill in a single modality, an
integrated task requires test-takers to use and coordinate skills in more than one modality.
For example, an integrated writing task can require test-takers to incorporate and
synthesize information from an aural lecture and a reading passage in their written
responses. These integrated tasks not only provide information about examinees’ abilities
81
in more than one skill area but also their abilities to coordinate the use of different skills
As stated in Chapter One, this study aims to provide empirical evidence to support
our understanding of the nature of FL proficiency in both its absolute and relative senses,
and to investigate the educational, social, and cultural influences on language proficiency.
ability as measured by the TOEFL iBT® test. The reasons for choosing the test are
explained as follows.
First, the theoretical reasoning behind the development of the test reflects the
current thinking in applied linguists, which views language ability being communicative
in nature. The test focuses on language use in communication, not language display in
isolation. It specifies the relationship between the context of language use and language
abilities as integrated, rather than dichotomous. The role of context in defining language
ability is recognized and reflected in test development. Performance based on the test will
Second, as the world’s most widely used English language test (ETS, 2008), the
variables, such as native language and country, English language learning, and language
examine communicative language ability in its relative sense across different test-taker
groups. With test-takers from different backgrounds, the results of the test could also
82
offer unique perspectives on how different educational, social, and cultural environments
Based on the TOEFL iBT® test performance, this study investigates the
dimensionality of communicative language ability and its latent factorial structure across
test-taker groups with different context-of-learning experiences. The study also intends to
well as the joint impact learning and time abroad on the development of FL ability.
communicative language ability, as measured by the TOEFL iBT test, and how this
model. Language competencies within individual users and the context of language use
are both parts of the construct definition in the COE model. The COE model originally
five variables: participants, content, setting, purpose, and register. However, it was
decided later to organize language tasks by modality largely because the results of
surveys (Ginther & Grant, 1996; Taylor, 1993) showed that the majority of test users
would like to have scores reported for speaking, writing, listening, and reading. It was
obvious that there was a mismatch between the current thinking in applied linguistics and
popular belief about what language proficiency meant. As pointed out by Chapelle et al.
83
(1997), “[w]here the skills approach falls short, in the view of applied linguists, is its
failure to account for the impact of the context of language use” (p. 26).
between the context of language use and cognitive skill-based capacities of language
the role of context in defining the language ability would offer the theoretical framework
language ability and learner/test-taker background variables. This research interest has
been shared by the second language research community as well as the TOEFL® research
community. Several past TOEFL studies were devoted to validating score interpretation
and use in light of differences of test-takers on background variables, such as reasons for
taking the test (Swinton & Powers, 1980), native language background (Stricker et al.,
2005; Stricker & Rock, 2008; Swinton & Powers, 1980), target language exposure in
home countries (Stricker & Rock, 2008), and years of classroom instruction (Stricker &
Rock, 2008). The current study aims to investigate the joint impact of study-abroad in the
communicative language ability, based on TOEFL iBT® test performance. Although the
test is intended for people whose first language is not English (Jamison et al., 2000),
some of the test-takers may have the experience of living in an English language
environment, and others may not. Borrowing the concept of language contact from the
study-abroad literature and treating it as a test-taker characteristic (TTC), this study first
looks at whether or not having the contact experience with the target language
84
community, has any effect on the underlying structure of the test performance.
communicative language ability in its relative sense. In other words, the result tells us
whether or not the language abilities developed in groups of test-takers with different
a test development and validation perspective, the outcomes of this investigation contain
crucial information that can be used for test validation. Understanding how a test
functions across different test-taker groups tells whether or not the test scores can be
interpreted and used in a similar way across these groups. Since the test is administrated
both domestically (where English is the dominant language) and internationally (where
English is not the dominant language), the results also inform us if it is a reasonable
Second, test-takers may also vary in terms of the length of study-abroad and
formal language training they have received. The length of study-abroad as well as the
length of learning are examined together to facilitate further understanding of the joint
ability. The results have both pedagogical and practical implications. Pedagogically
speaking, results of such an investigation can offer empirical evidence of the impact of
study abroad, and how it interacts with the effect of learning. From a practical
perspective, findings can be used to advise future test-takers on how to prepare for the
The study-abroad and learning experiences are both salient in the target test-
taking population, and their impact on test performance deserves an in-deep investigation
85
in the context of TOEFL iBT® testing. The outcomes would inform us what to expect
from and how to deal with the increasingly diverse and constantly changing test-taking
Research Questions
The first research question asks what the nature of communicative language
ability is, as measured by the TOEFL iBT® test. To be more specific, this investigation
focuses on answering the following two questions: (1) what the constituents of
communicative language ability are and how they are related; and (2) if the role of
context in defining communicative language ability can be reflected in the latent structure
The operational test adopts the four-skills approach to test design, and it reports a
separate score on each of the skills (listening, reading, writing, and speaking) as well as a
total score. Factor-analytic studies (Sawaki et al., 2008; Stricker & Rock, 2008) have
provided supporting evidence for this skill-based design and reporting scheme. Stricker
and Rock (2008), using task-based scores from 2720 test-takers during the TOEFL iBT
field test, found that a correlated four-factor model and a higher-order model with four
first-order factors fit the data similarly well. They concluded that the latter was the best
model based on the principle of parsimony. Sawaki et al. (2008) explored with the same
dataset as the one in Stricker and Rock (2008) but conducted the analysis based on item-
level scores. They found that the fit of the correlated four-factor model was better than
that of the higher-order model based on the result of a chi-square difference test although
86
the differences in other fit indices were minimal. They concluded that the higher-order
In an earlier study Stricker et al. (2005) examined the factor structure of the
LanguEdge™ test, a prototype of the TOEFL iBT® test. Based on task-based scores, the
authors found a correlated two-factor model across three native groups (Arabic, Chinese,
and Spanish). One of the factors was a speaking factor, and the other was a combination
The higher-order model and the correlated four-factor model had both exhibited
good match with TOEFL® test performance in the past. Since the LanguEdge test used in
Stircker et al. (2005) and the TOEFL iBT test have similar test structure and item types,
the correlated two-factor model found in Stricker et al. (2005) could provide a suitable
solution with the current dataset. Based on the results of these studies, a higher-order
model, a correlated four-factor model, and a correlated two-factor model, are all plausible
factorial solutions. To address the first focus of research question one, the hypothesis is
the TOEFL iBT test can be best explained by a higher-order model, and can also be
model.
In the context of TOEFL testing, the context of language use is interpreted as part
of the definition of communicative language ability (Chapelle et al., 1997), and language
use situation is characterized by five variables: participants, content, setting, purpose, and
register (Jamison et al., 2000). The content variable refers to the topic of a task. The
87
setting variable refers to the location of a language act. The content and the setting
variables are key ingredients in defining the context of language use. To address the
second focus of research question one, models with situational factors, content and
setting, are subject to evaluation of fit. The two corresponding hypotheses are formulated
as follows:
through testing Hypothesis 1 improves model fit, and therefore demonstrates the role of
through testing Hypothesis 1 improves model fit, and therefore demonstrates the role of
ability, as measured by the TOEFL iBT® test, has equivalent representation across groups
invariant across two groups of test-takers, with one group having been exposed to an
English-speaking environment and the other without such experience. There is handful of
studies that have studied the nature of FL proficiency in relation to TTCs by using multi-
was found to be equivalent across test-taker groups in some studies (Sticker et al., 2005;
Sticker & Rock, 2008; Shin, 2005; Wang, 2006), but different in others (Bae & Bachman,
1998; Ginther & Stevens, 1998; Kunnan, 1992; Morgan & Mazzeo, 1988; Römhild,
88
2008; Sang et al., 1986;). Among these studies, only Morgan and Mazzeo (1988) used
study-abroad experience as a TTC to define and compare groups. When comparing the
group of test-takers who had spent at least one month in the target language (French)
Due to lack of empirical studies examining study-abroad experience via a latent trait
language contact at home. Although the experience of language contact at home is not
equivalent to the one of studying abroad in the target language society, they share
similarities to a certain extent. In the same study conducted by Morgan and Mazzeo
(1988), no equivalence was found in the structure of language ability between the
heritage learner group and the standard group. Bae and Bachman (1998) also found that
the heritage Korean learner group and the non-heritage Korean learner group differed in
terms of underlying structure of their language ability. Based on these observations, the
the other without such experience, differ in the underlying configuration of their
The third research question investigates the impact of length of time abroad and
by the TOEFL iBT® test. Specifically, this investigation addresses the following two
89
questions: (1) if the length of formal learning differentiates test-takers who do not have
study-abroad experience; and (2) if the length of time abroad and formal learning
Only one study conducted by Kunnan (1994, 1995) investigated the impact of
(SEM) approach. His findings suggested that both home country formal instruction and
aspects of language ability, and the influences could be either positive or negative.
Borrowing some insights from study-abroad studies, Davidson argued that the length of
study abroad was positively correlated with language gains (as cited in Dewey, 2004, p.
322), whereas Freed (1995) claimed that formal language instruction was an important
showed that years of formal language instruction had impact on Spanish proficiency gain.
Magnan and Back (2007) claimed that coursework had a positive influence on linguistic
gain in French. The literature indicates that both formal instruction and study-abroad
experience have influences on language development. The two hypotheses related to the
development of their language ability is associated with the length of formal learning.
of their language ability is associated with the length of formal learning and the length of
study-abroad.
90
A request for access to the TOEFL iBT public dataset was submitted to ETS, and
was approved by ETS. The dataset contained two test forms (Form A and Form B) and
test performance of 1000 test-takers on each form. Form A and its associated test
performance from 1000 test-takers (Sample A) were used in the analysis in this study. A
One thousand test-takers were randomly drawn from one TOEFL iBT
administration during fall 2006. Each test-taker was linked to a unique sample
recorded for each test-taker at the time of the test administration. Test-takers who took
the test in the United States or Canada were identified as domestic, whereas those who
took the test in all other countries were identified as overseas. There were 418 domestic
country, and native language. Test-takers were also asked to respond to questions
regarding the reason for taking the test, the type of institution in which they were
interested, the amount of financial support they expected, the amount of time they spent
studying English, the amount of time they spent in content classes taught in English, and
proficiency as manifested in TOEFL iBT® test performance, the first step was to identify
who had the exposure and who had not upon taking the test. Although test-taking location
was recorded for every test-taker, it could not be used as a reliable indicator for
91
identifying the group membership. Among the 582 people who took the test overseas,
106 of them indicated that they had lived in an English-speaking country. It was very
likely that these test-takers, after having been exposed to English, had relocated and
therefore took the test in an overseas test center. Another indicator of the group
membership was test-takers’ responses to the question ‘have you ever lived in a country
in which English is the main language spoken’. This indicator was used to identify the
groups. It also contained the information of the length of English language exposure.
abroad experience and learning experience, and their joint impact on test performance.
Test-takers’ responses on two more background questions were used for this
investigation. One question concerned the amount of time they spent studying English,
and the other was about the amount of time they spent in content classes taught in
English.
questions. Among these 399 test-takers, 29 test-takers said that they had never lived in an
English-speaking country, although the record showed that they took the test either in the
United States or Canada. Bearing in mind that these 29 test-takers were physically
located in an English-speaking country at the time they took the test, their responses to
this question contradicted documented fact. One possible cause for this inconsistency
could be that these test-takers were not able to fully understand the question therefore
provided inaccurate information (X. Xi, personal communication, November 17, 2010).
Another explanation could be that they took the test immediately on arriving in the US or
Canada (J. E. Liskin-Gasparro, personal communication, January 17, 2011). Among the
92
399 test-takers who answered all three background questions, this study identified 370
The location of test-taking was recorded for each test-taker. This information told
us where a test-taker was physically located when taking the test. Based on test-taking
location, inferences could be made about what kind of linguistic environment a test-taker
was immersed in at the time of testing. One hundred fifty-seven subjects took the test at
test centers located either in the United States or Canada. The remaining 213 subjects
All subjects provided age information. The average age of these 370 test-takers
was 24 at the time of testing. The youngest subject was 14, and the oldest was 51. The
majority of the subjects (about 85%) were between the ages of 15 and 30.
Of the 370 subjects, 321 reported on their gender, among whom 54% were male
Test-takers were asked to use a list of country and region codes to indicate their
native counties (where they were born), and a list of native language codes to tell their
native languages. All but one of the 370 subjects responded. The 369 respondents were
from a total number of 56 countries or regions. More than half of the subjects were from
these seven largest groups: Korea, Japan, India, Taiwan, France, Thailand, and China.
With regard to their native languages, a total of 38 different native languages were
represented in this sample. The five most frequently spoken native languages in order of
the number of its speakers were Korean, Japanese, Chinese, Spanish, and Arabic. Native
93
speakers of these five languages made up about 59% of the total sample. Four subjects
Test-takers were asked to respond to the question ‘what is your main reason for
taking TOEFL.’ They were provided with the following answer choices: (1) to enter a
graduate student, (3) to enter a school other than a college or university, (4) to become
proficiency in English to the company for which I work or expect to work, and (6) other
than above. Out of the 370 subjects, 366 responded. About 88% of the respondents
indicated that they took the test in order to enter a college or university either as an
When asked the question ‘what types of institution are you interested in
attending,’ 367 subjects responded. The provided answer choices were: (1) four-year
school, (4) ESL institute, and (5) don’t know. Subjects were allowed to choose more than
one type of institution. Among the 367 subjects who responded, none of them chose more
than one type of institution. About 88% of them indicated that they were interested in
Of the 370 subjects, 356 answered the question ‘how much do you and your
family expect to contribute annually toward your study in the United States or Canada.’
They could choose from the following answer choices: (1) less than $5000, (2) $5000 to
$10,000, (3) $10,001 to $15,000, (4) $15,001 to 25,000, (5) more than $25,000, or (6)
don’t know. More than a third of the respondents indicated that they did not know the
94
answer. About one fifth of them expected to receive more than $25,000 in financial
All 370 subjects answered the question ‘how much time have you spent studying
English (in a secondary or post secondary school).’ The answer choices were: (1) none,
(2) less than 1 year, (3) 1 year or more, but less than 2 years, (4) 2 years or more, but less
than 5 years, (5) 5 years or more, but less than 10 years, and (6) 10 years or more. The
majority of them (about 64%) reported that they had studied English for at least 5 years
by the time they took the test. A third of the subjects had studied English for 10 years or
All 370 subjects responded to the question ‘how much time have you attended a
chemistry) were taught in English.’ The answer choices were: (1) none, (2) less than 1
year, (3) 1 year or more but less than 2 years, (4) 2 years or more but less than 5 years,
and (5) 5 years or more. About a third of them indicated that they had never had such
experience. Close to 60% of the subjects indicated that they had at least one year of such
experience.
All 370 subjects responded to the question ‘have you ever lived in a country in
which English is the main language spoken.’ To answer this question, they chose from
the following possible answers: (1) no, (2) yes, for less than 6 months, (3) yes, for 6
months to 1 year, and (4) yes, for more than 1 year. About two-thirds of the subjects
indicated that they had lived in an English-speaking country upon taking the test.
95
The sample of 370 subjects used in this study was just a fraction of the total
random sample of 1000 test-takers generated from one TOEFL iBT® test administration.
Since the study sample was not randomly generated, two steps were taken to ensure that
the study sample was comparable to the random sample of 1000 test-takers.
Table 1. Continued
Table 1. Continued
First, the study sample (N=370) was compared to the total random sample
(N=1000) on all available background variables collected for the test-takers (see Table 1).
Second, one-sample t-tests for the section and total scores were conducted to detect any
The percentages in Table 1 show that both samples had the same distribution on
the location variable. The age and gender distributions in the two samples resembled each
other well.
With regard to country of origin and native language, more than half of the test-
takers came from the same seven countries in both samples. However, the percentage of
test-takers from Mainland China in the study sample was disproportionally low,
compared to the one in the total sample. The five most frequently spoken native
languages among test-takers were also the same across the two samples, although the
order of the two lists differed. It is noteworthy that a disproportionally high percentage of
test-takers from Mainland China were not included into the study sample because these
test-takers did not answer all three background questions regarding their learning and
group for both samples. The native languages spoken by these test-takers included a
variety of languages native to people living in the various parts of the country. Due to this
diverse linguistic situation, none of these languages had a large native language group in
either sample.
Although English has a special administrative role in the India society, very few
test-takers considered English as their native language. In the total sample of 1000 test-
takers, among the 120 Indian test-takers, only five reported their first language as
English. In the study sample of 370 test-takers, among the 45 Indian test-takers, none of
99
them indicated their first language as English. This observation allowed us to treat the
English also has a special administrative role in the region of Hong Kong. All 11
test-takers from Hong Kong in the total sample identified their native language as
Chinese. This observation allowed us to treat the test-takers from Hong Kong as English
Regarding variables that had a large number of missing values in the total sample,
such as reason for taking the test, financial support, time spent studying English, time
spent in content classes taught in English, and the time spent living in an English-
speaking country, the distribution patterns of the obtained responses were similar across
the two samples. Percentages based on the test-takers who actually responded to these
questions were also reported. For example, the table shows that among the total 1000
test-takers, 464 test-takers (N=464) actually responded to the question asking for their
reason for taking the test, and 36% of them indicated that they took the test to enter a
percentages were very close to the ones based on the study sample of 370 test-takers.
Furthermore, the four section scores and the total score were compared across the
two samples. The descriptive statistics shown in Table 2 indicated that the section and
total mean scores were all higher in the study sample than the ones in the total sample.
A series of one sample t-tests were conducted to compare the means of the study
sample to the ones of the total sample. The means of the study sample were tested against
the total sample means to evaluate the statistical significance of the differences. The
results, summarized in Table 3, showed that only the listening score was found to be
100
significantly higher in the study sample than the one in the total sample at the p level of
0.01. At the p level of 0.05, the listening score and the total score were found to be
significantly higher in the study sample than the ones in the total sample. These results
suggested that the study sample did not deviate from the total sample significantly with
discrepancies were found regarding the percentages of test-takers from Mainland China
and Chinese native speakers. The reason for these discrepancies remains unknown. This
101
should be kept in mind when generalizing the findings from this study to the total sample
Measures
The test used in this study was an operational form of the TOEFL iBT test
administered during the fall of 2006. The test had four sections: listening, reading,
The listening section had six tasks. Each listening task had a prompt followed by
5 or 6 questions. There were 34 items in total in the listening section. Each item was
scored for one point for a correct answer, and zero points for a wrong answer. Items that
were not reached or were omitted were marked as N or M, respectively. The total
possible raw score points for the listening section was 34.
The reading section had three tasks. Each reading task had a prompt followed by
them, worth one point each, were dichotomously scored. Three items were polytomously
scored, worth either two or three points. Items that were not reached or were omitted
were marked as N or M, respectively. The total possible raw score points for the reading
The speaking section contained six tasks. The first two tasks asked test-takers to
independent because required responses were not dependent on any information provided
through other channels during the test. The other four were integrated speaking tasks.
These tasks required test-takers to provide oral responses based on the information they
102
received through listening or reading or both channels. Each task was rated on a 0–4
holistic scale at a one point interval. The total possible raw score points for the speaking
The writing section consisted of two tasks. The first task was an integrated task
that required test-takers to provide written responses based on the information they
received through listening and reading. The second one, an independent task, asked test-
takers to write in response to a written prompt. Each task was rated on a 1–5 holistic scale
at half-point intervals (up to the first decimal place). The total possible raw score points
In summary, the whole test had 17 language tasks, organized into four sections by
modality. There were six listening tasks, three reading tasks, six speaking tasks, and two
writing tasks. To help understand the dynamics between language skill and the context of
language use manifested within each task, the situations of these tasks are described as
follows.
five variables: participants, content, setting, purpose, and register (Jamison et al., 2000).
The content refers to the topic of a task. The setting refers to the location of a language
act.
The first listening task (L1) was situated in a conversation between a male student
and a female biology professor at the professor’s office. The participants talked mainly
about how to prepare for an upcoming test. The nature of the interaction was consultative
103
with frequent turns. To complete this task, test-takers were asked to respond to five
multiple-choice questions.
The second listening task (L2) presented part of a lecture in an art history class in
a formal classroom setting. The male professor was the only participant. The language
used by the professor was formal. There was no interaction between the professor and the
audience during the task. To complete this task, test-takers were asked to respond to six
multiple-choice questions.
The third listening task (L3) presented part of a lecture in a meteorology class in a
formal classroom setting. Three participants were involved, one male professor, one male
student, and one female student. The language used was formal with periods of short
interaction between the professor and the students. To complete this task, test-takers were
The fourth listening task (L4) was situated between a female student and a male
employee at the university housing office. The topic of their conversation was about
housing opportunities on and off campus. The nature of the interaction was consultative
with relatively short turns. To complete this task, test-takers were asked to respond to five
multiple-choice questions.
The fifth listening task (L5) presented part of a lecture in an education class in a
formal classroom setting. The participants were a female professor and a male student.
The language used was formal with periods of short interaction between the professor
and the student. To complete this task, test-takers were asked to respond to six multiple-
choice questions.
104
The last listening task (L6) presented part of a lecture in an environmental science
class in a formal classroom setting. The male professor was the only participant. The
language used was formal with no interaction between the professor and the audience
during the entire listening time. To complete this task, test-takers were asked to respond
In the reading section, all three reading tasks were based on academic content. In
the first task (R1), test-takers were asked to respond to 14 multiple-choice questions of
various kinds based on a reading passage on the topic of psychology. The second task
based on a reading passage about archeology. The last task (R3) required the test-takers
about biology. The language used in the readings was formal in register. The settings of
these language tasks remained unknown since such information was not provided in the
task specifications.
Lack of context development was also found for the first speaking task (S1) and
the second speaking task (S2). Neither task specified the situation of language use. The
topics for both tasks were non-academic. Both tasks required test-takers to provide an
The third speaking task (S3) had a reading and a listening component. The topic
was the university’s plan to renovate the library. The reading component required test-
takers to read a short article in the student newspaper about the change. The listening
academic setting discussing their opinions about this renovation plan. The two
105
participants interacted with each other frequently with relatively short turns. The test-
takers were required to give an oral response to a prompt based on the content of the
The fourth speaking task (S4) also had a reading and a listening component. The
academic topic. The listening component presented part of a lecture on the same topic
delivered by a male professor in a formal classroom setting. The language used in this
task was formal. There was no interaction between the professor and his audience during
the listening time. Test-takers were required to give an oral response to a prompt based
The fifth speaking task (S5) contained a listening component. The listening part
was situated in a conversation between a male and a female professor in an office setting.
The focus of their dialogue was related to the class requirements for a student. The
register of the language was consultative in nature with frequently turns between the two
participants. Test-takers were required to give an oral response to a prompt based on the
The last speaking task (S6) had a listening component. The listening part
setting. The language used was formal with no interaction between the professor and the
audience. Test-takers were required to give an oral response to a prompt based on the
The first writing task (W1) was integrated with reading and listening. Test-takers
were first asked to read a passage about an academic topic and then to listen to part of a
106
lecture on the same topic delivered by a male professor in a classroom setting. There was
no interaction between the professor and his audience during the entire listening time.
The test-takers were required to give a written response to a prompt based on the content
The second writing task (W2) asked test-takers to write on a non-academic topic.
The context of language use in this task remained unknown since this information was
not provided in the task specification. The test-takers were required to provide a written
response to a prompt.
Based on the descriptions of the tasks, two broad categories emerged with regard
to task content. The first type was academically oriented. The content of these tasks was
developed mainly based on scholarly or textbook articles in the realm of natural and
social sciences. The second type involved topics that were related to courses (e.g., exam
preparation) or to life on campus (e.g., student housing). The development of these tasks
did not rely on information in a particular academic area. In other words, having previous
knowledge on a particular academic topic was not likely to interact with test-takers’
responses to this type of tasks. By this categorization, all 17 tasks were sorted into either
academic or non-academic. There were ten tasks with academic content and seven with
non-academic content.
The tasks were also sorted by the location where a language act occurred.
Unfortunately the information on setting was not always provided. Lack of context
development was observed for all reading tasks as well as some speaking and writing
tasks. Tasks for which the setting of language use was developed could be divided into
107
two groups: instructional and non-instructional. The first group of tasks involved
language acts that took place inside classrooms. In this type of setting interactions among
the participants (if there is any) were usually sporadic, and the language used was
academically oriented and formal. The second type of tasks took place outside
classrooms (e.g., a professor’s office, the library) where interactions were usually more
frequently and the language used tended to be less formal. Table 4 summarizes the 17
Analysis Procedures
The dataset used in this study included 370 test-takers’ scores on 17 skill-based
language tasks. Listening and reading items that were not reached or were omitted were
108
assigned a score of zero. There was no missing score value for the speaking and writing
tasks.
Predictive Analytics SoftWare® Statistics 18 (SPSS Inc., 2009). Analyses involving latent
Level of Measure
considerations and theoretical needs. This study used task scores, also called item parcel
scores, as the level of measure. For the listening and reading sections, a task score was
the total score summed across a set of items based on a common prompt. Six listening
task scores and three reading task scores were therefore obtained. A task score in the
writing and speaking sections was simply the score assigned for a task. Six speaking task
scores and two writing task scores were therefore obtained. Each task score was used in
the analysis as an observed variable. There were 17 observed variables in the study in
total. Variable names were the same as the task names listed in Table 4.
There are a couple of reasons for choosing task scores as the level of measure.
First, one of the research questions is to examine the relationship between language skill
and the context of language use in the underlying structure of the test performance. Two
key elements used for characterizing a language use situation were content and setting,
both of which could be defined at the task level. Individual items within a task all shared
the same focus of content and setting. Secondly, using task scores instead of item scores
dichotomously scored listening and reading items could be treated as continuous as well
109
as the ratings based on the polytomously scored speaking and writing tasks. Although
equation modeling (SEM), Kunnan (1998b) recommended not mixing different levels of
measurement in a single covariance or correlation matrix. The third reason for using this
level of measure is to reduce the chance of having correlations among the observed
variables to be extremely high. It is very likely for items based on a common prompt to
have dependence upon one another. Multicollinearity observed in the observed variables
can be one of the causes why an estimation process fails to succeed (Kline, 2005). Using
task scores instead of item scores, suggested by Stricker and Rock (2008), would help to
alleviate the problem caused by the dependence among items associated with a common
prompt in this study. The last reason comes out of concern for sample size needed for the
planned multivariate analysis. The more variables used in an analysis, the larger sample
size is needed. Using task scores would help to reduce the number of observed variables
in this analysis, and therefore to reduce the sample size needed for the study.
Distribution of Values
depends on the distribution of observed values. The most commonly used estimation
normality were inspected. Univariate normality was checked by examining the skewness
and kurtosis indices, and by examining the plots of score distributions. Multivariate-
suggested by Kline (2005). The distribution of the values was examined so that an
Most estimation procedures for multivariate analysis are based either on the
between two variables is not linear, this non-linearity will not be captured in the
was examined using scatter plots of all possible pairs of the variables in this study.
extremely high. If one variable is highly correlated with another, then it means that one of
the variables is redundant in terms of measuring the construct. Kline (2005) suggested
Estimation Method
The default estimator for latent analysis with continuous variables is maximum
conventional standard errors and chi-square test statistic. Since ML estimation is sensitive
corrected normal theory method should be used to avoid bias caused by non-normality in
the dataset, as recommended by Kline (2005). By using a corrected method, the original
data is analyzed using a normal theory method (ML in this case), but the estimates of
standard errors are robust to non-normality and the test statistics are corrected. Mplus
111
provides such a corrected normal theory estimation method (called MLM), which
produces maximum likelihood parameter estimates with standard errors and a mean-
adjusted chi-square test statistic that are robust to non-normality. The MLM chi-square is
The adequacy and appropriateness of the models were evaluated based on two
criteria: (1) the values of selected overall model fit indices, and (2) the significance and
reasonableness of individual parameter estimates. The selection of model fit indices used
in this study was based on Kline (2005)’s suggestions. Below is a brief description of
The value of the chi-square (χ2) statistic reflects the distance in fit between the
observed data. A χ2 test evaluates the statistical significance of this distance. The lower
the value is, the closer the two structures are, and therefore the better the model
Normed Chi-Square
sensitive to sample size. The value tends to be high when the sample size is large. This
could mislead to reject a model whose deviation from the observed data structure may not
be significant (Bollen, 1989). Dividing the chi-square value by the degrees of freedom
(df) can be used to reduce the sensitivity of chi-square to sample size (Kline, 2005). The
result is a lower value referred to as normed chi-square (χ2/df). A ratio of less than 3 is an
112
indicator of good model fit, recommended by Kline (1998). This criterion was adopted in
Neither χ2 nor χ2/df has a built-in mechanism that corrects for model complexity.
Normally speaking, if two models show equivalent fit to the same data, the simpler one is
preferred over the more complex one based on the principle of parsimony. When dealing
with the same data, the simpler a model is, the fewer parameters are estimated, and the
higher the degrees of freedom are. The root mean square error of approximation
therefore favors simpler models. A value of zero indicates a perfect fit. The higher the
value goes, the worse the fit gets. RMSEA smaller than 0.05 can be interpreted as a sign
of good model fit while values between 0.05 and 0.08 indicate reasonable approximation
of error (Browne & Cudeck, 1993). This criterion was adopted in the current study.
The comparative fit index (CFI) compares the fit of the specified model to the fit
of a baseline model which assumes zero covariances among the observed variables.
Because it is usually unrealistic to assume that variables are uncorrelated, the fit of a
baseline model is often really bad. The improvement in fit from the baseline model shows
that the specified model is a better one. A rule of thumb, suggested by Hu and Bentler
(1999), is that a CFI value larger than 0.9 shows the specified model has a good fit. This
The standardized root mean square residual (SRMR) is an absolute fit index
which is based on the mean absolute correlation residual. The size of a correlation
residual indicates how an observed correlation matrix differs from the model-implied. A
SRMR value of zero shows that there is no difference between the two correlation
matrices, indicating a perfect model fit. A SRMR value less than 0.1 is commonly
considered as a sign of acceptable fit (Kline, 2005). This criterion was adopted in the
current study.
sign of an estimate should be checked to ensure that the meaning of the estimate is
theoretically sound. The value of an estimate divided by its standard error provides a test
statistic that can be used to evaluate the significance of the estimate. Multicollinearity
among latent factors can also be detected by examining their estimated correlations. An
extremely high correlation estimate between two factors indicates a linear dependency
among the factors. This means that the factors are empirically indistinguishable, which
makes a model implausible. Previous researchers (Sawaki et al., 2008; Stricker et al.,
2005; Stricker & Rock 2008) used a value of 0.9 to screen out extremely high
correlations among factors. This criterion was adopted in the current study.
measured by the test conforms to the test design, score reporting and the TOEFL®
114
previously established factor model with similar test data could also be compatible with
measured by the TOEFL iBT® test can be best explained by a higher-order model, and
can also be explained adequately by a correlated four-factor model and a correlated two-
factor model. A higher-order model, a correlated four-factor model, and a correlated two-
factor model were specified as a priori and tested for fit as competing models. All three
models were shown to be compatible with previous TOEFL test data. Integrated speaking
tasks whose completion required language processing in multiple modalities were found
to load on the target modality (speaking), whereas integrated writing tasks were found to
load on the designated writing factor (Sawaki et al., 2008; Sawaki, et al., 2009).
Therefore the integrated speaking and writing tasks were specified to load on their target
modality in all three models. As the result of testing the first hypothesis, the model that
best represented the latent structure of the dataset used in this study was established as
Figure 1 illustrates the relationships among the observed variables, latent factors,
and residual/unique variances in the higher-order model. The observed variables are
represented by the rectangular boxes. Latent variables, including the four skill factors and
the residuals, are indicated by the ellipses. The six listening variables (L1–L6) loaded on
a common factor which could be referred as the listening factor (L). In other words, the
six listening tasks were the indicators of the presumed listening factor. The three reading
tasks (R1–R3) loaded a common factor which could be interpreted as the reading factor
115
(R). A presumed speaking factor (S) was responsible for the relationships among the six
speaking tasks (S1–S6). The two writing variables (W1 and W2) shared a common
underlying factor, possibly a writing factor (W). These were the four first-order factors
corresponding to the four modalities. The relationships among the first-order factors were
ability factor (G). In other words, the first-order factors were constrained to interact with
one another only through the higher-order factor. The higher-order factor represented a
common underlying dimension across the four first-order factors. The residual variances
(E1–E17), the part of the variance of an indicator that could not be explained by its
respective factor in the model, were uncorrelated with one another. Also referred to as the
measurement errors, the residual variances reflected the degree of how reliable or
The constituents and their relationships in the correlated four-factor model are
illustrated in Figure 2. In the absence of a higher-order factor the four factors each
corresponding to its respective modality were modeled to correlate with one another.
The correlated two-factor model (Figure 3) was nested within the correlated four-
factor model by constraining the correlations among the listening, reading, and writing
factors to be one. In this model variables from the listening, reading, and writing sections
all loaded on a common factor, a presumed non-speaking factor (L/R/W). The six
speaking variables loaded on the second factor, probably a speaking factor (S). The two
The baseline model, confirmed in the step above, provided a platform for testing
TOEFL iBT® test. Next, situational factors were added to the baseline model to evaluate
119
Language tasks were categorized with regard to the context of language use. Task
grouping was based on two key elements: content and setting. On the dimension of
content, all tasks were categorized into either academic or non-academic. On the
dimension of setting, when the setting of a task was provided, the task was labeled as
model improves model fit, and therefore demonstrates the role of context in defining
baseline model from the previous hypothesis testing, and this two-dimensional model was
evaluated for fit. In this model, each task loaded on two factors, a skill-based factor and a
content-based factor. Ten tasks loaded on a common content factor associated with
academic material, whereas the other seven loaded on a non-academic content factor.
Imposing the second dimension was expected to help explain the relationships
among the observed variables together with the common skill-based factors. In the
assumption that they were unique to their respective variables, and were not associated
with one another in a systematic way. However, whether these residual variances are
truly unique becomes a question of doubt if the context of language use is considered.
Successful modeling testing would support that performance on language tasks could be
120
accounted for by the situational factors as well as the skill-based factors, and therefore
model improves model fit, and therefore demonstrates the role of context in defining
baseline model, and this two-dimensional model was tested for fit. In this model, each
task loaded on two factors, a skill-based factor and a setting-based factor. Seven tasks
loaded on a common instructional setting factor, and four loaded on a common non-
instructional setting factor. The rest six tasks without setting development loaded on a
third setting factor. Successful model testing would support that performance on language
tasks could be accounted for by the situational factors as well as the skill-based factors,
and therefore confirmed the role of context in defining communicative language ability.
Based on the results of the analysis above, a model that best represented the test
construct was chosen as the final model for the entire sample to be used in the following
multi-group analysis.
the TOEFL iBT® test, has equivalent representation cross two groups of test-takers, with
one group having been exposed to an English-speaking environment and the other
without such experience. In other words, the results would inform us whether group
membership moderated the relations among the variables in the factor model. In all steps
experience and the other without such experience, differ in the underlying configuration
124 test-takers (Group I) had never lived in an English language environment upon
taking the test. The other group of 246 test-takers (Group II) had experience of living in
the target language community upon test-taking. The multi-group invariance analysis was
the model was prioritized over the structural component. The measurement component
consisted of the number of the factors, the relationship between the factors and their
respective indicators, and the residual variances. The measurement part defined the
meanings of the factors by specifying how they were measured and what their indicators
were. The structural component included the variances and covariances of the factors. It
It would only make sense to ensure the equality of factor relationships after it has
been established that the factors have the same meanings across groups. This is why the
measurement component should be tested for equality first. If successful, then testing the
to examine group mean differences on latent factors. A mean structure was imposed in all
Therefore the equality of factor means was tested when the structural component was
inspected.
None of the studies reviewed incorporated a mean structure. Bae and Bachman
(1998) called for using mean structures as a suggestion for future researchers. As pointed
out by these authors, a mean structure latent approach allows us to account for
Measurement Invariance
ordered manner. The first step was to test the equivalence of the overall factor structure
across the groups. The same factor structure was imposed on both groups
simultaneously. Parameter estimates in one group were allowed to vary from the ones in
the other group. The result of this step answered the question whether the factors had the
The second step was to test the equivalence of factor loadings across the groups.
Equality constrains were imposed on all factor loadings across the groups. Factor loading
estimates in one group were not allowed to be different from the ones in the other group.
Residual variances were allowed to differ. This was a more restrictive model compared to
the model tested in the first step. Model fit was evaluated. Since the two models were
nested, a chi-square different test was conducted to evaluate which model should be kept.
A chi-square difference test without significant result would indicate that the fit of the
more restrictive model did not deteriorate badly enough to justify adopting the more
liberal model. In this case, the more restrictive model would be adopted. The result of this
123
step would tell us whether the indicators measured the factors in a comparable way across
the groups.
The third step was to test the equivalence of residual variances in both groups.
Equality constrains were imposed on the residual variances along with the factor loadings
across the groups. This was a more restrictive model compared to the model tested in the
second step. Model fit was evaluated, and a chi-square different test could be conducted
to evaluate which model should be adopted. The result of this step would inform us
whether the indicators in one group were as reliable as the ones in the other group at
Structural Invariance
Factor means, variances, and covariances were the targets of this investigation.
The common practice is to test the invariance of factor means and covariances before
testing factor variances because it is usually expected for groups to differ in their
The first step was to test the invariance of factor means. The means of the factor
indicators (also called endogenous variables) were estimated as intercepts. The estimated
means of the indicators, the intercepts, were constrained to be equal across the groups for
model identification purpose. They were held equal so that the differences of factor
means could be estimated. The means of the latent factors (also called exogenous
variables) were fixed to be zero in one group and free to be estimated in the comparison
group. Model fit was evaluated. A chi-square different test, between the current and its
which model should be adopted. The result would indicate the estimated relative
In the next step the equality of factor covariances was tested. These parameters
were held invariant for both groups. Model fit was evaluated, and a chi-square different
test was conducted to evaluate which model should be adopted. This would be followed
by a test of factor variance invariance if the factor covariances could be held equal.
study-abroad experience had any effect on the development of the language ability. In
this section, structural equation models were built and tested to investigate the third
research question: what the impact of length of study-abroad and classroom learning on
the development of communicative language ability is, as measured by the TOEFL iBT®
test.
Three TTCs were introduced as independent variables. One was the length of time
spent studying English. The second one was the length of time spent in content classes
taught in English. The last one was the length of time spent living in an English-speaking
country. The first two characteristics were concerned about English language training in a
formal setting. The last one was concerned about English language contact experience.
questions. For Group I test-takers with no study-abroad experience, only the first two
variables were relevant. For Group II test-takers with study-abroad experience, all three
variables were relevant. Model testing was conducted separately for Group I and Group
II, because each group had a different set of relevant background variables.
125
Hypothesis 5 asserts that for test-takers who have no study-abroad experience (the
home-country group) the development of their language ability is associated with the
length formal learning. To test this hypothesis, two independent variables – the time
spent studying English and the time spent in content classes taught in English – were
modeled to have direct associations with the language ability. This model was subject to
test of fit, and the significance of the direct effects was also evaluated.
Hypothesis 6 proclaims that for test-takers who have had study-abroad experience
(the study-abroad group) the development of their language ability is associated with the
length of formal learning and the length of study-abroad. Three independent variables
were modeled to have direct associations with the language ability. They were: the time
spent studying English, the time spent in content classes taught in English, and the length
of study-abroad experience. This model was subject to test of fit, and the significance of
CHAPTER FOUR
This chapter reports the results of testing the hypotheses put forward in the
establishing a model for the entire sample are reported first, followed by results from
learning experiences. Last, the results of evaluating two unique structural equation
Preliminary Analysis
Table 5 summarizes the descriptive statistics for the observed variables, including
possible score range, mean, standard deviation, kurtosis, skewness, and z scores for the
Listening Task Six. Variables R1 to R3 represent Reading Task One to Reading Task
Variables W1 and W2 refer to Writing Task One and Writing Task Two.
scores reported in the tables can be used to test univariate normality. Z scores for the
kurtosis values were significant at p < .01 for variables L3, L4, L6, R3, and W1. Z scores
for the skewness values were significant at p < .01 for the following variables: L1, L4,
L5, L6, and R2. Since this was a relatively large sample (N=370), these z scores were
interpreted with reservation. Instead, the absolute values of the kurtosis and skewness
statistics as well as the shape of the distributions were used to evaluate normality.
127
Z Kurtosis Z Skewness
Variable Range Mean Std. Dev. Kurtosis Skewness
(SE=.253) (SE=.127)
L1 0-5 3.33 1.138 -.264 -1.043 -0.373 -2.937
L2 0-6 3.57 1.376 -.581 -2.296 -0.098 -.772
L3 0-6 2.97 1.560 -.734 -2.901 0.179 1.409
L4 0-5 4.44 0.888 4.245 16.779 -1.911 -15.047
L5 0-6 4.37 1.297 -.259 -1.024 -0.637 -5.016
L6 0-6 4.78 1.384 .976 3.858 -1.223 -9.630
R1 0-15 6.94 2.725 -.343 -1.356 0.285 2.244
R2 0-15 10.06 3.027 -.652 -2.577 -0.393 -3.094
R3 0-15 9.98 3.064 -.963 -3.806 -0.119 -.937
S1 0-4 2.51 0.759 -.340 -1.344 -0.028 -.220
S2 0-4 2.62 0.805 -.508 -2.008 -0.016 -.126
S3 0-4 2.50 0.755 .073 .289 -0.086 -.677
S4 0-4 2.39 0.827 -0.027 -.107 -0.036 -.283
S5 0-4 2.58 0.810 0.172 .680 -0.150 -1.181
S6 0-4 2.53 0.856 -0.115 .455 -0.132 -1.039
W1 1-5 3.23 1.148 -0.690 -2.727 -0.289 -2.276
W2 1-5 3.46 0.817 -0.169 -.668 -0.020 -.157
The values of kurtosis and skewness are zero in a normal distribution. As Table 5
shows, except for variable L4 and L6, the values of skewness and kurtosis were all
held for these variables. Variable L4 had a kurtosis value of 4.25 and a skewness value of
-1.91. Variable L6 had a kurtosis value of 0.98 and a skewness value of -1.22. Examining
the histograms of these two variables revealed that both distributions exhibited a ceiling
effect. Univariate normality could not be held in these two cases. Having two extremely
non-normal variables indicated that this set of variables could deviate from multivariate
normality. These facts were taken into consideration when choosing an appropriate
L1 L2 L3 L4 L5 L6 R1 R2 R3 S1 S2 S3 S4 S5 S6 W1 W2
L1 1
L2 .35 1
L3 .38 .50 1
L4 .33 .31 .28 1
L5 .41 .39 .40 .40 1
L6 .41 .42 .44 .49 .46 1
R1 .36 .43 .45 .31 .36 .44 1
R2 .36 .49 .48 .39 .43 .55 .54 1
R3 .37 .44 .49 .38 .41 .51 .56 .65 1
S1 .38 .42 .42 .37 .43 .45 .40 .32 .40 1
S2 .34 .36 .36 .39 .32 .38 .39 .33 .33 .58 1
S3 .39 .37 .38 .45 .38 .43 .35 .37 .40 .56 .57 1
S4 .34 .35 .40 .39 .37 .45 .36 .33 .42 .63 .55 .57 1
S5 .38 .37 .41 .45 .42 .46 .44 .38 .44 .56 .58 .57 .60 1
S6 .41 .39 .41 .48 .45 .49 .38 .39 .37 .62 .64 .61 .57 .64 1
W1 .50 .46 .51 .43 .50 .54 .53 .59 .58 .48 .46 .51 .43 .55 .52 1
W2 .45 .49 .48 .44 .48 .48 .50 .50 .54 .57 .60 .56 .52 .63 .56 .61 1
Next, the linearity and multicollinearity of the variables were scrutinized to ensure
that the variables were represented in the dataset appropriately. Linearity was examined
using scatter plots of all possible pairs of the variables. No violation of linearity was
found. Pairwise multicollinearity was checked by inspecting the correlation matrix of the
variables. As shown in Table 6, dependence among all pairs of variables was moderate
was detected with two variables, which suggested that the distribution of the set of
variables could deviate from multivariate normality. It was then decided to implement a
corrected normal theory estimation method in the multivariate analysis to avoid bias
caused by non-normality in the dataset. The MLM estimator provided by Mplus was
corrected test statistic known as Satorra-Bentler test statistic (χ2S-B) (Satorra & Bentler,
129
1994). This test statistic is mean-adjusted and robust to non-normality (Muthén &
Muthén, 2010).
Results from the previous studies revealed that a couple of factor models could be
adequate at accounting for the underlying structure of TOEFL® test performance (Sawaki
et al., 2008; Stricker & Rock, 2008; Stricker et al., 2005). A higher-order model (Figure
1), a correlated four-factor model (Figure 2), and a correlated two-factor model (Figure 3)
were all proved to be possible factor solutions in the past. To establish the baseline model
all three competing models were tested for fit with the current data through a series of
confirmatory factor analyses. Summarized in Table 7, the selected fit indices were used
to evaluate model fit. Model df refers to a model’s degrees of freedom. χ2S-B refers to the
Satorra-Bentler chi-square test statistic. χ2S-B /df refers to the normed Satorra-Bentler chi-
square test statistic. CFI refers to the comparative fit index. RMSEA refers to the root
mean square error of approximation. SRMR refers to the standardized root mean square
residual.
The degrees of freedom (df) indicates how parsimonious or saturated a model is.
the data. A free parameter is a parameter that is free to be estimated during model
estimation. On the other hand, a fixed parameter is one whose value is determined
without model estimation. With the same data structure, the more free parameters a
model is specified to estimate, the less parsimonious or more saturated the model is, and
the lower the degrees of freedom are. Model fit generally deteriorates when model
complexity decreases because there are fewer free parameters to estimate in a more
parsimonious model.
The correlated four-factor model with 113 degrees of freedom was the most
saturated model among the three. The correlated two-factor model was the most
parsimonious one with 118 degrees of freedom. All fit indices deteriorated when model
complexity decreased from the correlated four-factor model to the correlated two-factor
model.
First, the criteria pre-determined based on the relevant literature were used to
evaluate overall model fit. All three chi-square values (χ2S-B) were significant (p = 0.000),
which put model fit in doubt. However, as discussed, the value of the model chi-square
should be interpreted with caution because this test statistic is highly sensitive to sample
size. To reduce the sensitivity of chi-square to sample size, the chi-square values were
divided by the degrees of freedom. As the normed chi-square (χ2S-B /df) values showed,
the ratios were all well below 3, which indicated that all three models fit the data well.
131
The values of comparative fit indices (CFI) were all larger than 0.9. This meant
that the fit of all three models improved significantly when compared to their respective
The values of root mean square error of approximation (RMSEA) for the two
more saturated models (higher-order model and correlated four-factor model) were below
0.05, and the value for the most parsimonious one (correlated two-factor model) was
between 0.05 and 0.08. This outcome could be interpreted as a sign of good fit for all
The value of standardized root mean square residual (SRMR) were all well below
0.1. This indicated that the correlation matrices did not differ significantly from the
model-implied ones. All three models were considered to have good fit.
The selected fit indices for all three models were satisfactory except for the model
chi-square values. Therefore, on the global level all three models demonstrated
significance. The results of testing the higher-order model showed that the estimated
residual variance of the writing factor was negative, and the estimated correlation
between the higher-order factor and the writing factor was larger than one. These findings
signaled problems in model specification, and therefore made the model inadmissible.
Although this higher-order model was confirmed by previous researchers with similar
TOEFL® test data, this model was not compatible with the current dataset. One possible
reason could be that the sample used in this study was a much smaller one (N=370)
compared to the one (N=2070) used in Sawaki et al. (2008) and Stricker & Rock (2008). 3
132
Regarding the correlated four-factor model it was detected that the correlation
between the listening and the writing factor was estimated as high as 0.97, larger than the
0.9 acceptance level.4 This high level of correlation indicated a linear dependence
between the factors, meaning that the factors were not distinct enough to be considered as
two separate factors. It was then decided that this model based on the study sample
An examination of the result from testing the two-factor model showed that all
parameter estimates were appropriate and significant. Taking all criteria into
consideration, this model provided the best explanation to the data, and therefore was
The correlated two-factor model was established as the baseline model as the
outcome of the previous analysis. Built upon this baseline model, testing the role of
context in the internal structure of the test performance was pursued next.
One key element in defining context, content of a language task, was tested for its
ability in accounting for the relationships among the variables along with the skill-based
factors. Previous task analysis showed that ten language tasks were associated with
academic material, whereas the other seven were related to non-academic content. A
content factor dimension was imposed on the baseline model (Figure 4). Two content
As illustrated in Figure 4, all language tasks were specified to load on two factors,
a skill-based factor and a content factor. Taking the first speaking task (S1) as an
133
example, this task loaded with the other speaking tasks on the speaking factor. This task
also loaded on a non-academic content factor since the task was not related to academic
content. Along this content dimension, ten tasks loaded on a common content factor
associated with academic material, while the other seven loaded on a non-academic
content factor.
This two-dimensional model was tested for fit. The result indicated that
convergence could not be reached. No overall model fit index was reported. Individual
parameter estimates were reported without standard errors therefore the significance of
specification. The content dimension failed to capture the relationships among the
variables in conjunction with the two skill-based factors based on the test performance of
the study sample (N=370).8 This model was inadmissible, and therefore was discarded
Another key element in defining context, setting of a language act, was also tested
for its ability in accounting for the relationships among the variables along with the skill-
based factors. Previous task analysis showed that seven tasks were situated in an
setting for the remaining six tasks could not be identified due to lack of context
development. A setting factor dimension was imposed on the baseline model (Figure 5).
available (N/A).9
134
As illustrated in Figure 5, all language tasks were specified to load on two factors,
a skill-based factor and a setting factor. Taking the third speaking task (S3) as an
example, this task loaded with the other speaking tasks on the speaking factor. This task
instructional environment. Along this setting dimension, seven tasks loaded on a common
135
instructional setting factor, and four on a common non-instructional setting factor. The
remaining six tasks without context development all loaded on a third setting factor
(N/A).
This two-dimensional model was tested for fit. Once again, the result indicated
that convergence could not be reached. Adding a second dimension of setting brought in
severe problems in model specification. The setting dimension failed to capture the
relationships among the variables in conjunction with the two skill-based factors based on
the test performance of the study sample (N=370).10 This model was inadmissible, and
The correlated two-factor model was adopted as the final model for the entire
sample group. The first factor, loaded on tasks from the listening, reading, and writing
exclusively on the speaking tasks, could be interpreted as a speaking factor. The results of
model testing showed that this model demonstrated an adequate fit to the data. The final
and Figure 7 respectively. In both figures, a path pointing from a latent factor to an
observed variable, also called an indicator, represented the presumed effect of the factor
on that variable. Estimates of these effects were factor loadings. A path pointing from a
systematic errors on the variable. A path linking the two latent factors did not indicate
directionality. It simply represented the unanalyzed association between the two factors.
The variables were rearranged in order for the ease of visual display.
latent variables were assigned through the unit loading identification (ULI) constraint.
The factor loadings for the first indicator of each factor were fixed at one to assign a scale
137
to the factors. The measurement errors were assigned a scale through fixing their
estimated effects on the indicators to be unitary. For example, the unstandardized factor
loading of the second listening task (L2) on the non-speaking factor (L/R/W) was
estimated to be 1.312. The numbers printed next to the latent factors represented factor
variances. The factor covariance was indicated by the number printed next to the path
linking the two latent factors. Next to the measurement errors were residual variances of
the indicators. The variances of the non-speaking and the speaking factors were estimated
to be 0.430 and 0.335 respectively. The covariance between the two factors was
estimated to be 0.306. Estimation of the residual variance of the first listening task (L1),
The standardized estimates were computed when the latent factors and the
observed variables were standardized. The variances of all variables, including the
factors, the observed variables, and the residuals, were fixed at one. The estimated factor
correlation was reported next to the path linking the two factors. The correlation between
the two skill-based factors was estimated to be 0.807. The factor loadings were
deviation in the latent speaking factor predicted 0.764 standard deviation of change in the
first speaking variable (S1). The higher a factor loading was, the better the indicator was
at measuring the latent factor. The standardized factor loadings could also be interpreted
as estimated correlations between a latent factor and its indicators in the current model
because each indicator was specified to measure only one latent factor. For example, the
estimated correlation between the speaking factor and the first speaking variable (S1) was
0.764. This meant that 58.4% (0.7642) of the total variance of this indicator could be
138
accounted for by the speaking factor. The standardized residual variances represented the
percentage of variance of the indicators that could not be explained by the common
factors. In case of the first speaking variable (S1), 41.6% of the variance could not be
accounted for by the speaking factor. The standardized residual path coefficient for the
direct effect of the measurement error on this speaking variable was 0.645(0.4161/2),
which meant that one standard deviation of change in the error term was associated with
Assuming that the obtained final model was the correct model for the entire
group, the multi-group invariance analysis investigated whether the specified model
could hold equivalent across two groups of test-takers. One hundred and twenty-four test-
takers who had never been immersed in an English language environment were grouped
together (Group I). The other group (Group II) of 246 test-takers had lived in an English-
speaking country for various lengths of time. Table 8 summarizes the descriptive
two groups in a hierarchically ordered fashion. In all steps of the analysis, unstandardized
be used for comparing groups, as groups are assumed to differ in their variabilities on
Measurement Invariance
invariance was inspected. The same factor structure was imposed on both groups
simultaneously but parameter estimates were allowed to differ across the groups.
The resulting unstandardized parameter estimates for both groups are shown in
Figure 8 and Figure 9 respectively. The same correlated two-factor model was applied on
both groups. Factor loadings, indicator residuals as well as factor means, variances, and
covariance had different estimates in each group. The numbers printed next to a factor
referred to the factor mean and factor variance. In Group I the means of the latent factors
were fixed at zero. Factor variances were estimated at 0.518 for the non-speaking factor
and 0.374 for the speaking factor. The estimated covariance of the factors was 0.353. In
Group II the means of the latent factors were free to be estimated. The mean of the first
factor was estimated 0.140 lower than the one in Group I. The mean of the second factor
was estimated 0.017 lower than the one in Group I. Factor variance estimates were 0.386
and 0.319 in Group II. The estimated covariance of the factors was 0.283 in Group II.
Model fit indices are summarized in Table 9. Except for the model chi-square, all
model fit indices were satisfactory. All parameter estimates, as reported in the figures,
were appropriate and reasonable. It was then concluded that the same correlated two-
factor structure could be held across the groups. The result of this step ensured that the
performance in both groups could be accounted for by the same two factors, a speaking
factor which was loaded with tasks from the speaking section and a non-speaking factor
which was loaded with tasks from the sections of listening, reading, and writing. The
Next, the equivalence of factor loadings was inspected. Factor loading estimates
loading estimates were the same for both groups. Indicator residuals, factor means, factor
variances, and factor covariance were allowed to differ across the groups.
144
Model fit indices are summarized in Table 9. Except for the model chi-square, all
model fit indices were satisfactory. All parameter estimates, as reported in the figures,
between this model and the preceding model was not significant: 15.266 with 15 df (χ2S-
B|Diff / df = 1.018). Compared to the model tested in the preceding step, the current model
145
had fewer free parameters to estimate because it imposed equal factor loadings across the
groups. Therefore the current model was more restrictive and simpler. The decrease in
the number of free parameters led to deterioration in model fit. The non-significant result
demonstrated that model fit did worsen but not enough to justify choosing the more
It was then concluded that the factor loadings could be held invariant across the
groups. The result of this step indicated that the factors were measured by their indicators
in a comparable way for both groups. The amount of the variance of an indicator that
could be accounted for by its respective factor was comparable across the groups. The
The third step was to examine the equivalence of residual variances in both
groups. Residual variances along with the factor loadings were constrained to be equal
As illustrated in Figure 12 and Figure 13, the factor loadings as well as the
residuals were fixed to be the same for both groups. Factor means, factor variances, and
Model fit indices are summarized in Table 9. Except for the model chi-square, all
model fit indices were satisfactory. All parameter estimates, as reported in the figures,
between this model and the preceding model was not significant: 22.249 with 17 df (χ2S-
B|Diff / df = 1.309). This result indicated that the model fit did not deteriorate enough to
justify choosing the preceding more saturated model over the current simpler one. It was
then concluded that the indicator residuals could be held invariant across the groups. The
outcome of this step showed that the indicators performed equally at measuring their
respective factors for both groups. The amount of the variance of an indicator that could
Analysis in the above three steps completed the test of multi-group measurement
invariance. The testing of equality on factor structure, factor loadings, and residuals all
succeeded. It was concluded that the measurement part of the model could be held
equivalent across the groups. The multi-group invariance analysis then proceeded with
Structural Invariance
Next, the structural part of the model was scrutinized with equality control across
the groups.
The invariance of factor means was examined first. As shown in Figure 14 and
Figure 15, the measurement part of the model, including the factor loadings and indicator
residuals, was fixed to be the same across the groups. Factor means in both groups were
fixed to zero to be equal. Factor variances and covariance were free to differ.
The model fit indices are summarized in Table 10. Except for the model chi-
square, all model fit indices were satisfactory. All parameter estimates, as reported in the
figures were appropriate and reasonable. The Satorra-Bentler chi-square difference (χ2S-
B|Diff) between this model and the preceding model was not significant: 3.771 with 2 df
(χ2S-B|Diff / df = 1.886). This result indicated that the model fit did not deteriorate enough
to justify choosing the more saturated model over the current simpler one. It was then
concluded that the factor means could be held invariant across the groups. The outcome
of this step showed that the two groups were equivalent in terms of latent factor means.
In other words, there was not enough evidence to say that one group was better than the
other on either latent ability. The structural invariance analysis then proceeded to the next
step.
The next step was to test the equivalence of factor covariance. These parameters
were constrained to be equal across the groups. As indicated in Table 10, model
estimation did not succeed. This result failed to demonstrate the equivalence of the factor
covariance across the groups. The multi-group invariance analysis was then terminated.
Since no further step of testing factor variance invariance would be taken, it could then be
150
assumed that the groups differed in their variabilities on the common factors. The multi-
group model estimated in the preceding step became the final model for the groups.
Table 10. Fit Indices from the Multi-Group Structural Invariance Analysis
covariance could not be held across the groups. The final model for both groups, as
shown in Figure 14 and Figure 15, had a correlated two-factor structure. The factors were
measured by the same set of indicators, which ensured that the factors had the same
meanings for both groups. The first factor was a combination of listening, reading, and
writing, a non-speaking factor. The second factor was a speaking factor, loaded
exclusively with the speaking tasks. Factor loadings were also equivalent for both groups.
For example, the unstandardized factor loading of the second speaking task (S2) on the
speaking factor was 1.042 for both groups, which meant that this speaking task
functioned equivalently as an indicator of the speaking factor for Group I and Group II.
Indicator residuals were comparable across the groups as well. For example, the residual
variance of the first writing task (W1) was 0.489 for both groups, which indicated that the
same amount of variance of this writing task was left unexplained by its factor for Group
I as for Group II. At last, the means of the latent factors were equal across the groups. In
terms of their latent abilities, the groups did not differ from each other since the model
testing succeeded when the factor means were held invariant across the groups.
Since the test of factor covariance equality failed, these estimates were not fixed
to be equal for the groups. Factor variances were also assumed to be unequal. Factor
variance estimates were 0.438 and 0.345 in Group I, and 0.426 and 0.330 in Group II.
The covariance estimate of the factors was 0.313 in Group I, and 0.303 in Group II.
Although model testing indicated that the groups differed on these parameters, the
At the end the factor models for the two groups were almost identical with only
minor differences. The factor structure underlying the performance of test-takers with
study-abroad experience was almost the same as the one from test-takers without such
experience. These results informed us that the impact of this group membership on the
test performance was minimal. It was then reasonable to conclude that the test
functioned in a comparable way across the two subgroups of test-takers, one with study-
abroad experience and one without such experience. In other words, the structure of
communicative language ability, as measured by the TOEFL iBT® test, was found to
spent studying English and the time spent in content classes taught in English—were
specified to have direct effects on the development of the latent factors, and these effects
A unique model was built for each group. With the group of test-takers who did
not have study-abroad experience, the home-country group, as shown in Figure 16, two
independent variables – the time spent studying English and the time spent in content
classes taught in English – were modeled to have direct effects on both latent abilities.
The two independent variables were represented by the rectangular boxes labeled as
‘Study’ and ‘Content’ in the figure as these variables were directly observed. The latent
factors, represented in the ellipses, were dependent variables in the model. A path
the direct effect of the former on the latter. The part of the variance of a latent factor that
154
could not be explained by the independent variables, also called disturbance, was
represented in an ellipse next to the latent factor. The disturbances of the two latent
factors (D1 and D2) linking by a double arrow line were free to vary and covary.
Figure 17, three independent variables were modeled to have direct effects on both latent
factors. They were the time spent studying English (labeled as ‘Study’), the time spent in
content classes taught in English (labeled as ‘Content’), and the length of living in an
the scale of the independent variables was not an issue in model estimation, only the scale
2010). Since the same set of continuous indicators used previously was used in these two
models to define the latent factors, the same estimation method from earlier was
implemented here as well. The models were evaluated by using the MLM estimator in
Mplus. The selected fit indices, summarized in Table 11, showed both models fit the data
well. In the next section, the standardized parameter estimates in each group was
check if the model was appropriate and reasonable at explaining the relationships among
All factor loadings were significant at p < 0.01. The path coefficients from the
‘Study’ variable to the latent factors were 0.139 and 0.290. The path coefficients from
the ‘Content’ variable to the factors were 0.227 and 0.338. These standardized path
factors was marked by one asterisk next to a path coefficient at p < 0.05, and two
Three out of the four path coefficients were significant. The path coefficient
between the ‘Study’ variable and the non-speaking factor was not significant. This meant
that the change of latent non-speaking ability was not likely to be affected by the length
of time studying English for the test-takers without study-abroad experience. The path
coefficient between the ‘Content’ variable and the non-speaking factor was significant at
p < 0.05. This indicated that one standard deviation of change in the length of taking
content classes taught in English brought up 0.227 standard deviation of change in the
With regard to speaking, both ‘Study’ and ‘Content’ variables had significant
impact (p < 0.01) on this latent factor. The path coefficient between the ‘Study’ variable
and the speaking factor showed that one standard deviation of change in the length of
studying English brought up 0.290 standard deviation of change in the latent speaking
158
ability. The path coefficient between the ‘Content’ variable and the speaking factor
showed that one standard deviation of change in the length of taking content classes
taught in English brought up 0.338 standard deviation of change in the latent speaking
ability.
The residual variance of the non-speaking factor was 0.915, which meant that
91.5% of the variance of the non-speaking factor could not be explained by the two
independent variables. The residual variance of the speaking factor was 0.756, which
meant that 75.6% of the variance of the speaking factor could not be explained by the two
independent variables. The two residuals were correlated at 0.787. Both standardized
residuals, especially the one for the non-speaking factor, were very high. This indicated
that variables other than the ones specified in the model might have had influences on the
development of the latent factors. Since these other variables were not represented in the
model, their impact on the latent variables could not be analyzed in this study.
length of studying English and taking content classes in English had significant
associations with the speaking ability. Only the length of taking content classes in
check if the model was appropriate and reasonable at explaining the relationships among
All factor loadings were significant at p < 0.01. The path coefficients from the
‘Study’ variable to the latent factors were 0.347 and 0.328. The path coefficients from the
‘Content’ variable to the factors were 0.098 and 0.094. The path coefficients from the
‘Live’ variable to the factors were 0.155 and 0.229. Significance of the effects of the
independent variables on the factors was marked by one asterisk next to a path coefficient
Four out of the six path coefficients were significant at p < 0.01. The significant
path coefficients between the ‘Study’ variable and both factors indicated that one
standard deviation of change in the length of studying English brought up 0.347 standard
deviation of change in the non-speaking factor, and 0.328 standard deviation of change in
the speaking factor. The ‘Content’ variable had no significant impact on either of the
161
factors. The ‘Live’ variable had significant impact on both factors. One standard
up 0.155 standard deviation of change in the non-speaking factor, and 0.229 standard
The residual variances for the factors were high, 0.816 for the non-speaking factor
and 0.795 for the speaking factor. They were correlated at 0.767. This indicated that
variables other than the ones specified in the model may have had influences on the latent
factors. Since these other variables were not represented in the model, their impact on the
lengths of studying English and living in an English language environment both had
CHAPTER FIVE
This chapter starts with a review of the study, followed by a summary of the
primary findings. Discussion focuses on three topics: (1) the nature of communicative
language ability, (2) group membership and language ability, and (3) learning contexts
and language ability. The implications of this study for foreign language (FL) test
development and validation are elaborated. The merits of using a structural equation
modeling (SEM) approach to address issues at the interface between language testing and
acquisition are appraised. The study’s contributions and its limitations are discussed.
learning experiences.
language ability was examined through a latent factor approach based on TOEFL iBT test
experiences. This ability’s associations with the length of study-abroad and learning were
ability, and what the role of context is in defining this ability. This question was
investigated with both skill abilities and the context of language use taken into
consideration. Based on the results from previous factor-analytic research with similar
data, three competing models were tested for fit through a series of confirmatory factor
structure did not succeed. Instead, a correlated two-factor model was shown to be
compatible with the data. One of the factors was a speaking factor, and the other could be
This model was established as the baseline model. Two context-related factors–content
and setting–were then added to the baseline model with the goal of improving model fit.
However, in neither case did model estimation succeed. This indicated that, contrary to
the hypotheses, the added context factors were not useful in explaining the latent
structure of the test performance together with the skill-based factors. The correlated two-
factor model fit the data well, and was adopted as the final model for the whole sample.
The second question investigated whether or not having had contact with the
language ability. This was achieved by conducting multi-group invariance analysis across
environment prior to taking the test, whereas the other group had not. Simultaneous
multi-group invariance analysis with a mean structure was carried out with parameters
constrained to equality across the groups in a hierarchical fashion. The results showed
164
that the moderating effect of this group membership on test performance was minimal.
Across the groups, the test measured the same set of latent abilities in similar ways. The
groups did not differ in terms of their standings on the factor means either. Contrary to
the hypothesis, the nature of communicative language ability elicited by the test had
The third research question inquired if the lengths of study-abroad and learning
have any association with the development of communicative language ability. This was
accomplished by establishing models unique to test-taker groups. With the group of test-
takers who had had study-abroad experience, the length of time abroad, the time spent
studying English, and the time spent in content classes taught in English were modeled to
have direct effects on the latent abilities in the correlated two-factor model. With the
group of test-takers who had not had study-abroad experience, the time spent studying
English and the time spent in content classes taught in English were modeled to have
direct effects on the latent abilities in the correlated two-factor model. The results lent
partial support to the hypotheses. Although both study-abroad and learning were found to
portions of the factor variances remained unexplained in the models. This result
suggested that variables other than the ones specified in the models might have had
Discussion
Discussion centers on the following three topics: (1) the nature of communicative
language ability, (2) group membership and language ability, and (3) learning contexts
The goal of the TOEFL iBT® test is to assess communicative language ability,
whose definition reflects the influences of both skill abilities and the context of language
use. The test is designed to reflect this current thinking in applied linguistics.
The construct the test intends to measure–the ability to use the English language
components. One component, on which all six speaking tasks loaded, could be
interpreted as the speaking ability of this group of test-takers. The second component, on
which tasks from the listening, reading, and writing sections loaded, could be labeled as
the ability to listen, read, and write. In other words, listening, reading, and writing were
the non-speaking ability of this group of test-takers. In relation to test scores, this two-
factor structure meant that a test-taker who was higher ranked on listening was also likely
to perform well on reading and writing, but not necessarily on speaking. Likewise, the
fact that a test-taker scored high on speaking could not be used to draw the conclusion
that the person could also performed well on listening, reading, and writing. Both
components were skill-based, and together they accounted successfully for the
In contrast to skills, the role of context in defining the construct was not reflected
in this factor model. When two aspects of the context of language use, content and
setting, were examined together with the skill-based factors, model fit did not improve.
Rather, including these situation factors made the models inadmissible. Neither situation
factor was successful at explaining test performance when tested together with the skill-
166
based factors. With regard to task content, test-takers’ performance was not influenced by
the content of the language tasks, whether academic or non-academic. As far as setting
was concerned, the performance of test-takers was not affected by the availability of the
instructional.
The final model chosen to account for the test performance for the whole group
contained two correlated skill-based factors. This model, however, did not indicate any
influence of the context of language use in the latent configuration of the test construct.
proficiency reached by applied linguists and language testers. Contrary to the unitary
view of language proficiency endorsed by Oller (1979), the nature of FL proficiency has
reading were found to be two distinct factors in Bachman and Palmer’s (1983) study.
and interactive use of language) were found in Sang et al. (1986). A higher-order model
the nature of FL ability in Fouly et al. (1990). Bachman et al. (1995) were able to identify
distinct first-order factors. Both Buck (1992) and Bae and Bachman (1998) demonstrated
that the two receptive skills, listening and reading, were factorially different. The
context, and comprehending long context. The correlated two-factor model found in this
167
study added another piece of supporting evidence for the multi-component nature of FL
ability.
This two-factor model suggested that responses to the listening, reading, and
writing tasks might have required similar skills, whereas the speaking tasks might have
demanded a somewhat different skill set. Finding this two-factor model could also be due
to the differences in testing method. Both listening and reading used multiple-choice
questions, whereas speaking employed constructed-response tasks. The two writing tasks
were also constructed-response items but they loaded together with the listening and
reading tasks. Still, majority of the tasks (9 out of 11) on the non-speaking factor used
objective multiple-choice questions. Test method effect might have contributed to the
finding of the two-factor model. The third explanation for the distinctiveness between a
speaking factor and a non-speaking factor could be instruction or lack thereof. Speaking
section became mandatory in the TOEFL iBT testing, whereas listening, reading, and
writing had long been part of the TOEFL testing routine before the introduction of the
TOEFL iBT test. For years TOEFL test-takers could choose not to be tested on speaking.
Lack of test preparation and training could be the reason for finding a speaking ability
From a test validation point of view, the correlated two-factor model failed to
confirm an internal test structure that was compatible with the test’s section design and
score reporting scheme. The TOEFL iBT® test has a structure of four skill-based sections:
listening, reading, writing, and speaking. Stricker and Rock (2008) and Sawaki et al.
(2008) both concluded that a higher-order factor (general FL ability) together with four
speaking, and writing) provided the best explanation of TOEFL iBT test performance.
Results from these previous TOEFL iBT studies supported the practice of reporting a
separate score on each of the skill sections as well as a total score, an average of the four
The correlated two-factor model confirmed in this study was identical to what
Stricker et al. (2005) found based on a prototype of the TOEFL iBT® test. The results of
this study and Stricker et al. (2005) suggested a different internal structure of the test and,
therefore, an alternative way to organize test content and to report scores. According to
this factor model, the ability measured by the test had two instead of four latent
designed to measure separate abilities of listening, reading, and writing were all
associated with the same latent ability, an ability that was not related to speaking. This
correlated two-factor model without a higher-order structure did not provide enough
evidence to support the existence of a general language ability. To reflect the internal
structure of the test suggested by this model, the domain of the test might be organized
into either speaking or non-speaking. If future studies support the findings of this study,
the test could report a speaking score based on speaking tasks and a non-speaking score
based on tasks from listening, reading, and writing. The results of this study did not
provide justification for reporting a total score for the entire test.
communicative language ability, and the empirically obtained factor structure. The
definition of the test construct highlights the intertwining relationships between the
169
context of language use and the skill-based capacities within individuals, and suggests
organizing the test domain by language use situation (Chapelle et al., 1997). Although the
operational TOEFL iBT® test follows the four-skills convention to organize test content,
the context of language use can still play a role in defining the ability measured by the
In the language testing literature, this approach has been used in multiple studies
to demonstrate factors that are not skill-based. Most of these studies focused on test
method effects. Bachman and Palmer (1983) empirically demonstrated the existence of
three method-related factors (interview, translation, and self-rating) together with two
study by the same authors (Bachman & Palmer, 1982), the best model they found was
writing sample and multiple-choice test, and self-rating) and skill factors (grammatical
discovered that a two-dimensional model with three skill-based factors (main idea
(an audio input mode and a written input mode) provided the best match to the data.
Llosa (2007) found that two-dimensional models with both skill factors
and classroom assessment) provided the best explanation for the data in multiple test
populations. A two-dimensional model with three skill factors (extracting main ideas,
major ideas, and supporting details) and three method factors (summary task, incomplete
170
outline, and open-ended question) was chosen in Shin’s (2008) study to account for FL
test performance.
The common approach in the above studies was to demonstrate the method effects
dimension of test method factors. Successful model estimation indicated that the method
factors together with the skill-based factors were responsible for the underlying structure
of the ability being measured. This approach was adopted in this study to examine the
role of context in defining communicative language ability measured by the TOEFL iBT®
test. A second dimension of situation factors, based on content and setting, was tested,
respectively, along with the skill-based correlated two-factor structure. Model estimation
did not succeed in either case, indicating that a model with both skill-based factors and
situation factors was not empirically compatible with the data. Instead, a skill-based two-
factor model provided the best explanation of the construct. However, there was a
noticeable amount of overlap in the indicators that represented skill and situation factors.
For example, 8 out of the 10 indicators associated with the academic content were also
indicators of the non-speaking factor. This could be a possible reason for estimation
These findings suggested that the ability measured by the test was predominantly
skill-oriented. The relationships between the context of language use and the skill-based
capacities were not captured in the latent structure of the test construct. In other words,
the role of the context of language use in defining communicative language ability could
The language testing research community has long been aware of the relativity of
FL ability. Earlier researchers made the call for interpreting the nature of FL ability in
light of learner variability (Harley et al., 1990; Kunnan, 1998a). In a more recent review
of English language testing and assessment, Alderson and Banerjee (2002) restated the
The field has witnessed a surge of empirical studies that investigated the nature of
FL ability in relation to learner variability. This line of research responded to the question
Römhild, 2008; Shin, 2005), cognitive skill (Sang et al., 1986), gender (Wang, 2006), and
A few of these studies focused on the contact with the target language, either in a
study abroad environment (Morgan & Mazzeo, 1988), or an at-home environment (Bae &
Bachman, 1998; Morgan & Mazzeo, 1988). Results from these studies suggested that
language contact as a TTC moderated test performance. In other words, language abilities
developed in groups with different language contact experiences were different in terms
of latent structure.
This study took a special research interest in the relationship between the nature
of communicative language ability, as measured by the TOEFL iBT® test, and test-takers’
target language contact, either having lived in an English speaking environment or not
having done so. The results contrasted with the outcomes from previous studies.
172
At the measurement level across the two groups, the test measured the same
abilities (speaking and non-speaking). The test tasks functioned equivalently as indicators
of the abilities they measured. At the structural level, the two groups did not differ in
terms of their mean performance on the latent abilities. The degrees of variability of the
latent abilities as well as the correlational relationship between the two abilities were
found to vary across the groups but not by much. It is usually the assumption that factors
differ in their variability in different groups. That was the reason for choosing the unit
conclusion, these results suggested that having the study-abroad experience (from less
than 6 months to more than 1 year) did not alter how a test-taker performed on the test, in
terms of the latent factor structure as well as the latent factor means.
The factorial invariance found in this study, however, did resonate with what
other TOEFL iBT® multi-group studies have found. Stricker et al. (2005) concluded that
a correlated two-factor structure could be applied across three native language groups. In
another study, Stricker and Rock (2008) confirmed the same higher-order structure across
subgroups by native language family, exposure in home countries, and formal instruction.
By focusing on test-takers’ exposure in the target language community, the results of this
study provided another piece of evidence for the generalizability of the test’s internal
structure.
This study did not find convincing evidence to claim that test-takers with study-
abroad experience performed differently on the test as a whole group from the group
without such experience. The English language ability developed in the two groups
ability. They also exhibited an ability that could be captured by their responses to the
mean structure in this study led to the surprising finding that the study-abroad test-takers
did not turn out to be better at English compared to the test-takers who had never had
in the home country, a common belief as captured by Freed, Segalowitz and Dewey
classroom settings has also been challenged by study abroad researchers. After reviewing
studies that compared language gains from study-abroad and from home country formal
learning, Collentine and Freed (2004) found no convincing evidence that one learning
context was of absolute superiority compared to the other. Depending on the aspects of
linguistic development and levels of proficiency, one learning context might produce
more gains than the other. A study by Davidson (as cited in Dewey, 2004, p. 322) also
pointed out that it might take a full year of target language contact for the linguistic
benefits to become evident, and he called for additional research on the effects of length
of study-abroad. In this study, exposure to the English language varied from less than 6
months to more than 1 year. Among the 246 test-takers who had been immersed, only
about half of them had lived in an English-speaking country for more than a year.
From the test validation point of view, confirming an equivalent structure across
the two groups provided important validity evidence based on the test’s generalizability.
174
The TOEFL iBT® test is intended for people whose first language is not English.
However the intended test-taking population could be very diverse. With growing
opportunities for study abroad, language learning has expanded from its traditional
population is the target language contact experience. Taking the sample in this study as
an example, about two thirds of the test-takers had the experience of living in an English
language country prior to taking the test. The rest did not have such experience. This
legitimate question: Should we use the same test for all intended test-takers; and if we do
use the same test, can test scores be interpreted and used in the same way? The results of
the multi-group invariance analysis in this study indicated that the test functioned
equivalently for both groups of test-takers, regardless of their different target language
contact experiences.
dominant language) and internationally (where English is not the dominant language).
can also be used as a rough index of whether a test-taker has or has not had language
contact with English. The results from this study provided partial support to use the same
test format and score reporting scheme for both domestic and international test-taking
locations.11 Additional research on how domestic and international TOEFL iBT® test-
target test-taking population is diverse. To justify using the same test across groups, it is
first necessary to make sure that the test measures the same construct in an equivalent
way. If analysis suggests that a four-skills test measures the designed four skills for one
group and one general skill for another group, then reporting skill-based scores makes
good sense for the former group, and reporting only total scores for the second group is
preferred. As a matter of fact, when the test is used on the second group, the length of the
test could be reduced to one-fourth of its original length because information obtained
compare group means. That is because it is important to know what to compare before
conducting any comparisons. Assume that in one group factor loadings of listening items
are relatively high and those in another group are relatively low. This hypothetical
finding would indicate that these test items were better indicators of the listening ability
in the first group than in the second. Comparing group means in listening based on these
items would become questionable. In the first group, reaching a listening score this way
would be acceptable. In the second group, since the items would appear not to be good
indicators, there would be doubt regarding how much information about listening they
really conveyed. A valid listening score based on these items might not be warranted.
Therefore comparison would make little sense. Consider another scenario. An integrated
task involving both listening and speaking loads with listening tasks in one group but
with speaking tasks in the other group. Comparing groups on this task would not be
warranted because the task has simply measured different things in different groups. In
176
conclusion, to ensure fair interpretation and use of test scores across diverse groups of
grammatical structures. Formal FL training in the home country usually provides such a
opinions. Study abroad in the target language community is likely to create such
A FL learner might have experience with one of the contexts, or both. The results
of this study provided opportunities to understand how both contexts are associated with
the development of language ability. For test-takers who indicated no experience of direct
contact with an English language environment (the home-country group), learning was
captured by their experience of studying English and studying content classes taught in
English. The former would occur most likely in an instructional context, whereas the
latter might happen in a hybrid context which could be both instructional and
communicative. For test-takers who claimed having direct contact with the target
language community (the study-abroad group), learning was captured by their experience
The findings of this study suggested that all three learning situations had an
impact on the development of aspects of language ability. The speaking ability of the
home-country group was associated with the lengths of both studying English and taking
content classes taught in English. This group’s ability to listen, read, and write was
associated only with the length of time taking content classes taught in English.
speaking) had a significant relationship with the length of time of taking content classes
country were significantly associated with both ability components. Between these two
learning situations, the length of time studying English held a stronger relationship with
impact on the development of language ability. This finding was compatible with results
from studies comparing study abroad and formal instruction in the home country (Díaz-
Campos, 2004; Collentine, 2004; Lafford, 2004; Sasaki, 2007).These studies indicated
that students receiving formal language instruction in their home counties made just as
much gain (if not more) on some aspects of language ability as study abroad learners.
What was surprising was that even within the study-abroad group, test-takers’
to the common belief in the absolute superiority of study abroad over other learning
contexts.
178
Practically speaking, this study suggested that study abroad might not be the only
way to prepare for the test, and to improve English language ability. Training received in
a formal classroom setting might help just as much as, if not more, to perform well on the
Implications
The outcomes of this study have implications for a broad range of issues
associated with language acquisition and language testing. This section is organized by
three topics. First, thoughts regarding test development are elaborated. Second, an idea of
put forward. Last, using a structural equation modeling (SEM) approach to reach the
If the context of language use and the internal skill abilities of individual language
users are both part of the definition of the communicative language ability that the
TOEFL iBT® test intends to measure, the role of this language use context in defining the
test construct should be reflected in the internal structure of the test. Unfortunately the
attempt to demonstrate the abilities to respond to context did not succeed. The skill
abilities, as already reflected in the design of the test, were shown to be the dominant
forces in determining the internal structure of the test. Test-takers’ ability to respond to
different situations defined by content and setting did not appear to have an impact on
179
based language components. In other words, are these components testable yet? Current
proposed by Bachman (1990), embraces the context of language use in its framework.
situations reliable and distinguishable enough so that this ability can be captured in the
original theory. This study suggested that features of language use situation, such as
content and setting, were not able to elicit the associated context-based abilities.
However, not all features used to define the context of language use were tested
simultaneously in this study. Furthermore, not all tasks used to test the context-associated
ability components had a fully developed language use situation. This lack of context
development was especially obvious in all reading tasks as well as independent writing
and speaking tasks. This might have been the cause for the failure to find context-based
abilities.
With the intent to understand the nature of communicative language ability, this
study chose the TOEFL iBT® test in hopes of allowing context-dependent abilities to
surface and appear in the internal structure of the test, because this test was designed to
180
particularly reflect communicative language use in contexts. The results implied that
more care and attention need to be given to context development for the tasks used in the
test. Making a complete departure from the four-skills test design, while not yet
acceptable to the majority of test users, might offer more opportunity for implementing a
The results also implied that new metrics might need to be considered for scoring
on the test, compared to those without such experience. In spite of the prevailing belief in
the superiority of study-abroad learning contexts over traditional classroom contexts, this
study suggested that test-takers without study-abroad experience were just as likely (or
Collentine and Freed (2004) pointed out, the types of language gains attributed to an
TOEFL iBT® testing, the speaking and writing tasks are rated based on holistic scales.
Scores are not available on aspects of speaking, such as oral fluency and pronunciation.
Scores are not available on aspects of writing either, such as writing fluency and
variables might not be readily testable or quantifiable by the existing metrics. But they
acquisition. Reporting scores on these variables might also provide language learners
Test-Taker Profiles
Viewing the study-abroad learning context different from the instructional context
based in classrooms is built upon the assumption that the former provides more
opportunities for contact with the target language community (Dewey, 2004; Freed et al.,
2004; Segalowitz & Freed, 2004). Freed et al. (2004) raised concern about this presumed
privilege, and proposed to use concrete measures to characterize the nature, quality and
(Freed, Dewey, & Segalowitz, 2004) was designed to provide such a measure to
document and quantify language learners’ interaction with native speakers during time
abroad.
learning experiences performed differently on the test. Their language contact experience,
or lack of it, was characterized by a dichotomous measure: either having it or not having
it. No difference in test performance across the groups was found, which could be
attributed to the fact that the real differences in learning contexts were not captured by
information for the test-takers. This contact information can be used to establish FL test-
taker profiles. Such a profile traditionally includes information like age, gender, native
country and native language, etc. With increasing opportunities for study-abroad,
especially after the U.S. Senate declared 2006 the Year of Study Abroad (Magnan &
Back, 2007), it has become more relevant to understand test-takers not only by pre-
182
existing demographic information but also by what they have done to make use of the
target language in non-traditional learning contexts, such as study abroad. Elements such
classroom setting, may trigger or hinder learning in study-abroad contexts. Such elements
need to be built into FL test-taker profiles. Carefully and fully developed test-taker
profiles will enhance understanding of what to expect from and how to deal with our
This study used a structural equation modeling (SEM) approach to examine the
contexts based on test performance of the TOEFL iBT® test. The research interest in the
nature of language ability has been shared by the language testing community, whereas
understanding the factors that affect language acquisition has been one of the focuses in
the field of language acquisition (Bachman & Cohen, 1998). At the interface between
language testing and acquisition, there is the issue of how to bring the insights gained
from one field to inform the research agenda in the other. A SEM approach, a hybrid of
factor analysis and path analysis, offers a research method that can be used to address the
A structural equation model usually has two parts. The measurement component
illustrates the relationships between latent abilities and their indicators. Such a latent
Multiple indicators (observed scores) are also required for each latent factor to ensure
test performance clarifies the nature of the theoretical construct. In language testing, this
communicative language ability could involve finding a latent factor structure based on
test performance that is compatible with the theoretical configuration of this ability. As
results of such validation attempts over multiple tests and under different testing
can be advanced.
The second part of a structural equation model is the structural model. Building a
structural model allows an investigation of the impact of multiple variables on the latent
nature of this variable should be verified in a measurement model first before being
entered into the structural model. In language testing, these independent variables usually
characteristics can shed light on how test-takers acquire a certain ability and/or reach a
certain ability level. The structural model brings language educators into a dialogue with
From the measurement model, language teachers will have better understanding
of learners’ achievement–exactly what they learn and what kinds of abilities they acquire.
This will have impact on how they direct teaching resources and organize curriculum.
From the structural model, test developers will have better ideas about how the test
functions; that is, if test results reflect the language gains associated with changes in test-
taker characteristics. This will lead to better use of test instruments to detect language
gains. A SEM approach provides a platform upon which conversations across language
Unique Contributions
of communicative language ability. With this intent in mind, the TOEFL iBT® test was
chosen because this test intends to measure communicative language ability, with both
skill abilities and the context of language use as parts of the theoretical definition of the
performance elicited by the test. The content and setting factors were specifically
surface and appear in the internal structure of the test. Contextual factors have not been
examined in the literature of construct studies in language testing. This study set up an
example of using factor analysis to examine the role of context in defining language
ability.
characteristic (TTC) that had not been studied in the context of TOEFL iBT® testing. The
language testing community has long been aware of the fact that the makeup of language
185
ability might not hold equivalent across test-taker groups with different characteristics.
Language contact with the target language community, a factor that has been studied
and was examined in its relation to the underlying structure of the construct. With
years, this TTC has become relevant and salient in more testing situations. In the situation
of TOEFL iBT® testing especially, since the test is administered both domestically and
Moreover, not only the factorial structure but also the mean structure of the
comparing means based on observed scores because: (1) the pre-established measurement
invariance ensures that the latent factors represent the same abilities across groups, and
(2) measurement errors are taken into account in the overall model. This study made a
analysis.
186
This study started the analysis with a dataset of a randomly generated sample of
1000 subjects. Due to missing values and inconsistent responses in the original dataset,
only 370 subjects were included in the final analysis. The reasons for missing values and
inconsistency in the responses could only be speculated but not confirmed. One reason
could be that some test-takers did not interpret the background questions correctly
because these questions were presented in English rather in their native languages.
factors, which could not be taken into account due to the lack of information, might have
had influences on the results of the analysis. Future researchers are recommended to use
The study sample of 370 subjects matched the original random sample of 1000
performance. Although the study sample was sufficient to carry out a single group
analysis, when divided into groups in the multi-group analysis, the subgroup samples
seemed relatively small. Confidence in the results of such analysis could still be sustained
based on the fact that the study sample could be considered a good representative of the
total sample as well as the target population. Nevertheless, the results should be
When investigating the latent structure of the construct, the writing factor was
represented by only two indicators. In factor analysis, a factor needs at least two
indicators for the model to be identifiable. This is the minimum requirement, and in
187
practice it is advised to have more than two indicators for each factor. Future researchers
are recommended to test the latent structure of FL ability with each factor represented by
When investigating the influences of language use context, this study chose to
focus on only two situational factors, content and setting. Not all features used to define
the context of language use were tested in this study. This was partially because not all
situational features were present in the specifications of all test tasks. Future researchers
interested in this line of study could explore other situational factors, such as participants,
Although the design of the TOEFL iBT® test intends to reflect the role of
language use context in defining the communicative language ability, not all tasks have a
fully developed language use situation that would allow for adequately assessing the
impact of contextual factors. This might have been the cause for failing to model
who are interested in the nature of communicative language ability should conduct
studies based on tests that are developed through a communicative approach. Such a test,
with a clear focus on communicative language use, could organize the domain of interest
by language use situation so that it is more likely for contextual factors to be a part of the
representation of the test construct. The development of language tasks used in such a test
should also have a clear context orientation, focusing on key features that have been
make a complete departure from the widely accepted four-skills design to fully
The design of the multi-group analysis separated the test-takers into two groups,
either having or not having the study-abroad experience. However, some findings from
the study abroad literature suggested a full year of study-abroad as the threshold for the
either more or less than one year of study-abroad experience, might allow for detecting
This full year threshold hypothesis was not originally intended, and therefore was
not tested in this study. However, after finding no structural and mean difference between
the two groups defined in the study, two additional sets of analyses were attempted based
on the study sample of 370 test-takers. In the first analysis, test-takers with less than 6
months of study-abroad (N=191) were compared to the ones with more than 6 months of
time abroad (N=179). Although measurement invariance could be held, the latter group
performed significantly better (p < 0.01) on speaking than the former. In the second
analysis, test-takers with less than one year of study-abroad (N=240) were compared to
the ones with more than one year of time abroad (N=130). Although measurement
invariance could be held, the latter group performed significant better on the speaking
factor (p < 0.000), and to a lesser degree on the non-speaking factor (p < 0.05). These
results suggested that language ability, especially the speaking ability, was positively
associated with the length of study-abroad. For differences in language gains to become
testing situations. They are also encouraged to examine TTCs that have not caught the
research community’s attention but may be relevant to language testing and acquisition.
The current study had a research interest in the joint impact of classroom learning
variables were defined by length in years. The richness of these learning experiences was
not fully captured in the models tested. This probably explained why a large portion of
the factor variances could not be accounted for in the study. Variables other than the ones
specified in the models were not investigated, due to lack of such information. It is
recommended that future researchers who share the same research interest collect
detailed information on the nature and intensity of learners’ language contact with the
encourage collaborative research efforts joined by both language testing researchers and
language acquisition researchers. Through a proper method, such as SEM, this line of
research would inform not only what constitutes language ability but also how the aspects
Finally, data utilized in the study were purely observational, providing a static
image of the relationships among the variables at one point in time. Without a controlled
experimental design, causal statements could not be made with confidence, which
prohibited making inferences about the factors responsible for language development.
Carefully designed experimental studies are recommended for the future to help fully
NOTES
1
In this article, a foreign language refers to a language that is learned after a person has
already learned his or her native language(s). The term ‘foreign language’ is used
interchangeably with second language. In the context of this study, foreign language
ability is conceptualized as a latent trait. The term ‘ability’ is used interchangeably with
proficiency and competence.
2
TOEFL iBT is a registered trademark of Educational Testing Service (ETS). This
publication is not endorsed or approved by ETS.
3
The higher-order factor model was also tested based on the test performance of 1000
test-takers in the total random sample. The model was rejected because extremely high
correlations were found between the higher-order factor and the first-order factors.
4
Because of the confirmatory nature of the analysis, only models that had been confirmed
in the previous TOEFL® literature were hypothesized. However after finding a high
correlation between listening and writing, these two factors were grouped together and
tested in a three-factor model (listening/writing, reading, and speaking). A close to 0.9
correlation was found between the reading factor and the listening/writing factor, which
suggested that a two-factor model with a speaking factor and a non-speaking factor
would be a more appropriate model for the data. The result of post hoc model fitting was
not reported because of the exploratory nature of modification procedures and the risk of
capitalization on chance factors.
5
The correlated four-factor model was also tested based on the test performance of 1000
test-takers in the total random sample. The model was rejected because extremely high
correlations were found among the latent factors.
6
The correlated two-factor model was also tested based on the test performance of 1000
test-takers in the total random sample. Taking all criteria into consideration, this model
provided the best explanation to the data.
7
A model with only content and no skill factors was tested based on the 370 test-takers’
performance. The result indicated that the model fit was not acceptable.
8
The two-dimensional model with both content and skill factors was also tested on the
test performance of 1000 test-takers in the total random sample. The model was rejected
because the estimated correlation among the two content factors was extremely high, and
that a number of factor loadings were insignificant.
9
A model with only setting and no skill factors was tested based on the 370 test-takers’
performance. The result indicated that the model fit was not acceptable.
10
The two-dimensional model with both setting and skill factors was also tested on the
test performance of 1000 test-takers in the total random sample. The model was rejected
191
because the estimated correlations among the setting factors were extremely high, and
that a number of factor loadings were insignificant.
11
Simultaneous multiple-group invariance analyses were also performed across the two
groups: 418 domestic test-takers and 582 overseas test-takers in the total random sample.
Both measurement and structural invariance could be held. No factor mean difference
was found. This result supported using the same test format and score reporting scheme
both internationally and domestically. This step was not reported because test-taking
location was found not to be a reliable indicator of language contact experience.
192
REFERENCES
Alderson, J. C., & Banerjee, J. (2002). Language testing and assessment (Part 2).
Language Teaching, 35, 79-113. doi: 10.1017/S0261444802001751
Anastasi, A., & Urbina, S. (1997). Psychological testing. Upper Saddle River, NJ:
Prentice-Hall.
Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that
what we count counts. Language Testing, 17(1), 1–42. doi:
10.1177/026553220001700101
Bachman, L. F., Davidson, F., & Foulkes, J. (1990). A comparison of the abilities
measured by the Cambridge and Educational Testing Service EFL Test Batteries.
Issues in Applied Linguistics, 1(1), 30-55.
Bachman, L. F., Davidson, F., Ryan, K. & Choi, I.-C. (1995). An investigation into the
comparability of two tests of English as a foreign language: The Cambridge-
TOEFL comparability study. New York: Cambridge University Press.
Bachman, L. F., & Palmer, A. S. (1982). The construct validation of some components of
communicative proficiency. TESOL Quarterly 16(4), 449–465.
Bachman, L. F., & Palmer, A. S. (1983). The construct validation of the FSI oral
interview. In J. W. Oller, Jr. (Ed.), Issues in language testing research (pp. 154-
169). Rowley, MA: Newbury House.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and
developing useful language tests. Oxford: Oxford University Press.
193
Bae, J., & Bachman, L. F., (1998). A latent variable approach to listening and reading:
testing factorial invariance across two groups of children in the Korean/English
two-way immersion program. Language Testing, 15(3), 380-414. doi:
10.1177/026553229801500304
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A.
Bollon, & J. S. Long (Eds.), Testing structural equation models (pp. 136-162).
Newbury Park, CA: Sage.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81-105. doi:
10.1037/h0046016
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approach to second
language teaching and testing. Applied Linguistics, 1(1), 1-47.
Carroll, J. B. (1958). A factor analysis of two foreign language aptitude batteries. Journey
of General Psychology, 59(3), 3-19.
Carroll, J. B. (1983). Psychometric theory and language testing. In J. W. Oller, Jr. (Ed.),
Issues in language testing research (pp. 80-107). Rowley, MA: Newbury House.
Chapelle, C., Grabe, W., & Berns, M. (1997). Communicative language proficiency:
Definition and implications for TOEFL 2000. (TOEFL Monograph Series Report
No. 10). Princeton, NJ: Educational Testing Service.
College Board. (1987). Advanced placement course description: French. New York:
College Entrance Examination Board.
College Board. (1989). Advanced placement course description: Spanish language and
literature. New York: College Entrance Examination Board.
Collentine, J., & Freed, B. (2004). Learning context and its effects on second language
acquisition. Studies of Second Language Acquisition, 26, 153-171.
doi:10.1017/S0272263104262015
Educational Testing Service. (2004). The next generation TOEFL test: Focus on
communication. Retrieved from http://www.ets.org/toefl/nextgen/
Educational Testing Service. (2008). Validity evidence supporting the interpretation and
use of TOEFL® iBT scores. Retrieved from http://www.ets.org
Fouly, K. A., Bachman, L. F., & Cziko, G. A. (1990). The divisibility of language
competence: A confirmatory approach. Language Learning, 40(1), 1-21. doi:
10.1111/j.1467-1770.1990.tb00952.x
Freed, B. F. (1995). What makes us think that students who study abroad become fluent?
In B. F. Freed (Ed.), Second language acquisition in a study abroad context (pp.
123-148). Amsterdam: John Benjamins.
Freed, B. F., Dewey, D. P., Segalowitz, N., & Halter, Randall. (2004). The language
contact profile. Studies of Second Language Acquisition, 26, 349-356. doi:
10.1017/S0272263104062096
Freed, B. F., Segalowitz, N., & Dewey, D. P. (2004). Context of learning and second
language fluency in French: Comparing regular classroom, study abroad, and
intensive domestic immersion programs. Studies of Second Language Acquisition,
26, 275-301. doi: 10.1017/S0272263104062060
Gardner, R. C., & Lambert, W. E. (1965). Language aptitude, intelligence, and second
language achievement. Journal of Educational Psychology, 56(4), 191-199. doi:
10.1037/h0022400
Ginther, A., & Grant, L. (1996). A review of the academic needs of native English-
speaking college students in the United States. (TOEFL Monograph Series Report
No. 1). Princeton, NJ: Educational Testing Service.
Ginther, A., & Stevens, J. (1998). Language background, ethnicity, and the internal
construct validity of the Advanced Placement Spanish Language Examination. In
A. J. Kunnan (Ed.), Validation in language assessment (pp. 169-194). Mahwah,
NJ: Lawrence Erlbaum Associates.
Hale, G. A., Stansfield, C. W., Rock, D. A., Hicks, M. M., Butler, F. A., & Oller, J. W.
Jr. (1988). Multiple-choice cloze items and the Test of English as a Foreign
Language. (TOEFL Research Report No. 26; ETS Research Report No. 88-02).
Princeton, NJ: Educational Testing Service.
Hale, G. A., Rock, D. A., & Jirele, T. (1989). Confirmatory factor analysis of the Test of
English as a Foreign Language. (TOEFL Research Report No. 32; ETS Research
Report No. 89-42). Princeton, NJ: Educational Testing Service.
Harley, B., Cummins, J., Swain, M., & Allen, P. (1990). The nature of language
proficiency. In B. Harley, P. Allen, J. Cummins, & M. Swain (Eds.), The
development of second language proficiency (pp. 7-25). New York: Cambridge
University Press.
Hosley, D., & Meredith, K. (1979). Inter- and intra-test correlates of the TOEFL. TESOL
Quarterly, 13(2), 209-217.
196
Hu, L.-T., & Bentler, P. M. (1999). Cutoff criteria for fit indices in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation
Modeling, 6, 1-55.
Jamieson, J., Jones, S., Kirsch, I., Mosenthal, P., & Taylor, C. (2000). TOEFL 2000
framework: A working paper. (TOEFL Monograph Series Report No. 16).
Princeton, NJ: Educational Testing Service.
Kline, R. B. (1998). Principles and practice of structural equation modeling. New York:
Guilford.
Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.).
New York: Guilford.
Lafford, B. A. (2004). The effect of the context of learning on the use of communication
strategies by learners of Spanish as a second language. Studies of Second
Language Acquisition, 26, 201-225. doi: 10.1017/S0272263104062035
Magnan, S. S., & Back, M. (2007). Social interaction and linguistic gain during study
abroad. Foreign Language Annals, 40(1), 43-61. doi: 10.1111/j.1944-
9720.2007.tb02853.x
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.) (pp.
13-103). New York: American Council on Education and Macmillan.
McNamara, T. F. (1990). Item response theory and the validation of an ESP test for
health professionals. Language Testing 7, 52-76. doi:
10.1177/026553229000700105
Morgan, R., & Mazzeo, J. (1988). A Comparison of the structural relationships among
reading, listening, writing, and speaking components of the AP French Language
Examination for AP candidates and college students. (ETS Research Report No.
88-59). Princeton, NJ: Educational Testing Service.
Muthén, L. K., & Muthén, B. O. (2010). Mplus User’s Guide (6th ed.). Los Angeles, CA:
Muthén & Muthén.
Oller, J. W. Jr. (1974). Expectancy for successive elements: Key ingredient to language
use. Foreign Language Annuals, 7, 443-452.
Oller, J. W. Jr. (1979). The factorial structure of language proficiency: Divisible or not?
In J. W. Oller, Jr. Language test at school: A pragmatic approach (pp. 423-458).
London: Longman.
Oller, J. W. Jr. (1983): A consensus for the eighties? In J. W. Oller, Jr. (Ed.), Issues in
language testing research (pp. 351-356). Rowley, MA: Newbury House.
Oller, J. W. Jr., & Hinofotis, F. B. (1980). Two mutually exclusive hypotheses about
second language ability: Indivisible or partially divisible competence. In J. W.
Oller, Jr. & K. Perkins (Eds.), Research in language testing (pp. 13-23). Rowley,
MA: Newbury House.
Pimsleur, P., Stockwell, R. P., & Comrey, A. L. (1962). Foreign language learning
ability. Journal of Educational Psychology, 53(1), 15-26. doi: 10.1037/h0044336
Römhild, A. (2008). Investigating the invariance of the ECPE factor structure across
different proficiency levels. Spaan Fellow Working Papers in Second or Foreign
Language Assessment, 6, 29-55. Ann Arbor, MI: University of Michigan English
Language Institute. www.isa.umich.edu/eli/research/spaan
Sang, F., Schmitz, B., Vollmer, H. J., Baumert, J., & Roeder, P. M. (1986). Models of
second language competence: A structural equation approach. Language Testing,
3(1), 54-79. doi: 10.1177/026553228600300103
198
Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors on
covariance structure analysis. In A. von Eye, & C. C. Clogg (Eds.), Latent
variables analysis (pp. 399-419). Thousand Oaks, CA: Sage.
Sawaki, Y., Stricker, L., & Oranje, A. (2008). Factor structure of the TOEFL Internet-
Based Test (iBT): Exploration in a field trial sample. (TOEFL iBT Research
Report No. 04; ETS Research Report No. 08-09). Princeton, NJ: Educational
Testing Service.
Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL
Internet-based test. Language testing, 26 (1), 005-030. doi:
10.1177/0265532208097335
Scholz, G., Hendricks, D., Spurling, R., Johnson, M., & Vandenburg, L. (1980). Is
language ability divisible or unitary? A factor analysis of twenty-two English
proficiency tests. In J. W. Oller, Jr. & K. Perkins (Eds.), Research in language
testing (pp. 24-33). Rowley, MA: Newbury House.
Shin, S.-K. (2005). Did they take the same test? Examinee language proficiency and the
structure of language tests. Language Testing, 22(1) 31–57. doi:
10.1191/0265532205lt296oa
Shin, S. (2008). Examining the construct validity of a web-based academic listening test:
An investigation of the effects of response formats. Spaan Fellow Working
Papers in Second or Foreign Language Assessment, 6, 95-129. Ann Arbor, MI:
University of Michigan English Language Institute.
www.isa.umich.edu/eli/research/spaan
Skehan, P. (1991). Progress in language testing: The 1990s. In J. C. Alderson, & B. North
(Eds.), Language testing in the 1990s: The communicative legacy (pp. 3-21).
London: Macmillan.
199
SPSS Inc. (2009). PASW® Statistics Base 18. Chicago, IL: SPSS Inc.
Stricker, L. J., & Rock, D. A. (2008). Factor structure of the TOEFL Internet-Based Test
across subgroups. (TOEFL iBT Research Report No. 07; ETS Research Report
No. 08-66). Princeton, NJ: Educational Testing Service.
Stricker, L. J., Rock, D. A., & Lee, Y.-W. (2005). Factor structure of the LanguEdge™
Test across language groups. (TOEFL Monograph Series Report No. 32).
Princeton, NJ: Educational Testing Service.
Swinton, S. S., & Powers, D. E. (1980). Factor analysis of the Test of English as a
Foreign Language for several language groups. (TOEFL Research Report No.
06; ETS Research Report No. 80-32). Princeton, NJ: Educational Testing Service.
Taylor, C. (1993). Report of TOEFL score users focus groups. TOEFL 2000 Internal
Report. Princeton, NJ: Educational Testing Service.
Upshur, J. A., & Homburg, T, J. (1983). Some relations among language tests at
successive ability levels. In J. W. Oller, Jr. (Ed.), Issues in language testing
research (pp. 188-202). Rowley, MA: Newbury House.
Vollmer, H. J., & Sang, F. (1983). Competing hypotheses about second language ability:
A plea for caution. In J. W. Oller, Jr. (Ed.), Issues in language testing research
(pp. 29-79). Rowley, MA: Newbury House.
Wang, S. D. (2006). Validation and invariance of factor structure of the ECPE and
MELAB across gender. Spaan Fellow Working Papers in Second or Foreign
Language Assessment, 4, 41-56. Ann Arbor, MI: University of Michigan English
Language Institute. www.isa.umich.edu/eli/research/spaan
200
Wolf, M. K., Kao, J., Herman, J., Bachman, L. F., Bailey, A., Bachman, P. L.,
Farnsworth, T., & Changm. C. (2008). Issues in assessing English language
learners: English language proficiency measures and accommodation uses—
Literature review (Part 1 of 3). (CRESST Report No. 731). Los Angeles, CA:
CRESST/UCLA.
Woods, A. (1983). Principal components and factor analysis in the investigation of the
structure of language proficiency. In A. Hughesand, & D. Porter (Eds.), Current
developments in language testing (pp. 43-52). New York: Academic Press.