At The Interface Between Language Testing and Second Language Acq

University of Iowa
Iowa Research Online

Theses and Dissertations
Spring 2011
At the interface between language testing and

second language acquisition: communicative
language ability and test-taker characteristics
Lin Gu
University of Iowa
Copyright 2011 LIN GU
This dissertation is available at Iowa Research Online: https://ir.uiowa.edu/etd/972
Recommended Citation
Gu, Lin. "At the interface between language testing and second language acquisition: communicative language ability and test-taker
characteristics." PhD (Doctor of Philosophy) thesis, University of Iowa, 2011.
https://ir.uiowa.edu/etd/972.
Follow this and additional works at: https://ir.uiowa.edu/etd
Part of the First and Second Language Acquisition Commons

AT THE INTERFACE BETWEEN LANGUAGE TESTING AND SECOND
LANGUAGE ACQUISITION:
COMMUNICATIVE LANGUAGE ABILITY AND TEST-TAKER
CHARACTERISTICS
by
Lin Gu
An Abstract
Of a thesis submitted in partial fulfillment of the requirements

for the Doctor of Philosophy degree in Second Language Acquisition
in the Graduate College of
The University of Iowa
May 2011
Thesis Supervisors: Associate Professor Judith E. Liskin-Gasparro

Associate Professor Timothy N. Ansley
1
ABSTRACT
The present study investigates the nature of communicative language ability as
manifested in performance on the TOEFL iBT® test, as well as the relationship between
this ability with test-takers’ study-abroad and learning experiences. The research interest
in the nature of language ability is shared by the language testing community, whereas
understanding the factors that affect language acquisition has been a focus of attention in
the field of second language acquisition (Bachman & Cohen, 1998). This study utilizes a
structural equation modeling approach, a hybrid of factor analysis and path analysis, to
address issues at the interface between language testing and second language acquisition.
The purpose of this study is two-fold. The first has a linguistic focus: to provide
empirical evidence to enhance our understanding of the nature of communicative
language ability by examining the dimensionality of this construct in both its absolute
and relative senses. The second purpose, which has a social and cultural orientation, is to
investigate the possible educational, social, and cultural influences on the acquisition of
English as a foreign language, and the relationships between test performance and test-
taker characteristics.
The results revealed that the ability measured by the test was predominantly skill-
oriented. The role of the context of language use in defining communicative language
ability could not be confirmed due to a lack of empirical evidence. As elicited by the test,
this ability was found to have equivalent underlying representations in two groups of test-
takers with different context-of-learning experiences. The common belief in the
superiority of the study-abroad environment over learning in the home country could not
be upheld. Furthermore, both study-abroad and home-country learning were proved to

2
have significant associations with aspects of the language ability, although the results
also suggested that variables other than the ones specified in the models may have had an
impact on the development of the ability being investigated.
From a test validation point of view, the results of this study provide crucial
validity evidence regarding the test’s internal structure, this structure’s generalizability
across subgroups of test-takers, as well as its external relationships with relevant test-
taker characteristics. Such a validity inquiry contributes to our understanding of what
constitutes the test construct, and how this construct interacts with the individual and
socio-cultural variables of foreign language learners and test-takers.
Abstract Approved: _________________________________________________
Thesis Supervisor
_________________________________________________
Title and Department
_________________________________________________
Date
_________________________________________________
Thesis Supervisor
_________________________________________________
Title and Department
________________________________________________
Date
AT THE INTERFACE BETWEEN LANGUAGE TESTING AND SECOND
LANGUAGE ACQUISITION:
COMMUNICATIVE LANGUAGE ABILITY AND TEST-TAKER
CHARACTERISTICS
by
Lin Gu
A thesis submitted in partial fulfillment of the requirements

for the Doctor of Philosophy degree in Second Language Acquisition
in the Graduate College of
May 2011
Thesis Supervisors: Associate Professor Judith E. Liskin-Gasparro

Associate Professor Timothy N. Ansley
Copyright by
LIN GU
2011
All Rights Reserved

Graduate College
Iowa City, Iowa
CERTIFICATE OF APPROVAL
____________________________
PH.D. THESIS
_____________
This is to certify that the Ph.D. thesis of
Lin Gu
has been approved by the Examining Committee

for the thesis requirement for the Doctor of Philosophy degree
in Second Language Acquisition
at the May 2011 graduation.
Thesis Committee: ________________________________________

Judith E. Liskin-Gasparro, Thesis Supervisor
________________________________________
Timothy N. Ansley, Thesis Supervisor
________________________________________
Helen H. Shen
________________________________________
Lia M. Plakans
________________________________________
Bonnie S. Sunstein
To my loved ones:
献给我亲爱的:
Mother 母亲卓志华女士
Father 父亲顾生根先生
ii
ACKNOWLEDGMENTS
I would like to thank my advisor, Dr. Liskin-Gasparro, for her guidance, help, and
encouragement in every phase of this dissertation study. Without her belief in me and her
unfailing support, I would not have been able to see this study to its completion.
I am also indebted to Dr. Ansley for his insightful comments and challenging
questions. The training I received from studying in his department helped me
tremendously to conceptualize and implement this research project.
Sincere thanks also go to Dr. Plakans, Dr. Shen, and Dr. Sunstein for their warm
encouragement throughout the conduct of this study.
I would like to thank the Graduate College of the University of Iowa and
Educational Testing Service for funding this research project, and to thank Educational
Testing Service for permissions granted to use their copyrighted test material and data.
Last, but not least, I wish to express my deep gratitude to my parents, to Mary,
and to Walker for their support and love.
iii
ABSTRACT
The present study investigates the nature of communicative language ability as
manifested in performance on the TOEFL iBT® test, as well as the relationship between
this ability with test-takers’ study-abroad and learning experiences. The research interest
in the nature of language ability is shared by the language testing community, whereas
understanding the factors that affect language acquisition has been a focus of attention in
the field of second language acquisition (Bachman & Cohen, 1998). This study utilizes a
structural equation modeling approach, a hybrid of factor analysis and path analysis, to
address issues at the interface between language testing and second language acquisition.
and relative senses. The second purpose, which has a social and cultural orientation, is to
investigate the possible educational, social, and cultural influences on the acquisition of
English as a foreign language, and the relationships between test performance and test-
taker characteristics.
The results revealed that the ability measured by the test was predominantly skill-
oriented. The role of the context of language use in defining communicative language
ability could not be confirmed due to a lack of empirical evidence. As elicited by the test,
this ability was found to have equivalent underlying representations in two groups of test-
takers with different context-of-learning experiences. The common belief in the
superiority of the study-abroad environment over learning in the home country could not
be upheld. Furthermore, both study-abroad and home-country learning were proved to
iv
have significant associations with aspects of the language ability, although the results
also suggested that variables other than the ones specified in the models may have had an
From a test validation point of view, the results of this study provide crucial
validity evidence regarding the test’s internal structure, this structure’s generalizability
across subgroups of test-takers, as well as its external relationships with relevant test-
taker characteristics. Such a validity inquiry contributes to our understanding of what
constitutes the test construct, and how this construct interacts with the individual and
socio-cultural variables of foreign language learners and test-takers.
v
TABLE OF CONTENTS
LIST OF TABLES……………………………………………………………………...viii
LIST OF FIGURES………………………………………………………………………ix
CHAPTER ONE INTRODUCTION……………………………………………………...1
Context of the Problem……………………………………………………………1

Purposes of the study……………………………………………………………...5
Contributions and Significance…………………………………………………...6
Organization of the Dissertation…………………………………………………9
CHAPTER TWO LITERATURE REVIEW…………………………………….............12
Introduction………………………………………………………………………12
Unitary Competence Hypothesis………………………………………………...14
Multidimensional Competency Models………………………………………….18
Method Effect……………………………………………………………………32
The Factor Analysis Approach………………………………………………34
The CFA-MTMM Matrix Approach………………………………………...39
Test-Taker Characteristics and Test Performance……………………………….43
Proficiency Level…………………………………………………………….47
Other Background Characteristics…………………………………………...52
Target Language Contact……………………….……………………………56
ETS–TOEFL® Studies………………………………………………………64
Making a Validity Argument…………………………………………………….71
Conclusion……………………………………………………………………….72
CHAPTER THREE METHODOLOGY…………………………………………….......79
About the TOEFL iBT® Test……………………………………………………79

General Goals of the Study………………………………………………………82
Research Questions………………………………………………………………85
Research Question One………………………………………………………85
Research Question Two……………………………………………………...87
Research Question Three…………………………………………………….88
The TOEFL iBT® Public Dataset………………………………………………..90
Sample Representativeness and Appropriateness………………………………..95
Measures………………………………………………………………………..101
Structure of the Test………………………………………………………...101
Descriptions of Task Situations…………………………………………….102
Categorizing Tasks by Content and Setting………………………………...106
Analysis Procedures……………………………………………………….........107
Level of Measure…………………………………………………………...108
Distribution of Values………………………………………………………109
vi
Linearity and Multicollinearity……………………………………………..110
Estimation Method………………………………………………………….110
Assessing Model Fit………………………………………………………...111
Establishing the Baseline Model……………………………………………113
Modeling the Context of Language Use……………………………………118
Multi-Group Invariance Analysis…………………………………………..120
Building Structural Equation Models………………………………………124
CHAPTER FOUR ANALYSIS AND RESULTS……………………………………...126
Preliminary Analysis……………………………………………………………126
The Baseline Model…………………………………………………………….129
The Context of Language Use………………………………………………….132
The Content Dimension…………………………………………………….132
The Setting Dimension……………………………………………………..133
The Final Model………………………………………................................136
Multi-Group Invariance Analysis………………………………………………138
Measurement Invariance……………………………………………………141
Structural Invariance………………………………………………………..149
Results of Multi-Group Invariance Analysis…………………………….....152
Structural Equation Models…………………………………………………….153
The Home-Country Group………………………………………………….157
The Study-Abroad Group…………………………………………………..159
CHAPTER FIVE DISCUSSION AND CONCLUSIONS……………………………162
Overview of the Study………………………………………………………….162

Summary of the Primary Findings……………………………………………...163
Discussion……………………………………………………………………..164
The Nature of Communicative Language Ability………………………….165
Group Membership and Language Ability…………………………………171
Learning Contexts and Language Ability…………………………………..176
Implications……………………………………………………………………..178
Foreign Language Test Development………………………………………178
Test-Taker Profiles………………………………………………………….181
The Interface between Testing and Acquisition……………………………182
Unique Contributions…………………………………………………………...184
Limitations and Recommendations…………………………………..................186
NOTES………………………………………………………………………………….190
REFERENCES…………………………………………………………………………192
vii
LIST OF TABLES
Table 1. Test-Taker Characteristics across the Two Samples………………………95
Table 2. Descriptive Statistics across the Two Samples…………………………...100
Table 3. One-Sample t-Tests………………………………………………………100
Table 4. Task Content and Context………………………………………………..107
Table 5. Descriptive Statistics for the Observed Variables………………………..127
Table 6. Correlations of the Observed Variables…………………………………..128
Table 7. Fit Indices for the Three Competing Models……………………………..129
Table 8. Descriptive Statistics across the Two Groups……………………………140
Table 9. Fit Indices from the Multi-Group Measurement Invariance Analysis…147
Table 10. Fit Indices from the Multi-Group Structural Invariance Analysis……….151
Table 11. Fit Indices for the Structural Equation Models…………………………..156
viii
LIST OF FIGURES
Figure 1. Higher-Order Factor Model………………………………………………116
Figure 2. Correlated Four-Factor Model……………………………………………117
Figure 3. Correlated Two-Factor Model……………………………………………118
Figure 4. Correlated Two-Factor Model with a Content Dimension……………….134
Figure 5. Correlated Two-Factor Model with a Setting Dimension………………..135
Figure 6. Final Model with Unstandardized Estimates……………………………..139
Figure 7. Final Model with Standardized Estimates………………………………..140
Figure 8. Factor Structure Invariance with Unstandardized Estimates Group I……142
Figure 9. Factor Structure Invariance with Unstandardized Estimates Group II…...143
Figure 10. Factor Loading Invariance with Unstandardized Estimates Group I.........144
Figure 11. Factor Loading Invariance with Unstandardized Estimates Group II……145
Figure 12. Indicator Residual Invariance with Unstandardized Estimates Group I….147
Figure 13. Indicator Residual Invariance with Unstandardized Estimates Group II...148
Figure 14. Factor Mean Invariance with Unstandardized Estimates Group I………..150
Figure 15. Factor Mean Invariance with Unstandardized Estimates Group II………151
Figure 16. Structural Equation Model Group I………………………………………155
Figure 17. Structural Equation Model Group II……………………………………...156
Figure 18. Structural Equation Model with Standardized Estimates Group I……….158
Figure 19. Structural Equation Model with Standardized Estimates Group II………160
ix
1
CHAPTER ONE
INTRODUCTION
Context of the Problem
Recent years have witnessed a growing number of foreign language (FL) learners
in the United States and worldwide. To meet the assessment needs of this expanding
population, tests whose development and validation are well grounded in applied
linguistics and educational measurement theories are in high demand.
The most recent edition of Standards for Educational and Psychological Testing
(American Educational Research Association [AERA], American Psychological
Association [APA], & National Council on Measurement in Education [NCME], 1999;
hereafter referred to as Standards) asserts the importance of developing sound
educational and psychological measures, and it regards these measures as “the most
important contributions of behavioral science to our society, providing fundamental and
significant improvements over previous practices” (p. 1). The Standards regards validity
as the “most fundamental consideration in developing and evaluating tests,” and it
defines validity as “the degree to which evidence and theory support the interpretations of
test scores entailed by proposed uses of tests” (p. 9).
Validity has been traditionally divided into three different types (Alderson &
Banerjee, 2002; Bachman, 1990). The three traditionally used validity categories include
content-related validity, predictive and concurrent criterion-related validity, and
construct-related validity (Anastasi & Urbina, 1997; Messick, 1989). Messick (1989)
reintroduced validity as a unitary concept. As defined in Messick’s (1989) seminal
article, validity is “an integrated evaluation judgment of the degree to which empirical
2
evidence and theoretical rationales support the adequacy and appropriateness of
inferences and actions based on test scores or other modes of assessment” (p. 13; original
emphases in italics). The former definition of three validities has been replaced with a
single unified view of validity (Chapelle, 1999). As pointed out by Bachman (1990), this
unitary view of validity has been “endorsed by the measurement profession as a whole”
(p. 236).
A crucial aspect of this unitary validity concept is the centrality of construct
validation. The Standards (AERA, APA, & NCME, 1999) defines constructs as
knowledge, skills, abilities, processes, or characteristics to be assessed, and refer score
interpretation to the “construct or concepts the test is intended to measure” (p. 9).
Construct validity has always been considered central to test validation (Cronbach &
Meehl, 1955). In the unitary validity framework, construct validity is the “fundamental
and all-inclusive validity concept” (Anastasi & Urbina, 1997, p. 114).
Messick’s (1989) validity framework has had a profound impact on the
development and professionalization of the field of FL testing. FL testers have generally
embraced and adapted the unitary validity concept. Validity, as Bachman (1990) puts it,
is “the most important quality of test use” (p. 289).
In the context of FL testing, the centerpiece of a validation inquiry is to develop a
FL construct framework based on both theoretical and empirical grounds. The most
fundamental question is what constitutes the construct of FL proficiency. Messick (1989)
proposed to examine the structural component of a test to understand the degree to which
the nature and dimensionality of the inter-item structure reflect the nature and
dimensionality of the construct domain. Dimensionality of a construct domain refers to

3
the domain’s underlying factorial structure. As an approach to collect validity evidence,
dimensionality analysis was also proposed by Chapelle (1999) to examine the fit of
empirical test performance to a psychometric model based on a relevant construct theory.
Evidence based on a test’s internal structure was also suggested by Wolf et al. (2008) to
validate English as a FL tests. Results of investigating the internal structure of a test
could be used to interpret the nature of the FL construct: whether the construct is
unidimensional or multidimensional, and what the makeup of a multidimensional
construct is.
The generality of this FL construct has also been a focus of investigation in the
field of language testing. Messick (1989) warned against taking the generalizability of a
construct meaning across various contexts for granted. He proposed that context effects,
especially different population groups, in score interpretation be systematically appraised.
Validity evidence based on a test’s generalizability was also proposed by Chapelle (1999)
to ensure legitimate test score interpretation and uses across groups of test-takers, time,
instruction conditions, and test task characteristics. The idea of a universally applicable
construct framework seems especially questionable in language testing, considering the
differences in the language to be measured (e.g., English, French, Chinese), and the
usually heterogeneous nature of the test-taking population. This line of research helps to
answer the question of whether the same construct structure holds across groups of test-
takers who differ in terms of the features salient to a specific testing situation.
The research interest in the nature of language ability has been shared by the
language testing community, whereas understanding the factors that affect language
acquisition has been one of the focuses in the field of language acquisition (Bachman &
4
Cohen, 1998). However, to ensure valid test score interpretation and uses, the acquisition
factors that reside outside of a test need to be investigated in relation to test performance.
In other words, validity evidence based on a test’s external structure is needed to ensure
validity.
To illustrate this point, we use language testing in a study-abroad situation as an
example. Let us assume that the relevant literature uniformly asserts that study abroad is
superior to the traditional classroom learning environment. A controlled experimental
study is designed to compare the language gains of study-abroad learners versus those of
classroom learners. Contrary to the common belief upheld by the literature, test results
show that the gains are equivalent. We are now confronted with the dilemma that the
test’s external relationship with a situation factor (study abroad vs. classroom learning)
does not conform to what has been demonstrated in the language acquisition literature.
This would bring us to doubt if the scores can be interpreted to have captured the real
gains across the groups, and if the test has been used appropriately for the purpose of this
study.
Examining a test’s external relationships with other tests and with non-test
behaviors has been proposed by both measurement professionals (AERA, APA, &
NCME, 1999; Messick, 1989) as well as language testing researchers (Chapelle, 1990;
Wolf et al., 2008). In the context of language testing, this line of research ensures that a
test has the appropriate external relationships that are compatible with both testing and
acquisition theories.
5
Purposes of the Study
This dissertation research investigates the nature of FL ability1, and this ability’s
relationships with test-takers’ study-abroad and learning experiences. The current view in
applied linguistics conceptualizes FL ability as communicative in nature (Canale, 1983;
Canale & Swain, 1980). Situated in the context of the TOEFL iBT® test2, this study
focuses on test-takers of English as a FL, and on the construct validation of
communicative language ability.
and relative senses, as measured by the TOEFL iBT Test. Dimensionality refers to the
underlying factorial structure of the construct. Dimensionality in the absolute sense
implies the invariance of the proposed factorial structure, irrespective of factors such as
the FL tested, the test chosen, the population studied, and the conditions under which
language acquisition takes place, to name just a few. Dimensionality in its relative sense
implies no such universally applicable factorial structure of FL proficiency. Instead, the
factorial structure varies depending on many factors, such as those mentioned
immediately above. It also implies that different aspects of this proficiency, if the
proficiency is indeed divisible, may develop at different rates and over different paths
under varying conditions.
The second purpose, which has a social–cultural orientation, is to investigate the
possible educational, social, and cultural influences on English as a FL acquisition, and
the relationships between test performance and test-taker characteristics (TTCs). These
6
characteristics can include, among others, native language, language spoken at home,
years of immersion, years of schooling in the target language, and anticipated career.
Examining test performance in relation to test-takers’ study-abroad and learning
experiences is the focus of this research study. By investigating the individual and social–
cultural conditions of test-takers through a structural equation modeling (SEM) approach,
this study can shed light on how differences in learning contexts and experiences may
interact with the nature and path of language acquisition.
Contributions and Significance
The outcomes of this study will contribute to a better understanding of FL ability.
Although it has been generally agreed that FL proficiency is a complex construct with
multiple dimensions, it is still unclear to the research community what this proficiency
consists of, and how the constituent parts interact. Specifically, this research focuses on
examining the nature of communicative language proficiency based on the concept of
communicative competence (Canale, 1983; Canale & Swain, 1980). Since its initial
appearance, the concept of communicative competence has had a strong influence not
only on language teaching and learning, but also on test design and development. Based
on the communicative competence literature, the TOEFL® Committee of Examiners
(COE) developed a model that defines communicative language proficiency in academic
contexts (Chapelle, Grabe, & Berns, 1997). According to the COE model, the TOEFL
iBT® test measures academic communicative language proficiency. Based on TOEFL
iBT test performance, this study will make both theoretical and empirical contributions to
our understanding of communicative language proficiency in academic contexts, as
measured by the test. The results could also provide evidence based on the internal
7
structure of the test for developing a validity argument for test score interpretation and
uses.
Second, this study will contribute to expanding our understanding of the
relationships between TTCs and FL test performance. The field of FL testing has recently
witnessed a surge of interest, which has resulted in a growing number of empirical
studies on TTCs, such as gender (Wang, 2006), native language background (Shin,
2005), ethnicity and preferred language (Ginther & Stevens, 1998), and home language
environment (Bae & Bachman, 1998), to name just a few. Multiple studies have also
been conducted in TOEFL® testing to validate scores interpretation and uses in light of
differences of test-takers on background variables, such as reasons for taking the test
(Swinton & Powers, 1980), native language background (Stricker et al., 2005), target
language exposure in home countries (Stricker & Rock, 2008), and years of classroom
instruction (Stricker & Rock, 2008). The general consensus reached by the language
testing research community is that the factor structure underlying FL test performance
can be interpreted more meaningfully if relevant TTCs are taken into consideration.
However, the amount of information we have obtained is far from what we need to fully
understand the complex and dynamic network of relationships. Moreover, there are still
TTCs that have not attracted enough attention from language testing researchers, and
therefore their relationships with test performance remain under-researched.
One such TTC is contact with the target language. Language contact is a concept
developed by study-abroad researchers. It specifies the nature and intensity of language
learners’ out-of-class contact with the target language in a study-abroad environment. FL
test-takers are non-native speakers of the target language who may have had the
8
experience of studying and/or living in the target language environment. Learning and
living experience in the target language environment has been investigated in relation to
test performance in only a few studies (e.g. Kunnan, 1994, 1995; Morgan & Mazzeo,
1988). This TTC is salient in most FL testing situations, but has not been examined
extensively.
This study will contribute to our knowledge of the relationships between TTCs
and FL test performance, especially the test performance of test-takers with different
language contact experiences. Because of the TOEFL iBT® test’s diverse test-taking
population, examining the internal structure of the test across relevant subgroups of test-
takers would provide evidence for the test’s generalizability in building a validity
argument for proposed score interpretation and uses.
From a pedagogical perspective, the study will inform us whether study abroad
has an effect on language learning and acquisition and, if so, in what ways. There has
been a rapid growth of research on study abroad contexts of learning, and these studies
have generated mixed results (e.g., Collentine & Freed, 2004). The results of this study
will provide empirical evidence on the impact of study abroad, if any. From a practical
point of view, the results of this study could advise potential FL test-takers on how to
prepare for a test, as well as how to improve their FL proficiency in general. Studying in
a foreign country normally requires a huge investment, both financial and emotional,
from test-takers and their families. It is my hope that the results of this study can shed
some light on how effective study-abroad is in relation to FL test performance.
In general, the significance of this research to the fields of FL testing and FL
acquisition lies in two areas: (a) extending language testing research agenda to include
9
variables external to a test so that we can have a better understanding of not only the
nature of FL construct but also this construct’s external relationships with relevant TTCs;
and (b) employing a SEM approach to tackle issues at the interface between language
acquisition and language testing.
Organization of the Dissertation
Chapter One describes the context of the problem, explains the purposes of the
study, indicates the contributions and significance, and presents the organization of the
dissertation.
Chapter Two examines theoretical models and empirical studies on FL
proficiency. This review includes proficiency models that have been subjected to
empirical investigation in a testing situation. Furthermore, only studies that have utilized
latent factor analysis and structural equation modeling are reviewed. The organization of
the literature review is laid out as follows.
A unitary competence model is discussed first. This is followed by an expansion
of the review to include various multidimensional competence models. The review shows
that the concept of a multidimensional FL proficiency has been well received based on
both theoretical and empirical grounds, although the research community has not yet
reached an agreement regarding the makeup of the construct. The next section reviews
studies that have modeled the effect of test methods, as opposed to constructs, in their
analyses of proficiency models. The review indicates that test method facet ought to be
visualized as an integral part of the factor structure underlying FL test performance. The
last section is devoted to summarizing empirical studies that have explicitly modeled
TTCs in their analyses. Studies are divided into four groups. The first group focuses on
10
the relationship between proficiency level and the degree of factor differentiation. Studies
in the second group examine other TTCs, such as native language background and
proficiency, learning condition, and gender. Studies in the third group examined target
language contact as a TTC by using multi-group factor analysis. Last, validation studies
in the context of TOEFL® testing are reviewed and summarized. The results of the
literature review demonstrate that two consensuses have been reached by the research
community. First, FL proficiency is a complex construct with multiple dimensions.
Second, the factor structure of FL test performance can be interpreted more meaningfully
if relevant TTCs are taken into consideration. Directions for future research are suggested
at the end of the chapter. They are: (a) to test two competing hypotheses, a hierarchical
model versus a non-hierarchical model of FL proficiency; (b) to examine test
performance on integrated test tasks that require the use of multiple skills; and (c) to
study TTCs that have not attracted enough attention from previous researchers, especially
language contact with the target language community.
Chapter Three presents the design of the study. The background and development
of the TOEFL iBT® test as well as the format of the operational test are first introduced.
The reason for choosing this test and how the unique characteristics of this test can assist
us in accomplishing the general research purposes of this study are explained. Situated in
the context of TOEFL iBT testing, three research questions are proposed and six
hypotheses are formulated. A layout of materials and data is provided, and planned data
analysis is explained in a step-by-step fashion.
Chapter Four reports the results of the analyses. Taking both skill-oriented
abilities and language use context into consideration, the outcomes of establishing a
11
model for the entire sample are reported first, followed by results from multi-group
invariance analysis across groups of test-takers with different context-of-learning
experiences. Last, results from analyzing two unique structural equation models with
components of study-abroad and learning experiences are reported.
Chapter Five provides a review of the study and a summary of the primary
findings. Discussion focuses on three topics: the nature of communicative language
ability, group membership and language ability, and learning contexts and language
ability. Insights gained from this study on language test development and validation are
elaborated. Using a structural equation modeling approach to address issues at the
interface between language testing and acquisition is appraised. The chapter further
discusses the study’s contributions and limitations. Recommendations for future research
are also provided.

12
CHAPTER TWO
LITERATURE REVIEW
Introduction
In Alderson and Banerjee’s (2002) state-of-the-art review of language testing and
assessment, they stated that what is central to language testing is “an understanding of
what language is, and what it takes to learn and use language, which then becomes the
basis for establishing ways of assessing people’s abilities” (p. 80). Bachman (2000) also
emphasized the importance of the nature of language ability and language use. In his
review of modern language testing at the turn of the twenty-first century, Bachman
asserted that what has guided the development and refinement of new tests is “current
thinking in applied linguistics about the nature of language ability and language use” (p.
2). These positions were echoed by Wolf et al. (2008) in their statement that “[a] clear
definition of a construct to be measured is a fundamental first step in any validity
argument” (p. 20).
The essential message conveyed by these authors is that prior to conducting any
sort of language assessment, an understanding of the nature of foreign language (FL)
proficiency that is being assessed should be established. This point has been well
accepted by the profession, as a profusion of FL proficiency models of diverse natures
have emerged. Some models are compatible with one another, whereas others are
competing and contradictory. Some models are more general than the others and, as
Chalhoub-Deville (1997) explained, a more general model is capable of representing a
construct in diverse contexts, whereas a more local model tends to depict the construct as
it applies in a particular context. Some models are more complex as they tend to include
13
as many construct elements as possible, whereas others are more parsimonious as they
include only the elements that pertain to a particular testing situation. Some models enjoy
strong empirical foundations, whereas others are based on scanty empirical support.
Among models that claim empirical support, a wide range of methodologies are used by
different researchers. The implementation of each methodology often entails the
employment of a unique set of statistical and non-statistical techniques.
To discuss every language proficiency model that has been developed by theorists
and practitioners is beyond the scope of this review of the literature. An exhaustive
review might also suffer a lack of focus, and it would be difficult to draw any practical
conclusions. Therefore only models that have been subjected to empirical investigation in
a testing situation will be reviewed. Furthermore, as argued in the previous chapter,
validity evidence based on a test’s internal structure, external structure, and the test’s
generalizability is instrumental to our understanding of the nature of FL proficiency. A
latent trait approach is most appropriate for conducting such validity inquiries. This
chapter reviews studies that have utilized a latent approach in their investigations of the
nature of language proficiency.
A unitary competence model is discussed first. This is followed by an expansion
of the review to include various multidimensional competence models. The next section
reviews studies that have modeled the effect of test methods, as opposed to constructs, in
their examination language proficiency models. The last group of studies branches out to
examine variables that reside outside a test, namely test-taker characteristics (TTCs), and
to interpret language proficiency models as informed by the relationships between test
performance and TTCs.

14
The chapter concludes that there is not a best model of FL proficiency, at least not
according to the results of this literature review, on which a test developer can base his or
her test. However, consensus in some areas has been reached by the profession.
Unitary Competence Hypothesis
Almost every article on the nature of languag proficiency traces this line of
research back to Oller’s (1979) view of language proficiency as a unitary entity and the
empirical investigations of the factorial structure of language proficiency led by him and
his associates. Oller’s research addresses one of the fundamental questions that is
relevant to educators and researchers of language acquisition: What is the nature of the
proficiency that a language learner attains?
From the standpoint of a language test developer, Oller (1979) framed his
question as “whether or not language ability can be divided up into separately testable
components” (p. 423). A component that is testable will yield scores that are reliable,
which means that this component of language ability is a relatively stable trait. Being
testable also means that the component can be distinguished from other components on
both theoretical and empirical grounds.
Three hypotheses regarding the testability of language components were put
forward by Oller (1979). The divisibility hypothesis claims that tests of different
components, skills, aspects, or elements do not share common variance, and performance
on each test is accounted for by its unique variance. On the contrary, the indivisibility or
unitary competence hypothesis states that no test measuring a particular element has
unique variance, and that all tests share the same variance. Taking the middle ground, the
third hypothesis is the partial divisibility hypothesis. According to this hypothesis,

15
performance on a particular test is accounted for by both shared variance with other tests
and unique variance that pertains only to this test. This last hypothesis was also
considered to be the weak form of the unitary hypothesis.
The theory widely used by the proponents of the unitary hypothesis comes from
cognitive science. Based on Spearman’s (1904) thinking, a general factor of intelligence
dominates most of the variance in human performance. If there are good reasons for
assuming that language proficiency functions more or less like other human activities,
then this competence is unidimensional. Another strand of linguistic evidence comes
from Spolsky’s belief in two fundamental truths about language: Language is redundant
and it is creative. Spolsky (1968) argued that global proficiency tests, which assess
learners’ ability to utilize grammatical redundancies in response to novel verbal
sequences, are essentially measures of linguistic competence, rather than discrete skills,
such as writing, reading, speaking, and listening. Knowledge of a language, manifested as
knowledge of rules, seems much the same as underlying linguistic competence, which
operates in all kinds of performance. Built upon Spearman’s (1904) general factor of
intelligence and Spolsky’s (1968) linguistic competency captured in grammatical
redundancies and creativity, Oller (1974) proposed an expectancy grammar as a
convenient label for the psychologically internalized grammar that governs all kinds of
language behaviors and performance.
Based on a statistical method called principal component analysis (PCA), multiple
studies tested the unitary competence hypothesis (Oller, 1979; Oller & Hinofotis, 1980;
Scholz, Hendricks, Spurling, Johnson, & Vandenburg, 1980). In a PCA procedure, the
first principal component extracted accounts for the largest amount of variance among the
16
measures. In all these studies, the same factorial pattern emerged. The first principal
component accounted for a considerable amount of common variance, and therefore was
considered as a good predictor of the intercorrelations among the measures. Language
tests had high loadings on the first component, and the loadings on this component were
higher than the loadings on the subsequently extracted component(s). The residual
correlations among the measures, after the first principal component had been extracted,
were negligible. By obtaining such a factorial pattern, these researchers believed that the
confirmation for a general factor underlying language proficiency was found.
Although a dominant global factor was repeatedly confirmed, the PCA method
used in these studies was criticized and considered inappropriate for conducting this line
of investigation (Carroll, 1983; Farhady, 1983; Vollmer & Sang, 1983). In terms of
extracting the initial factors, as Farhady (1983) argued, since values on the diagonal are
set at unity in PCA, all variances, including the common variance, specific variance and
error variance are used to define the underlying factors. Instead of using PCA, principal
factor analysis (PFA) can be used. In PFA, estimated communalities are assigned to the
diagonal cells, and specific and error variance components are not included. An iterative
method is used to refine estimates of the communalities to obtain the best possible
estimates of the communalities for various steps of factor extraction. PFA uses only the
common variance among the variables in the analysis while systematically discarding
uniqueness.
Both PCA and PFA extract the first factor in such a way as to account for the
maximum amount of variance in each and all of the variables. Variance from many
different common factor sources is being extracted because the factor vector is placed in
17
such a way that as many of the variables as possible have substantial projections on it.
The first factor can extract variance from several unrelated variables. It is highly inflated
and, therefore, the subsequent factors are correspondingly deflated. The first factor, by
the very nature of the extraction procedure, will account for the greatest amount of
variance, and it will appear to be more prominent than the rest.
Researchers (Carroll, 1983; Farhady, 1983; Vollmer & Sang, 1983) also
suggested that initial factor structures should be rotated to obtain psychologically
meaningful factor patterns since the initial unrotated factors may not give the best picture
of the factor structure and may lead to misinterpretations. By rotation, a simpler factor
structure might be arrived at, with each variable loading primarily on only one factor, and
each factor accounting for a maximum of the variances generated by the variables that
load on it.
Multiple-factor structures did emerge from some of the findings when analysis
methods different from the PCA method were used. Oller and Hinofotis (1980) found a
two-factor and a three-factor structure by using factor rotation techniques. Scholz et al.
(1980) were able to describe the data with four independent dimensions by using PFA.
However, these researchers still uniformly concluded that the unitary factor solution was
the best theoretical explanation for their data. The major reason was that no clear pattern
appeared in which tests were grouped according to the posited skills or components. They
came to the agreement that there must be a general factor underlying performance on
many language processing tasks. As Oller (1983) stated, although this general factor may
be componentially complex, language users must in some manner possess a generative
system that functions in an integrated fashion in many communicative contexts.

18
In conclusion, a general language proficiency is manifested as an underlying
unitary competence interpreted along the lines of an expectancy grammar. As the
language testing research community started to realize the limitations of using PCA to
configure the latent structure of language proficiency, the notion of a unitary language
competence became less convincing. As researchers argued (Farhady, 1983; Vollmer &
Sang, 1983; Woods, 1983), the derived factors contained not only common variance that
was shared by different tests, but also test-specific variance and error variance. Since a
PCA approach is programmed to extract the greatest amount of variance from all sources,
the technique generally overestimates the significance of the first derived factor. The
unidimensional model claimed by Oller and his colleagues has been generally considered
to be inadmissible because it was based on empirical findings that used statistical
techniques incorrectly.
While the field continues to pursue this question regarding the nature of language
proficiency, methodologies such as factor analysis are gradually gaining popularity
among language researchers. Results from a growing number of studies using factor
analysis have begun to demonstrate evidence in support of a multidimensional view of
language proficiency. In the next section, the discussion on the nature of language
proficiency will be expanded to include various multidimensional competence models.
Multidimensional Competency Models
In contrast to a unitary competence view, a divisibility hypothesis (Oller, 1979)
postulates that there are independent multiple components underlying FL competence.
This multidimensional approach for describing FL ability is based on the assumption of
structuralism (Carroll, 1965; Lado, 1961). According to this school of thought, language
19
ability can be specified as a series of skills or knowledge components. Carroll (1965)
proposed that language performance can be identified by using a matrix in which
components are placed against skills. The language components are phonology or
orthography, morphology, syntax, and lexicon. The skill set includes auditory
comprehension, oral production, reading, and writing. Even though the idea of the
independence of the various components and skills has never been expressed in the form
of a hypothesis, the components and skills are considered to be logically different and
therefore independent of one another. This position implies the strong form of the
divisibility hypothesis, which posits that language ability is made up of a number of
different and uncorrelated dimensions. In the case of Carroll’s (1965) theoretical
framework, any combination of a skill and a component is independent of any of the
other possible combinations. These 16 different combinations are theoretically included
in performance specifications, against which 16 independent abilities can be tested. In
practice, however, even Carroll admitted that it seemed daunting and also unnecessary to
test all 16 separate competences. A review of the relevant literature shows that the strong
form of this hypothesis has not drawn much empirical evidence in its support.
Carroll (1965) then proposed to distinguish an integrated approach from a discrete
structure-point approach to language testing. When an integrated approach is taken, the
total communicative effect of an utterance is emphasized instead of specific points of
structure or lexicon. Naturally the four skills of listening, reading, speaking, and writing,
which are regarded as integrated performance based on the candidate’s mastery of the
whole array of language components, receive focus in this integrated approach to testing.
20
He asserted that an ideal language proficiency test should make it possible to differentiate
levels of performance in those integrated skill dimensions of performance.
After reviewing the history of the structuralist approach to language teaching and
testing, Vollmer and Sang (1983) formulated one possible weak form, hypothesizing that
each of the four skills could be focused upon and that each one could be measured more
or less independently from the others. In this weak view of divisibility, executing skills
like listening, reading, speaking, and writing requires the integrated use of different levels
of knowledge in solving complex language tasks.
As manifested in most of today’s FL curricula and textbooks, this four-skills
approach has had an enormous impact on how FL has been taught and tested. There is
abundant evidence of teaching the four language skills as more or less distinguishable
areas of performance as well as testing these skills as separate constructs. By adopting
this four-skills approach, it is assumed, consciously or unconsciously, that each of the
skills can be individually focused and measured.
Neither the unitary hypothesis nor the complete divisibility hypothesis can be
justified in light of results from empirical studies. Oller (1983) conceded that multiple
factors might underlie language proficiency. He was still in favor of a general factor, but
this general factor might be partitioned into components. He further proposed that
hierarchical models with one general factor and several first-order factors would work
better than simpler models.
From the structuralist tradition, Carroll (1983) also started to recognize the
possibility of the existence of a general language ability as the indicator of overall rate of
development in language learning. Nevertheless, Carroll still believed that different

21
language skills could be separately recognized and measured because skills have the
tendency to be developed and specialized to different degrees, or at different rates.
Results from early empirical studies convinced the field to embrace the
multidimensional concept of FL ability (Carroll, 1958; Gardner & Lambert, 1965; Hosley
& Meredith, 1979; Pimsleur, Stockwell, & Comrey, 1962). All of these studies confirmed
the multidimensionality of FL ability. They also implied that linguistic competence is
divisible instead of unitary. The trend shows that FL proficiency can be better explained
by a plurality of underlying factors than by a unitary general factor.
However, resulting factorial structures often vary vastly from one study to
another. There is no clear indication as to what the multidimensional construct should or
could look like. In results from a majority of these studies, a more general factor
appeared along with a set of more specific factors. However, questions remain regarding
how this general factor relates to the specific factors as well as what these specific factors
are and how they relate to one another. The inconsistency of the resulting factorial
structures from various studies does not allow for strong arguments in the development of
a general theory of FL proficiency.
With the flourishing of different multidimensional models but no reconciliation,
Vollmer and Sang (1983) offered several suggestions to tackle this issue. They proposed
including a wider variety of measures, especially productive measures. They suggested
the possibility of testing an alternative hierarchical model. Regarding the statistical
methods used, they recommended taking a more confirmatory approach in modeling
language ability. Their observations showed that factor analysis was used in an
exploratory way in most of the earlier studies; therefore, results were largely limited to
22
the labeling and ad hoc interpretation of the factors. Using confirmatory factor analysis
(CFA) instead of exploratory factor analysis (EFA) could provide researchers with more
opportunities for theory testing and theory building. As the field moved forward, in more
recent studies, these issues have been recognized and treated with care by researchers.
The following section gives a comprehensive review of empirical studies with a
focus on the factorial structure underlying FL proficiency through a confirmatory latent
trait approach. Four hypotheses are commonly tested in this body of research. Two of
them are derived from Oller’s (1979) partial divisibility hypothesis. The first of these is
called the correlated trait hypothesis, which hypothesizes that a number of factors
underlie FL ability, and that these factors are correlated with each other. Competing with
the correlated trait hypothesis is the higher-order hypothesis. According to this
hypothesis, what underlies the structure of FL ability is a number of uncorrelated specific
primary (or first-order) factors subsumed under a general (or higher-order) factor. This
general factor affects test performance only indirectly, through the primary factors. The
assumption behind this hypothesis is that the correlations among the primary factors can
be accounted for by the general factor. These two hypotheses are usually examined
against each other and against the other two hypotheses, the unitary hypothesis and the
divisibility hypothesis. The unitary hypothesis postulates that FL performance is
accounted for by only one general underlying factor. The divisibility hypothesis claims
that a number of factors account for the variance in FL performance, and that these
factors are independent of each other.
A multi-component theoretical framework that consisted of grammatical
competence, pragmatic competence, and sociolinguistic competence was tested in

23
Bachman and Palmer (1982). Grammatical competence was specified to include
morphology and syntax. Pragmatic competence was defined as the mastery of
vocabulary, cohesion, and organization. Sociolinguistic competence referred to the
mastery of register, nativeness, and non-literal language. As noted by the authors,
phonology and graphology were not included in this framework. Each trait in this study
was measured by using four different methods: interview, writing, multiple-choice, and
self-rating. In the results, both the unitary hypothesis and the complete divisibility
hypothesis were rejected. The model with a higher-order general factor and two
uncorrelated primary trait factors provided the best fit. Grammatical competence and
pragmatic competence clustered together, and emerged as one trait that was distinct from
sociolinguistic competence. Except for the grammar measures, the rest of the measures
loaded heavily on the general factor. This study found strong support for a substantial
general factor. Although the nature of this higher-order general factor remained open, the
authors conceptualized the two first-order factors as the organizational aspects of
language and the affective aspects of language.
The framework investigated in Bachman and Palmer (1982) was based on the
concept of communicative competence. Since little empirical research was available
regarding the validity of the components in the communicative competence framework,
the authors hoped that their study could provide empirical support for some of the
components posited by the communicative competence framework.
Communicative competence was introduced by Canale and Swain (1980). Their
language framework has three aspects: grammatical competence, sociolinguistic
competence, and strategic competence, with the latter two being considered as equally
24
prominent as the grammatical competence. This framework was slightly modified by
Canele (1983) with the addition of discourse competence to the original model. As
Canale (1983) pointed out, the tremendous interest communicative competence generated
indicated a departure from a grammar-centered to a communication-focused language
research approach. Furthermore, instead of viewing communicative competence as a
global unitary factor, the construct is conceptualized as a modular one, composed of
several theoretically distinct factors. The popularity of communicative competence and
its acceptance moved the general discussion about the nature of language proficiency
farther away from the unitary proposition. However, as Canale (1983) acknowledged, the
framework was more a theoretical than an empirical model. Little empirical evidence has
been found to distinguish the four aspects of the competence, or to specify the manner in
which the four areas interact. Although the framework has been criticized for being based
mainly on theoretical but not empirical work (Cziko, 1984), it has also been credited for
advancing the field’s knowledge upon which future efforts of proficiency model
formulation and testing could be grounded (Bachman, 1990).
The four hypotheses were tested again in Bachman and Palmer (1983). They
conducted a construct validation study of the Foreign Service Institute (FSI) oral
interview test. Each posited trait, speaking or reading, was measured by three methods:
interview, translation, and self-rating. Based on the results from the six measures, they
found that the two partial divisibility hypotheses provided a better fit to their data. They
thus rejected the unitary trait hypothesis and the divisibility hypothesis. Although the
model of a general factor with two uncorrelated primary traits fit the data best
statistically, the authors decided to adopt the two-factor correlated trait model because of
25
the principle of parsimony. In their argument, the latter model had fewer free parameters
to estimate, and therefore was preferred as the more parsimonious model. The results
demonstrated strong support for the distinctness of speaking and reading as separate but
correlated traits. The results also showed evidence in support of a multidimensional FL
proficiency.
Sang, Schmitz, Vollmer, Baumert, and Roeder (1986) claimed that neither the
unitary trait hypothesis nor the four-skills multidimensional model of FL competence
could be upheld in light of empirical data. Instead of using the four-skills framework,
they postulated a factor structure with three correlated dimensions. The first dimension,
basic elements of knowledge, was associated with measures of pronunciation, spelling,
and lexical knowledge. The second dimension, integration of basic knowledge elements,
related to measures of grammar and reading comprehension. The last dimension,
interactive use of language, was reflected in measures of listening comprehension and
interaction. The authors found that the posited correlated three-factor model fit the data;
however, the correlations among the latent factors were high. To rule out a rival one-
factor model based on the unitary factor hypothesis, a competing one-factor model was
tested and rejected based on the results of model fit comparison. No higher-order model
was tested in this study. The potential comparison of a higher-order model and a
correlated model was not pursued.
Using both factor analysis and comparison of group means, Harley, Cummins,
Swain, and Allen (1990) hypothesized a three-trait structure that included grammar,
discourse, and sociolinguistic competence. Each trait was measured in three modes: an
oral productive mode, a written productive mode, and a multiple-choice written mode.
26
The analysis failed to confirm the hypothesized three-trait structure. Instead, a two-factor
solution with a general factor and a written method factor were found. Although factor
analysis did not show strong support for the hypothesized distinctions, the analysis of
mean comparisons did provide some evidence in support of a distinction among the
hypothesized traits. The pattern of score differences between native speakers and second
language learners was found not to be uniform across the three competence areas. Based
on this observation, the authors argued that since language skills could develop at
different rates, they could then be distinctly identified and measured separately.
Two competing hypotheses concerning FL ability in the areas of oral-aural,
structure-reading and discourse were investigated in Fouly, Bachman, and Cziko (1990):
a correlated-trait hypothesis and a high-order hypothesis. Both models proved to fit the
data fairly well. The study concluded that both distinct factors and a general language
factor existed.
In a large-scale cross-education system, cross-test battery study, Bachman,
Davidson, Ryan, and Choi (1995) investigated whether two test batteries were
comparable in terms of underlying factorial structure. The test battery administrated by
Educational Testing Service (ETS) included the TOEFL® test, the SPEAK® test, and the
TEW® test. The Cambridge-based test battery consists of five measures of the First
Certificate in English (FCE) test.
A similar correlated two-factor solution was found within each test battery. In
both cases, measures of listening and speaking loaded predominantly on one factor.
Written tests, on the other hand, loaded primarily on the second factor. A cross-test
analysis using all 13 variables was performed to investigate whether the two tests indeed
27
measured similar abilities. A correlated four-factor solution was obtained with a speaking
factor, a listening factor, and a written mode associated with the ETS test battery, and a
written mode associated with the Cambridge test battery. Because all the factors were
highly correlated, the researchers transformed the correlated four-factor model to a
higher-order factor model. The resulting factor structure had a higher-order general factor
and four uncorrelated primary factors. It was found that the higher-order factor accounted
for a large proportion of the variance in each test battery. The authors claimed that the
results of this comparative study supported the position that FL proficiency consists of a
general factor and a couple of first-order factors.
A more recent study (Shin, 2005) used the data from the Bachman et al. (1995)
comparability study, and found a correlated three-factor solution for the ETS test battery.
The factor structure with a listening factor, a speaking factor, and a written mode factor
closely resembled the findings from Bachman et al. (1995). The Shin (2005) study also
rejected the unitary trait model and the complete divisibility model. Between the higher-
order model and the correlated-trait model, the higher-order model was found to be
optimal in representing the data, and it was also considered to be more parsimonious.
Focusing on listening trait validity, Buck’s (1992) study was significant in this
line of research in two ways. First, the study successfully demonstrated the uniqueness of
the listening trait and its difference from the reading trait, in contrast to the previous
studies that had failed repeatedly to make this distinction (Carroll, 1983; Oller &
Hinofotis, 1980; Scholz et al., 1980). Second, the study illustrated the process of
designing measures that operationalized the listening trait properly. Buck pointed out a
number of ways in which listening and reading differ: mode of input, information
28
density, and different vocabulary, among others. He also suggested that contextual
information should be provided to replicate the processes used in real-world listening. By
controlling for these textual and contextual variables, a measure can be made to include
many of the characteristics of the normal listening situation, and become theoretically
valid in operationalizing the underlying listening trait. Each trait, listening and reading,
was measured by using four methods: open-ended questions, multiple-choice questions,
gap-filling tests, and translation. The results yielded a model with two closely correlated
traits: listening and reading. When addressing this close relationship between listening
and reading, Buck asserted that reading and listening share a common general language
comprehension process. He proposed distinguishing between two types of listening tests:
tests of general language comprehension and tests of the listening comprehension trait.
The results supported the position that language ability is divisible, even where two
highly related receptive skills are concerned.
In an investigation of the relationships of second language proficiency, FL
aptitude, and intelligence, 11 measures in Sasaki’s (1993) study yielded a correlated
three-factor solution: a composition ability, an ability to comprehend relatively short
context, and an ability to comprehend relatively long context. The author detected that
the two partial divisibility models could not be distinguished statistically, meaning that
they fit the data equivalently. Model comparison between the partial divisibility models
yielded similar unfruitful results in the past (Bachman & Palmer, 1983; Fouly et al.,
1990). A higher-order model with three first-order factors and a correlated three-factor
model are statistically equivalent. They cannot be distinguished by resorting to model
comparisons, and therefore, the superiority of one model over the other cannot be
29
demonstrated. Sasaki urged future researchers to include more theoretically supported
first-order factors in order to settle this controversy over the best-fitting model for second
language proficiency. Nonetheless, the two partial divisibility models fit the data much
better than the unitary model or the uncorrelated model. The researcher concluded that
there were at least several distinct trait factors between the general factor and the
observed variables, and that these specific trait factors were not independent of each
other.
While the majority of the studies to date have focused on English as a FL, Bae
and Bachman’s (1998) study which examined Korean as a FL and as a heritage language,
was an exception that departed from the English norm. Learners of Korean-American
ancestry who were assumed to have exposure to Korean at home constituted the heritage
group. Learners of non-Korean ancestry who were assumed to have little or no out-of-
class contact with the Korean language were in the non-heritage group. By focusing on
the two receptive language skills, reading and listening, the study showed that a
correlated two-factor model described the data for both the heritage group and the non-
heritage learner group. A rival model with a single factor was also tested, and the fit of
this one-factor model was significantly worse than that of the two-factor model. The
authors concluded that the same underlying two-factor pattern applied for both groups.
In the studies being discussed so far relationships among the factors were
specified as either none or non-directional. This means that factors were hypothesized
either to be independent or to be correlated reciprocally. Directional relationships among
factors have not been reported extensively in the literature. Only a few studies examined
such possible relationships. Upshur and Homburg (1983) found that while grammar and
30
vocabulary were correlated, they both affected reading. In their study, the relationships
between grammar and reading and between vocabulary and reading were modeled not to
be reciprocal, but rather directional. This model implies that the development of grammar
and vocabulary abilities has a direct effect on the development of reading ability.
Directional relationships among the latent factors were also tested in Sang et al. (1986) in
a three-factor model concerning grammar ability, vocabulary ability, and reading ability.
The relationships were directional as the grammar and vocabulary ability were modeled
to have direct effects on the reading ability.
There are reasons for this lack of interest in testing directional relationships
among the factors. Factor analysis, having its roots in correlation and regression
techniques, lacks the explanatory power to confirm or disconfirm a causal (i.e.,
directional) relationship. To justify that one construct causes another, one needs to rely
on an experimental design that applies random sampling procedures and that controls for
extraneous factors. In most educational and FL studies, implementing a strict
experimental design can hardly be achieved, because there are often too many extraneous
factors to be controlled, and also because the construct under investigation is often poorly
defined. Considering the wide variety of specific language factors that have emerged in
the literature and the many possible relationships among them, it seems almost
impossible to single out one construct from the rest, and focus on its directional
relationship to the others.
The previous discussion should also bring our attention to the nature of the
higher-order general factor that has repeatedly emerged in most of the studies discussed
above. A higher-order model asserts that the general factor accounts for the correlational
31
relationship among the first-order factors. However the nature of this general language
ability remained uncertain. Different researchers have proposed different theoretical
concepts that can be used to define this general language ability. Oller (1983) claimed
that it is pragmatic mapping that underlies the ability. Bachman and Palmer (1982)
asserted that this general ability has something to do with information processing in
extended discourse. As vague as it sounded, Bachman et al. (1995) suggested that the
general factor can be interpreted as the common aspect of language proficiency shared by
subjects when they perform on a certain test.
As summarized by Fouly et al. (1990), the nature of this general factor is open to
speculation and future research. There may even exist more than one higher-order factor,
depending on the number of first-order traits and the correlational relationships among
them. Although the number and the nature of the measures used have been different from
study to study, the field seems to have come to the consensus that language ability
consists of multiple aspects, components, or competencies, but questions still remain
regarding what these specific components are and how they interact.
As demonstrated in this review, the number of measures used in a single study
ranged from 6 (Bachman & Palmer, 1983) to 13 (Bachman et al., 1995). The format of
these measures could be as discrete as multiple-choice items, or as integrated as an
interactive speaking task. Furthermore, the number and the nature of traits in the models
for hypothesis testing also varied vastly from one study to another. In some studies,
modeling was based on a previously established theoretical framework (Bachman &
Palmer, 1982). In other cases, a study was conducted for the purpose of test score
validation (Bachman et al., 1995). Therefore, what was included in the factor model
32
depended on the nature and purpose of the test but not on a particular theory. There were
also times when the researchers intended to examine only some specific traits of interest
and, therefore, included only these traits in their models (Buck, 1992; Shin, 2005).
In conclusion, the concept of a multidimensional FL proficiency has been well
received based on both theoretical and empirical grounds (Kunnan, 1998a), although the
research community has not yet reached an agreement regarding the nature of the
constituents, or on the manner in which they interact (Chalhoub-Deville, 1997). In a
recent publication on issues in assessing English language learners (Wolf et al., 2008),
the authors emphasized the field’s uncertainty as they declared that “there has been no
consensus across the tests in the definition of language proficiency” (p. 22).
The field needs to find a mechanism to reconcile findings from different studies in
order to move forward on the agenda of defining what FL proficiency means and how we
develop valid tests to measure it.
Method Effect
The major goal pursued in the studies reviewed is the identification of one or
more factors underlying FL proficiency. Given that a trait or a factor is latent and cannot
be measured directly, test measures are needed to elicit performance that will reflect the
target trait structure. We therefore take test performance on such measures as indicators
of the latent trait structure. In most cases, multiple indicators or measures are needed to
elicit performance on one latent factor.
Some of the models contained only language-related factors, be they knowledge-
based or skill-based. These models assume that no other factor contributes systematically
to the resulting test performance. However, these assumptions usually cannot be

33
guaranteed. If a particular test method (e.g., cloze test) affects test performance in a
systematic way, this method-specific influence should be taken into consideration when
the overall factor structure of test performance is modeled. That is to say, if a special
ability can be developed to respond to a particular type of test method, this method-
related ability is what we need to distinguish from language-related ability.
To say that score interpretation on a FL measure is valid is to suggest that the test
reflects how test-takers respond to the test items differently as a result of their different
standings on the ability continuum rather than of their differential mastery of that
particular test method. In other words, if another test of a different method designed to
measure the same trait yields the same pattern in test results, we can conclude that
convergent evidence has been found that the two different measures measure the same
trait, and that it is the trait, not the test method, that contributes to the resulting test
performance. In reality, method effect can hardly be ignored, and trait effect is always
intertwined with method effect. A review of the relevant literature suggests that there are
factors other than language proficiency that might also contribute to the underlying
structure of test performance (Bachman et al., 1995; Bachman & Palmer, 1982, 1983;
Harley et al., 1990; Turner, 1989).
The review of studies in the previous sections is approached with a focus on
configuring the trait, namely FL proficiency. In the current section, this focus has been
broadened as method factors are investigated along with the trait factors. Some of the
studies reviewed in the previous sections will be reexamined from this new perspective
along with some new studies. In these studies, test methods as potential latent factors
34
responsible for test performance, were investigated. The delicate relationship between
test method and trait in the context of FL testing is discussed at the end of the review.
Two approaches are commonly used to provide information regarding the
distinctness of trait and test method. The first is confirmatory factor analysis (CFA),
which has been widely used in FL proficiency dimensionality studies. Instead of focusing
exclusively on the possible trait factors, test method effects are also considered as
potential factors underlying test performance. The second approach is a combination of
CFA and the multitrait-multimethod (MTMM) matrix method. Both approaches provide
a way to investigate convergent and discriminant validity by distinguishing the effect of
measurement method from the effect of the trait being measured. Using either approach,
the studies cited below demonstrated the influence of test method on test performance.
The Factor Analysis Approach
In Bachman and Palmer’s (1983) validation study of the FSI oral proficiency
interview, three different methods were used to provide both convergent and divergent
evidence to the construct validity of the oral interview test. Besides the interview method,
translation and self-rating were also used to measure the two traits: speaking and reading.
Findings from the analyses showed that models with three correlated method factors
consistently provided a better fit than models positing no method factor or with fewer
than three methods. A model with two correlated trait factors and three correlated method
factors was shown to provide the best fit to the data. Among the three methods used,
measures involving the oral interview method loaded more heavily on trait than on
method, whereas measures using translation and self-rating loaded more heavily on
method than on trait. The authors claimed that the interview method demonstrated the
35
largest trait component and the smallest method component, showing both convergent
and discriminant validity evidence for the score interpretation and use.
In another study, Bachman and Palmer (1982) again employed multiple measures
to measure each of the three posited traits (grammatical competence, pragmatic
competence, and sociolinguistic competence) underlying communicative language
competence. The four methods used in this study were interview, writing sample,
multiple-choice test, and self-rating. As far as the method factors were concerned, models
with three correlated methods proved to fit better than models with four uncorrelated
methods. The three correlated method factors were interview, writing and multiple-choice
as one method factor, and self-rating. They also found that measures using interview and
self-rating loaded more heavily on method than on trait factors, whereas measures using
the writing and multiple-choice method consistently loaded more heavily on trait than on
method factors. This result contradicted the findings in their 1983 study, where measures
involving oral interview demonstrated more trait effect than method effect.
In both studies, besides language-related traits, test methods also accounted for
test performance. In the first study, three correlated method factors (oral interview,
translation, and self-rating) were built into the factor structure. In the second case, three
correlated methods (interview, writing and multiple-choice, and self-rating) were
incorporated as contributing factors. In both studies models involving method factors
proved to fit the data better than models positing only language-related factors. These
results showed that when configuring latent factor structures underlying test results, test
methods together with language-related abilities should both be considered.

36
Test method factors have also been found in other studies investigating FL
proficiency (Harley et al., 1990; Song, 2008; Turner, 1989). Harley et al. (1990) used
three types of methods for measuring grammatical, discourse, and sociolinguistic
components. They categorized their measures by mode: oral productive mode, written
productive mode, and multiple-choice written mode. For example, a grammar oral
production task could consist of a structured interview scored for accuracy of verb
morphology, prepositions, and syntax. A discourse multiple-choice task could require
examinees to select coherent sentences at the paragraph level. A sociolinguistic written
production task could ask examinees to write a formal request letter and informal note
scored for ability to distinguish formal and informal register. In their final factor model, a
general factor and a written method factor emerged from the test results on the various
measures. Although the nature of the general factor remained unclear, the written method
factor could definitely be taken as an indication of the presence of method effect.
An audio mode versus written input mode difference was found in Song (2008).
This study examined the similarities and differences in factor structure underlying
listening and reading abilities. In the results, three kinds of skill-related factors with the
two input mode factors explained the data best.
Factors other than language ability that contributed to performance on cloze tests
were found in Turner (1989). Eight cloze tests were given, varying in first language (L1)
or second language (L2), content domain (two topics), and contextual constraints (local
and global). The results helped answer the question of what knowledge was required for
successful performance on cloze tests, including and in addition to language ability.
Based on the results from factor analyses, the model that demonstrated the best fit was
37
composed of three uncorrelated factors: two distinct language factors (L1 and L2) and a
general factor, which was defined by Tuner as cloze-taking ability (CTA). The author
claimed that the inclusion of the L1 factor made it possible to distinguish the L2 factor
from the CTA factor. Regarding the measures in L2, the findings showed that L2 cloze
performance was dependent on both the L2 language factor and on the CTA factor. The
results also showed that this CTA factor had greater influence on cloze performance than
did the L2 language factor. The presence of the CTA factor in the final factor structure
demonstrated that method effect existed in cloze tests across linguistic boundaries.
Even if a factor structure contains only elements that can be interpreted as
language-related factors, alternative explanations of the nature of the factors cannot be
ruled out simply based on the results from statistical analysis. Substantive knowledge
should always be called upon to determine the nature of the factors, and how to label
them. It is likely that a seemingly language-related factor can also be interpreted as a test
method factor. Taking the results from Bachman et al.’s (1995) comparability study as
the example, the factors in the final solution across both test batteries included a speaking
factor, a structure, reading, and writing factor associated with the ETS test battery, a
structure, reading, and writing factor associated with the Cambridge test battery, and a
listening and interactive speaking factor. The second and third factor both included
measures in the written input mode. The distinctiveness of these two factors might have
lain either in the kind of language ability into which each test battery tapped or in the
kind of methods each battery used. In the latter case, this distinctiveness reflected method
effect across the two test batteries. The last factor, listening and interactive speaking,
included the TOEFL® test listening section, the FCE listening comprehension section,
38
and the FCE oral interview. At first it might seem strange that the FCE oral interview
would load on this factor with two listening measures instead of on the speaking factor
with the SPEAK® test from the ETS battery. However, after taking a closer look at the
nature of the measures, we would find that the FCE oral interview was interactive in
nature, and it demanded substantive listening skill from the test takers, whereas the
SPEAK test was non-interactive and monologic in nature, which did not entail oral
interaction, and it did not have a listening requirement. Therefore the last factor was
essentially the ability to perceive and process aural input, in other words, the listening
ability.
The discussion has come to the place where it is hard to draw a clear-cut line
between a language-related factor and a non-language-related factor. Why do the TOEFL
structure, reading, and writing sections all load on a single factor? Is it true that grammar
ability, reading ability, and writing ability are essentially indistinguishable? Or is it
because all of the measures share the same written mode so that the real differences
among the language-related abilities are masked by the commonalities of the methods
used to measure them?
All of the measures used in the studies reviewed here can be categorized in many
different ways. Measures can be separated by input mode as either aural or written. They
can be divided by output mode as either oral or written. They can also be distinguished
by contextual situation as either context reduced or context embedded. There is no doubt
that a listening test must have aural input. However, whether the listening test items
should be responded to in context-reduced situations (e.g., multiple-choice items) or in
context-embedded situations (e.g., constructing short answers) depends on the purpose of

39
a particular test. In the case when a short-answer item type is chosen, might listening
ability be confounded with writing ability, so that listening and writing emerge as a single
factor simply because the two measures share a common output mode? Does this mean
that multiple-choice listening questions are superior to short-answer listening questions
because the former measure the unique listening construct with no influence from
writing? But can we be sure that the abilities being tapped by constructed-response
listening tasks, such as processing, internalizing, and reconstructing aural input, are not
part of the listening construct? These questions remained to be answered by the field in
the future.
The CFA-MTMM Matrix Approach
The MTMM method has been used widely in construct validation studies to
investigate both convergent and discriminant validity, and to distinguish between method
effect and trait effect. The traditional MTMM method (Campbell & Fiske, 1959) provides
a straightforward mechanism to detect and distinguish the trait and method effects by
simply comparing the correlations in a MTMM matrix without having to conduct any
significance test. However, it does not specify any underlying factorial model to quantify
and account for the relationships between the measures and the traits. Due to this
inadequacy of dealing with multivariate data, model evaluation and comparing cannot be
conducted by using the traditional MTMM approach.
Widaman (1985) proposed a confirmatory factor analysis (CFA) approach to
MTMM matrix data. In this approach, model specification is performed at the onset of
the analysis. The best fitting model for the data is selected as the baseline model to which
a series of alternative models are compared by using chi-square difference tests. The
40
results of these model comparisons tell us the degree of trait convergence, trait
discrimination, and method effects in test measures.
A CFA approach of MTMM data was adopted in Llosa (2007) to investigate the
construct validity of a standardized test and a classroom assessment. The MTMM data
were based on the test results from Grades 2, 3, and 4 students on two English as a
Second Language (ESL) assessments: the California English Language Development
Test (CELDT) and the English Language Development (ELD) Classroom Assessment
measures. To be specific, the study aimed to examine the extent to which these two tests
measured common language ability constructs, the extent to which the constructs
measured by the tests were distinct, and the extent to which the differences in test design
and ratings contributed to the underlying structure of test performance.
A baseline model with three first-order trait factors (Listening/Speaking, Reading,
and Writing) and one higher-order general factor was first established. The two methods,
the CELDT and the ELD Classroom Assessment measures, were modeled onto the
baseline model as either correlated or uncorrelated. This established two versions of the
trait-method model. In Grade 4, the trait-method model with correlated method factors
was confirmed to provide the best fit while in Grades 2 and 3, the trait-method model
with uncorrelated method factors was selected as the best model. The results showed that
both trait and method influenced test performance in all grades.
To test convergence, the fit of a trait-method model was compared to the fit of a
method-only model (correlated method factors for Grade 4 and uncorrelated method
factors for Grades 2 and 3). Significant improvement in model fit of the former over the
41
latter indicated that method alone could not provide satisfactory explanation to account
for the data. This also confirmed the presence of the trait effects.
To find discriminant evidence, the fit of a trait-method model was compared to
the fit of a unidimensional trait-method model in which only one general trait was
specified. The significant chi-square difference result indicated that the traits, although
influenced by a higher-order general factor, were still distinguishable empirically.
Evidence of trait divergence was therefore obtained.
Method effects were investigated by comparing the fit of a trait-method model to
the fit of a trait-only model. Substantial method effects were detected based on the result
of significant model fit improvement provided by the former over the latter.
Trait convergence, trait divergence, and method effects were further examined in
light of individual parameter estimates. The results deviated from those of the overall
model comparisons to some extent. As far as convergence was concerned, the proportion
of method variance exceeded that of trait variance in the baseline model for Grade 4.
Trait discrimination was also challenged by the uniformly high factor loadings on the
higher-order factor in Grades 2 and 3. A significant method correlation in Grade 4 also
suggested a common method bias across the two measures.
This study demonstrated how a CFA approach could be applied to MTMM data to
detect and distinguish trait and method effects. Substantial method effects were found,
meaning that performance on the two tests reflected not only the traits measured but also
factors associated with the instruments used. It also showed that the trait effects were
significant in accounting for test performance, and that a trait-method model provided the
best explanation to the data.

42
In order to examine the effects of response formats in a web-based academic
listening test, a CFA approach of MTMM data was also employed in Shin’s (2008). The
three methods under investigation were summary task, incomplete outline, and open-
ended questions. The academic listening constructs specified in the study were the
abilities to extract main ideas, major ideas, and supporting details. A baseline trait-
method model with correlated traits and uncorrelated method factors was established at
the outset. The baseline model was then compared to a method-only model with no trait
factors to find convergent evidence. To search for discriminant evidence, the baseline
model was compared to a trait-method model with one general trait factor, and a trait-
model with uncorrelated trait factors. Both convergent and discriminant evidence was
obtained based on the results of model comparisons.
A higher-order model, which was equivalent to the baseline model in terms of fit
statistics, was chosen for examining the individual parameter estimates. The relative
effects of trait factors were mostly larger than those of the method factors. Although for
each trait factor some tasks were stronger indicators than others, the author concluded
that all three response formats were valid indicators of academic listening
comprehension.
The CFA approach to MTMM data is the most commonly used alternative to the
original Campbell and Fiske (1959) MTMM analysis (Sawaki, Stricker, & Oranje, 2008).
Multiple steps of model comparisons assist in finding evidence for trait convergence, trait
divergence, and method effects. The main goal of this kind of analysis is to evaluate the
effects of factors (e.g. test method factors) that are not specified as the constructs a test
intends to measure. Unlike the original MTMM analysis, this CFA approach to MTMM
43
data allows researchers to specify and evaluate underlying factorial models, and to find
the best model that takes into account both trait and method factors based on the
outcomes of a series of statistical difference tests.
Before we can arrive at any conclusion regarding how test method interferes with
trait structure in the context of FL testing, a fundamental question needs to be answered:
Should we view method and trait dichotomously, or should we perceive them as an
integrated unit? In traditional construct theory, from a cognitive-psychological
perspective, test method is viewed as a source of irrelevance that should be minimized in
testing. In the context of language testing, often times a particular method is designed to
measure a unique aspect of language ability. It is likely that the trait and the method
define each other until it is impossible to disentangle one from the other. In my view, FL
tests should be designed to reflect the dynamic and interactive relationships between trait
and method. The test method facet ought to be visualized as an integral part of the factor
structure underlying FL test performance. As Bachman (1990) put it, “the matching of
trait and method factors is one aspect that contributes to test authenticity” (p. 241).
This section summarizes studies investigating the effects of test method on test
performance. The next section will discuss the relationship between test-taker
characteristics and test performance.
Test-Taker Characteristics and Test Performance
Since Oller’s (1979) proposal for a unidimensional view of language proficiency,
the field has witnessed a proliferation of factor-analytic studies devoted to examining the
nature of FL proficiency. Researchers seem to have reached a consensus on at least two
issues. The first consensus is that the result of a FL test can be better interpreted if
44
multiple factors are considered to account for the performance. This is to say that FL
proficiency is multidimensional in nature rather than unitary. The second agreement is
implied in the mixed findings in terms of the nature and number of factors underlying the
construct. It has become clear that the make-up of FL proficiency is context dependent.
Results of any factorial analysis heavily depend on why and how the test is
constructed. Regarding test construction, test method facets (Bachman, 1990) have been
proposed as having a decisive influence on FL test performance. By using a particular
test to validate the theoretical construct of FL proficiency, we have to acknowledge the
many possible constraints of the test we use. Tests may vary in length, content coverage
and representation, and internal consistency, to name just a few. Methods that a test
utilizes can also differ in terms of input mode, expected response, the complicated
relationship between the two, and so on. As a result, the outcome of any analysis aimed at
demystifying the nature of FL proficiency can be interpreted meaningfully only with a
full understanding of a particular testing situation.
Another trend that has emerged in the literature is that researchers have started to
pay attention not only to the test being used but also to the characteristics of the test-
takers.
Results from early research cast doubt on the uniformity of factor structure across
all groups of test-takers. In an effort to validate the communicative competence model,
Harley et al. (1990) failed to confirm an a priori three-trait model, and instead found that
a general factor and a written method factor could explain their test performance data
well. Although this two-factor model was chosen as the final model, they suggested that
the relationship between the different components of the construct might be highly
45
dependent on different language learning experiences of the test-takers, and that it might
be differentially related to cognitive attributes of these individuals. They urged future
studies to probe how different aspects of FL proficiency develop and interact as a
function of learning experience. The importance of considering test-takers’ backgrounds
when interpreting test performance was also emphasized in Bachman et al.’s (1990)
study. These authors recommended that future researchers investigate the relationships
between test-taker characteristics (TTCs) and test performance.
TTCs are recognized as one of the four influences on test scores in the
communicative language ability (CLA) model (Bachman, 1990): communicative
language ability, test method facets, personal characteristics, and random measurement
error. Personal characteristics, or background characteristics, are specified in the model
on a par with CLA and test method facets. The personal characteristics proposed in the
CLA model include cultural background, background knowledge, cognitive abilities, sex,
and age.
The Bachman (1990) model acknowledges the influences of test method and
TTCs on test performance. It has a language ability component, a test method component,
and a component of TTCs, based on which researchers can posit and test relationships
about factors that influence test performance. Although critics (Chalhoub-Deville, 1997;
McNamara, 1990; Skehan, 1991) argued that it might be difficult to implement such a
comprehensive model in actual test development, the CLA model was nevertheless
portrayed as the state of the art by Alderson and Banerjee (2002) as they argued that the
model emphasizes the interaction between the language user, the discourse, and the
context.
46
Suggestions from these earlier researchers served as the springboard for later
studies concentrating on this test-taker factor. It has become a common awareness that
results of factor analyses may be interpreted in light of learner variability. In Kunnan’s
(1998a) categorization of language assessment research themes within the Messick
(1989) framework, studies on TTCs provide an evidential source of justification of valid
test use. The TTCs mentioned in that summary were academic background, native
language and culture, field in/dependence, gender, ethnicity, and age. Kunnan (1998a)
argued that test takers from certain background might perform differently based on
factors that are relevant to the abilities being tested. This message is echoed in Alderson
and Banerjee’s (2002) review of language testing and assessment, as they argued that the
most important challenge for language testers is to understand the characteristics of
different test-takers and how these characteristics interact with test-takers’ ability
manifested in their test performance.
The following section is devoted to summarizing empirical studies that have
explicitly modeled TTCs in their analyses of FL proficiency. Studies are divided into four
groups. The first group of studies focused on the relationship between proficiency level
and the degree of factor differentiation. Studies in the second group examined other
TTCs, such as native language background and proficiency, learning condition, and
gender. Two studies investigating language ability in relation to target language contact
are included in the third group. And last, validation studies in the context of TOEFL®
testing are reviewed and summarized.

47
Proficiency Level
Regarding the relationship of learners’ target language ability level and the nature
of their proficiency, two competing hypotheses have received much attention from the
field. One hypothesis states that the dimensions of language ability become more
differentiated as a function of increasing examinee proficiency. The competing
hypothesis postulates a negative relationship between the degree of factor differentiation
and increasing proficiency. In opposition to both hypotheses, the null hypothesis claims
that factor structure becomes neither more nor less differentiable but stays the same
across learner groups of different ability levels. Research so far has shown quite mixed
results (Kunnan, 1992; Römhild, 2008; Shin, 2005; Swinton & Powers, 1980).
Kunnan (1992) detected a positive relationship between proficiency level and the
degree of factor differentiation by using an ESL placement test developed at the
University of California. Based on test results, ESL students would be placed into four
different course levels to receive further instruction on the English language, depending
on their current level.
As part of the test validation process, Kunnan investigated the dimensionality of
the language ability of the test-takers, and examined whether the test measured the same
abilities in the same way across all ability levels. To achieve the first goal, an exploratory
factor analysis (EFA) was performed based on six subtest scores for the total group. A
two-factor solution was selected to be the most parsimonious and interpretable, with the
two listening subtest scores loading on one factor, and the subtest scores of reading and
grammar loading on the second factor.

48
To investigate if this factor structure would vary as a function of the proficiency
level of the test-takers, results of the EFA for each course group were compared.
Although a two-factor solution appeared to be the best fit for each course group, the
variables that loaded on the factors and the relationships of the factors were different
across the groups. The most salient difference was that only the group with the lowest
proficiency had an oblique solution, whereas the solutions for the other three course
groups were orthogonal. This is to say that for the students with the lowest proficiency,
the two factors were correlated in the EFA solution, whereas the factors could be
considered relatively independent for the other groups with higher proficiency. The
indication was that the students with lower ability had less differentiable skill abilities in
contrast to the students at higher levels of ability, whose skills appeared to be more
distinct. The author concluded that the test might not measure the same abilities across
groups of different ability levels. This finding posed a threat to the validity of score
interpretation, especially if a single composite score were reported for all test-takers, as it
was in this case. Due to the inconsistency in the factor structure across the course groups,
the author suggested considering reporting section scores along with the total scores for
making placement decisions.
Römhild (2008) also found that factor structure changed across examinee groups
with different levels of language proficiency. However, instead of a positive relationship
like the one found in Kunnan (1992), Römhild observed that language proficiency was
negatively related to the degree of factor differentiation based on the results of the
Examination for the Certificate of Proficiency in English (ECPE). Decreasing degree of

49
factor differentiation was found as a function of examinees’ increasing proficiency across
two ability-level groups.
ECPE, developed at the English Language Institute at the University of Michigan,
is a test of advanced English language proficiency, reflecting skills and content typically
used in university or professional contexts. The multiple-choice section of ECPE, used
for this study, included the listening and reading subtests, and three structure subtests
(grammar, vocabulary, and a cloze test). The items were divided into two equivalent
halves by using the split-half technique. The assumption was that each half of the test
would be an adequate representation of the full-length test. To divide the examinees into
low- and high-proficiency groups, the study used an internal criterion measure based on
the test results from one ECPE test half. The results from the other half of the test were
used to conduct the multiple group factor analysis.
A series of EFA based on item-level data were performed separately on each
group to determine the number of latent factors. Among the three competing baseline
models, the five-factor model representing the subtest structure of ECPE provided the
best model fit for both groups.
Although identical baseline models were found for each group, the outcome of
analyses on factor loading invariance showed that only partial invariance could be held
across the groups. Items that exhibited factor loading noninvariance were removed, and
the baseline model was re-established. This new partial measurement invariance model
was subjected to the scrutiny of structural invariance. Significant differences in factor
variances and covariances were observed, indicating that structural invariance could not
be held across the groups. Factor variances were smaller in the low-proficiency group,
50
meaning that the performance of this group on each of the factors was more uniform. In
other words, there was less variation in the outcome measures for the low-ability group.
Factor correlations also appeared to be smaller in this group, compared to the high-
proficiency group. The weaker correlations indicated that the five factors were more
separable in the low-proficiency group. This result indicated that the structure of test-
takers’ language ability became less differentiated as their overall language proficiency
increased.
The role of target language proficiency in determining the structure of language
ability was also examined in Shin (2005). Unlike the two studies reviewed above, this
study did not find sufficient evidence to suggest that factor structure differed across
ability groups. Participants in this study were from Bachman et al.’s (1995) study.
Students’ test results on the Cambridge test battery served as the external criterion to
define their proficiency levels. Their scores on the TOEFL® test and the SPEAK® test
from the ETS test battery were used to examine factorial invariance across the low-,
intermediate-, and advanced-level groups.
A higher-order factor model, with three first-order factors (listening, written
mode, and speaking) seemed to represent the test results of each ability group optimally,
and therefore was established as the baseline model. This baseline model was then
estimated simultaneously across all ability groups with equality constraints imposed in a
step-wise fashion to examine both measurement and structural equivalence. Partial
measurement invariance was established by removing constraints from three factor
loadings that showed significant variances across the groups. The author then proceeded
to examine structural invariance by imposing equality on the loadings of the first-order

51
factors on the general language factor. Four of the six first-order factor loadings were
shown to be invariant. Based on these results, the author concluded that the structure of
the ETS tests stayed partially equivalent across the groups, implying that the degree of
factor differentiation neither increased nor decreased with the development of overall
language ability. This finding helped to make the argument that the ETS tests functioned
in the same way across different proficiency groups, and that the use of a single
composite score could be justified.
The decreasing trend in the relationship between language proficiency and the
degree of factor differentiation found in Römhild (2008) was in disagreement with the
finding of Kunnan (1992). Factor structure was found to be partially invariant on both
measurement and structural levels in Shin (2005), indicating that the degree of factor
differentiation did not vary as a function of language proficiency. One reason for this
discrepancy might have to do with the nature of the data used in each study. Kunnan
(1992) and Shin (2005) used section scores as the unit of data analysis, whereas in
Römhild (2008) analyses were based on item-level data. Another reason could simply be
that the researchers used different tests.
One issue that deserves our attention is how group membership was determined
differently in the studies. Kunnan (1992) did not use an independent criterion measure for
separating the groups. Instead, the same test used to determine group membership was
used to perform multiple group analyses. In contrast, both Shin (2005) and Römhild
(2008) employed an independent criterion measure (internal or external) to decide group
membership. The latter approach is preferred because any significant result would be less
likely to be captured by chance if an independent criterion measure were used.

52
Other Background Characteristics
FL learners come into a language testing situation as complex human beings,
characterized not only by their prior target language achievement but also by their native
language background, gender, past and current learning conditions, and many other
characteristics. Test-takers’ identities and life experiences are also valuable information
for us to understand their current learning profiles, and how they have arrived at where
they are. The research community has gradually embraced the idea that treating test-
takers regardless of their identities and life experiences will give us an over-simplified
picture of their test performance. In the previous section, we have discussed studies that
have examined prior target language achievement as one of the many TTCs, and its
interplay with the degree of factor differentiation. In the following section, studies that
have investigated other TTCs are reviewed.
Sang et al. (1986) criticized the unwarranted assumption of the equivalence of
factor structure regardless of the differences in the learners or the conditions under which
language acquisition took place. The study posited that differences in cognitive skills and
learning conditions could have influences on the structure of language proficiency.
Levels of cognitive skills were decided by learners’ L1 proficiency. Subjects were
divided into two groups: one with high L1 (German) proficiency and the other with low
achievements in German. Learning conditions were divided according to two different
teaching strategies that had been implemented.
First, a three-dimension factor structure (elementary, complex, and
communicative) was confirmed to be the best fitting model for the total sample. Then the
invariance of this structure was examined as a function of L1 proficiency and of the kind
53
of instruction the learners received. The study found that achievement in L1 could affect
the factor structure in several ways. Across the low-and high-L1 groups, differences in
factor loadings on complex skills and interactive use were found. Loadings on the factors
of more advanced language skills became more salient as the learners advanced in
mastering their native language. In other words, advanced FL skills were more likely to
appear as a distinguishable factor for learners with a high level of L1 ability. Another
finding was the differences in factor correlations. Factors were more closely correlated in
the group with a high level of competence in their L1. Based on this observation, the
authors suggested that a general L2 proficiency factor could emerge if the sample was
highly proficient in their mother tongue because learners with high-level L1 proficiency
tended to develop a more generalized ability structure by being more adept in transferring
cognitive ability from one aspect to another. In conclusion, the structure of L2
proficiency was found to be dependent on the learners’ degree of mastery of their L1.
During the next step, two types of teaching strategies were examined in relation to
test performance. A modern and a traditional teaching strategy were modeled as
predictors of the language components in the model. The traditional teaching style was
characterized by its heavy use of L1, and by favoring a bilingual approach to lexicon,
etymological explanations, traditional oral translation, and normative understanding of
grammar. In contrast, a modern approach emphasized using a direct method and
unstructured oral interactions. The results showed that the traditional teaching strategy
had a positive relationship with the elementary factor but no association with the complex
factor, and a negative association with the communicative factor. In contrast, the modern
54
teaching strategy was positively associated with all aspects of ability development,
especially with the communicative component.
Gender, as another key TTC, was studied in Wang (2006). Demonstrating test
fairness across genders is an important aspect of building a validity argument. As an
attempt to provide evidence of fairness for two test batteries, the Examination for the
Certificate of Proficiency in English (ECPE) and the Michigan English Language
Assessment Battery (MELAB), the author investigated the dimensionality of both tests
and the equivalence of the factor structures across genders. Since both tests report total
scores, the author aimed to gather evidence to support the use of the total scores for both
male and female test-takers.
ECPE included a speaking section, a listening section, and a section measuring
grammar, cloze, vocabulary, and reading (GCVR). MELAB consisted of a composition, a
listening section, and a GCVR section. Analyses were performed on subtest scores.
Subjects were randomly split into two samples to form a calibration sample and a
validation sample. EFA without rotation was performed for each test separately on the
calibration sample. It was found that only one factor dominated the distribution of score
variance in both ECPE and MELAB. Factorial invariance under this one-factor model
was examined across gender on the validation sample for both tests. A model with
equality constraints on the model structure, latent factor variances, factor loadings, and
unique variances, provided the best fit to the data for both tests. This result demonstrated
that all parameter estimates in the one-factor model were invariant across genders for
ECPE and MELAB. The indication was that both tests functioned in the same way across
55
gender groups. Therefore it was reasonable to report a single total score for both groups,
and to compare scores across genders.
As reviewed in the previous section, Shin (2005) did not find enough evidence to
support that the degree of factor differentiation varied as a function of proficiency level.
Partial measurement and structural invariance held across all three ability groups.
However, noninvariance was still detected in individual parameter estimates, especially
in the factor loadings of pronunciation and fluency on the speaking factor. To help locate
the source of noninvariance, a method called multiple indicators multiple causes
(MIMIC) modeling was employed. In MIMIC modeling, language background was
specified as a covariate to determine how the language background of the test-takers was
related to the detected noninvariance.
Two language groups were formed: the Asian language group and the European
language group. Direct effects on the pronunciation and fluency measures from the
covariate were added in the re-specified model. This resulted in significant model
improvement. The appearance of the two direct effects suggested that there was a scale
bias that would produce different performance ratings due to group membership. The
author concluded that based on the results of the MIMIC modeling, the two speaking
rating scales, pronunciation and fluency, functioned differentially across the two major
language groups.
Group membership, as defined by language background in this case or as
determined by other criteria in other studies, is critical in helping us understand whether
test results reflect information other than what the test intends to reveal, such as native
language background. Efforts should be made to investigate the nature of any such
56
confounding information. Otherwise, the validity of score interpretation and use will be
put into jeopardy. This study demonstrated how MIMIC modeling could be a powerful
tool for addressing such test fairness issues.
The studies reviewed in this section specified one or multiple TTCs in their
models based on FL test performance. The findings generally support the assertion that
FL test performance can be interpreted more meaningfully in light of test-taker
variability. The impact of the various TTCs can be observed in two ways. By means of
multi-group invariance analysis, factor structure equivalence can be examined across
groups defined by learner characteristics. Functioning as covariates, TTCs can also be
modeled as independent variables to have direct effects on test performance. Results from
studies using either approach have shown that a universal view of FL proficiency,
regardless of TTCs, is not warranted. Studies in the section have also demonstrated the
usefulness of multi-group invariance analysis and SEM as tools for interpreting test
performance in relation to relevant TTCs.
Target Language Contact
There were a few studies that examined test performance in relation to target
language contact experience as a TTC. Two kinds of language contact were of particular
interest to the researchers: language contact with the target language community and
language contact at home. The former usually occurs in a study-abroad context, whereas
the latter is often situated in a heritage language environment.
Language contact with the target language as a TTC was first examined in two
Advanced Placement (AP) examinations. Each year, college-bound high school students
who want to earn college credit and advanced placement in particular areas of study take
57
the College Board’s AP courses and exams. AP examinations cover a wide range of
academic subjects, including foreign languages. AP FL exams generally measure FL
skills that students might acquire after four to six semesters of college FL courses. Four
FL exams are traditionally available for students to take: French, German, Latin, and
Spanish. In 2007, three new languages were added to AP’s list of FL examinations:
Chinese, Japanese, and Italian.
The demographics of the AP examinees can vary from exam to exam and at
different time periods. In the case of the AP French Language Examination, it was
indicated in Morgan and Mazzeo (1988) that the majority of the examinees taking this
exam learned French primarily through secondary school academic coursework, but that
some of the examinees had a significant amount of out-of-school French language
experience, either by studying in a French-speaking country or by exposure at home.
Whether the test is an appropriate measure for the latter two groups, and whether the test
functions in the same way across all examinees is a validity question that needs to be
addressed before test scores can be interpreted and used appropriately.
Morgan and Mazzeo (1988) identified four groups of examinees based on their
language-learning experiences. The first group, the standard group, had little or no out-
of-school French language experience. This group, which constituted the majority of the
test-takers, was divided into two samples for cross-validation purposes. The second group
(special group I) included the examinees who had spent at least one month in a French-
speaking country. The third group (special group II) regularly spoke or listened to French
at home, and therefore fit the definition of heritage learners. The fourth group included in
the study was college students who had no out-of-school experience.

58
Since the exam was intended to measure the listening, reading, writing, and
speaking skills (College Board, 1987), the primary goal of this study was to verify the
skills measured by the exam, and to determine whether the same dimensional structure
could be applied to the four populations. A correlated four-factor model (listening and
writing, language structure, reading, and speaking) provided the best overall fit to the
data. In this model the listening scores loaded with the writing scores on a common
factor. Discrepancy was found between the make-up of this factor model and the test’s
subsection structure, in which listening and writing were separate sections, and both
language structure and reading comprehension belonged to the reading section. The
reason given by the researchers for this discrepancy was that the listening items measured
a dimension similar to the one measured by the writing tasks. The researchers criticized
the equal weighting scheme, arguing that it actually overweighed students’ ability to
produce grammatically correct prose, as it was measured in both listening and writing
sections of the test, and it downplayed the ability to read and interpret French passages,
as some of the items in the Reading section seemed to assess a separate grammar-oriented
factor of language structure.
This correlated four-factor model was examined for invariance between the
standard group and each of the other three groups. Between the standard group and
special group I, only invariance in factor loadings could be held. No invariance was
found between the standard group and special group II. Invariance in both factor loadings
and unique variances was found between the standard group and the college group. The
factor structure based on the standard group was most similar to the one of the college
group. What was common to the standard group and the college group was that both
59
groups lacked significant out-of-school French language experience. The indication was
that out-of-school FL experience did have an influence on the structure of test
performance.
Similar to the AP French Examination, the AP Spanish program proposes to
assess high school students’ Spanish language proficiency in order to grant college credit
or advanced placement in college Spanish language courses (College Board, 1989).
The internal construct validity of the AP Spanish Language Examination was
evaluated by Ginther and Stevens (1998). One purpose of the study was to investigate
whether the dimensionality underlying test performance was compatible with the four-
section design of the test, intending to measure language abilities in listening, reading,
writing, and speaking. The second goal was to study whether the same factor structure
would hold constant across relevant groups within the entire test-taking population.
Group membership was determined by ethnicity and preferred language. The
reference group consisted of Latin Spanish-speakers who identified their ethnic
background as Latin. The other four groups were Mexican Spanish-speakers, Mexican
bilingual-speakers (who indicated equal preference for Spanish and English), White
English-speakers, and Black English-speakers.
A correlated four-factor model (listening, reading, writing, and speaking) was
specified as the a priori model. This model was found to provide a good fit for the
reference group based on item parcel scores. Group comparisons were conducted by
imposing multiple levels of equity constraints between the reference group and each of
the four comparison groups.

60
When comparing the reference group to the Mexican Spanish-speaking group and
the Mexican bilingual-speaking group, invariance in the number of the factors and factor
covariances did hold across the groups. However, only invariance in the number of the
factors was found when comparing the reference group to the two English-speaking
groups. This finding suggested that the more a group’s background in ethnicity and
preferred language deviated from the reference group, the less equivalent the factor
structures seemed to be.
It was also found that the four factors were less closely related for the Spanish-
speaking groups than for the English-speaking examinees. Descriptive statistics also
showed that Spanish-speakers enjoyed high means in all scores, compared to the English-
speakers. Unfortunately the study did not include a mean structure in group comparisons.
The researchers suggested classroom instructional environment as one potential cause for
the high factor correlations of the low-level groups. The rationale given was that if a
group of students learned and experienced a FL mostly in a formal instructional setting,
the abilities they developed would be highly constrained by the type of instruction they
received. If an equal amount of attention was given to each aspect of language
development (listening, reading, writing, and speaking), it would be likely that the growth
of the four abilities would appear to be relatively uniform.
Differences were also found in model parameter estimates across groups. The
speaking factor was found to be less correlated with the other factors for the two Spanish-
speaking groups than for the two English-speaking groups. This indicated that speaking
ability was both quantitatively and qualitatively different across the Spanish-speakers
compared to the English-speakers. Not only did the Spanish-speakers perform better on
61
the speaking tasks, but also their speaking performance was less related to their
performance on the other parts of the test. The researchers suggested that having out-of-
class experience with the target language might have had a fundamental influence on this
difference in factor structure. The researchers also criticized the fairness of the test. They
argued that some examinees were evaluated not only on what they learned in the
classroom but also on what they brought to the classroom.
Heritage language environment, another possibly relevant characteristic for many
language learners, was studied by Bae and Bachman (1998). The uniqueness of this study
was that the nature of language proficiency was investigated in the Korean language
instead of in English. Two groups of learners were included: the heritage learners and the
non-heritage learners. In terms of home language environment, the Korean heritage
learners were likely to be immersed in a Korean-speaking environment, whereas the non-
heritage learners would mainly communicate in English. The study focused only on two
comprehension skills, listening and reading. A correlated two-factor model was selected
as the baseline model as it described the data for both groups well. This baseline model
was tested for the equivalence of parameters of interest across the groups.
The results from examining measurement invariance showed that the factor
loadings from one listening task on the listening factor were different across the two
groups. A task analysis of this listening task suggested that this task might have had
tapped into relatively more integrated higher-level listening skills due to its complex
input. The factor loading for this task was greater for the Korean heritage group than for
the non-heritage group, suggesting that this task measured listening differentially across
the two groups, and that this task was a better indicator of the listening ability for the
62
former group with higher overall listening ability than for the latter group with lower
listening competence. The results from examining structural invariance indicated that the
factors were correlated in similar ways across the two groups. However, compared to the
non-heritage learners, the Korean heritage learners performed more uniformly on the
listening tasks. In contrast, the non-heritage learners demonstrated less variance in their
reading scores than the heritage learners. Although the same two-factor pattern was
accepted from both groups in terms of the overall model fit, significant differences in
individual parameters across groups pointed out that some parts of the test functioned
differently for learners in different groups.
This study demonstrated that performance from test-takers of different heritage
language backgrounds might display different factor structures. Furthermore, not only the
configuration of their factor structures might differ, but their ability to respond to certain
tasks is likely to vary as well. The authors made an urgent call for future researchers to
include a mean structure to multi-group factor analysis so that groups’ factor means could
be compared.
Kunnan’s (1994, 1995) study was one of the few studies that employed a multi-
group structural equation modeling (SEM) approach to examine both construct
representation and its external structural relationships. The study investigated the
influence of social milieu and previous exposure/instruction on test performance. The
multi-group SEM analyses were conducted based on performance data on the ETS and
Cambridge test batteries in Bachman et al. (1991).
The overall model had a measurement component that modeled the relationships
between test measures and latent factors, as well as the relationships among the factors. It
63
also had a structure component to account for test performance’s external relationships
with multiple TTCs. These TTCs included: home-country formal instruction (HCF),
home-country informal exposure (HCI), English-speaking country instruction or
exposure (ESC), and self-monitoring by test-takers of their own language production
(MON). Both the measurement and structural components were tested for model fit
across two groups of test takers: non-Indo-European native language group (Thai, Arabic,
Japanese, and Chinese), and Indo-European native language group (Spanish, Portuguese,
French, and German). The research question was how the four TTCs mentioned above
influenced FL test performance across the two major language groups.
In configuring the measurement model, a correlated four-factor model was
attempted for both language groups. The four factors were: reading and writing assessed
by the ETS test battery, reading and writing assessed by the Cambridge test battery
listening-noninteractional, and listening-interactional. Two competing hypotheses were
postulated when modeling the relationships among the TTCs and test performance. In the
first model, the four TTCs were specified to have equal influences on the four factors. In
the second model, only MON had direct influence on test performance, whereas the other
three were allowed to cast influences on test performance through MON, the intervening
factor.
The study found that neither model produced a clear overall statistical fit for both
groups, although some direct effects of TTCs on the test performance were substantive
and interpretable. Home language formal instruction was found to have positive influence
on reading and writing, but negative influence on listening and speaking. Home country
informal exposure was found to have positive impact on listening and speaking, but
64
negative impact on reading and writing. It was also discovered that the experience of
English-speaking country instruction or exposure had unquestionable positive influence
on listening and speaking.
ETS-TOEFL® Studies
The TOEFL® test is developed and administered by Educational Testing Service
(ETS), and is intended to measure English skills to use in an academic setting. Since the
test is administered worldwide, its test-taking population is highly heterogeneous in terms
of native languages, educational and vocational experiences, exposure to the English
language, level of degree sought, and other factors.
As discussed in the previous sections, the internal structure of a test can be highly
dependent on the characteristics of the group taking the test. Any effort to build a validity
argument for test score interpretation and use should take relevant test-taker variables
into consideration. Because of the TOEEL test’s extremely diverse test-taking population,
examining the relationships between TTCs and test performance has been a strong focus
in TOEFL validation studies. With every new generation of the TOEFL test, there has
been a new wave of validation studies concentrating on the factor structure and its
interplay with the ever-changing test-taking population (Hale et al., 1988; Hale et al.,
1989; Stricker et al., 2005; Stricker et al., 2008; Swinton & Powers, 1980). The following
section reviews three of these studies.
Swinton and Powers’s (1980) provided validity evidence of TOEFL® test score
use by determining the construct the test intended to measure and the relationships among
the factors in light of multiple TTCs. As many as seven native language groups were
specified in the study: African, Arabic, Chinese (Non-Taiwanese), Farsi, Germanic,

65
Japanese, and Spanish. Groups differed in their overall level of proficiency based on their
test performance with the Germanic group having the highest and the Farsi group having
the lowest overall language ability. Variables of age and reason for taking the TOEFL
test were also analyzed for their relationships with test performance. Analyses were
conducted based on item-level data.
Regarding the influence of overall proficiency, it was found that the language
ability of the group with the highest proficiency, the Germanic group in this case, showed
the greatest distinctions in its factor structure. As many as eight factors emerged for the
Germanic group during the initial factor extraction. As the analyses proceeded, it was
also discovered that only two factors could be confirmed for the Farsi group, the group
with the lowest overall proficiency. These results demonstrated that the amount of factor
differentiation did vary as a function of overall ability level. One reason for finding a
larger set of differentiated abilities in the high-level group, as the author suggested, might
be that the responses from high-level test-takers were less likely to be contaminated with
random factors, such as guessing, which made it possible for some minor dimensions to
emerge.
Except for the Farsi and Germanic groups, a three-factor model was established
for each of the other language groups. A listening factor corresponding to Section I
(Listening Comprehension) of the test was found for all groups, meaning that the
response pattern on the listening items could be explained by one common factor for all
language groups. However, there were differences in the make-up of the factor structures
in the other parts of the test across two major language groups, the Indo-European group
(Spanish and Germanic) and the non-Indo-European group (African, Arabic, Japanese,
66
and Chinese). For the Indo-European group, the majority of the items in Section II
(Structure and Written Expression) loaded on one common factor, and most of the items
in Section III (Vocabulary and Reading Comprehension) loaded on another common
factor. The two factors corresponded well to Section II and section III of the test. As for
the non-Indo-European group, most items from Reading Comprehension were found to
load with items from Structure and Written Expression on a common factor, whereas the
items from Vocabulary loaded on a separate factor. For this group, the items in Section
III did not seem to measure a common construct. Instead, the reading items tended to
measure a factor that was similar to the one measured by the items in Section II.
According to the authors, this finding had tremendous implications for score
reporting and interpretation. They suggested reporting subtest scores on Vocabulary and
Reading Comprehension separately rather than a combined Section III score, especially
for the non-Indo-European group. Their finding indicated that the interpretation of
Section III scores could be very different for the two major language groups. For the
Indo-European speakers, these scores indicated their relative status on a unidimensional
ability, and the interpretation was relatively straightforward. For the non-Indo-European
speakers, two factors seemed to account for performance in Section III. To make score
interpretation even more complicated, one dimension also overlapped with a factor
measured by another section of the test. Reporting one single Section III score with the
assumption that all items in the section measured the same ability would be misleading.
The usefulness and necessity of reporting a separate vocabulary score were highlighted
by the study for improving TOEFL® score reporting.

67
Influences of other test-taker variables were also detected via a factor extension
analysis, in which these variables were regressed on the test performance. It was found
that both age and graduate degree-oriented motive were positively related to a factor
associated with the vocabulary ability for the non-Indo-European group and with both
vocabulary and reading abilities for the Indo-European group.
As language testing at ETS evolved into the 21st century, attempts were made to
revise the TOEFL® test so that the most recent theoretical developments in the field of
second language acquisition and FL testing could be reflected in the test design and
construction. As part of this broad mission, LanguEdge™ courseware was introduced to
improve the learning of English as a second language by providing classroom assessment
of communicative skills (ETS, 2002). The assessment component included in the
courseware package had an ESL test which was similar in task design and test structure
to the new generation of the TOEFL test which was under development at that point
(ETS, 2004). In both the LanguEdge test and the new TOEFL test, speaking and writing
have become more integrated with reading and listening (ETS, 2004).
Stricker et al. (2005) examined the construct validity of the LanguEdge test. In the
test, integrated tasks were implemented so that language skills could be tested in a way
that was close to how they would have been used in real life communications. Two out of
the five speaking tasks were integrated with listening and reading, respectively. There
were also two integrated writing tasks (out of three writing tasks in total), one having a
listening component and the other having a reading element. Factor structure of test
performance was examined across three native language groups: Arabic, Chinese, and
Spanish.
68
Scores on the four integrated tasks were counted for the speaking and writing
sections. These integrated tasks added an extra layer of complexity onto score
interpretation across different test-taking groups. The questions of concern to the authors
were (a) whether they could still find an appropriate model with factors corresponding to
the sub-section structure, and (b) whether the factor structure would hold equivalent
across groups of different native language backgrounds.
Analyses were performed based on composite scores by prompt or passage for
listening and reading as well as polytomously rated speaking and writing scores. A
correlated two-factor solution was chosen for each group and for the whole group. In this
model speaking ratings all loaded on one factor, whereas scores from the other sections
loaded on the other factor. According to this solution a common factor accounted for test
performance of the Listening, Reading, and Writing sections. The analyses did not
produce a factor model that corresponded to the structure of the test. The authors
suspected that there might have been a lack-of-instruction effect on the under-
development of the test-takers’ speaking ability, which might have made speaking stand
out as a unique factor.
The multiple group invariance analyses showed that invariance held across the
groups in terms of the number of factors, factor loadings and unique variances but not
factor correlations. The study concluded that the test functioned in a similar way across
groups although the correlations between the factors were lower for the Arabic sample
than for the other two.
The study failed to find separate factors for each section of the test. This might
have been because the inclusion of the integrated tasks had blurred the distinction of the
69
skill measured by each section. With regard to score interpretation, this finding also
called into question the usefulness of reporting section scores. On the other hand, the
study established a uniform factor structure across three native language groups. Factor
structure equivalence was a valuable piece of validity evidence of test fairness, especially
for a test with a diverse test-taking population.
The TOEFL® test evolved into an internet-based mode in 2005. The introduction
of the internet-based TOEFL test, the TOEFL iBT® test, was a milestone in the test’s
growth and modernization. Computer and internet technologies make the TOEFL iBT
test accessible to a wide population of interested test-takers. With a growing test-taking
population, it has become even more important for researchers to investigate the test’s
underlying factor structure across different test-taking groups around the world. Research
studies, such as Stricker et al. (2005), laid the groundwork for validating TOEFL iBT test
scores. Other researchers have continued to contribute to developing the validity
argument for test score interpretation and use.
Stricker and Rock (2008) recently assessed the structure invariance of TOEFL
iBT performance across subgroups of test takers who were identified by three criteria:
native language family backgrounds, exposure to English use in educational and business
contexts in home countries, and years of formal instruction in the English language.
Various native language groups in the test population were categorized into two
major language groups: the Indo-European language family and the non-Indo-European
language family. Kachur’s (1984) classification of inner-circle countries (English is
primary), outer-circle countries (English has special administrative status), and
expanding-circle countries (English is considered important but has no special

70
administrative status) with regard to the prevalence of English use in educational and
business contexts was also adopted in this study.
Focusing on English as a FL learners, test-takers from the inner-circle countries
were not included in the study. Countries represented in the test-taker population were
divided into either the outer-circle country group or the expanding-circle country group.
Furthermore, based on the length of formal instruction in English received, individual
test-takers were also divided into three groups: a group with fewer than 6 years of formal
instruction in English, a group with instruction between 7 and 10 years, and a group with
more than 11 years of formal classroom experience with English.
A higher-order model with four first-order factors was found to be the best-fitting
model for all test-takers in this study. The same structure was also identified in all
subgroups of test-takers, categorized by native language family, exposure, and formal
instruction. The authors concluded that this higher-order structure conformed to the four-
section test design, and supported the usefulness of reporting the total score as an overall
indicator of the general language ability as well as four section scores as indicators of the
more specific abilities.
This uniformity in the structure of language ability as measured by the test across
diverse subgroups of test-takers provided strong validity evidence related to the test’s
internal structure and this structure’s generalization. Based on these findings, a validity
argument can be made that the TOEFL iBT® test functions the same way for diverse
subgroups of test-takers. This also implies that score comparisons among individual test-
takers who belong to different subgroups are meaningful and legitimate.

71
Making a Validity Argument
Examining the dimensionality of a test is an essential part of the process of
building the validity argument for a test. The resulting factor structure sums up the
response pattern of test performance on the test. It also reveals the underlying ability or
abilities the test empirically measures. Our understanding of the nature of FL proficiency
is enhanced and verified through this process. However, as demonstrated by the results
from the studies cited in this review of literature, the make-up of the structure of FL
proficiency is highly context-dependent. It depends on what test is used, and how the test
is constructed. It also depends on who takes the test and under what circumstances. Tests
aiming at assessing proficiency in a FL can vary in test length, subtest structure, item or
task type, and scoring scheme, to name just a few. These facets, proposed by Bachman
(1990) as test method facets, can cast tremendous influences on test performance. In
addition, TTCs, or learner variables, also interact with how a test functions. Evaluating
whether a test exhibits equivalent factorial structure across different examinee
populations, defined by proficiency level, native language background, gender, learning
condition, home language environment, and so forth, provides critical validity evidence
for test generalization and fairness. Finding factorial noninvariance implies that
divergence in the test performance of different groups exists. This will make it hard to
argue that the results of the test can be interpreted in a comparable way across the groups.
Due to the differences in factor structure such as the number of factors and the
correlations among the factors, score interpretation could be very group-dependent. In
this case, will it still make good sense to use the same test or the same score reporting
system for the different groups?

72
Test fairness addresses whether a test measures the same construct in all relevant
subgroups of the population (AERA, APA, & NCME, 1999). Factorial invariance is an
important assumption for claiming test fairness. As highlighted by many authors in this
review, finding out whether a factorial model is invariant across groups of test-takers help
decide how to report scores, and how test scores should be used. Relevant issues in this
regard include whether a single composite score or section scores or both should be
reported, and how weights should be assigned to the section scores if a total score is
needed. The recognition of population diversity needs to be incorporated when designing
a score reporting system for a test that exhibits factorial noninvariance for different test-
taking groups.
Conclusion
This chapter has reviewed and discussed models of FL proficiency. The review of
the models was not intended to be inclusive. The decision to review only selected models
was based on the two criteria. First, these models represent milestones in the course of
searching for an understanding of what FL proficiency is, and they have been influential
in FL test development. Second, the soundness of these models has been theoretically
examined and empirically investigated via a latent trait approach in language testing.
At the end of the last decade, Bachman (2000) pointed out two major themes of
advancement in language testing research. The progression of the field has been powered
by both the refinement of a rich variety of approaches and tools, and by a broadening of
philosophical perspectives and the repertoire of researchable questions. In the case of
defining the nature of FL proficiency, the field has witnessed rapid growth in theoretical
positions and empirical techniques: from the domination of the unitary competence
73
model to a profusion of multidimensional competence models; from viewing language
ability as consisting of discrete skills and components to understanding language use as
dynamic and communicative; from holding a dichotomous view of trait and method to
perceiving test method as an integral part of FL competence; from being primarily
concerned with the internal structure of a test to examining a test’s external relationship
with TTCs.
According to this literature review, consensus has been reached by the research
community in two areas. First, it is generally agreed that FL proficiency is a complex
construct with multiple dimensions. Second, the view is held that the proficiency in a FL
can be interpreted more meaningfully if relevant TTCs are taken into consideration.
What the field has come to agree upon is crucial to the consolidation and
professionalization of the field of language testing. The unification in the fundamental
belief about the nature of FL proficiency demonstrates the collaborative efforts from
researchers with different backgrounds and strengths. However, uncertainties still remain,
and these pose both challenges and opportunities for future language testing researchers.
One common theme that has emerged from this review is that it is still unclear to
the research community what FL proficiency consists of or how the constituent parts
interact. Sawaki et al. (2008) pointed out that multidimensional competence models come
in different forms, varying in terms of the exact factor structures identified.
Under the broad category of multidimensional competence, there are three
different schools of thought for modeling this construct. The first one claims that FL
competence consists of multiple uncorrelated factors. The second believes that this
proficiency can best be represented by a set of correlated factors. The third argues that the
74
relationships among the factors should be explained in a hierarchical structure, in which
the construct is composed of a set of uncorrelated first-order factors subsumed under a
higher-order factor. This higher-order factor is general in nature, which is responsible for
performance across multiple skills. The higher the loadings of first-order factors on the
general factor, the more likely it is this general factor actually exists.
While the first position has been repeatedly refuted based on empirical evidence
showing that factors are indeed correlated, choosing between the second and the third
position has never been easy. The question is how high the correlations among the factors
should be to adopt a hierarchical structure. This question becomes even harder to resolve
when there are three first-order factors in the model. A hierarchical model with three
primary factors, is statistically equivalent to the corresponding correlated three-factor
model in terms of overall model fit statistics. Researchers in the past have leaned towards
one position or the other by comparing model fit or by examining individual parameter
estimates when results from model comparisons did not indicate which one was clearly
superior to the other. Most recently, a bifactor model was tested in Sawaki et al. (2008).
In a bifactor model, two factor loadings are estimated for each observed score, one on the
general higher-order factor and the other on one of the first-order factors. More research
needs to be conducted before we can tell how well this bifactor model can be applied to
explain FL test performance.
Future researchers should keep testing these competing hypotheses until general
agreement is reached regarding whether language proficiency should be hierarchical or
non-hierarchical in its underlying representation.

75
The second unresolved issue has to do with the compatibility of trait and method.
Since its initial appearance, the concept of communicative competence (Canale & Swain,
1980) has had a strong influence not only on language teaching and learning, but also on
test development. Communicative competence emphasizes the use and coordination of all
possible knowledge and skills required for interacting in actual communication (Canale,
1983). This posits an immediate threat to the validity and usefulness of tests that focus on
only one aspect of language knowledge or skill at a time. For instance, a listening test is
usually designed to measure listening ability and, ideally, not to measure anything else. A
test battery will usually consist of a series of skill-based tests (listening, reading, writing,
and speaking), knowledge-based tests, (vocabulary, grammar, phonology, etc.), or both.
Such a test battery is good enough to measure a wide repertoire of skills and knowledge
separately, but not sufficient to measure the coordination of different skills and
knowledge, which is exactly what communicative competence requires.
A new type of language test tasks, the so-called integrated tasks, has been
developed for and implemented in the TOEFL iBT® test. An integrated task requires the
use of multiple skills simultaneously. For example, an integrated speaking task could
involve both speaking and listening, and an integrated writing task could involve reading,
listening, and writing. Such an integrated task is no longer tied to one modality and,
therefore, can no longer be defined by skill or knowledge.
The arrival of integrated language tasks raises questions about how these skill-
based factor models found in many of the studies in the review are compatible with the
theoretical definition of communicative competence. A second question is what the
structure underlying performance on integrated tasks might look like. A third follow-up
76
question is if we do find distinct factors, how are we going to name the factors so that
they are interpreted appropriately and meaningfully?
Future researchers should examine test performance on integrated tasks that are
designed specifically to operationalize the construct of communicative competence. A
latent factor approach would still apply; however, the resulting factor structure might no
longer be compatible with the four-skills naming convention.
Third, this review has also pointed out that one of the top tasks on the agenda for
future research is to investigate the relationships between TTCs and test performance.
TTCs are recognized as one of the four influences on test scores in Bachman’s (1990)
CLA model, and in Bachman and Palmer’s (1996) description of language use in
language tests. According to this review, the field has recently witnessed a surge of
interest and a growing number of empirical studies on TTCs, such as gender (Wang,
2006), native language background (Shin, 2005), ethnicity and preferred language
(Ginther & Stevens, 1998), home language environment (Bae & Bachman, 1998), to
name just a few. However, the amount of information we have obtained is far from what
we will need to fully understand the complex and dynamic relationships. Moreover, there
are still TTCs that have not attracted enough attention by the field, and therefore their
relationships with test performance remain under-researched.
One such TTC is language contact with the target language in a study-abroad
environment. FL test-takers are non-native speakers of the target language, but they may
have had the experience of studying and/or living in the target language environment.
Learning experience in the target language environment has been investigated in relation
to test performance in very few studies (Kunnan, 1994, 1995; Morgan & Mazzeo, 1988).
77
Since this TTC is salient and relevant in the context of FL testing, future research should
study the relationships of this TTC and FL test performance in various testing situations.
It is also worth mentioning that the growing interest on the topic of TTCs and test
performance can to a large extent be attributed to statistical advancements in structural
equation modeling (SEM). SEM has offered not only powerful research techniques, but
also new perspectives to explore issues that interest researchers from both language
testing and second language acquisition.
Within the general SEM framework, we can investigate a wide range of issues: (a)
what the internal structure of a test is; (b) whether and to what degree the internal
structure of a test holds equivalent across different test-taking groups of interest; and (c)
whether this latent internal structure relates to any external variables of TTCs and, if so,
in what way. Factor analysis, a family of latent multivariate analysis techniques, has been
used extensively to investigate the first question. Multi-group invariance factor analysis
has been used to address the second question. Built upon factor analysis and path
analysis, SEM is able to further provide a structural view of the relationships between a
test’s internal structure and the learner variables that exist outside the test, such as age,
gender, classroom learning experience, and study-abroad experience. This structural
component, together with the measurement component based on the internal structure of
the test, could offer insights into the relationships of test-taker background and test
performance. Future researchers, who are interested in the relationships between TTCs
and FL test performance, should embrace an SEM approach in their investigations.
This chapter reviews and synthesizes research results of studies that have
investigated the nature of FL proficiency through a latent factor approach in the context
78
of FL testing. Directions for future research are suggested at the end of the chapter.
These concerns for future studies will be addressed in the design of the current study
described in the next chapter.

79
CHAPTER THREE
METHODOLOGY
This chapter first introduces the TOEFL iBT® test developed at Educational
Testing Service (ETS), including a brief history of the test, and ETS’s philosophy of
testing English as a foreign language (FL). Next, two general goals of the study are
explained, and three research questions are put forward. Borrowing knowledge from the
studies reviewed in the previous chapter as well as insights from the study-abroad
literature, six hypotheses are formulated. This is followed by a description of the TOEFL
iBT public dataset, the subjects, and the measures used in this study. Last, planned
analysis procedures are laid out.
About the TOEFL iBT® Test
The TOEFL iBT test is a test of English as a FL, and it is administered world-
wide. The purpose of the test is to “measure the communicative language ability of
people whose first language is not English” (Jamieson, Jones, Kirsch, Mosenthal,
&Taylor, 2000). An online report on validity evidence states that “TOEFL iBT test
scores are interpreted as the ability of the test taker to use and understand English as it is
spoken, written, and heard in college and university settings,” and “[t]he proposed uses of
TOEFL iBT test scores are to aid in admissions and placement decisions at English-
medium institutions of higher education and to support English-language instruction”
(ETS, 2008). As the newest member of the TOEFL® test suite, the TOEFL iBT test has
been in operational use since 2005.
The development of the TOEFL iBT test was part of a broad effort to modernize
language testing at ETS. Jamieson et al. (2000) explained that the impetus behind this
80
movement was to build a test that could reflect communicative competence models. The
design of the test was also intended to meet the needs of various stake-holders for more
constructed-response tasks as well as tasks that integrated multiple language modalities.
In response to these calls, the current TOEFL iBT® test has the following unique
characteristics that distinguish it from its precursors.
First, based on a model developed by the TOEFL® Committee of Examiners
(COE), the test measures communicative language proficiency and focuses on academic
language and the language of university life (Chapelle et al., 1997). This COE model
makes an explicit distinction between the context of language use and the internal
capacities of individual language users. The model portrays the relationships between the
two as dynamic and integrated, as the COE members believed that “the features of the
context call on specific capacities defined within the internal operations” (p. 4). This
understanding of language ability in relation to its context of use lays out the theoretical
foundation upon which a test development framework with both a context and ability
components has been established (Jamieson et al., 2000).
Second, the new test has mandatory speaking and writing sections, intended to
elicit constructed responses from test-takers. A new kind of constructed-response tasks,
namely integrated tasks, is for the first time introduced into the TOEFL testing. Unlike
the traditional independent tasks, which call upon language skill in a single modality, an
integrated task requires test-takers to use and coordinate skills in more than one modality.
For example, an integrated writing task can require test-takers to incorporate and
synthesize information from an aural lecture and a reading passage in their written
responses. These integrated tasks not only provide information about examinees’ abilities
81
in more than one skill area but also their abilities to coordinate the use of different skills
in response to specific language tasks.
As stated in Chapter One, this study aims to provide empirical evidence to support
our understanding of the nature of FL proficiency in both its absolute and relative senses,
and to investigate the educational, social, and cultural influences on language proficiency.
The FL proficiency investigated in this study was defined as communicative language
ability as measured by the TOEFL iBT® test. The reasons for choosing the test are
explained as follows.
First, the theoretical reasoning behind the development of the test reflects the
current thinking in applied linguists, which views language ability being communicative
in nature. The test focuses on language use in communication, not language display in
isolation. It specifies the relationship between the context of language use and language
abilities as integrated, rather than dichotomous. The role of context in defining language
ability is recognized and reflected in test development. Performance based on the test will
enable us to empirically examine the theoretical construct of communicative language
ability, as a way to enhance our understanding of FL proficiency in general.
Second, as the world’s most widely used English language test (ETS, 2008), the
test enjoys an extremely diverse test-taking population differing on many background
variables, such as native language and country, English language learning, and language
contact experience. This testing situation provides us with ample opportunities to
examine communicative language ability in its relative sense across different test-taker
groups. With test-takers from different backgrounds, the results of the test could also
82
offer unique perspectives on how different educational, social, and cultural environments
affect the nature and path of FL acquisition.
General Goals of the Study
Based on the TOEFL iBT® test performance, this study investigates the
dimensionality of communicative language ability and its latent factorial structure across
test-taker groups with different context-of-learning experiences. The study also intends to
discover the interactions between test-takers’ study-abroad and learning experiences, as
well as the joint impact learning and time abroad on the development of FL ability.
The first goal of this study is to investigate the internal structure of
communicative language ability, as measured by the TOEFL iBT test, and how this
structure conforms to the communicative competence framework endorsed by the COE
model. Language competencies within individual users and the context of language use
are both parts of the construct definition in the COE model. The COE model originally
proposed to organize the test domain by a situation-based approach that “acknowledges
the complexities and interrelatedness of features of language and contexts in
communicative language proficiency” (Chapelle et al., 1997, p. 26). Adopting this
situation-based approach, Jamison et al., (2000) suggested characterizing situations with
five variables: participants, content, setting, purpose, and register. However, it was
decided later to organize language tasks by modality largely because the results of
surveys (Ginther & Grant, 1996; Taylor, 1993) showed that the majority of test users
would like to have scores reported for speaking, writing, listening, and reading. It was
obvious that there was a mismatch between the current thinking in applied linguistics and
popular belief about what language proficiency meant. As pointed out by Chapelle et al.
83
(1997), “[w]here the skills approach falls short, in the view of applied linguists, is its
failure to account for the impact of the context of language use” (p. 26).
The nature of communicative language ability, as informed by the relationships
between the context of language use and cognitive skill-based capacities of language
users, is examined by adopting the situation-based approach in this study. Understanding
the role of context in defining the language ability would offer the theoretical framework
of communicative competence greater power in delimitating the nature of FL ability.
The second goal is to investigate the relationships between communicative
language ability and learner/test-taker background variables. This research interest has
been shared by the second language research community as well as the TOEFL® research
community. Several past TOEFL studies were devoted to validating score interpretation
and use in light of differences of test-takers on background variables, such as reasons for
taking the test (Swinton & Powers, 1980), native language background (Stricker et al.,
2005; Stricker & Rock, 2008; Swinton & Powers, 1980), target language exposure in
home countries (Stricker & Rock, 2008), and years of classroom instruction (Stricker &
Rock, 2008). The current study aims to investigate the joint impact of study-abroad in the
target language and formal classroom learning experience on the development of
communicative language ability, based on TOEFL iBT® test performance. Although the
test is intended for people whose first language is not English (Jamison et al., 2000),
some of the test-takers may have the experience of living in an English language
environment, and others may not. Borrowing the concept of language contact from the
study-abroad literature and treating it as a test-taker characteristic (TTC), this study first
looks at whether or not having the contact experience with the target language
84
community, has any effect on the underlying structure of the test performance.
Theoretically speaking, this investigation allows us to examine the nature of
communicative language ability in its relative sense. In other words, the result tells us
whether or not the language abilities developed in groups of test-takers with different
context-of-learning experiences are equivalent in terms of their factorial structures. From
a test development and validation perspective, the outcomes of this investigation contain
crucial information that can be used for test validation. Understanding how a test
functions across different test-taker groups tells whether or not the test scores can be
interpreted and used in a similar way across these groups. Since the test is administrated
both domestically (where English is the dominant language) and internationally (where
English is not the dominant language), the results also inform us if it is a reasonable
practice to use the same test format at all locations.
Second, test-takers may also vary in terms of the length of study-abroad and
formal language training they have received. The length of study-abroad as well as the
length of learning are examined together to facilitate further understanding of the joint
impact of study-abroad and learning on the development of communicative language
ability. The results have both pedagogical and practical implications. Pedagogically
speaking, results of such an investigation can offer empirical evidence of the impact of
study abroad, and how it interacts with the effect of learning. From a practical
perspective, findings can be used to advise future test-takers on how to prepare for the
test and how to improve their English proficiency in general.
The study-abroad and learning experiences are both salient in the target test-
taking population, and their impact on test performance deserves an in-deep investigation
85
in the context of TOEFL iBT® testing. The outcomes would inform us what to expect
from and how to deal with the increasingly diverse and constantly changing test-taking
population during test development and validation.
Research Questions
Research Question One
The first research question asks what the nature of communicative language
ability is, as measured by the TOEFL iBT® test. To be more specific, this investigation
focuses on answering the following two questions: (1) what the constituents of
communicative language ability are and how they are related; and (2) if the role of
context in defining communicative language ability can be reflected in the latent structure
of the test performance?
The operational test adopts the four-skills approach to test design, and it reports a
separate score on each of the skills (listening, reading, writing, and speaking) as well as a
total score. Factor-analytic studies (Sawaki et al., 2008; Stricker & Rock, 2008) have
provided supporting evidence for this skill-based design and reporting scheme. Stricker
and Rock (2008), using task-based scores from 2720 test-takers during the TOEFL iBT
field test, found that a correlated four-factor model and a higher-order model with four
first-order factors fit the data similarly well. They concluded that the latter was the best
model based on the principle of parsimony. Sawaki et al. (2008) explored with the same
dataset as the one in Stricker and Rock (2008) but conducted the analysis based on item-
level scores. They found that the fit of the correlated four-factor model was better than
that of the higher-order model based on the result of a chi-square difference test although
86
the differences in other fit indices were minimal. They concluded that the higher-order
model was preferred because it was more parsimonious.
In an earlier study Stricker et al. (2005) examined the factor structure of the
LanguEdge™ test, a prototype of the TOEFL iBT® test. Based on task-based scores, the
authors found a correlated two-factor model across three native groups (Arabic, Chinese,
and Spanish). One of the factors was a speaking factor, and the other was a combination
of listening, reading, and writing.
The higher-order model and the correlated four-factor model had both exhibited
good match with TOEFL® test performance in the past. Since the LanguEdge test used in
Stircker et al. (2005) and the TOEFL iBT test have similar test structure and item types,
the correlated two-factor model found in Stricker et al. (2005) could provide a suitable
solution with the current dataset. Based on the results of these studies, a higher-order
model, a correlated four-factor model, and a correlated two-factor model, are all plausible
factorial solutions. To address the first focus of research question one, the hypothesis is
put forward as follows:
Hypothesis 1: The structure of the communicative language ability measured by
the TOEFL iBT test can be best explained by a higher-order model, and can also be
explained adequately by a correlated four-factor model and a correlated two-factor
model.
In the context of TOEFL testing, the context of language use is interpreted as part
of the definition of communicative language ability (Chapelle et al., 1997), and language
use situation is characterized by five variables: participants, content, setting, purpose, and
register (Jamison et al., 2000). The content variable refers to the topic of a task. The
87
setting variable refers to the location of a language act. The content and the setting
variables are key ingredients in defining the context of language use. To address the
second focus of research question one, models with situational factors, content and
setting, are subject to evaluation of fit. The two corresponding hypotheses are formulated
as follows:
Hypothesis 2: Adding a dimension of content factors to the model confirmed
through testing Hypothesis 1 improves model fit, and therefore demonstrates the role of
context in defining communicative language ability.
Hypothesis 3: Adding a dimension of setting factors to the model confirmed
through testing Hypothesis 1 improves model fit, and therefore demonstrates the role of
context in defining communicative language ability.
Research Question Two
The second research question inquires whether the communicative language
ability, as measured by the TOEFL iBT® test, has equivalent representation across groups
of test-takers differing on one TTC: study-abroad. To be more specific, this question
investigates if the underlying configuration of communicative language ability is
invariant across two groups of test-takers, with one group having been exposed to an
English-speaking environment and the other without such experience. There is handful of
studies that have studied the nature of FL proficiency in relation to TTCs by using multi-
group invariance analysis. This ability, as measured by a test of a researcher’s choice,
was found to be equivalent across test-taker groups in some studies (Sticker et al., 2005;
Sticker & Rock, 2008; Shin, 2005; Wang, 2006), but different in others (Bae & Bachman,
1998; Ginther & Stevens, 1998; Kunnan, 1992; Morgan & Mazzeo, 1988; Römhild,
88
2008; Sang et al., 1986;). Among these studies, only Morgan and Mazzeo (1988) used
study-abroad experience as a TTC to define and compare groups. When comparing the
group of test-takers who had spent at least one month in the target language (French)
environment to the standard group (with little or no out-of-school French language
experience), their language ability appeared to have different underlying representations.
Due to lack of empirical studies examining study-abroad experience via a latent trait
approach, results of studies on heritage language environment were brought in to help
formulate meaningful hypothesis. Heritage language learners usually have target
language contact at home. Although the experience of language contact at home is not
equivalent to the one of studying abroad in the target language society, they share
similarities to a certain extent. In the same study conducted by Morgan and Mazzeo
(1988), no equivalence was found in the structure of language ability between the
heritage learner group and the standard group. Bae and Bachman (1998) also found that
the heritage Korean learner group and the non-heritage Korean learner group differed in
terms of underlying structure of their language ability. Based on these observations, the
hypothesis in response to the second research question is put forward as follows:
Hypothesis 4: Two groups of test-takers, one with study-abroad experience and
the other without such experience, differ in the underlying configuration of their
communicative language ability, as measured by the TOEFL iBT® test.
Research Question Three
The third research question investigates the impact of length of time abroad and
classroom learning on the development of communicative language ability, as measured
by the TOEFL iBT® test. Specifically, this investigation addresses the following two
89
questions: (1) if the length of formal learning differentiates test-takers who do not have
study-abroad experience; and (2) if the length of time abroad and formal learning
differentiates test-takers who have study-abroad experience.
Only one study conducted by Kunnan (1994, 1995) investigated the impact of
length of learning and study-abroad on FL ability through a structural equation modeling
(SEM) approach. His findings suggested that both home country formal instruction and
experience in the target language environment had influences on the development of
aspects of language ability, and the influences could be either positive or negative.
Borrowing some insights from study-abroad studies, Davidson argued that the length of
study abroad was positively correlated with language gains (as cited in Dewey, 2004, p.
322), whereas Freed (1995) claimed that formal language instruction was an important
variable in predicting proficiency gains achieved abroad. Díaz-Campos’s (2004) study
showed that years of formal language instruction had impact on Spanish proficiency gain.
Magnan and Back (2007) claimed that coursework had a positive influence on linguistic
gain in French. The literature indicates that both formal instruction and study-abroad
experience have influences on language development. The two hypotheses related to the
third research question are put forward as follows:
Hypothesis 5: For test-takers who have no study-abroad experience, the
development of their language ability is associated with the length of formal learning.
Hypothesis 6: For test-takers who have study-abroad experience, the development
of their language ability is associated with the length of formal learning and the length of
study-abroad.
90
The TOEFL iBT® Public Dataset
A request for access to the TOEFL iBT public dataset was submitted to ETS, and
was approved by ETS. The dataset contained two test forms (Form A and Form B) and
test performance of 1000 test-takers on each form. Form A and its associated test
performance from 1000 test-takers (Sample A) were used in the analysis in this study. A
description of the requested data is provided below.
One thousand test-takers were randomly drawn from one TOEFL iBT
administration during fall 2006. Each test-taker was linked to a unique sample
identification number (e.g., 20070001). The information on test-taking location was
recorded for each test-taker at the time of the test administration. Test-takers who took
the test in the United States or Canada were identified as domestic, whereas those who
took the test in all other countries were identified as overseas. There were 418 domestic
test-takers and 582 overseas test-takers.
Test-takers were asked to provide information on their age, gender, native
country, and native language. Test-takers were also asked to respond to questions
regarding the reason for taking the test, the type of institution in which they were
interested, the amount of financial support they expected, the amount of time they spent
studying English, the amount of time they spent in content classes taught in English, and
the amount of time they spent living in an English-speaking country.
To investigate the effect of target language exposure on English language
proficiency as manifested in TOEFL iBT® test performance, the first step was to identify
who had the exposure and who had not upon taking the test. Although test-taking location
was recorded for every test-taker, it could not be used as a reliable indicator for
91
identifying the group membership. Among the 582 people who took the test overseas,
106 of them indicated that they had lived in an English-speaking country. It was very
likely that these test-takers, after having been exposed to English, had relocated and
therefore took the test in an overseas test center. Another indicator of the group
membership was test-takers’ responses to the question ‘have you ever lived in a country
in which English is the main language spoken’. This indicator was used to identify the
groups. It also contained the information of the length of English language exposure.
Furthermore, this study intended to examine the relationship between study-
abroad experience and learning experience, and their joint impact on test performance.
Test-takers’ responses on two more background questions were used for this
investigation. One question concerned the amount of time they spent studying English,
and the other was about the amount of time they spent in content classes taught in
English.
Three hundred ninety-nine test-takers responded to all three background
questions. Among these 399 test-takers, 29 test-takers said that they had never lived in an
English-speaking country, although the record showed that they took the test either in the
United States or Canada. Bearing in mind that these 29 test-takers were physically
located in an English-speaking country at the time they took the test, their responses to
this question contradicted documented fact. One possible cause for this inconsistency
could be that these test-takers were not able to fully understand the question therefore
provided inaccurate information (X. Xi, personal communication, November 17, 2010).
Another explanation could be that they took the test immediately on arriving in the US or
Canada (J. E. Liskin-Gasparro, personal communication, January 17, 2011). Among the
92
399 test-takers who answered all three background questions, this study identified 370
test-takers whose study-abroad experience could be confirmed. Background
characteristics of these 370 test-takers are discussed below.
The location of test-taking was recorded for each test-taker. This information told
us where a test-taker was physically located when taking the test. Based on test-taking
location, inferences could be made about what kind of linguistic environment a test-taker
was immersed in at the time of testing. One hundred fifty-seven subjects took the test at
test centers located either in the United States or Canada. The remaining 213 subjects
took the test at overseas locations.
All subjects provided age information. The average age of these 370 test-takers
was 24 at the time of testing. The youngest subject was 14, and the oldest was 51. The
majority of the subjects (about 85%) were between the ages of 15 and 30.
Of the 370 subjects, 321 reported on their gender, among whom 54% were male
and 46% were female.
Test-takers were asked to use a list of country and region codes to indicate their
native counties (where they were born), and a list of native language codes to tell their
native languages. All but one of the 370 subjects responded. The 369 respondents were
from a total number of 56 countries or regions. More than half of the subjects were from
these seven largest groups: Korea, Japan, India, Taiwan, France, Thailand, and China.
With regard to their native languages, a total of 38 different native languages were
represented in this sample. The five most frequently spoken native languages in order of
the number of its speakers were Korean, Japanese, Chinese, Spanish, and Arabic. Native
93
speakers of these five languages made up about 59% of the total sample. Four subjects
indicated that their native language was English.
Test-takers were asked to respond to the question ‘what is your main reason for
taking TOEFL.’ They were provided with the following answer choices: (1) to enter a
college or university as an undergraduate student, (2) to enter a college or university as a
graduate student, (3) to enter a school other than a college or university, (4) to become
licensed to practice my profession in the United States or Canada, (5) to demonstrate my
proficiency in English to the company for which I work or expect to work, and (6) other
than above. Out of the 370 subjects, 366 responded. About 88% of the respondents
indicated that they took the test in order to enter a college or university either as an
undergraduate student or a graduate student.
When asked the question ‘what types of institution are you interested in
attending,’ 367 subjects responded. The provided answer choices were: (1) four-year
college or university, (2) two-year community college, (3) graduate or professional
school, (4) ESL institute, and (5) don’t know. Subjects were allowed to choose more than
one type of institution. Among the 367 subjects who responded, none of them chose more
than one type of institution. About 88% of them indicated that they were interested in
attending either four-year colleges/universities or graduate/professional schools.
Of the 370 subjects, 356 answered the question ‘how much do you and your
family expect to contribute annually toward your study in the United States or Canada.’
They could choose from the following answer choices: (1) less than $5000, (2) $5000 to
$10,000, (3) $10,001 to $15,000, (4) $15,001 to 25,000, (5) more than $25,000, or (6)
don’t know. More than a third of the respondents indicated that they did not know the
94
answer. About one fifth of them expected to receive more than $25,000 in financial
support from their families.
All 370 subjects answered the question ‘how much time have you spent studying
English (in a secondary or post secondary school).’ The answer choices were: (1) none,
(2) less than 1 year, (3) 1 year or more, but less than 2 years, (4) 2 years or more, but less
than 5 years, (5) 5 years or more, but less than 10 years, and (6) 10 years or more. The
majority of them (about 64%) reported that they had studied English for at least 5 years
by the time they took the test. A third of the subjects had studied English for 10 years or
more at the time of testing.
All 370 subjects responded to the question ‘how much time have you attended a
secondary or post-secondary school in which content classes (such as math, history,
chemistry) were taught in English.’ The answer choices were: (1) none, (2) less than 1
year, (3) 1 year or more but less than 2 years, (4) 2 years or more but less than 5 years,
and (5) 5 years or more. About a third of them indicated that they had never had such
experience. Close to 60% of the subjects indicated that they had at least one year of such
experience.
All 370 subjects responded to the question ‘have you ever lived in a country in
which English is the main language spoken.’ To answer this question, they chose from
the following possible answers: (1) no, (2) yes, for less than 6 months, (3) yes, for 6
months to 1 year, and (4) yes, for more than 1 year. About two-thirds of the subjects
indicated that they had lived in an English-speaking country upon taking the test.
95
Sample Representativeness and Appropriateness
The sample of 370 subjects used in this study was just a fraction of the total
random sample of 1000 test-takers generated from one TOEFL iBT® test administration.
Since the study sample was not randomly generated, two steps were taken to ensure that
the study sample was comparable to the random sample of 1000 test-takers.
Table 1. Test-Taker Characteristics across the Two Samples
Background Answer Choices Study Sample Total Sample

Variables (N = 370) (N=1000)
Location 1=U.S. or Canada 42% 42%
2=All other countries 58% 58%
Age Below 15 1% 2%
Between 15 and 30 85% 86%
Above 30 14% 12%
Gender Male 47% 43%
Female 40% 42%
Missing 13% 15%
Native Korea 18% 15%
country Japan 13% 9%
India 11% 11%
Taiwan 5% 5%
France 4% 3%
Thailand 4% 3%
China 4% 12%
Other 40% 42%
Missing < 1% < 1%
Native Korean 19% 15%
language Japanese 13% 9%
Spanish 10% 8%
Chinese 10% 18%
Arabic 6% 8%
Others 41% 41%
Missing < 1% < 1%
96
Table 1. Continued
Background Answer Choices Study Total Sample

Variables Sample (N=1000)
(N = 370)
Reason for N=370 N=1000 N=464
taking the 1=To enter a college or university 37% 17% 36%
test as an undergraduate student
2=To enter a college or university 50% 24% 52%
as a graduate student
3=To enter a school other than a 3% 1% 3%
college or university
4=To become licensed to practice 4% 2% 5%
my profession in the U.S. or
Canada
5=To demonstrate my proficiency 2% < 1% 2%
in English to the company for
which I work or expect to work
6=Other than above 4% 2% 3%
Missing 1% 54% 0%
Type of N=370 N=1000 N=453
institution 1= Four-year college or university 37% 17% 37%
2=Two-year community college 4% 2% 4%
3=Graduate or professional school 50% 23% 51%
4= ESL institute 1% < 1% 1%
5=Don’t know 7% 3% 8%
Missing 1% 55% 0%
Financial N=370 N=1000 N=404
Support 1= Less than $5000 8% 3% 8%
2=$5000 to $10,000 10% 4% 10%
3=$10,001 to $15,000 14% 5% 13%
4=$15,001 to 25, 000 9% 4% 10%
5= More than $25,000 21% 9% 21%
6= Don’t know 35% 15% 37%
Missing 4% 60% 0%
97
Table 1. Continued
Background Answer Choices Study Total Sample

Variables Sample (N=1000)
(N = 370)
Time spent N=370 N=1000 N=414
studying 1=None 2% 1% 3%
English 2=Less than 1 year 4% 2% 4%
3=1 year or more but less than 2 10% 4% 10%
years
4=2 years or more, but less than 5 20% 8% 20%
years
5=5 years or more but less than 10 34% 14% 34%
years
6=10 years or more 30% 12% 29%
Missing 0% 59% 0%
in content 1=None 31% 13% 32%
classes 2=Less than 1 year 10% 4% 10%
taught in 3=1 year or more but less than 2 16% 7% 16%
English years
4=2 years or more but less than 5 20% 8% 20%
years
5=5 years or more 23% 9% 22%
Missing 0% 59% 0%
living in an 1=No 34% 21% 39%
English- 2=Yes, for less than 6 months 18% 9% 16%
speaking 3=Yes, for 6 months to 1 year 13% 6% 11%
country 4=Yes, for more than 1 year 35% 17% 33%
Missing 0% 53% 0%
First, the study sample (N=370) was compared to the total random sample
(N=1000) on all available background variables collected for the test-takers (see Table 1).
Second, one-sample t-tests for the section and total scores were conducted to detect any
mean difference across the two samples.

98
The percentages in Table 1 show that both samples had the same distribution on
the location variable. The age and gender distributions in the two samples resembled each
other well.
With regard to country of origin and native language, more than half of the test-
takers came from the same seven countries in both samples. However, the percentage of
test-takers from Mainland China in the study sample was disproportionally low,
compared to the one in the total sample. The five most frequently spoken native
languages among test-takers were also the same across the two samples, although the
order of the two lists differed. It is noteworthy that a disproportionally high percentage of
test-takers from Mainland China were not included into the study sample because these
test-takers did not answer all three background questions regarding their learning and
study-abroad experiences. Therefore the percentage of native Chinese speakers was
disproportionally low in the study sample, compared to the total sample.
As shown in Table 1, test-takers from India constituted a large native country
group for both samples. The native languages spoken by these test-takers included a
variety of languages native to people living in the various parts of the country. Due to this
diverse linguistic situation, none of these languages had a large native language group in
either sample.
Although English has a special administrative role in the India society, very few
test-takers considered English as their native language. In the total sample of 1000 test-
takers, among the 120 Indian test-takers, only five reported their first language as
English. In the study sample of 370 test-takers, among the 45 Indian test-takers, none of
99
them indicated their first language as English. This observation allowed us to treat the
test-takers from India as English as a second language learners with confidence.
English also has a special administrative role in the region of Hong Kong. All 11
test-takers from Hong Kong in the total sample identified their native language as
Chinese. This observation allowed us to treat the test-takers from Hong Kong as English
as a second language learners with confidence.
Regarding variables that had a large number of missing values in the total sample,
such as reason for taking the test, financial support, time spent studying English, time
spent in content classes taught in English, and the time spent living in an English-
speaking country, the distribution patterns of the obtained responses were similar across
the two samples. Percentages based on the test-takers who actually responded to these
questions were also reported. For example, the table shows that among the total 1000
test-takers, 464 test-takers (N=464) actually responded to the question asking for their
reason for taking the test, and 36% of them indicated that they took the test to enter a
college or university as an undergraduate students. As indicated in the table, these
percentages were very close to the ones based on the study sample of 370 test-takers.
Furthermore, the four section scores and the total score were compared across the
two samples. The descriptive statistics shown in Table 2 indicated that the section and
total mean scores were all higher in the study sample than the ones in the total sample.
A series of one sample t-tests were conducted to compare the means of the study
sample to the ones of the total sample. The means of the study sample were tested against
the total sample means to evaluate the statistical significance of the differences. The
results, summarized in Table 3, showed that only the listening score was found to be
100
significantly higher in the study sample than the one in the total sample at the p level of
0.01. At the p level of 0.05, the listening score and the total score were found to be
significantly higher in the study sample than the ones in the total sample. These results
suggested that the study sample did not deviate from the total sample significantly with
regard to the test performance.
Table 2. Descriptive Statistics across the Two Samples
Study Sample TOEFL iBT® Public Dataset Sample A

(N = 370) (N = 1000)
Mean Std. Dev. Mean Std. Dev.
Listening 23.46 5.44 22.67 5.92
Reading 26.98 7.51 26.23 7.95
Speaking 15.13 3.91 15.08 3.86
Writing 6.70 1.77 6.52 1.90
Total 72.27 16.35 70.50 17.36
Table 3. One-Sample t-Tests
Mean Diff. t df Sig. (2-tailed)

Listening 0.79 2.79 369 0.006
Reading 0.75 1.92 369 0.056
Speaking 0.05 0.26 369 0.796
Writing 0.18 1.91 369 0.056
Total 1.77 2.08 369 0.038
Generally speaking, the two samples were reasonably comparable, although
discrepancies were found regarding the percentages of test-takers from Mainland China
and Chinese native speakers. The reason for these discrepancies remains unknown. This
101
should be kept in mind when generalizing the findings from this study to the total sample
as well as to the whole TOEFL® iBT test-taking population.
Measures
The test used in this study was an operational form of the TOEFL iBT test
administered during the fall of 2006. The test had four sections: listening, reading,
speaking, and writing. The structure of the test is explained as follows.
Structure of the Test
The listening section had six tasks. Each listening task had a prompt followed by
5 or 6 questions. There were 34 items in total in the listening section. Each item was
scored for one point for a correct answer, and zero points for a wrong answer. Items that
were not reached or were omitted were marked as N or M, respectively. The total
possible raw score points for the listening section was 34.
The reading section had three tasks. Each reading task had a prompt followed by
12 to 14 questions. There were 41 items in total in the reading section. Thirty-eight of
them, worth one point each, were dichotomously scored. Three items were polytomously
scored, worth either two or three points. Items that were not reached or were omitted
were marked as N or M, respectively. The total possible raw score points for the reading
section was 45.
The speaking section contained six tasks. The first two tasks asked test-takers to
provide oral responses to a written prompt. These tasks were considered to be
independent because required responses were not dependent on any information provided
through other channels during the test. The other four were integrated speaking tasks.
These tasks required test-takers to provide oral responses based on the information they
102
received through listening or reading or both channels. Each task was rated on a 0–4
holistic scale at a one point interval. The total possible raw score points for the speaking
section was 24.
The writing section consisted of two tasks. The first task was an integrated task
that required test-takers to provide written responses based on the information they
received through listening and reading. The second one, an independent task, asked test-
takers to write in response to a written prompt. Each task was rated on a 1–5 holistic scale
at half-point intervals (up to the first decimal place). The total possible raw score points
for the writing section was 10.
In summary, the whole test had 17 language tasks, organized into four sections by
modality. There were six listening tasks, three reading tasks, six speaking tasks, and two
writing tasks. To help understand the dynamics between language skill and the context of
language use manifested within each task, the situations of these tasks are described as
follows.
Descriptions of Task Situations
In the context of TOEFL® testing, a language task situation is characterized by
five variables: participants, content, setting, purpose, and register (Jamison et al., 2000).
The content refers to the topic of a task. The setting refers to the location of a language
act.
The first listening task (L1) was situated in a conversation between a male student
and a female biology professor at the professor’s office. The participants talked mainly
about how to prepare for an upcoming test. The nature of the interaction was consultative
103
with frequent turns. To complete this task, test-takers were asked to respond to five
multiple-choice questions.
The second listening task (L2) presented part of a lecture in an art history class in
a formal classroom setting. The male professor was the only participant. The language
used by the professor was formal. There was no interaction between the professor and the
audience during the task. To complete this task, test-takers were asked to respond to six
The third listening task (L3) presented part of a lecture in a meteorology class in a
formal classroom setting. Three participants were involved, one male professor, one male
student, and one female student. The language used was formal with periods of short
interaction between the professor and the students. To complete this task, test-takers were
asked to respond to six multiple-choice questions.
The fourth listening task (L4) was situated between a female student and a male
employee at the university housing office. The topic of their conversation was about
housing opportunities on and off campus. The nature of the interaction was consultative
with relatively short turns. To complete this task, test-takers were asked to respond to five
The fifth listening task (L5) presented part of a lecture in an education class in a
formal classroom setting. The participants were a female professor and a male student.
The language used was formal with periods of short interaction between the professor
and the student. To complete this task, test-takers were asked to respond to six multiple-
choice questions.
104
The last listening task (L6) presented part of a lecture in an environmental science
class in a formal classroom setting. The male professor was the only participant. The
language used was formal with no interaction between the professor and the audience
during the entire listening time. To complete this task, test-takers were asked to respond
to six multiple-choice questions.
In the reading section, all three reading tasks were based on academic content. In
the first task (R1), test-takers were asked to respond to 14 multiple-choice questions of
various kinds based on a reading passage on the topic of psychology. The second task
(R2) required the test-takers to respond to 14 multiple-choice questions of various kinds
based on a reading passage about archeology. The last task (R3) required the test-takers
to respond to 13 multiple-choice questions of various kinds based on a reading passage
about biology. The language used in the readings was formal in register. The settings of
these language tasks remained unknown since such information was not provided in the
task specifications.
Lack of context development was also found for the first speaking task (S1) and
the second speaking task (S2). Neither task specified the situation of language use. The
topics for both tasks were non-academic. Both tasks required test-takers to provide an
oral response to a prompt.
The third speaking task (S3) had a reading and a listening component. The topic
was the university’s plan to renovate the library. The reading component required test-
takers to read a short article in the student newspaper about the change. The listening
component involved a conversation between a male and a female student in a non-
academic setting discussing their opinions about this renovation plan. The two
105
participants interacted with each other frequently with relatively short turns. The test-
takers were required to give an oral response to a prompt based on the content of the
reading passage and the dialogue.
The fourth speaking task (S4) also had a reading and a listening component. The
reading component required test-takers to read an article from a biology textbook on an
academic topic. The listening component presented part of a lecture on the same topic
delivered by a male professor in a formal classroom setting. The language used in this
task was formal. There was no interaction between the professor and his audience during
the listening time. Test-takers were required to give an oral response to a prompt based
on the content of the reading passage and the lecture.
The fifth speaking task (S5) contained a listening component. The listening part
was situated in a conversation between a male and a female professor in an office setting.
The focus of their dialogue was related to the class requirements for a student. The
register of the language was consultative in nature with frequently turns between the two
participants. Test-takers were required to give an oral response to a prompt based on the
content of the dialogue.
The last speaking task (S6) had a listening component. The listening part
presented part of a lecture delivered by a female anthropology professor in a classroom
setting. The language used was formal with no interaction between the professor and the
audience. Test-takers were required to give an oral response to a prompt based on the
content of the lecture.
The first writing task (W1) was integrated with reading and listening. Test-takers
were first asked to read a passage about an academic topic and then to listen to part of a
106
lecture on the same topic delivered by a male professor in a classroom setting. There was
no interaction between the professor and his audience during the entire listening time.
The test-takers were required to give a written response to a prompt based on the content
of the reading passage and the lecture.
The second writing task (W2) asked test-takers to write on a non-academic topic.
The context of language use in this task remained unknown since this information was
not provided in the task specification. The test-takers were required to provide a written
response to a prompt.
Categorizing Tasks by Content and Setting
Based on the descriptions of the tasks, two broad categories emerged with regard
to task content. The first type was academically oriented. The content of these tasks was
developed mainly based on scholarly or textbook articles in the realm of natural and
social sciences. The second type involved topics that were related to courses (e.g., exam
preparation) or to life on campus (e.g., student housing). The development of these tasks
did not rely on information in a particular academic area. In other words, having previous
knowledge on a particular academic topic was not likely to interact with test-takers’
responses to this type of tasks. By this categorization, all 17 tasks were sorted into either
academic or non-academic. There were ten tasks with academic content and seven with
non-academic content.
The tasks were also sorted by the location where a language act occurred.
Unfortunately the information on setting was not always provided. Lack of context
development was observed for all reading tasks as well as some speaking and writing
tasks. Tasks for which the setting of language use was developed could be divided into
107
two groups: instructional and non-instructional. The first group of tasks involved
language acts that took place inside classrooms. In this type of setting interactions among
the participants (if there is any) were usually sporadic, and the language used was
academically oriented and formal. The second type of tasks took place outside
classrooms (e.g., a professor’s office, the library) where interactions were usually more
frequently and the language used tended to be less formal. Table 4 summarizes the 17
language tasks by content and setting.
Table 4. Task Content and Context
Task Name Type Content Setting

L1 Listening Non-academic Non-instructional
L2 Listening Academic Instructional
L4 Listening Non-academic Non-instructional
R1 Reading Academic N /A
S1 Speaking Non-academic N /A
S2 Speaking Non-academic N /A
S3 Speaking Non-academic Non-instructional
S4 Speaking Academic Instructional
S5 Speaking Non-academic Non-instructional
S6 Speaking Academic Instructional
W1 Writing Academic Instructional
W2 Writing Non-academic N /A
Analysis Procedures
The dataset used in this study included 370 test-takers’ scores on 17 skill-based
language tasks. Listening and reading items that were not reached or were omitted were
108
assigned a score of zero. There was no missing score value for the speaking and writing
tasks.
Analyses involving only observed variables were performed by using the
Predictive Analytics SoftWare® Statistics 18 (SPSS Inc., 2009). Analyses involving latent
variables were performed by using Mplus (Muthén & Muthén, 2010).
Level of Measure
An appropriate level of measurement was chosen based on statistical
considerations and theoretical needs. This study used task scores, also called item parcel
scores, as the level of measure. For the listening and reading sections, a task score was
the total score summed across a set of items based on a common prompt. Six listening
task scores and three reading task scores were therefore obtained. A task score in the
writing and speaking sections was simply the score assigned for a task. Six speaking task
scores and two writing task scores were therefore obtained. Each task score was used in
the analysis as an observed variable. There were 17 observed variables in the study in
total. Variable names were the same as the task names listed in Table 4.
There are a couple of reasons for choosing task scores as the level of measure.
First, one of the research questions is to examine the relationship between language skill
and the context of language use in the underlying structure of the test performance. Two
key elements used for characterizing a language use situation were content and setting,
both of which could be defined at the task level. Individual items within a task all shared
the same focus of content and setting. Secondly, using task scores instead of item scores
would allow all variables to be treated as continuous. Task scores based on
dichotomously scored listening and reading items could be treated as continuous as well
109
as the ratings based on the polytomously scored speaking and writing tasks. Although
other levels of measurement (such as categorical or ordinal) are permitted in structural
equation modeling (SEM), Kunnan (1998b) recommended not mixing different levels of
measurement in a single covariance or correlation matrix. The third reason for using this
level of measure is to reduce the chance of having correlations among the observed
variables to be extremely high. It is very likely for items based on a common prompt to
have dependence upon one another. Multicollinearity observed in the observed variables
can be one of the causes why an estimation process fails to succeed (Kline, 2005). Using
task scores instead of item scores, suggested by Stricker and Rock (2008), would help to
alleviate the problem caused by the dependence among items associated with a common
prompt in this study. The last reason comes out of concern for sample size needed for the
planned multivariate analysis. The more variables used in an analysis, the larger sample
size is needed. Using task scores would help to reduce the number of observed variables
in this analysis, and therefore to reduce the sample size needed for the study.
Distribution of Values
Choosing an appropriate estimation procedure used for multivariate analysis
depends on the distribution of observed values. The most commonly used estimation
methods (e.g., maximum likelihood estimation) assume univariate and multivariate
normality of observed variables. Assumptions regarding univariate and multivariate
normality were inspected. Univariate normality was checked by examining the skewness
and kurtosis indices, and by examining the plots of score distributions. Multivariate-
normality was evaluated based on the results of univariate normality inspection, as

110
suggested by Kline (2005). The distribution of the values was examined so that an
informed decision could be made regarding choosing an appropriate estimation method.
Linearity and Multicollinearity
Most estimation procedures for multivariate analysis are based either on the
covariance or the correlational matrix of the observed variables. If the relationship
between two variables is not linear, this non-linearity will not be captured in the
correlational indices. In this case, a nonlinear approach needs to be adopted. Linearity
was examined using scatter plots of all possible pairs of the variables in this study.
Multicollinearity occurs when the correlations among pairs of variables are
extremely high. If one variable is highly correlated with another, then it means that one of
the variables is redundant in terms of measuring the construct. Kline (2005) suggested
either eliminating or combining redundant ones into a composite variable to avoid
multicollinearity in the data. Pairwise multicollinearity was checked by inspecting the
correlation matrix of the variables in this study.
Estimation Method
The default estimator for latent analysis with continuous variables is maximum
likelihood (ML) in Mplus. ML estimation method estimates parameters with
conventional standard errors and chi-square test statistic. Since ML estimation is sensitive
to non-normality, when the distributions of the observed variables are non-normal, a
corrected normal theory method should be used to avoid bias caused by non-normality in
the dataset, as recommended by Kline (2005). By using a corrected method, the original
data is analyzed using a normal theory method (ML in this case), but the estimates of
standard errors are robust to non-normality and the test statistics are corrected. Mplus
111
provides such a corrected normal theory estimation method (called MLM), which
produces maximum likelihood parameter estimates with standard errors and a mean-
adjusted chi-square test statistic that are robust to non-normality. The MLM chi-square is
also referred to as the Satorra-Bentler chi-square (Muthén & Muthén, 2010).
Assessing Model Fit
The adequacy and appropriateness of the models were evaluated based on two
criteria: (1) the values of selected overall model fit indices, and (2) the significance and
reasonableness of individual parameter estimates. The selection of model fit indices used
in this study was based on Kline (2005)’s suggestions. Below is a brief description of
each model fit index.
Chi-Square Test of Model Fit
The value of the chi-square (χ2) statistic reflects the distance in fit between the
model-implied variance-covariance structure and the variance-covariance structure of the
observed data. A χ2 test evaluates the statistical significance of this distance. The lower
the value is, the closer the two structures are, and therefore the better the model
corresponds to the observed data.
Normed Chi-Square
The model chi-square discussed previously has its drawback because it is
sensitive to sample size. The value tends to be high when the sample size is large. This
could mislead to reject a model whose deviation from the observed data structure may not
be significant (Bollen, 1989). Dividing the chi-square value by the degrees of freedom
(df) can be used to reduce the sensitivity of chi-square to sample size (Kline, 2005). The
result is a lower value referred to as normed chi-square (χ2/df). A ratio of less than 3 is an
112
indicator of good model fit, recommended by Kline (1998). This criterion was adopted in
the current study.
Root Mean Square Error of Approximation
Neither χ2 nor χ2/df has a built-in mechanism that corrects for model complexity.
Normally speaking, if two models show equivalent fit to the same data, the simpler one is
preferred over the more complex one based on the principle of parsimony. When dealing
with the same data, the simpler a model is, the fewer parameters are estimated, and the
higher the degrees of freedom are. The root mean square error of approximation
(RMSEA) is a parsimony-adjusted index that corrects for model complexity, and
therefore favors simpler models. A value of zero indicates a perfect fit. The higher the
value goes, the worse the fit gets. RMSEA smaller than 0.05 can be interpreted as a sign
of good model fit while values between 0.05 and 0.08 indicate reasonable approximation
of error (Browne & Cudeck, 1993). This criterion was adopted in the current study.
Comparative Fit Index
The comparative fit index (CFI) compares the fit of the specified model to the fit
of a baseline model which assumes zero covariances among the observed variables.
Because it is usually unrealistic to assume that variables are uncorrelated, the fit of a
baseline model is often really bad. The improvement in fit from the baseline model shows
that the specified model is a better one. A rule of thumb, suggested by Hu and Bentler
(1999), is that a CFI value larger than 0.9 shows the specified model has a good fit. This
criterion was adopted in the current study.

113
Standardized Root Mean Square Residual
The standardized root mean square residual (SRMR) is an absolute fit index
which is based on the mean absolute correlation residual. The size of a correlation
residual indicates how an observed correlation matrix differs from the model-implied. A
SRMR value of zero shows that there is no difference between the two correlation
matrices, indicating a perfect model fit. A SRMR value less than 0.1 is commonly
considered as a sign of acceptable fit (Kline, 2005). This criterion was adopted in the
current study.
Individual Parameter Estimates
Parameter estimates can be examined for appropriateness and significance. The
sign of an estimate should be checked to ensure that the meaning of the estimate is
theoretically sound. The value of an estimate divided by its standard error provides a test
statistic that can be used to evaluate the significance of the estimate. Multicollinearity
among latent factors can also be detected by examining their estimated correlations. An
extremely high correlation estimate between two factors indicates a linear dependency
among the factors. This means that the factors are empirically indistinguishable, which
makes a model implausible. Previous researchers (Sawaki et al., 2008; Stricker et al.,
2005; Stricker & Rock 2008) used a value of 0.9 to screen out extremely high
correlations among factors. This criterion was adopted in the current study.
Establishing the Baseline Model
A confirmatory approach to factor analysis was adopted to respond to research
question one by examining whether the structure of communicative language ability
measured by the test conforms to the test design, score reporting and the TOEFL®
114
literature . A series of confirmatory factor analyses was conducted to find out if a
previously established factor model with similar test data could also be compatible with
the data in this study.
Hypothesis 1 states that the structure of the communicative language ability
measured by the TOEFL iBT® test can be best explained by a higher-order model, and
can also be explained adequately by a correlated four-factor model and a correlated two-
factor model. A higher-order model, a correlated four-factor model, and a correlated two-
factor model were specified as a priori and tested for fit as competing models. All three
models were shown to be compatible with previous TOEFL test data. Integrated speaking
tasks whose completion required language processing in multiple modalities were found
to load on the target modality (speaking), whereas integrated writing tasks were found to
load on the designated writing factor (Sawaki et al., 2008; Sawaki, et al., 2009).
Therefore the integrated speaking and writing tasks were specified to load on their target
modality in all three models. As the result of testing the first hypothesis, the model that
best represented the latent structure of the dataset used in this study was established as
the baseline model, on which future analysis was based.
Figure 1 illustrates the relationships among the observed variables, latent factors,
and residual/unique variances in the higher-order model. The observed variables are
represented by the rectangular boxes. Latent variables, including the four skill factors and
the residuals, are indicated by the ellipses. The six listening variables (L1–L6) loaded on
a common factor which could be referred as the listening factor (L). In other words, the
six listening tasks were the indicators of the presumed listening factor. The three reading
tasks (R1–R3) loaded a common factor which could be interpreted as the reading factor
115
(R). A presumed speaking factor (S) was responsible for the relationships among the six
speaking tasks (S1–S6). The two writing variables (W1 and W2) shared a common
underlying factor, possibly a writing factor (W). These were the four first-order factors
corresponding to the four modalities. The relationships among the first-order factors were
subsumed under the influence of a higher-order factor, a hypothetical general language
ability factor (G). In other words, the first-order factors were constrained to interact with
one another only through the higher-order factor. The higher-order factor represented a
common underlying dimension across the four first-order factors. The residual variances
(E1–E17), the part of the variance of an indicator that could not be explained by its
respective factor in the model, were uncorrelated with one another. Also referred to as the
measurement errors, the residual variances reflected the degree of how reliable or
unreliable an indicator was at measuring its latent factor.
The constituents and their relationships in the correlated four-factor model are
illustrated in Figure 2. In the absence of a higher-order factor the four factors each
corresponding to its respective modality were modeled to correlate with one another.
The correlated two-factor model (Figure 3) was nested within the correlated four-
factor model by constraining the correlations among the listening, reading, and writing
factors to be one. In this model variables from the listening, reading, and writing sections
all loaded on a common factor, a presumed non-speaking factor (L/R/W). The six
speaking variables loaded on the second factor, probably a speaking factor (S). The two
factors were allowed to covary.

116
Figure 1. Higher-Order Factor Model

117
Figure 2. Correlated Four-Factor Model

118
Figure 3. Correlated Two-Factor Model
Modeling the Context of Language Use
The baseline model, confirmed in the step above, provided a platform for testing
the role of context in defining communicative language ability as measured by the
TOEFL iBT® test. Next, situational factors were added to the baseline model to evaluate
119
the role of context in defining communicative language ability through a series of
confirmatory factor analyses.
Language tasks were categorized with regard to the context of language use. Task
grouping was based on two key elements: content and setting. On the dimension of
content, all tasks were categorized into either academic or non-academic. On the
dimension of setting, when the setting of a task was provided, the task was labeled as
either occurring in a non-instructional or an instructional setting. Tasks without sufficient
setting development were grouped into a third category.
Hypothesis 2 asserts that adding a dimension of content factors to the baseline
model improves model fit, and therefore demonstrates the role of context in defining
communicative language ability. A dimension of content factors was imposed on the
baseline model from the previous hypothesis testing, and this two-dimensional model was
evaluated for fit. In this model, each task loaded on two factors, a skill-based factor and a
content-based factor. Ten tasks loaded on a common content factor associated with
academic material, whereas the other seven loaded on a non-academic content factor.
Imposing the second dimension was expected to help explain the relationships
among the observed variables together with the common skill-based factors. In the
baseline model, the residual variances were modeled to be uncorrelated on the
assumption that they were unique to their respective variables, and were not associated
with one another in a systematic way. However, whether these residual variances are
truly unique becomes a question of doubt if the context of language use is considered.
Successful modeling testing would support that performance on language tasks could be
120
accounted for by the situational factors as well as the skill-based factors, and therefore
confirmed the role of context in defining communicative language ability.
Hypothesis 3 proclaims that adding a dimension of setting factors to the baseline
model improves model fit, and therefore demonstrates the role of context in defining
communicative language ability. A dimension of setting factors was imposed on the
baseline model, and this two-dimensional model was tested for fit. In this model, each
task loaded on two factors, a skill-based factor and a setting-based factor. Seven tasks
loaded on a common instructional setting factor, and four loaded on a common non-
instructional setting factor. The rest six tasks without setting development loaded on a
third setting factor. Successful model testing would support that performance on language
tasks could be accounted for by the situational factors as well as the skill-based factors,
and therefore confirmed the role of context in defining communicative language ability.
Based on the results of the analysis above, a model that best represented the test
construct was chosen as the final model for the entire sample to be used in the following
multi-group analysis.
Multi-Group Invariance Analysis
The following multi-group invariance analysis investigated the second research
question: whether the configuration of communicative language ability, as measured by
the TOEFL iBT® test, has equivalent representation cross two groups of test-takers, with
one group having been exposed to an English-speaking environment and the other
without such experience. In other words, the results would inform us whether group
membership moderated the relations among the variables in the factor model. In all steps
of the invariance analysis, the two groups were tested simultaneously.

121
Hypothesis 4 states that two groups of test-takers, one with study-abroad
experience and the other without such experience, differ in the underlying configuration
of their language ability. Based on self-reported background information, one group of
124 test-takers (Group I) had never lived in an English language environment upon
taking the test. The other group of 246 test-takers (Group II) had experience of living in
the target language community upon test-taking. The multi-group invariance analysis was
executed simultaneously across these two groups in a step-wise fashion.
In the planned multi-group invariance analysis, the measurement component in
the model was prioritized over the structural component. The measurement component
consisted of the number of the factors, the relationship between the factors and their
respective indicators, and the residual variances. The measurement part defined the
meanings of the factors by specifying how they were measured and what their indicators
were. The structural component included the variances and covariances of the factors. It
specified the relationships among the latent factors.
It would only make sense to ensure the equality of factor relationships after it has
been established that the factors have the same meanings across groups. This is why the
measurement component should be tested for equality first. If successful, then testing the
equality of the structural component can proceed.
Furthermore, a mean structure was imposed in the multi-group invariance analysis
to examine group mean differences on latent factors. A mean structure was imposed in all
steps of multi-group invariance analysis to allow model comparison. Group mean
differences on latent factors can be inspected, if measurement invariance can be held.

122
Therefore the equality of factor means was tested when the structural component was
inspected.
None of the studies reviewed incorporated a mean structure. Bae and Bachman
(1998) called for using mean structures as a suggestion for future researchers. As pointed
out by these authors, a mean structure latent approach allows us to account for
measurement errors when investigating group differences.
Measurement Invariance
Three steps were planned in testing measurement invariance in a hieratically
ordered manner. The first step was to test the equivalence of the overall factor structure
across the groups. The same factor structure was imposed on both groups
simultaneously. Parameter estimates in one group were allowed to vary from the ones in
the other group. The result of this step answered the question whether the factors had the
same meanings across the groups.
The second step was to test the equivalence of factor loadings across the groups.
Equality constrains were imposed on all factor loadings across the groups. Factor loading
estimates in one group were not allowed to be different from the ones in the other group.
Residual variances were allowed to differ. This was a more restrictive model compared to
the model tested in the first step. Model fit was evaluated. Since the two models were
nested, a chi-square different test was conducted to evaluate which model should be kept.
A chi-square difference test without significant result would indicate that the fit of the
more restrictive model did not deteriorate badly enough to justify adopting the more
liberal model. In this case, the more restrictive model would be adopted. The result of this
123
step would tell us whether the indicators measured the factors in a comparable way across
the groups.
The third step was to test the equivalence of residual variances in both groups.
Equality constrains were imposed on the residual variances along with the factor loadings
across the groups. This was a more restrictive model compared to the model tested in the
second step. Model fit was evaluated, and a chi-square different test could be conducted
to evaluate which model should be adopted. The result of this step would inform us
whether the indicators in one group were as reliable as the ones in the other group at
measuring the factors.
Structural Invariance
Factor means, variances, and covariances were the targets of this investigation.
The common practice is to test the invariance of factor means and covariances before
testing factor variances because it is usually expected for groups to differ in their
variabilities on the common factors (Kline, 2005).
The first step was to test the invariance of factor means. The means of the factor
indicators (also called endogenous variables) were estimated as intercepts. The estimated
means of the indicators, the intercepts, were constrained to be equal across the groups for
model identification purpose. They were held equal so that the differences of factor
means could be estimated. The means of the latent factors (also called exogenous
variables) were fixed to be zero in one group and free to be estimated in the comparison
group. Model fit was evaluated. A chi-square different test, between the current and its
preceding model during testing measurement invariance, was conducted to evaluate

124
which model should be adopted. The result would indicate the estimated relative
differences in factor means across the groups.
In the next step the equality of factor covariances was tested. These parameters
were held invariant for both groups. Model fit was evaluated, and a chi-square different
test was conducted to evaluate which model should be adopted. This would be followed
by a test of factor variance invariance if the factor covariances could be held equal.
Building Structural Equation Models
The previous multi-group invariance analysis investigated whether or not having
study-abroad experience had any effect on the development of the language ability. In
this section, structural equation models were built and tested to investigate the third
research question: what the impact of length of study-abroad and classroom learning on
the development of communicative language ability is, as measured by the TOEFL iBT®
test.
Three TTCs were introduced as independent variables. One was the length of time
spent studying English. The second one was the length of time spent in content classes
taught in English. The last one was the length of time spent living in an English-speaking
country. The first two characteristics were concerned about English language training in a
formal setting. The last one was concerned about English language contact experience.
The information was obtained based on test-takers’ responses to the background
questions. For Group I test-takers with no study-abroad experience, only the first two
variables were relevant. For Group II test-takers with study-abroad experience, all three
variables were relevant. Model testing was conducted separately for Group I and Group
II, because each group had a different set of relevant background variables.
125
Hypothesis 5 asserts that for test-takers who have no study-abroad experience (the
home-country group) the development of their language ability is associated with the
length formal learning. To test this hypothesis, two independent variables – the time
spent studying English and the time spent in content classes taught in English – were
modeled to have direct associations with the language ability. This model was subject to
test of fit, and the significance of the direct effects was also evaluated.
Hypothesis 6 proclaims that for test-takers who have had study-abroad experience
(the study-abroad group) the development of their language ability is associated with the
length of formal learning and the length of study-abroad. Three independent variables
were modeled to have direct associations with the language ability. They were: the time
spent studying English, the time spent in content classes taught in English, and the length
of study-abroad experience. This model was subject to test of fit, and the significance of
the direct effects was also evaluated.

126
CHAPTER FOUR
ANALYSIS AND RESULTS
This chapter reports the results of testing the hypotheses put forward in the
previous chapter in order to answer the three research questions. Outcomes of
establishing a model for the entire sample are reported first, followed by results from
multi-group invariance analysis across groups of test-takers with different context-of-
learning experiences. Last, the results of evaluating two unique structural equation
models are accounted.
Preliminary Analysis
Table 5 summarizes the descriptive statistics for the observed variables, including
possible score range, mean, standard deviation, kurtosis, skewness, and z scores for the
kurtosis and skewness values. Variables L1 to L6 refer to Listening Task One to
Listening Task Six. Variables R1 to R3 represent Reading Task One to Reading Task
Three. Variables S1 to S6 correspond to Speaking Task One to Speaking Task Six.
Variables W1 and W2 refer to Writing Task One and Writing Task Two.
First, the assumptions of univariate and multivariate normality were checked. Z
scores reported in the tables can be used to test univariate normality. Z scores for the
kurtosis values were significant at p < .01 for variables L3, L4, L6, R3, and W1. Z scores
for the skewness values were significant at p < .01 for the following variables: L1, L4,
L5, L6, and R2. Since this was a relatively large sample (N=370), these z scores were
interpreted with reservation. Instead, the absolute values of the kurtosis and skewness
statistics as well as the shape of the distributions were used to evaluate normality.
127
Table 5. Descriptive Statistics for the Observed Variables
Z Kurtosis Z Skewness
Variable Range Mean Std. Dev. Kurtosis Skewness
(SE=.253) (SE=.127)
L1 0-5 3.33 1.138 -.264 -1.043 -0.373 -2.937
L2 0-6 3.57 1.376 -.581 -2.296 -0.098 -.772
L3 0-6 2.97 1.560 -.734 -2.901 0.179 1.409
L4 0-5 4.44 0.888 4.245 16.779 -1.911 -15.047
L5 0-6 4.37 1.297 -.259 -1.024 -0.637 -5.016
L6 0-6 4.78 1.384 .976 3.858 -1.223 -9.630
R1 0-15 6.94 2.725 -.343 -1.356 0.285 2.244
R2 0-15 10.06 3.027 -.652 -2.577 -0.393 -3.094
R3 0-15 9.98 3.064 -.963 -3.806 -0.119 -.937
S1 0-4 2.51 0.759 -.340 -1.344 -0.028 -.220
S2 0-4 2.62 0.805 -.508 -2.008 -0.016 -.126
S3 0-4 2.50 0.755 .073 .289 -0.086 -.677
S4 0-4 2.39 0.827 -0.027 -.107 -0.036 -.283
S5 0-4 2.58 0.810 0.172 .680 -0.150 -1.181
S6 0-4 2.53 0.856 -0.115 .455 -0.132 -1.039
W1 1-5 3.23 1.148 -0.690 -2.727 -0.289 -2.276
W2 1-5 3.46 0.817 -0.169 -.668 -0.020 -.157
The values of kurtosis and skewness are zero in a normal distribution. As Table 5
shows, except for variable L4 and L6, the values of skewness and kurtosis were all
between an acceptable range of -1 and 1, indicating that univariate normality could be
held for these variables. Variable L4 had a kurtosis value of 4.25 and a skewness value of
-1.91. Variable L6 had a kurtosis value of 0.98 and a skewness value of -1.22. Examining
the histograms of these two variables revealed that both distributions exhibited a ceiling
effect. Univariate normality could not be held in these two cases. Having two extremely
non-normal variables indicated that this set of variables could deviate from multivariate
normality. These facts were taken into consideration when choosing an appropriate
estimation procedure for this group of variables.

128
Table 6. Correlations of the Observed Variables
L1 L2 L3 L4 L5 L6 R1 R2 R3 S1 S2 S3 S4 S5 S6 W1 W2
L1 1
L2 .35 1
L3 .38 .50 1
L4 .33 .31 .28 1
L5 .41 .39 .40 .40 1
L6 .41 .42 .44 .49 .46 1
R1 .36 .43 .45 .31 .36 .44 1
R2 .36 .49 .48 .39 .43 .55 .54 1
R3 .37 .44 .49 .38 .41 .51 .56 .65 1
S1 .38 .42 .42 .37 .43 .45 .40 .32 .40 1
S2 .34 .36 .36 .39 .32 .38 .39 .33 .33 .58 1
S3 .39 .37 .38 .45 .38 .43 .35 .37 .40 .56 .57 1
S4 .34 .35 .40 .39 .37 .45 .36 .33 .42 .63 .55 .57 1
S5 .38 .37 .41 .45 .42 .46 .44 .38 .44 .56 .58 .57 .60 1
S6 .41 .39 .41 .48 .45 .49 .38 .39 .37 .62 .64 .61 .57 .64 1
W1 .50 .46 .51 .43 .50 .54 .53 .59 .58 .48 .46 .51 .43 .55 .52 1
W2 .45 .49 .48 .44 .48 .48 .50 .50 .54 .57 .60 .56 .52 .63 .56 .61 1
Next, the linearity and multicollinearity of the variables were scrutinized to ensure
that the variables were represented in the dataset appropriately. Linearity was examined
using scatter plots of all possible pairs of the variables. No violation of linearity was
found. Pairwise multicollinearity was checked by inspecting the correlation matrix of the
variables. As shown in Table 6, dependence among all pairs of variables was moderate
(.28–.65). No extremely high value of correlation coefficient was found.
The results of the preliminary analysis indicated that univariate non-normality
was detected with two variables, which suggested that the distribution of the set of
variables could deviate from multivariate normality. It was then decided to implement a
corrected normal theory estimation method in the multivariate analysis to avoid bias
caused by non-normality in the dataset. The MLM estimator provided by Mplus was
implemented. The chi-square statistic generated by the MLM estimator in Mplus is a
corrected test statistic known as Satorra-Bentler test statistic (χ2S-B) (Satorra & Bentler,
129
1994). This test statistic is mean-adjusted and robust to non-normality (Muthén &
Muthén, 2010).
The Baseline Model
Results from the previous studies revealed that a couple of factor models could be
adequate at accounting for the underlying structure of TOEFL® test performance (Sawaki
et al., 2008; Stricker & Rock, 2008; Stricker et al., 2005). A higher-order model (Figure
1), a correlated four-factor model (Figure 2), and a correlated two-factor model (Figure 3)
were all proved to be possible factor solutions in the past. To establish the baseline model
all three competing models were tested for fit with the current data through a series of
confirmatory factor analyses. Summarized in Table 7, the selected fit indices were used
to evaluate model fit. Model df refers to a model’s degrees of freedom. χ2S-B refers to the
Satorra-Bentler chi-square test statistic. χ2S-B /df refers to the normed Satorra-Bentler chi-
square test statistic. CFI refers to the comparative fit index. RMSEA refers to the root
mean square error of approximation. SRMR refers to the standardized root mean square
residual.
Table 7. Fit Indices for the Three Competing Models
Model Model df χ2S-B χ2S-B /df CFI RMSEA SRMR

Higher-order factor 115 215.125 1.871 0.969 0.049 0.038
model
Correlated 113 185.882 1.645 0.977 0.042 0.033
four-factor model
Correlated two- 118 268.486 2.275 0.953 0.059 0.043
factor model
130
The degrees of freedom (df) indicates how parsimonious or saturated a model is.
A model’s degrees of freedom can be determined by subtracting the number of free
parameters from the number of unique elements in the variance-covariance structure of
the data. A free parameter is a parameter that is free to be estimated during model
estimation. On the other hand, a fixed parameter is one whose value is determined
without model estimation. With the same data structure, the more free parameters a
model is specified to estimate, the less parsimonious or more saturated the model is, and
the lower the degrees of freedom are. Model fit generally deteriorates when model
complexity decreases because there are fewer free parameters to estimate in a more
parsimonious model.
The correlated four-factor model with 113 degrees of freedom was the most
saturated model among the three. The correlated two-factor model was the most
parsimonious one with 118 degrees of freedom. All fit indices deteriorated when model
complexity decreased from the correlated four-factor model to the correlated two-factor
model.
First, the criteria pre-determined based on the relevant literature were used to
evaluate overall model fit. All three chi-square values (χ2S-B) were significant (p = 0.000),
which put model fit in doubt. However, as discussed, the value of the model chi-square
should be interpreted with caution because this test statistic is highly sensitive to sample
size. To reduce the sensitivity of chi-square to sample size, the chi-square values were
divided by the degrees of freedom. As the normed chi-square (χ2S-B /df) values showed,
the ratios were all well below 3, which indicated that all three models fit the data well.
131
The values of comparative fit indices (CFI) were all larger than 0.9. This meant
that the fit of all three models improved significantly when compared to their respective
baseline models that assumed no covariances among the variables.
The values of root mean square error of approximation (RMSEA) for the two
more saturated models (higher-order model and correlated four-factor model) were below
0.05, and the value for the most parsimonious one (correlated two-factor model) was
between 0.05 and 0.08. This outcome could be interpreted as a sign of good fit for all
three models when model complexity was accounted for.
The value of standardized root mean square residual (SRMR) were all well below
0.1. This indicated that the correlation matrices did not differ significantly from the
model-implied ones. All three models were considered to have good fit.
The selected fit indices for all three models were satisfactory except for the model
chi-square values. Therefore, on the global level all three models demonstrated
reasonable fit to the data.
Next, individual parameter estimates were examined for appropriateness and
significance. The results of testing the higher-order model showed that the estimated
residual variance of the writing factor was negative, and the estimated correlation
between the higher-order factor and the writing factor was larger than one. These findings
signaled problems in model specification, and therefore made the model inadmissible.
Although this higher-order model was confirmed by previous researchers with similar
TOEFL® test data, this model was not compatible with the current dataset. One possible
reason could be that the sample used in this study was a much smaller one (N=370)
compared to the one (N=2070) used in Sawaki et al. (2008) and Stricker & Rock (2008). 3
132
Regarding the correlated four-factor model it was detected that the correlation
between the listening and the writing factor was estimated as high as 0.97, larger than the
0.9 acceptance level.4 This high level of correlation indicated a linear dependence
between the factors, meaning that the factors were not distinct enough to be considered as
two separate factors. It was then decided that this model based on the study sample
(N=370) was also inadmissible.5
An examination of the result from testing the two-factor model showed that all
parameter estimates were appropriate and significant. Taking all criteria into
consideration, this model provided the best explanation to the data, and therefore was
adopted as the baseline model for the following analysis.6
The Context of Language Use
The correlated two-factor model was established as the baseline model as the
outcome of the previous analysis. Built upon this baseline model, testing the role of
context in the internal structure of the test performance was pursued next.
The Content Dimension
One key element in defining context, content of a language task, was tested for its
ability in accounting for the relationships among the variables along with the skill-based
factors. Previous task analysis showed that ten language tasks were associated with
academic material, whereas the other seven were related to non-academic content. A
content factor dimension was imposed on the baseline model (Figure 4). Two content
factors were specified, academic and non-academic.7
As illustrated in Figure 4, all language tasks were specified to load on two factors,
a skill-based factor and a content factor. Taking the first speaking task (S1) as an
133
example, this task loaded with the other speaking tasks on the speaking factor. This task
also loaded on a non-academic content factor since the task was not related to academic
content. Along this content dimension, ten tasks loaded on a common content factor
associated with academic material, while the other seven loaded on a non-academic
content factor.
This two-dimensional model was tested for fit. The result indicated that
convergence could not be reached. No overall model fit index was reported. Individual
parameter estimates were reported without standard errors therefore the significance of
the parameter estimates could not be evaluated.
Adding a second dimension of content brought in severe problems in model
specification. The content dimension failed to capture the relationships among the
variables in conjunction with the two skill-based factors based on the test performance of
the study sample (N=370).8 This model was inadmissible, and therefore was discarded
from any future analysis.
The Setting Dimension
Another key element in defining context, setting of a language act, was also tested
for its ability in accounting for the relationships among the variables along with the skill-
based factors. Previous task analysis showed that seven tasks were situated in an
instructional setting, and four were in a non-instructional setting. The information on
setting for the remaining six tasks could not be identified due to lack of context
development. A setting factor dimension was imposed on the baseline model (Figure 5).
Three setting-related factors were specified, instructional, non-instructional, and not
available (N/A).9
134
Figure 4. Correlated Two-Factor Model with a Content Dimension
As illustrated in Figure 5, all language tasks were specified to load on two factors,
a skill-based factor and a setting factor. Taking the third speaking task (S3) as an
example, this task loaded with the other speaking tasks on the speaking factor. This task
also loaded on a non-instructional setting factor since it was not situated in an
instructional environment. Along this setting dimension, seven tasks loaded on a common
135
instructional setting factor, and four on a common non-instructional setting factor. The
remaining six tasks without context development all loaded on a third setting factor
(N/A).
Figure 5. Correlated Two-Factor Model with a Setting Dimension

136
This two-dimensional model was tested for fit. Once again, the result indicated
that convergence could not be reached. Adding a second dimension of setting brought in
severe problems in model specification. The setting dimension failed to capture the
relationships among the variables in conjunction with the two skill-based factors based on
the test performance of the study sample (N=370).10 This model was inadmissible, and
therefore was discarded from any future analysis.
The Final Model
The correlated two-factor model was adopted as the final model for the entire
sample group. The first factor, loaded on tasks from the listening, reading, and writing
sections, could be interpreted as a non-speaking factor. The second factor, loaded
exclusively on the speaking tasks, could be interpreted as a speaking factor. The results of
model testing showed that this model demonstrated an adequate fit to the data. The final
model with unstandardized and standardized parameter estimates is illustrated in Figure 6
and Figure 7 respectively. In both figures, a path pointing from a latent factor to an
observed variable, also called an indicator, represented the presumed effect of the factor
on that variable. Estimates of these effects were factor loadings. A path pointing from a
measurement error to an indicator corresponded to the presumed effect of random and
systematic errors on the variable. A path linking the two latent factors did not indicate
directionality. It simply represented the unanalyzed association between the two factors.
The variables were rearranged in order for the ease of visual display.
To obtain the unstandardized estimates illustrated in Figure 6, the scales of the
latent variables were assigned through the unit loading identification (ULI) constraint.
The factor loadings for the first indicator of each factor were fixed at one to assign a scale
137
to the factors. The measurement errors were assigned a scale through fixing their
estimated effects on the indicators to be unitary. For example, the unstandardized factor
loading of the second listening task (L2) on the non-speaking factor (L/R/W) was
estimated to be 1.312. The numbers printed next to the latent factors represented factor
variances. The factor covariance was indicated by the number printed next to the path
linking the two latent factors. Next to the measurement errors were residual variances of
the indicators. The variances of the non-speaking and the speaking factors were estimated
to be 0.430 and 0.335 respectively. The covariance between the two factors was
estimated to be 0.306. Estimation of the residual variance of the first listening task (L1),
for example, was 0.861.
The standardized estimates were computed when the latent factors and the
observed variables were standardized. The variances of all variables, including the
factors, the observed variables, and the residuals, were fixed at one. The estimated factor
correlation was reported next to the path linking the two factors. The correlation between
the two skill-based factors was estimated to be 0.807. The factor loadings were
standardized regression coefficients. For example, one unit of change in standard
deviation in the latent speaking factor predicted 0.764 standard deviation of change in the
first speaking variable (S1). The higher a factor loading was, the better the indicator was
at measuring the latent factor. The standardized factor loadings could also be interpreted
as estimated correlations between a latent factor and its indicators in the current model
because each indicator was specified to measure only one latent factor. For example, the
estimated correlation between the speaking factor and the first speaking variable (S1) was
0.764. This meant that 58.4% (0.7642) of the total variance of this indicator could be
138
accounted for by the speaking factor. The standardized residual variances represented the
percentage of variance of the indicators that could not be explained by the common
factors. In case of the first speaking variable (S1), 41.6% of the variance could not be
accounted for by the speaking factor. The standardized residual path coefficient for the
direct effect of the measurement error on this speaking variable was 0.645(0.4161/2),
which meant that one standard deviation of change in the error term was associated with
0.645 standard deviation of change in this variable.
Multi-Group Invariance Analysis
Assuming that the obtained final model was the correct model for the entire
group, the multi-group invariance analysis investigated whether the specified model
could hold equivalent across two groups of test-takers. One hundred and twenty-four test-
takers who had never been immersed in an English language environment were grouped
together (Group I). The other group (Group II) of 246 test-takers had lived in an English-
speaking country for various lengths of time. Table 8 summarizes the descriptive
statistics across the two groups.
The multi-group invariance analysis was executed simultaneously across these
two groups in a hierarchically ordered fashion. In all steps of the analysis, unstandardized
parameter estimates were reported. Generally speaking, unstandardized estimates should
be used for comparing groups, as groups are assumed to differ in their variabilities on
common factors (Kline, 2005).

139
Figure 6. Final Model with Unstandardized Estimates

140
Figure 7. Final Model with Standardized Estimates
Table 8. Descriptive Statistics across the Two Groups
Group I (N = 124) Group II (N = 246)

Mean Std. Dev. Mean Std. Dev.
Listening 23.34 5.48 23.52 5.43
Reading 28.25 7.22 26.34 7.59
Speaking 15.08 3.98 15.16 3.88
Writing 6.89 1.83 6.60 0.26
Total 73.56 16.37 71.62 16.34
141
Measurement Invariance
During the first step of testing measurement invariance, factor structure
invariance was inspected. The same factor structure was imposed on both groups
simultaneously but parameter estimates were allowed to differ across the groups.
The resulting unstandardized parameter estimates for both groups are shown in
Figure 8 and Figure 9 respectively. The same correlated two-factor model was applied on
both groups. Factor loadings, indicator residuals as well as factor means, variances, and
covariance had different estimates in each group. The numbers printed next to a factor
referred to the factor mean and factor variance. In Group I the means of the latent factors
were fixed at zero. Factor variances were estimated at 0.518 for the non-speaking factor
and 0.374 for the speaking factor. The estimated covariance of the factors was 0.353. In
Group II the means of the latent factors were free to be estimated. The mean of the first
factor was estimated 0.140 lower than the one in Group I. The mean of the second factor
was estimated 0.017 lower than the one in Group I. Factor variance estimates were 0.386
and 0.319 in Group II. The estimated covariance of the factors was 0.283 in Group II.
Model fit indices are summarized in Table 9. Except for the model chi-square, all
model fit indices were satisfactory. All parameter estimates, as reported in the figures,
were appropriate and reasonable. It was then concluded that the same correlated two-
factor structure could be held across the groups. The result of this step ensured that the
performance in both groups could be accounted for by the same two factors, a speaking
factor which was loaded with tasks from the speaking section and a non-speaking factor
which was loaded with tasks from the sections of listening, reading, and writing. The
invariance analysis then proceeded to the next step.

142
Figure 8. Factor Structure Invariance with Unstandardized Estimates Group I

143
Figure 9. Factor Structure Invariance with Unstandardized Estimates Group II
Next, the equivalence of factor loadings was inspected. Factor loading estimates
were held to be equal across the groups.
As illustrated in Figure 10 and Figure 11, the resulting unstandardized factor
loading estimates were the same for both groups. Indicator residuals, factor means, factor
variances, and factor covariance were allowed to differ across the groups.
144
Figure 10. Factor Loading Invariance with Unstandardized Estimates Group I
were appropriate and reasonable. The Satorra-Bentler chi-square difference (χ2S-B|Diff)
between this model and the preceding model was not significant: 15.266 with 15 df (χ2S-
B|Diff / df = 1.018). Compared to the model tested in the preceding step, the current model
145
had fewer free parameters to estimate because it imposed equal factor loadings across the
groups. Therefore the current model was more restrictive and simpler. The decrease in
the number of free parameters led to deterioration in model fit. The non-significant result
demonstrated that model fit did worsen but not enough to justify choosing the more
saturated model over the simpler one.
Figure 11. Factor Loading Invariance with Unstandardized Estimates Group II

146
It was then concluded that the factor loadings could be held invariant across the
groups. The result of this step indicated that the factors were measured by their indicators
in a comparable way for both groups. The amount of the variance of an indicator that
could be accounted for by its respective factor was comparable across the groups. The
invariance analysis then proceeded to the next step.
The third step was to examine the equivalence of residual variances in both
groups. Residual variances along with the factor loadings were constrained to be equal
across the groups.
As illustrated in Figure 12 and Figure 13, the factor loadings as well as the
residuals were fixed to be the same for both groups. Factor means, factor variances, and
factor covariance were allowed to differ in each group.
were appropriate and reasonable. The Satorra-Bentler chi-square difference (χ2S-B|Diff)
between this model and the preceding model was not significant: 22.249 with 17 df (χ2S-
B|Diff / df = 1.309). This result indicated that the model fit did not deteriorate enough to
justify choosing the preceding more saturated model over the current simpler one. It was
then concluded that the indicator residuals could be held invariant across the groups. The
outcome of this step showed that the indicators performed equally at measuring their
respective factors for both groups. The amount of the variance of an indicator that could
not be explained by it respective factor was comparable across the groups.

147
Figure 12. Indicator Residual Invariance with Unstandardized Estimates Group I
Table 9. Fit Indices from the Multi-Group Measurement Invariance Analysis
Multi-group measurement Model χ2S-B χ2S-B /df CFI RMSEA SRMR

invariance df
Factor structure invariance 251 441.956 1.761 0.941 0.064 0.055
Factor loading invariance 266 456.488 1.716 0.941 0.062 0.061
Indicator residual 283 478.524 1.691 0.940 0.061 0.064
invariance
148
Analysis in the above three steps completed the test of multi-group measurement
invariance. The testing of equality on factor structure, factor loadings, and residuals all
succeeded. It was concluded that the measurement part of the model could be held
equivalent across the groups. The multi-group invariance analysis then proceeded with
the testing of structural invariance.
Figure 13. Indicator Residual Invariance with Unstandardized Estimates Group II

149
Structural Invariance
Next, the structural part of the model was scrutinized with equality control across
the groups.
The invariance of factor means was examined first. As shown in Figure 14 and
Figure 15, the measurement part of the model, including the factor loadings and indicator
residuals, was fixed to be the same across the groups. Factor means in both groups were
fixed to zero to be equal. Factor variances and covariance were free to differ.
The model fit indices are summarized in Table 10. Except for the model chi-
square, all model fit indices were satisfactory. All parameter estimates, as reported in the
figures were appropriate and reasonable. The Satorra-Bentler chi-square difference (χ2S-
B|Diff) between this model and the preceding model was not significant: 3.771 with 2 df
(χ2S-B|Diff / df = 1.886). This result indicated that the model fit did not deteriorate enough
to justify choosing the more saturated model over the current simpler one. It was then
concluded that the factor means could be held invariant across the groups. The outcome
of this step showed that the two groups were equivalent in terms of latent factor means.
In other words, there was not enough evidence to say that one group was better than the
other on either latent ability. The structural invariance analysis then proceeded to the next
step.
The next step was to test the equivalence of factor covariance. These parameters
were constrained to be equal across the groups. As indicated in Table 10, model
estimation did not succeed. This result failed to demonstrate the equivalence of the factor
covariance across the groups. The multi-group invariance analysis was then terminated.
Since no further step of testing factor variance invariance would be taken, it could then be
150
assumed that the groups differed in their variabilities on the common factors. The multi-
group model estimated in the preceding step became the final model for the groups.
Figure 14. Factor Mean Invariance with Unstandardized Estimates Group I

151
Figure 15. Factor Mean Invariance with Unstandardized Estimates Group II
Table 10. Fit Indices from the Multi-Group Structural Invariance Analysis
Multi-group structural Model χ2S-B χ2S-B /df CFI RMSEA SRMR

invariance df
Factor mean invariance 285 482.295 1.692 0.939 0.061 0.065
Factor covariance No convergence
invariance
152
Results of Multi-Group Invariance Analysis
Multi-group invariance analysis succeeded until the equality of the factor
covariance could not be held across the groups. The final model for both groups, as
shown in Figure 14 and Figure 15, had a correlated two-factor structure. The factors were
measured by the same set of indicators, which ensured that the factors had the same
meanings for both groups. The first factor was a combination of listening, reading, and
writing, a non-speaking factor. The second factor was a speaking factor, loaded
exclusively with the speaking tasks. Factor loadings were also equivalent for both groups.
For example, the unstandardized factor loading of the second speaking task (S2) on the
speaking factor was 1.042 for both groups, which meant that this speaking task
functioned equivalently as an indicator of the speaking factor for Group I and Group II.
Indicator residuals were comparable across the groups as well. For example, the residual
variance of the first writing task (W1) was 0.489 for both groups, which indicated that the
same amount of variance of this writing task was left unexplained by its factor for Group
I as for Group II. At last, the means of the latent factors were equal across the groups. In
terms of their latent abilities, the groups did not differ from each other since the model
testing succeeded when the factor means were held invariant across the groups.
Since the test of factor covariance equality failed, these estimates were not fixed
to be equal for the groups. Factor variances were also assumed to be unequal. Factor
variance estimates were 0.438 and 0.345 in Group I, and 0.426 and 0.330 in Group II.
The covariance estimate of the factors was 0.313 in Group I, and 0.303 in Group II.
Although model testing indicated that the groups differed on these parameters, the
differences were not big.

153
At the end the factor models for the two groups were almost identical with only
minor differences. The factor structure underlying the performance of test-takers with
study-abroad experience was almost the same as the one from test-takers without such
experience. These results informed us that the impact of this group membership on the
test performance was minimal. It was then reasonable to conclude that the test
functioned in a comparable way across the two subgroups of test-takers, one with study-
abroad experience and one without such experience. In other words, the structure of
communicative language ability, as measured by the TOEFL iBT® test, was found to
have equivalent representations across the two groups.
Structural Equation Models
The length of study-abroad along with two learning-oriented variables—the time
spent studying English and the time spent in content classes taught in English—were
specified to have direct effects on the development of the latent factors, and these effects
were investigated through a structural equation modeling (SEM) approach.
A unique model was built for each group. With the group of test-takers who did
not have study-abroad experience, the home-country group, as shown in Figure 16, two
independent variables – the time spent studying English and the time spent in content
classes taught in English – were modeled to have direct effects on both latent abilities.
The two independent variables were represented by the rectangular boxes labeled as
‘Study’ and ‘Content’ in the figure as these variables were directly observed. The latent
factors, represented in the ellipses, were dependent variables in the model. A path
pointing from an independent observed variable to a dependent latent factor represented
the direct effect of the former on the latter. The part of the variance of a latent factor that
154
could not be explained by the independent variables, also called disturbance, was
represented in an ellipse next to the latent factor. The disturbances of the two latent
factors (D1 and D2) linking by a double arrow line were free to vary and covary.
With the group of test-takers who had study-abroad experience, as shown in
Figure 17, three independent variables were modeled to have direct effects on both latent
factors. They were the time spent studying English (labeled as ‘Study’), the time spent in
content classes taught in English (labeled as ‘Content’), and the length of living in an
English-speaking country (labeled as ‘Live’).
In both models, the independent variables were categorical in nature. However,
the scale of the independent variables was not an issue in model estimation, only the scale
of dependent variables was (L. K. Muthén, personal communication, December 10,
2010). Since the same set of continuous indicators used previously was used in these two
models to define the latent factors, the same estimation method from earlier was
implemented here as well. The models were evaluated by using the MLM estimator in
Mplus. The selected fit indices, summarized in Table 11, showed both models fit the data
well. In the next section, the standardized parameter estimates in each group was
examined for appropriateness and significance.

155
Figure 16. Structural Equation Model Group I

156
Figure 17. Structural Equation Model Group II
Table 11. Fit Indices for the Structural Equation Models
SEM Model df χ2S-B χ2S-B /df CFI RMSEA SRMR

Group I 148 256.623 1.734 0.915 0.077 0.059
Group II 163 285.102 1.749 0.947 0.055 0.046
157
The Home-Country Group
The standardized parameter estimates, displayed in Figure 18, were examined to
check if the model was appropriate and reasonable at explaining the relationships among
the variables for the group of test-takers without study-abroad experience.
All factor loadings were significant at p < 0.01. The path coefficients from the
‘Study’ variable to the latent factors were 0.139 and 0.290. The path coefficients from
the ‘Content’ variable to the factors were 0.227 and 0.338. These standardized path
coefficients could be interpreted as regression coefficients between the dependent and
independent variables. Significance of the effects of the independent variables on the
factors was marked by one asterisk next to a path coefficient at p < 0.05, and two
asterisks at p < 0.01.
Three out of the four path coefficients were significant. The path coefficient
between the ‘Study’ variable and the non-speaking factor was not significant. This meant
that the change of latent non-speaking ability was not likely to be affected by the length
of time studying English for the test-takers without study-abroad experience. The path
coefficient between the ‘Content’ variable and the non-speaking factor was significant at
p < 0.05. This indicated that one standard deviation of change in the length of taking
content classes taught in English brought up 0.227 standard deviation of change in the
non-speaking latent factor.
With regard to speaking, both ‘Study’ and ‘Content’ variables had significant
impact (p < 0.01) on this latent factor. The path coefficient between the ‘Study’ variable
and the speaking factor showed that one standard deviation of change in the length of
studying English brought up 0.290 standard deviation of change in the latent speaking
158
ability. The path coefficient between the ‘Content’ variable and the speaking factor
showed that one standard deviation of change in the length of taking content classes
taught in English brought up 0.338 standard deviation of change in the latent speaking
ability.
Figure 18. Structural Equation Model with Standardized Estimates Group I

159
The residual variance of the non-speaking factor was 0.915, which meant that
91.5% of the variance of the non-speaking factor could not be explained by the two
independent variables. The residual variance of the speaking factor was 0.756, which
meant that 75.6% of the variance of the speaking factor could not be explained by the two
independent variables. The two residuals were correlated at 0.787. Both standardized
residuals, especially the one for the non-speaking factor, were very high. This indicated
that variables other than the ones specified in the model might have had influences on the
development of the latent factors. Since these other variables were not represented in the
model, their impact on the latent variables could not be analyzed in this study.
In conclusion, for the group of test-takers without study-abroad experience, the
length of studying English and taking content classes in English had significant
associations with the speaking ability. Only the length of taking content classes in
English had a significant association with the non-speaking ability.
The Study-Abroad Group
The standardized parameter estimates, displayed in Figure 19, were examined to
check if the model was appropriate and reasonable at explaining the relationships among
the variables for the group of test-takers with study-abroad experience.
All factor loadings were significant at p < 0.01. The path coefficients from the
‘Study’ variable to the latent factors were 0.347 and 0.328. The path coefficients from the
‘Content’ variable to the factors were 0.098 and 0.094. The path coefficients from the
‘Live’ variable to the factors were 0.155 and 0.229. Significance of the effects of the
independent variables on the factors was marked by one asterisk next to a path coefficient
at p < 0.05, and two asterisks at p < 0.01.

160
Figure 19. Structural Equation Model with Standardized Estimates Group II
Four out of the six path coefficients were significant at p < 0.01. The significant
path coefficients between the ‘Study’ variable and both factors indicated that one
standard deviation of change in the length of studying English brought up 0.347 standard
deviation of change in the non-speaking factor, and 0.328 standard deviation of change in
the speaking factor. The ‘Content’ variable had no significant impact on either of the
161
factors. The ‘Live’ variable had significant impact on both factors. One standard
deviation of change in the length of living in an English speaking environment brought
up 0.155 standard deviation of change in the non-speaking factor, and 0.229 standard
deviation of change in the speaking factor.
The residual variances for the factors were high, 0.816 for the non-speaking factor
and 0.795 for the speaking factor. They were correlated at 0.767. This indicated that
variables other than the ones specified in the model may have had influences on the latent
factors. Since these other variables were not represented in the model, their impact on the
latent variables could not be analyzed in this study.
In conclusion, for the group of test-takers with study-abroad experience, the
lengths of studying English and living in an English language environment both had
significant impacts on both non-speaking and speaking abilities.

162
CHAPTER FIVE
DISCUSSION AND CONCLUSIONS
This chapter starts with a review of the study, followed by a summary of the
primary findings. Discussion focuses on three topics: (1) the nature of communicative
language ability, (2) group membership and language ability, and (3) learning contexts
and language ability. The implications of this study for foreign language (FL) test
development and validation are elaborated. The merits of using a structural equation
modeling (SEM) approach to address issues at the interface between language testing and
acquisition are appraised. The study’s contributions and its limitations are discussed.
Recommendations for future research are provided at the end.
Overview of the Study
This dissertation research investigated the nature of communicative language
ability as manifested in TOEFL iBT® test performance. It also investigated the
relationships between communicative language ability and test-takers’ study-abroad and
learning experiences.
The current view in applied linguistics conceptualizes FL ability as
communicative in nature. In this study, the context dependence of this communicative
language ability was examined through a latent factor approach based on TOEFL iBT test
performance. The relativity of this ability was inspected by conducting multi-group
invariance analysis across groups of test-takers with different context-of-learning
experiences. This ability’s associations with the length of study-abroad and learning were
further explored through a SEM approach.

163
Summary of the Primary Findings
The first research question asked what constitutes communicative language
ability, and what the role of context is in defining this ability. This question was
investigated with both skill abilities and the context of language use taken into
consideration. Based on the results from previous factor-analytic research with similar
data, three competing models were tested for fit through a series of confirmatory factor
analyses. The efforts to confirm a higher-order model and a correlated four-factor
structure did not succeed. Instead, a correlated two-factor model was shown to be
compatible with the data. One of the factors was a speaking factor, and the other could be
interpreted as a non-speaking factor, a combination of listening, reading, and writing.
This model was established as the baseline model. Two context-related factors–content
and setting–were then added to the baseline model with the goal of improving model fit.
However, in neither case did model estimation succeed. This indicated that, contrary to
the hypotheses, the added context factors were not useful in explaining the latent
structure of the test performance together with the skill-based factors. The correlated two-
factor model fit the data well, and was adopted as the final model for the whole sample.
The second question investigated whether or not having had contact with the
target language environment has an effect on the latent structure of communicative
language ability. This was achieved by conducting multi-group invariance analysis across
two subgroups of test-takers. One group of test-takers had lived in an English-language
environment prior to taking the test, whereas the other group had not. Simultaneous
multi-group invariance analysis with a mean structure was carried out with parameters
constrained to equality across the groups in a hierarchical fashion. The results showed
164
that the moderating effect of this group membership on test performance was minimal.
Across the groups, the test measured the same set of latent abilities in similar ways. The
groups did not differ in terms of their standings on the factor means either. Contrary to
the hypothesis, the nature of communicative language ability elicited by the test had
equivalent underlying factorial representations in the two groups.
The third research question inquired if the lengths of study-abroad and learning
have any association with the development of communicative language ability. This was
accomplished by establishing models unique to test-taker groups. With the group of test-
takers who had had study-abroad experience, the length of time abroad, the time spent
studying English, and the time spent in content classes taught in English were modeled to
have direct effects on the latent abilities in the correlated two-factor model. With the
group of test-takers who had not had study-abroad experience, the time spent studying
English and the time spent in content classes taught in English were modeled to have
direct effects on the latent abilities in the correlated two-factor model. The results lent
partial support to the hypotheses. Although both study-abroad and learning were found to
have significant associations with aspects of communicative language ability, large
portions of the factor variances remained unexplained in the models. This result
suggested that variables other than the ones specified in the models might have had
Discussion
Discussion centers on the following three topics: (1) the nature of communicative
language ability, (2) group membership and language ability, and (3) learning contexts
and language ability.

165
The Nature of Communicative Language Ability
The goal of the TOEFL iBT® test is to assess communicative language ability,
whose definition reflects the influences of both skill abilities and the context of language
use. The test is designed to reflect this current thinking in applied linguistics.
The construct the test intends to measure–the ability to use the English language
communicatively in North American university context–was found to have two latent
components. One component, on which all six speaking tasks loaded, could be
interpreted as the speaking ability of this group of test-takers. The second component, on
which tasks from the listening, reading, and writing sections loaded, could be labeled as
the ability to listen, read, and write. In other words, listening, reading, and writing were
basically indistinguishable, not separate abilities. Together they could be understood as
the non-speaking ability of this group of test-takers. In relation to test scores, this two-
factor structure meant that a test-taker who was higher ranked on listening was also likely
to perform well on reading and writing, but not necessarily on speaking. Likewise, the
fact that a test-taker scored high on speaking could not be used to draw the conclusion
that the person could also performed well on listening, reading, and writing. Both
components were skill-based, and together they accounted successfully for the
relationships among the tasks used in the test.
In contrast to skills, the role of context in defining the construct was not reflected
in this factor model. When two aspects of the context of language use, content and
setting, were examined together with the skill-based factors, model fit did not improve.
Rather, including these situation factors made the models inadmissible. Neither situation
factor was successful at explaining test performance when tested together with the skill-
166
based factors. With regard to task content, test-takers’ performance was not influenced by
the content of the language tasks, whether academic or non-academic. As far as setting
was concerned, the performance of test-takers was not affected by the availability of the
setting description, or by the nature of the setting, whether instructional or non-
instructional.
The final model chosen to account for the test performance for the whole group
contained two correlated skill-based factors. This model, however, did not indicate any
influence of the context of language use in the latent configuration of the test construct.
First, the finding of a two-factor model representing communicative language
ability was consistent with the consensus on the multi-component nature of FL
proficiency reached by applied linguists and language testers. Contrary to the unitary
view of language proficiency endorsed by Oller (1979), the nature of FL proficiency has
been shown by many researchers to consist of multiple components. Speaking and
reading were found to be two distinct factors in Bachman and Palmer’s (1983) study.
Three factors (basic elements of knowledge, integration of basic knowledge elements,
and interactive use of language) were found in Sang et al. (1986). A higher-order model
with three first-order factors (oral–aural, structure–reading, and discourse) represented
the nature of FL ability in Fouly et al. (1990). Bachman et al. (1995) were able to identify
a higher-order model with speaking, listening, and test-specific writing abilities as
distinct first-order factors. Both Buck (1992) and Bae and Bachman (1998) demonstrated
that the two receptive skills, listening and reading, were factorially different. The
components of FL ability found in Sasaki (1993) included writing, comprehending short
context, and comprehending long context. The correlated two-factor model found in this
167
study added another piece of supporting evidence for the multi-component nature of FL
ability.
This two-factor model suggested that responses to the listening, reading, and
writing tasks might have required similar skills, whereas the speaking tasks might have
demanded a somewhat different skill set. Finding this two-factor model could also be due
to the differences in testing method. Both listening and reading used multiple-choice
questions, whereas speaking employed constructed-response tasks. The two writing tasks
were also constructed-response items but they loaded together with the listening and
reading tasks. Still, majority of the tasks (9 out of 11) on the non-speaking factor used
objective multiple-choice questions. Test method effect might have contributed to the
finding of the two-factor model. The third explanation for the distinctiveness between a
speaking factor and a non-speaking factor could be instruction or lack thereof. Speaking
section became mandatory in the TOEFL iBT testing, whereas listening, reading, and
writing had long been part of the TOEFL testing routine before the introduction of the
TOEFL iBT test. For years TOEFL test-takers could choose not to be tested on speaking.
Lack of test preparation and training could be the reason for finding a speaking ability
that was different from listening, reading, and writing combined.
From a test validation point of view, the correlated two-factor model failed to
confirm an internal test structure that was compatible with the test’s section design and
score reporting scheme. The TOEFL iBT® test has a structure of four skill-based sections:
listening, reading, writing, and speaking. Stricker and Rock (2008) and Sawaki et al.
(2008) both concluded that a higher-order factor (general FL ability) together with four
first-order factors corresponding to the four skills respectively (listening, reading,

168
speaking, and writing) provided the best explanation of TOEFL iBT test performance.
Results from these previous TOEFL iBT studies supported the practice of reporting a
separate score on each of the skill sections as well as a total score, an average of the four
section scores, for the test.
The correlated two-factor model confirmed in this study was identical to what
Stricker et al. (2005) found based on a prototype of the TOEFL iBT® test. The results of
this study and Stricker et al. (2005) suggested a different internal structure of the test and,
therefore, an alternative way to organize test content and to report scores. According to
this factor model, the ability measured by the test had two instead of four latent
components, a speaking component and a non-speaking component. Items that were
designed to measure separate abilities of listening, reading, and writing were all
associated with the same latent ability, an ability that was not related to speaking. This
correlated two-factor model without a higher-order structure did not provide enough
evidence to support the existence of a general language ability. To reflect the internal
structure of the test suggested by this model, the domain of the test might be organized
into either speaking or non-speaking. If future studies support the findings of this study,
the test could report a speaking score based on speaking tasks and a non-speaking score
based on tasks from listening, reading, and writing. The results of this study did not
provide justification for reporting a total score for the entire test.
Second, failing to demonstrate the influences of situation factors on the test
performance captured the discrepancies between the theoretical conception of
communicative language ability, and the empirically obtained factor structure. The
definition of the test construct highlights the intertwining relationships between the
169
context of language use and the skill-based capacities within individuals, and suggests
organizing the test domain by language use situation (Chapelle et al., 1997). Although the
operational TOEFL iBT® test follows the four-skills convention to organize test content,
the context of language use can still play a role in defining the ability measured by the
test. One way to demonstrate this influence of context is to establish data-compatible
factor models that contain situation factors.
In the language testing literature, this approach has been used in multiple studies
to demonstrate factors that are not skill-based. Most of these studies focused on test
method effects. Bachman and Palmer (1983) empirically demonstrated the existence of
three method-related factors (interview, translation, and self-rating) together with two
correlated skill factors (speaking and reading) in a two-dimensional model. In another
study by the same authors (Bachman & Palmer, 1982), the best model they found was
also two-dimensional, and contained both method factors (interview, a combination of
writing sample and multiple-choice test, and self-rating) and skill factors (grammatical
competence, pragmatic competence, and sociolinguistic competence). Song (2008)
discovered that a two-dimensional model with three skill-based factors (main idea
comprehension, detailed comprehension, and implication) and two method-related factors
(an audio input mode and a written input mode) provided the best match to the data.
Llosa (2007) found that two-dimensional models with both skill factors
(listening/speaking, reading, and writing) and method factors (standardized assessment
and classroom assessment) provided the best explanation for the data in multiple test
populations. A two-dimensional model with three skill factors (extracting main ideas,
major ideas, and supporting details) and three method factors (summary task, incomplete
170
outline, and open-ended question) was chosen in Shin’s (2008) study to account for FL
test performance.
The common approach in the above studies was to demonstrate the method effects
by evaluating factor models with two dimensions, a skill-based dimension and a
dimension of test method factors. Successful model estimation indicated that the method
factors together with the skill-based factors were responsible for the underlying structure
of the ability being measured. This approach was adopted in this study to examine the
role of context in defining communicative language ability measured by the TOEFL iBT®
test. A second dimension of situation factors, based on content and setting, was tested,
respectively, along with the skill-based correlated two-factor structure. Model estimation
did not succeed in either case, indicating that a model with both skill-based factors and
situation factors was not empirically compatible with the data. Instead, a skill-based two-
factor model provided the best explanation of the construct. However, there was a
noticeable amount of overlap in the indicators that represented skill and situation factors.
For example, 8 out of the 10 indicators associated with the academic content were also
indicators of the non-speaking factor. This could be a possible reason for estimation
failure (T.N. Ansley, personal communication, April 4, 2011).
These findings suggested that the ability measured by the test was predominantly
skill-oriented. The relationships between the context of language use and the skill-based
capacities were not captured in the latent structure of the test construct. In other words,
the role of the context of language use in defining communicative language ability could
not be confirmed due to a lack of empirical evidence.

171
Group Membership and Language Ability
The language testing research community has long been aware of the relativity of
FL ability. Earlier researchers made the call for interpreting the nature of FL ability in
light of learner variability (Harley et al., 1990; Kunnan, 1998a). In a more recent review
of English language testing and assessment, Alderson and Banerjee (2002) restated the
importance of understanding the characteristics of test-takers and how these
characteristics interact with their abilities measured by a test.
The field has witnessed a surge of empirical studies that investigated the nature of
FL ability in relation to learner variability. This line of research responded to the question
of whether the nature of language ability measured by a test varied as a function of
various test-taker characteristics (TTCs), such as proficiency level (Kunnan, 1992;
Römhild, 2008; Shin, 2005), cognitive skill (Sang et al., 1986), gender (Wang, 2006), and
ethnicity (Ginther & Stevens, 1998).
A few of these studies focused on the contact with the target language, either in a
study abroad environment (Morgan & Mazzeo, 1988), or an at-home environment (Bae &
Bachman, 1998; Morgan & Mazzeo, 1988). Results from these studies suggested that
language contact as a TTC moderated test performance. In other words, language abilities
developed in groups with different language contact experiences were different in terms
of latent structure.
This study took a special research interest in the relationship between the nature
of communicative language ability, as measured by the TOEFL iBT® test, and test-takers’
target language contact, either having lived in an English speaking environment or not
having done so. The results contrasted with the outcomes from previous studies.
172
At the measurement level across the two groups, the test measured the same
abilities (speaking and non-speaking). The test tasks functioned equivalently as indicators
of the abilities they measured. At the structural level, the two groups did not differ in
terms of their mean performance on the latent abilities. The degrees of variability of the
latent abilities as well as the correlational relationship between the two abilities were
found to vary across the groups but not by much. It is usually the assumption that factors
differ in their variability in different groups. That was the reason for choosing the unit
loading identification approach in the multi-group invariance analysis in this study. In
conclusion, these results suggested that having the study-abroad experience (from less
than 6 months to more than 1 year) did not alter how a test-taker performed on the test, in
terms of the latent factor structure as well as the latent factor means.
The factorial invariance found in this study, however, did resonate with what
other TOEFL iBT® multi-group studies have found. Stricker et al. (2005) concluded that
a correlated two-factor structure could be applied across three native language groups. In
another study, Stricker and Rock (2008) confirmed the same higher-order structure across
subgroups by native language family, exposure in home countries, and formal instruction.
By focusing on test-takers’ exposure in the target language community, the results of this
study provided another piece of evidence for the generalizability of the test’s internal
structure.
This study did not find convincing evidence to claim that test-takers with study-
abroad experience performed differently on the test as a whole group from the group
without such experience. The English language ability developed in the two groups
appeared to be similar in nature. Both groups seemed to possess a distinct speaking

173
ability. They also exhibited an ability that could be captured by their responses to the
listening, reading, and writing tasks, a non-speaking ability. Furthermore, imposing a
mean structure in this study led to the surprising finding that the study-abroad test-takers
did not turn out to be better at English compared to the test-takers who had never had
such exposure. Preferring study-abroad contexts as opposed to formal classroom settings
in the home country, a common belief as captured by Freed, Segalowitz and Dewey
(2004), could not be upheld based on this study.
The supposed superiority of study-abroad learning contexts over traditional
classroom settings has also been challenged by study abroad researchers. After reviewing
studies that compared language gains from study-abroad and from home country formal
learning, Collentine and Freed (2004) found no convincing evidence that one learning
context was of absolute superiority compared to the other. Depending on the aspects of
linguistic development and levels of proficiency, one learning context might produce
more gains than the other. A study by Davidson (as cited in Dewey, 2004, p. 322) also
pointed out that it might take a full year of target language contact for the linguistic
benefits to become evident, and he called for additional research on the effects of length
of study-abroad. In this study, exposure to the English language varied from less than 6
months to more than 1 year. Among the 246 test-takers who had been immersed, only
about half of them had lived in an English-speaking country for more than a year.
Grouping test-takers with different lengths of study-abroad experience together might
have diluted the impact of language contact.
From the test validation point of view, confirming an equivalent structure across
the two groups provided important validity evidence based on the test’s generalizability.
174
The TOEFL iBT® test is intended for people whose first language is not English.
However the intended test-taking population could be very diverse. With growing
opportunities for study abroad, language learning has expanded from its traditional
classroom settings to community-embedded contexts. One characteristic that divides the
population is the target language contact experience. Taking the sample in this study as
an example, about two thirds of the test-takers had the experience of living in an English
language country prior to taking the test. The rest did not have such experience. This
demographic characteristic of the TOEFL iBT® test-taker population brings up a
legitimate question: Should we use the same test for all intended test-takers; and if we do
use the same test, can test scores be interpreted and used in the same way? The results of
the multi-group invariance analysis in this study indicated that the test functioned
equivalently for both groups of test-takers, regardless of their different target language
contact experiences.
Furthermore, the test is administered both domestically (where English is the
dominant language) and internationally (where English is not the dominant language).
Although test-taking location is not always a reliable indication of a test-taker’s language
contact experience, it does suggest the test-taker’s immediate linguistic environment. It
can also be used as a rough index of whether a test-taker has or has not had language
contact with English. The results from this study provided partial support to use the same
test format and score reporting scheme for both domestic and international test-taking
locations.11 Additional research on how domestic and international TOEFL iBT® test-
takers perform is needed to strengthen this validity argument.

175
Validity evidence based on a test’s generalizability is an important concern if the
target test-taking population is diverse. To justify using the same test across groups, it is
first necessary to make sure that the test measures the same construct in an equivalent
way. If analysis suggests that a four-skills test measures the designed four skills for one
group and one general skill for another group, then reporting skill-based scores makes
good sense for the former group, and reporting only total scores for the second group is
preferred. As a matter of fact, when the test is used on the second group, the length of the
test could be reduced to one-fourth of its original length because information obtained
from different skill sections is redundant.
Factorial equivalence is also a condition that needs to be satisfied in order to
compare group means. That is because it is important to know what to compare before
conducting any comparisons. Assume that in one group factor loadings of listening items
are relatively high and those in another group are relatively low. This hypothetical
finding would indicate that these test items were better indicators of the listening ability
in the first group than in the second. Comparing group means in listening based on these
items would become questionable. In the first group, reaching a listening score this way
would be acceptable. In the second group, since the items would appear not to be good
indicators, there would be doubt regarding how much information about listening they
really conveyed. A valid listening score based on these items might not be warranted.
Therefore comparison would make little sense. Consider another scenario. An integrated
task involving both listening and speaking loads with listening tasks in one group but
with speaking tasks in the other group. Comparing groups on this task would not be
warranted because the task has simply measured different things in different groups. In
176
conclusion, to ensure fair interpretation and use of test scores across diverse groups of
test-takers, validity evidence based on a test’s generalizability through multi-group
invariance analysis should be well established.
Learning Contexts and Language Ability
Language learning usually occurs in two contexts, an instructional context and a
communicative context, as pointed out by Batstone (2002). In an instructional context,
the goal is to improve linguistic expertise, such as learning new vocabulary or
grammatical structures. Formal FL training in the home country usually provides such a
context. In a communicative context, the objective is to use the target language to
perform communicative functions, such as exchanging information or expressing
opinions. Study abroad in the target language community is likely to create such
communicative contexts for learners.
A FL learner might have experience with one of the contexts, or both. The results
of this study provided opportunities to understand how both contexts are associated with
the development of language ability. For test-takers who indicated no experience of direct
contact with an English language environment (the home-country group), learning was
captured by their experience of studying English and studying content classes taught in
English. The former would occur most likely in an instructional context, whereas the
latter might happen in a hybrid context which could be both instructional and
communicative. For test-takers who claimed having direct contact with the target
language community (the study-abroad group), learning was captured by their experience
of living in an English-speaking country as well as studying English and studying content
classes taught in English.

177
The findings of this study suggested that all three learning situations had an
impact on the development of aspects of language ability. The speaking ability of the
home-country group was associated with the lengths of both studying English and taking
content classes taught in English. This group’s ability to listen, read, and write was
associated only with the length of time taking content classes taught in English.
With the study-abroad group, neither ability component (speaking or non-
speaking) had a significant relationship with the length of time of taking content classes
in English. The lengths of time of studying English and living in an English-speaking
country were significantly associated with both ability components. Between these two
learning situations, the length of time studying English held a stronger relationship with
both ability components than time living in an English-speaking country.
Both learning contexts, instructional and communicative, appeared to have an
impact on the development of language ability. This finding was compatible with results
from studies comparing study abroad and formal instruction in the home country (Díaz-
Campos, 2004; Collentine, 2004; Lafford, 2004; Sasaki, 2007).These studies indicated
that students receiving formal language instruction in their home counties made just as
much gain (if not more) on some aspects of language ability as study abroad learners.
What was surprising was that even within the study-abroad group, test-takers’
performance was associated more with their experience in an instructional learning
context than in a communicative learning context. This again exhibited counter-evidence
to the common belief in the absolute superiority of study abroad over other learning
contexts.
178
Pedagogically speaking, this study demonstrated that both learning contexts,
instructional and communicative, had an impact on language development.
Practically speaking, this study suggested that study abroad might not be the only
way to prepare for the test, and to improve English language ability. Training received in
a formal classroom setting might help just as much as, if not more, to perform well on the
TOEFL iBT® test, and to develop English language ability.
Implications
The outcomes of this study have implications for a broad range of issues
associated with language acquisition and language testing. This section is organized by
three topics. First, thoughts regarding test development are elaborated. Second, an idea of
understanding the impact of learning contexts through establishing test-taker profiles is
put forward. Last, using a structural equation modeling (SEM) approach to reach the
interface between language testing and acquisition is discussed.
Foreign Language Test Development
If the context of language use and the internal skill abilities of individual language
users are both part of the definition of the communicative language ability that the
TOEFL iBT® test intends to measure, the role of this language use context in defining the
test construct should be reflected in the internal structure of the test. Unfortunately the
attempt to demonstrate the abilities to respond to context did not succeed. The skill
abilities, as already reflected in the design of the test, were shown to be the dominant
forces in determining the internal structure of the test. Test-takers’ ability to respond to
different situations defined by content and setting did not appear to have an impact on
179
their test performance. As a result, the theoretical context-based ability in the
communicative language framework could not be demonstrated.
This disappointing finding raises questions regarding the testability of context-
based language components. In other words, are these components testable yet? Current
thinking in applied linguistics, such as the model of communicative language ability
proposed by Bachman (1990), embraces the context of language use in its framework.
But, is it ready to be tested? Is test-takers’ ability to respond to different language use
situations reliable and distinguishable enough so that this ability can be captured in the
internal structure of a test?
It is one thing to conceptualize context-based components in a theoretical
framework. It is another to empirically demonstrate the existence of such ability
components. Such empirical attempts might eventually lead to modifications of the
original theory. This study suggested that features of language use situation, such as
content and setting, were not able to elicit the associated context-based abilities.
However, not all features used to define the context of language use were tested
simultaneously in this study. Furthermore, not all tasks used to test the context-associated
ability components had a fully developed language use situation. This lack of context
development was especially obvious in all reading tasks as well as independent writing
and speaking tasks. This might have been the cause for the failure to find context-based
abilities.
With the intent to understand the nature of communicative language ability, this
study chose the TOEFL iBT® test in hopes of allowing context-dependent abilities to
surface and appear in the internal structure of the test, because this test was designed to
180
particularly reflect communicative language use in contexts. The results implied that
more care and attention need to be given to context development for the tasks used in the
test. Making a complete departure from the four-skills test design, while not yet
acceptable to the majority of test users, might offer more opportunity for implementing a
communicative language framework to elicit context-dependent language components.
The results also implied that new metrics might need to be considered for scoring
test performance. Test-takers who had study-abroad experience performed equivalently
on the test, compared to those without such experience. In spite of the prevailing belief in
the superiority of study-abroad learning contexts over traditional classroom contexts, this
study suggested that test-takers without study-abroad experience were just as likely (or
unlikely) to succeed on the test as test-takers with such experience. However, as
Collentine and Freed (2004) pointed out, the types of language gains attributed to an
study-abroad environment might not be captured by traditional metrics. In the context of
TOEFL iBT® testing, the speaking and writing tasks are rated based on holistic scales.
Scores are not available on aspects of speaking, such as oral fluency and pronunciation.
Scores are not available on aspects of writing either, such as writing fluency and
discursive coherence. Other non-traditional metrics could be: sociolinguistic
appropriateness, communicative strategies, pragmatic appropriateness, etc. These
variables might not be readily testable or quantifiable by the existing metrics. But they
may be capable of reflecting the influences of the study-abroad context on language
acquisition. Reporting scores on these variables might also provide language learners
useful diagnostic feedback (X. Xi, personal communication, January 7, 2010).

181
Test-Taker Profiles
Viewing the study-abroad learning context different from the instructional context
based in classrooms is built upon the assumption that the former provides more
opportunities for contact with the target language community (Dewey, 2004; Freed et al.,
2004; Segalowitz & Freed, 2004). Freed et al. (2004) raised concern about this presumed
privilege, and proposed to use concrete measures to characterize the nature, quality and
intensity of language contact experience. The Language Contact Profile questionnaire
(Freed, Dewey, & Segalowitz, 2004) was designed to provide such a measure to
document and quantify language learners’ interaction with native speakers during time
abroad.
This dissertation study looked at whether test-takers with different context-of-
learning experiences performed differently on the test. Their language contact experience,
or lack of it, was characterized by a dichotomous measure: either having it or not having
it. No difference in test performance across the groups was found, which could be
attributed to the fact that the real differences in learning contexts were not captured by
this grouping method.
To fully understand the impact of study-abroad on language development, it
might be necessary to take further measures to obtain detailed language contact
information for the test-takers. This contact information can be used to establish FL test-
taker profiles. Such a profile traditionally includes information like age, gender, native
country and native language, etc. With increasing opportunities for study-abroad,
especially after the U.S. Senate declared 2006 the Year of Study Abroad (Magnan &
Back, 2007), it has become more relevant to understand test-takers not only by pre-
182
existing demographic information but also by what they have done to make use of the
target language in non-traditional learning contexts, such as study abroad. Elements such
as housing arrangement and personal relationships, which are usually irrelevant in a
classroom setting, may trigger or hinder learning in study-abroad contexts. Such elements
need to be built into FL test-taker profiles. Carefully and fully developed test-taker
profiles will enhance understanding of what to expect from and how to deal with our
increasingly diverse and constantly changing test-taking population during test
development and validation.
The Interface between Testing and Acquisition
This study used a structural equation modeling (SEM) approach to examine the
nature of communicative language ability and this ability’s relationship to learning
contexts based on test performance of the TOEFL iBT® test. The research interest in the
nature of language ability has been shared by the language testing community, whereas
understanding the factors that affect language acquisition has been one of the focuses in
the field of language acquisition (Bachman & Cohen, 1998). At the interface between
language testing and acquisition, there is the issue of how to bring the insights gained
from one field to inform the research agenda in the other. A SEM approach, a hybrid of
factor analysis and path analysis, offers a research method that can be used to address the
issues that connect the two fields.
A structural equation model usually has two parts. The measurement component
illustrates the relationships between latent abilities and their indicators. Such a latent
approach to examine the nature of a construct permits a consideration of measurement
errors by means of directly estimating error variances in the measurement model.

183
Multiple indicators (observed scores) are also required for each latent factor to ensure
construct convergence. Finding different indicators loading on different latent factors
provides evidence of construct divergence. Establishing a measurement model based on
test performance clarifies the nature of the theoretical construct. In language testing, this
construct is often referred to as FL proficiency. Under the communicative competence
framework, this construct is communicative language ability. Construct validation of
communicative language ability could involve finding a latent factor structure based on
test performance that is compatible with the theoretical configuration of this ability. As
results of such validation attempts over multiple tests and under different testing
situations become available, the field’s understanding of communicative language ability
can be advanced.
The second part of a structural equation model is the structural model. Building a
structural model allows an investigation of the impact of multiple variables on the latent
abilities from the measurement model simultaneously. These independent variables
themselves can be either latent or observed. If an independent variable is latent, the
nature of this variable should be verified in a measurement model first before being
entered into the structural model. In language testing, these independent variables usually
correspond to test-take characteristics, such as age, gender, teaching method, context of
learning, length of learning, etc. Interpreting test performance in relation to background
characteristics can shed light on how test-takers acquire a certain ability and/or reach a
certain ability level. The structural model brings language educators into a dialogue with
language test researchers behind the measurement model.

184
From the measurement model, language teachers will have better understanding
of learners’ achievement–exactly what they learn and what kinds of abilities they acquire.
This will have impact on how they direct teaching resources and organize curriculum.
From the structural model, test developers will have better ideas about how the test
functions; that is, if test results reflect the language gains associated with changes in test-
taker characteristics. This will lead to better use of test instruments to detect language
gains. A SEM approach provides a platform upon which conversations across language
testing and acquisition can be facilitated.
Unique Contributions
Conceptualized as a FL construct study, this dissertation focused on the construct
of communicative language ability. With this intent in mind, the TOEFL iBT® test was
chosen because this test intends to measure communicative language ability, with both
skill abilities and the context of language use as parts of the theoretical definition of the
test construct. An investigation of context-dependent abilities was launched based on test
performance elicited by the test. The content and setting factors were specifically
modeled, in hopes that such an investigation might allow context-dependent abilities to
surface and appear in the internal structure of the test. Contextual factors have not been
examined in the literature of construct studies in language testing. This study set up an
example of using factor analysis to examine the role of context in defining language
ability.
Second, this study investigated the language construct in relation to a test-taker
characteristic (TTC) that had not been studied in the context of TOEFL iBT® testing. The
language testing community has long been aware of the fact that the makeup of language
185
ability might not hold equivalent across test-taker groups with different characteristics.
Language contact with the target language community, a factor that has been studied
extensively in language acquisition and study-abroad research, was introduced as a TTC,
and was examined in its relation to the underlying structure of the construct. With
growing opportunities of language learning in community-embedded contexts in recent
years, this TTC has become relevant and salient in more testing situations. In the situation
of TOEFL iBT® testing especially, since the test is administered both domestically and
internationally, establishing the test’s internal structure equivalence is an important
validation concern. This study made a unique contribution by examining test-takers’
language contact experience in relation to their test performance as measured by the
TOEFL iBT test.
Moreover, not only the factorial structure but also the mean structure of the
language construct was inspected. Applying a mean structure in a multi-group invariance
analysis allows a comparison of groups on latent factor means after measurement
invariance has been established across groups. Such a comparison is superior to
comparing means based on observed scores because: (1) the pre-established measurement
invariance ensures that the latent factors represent the same abilities across groups, and
(2) measurement errors are taken into account in the overall model. This study made a
unique methodological contribution to the field of language testing and acquisition by
demonstrating how a mean structure could be incorporated in multi-group invariance
analysis.
186
Limitations and Recommendations
This study started the analysis with a dataset of a randomly generated sample of
1000 subjects. Due to missing values and inconsistent responses in the original dataset,
only 370 subjects were included in the final analysis. The reasons for missing values and
inconsistency in the responses could only be speculated but not confirmed. One reason
could be that some test-takers did not interpret the background questions correctly
because these questions were presented in English rather in their native languages.
Correctness and truthfulness of the self-reported information could be questioned. These
factors, which could not be taken into account due to the lack of information, might have
had influences on the results of the analysis. Future researchers are recommended to use
questionnaires translated to test-takers’ native languages to ensure the quality and
correctness of their self-reported background information.
The study sample of 370 subjects matched the original random sample of 1000
subjects reasonably well on all background characteristics as well as their test
performance. Although the study sample was sufficient to carry out a single group
analysis, when divided into groups in the multi-group analysis, the subgroup samples
seemed relatively small. Confidence in the results of such analysis could still be sustained
based on the fact that the study sample could be considered a good representative of the
total sample as well as the target population. Nevertheless, the results should be
interpreted and generalized to the population with caution.
When investigating the latent structure of the construct, the writing factor was
represented by only two indicators. In factor analysis, a factor needs at least two
indicators for the model to be identifiable. This is the minimum requirement, and in
187
practice it is advised to have more than two indicators for each factor. Future researchers
are recommended to test the latent structure of FL ability with each factor represented by
a sufficient number of indicators.
When investigating the influences of language use context, this study chose to
focus on only two situational factors, content and setting. Not all features used to define
the context of language use were tested in this study. This was partially because not all
situational features were present in the specifications of all test tasks. Future researchers
interested in this line of study could explore other situational factors, such as participants,
purpose, and register.
Although the design of the TOEFL iBT® test intends to reflect the role of
language use context in defining the communicative language ability, not all tasks have a
fully developed language use situation that would allow for adequately assessing the
impact of contextual factors. This might have been the cause for failing to model
contextual factors in the underlying representation of the construct. Future researchers
who are interested in the nature of communicative language ability should conduct
studies based on tests that are developed through a communicative approach. Such a test,
with a clear focus on communicative language use, could organize the domain of interest
by language use situation so that it is more likely for contextual factors to be a part of the
representation of the test construct. The development of language tasks used in such a test
should also have a clear context orientation, focusing on key features that have been
empirically proved to have impacts on test performance. It might even be necessary to
make a complete departure from the widely accepted four-skills design to fully
implement the communicative language framework in FL testing.

188
The design of the multi-group analysis separated the test-takers into two groups,
either having or not having the study-abroad experience. However, some findings from
the study abroad literature suggested a full year of study-abroad as the threshold for the
linguistic benefits to be detectable. Grouping test-takers by this one-year rule, having
either more or less than one year of study-abroad experience, might allow for detecting
differences in the language ability across groups.
This full year threshold hypothesis was not originally intended, and therefore was
not tested in this study. However, after finding no structural and mean difference between
the two groups defined in the study, two additional sets of analyses were attempted based
on the study sample of 370 test-takers. In the first analysis, test-takers with less than 6
months of study-abroad (N=191) were compared to the ones with more than 6 months of
time abroad (N=179). Although measurement invariance could be held, the latter group
performed significantly better (p < 0.01) on speaking than the former. In the second
analysis, test-takers with less than one year of study-abroad (N=240) were compared to
the ones with more than one year of time abroad (N=130). Although measurement
invariance could be held, the latter group performed significant better on the speaking
factor (p < 0.000), and to a lesser degree on the non-speaking factor (p < 0.05). These
results suggested that language ability, especially the speaking ability, was positively
associated with the length of study-abroad. For differences in language gains to become
detectable a full-year study-abroad might be needed.
Future researchers are encouraged to investigate the generalizability of FL
proficiency in relation to study-abroad experience of various lengths in the same or other

189
testing situations. They are also encouraged to examine TTCs that have not caught the
research community’s attention but may be relevant to language testing and acquisition.
The current study had a research interest in the joint impact of classroom learning
and community-embedded learning on language ability development. Both learning
variables were defined by length in years. The richness of these learning experiences was
not fully captured in the models tested. This probably explained why a large portion of
the factor variances could not be accounted for in the study. Variables other than the ones
specified in the models were not investigated, due to lack of such information. It is
recommended that future researchers who share the same research interest collect
detailed information on the nature and intensity of learners’ language contact with the
target language community through well-developed instruments. Such endeavors would
encourage collaborative research efforts joined by both language testing researchers and
language acquisition researchers. Through a proper method, such as SEM, this line of
research would inform not only what constitutes language ability but also how the aspects
of this ability are associated with acquisition factors.
Finally, data utilized in the study were purely observational, providing a static
image of the relationships among the variables at one point in time. Without a controlled
experimental design, causal statements could not be made with confidence, which
prohibited making inferences about the factors responsible for language development.
Carefully designed experimental studies are recommended for the future to help fully
understand the impact of learning contexts on language development.

190
NOTES
1
In this article, a foreign language refers to a language that is learned after a person has
already learned his or her native language(s). The term ‘foreign language’ is used
interchangeably with second language. In the context of this study, foreign language
ability is conceptualized as a latent trait. The term ‘ability’ is used interchangeably with
proficiency and competence.
2
TOEFL iBT is a registered trademark of Educational Testing Service (ETS). This
publication is not endorsed or approved by ETS.
3
The higher-order factor model was also tested based on the test performance of 1000
test-takers in the total random sample. The model was rejected because extremely high
correlations were found between the higher-order factor and the first-order factors.
4
Because of the confirmatory nature of the analysis, only models that had been confirmed
in the previous TOEFL® literature were hypothesized. However after finding a high
correlation between listening and writing, these two factors were grouped together and
tested in a three-factor model (listening/writing, reading, and speaking). A close to 0.9
correlation was found between the reading factor and the listening/writing factor, which
suggested that a two-factor model with a speaking factor and a non-speaking factor
would be a more appropriate model for the data. The result of post hoc model fitting was
not reported because of the exploratory nature of modification procedures and the risk of
capitalization on chance factors.
5
The correlated four-factor model was also tested based on the test performance of 1000
test-takers in the total random sample. The model was rejected because extremely high
correlations were found among the latent factors.
6
The correlated two-factor model was also tested based on the test performance of 1000
test-takers in the total random sample. Taking all criteria into consideration, this model
provided the best explanation to the data.
7
A model with only content and no skill factors was tested based on the 370 test-takers’
performance. The result indicated that the model fit was not acceptable.
8
The two-dimensional model with both content and skill factors was also tested on the
test performance of 1000 test-takers in the total random sample. The model was rejected
because the estimated correlation among the two content factors was extremely high, and
that a number of factor loadings were insignificant.
9
A model with only setting and no skill factors was tested based on the 370 test-takers’
performance. The result indicated that the model fit was not acceptable.
10
The two-dimensional model with both setting and skill factors was also tested on the
test performance of 1000 test-takers in the total random sample. The model was rejected
191
because the estimated correlations among the setting factors were extremely high, and
that a number of factor loadings were insignificant.
11
Simultaneous multiple-group invariance analyses were also performed across the two
groups: 418 domestic test-takers and 582 overseas test-takers in the total random sample.
Both measurement and structural invariance could be held. No factor mean difference
was found. This result supported using the same test format and score reporting scheme
both internationally and domestically. This step was not reported because test-taking
location was found not to be a reliable indicator of language contact experience.
192
REFERENCES
Alderson, J. C., & Banerjee, J. (2002). Language testing and assessment (Part 2).
Language Teaching, 35, 79-113. doi: 10.1017/S0261444802001751
American Educational Research Association, American Psychological Association, &

National Council on Measurement in Education. (1999). Standards for
educational and psychological testing. Washington, DC: American Educational
Research Association.
Anastasi, A., & Urbina, S. (1997). Psychological testing. Upper Saddle River, NJ:
Prentice-Hall.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford

University Press.
Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that
what we count counts. Language Testing, 17(1), 1–42. doi:
10.1177/026553220001700101
Bachman, L. F., & Cohen, A. D. (1998). Language testing—SLA interfaces: An update.

In L. F. Bachman, & A. D. Cohan (Eds.), Interfaces between second language
acquisition and language testing research (pp. 1-31). New York: Cambridge
University Press.
Bachman, L. F., Davidson, F., & Foulkes, J. (1990). A comparison of the abilities
measured by the Cambridge and Educational Testing Service EFL Test Batteries.
Issues in Applied Linguistics, 1(1), 30-55.
Bachman, L. F., Davidson, F., Ryan, K. & Choi, I.-C. (1995). An investigation into the
comparability of two tests of English as a foreign language: The Cambridge-
TOEFL comparability study. New York: Cambridge University Press.
Bachman, L. F., & Palmer, A. S. (1982). The construct validation of some components of
communicative proficiency. TESOL Quarterly 16(4), 449–465.
Bachman, L. F., & Palmer, A. S. (1983). The construct validation of the FSI oral
interview. In J. W. Oller, Jr. (Ed.), Issues in language testing research (pp. 154-
169). Rowley, MA: Newbury House.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and
developing useful language tests. Oxford: Oxford University Press.
193
Bae, J., & Bachman, L. F., (1998). A latent variable approach to listening and reading:
testing factorial invariance across two groups of children in the Korean/English
two-way immersion program. Language Testing, 15(3), 380-414. doi:
10.1177/026553229801500304
Batstone, R. (2002). Contexts of engagement: A discourse perspective on “intake” and

“pushed output.” System, 30, 1–14. doi: doi:10.1016/S0346-251X(01)00055-0
Buck, G. (1992). Listening comprehension: construct validity and trait characteristics.

Language Learning, 42(3), 313–357. doi: 10.1111/j.1467-1770.1992.tb01339.x
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A.
Bollon, & J. S. Long (Eds.), Testing structural equation models (pp. 136-162).
Newbury Park, CA: Sage.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81-105. doi:
10.1037/h0046016
Canale, M. (1983). From communicative competence to communicative language

pedagogy. In J. G. Richards, & R. W. Schmidt (Eds.), Language and
communication (pp. 2-27). London: Longman.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approach to second
language teaching and testing. Applied Linguistics, 1(1), 1-47.
Carroll, J. B. (1958). A factor analysis of two foreign language aptitude batteries. Journey
of General Psychology, 59(3), 3-19.
Carroll, J. B. (1965). Fundamental consideration in testing for English language

proficiency of foreign students. In H. B. Allen (Ed.), Teaching English as a
second language: A book of readings (pp. 364-372). New York: McGraw-Hill.
Carroll, J. B. (1983). Psychometric theory and language testing. In J. W. Oller, Jr. (Ed.),
Issues in language testing research (pp. 80-107). Rowley, MA: Newbury House.
Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied

Linguistics, 19, 254-272. doi: 10.1017/S0267190599190135
Chalhoub-Deville, M. (1997). Theoretical models, assessment frameworks and test

construction. Language Testing, 14(1), 3-22. doi: 10.1177/026553229701400102
194
Chapelle, C., Grabe, W., & Berns, M. (1997). Communicative language proficiency:
Definition and implications for TOEFL 2000. (TOEFL Monograph Series Report
No. 10). Princeton, NJ: Educational Testing Service.
College Board. (1987). Advanced placement course description: French. New York:
College Entrance Examination Board.
College Board. (1989). Advanced placement course description: Spanish language and
literature. New York: College Entrance Examination Board.
Collentine, J. (2004). The effects of learning contexts on morphosyntactic and lexical

development. Studies of Second Language Acquisition, 26, 227-248. doi:
10.1017/S0272263104062047
Collentine, J., & Freed, B. (2004). Learning context and its effects on second language
acquisition. Studies of Second Language Acquisition, 26, 153-171.
doi:10.1017/S0272263104262015
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.

Psychological Bulletin, 52, 281–302. doi: 10.1037/h0040957
Cziko, G. A. (1984). Some problems with empirically-based models of communicative

competence. Applied Linguistics, 5, 23-38. doi:10.1093/applin/5.1.23
Dewey, D. P. (2004). A comparison of reading development by learners of Japanese in

intensive domestic immersion and study abroad contexts. Studies of Second
Language Acquisition, 26, 303-327. doi: 10.1017/0S0272263104062072
Díaz-Campos, M. (2004). Context of learning in the acquisition of Spanish second

language phonology. Studies of Second Language Acquisition, 26, 249-273. doi:
10.1017/S0272263104062059
Educational Testing Service. (2002). LanguEdge courseware score interpretation guide.

Princeton, NJ: Author.
Educational Testing Service. (2004). The next generation TOEFL test: Focus on
communication. Retrieved from http://www.ets.org/toefl/nextgen/
Educational Testing Service. (2008). Validity evidence supporting the interpretation and
use of TOEFL® iBT scores. Retrieved from http://www.ets.org
Farhady, H. (1983): On the plausibility of the unitary language proficiency factor. In J.

W. Oller, Jr. (Ed.), Issues in language testing research (pp. 11-28). Rowley, MA:
Newbury House.
195
Fouly, K. A., Bachman, L. F., & Cziko, G. A. (1990). The divisibility of language
competence: A confirmatory approach. Language Learning, 40(1), 1-21. doi:
10.1111/j.1467-1770.1990.tb00952.x
Freed, B. F. (1995). What makes us think that students who study abroad become fluent?
In B. F. Freed (Ed.), Second language acquisition in a study abroad context (pp.
123-148). Amsterdam: John Benjamins.
Freed, B. F., Dewey, D. P., Segalowitz, N., & Halter, Randall. (2004). The language
contact profile. Studies of Second Language Acquisition, 26, 349-356. doi:
10.1017/S0272263104062096
Freed, B. F., Segalowitz, N., & Dewey, D. P. (2004). Context of learning and second
language fluency in French: Comparing regular classroom, study abroad, and
intensive domestic immersion programs. Studies of Second Language Acquisition,
26, 275-301. doi: 10.1017/S0272263104062060
Gardner, R. C., & Lambert, W. E. (1965). Language aptitude, intelligence, and second
language achievement. Journal of Educational Psychology, 56(4), 191-199. doi:
10.1037/h0022400
Ginther, A., & Grant, L. (1996). A review of the academic needs of native English-
speaking college students in the United States. (TOEFL Monograph Series Report
No. 1). Princeton, NJ: Educational Testing Service.
Ginther, A., & Stevens, J. (1998). Language background, ethnicity, and the internal
construct validity of the Advanced Placement Spanish Language Examination. In
A. J. Kunnan (Ed.), Validation in language assessment (pp. 169-194). Mahwah,
NJ: Lawrence Erlbaum Associates.
Hale, G. A., Stansfield, C. W., Rock, D. A., Hicks, M. M., Butler, F. A., & Oller, J. W.
Jr. (1988). Multiple-choice cloze items and the Test of English as a Foreign
Language. (TOEFL Research Report No. 26; ETS Research Report No. 88-02).
Princeton, NJ: Educational Testing Service.
Hale, G. A., Rock, D. A., & Jirele, T. (1989). Confirmatory factor analysis of the Test of
English as a Foreign Language. (TOEFL Research Report No. 32; ETS Research
Report No. 89-42). Princeton, NJ: Educational Testing Service.
Harley, B., Cummins, J., Swain, M., & Allen, P. (1990). The nature of language
proficiency. In B. Harley, P. Allen, J. Cummins, & M. Swain (Eds.), The
development of second language proficiency (pp. 7-25). New York: Cambridge
University Press.
Hosley, D., & Meredith, K. (1979). Inter- and intra-test correlates of the TOEFL. TESOL
Quarterly, 13(2), 209-217.
196
Hu, L.-T., & Bentler, P. M. (1999). Cutoff criteria for fit indices in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation
Modeling, 6, 1-55.
Jamieson, J., Jones, S., Kirsch, I., Mosenthal, P., & Taylor, C. (2000). TOEFL 2000
framework: A working paper. (TOEFL Monograph Series Report No. 16).
Kachru, B. B. (1984). World Englishes and the teaching of English to non-native

speakers: Contexts, attitudes, and concerns. TESOL Newsletter, 18, 25–26.
Kunnan, A. J. (1992). An investigation of a criterion- referenced test using G-theory, and

factor and cluster analyses. Language Testing, 9, 30-49. doi:
10.1177/026553229200900104
Kunnan, A. J. (1994). Modelling relationships among some test-taker characteristics and

performance on EFL tests: An approach to construct validation. Language Testing,
11, 225-250. doi: 10.1177/026553229401100301
Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural

modeling approach. Cambridge: Cambridge University Press.
Kunnan, A. J. (1998a). Approach to validation in language assessment. In A. J. Kunnan

(Ed.), Validation in language assessment (pp. 1-16). Mahwah, NJ: Lawrence
Erlbaum Associates.
Kunnan, A. J. (1998b). An introduction to structural equation modelling for language

assessment research. Language Testing, 15(3), 295–332. doi:
10.1177/026553229801500302
Kline, R. B. (1998). Principles and practice of structural equation modeling. New York:
Guilford.
Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.).
New York: Guilford.
Lado, R. (1961). Language testing. New York: McGraw-Hill.
Lafford, B. A. (2004). The effect of the context of learning on the use of communication
strategies by learners of Spanish as a second language. Studies of Second
Language Acquisition, 26, 201-225. doi: 10.1017/S0272263104062035
Llosa, L. (2007). Validating a standards-based classroom assessment of English

proficiency: A multitrait-multimethod approach. Language Testing, 24(4), 489-
515. doi: 10.1177/0265532207080770
197
Magnan, S. S., & Back, M. (2007). Social interaction and linguistic gain during study
abroad. Foreign Language Annals, 40(1), 43-61. doi: 10.1111/j.1944-
9720.2007.tb02853.x
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.) (pp.
13-103). New York: American Council on Education and Macmillan.
McNamara, T. F. (1990). Item response theory and the validation of an ESP test for
health professionals. Language Testing 7, 52-76. doi:
10.1177/026553229000700105
Morgan, R., & Mazzeo, J. (1988). A Comparison of the structural relationships among
reading, listening, writing, and speaking components of the AP French Language
Examination for AP candidates and college students. (ETS Research Report No.
88-59). Princeton, NJ: Educational Testing Service.
Muthén, L. K., & Muthén, B. O. (2010). Mplus User’s Guide (6th ed.). Los Angeles, CA:
Muthén & Muthén.
Oller, J. W. Jr. (1974). Expectancy for successive elements: Key ingredient to language
use. Foreign Language Annuals, 7, 443-452.
Oller, J. W. Jr. (1979). The factorial structure of language proficiency: Divisible or not?
In J. W. Oller, Jr. Language test at school: A pragmatic approach (pp. 423-458).
London: Longman.
Oller, J. W. Jr. (1983): A consensus for the eighties? In J. W. Oller, Jr. (Ed.), Issues in
language testing research (pp. 351-356). Rowley, MA: Newbury House.
Oller, J. W. Jr., & Hinofotis, F. B. (1980). Two mutually exclusive hypotheses about
second language ability: Indivisible or partially divisible competence. In J. W.
Oller, Jr. & K. Perkins (Eds.), Research in language testing (pp. 13-23). Rowley,
MA: Newbury House.
Pimsleur, P., Stockwell, R. P., & Comrey, A. L. (1962). Foreign language learning
ability. Journal of Educational Psychology, 53(1), 15-26. doi: 10.1037/h0044336
Römhild, A. (2008). Investigating the invariance of the ECPE factor structure across
different proficiency levels. Spaan Fellow Working Papers in Second or Foreign
Language Assessment, 6, 29-55. Ann Arbor, MI: University of Michigan English
Language Institute. www.isa.umich.edu/eli/research/spaan
Sang, F., Schmitz, B., Vollmer, H. J., Baumert, J., & Roeder, P. M. (1986). Models of
second language competence: A structural equation approach. Language Testing,
3(1), 54-79. doi: 10.1177/026553228600300103
198
Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors on
covariance structure analysis. In A. von Eye, & C. C. Clogg (Eds.), Latent
variables analysis (pp. 399-419). Thousand Oaks, CA: Sage.
Sasaki, M. (1993). Relationships among second language proficiency, foreign language

aptitude, and intelligence: A structural equation modeling approach. Language
Learning, 43(3), 313-344. doi: 10.1111/j.1467-1770.1993.tb00617.x
Sasaki, M. (2007). Effects of study-abroad experiences on EFL writers: A multiple-data

analysis. The Modern Language Journal, 91(4), 602-620. doi: 10.1111/j.1540-
4781.2007.00625.x
Sawaki, Y., Stricker, L., & Oranje, A. (2008). Factor structure of the TOEFL Internet-
Based Test (iBT): Exploration in a field trial sample. (TOEFL iBT Research
Report No. 04; ETS Research Report No. 08-09). Princeton, NJ: Educational
Testing Service.
Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL
Internet-based test. Language testing, 26 (1), 005-030. doi:
10.1177/0265532208097335
Segalowitz, N. (2004). Context, contact, and cognition in oral fluency acquisition:

Learning Spanish in at home and study abroad contexts. Studies of Second
Language Acquisition, 26, 173-199. doi: 10.1017/S0272263104062023
Scholz, G., Hendricks, D., Spurling, R., Johnson, M., & Vandenburg, L. (1980). Is
language ability divisible or unitary? A factor analysis of twenty-two English
proficiency tests. In J. W. Oller, Jr. & K. Perkins (Eds.), Research in language
testing (pp. 24-33). Rowley, MA: Newbury House.
Shin, S.-K. (2005). Did they take the same test? Examinee language proficiency and the
structure of language tests. Language Testing, 22(1) 31–57. doi:
10.1191/0265532205lt296oa
Shin, S. (2008). Examining the construct validity of a web-based academic listening test:
An investigation of the effects of response formats. Spaan Fellow Working
Papers in Second or Foreign Language Assessment, 6, 95-129. Ann Arbor, MI:
University of Michigan English Language Institute.
www.isa.umich.edu/eli/research/spaan
Skehan, P. (1991). Progress in language testing: The 1990s. In J. C. Alderson, & B. North
(Eds.), Language testing in the 1990s: The communicative legacy (pp. 3-21).
London: Macmillan.
199
Song, M. Y. (2008). Do divisible subskills exist in second language (L2) comprehension?

A structural equation modeling approach. Language Testing, 25(4), 435–464. doi:
10.1177/0265532208094272
Spearman, C. (1904). "General Intelligence," objectively determined and measured. The

American Journal of Psychology, 15(2), 201-292.
Spolsky, B. (1968). Language testing: The Problem of validation. TESOL Quarterly,

2(2), 88-94.
SPSS Inc. (2009). PASW® Statistics Base 18. Chicago, IL: SPSS Inc.
Stricker, L. J., & Rock, D. A. (2008). Factor structure of the TOEFL Internet-Based Test
across subgroups. (TOEFL iBT Research Report No. 07; ETS Research Report
No. 08-66). Princeton, NJ: Educational Testing Service.
Stricker, L. J., Rock, D. A., & Lee, Y.-W. (2005). Factor structure of the LanguEdge™
Test across language groups. (TOEFL Monograph Series Report No. 32).
Swinton, S. S., & Powers, D. E. (1980). Factor analysis of the Test of English as a
Foreign Language for several language groups. (TOEFL Research Report No.
06; ETS Research Report No. 80-32). Princeton, NJ: Educational Testing Service.
Taylor, C. (1993). Report of TOEFL score users focus groups. TOEFL 2000 Internal
Report. Princeton, NJ: Educational Testing Service.
Turner, C. E. (1989). The underlying factor structure of L2 cloze test performance in

Francophone, university-level students: Causal modeling as an approach to
construct validation. Language Testing, 6(2), 172-197. doi:
10.1177/026553228900600205
Upshur, J. A., & Homburg, T, J. (1983). Some relations among language tests at
successive ability levels. In J. W. Oller, Jr. (Ed.), Issues in language testing
research (pp. 188-202). Rowley, MA: Newbury House.
Vollmer, H. J., & Sang, F. (1983). Competing hypotheses about second language ability:
A plea for caution. In J. W. Oller, Jr. (Ed.), Issues in language testing research
(pp. 29-79). Rowley, MA: Newbury House.
Wang, S. D. (2006). Validation and invariance of factor structure of the ECPE and
MELAB across gender. Spaan Fellow Working Papers in Second or Foreign
Language Assessment, 4, 41-56. Ann Arbor, MI: University of Michigan English
Language Institute. www.isa.umich.edu/eli/research/spaan
200
Widaman, K. F. (1985). Hierarchically nested covariance structure models for multitrait-

multimethod data. Applied Psychological Measurement, 9, 1–26. doi:
10.1177/014662168500900101
Wolf, M. K., Kao, J., Herman, J., Bachman, L. F., Bailey, A., Bachman, P. L.,
Farnsworth, T., & Changm. C. (2008). Issues in assessing English language
learners: English language proficiency measures and accommodation uses—
Literature review (Part 1 of 3). (CRESST Report No. 731). Los Angeles, CA:
CRESST/UCLA.
Woods, A. (1983). Principal components and factor analysis in the investigation of the
structure of language proficiency. In A. Hughesand, & D. Porter (Eds.), Current
developments in language testing (pp. 43-52). New York: Academic Press.

At The Interface Between Language Testing and Second Language Acq

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

At The Interface Between Language Testing and Second Language Acq

Enviado por

Direitos autorais:

Formatos disponíveis

University of Iowa

Iowa Research Online

At the interface between language testing and

Copyright 2011 LIN GU

This dissertation is available at Iowa Research Online: https://ir.uiowa.edu/etd/972

Follow this and additional works at: https://ir.uiowa.edu/etd

Part of the First and Second Language Acquisition Commons

COMMUNICATIVE LANGUAGE ABILITY AND TEST-TAKER

Of a thesis submitted in partial fulfillment of the requirements

Thesis Supervisors: Associate Professor Judith E. Liskin-Gasparro

The present study investigates the nature of communicative language ability as

empirical evidence to enhance our understanding of the nature of communicative

takers with different context-of-learning experiences. The common belief in the

be upheld. Furthermore, both study-abroad and home-country learning were proved to

impact on the development of the ability being investigated.

taker characteristics. Such a validity inquiry contributes to our understanding of what

socio-cultural variables of foreign language learners and test-takers.

Abstract Approved: _________________________________________________

Title and Department

Title and Department

COMMUNICATIVE LANGUAGE ABILITY AND TEST-TAKER

A thesis submitted in partial fulfillment of the requirements

Thesis Supervisors: Associate Professor Judith E. Liskin-Gasparro

All Rights Reserved

This is to certify that the Ph.D. thesis of

has been approved by the Examining Committee

Thesis Committee: ________________________________________

questions. The training I received from studying in his department helped me

tremendously to conceptualize and implement this research project.

encouragement throughout the conduct of this study.

and to Walker for their support and love.

The present study investigates the nature of communicative language ability as

empirical evidence to enhance our understanding of the nature of communicative

takers with different context-of-learning experiences. The common belief in the

be upheld. Furthermore, both study-abroad and home-country learning were proved to

impact on the development of the ability being investigated.

taker characteristics. Such a validity inquiry contributes to our understanding of what

socio-cultural variables of foreign language learners and test-takers.

CHAPTER ONE INTRODUCTION……………………………………………………...1

Context of the Problem……………………………………………………………1

CHAPTER TWO LITERATURE REVIEW…………………………………….............12

CHAPTER THREE METHODOLOGY…………………………………………….......79

About the TOEFL iBT® Test……………………………………………………79

CHAPTER FOUR ANALYSIS AND RESULTS……………………………………...126

CHAPTER FIVE DISCUSSION AND CONCLUSIONS……………………………162

Overview of the Study………………………………………………………….162

Table 1. Test-Taker Characteristics across the Two Samples………………………95

Table 2. Descriptive Statistics across the Two Samples…………………………...100

Table 3. One-Sample t-Tests………………………………………………………100

Table 4. Task Content and Context………………………………………………..107

Table 5. Descriptive Statistics for the Observed Variables………………………..127

Table 6. Correlations of the Observed Variables…………………………………..128

Table 7. Fit Indices for the Three Competing Models……………………………..129

Table 8. Descriptive Statistics across the Two Groups……………………………140

Table 9. Fit Indices from the Multi-Group Measurement Invariance Analysis…147

Table 11. Fit Indices for the Structural Equation Models…………………………..156

Figure 1. Higher-Order Factor Model………………………………………………116

Figure 2. Correlated Four-Factor Model……………………………………………117

Figure 3. Correlated Two-Factor Model……………………………………………118

Figure 4. Correlated Two-Factor Model with a Content Dimension……………….134

Figure 5. Correlated Two-Factor Model with a Setting Dimension………………..135

Figure 6. Final Model with Unstandardized Estimates……………………………..139

Figure 7. Final Model with Standardized Estimates………………………………..140