Você está na página 1de 215

University of Iowa

Iowa Research Online


Theses and Dissertations

Spring 2011

At the interface between language testing and


second language acquisition: communicative
language ability and test-taker characteristics
Lin Gu
University of Iowa

Copyright 2011 LIN GU

This dissertation is available at Iowa Research Online: https://ir.uiowa.edu/etd/972

Recommended Citation
Gu, Lin. "At the interface between language testing and second language acquisition: communicative language ability and test-taker
characteristics." PhD (Doctor of Philosophy) thesis, University of Iowa, 2011.
https://ir.uiowa.edu/etd/972.

Follow this and additional works at: https://ir.uiowa.edu/etd

Part of the First and Second Language Acquisition Commons


AT THE INTERFACE BETWEEN LANGUAGE TESTING AND SECOND

LANGUAGE ACQUISITION:

COMMUNICATIVE LANGUAGE ABILITY AND TEST-TAKER

CHARACTERISTICS

by

Lin Gu

An Abstract

Of a thesis submitted in partial fulfillment of the requirements


for the Doctor of Philosophy degree in Second Language Acquisition
in the Graduate College of
The University of Iowa

May 2011

Thesis Supervisors: Associate Professor Judith E. Liskin-Gasparro


Associate Professor Timothy N. Ansley
1

ABSTRACT

The present study investigates the nature of communicative language ability as

manifested in performance on the TOEFL iBT® test, as well as the relationship between

this ability with test-takers’ study-abroad and learning experiences. The research interest

in the nature of language ability is shared by the language testing community, whereas

understanding the factors that affect language acquisition has been a focus of attention in

the field of second language acquisition (Bachman & Cohen, 1998). This study utilizes a

structural equation modeling approach, a hybrid of factor analysis and path analysis, to

address issues at the interface between language testing and second language acquisition.

The purpose of this study is two-fold. The first has a linguistic focus: to provide

empirical evidence to enhance our understanding of the nature of communicative

language ability by examining the dimensionality of this construct in both its absolute

and relative senses. The second purpose, which has a social and cultural orientation, is to

investigate the possible educational, social, and cultural influences on the acquisition of

English as a foreign language, and the relationships between test performance and test-

taker characteristics.

The results revealed that the ability measured by the test was predominantly skill-

oriented. The role of the context of language use in defining communicative language

ability could not be confirmed due to a lack of empirical evidence. As elicited by the test,

this ability was found to have equivalent underlying representations in two groups of test-

takers with different context-of-learning experiences. The common belief in the

superiority of the study-abroad environment over learning in the home country could not

be upheld. Furthermore, both study-abroad and home-country learning were proved to


2

have significant associations with aspects of the language ability, although the results

also suggested that variables other than the ones specified in the models may have had an

impact on the development of the ability being investigated.

From a test validation point of view, the results of this study provide crucial

validity evidence regarding the test’s internal structure, this structure’s generalizability

across subgroups of test-takers, as well as its external relationships with relevant test-

taker characteristics. Such a validity inquiry contributes to our understanding of what

constitutes the test construct, and how this construct interacts with the individual and

socio-cultural variables of foreign language learners and test-takers.

Abstract Approved: _________________________________________________

Thesis Supervisor

_________________________________________________

Title and Department

_________________________________________________

Date

_________________________________________________

Thesis Supervisor

_________________________________________________

Title and Department

________________________________________________

Date
AT THE INTERFACE BETWEEN LANGUAGE TESTING AND SECOND

LANGUAGE ACQUISITION:

COMMUNICATIVE LANGUAGE ABILITY AND TEST-TAKER

CHARACTERISTICS

by

Lin Gu

A thesis submitted in partial fulfillment of the requirements


for the Doctor of Philosophy degree in Second Language Acquisition
in the Graduate College of
The University of Iowa

May 2011

Thesis Supervisors: Associate Professor Judith E. Liskin-Gasparro


Associate Professor Timothy N. Ansley
Copyright by

LIN GU

2011

All Rights Reserved


Graduate College
The University of Iowa
Iowa City, Iowa

CERTIFICATE OF APPROVAL

____________________________

PH.D. THESIS

_____________

This is to certify that the Ph.D. thesis of

Lin Gu

has been approved by the Examining Committee


for the thesis requirement for the Doctor of Philosophy degree
in Second Language Acquisition
at the May 2011 graduation.

Thesis Committee: ________________________________________


Judith E. Liskin-Gasparro, Thesis Supervisor

________________________________________
Timothy N. Ansley, Thesis Supervisor

________________________________________
Helen H. Shen

________________________________________
Lia M. Plakans

________________________________________
Bonnie S. Sunstein
To my loved ones:
献给我亲爱的:
Mother 母亲 卓志华 女士
Father 父亲 顾生根 先生

ii
ACKNOWLEDGMENTS

I would like to thank my advisor, Dr. Liskin-Gasparro, for her guidance, help, and

encouragement in every phase of this dissertation study. Without her belief in me and her

unfailing support, I would not have been able to see this study to its completion.

I am also indebted to Dr. Ansley for his insightful comments and challenging

questions. The training I received from studying in his department helped me

tremendously to conceptualize and implement this research project.

Sincere thanks also go to Dr. Plakans, Dr. Shen, and Dr. Sunstein for their warm

encouragement throughout the conduct of this study.

I would like to thank the Graduate College of the University of Iowa and

Educational Testing Service for funding this research project, and to thank Educational

Testing Service for permissions granted to use their copyrighted test material and data.

Last, but not least, I wish to express my deep gratitude to my parents, to Mary,

and to Walker for their support and love.

iii
ABSTRACT

The present study investigates the nature of communicative language ability as

manifested in performance on the TOEFL iBT® test, as well as the relationship between

this ability with test-takers’ study-abroad and learning experiences. The research interest

in the nature of language ability is shared by the language testing community, whereas

understanding the factors that affect language acquisition has been a focus of attention in

the field of second language acquisition (Bachman & Cohen, 1998). This study utilizes a

structural equation modeling approach, a hybrid of factor analysis and path analysis, to

address issues at the interface between language testing and second language acquisition.

The purpose of this study is two-fold. The first has a linguistic focus: to provide

empirical evidence to enhance our understanding of the nature of communicative

language ability by examining the dimensionality of this construct in both its absolute

and relative senses. The second purpose, which has a social and cultural orientation, is to

investigate the possible educational, social, and cultural influences on the acquisition of

English as a foreign language, and the relationships between test performance and test-

taker characteristics.

The results revealed that the ability measured by the test was predominantly skill-

oriented. The role of the context of language use in defining communicative language

ability could not be confirmed due to a lack of empirical evidence. As elicited by the test,

this ability was found to have equivalent underlying representations in two groups of test-

takers with different context-of-learning experiences. The common belief in the

superiority of the study-abroad environment over learning in the home country could not

be upheld. Furthermore, both study-abroad and home-country learning were proved to

iv
have significant associations with aspects of the language ability, although the results

also suggested that variables other than the ones specified in the models may have had an

impact on the development of the ability being investigated.

From a test validation point of view, the results of this study provide crucial

validity evidence regarding the test’s internal structure, this structure’s generalizability

across subgroups of test-takers, as well as its external relationships with relevant test-

taker characteristics. Such a validity inquiry contributes to our understanding of what

constitutes the test construct, and how this construct interacts with the individual and

socio-cultural variables of foreign language learners and test-takers.

v
TABLE OF CONTENTS

LIST OF TABLES……………………………………………………………………...viii

LIST OF FIGURES………………………………………………………………………ix

CHAPTER ONE INTRODUCTION……………………………………………………...1

Context of the Problem……………………………………………………………1


Purposes of the study……………………………………………………………...5
Contributions and Significance…………………………………………………...6
Organization of the Dissertation…………………………………………………9

CHAPTER TWO LITERATURE REVIEW…………………………………….............12

Introduction………………………………………………………………………12
Unitary Competence Hypothesis………………………………………………...14
Multidimensional Competency Models………………………………………….18
Method Effect……………………………………………………………………32
The Factor Analysis Approach………………………………………………34
The CFA-MTMM Matrix Approach………………………………………...39
Test-Taker Characteristics and Test Performance……………………………….43
Proficiency Level…………………………………………………………….47
Other Background Characteristics…………………………………………...52
Target Language Contact……………………….……………………………56
ETS–TOEFL® Studies………………………………………………………64
Making a Validity Argument…………………………………………………….71
Conclusion……………………………………………………………………….72

CHAPTER THREE METHODOLOGY…………………………………………….......79

About the TOEFL iBT® Test……………………………………………………79


General Goals of the Study………………………………………………………82
Research Questions………………………………………………………………85
Research Question One………………………………………………………85
Research Question Two……………………………………………………...87
Research Question Three…………………………………………………….88
The TOEFL iBT® Public Dataset………………………………………………..90
Sample Representativeness and Appropriateness………………………………..95
Measures………………………………………………………………………..101
Structure of the Test………………………………………………………...101
Descriptions of Task Situations…………………………………………….102
Categorizing Tasks by Content and Setting………………………………...106
Analysis Procedures……………………………………………………….........107
Level of Measure…………………………………………………………...108
Distribution of Values………………………………………………………109

vi
Linearity and Multicollinearity……………………………………………..110
Estimation Method………………………………………………………….110
Assessing Model Fit………………………………………………………...111
Establishing the Baseline Model……………………………………………113
Modeling the Context of Language Use……………………………………118
Multi-Group Invariance Analysis…………………………………………..120
Building Structural Equation Models………………………………………124

CHAPTER FOUR ANALYSIS AND RESULTS……………………………………...126

Preliminary Analysis……………………………………………………………126
The Baseline Model…………………………………………………………….129
The Context of Language Use………………………………………………….132
The Content Dimension…………………………………………………….132
The Setting Dimension……………………………………………………..133
The Final Model………………………………………................................136
Multi-Group Invariance Analysis………………………………………………138
Measurement Invariance……………………………………………………141
Structural Invariance………………………………………………………..149
Results of Multi-Group Invariance Analysis…………………………….....152
Structural Equation Models…………………………………………………….153
The Home-Country Group………………………………………………….157
The Study-Abroad Group…………………………………………………..159

CHAPTER FIVE DISCUSSION AND CONCLUSIONS……………………………162

Overview of the Study………………………………………………………….162


Summary of the Primary Findings……………………………………………...163
Discussion……………………………………………………………………..164
The Nature of Communicative Language Ability………………………….165
Group Membership and Language Ability…………………………………171
Learning Contexts and Language Ability…………………………………..176
Implications……………………………………………………………………..178
Foreign Language Test Development………………………………………178
Test-Taker Profiles………………………………………………………….181
The Interface between Testing and Acquisition……………………………182
Unique Contributions…………………………………………………………...184
Limitations and Recommendations…………………………………..................186

NOTES………………………………………………………………………………….190

REFERENCES…………………………………………………………………………192

vii
LIST OF TABLES

Table 1. Test-Taker Characteristics across the Two Samples………………………95

Table 2. Descriptive Statistics across the Two Samples…………………………...100

Table 3. One-Sample t-Tests………………………………………………………100

Table 4. Task Content and Context………………………………………………..107

Table 5. Descriptive Statistics for the Observed Variables………………………..127

Table 6. Correlations of the Observed Variables…………………………………..128

Table 7. Fit Indices for the Three Competing Models……………………………..129

Table 8. Descriptive Statistics across the Two Groups……………………………140

Table 9. Fit Indices from the Multi-Group Measurement Invariance Analysis…147

Table 10. Fit Indices from the Multi-Group Structural Invariance Analysis……….151

Table 11. Fit Indices for the Structural Equation Models…………………………..156

viii
LIST OF FIGURES

Figure 1. Higher-Order Factor Model………………………………………………116

Figure 2. Correlated Four-Factor Model……………………………………………117

Figure 3. Correlated Two-Factor Model……………………………………………118

Figure 4. Correlated Two-Factor Model with a Content Dimension……………….134

Figure 5. Correlated Two-Factor Model with a Setting Dimension………………..135

Figure 6. Final Model with Unstandardized Estimates……………………………..139

Figure 7. Final Model with Standardized Estimates………………………………..140

Figure 8. Factor Structure Invariance with Unstandardized Estimates Group I……142

Figure 9. Factor Structure Invariance with Unstandardized Estimates Group II…...143

Figure 10. Factor Loading Invariance with Unstandardized Estimates Group I.........144

Figure 11. Factor Loading Invariance with Unstandardized Estimates Group II……145

Figure 12. Indicator Residual Invariance with Unstandardized Estimates Group I….147

Figure 13. Indicator Residual Invariance with Unstandardized Estimates Group II...148

Figure 14. Factor Mean Invariance with Unstandardized Estimates Group I………..150

Figure 15. Factor Mean Invariance with Unstandardized Estimates Group II………151

Figure 16. Structural Equation Model Group I………………………………………155

Figure 17. Structural Equation Model Group II……………………………………...156

Figure 18. Structural Equation Model with Standardized Estimates Group I……….158

Figure 19. Structural Equation Model with Standardized Estimates Group II………160

ix
1

CHAPTER ONE

INTRODUCTION

Context of the Problem

Recent years have witnessed a growing number of foreign language (FL) learners

in the United States and worldwide. To meet the assessment needs of this expanding

population, tests whose development and validation are well grounded in applied

linguistics and educational measurement theories are in high demand.

The most recent edition of Standards for Educational and Psychological Testing

(American Educational Research Association [AERA], American Psychological

Association [APA], & National Council on Measurement in Education [NCME], 1999;

hereafter referred to as Standards) asserts the importance of developing sound

educational and psychological measures, and it regards these measures as “the most

important contributions of behavioral science to our society, providing fundamental and

significant improvements over previous practices” (p. 1). The Standards regards validity

as the “most fundamental consideration in developing and evaluating tests,” and it

defines validity as “the degree to which evidence and theory support the interpretations of

test scores entailed by proposed uses of tests” (p. 9).

Validity has been traditionally divided into three different types (Alderson &

Banerjee, 2002; Bachman, 1990). The three traditionally used validity categories include

content-related validity, predictive and concurrent criterion-related validity, and

construct-related validity (Anastasi & Urbina, 1997; Messick, 1989). Messick (1989)

reintroduced validity as a unitary concept. As defined in Messick’s (1989) seminal

article, validity is “an integrated evaluation judgment of the degree to which empirical
2

evidence and theoretical rationales support the adequacy and appropriateness of

inferences and actions based on test scores or other modes of assessment” (p. 13; original

emphases in italics). The former definition of three validities has been replaced with a

single unified view of validity (Chapelle, 1999). As pointed out by Bachman (1990), this

unitary view of validity has been “endorsed by the measurement profession as a whole”

(p. 236).

A crucial aspect of this unitary validity concept is the centrality of construct

validation. The Standards (AERA, APA, & NCME, 1999) defines constructs as

knowledge, skills, abilities, processes, or characteristics to be assessed, and refer score

interpretation to the “construct or concepts the test is intended to measure” (p. 9).

Construct validity has always been considered central to test validation (Cronbach &

Meehl, 1955). In the unitary validity framework, construct validity is the “fundamental

and all-inclusive validity concept” (Anastasi & Urbina, 1997, p. 114).

Messick’s (1989) validity framework has had a profound impact on the

development and professionalization of the field of FL testing. FL testers have generally

embraced and adapted the unitary validity concept. Validity, as Bachman (1990) puts it,

is “the most important quality of test use” (p. 289).

In the context of FL testing, the centerpiece of a validation inquiry is to develop a

FL construct framework based on both theoretical and empirical grounds. The most

fundamental question is what constitutes the construct of FL proficiency. Messick (1989)

proposed to examine the structural component of a test to understand the degree to which

the nature and dimensionality of the inter-item structure reflect the nature and

dimensionality of the construct domain. Dimensionality of a construct domain refers to


3

the domain’s underlying factorial structure. As an approach to collect validity evidence,

dimensionality analysis was also proposed by Chapelle (1999) to examine the fit of

empirical test performance to a psychometric model based on a relevant construct theory.

Evidence based on a test’s internal structure was also suggested by Wolf et al. (2008) to

validate English as a FL tests. Results of investigating the internal structure of a test

could be used to interpret the nature of the FL construct: whether the construct is

unidimensional or multidimensional, and what the makeup of a multidimensional

construct is.

The generality of this FL construct has also been a focus of investigation in the

field of language testing. Messick (1989) warned against taking the generalizability of a

construct meaning across various contexts for granted. He proposed that context effects,

especially different population groups, in score interpretation be systematically appraised.

Validity evidence based on a test’s generalizability was also proposed by Chapelle (1999)

to ensure legitimate test score interpretation and uses across groups of test-takers, time,

instruction conditions, and test task characteristics. The idea of a universally applicable

construct framework seems especially questionable in language testing, considering the

differences in the language to be measured (e.g., English, French, Chinese), and the

usually heterogeneous nature of the test-taking population. This line of research helps to

answer the question of whether the same construct structure holds across groups of test-

takers who differ in terms of the features salient to a specific testing situation.

The research interest in the nature of language ability has been shared by the

language testing community, whereas understanding the factors that affect language

acquisition has been one of the focuses in the field of language acquisition (Bachman &
4

Cohen, 1998). However, to ensure valid test score interpretation and uses, the acquisition

factors that reside outside of a test need to be investigated in relation to test performance.

In other words, validity evidence based on a test’s external structure is needed to ensure

validity.

To illustrate this point, we use language testing in a study-abroad situation as an

example. Let us assume that the relevant literature uniformly asserts that study abroad is

superior to the traditional classroom learning environment. A controlled experimental

study is designed to compare the language gains of study-abroad learners versus those of

classroom learners. Contrary to the common belief upheld by the literature, test results

show that the gains are equivalent. We are now confronted with the dilemma that the

test’s external relationship with a situation factor (study abroad vs. classroom learning)

does not conform to what has been demonstrated in the language acquisition literature.

This would bring us to doubt if the scores can be interpreted to have captured the real

gains across the groups, and if the test has been used appropriately for the purpose of this

study.

Examining a test’s external relationships with other tests and with non-test

behaviors has been proposed by both measurement professionals (AERA, APA, &

NCME, 1999; Messick, 1989) as well as language testing researchers (Chapelle, 1990;

Wolf et al., 2008). In the context of language testing, this line of research ensures that a

test has the appropriate external relationships that are compatible with both testing and

acquisition theories.
5

Purposes of the Study

This dissertation research investigates the nature of FL ability1, and this ability’s

relationships with test-takers’ study-abroad and learning experiences. The current view in

applied linguistics conceptualizes FL ability as communicative in nature (Canale, 1983;

Canale & Swain, 1980). Situated in the context of the TOEFL iBT® test2, this study

focuses on test-takers of English as a FL, and on the construct validation of

communicative language ability.

The purpose of this study is two-fold. The first has a linguistic focus: to provide

empirical evidence to enhance our understanding of the nature of communicative

language ability by examining the dimensionality of this construct in both its absolute

and relative senses, as measured by the TOEFL iBT Test. Dimensionality refers to the

underlying factorial structure of the construct. Dimensionality in the absolute sense

implies the invariance of the proposed factorial structure, irrespective of factors such as

the FL tested, the test chosen, the population studied, and the conditions under which

language acquisition takes place, to name just a few. Dimensionality in its relative sense

implies no such universally applicable factorial structure of FL proficiency. Instead, the

factorial structure varies depending on many factors, such as those mentioned

immediately above. It also implies that different aspects of this proficiency, if the

proficiency is indeed divisible, may develop at different rates and over different paths

under varying conditions.

The second purpose, which has a social–cultural orientation, is to investigate the

possible educational, social, and cultural influences on English as a FL acquisition, and

the relationships between test performance and test-taker characteristics (TTCs). These
6

characteristics can include, among others, native language, language spoken at home,

years of immersion, years of schooling in the target language, and anticipated career.

Examining test performance in relation to test-takers’ study-abroad and learning

experiences is the focus of this research study. By investigating the individual and social–

cultural conditions of test-takers through a structural equation modeling (SEM) approach,

this study can shed light on how differences in learning contexts and experiences may

interact with the nature and path of language acquisition.

Contributions and Significance

The outcomes of this study will contribute to a better understanding of FL ability.

Although it has been generally agreed that FL proficiency is a complex construct with

multiple dimensions, it is still unclear to the research community what this proficiency

consists of, and how the constituent parts interact. Specifically, this research focuses on

examining the nature of communicative language proficiency based on the concept of

communicative competence (Canale, 1983; Canale & Swain, 1980). Since its initial

appearance, the concept of communicative competence has had a strong influence not

only on language teaching and learning, but also on test design and development. Based

on the communicative competence literature, the TOEFL® Committee of Examiners

(COE) developed a model that defines communicative language proficiency in academic

contexts (Chapelle, Grabe, & Berns, 1997). According to the COE model, the TOEFL

iBT® test measures academic communicative language proficiency. Based on TOEFL

iBT test performance, this study will make both theoretical and empirical contributions to

our understanding of communicative language proficiency in academic contexts, as

measured by the test. The results could also provide evidence based on the internal
7

structure of the test for developing a validity argument for test score interpretation and

uses.

Second, this study will contribute to expanding our understanding of the

relationships between TTCs and FL test performance. The field of FL testing has recently

witnessed a surge of interest, which has resulted in a growing number of empirical

studies on TTCs, such as gender (Wang, 2006), native language background (Shin,

2005), ethnicity and preferred language (Ginther & Stevens, 1998), and home language

environment (Bae & Bachman, 1998), to name just a few. Multiple studies have also

been conducted in TOEFL® testing to validate scores interpretation and uses in light of

differences of test-takers on background variables, such as reasons for taking the test

(Swinton & Powers, 1980), native language background (Stricker et al., 2005), target

language exposure in home countries (Stricker & Rock, 2008), and years of classroom

instruction (Stricker & Rock, 2008). The general consensus reached by the language

testing research community is that the factor structure underlying FL test performance

can be interpreted more meaningfully if relevant TTCs are taken into consideration.

However, the amount of information we have obtained is far from what we need to fully

understand the complex and dynamic network of relationships. Moreover, there are still

TTCs that have not attracted enough attention from language testing researchers, and

therefore their relationships with test performance remain under-researched.

One such TTC is contact with the target language. Language contact is a concept

developed by study-abroad researchers. It specifies the nature and intensity of language

learners’ out-of-class contact with the target language in a study-abroad environment. FL

test-takers are non-native speakers of the target language who may have had the
8

experience of studying and/or living in the target language environment. Learning and

living experience in the target language environment has been investigated in relation to

test performance in only a few studies (e.g. Kunnan, 1994, 1995; Morgan & Mazzeo,

1988). This TTC is salient in most FL testing situations, but has not been examined

extensively.

This study will contribute to our knowledge of the relationships between TTCs

and FL test performance, especially the test performance of test-takers with different

language contact experiences. Because of the TOEFL iBT® test’s diverse test-taking

population, examining the internal structure of the test across relevant subgroups of test-

takers would provide evidence for the test’s generalizability in building a validity

argument for proposed score interpretation and uses.

From a pedagogical perspective, the study will inform us whether study abroad

has an effect on language learning and acquisition and, if so, in what ways. There has

been a rapid growth of research on study abroad contexts of learning, and these studies

have generated mixed results (e.g., Collentine & Freed, 2004). The results of this study

will provide empirical evidence on the impact of study abroad, if any. From a practical

point of view, the results of this study could advise potential FL test-takers on how to

prepare for a test, as well as how to improve their FL proficiency in general. Studying in

a foreign country normally requires a huge investment, both financial and emotional,

from test-takers and their families. It is my hope that the results of this study can shed

some light on how effective study-abroad is in relation to FL test performance.

In general, the significance of this research to the fields of FL testing and FL

acquisition lies in two areas: (a) extending language testing research agenda to include
9

variables external to a test so that we can have a better understanding of not only the

nature of FL construct but also this construct’s external relationships with relevant TTCs;

and (b) employing a SEM approach to tackle issues at the interface between language

acquisition and language testing.

Organization of the Dissertation

Chapter One describes the context of the problem, explains the purposes of the

study, indicates the contributions and significance, and presents the organization of the

dissertation.

Chapter Two examines theoretical models and empirical studies on FL

proficiency. This review includes proficiency models that have been subjected to

empirical investigation in a testing situation. Furthermore, only studies that have utilized

latent factor analysis and structural equation modeling are reviewed. The organization of

the literature review is laid out as follows.

A unitary competence model is discussed first. This is followed by an expansion

of the review to include various multidimensional competence models. The review shows

that the concept of a multidimensional FL proficiency has been well received based on

both theoretical and empirical grounds, although the research community has not yet

reached an agreement regarding the makeup of the construct. The next section reviews

studies that have modeled the effect of test methods, as opposed to constructs, in their

analyses of proficiency models. The review indicates that test method facet ought to be

visualized as an integral part of the factor structure underlying FL test performance. The

last section is devoted to summarizing empirical studies that have explicitly modeled

TTCs in their analyses. Studies are divided into four groups. The first group focuses on
10

the relationship between proficiency level and the degree of factor differentiation. Studies

in the second group examine other TTCs, such as native language background and

proficiency, learning condition, and gender. Studies in the third group examined target

language contact as a TTC by using multi-group factor analysis. Last, validation studies

in the context of TOEFL® testing are reviewed and summarized. The results of the

literature review demonstrate that two consensuses have been reached by the research

community. First, FL proficiency is a complex construct with multiple dimensions.

Second, the factor structure of FL test performance can be interpreted more meaningfully

if relevant TTCs are taken into consideration. Directions for future research are suggested

at the end of the chapter. They are: (a) to test two competing hypotheses, a hierarchical

model versus a non-hierarchical model of FL proficiency; (b) to examine test

performance on integrated test tasks that require the use of multiple skills; and (c) to

study TTCs that have not attracted enough attention from previous researchers, especially

language contact with the target language community.

Chapter Three presents the design of the study. The background and development

of the TOEFL iBT® test as well as the format of the operational test are first introduced.

The reason for choosing this test and how the unique characteristics of this test can assist

us in accomplishing the general research purposes of this study are explained. Situated in

the context of TOEFL iBT testing, three research questions are proposed and six

hypotheses are formulated. A layout of materials and data is provided, and planned data

analysis is explained in a step-by-step fashion.

Chapter Four reports the results of the analyses. Taking both skill-oriented

abilities and language use context into consideration, the outcomes of establishing a
11

model for the entire sample are reported first, followed by results from multi-group

invariance analysis across groups of test-takers with different context-of-learning

experiences. Last, results from analyzing two unique structural equation models with

components of study-abroad and learning experiences are reported.

Chapter Five provides a review of the study and a summary of the primary

findings. Discussion focuses on three topics: the nature of communicative language

ability, group membership and language ability, and learning contexts and language

ability. Insights gained from this study on language test development and validation are

elaborated. Using a structural equation modeling approach to address issues at the

interface between language testing and acquisition is appraised. The chapter further

discusses the study’s contributions and limitations. Recommendations for future research

are also provided.


12

CHAPTER TWO

LITERATURE REVIEW

Introduction

In Alderson and Banerjee’s (2002) state-of-the-art review of language testing and

assessment, they stated that what is central to language testing is “an understanding of

what language is, and what it takes to learn and use language, which then becomes the

basis for establishing ways of assessing people’s abilities” (p. 80). Bachman (2000) also

emphasized the importance of the nature of language ability and language use. In his

review of modern language testing at the turn of the twenty-first century, Bachman

asserted that what has guided the development and refinement of new tests is “current

thinking in applied linguistics about the nature of language ability and language use” (p.

2). These positions were echoed by Wolf et al. (2008) in their statement that “[a] clear

definition of a construct to be measured is a fundamental first step in any validity

argument” (p. 20).

The essential message conveyed by these authors is that prior to conducting any

sort of language assessment, an understanding of the nature of foreign language (FL)

proficiency that is being assessed should be established. This point has been well

accepted by the profession, as a profusion of FL proficiency models of diverse natures

have emerged. Some models are compatible with one another, whereas others are

competing and contradictory. Some models are more general than the others and, as

Chalhoub-Deville (1997) explained, a more general model is capable of representing a

construct in diverse contexts, whereas a more local model tends to depict the construct as

it applies in a particular context. Some models are more complex as they tend to include
13

as many construct elements as possible, whereas others are more parsimonious as they

include only the elements that pertain to a particular testing situation. Some models enjoy

strong empirical foundations, whereas others are based on scanty empirical support.

Among models that claim empirical support, a wide range of methodologies are used by

different researchers. The implementation of each methodology often entails the

employment of a unique set of statistical and non-statistical techniques.

To discuss every language proficiency model that has been developed by theorists

and practitioners is beyond the scope of this review of the literature. An exhaustive

review might also suffer a lack of focus, and it would be difficult to draw any practical

conclusions. Therefore only models that have been subjected to empirical investigation in

a testing situation will be reviewed. Furthermore, as argued in the previous chapter,

validity evidence based on a test’s internal structure, external structure, and the test’s

generalizability is instrumental to our understanding of the nature of FL proficiency. A

latent trait approach is most appropriate for conducting such validity inquiries. This

chapter reviews studies that have utilized a latent approach in their investigations of the

nature of language proficiency.

A unitary competence model is discussed first. This is followed by an expansion

of the review to include various multidimensional competence models. The next section

reviews studies that have modeled the effect of test methods, as opposed to constructs, in

their examination language proficiency models. The last group of studies branches out to

examine variables that reside outside a test, namely test-taker characteristics (TTCs), and

to interpret language proficiency models as informed by the relationships between test

performance and TTCs.


14

The chapter concludes that there is not a best model of FL proficiency, at least not

according to the results of this literature review, on which a test developer can base his or

her test. However, consensus in some areas has been reached by the profession.

Unitary Competence Hypothesis

Almost every article on the nature of languag proficiency traces this line of

research back to Oller’s (1979) view of language proficiency as a unitary entity and the

empirical investigations of the factorial structure of language proficiency led by him and

his associates. Oller’s research addresses one of the fundamental questions that is

relevant to educators and researchers of language acquisition: What is the nature of the

proficiency that a language learner attains?

From the standpoint of a language test developer, Oller (1979) framed his

question as “whether or not language ability can be divided up into separately testable

components” (p. 423). A component that is testable will yield scores that are reliable,

which means that this component of language ability is a relatively stable trait. Being

testable also means that the component can be distinguished from other components on

both theoretical and empirical grounds.

Three hypotheses regarding the testability of language components were put

forward by Oller (1979). The divisibility hypothesis claims that tests of different

components, skills, aspects, or elements do not share common variance, and performance

on each test is accounted for by its unique variance. On the contrary, the indivisibility or

unitary competence hypothesis states that no test measuring a particular element has

unique variance, and that all tests share the same variance. Taking the middle ground, the

third hypothesis is the partial divisibility hypothesis. According to this hypothesis,


15

performance on a particular test is accounted for by both shared variance with other tests

and unique variance that pertains only to this test. This last hypothesis was also

considered to be the weak form of the unitary hypothesis.

The theory widely used by the proponents of the unitary hypothesis comes from

cognitive science. Based on Spearman’s (1904) thinking, a general factor of intelligence

dominates most of the variance in human performance. If there are good reasons for

assuming that language proficiency functions more or less like other human activities,

then this competence is unidimensional. Another strand of linguistic evidence comes

from Spolsky’s belief in two fundamental truths about language: Language is redundant

and it is creative. Spolsky (1968) argued that global proficiency tests, which assess

learners’ ability to utilize grammatical redundancies in response to novel verbal

sequences, are essentially measures of linguistic competence, rather than discrete skills,

such as writing, reading, speaking, and listening. Knowledge of a language, manifested as

knowledge of rules, seems much the same as underlying linguistic competence, which

operates in all kinds of performance. Built upon Spearman’s (1904) general factor of

intelligence and Spolsky’s (1968) linguistic competency captured in grammatical

redundancies and creativity, Oller (1974) proposed an expectancy grammar as a

convenient label for the psychologically internalized grammar that governs all kinds of

language behaviors and performance.

Based on a statistical method called principal component analysis (PCA), multiple

studies tested the unitary competence hypothesis (Oller, 1979; Oller & Hinofotis, 1980;

Scholz, Hendricks, Spurling, Johnson, & Vandenburg, 1980). In a PCA procedure, the

first principal component extracted accounts for the largest amount of variance among the
16

measures. In all these studies, the same factorial pattern emerged. The first principal

component accounted for a considerable amount of common variance, and therefore was

considered as a good predictor of the intercorrelations among the measures. Language

tests had high loadings on the first component, and the loadings on this component were

higher than the loadings on the subsequently extracted component(s). The residual

correlations among the measures, after the first principal component had been extracted,

were negligible. By obtaining such a factorial pattern, these researchers believed that the

confirmation for a general factor underlying language proficiency was found.

Although a dominant global factor was repeatedly confirmed, the PCA method

used in these studies was criticized and considered inappropriate for conducting this line

of investigation (Carroll, 1983; Farhady, 1983; Vollmer & Sang, 1983). In terms of

extracting the initial factors, as Farhady (1983) argued, since values on the diagonal are

set at unity in PCA, all variances, including the common variance, specific variance and

error variance are used to define the underlying factors. Instead of using PCA, principal

factor analysis (PFA) can be used. In PFA, estimated communalities are assigned to the

diagonal cells, and specific and error variance components are not included. An iterative

method is used to refine estimates of the communalities to obtain the best possible

estimates of the communalities for various steps of factor extraction. PFA uses only the

common variance among the variables in the analysis while systematically discarding

uniqueness.

Both PCA and PFA extract the first factor in such a way as to account for the

maximum amount of variance in each and all of the variables. Variance from many

different common factor sources is being extracted because the factor vector is placed in
17

such a way that as many of the variables as possible have substantial projections on it.

The first factor can extract variance from several unrelated variables. It is highly inflated

and, therefore, the subsequent factors are correspondingly deflated. The first factor, by

the very nature of the extraction procedure, will account for the greatest amount of

variance, and it will appear to be more prominent than the rest.

Researchers (Carroll, 1983; Farhady, 1983; Vollmer & Sang, 1983) also

suggested that initial factor structures should be rotated to obtain psychologically

meaningful factor patterns since the initial unrotated factors may not give the best picture

of the factor structure and may lead to misinterpretations. By rotation, a simpler factor

structure might be arrived at, with each variable loading primarily on only one factor, and

each factor accounting for a maximum of the variances generated by the variables that

load on it.

Multiple-factor structures did emerge from some of the findings when analysis

methods different from the PCA method were used. Oller and Hinofotis (1980) found a

two-factor and a three-factor structure by using factor rotation techniques. Scholz et al.

(1980) were able to describe the data with four independent dimensions by using PFA.

However, these researchers still uniformly concluded that the unitary factor solution was

the best theoretical explanation for their data. The major reason was that no clear pattern

appeared in which tests were grouped according to the posited skills or components. They

came to the agreement that there must be a general factor underlying performance on

many language processing tasks. As Oller (1983) stated, although this general factor may

be componentially complex, language users must in some manner possess a generative

system that functions in an integrated fashion in many communicative contexts.


18

In conclusion, a general language proficiency is manifested as an underlying

unitary competence interpreted along the lines of an expectancy grammar. As the

language testing research community started to realize the limitations of using PCA to

configure the latent structure of language proficiency, the notion of a unitary language

competence became less convincing. As researchers argued (Farhady, 1983; Vollmer &

Sang, 1983; Woods, 1983), the derived factors contained not only common variance that

was shared by different tests, but also test-specific variance and error variance. Since a

PCA approach is programmed to extract the greatest amount of variance from all sources,

the technique generally overestimates the significance of the first derived factor. The

unidimensional model claimed by Oller and his colleagues has been generally considered

to be inadmissible because it was based on empirical findings that used statistical

techniques incorrectly.

While the field continues to pursue this question regarding the nature of language

proficiency, methodologies such as factor analysis are gradually gaining popularity

among language researchers. Results from a growing number of studies using factor

analysis have begun to demonstrate evidence in support of a multidimensional view of

language proficiency. In the next section, the discussion on the nature of language

proficiency will be expanded to include various multidimensional competence models.

Multidimensional Competency Models

In contrast to a unitary competence view, a divisibility hypothesis (Oller, 1979)

postulates that there are independent multiple components underlying FL competence.

This multidimensional approach for describing FL ability is based on the assumption of

structuralism (Carroll, 1965; Lado, 1961). According to this school of thought, language
19

ability can be specified as a series of skills or knowledge components. Carroll (1965)

proposed that language performance can be identified by using a matrix in which

components are placed against skills. The language components are phonology or

orthography, morphology, syntax, and lexicon. The skill set includes auditory

comprehension, oral production, reading, and writing. Even though the idea of the

independence of the various components and skills has never been expressed in the form

of a hypothesis, the components and skills are considered to be logically different and

therefore independent of one another. This position implies the strong form of the

divisibility hypothesis, which posits that language ability is made up of a number of

different and uncorrelated dimensions. In the case of Carroll’s (1965) theoretical

framework, any combination of a skill and a component is independent of any of the

other possible combinations. These 16 different combinations are theoretically included

in performance specifications, against which 16 independent abilities can be tested. In

practice, however, even Carroll admitted that it seemed daunting and also unnecessary to

test all 16 separate competences. A review of the relevant literature shows that the strong

form of this hypothesis has not drawn much empirical evidence in its support.

Carroll (1965) then proposed to distinguish an integrated approach from a discrete

structure-point approach to language testing. When an integrated approach is taken, the

total communicative effect of an utterance is emphasized instead of specific points of

structure or lexicon. Naturally the four skills of listening, reading, speaking, and writing,

which are regarded as integrated performance based on the candidate’s mastery of the

whole array of language components, receive focus in this integrated approach to testing.
20

He asserted that an ideal language proficiency test should make it possible to differentiate

levels of performance in those integrated skill dimensions of performance.

After reviewing the history of the structuralist approach to language teaching and

testing, Vollmer and Sang (1983) formulated one possible weak form, hypothesizing that

each of the four skills could be focused upon and that each one could be measured more

or less independently from the others. In this weak view of divisibility, executing skills

like listening, reading, speaking, and writing requires the integrated use of different levels

of knowledge in solving complex language tasks.

As manifested in most of today’s FL curricula and textbooks, this four-skills

approach has had an enormous impact on how FL has been taught and tested. There is

abundant evidence of teaching the four language skills as more or less distinguishable

areas of performance as well as testing these skills as separate constructs. By adopting

this four-skills approach, it is assumed, consciously or unconsciously, that each of the

skills can be individually focused and measured.

Neither the unitary hypothesis nor the complete divisibility hypothesis can be

justified in light of results from empirical studies. Oller (1983) conceded that multiple

factors might underlie language proficiency. He was still in favor of a general factor, but

this general factor might be partitioned into components. He further proposed that

hierarchical models with one general factor and several first-order factors would work

better than simpler models.

From the structuralist tradition, Carroll (1983) also started to recognize the

possibility of the existence of a general language ability as the indicator of overall rate of

development in language learning. Nevertheless, Carroll still believed that different


21

language skills could be separately recognized and measured because skills have the

tendency to be developed and specialized to different degrees, or at different rates.

Results from early empirical studies convinced the field to embrace the

multidimensional concept of FL ability (Carroll, 1958; Gardner & Lambert, 1965; Hosley

& Meredith, 1979; Pimsleur, Stockwell, & Comrey, 1962). All of these studies confirmed

the multidimensionality of FL ability. They also implied that linguistic competence is

divisible instead of unitary. The trend shows that FL proficiency can be better explained

by a plurality of underlying factors than by a unitary general factor.

However, resulting factorial structures often vary vastly from one study to

another. There is no clear indication as to what the multidimensional construct should or

could look like. In results from a majority of these studies, a more general factor

appeared along with a set of more specific factors. However, questions remain regarding

how this general factor relates to the specific factors as well as what these specific factors

are and how they relate to one another. The inconsistency of the resulting factorial

structures from various studies does not allow for strong arguments in the development of

a general theory of FL proficiency.

With the flourishing of different multidimensional models but no reconciliation,

Vollmer and Sang (1983) offered several suggestions to tackle this issue. They proposed

including a wider variety of measures, especially productive measures. They suggested

the possibility of testing an alternative hierarchical model. Regarding the statistical

methods used, they recommended taking a more confirmatory approach in modeling

language ability. Their observations showed that factor analysis was used in an

exploratory way in most of the earlier studies; therefore, results were largely limited to
22

the labeling and ad hoc interpretation of the factors. Using confirmatory factor analysis

(CFA) instead of exploratory factor analysis (EFA) could provide researchers with more

opportunities for theory testing and theory building. As the field moved forward, in more

recent studies, these issues have been recognized and treated with care by researchers.

The following section gives a comprehensive review of empirical studies with a

focus on the factorial structure underlying FL proficiency through a confirmatory latent

trait approach. Four hypotheses are commonly tested in this body of research. Two of

them are derived from Oller’s (1979) partial divisibility hypothesis. The first of these is

called the correlated trait hypothesis, which hypothesizes that a number of factors

underlie FL ability, and that these factors are correlated with each other. Competing with

the correlated trait hypothesis is the higher-order hypothesis. According to this

hypothesis, what underlies the structure of FL ability is a number of uncorrelated specific

primary (or first-order) factors subsumed under a general (or higher-order) factor. This

general factor affects test performance only indirectly, through the primary factors. The

assumption behind this hypothesis is that the correlations among the primary factors can

be accounted for by the general factor. These two hypotheses are usually examined

against each other and against the other two hypotheses, the unitary hypothesis and the

divisibility hypothesis. The unitary hypothesis postulates that FL performance is

accounted for by only one general underlying factor. The divisibility hypothesis claims

that a number of factors account for the variance in FL performance, and that these

factors are independent of each other.

A multi-component theoretical framework that consisted of grammatical

competence, pragmatic competence, and sociolinguistic competence was tested in


23

Bachman and Palmer (1982). Grammatical competence was specified to include

morphology and syntax. Pragmatic competence was defined as the mastery of

vocabulary, cohesion, and organization. Sociolinguistic competence referred to the

mastery of register, nativeness, and non-literal language. As noted by the authors,

phonology and graphology were not included in this framework. Each trait in this study

was measured by using four different methods: interview, writing, multiple-choice, and

self-rating. In the results, both the unitary hypothesis and the complete divisibility

hypothesis were rejected. The model with a higher-order general factor and two

uncorrelated primary trait factors provided the best fit. Grammatical competence and

pragmatic competence clustered together, and emerged as one trait that was distinct from

sociolinguistic competence. Except for the grammar measures, the rest of the measures

loaded heavily on the general factor. This study found strong support for a substantial

general factor. Although the nature of this higher-order general factor remained open, the

authors conceptualized the two first-order factors as the organizational aspects of

language and the affective aspects of language.

The framework investigated in Bachman and Palmer (1982) was based on the

concept of communicative competence. Since little empirical research was available

regarding the validity of the components in the communicative competence framework,

the authors hoped that their study could provide empirical support for some of the

components posited by the communicative competence framework.

Communicative competence was introduced by Canale and Swain (1980). Their

language framework has three aspects: grammatical competence, sociolinguistic

competence, and strategic competence, with the latter two being considered as equally
24

prominent as the grammatical competence. This framework was slightly modified by

Canele (1983) with the addition of discourse competence to the original model. As

Canale (1983) pointed out, the tremendous interest communicative competence generated

indicated a departure from a grammar-centered to a communication-focused language

research approach. Furthermore, instead of viewing communicative competence as a

global unitary factor, the construct is conceptualized as a modular one, composed of

several theoretically distinct factors. The popularity of communicative competence and

its acceptance moved the general discussion about the nature of language proficiency

farther away from the unitary proposition. However, as Canale (1983) acknowledged, the

framework was more a theoretical than an empirical model. Little empirical evidence has

been found to distinguish the four aspects of the competence, or to specify the manner in

which the four areas interact. Although the framework has been criticized for being based

mainly on theoretical but not empirical work (Cziko, 1984), it has also been credited for

advancing the field’s knowledge upon which future efforts of proficiency model

formulation and testing could be grounded (Bachman, 1990).

The four hypotheses were tested again in Bachman and Palmer (1983). They

conducted a construct validation study of the Foreign Service Institute (FSI) oral

interview test. Each posited trait, speaking or reading, was measured by three methods:

interview, translation, and self-rating. Based on the results from the six measures, they

found that the two partial divisibility hypotheses provided a better fit to their data. They

thus rejected the unitary trait hypothesis and the divisibility hypothesis. Although the

model of a general factor with two uncorrelated primary traits fit the data best

statistically, the authors decided to adopt the two-factor correlated trait model because of
25

the principle of parsimony. In their argument, the latter model had fewer free parameters

to estimate, and therefore was preferred as the more parsimonious model. The results

demonstrated strong support for the distinctness of speaking and reading as separate but

correlated traits. The results also showed evidence in support of a multidimensional FL

proficiency.

Sang, Schmitz, Vollmer, Baumert, and Roeder (1986) claimed that neither the

unitary trait hypothesis nor the four-skills multidimensional model of FL competence

could be upheld in light of empirical data. Instead of using the four-skills framework,

they postulated a factor structure with three correlated dimensions. The first dimension,

basic elements of knowledge, was associated with measures of pronunciation, spelling,

and lexical knowledge. The second dimension, integration of basic knowledge elements,

related to measures of grammar and reading comprehension. The last dimension,

interactive use of language, was reflected in measures of listening comprehension and

interaction. The authors found that the posited correlated three-factor model fit the data;

however, the correlations among the latent factors were high. To rule out a rival one-

factor model based on the unitary factor hypothesis, a competing one-factor model was

tested and rejected based on the results of model fit comparison. No higher-order model

was tested in this study. The potential comparison of a higher-order model and a

correlated model was not pursued.

Using both factor analysis and comparison of group means, Harley, Cummins,

Swain, and Allen (1990) hypothesized a three-trait structure that included grammar,

discourse, and sociolinguistic competence. Each trait was measured in three modes: an

oral productive mode, a written productive mode, and a multiple-choice written mode.
26

The analysis failed to confirm the hypothesized three-trait structure. Instead, a two-factor

solution with a general factor and a written method factor were found. Although factor

analysis did not show strong support for the hypothesized distinctions, the analysis of

mean comparisons did provide some evidence in support of a distinction among the

hypothesized traits. The pattern of score differences between native speakers and second

language learners was found not to be uniform across the three competence areas. Based

on this observation, the authors argued that since language skills could develop at

different rates, they could then be distinctly identified and measured separately.

Two competing hypotheses concerning FL ability in the areas of oral-aural,

structure-reading and discourse were investigated in Fouly, Bachman, and Cziko (1990):

a correlated-trait hypothesis and a high-order hypothesis. Both models proved to fit the

data fairly well. The study concluded that both distinct factors and a general language

factor existed.

In a large-scale cross-education system, cross-test battery study, Bachman,

Davidson, Ryan, and Choi (1995) investigated whether two test batteries were

comparable in terms of underlying factorial structure. The test battery administrated by

Educational Testing Service (ETS) included the TOEFL® test, the SPEAK® test, and the

TEW® test. The Cambridge-based test battery consists of five measures of the First

Certificate in English (FCE) test.

A similar correlated two-factor solution was found within each test battery. In

both cases, measures of listening and speaking loaded predominantly on one factor.

Written tests, on the other hand, loaded primarily on the second factor. A cross-test

analysis using all 13 variables was performed to investigate whether the two tests indeed
27

measured similar abilities. A correlated four-factor solution was obtained with a speaking

factor, a listening factor, and a written mode associated with the ETS test battery, and a

written mode associated with the Cambridge test battery. Because all the factors were

highly correlated, the researchers transformed the correlated four-factor model to a

higher-order factor model. The resulting factor structure had a higher-order general factor

and four uncorrelated primary factors. It was found that the higher-order factor accounted

for a large proportion of the variance in each test battery. The authors claimed that the

results of this comparative study supported the position that FL proficiency consists of a

general factor and a couple of first-order factors.

A more recent study (Shin, 2005) used the data from the Bachman et al. (1995)

comparability study, and found a correlated three-factor solution for the ETS test battery.

The factor structure with a listening factor, a speaking factor, and a written mode factor

closely resembled the findings from Bachman et al. (1995). The Shin (2005) study also

rejected the unitary trait model and the complete divisibility model. Between the higher-

order model and the correlated-trait model, the higher-order model was found to be

optimal in representing the data, and it was also considered to be more parsimonious.

Focusing on listening trait validity, Buck’s (1992) study was significant in this

line of research in two ways. First, the study successfully demonstrated the uniqueness of

the listening trait and its difference from the reading trait, in contrast to the previous

studies that had failed repeatedly to make this distinction (Carroll, 1983; Oller &

Hinofotis, 1980; Scholz et al., 1980). Second, the study illustrated the process of

designing measures that operationalized the listening trait properly. Buck pointed out a

number of ways in which listening and reading differ: mode of input, information
28

density, and different vocabulary, among others. He also suggested that contextual

information should be provided to replicate the processes used in real-world listening. By

controlling for these textual and contextual variables, a measure can be made to include

many of the characteristics of the normal listening situation, and become theoretically

valid in operationalizing the underlying listening trait. Each trait, listening and reading,

was measured by using four methods: open-ended questions, multiple-choice questions,

gap-filling tests, and translation. The results yielded a model with two closely correlated

traits: listening and reading. When addressing this close relationship between listening

and reading, Buck asserted that reading and listening share a common general language

comprehension process. He proposed distinguishing between two types of listening tests:

tests of general language comprehension and tests of the listening comprehension trait.

The results supported the position that language ability is divisible, even where two

highly related receptive skills are concerned.

In an investigation of the relationships of second language proficiency, FL

aptitude, and intelligence, 11 measures in Sasaki’s (1993) study yielded a correlated

three-factor solution: a composition ability, an ability to comprehend relatively short

context, and an ability to comprehend relatively long context. The author detected that

the two partial divisibility models could not be distinguished statistically, meaning that

they fit the data equivalently. Model comparison between the partial divisibility models

yielded similar unfruitful results in the past (Bachman & Palmer, 1983; Fouly et al.,

1990). A higher-order model with three first-order factors and a correlated three-factor

model are statistically equivalent. They cannot be distinguished by resorting to model

comparisons, and therefore, the superiority of one model over the other cannot be
29

demonstrated. Sasaki urged future researchers to include more theoretically supported

first-order factors in order to settle this controversy over the best-fitting model for second

language proficiency. Nonetheless, the two partial divisibility models fit the data much

better than the unitary model or the uncorrelated model. The researcher concluded that

there were at least several distinct trait factors between the general factor and the

observed variables, and that these specific trait factors were not independent of each

other.

While the majority of the studies to date have focused on English as a FL, Bae

and Bachman’s (1998) study which examined Korean as a FL and as a heritage language,

was an exception that departed from the English norm. Learners of Korean-American

ancestry who were assumed to have exposure to Korean at home constituted the heritage

group. Learners of non-Korean ancestry who were assumed to have little or no out-of-

class contact with the Korean language were in the non-heritage group. By focusing on

the two receptive language skills, reading and listening, the study showed that a

correlated two-factor model described the data for both the heritage group and the non-

heritage learner group. A rival model with a single factor was also tested, and the fit of

this one-factor model was significantly worse than that of the two-factor model. The

authors concluded that the same underlying two-factor pattern applied for both groups.

In the studies being discussed so far relationships among the factors were

specified as either none or non-directional. This means that factors were hypothesized

either to be independent or to be correlated reciprocally. Directional relationships among

factors have not been reported extensively in the literature. Only a few studies examined

such possible relationships. Upshur and Homburg (1983) found that while grammar and
30

vocabulary were correlated, they both affected reading. In their study, the relationships

between grammar and reading and between vocabulary and reading were modeled not to

be reciprocal, but rather directional. This model implies that the development of grammar

and vocabulary abilities has a direct effect on the development of reading ability.

Directional relationships among the latent factors were also tested in Sang et al. (1986) in

a three-factor model concerning grammar ability, vocabulary ability, and reading ability.

The relationships were directional as the grammar and vocabulary ability were modeled

to have direct effects on the reading ability.

There are reasons for this lack of interest in testing directional relationships

among the factors. Factor analysis, having its roots in correlation and regression

techniques, lacks the explanatory power to confirm or disconfirm a causal (i.e.,

directional) relationship. To justify that one construct causes another, one needs to rely

on an experimental design that applies random sampling procedures and that controls for

extraneous factors. In most educational and FL studies, implementing a strict

experimental design can hardly be achieved, because there are often too many extraneous

factors to be controlled, and also because the construct under investigation is often poorly

defined. Considering the wide variety of specific language factors that have emerged in

the literature and the many possible relationships among them, it seems almost

impossible to single out one construct from the rest, and focus on its directional

relationship to the others.

The previous discussion should also bring our attention to the nature of the

higher-order general factor that has repeatedly emerged in most of the studies discussed

above. A higher-order model asserts that the general factor accounts for the correlational
31

relationship among the first-order factors. However the nature of this general language

ability remained uncertain. Different researchers have proposed different theoretical

concepts that can be used to define this general language ability. Oller (1983) claimed

that it is pragmatic mapping that underlies the ability. Bachman and Palmer (1982)

asserted that this general ability has something to do with information processing in

extended discourse. As vague as it sounded, Bachman et al. (1995) suggested that the

general factor can be interpreted as the common aspect of language proficiency shared by

subjects when they perform on a certain test.

As summarized by Fouly et al. (1990), the nature of this general factor is open to

speculation and future research. There may even exist more than one higher-order factor,

depending on the number of first-order traits and the correlational relationships among

them. Although the number and the nature of the measures used have been different from

study to study, the field seems to have come to the consensus that language ability

consists of multiple aspects, components, or competencies, but questions still remain

regarding what these specific components are and how they interact.

As demonstrated in this review, the number of measures used in a single study

ranged from 6 (Bachman & Palmer, 1983) to 13 (Bachman et al., 1995). The format of

these measures could be as discrete as multiple-choice items, or as integrated as an

interactive speaking task. Furthermore, the number and the nature of traits in the models

for hypothesis testing also varied vastly from one study to another. In some studies,

modeling was based on a previously established theoretical framework (Bachman &

Palmer, 1982). In other cases, a study was conducted for the purpose of test score

validation (Bachman et al., 1995). Therefore, what was included in the factor model
32

depended on the nature and purpose of the test but not on a particular theory. There were

also times when the researchers intended to examine only some specific traits of interest

and, therefore, included only these traits in their models (Buck, 1992; Shin, 2005).

In conclusion, the concept of a multidimensional FL proficiency has been well

received based on both theoretical and empirical grounds (Kunnan, 1998a), although the

research community has not yet reached an agreement regarding the nature of the

constituents, or on the manner in which they interact (Chalhoub-Deville, 1997). In a

recent publication on issues in assessing English language learners (Wolf et al., 2008),

the authors emphasized the field’s uncertainty as they declared that “there has been no

consensus across the tests in the definition of language proficiency” (p. 22).

The field needs to find a mechanism to reconcile findings from different studies in

order to move forward on the agenda of defining what FL proficiency means and how we

develop valid tests to measure it.

Method Effect

The major goal pursued in the studies reviewed is the identification of one or

more factors underlying FL proficiency. Given that a trait or a factor is latent and cannot

be measured directly, test measures are needed to elicit performance that will reflect the

target trait structure. We therefore take test performance on such measures as indicators

of the latent trait structure. In most cases, multiple indicators or measures are needed to

elicit performance on one latent factor.

Some of the models contained only language-related factors, be they knowledge-

based or skill-based. These models assume that no other factor contributes systematically

to the resulting test performance. However, these assumptions usually cannot be


33

guaranteed. If a particular test method (e.g., cloze test) affects test performance in a

systematic way, this method-specific influence should be taken into consideration when

the overall factor structure of test performance is modeled. That is to say, if a special

ability can be developed to respond to a particular type of test method, this method-

related ability is what we need to distinguish from language-related ability.

To say that score interpretation on a FL measure is valid is to suggest that the test

reflects how test-takers respond to the test items differently as a result of their different

standings on the ability continuum rather than of their differential mastery of that

particular test method. In other words, if another test of a different method designed to

measure the same trait yields the same pattern in test results, we can conclude that

convergent evidence has been found that the two different measures measure the same

trait, and that it is the trait, not the test method, that contributes to the resulting test

performance. In reality, method effect can hardly be ignored, and trait effect is always

intertwined with method effect. A review of the relevant literature suggests that there are

factors other than language proficiency that might also contribute to the underlying

structure of test performance (Bachman et al., 1995; Bachman & Palmer, 1982, 1983;

Harley et al., 1990; Turner, 1989).

The review of studies in the previous sections is approached with a focus on

configuring the trait, namely FL proficiency. In the current section, this focus has been

broadened as method factors are investigated along with the trait factors. Some of the

studies reviewed in the previous sections will be reexamined from this new perspective

along with some new studies. In these studies, test methods as potential latent factors
34

responsible for test performance, were investigated. The delicate relationship between

test method and trait in the context of FL testing is discussed at the end of the review.

Two approaches are commonly used to provide information regarding the

distinctness of trait and test method. The first is confirmatory factor analysis (CFA),

which has been widely used in FL proficiency dimensionality studies. Instead of focusing

exclusively on the possible trait factors, test method effects are also considered as

potential factors underlying test performance. The second approach is a combination of

CFA and the multitrait-multimethod (MTMM) matrix method. Both approaches provide

a way to investigate convergent and discriminant validity by distinguishing the effect of

measurement method from the effect of the trait being measured. Using either approach,

the studies cited below demonstrated the influence of test method on test performance.

The Factor Analysis Approach

In Bachman and Palmer’s (1983) validation study of the FSI oral proficiency

interview, three different methods were used to provide both convergent and divergent

evidence to the construct validity of the oral interview test. Besides the interview method,

translation and self-rating were also used to measure the two traits: speaking and reading.

Findings from the analyses showed that models with three correlated method factors

consistently provided a better fit than models positing no method factor or with fewer

than three methods. A model with two correlated trait factors and three correlated method

factors was shown to provide the best fit to the data. Among the three methods used,

measures involving the oral interview method loaded more heavily on trait than on

method, whereas measures using translation and self-rating loaded more heavily on

method than on trait. The authors claimed that the interview method demonstrated the
35

largest trait component and the smallest method component, showing both convergent

and discriminant validity evidence for the score interpretation and use.

In another study, Bachman and Palmer (1982) again employed multiple measures

to measure each of the three posited traits (grammatical competence, pragmatic

competence, and sociolinguistic competence) underlying communicative language

competence. The four methods used in this study were interview, writing sample,

multiple-choice test, and self-rating. As far as the method factors were concerned, models

with three correlated methods proved to fit better than models with four uncorrelated

methods. The three correlated method factors were interview, writing and multiple-choice

as one method factor, and self-rating. They also found that measures using interview and

self-rating loaded more heavily on method than on trait factors, whereas measures using

the writing and multiple-choice method consistently loaded more heavily on trait than on

method factors. This result contradicted the findings in their 1983 study, where measures

involving oral interview demonstrated more trait effect than method effect.

In both studies, besides language-related traits, test methods also accounted for

test performance. In the first study, three correlated method factors (oral interview,

translation, and self-rating) were built into the factor structure. In the second case, three

correlated methods (interview, writing and multiple-choice, and self-rating) were

incorporated as contributing factors. In both studies models involving method factors

proved to fit the data better than models positing only language-related factors. These

results showed that when configuring latent factor structures underlying test results, test

methods together with language-related abilities should both be considered.


36

Test method factors have also been found in other studies investigating FL

proficiency (Harley et al., 1990; Song, 2008; Turner, 1989). Harley et al. (1990) used

three types of methods for measuring grammatical, discourse, and sociolinguistic

components. They categorized their measures by mode: oral productive mode, written

productive mode, and multiple-choice written mode. For example, a grammar oral

production task could consist of a structured interview scored for accuracy of verb

morphology, prepositions, and syntax. A discourse multiple-choice task could require

examinees to select coherent sentences at the paragraph level. A sociolinguistic written

production task could ask examinees to write a formal request letter and informal note

scored for ability to distinguish formal and informal register. In their final factor model, a

general factor and a written method factor emerged from the test results on the various

measures. Although the nature of the general factor remained unclear, the written method

factor could definitely be taken as an indication of the presence of method effect.

An audio mode versus written input mode difference was found in Song (2008).

This study examined the similarities and differences in factor structure underlying

listening and reading abilities. In the results, three kinds of skill-related factors with the

two input mode factors explained the data best.

Factors other than language ability that contributed to performance on cloze tests

were found in Turner (1989). Eight cloze tests were given, varying in first language (L1)

or second language (L2), content domain (two topics), and contextual constraints (local

and global). The results helped answer the question of what knowledge was required for

successful performance on cloze tests, including and in addition to language ability.

Based on the results from factor analyses, the model that demonstrated the best fit was
37

composed of three uncorrelated factors: two distinct language factors (L1 and L2) and a

general factor, which was defined by Tuner as cloze-taking ability (CTA). The author

claimed that the inclusion of the L1 factor made it possible to distinguish the L2 factor

from the CTA factor. Regarding the measures in L2, the findings showed that L2 cloze

performance was dependent on both the L2 language factor and on the CTA factor. The

results also showed that this CTA factor had greater influence on cloze performance than

did the L2 language factor. The presence of the CTA factor in the final factor structure

demonstrated that method effect existed in cloze tests across linguistic boundaries.

Even if a factor structure contains only elements that can be interpreted as

language-related factors, alternative explanations of the nature of the factors cannot be

ruled out simply based on the results from statistical analysis. Substantive knowledge

should always be called upon to determine the nature of the factors, and how to label

them. It is likely that a seemingly language-related factor can also be interpreted as a test

method factor. Taking the results from Bachman et al.’s (1995) comparability study as

the example, the factors in the final solution across both test batteries included a speaking

factor, a structure, reading, and writing factor associated with the ETS test battery, a

structure, reading, and writing factor associated with the Cambridge test battery, and a

listening and interactive speaking factor. The second and third factor both included

measures in the written input mode. The distinctiveness of these two factors might have

lain either in the kind of language ability into which each test battery tapped or in the

kind of methods each battery used. In the latter case, this distinctiveness reflected method

effect across the two test batteries. The last factor, listening and interactive speaking,

included the TOEFL® test listening section, the FCE listening comprehension section,
38

and the FCE oral interview. At first it might seem strange that the FCE oral interview

would load on this factor with two listening measures instead of on the speaking factor

with the SPEAK® test from the ETS battery. However, after taking a closer look at the

nature of the measures, we would find that the FCE oral interview was interactive in

nature, and it demanded substantive listening skill from the test takers, whereas the

SPEAK test was non-interactive and monologic in nature, which did not entail oral

interaction, and it did not have a listening requirement. Therefore the last factor was

essentially the ability to perceive and process aural input, in other words, the listening

ability.

The discussion has come to the place where it is hard to draw a clear-cut line

between a language-related factor and a non-language-related factor. Why do the TOEFL

structure, reading, and writing sections all load on a single factor? Is it true that grammar

ability, reading ability, and writing ability are essentially indistinguishable? Or is it

because all of the measures share the same written mode so that the real differences

among the language-related abilities are masked by the commonalities of the methods

used to measure them?

All of the measures used in the studies reviewed here can be categorized in many

different ways. Measures can be separated by input mode as either aural or written. They

can be divided by output mode as either oral or written. They can also be distinguished

by contextual situation as either context reduced or context embedded. There is no doubt

that a listening test must have aural input. However, whether the listening test items

should be responded to in context-reduced situations (e.g., multiple-choice items) or in

context-embedded situations (e.g., constructing short answers) depends on the purpose of


39

a particular test. In the case when a short-answer item type is chosen, might listening

ability be confounded with writing ability, so that listening and writing emerge as a single

factor simply because the two measures share a common output mode? Does this mean

that multiple-choice listening questions are superior to short-answer listening questions

because the former measure the unique listening construct with no influence from

writing? But can we be sure that the abilities being tapped by constructed-response

listening tasks, such as processing, internalizing, and reconstructing aural input, are not

part of the listening construct? These questions remained to be answered by the field in

the future.

The CFA-MTMM Matrix Approach

The MTMM method has been used widely in construct validation studies to

investigate both convergent and discriminant validity, and to distinguish between method

effect and trait effect. The traditional MTMM method (Campbell & Fiske, 1959) provides

a straightforward mechanism to detect and distinguish the trait and method effects by

simply comparing the correlations in a MTMM matrix without having to conduct any

significance test. However, it does not specify any underlying factorial model to quantify

and account for the relationships between the measures and the traits. Due to this

inadequacy of dealing with multivariate data, model evaluation and comparing cannot be

conducted by using the traditional MTMM approach.

Widaman (1985) proposed a confirmatory factor analysis (CFA) approach to

MTMM matrix data. In this approach, model specification is performed at the onset of

the analysis. The best fitting model for the data is selected as the baseline model to which

a series of alternative models are compared by using chi-square difference tests. The
40

results of these model comparisons tell us the degree of trait convergence, trait

discrimination, and method effects in test measures.

A CFA approach of MTMM data was adopted in Llosa (2007) to investigate the

construct validity of a standardized test and a classroom assessment. The MTMM data

were based on the test results from Grades 2, 3, and 4 students on two English as a

Second Language (ESL) assessments: the California English Language Development

Test (CELDT) and the English Language Development (ELD) Classroom Assessment

measures. To be specific, the study aimed to examine the extent to which these two tests

measured common language ability constructs, the extent to which the constructs

measured by the tests were distinct, and the extent to which the differences in test design

and ratings contributed to the underlying structure of test performance.

A baseline model with three first-order trait factors (Listening/Speaking, Reading,

and Writing) and one higher-order general factor was first established. The two methods,

the CELDT and the ELD Classroom Assessment measures, were modeled onto the

baseline model as either correlated or uncorrelated. This established two versions of the

trait-method model. In Grade 4, the trait-method model with correlated method factors

was confirmed to provide the best fit while in Grades 2 and 3, the trait-method model

with uncorrelated method factors was selected as the best model. The results showed that

both trait and method influenced test performance in all grades.

To test convergence, the fit of a trait-method model was compared to the fit of a

method-only model (correlated method factors for Grade 4 and uncorrelated method

factors for Grades 2 and 3). Significant improvement in model fit of the former over the
41

latter indicated that method alone could not provide satisfactory explanation to account

for the data. This also confirmed the presence of the trait effects.

To find discriminant evidence, the fit of a trait-method model was compared to

the fit of a unidimensional trait-method model in which only one general trait was

specified. The significant chi-square difference result indicated that the traits, although

influenced by a higher-order general factor, were still distinguishable empirically.

Evidence of trait divergence was therefore obtained.

Method effects were investigated by comparing the fit of a trait-method model to

the fit of a trait-only model. Substantial method effects were detected based on the result

of significant model fit improvement provided by the former over the latter.

Trait convergence, trait divergence, and method effects were further examined in

light of individual parameter estimates. The results deviated from those of the overall

model comparisons to some extent. As far as convergence was concerned, the proportion

of method variance exceeded that of trait variance in the baseline model for Grade 4.

Trait discrimination was also challenged by the uniformly high factor loadings on the

higher-order factor in Grades 2 and 3. A significant method correlation in Grade 4 also

suggested a common method bias across the two measures.

This study demonstrated how a CFA approach could be applied to MTMM data to

detect and distinguish trait and method effects. Substantial method effects were found,

meaning that performance on the two tests reflected not only the traits measured but also

factors associated with the instruments used. It also showed that the trait effects were

significant in accounting for test performance, and that a trait-method model provided the

best explanation to the data.


42

In order to examine the effects of response formats in a web-based academic

listening test, a CFA approach of MTMM data was also employed in Shin’s (2008). The

three methods under investigation were summary task, incomplete outline, and open-

ended questions. The academic listening constructs specified in the study were the

abilities to extract main ideas, major ideas, and supporting details. A baseline trait-

method model with correlated traits and uncorrelated method factors was established at

the outset. The baseline model was then compared to a method-only model with no trait

factors to find convergent evidence. To search for discriminant evidence, the baseline

model was compared to a trait-method model with one general trait factor, and a trait-

model with uncorrelated trait factors. Both convergent and discriminant evidence was

obtained based on the results of model comparisons.

A higher-order model, which was equivalent to the baseline model in terms of fit

statistics, was chosen for examining the individual parameter estimates. The relative

effects of trait factors were mostly larger than those of the method factors. Although for

each trait factor some tasks were stronger indicators than others, the author concluded

that all three response formats were valid indicators of academic listening

comprehension.

The CFA approach to MTMM data is the most commonly used alternative to the

original Campbell and Fiske (1959) MTMM analysis (Sawaki, Stricker, & Oranje, 2008).

Multiple steps of model comparisons assist in finding evidence for trait convergence, trait

divergence, and method effects. The main goal of this kind of analysis is to evaluate the

effects of factors (e.g. test method factors) that are not specified as the constructs a test

intends to measure. Unlike the original MTMM analysis, this CFA approach to MTMM
43

data allows researchers to specify and evaluate underlying factorial models, and to find

the best model that takes into account both trait and method factors based on the

outcomes of a series of statistical difference tests.

Before we can arrive at any conclusion regarding how test method interferes with

trait structure in the context of FL testing, a fundamental question needs to be answered:

Should we view method and trait dichotomously, or should we perceive them as an

integrated unit? In traditional construct theory, from a cognitive-psychological

perspective, test method is viewed as a source of irrelevance that should be minimized in

testing. In the context of language testing, often times a particular method is designed to

measure a unique aspect of language ability. It is likely that the trait and the method

define each other until it is impossible to disentangle one from the other. In my view, FL

tests should be designed to reflect the dynamic and interactive relationships between trait

and method. The test method facet ought to be visualized as an integral part of the factor

structure underlying FL test performance. As Bachman (1990) put it, “the matching of

trait and method factors is one aspect that contributes to test authenticity” (p. 241).

This section summarizes studies investigating the effects of test method on test

performance. The next section will discuss the relationship between test-taker

characteristics and test performance.

Test-Taker Characteristics and Test Performance

Since Oller’s (1979) proposal for a unidimensional view of language proficiency,

the field has witnessed a proliferation of factor-analytic studies devoted to examining the

nature of FL proficiency. Researchers seem to have reached a consensus on at least two

issues. The first consensus is that the result of a FL test can be better interpreted if
44

multiple factors are considered to account for the performance. This is to say that FL

proficiency is multidimensional in nature rather than unitary. The second agreement is

implied in the mixed findings in terms of the nature and number of factors underlying the

construct. It has become clear that the make-up of FL proficiency is context dependent.

Results of any factorial analysis heavily depend on why and how the test is

constructed. Regarding test construction, test method facets (Bachman, 1990) have been

proposed as having a decisive influence on FL test performance. By using a particular

test to validate the theoretical construct of FL proficiency, we have to acknowledge the

many possible constraints of the test we use. Tests may vary in length, content coverage

and representation, and internal consistency, to name just a few. Methods that a test

utilizes can also differ in terms of input mode, expected response, the complicated

relationship between the two, and so on. As a result, the outcome of any analysis aimed at

demystifying the nature of FL proficiency can be interpreted meaningfully only with a

full understanding of a particular testing situation.

Another trend that has emerged in the literature is that researchers have started to

pay attention not only to the test being used but also to the characteristics of the test-

takers.

Results from early research cast doubt on the uniformity of factor structure across

all groups of test-takers. In an effort to validate the communicative competence model,

Harley et al. (1990) failed to confirm an a priori three-trait model, and instead found that

a general factor and a written method factor could explain their test performance data

well. Although this two-factor model was chosen as the final model, they suggested that

the relationship between the different components of the construct might be highly
45

dependent on different language learning experiences of the test-takers, and that it might

be differentially related to cognitive attributes of these individuals. They urged future

studies to probe how different aspects of FL proficiency develop and interact as a

function of learning experience. The importance of considering test-takers’ backgrounds

when interpreting test performance was also emphasized in Bachman et al.’s (1990)

study. These authors recommended that future researchers investigate the relationships

between test-taker characteristics (TTCs) and test performance.

TTCs are recognized as one of the four influences on test scores in the

communicative language ability (CLA) model (Bachman, 1990): communicative

language ability, test method facets, personal characteristics, and random measurement

error. Personal characteristics, or background characteristics, are specified in the model

on a par with CLA and test method facets. The personal characteristics proposed in the

CLA model include cultural background, background knowledge, cognitive abilities, sex,

and age.

The Bachman (1990) model acknowledges the influences of test method and

TTCs on test performance. It has a language ability component, a test method component,

and a component of TTCs, based on which researchers can posit and test relationships

about factors that influence test performance. Although critics (Chalhoub-Deville, 1997;

McNamara, 1990; Skehan, 1991) argued that it might be difficult to implement such a

comprehensive model in actual test development, the CLA model was nevertheless

portrayed as the state of the art by Alderson and Banerjee (2002) as they argued that the

model emphasizes the interaction between the language user, the discourse, and the

context.
46

Suggestions from these earlier researchers served as the springboard for later

studies concentrating on this test-taker factor. It has become a common awareness that

results of factor analyses may be interpreted in light of learner variability. In Kunnan’s

(1998a) categorization of language assessment research themes within the Messick

(1989) framework, studies on TTCs provide an evidential source of justification of valid

test use. The TTCs mentioned in that summary were academic background, native

language and culture, field in/dependence, gender, ethnicity, and age. Kunnan (1998a)

argued that test takers from certain background might perform differently based on

factors that are relevant to the abilities being tested. This message is echoed in Alderson

and Banerjee’s (2002) review of language testing and assessment, as they argued that the

most important challenge for language testers is to understand the characteristics of

different test-takers and how these characteristics interact with test-takers’ ability

manifested in their test performance.

The following section is devoted to summarizing empirical studies that have

explicitly modeled TTCs in their analyses of FL proficiency. Studies are divided into four

groups. The first group of studies focused on the relationship between proficiency level

and the degree of factor differentiation. Studies in the second group examined other

TTCs, such as native language background and proficiency, learning condition, and

gender. Two studies investigating language ability in relation to target language contact

are included in the third group. And last, validation studies in the context of TOEFL®

testing are reviewed and summarized.


47

Proficiency Level

Regarding the relationship of learners’ target language ability level and the nature

of their proficiency, two competing hypotheses have received much attention from the

field. One hypothesis states that the dimensions of language ability become more

differentiated as a function of increasing examinee proficiency. The competing

hypothesis postulates a negative relationship between the degree of factor differentiation

and increasing proficiency. In opposition to both hypotheses, the null hypothesis claims

that factor structure becomes neither more nor less differentiable but stays the same

across learner groups of different ability levels. Research so far has shown quite mixed

results (Kunnan, 1992; Römhild, 2008; Shin, 2005; Swinton & Powers, 1980).

Kunnan (1992) detected a positive relationship between proficiency level and the

degree of factor differentiation by using an ESL placement test developed at the

University of California. Based on test results, ESL students would be placed into four

different course levels to receive further instruction on the English language, depending

on their current level.

As part of the test validation process, Kunnan investigated the dimensionality of

the language ability of the test-takers, and examined whether the test measured the same

abilities in the same way across all ability levels. To achieve the first goal, an exploratory

factor analysis (EFA) was performed based on six subtest scores for the total group. A

two-factor solution was selected to be the most parsimonious and interpretable, with the

two listening subtest scores loading on one factor, and the subtest scores of reading and

grammar loading on the second factor.


48

To investigate if this factor structure would vary as a function of the proficiency

level of the test-takers, results of the EFA for each course group were compared.

Although a two-factor solution appeared to be the best fit for each course group, the

variables that loaded on the factors and the relationships of the factors were different

across the groups. The most salient difference was that only the group with the lowest

proficiency had an oblique solution, whereas the solutions for the other three course

groups were orthogonal. This is to say that for the students with the lowest proficiency,

the two factors were correlated in the EFA solution, whereas the factors could be

considered relatively independent for the other groups with higher proficiency. The

indication was that the students with lower ability had less differentiable skill abilities in

contrast to the students at higher levels of ability, whose skills appeared to be more

distinct. The author concluded that the test might not measure the same abilities across

groups of different ability levels. This finding posed a threat to the validity of score

interpretation, especially if a single composite score were reported for all test-takers, as it

was in this case. Due to the inconsistency in the factor structure across the course groups,

the author suggested considering reporting section scores along with the total scores for

making placement decisions.

Römhild (2008) also found that factor structure changed across examinee groups

with different levels of language proficiency. However, instead of a positive relationship

like the one found in Kunnan (1992), Römhild observed that language proficiency was

negatively related to the degree of factor differentiation based on the results of the

Examination for the Certificate of Proficiency in English (ECPE). Decreasing degree of


49

factor differentiation was found as a function of examinees’ increasing proficiency across

two ability-level groups.

ECPE, developed at the English Language Institute at the University of Michigan,

is a test of advanced English language proficiency, reflecting skills and content typically

used in university or professional contexts. The multiple-choice section of ECPE, used

for this study, included the listening and reading subtests, and three structure subtests

(grammar, vocabulary, and a cloze test). The items were divided into two equivalent

halves by using the split-half technique. The assumption was that each half of the test

would be an adequate representation of the full-length test. To divide the examinees into

low- and high-proficiency groups, the study used an internal criterion measure based on

the test results from one ECPE test half. The results from the other half of the test were

used to conduct the multiple group factor analysis.

A series of EFA based on item-level data were performed separately on each

group to determine the number of latent factors. Among the three competing baseline

models, the five-factor model representing the subtest structure of ECPE provided the

best model fit for both groups.

Although identical baseline models were found for each group, the outcome of

analyses on factor loading invariance showed that only partial invariance could be held

across the groups. Items that exhibited factor loading noninvariance were removed, and

the baseline model was re-established. This new partial measurement invariance model

was subjected to the scrutiny of structural invariance. Significant differences in factor

variances and covariances were observed, indicating that structural invariance could not

be held across the groups. Factor variances were smaller in the low-proficiency group,
50

meaning that the performance of this group on each of the factors was more uniform. In

other words, there was less variation in the outcome measures for the low-ability group.

Factor correlations also appeared to be smaller in this group, compared to the high-

proficiency group. The weaker correlations indicated that the five factors were more

separable in the low-proficiency group. This result indicated that the structure of test-

takers’ language ability became less differentiated as their overall language proficiency

increased.

The role of target language proficiency in determining the structure of language

ability was also examined in Shin (2005). Unlike the two studies reviewed above, this

study did not find sufficient evidence to suggest that factor structure differed across

ability groups. Participants in this study were from Bachman et al.’s (1995) study.

Students’ test results on the Cambridge test battery served as the external criterion to

define their proficiency levels. Their scores on the TOEFL® test and the SPEAK® test

from the ETS test battery were used to examine factorial invariance across the low-,

intermediate-, and advanced-level groups.

A higher-order factor model, with three first-order factors (listening, written

mode, and speaking) seemed to represent the test results of each ability group optimally,

and therefore was established as the baseline model. This baseline model was then

estimated simultaneously across all ability groups with equality constraints imposed in a

step-wise fashion to examine both measurement and structural equivalence. Partial

measurement invariance was established by removing constraints from three factor

loadings that showed significant variances across the groups. The author then proceeded

to examine structural invariance by imposing equality on the loadings of the first-order


51

factors on the general language factor. Four of the six first-order factor loadings were

shown to be invariant. Based on these results, the author concluded that the structure of

the ETS tests stayed partially equivalent across the groups, implying that the degree of

factor differentiation neither increased nor decreased with the development of overall

language ability. This finding helped to make the argument that the ETS tests functioned

in the same way across different proficiency groups, and that the use of a single

composite score could be justified.

The decreasing trend in the relationship between language proficiency and the

degree of factor differentiation found in Römhild (2008) was in disagreement with the

finding of Kunnan (1992). Factor structure was found to be partially invariant on both

measurement and structural levels in Shin (2005), indicating that the degree of factor

differentiation did not vary as a function of language proficiency. One reason for this

discrepancy might have to do with the nature of the data used in each study. Kunnan

(1992) and Shin (2005) used section scores as the unit of data analysis, whereas in

Römhild (2008) analyses were based on item-level data. Another reason could simply be

that the researchers used different tests.

One issue that deserves our attention is how group membership was determined

differently in the studies. Kunnan (1992) did not use an independent criterion measure for

separating the groups. Instead, the same test used to determine group membership was

used to perform multiple group analyses. In contrast, both Shin (2005) and Römhild

(2008) employed an independent criterion measure (internal or external) to decide group

membership. The latter approach is preferred because any significant result would be less

likely to be captured by chance if an independent criterion measure were used.


52

Other Background Characteristics

FL learners come into a language testing situation as complex human beings,

characterized not only by their prior target language achievement but also by their native

language background, gender, past and current learning conditions, and many other

characteristics. Test-takers’ identities and life experiences are also valuable information

for us to understand their current learning profiles, and how they have arrived at where

they are. The research community has gradually embraced the idea that treating test-

takers regardless of their identities and life experiences will give us an over-simplified

picture of their test performance. In the previous section, we have discussed studies that

have examined prior target language achievement as one of the many TTCs, and its

interplay with the degree of factor differentiation. In the following section, studies that

have investigated other TTCs are reviewed.

Sang et al. (1986) criticized the unwarranted assumption of the equivalence of

factor structure regardless of the differences in the learners or the conditions under which

language acquisition took place. The study posited that differences in cognitive skills and

learning conditions could have influences on the structure of language proficiency.

Levels of cognitive skills were decided by learners’ L1 proficiency. Subjects were

divided into two groups: one with high L1 (German) proficiency and the other with low

achievements in German. Learning conditions were divided according to two different

teaching strategies that had been implemented.

First, a three-dimension factor structure (elementary, complex, and

communicative) was confirmed to be the best fitting model for the total sample. Then the

invariance of this structure was examined as a function of L1 proficiency and of the kind
53

of instruction the learners received. The study found that achievement in L1 could affect

the factor structure in several ways. Across the low-and high-L1 groups, differences in

factor loadings on complex skills and interactive use were found. Loadings on the factors

of more advanced language skills became more salient as the learners advanced in

mastering their native language. In other words, advanced FL skills were more likely to

appear as a distinguishable factor for learners with a high level of L1 ability. Another

finding was the differences in factor correlations. Factors were more closely correlated in

the group with a high level of competence in their L1. Based on this observation, the

authors suggested that a general L2 proficiency factor could emerge if the sample was

highly proficient in their mother tongue because learners with high-level L1 proficiency

tended to develop a more generalized ability structure by being more adept in transferring

cognitive ability from one aspect to another. In conclusion, the structure of L2

proficiency was found to be dependent on the learners’ degree of mastery of their L1.

During the next step, two types of teaching strategies were examined in relation to

test performance. A modern and a traditional teaching strategy were modeled as

predictors of the language components in the model. The traditional teaching style was

characterized by its heavy use of L1, and by favoring a bilingual approach to lexicon,

etymological explanations, traditional oral translation, and normative understanding of

grammar. In contrast, a modern approach emphasized using a direct method and

unstructured oral interactions. The results showed that the traditional teaching strategy

had a positive relationship with the elementary factor but no association with the complex

factor, and a negative association with the communicative factor. In contrast, the modern
54

teaching strategy was positively associated with all aspects of ability development,

especially with the communicative component.

Gender, as another key TTC, was studied in Wang (2006). Demonstrating test

fairness across genders is an important aspect of building a validity argument. As an

attempt to provide evidence of fairness for two test batteries, the Examination for the

Certificate of Proficiency in English (ECPE) and the Michigan English Language

Assessment Battery (MELAB), the author investigated the dimensionality of both tests

and the equivalence of the factor structures across genders. Since both tests report total

scores, the author aimed to gather evidence to support the use of the total scores for both

male and female test-takers.

ECPE included a speaking section, a listening section, and a section measuring

grammar, cloze, vocabulary, and reading (GCVR). MELAB consisted of a composition, a

listening section, and a GCVR section. Analyses were performed on subtest scores.

Subjects were randomly split into two samples to form a calibration sample and a

validation sample. EFA without rotation was performed for each test separately on the

calibration sample. It was found that only one factor dominated the distribution of score

variance in both ECPE and MELAB. Factorial invariance under this one-factor model

was examined across gender on the validation sample for both tests. A model with

equality constraints on the model structure, latent factor variances, factor loadings, and

unique variances, provided the best fit to the data for both tests. This result demonstrated

that all parameter estimates in the one-factor model were invariant across genders for

ECPE and MELAB. The indication was that both tests functioned in the same way across
55

gender groups. Therefore it was reasonable to report a single total score for both groups,

and to compare scores across genders.

As reviewed in the previous section, Shin (2005) did not find enough evidence to

support that the degree of factor differentiation varied as a function of proficiency level.

Partial measurement and structural invariance held across all three ability groups.

However, noninvariance was still detected in individual parameter estimates, especially

in the factor loadings of pronunciation and fluency on the speaking factor. To help locate

the source of noninvariance, a method called multiple indicators multiple causes

(MIMIC) modeling was employed. In MIMIC modeling, language background was

specified as a covariate to determine how the language background of the test-takers was

related to the detected noninvariance.

Two language groups were formed: the Asian language group and the European

language group. Direct effects on the pronunciation and fluency measures from the

covariate were added in the re-specified model. This resulted in significant model

improvement. The appearance of the two direct effects suggested that there was a scale

bias that would produce different performance ratings due to group membership. The

author concluded that based on the results of the MIMIC modeling, the two speaking

rating scales, pronunciation and fluency, functioned differentially across the two major

language groups.

Group membership, as defined by language background in this case or as

determined by other criteria in other studies, is critical in helping us understand whether

test results reflect information other than what the test intends to reveal, such as native

language background. Efforts should be made to investigate the nature of any such
56

confounding information. Otherwise, the validity of score interpretation and use will be

put into jeopardy. This study demonstrated how MIMIC modeling could be a powerful

tool for addressing such test fairness issues.

The studies reviewed in this section specified one or multiple TTCs in their

models based on FL test performance. The findings generally support the assertion that

FL test performance can be interpreted more meaningfully in light of test-taker

variability. The impact of the various TTCs can be observed in two ways. By means of

multi-group invariance analysis, factor structure equivalence can be examined across

groups defined by learner characteristics. Functioning as covariates, TTCs can also be

modeled as independent variables to have direct effects on test performance. Results from

studies using either approach have shown that a universal view of FL proficiency,

regardless of TTCs, is not warranted. Studies in the section have also demonstrated the

usefulness of multi-group invariance analysis and SEM as tools for interpreting test

performance in relation to relevant TTCs.

Target Language Contact

There were a few studies that examined test performance in relation to target

language contact experience as a TTC. Two kinds of language contact were of particular

interest to the researchers: language contact with the target language community and

language contact at home. The former usually occurs in a study-abroad context, whereas

the latter is often situated in a heritage language environment.

Language contact with the target language as a TTC was first examined in two

Advanced Placement (AP) examinations. Each year, college-bound high school students

who want to earn college credit and advanced placement in particular areas of study take
57

the College Board’s AP courses and exams. AP examinations cover a wide range of

academic subjects, including foreign languages. AP FL exams generally measure FL

skills that students might acquire after four to six semesters of college FL courses. Four

FL exams are traditionally available for students to take: French, German, Latin, and

Spanish. In 2007, three new languages were added to AP’s list of FL examinations:

Chinese, Japanese, and Italian.

The demographics of the AP examinees can vary from exam to exam and at

different time periods. In the case of the AP French Language Examination, it was

indicated in Morgan and Mazzeo (1988) that the majority of the examinees taking this

exam learned French primarily through secondary school academic coursework, but that

some of the examinees had a significant amount of out-of-school French language

experience, either by studying in a French-speaking country or by exposure at home.

Whether the test is an appropriate measure for the latter two groups, and whether the test

functions in the same way across all examinees is a validity question that needs to be

addressed before test scores can be interpreted and used appropriately.

Morgan and Mazzeo (1988) identified four groups of examinees based on their

language-learning experiences. The first group, the standard group, had little or no out-

of-school French language experience. This group, which constituted the majority of the

test-takers, was divided into two samples for cross-validation purposes. The second group

(special group I) included the examinees who had spent at least one month in a French-

speaking country. The third group (special group II) regularly spoke or listened to French

at home, and therefore fit the definition of heritage learners. The fourth group included in

the study was college students who had no out-of-school experience.


58

Since the exam was intended to measure the listening, reading, writing, and

speaking skills (College Board, 1987), the primary goal of this study was to verify the

skills measured by the exam, and to determine whether the same dimensional structure

could be applied to the four populations. A correlated four-factor model (listening and

writing, language structure, reading, and speaking) provided the best overall fit to the

data. In this model the listening scores loaded with the writing scores on a common

factor. Discrepancy was found between the make-up of this factor model and the test’s

subsection structure, in which listening and writing were separate sections, and both

language structure and reading comprehension belonged to the reading section. The

reason given by the researchers for this discrepancy was that the listening items measured

a dimension similar to the one measured by the writing tasks. The researchers criticized

the equal weighting scheme, arguing that it actually overweighed students’ ability to

produce grammatically correct prose, as it was measured in both listening and writing

sections of the test, and it downplayed the ability to read and interpret French passages,

as some of the items in the Reading section seemed to assess a separate grammar-oriented

factor of language structure.

This correlated four-factor model was examined for invariance between the

standard group and each of the other three groups. Between the standard group and

special group I, only invariance in factor loadings could be held. No invariance was

found between the standard group and special group II. Invariance in both factor loadings

and unique variances was found between the standard group and the college group. The

factor structure based on the standard group was most similar to the one of the college

group. What was common to the standard group and the college group was that both
59

groups lacked significant out-of-school French language experience. The indication was

that out-of-school FL experience did have an influence on the structure of test

performance.

Similar to the AP French Examination, the AP Spanish program proposes to

assess high school students’ Spanish language proficiency in order to grant college credit

or advanced placement in college Spanish language courses (College Board, 1989).

The internal construct validity of the AP Spanish Language Examination was

evaluated by Ginther and Stevens (1998). One purpose of the study was to investigate

whether the dimensionality underlying test performance was compatible with the four-

section design of the test, intending to measure language abilities in listening, reading,

writing, and speaking. The second goal was to study whether the same factor structure

would hold constant across relevant groups within the entire test-taking population.

Group membership was determined by ethnicity and preferred language. The

reference group consisted of Latin Spanish-speakers who identified their ethnic

background as Latin. The other four groups were Mexican Spanish-speakers, Mexican

bilingual-speakers (who indicated equal preference for Spanish and English), White

English-speakers, and Black English-speakers.

A correlated four-factor model (listening, reading, writing, and speaking) was

specified as the a priori model. This model was found to provide a good fit for the

reference group based on item parcel scores. Group comparisons were conducted by

imposing multiple levels of equity constraints between the reference group and each of

the four comparison groups.


60

When comparing the reference group to the Mexican Spanish-speaking group and

the Mexican bilingual-speaking group, invariance in the number of the factors and factor

covariances did hold across the groups. However, only invariance in the number of the

factors was found when comparing the reference group to the two English-speaking

groups. This finding suggested that the more a group’s background in ethnicity and

preferred language deviated from the reference group, the less equivalent the factor

structures seemed to be.

It was also found that the four factors were less closely related for the Spanish-

speaking groups than for the English-speaking examinees. Descriptive statistics also

showed that Spanish-speakers enjoyed high means in all scores, compared to the English-

speakers. Unfortunately the study did not include a mean structure in group comparisons.

The researchers suggested classroom instructional environment as one potential cause for

the high factor correlations of the low-level groups. The rationale given was that if a

group of students learned and experienced a FL mostly in a formal instructional setting,

the abilities they developed would be highly constrained by the type of instruction they

received. If an equal amount of attention was given to each aspect of language

development (listening, reading, writing, and speaking), it would be likely that the growth

of the four abilities would appear to be relatively uniform.

Differences were also found in model parameter estimates across groups. The

speaking factor was found to be less correlated with the other factors for the two Spanish-

speaking groups than for the two English-speaking groups. This indicated that speaking

ability was both quantitatively and qualitatively different across the Spanish-speakers

compared to the English-speakers. Not only did the Spanish-speakers perform better on
61

the speaking tasks, but also their speaking performance was less related to their

performance on the other parts of the test. The researchers suggested that having out-of-

class experience with the target language might have had a fundamental influence on this

difference in factor structure. The researchers also criticized the fairness of the test. They

argued that some examinees were evaluated not only on what they learned in the

classroom but also on what they brought to the classroom.

Heritage language environment, another possibly relevant characteristic for many

language learners, was studied by Bae and Bachman (1998). The uniqueness of this study

was that the nature of language proficiency was investigated in the Korean language

instead of in English. Two groups of learners were included: the heritage learners and the

non-heritage learners. In terms of home language environment, the Korean heritage

learners were likely to be immersed in a Korean-speaking environment, whereas the non-

heritage learners would mainly communicate in English. The study focused only on two

comprehension skills, listening and reading. A correlated two-factor model was selected

as the baseline model as it described the data for both groups well. This baseline model

was tested for the equivalence of parameters of interest across the groups.

The results from examining measurement invariance showed that the factor

loadings from one listening task on the listening factor were different across the two

groups. A task analysis of this listening task suggested that this task might have had

tapped into relatively more integrated higher-level listening skills due to its complex

input. The factor loading for this task was greater for the Korean heritage group than for

the non-heritage group, suggesting that this task measured listening differentially across

the two groups, and that this task was a better indicator of the listening ability for the
62

former group with higher overall listening ability than for the latter group with lower

listening competence. The results from examining structural invariance indicated that the

factors were correlated in similar ways across the two groups. However, compared to the

non-heritage learners, the Korean heritage learners performed more uniformly on the

listening tasks. In contrast, the non-heritage learners demonstrated less variance in their

reading scores than the heritage learners. Although the same two-factor pattern was

accepted from both groups in terms of the overall model fit, significant differences in

individual parameters across groups pointed out that some parts of the test functioned

differently for learners in different groups.

This study demonstrated that performance from test-takers of different heritage

language backgrounds might display different factor structures. Furthermore, not only the

configuration of their factor structures might differ, but their ability to respond to certain

tasks is likely to vary as well. The authors made an urgent call for future researchers to

include a mean structure to multi-group factor analysis so that groups’ factor means could

be compared.

Kunnan’s (1994, 1995) study was one of the few studies that employed a multi-

group structural equation modeling (SEM) approach to examine both construct

representation and its external structural relationships. The study investigated the

influence of social milieu and previous exposure/instruction on test performance. The

multi-group SEM analyses were conducted based on performance data on the ETS and

Cambridge test batteries in Bachman et al. (1991).

The overall model had a measurement component that modeled the relationships

between test measures and latent factors, as well as the relationships among the factors. It
63

also had a structure component to account for test performance’s external relationships

with multiple TTCs. These TTCs included: home-country formal instruction (HCF),

home-country informal exposure (HCI), English-speaking country instruction or

exposure (ESC), and self-monitoring by test-takers of their own language production

(MON). Both the measurement and structural components were tested for model fit

across two groups of test takers: non-Indo-European native language group (Thai, Arabic,

Japanese, and Chinese), and Indo-European native language group (Spanish, Portuguese,

French, and German). The research question was how the four TTCs mentioned above

influenced FL test performance across the two major language groups.

In configuring the measurement model, a correlated four-factor model was

attempted for both language groups. The four factors were: reading and writing assessed

by the ETS test battery, reading and writing assessed by the Cambridge test battery

listening-noninteractional, and listening-interactional. Two competing hypotheses were

postulated when modeling the relationships among the TTCs and test performance. In the

first model, the four TTCs were specified to have equal influences on the four factors. In

the second model, only MON had direct influence on test performance, whereas the other

three were allowed to cast influences on test performance through MON, the intervening

factor.

The study found that neither model produced a clear overall statistical fit for both

groups, although some direct effects of TTCs on the test performance were substantive

and interpretable. Home language formal instruction was found to have positive influence

on reading and writing, but negative influence on listening and speaking. Home country

informal exposure was found to have positive impact on listening and speaking, but
64

negative impact on reading and writing. It was also discovered that the experience of

English-speaking country instruction or exposure had unquestionable positive influence

on listening and speaking.

ETS-TOEFL® Studies

The TOEFL® test is developed and administered by Educational Testing Service

(ETS), and is intended to measure English skills to use in an academic setting. Since the

test is administered worldwide, its test-taking population is highly heterogeneous in terms

of native languages, educational and vocational experiences, exposure to the English

language, level of degree sought, and other factors.

As discussed in the previous sections, the internal structure of a test can be highly

dependent on the characteristics of the group taking the test. Any effort to build a validity

argument for test score interpretation and use should take relevant test-taker variables

into consideration. Because of the TOEEL test’s extremely diverse test-taking population,

examining the relationships between TTCs and test performance has been a strong focus

in TOEFL validation studies. With every new generation of the TOEFL test, there has

been a new wave of validation studies concentrating on the factor structure and its

interplay with the ever-changing test-taking population (Hale et al., 1988; Hale et al.,

1989; Stricker et al., 2005; Stricker et al., 2008; Swinton & Powers, 1980). The following

section reviews three of these studies.

Swinton and Powers’s (1980) provided validity evidence of TOEFL® test score

use by determining the construct the test intended to measure and the relationships among

the factors in light of multiple TTCs. As many as seven native language groups were

specified in the study: African, Arabic, Chinese (Non-Taiwanese), Farsi, Germanic,


65

Japanese, and Spanish. Groups differed in their overall level of proficiency based on their

test performance with the Germanic group having the highest and the Farsi group having

the lowest overall language ability. Variables of age and reason for taking the TOEFL

test were also analyzed for their relationships with test performance. Analyses were

conducted based on item-level data.

Regarding the influence of overall proficiency, it was found that the language

ability of the group with the highest proficiency, the Germanic group in this case, showed

the greatest distinctions in its factor structure. As many as eight factors emerged for the

Germanic group during the initial factor extraction. As the analyses proceeded, it was

also discovered that only two factors could be confirmed for the Farsi group, the group

with the lowest overall proficiency. These results demonstrated that the amount of factor

differentiation did vary as a function of overall ability level. One reason for finding a

larger set of differentiated abilities in the high-level group, as the author suggested, might

be that the responses from high-level test-takers were less likely to be contaminated with

random factors, such as guessing, which made it possible for some minor dimensions to

emerge.

Except for the Farsi and Germanic groups, a three-factor model was established

for each of the other language groups. A listening factor corresponding to Section I

(Listening Comprehension) of the test was found for all groups, meaning that the

response pattern on the listening items could be explained by one common factor for all

language groups. However, there were differences in the make-up of the factor structures

in the other parts of the test across two major language groups, the Indo-European group

(Spanish and Germanic) and the non-Indo-European group (African, Arabic, Japanese,
66

and Chinese). For the Indo-European group, the majority of the items in Section II

(Structure and Written Expression) loaded on one common factor, and most of the items

in Section III (Vocabulary and Reading Comprehension) loaded on another common

factor. The two factors corresponded well to Section II and section III of the test. As for

the non-Indo-European group, most items from Reading Comprehension were found to

load with items from Structure and Written Expression on a common factor, whereas the

items from Vocabulary loaded on a separate factor. For this group, the items in Section

III did not seem to measure a common construct. Instead, the reading items tended to

measure a factor that was similar to the one measured by the items in Section II.

According to the authors, this finding had tremendous implications for score

reporting and interpretation. They suggested reporting subtest scores on Vocabulary and

Reading Comprehension separately rather than a combined Section III score, especially

for the non-Indo-European group. Their finding indicated that the interpretation of

Section III scores could be very different for the two major language groups. For the

Indo-European speakers, these scores indicated their relative status on a unidimensional

ability, and the interpretation was relatively straightforward. For the non-Indo-European

speakers, two factors seemed to account for performance in Section III. To make score

interpretation even more complicated, one dimension also overlapped with a factor

measured by another section of the test. Reporting one single Section III score with the

assumption that all items in the section measured the same ability would be misleading.

The usefulness and necessity of reporting a separate vocabulary score were highlighted

by the study for improving TOEFL® score reporting.


67

Influences of other test-taker variables were also detected via a factor extension

analysis, in which these variables were regressed on the test performance. It was found

that both age and graduate degree-oriented motive were positively related to a factor

associated with the vocabulary ability for the non-Indo-European group and with both

vocabulary and reading abilities for the Indo-European group.

As language testing at ETS evolved into the 21st century, attempts were made to

revise the TOEFL® test so that the most recent theoretical developments in the field of

second language acquisition and FL testing could be reflected in the test design and

construction. As part of this broad mission, LanguEdge™ courseware was introduced to

improve the learning of English as a second language by providing classroom assessment

of communicative skills (ETS, 2002). The assessment component included in the

courseware package had an ESL test which was similar in task design and test structure

to the new generation of the TOEFL test which was under development at that point

(ETS, 2004). In both the LanguEdge test and the new TOEFL test, speaking and writing

have become more integrated with reading and listening (ETS, 2004).

Stricker et al. (2005) examined the construct validity of the LanguEdge test. In the

test, integrated tasks were implemented so that language skills could be tested in a way

that was close to how they would have been used in real life communications. Two out of

the five speaking tasks were integrated with listening and reading, respectively. There

were also two integrated writing tasks (out of three writing tasks in total), one having a

listening component and the other having a reading element. Factor structure of test

performance was examined across three native language groups: Arabic, Chinese, and

Spanish.
68

Scores on the four integrated tasks were counted for the speaking and writing

sections. These integrated tasks added an extra layer of complexity onto score

interpretation across different test-taking groups. The questions of concern to the authors

were (a) whether they could still find an appropriate model with factors corresponding to

the sub-section structure, and (b) whether the factor structure would hold equivalent

across groups of different native language backgrounds.

Analyses were performed based on composite scores by prompt or passage for

listening and reading as well as polytomously rated speaking and writing scores. A

correlated two-factor solution was chosen for each group and for the whole group. In this

model speaking ratings all loaded on one factor, whereas scores from the other sections

loaded on the other factor. According to this solution a common factor accounted for test

performance of the Listening, Reading, and Writing sections. The analyses did not

produce a factor model that corresponded to the structure of the test. The authors

suspected that there might have been a lack-of-instruction effect on the under-

development of the test-takers’ speaking ability, which might have made speaking stand

out as a unique factor.

The multiple group invariance analyses showed that invariance held across the

groups in terms of the number of factors, factor loadings and unique variances but not

factor correlations. The study concluded that the test functioned in a similar way across

groups although the correlations between the factors were lower for the Arabic sample

than for the other two.

The study failed to find separate factors for each section of the test. This might

have been because the inclusion of the integrated tasks had blurred the distinction of the
69

skill measured by each section. With regard to score interpretation, this finding also

called into question the usefulness of reporting section scores. On the other hand, the

study established a uniform factor structure across three native language groups. Factor

structure equivalence was a valuable piece of validity evidence of test fairness, especially

for a test with a diverse test-taking population.

The TOEFL® test evolved into an internet-based mode in 2005. The introduction

of the internet-based TOEFL test, the TOEFL iBT® test, was a milestone in the test’s

growth and modernization. Computer and internet technologies make the TOEFL iBT

test accessible to a wide population of interested test-takers. With a growing test-taking

population, it has become even more important for researchers to investigate the test’s

underlying factor structure across different test-taking groups around the world. Research

studies, such as Stricker et al. (2005), laid the groundwork for validating TOEFL iBT test

scores. Other researchers have continued to contribute to developing the validity

argument for test score interpretation and use.

Stricker and Rock (2008) recently assessed the structure invariance of TOEFL

iBT performance across subgroups of test takers who were identified by three criteria:

native language family backgrounds, exposure to English use in educational and business

contexts in home countries, and years of formal instruction in the English language.

Various native language groups in the test population were categorized into two

major language groups: the Indo-European language family and the non-Indo-European

language family. Kachur’s (1984) classification of inner-circle countries (English is

primary), outer-circle countries (English has special administrative status), and

expanding-circle countries (English is considered important but has no special


70

administrative status) with regard to the prevalence of English use in educational and

business contexts was also adopted in this study.

Focusing on English as a FL learners, test-takers from the inner-circle countries

were not included in the study. Countries represented in the test-taker population were

divided into either the outer-circle country group or the expanding-circle country group.

Furthermore, based on the length of formal instruction in English received, individual

test-takers were also divided into three groups: a group with fewer than 6 years of formal

instruction in English, a group with instruction between 7 and 10 years, and a group with

more than 11 years of formal classroom experience with English.

A higher-order model with four first-order factors was found to be the best-fitting

model for all test-takers in this study. The same structure was also identified in all

subgroups of test-takers, categorized by native language family, exposure, and formal

instruction. The authors concluded that this higher-order structure conformed to the four-

section test design, and supported the usefulness of reporting the total score as an overall

indicator of the general language ability as well as four section scores as indicators of the

more specific abilities.

This uniformity in the structure of language ability as measured by the test across

diverse subgroups of test-takers provided strong validity evidence related to the test’s

internal structure and this structure’s generalization. Based on these findings, a validity

argument can be made that the TOEFL iBT® test functions the same way for diverse

subgroups of test-takers. This also implies that score comparisons among individual test-

takers who belong to different subgroups are meaningful and legitimate.


71

Making a Validity Argument

Examining the dimensionality of a test is an essential part of the process of

building the validity argument for a test. The resulting factor structure sums up the

response pattern of test performance on the test. It also reveals the underlying ability or

abilities the test empirically measures. Our understanding of the nature of FL proficiency

is enhanced and verified through this process. However, as demonstrated by the results

from the studies cited in this review of literature, the make-up of the structure of FL

proficiency is highly context-dependent. It depends on what test is used, and how the test

is constructed. It also depends on who takes the test and under what circumstances. Tests

aiming at assessing proficiency in a FL can vary in test length, subtest structure, item or

task type, and scoring scheme, to name just a few. These facets, proposed by Bachman

(1990) as test method facets, can cast tremendous influences on test performance. In

addition, TTCs, or learner variables, also interact with how a test functions. Evaluating

whether a test exhibits equivalent factorial structure across different examinee

populations, defined by proficiency level, native language background, gender, learning

condition, home language environment, and so forth, provides critical validity evidence

for test generalization and fairness. Finding factorial noninvariance implies that

divergence in the test performance of different groups exists. This will make it hard to

argue that the results of the test can be interpreted in a comparable way across the groups.

Due to the differences in factor structure such as the number of factors and the

correlations among the factors, score interpretation could be very group-dependent. In

this case, will it still make good sense to use the same test or the same score reporting

system for the different groups?


72

Test fairness addresses whether a test measures the same construct in all relevant

subgroups of the population (AERA, APA, & NCME, 1999). Factorial invariance is an

important assumption for claiming test fairness. As highlighted by many authors in this

review, finding out whether a factorial model is invariant across groups of test-takers help

decide how to report scores, and how test scores should be used. Relevant issues in this

regard include whether a single composite score or section scores or both should be

reported, and how weights should be assigned to the section scores if a total score is

needed. The recognition of population diversity needs to be incorporated when designing

a score reporting system for a test that exhibits factorial noninvariance for different test-

taking groups.

Conclusion

This chapter has reviewed and discussed models of FL proficiency. The review of

the models was not intended to be inclusive. The decision to review only selected models

was based on the two criteria. First, these models represent milestones in the course of

searching for an understanding of what FL proficiency is, and they have been influential

in FL test development. Second, the soundness of these models has been theoretically

examined and empirically investigated via a latent trait approach in language testing.

At the end of the last decade, Bachman (2000) pointed out two major themes of

advancement in language testing research. The progression of the field has been powered

by both the refinement of a rich variety of approaches and tools, and by a broadening of

philosophical perspectives and the repertoire of researchable questions. In the case of

defining the nature of FL proficiency, the field has witnessed rapid growth in theoretical

positions and empirical techniques: from the domination of the unitary competence
73

model to a profusion of multidimensional competence models; from viewing language

ability as consisting of discrete skills and components to understanding language use as

dynamic and communicative; from holding a dichotomous view of trait and method to

perceiving test method as an integral part of FL competence; from being primarily

concerned with the internal structure of a test to examining a test’s external relationship

with TTCs.

According to this literature review, consensus has been reached by the research

community in two areas. First, it is generally agreed that FL proficiency is a complex

construct with multiple dimensions. Second, the view is held that the proficiency in a FL

can be interpreted more meaningfully if relevant TTCs are taken into consideration.

What the field has come to agree upon is crucial to the consolidation and

professionalization of the field of language testing. The unification in the fundamental

belief about the nature of FL proficiency demonstrates the collaborative efforts from

researchers with different backgrounds and strengths. However, uncertainties still remain,

and these pose both challenges and opportunities for future language testing researchers.

One common theme that has emerged from this review is that it is still unclear to

the research community what FL proficiency consists of or how the constituent parts

interact. Sawaki et al. (2008) pointed out that multidimensional competence models come

in different forms, varying in terms of the exact factor structures identified.

Under the broad category of multidimensional competence, there are three

different schools of thought for modeling this construct. The first one claims that FL

competence consists of multiple uncorrelated factors. The second believes that this

proficiency can best be represented by a set of correlated factors. The third argues that the
74

relationships among the factors should be explained in a hierarchical structure, in which

the construct is composed of a set of uncorrelated first-order factors subsumed under a

higher-order factor. This higher-order factor is general in nature, which is responsible for

performance across multiple skills. The higher the loadings of first-order factors on the

general factor, the more likely it is this general factor actually exists.

While the first position has been repeatedly refuted based on empirical evidence

showing that factors are indeed correlated, choosing between the second and the third

position has never been easy. The question is how high the correlations among the factors

should be to adopt a hierarchical structure. This question becomes even harder to resolve

when there are three first-order factors in the model. A hierarchical model with three

primary factors, is statistically equivalent to the corresponding correlated three-factor

model in terms of overall model fit statistics. Researchers in the past have leaned towards

one position or the other by comparing model fit or by examining individual parameter

estimates when results from model comparisons did not indicate which one was clearly

superior to the other. Most recently, a bifactor model was tested in Sawaki et al. (2008).

In a bifactor model, two factor loadings are estimated for each observed score, one on the

general higher-order factor and the other on one of the first-order factors. More research

needs to be conducted before we can tell how well this bifactor model can be applied to

explain FL test performance.

Future researchers should keep testing these competing hypotheses until general

agreement is reached regarding whether language proficiency should be hierarchical or

non-hierarchical in its underlying representation.


75

The second unresolved issue has to do with the compatibility of trait and method.

Since its initial appearance, the concept of communicative competence (Canale & Swain,

1980) has had a strong influence not only on language teaching and learning, but also on

test development. Communicative competence emphasizes the use and coordination of all

possible knowledge and skills required for interacting in actual communication (Canale,

1983). This posits an immediate threat to the validity and usefulness of tests that focus on

only one aspect of language knowledge or skill at a time. For instance, a listening test is

usually designed to measure listening ability and, ideally, not to measure anything else. A

test battery will usually consist of a series of skill-based tests (listening, reading, writing,

and speaking), knowledge-based tests, (vocabulary, grammar, phonology, etc.), or both.

Such a test battery is good enough to measure a wide repertoire of skills and knowledge

separately, but not sufficient to measure the coordination of different skills and

knowledge, which is exactly what communicative competence requires.

A new type of language test tasks, the so-called integrated tasks, has been

developed for and implemented in the TOEFL iBT® test. An integrated task requires the

use of multiple skills simultaneously. For example, an integrated speaking task could

involve both speaking and listening, and an integrated writing task could involve reading,

listening, and writing. Such an integrated task is no longer tied to one modality and,

therefore, can no longer be defined by skill or knowledge.

The arrival of integrated language tasks raises questions about how these skill-

based factor models found in many of the studies in the review are compatible with the

theoretical definition of communicative competence. A second question is what the

structure underlying performance on integrated tasks might look like. A third follow-up
76

question is if we do find distinct factors, how are we going to name the factors so that

they are interpreted appropriately and meaningfully?

Future researchers should examine test performance on integrated tasks that are

designed specifically to operationalize the construct of communicative competence. A

latent factor approach would still apply; however, the resulting factor structure might no

longer be compatible with the four-skills naming convention.

Third, this review has also pointed out that one of the top tasks on the agenda for

future research is to investigate the relationships between TTCs and test performance.

TTCs are recognized as one of the four influences on test scores in Bachman’s (1990)

CLA model, and in Bachman and Palmer’s (1996) description of language use in

language tests. According to this review, the field has recently witnessed a surge of

interest and a growing number of empirical studies on TTCs, such as gender (Wang,

2006), native language background (Shin, 2005), ethnicity and preferred language

(Ginther & Stevens, 1998), home language environment (Bae & Bachman, 1998), to

name just a few. However, the amount of information we have obtained is far from what

we will need to fully understand the complex and dynamic relationships. Moreover, there

are still TTCs that have not attracted enough attention by the field, and therefore their

relationships with test performance remain under-researched.

One such TTC is language contact with the target language in a study-abroad

environment. FL test-takers are non-native speakers of the target language, but they may

have had the experience of studying and/or living in the target language environment.

Learning experience in the target language environment has been investigated in relation

to test performance in very few studies (Kunnan, 1994, 1995; Morgan & Mazzeo, 1988).
77

Since this TTC is salient and relevant in the context of FL testing, future research should

study the relationships of this TTC and FL test performance in various testing situations.

It is also worth mentioning that the growing interest on the topic of TTCs and test

performance can to a large extent be attributed to statistical advancements in structural

equation modeling (SEM). SEM has offered not only powerful research techniques, but

also new perspectives to explore issues that interest researchers from both language

testing and second language acquisition.

Within the general SEM framework, we can investigate a wide range of issues: (a)

what the internal structure of a test is; (b) whether and to what degree the internal

structure of a test holds equivalent across different test-taking groups of interest; and (c)

whether this latent internal structure relates to any external variables of TTCs and, if so,

in what way. Factor analysis, a family of latent multivariate analysis techniques, has been

used extensively to investigate the first question. Multi-group invariance factor analysis

has been used to address the second question. Built upon factor analysis and path

analysis, SEM is able to further provide a structural view of the relationships between a

test’s internal structure and the learner variables that exist outside the test, such as age,

gender, classroom learning experience, and study-abroad experience. This structural

component, together with the measurement component based on the internal structure of

the test, could offer insights into the relationships of test-taker background and test

performance. Future researchers, who are interested in the relationships between TTCs

and FL test performance, should embrace an SEM approach in their investigations.

This chapter reviews and synthesizes research results of studies that have

investigated the nature of FL proficiency through a latent factor approach in the context
78

of FL testing. Directions for future research are suggested at the end of the chapter.

These concerns for future studies will be addressed in the design of the current study

described in the next chapter.


79

CHAPTER THREE

METHODOLOGY

This chapter first introduces the TOEFL iBT® test developed at Educational

Testing Service (ETS), including a brief history of the test, and ETS’s philosophy of

testing English as a foreign language (FL). Next, two general goals of the study are

explained, and three research questions are put forward. Borrowing knowledge from the

studies reviewed in the previous chapter as well as insights from the study-abroad

literature, six hypotheses are formulated. This is followed by a description of the TOEFL

iBT public dataset, the subjects, and the measures used in this study. Last, planned

analysis procedures are laid out.

About the TOEFL iBT® Test

The TOEFL iBT test is a test of English as a FL, and it is administered world-

wide. The purpose of the test is to “measure the communicative language ability of

people whose first language is not English” (Jamieson, Jones, Kirsch, Mosenthal,

&Taylor, 2000). An online report on validity evidence states that “TOEFL iBT test

scores are interpreted as the ability of the test taker to use and understand English as it is

spoken, written, and heard in college and university settings,” and “[t]he proposed uses of

TOEFL iBT test scores are to aid in admissions and placement decisions at English-

medium institutions of higher education and to support English-language instruction”

(ETS, 2008). As the newest member of the TOEFL® test suite, the TOEFL iBT test has

been in operational use since 2005.

The development of the TOEFL iBT test was part of a broad effort to modernize

language testing at ETS. Jamieson et al. (2000) explained that the impetus behind this
80

movement was to build a test that could reflect communicative competence models. The

design of the test was also intended to meet the needs of various stake-holders for more

constructed-response tasks as well as tasks that integrated multiple language modalities.

In response to these calls, the current TOEFL iBT® test has the following unique

characteristics that distinguish it from its precursors.

First, based on a model developed by the TOEFL® Committee of Examiners

(COE), the test measures communicative language proficiency and focuses on academic

language and the language of university life (Chapelle et al., 1997). This COE model

makes an explicit distinction between the context of language use and the internal

capacities of individual language users. The model portrays the relationships between the

two as dynamic and integrated, as the COE members believed that “the features of the

context call on specific capacities defined within the internal operations” (p. 4). This

understanding of language ability in relation to its context of use lays out the theoretical

foundation upon which a test development framework with both a context and ability

components has been established (Jamieson et al., 2000).

Second, the new test has mandatory speaking and writing sections, intended to

elicit constructed responses from test-takers. A new kind of constructed-response tasks,

namely integrated tasks, is for the first time introduced into the TOEFL testing. Unlike

the traditional independent tasks, which call upon language skill in a single modality, an

integrated task requires test-takers to use and coordinate skills in more than one modality.

For example, an integrated writing task can require test-takers to incorporate and

synthesize information from an aural lecture and a reading passage in their written

responses. These integrated tasks not only provide information about examinees’ abilities
81

in more than one skill area but also their abilities to coordinate the use of different skills

in response to specific language tasks.

As stated in Chapter One, this study aims to provide empirical evidence to support

our understanding of the nature of FL proficiency in both its absolute and relative senses,

and to investigate the educational, social, and cultural influences on language proficiency.

The FL proficiency investigated in this study was defined as communicative language

ability as measured by the TOEFL iBT® test. The reasons for choosing the test are

explained as follows.

First, the theoretical reasoning behind the development of the test reflects the

current thinking in applied linguists, which views language ability being communicative

in nature. The test focuses on language use in communication, not language display in

isolation. It specifies the relationship between the context of language use and language

abilities as integrated, rather than dichotomous. The role of context in defining language

ability is recognized and reflected in test development. Performance based on the test will

enable us to empirically examine the theoretical construct of communicative language

ability, as a way to enhance our understanding of FL proficiency in general.

Second, as the world’s most widely used English language test (ETS, 2008), the

test enjoys an extremely diverse test-taking population differing on many background

variables, such as native language and country, English language learning, and language

contact experience. This testing situation provides us with ample opportunities to

examine communicative language ability in its relative sense across different test-taker

groups. With test-takers from different backgrounds, the results of the test could also
82

offer unique perspectives on how different educational, social, and cultural environments

affect the nature and path of FL acquisition.

General Goals of the Study

Based on the TOEFL iBT® test performance, this study investigates the

dimensionality of communicative language ability and its latent factorial structure across

test-taker groups with different context-of-learning experiences. The study also intends to

discover the interactions between test-takers’ study-abroad and learning experiences, as

well as the joint impact learning and time abroad on the development of FL ability.

The first goal of this study is to investigate the internal structure of

communicative language ability, as measured by the TOEFL iBT test, and how this

structure conforms to the communicative competence framework endorsed by the COE

model. Language competencies within individual users and the context of language use

are both parts of the construct definition in the COE model. The COE model originally

proposed to organize the test domain by a situation-based approach that “acknowledges

the complexities and interrelatedness of features of language and contexts in

communicative language proficiency” (Chapelle et al., 1997, p. 26). Adopting this

situation-based approach, Jamison et al., (2000) suggested characterizing situations with

five variables: participants, content, setting, purpose, and register. However, it was

decided later to organize language tasks by modality largely because the results of

surveys (Ginther & Grant, 1996; Taylor, 1993) showed that the majority of test users

would like to have scores reported for speaking, writing, listening, and reading. It was

obvious that there was a mismatch between the current thinking in applied linguistics and

popular belief about what language proficiency meant. As pointed out by Chapelle et al.
83

(1997), “[w]here the skills approach falls short, in the view of applied linguists, is its

failure to account for the impact of the context of language use” (p. 26).

The nature of communicative language ability, as informed by the relationships

between the context of language use and cognitive skill-based capacities of language

users, is examined by adopting the situation-based approach in this study. Understanding

the role of context in defining the language ability would offer the theoretical framework

of communicative competence greater power in delimitating the nature of FL ability.

The second goal is to investigate the relationships between communicative

language ability and learner/test-taker background variables. This research interest has

been shared by the second language research community as well as the TOEFL® research

community. Several past TOEFL studies were devoted to validating score interpretation

and use in light of differences of test-takers on background variables, such as reasons for

taking the test (Swinton & Powers, 1980), native language background (Stricker et al.,

2005; Stricker & Rock, 2008; Swinton & Powers, 1980), target language exposure in

home countries (Stricker & Rock, 2008), and years of classroom instruction (Stricker &

Rock, 2008). The current study aims to investigate the joint impact of study-abroad in the

target language and formal classroom learning experience on the development of

communicative language ability, based on TOEFL iBT® test performance. Although the

test is intended for people whose first language is not English (Jamison et al., 2000),

some of the test-takers may have the experience of living in an English language

environment, and others may not. Borrowing the concept of language contact from the

study-abroad literature and treating it as a test-taker characteristic (TTC), this study first

looks at whether or not having the contact experience with the target language
84

community, has any effect on the underlying structure of the test performance.

Theoretically speaking, this investigation allows us to examine the nature of

communicative language ability in its relative sense. In other words, the result tells us

whether or not the language abilities developed in groups of test-takers with different

context-of-learning experiences are equivalent in terms of their factorial structures. From

a test development and validation perspective, the outcomes of this investigation contain

crucial information that can be used for test validation. Understanding how a test

functions across different test-taker groups tells whether or not the test scores can be

interpreted and used in a similar way across these groups. Since the test is administrated

both domestically (where English is the dominant language) and internationally (where

English is not the dominant language), the results also inform us if it is a reasonable

practice to use the same test format at all locations.

Second, test-takers may also vary in terms of the length of study-abroad and

formal language training they have received. The length of study-abroad as well as the

length of learning are examined together to facilitate further understanding of the joint

impact of study-abroad and learning on the development of communicative language

ability. The results have both pedagogical and practical implications. Pedagogically

speaking, results of such an investigation can offer empirical evidence of the impact of

study abroad, and how it interacts with the effect of learning. From a practical

perspective, findings can be used to advise future test-takers on how to prepare for the

test and how to improve their English proficiency in general.

The study-abroad and learning experiences are both salient in the target test-

taking population, and their impact on test performance deserves an in-deep investigation
85

in the context of TOEFL iBT® testing. The outcomes would inform us what to expect

from and how to deal with the increasingly diverse and constantly changing test-taking

population during test development and validation.

Research Questions

Research Question One

The first research question asks what the nature of communicative language

ability is, as measured by the TOEFL iBT® test. To be more specific, this investigation

focuses on answering the following two questions: (1) what the constituents of

communicative language ability are and how they are related; and (2) if the role of

context in defining communicative language ability can be reflected in the latent structure

of the test performance?

The operational test adopts the four-skills approach to test design, and it reports a

separate score on each of the skills (listening, reading, writing, and speaking) as well as a

total score. Factor-analytic studies (Sawaki et al., 2008; Stricker & Rock, 2008) have

provided supporting evidence for this skill-based design and reporting scheme. Stricker

and Rock (2008), using task-based scores from 2720 test-takers during the TOEFL iBT

field test, found that a correlated four-factor model and a higher-order model with four

first-order factors fit the data similarly well. They concluded that the latter was the best

model based on the principle of parsimony. Sawaki et al. (2008) explored with the same

dataset as the one in Stricker and Rock (2008) but conducted the analysis based on item-

level scores. They found that the fit of the correlated four-factor model was better than

that of the higher-order model based on the result of a chi-square difference test although
86

the differences in other fit indices were minimal. They concluded that the higher-order

model was preferred because it was more parsimonious.

In an earlier study Stricker et al. (2005) examined the factor structure of the

LanguEdge™ test, a prototype of the TOEFL iBT® test. Based on task-based scores, the

authors found a correlated two-factor model across three native groups (Arabic, Chinese,

and Spanish). One of the factors was a speaking factor, and the other was a combination

of listening, reading, and writing.

The higher-order model and the correlated four-factor model had both exhibited

good match with TOEFL® test performance in the past. Since the LanguEdge test used in

Stircker et al. (2005) and the TOEFL iBT test have similar test structure and item types,

the correlated two-factor model found in Stricker et al. (2005) could provide a suitable

solution with the current dataset. Based on the results of these studies, a higher-order

model, a correlated four-factor model, and a correlated two-factor model, are all plausible

factorial solutions. To address the first focus of research question one, the hypothesis is

put forward as follows:

Hypothesis 1: The structure of the communicative language ability measured by

the TOEFL iBT test can be best explained by a higher-order model, and can also be

explained adequately by a correlated four-factor model and a correlated two-factor

model.

In the context of TOEFL testing, the context of language use is interpreted as part

of the definition of communicative language ability (Chapelle et al., 1997), and language

use situation is characterized by five variables: participants, content, setting, purpose, and

register (Jamison et al., 2000). The content variable refers to the topic of a task. The
87

setting variable refers to the location of a language act. The content and the setting

variables are key ingredients in defining the context of language use. To address the

second focus of research question one, models with situational factors, content and

setting, are subject to evaluation of fit. The two corresponding hypotheses are formulated

as follows:

Hypothesis 2: Adding a dimension of content factors to the model confirmed

through testing Hypothesis 1 improves model fit, and therefore demonstrates the role of

context in defining communicative language ability.

Hypothesis 3: Adding a dimension of setting factors to the model confirmed

through testing Hypothesis 1 improves model fit, and therefore demonstrates the role of

context in defining communicative language ability.

Research Question Two

The second research question inquires whether the communicative language

ability, as measured by the TOEFL iBT® test, has equivalent representation across groups

of test-takers differing on one TTC: study-abroad. To be more specific, this question

investigates if the underlying configuration of communicative language ability is

invariant across two groups of test-takers, with one group having been exposed to an

English-speaking environment and the other without such experience. There is handful of

studies that have studied the nature of FL proficiency in relation to TTCs by using multi-

group invariance analysis. This ability, as measured by a test of a researcher’s choice,

was found to be equivalent across test-taker groups in some studies (Sticker et al., 2005;

Sticker & Rock, 2008; Shin, 2005; Wang, 2006), but different in others (Bae & Bachman,

1998; Ginther & Stevens, 1998; Kunnan, 1992; Morgan & Mazzeo, 1988; Römhild,
88

2008; Sang et al., 1986;). Among these studies, only Morgan and Mazzeo (1988) used

study-abroad experience as a TTC to define and compare groups. When comparing the

group of test-takers who had spent at least one month in the target language (French)

environment to the standard group (with little or no out-of-school French language

experience), their language ability appeared to have different underlying representations.

Due to lack of empirical studies examining study-abroad experience via a latent trait

approach, results of studies on heritage language environment were brought in to help

formulate meaningful hypothesis. Heritage language learners usually have target

language contact at home. Although the experience of language contact at home is not

equivalent to the one of studying abroad in the target language society, they share

similarities to a certain extent. In the same study conducted by Morgan and Mazzeo

(1988), no equivalence was found in the structure of language ability between the

heritage learner group and the standard group. Bae and Bachman (1998) also found that

the heritage Korean learner group and the non-heritage Korean learner group differed in

terms of underlying structure of their language ability. Based on these observations, the

hypothesis in response to the second research question is put forward as follows:

Hypothesis 4: Two groups of test-takers, one with study-abroad experience and

the other without such experience, differ in the underlying configuration of their

communicative language ability, as measured by the TOEFL iBT® test.

Research Question Three

The third research question investigates the impact of length of time abroad and

classroom learning on the development of communicative language ability, as measured

by the TOEFL iBT® test. Specifically, this investigation addresses the following two
89

questions: (1) if the length of formal learning differentiates test-takers who do not have

study-abroad experience; and (2) if the length of time abroad and formal learning

differentiates test-takers who have study-abroad experience.

Only one study conducted by Kunnan (1994, 1995) investigated the impact of

length of learning and study-abroad on FL ability through a structural equation modeling

(SEM) approach. His findings suggested that both home country formal instruction and

experience in the target language environment had influences on the development of

aspects of language ability, and the influences could be either positive or negative.

Borrowing some insights from study-abroad studies, Davidson argued that the length of

study abroad was positively correlated with language gains (as cited in Dewey, 2004, p.

322), whereas Freed (1995) claimed that formal language instruction was an important

variable in predicting proficiency gains achieved abroad. Díaz-Campos’s (2004) study

showed that years of formal language instruction had impact on Spanish proficiency gain.

Magnan and Back (2007) claimed that coursework had a positive influence on linguistic

gain in French. The literature indicates that both formal instruction and study-abroad

experience have influences on language development. The two hypotheses related to the

third research question are put forward as follows:

Hypothesis 5: For test-takers who have no study-abroad experience, the

development of their language ability is associated with the length of formal learning.

Hypothesis 6: For test-takers who have study-abroad experience, the development

of their language ability is associated with the length of formal learning and the length of

study-abroad.
90

The TOEFL iBT® Public Dataset

A request for access to the TOEFL iBT public dataset was submitted to ETS, and

was approved by ETS. The dataset contained two test forms (Form A and Form B) and

test performance of 1000 test-takers on each form. Form A and its associated test

performance from 1000 test-takers (Sample A) were used in the analysis in this study. A

description of the requested data is provided below.

One thousand test-takers were randomly drawn from one TOEFL iBT

administration during fall 2006. Each test-taker was linked to a unique sample

identification number (e.g., 20070001). The information on test-taking location was

recorded for each test-taker at the time of the test administration. Test-takers who took

the test in the United States or Canada were identified as domestic, whereas those who

took the test in all other countries were identified as overseas. There were 418 domestic

test-takers and 582 overseas test-takers.

Test-takers were asked to provide information on their age, gender, native

country, and native language. Test-takers were also asked to respond to questions

regarding the reason for taking the test, the type of institution in which they were

interested, the amount of financial support they expected, the amount of time they spent

studying English, the amount of time they spent in content classes taught in English, and

the amount of time they spent living in an English-speaking country.

To investigate the effect of target language exposure on English language

proficiency as manifested in TOEFL iBT® test performance, the first step was to identify

who had the exposure and who had not upon taking the test. Although test-taking location

was recorded for every test-taker, it could not be used as a reliable indicator for
91

identifying the group membership. Among the 582 people who took the test overseas,

106 of them indicated that they had lived in an English-speaking country. It was very

likely that these test-takers, after having been exposed to English, had relocated and

therefore took the test in an overseas test center. Another indicator of the group

membership was test-takers’ responses to the question ‘have you ever lived in a country

in which English is the main language spoken’. This indicator was used to identify the

groups. It also contained the information of the length of English language exposure.

Furthermore, this study intended to examine the relationship between study-

abroad experience and learning experience, and their joint impact on test performance.

Test-takers’ responses on two more background questions were used for this

investigation. One question concerned the amount of time they spent studying English,

and the other was about the amount of time they spent in content classes taught in

English.

Three hundred ninety-nine test-takers responded to all three background

questions. Among these 399 test-takers, 29 test-takers said that they had never lived in an

English-speaking country, although the record showed that they took the test either in the

United States or Canada. Bearing in mind that these 29 test-takers were physically

located in an English-speaking country at the time they took the test, their responses to

this question contradicted documented fact. One possible cause for this inconsistency

could be that these test-takers were not able to fully understand the question therefore

provided inaccurate information (X. Xi, personal communication, November 17, 2010).

Another explanation could be that they took the test immediately on arriving in the US or

Canada (J. E. Liskin-Gasparro, personal communication, January 17, 2011). Among the
92

399 test-takers who answered all three background questions, this study identified 370

test-takers whose study-abroad experience could be confirmed. Background

characteristics of these 370 test-takers are discussed below.

The location of test-taking was recorded for each test-taker. This information told

us where a test-taker was physically located when taking the test. Based on test-taking

location, inferences could be made about what kind of linguistic environment a test-taker

was immersed in at the time of testing. One hundred fifty-seven subjects took the test at

test centers located either in the United States or Canada. The remaining 213 subjects

took the test at overseas locations.

All subjects provided age information. The average age of these 370 test-takers

was 24 at the time of testing. The youngest subject was 14, and the oldest was 51. The

majority of the subjects (about 85%) were between the ages of 15 and 30.

Of the 370 subjects, 321 reported on their gender, among whom 54% were male

and 46% were female.

Test-takers were asked to use a list of country and region codes to indicate their

native counties (where they were born), and a list of native language codes to tell their

native languages. All but one of the 370 subjects responded. The 369 respondents were

from a total number of 56 countries or regions. More than half of the subjects were from

these seven largest groups: Korea, Japan, India, Taiwan, France, Thailand, and China.

With regard to their native languages, a total of 38 different native languages were

represented in this sample. The five most frequently spoken native languages in order of

the number of its speakers were Korean, Japanese, Chinese, Spanish, and Arabic. Native
93

speakers of these five languages made up about 59% of the total sample. Four subjects

indicated that their native language was English.

Test-takers were asked to respond to the question ‘what is your main reason for

taking TOEFL.’ They were provided with the following answer choices: (1) to enter a

college or university as an undergraduate student, (2) to enter a college or university as a

graduate student, (3) to enter a school other than a college or university, (4) to become

licensed to practice my profession in the United States or Canada, (5) to demonstrate my

proficiency in English to the company for which I work or expect to work, and (6) other

than above. Out of the 370 subjects, 366 responded. About 88% of the respondents

indicated that they took the test in order to enter a college or university either as an

undergraduate student or a graduate student.

When asked the question ‘what types of institution are you interested in

attending,’ 367 subjects responded. The provided answer choices were: (1) four-year

college or university, (2) two-year community college, (3) graduate or professional

school, (4) ESL institute, and (5) don’t know. Subjects were allowed to choose more than

one type of institution. Among the 367 subjects who responded, none of them chose more

than one type of institution. About 88% of them indicated that they were interested in

attending either four-year colleges/universities or graduate/professional schools.

Of the 370 subjects, 356 answered the question ‘how much do you and your

family expect to contribute annually toward your study in the United States or Canada.’

They could choose from the following answer choices: (1) less than $5000, (2) $5000 to

$10,000, (3) $10,001 to $15,000, (4) $15,001 to 25,000, (5) more than $25,000, or (6)

don’t know. More than a third of the respondents indicated that they did not know the
94

answer. About one fifth of them expected to receive more than $25,000 in financial

support from their families.

All 370 subjects answered the question ‘how much time have you spent studying

English (in a secondary or post secondary school).’ The answer choices were: (1) none,

(2) less than 1 year, (3) 1 year or more, but less than 2 years, (4) 2 years or more, but less

than 5 years, (5) 5 years or more, but less than 10 years, and (6) 10 years or more. The

majority of them (about 64%) reported that they had studied English for at least 5 years

by the time they took the test. A third of the subjects had studied English for 10 years or

more at the time of testing.

All 370 subjects responded to the question ‘how much time have you attended a

secondary or post-secondary school in which content classes (such as math, history,

chemistry) were taught in English.’ The answer choices were: (1) none, (2) less than 1

year, (3) 1 year or more but less than 2 years, (4) 2 years or more but less than 5 years,

and (5) 5 years or more. About a third of them indicated that they had never had such

experience. Close to 60% of the subjects indicated that they had at least one year of such

experience.

All 370 subjects responded to the question ‘have you ever lived in a country in

which English is the main language spoken.’ To answer this question, they chose from

the following possible answers: (1) no, (2) yes, for less than 6 months, (3) yes, for 6

months to 1 year, and (4) yes, for more than 1 year. About two-thirds of the subjects

indicated that they had lived in an English-speaking country upon taking the test.
95

Sample Representativeness and Appropriateness

The sample of 370 subjects used in this study was just a fraction of the total

random sample of 1000 test-takers generated from one TOEFL iBT® test administration.

Since the study sample was not randomly generated, two steps were taken to ensure that

the study sample was comparable to the random sample of 1000 test-takers.

Table 1. Test-Taker Characteristics across the Two Samples

Background Answer Choices Study Sample Total Sample


Variables (N = 370) (N=1000)
Location 1=U.S. or Canada 42% 42%
2=All other countries 58% 58%
Age Below 15 1% 2%
Between 15 and 30 85% 86%
Above 30 14% 12%
Gender Male 47% 43%
Female 40% 42%
Missing 13% 15%
Native Korea 18% 15%
country Japan 13% 9%
India 11% 11%
Taiwan 5% 5%
France 4% 3%
Thailand 4% 3%
China 4% 12%
Other 40% 42%
Missing < 1% < 1%
Native Korean 19% 15%
language Japanese 13% 9%
Spanish 10% 8%
Chinese 10% 18%
Arabic 6% 8%
Others 41% 41%
Missing < 1% < 1%
96

Table 1. Continued

Background Answer Choices Study Total Sample


Variables Sample (N=1000)
(N = 370)
Reason for N=370 N=1000 N=464
taking the 1=To enter a college or university 37% 17% 36%
test as an undergraduate student
2=To enter a college or university 50% 24% 52%
as a graduate student
3=To enter a school other than a 3% 1% 3%
college or university
4=To become licensed to practice 4% 2% 5%
my profession in the U.S. or
Canada
5=To demonstrate my proficiency 2% < 1% 2%
in English to the company for
which I work or expect to work
6=Other than above 4% 2% 3%
Missing 1% 54% 0%
Type of N=370 N=1000 N=453
institution 1= Four-year college or university 37% 17% 37%
2=Two-year community college 4% 2% 4%
3=Graduate or professional school 50% 23% 51%
4= ESL institute 1% < 1% 1%
5=Don’t know 7% 3% 8%
Missing 1% 55% 0%
Financial N=370 N=1000 N=404
Support 1= Less than $5000 8% 3% 8%
2=$5000 to $10,000 10% 4% 10%
3=$10,001 to $15,000 14% 5% 13%
4=$15,001 to 25, 000 9% 4% 10%
5= More than $25,000 21% 9% 21%
6= Don’t know 35% 15% 37%
Missing 4% 60% 0%
97

Table 1. Continued

Background Answer Choices Study Total Sample


Variables Sample (N=1000)
(N = 370)
Time spent N=370 N=1000 N=414
studying 1=None 2% 1% 3%
English 2=Less than 1 year 4% 2% 4%
3=1 year or more but less than 2 10% 4% 10%
years
4=2 years or more, but less than 5 20% 8% 20%
years
5=5 years or more but less than 10 34% 14% 34%
years
6=10 years or more 30% 12% 29%
Missing 0% 59% 0%
Time spent N=370 N=1000 N=411
in content 1=None 31% 13% 32%
classes 2=Less than 1 year 10% 4% 10%
taught in 3=1 year or more but less than 2 16% 7% 16%
English years
4=2 years or more but less than 5 20% 8% 20%
years
5=5 years or more 23% 9% 22%
Missing 0% 59% 0%
Time spent N=370 N=1000 N=527
living in an 1=No 34% 21% 39%
English- 2=Yes, for less than 6 months 18% 9% 16%
speaking 3=Yes, for 6 months to 1 year 13% 6% 11%
country 4=Yes, for more than 1 year 35% 17% 33%
Missing 0% 53% 0%

First, the study sample (N=370) was compared to the total random sample

(N=1000) on all available background variables collected for the test-takers (see Table 1).

Second, one-sample t-tests for the section and total scores were conducted to detect any

mean difference across the two samples.


98

The percentages in Table 1 show that both samples had the same distribution on

the location variable. The age and gender distributions in the two samples resembled each

other well.

With regard to country of origin and native language, more than half of the test-

takers came from the same seven countries in both samples. However, the percentage of

test-takers from Mainland China in the study sample was disproportionally low,

compared to the one in the total sample. The five most frequently spoken native

languages among test-takers were also the same across the two samples, although the

order of the two lists differed. It is noteworthy that a disproportionally high percentage of

test-takers from Mainland China were not included into the study sample because these

test-takers did not answer all three background questions regarding their learning and

study-abroad experiences. Therefore the percentage of native Chinese speakers was

disproportionally low in the study sample, compared to the total sample.

As shown in Table 1, test-takers from India constituted a large native country

group for both samples. The native languages spoken by these test-takers included a

variety of languages native to people living in the various parts of the country. Due to this

diverse linguistic situation, none of these languages had a large native language group in

either sample.

Although English has a special administrative role in the India society, very few

test-takers considered English as their native language. In the total sample of 1000 test-

takers, among the 120 Indian test-takers, only five reported their first language as

English. In the study sample of 370 test-takers, among the 45 Indian test-takers, none of
99

them indicated their first language as English. This observation allowed us to treat the

test-takers from India as English as a second language learners with confidence.

English also has a special administrative role in the region of Hong Kong. All 11

test-takers from Hong Kong in the total sample identified their native language as

Chinese. This observation allowed us to treat the test-takers from Hong Kong as English

as a second language learners with confidence.

Regarding variables that had a large number of missing values in the total sample,

such as reason for taking the test, financial support, time spent studying English, time

spent in content classes taught in English, and the time spent living in an English-

speaking country, the distribution patterns of the obtained responses were similar across

the two samples. Percentages based on the test-takers who actually responded to these

questions were also reported. For example, the table shows that among the total 1000

test-takers, 464 test-takers (N=464) actually responded to the question asking for their

reason for taking the test, and 36% of them indicated that they took the test to enter a

college or university as an undergraduate students. As indicated in the table, these

percentages were very close to the ones based on the study sample of 370 test-takers.

Furthermore, the four section scores and the total score were compared across the

two samples. The descriptive statistics shown in Table 2 indicated that the section and

total mean scores were all higher in the study sample than the ones in the total sample.

A series of one sample t-tests were conducted to compare the means of the study

sample to the ones of the total sample. The means of the study sample were tested against

the total sample means to evaluate the statistical significance of the differences. The

results, summarized in Table 3, showed that only the listening score was found to be
100

significantly higher in the study sample than the one in the total sample at the p level of

0.01. At the p level of 0.05, the listening score and the total score were found to be

significantly higher in the study sample than the ones in the total sample. These results

suggested that the study sample did not deviate from the total sample significantly with

regard to the test performance.

Table 2. Descriptive Statistics across the Two Samples

Study Sample TOEFL iBT® Public Dataset Sample A


(N = 370) (N = 1000)
Mean Std. Dev. Mean Std. Dev.
Listening 23.46 5.44 22.67 5.92
Reading 26.98 7.51 26.23 7.95
Speaking 15.13 3.91 15.08 3.86
Writing 6.70 1.77 6.52 1.90
Total 72.27 16.35 70.50 17.36

Table 3. One-Sample t-Tests

Mean Diff. t df Sig. (2-tailed)


Listening 0.79 2.79 369 0.006
Reading 0.75 1.92 369 0.056
Speaking 0.05 0.26 369 0.796
Writing 0.18 1.91 369 0.056
Total 1.77 2.08 369 0.038

Generally speaking, the two samples were reasonably comparable, although

discrepancies were found regarding the percentages of test-takers from Mainland China

and Chinese native speakers. The reason for these discrepancies remains unknown. This
101

should be kept in mind when generalizing the findings from this study to the total sample

as well as to the whole TOEFL® iBT test-taking population.

Measures

The test used in this study was an operational form of the TOEFL iBT test

administered during the fall of 2006. The test had four sections: listening, reading,

speaking, and writing. The structure of the test is explained as follows.

Structure of the Test

The listening section had six tasks. Each listening task had a prompt followed by

5 or 6 questions. There were 34 items in total in the listening section. Each item was

scored for one point for a correct answer, and zero points for a wrong answer. Items that

were not reached or were omitted were marked as N or M, respectively. The total

possible raw score points for the listening section was 34.

The reading section had three tasks. Each reading task had a prompt followed by

12 to 14 questions. There were 41 items in total in the reading section. Thirty-eight of

them, worth one point each, were dichotomously scored. Three items were polytomously

scored, worth either two or three points. Items that were not reached or were omitted

were marked as N or M, respectively. The total possible raw score points for the reading

section was 45.

The speaking section contained six tasks. The first two tasks asked test-takers to

provide oral responses to a written prompt. These tasks were considered to be

independent because required responses were not dependent on any information provided

through other channels during the test. The other four were integrated speaking tasks.

These tasks required test-takers to provide oral responses based on the information they
102

received through listening or reading or both channels. Each task was rated on a 0–4

holistic scale at a one point interval. The total possible raw score points for the speaking

section was 24.

The writing section consisted of two tasks. The first task was an integrated task

that required test-takers to provide written responses based on the information they

received through listening and reading. The second one, an independent task, asked test-

takers to write in response to a written prompt. Each task was rated on a 1–5 holistic scale

at half-point intervals (up to the first decimal place). The total possible raw score points

for the writing section was 10.

In summary, the whole test had 17 language tasks, organized into four sections by

modality. There were six listening tasks, three reading tasks, six speaking tasks, and two

writing tasks. To help understand the dynamics between language skill and the context of

language use manifested within each task, the situations of these tasks are described as

follows.

Descriptions of Task Situations

In the context of TOEFL® testing, a language task situation is characterized by

five variables: participants, content, setting, purpose, and register (Jamison et al., 2000).

The content refers to the topic of a task. The setting refers to the location of a language

act.

The first listening task (L1) was situated in a conversation between a male student

and a female biology professor at the professor’s office. The participants talked mainly

about how to prepare for an upcoming test. The nature of the interaction was consultative
103

with frequent turns. To complete this task, test-takers were asked to respond to five

multiple-choice questions.

The second listening task (L2) presented part of a lecture in an art history class in

a formal classroom setting. The male professor was the only participant. The language

used by the professor was formal. There was no interaction between the professor and the

audience during the task. To complete this task, test-takers were asked to respond to six

multiple-choice questions.

The third listening task (L3) presented part of a lecture in a meteorology class in a

formal classroom setting. Three participants were involved, one male professor, one male

student, and one female student. The language used was formal with periods of short

interaction between the professor and the students. To complete this task, test-takers were

asked to respond to six multiple-choice questions.

The fourth listening task (L4) was situated between a female student and a male

employee at the university housing office. The topic of their conversation was about

housing opportunities on and off campus. The nature of the interaction was consultative

with relatively short turns. To complete this task, test-takers were asked to respond to five

multiple-choice questions.

The fifth listening task (L5) presented part of a lecture in an education class in a

formal classroom setting. The participants were a female professor and a male student.

The language used was formal with periods of short interaction between the professor

and the student. To complete this task, test-takers were asked to respond to six multiple-

choice questions.
104

The last listening task (L6) presented part of a lecture in an environmental science

class in a formal classroom setting. The male professor was the only participant. The

language used was formal with no interaction between the professor and the audience

during the entire listening time. To complete this task, test-takers were asked to respond

to six multiple-choice questions.

In the reading section, all three reading tasks were based on academic content. In

the first task (R1), test-takers were asked to respond to 14 multiple-choice questions of

various kinds based on a reading passage on the topic of psychology. The second task

(R2) required the test-takers to respond to 14 multiple-choice questions of various kinds

based on a reading passage about archeology. The last task (R3) required the test-takers

to respond to 13 multiple-choice questions of various kinds based on a reading passage

about biology. The language used in the readings was formal in register. The settings of

these language tasks remained unknown since such information was not provided in the

task specifications.

Lack of context development was also found for the first speaking task (S1) and

the second speaking task (S2). Neither task specified the situation of language use. The

topics for both tasks were non-academic. Both tasks required test-takers to provide an

oral response to a prompt.

The third speaking task (S3) had a reading and a listening component. The topic

was the university’s plan to renovate the library. The reading component required test-

takers to read a short article in the student newspaper about the change. The listening

component involved a conversation between a male and a female student in a non-

academic setting discussing their opinions about this renovation plan. The two
105

participants interacted with each other frequently with relatively short turns. The test-

takers were required to give an oral response to a prompt based on the content of the

reading passage and the dialogue.

The fourth speaking task (S4) also had a reading and a listening component. The

reading component required test-takers to read an article from a biology textbook on an

academic topic. The listening component presented part of a lecture on the same topic

delivered by a male professor in a formal classroom setting. The language used in this

task was formal. There was no interaction between the professor and his audience during

the listening time. Test-takers were required to give an oral response to a prompt based

on the content of the reading passage and the lecture.

The fifth speaking task (S5) contained a listening component. The listening part

was situated in a conversation between a male and a female professor in an office setting.

The focus of their dialogue was related to the class requirements for a student. The

register of the language was consultative in nature with frequently turns between the two

participants. Test-takers were required to give an oral response to a prompt based on the

content of the dialogue.

The last speaking task (S6) had a listening component. The listening part

presented part of a lecture delivered by a female anthropology professor in a classroom

setting. The language used was formal with no interaction between the professor and the

audience. Test-takers were required to give an oral response to a prompt based on the

content of the lecture.

The first writing task (W1) was integrated with reading and listening. Test-takers

were first asked to read a passage about an academic topic and then to listen to part of a
106

lecture on the same topic delivered by a male professor in a classroom setting. There was

no interaction between the professor and his audience during the entire listening time.

The test-takers were required to give a written response to a prompt based on the content

of the reading passage and the lecture.

The second writing task (W2) asked test-takers to write on a non-academic topic.

The context of language use in this task remained unknown since this information was

not provided in the task specification. The test-takers were required to provide a written

response to a prompt.

Categorizing Tasks by Content and Setting

Based on the descriptions of the tasks, two broad categories emerged with regard

to task content. The first type was academically oriented. The content of these tasks was

developed mainly based on scholarly or textbook articles in the realm of natural and

social sciences. The second type involved topics that were related to courses (e.g., exam

preparation) or to life on campus (e.g., student housing). The development of these tasks

did not rely on information in a particular academic area. In other words, having previous

knowledge on a particular academic topic was not likely to interact with test-takers’

responses to this type of tasks. By this categorization, all 17 tasks were sorted into either

academic or non-academic. There were ten tasks with academic content and seven with

non-academic content.

The tasks were also sorted by the location where a language act occurred.

Unfortunately the information on setting was not always provided. Lack of context

development was observed for all reading tasks as well as some speaking and writing

tasks. Tasks for which the setting of language use was developed could be divided into
107

two groups: instructional and non-instructional. The first group of tasks involved

language acts that took place inside classrooms. In this type of setting interactions among

the participants (if there is any) were usually sporadic, and the language used was

academically oriented and formal. The second type of tasks took place outside

classrooms (e.g., a professor’s office, the library) where interactions were usually more

frequently and the language used tended to be less formal. Table 4 summarizes the 17

language tasks by content and setting.

Table 4. Task Content and Context

Task Name Type Content Setting


L1 Listening Non-academic Non-instructional
L2 Listening Academic Instructional
L3 Listening Academic Instructional
L4 Listening Non-academic Non-instructional
L5 Listening Academic Instructional
L6 Listening Academic Instructional
R1 Reading Academic N /A
R2 Reading Academic N /A
R3 Reading Academic N /A
S1 Speaking Non-academic N /A
S2 Speaking Non-academic N /A
S3 Speaking Non-academic Non-instructional
S4 Speaking Academic Instructional
S5 Speaking Non-academic Non-instructional
S6 Speaking Academic Instructional
W1 Writing Academic Instructional
W2 Writing Non-academic N /A

Analysis Procedures

The dataset used in this study included 370 test-takers’ scores on 17 skill-based

language tasks. Listening and reading items that were not reached or were omitted were
108

assigned a score of zero. There was no missing score value for the speaking and writing

tasks.

Analyses involving only observed variables were performed by using the

Predictive Analytics SoftWare® Statistics 18 (SPSS Inc., 2009). Analyses involving latent

variables were performed by using Mplus (Muthén & Muthén, 2010).

Level of Measure

An appropriate level of measurement was chosen based on statistical

considerations and theoretical needs. This study used task scores, also called item parcel

scores, as the level of measure. For the listening and reading sections, a task score was

the total score summed across a set of items based on a common prompt. Six listening

task scores and three reading task scores were therefore obtained. A task score in the

writing and speaking sections was simply the score assigned for a task. Six speaking task

scores and two writing task scores were therefore obtained. Each task score was used in

the analysis as an observed variable. There were 17 observed variables in the study in

total. Variable names were the same as the task names listed in Table 4.

There are a couple of reasons for choosing task scores as the level of measure.

First, one of the research questions is to examine the relationship between language skill

and the context of language use in the underlying structure of the test performance. Two

key elements used for characterizing a language use situation were content and setting,

both of which could be defined at the task level. Individual items within a task all shared

the same focus of content and setting. Secondly, using task scores instead of item scores

would allow all variables to be treated as continuous. Task scores based on

dichotomously scored listening and reading items could be treated as continuous as well
109

as the ratings based on the polytomously scored speaking and writing tasks. Although

other levels of measurement (such as categorical or ordinal) are permitted in structural

equation modeling (SEM), Kunnan (1998b) recommended not mixing different levels of

measurement in a single covariance or correlation matrix. The third reason for using this

level of measure is to reduce the chance of having correlations among the observed

variables to be extremely high. It is very likely for items based on a common prompt to

have dependence upon one another. Multicollinearity observed in the observed variables

can be one of the causes why an estimation process fails to succeed (Kline, 2005). Using

task scores instead of item scores, suggested by Stricker and Rock (2008), would help to

alleviate the problem caused by the dependence among items associated with a common

prompt in this study. The last reason comes out of concern for sample size needed for the

planned multivariate analysis. The more variables used in an analysis, the larger sample

size is needed. Using task scores would help to reduce the number of observed variables

in this analysis, and therefore to reduce the sample size needed for the study.

Distribution of Values

Choosing an appropriate estimation procedure used for multivariate analysis

depends on the distribution of observed values. The most commonly used estimation

methods (e.g., maximum likelihood estimation) assume univariate and multivariate

normality of observed variables. Assumptions regarding univariate and multivariate

normality were inspected. Univariate normality was checked by examining the skewness

and kurtosis indices, and by examining the plots of score distributions. Multivariate-

normality was evaluated based on the results of univariate normality inspection, as


110

suggested by Kline (2005). The distribution of the values was examined so that an

informed decision could be made regarding choosing an appropriate estimation method.

Linearity and Multicollinearity

Most estimation procedures for multivariate analysis are based either on the

covariance or the correlational matrix of the observed variables. If the relationship

between two variables is not linear, this non-linearity will not be captured in the

correlational indices. In this case, a nonlinear approach needs to be adopted. Linearity

was examined using scatter plots of all possible pairs of the variables in this study.

Multicollinearity occurs when the correlations among pairs of variables are

extremely high. If one variable is highly correlated with another, then it means that one of

the variables is redundant in terms of measuring the construct. Kline (2005) suggested

either eliminating or combining redundant ones into a composite variable to avoid

multicollinearity in the data. Pairwise multicollinearity was checked by inspecting the

correlation matrix of the variables in this study.

Estimation Method

The default estimator for latent analysis with continuous variables is maximum

likelihood (ML) in Mplus. ML estimation method estimates parameters with

conventional standard errors and chi-square test statistic. Since ML estimation is sensitive

to non-normality, when the distributions of the observed variables are non-normal, a

corrected normal theory method should be used to avoid bias caused by non-normality in

the dataset, as recommended by Kline (2005). By using a corrected method, the original

data is analyzed using a normal theory method (ML in this case), but the estimates of

standard errors are robust to non-normality and the test statistics are corrected. Mplus
111

provides such a corrected normal theory estimation method (called MLM), which

produces maximum likelihood parameter estimates with standard errors and a mean-

adjusted chi-square test statistic that are robust to non-normality. The MLM chi-square is

also referred to as the Satorra-Bentler chi-square (Muthén & Muthén, 2010).

Assessing Model Fit

The adequacy and appropriateness of the models were evaluated based on two

criteria: (1) the values of selected overall model fit indices, and (2) the significance and

reasonableness of individual parameter estimates. The selection of model fit indices used

in this study was based on Kline (2005)’s suggestions. Below is a brief description of

each model fit index.

Chi-Square Test of Model Fit

The value of the chi-square (χ2) statistic reflects the distance in fit between the

model-implied variance-covariance structure and the variance-covariance structure of the

observed data. A χ2 test evaluates the statistical significance of this distance. The lower

the value is, the closer the two structures are, and therefore the better the model

corresponds to the observed data.

Normed Chi-Square

The model chi-square discussed previously has its drawback because it is

sensitive to sample size. The value tends to be high when the sample size is large. This

could mislead to reject a model whose deviation from the observed data structure may not

be significant (Bollen, 1989). Dividing the chi-square value by the degrees of freedom

(df) can be used to reduce the sensitivity of chi-square to sample size (Kline, 2005). The

result is a lower value referred to as normed chi-square (χ2/df). A ratio of less than 3 is an
112

indicator of good model fit, recommended by Kline (1998). This criterion was adopted in

the current study.

Root Mean Square Error of Approximation

Neither χ2 nor χ2/df has a built-in mechanism that corrects for model complexity.

Normally speaking, if two models show equivalent fit to the same data, the simpler one is

preferred over the more complex one based on the principle of parsimony. When dealing

with the same data, the simpler a model is, the fewer parameters are estimated, and the

higher the degrees of freedom are. The root mean square error of approximation

(RMSEA) is a parsimony-adjusted index that corrects for model complexity, and

therefore favors simpler models. A value of zero indicates a perfect fit. The higher the

value goes, the worse the fit gets. RMSEA smaller than 0.05 can be interpreted as a sign

of good model fit while values between 0.05 and 0.08 indicate reasonable approximation

of error (Browne & Cudeck, 1993). This criterion was adopted in the current study.

Comparative Fit Index

The comparative fit index (CFI) compares the fit of the specified model to the fit

of a baseline model which assumes zero covariances among the observed variables.

Because it is usually unrealistic to assume that variables are uncorrelated, the fit of a

baseline model is often really bad. The improvement in fit from the baseline model shows

that the specified model is a better one. A rule of thumb, suggested by Hu and Bentler

(1999), is that a CFI value larger than 0.9 shows the specified model has a good fit. This

criterion was adopted in the current study.


113

Standardized Root Mean Square Residual

The standardized root mean square residual (SRMR) is an absolute fit index

which is based on the mean absolute correlation residual. The size of a correlation

residual indicates how an observed correlation matrix differs from the model-implied. A

SRMR value of zero shows that there is no difference between the two correlation

matrices, indicating a perfect model fit. A SRMR value less than 0.1 is commonly

considered as a sign of acceptable fit (Kline, 2005). This criterion was adopted in the

current study.

Individual Parameter Estimates

Parameter estimates can be examined for appropriateness and significance. The

sign of an estimate should be checked to ensure that the meaning of the estimate is

theoretically sound. The value of an estimate divided by its standard error provides a test

statistic that can be used to evaluate the significance of the estimate. Multicollinearity

among latent factors can also be detected by examining their estimated correlations. An

extremely high correlation estimate between two factors indicates a linear dependency

among the factors. This means that the factors are empirically indistinguishable, which

makes a model implausible. Previous researchers (Sawaki et al., 2008; Stricker et al.,

2005; Stricker & Rock 2008) used a value of 0.9 to screen out extremely high

correlations among factors. This criterion was adopted in the current study.

Establishing the Baseline Model

A confirmatory approach to factor analysis was adopted to respond to research

question one by examining whether the structure of communicative language ability

measured by the test conforms to the test design, score reporting and the TOEFL®
114

literature . A series of confirmatory factor analyses was conducted to find out if a

previously established factor model with similar test data could also be compatible with

the data in this study.

Hypothesis 1 states that the structure of the communicative language ability

measured by the TOEFL iBT® test can be best explained by a higher-order model, and

can also be explained adequately by a correlated four-factor model and a correlated two-

factor model. A higher-order model, a correlated four-factor model, and a correlated two-

factor model were specified as a priori and tested for fit as competing models. All three

models were shown to be compatible with previous TOEFL test data. Integrated speaking

tasks whose completion required language processing in multiple modalities were found

to load on the target modality (speaking), whereas integrated writing tasks were found to

load on the designated writing factor (Sawaki et al., 2008; Sawaki, et al., 2009).

Therefore the integrated speaking and writing tasks were specified to load on their target

modality in all three models. As the result of testing the first hypothesis, the model that

best represented the latent structure of the dataset used in this study was established as

the baseline model, on which future analysis was based.

Figure 1 illustrates the relationships among the observed variables, latent factors,

and residual/unique variances in the higher-order model. The observed variables are

represented by the rectangular boxes. Latent variables, including the four skill factors and

the residuals, are indicated by the ellipses. The six listening variables (L1–L6) loaded on

a common factor which could be referred as the listening factor (L). In other words, the

six listening tasks were the indicators of the presumed listening factor. The three reading

tasks (R1–R3) loaded a common factor which could be interpreted as the reading factor
115

(R). A presumed speaking factor (S) was responsible for the relationships among the six

speaking tasks (S1–S6). The two writing variables (W1 and W2) shared a common

underlying factor, possibly a writing factor (W). These were the four first-order factors

corresponding to the four modalities. The relationships among the first-order factors were

subsumed under the influence of a higher-order factor, a hypothetical general language

ability factor (G). In other words, the first-order factors were constrained to interact with

one another only through the higher-order factor. The higher-order factor represented a

common underlying dimension across the four first-order factors. The residual variances

(E1–E17), the part of the variance of an indicator that could not be explained by its

respective factor in the model, were uncorrelated with one another. Also referred to as the

measurement errors, the residual variances reflected the degree of how reliable or

unreliable an indicator was at measuring its latent factor.

The constituents and their relationships in the correlated four-factor model are

illustrated in Figure 2. In the absence of a higher-order factor the four factors each

corresponding to its respective modality were modeled to correlate with one another.

The correlated two-factor model (Figure 3) was nested within the correlated four-

factor model by constraining the correlations among the listening, reading, and writing

factors to be one. In this model variables from the listening, reading, and writing sections

all loaded on a common factor, a presumed non-speaking factor (L/R/W). The six

speaking variables loaded on the second factor, probably a speaking factor (S). The two

factors were allowed to covary.


116

Figure 1. Higher-Order Factor Model


117

Figure 2. Correlated Four-Factor Model


118

Figure 3. Correlated Two-Factor Model

Modeling the Context of Language Use

The baseline model, confirmed in the step above, provided a platform for testing

the role of context in defining communicative language ability as measured by the

TOEFL iBT® test. Next, situational factors were added to the baseline model to evaluate
119

the role of context in defining communicative language ability through a series of

confirmatory factor analyses.

Language tasks were categorized with regard to the context of language use. Task

grouping was based on two key elements: content and setting. On the dimension of

content, all tasks were categorized into either academic or non-academic. On the

dimension of setting, when the setting of a task was provided, the task was labeled as

either occurring in a non-instructional or an instructional setting. Tasks without sufficient

setting development were grouped into a third category.

Hypothesis 2 asserts that adding a dimension of content factors to the baseline

model improves model fit, and therefore demonstrates the role of context in defining

communicative language ability. A dimension of content factors was imposed on the

baseline model from the previous hypothesis testing, and this two-dimensional model was

evaluated for fit. In this model, each task loaded on two factors, a skill-based factor and a

content-based factor. Ten tasks loaded on a common content factor associated with

academic material, whereas the other seven loaded on a non-academic content factor.

Imposing the second dimension was expected to help explain the relationships

among the observed variables together with the common skill-based factors. In the

baseline model, the residual variances were modeled to be uncorrelated on the

assumption that they were unique to their respective variables, and were not associated

with one another in a systematic way. However, whether these residual variances are

truly unique becomes a question of doubt if the context of language use is considered.

Successful modeling testing would support that performance on language tasks could be
120

accounted for by the situational factors as well as the skill-based factors, and therefore

confirmed the role of context in defining communicative language ability.

Hypothesis 3 proclaims that adding a dimension of setting factors to the baseline

model improves model fit, and therefore demonstrates the role of context in defining

communicative language ability. A dimension of setting factors was imposed on the

baseline model, and this two-dimensional model was tested for fit. In this model, each

task loaded on two factors, a skill-based factor and a setting-based factor. Seven tasks

loaded on a common instructional setting factor, and four loaded on a common non-

instructional setting factor. The rest six tasks without setting development loaded on a

third setting factor. Successful model testing would support that performance on language

tasks could be accounted for by the situational factors as well as the skill-based factors,

and therefore confirmed the role of context in defining communicative language ability.

Based on the results of the analysis above, a model that best represented the test

construct was chosen as the final model for the entire sample to be used in the following

multi-group analysis.

Multi-Group Invariance Analysis

The following multi-group invariance analysis investigated the second research

question: whether the configuration of communicative language ability, as measured by

the TOEFL iBT® test, has equivalent representation cross two groups of test-takers, with

one group having been exposed to an English-speaking environment and the other

without such experience. In other words, the results would inform us whether group

membership moderated the relations among the variables in the factor model. In all steps

of the invariance analysis, the two groups were tested simultaneously.


121

Hypothesis 4 states that two groups of test-takers, one with study-abroad

experience and the other without such experience, differ in the underlying configuration

of their language ability. Based on self-reported background information, one group of

124 test-takers (Group I) had never lived in an English language environment upon

taking the test. The other group of 246 test-takers (Group II) had experience of living in

the target language community upon test-taking. The multi-group invariance analysis was

executed simultaneously across these two groups in a step-wise fashion.

In the planned multi-group invariance analysis, the measurement component in

the model was prioritized over the structural component. The measurement component

consisted of the number of the factors, the relationship between the factors and their

respective indicators, and the residual variances. The measurement part defined the

meanings of the factors by specifying how they were measured and what their indicators

were. The structural component included the variances and covariances of the factors. It

specified the relationships among the latent factors.

It would only make sense to ensure the equality of factor relationships after it has

been established that the factors have the same meanings across groups. This is why the

measurement component should be tested for equality first. If successful, then testing the

equality of the structural component can proceed.

Furthermore, a mean structure was imposed in the multi-group invariance analysis

to examine group mean differences on latent factors. A mean structure was imposed in all

steps of multi-group invariance analysis to allow model comparison. Group mean

differences on latent factors can be inspected, if measurement invariance can be held.


122

Therefore the equality of factor means was tested when the structural component was

inspected.

None of the studies reviewed incorporated a mean structure. Bae and Bachman

(1998) called for using mean structures as a suggestion for future researchers. As pointed

out by these authors, a mean structure latent approach allows us to account for

measurement errors when investigating group differences.

Measurement Invariance

Three steps were planned in testing measurement invariance in a hieratically

ordered manner. The first step was to test the equivalence of the overall factor structure

across the groups. The same factor structure was imposed on both groups

simultaneously. Parameter estimates in one group were allowed to vary from the ones in

the other group. The result of this step answered the question whether the factors had the

same meanings across the groups.

The second step was to test the equivalence of factor loadings across the groups.

Equality constrains were imposed on all factor loadings across the groups. Factor loading

estimates in one group were not allowed to be different from the ones in the other group.

Residual variances were allowed to differ. This was a more restrictive model compared to

the model tested in the first step. Model fit was evaluated. Since the two models were

nested, a chi-square different test was conducted to evaluate which model should be kept.

A chi-square difference test without significant result would indicate that the fit of the

more restrictive model did not deteriorate badly enough to justify adopting the more

liberal model. In this case, the more restrictive model would be adopted. The result of this
123

step would tell us whether the indicators measured the factors in a comparable way across

the groups.

The third step was to test the equivalence of residual variances in both groups.

Equality constrains were imposed on the residual variances along with the factor loadings

across the groups. This was a more restrictive model compared to the model tested in the

second step. Model fit was evaluated, and a chi-square different test could be conducted

to evaluate which model should be adopted. The result of this step would inform us

whether the indicators in one group were as reliable as the ones in the other group at

measuring the factors.

Structural Invariance

Factor means, variances, and covariances were the targets of this investigation.

The common practice is to test the invariance of factor means and covariances before

testing factor variances because it is usually expected for groups to differ in their

variabilities on the common factors (Kline, 2005).

The first step was to test the invariance of factor means. The means of the factor

indicators (also called endogenous variables) were estimated as intercepts. The estimated

means of the indicators, the intercepts, were constrained to be equal across the groups for

model identification purpose. They were held equal so that the differences of factor

means could be estimated. The means of the latent factors (also called exogenous

variables) were fixed to be zero in one group and free to be estimated in the comparison

group. Model fit was evaluated. A chi-square different test, between the current and its

preceding model during testing measurement invariance, was conducted to evaluate


124

which model should be adopted. The result would indicate the estimated relative

differences in factor means across the groups.

In the next step the equality of factor covariances was tested. These parameters

were held invariant for both groups. Model fit was evaluated, and a chi-square different

test was conducted to evaluate which model should be adopted. This would be followed

by a test of factor variance invariance if the factor covariances could be held equal.

Building Structural Equation Models

The previous multi-group invariance analysis investigated whether or not having

study-abroad experience had any effect on the development of the language ability. In

this section, structural equation models were built and tested to investigate the third

research question: what the impact of length of study-abroad and classroom learning on

the development of communicative language ability is, as measured by the TOEFL iBT®

test.

Three TTCs were introduced as independent variables. One was the length of time

spent studying English. The second one was the length of time spent in content classes

taught in English. The last one was the length of time spent living in an English-speaking

country. The first two characteristics were concerned about English language training in a

formal setting. The last one was concerned about English language contact experience.

The information was obtained based on test-takers’ responses to the background

questions. For Group I test-takers with no study-abroad experience, only the first two

variables were relevant. For Group II test-takers with study-abroad experience, all three

variables were relevant. Model testing was conducted separately for Group I and Group

II, because each group had a different set of relevant background variables.
125

Hypothesis 5 asserts that for test-takers who have no study-abroad experience (the

home-country group) the development of their language ability is associated with the

length formal learning. To test this hypothesis, two independent variables – the time

spent studying English and the time spent in content classes taught in English – were

modeled to have direct associations with the language ability. This model was subject to

test of fit, and the significance of the direct effects was also evaluated.

Hypothesis 6 proclaims that for test-takers who have had study-abroad experience

(the study-abroad group) the development of their language ability is associated with the

length of formal learning and the length of study-abroad. Three independent variables

were modeled to have direct associations with the language ability. They were: the time

spent studying English, the time spent in content classes taught in English, and the length

of study-abroad experience. This model was subject to test of fit, and the significance of

the direct effects was also evaluated.


126

CHAPTER FOUR

ANALYSIS AND RESULTS

This chapter reports the results of testing the hypotheses put forward in the

previous chapter in order to answer the three research questions. Outcomes of

establishing a model for the entire sample are reported first, followed by results from

multi-group invariance analysis across groups of test-takers with different context-of-

learning experiences. Last, the results of evaluating two unique structural equation

models are accounted.

Preliminary Analysis

Table 5 summarizes the descriptive statistics for the observed variables, including

possible score range, mean, standard deviation, kurtosis, skewness, and z scores for the

kurtosis and skewness values. Variables L1 to L6 refer to Listening Task One to

Listening Task Six. Variables R1 to R3 represent Reading Task One to Reading Task

Three. Variables S1 to S6 correspond to Speaking Task One to Speaking Task Six.

Variables W1 and W2 refer to Writing Task One and Writing Task Two.

First, the assumptions of univariate and multivariate normality were checked. Z

scores reported in the tables can be used to test univariate normality. Z scores for the

kurtosis values were significant at p < .01 for variables L3, L4, L6, R3, and W1. Z scores

for the skewness values were significant at p < .01 for the following variables: L1, L4,

L5, L6, and R2. Since this was a relatively large sample (N=370), these z scores were

interpreted with reservation. Instead, the absolute values of the kurtosis and skewness

statistics as well as the shape of the distributions were used to evaluate normality.
127

Table 5. Descriptive Statistics for the Observed Variables

Z Kurtosis Z Skewness
Variable Range Mean Std. Dev. Kurtosis Skewness
(SE=.253) (SE=.127)
L1 0-5 3.33 1.138 -.264 -1.043 -0.373 -2.937
L2 0-6 3.57 1.376 -.581 -2.296 -0.098 -.772
L3 0-6 2.97 1.560 -.734 -2.901 0.179 1.409
L4 0-5 4.44 0.888 4.245 16.779 -1.911 -15.047
L5 0-6 4.37 1.297 -.259 -1.024 -0.637 -5.016
L6 0-6 4.78 1.384 .976 3.858 -1.223 -9.630
R1 0-15 6.94 2.725 -.343 -1.356 0.285 2.244
R2 0-15 10.06 3.027 -.652 -2.577 -0.393 -3.094
R3 0-15 9.98 3.064 -.963 -3.806 -0.119 -.937
S1 0-4 2.51 0.759 -.340 -1.344 -0.028 -.220
S2 0-4 2.62 0.805 -.508 -2.008 -0.016 -.126
S3 0-4 2.50 0.755 .073 .289 -0.086 -.677
S4 0-4 2.39 0.827 -0.027 -.107 -0.036 -.283
S5 0-4 2.58 0.810 0.172 .680 -0.150 -1.181
S6 0-4 2.53 0.856 -0.115 .455 -0.132 -1.039
W1 1-5 3.23 1.148 -0.690 -2.727 -0.289 -2.276
W2 1-5 3.46 0.817 -0.169 -.668 -0.020 -.157

The values of kurtosis and skewness are zero in a normal distribution. As Table 5

shows, except for variable L4 and L6, the values of skewness and kurtosis were all

between an acceptable range of -1 and 1, indicating that univariate normality could be

held for these variables. Variable L4 had a kurtosis value of 4.25 and a skewness value of

-1.91. Variable L6 had a kurtosis value of 0.98 and a skewness value of -1.22. Examining

the histograms of these two variables revealed that both distributions exhibited a ceiling

effect. Univariate normality could not be held in these two cases. Having two extremely

non-normal variables indicated that this set of variables could deviate from multivariate

normality. These facts were taken into consideration when choosing an appropriate

estimation procedure for this group of variables.


128

Table 6. Correlations of the Observed Variables

L1 L2 L3 L4 L5 L6 R1 R2 R3 S1 S2 S3 S4 S5 S6 W1 W2
L1 1
L2 .35 1
L3 .38 .50 1
L4 .33 .31 .28 1
L5 .41 .39 .40 .40 1
L6 .41 .42 .44 .49 .46 1
R1 .36 .43 .45 .31 .36 .44 1
R2 .36 .49 .48 .39 .43 .55 .54 1
R3 .37 .44 .49 .38 .41 .51 .56 .65 1
S1 .38 .42 .42 .37 .43 .45 .40 .32 .40 1
S2 .34 .36 .36 .39 .32 .38 .39 .33 .33 .58 1
S3 .39 .37 .38 .45 .38 .43 .35 .37 .40 .56 .57 1
S4 .34 .35 .40 .39 .37 .45 .36 .33 .42 .63 .55 .57 1
S5 .38 .37 .41 .45 .42 .46 .44 .38 .44 .56 .58 .57 .60 1
S6 .41 .39 .41 .48 .45 .49 .38 .39 .37 .62 .64 .61 .57 .64 1
W1 .50 .46 .51 .43 .50 .54 .53 .59 .58 .48 .46 .51 .43 .55 .52 1
W2 .45 .49 .48 .44 .48 .48 .50 .50 .54 .57 .60 .56 .52 .63 .56 .61 1

Next, the linearity and multicollinearity of the variables were scrutinized to ensure

that the variables were represented in the dataset appropriately. Linearity was examined

using scatter plots of all possible pairs of the variables. No violation of linearity was

found. Pairwise multicollinearity was checked by inspecting the correlation matrix of the

variables. As shown in Table 6, dependence among all pairs of variables was moderate

(.28–.65). No extremely high value of correlation coefficient was found.

The results of the preliminary analysis indicated that univariate non-normality

was detected with two variables, which suggested that the distribution of the set of

variables could deviate from multivariate normality. It was then decided to implement a

corrected normal theory estimation method in the multivariate analysis to avoid bias

caused by non-normality in the dataset. The MLM estimator provided by Mplus was

implemented. The chi-square statistic generated by the MLM estimator in Mplus is a

corrected test statistic known as Satorra-Bentler test statistic (χ2S-B) (Satorra & Bentler,
129

1994). This test statistic is mean-adjusted and robust to non-normality (Muthén &

Muthén, 2010).

The Baseline Model

Results from the previous studies revealed that a couple of factor models could be

adequate at accounting for the underlying structure of TOEFL® test performance (Sawaki

et al., 2008; Stricker & Rock, 2008; Stricker et al., 2005). A higher-order model (Figure

1), a correlated four-factor model (Figure 2), and a correlated two-factor model (Figure 3)

were all proved to be possible factor solutions in the past. To establish the baseline model

all three competing models were tested for fit with the current data through a series of

confirmatory factor analyses. Summarized in Table 7, the selected fit indices were used

to evaluate model fit. Model df refers to a model’s degrees of freedom. χ2S-B refers to the

Satorra-Bentler chi-square test statistic. χ2S-B /df refers to the normed Satorra-Bentler chi-

square test statistic. CFI refers to the comparative fit index. RMSEA refers to the root

mean square error of approximation. SRMR refers to the standardized root mean square

residual.

Table 7. Fit Indices for the Three Competing Models

Model Model df χ2S-B χ2S-B /df CFI RMSEA SRMR


Higher-order factor 115 215.125 1.871 0.969 0.049 0.038
model
Correlated 113 185.882 1.645 0.977 0.042 0.033
four-factor model
Correlated two- 118 268.486 2.275 0.953 0.059 0.043
factor model
130

The degrees of freedom (df) indicates how parsimonious or saturated a model is.

A model’s degrees of freedom can be determined by subtracting the number of free

parameters from the number of unique elements in the variance-covariance structure of

the data. A free parameter is a parameter that is free to be estimated during model

estimation. On the other hand, a fixed parameter is one whose value is determined

without model estimation. With the same data structure, the more free parameters a

model is specified to estimate, the less parsimonious or more saturated the model is, and

the lower the degrees of freedom are. Model fit generally deteriorates when model

complexity decreases because there are fewer free parameters to estimate in a more

parsimonious model.

The correlated four-factor model with 113 degrees of freedom was the most

saturated model among the three. The correlated two-factor model was the most

parsimonious one with 118 degrees of freedom. All fit indices deteriorated when model

complexity decreased from the correlated four-factor model to the correlated two-factor

model.

First, the criteria pre-determined based on the relevant literature were used to

evaluate overall model fit. All three chi-square values (χ2S-B) were significant (p = 0.000),

which put model fit in doubt. However, as discussed, the value of the model chi-square

should be interpreted with caution because this test statistic is highly sensitive to sample

size. To reduce the sensitivity of chi-square to sample size, the chi-square values were

divided by the degrees of freedom. As the normed chi-square (χ2S-B /df) values showed,

the ratios were all well below 3, which indicated that all three models fit the data well.
131

The values of comparative fit indices (CFI) were all larger than 0.9. This meant

that the fit of all three models improved significantly when compared to their respective

baseline models that assumed no covariances among the variables.

The values of root mean square error of approximation (RMSEA) for the two

more saturated models (higher-order model and correlated four-factor model) were below

0.05, and the value for the most parsimonious one (correlated two-factor model) was

between 0.05 and 0.08. This outcome could be interpreted as a sign of good fit for all

three models when model complexity was accounted for.

The value of standardized root mean square residual (SRMR) were all well below

0.1. This indicated that the correlation matrices did not differ significantly from the

model-implied ones. All three models were considered to have good fit.

The selected fit indices for all three models were satisfactory except for the model

chi-square values. Therefore, on the global level all three models demonstrated

reasonable fit to the data.

Next, individual parameter estimates were examined for appropriateness and

significance. The results of testing the higher-order model showed that the estimated

residual variance of the writing factor was negative, and the estimated correlation

between the higher-order factor and the writing factor was larger than one. These findings

signaled problems in model specification, and therefore made the model inadmissible.

Although this higher-order model was confirmed by previous researchers with similar

TOEFL® test data, this model was not compatible with the current dataset. One possible

reason could be that the sample used in this study was a much smaller one (N=370)

compared to the one (N=2070) used in Sawaki et al. (2008) and Stricker & Rock (2008). 3
132

Regarding the correlated four-factor model it was detected that the correlation

between the listening and the writing factor was estimated as high as 0.97, larger than the

0.9 acceptance level.4 This high level of correlation indicated a linear dependence

between the factors, meaning that the factors were not distinct enough to be considered as

two separate factors. It was then decided that this model based on the study sample

(N=370) was also inadmissible.5

An examination of the result from testing the two-factor model showed that all

parameter estimates were appropriate and significant. Taking all criteria into

consideration, this model provided the best explanation to the data, and therefore was

adopted as the baseline model for the following analysis.6

The Context of Language Use

The correlated two-factor model was established as the baseline model as the

outcome of the previous analysis. Built upon this baseline model, testing the role of

context in the internal structure of the test performance was pursued next.

The Content Dimension

One key element in defining context, content of a language task, was tested for its

ability in accounting for the relationships among the variables along with the skill-based

factors. Previous task analysis showed that ten language tasks were associated with

academic material, whereas the other seven were related to non-academic content. A

content factor dimension was imposed on the baseline model (Figure 4). Two content

factors were specified, academic and non-academic.7

As illustrated in Figure 4, all language tasks were specified to load on two factors,

a skill-based factor and a content factor. Taking the first speaking task (S1) as an
133

example, this task loaded with the other speaking tasks on the speaking factor. This task

also loaded on a non-academic content factor since the task was not related to academic

content. Along this content dimension, ten tasks loaded on a common content factor

associated with academic material, while the other seven loaded on a non-academic

content factor.

This two-dimensional model was tested for fit. The result indicated that

convergence could not be reached. No overall model fit index was reported. Individual

parameter estimates were reported without standard errors therefore the significance of

the parameter estimates could not be evaluated.

Adding a second dimension of content brought in severe problems in model

specification. The content dimension failed to capture the relationships among the

variables in conjunction with the two skill-based factors based on the test performance of

the study sample (N=370).8 This model was inadmissible, and therefore was discarded

from any future analysis.

The Setting Dimension

Another key element in defining context, setting of a language act, was also tested

for its ability in accounting for the relationships among the variables along with the skill-

based factors. Previous task analysis showed that seven tasks were situated in an

instructional setting, and four were in a non-instructional setting. The information on

setting for the remaining six tasks could not be identified due to lack of context

development. A setting factor dimension was imposed on the baseline model (Figure 5).

Three setting-related factors were specified, instructional, non-instructional, and not

available (N/A).9
134

Figure 4. Correlated Two-Factor Model with a Content Dimension

As illustrated in Figure 5, all language tasks were specified to load on two factors,

a skill-based factor and a setting factor. Taking the third speaking task (S3) as an

example, this task loaded with the other speaking tasks on the speaking factor. This task

also loaded on a non-instructional setting factor since it was not situated in an

instructional environment. Along this setting dimension, seven tasks loaded on a common
135

instructional setting factor, and four on a common non-instructional setting factor. The

remaining six tasks without context development all loaded on a third setting factor

(N/A).

Figure 5. Correlated Two-Factor Model with a Setting Dimension


136

This two-dimensional model was tested for fit. Once again, the result indicated

that convergence could not be reached. Adding a second dimension of setting brought in

severe problems in model specification. The setting dimension failed to capture the

relationships among the variables in conjunction with the two skill-based factors based on

the test performance of the study sample (N=370).10 This model was inadmissible, and

therefore was discarded from any future analysis.

The Final Model

The correlated two-factor model was adopted as the final model for the entire

sample group. The first factor, loaded on tasks from the listening, reading, and writing

sections, could be interpreted as a non-speaking factor. The second factor, loaded

exclusively on the speaking tasks, could be interpreted as a speaking factor. The results of

model testing showed that this model demonstrated an adequate fit to the data. The final

model with unstandardized and standardized parameter estimates is illustrated in Figure 6

and Figure 7 respectively. In both figures, a path pointing from a latent factor to an

observed variable, also called an indicator, represented the presumed effect of the factor

on that variable. Estimates of these effects were factor loadings. A path pointing from a

measurement error to an indicator corresponded to the presumed effect of random and

systematic errors on the variable. A path linking the two latent factors did not indicate

directionality. It simply represented the unanalyzed association between the two factors.

The variables were rearranged in order for the ease of visual display.

To obtain the unstandardized estimates illustrated in Figure 6, the scales of the

latent variables were assigned through the unit loading identification (ULI) constraint.

The factor loadings for the first indicator of each factor were fixed at one to assign a scale
137

to the factors. The measurement errors were assigned a scale through fixing their

estimated effects on the indicators to be unitary. For example, the unstandardized factor

loading of the second listening task (L2) on the non-speaking factor (L/R/W) was

estimated to be 1.312. The numbers printed next to the latent factors represented factor

variances. The factor covariance was indicated by the number printed next to the path

linking the two latent factors. Next to the measurement errors were residual variances of

the indicators. The variances of the non-speaking and the speaking factors were estimated

to be 0.430 and 0.335 respectively. The covariance between the two factors was

estimated to be 0.306. Estimation of the residual variance of the first listening task (L1),

for example, was 0.861.

The standardized estimates were computed when the latent factors and the

observed variables were standardized. The variances of all variables, including the

factors, the observed variables, and the residuals, were fixed at one. The estimated factor

correlation was reported next to the path linking the two factors. The correlation between

the two skill-based factors was estimated to be 0.807. The factor loadings were

standardized regression coefficients. For example, one unit of change in standard

deviation in the latent speaking factor predicted 0.764 standard deviation of change in the

first speaking variable (S1). The higher a factor loading was, the better the indicator was

at measuring the latent factor. The standardized factor loadings could also be interpreted

as estimated correlations between a latent factor and its indicators in the current model

because each indicator was specified to measure only one latent factor. For example, the

estimated correlation between the speaking factor and the first speaking variable (S1) was

0.764. This meant that 58.4% (0.7642) of the total variance of this indicator could be
138

accounted for by the speaking factor. The standardized residual variances represented the

percentage of variance of the indicators that could not be explained by the common

factors. In case of the first speaking variable (S1), 41.6% of the variance could not be

accounted for by the speaking factor. The standardized residual path coefficient for the

direct effect of the measurement error on this speaking variable was 0.645(0.4161/2),

which meant that one standard deviation of change in the error term was associated with

0.645 standard deviation of change in this variable.

Multi-Group Invariance Analysis

Assuming that the obtained final model was the correct model for the entire

group, the multi-group invariance analysis investigated whether the specified model

could hold equivalent across two groups of test-takers. One hundred and twenty-four test-

takers who had never been immersed in an English language environment were grouped

together (Group I). The other group (Group II) of 246 test-takers had lived in an English-

speaking country for various lengths of time. Table 8 summarizes the descriptive

statistics across the two groups.

The multi-group invariance analysis was executed simultaneously across these

two groups in a hierarchically ordered fashion. In all steps of the analysis, unstandardized

parameter estimates were reported. Generally speaking, unstandardized estimates should

be used for comparing groups, as groups are assumed to differ in their variabilities on

common factors (Kline, 2005).


139

Figure 6. Final Model with Unstandardized Estimates


140

Figure 7. Final Model with Standardized Estimates

Table 8. Descriptive Statistics across the Two Groups

Group I (N = 124) Group II (N = 246)


Mean Std. Dev. Mean Std. Dev.
Listening 23.34 5.48 23.52 5.43
Reading 28.25 7.22 26.34 7.59
Speaking 15.08 3.98 15.16 3.88
Writing 6.89 1.83 6.60 0.26
Total 73.56 16.37 71.62 16.34
141

Measurement Invariance

During the first step of testing measurement invariance, factor structure

invariance was inspected. The same factor structure was imposed on both groups

simultaneously but parameter estimates were allowed to differ across the groups.

The resulting unstandardized parameter estimates for both groups are shown in

Figure 8 and Figure 9 respectively. The same correlated two-factor model was applied on

both groups. Factor loadings, indicator residuals as well as factor means, variances, and

covariance had different estimates in each group. The numbers printed next to a factor

referred to the factor mean and factor variance. In Group I the means of the latent factors

were fixed at zero. Factor variances were estimated at 0.518 for the non-speaking factor

and 0.374 for the speaking factor. The estimated covariance of the factors was 0.353. In

Group II the means of the latent factors were free to be estimated. The mean of the first

factor was estimated 0.140 lower than the one in Group I. The mean of the second factor

was estimated 0.017 lower than the one in Group I. Factor variance estimates were 0.386

and 0.319 in Group II. The estimated covariance of the factors was 0.283 in Group II.

Model fit indices are summarized in Table 9. Except for the model chi-square, all

model fit indices were satisfactory. All parameter estimates, as reported in the figures,

were appropriate and reasonable. It was then concluded that the same correlated two-

factor structure could be held across the groups. The result of this step ensured that the

performance in both groups could be accounted for by the same two factors, a speaking

factor which was loaded with tasks from the speaking section and a non-speaking factor

which was loaded with tasks from the sections of listening, reading, and writing. The

invariance analysis then proceeded to the next step.


142

Figure 8. Factor Structure Invariance with Unstandardized Estimates Group I


143

Figure 9. Factor Structure Invariance with Unstandardized Estimates Group II

Next, the equivalence of factor loadings was inspected. Factor loading estimates

were held to be equal across the groups.

As illustrated in Figure 10 and Figure 11, the resulting unstandardized factor

loading estimates were the same for both groups. Indicator residuals, factor means, factor

variances, and factor covariance were allowed to differ across the groups.
144

Figure 10. Factor Loading Invariance with Unstandardized Estimates Group I

Model fit indices are summarized in Table 9. Except for the model chi-square, all

model fit indices were satisfactory. All parameter estimates, as reported in the figures,

were appropriate and reasonable. The Satorra-Bentler chi-square difference (χ2S-B|Diff)

between this model and the preceding model was not significant: 15.266 with 15 df (χ2S-

B|Diff / df = 1.018). Compared to the model tested in the preceding step, the current model
145

had fewer free parameters to estimate because it imposed equal factor loadings across the

groups. Therefore the current model was more restrictive and simpler. The decrease in

the number of free parameters led to deterioration in model fit. The non-significant result

demonstrated that model fit did worsen but not enough to justify choosing the more

saturated model over the simpler one.

Figure 11. Factor Loading Invariance with Unstandardized Estimates Group II


146

It was then concluded that the factor loadings could be held invariant across the

groups. The result of this step indicated that the factors were measured by their indicators

in a comparable way for both groups. The amount of the variance of an indicator that

could be accounted for by its respective factor was comparable across the groups. The

invariance analysis then proceeded to the next step.

The third step was to examine the equivalence of residual variances in both

groups. Residual variances along with the factor loadings were constrained to be equal

across the groups.

As illustrated in Figure 12 and Figure 13, the factor loadings as well as the

residuals were fixed to be the same for both groups. Factor means, factor variances, and

factor covariance were allowed to differ in each group.

Model fit indices are summarized in Table 9. Except for the model chi-square, all

model fit indices were satisfactory. All parameter estimates, as reported in the figures,

were appropriate and reasonable. The Satorra-Bentler chi-square difference (χ2S-B|Diff)

between this model and the preceding model was not significant: 22.249 with 17 df (χ2S-

B|Diff / df = 1.309). This result indicated that the model fit did not deteriorate enough to

justify choosing the preceding more saturated model over the current simpler one. It was

then concluded that the indicator residuals could be held invariant across the groups. The

outcome of this step showed that the indicators performed equally at measuring their

respective factors for both groups. The amount of the variance of an indicator that could

not be explained by it respective factor was comparable across the groups.


147

Figure 12. Indicator Residual Invariance with Unstandardized Estimates Group I

Table 9. Fit Indices from the Multi-Group Measurement Invariance Analysis

Multi-group measurement Model χ2S-B χ2S-B /df CFI RMSEA SRMR


invariance df
Factor structure invariance 251 441.956 1.761 0.941 0.064 0.055
Factor loading invariance 266 456.488 1.716 0.941 0.062 0.061
Indicator residual 283 478.524 1.691 0.940 0.061 0.064
invariance
148

Analysis in the above three steps completed the test of multi-group measurement

invariance. The testing of equality on factor structure, factor loadings, and residuals all

succeeded. It was concluded that the measurement part of the model could be held

equivalent across the groups. The multi-group invariance analysis then proceeded with

the testing of structural invariance.

Figure 13. Indicator Residual Invariance with Unstandardized Estimates Group II


149

Structural Invariance

Next, the structural part of the model was scrutinized with equality control across

the groups.

The invariance of factor means was examined first. As shown in Figure 14 and

Figure 15, the measurement part of the model, including the factor loadings and indicator

residuals, was fixed to be the same across the groups. Factor means in both groups were

fixed to zero to be equal. Factor variances and covariance were free to differ.

The model fit indices are summarized in Table 10. Except for the model chi-

square, all model fit indices were satisfactory. All parameter estimates, as reported in the

figures were appropriate and reasonable. The Satorra-Bentler chi-square difference (χ2S-

B|Diff) between this model and the preceding model was not significant: 3.771 with 2 df

(χ2S-B|Diff / df = 1.886). This result indicated that the model fit did not deteriorate enough

to justify choosing the more saturated model over the current simpler one. It was then

concluded that the factor means could be held invariant across the groups. The outcome

of this step showed that the two groups were equivalent in terms of latent factor means.

In other words, there was not enough evidence to say that one group was better than the

other on either latent ability. The structural invariance analysis then proceeded to the next

step.

The next step was to test the equivalence of factor covariance. These parameters

were constrained to be equal across the groups. As indicated in Table 10, model

estimation did not succeed. This result failed to demonstrate the equivalence of the factor

covariance across the groups. The multi-group invariance analysis was then terminated.

Since no further step of testing factor variance invariance would be taken, it could then be
150

assumed that the groups differed in their variabilities on the common factors. The multi-

group model estimated in the preceding step became the final model for the groups.

Figure 14. Factor Mean Invariance with Unstandardized Estimates Group I


151

Figure 15. Factor Mean Invariance with Unstandardized Estimates Group II

Table 10. Fit Indices from the Multi-Group Structural Invariance Analysis

Multi-group structural Model χ2S-B χ2S-B /df CFI RMSEA SRMR


invariance df
Factor mean invariance 285 482.295 1.692 0.939 0.061 0.065
Factor covariance No convergence
invariance
152

Results of Multi-Group Invariance Analysis

Multi-group invariance analysis succeeded until the equality of the factor

covariance could not be held across the groups. The final model for both groups, as

shown in Figure 14 and Figure 15, had a correlated two-factor structure. The factors were

measured by the same set of indicators, which ensured that the factors had the same

meanings for both groups. The first factor was a combination of listening, reading, and

writing, a non-speaking factor. The second factor was a speaking factor, loaded

exclusively with the speaking tasks. Factor loadings were also equivalent for both groups.

For example, the unstandardized factor loading of the second speaking task (S2) on the

speaking factor was 1.042 for both groups, which meant that this speaking task

functioned equivalently as an indicator of the speaking factor for Group I and Group II.

Indicator residuals were comparable across the groups as well. For example, the residual

variance of the first writing task (W1) was 0.489 for both groups, which indicated that the

same amount of variance of this writing task was left unexplained by its factor for Group

I as for Group II. At last, the means of the latent factors were equal across the groups. In

terms of their latent abilities, the groups did not differ from each other since the model

testing succeeded when the factor means were held invariant across the groups.

Since the test of factor covariance equality failed, these estimates were not fixed

to be equal for the groups. Factor variances were also assumed to be unequal. Factor

variance estimates were 0.438 and 0.345 in Group I, and 0.426 and 0.330 in Group II.

The covariance estimate of the factors was 0.313 in Group I, and 0.303 in Group II.

Although model testing indicated that the groups differed on these parameters, the

differences were not big.


153

At the end the factor models for the two groups were almost identical with only

minor differences. The factor structure underlying the performance of test-takers with

study-abroad experience was almost the same as the one from test-takers without such

experience. These results informed us that the impact of this group membership on the

test performance was minimal. It was then reasonable to conclude that the test

functioned in a comparable way across the two subgroups of test-takers, one with study-

abroad experience and one without such experience. In other words, the structure of

communicative language ability, as measured by the TOEFL iBT® test, was found to

have equivalent representations across the two groups.

Structural Equation Models

The length of study-abroad along with two learning-oriented variables—the time

spent studying English and the time spent in content classes taught in English—were

specified to have direct effects on the development of the latent factors, and these effects

were investigated through a structural equation modeling (SEM) approach.

A unique model was built for each group. With the group of test-takers who did

not have study-abroad experience, the home-country group, as shown in Figure 16, two

independent variables – the time spent studying English and the time spent in content

classes taught in English – were modeled to have direct effects on both latent abilities.

The two independent variables were represented by the rectangular boxes labeled as

‘Study’ and ‘Content’ in the figure as these variables were directly observed. The latent

factors, represented in the ellipses, were dependent variables in the model. A path

pointing from an independent observed variable to a dependent latent factor represented

the direct effect of the former on the latter. The part of the variance of a latent factor that
154

could not be explained by the independent variables, also called disturbance, was

represented in an ellipse next to the latent factor. The disturbances of the two latent

factors (D1 and D2) linking by a double arrow line were free to vary and covary.

With the group of test-takers who had study-abroad experience, as shown in

Figure 17, three independent variables were modeled to have direct effects on both latent

factors. They were the time spent studying English (labeled as ‘Study’), the time spent in

content classes taught in English (labeled as ‘Content’), and the length of living in an

English-speaking country (labeled as ‘Live’).

In both models, the independent variables were categorical in nature. However,

the scale of the independent variables was not an issue in model estimation, only the scale

of dependent variables was (L. K. Muthén, personal communication, December 10,

2010). Since the same set of continuous indicators used previously was used in these two

models to define the latent factors, the same estimation method from earlier was

implemented here as well. The models were evaluated by using the MLM estimator in

Mplus. The selected fit indices, summarized in Table 11, showed both models fit the data

well. In the next section, the standardized parameter estimates in each group was

examined for appropriateness and significance.


155

Figure 16. Structural Equation Model Group I


156

Figure 17. Structural Equation Model Group II

Table 11. Fit Indices for the Structural Equation Models

SEM Model df χ2S-B χ2S-B /df CFI RMSEA SRMR


Group I 148 256.623 1.734 0.915 0.077 0.059
Group II 163 285.102 1.749 0.947 0.055 0.046
157

The Home-Country Group

The standardized parameter estimates, displayed in Figure 18, were examined to

check if the model was appropriate and reasonable at explaining the relationships among

the variables for the group of test-takers without study-abroad experience.

All factor loadings were significant at p < 0.01. The path coefficients from the

‘Study’ variable to the latent factors were 0.139 and 0.290. The path coefficients from

the ‘Content’ variable to the factors were 0.227 and 0.338. These standardized path

coefficients could be interpreted as regression coefficients between the dependent and

independent variables. Significance of the effects of the independent variables on the

factors was marked by one asterisk next to a path coefficient at p < 0.05, and two

asterisks at p < 0.01.

Three out of the four path coefficients were significant. The path coefficient

between the ‘Study’ variable and the non-speaking factor was not significant. This meant

that the change of latent non-speaking ability was not likely to be affected by the length

of time studying English for the test-takers without study-abroad experience. The path

coefficient between the ‘Content’ variable and the non-speaking factor was significant at

p < 0.05. This indicated that one standard deviation of change in the length of taking

content classes taught in English brought up 0.227 standard deviation of change in the

non-speaking latent factor.

With regard to speaking, both ‘Study’ and ‘Content’ variables had significant

impact (p < 0.01) on this latent factor. The path coefficient between the ‘Study’ variable

and the speaking factor showed that one standard deviation of change in the length of

studying English brought up 0.290 standard deviation of change in the latent speaking
158

ability. The path coefficient between the ‘Content’ variable and the speaking factor

showed that one standard deviation of change in the length of taking content classes

taught in English brought up 0.338 standard deviation of change in the latent speaking

ability.

Figure 18. Structural Equation Model with Standardized Estimates Group I


159

The residual variance of the non-speaking factor was 0.915, which meant that

91.5% of the variance of the non-speaking factor could not be explained by the two

independent variables. The residual variance of the speaking factor was 0.756, which

meant that 75.6% of the variance of the speaking factor could not be explained by the two

independent variables. The two residuals were correlated at 0.787. Both standardized

residuals, especially the one for the non-speaking factor, were very high. This indicated

that variables other than the ones specified in the model might have had influences on the

development of the latent factors. Since these other variables were not represented in the

model, their impact on the latent variables could not be analyzed in this study.

In conclusion, for the group of test-takers without study-abroad experience, the

length of studying English and taking content classes in English had significant

associations with the speaking ability. Only the length of taking content classes in

English had a significant association with the non-speaking ability.

The Study-Abroad Group

The standardized parameter estimates, displayed in Figure 19, were examined to

check if the model was appropriate and reasonable at explaining the relationships among

the variables for the group of test-takers with study-abroad experience.

All factor loadings were significant at p < 0.01. The path coefficients from the

‘Study’ variable to the latent factors were 0.347 and 0.328. The path coefficients from the

‘Content’ variable to the factors were 0.098 and 0.094. The path coefficients from the

‘Live’ variable to the factors were 0.155 and 0.229. Significance of the effects of the

independent variables on the factors was marked by one asterisk next to a path coefficient

at p < 0.05, and two asterisks at p < 0.01.


160

Figure 19. Structural Equation Model with Standardized Estimates Group II

Four out of the six path coefficients were significant at p < 0.01. The significant

path coefficients between the ‘Study’ variable and both factors indicated that one

standard deviation of change in the length of studying English brought up 0.347 standard

deviation of change in the non-speaking factor, and 0.328 standard deviation of change in

the speaking factor. The ‘Content’ variable had no significant impact on either of the
161

factors. The ‘Live’ variable had significant impact on both factors. One standard

deviation of change in the length of living in an English speaking environment brought

up 0.155 standard deviation of change in the non-speaking factor, and 0.229 standard

deviation of change in the speaking factor.

The residual variances for the factors were high, 0.816 for the non-speaking factor

and 0.795 for the speaking factor. They were correlated at 0.767. This indicated that

variables other than the ones specified in the model may have had influences on the latent

factors. Since these other variables were not represented in the model, their impact on the

latent variables could not be analyzed in this study.

In conclusion, for the group of test-takers with study-abroad experience, the

lengths of studying English and living in an English language environment both had

significant impacts on both non-speaking and speaking abilities.


162

CHAPTER FIVE

DISCUSSION AND CONCLUSIONS

This chapter starts with a review of the study, followed by a summary of the

primary findings. Discussion focuses on three topics: (1) the nature of communicative

language ability, (2) group membership and language ability, and (3) learning contexts

and language ability. The implications of this study for foreign language (FL) test

development and validation are elaborated. The merits of using a structural equation

modeling (SEM) approach to address issues at the interface between language testing and

acquisition are appraised. The study’s contributions and its limitations are discussed.

Recommendations for future research are provided at the end.

Overview of the Study

This dissertation research investigated the nature of communicative language

ability as manifested in TOEFL iBT® test performance. It also investigated the

relationships between communicative language ability and test-takers’ study-abroad and

learning experiences.

The current view in applied linguistics conceptualizes FL ability as

communicative in nature. In this study, the context dependence of this communicative

language ability was examined through a latent factor approach based on TOEFL iBT test

performance. The relativity of this ability was inspected by conducting multi-group

invariance analysis across groups of test-takers with different context-of-learning

experiences. This ability’s associations with the length of study-abroad and learning were

further explored through a SEM approach.


163

Summary of the Primary Findings

The first research question asked what constitutes communicative language

ability, and what the role of context is in defining this ability. This question was

investigated with both skill abilities and the context of language use taken into

consideration. Based on the results from previous factor-analytic research with similar

data, three competing models were tested for fit through a series of confirmatory factor

analyses. The efforts to confirm a higher-order model and a correlated four-factor

structure did not succeed. Instead, a correlated two-factor model was shown to be

compatible with the data. One of the factors was a speaking factor, and the other could be

interpreted as a non-speaking factor, a combination of listening, reading, and writing.

This model was established as the baseline model. Two context-related factors–content

and setting–were then added to the baseline model with the goal of improving model fit.

However, in neither case did model estimation succeed. This indicated that, contrary to

the hypotheses, the added context factors were not useful in explaining the latent

structure of the test performance together with the skill-based factors. The correlated two-

factor model fit the data well, and was adopted as the final model for the whole sample.

The second question investigated whether or not having had contact with the

target language environment has an effect on the latent structure of communicative

language ability. This was achieved by conducting multi-group invariance analysis across

two subgroups of test-takers. One group of test-takers had lived in an English-language

environment prior to taking the test, whereas the other group had not. Simultaneous

multi-group invariance analysis with a mean structure was carried out with parameters

constrained to equality across the groups in a hierarchical fashion. The results showed
164

that the moderating effect of this group membership on test performance was minimal.

Across the groups, the test measured the same set of latent abilities in similar ways. The

groups did not differ in terms of their standings on the factor means either. Contrary to

the hypothesis, the nature of communicative language ability elicited by the test had

equivalent underlying factorial representations in the two groups.

The third research question inquired if the lengths of study-abroad and learning

have any association with the development of communicative language ability. This was

accomplished by establishing models unique to test-taker groups. With the group of test-

takers who had had study-abroad experience, the length of time abroad, the time spent

studying English, and the time spent in content classes taught in English were modeled to

have direct effects on the latent abilities in the correlated two-factor model. With the

group of test-takers who had not had study-abroad experience, the time spent studying

English and the time spent in content classes taught in English were modeled to have

direct effects on the latent abilities in the correlated two-factor model. The results lent

partial support to the hypotheses. Although both study-abroad and learning were found to

have significant associations with aspects of communicative language ability, large

portions of the factor variances remained unexplained in the models. This result

suggested that variables other than the ones specified in the models might have had

impact on the development of the ability being investigated.

Discussion

Discussion centers on the following three topics: (1) the nature of communicative

language ability, (2) group membership and language ability, and (3) learning contexts

and language ability.


165

The Nature of Communicative Language Ability

The goal of the TOEFL iBT® test is to assess communicative language ability,

whose definition reflects the influences of both skill abilities and the context of language

use. The test is designed to reflect this current thinking in applied linguistics.

The construct the test intends to measure–the ability to use the English language

communicatively in North American university context–was found to have two latent

components. One component, on which all six speaking tasks loaded, could be

interpreted as the speaking ability of this group of test-takers. The second component, on

which tasks from the listening, reading, and writing sections loaded, could be labeled as

the ability to listen, read, and write. In other words, listening, reading, and writing were

basically indistinguishable, not separate abilities. Together they could be understood as

the non-speaking ability of this group of test-takers. In relation to test scores, this two-

factor structure meant that a test-taker who was higher ranked on listening was also likely

to perform well on reading and writing, but not necessarily on speaking. Likewise, the

fact that a test-taker scored high on speaking could not be used to draw the conclusion

that the person could also performed well on listening, reading, and writing. Both

components were skill-based, and together they accounted successfully for the

relationships among the tasks used in the test.

In contrast to skills, the role of context in defining the construct was not reflected

in this factor model. When two aspects of the context of language use, content and

setting, were examined together with the skill-based factors, model fit did not improve.

Rather, including these situation factors made the models inadmissible. Neither situation

factor was successful at explaining test performance when tested together with the skill-
166

based factors. With regard to task content, test-takers’ performance was not influenced by

the content of the language tasks, whether academic or non-academic. As far as setting

was concerned, the performance of test-takers was not affected by the availability of the

setting description, or by the nature of the setting, whether instructional or non-

instructional.

The final model chosen to account for the test performance for the whole group

contained two correlated skill-based factors. This model, however, did not indicate any

influence of the context of language use in the latent configuration of the test construct.

First, the finding of a two-factor model representing communicative language

ability was consistent with the consensus on the multi-component nature of FL

proficiency reached by applied linguists and language testers. Contrary to the unitary

view of language proficiency endorsed by Oller (1979), the nature of FL proficiency has

been shown by many researchers to consist of multiple components. Speaking and

reading were found to be two distinct factors in Bachman and Palmer’s (1983) study.

Three factors (basic elements of knowledge, integration of basic knowledge elements,

and interactive use of language) were found in Sang et al. (1986). A higher-order model

with three first-order factors (oral–aural, structure–reading, and discourse) represented

the nature of FL ability in Fouly et al. (1990). Bachman et al. (1995) were able to identify

a higher-order model with speaking, listening, and test-specific writing abilities as

distinct first-order factors. Both Buck (1992) and Bae and Bachman (1998) demonstrated

that the two receptive skills, listening and reading, were factorially different. The

components of FL ability found in Sasaki (1993) included writing, comprehending short

context, and comprehending long context. The correlated two-factor model found in this
167

study added another piece of supporting evidence for the multi-component nature of FL

ability.

This two-factor model suggested that responses to the listening, reading, and

writing tasks might have required similar skills, whereas the speaking tasks might have

demanded a somewhat different skill set. Finding this two-factor model could also be due

to the differences in testing method. Both listening and reading used multiple-choice

questions, whereas speaking employed constructed-response tasks. The two writing tasks

were also constructed-response items but they loaded together with the listening and

reading tasks. Still, majority of the tasks (9 out of 11) on the non-speaking factor used

objective multiple-choice questions. Test method effect might have contributed to the

finding of the two-factor model. The third explanation for the distinctiveness between a

speaking factor and a non-speaking factor could be instruction or lack thereof. Speaking

section became mandatory in the TOEFL iBT testing, whereas listening, reading, and

writing had long been part of the TOEFL testing routine before the introduction of the

TOEFL iBT test. For years TOEFL test-takers could choose not to be tested on speaking.

Lack of test preparation and training could be the reason for finding a speaking ability

that was different from listening, reading, and writing combined.

From a test validation point of view, the correlated two-factor model failed to

confirm an internal test structure that was compatible with the test’s section design and

score reporting scheme. The TOEFL iBT® test has a structure of four skill-based sections:

listening, reading, writing, and speaking. Stricker and Rock (2008) and Sawaki et al.

(2008) both concluded that a higher-order factor (general FL ability) together with four

first-order factors corresponding to the four skills respectively (listening, reading,


168

speaking, and writing) provided the best explanation of TOEFL iBT test performance.

Results from these previous TOEFL iBT studies supported the practice of reporting a

separate score on each of the skill sections as well as a total score, an average of the four

section scores, for the test.

The correlated two-factor model confirmed in this study was identical to what

Stricker et al. (2005) found based on a prototype of the TOEFL iBT® test. The results of

this study and Stricker et al. (2005) suggested a different internal structure of the test and,

therefore, an alternative way to organize test content and to report scores. According to

this factor model, the ability measured by the test had two instead of four latent

components, a speaking component and a non-speaking component. Items that were

designed to measure separate abilities of listening, reading, and writing were all

associated with the same latent ability, an ability that was not related to speaking. This

correlated two-factor model without a higher-order structure did not provide enough

evidence to support the existence of a general language ability. To reflect the internal

structure of the test suggested by this model, the domain of the test might be organized

into either speaking or non-speaking. If future studies support the findings of this study,

the test could report a speaking score based on speaking tasks and a non-speaking score

based on tasks from listening, reading, and writing. The results of this study did not

provide justification for reporting a total score for the entire test.

Second, failing to demonstrate the influences of situation factors on the test

performance captured the discrepancies between the theoretical conception of

communicative language ability, and the empirically obtained factor structure. The

definition of the test construct highlights the intertwining relationships between the
169

context of language use and the skill-based capacities within individuals, and suggests

organizing the test domain by language use situation (Chapelle et al., 1997). Although the

operational TOEFL iBT® test follows the four-skills convention to organize test content,

the context of language use can still play a role in defining the ability measured by the

test. One way to demonstrate this influence of context is to establish data-compatible

factor models that contain situation factors.

In the language testing literature, this approach has been used in multiple studies

to demonstrate factors that are not skill-based. Most of these studies focused on test

method effects. Bachman and Palmer (1983) empirically demonstrated the existence of

three method-related factors (interview, translation, and self-rating) together with two

correlated skill factors (speaking and reading) in a two-dimensional model. In another

study by the same authors (Bachman & Palmer, 1982), the best model they found was

also two-dimensional, and contained both method factors (interview, a combination of

writing sample and multiple-choice test, and self-rating) and skill factors (grammatical

competence, pragmatic competence, and sociolinguistic competence). Song (2008)

discovered that a two-dimensional model with three skill-based factors (main idea

comprehension, detailed comprehension, and implication) and two method-related factors

(an audio input mode and a written input mode) provided the best match to the data.

Llosa (2007) found that two-dimensional models with both skill factors

(listening/speaking, reading, and writing) and method factors (standardized assessment

and classroom assessment) provided the best explanation for the data in multiple test

populations. A two-dimensional model with three skill factors (extracting main ideas,

major ideas, and supporting details) and three method factors (summary task, incomplete
170

outline, and open-ended question) was chosen in Shin’s (2008) study to account for FL

test performance.

The common approach in the above studies was to demonstrate the method effects

by evaluating factor models with two dimensions, a skill-based dimension and a

dimension of test method factors. Successful model estimation indicated that the method

factors together with the skill-based factors were responsible for the underlying structure

of the ability being measured. This approach was adopted in this study to examine the

role of context in defining communicative language ability measured by the TOEFL iBT®

test. A second dimension of situation factors, based on content and setting, was tested,

respectively, along with the skill-based correlated two-factor structure. Model estimation

did not succeed in either case, indicating that a model with both skill-based factors and

situation factors was not empirically compatible with the data. Instead, a skill-based two-

factor model provided the best explanation of the construct. However, there was a

noticeable amount of overlap in the indicators that represented skill and situation factors.

For example, 8 out of the 10 indicators associated with the academic content were also

indicators of the non-speaking factor. This could be a possible reason for estimation

failure (T.N. Ansley, personal communication, April 4, 2011).

These findings suggested that the ability measured by the test was predominantly

skill-oriented. The relationships between the context of language use and the skill-based

capacities were not captured in the latent structure of the test construct. In other words,

the role of the context of language use in defining communicative language ability could

not be confirmed due to a lack of empirical evidence.


171

Group Membership and Language Ability

The language testing research community has long been aware of the relativity of

FL ability. Earlier researchers made the call for interpreting the nature of FL ability in

light of learner variability (Harley et al., 1990; Kunnan, 1998a). In a more recent review

of English language testing and assessment, Alderson and Banerjee (2002) restated the

importance of understanding the characteristics of test-takers and how these

characteristics interact with their abilities measured by a test.

The field has witnessed a surge of empirical studies that investigated the nature of

FL ability in relation to learner variability. This line of research responded to the question

of whether the nature of language ability measured by a test varied as a function of

various test-taker characteristics (TTCs), such as proficiency level (Kunnan, 1992;

Römhild, 2008; Shin, 2005), cognitive skill (Sang et al., 1986), gender (Wang, 2006), and

ethnicity (Ginther & Stevens, 1998).

A few of these studies focused on the contact with the target language, either in a

study abroad environment (Morgan & Mazzeo, 1988), or an at-home environment (Bae &

Bachman, 1998; Morgan & Mazzeo, 1988). Results from these studies suggested that

language contact as a TTC moderated test performance. In other words, language abilities

developed in groups with different language contact experiences were different in terms

of latent structure.

This study took a special research interest in the relationship between the nature

of communicative language ability, as measured by the TOEFL iBT® test, and test-takers’

target language contact, either having lived in an English speaking environment or not

having done so. The results contrasted with the outcomes from previous studies.
172

At the measurement level across the two groups, the test measured the same

abilities (speaking and non-speaking). The test tasks functioned equivalently as indicators

of the abilities they measured. At the structural level, the two groups did not differ in

terms of their mean performance on the latent abilities. The degrees of variability of the

latent abilities as well as the correlational relationship between the two abilities were

found to vary across the groups but not by much. It is usually the assumption that factors

differ in their variability in different groups. That was the reason for choosing the unit

loading identification approach in the multi-group invariance analysis in this study. In

conclusion, these results suggested that having the study-abroad experience (from less

than 6 months to more than 1 year) did not alter how a test-taker performed on the test, in

terms of the latent factor structure as well as the latent factor means.

The factorial invariance found in this study, however, did resonate with what

other TOEFL iBT® multi-group studies have found. Stricker et al. (2005) concluded that

a correlated two-factor structure could be applied across three native language groups. In

another study, Stricker and Rock (2008) confirmed the same higher-order structure across

subgroups by native language family, exposure in home countries, and formal instruction.

By focusing on test-takers’ exposure in the target language community, the results of this

study provided another piece of evidence for the generalizability of the test’s internal

structure.

This study did not find convincing evidence to claim that test-takers with study-

abroad experience performed differently on the test as a whole group from the group

without such experience. The English language ability developed in the two groups

appeared to be similar in nature. Both groups seemed to possess a distinct speaking


173

ability. They also exhibited an ability that could be captured by their responses to the

listening, reading, and writing tasks, a non-speaking ability. Furthermore, imposing a

mean structure in this study led to the surprising finding that the study-abroad test-takers

did not turn out to be better at English compared to the test-takers who had never had

such exposure. Preferring study-abroad contexts as opposed to formal classroom settings

in the home country, a common belief as captured by Freed, Segalowitz and Dewey

(2004), could not be upheld based on this study.

The supposed superiority of study-abroad learning contexts over traditional

classroom settings has also been challenged by study abroad researchers. After reviewing

studies that compared language gains from study-abroad and from home country formal

learning, Collentine and Freed (2004) found no convincing evidence that one learning

context was of absolute superiority compared to the other. Depending on the aspects of

linguistic development and levels of proficiency, one learning context might produce

more gains than the other. A study by Davidson (as cited in Dewey, 2004, p. 322) also

pointed out that it might take a full year of target language contact for the linguistic

benefits to become evident, and he called for additional research on the effects of length

of study-abroad. In this study, exposure to the English language varied from less than 6

months to more than 1 year. Among the 246 test-takers who had been immersed, only

about half of them had lived in an English-speaking country for more than a year.

Grouping test-takers with different lengths of study-abroad experience together might

have diluted the impact of language contact.

From the test validation point of view, confirming an equivalent structure across

the two groups provided important validity evidence based on the test’s generalizability.
174

The TOEFL iBT® test is intended for people whose first language is not English.

However the intended test-taking population could be very diverse. With growing

opportunities for study abroad, language learning has expanded from its traditional

classroom settings to community-embedded contexts. One characteristic that divides the

population is the target language contact experience. Taking the sample in this study as

an example, about two thirds of the test-takers had the experience of living in an English

language country prior to taking the test. The rest did not have such experience. This

demographic characteristic of the TOEFL iBT® test-taker population brings up a

legitimate question: Should we use the same test for all intended test-takers; and if we do

use the same test, can test scores be interpreted and used in the same way? The results of

the multi-group invariance analysis in this study indicated that the test functioned

equivalently for both groups of test-takers, regardless of their different target language

contact experiences.

Furthermore, the test is administered both domestically (where English is the

dominant language) and internationally (where English is not the dominant language).

Although test-taking location is not always a reliable indication of a test-taker’s language

contact experience, it does suggest the test-taker’s immediate linguistic environment. It

can also be used as a rough index of whether a test-taker has or has not had language

contact with English. The results from this study provided partial support to use the same

test format and score reporting scheme for both domestic and international test-taking

locations.11 Additional research on how domestic and international TOEFL iBT® test-

takers perform is needed to strengthen this validity argument.


175

Validity evidence based on a test’s generalizability is an important concern if the

target test-taking population is diverse. To justify using the same test across groups, it is

first necessary to make sure that the test measures the same construct in an equivalent

way. If analysis suggests that a four-skills test measures the designed four skills for one

group and one general skill for another group, then reporting skill-based scores makes

good sense for the former group, and reporting only total scores for the second group is

preferred. As a matter of fact, when the test is used on the second group, the length of the

test could be reduced to one-fourth of its original length because information obtained

from different skill sections is redundant.

Factorial equivalence is also a condition that needs to be satisfied in order to

compare group means. That is because it is important to know what to compare before

conducting any comparisons. Assume that in one group factor loadings of listening items

are relatively high and those in another group are relatively low. This hypothetical

finding would indicate that these test items were better indicators of the listening ability

in the first group than in the second. Comparing group means in listening based on these

items would become questionable. In the first group, reaching a listening score this way

would be acceptable. In the second group, since the items would appear not to be good

indicators, there would be doubt regarding how much information about listening they

really conveyed. A valid listening score based on these items might not be warranted.

Therefore comparison would make little sense. Consider another scenario. An integrated

task involving both listening and speaking loads with listening tasks in one group but

with speaking tasks in the other group. Comparing groups on this task would not be

warranted because the task has simply measured different things in different groups. In
176

conclusion, to ensure fair interpretation and use of test scores across diverse groups of

test-takers, validity evidence based on a test’s generalizability through multi-group

invariance analysis should be well established.

Learning Contexts and Language Ability

Language learning usually occurs in two contexts, an instructional context and a

communicative context, as pointed out by Batstone (2002). In an instructional context,

the goal is to improve linguistic expertise, such as learning new vocabulary or

grammatical structures. Formal FL training in the home country usually provides such a

context. In a communicative context, the objective is to use the target language to

perform communicative functions, such as exchanging information or expressing

opinions. Study abroad in the target language community is likely to create such

communicative contexts for learners.

A FL learner might have experience with one of the contexts, or both. The results

of this study provided opportunities to understand how both contexts are associated with

the development of language ability. For test-takers who indicated no experience of direct

contact with an English language environment (the home-country group), learning was

captured by their experience of studying English and studying content classes taught in

English. The former would occur most likely in an instructional context, whereas the

latter might happen in a hybrid context which could be both instructional and

communicative. For test-takers who claimed having direct contact with the target

language community (the study-abroad group), learning was captured by their experience

of living in an English-speaking country as well as studying English and studying content

classes taught in English.


177

The findings of this study suggested that all three learning situations had an

impact on the development of aspects of language ability. The speaking ability of the

home-country group was associated with the lengths of both studying English and taking

content classes taught in English. This group’s ability to listen, read, and write was

associated only with the length of time taking content classes taught in English.

With the study-abroad group, neither ability component (speaking or non-

speaking) had a significant relationship with the length of time of taking content classes

in English. The lengths of time of studying English and living in an English-speaking

country were significantly associated with both ability components. Between these two

learning situations, the length of time studying English held a stronger relationship with

both ability components than time living in an English-speaking country.

Both learning contexts, instructional and communicative, appeared to have an

impact on the development of language ability. This finding was compatible with results

from studies comparing study abroad and formal instruction in the home country (Díaz-

Campos, 2004; Collentine, 2004; Lafford, 2004; Sasaki, 2007).These studies indicated

that students receiving formal language instruction in their home counties made just as

much gain (if not more) on some aspects of language ability as study abroad learners.

What was surprising was that even within the study-abroad group, test-takers’

performance was associated more with their experience in an instructional learning

context than in a communicative learning context. This again exhibited counter-evidence

to the common belief in the absolute superiority of study abroad over other learning

contexts.
178

Pedagogically speaking, this study demonstrated that both learning contexts,

instructional and communicative, had an impact on language development.

Practically speaking, this study suggested that study abroad might not be the only

way to prepare for the test, and to improve English language ability. Training received in

a formal classroom setting might help just as much as, if not more, to perform well on the

TOEFL iBT® test, and to develop English language ability.

Implications

The outcomes of this study have implications for a broad range of issues

associated with language acquisition and language testing. This section is organized by

three topics. First, thoughts regarding test development are elaborated. Second, an idea of

understanding the impact of learning contexts through establishing test-taker profiles is

put forward. Last, using a structural equation modeling (SEM) approach to reach the

interface between language testing and acquisition is discussed.

Foreign Language Test Development

If the context of language use and the internal skill abilities of individual language

users are both part of the definition of the communicative language ability that the

TOEFL iBT® test intends to measure, the role of this language use context in defining the

test construct should be reflected in the internal structure of the test. Unfortunately the

attempt to demonstrate the abilities to respond to context did not succeed. The skill

abilities, as already reflected in the design of the test, were shown to be the dominant

forces in determining the internal structure of the test. Test-takers’ ability to respond to

different situations defined by content and setting did not appear to have an impact on
179

their test performance. As a result, the theoretical context-based ability in the

communicative language framework could not be demonstrated.

This disappointing finding raises questions regarding the testability of context-

based language components. In other words, are these components testable yet? Current

thinking in applied linguistics, such as the model of communicative language ability

proposed by Bachman (1990), embraces the context of language use in its framework.

But, is it ready to be tested? Is test-takers’ ability to respond to different language use

situations reliable and distinguishable enough so that this ability can be captured in the

internal structure of a test?

It is one thing to conceptualize context-based components in a theoretical

framework. It is another to empirically demonstrate the existence of such ability

components. Such empirical attempts might eventually lead to modifications of the

original theory. This study suggested that features of language use situation, such as

content and setting, were not able to elicit the associated context-based abilities.

However, not all features used to define the context of language use were tested

simultaneously in this study. Furthermore, not all tasks used to test the context-associated

ability components had a fully developed language use situation. This lack of context

development was especially obvious in all reading tasks as well as independent writing

and speaking tasks. This might have been the cause for the failure to find context-based

abilities.

With the intent to understand the nature of communicative language ability, this

study chose the TOEFL iBT® test in hopes of allowing context-dependent abilities to

surface and appear in the internal structure of the test, because this test was designed to
180

particularly reflect communicative language use in contexts. The results implied that

more care and attention need to be given to context development for the tasks used in the

test. Making a complete departure from the four-skills test design, while not yet

acceptable to the majority of test users, might offer more opportunity for implementing a

communicative language framework to elicit context-dependent language components.

The results also implied that new metrics might need to be considered for scoring

test performance. Test-takers who had study-abroad experience performed equivalently

on the test, compared to those without such experience. In spite of the prevailing belief in

the superiority of study-abroad learning contexts over traditional classroom contexts, this

study suggested that test-takers without study-abroad experience were just as likely (or

unlikely) to succeed on the test as test-takers with such experience. However, as

Collentine and Freed (2004) pointed out, the types of language gains attributed to an

study-abroad environment might not be captured by traditional metrics. In the context of

TOEFL iBT® testing, the speaking and writing tasks are rated based on holistic scales.

Scores are not available on aspects of speaking, such as oral fluency and pronunciation.

Scores are not available on aspects of writing either, such as writing fluency and

discursive coherence. Other non-traditional metrics could be: sociolinguistic

appropriateness, communicative strategies, pragmatic appropriateness, etc. These

variables might not be readily testable or quantifiable by the existing metrics. But they

may be capable of reflecting the influences of the study-abroad context on language

acquisition. Reporting scores on these variables might also provide language learners

useful diagnostic feedback (X. Xi, personal communication, January 7, 2010).


181

Test-Taker Profiles

Viewing the study-abroad learning context different from the instructional context

based in classrooms is built upon the assumption that the former provides more

opportunities for contact with the target language community (Dewey, 2004; Freed et al.,

2004; Segalowitz & Freed, 2004). Freed et al. (2004) raised concern about this presumed

privilege, and proposed to use concrete measures to characterize the nature, quality and

intensity of language contact experience. The Language Contact Profile questionnaire

(Freed, Dewey, & Segalowitz, 2004) was designed to provide such a measure to

document and quantify language learners’ interaction with native speakers during time

abroad.

This dissertation study looked at whether test-takers with different context-of-

learning experiences performed differently on the test. Their language contact experience,

or lack of it, was characterized by a dichotomous measure: either having it or not having

it. No difference in test performance across the groups was found, which could be

attributed to the fact that the real differences in learning contexts were not captured by

this grouping method.

To fully understand the impact of study-abroad on language development, it

might be necessary to take further measures to obtain detailed language contact

information for the test-takers. This contact information can be used to establish FL test-

taker profiles. Such a profile traditionally includes information like age, gender, native

country and native language, etc. With increasing opportunities for study-abroad,

especially after the U.S. Senate declared 2006 the Year of Study Abroad (Magnan &

Back, 2007), it has become more relevant to understand test-takers not only by pre-
182

existing demographic information but also by what they have done to make use of the

target language in non-traditional learning contexts, such as study abroad. Elements such

as housing arrangement and personal relationships, which are usually irrelevant in a

classroom setting, may trigger or hinder learning in study-abroad contexts. Such elements

need to be built into FL test-taker profiles. Carefully and fully developed test-taker

profiles will enhance understanding of what to expect from and how to deal with our

increasingly diverse and constantly changing test-taking population during test

development and validation.

The Interface between Testing and Acquisition

This study used a structural equation modeling (SEM) approach to examine the

nature of communicative language ability and this ability’s relationship to learning

contexts based on test performance of the TOEFL iBT® test. The research interest in the

nature of language ability has been shared by the language testing community, whereas

understanding the factors that affect language acquisition has been one of the focuses in

the field of language acquisition (Bachman & Cohen, 1998). At the interface between

language testing and acquisition, there is the issue of how to bring the insights gained

from one field to inform the research agenda in the other. A SEM approach, a hybrid of

factor analysis and path analysis, offers a research method that can be used to address the

issues that connect the two fields.

A structural equation model usually has two parts. The measurement component

illustrates the relationships between latent abilities and their indicators. Such a latent

approach to examine the nature of a construct permits a consideration of measurement

errors by means of directly estimating error variances in the measurement model.


183

Multiple indicators (observed scores) are also required for each latent factor to ensure

construct convergence. Finding different indicators loading on different latent factors

provides evidence of construct divergence. Establishing a measurement model based on

test performance clarifies the nature of the theoretical construct. In language testing, this

construct is often referred to as FL proficiency. Under the communicative competence

framework, this construct is communicative language ability. Construct validation of

communicative language ability could involve finding a latent factor structure based on

test performance that is compatible with the theoretical configuration of this ability. As

results of such validation attempts over multiple tests and under different testing

situations become available, the field’s understanding of communicative language ability

can be advanced.

The second part of a structural equation model is the structural model. Building a

structural model allows an investigation of the impact of multiple variables on the latent

abilities from the measurement model simultaneously. These independent variables

themselves can be either latent or observed. If an independent variable is latent, the

nature of this variable should be verified in a measurement model first before being

entered into the structural model. In language testing, these independent variables usually

correspond to test-take characteristics, such as age, gender, teaching method, context of

learning, length of learning, etc. Interpreting test performance in relation to background

characteristics can shed light on how test-takers acquire a certain ability and/or reach a

certain ability level. The structural model brings language educators into a dialogue with

language test researchers behind the measurement model.


184

From the measurement model, language teachers will have better understanding

of learners’ achievement–exactly what they learn and what kinds of abilities they acquire.

This will have impact on how they direct teaching resources and organize curriculum.

From the structural model, test developers will have better ideas about how the test

functions; that is, if test results reflect the language gains associated with changes in test-

taker characteristics. This will lead to better use of test instruments to detect language

gains. A SEM approach provides a platform upon which conversations across language

testing and acquisition can be facilitated.

Unique Contributions

Conceptualized as a FL construct study, this dissertation focused on the construct

of communicative language ability. With this intent in mind, the TOEFL iBT® test was

chosen because this test intends to measure communicative language ability, with both

skill abilities and the context of language use as parts of the theoretical definition of the

test construct. An investigation of context-dependent abilities was launched based on test

performance elicited by the test. The content and setting factors were specifically

modeled, in hopes that such an investigation might allow context-dependent abilities to

surface and appear in the internal structure of the test. Contextual factors have not been

examined in the literature of construct studies in language testing. This study set up an

example of using factor analysis to examine the role of context in defining language

ability.

Second, this study investigated the language construct in relation to a test-taker

characteristic (TTC) that had not been studied in the context of TOEFL iBT® testing. The

language testing community has long been aware of the fact that the makeup of language
185

ability might not hold equivalent across test-taker groups with different characteristics.

Language contact with the target language community, a factor that has been studied

extensively in language acquisition and study-abroad research, was introduced as a TTC,

and was examined in its relation to the underlying structure of the construct. With

growing opportunities of language learning in community-embedded contexts in recent

years, this TTC has become relevant and salient in more testing situations. In the situation

of TOEFL iBT® testing especially, since the test is administered both domestically and

internationally, establishing the test’s internal structure equivalence is an important

validation concern. This study made a unique contribution by examining test-takers’

language contact experience in relation to their test performance as measured by the

TOEFL iBT test.

Moreover, not only the factorial structure but also the mean structure of the

language construct was inspected. Applying a mean structure in a multi-group invariance

analysis allows a comparison of groups on latent factor means after measurement

invariance has been established across groups. Such a comparison is superior to

comparing means based on observed scores because: (1) the pre-established measurement

invariance ensures that the latent factors represent the same abilities across groups, and

(2) measurement errors are taken into account in the overall model. This study made a

unique methodological contribution to the field of language testing and acquisition by

demonstrating how a mean structure could be incorporated in multi-group invariance

analysis.
186

Limitations and Recommendations

This study started the analysis with a dataset of a randomly generated sample of

1000 subjects. Due to missing values and inconsistent responses in the original dataset,

only 370 subjects were included in the final analysis. The reasons for missing values and

inconsistency in the responses could only be speculated but not confirmed. One reason

could be that some test-takers did not interpret the background questions correctly

because these questions were presented in English rather in their native languages.

Correctness and truthfulness of the self-reported information could be questioned. These

factors, which could not be taken into account due to the lack of information, might have

had influences on the results of the analysis. Future researchers are recommended to use

questionnaires translated to test-takers’ native languages to ensure the quality and

correctness of their self-reported background information.

The study sample of 370 subjects matched the original random sample of 1000

subjects reasonably well on all background characteristics as well as their test

performance. Although the study sample was sufficient to carry out a single group

analysis, when divided into groups in the multi-group analysis, the subgroup samples

seemed relatively small. Confidence in the results of such analysis could still be sustained

based on the fact that the study sample could be considered a good representative of the

total sample as well as the target population. Nevertheless, the results should be

interpreted and generalized to the population with caution.

When investigating the latent structure of the construct, the writing factor was

represented by only two indicators. In factor analysis, a factor needs at least two

indicators for the model to be identifiable. This is the minimum requirement, and in
187

practice it is advised to have more than two indicators for each factor. Future researchers

are recommended to test the latent structure of FL ability with each factor represented by

a sufficient number of indicators.

When investigating the influences of language use context, this study chose to

focus on only two situational factors, content and setting. Not all features used to define

the context of language use were tested in this study. This was partially because not all

situational features were present in the specifications of all test tasks. Future researchers

interested in this line of study could explore other situational factors, such as participants,

purpose, and register.

Although the design of the TOEFL iBT® test intends to reflect the role of

language use context in defining the communicative language ability, not all tasks have a

fully developed language use situation that would allow for adequately assessing the

impact of contextual factors. This might have been the cause for failing to model

contextual factors in the underlying representation of the construct. Future researchers

who are interested in the nature of communicative language ability should conduct

studies based on tests that are developed through a communicative approach. Such a test,

with a clear focus on communicative language use, could organize the domain of interest

by language use situation so that it is more likely for contextual factors to be a part of the

representation of the test construct. The development of language tasks used in such a test

should also have a clear context orientation, focusing on key features that have been

empirically proved to have impacts on test performance. It might even be necessary to

make a complete departure from the widely accepted four-skills design to fully

implement the communicative language framework in FL testing.


188

The design of the multi-group analysis separated the test-takers into two groups,

either having or not having the study-abroad experience. However, some findings from

the study abroad literature suggested a full year of study-abroad as the threshold for the

linguistic benefits to be detectable. Grouping test-takers by this one-year rule, having

either more or less than one year of study-abroad experience, might allow for detecting

differences in the language ability across groups.

This full year threshold hypothesis was not originally intended, and therefore was

not tested in this study. However, after finding no structural and mean difference between

the two groups defined in the study, two additional sets of analyses were attempted based

on the study sample of 370 test-takers. In the first analysis, test-takers with less than 6

months of study-abroad (N=191) were compared to the ones with more than 6 months of

time abroad (N=179). Although measurement invariance could be held, the latter group

performed significantly better (p < 0.01) on speaking than the former. In the second

analysis, test-takers with less than one year of study-abroad (N=240) were compared to

the ones with more than one year of time abroad (N=130). Although measurement

invariance could be held, the latter group performed significant better on the speaking

factor (p < 0.000), and to a lesser degree on the non-speaking factor (p < 0.05). These

results suggested that language ability, especially the speaking ability, was positively

associated with the length of study-abroad. For differences in language gains to become

detectable a full-year study-abroad might be needed.

Future researchers are encouraged to investigate the generalizability of FL

proficiency in relation to study-abroad experience of various lengths in the same or other


189

testing situations. They are also encouraged to examine TTCs that have not caught the

research community’s attention but may be relevant to language testing and acquisition.

The current study had a research interest in the joint impact of classroom learning

and community-embedded learning on language ability development. Both learning

variables were defined by length in years. The richness of these learning experiences was

not fully captured in the models tested. This probably explained why a large portion of

the factor variances could not be accounted for in the study. Variables other than the ones

specified in the models were not investigated, due to lack of such information. It is

recommended that future researchers who share the same research interest collect

detailed information on the nature and intensity of learners’ language contact with the

target language community through well-developed instruments. Such endeavors would

encourage collaborative research efforts joined by both language testing researchers and

language acquisition researchers. Through a proper method, such as SEM, this line of

research would inform not only what constitutes language ability but also how the aspects

of this ability are associated with acquisition factors.

Finally, data utilized in the study were purely observational, providing a static

image of the relationships among the variables at one point in time. Without a controlled

experimental design, causal statements could not be made with confidence, which

prohibited making inferences about the factors responsible for language development.

Carefully designed experimental studies are recommended for the future to help fully

understand the impact of learning contexts on language development.


190

NOTES
1
In this article, a foreign language refers to a language that is learned after a person has
already learned his or her native language(s). The term ‘foreign language’ is used
interchangeably with second language. In the context of this study, foreign language
ability is conceptualized as a latent trait. The term ‘ability’ is used interchangeably with
proficiency and competence.
2
TOEFL iBT is a registered trademark of Educational Testing Service (ETS). This
publication is not endorsed or approved by ETS.
3
The higher-order factor model was also tested based on the test performance of 1000
test-takers in the total random sample. The model was rejected because extremely high
correlations were found between the higher-order factor and the first-order factors.
4
Because of the confirmatory nature of the analysis, only models that had been confirmed
in the previous TOEFL® literature were hypothesized. However after finding a high
correlation between listening and writing, these two factors were grouped together and
tested in a three-factor model (listening/writing, reading, and speaking). A close to 0.9
correlation was found between the reading factor and the listening/writing factor, which
suggested that a two-factor model with a speaking factor and a non-speaking factor
would be a more appropriate model for the data. The result of post hoc model fitting was
not reported because of the exploratory nature of modification procedures and the risk of
capitalization on chance factors.
5
The correlated four-factor model was also tested based on the test performance of 1000
test-takers in the total random sample. The model was rejected because extremely high
correlations were found among the latent factors.
6
The correlated two-factor model was also tested based on the test performance of 1000
test-takers in the total random sample. Taking all criteria into consideration, this model
provided the best explanation to the data.
7
A model with only content and no skill factors was tested based on the 370 test-takers’
performance. The result indicated that the model fit was not acceptable.
8
The two-dimensional model with both content and skill factors was also tested on the
test performance of 1000 test-takers in the total random sample. The model was rejected
because the estimated correlation among the two content factors was extremely high, and
that a number of factor loadings were insignificant.
9
A model with only setting and no skill factors was tested based on the 370 test-takers’
performance. The result indicated that the model fit was not acceptable.
10
The two-dimensional model with both setting and skill factors was also tested on the
test performance of 1000 test-takers in the total random sample. The model was rejected
191

because the estimated correlations among the setting factors were extremely high, and
that a number of factor loadings were insignificant.
11
Simultaneous multiple-group invariance analyses were also performed across the two
groups: 418 domestic test-takers and 582 overseas test-takers in the total random sample.
Both measurement and structural invariance could be held. No factor mean difference
was found. This result supported using the same test format and score reporting scheme
both internationally and domestically. This step was not reported because test-taking
location was found not to be a reliable indicator of language contact experience.
192

REFERENCES

Alderson, J. C., & Banerjee, J. (2002). Language testing and assessment (Part 2).
Language Teaching, 35, 79-113. doi: 10.1017/S0261444802001751

American Educational Research Association, American Psychological Association, &


National Council on Measurement in Education. (1999). Standards for
educational and psychological testing. Washington, DC: American Educational
Research Association.

Anastasi, A., & Urbina, S. (1997). Psychological testing. Upper Saddle River, NJ:
Prentice-Hall.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford


University Press.

Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that
what we count counts. Language Testing, 17(1), 1–42. doi:
10.1177/026553220001700101

Bachman, L. F., & Cohen, A. D. (1998). Language testing—SLA interfaces: An update.


In L. F. Bachman, & A. D. Cohan (Eds.), Interfaces between second language
acquisition and language testing research (pp. 1-31). New York: Cambridge
University Press.

Bachman, L. F., Davidson, F., & Foulkes, J. (1990). A comparison of the abilities
measured by the Cambridge and Educational Testing Service EFL Test Batteries.
Issues in Applied Linguistics, 1(1), 30-55.

Bachman, L. F., Davidson, F., Ryan, K. & Choi, I.-C. (1995). An investigation into the
comparability of two tests of English as a foreign language: The Cambridge-
TOEFL comparability study. New York: Cambridge University Press.

Bachman, L. F., & Palmer, A. S. (1982). The construct validation of some components of
communicative proficiency. TESOL Quarterly 16(4), 449–465.

Bachman, L. F., & Palmer, A. S. (1983). The construct validation of the FSI oral
interview. In J. W. Oller, Jr. (Ed.), Issues in language testing research (pp. 154-
169). Rowley, MA: Newbury House.

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and
developing useful language tests. Oxford: Oxford University Press.
193

Bae, J., & Bachman, L. F., (1998). A latent variable approach to listening and reading:
testing factorial invariance across two groups of children in the Korean/English
two-way immersion program. Language Testing, 15(3), 380-414. doi:
10.1177/026553229801500304

Batstone, R. (2002). Contexts of engagement: A discourse perspective on “intake” and


“pushed output.” System, 30, 1–14. doi: doi:10.1016/S0346-251X(01)00055-0

Buck, G. (1992). Listening comprehension: construct validity and trait characteristics.


Language Learning, 42(3), 313–357. doi: 10.1111/j.1467-1770.1992.tb01339.x

Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.

Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A.
Bollon, & J. S. Long (Eds.), Testing structural equation models (pp. 136-162).
Newbury Park, CA: Sage.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81-105. doi:
10.1037/h0046016

Canale, M. (1983). From communicative competence to communicative language


pedagogy. In J. G. Richards, & R. W. Schmidt (Eds.), Language and
communication (pp. 2-27). London: Longman.

Canale, M., & Swain, M. (1980). Theoretical bases of communicative approach to second
language teaching and testing. Applied Linguistics, 1(1), 1-47.

Carroll, J. B. (1958). A factor analysis of two foreign language aptitude batteries. Journey
of General Psychology, 59(3), 3-19.

Carroll, J. B. (1965). Fundamental consideration in testing for English language


proficiency of foreign students. In H. B. Allen (Ed.), Teaching English as a
second language: A book of readings (pp. 364-372). New York: McGraw-Hill.

Carroll, J. B. (1983). Psychometric theory and language testing. In J. W. Oller, Jr. (Ed.),
Issues in language testing research (pp. 80-107). Rowley, MA: Newbury House.

Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied


Linguistics, 19, 254-272. doi: 10.1017/S0267190599190135

Chalhoub-Deville, M. (1997). Theoretical models, assessment frameworks and test


construction. Language Testing, 14(1), 3-22. doi: 10.1177/026553229701400102
194

Chapelle, C., Grabe, W., & Berns, M. (1997). Communicative language proficiency:
Definition and implications for TOEFL 2000. (TOEFL Monograph Series Report
No. 10). Princeton, NJ: Educational Testing Service.

College Board. (1987). Advanced placement course description: French. New York:
College Entrance Examination Board.

College Board. (1989). Advanced placement course description: Spanish language and
literature. New York: College Entrance Examination Board.

Collentine, J. (2004). The effects of learning contexts on morphosyntactic and lexical


development. Studies of Second Language Acquisition, 26, 227-248. doi:
10.1017/S0272263104062047

Collentine, J., & Freed, B. (2004). Learning context and its effects on second language
acquisition. Studies of Second Language Acquisition, 26, 153-171.
doi:10.1017/S0272263104262015

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.


Psychological Bulletin, 52, 281–302. doi: 10.1037/h0040957

Cziko, G. A. (1984). Some problems with empirically-based models of communicative


competence. Applied Linguistics, 5, 23-38. doi:10.1093/applin/5.1.23

Dewey, D. P. (2004). A comparison of reading development by learners of Japanese in


intensive domestic immersion and study abroad contexts. Studies of Second
Language Acquisition, 26, 303-327. doi: 10.1017/0S0272263104062072

Díaz-Campos, M. (2004). Context of learning in the acquisition of Spanish second


language phonology. Studies of Second Language Acquisition, 26, 249-273. doi:
10.1017/S0272263104062059

Educational Testing Service. (2002). LanguEdge courseware score interpretation guide.


Princeton, NJ: Author.

Educational Testing Service. (2004). The next generation TOEFL test: Focus on
communication. Retrieved from http://www.ets.org/toefl/nextgen/

Educational Testing Service. (2008). Validity evidence supporting the interpretation and
use of TOEFL® iBT scores. Retrieved from http://www.ets.org

Farhady, H. (1983): On the plausibility of the unitary language proficiency factor. In J.


W. Oller, Jr. (Ed.), Issues in language testing research (pp. 11-28). Rowley, MA:
Newbury House.
195

Fouly, K. A., Bachman, L. F., & Cziko, G. A. (1990). The divisibility of language
competence: A confirmatory approach. Language Learning, 40(1), 1-21. doi:
10.1111/j.1467-1770.1990.tb00952.x

Freed, B. F. (1995). What makes us think that students who study abroad become fluent?
In B. F. Freed (Ed.), Second language acquisition in a study abroad context (pp.
123-148). Amsterdam: John Benjamins.

Freed, B. F., Dewey, D. P., Segalowitz, N., & Halter, Randall. (2004). The language
contact profile. Studies of Second Language Acquisition, 26, 349-356. doi:
10.1017/S0272263104062096
Freed, B. F., Segalowitz, N., & Dewey, D. P. (2004). Context of learning and second
language fluency in French: Comparing regular classroom, study abroad, and
intensive domestic immersion programs. Studies of Second Language Acquisition,
26, 275-301. doi: 10.1017/S0272263104062060

Gardner, R. C., & Lambert, W. E. (1965). Language aptitude, intelligence, and second
language achievement. Journal of Educational Psychology, 56(4), 191-199. doi:
10.1037/h0022400

Ginther, A., & Grant, L. (1996). A review of the academic needs of native English-
speaking college students in the United States. (TOEFL Monograph Series Report
No. 1). Princeton, NJ: Educational Testing Service.

Ginther, A., & Stevens, J. (1998). Language background, ethnicity, and the internal
construct validity of the Advanced Placement Spanish Language Examination. In
A. J. Kunnan (Ed.), Validation in language assessment (pp. 169-194). Mahwah,
NJ: Lawrence Erlbaum Associates.

Hale, G. A., Stansfield, C. W., Rock, D. A., Hicks, M. M., Butler, F. A., & Oller, J. W.
Jr. (1988). Multiple-choice cloze items and the Test of English as a Foreign
Language. (TOEFL Research Report No. 26; ETS Research Report No. 88-02).
Princeton, NJ: Educational Testing Service.

Hale, G. A., Rock, D. A., & Jirele, T. (1989). Confirmatory factor analysis of the Test of
English as a Foreign Language. (TOEFL Research Report No. 32; ETS Research
Report No. 89-42). Princeton, NJ: Educational Testing Service.

Harley, B., Cummins, J., Swain, M., & Allen, P. (1990). The nature of language
proficiency. In B. Harley, P. Allen, J. Cummins, & M. Swain (Eds.), The
development of second language proficiency (pp. 7-25). New York: Cambridge
University Press.

Hosley, D., & Meredith, K. (1979). Inter- and intra-test correlates of the TOEFL. TESOL
Quarterly, 13(2), 209-217.
196

Hu, L.-T., & Bentler, P. M. (1999). Cutoff criteria for fit indices in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation
Modeling, 6, 1-55.

Jamieson, J., Jones, S., Kirsch, I., Mosenthal, P., & Taylor, C. (2000). TOEFL 2000
framework: A working paper. (TOEFL Monograph Series Report No. 16).
Princeton, NJ: Educational Testing Service.

Kachru, B. B. (1984). World Englishes and the teaching of English to non-native


speakers: Contexts, attitudes, and concerns. TESOL Newsletter, 18, 25–26.

Kunnan, A. J. (1992). An investigation of a criterion- referenced test using G-theory, and


factor and cluster analyses. Language Testing, 9, 30-49. doi:
10.1177/026553229200900104

Kunnan, A. J. (1994). Modelling relationships among some test-taker characteristics and


performance on EFL tests: An approach to construct validation. Language Testing,
11, 225-250. doi: 10.1177/026553229401100301

Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural


modeling approach. Cambridge: Cambridge University Press.

Kunnan, A. J. (1998a). Approach to validation in language assessment. In A. J. Kunnan


(Ed.), Validation in language assessment (pp. 1-16). Mahwah, NJ: Lawrence
Erlbaum Associates.

Kunnan, A. J. (1998b). An introduction to structural equation modelling for language


assessment research. Language Testing, 15(3), 295–332. doi:
10.1177/026553229801500302

Kline, R. B. (1998). Principles and practice of structural equation modeling. New York:
Guilford.

Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.).
New York: Guilford.

Lado, R. (1961). Language testing. New York: McGraw-Hill.

Lafford, B. A. (2004). The effect of the context of learning on the use of communication
strategies by learners of Spanish as a second language. Studies of Second
Language Acquisition, 26, 201-225. doi: 10.1017/S0272263104062035

Llosa, L. (2007). Validating a standards-based classroom assessment of English


proficiency: A multitrait-multimethod approach. Language Testing, 24(4), 489-
515. doi: 10.1177/0265532207080770
197

Magnan, S. S., & Back, M. (2007). Social interaction and linguistic gain during study
abroad. Foreign Language Annals, 40(1), 43-61. doi: 10.1111/j.1944-
9720.2007.tb02853.x

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.) (pp.
13-103). New York: American Council on Education and Macmillan.

McNamara, T. F. (1990). Item response theory and the validation of an ESP test for
health professionals. Language Testing 7, 52-76. doi:
10.1177/026553229000700105

Morgan, R., & Mazzeo, J. (1988). A Comparison of the structural relationships among
reading, listening, writing, and speaking components of the AP French Language
Examination for AP candidates and college students. (ETS Research Report No.
88-59). Princeton, NJ: Educational Testing Service.

Muthén, L. K., & Muthén, B. O. (2010). Mplus User’s Guide (6th ed.). Los Angeles, CA:
Muthén & Muthén.

Oller, J. W. Jr. (1974). Expectancy for successive elements: Key ingredient to language
use. Foreign Language Annuals, 7, 443-452.

Oller, J. W. Jr. (1979). The factorial structure of language proficiency: Divisible or not?
In J. W. Oller, Jr. Language test at school: A pragmatic approach (pp. 423-458).
London: Longman.

Oller, J. W. Jr. (1983): A consensus for the eighties? In J. W. Oller, Jr. (Ed.), Issues in
language testing research (pp. 351-356). Rowley, MA: Newbury House.

Oller, J. W. Jr., & Hinofotis, F. B. (1980). Two mutually exclusive hypotheses about
second language ability: Indivisible or partially divisible competence. In J. W.
Oller, Jr. & K. Perkins (Eds.), Research in language testing (pp. 13-23). Rowley,
MA: Newbury House.

Pimsleur, P., Stockwell, R. P., & Comrey, A. L. (1962). Foreign language learning
ability. Journal of Educational Psychology, 53(1), 15-26. doi: 10.1037/h0044336

Römhild, A. (2008). Investigating the invariance of the ECPE factor structure across
different proficiency levels. Spaan Fellow Working Papers in Second or Foreign
Language Assessment, 6, 29-55. Ann Arbor, MI: University of Michigan English
Language Institute. www.isa.umich.edu/eli/research/spaan

Sang, F., Schmitz, B., Vollmer, H. J., Baumert, J., & Roeder, P. M. (1986). Models of
second language competence: A structural equation approach. Language Testing,
3(1), 54-79. doi: 10.1177/026553228600300103
198

Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors on
covariance structure analysis. In A. von Eye, & C. C. Clogg (Eds.), Latent
variables analysis (pp. 399-419). Thousand Oaks, CA: Sage.

Sasaki, M. (1993). Relationships among second language proficiency, foreign language


aptitude, and intelligence: A structural equation modeling approach. Language
Learning, 43(3), 313-344. doi: 10.1111/j.1467-1770.1993.tb00617.x

Sasaki, M. (2007). Effects of study-abroad experiences on EFL writers: A multiple-data


analysis. The Modern Language Journal, 91(4), 602-620. doi: 10.1111/j.1540-
4781.2007.00625.x

Sawaki, Y., Stricker, L., & Oranje, A. (2008). Factor structure of the TOEFL Internet-
Based Test (iBT): Exploration in a field trial sample. (TOEFL iBT Research
Report No. 04; ETS Research Report No. 08-09). Princeton, NJ: Educational
Testing Service.

Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL
Internet-based test. Language testing, 26 (1), 005-030. doi:
10.1177/0265532208097335

Segalowitz, N. (2004). Context, contact, and cognition in oral fluency acquisition:


Learning Spanish in at home and study abroad contexts. Studies of Second
Language Acquisition, 26, 173-199. doi: 10.1017/S0272263104062023

Scholz, G., Hendricks, D., Spurling, R., Johnson, M., & Vandenburg, L. (1980). Is
language ability divisible or unitary? A factor analysis of twenty-two English
proficiency tests. In J. W. Oller, Jr. & K. Perkins (Eds.), Research in language
testing (pp. 24-33). Rowley, MA: Newbury House.

Shin, S.-K. (2005). Did they take the same test? Examinee language proficiency and the
structure of language tests. Language Testing, 22(1) 31–57. doi:
10.1191/0265532205lt296oa

Shin, S. (2008). Examining the construct validity of a web-based academic listening test:
An investigation of the effects of response formats. Spaan Fellow Working
Papers in Second or Foreign Language Assessment, 6, 95-129. Ann Arbor, MI:
University of Michigan English Language Institute.
www.isa.umich.edu/eli/research/spaan

Skehan, P. (1991). Progress in language testing: The 1990s. In J. C. Alderson, & B. North
(Eds.), Language testing in the 1990s: The communicative legacy (pp. 3-21).
London: Macmillan.
199

Song, M. Y. (2008). Do divisible subskills exist in second language (L2) comprehension?


A structural equation modeling approach. Language Testing, 25(4), 435–464. doi:
10.1177/0265532208094272

Spearman, C. (1904). "General Intelligence," objectively determined and measured. The


American Journal of Psychology, 15(2), 201-292.

Spolsky, B. (1968). Language testing: The Problem of validation. TESOL Quarterly,


2(2), 88-94.

SPSS Inc. (2009). PASW® Statistics Base 18. Chicago, IL: SPSS Inc.

Stricker, L. J., & Rock, D. A. (2008). Factor structure of the TOEFL Internet-Based Test
across subgroups. (TOEFL iBT Research Report No. 07; ETS Research Report
No. 08-66). Princeton, NJ: Educational Testing Service.

Stricker, L. J., Rock, D. A., & Lee, Y.-W. (2005). Factor structure of the LanguEdge™
Test across language groups. (TOEFL Monograph Series Report No. 32).
Princeton, NJ: Educational Testing Service.

Swinton, S. S., & Powers, D. E. (1980). Factor analysis of the Test of English as a
Foreign Language for several language groups. (TOEFL Research Report No.
06; ETS Research Report No. 80-32). Princeton, NJ: Educational Testing Service.

Taylor, C. (1993). Report of TOEFL score users focus groups. TOEFL 2000 Internal
Report. Princeton, NJ: Educational Testing Service.

Turner, C. E. (1989). The underlying factor structure of L2 cloze test performance in


Francophone, university-level students: Causal modeling as an approach to
construct validation. Language Testing, 6(2), 172-197. doi:
10.1177/026553228900600205

Upshur, J. A., & Homburg, T, J. (1983). Some relations among language tests at
successive ability levels. In J. W. Oller, Jr. (Ed.), Issues in language testing
research (pp. 188-202). Rowley, MA: Newbury House.

Vollmer, H. J., & Sang, F. (1983). Competing hypotheses about second language ability:
A plea for caution. In J. W. Oller, Jr. (Ed.), Issues in language testing research
(pp. 29-79). Rowley, MA: Newbury House.

Wang, S. D. (2006). Validation and invariance of factor structure of the ECPE and
MELAB across gender. Spaan Fellow Working Papers in Second or Foreign
Language Assessment, 4, 41-56. Ann Arbor, MI: University of Michigan English
Language Institute. www.isa.umich.edu/eli/research/spaan
200

Widaman, K. F. (1985). Hierarchically nested covariance structure models for multitrait-


multimethod data. Applied Psychological Measurement, 9, 1–26. doi:
10.1177/014662168500900101

Wolf, M. K., Kao, J., Herman, J., Bachman, L. F., Bailey, A., Bachman, P. L.,
Farnsworth, T., & Changm. C. (2008). Issues in assessing English language
learners: English language proficiency measures and accommodation uses—
Literature review (Part 1 of 3). (CRESST Report No. 731). Los Angeles, CA:
CRESST/UCLA.

Woods, A. (1983). Principal components and factor analysis in the investigation of the
structure of language proficiency. In A. Hughesand, & D. Porter (Eds.), Current
developments in language testing (pp. 43-52). New York: Academic Press.

Você também pode gostar