Você está na página 1de 31

: 16p

A Comparative Study of Construct Validation of Graduation English Proficiency Tests between Universities in Mainland China and Taiwan
Byron Gong
Department of English Language and Literature Soochow University, Taipei, Taiwan, R.O.C.

:10p

Keywords: construct validation, college graduation English


12p test, washback effect, high-stakes test

Abstract

12p

12p 2.5

This paper reports findings from an analytical study of how construct validation is reflected in graduation English tests that have been widely conducted at universities in Mainland China and Taiwan. The findings relate to key test perceptions of construct validation for test designers and stakeholders to consider should the mandatory testing systems in use be endorsed by the educational authorities in the greater China area. This study analyzes documentary and empirical data to perceive how significant that construct validation appears in these tests. Evidence has shown that the construct validation behind the college graduation English proficiency tests such as the CET-4 test in Mainland China and the GEPT (High-Intermediate) test in Taiwan has been effectively and moderately represented in the test content. Both the CET and GEPT tests have fairly sound rationale to be selected as a kind of qualified English proficiency test for universities at grass-roots to use. The CET test in the Mainland appears to have exerted a more powerful washback influence on her national college English programmes; while the autonomous and decentralized language assessment approach within universities in Taiwan provides autonomy in defining the test construct of ability according to their own local needs. It is suggested that universities in Taiwan should consider using a unified testing system as a kind of benchmark graduation English proficiency test although such a unified testing system may not be the only tool that should be adopted for assessing college students English competence so as to avoid college ELT programmes as test-driven.

single-spaced

INTRODUCTION 4

12p (Level 1)

The purpose of this paper is to present a comparative study of how significantly test constructs can be reflected in the graduation English tests used at universities in Mainland China and Taiwan respectively. To what extent that a 12p intended test contents large-scale test is designed to measure the 4 effectively and satisfactorily can be considered the primary concern for language test designers. In this sense, construct validity of such a test could become one of the most important considerations for test designers to consider. This could be especially critical when a test becomes high stakes because the outcomes of a high stakes test can be closely linked with test candidates access to upward socio-economic mobility. This paper provides studies of large-scale benchmark tests for college graduation across the Taiwan Strait. College students in Mainland China are under enormous pressure from a mandatory national testing system, i.e. the College English Test (CET-4); while their counterparts in Taiwan have to face the General English Proficiency Test (GEPT at its High-Intermediate level) or other tests designed by individual university equivalent to that level. Many universities in Taiwan have demanded that undergraduate students should pass the GEPT or other similar tests before graduation. To enhance the quality of their English education, universities in both Mainland and Taiwan are trying their best to carry out an unprecedented educational movement in terms of implementing a graduation English threshold test for all university students (Cheng 2008; Yang 2009; Tsai and Tsou 2009). In this background, construct validity could become one of the most important considerations for test designers to consider; and it is significant to assess the construct validation of such tests and the implications for all test stakeholders. However, until recently there has been no research focused on the test validation between the CET and GEPT tests in general, and little is known about their features concerning construct validity in particular. Thus, this research attempts to fill up a blank in this field. But on the other hand, as can be seen from the large scale and number of test candidates in Mainland China and Taiwan, it is impossible to discuss in depth the issue of how construct validity is completely reflected in every aspect in the national tests such as the CET-4 and GEPT (at High-Intermediate level) in a paper of this scope. This paper only serves to introduce the issues and concerns of test construct validation. Therefore, this paper focuses only on some key aspects of how construct validation is reflected in the English language testing system in the Chinese universities. As this paper is not a case study, the test construct validation is discussed only at macro level viewed from the angle of national interest and local colleges or universities. (Note that in this paper, college and university are used interchangeably, and both of these two words are used to refer to institutions of higher education within the Chinese context.) There are different kinds of English proficiency tests designed

2.3
individually by local universities, but passing the CET and GEPT can be considered as a prerequisite for (non-English major) college students to graduate from a university in both Mainland and Taiwan (Cheng 2008; Yang and Weir 1998; MOE 2005). The grass-roots policies at most universities in Mainland and Taiwan stipulate clearly that college students have to take the CET or GEPT test so as to graduate from a university. Therefore, for the purpose of this research, the researcher would focus on the CET-4 in Mainland China and its counterpart GEPT High-Intermediate in Taiwan rather than other levels of the GEPT test that are irrelevant to this paper. Other grass-roots English proficiency test, such as the SCUEPT test designed locally by Soochow University in Taipei, which is similar to the GEPT, may be studied in order to provide an example of how universities are doing at grass-roots level in this respect. Although the number of test candidates of the CET in the Mainland is much larger than that of the GEPT (or other kinds of threshold tests, such as the SCUEPT) in Taiwan, the two kinds of English proficiency tests share astonishing similarities in many important aspects such as the test nature, purpose, format, method, and components. As it is a common practice to conduct comparative studies between regional or local English proficiency tests designed and applied by individual universities with the national or international ones such as TOEFL or IELTS, it is significant to conduct a comparative study between the CET-4 and GEPT (High-Intermediate) across the Taiwan Strait although only a narrow aspect of construct validation of such tests can be probed due to the limited space of this paper. In addition, it is significant to have a mutual understanding of such similar tests even though the test constructs and washback effect may be interpreted by different test stakeholders. A comparative study of test constructs of these similar English proficiency tests would not only be beneficial to the academic circles across the Taiwan Strait, but may render the need for possible mutual accreditation of test results as well. The researcher hopes that the results of this paper will provide a new starting point and constructive implications for a possible exchange of experience in large-scale English language testing in Taiwan, as part of our momentum of the search for a solution to quality ELT assessment. Hopefully, the curriculum designers of college ELT programmes could also gain considerable insight from this study of test constructs for quality college ELT teaching. 12p

2.3

1.5

THEORETICAL FRAMEWORK

(Level 1)

Construct validation is one of the most important considerations for test designers to consider. Weir (2004) pointed out, Test validation is the process of generating evidence to support the well-foundedness of inferences concerning trait from test scores, i.e., essentially, testing should be concerned with evidence-based validity. It is significant to assess the construct validation of the CET and GEPT (and other) local tests used in the Mainland and Taiwan.

However, until recently there was no research focused on the test validation between the CET and GEPT tests in general, and little was known about their features concerning construct validity in particular. Therefore, a priori hypothesis in this study involves an assumption that certain test rationale applied to the test development will have an important bearing on the test validity in general and its consequential validity in particular. In the light of the trait of construct validity, the evidence will be the focal point of this research. The theoretical framework of this paper would be linked with a central issue as how construct validation can be reflected in the CET test in Mainland China and the GEPT test (or other grassroots tests) in Taiwan. To begin with, it is necessary to take a look at the meaning of validity and validation. Validity in language testing has traditionally been understood as to discover whether a test measures accurately what it is intended to measure (Hughes, 2003, p.26). While validation can be modestly understood as a whole process investigating and verifying evidence to provide sufficient justification for designing and conducting a language testing system in a certain way. In other words, a sound language testing system should be accountable for why a test should be used in that way. As for the nature of this kind of process, it is a never-ending process (Fulcher and Davidson, 2007). In his elucidation of methods of test validation, Xi (2008, p.177) described: Validity is a theoretical notion that defines the scope and nature of validation work, whereas validation is the process of developing and evaluating evidence for a proposed score interpretation and use. The way validity is conceptualized determines the scope and the nature of validity investigations and hence the methods to gather evidence. Validation frameworks specify the process used to prioritize, integrate, and evaluate evidence collected using various methods. Xis explanation provides a comprehensive understanding of the relationship between validity and validation. Furthermore, it is important to understand what construct validation means for the purpose of this research. It is believed that construct validation involves an analysis of the qualities that a test is intended to measure, thus providing a basis for the rationale of a test (Weir 2004; Bachman and Palmer, 1996). According to Cronbach (1984, p.149), construct validity is also a fluid and creative process. Therefore, for the purpose of this research, the researcher would adopt the following conception in this paper: construct validity of a language test refers to an indication of how representative it is of an underlying theory of language learning; and construct validation involves an investigation of the qualities that a test measures, thus providing a basis for the rationale of a test (Davies, Brown, Elder, Hill, Lumley, and McNamara, 1999). As Bachman (2004, p.15) wrote, A construct, then, is an attribute that has been defined in a specific way for the purpose of a particular measurement situation. Still in other words, construct validity is concerned with the question: Is the study actually investigating what it is supposed to be investigating? (Nunan, 1992) In simple terms, construct

means the idea used to support a test designers decision why s/he should design or construct a test in that certain way. This seems simple, but the whole procedure of verifying construct validation could be extremely challenging, and there are various approaches to appropriate explanation of construct validation from different perspectives (Fulcher, 2007, pp. 183-191). In commenting on Cronbach and Meehls 1955 work, Fulcher (2007, p.10) also pointed out that the central to understanding score meaning lies the question of what evidence can be presented to support a particular score interpretation. This requires that test designers should be accountable for the test they created. Nevertheless, as Alderson (1995, p.183) pointed out, construct validity is the most difficult concept to explain. In language testing, it is believed that construct validity refers to the totality of evidence about whether a particular operationalization of a theoretical framework adequately represents what is intended by theoretical account of the construct being measured. It is not a simple measurement of one item such as internal reliability coefficients. Bachman and Palmer (1996) proposed the dominant notion of test usefulness which contains six qualities: validity, reliability, authenticity, interactiveness, and impact, as well as practicality. So, as for the conception of construct validation, Xi (2008) has pointed out, the constructs of language tests have become increasingly more complex and may go beyond what has traditionally been defined. Therefore, regarding the nature of the CET and GEPT, it would be a huge project to conduct a thorough study of the construct validation of a large-scale test such as the national CET test in Mainland China or GEPT in Taiwan. Any one of the following issues such as correlational evidence, content validity, factor analysis, test usefulness, expert evaluation, the multitrait-multimethod, etc. would be a huge daunting project. A comprehensive study from all possible aspects of the construct validity is certainly beyond the scope of this paper. Hence, due to the limited space, the researcher would just conduct a modest study that is focused on some important aspects for this study, as a starting point for any further research. Viewed broadly, the researcher holds that a good language test should be well supported by the rationale behind the test, and a good language test can bring about positive backwash effects on language teaching. The researcher holds that construct validation should be studied so as to see to what extent that the rationale for a high-stakes test can be effectively reflected in the test purposes, contents, and results with positive backwash effects in terms of social mobility. This study involves a supposition that positive backwash effects on national English test programmes can be enhanced only when construct validation can be supported with a needed rationale. Regarding the influence of test washback, construct validity of the above mentioned high-stakes tests (the CET and GEPT) in the Chinese context can be viewed at two levels: national and grass-roots levels. At national level, construct validity can be viewed from its symbolic value for the state interest. In other words, the symbolic value of construct validation is related to the strategic needs of

national college English education, i.e. the primary stakeholders needs, and to how it can provide positive support for certain English test policies to be continued at macro level. Meanwhile, at grass-roots level, construct validity is more related to its functional value for end-users (teachers and students), which is the concern of how universities can have a reliable and valid evaluation system for the purpose of quality English education at college level. Therefore, the fundamental question raised in this paper is to what extent construct validation can be reflected and quantified as significant in college English graduation tests in Mainland China and Taiwan. 12p

RESEARCH METHOD

(Level 1)

For the purpose of this research, both documentary and empirical data of the CET-4 test in Mainland China and GEPT High-Intermediate test (and a representative local sample testthe SCUEPT test) in Taiwan are used in this paper. The reason why this paper focuses on GEPT (High-Intermediate) is based on the following fact. The GEPT test has different versions for different levels: such as the Elementary, Intermediate, High-Intermediate, Advanced, and Superior, and each set is a complete English proficiency test. The designer of GEPT LTTC (The Language Training and Testing Center) states that an examinee who passes this level of GEPT (High-Intermediate) has a generally effective command of English. His/her English ability is roughly equivalent to that of a university graduate in Taiwan whose major was not English (LTTC, 2003). Therefore, as this research is related to the test for non-English major college students, only the GEPT Test at High-Intermediate level is relevant, and can be compared with its counterpart of the CET-4 which has the same test purpose and function. GEPT tests at other levels are, therefore, irrelevant to this research. It is important to realize that there is a kind of difference in studying construct validation between psychological testing and language proficiency testing. Generally, in psychological testing, as it is difficult to tell what factors can be really considered as valid causal factors in the development of mental tests and analysis of data collected from these tests, researchers need to find what factors really cause a change of behavior among all possible factors, for example, by using LCAlatent change analysis (Windle, 2000). Broadly speaking, the techniques used by psychometricians to find out the influence of causal factors are relatively more complicated than those used for proficiency language testing. But in language proficiency testing, experienced test designers and teachers are usually not in a darkroom that they need to find what can be considered as language skills. It is clear that these test designers do not have to verify whether the four language skills should be the test components in a high-stakes language proficiency test. Therefore, most English proficiency tests, such as TOEFL, IELTS, CET, and GEPT, would contain the four language skillslistening, reading, writing, and speaking. Language testing, as Fulcher (2007) pointed out, is about doing. It is about creating

tests. In other words, at macro level, language test designers do not need to find and prove that the four skills rather than others should be assessed in a foreign language proficiency test. The research approach to its construct validation is also different from that of the more complicated psychometrics. Therefore, as far as the research method used in this paper is concerned, the researcher would not conduct a factor analysis so as to prove that the four language skills are important causal factors in ones language proficiency. Instead, construct validation is to be studied through various methods so as to have a comprehensive understanding of this issue from different perspectives. In other words the method used for studying construct validation of language proficiency testing would be different from that for construct validation of psychological tests. Although there is no single best way to study construct validity, the researcher would specifically look into some aspects that are commonly regarded important by testing professionals. These aspects included the analysis of test specifications, internal reliability, internal correlations, factor analysis, inter-subtest correlation matrix, the content validity, etc. as mentioned previously (Alderson et al., 1995; Bachman, 2004; Fulcher and Davidson, 2007; Xi, 2008). Generally, an effective approach to the study of construct validity of a language proficiency test is to look at the correlation among different test components. Regarding coefficients of correlation among different test components in a test, researchers should be aware that it is not desirable for two test components in a language proficiency test to have a very high coefficients of correlation such as over 0.85. In this respect, Alderson et al. (1995:184) pointed out, one way of assessing the construct validity of a test is to correlate the different test components with each other. Since the reason for having different test components is that they all measure something different and therefore contribute to the total language ability that test designers want to test, we should expect these correlations fairly low possibly in the order of +.3 +.5. If two components correlate very high with each other, say +.9, we might wonder whether the two subtest are indeed testing different traits or skills, or whether they are testing essentially the same thing. However, as far as the interpretation of coefficients of correlation is concerned, there is not an absolute value. Normally, it is suggested that a correlation between +.3+.7 can be considered acceptable (Yang and Weir, 2000, p.61). For the purpose of this paper, the researcher would consider any coefficients of correlation among different test components smaller than +.3 as rather low. This is because when they are on the small side, such as +.2, the content validity of a test would be very questionable. Of course, on the other end of the continuum, a rather high coefficient should also be rejected since

there would be a question if the two test components are testing essentially the same thing, as Alderson pointed out (1995: 184). As the CET-4 test is said to be a kind of criterion-related norm-referenced test (Jin, 2005; Yang and Weir, 1998), the researcher first looked into the criterion that the CET is related to, and to study how the criterion is reflected in the test specifications. It is believed that construct validation could be viewed by studying the test specifications of the CET (or GEPT) for a start. As Alderson (1995) stated, it is generally regarded as the correct method by analyzing test specifications as a starting point in studying the construct validity of a test. Then, step by step, the researcher analyzed some empirical data of the CET, GEPT, and SCUEPT tests. These aspects included the analysis of internal reliability, internal correlations, factor analysis, inter-subtest correlation matrix, the content validity, and group difference. In particular, much of the first-hand empirical data of SCUEPT test results of college students in Taiwan was collected from a random selection of over 2000 non-English major 2nd-year students (from Soochow University in 2007 and 2008). The empirical results were also discussed in detail. Finally, the researcher discussed the effects of the benchmark exams in terms of both symbolic and functional values used in Mainland China and Taiwan. The researcher holds that a probe of these aspects is intended for sensible answers to the research question of this research. The discussion of relevant documentary and empirical data will be described from the next section.

GRADUATION ENGLISH TESTS AT UNIVERSITIES IN MAINLAND CHINA 12p


The educational authorities inbeen promoting an Mainland China have (Level 1) unprecedented English language testing system at universities with millions of test candidates each year. From the very beginning, the British Council has been providing support to the CET Committee, and CALS of Applied Linguistics Research Centre at University of Reading in Britain has been responsible for the research on validation of the CET. The standardized national CET test (College English Test), which was introduced by the National College English Testing Committee (NCETC) on behalf of Chinas Ministry of Education in 1987 and revised in 2005, has been such a high-stakes college English proficiency test that undergraduate university students in China are required to take before graduation (Han et al., 2004). The number of CET candidates is on the increase every year. In the 1995 academic year, 583,135 students in China took the CET, with a passing rate of 66% (Yang and Weir, 1998); and 9.58 million students took the test in the 2005 academic year (Jin, 2005). In reality, the CET has become such a high-stakes benchmark test that most universities would demand students to pass the CET so as to obtain their bachelors degree. Although Chinas Ministry of Education altered its test policy in 2005 by stating that the CET is not to be linked with students graduation, college students still consider the CET test crucial because the CET

test certificate is an important criterion for many employers to consider at a job interview (Cheng 2008). In other words, viewed from functional value, it is an irreversible trend for millions of Chinese college students to take the CET test. Although there are negative voices against the CET (Han et al., 2004), it is generally believed that such a mandatory standardized English language testing system has brought about a cumulative positive effect on the quality teaching of college English education in Mainland China (Jin, 2005; Yang and Weir, :10p 1998). Therefore, as being linked with both educational and social status, the CET test has become high-stakes for 20 years since its first launch in 1987. (Level 2) Construct Validation Reflected in the CET-4 Test Specifications (Mainland China) How much the test specifications of the CET-4 (College English Test) can reflect the intended requirements of Chinas national teaching syllabus is considerably relevant to the degree of how construct validation can be fully represented in terms of its symbolic value. The CET is a national standardized test designed according to Chinas National College English Teaching Syllabus for Non-English Majors 1999 (revised in 2007). This national syllabus stipulates specific quantitative requirements for college students to achieve in terms of their English language proficiency, and skills of reading and listening are of paramount importance (http://edu.people.com.cn/GB/8216/43375/ 5995154.html). The CET tests have two basic versions, CET-4, and CET-6. The CET-6 is for students who have passed the CET-4, and have taken elective English course of Band 5-6. The CET Spoken English Test (CET-SET) is administered only to a very small number of students who want to take by themselves on the condition that these students have passed the CET-4 with a score of 80 or above out of a full score of 100, or the CET-6 with a score of 75 or above. However, only the CET-4, which is the focus of this paper, is considered as the benchmark test that virtually all undergraduate students need to pass, and the CET-4 test is administered twice a year, in January and June. According to CET Committee (2006), there are four main components in the CET-4 test: Listening Comprehension (35% dialogues and long talks), Reading Comprehension (35%), Cloze (15%one cloze and sentence translation), and Writing (15%one short essay). As for test specifications of the CET-4, the in-house report states that the guiding principle is to reflect the requirements of the national syllabus. According to the research report by Yang and Weir (1998), the CET-4 test specifications have generally met the requirements of the national syllabus. As the test is to help to implement this national teaching syllabus, the CET test designers had paid attention to the following aspects so as to construct a theoretical framework for the CET test: 1) The relationship between knowledge and ability: This means, conceptually, language is a tool for communication. The ultimate aim of

EFT is to ensure that students can use English to communicate. Therefore, the CET should test more language skills rather than language knowledge. 2) The relationship between fluency and accuracy: The designers of the CET have set specific speed requirements of reading, listening and writing (i.e. 50 wpm and 129 wpm are set for reading and listening in the CET-4 test). 3) The relationship between sentence understanding and discourse comprehension: As communication is based on discourse comprehension, the CET should not only take into consideration of sentence structures, but also the ability to understand discourse. 4) The relationship between receptive ability and productive ability: This means the CET specifications require that both passive and active skills are to be examined. According to the research report by Yang and Weir (1998), the CET-4 test is designed according to the above four major considerations which constitute the basis for its construct validation at a macro level. The symbolic value of the construct validation of the CET test is therefore can be indicated by the degree of how the CET test can be accepted by both the educational authorities and university English teachers. According to an official survey by Chinas CET Committee (2006), the CET has successfully achieved the aims of its test specifications; and the construct validation based on the theoretical framework can be well represented in each delivery of the CET test. Specifically, the statistics also provide the following implications: The internal reliability of objective items in the CET test reaches 0.9 or above every time when the CET test is conducted, indicating high reliability. A series of studies of questionnaires on the CET has indicated that 92% of college teachers in China agree that the CET test can effectively reflect students actual English proficiency level, indicating a high validity in terms of expert judgment. As the CET is a criterion-related norm-referenced test, the passing score set in the CET correlates with the teachers assessment of the test candidates passing score with a correlation coefficient of 0.82. In addition, the CET test scores correlate with the order of class assessment results given by the teachers with a correlation coefficient of 0.7, which is very good because it is difficult to achieve such a high coefficient in large-scale standardized tests. Over 86% of college teachers agree that the contents of the CET are appropriately designed and each part has a proper weighting. The CET has a complete testing system, including item bank management, test formation and organization, administration, statistical analysis of test results, test fairness, and practicality.

Therefore, upon conclusion of this part, it is clear that the construct validity of the CET is mainly associated with the state interest at national level. In other words, the test specifications of the CET reflect governmental initiatives for centralization and standardization of language testing at a national level, with a centralized definition of ability construct. Furthermore, empirically, Chinas national educational authorities have gained solid statistical support for its CET policy to be continued nationwide. Next, the graduation benchmark testing system in Taiwan will be discussed.

GRADUATION ENGLISH TESTS AT UNIVERSITIES IN TAIWAN


12p In Taiwan, island-wide English proficiency test there is not a mandatory (Level 1) set by Taiwans Ministry of Education for undergraduate students to take for graduation. Unlike its counterpart in the Mainland, the Chinese educational authority in Taiwan has been carrying out an American style of autonomous and decentralized language assessment within colleges and universities. The educational authority in Taiwan has transferred power to lower levels at each individual university, which gives more freedom to universities to decide what kind of English proficiency is needed for their undergraduate students according to each universitys own principles; and many universities in Taiwan have recently announced that they will carry out their own benchmark English testing system for college students to take. In other words, a kind of threshold English test is about to be carried out in the near future across the university campuses in Taiwan. In addition, undergraduate students can also take other English proficiency tests as a proof of their English proficiency before they finish the 4-year university education, such as the GEPT test (General English Proficiency Test, a criterion-referenced test set by a non-governmental organization in Taiwan), or TOEIC, IELTS, and TOEFL. Nevertheless, the local GEPT test is the most popular English proficiency test for college students to take although the passing rate for college students is around 32% (LTTC, 2007). Meanwhile, among the English proficiency tests designed by individual universities at grass-roots level in Taiwan, different universities have their own testing systems and criteria. In contrast to the common practice of using achievement test of the last term of the university programme, the SCUEPT test (Soochow University English Proficiency Test) appears to be at the forefront of the campaign for a standardized English proficiency threshold test for undergraduate students to pass for graduation. Many other universities are also trying to design their own proficiency tests now. By and large, it is clear that a variety of benchmark English tests will soon become high-stakes tests for thousands of college students in Taiwan to take; as such a test certificate would help college graduates to have better opportunities in the job market, too.

:10p (Level 2) Construct Validation in the GEPT/SCUEPT Test Specifications (Taiwan) Now let us take a look at the benchmark graduation English test for college students in Taiwan. There are more than 150 officially accredited universities in Taiwan. However, there lacks a cohesive paradigm of college English assessment at the tertiary level. There exists no specified requirement from the educational authorities in Taiwan to demand all college graduates to take an English proficiency test before they graduate. As mentioned earlier, the educational authority in Taiwan has transferred power to lower levels at each individual university, which gives more freedom to individual universities to decide what graduation threshold test should be. According to the researchers investigation, only some universities demand their students to take GEPT or any other public tests as part of the requirements for graduation. According to a 2007-year report on GEPT, only 22% of GEPT testees took the GEPT in order to give their test scores to their universities for reference. As for the specifications, the GEPT is not designed to test just college students English proficiency, but to test the English proficiency of the general public, which is very different from that of the CET. In addition, there exists little research of the possibility of a large-scale mandatory testing system designed especially for college English education in Taiwan. (Note: There is a new test called College Student English Proficiency TestCSEPT designed by LTTC in Taiwan, but it is still at its experimental stage and virtually no analytical statistics have been released so far.) Notwithstanding this, the freedom from government control regarding using a standardized and centralized assessment within universities in Taiwan reflects autonomy in defining the construct of ability, whose rationale could have a different framework for different socioeconomic purposes. However, the negative side of such autonomy in defining construct ability can also cause various problems for local universities to solve. Practically, the general scenario of the evaluation of English programmes among universities in Taiwan is that test scores may be inconsistent and incompatible. In other words, the macro-relationship between college students English competence and the applied evaluation methods in Taiwan is not clear, which means different universities adopt their own methods in evaluating students English proficiency, and such methods may be inconsistent each year and differ in different department, even differ from one individual teacher to another. Thus, it is far from the desirable situation that testing results can be considered mutually comparable among colleges in Taiwan in both theory and practice. This is because current evaluation criteria for college graduates English proficiency level in Taiwan are not based on an island-wide or nationally agreed standard, such as that of the CET-4 used in Mainland China. Therefore, the evaluation results conducted by different colleges and universities in Taiwan are difficult to interpret in terms of statistical analysis at national level. In addition, at present, graduation examinations across colleges and universities are mostly of progress tests or achievement tests, and the contents of such achievement tests could be widely different from one university to

another on the basis that different teaching materials are used. In a sense, such assessment practice of English language programs can hardly provide reliable and valid evaluation of college graduates English language proficiency when compared with each other. For example, let us look at the achievement test scores of the same English course at two campuses of the same university. Table 1. Pearson Correlation of Freshmen English Scores at Two Campuses Taipei Campus Taipei Campus Kaohsiung Campus Note. N=900, Year 2005. 1.000 0.034 Kaohsiung Campus 0.034 1.000

Table 1 reflects the fact that test scores are not comparable even within the same university. No significant correlation can be found between the test scores at its two campuses (Kaohsiung and Taipei) of the same university, with a Pearson correlation of 0.034 (p<0.01). This may suggest that as there are no established English test syllabus and test specifications for universities in Taiwan to follow, different tests are used for evaluation (Gong, 2004). In fact, each university has to take its own approach to the assessment of its English programmes at different levels. Hence, in the light of different teaching materials, the tests used for graduation examinations, if required, are unsurprisingly related to different teaching materials. The test results are accordingly not comparable due to the fact that there are different test contents; and there are hardly any test specifications written for such wide-ranging tests. Thus, as far as construct validity is concerned, it shows that the understanding and interpretation of language ability could vary at different universities. With an autonomous and decentralized language testing system, each individual college or university may decide their own testing criteria according to their own needed rationale. Therefore, at national level, the symbolic value of construct validation of both GEPT and other individual college tests in Taiwan appears to be limited when compared with that in the Mainland where the construct of the CET test is closely related to the national English teaching syllabus. But for universities at grass-roots level, their test results appear to be inconsistent and incompatible. Thus, the English language testing is said satisfactory and successful only in terms of the interpretation of the needed test construct, or ability by each university itself.

EMPERICAL ANAYLSIS OF THE GEPT, CET AND SCUEPT 12p


The writer (Level 1)of the CET and GEPT tests has discussed how test validation can be viewed from the aspects of test specifications in the previous Sections IV and V. Now, let us take a look at how construct validation can be viewed

from the aspect of internal reliability, internal correlation coefficients, and factor analysis of these tests, while construct validity linked with content validity and expert evaluation will be discussed in next sections. Test designers in both Mainland China and Taiwan paid much attention to the issue of validity in their tests at different levels. Let us look at some more empirical data so as to have a better view of the CET-4 test and GEPT (or SCUEPT). First, reliability results of the above three tests are reported as follows: the reliability (Cronbach's alpha) of the CET is 0.9 (Yand and Weir, 1998; Yang, 2009), the GEPT is said to be :10p p.23), and 0.86 0.85 (LTTC 2003, for the SCUEPT (Gong). But the CET has kept its reliability as 0.9 for nearly 20 years, which is remarkably high. (Level 2) Internal Correlation Coefficients of the GEPT Test in Taiwan As mentioned previously in Section III Research Method, the consideration of internal correlation coefficients is important and useful for us to evaluate construct validation of these tests. The following Tables 2-6 show the internal correlation coefficients of the GEPT test in Taiwan. Table 2. Internal Correlation Coefficients of the GEPT ( Year 2000) Test A (N=375) Test B (N=360) Components AL1 AL2 AL3 BL1 BL2 BL3 LI L2 L3 R1 R2 R3 1 0.777 0.779 0.591 0.590 0.598 1 0.792 0.629 0.624 0.648 1 0.620 0.605 0.680 1 0.822 0.788 0.636 0.663 0.695 1 0.839 0.651 0.684 0.730 1 0.648 0.713 0.770

Note: AL1= Test A Listening Part 1; AL2= Test A Listening Part 2; AL3= Test A Listening Part 3; BL1= Test B Listening Part 1; BL2= Test B Listening Part 2; BL3= Test B Listening Part 3. The abbreviations for test components are made by the writer of this paper. **p0.01 (High-Intermediate Report, LTTC, 2000, p. L-12). The above Table 2 shows that all the internal correlation coefficients of these two GEPT A and B tests are quite high in the same component Listening, especially in Test B; and fairly good between the two components of Listening and Reading, with lowest as 0.590. But a few of the correlation coefficients of Test B are somewhat too high as it is beyond the range of +.3+.7. In the same year of 2000, another version of GEPT was delivered by the LTTC and the following Table 3 can reveal the relevant internal correlation coefficients of this GEPT test.

Table 3. Internal Correlation Coefficients of the GEPT ( N=375, LTTC 2000 ) Sub-test Reading Part A Reading Part B Reading Part C Reading Part A Reading Part B Reading Part C Listening Part A Listening Part B Listening Part C 1.000 0.681 0.686 0.591 0.629 0.620 1.000 0.722 0.590 0.624 0.605 1.000 0.598 0.648 0.680

Note. **p0.01 (High-Intermediate Report, LTTC, 2000, p. R-11). According to the LTTC GEPT reports, there seems to be a sign of decreasing trend of the internal correlation coefficients of the GEPTS (high-Intermediate), which can be revealed from the following Table 4 and Table 5. Table 4. Internal Correlation Coefficients of the GEPT (LTTC 2003 ) Listening Reading Writing (Speaking) Listening Reading Writing Speaking 1.00 0.56 0.50 0.61 1.00 0.70 0.38 1.00 0.49 1.00

Note. ** p0.01 (High-Intermediate Report, LTTC, 2003, p.31).

Table 5. Internal Correlation Coefficients of the GEPT (LTTC 2007) Listening Reading Writing (Speaking) Listening Reading Writing Speaking 1.00 0.37 0.19 0.39 1.00 0.32 0.23 1.00 0.38 1.00

Note. ** p0.01 (High-Intermediate Report, LTTC, 2007, p.2). Another phenomenon reflected in the above Tables 25 is that the internal correlation coefficients of the GEPT (High-Intermediate) are not very consistent, which can be clearly observed in Table 5. Researchers need to pay more attention to such an inconsistent phenomenon. Looking at Tables 4 and 5, we can see the inter-subtest correlations of the early GEPT in 2003 and 2007

generally appear to be on the small side of a commonly used range, i.e. +.3+.7 when compared with those of the CET whose figures are mostly around 0.5 (p<0.01). This might indicate that each inter-subtest in GEPT is quite disintegrative regarding language communication skills. Probably, further efforts are needed so as to achieve less divergent construct validity in the test, which needs to keep a fair balance between convergent and divergent validity by using test items of better discrimination and difficulty index. The internal correlation coefficients of the GEPT (2000 test) appear to be more convergent than that of 2007 test. It is revealed in the 2000 GEPT Report that high coefficients between Listening and Reading in GEPT may be caused by the fact that the test format and content are similar, i.e. both are of paragraph comprehension :10p (LTTC, 2000). Nevertheless, in general, the GEPT 2000 and 2003 Reports provided quite satisfactory figures in terms of internal correlation coefficients. (Level 2) Internal Correlation Coefficients of the CET Test in China On the other hand, let us now take a look at the CET-4 test in the Mainland. According to Chinas National CET Committee, the CET-4 test is a standardized test designed with the cooperation of the British Council (Yang and Weir, 1998). Research on test validation of the CET is mainly the responsibility of the British team. Specifically, the CALS of Applied Linguistics at the University of Reading in Britain is in charge. As Cheng (2008) pointed out, much research has been conducted about the validity of a widely-used test formatmultiple-choice questions in objective tests, the main feature of the CET test. A fair number of empirical studies have been conducted in China and published in Chinese academic journals. Among these, there are many case studies concerning the validity of the CET test. For example, according to Zhou (2004) who conducted a comparability study of CET, he found that a Pearson correlation coefficient of 0.712 (p<0.01) between two CET tests. Regarding internal correlation coefficients of each part in the CET-4 test, Yang and Weir (1998) provided valuable research results. The following Table 6 shows that the internal correlation coefficients of each part in the CET-4 are between 0.3-0.7 (p<0.01). According to a number of studies (Yang and Weir 1998; Yang and Jin 2000; Jin 2005; and Cheng 2008), it is worthwhile to notice that the CET-4 has kept a stable trend of internal correlation coefficients, and there are no obvious trends of ups and downs of these coefficients during the past 20 years. Therefore, compared with Table 6, the internal correlation coefficients in GEPT are less desirable, especially the coefficient between Listening and Writing in GEPT 2007 (Table 5) is on the small side in comparison with that of the CET shown in Table 6, although writing could be an important factor to pull down the figures of the GEPT in 2007 as LTTC explains (LTTC, 2007). Such a phenomenon shown in Table 2 (LTTC 2000) and Table 5 (LTTC 2007) might be a factor for GEPT test designers to consider how to improve the

Table 6. Internal Pearson Correlation Coefficients of the CET-4 LC RC VS CL WR Total LC RC VS CL WR Total 1.000 0.563 0.539 0.467 0.388 0.792 1.000 0.615 0.531 0.359 0.892 1.000 0.626 0.470 0.802 1.000 0.404 0.707 1.000 0.581 1.000

Note. LC=listening comprehension; RC=reading comprehension; VS=vocabulary & structure; CL=cloze test; WR=Writing (Yang and Weir, 1998, p.60). ** p0.01 (This is double checked directly by email with the author again in 2-2009) . :10p GEPT test items and components so as to have a relatively more stable trend of the internal correlation coefficients of the GEPT (high intermediate) test. (Level 2) A look at the Local SCUEPT Test in Taiwan Regarding the grass-roots English proficiency test designed and applied by local universities in Taiwan, let us take a look at the SCUEPT test designed by Soochow University in Taipei as a representative grassroots test for the purpose of this paper. This is mainly due to the fact that the SCUEPT is a complete and relatively sound testing system used as a benchmark test for college graduation. Since 2007, the reliability coefficients of the SCUEPT have been +.803, .853, .831, and .819 respectively, which provide good evidence for the internal consistency. Furthermore, let us take a look at the inter-subtest correlation matrix because it is one of the popular methods to assess the construct validation of a test. So the writer conducted a study by correlating the different test components with each other, i.e. by inter-subtest correlation matrix, designed for establishing the construct validity of a test. The following correlation matrix in Table 7 shows the inter-subtest correlation coefficients among the SCUEPT scores at Soochow University in Taiwan conducted by the researcher in 2009. Table 7 reveals that the correlation coefficients of each subtest in SCUEPT with its total score appear to be very desirable, ranging from Short Listening to Cloze at 0.792, 0.655, 0.691, 0.607, 0.662, and 0.463 respectively. As for the inter-subtest correlations, the internal correlation coefficients among each test component are generally within the acceptable range (Alderson et al., 1995, p.184) despite being on the small side of a commonly used range, i.e. +.3+.5 (p<0.01). But we may notice that the internal correlation coefficient of Cloze is quite low which is 0.226, 0.200, 0.214, 0.174, 0.191, and 0.223 respectively. But as for

the test time, format and skills used in doing Cloze, they differ considerably from the other test components of the test. Based on the analysis of the test papers, the writer found that many students did not have time to finish the last test component, i.e. Cloze, during the specified test time. This could be the main causal factor in the low correlations with all the other test components, indicating most college students need more training in speed reading. Table 7. (Pearson) Inter-subtest Correlations Matrix of SCUEPT Test 5/2009 Total LS LL SC FR CR Cloze Score Total 1 Score LS .792** 1 LL .655** .542** 1 ** ** SC .691 .367 .273** 1 ** ** FR .607 .392 .325** .393** 1 ** ** ** ** CR .662 .350 .268 .379 .319** 1 ** ** ** ** ** Cloze .463 .226 .200 .214 .174 .223** 1 Note. N=1660. LS=listening short conversations, LL=listening long talks, SC=sentence completion, FR=fast reading, CR=careful reading. ** p0.01 (2-tailed). Meanwhile, the inter-subtest correlations of all the other test components appear to be quite acceptable according to the criterion set by Alderson (1995, p.184) who wrote we should expect these correlations fairly low possibly in the order of +.3 +.5. Thus, on the whole, the internal correlation coefficients of the SCUEPT test at Soochow University in Taiwan prove to be sound because (1) the correlation coefficients of each subtest in SCUEPT with its total score appear to be very desirable, and (2) inter-subtest correlations of six out of seven subtests are all generally within the acceptable range from +.3+.5. The strong correlations between the subtests and the total score may provide some grounds for the construct validation of the SCUEPT, but the weak part of coefficients may indicate that test components need improving in the test battery of the SCUEPT. Therefore, the internal correlation coefficients of the SCUEPT test at Soochow University, which represents a typical grass-roots universities in Taiwan, can be considered quite supportive of test construct validity for a locally designed English proficiency test for college graduation. Regarding concurrent validity, the writer of this paper also administered a reading test of the SCUEPT and GEPT to 120 students within two weeks. Then, a paired-samples t test was conducted so as to see if the SCUEPT differs greatly from the GEPT. Table 8 shows that there is not a statistically significant difference between SCUEPT and GEPT (value p > .05 at +.074 sig.), which is

a very clear index for us to tentatively describe these two tests are similar. Table 8. Paired-Samples T Test of SCUEPT and GEPT (Reading Test) 95% Confidence Int. of Difference Mean Pair 1 SCUEPT GEPT 1.048 SD 4.576 SEM .576 Lower -.104 Upper 2.199 t 1.818 df 62 Sig. .074

Note. p > .05 Next, the Pearson correlation between the two tests was calculated and illustrated in Table 9. There was a statistically significant positive correlation between the two tests, r = 0.72 (p=.000). Statistically, this means they are Table 9. Correlations between SCUEPT and GEPT (N=120) SC GEPT Pearson Correlation 1 .720** .720** 1

SC

GEPT Pearson Correlation ** p0.01 (2-tailed).

positively related at high level. In other words, the writer cautiously holds that the usefulness of SCUEPT can be partly supported by its high correlation with the GEPT. This is practically very useful for the students. Hence, the writer maintains that construct:10pof the SCUEPT viewed from the test validation usefulness is also supported by the evidence of its high correlation with the (Level 2) GEPT High Intermediate. SCUEPT Viewed from Differential-Group Experiment Empirically, another popular method to study construct validity is the differential-group experiment, and the writer conducted a study by using this method only with the SCUEPT in this part. This is because the writer of this paper holds that both the CET and GEPT are relatively well-designed large-scale tests and there is already much research literature on the results based on the differential-group experiment, showing a positive support to the construct validity in the CET and GEPT (Yang & Weir, 1998; Yang & Jin, 2000; LTTC, 2007). Therefore, their results will not be especially presented due to the limited space here. The intension of this method is to detect bias in the test for or against groups of students defined by the biodata characteristics (Brown, 2005, p.227). This means a researcher compares the performance of two groups on a test: one group that obviously has the construct (the daytime English majors in this study) and another group has little or less construct (nighttime English majors and non-English majors in this study). If the first group scored high on the test

and the other group(s) scored low, this would be an argument for the construct validity of the test, that is, those have the construct score higher on the test than those who do not have the construct (Alderson, et al 1995, p.185). In Bachmans view, differences in group performance can be stated in terms of group means, as follows: X U1 > X U2 > X U into > X Uprep If these differences were observed, this would be evidence in support of the claim that scores from this test would be useful for predicting future performance in a relevant language use domain, namely language use tasks in an academic setting (Bachman 2004, pp.290-292).
(Note: U1, U2, U into and Uprep in the above expression refer to different levels of university students in the study conducted by Brown et al (2002), which was further adopted by Bachman, 2004, pp.290-292).

Therefore, the researcher of this paper followed this method to compare the performance of three groups of second-year students on the same SCUEPT English proficiency test. The first group was daytime English majors, who were considered to have high ability; the second group was night-time English majors, and the third group was non-English majors whose English ability was presumed to be the lowest. According to the empirical analysis presented in Table 10, it shows clearly that the mean score of the first group scored highest, the second group scored relatively low, and the thirds group of non-English majors scored lowest of all. Table 10 shows the result of the descriptive analysis of the test Table 10. Test Scores of the SCUEPT Among Three Groups of Students (2009) Group N Mean Median Mode SD Daytime English Majors 63 67.69 67.40 58.20 7.73 Night-time English Majors 57 59.46 60.00 58.8 8.67 Non-English Majors 224 53.44 54.00 61.00 10.95 scores of the SCUEPT test administered in May 2009. Data consisted of 344 students at Soochow University, with 63 Daytime English Majors, 57 Nighttime English Majors, and 224 Non-English Majors. The mean scores for the above three groups are X group1=67.69, X group2=59.46, and X group3=53.44 respectively. The medians for the three groups are 67.4, 60, and 54. Other variables also have a similar tendency. As Table 10 shows, it is quite clear that there is a difference in test mean scores among the three groups, and even between the two groups of English majors. Therefore, based on these statistical results, the researcher has a strong argument for the construct validity of the test scores of the SCUEPT test. That is to say, the test differentiated between students who have much of the SCUEPT proficiency construct (daytime English majors), those who have less (nighttime English majors), and those who have very little of the proficiency construct (non-English majors). When taking other evidence into consideration,

such as content validity, inter-subtest correlation matrix mentioned in the previous sections, the writer would form a convincing argument that the SCUEPT scores have also reflected the construct that the SCUEPT was designed to measure. So far this section has studied the test construct validation from the aspect of internal correlation coefficients and other aspects of the three tests. The writer can cautiously conclude this section by stating that the notion of construct validity has not only been considerably represented in the CET, GEPT, and SCUEPT tests, but also played an important role in the development of the test specifications and construction although there is still much to be done. :10p Next, it is worthwhile to take a look at how construct validation can be (Level and reflected from the aspect of factor analysis of the CET 2) GEPT. Factor Analysis of the CET and GEPT Test Components As mentioned in Sections II and III, it is generally accepted that test construct validation can also be probed by factor analysis. Regarding the results of factor analysis of the CET test, Chinas National College English Test Committee released a number of its study results. Let us take a look at the 1998 statistics in Table 11. Table 11. CET-4 Factor Analysis (using Principle Components Analysis) Statistical Factor % of total cumulative variance Eigenvalue Analysis variance % 1 3.00246 60.0 60.0 2 .68722 13.7 73.8 3 .55023 11.0 84.8 4 .41964 8.4 93.2 5 .34045 6.8 100.2 Table 12. Factor 1 matrix from principle components analysis Factor Matrix Factor 1 LC .76497 RC .80278 CL .78927 VS .85199 WR .65115 Note. LC=listening comprehension; RC=reading comprehension; VS=vocabulary & structure; CL=cloze test; WR=Writing. From: Yang and Weir, 1998, p.227. According to Tables 11 and 12, it shows that the CET Eigenvalue of factor 1 is much greater than 1 and its contribution reaches 60%. Among the five factors of test components, factor 1 accounts for most of the variable contributions; and the range is between 0.651 and 0.852. Therefore, factor 1

can be considered as general English proficiency. In other words, the construct design of the CET-4 can strongly reflect the general English proficiency level of college students. Next, Table 13 and Table 14 show the results of factor analysis of GEPT (High Intermediate) in its 2003 Report, which is the only report that publicly released the results of factors analysis of the GEPT test so far. Table 13. Factor Analysis of GEPT (High Intermediate) (Stage 1 Test) Factor Factor Factor Test Components 1 2 3 Q&A 0.28 0.72 0.37 Short Listening 0.33 0.75 0.29 GEPT Dialogues Comprehension (High Short Intermediate) 0.27 0.78 -0.04 Talks Stage 1 Test: Voc & Listening 0.52 0.34 0.32 Structure Reading Reading Comprehension Cloze Test 0,63 0.32 0.28 Reading 0.70 0.36 0.25 5.05 4.19 3.23 Expl. Var Prp. Totl 0.24 0.20 0.15 Note. From GEPT 2003 Report, p.14. Table 14. Factor Analysis of GEPT (High Intermediate) (Stage 2 Test) Test Components Factor1 Factor2 Q&A 0.11 0.75 Listening Short 0.11 0.71 Comprehension Dialogues GEPT Short Talks 0.26 0.80 (High Voc & 0.63 0.32 Intermediate) Structure Reading Stage 2 Test: Comprehension Cloze Test 0.63 0.28 Listening Reading 0.50 0.53 Reading Chinese to Writing 0.63 0.38 English Speaking Writing Guided 0.71 0.32 Writing Speaking 0.26 0.67 4.76 4.50 Expl. Var Prp. Totl 0.30 0.28 Note. From GEPT 2003 Report, p.32. Table 13 and Table 14 were provided in the GEPT 2003 Report, but very unfortunately, the original eigenvalue and percentage of total variance of

factors were not clearly provided by the GEPT report. Nevertheless, the report mentioned that there are three factors whose eigenvalues are greater than 1, and 59% of variance is caused by the three factors. Factor 1 covers most of the variable contributions. The GEPT report (LTTC, 2003) wrote: Reading accounts for more of the variance in Factor 1, and Listening accounts for more of the variance in Factor 2. Possibly, Factor 1 is related to Reading, and Factor 2 is related to Listening, and Factor 3 is related to Writing (p.14). Similarly in Table 14, according to the same GEPT report, there are two factors whose eigenvalues are greater than 1, and 58% of variance is caused by the two factors. The report also suggested that: Possibly, Factor 1 is related to Reading and Writing, and possibly Factor 2 is related to Listening (p.32). In short, when comparing the factor matrix in Tables 12, 13 and 14, the writer found that the figures in CET is greater than those in the GEPT. Usually, according to the Kaiser criterion, it is suggested that researchers can retain only factors with eigenvalues greater than 1. In essence this means unless a factor extracts at least as much as the equivalent of one original variable, we drop it. As Mousavi (2002) mentioned, if the correlations among the variables in the correlation matrix are close to zero, then no factors will emerge from the factor analysis. On the other hand, the higher the correlations among the variables, the more likely it will be that one or more factors will result from the analysis (p.245). Accordingly, this might tentatively suggest that the results of factor analysis of the CET reflected higher correlations among the test components than those of the GEPT. But in general, the construct validation of both CET and GEPT seemed well-grounded in their own explanation. In other words, the results of factor analysis may provide more support for both the CET and GEPT tests in terms of their construct validation. Finally, as for the SCUEPT, there is no available report concerning its factor analysis. Because of the data limitation, the writer admits that he was unfortunately not able to conduct a factor analysis this time by himself. Therefore, factor analysis could be conducted to make some improvements on the studies of the SCUEPT in the future.

VALIDATION VIEWED FROM CONTENT VALIDITY


12p also Construct validation should be viewed from the perspective of content validity. One good way to study content 1) (Level validity is to gather the judgment of experts. Alderson et al (1995) points out: Typically, content validation involves experts making judgements in some systematic way. A common way is for them to analyze the content of a test and to compare it with a statement of what the content ought to be (p.173). The following Tables 15, 16, and 17 show the relevant results of the CET-4 in Mainland China and the GEPT and SCUEPT in Taiwan. Table 15 clearly shows only 8% of teachers have negative views on the CET-4, but the majority believes the CET-4 is creditable, indicating high approval of content validity in terms of judgment of experts.

On the other hand, no judgment of experts has been officially reported regarding the GEPT in Taiwan. However, according to the researchers survey among 17 English teachers who made their judgment of the content validity of Table 15. Content Validity of the CET-4 Test Based on Teachers Evaluation General Comments English Teachers Students (N=144) (N=2490) 1. Useful/reflecting students ability 68% 41.6% 2. Useful/good for jobs 3. Useless/not reflecting ability 4. Useless/students unwilling to take Note. From Yang and Weir, 1998, pp.174-175. the GEPT. The question is: To what extent does the GEPT (high intermediate) test properly reflect the claimed English proficiency that non-English major students have? Out of a scale from 1 to 10, the bigger the number is, the less the test suffers from poor construct representation. The writer collected the feedback of GEPT shown in Table 16. Table 16. Teachers Feedback of Construct Relevance of GEPT Teacher + Teacher 1 7 2 9 3 6 4 7 5 8 6 7 7 8 8 7 9 6 24% 4% 4% 26.5% 14.6% 11.4%

10

11

12

13

14

15

16

17

+ 6 7 7 6 8 6 7 8 Note. N=17 selected from universities in greater Taipei and Kaohsiung areas in 2006. This expert feedback of GEPT shows that the agreement percentage reaches only 71% (120/170), which means on average GEPT has not got very high marks in terms of judgment of experts. In this respect, the content validity for the GEPT can only be rated as moderately significant, which is fairly good. As for the SCUEPT, the researcher interviewed and collected questionnaires in 2007 from 20 college English teachers to have their judgment of the content validity of the SCUEPT. The question to these 20 teachers is: To what extent does the SCUEPT test not suffer from construct under-representation or construct irrelevant variance? Out of a scale from 1 to 10, the bigger the number is, the less the test suffers from construct under-representation. The feedback is shown in Table 17.

Table 17. Teachers Feedback of Construct Under-representation of SCUEPT Teacher 1 2 3 4 5 6 7 8 9 10 + 8 7 8 9 9 8 6 9 7 9 Teacher 11 12 13 14 15 16 17 18 19 + 8 8 7 9 9 8 9 7 9 Note. N=20 selected from universities in greater Taipei area in 2007. 20 8

This expert feedback shows that the agreement percentage reaches 81% (162/200) i.e. the SCUEPT test content does not suffer from construct under-representation. Therefore, relatively, the content validity for the SCUEPT can be rated as significantly strong. However, further research is still 12p needed in the future study because the selected respondents of the 17 and 20 teachers may not be adequately representative. Regarding the demographic (Level needed for better results. data, larger samples are badly 1)

DISCUSION

:10p

(Level 2) Test Washback The educational authorities and educational institutions at different levels in both Mainland and Taiwan have put testing in a position to affect the huge number of stake-holders involved. This influence of testing in China, especially in relation to nationwide College English Test (CET) has a close relationship with her national College English Teaching Syllabus. The test construct validation clearly needs to be in accordance with the national syllabus first. As for the CET, test washback is to be first linked with the needs of primary stakeholders, and then those of less important ones. Universities consider CET results as a reliable index of their quality ELT programmes. Meanwhile, the CET test results are directly linked with its testees social mobility in terms of job hunting. The washback effect of the CET is considerably influential on the stakeholders at each level (Jin & Wu, 2010). But the stakes of the GEPT (or other tests) in Taiwan would differ greatly from that in China, and the washback effect of the GEPT is weak as the GEPT is not the only tool adopted for assessing college students English competence. In fact, many universities have created their own criteria according to their different framework of construct validity. Specifically, the symbolic value of the CET test in China can be considered dramatically enormous for college ELT programmes in modern Chinese education. The implementation of the CET in Mainland China has caused millions of college students to study English, and it is no exaggeration to say that China has the largest English-learning population in the world now. Universities at grass-roots level in China have utilized the test functional value

to the extreme, and, to some extent, the CET has caused a high tension between college ELT teaching and students functional needs (Han et al., 2004). The washback effect has played a critical role in tertiary ELT curricula across all universities in China. According to studies of the CET, Cheng (2008) wrote about both positive and negative washback: Most of the CET stakeholders think highly of the test, especially its design, administration, marking, and the new measures adopted in recent years. They believe that positive washback of the test is much greater than the negative washback, and the negative washback is primarily due to the misuse of the test. However, some CET stakeholders are dissatisfied with the over use of multiple-choice (MC) format in the test, the lack of direct score reports to the teachers, the incomplete evaluation of students English proficiency without a compulsory spoken English test, and the use of the test as the sole means in evaluating the quality of college English teaching and learning (p.30). The issue of the CET washback is complicated, with many factors determining the outcome of college ELT programmes. But most college teachers hold that such a standardized English language testing system has brought about a cumulative positive effect on the quality teaching of college English education in China (Jin, 2005; Yang and Weir, 1998, Cheng, 2008). Clearly, tertiary ELT programmes at grass-roots level in China have to make adjustment accordingly under the influence of the CET washback. In Taiwan, the GEPT has become an important English test whose national symbolic value has made all stakeholders cherish its educational and social significance. Compared with the CET in China, the GEPT is not a high-stakes test to most college students in Taiwan as there is no such obligation that all college students have to pass the GEPT. However, the GEPT has successfully influenced all test stakeholders, especially the primary ones in tertiary ELT education (LTTC, 2007, 2003). This can be clearly reflected in the fact that almost all universities in Taiwan have made GEPT as the first choice for their students to take, either as mandatory or optional (LTTC, 2007). College English teachers have made more effort to teach their students. Specifically, some universities have clearly stated that all students have to pass the GEPT or the other (such as the SCUEPT) to obtain a bachelors degree. Accordingly, universities would encourage students to take the GEPT or the test deigned by universities at grass-roots. It can hardly be denied that the most powerful test washback effects such as the GEPT, are not just reflected in the improvement of each universitys ELT syllabus design, but also reflected in its social value and function. Officially, the Ministry of Education in Taiwan demanded a few years ago that at least 50% of college students should pass the GEPT test by 2007. In turn, most universities at grass-roots level started to carry out various awarding plans to encourage students to take the GEPT test by providing financial support. Partially, this shows the functional value of the

GEPT is high as the test usefulness is widely accepted. As Bachman (2004) pointed out, An overriding consideration in designing, developing and using language tests is that of test usefulness (p.5). This reflects that test usefulness should be related to many important areas that test designers need to know, and test construct validity also needs to be related usefulness. We :10p to test can hardly imagine a test can be considered socially valid if it turns out to be (Level 2) little useful by any means. Test Symbolic and Functional Values This paper introduced and discussed the issues and concerns of the construct validation of the CET in Mainland China and the GEPT, and SCUEPT (as a representative English proficiency test designed by a local university) in Taiwan from different aspects. At macro level, for the Chinese educational authorities in the Mainland, the CET test serves the state interest and the needs of the national higher education. The CET reflects governmental initiatives for centralization and standardization of language testing at a national level, with a centralized definition of ability construct validation. The CET appears to be such a mandatory standardized English language testing system that all college students in the Mainland need to pass this high-stakes benchmark test for graduation. Therefore, the CET has virtually mobilized all Chinese college students to study English hard. Although there are criticisms of the CET (Han et al., 2004), the major positive symbolic and functional values of backwash effects of the CET test is that all universities in China have realized the importance of college English education at national level, and have taken various actions to promote college English actively, which has brought about a cumulative positive effect on the quality teaching of college English education nationwide (Jin, 2005). In this sense, the construct validation of this high-stakes national test is linked with the needs in terms of the national ELT curriculum. The symbolic value of construct validation of the CET in China is related to the needed rationale not only educationally but also socially and economically because the CET has been popular in the job market across the Mainland (Jin, 2005; Cheng, 2008). However, as for the GEPT test, the educational authorities in Taiwan have adopted a completely different approach to the assessment of college English education. Universities in Taiwan have much more freedom to decide what kind of English proficiency is needed for their undergraduate students according to each universitys own understanding and interpretation of test construct abilities. The autonomous and decentralized language assessment within colleges and universities in Taiwan provides autonomy in defining the test constructs of ability according to their own local needs. Therefore, the GEPT is not high-stakes, and the symbolic value of construct validation of the GEPT appears to be limited when compared with that in the Mainland where the construct of the CET test is closely related to the national English teaching syllabus.

In view of symbolic value of construct validation, the researcher believes that the CET has comparatively got more credit because a set of dedicated test specifications has been designed, which is directly in line with the purpose of the CET test in the Mainland. By contrast, the GEPTs test specifications are not specially designed to evaluate college graduates English competency, not to mention graduation threshold. Therefore, strictly speaking, the purpose of the GEPT and its backwash effects on college English education do not completely accord with the needs of college English programs in Taiwan. There is no comparison between CET and GEPT in this sense. On the credit side, SCUEPT is designed for graduation threshold, but the symbolic value of its construct validity is limited as it is just for one individual university in Taiwan. At grass-roots level, the functional value of construct validation of both the CET and GEPT (or SCUEPTS) can be viewed with their specific needed rationale. Generally, the CET, GEPT, or SCUEPT all have solid empirical data to support its own rationale regarding internal reliability, correlation coefficients, content validity, etc., which are all satisfactorily acceptable at micro level. However, in the light of different tests used for graduation examinations in Taiwan, if required, the test results are unsurprisingly not comparable due to the fact that there are different test contents; and there are hardly any test specifications written for such wide-ranging tests. So, the functional value of construct validity is also limited to each individual university in Taiwan. 12p

CONCLUSION

(Level 1)

In the final section of this paper, the writer concludes with findings and provides implications. With both documentary and empirical data, the writer has examined the construct validation in the CET and GEPT (and SCUEPT) tests from various aspects and provided justification for how construct validation can be quantified and reflected in the CET and GEPT which are used as graduation English tests in Mainland China and Taiwan. Evidence has shown that the construct validation behind the college graduation English proficiency tests such as the CET and GEPT (and SCUEPT) in Mainland China and Taiwan has been effectively and moderately represented in the test content. Such a claim is satisfactorily supported by the analysis of documentary and empirical data according to the major research approaches. Empirical analysis by means of the analysis of test specifications, internal reliability, internal correlations, factor analysis, (Pearson) inter-subtest correlation matrix, the content validity, etc. has also provided strong support for the claim, and answered the main research question. The findings have shown that construct validation can be strongly reflected and quantified as significant in college English graduation tests in Mainland China and Taiwan. In this sense, both the CET and GEPT tests have fairly sound rationale to be selected as a kind of qualified English proficiency test for universities at

grass-roots to use. But in a narrow sense, the CET test in the Mainland appears to have exerted a more powerful washback influence on her national college English programmes. The researcher suggests that universities in Taiwan should consider using a unified testing system as a kind of benchmark graduation English proficiency test although such a unified testing system may not be the only tool that should be adopted for assessing college students English competence so as to avoid college ELT programmes as test-driven. In short, this research has provided an initiative for the study of the construct validation behind the college graduation English proficiency tests such as the CET and GEPT (and SCUEPT) in Mainland China and Taiwan. In a sense, this research has considerably filled up a blank in this field as a starting point although much remains to be done. As Bachman (2004) says, It is never sufficient for the purpose of supporting a validation argument (p.279), and there is still much more to be done in the field of language testing in general and test validity in particular. The researcher firmly believes that the issue of construct validation can be studied more thoroughly from different perspectives. It is hoped that the results of this study can provide a clear indication of to what degree that construct validation can be reflected in the CET and GEPT tests so as to help test designers to improve their language testing design. Meanwhile, the researcher also hopes that the results of this paper can provide a new starting point for a possible exchange of experience in large-scale English language testing, as part of our momentum of the search for a solution to quality college ELT testing.

12p

REFERENCES

(Level 1)

Alderson, J. C., Clapham, C., and Wall, D. (1995). Language Test Construction and Evaluation. Cambridge: CUP. Bachman, Lyle F. (2004). Statistical Analysis for Language Assessment. Cambridge: Cambridge University Press. Bachman, L.F. and Palmer, A.S. (1996) Language testing in practice. Oxford: OUP. Brown, J. D. (2005). Testing in Language Program: A Comprehensive Guide to English Language Assessment. New York: McGraw-Hill Education (Asia). Cheng Liying (2008). The key to success: English language testing in China. Language Testing 25(1): 15-37. Cronbach, L. J. (1984). Essentials of Psychological Testing (4th ed.). New York: Harper and Row. Davis, A., Brown, A., Elder, C., Hill, K., Lumley, T., and McNamara, T. (1999) Studies in Language Testing 7: Dictionary of language testing. Cambridge: CUP. Fulcher, G. and Davidson, F. (2007). Language Testing and Assessment. London: Routledge. Gong, Byron (2004). A Need for a Unified Assessment of College English Language ProgramsSome Theoretical and Practical Considerations for Quality ELT in Taiwan Shih Chien Management Commentary, Issue 1. Han, B. and Dai, M. and Yang, L. (2004): Problems with College English Test as emerged from a survey. Foreign Languages and Their Teaching 179(2): 17-23. Hughes, A. (2003). Testing for Language Teachers (2nd Ed). Cambridge: Cambridge University Press. Jin, Yan and Jiang Wu. (2010). A Preliminary Study of the Internet-Based CET-4. Paper presented at the 12th Academic Forum on English Language Testing in Asia, AFELTA 2010 Conference. Taipei, Taiwan. Jin, Y. (2005). The National College English Test of China. In Hamp-Lyons, L. (Chair), the big tests: Intentions and evidence. Symposium presented at International Association of Applied Linguistics (AILA) 2005 Conference in Madison, WI. LTTC (2007, 2003, 2000). A Statistical Report on the Scores of a GEPT Test. Taipei, LTTC. MOE. 2005. Higher Education NewsletterVol (166). Taipei: Ministry of Education. Mousavi, Seyyed Abbas (2002). An Encyclopedic Dictionary of Language Testing. 3rd ed. Taipei: Tung Hua Book Co. Ltd.. National College English Testing Committee, PRC (2006) College English Test Sample Papers. Shanghai, China: Shanghai Foreign Language Education Press. Tsai, Y. and C. Tsou (2009). A standardized English Language Proficiency test as the graduation benchmark: student perspectives on its application in higher education. In Jo Anne Baird (Ed), Assessment in Education: Principles Policy & Practice 16, no. 3: 319-330. Weir, C. (2004). Language Testing and Validation: An Evidence-based Approach. Basingstoke: Palgrave. Windle, M. (2000). A Latent Growth Curve Model of Delinquent Activity Among Adolescents. Applied Developmental Science, 4(4), 193-207. Xi, Xiaoming. (2008). Methods of Validation. In Elana Shohamy and Nancy H. Hornberger (Eds.), Encyclopedia of Language and Education (pp. 177-196). New York: Springer. Yang, H. and Jin, Y. (2000) Score interpretation of CET. Proceedings at the Third International Conference on English Language Testing in Asia. Hong Kong: Hong Kong Examinations Authority, 3240. Yang, H. (2009). The Sociological Aspects of Language Testing. Paper presented at the 2009 LTTC International Conference on English Language Teaching and Testing. Taipei, Taiwan. Yang, H. and Weir, C. (1998) Validation study of the National College English Test, third edition. Shanghai, China: Shanghai Foreign Language Education Press.

, ;

Você também pode gostar