Você está na página 1de 7
IPRINGIPLESOB* = LANGUAGE ASSESSMENT _ This chapter explores how principles of language assessment can and should be applied to formal tests, but with the ultimate recognition that these principles also apply to assessments of all kinds. In this chapter, these principles will be used to ccvaluate an existing, previously published, or ereated test. Chapter 3 will center on how to use those principles to design a good test. How do you know ifa testis effective? For the most part, that question can be answered by responding to such questions as: Can it be given within appropriate administrative constraints? Is it dependable? Does it accurately measure what you ‘want it to measure? These and other questions help to identify five cardinal criteria for*testing a test”: practicality, reliability validity, authenticity, and washback, We will ook at each one, but with no priority order implied in the order of presentation. PRACTICALITY An effective test is practical. This means that it + isnot excessively expensive, * stays within appropriate time constraints, + is relatively easy to administer, and + has a scoring/evaluation procedure that is specific and time-efficient A test that is prohibitively expensive is impractical. A test of language profi- ciency that takes a student five hours to complete is impractical—it consumes more time (and money) than necessary to accomplish its objective. A test that requires individual one-on-one proctoring is impractical for a group of several huin- dred test-takers and only a handful of examiners. A test that takes a few minutes for a student to take and several hours for an examiner to evaluate is impractical for most classroom situations. A test that can be scored only by computer is impractical if the test takes place a thousand miles away from the nearest computer. The value and quality of a test sometimes hinge on such nitty-gritty, practical considerations, RELIABILITY A reliable test is consistent and dependable. If you give the same (est to the same student or matched students on two different occasions, the test should yield sim- ilar results, The issue of reliability of a test may best be addressed by considering a timber of factors that may contribute to the unreliability of a test. Consider the following possibilities (adapted from Mousavi, 2002, p. 804): Fluctuations in the stu- dent, in scoring, in test administration, and in the test itsett Student-Related Reliability ‘The most common learnerrelated issue in reliability is caused by temporary illness, fatigue, a “bad day? anxiety, and other physical or psychological factors, which may make an “observed"score deviate from one’s “true” score. Also included in this cate- gory are such factors as a test-taker's "testwiseness” or strategies for efficient test taking (Mousavi, 2002, p. 804), Rater Reliability Human error, subjectivity, and bias may enter into the scoring process. Inter-rater reliability occurs when two or more scorers yield inconsistent scores of the same test, possibly for lack of attention (0 scoring criteria, inexperience, inattention, or even preconceived biases, In the story above about the placement test, the initial scoring plan for the dictations was found to be unreliable—that is, the two scorers ‘were not applying the same standards, Raterreliability issues are not limited to contexts where two or more scorers are involved. Intra-rater reliability is a common occurrence for classroom teachers because of unclear scoring criteria, fatigue, bias toward particular "good" and “bad” students, or simple carelessness. When I am faced with up to 40 tests to gride in only a week, know that the standards I apply—however subliminally—to the first few tests will be different from those I apply to the last few. I may be “easier” or *harder” on those first few papers or I may get tired, and the result may be an inconsistent evaluation across all tests. One solution to such intra-raer unreliability is to read through about half of the tests before rendering any final scores or grades, then to recycle back through the whole set of tests to ensure an even-handed judg- ment. In tests of writing skills, rater reliability is particularly hard to achieve since writing proficiency involves numerous traits that are difficult to define. The careful specification of an analytical scoring instrument, however, can increase rater relia- bility (J.D. Brown, 1991), Test Administration Reliability Unretiability may also result from the conditions in which the test is administered. once witnessed the administration of a test of aural comprehension in which a tape recorder played items for comprehension, but because of street noise outside the building, students sitting next to windows could not hear the tape accurately, This was a clear case of unreliability caused by the conditions of the test administration, Other sources of unreliability are found in photocopying variations, the amount of light in different parts of the room, variations in temperature, and even the condi tion of desks and chairs. Test Reliability Sometimes the nature of the test itself can cause measurement errors, Ifa test is too ong, testtakers may become fatigued by the time they reach the later items and hastily respond incorrectly. Timed tests may discriminate against students who do ‘not perform well on a test with 4 time limit. We all know people (and you may be included in this category!) who “know” the course material perfectly but who are adversely affected by the presence of a clock ticking away: Poorly written test items (that are ambiguous of that have more than one correct answer) may be a further source of test unreliability VALIDITY By far the most complex critetion of an effective test—and arguably the most impor- tant principle—is validity, “the extent to which inferences made from assessment results are appropriate, meaningful, and useful in terms of the purpose of the assess- ment” Gronlund, 1998, p. 226). A valid test of reading ability actually measures reading ability—not 20/20 vision, nor previous knowledge in a subject, nor some other variable of questionable relevance. To measure writing ability, one might ask students to write as many words as they can in 15 minutes, then simply count the words for the final score. Such a test would be easy to administer (practical), and the scoring quite dependable (reliable), But it would not constitute a valid test of writing ability without some consideration of comprehensibility, rhetorical dis- course elements, and the organization of ideas, among other fuctors, How is the validity of a test established? There is no final, absolute measure of validity, but several different kinds of evidence may be invoked in support. In some cases, it may be appropriate to examine the extent to which a test calls for perfor mance that matches that of the course or unit of study being tested. In other cases, ‘we may be concerned with how well a test determines whether or not students have reached an established set of goals or level of competence. Statistical correlation with other related but independent measures is another widely accepted form of evi- dence, Other concerns about a test’s validity may focus on the consequences— beyond measuring the criteria themseives—of a test, or even on the testtaker's perception of validity. We will look at these five types of evidence below. Content-Related Evidence If a test actually samples the subject matter about which conclusions are to be drawn, and if it requires the testtaker to perform the behavior that is being mea- sured, it can claim contentrelated evidence of validity, often popularly referred to as content validity (c.g., Mousavi, 2002; Hughes, 2003). You can usually identify con- tentelated evidence observationally if you can clearly define the achievement that ‘you are measuring, A test of tennis competency that asks someone to run a 10-yard

Você também pode gostar