IPRINGIPLESOB* =
LANGUAGE ASSESSMENT _
This chapter explores how principles of language assessment can and should be
applied to formal tests, but with the ultimate recognition that these principles also
apply to assessments of all kinds. In this chapter, these principles will be used to
ccvaluate an existing, previously published, or ereated test. Chapter 3 will center on
how to use those principles to design a good test.
How do you know ifa testis effective? For the most part, that question can be
answered by responding to such questions as: Can it be given within appropriate
administrative constraints? Is it dependable? Does it accurately measure what you
‘want it to measure? These and other questions help to identify five cardinal criteria
for*testing a test”: practicality, reliability validity, authenticity, and washback, We will
ook at each one, but with no priority order implied in the order of presentation.
PRACTICALITY
An effective test is practical. This means that it
+ isnot excessively expensive,
* stays within appropriate time constraints,
+ is relatively easy to administer, and
+ has a scoring/evaluation procedure that is specific and time-efficient
A test that is prohibitively expensive is impractical. A test of language profi-
ciency that takes a student five hours to complete is impractical—it consumes
more time (and money) than necessary to accomplish its objective. A test that
requires individual one-on-one proctoring is impractical for a group of several huin-
dred test-takers and only a handful of examiners. A test that takes a few minutes for
a student to take and several hours for an examiner to evaluate is impractical for
most classroom situations. A test that can be scored only by computer is impractical
if the test takes place a thousand miles away from the nearest computer. The value
and quality of a test sometimes hinge on such nitty-gritty, practical considerations,RELIABILITY
A reliable test is consistent and dependable. If you give the same (est to the same
student or matched students on two different occasions, the test should yield sim-
ilar results, The issue of reliability of a test may best be addressed by considering a
timber of factors that may contribute to the unreliability of a test. Consider the
following possibilities (adapted from Mousavi, 2002, p. 804): Fluctuations in the stu-
dent, in scoring, in test administration, and in the test itsett
Student-Related Reliability
‘The most common learnerrelated issue in reliability is caused by temporary illness,
fatigue, a “bad day? anxiety, and other physical or psychological factors, which may
make an “observed"score deviate from one’s “true” score. Also included in this cate-
gory are such factors as a test-taker's "testwiseness” or strategies for efficient test
taking (Mousavi, 2002, p. 804),
Rater Reliability
Human error, subjectivity, and bias may enter into the scoring process. Inter-rater
reliability occurs when two or more scorers yield inconsistent scores of the same
test, possibly for lack of attention (0 scoring criteria, inexperience, inattention, or
even preconceived biases, In the story above about the placement test, the initial
scoring plan for the dictations was found to be unreliable—that is, the two scorers
‘were not applying the same standards,
Raterreliability issues are not limited to contexts where two or more scorers
are involved. Intra-rater reliability is a common occurrence for classroom
teachers because of unclear scoring criteria, fatigue, bias toward particular "good"
and “bad” students, or simple carelessness. When I am faced with up to 40 tests to
gride in only a week, know that the standards I apply—however subliminally—to
the first few tests will be different from those I apply to the last few. I may be “easier”
or *harder” on those first few papers or I may get tired, and the result may be an
inconsistent evaluation across all tests. One solution to such intra-raer unreliability
is to read through about half of the tests before rendering any final scores or grades,
then to recycle back through the whole set of tests to ensure an even-handed judg-
ment. In tests of writing skills, rater reliability is particularly hard to achieve since
writing proficiency involves numerous traits that are difficult to define. The careful
specification of an analytical scoring instrument, however, can increase rater relia-
bility (J.D. Brown, 1991),
Test Administration Reliability
Unretiability may also result from the conditions in which the test is administered.
once witnessed the administration of a test of aural comprehension in which a tape
recorder played items for comprehension, but because of street noise outside the
building, students sitting next to windows could not hear the tape accurately, This
was a clear case of unreliability caused by the conditions of the test administration,
Other sources of unreliability are found in photocopying variations, the amount of
light in different parts of the room, variations in temperature, and even the condi
tion of desks and chairs.Test Reliability
Sometimes the nature of the test itself can cause measurement errors, Ifa test is too
ong, testtakers may become fatigued by the time they reach the later items and
hastily respond incorrectly. Timed tests may discriminate against students who do
‘not perform well on a test with 4 time limit. We all know people (and you may be
included in this category!) who “know” the course material perfectly but who are
adversely affected by the presence of a clock ticking away: Poorly written test items
(that are ambiguous of that have more than one correct answer) may be a further
source of test unreliability
VALIDITY
By far the most complex critetion of an effective test—and arguably the most impor-
tant principle—is validity, “the extent to which inferences made from assessment
results are appropriate, meaningful, and useful in terms of the purpose of the assess-
ment” Gronlund, 1998, p. 226). A valid test of reading ability actually measures
reading ability—not 20/20 vision, nor previous knowledge in a subject, nor some
other variable of questionable relevance. To measure writing ability, one might ask
students to write as many words as they can in 15 minutes, then simply count the
words for the final score. Such a test would be easy to administer (practical), and the
scoring quite dependable (reliable), But it would not constitute a valid test of
writing ability without some consideration of comprehensibility, rhetorical dis-
course elements, and the organization of ideas, among other fuctors,
How is the validity of a test established? There is no final, absolute measure of
validity, but several different kinds of evidence may be invoked in support. In some
cases, it may be appropriate to examine the extent to which a test calls for perfor
mance that matches that of the course or unit of study being tested. In other cases,
‘we may be concerned with how well a test determines whether or not students have
reached an established set of goals or level of competence. Statistical correlation with
other related but independent measures is another widely accepted form of evi-
dence, Other concerns about a test’s validity may focus on the consequences—
beyond measuring the criteria themseives—of a test, or even on the testtaker's
perception of validity. We will look at these five types of evidence below.
Content-Related Evidence
If a test actually samples the subject matter about which conclusions are to be
drawn, and if it requires the testtaker to perform the behavior that is being mea-
sured, it can claim contentrelated evidence of validity, often popularly referred to as
content validity (c.g., Mousavi, 2002; Hughes, 2003). You can usually identify con-
tentelated evidence observationally if you can clearly define the achievement that
‘you are measuring, A test of tennis competency that asks someone to run a 10-yard