Escolar Documentos
Profissional Documentos
Cultura Documentos
CHARACTERISTICS OF A GOOD
ASSESSMENT
LEARNING OUTCOMES
Introduction
Assessment has to be meaningful and significant to you and your students. To classroom
teachers, any classroom assessments should provide accurate evidence of student learning.
The accuracy of the evidence collected influences the effectiveness of instructional decisions
made. Remember that teaching and assessment are not two separate components, in fact,
the very basis of doing formative assessments is to feed teachers with data that informs
every aspect of instructional decisions. In other words, we assess because we need
information about student learning to better target our teaching.
To students, assessments should motivate and challenge their learning. The items or tasks
designed for any assessments should be able to measure the complexity of knowledge
acquired. In language learning, assessments should be designed to accurately measure four
components of language skills: reading, writing, listening and speaking.
We know that a test should consist of items that sample student learning, but what makes a
test a good test? How do we know that the instrument we have constructed is of high
quality? How do we determine the relevance of test content? How do we ascertain that the
reading test developed does, in fact, assess reading and not unrelated skills?
In this chapter, we will learn two general principles that guide teachers in test development:
VALIDITY and RELIABILITY. The application of these principles leads to the development of
high quality measures of student learning and meaningful assessment results.
Validity of Assessment
There are various definitions available to explain the concept of validity. The oldest and the
most common definition is presented below:
Definition of Validity
If you were to examine this definition carefully, you would realize that this definition does not really
capture the true nature of validity or assessment itself. Yet, this is the most commonly used
definition to describe validity.
We have learnt that test is only a type of assessment, yet the principle of validity encompasses all
assessment types. Therefore, the definition above is too simplistic in nature and insufficient to
represent what validity truly is.
More recent publications of assessment-related literature have discussed the concept of validity as
being related to the quality of interpretation made. For example, Nitko and Brookhart (2007) define
validity as the soundness of teachers’ interpretations and uses of students’ assessment results.
Similar to this, Miller, Linn and Gronlund (2013) view validity as an evaluation of the adequacy and
appropriateness of the interpretations and uses of assessment results.
All measures of learning are indirect. Because the constructs that we assess are latent in nature, we
have to base our assessment on tools that are observable and measurable. Evidence of learning is
derived in the form of written responses, verbal responses or through assessment tasks that are
performance-based and/or production-based. From these written and verbal responses, as well as
learning demonstrated from performance-based and object created, teachers interpret whether
learning has or has not taken place. And if assessment is highly valid, it provides teachers with
quality information with regard to student learning, hence a more accurate interpretation of learning
can be made. For example, a speaking test must include assessment task that requires candidates to
demonstrate verbally their speaking ability. If this speaking test is designed and dominated by items
or tasks that require candidates to demonstrate their proficiency in a written manner, then the test
has low validity.
Our interpretation of validity often involves the process of making inference. Inference refers to the
conclusions drawn on the synthesis and interpretatin of evidence and subsequently, requires
judgment on the basis of the evidence gathered and prior beliefs/expectations.
More on Validity
Activity 4.1
Validity Types
There are five types of validity: face validity, construct validity, content validity,
consequential validity and criterion validity.
Face Validity
Face validity represents the determination of an assessment’s validity at surface level. An
assessment has a high face validity if, at surface level, it translates sufficiently the content and
construct being assessed. For example, in a writing test, face validity is determined through whether
it has sufficient number of items to represent the construct assessed; is the length of assessment
appropriate with the number of items or task complexity; is the item or task appropriate
representation of knowledge or content assessed; etc. Face validity is the weakest validity type.
Construct Validity
A construct is also called an attribute, proficiency, ability, or skill being assessed. Construct validity
refers to the degree to which the inferences can legitimately be made. Its use is more significant in
researches rather than tests, for the purpose of establishing relationship between the aspect(s)
investigated in a study and the theoretical framework used. In a normal classroom test, construct
validity is important when the aspect assessed is abstractive in nature, has no obvious body of
content or when there are no existing criteria.
Content Validity
Content Validity refers to how well the content of the assessment samples the classroom situations
or subject matter about which conclusions are to be drawn. A reading test, for example, has a high
content validity if the tasks included in the test is a good representation of the domain (reading)
assessed and emphasize important aspects of reading skills.
Consequential Validity
This type of validity refers to extent to which the use of assessment results accomplishes its intended
purposes and the extent to which unintended effects of assessment is avoided. Teachers need to be
aware that the assessment implemented may have positive and/or negative effect on stakeholders.
Some of these effects include increased learning, decreased learning, increased morale and
motivation, decreased morale and motivation, narrowing of instruction, dropping out of school, etc.
Criterion Validity
Concurrent validity is represented by predictive validity and concurrent validity. Predictive validity
refers to the extent to which performance on the assessment predicts future performance.
Concurrent validity, on the other hand, is the extent to which the assessment is able to provide
estimate of current performance on some valued measures other than the test itself.
Reliability of Assessment
The word reliable is often associated with consistent, and reliability of assessment is often
described in relation to the consistency of assessment results. The definition presented
below is the most common definition used to describe reliability.
Definition of Reliability
The definition above implies that if a student is assessed multiple times with the same assessment
measure (e.g. given the same test), the student must be able to obtain the same scores regardless of
how many time he or she is assessed. If this student obtains different scores, then the assessment is
not reliable. There are two problems with this. First, in a normal classroom situation, teachers do
not give the same test to the same student more than once. Second, if a student sits for test A today
and sits for test A again next week, it is highly likely that the score would improve. Therefore, the
definition given is a poor description of reliability. Even though some reliability types do look at
consistency of assessment results, this consistency is the outcome of the use of multiple measures
(different instruments) that produce same set of scores; or the extent to which items in a instrument
would be able to produce similar responses. A more comprehensive definition of reliability is
provided below:
ReliabilityTypes
This sub-chapter discusses four types of validity: inter-rater reliability, intra-rater reliability,
parallel form reliability and internal consistency reliability.
Inter-rater Reliability
Inter-rater reliability refers to the consistency of judgment across different assessors using the same
assessment task and procedures. Teachers only need to be concerned with this type of reliability
when there is more than one assessor grading the assessment. Different rater grades differently due
to the differences in worldview, beliefs and experiences. It is important that all raters approach
students’ works with the same marking standards. This can be achieved if all assesors meet to
discuss marking standards before the marking process begins.
Intra-rater Reliability
Intra-rater reliability is the consistency of assessment outcomes across time and location, and using
the same assessment task administered by the same assessor. Raters are normal human beings with
emotions and feelings, and susceptible to distractions. In ensuring consistent output (and in ensuring
high intra-rater reliability), raters have to make sure that their objectivity is high when marking.
Classroom Discussion
For each of the questions below, identify the type of validity or reliability.
1. A physics teacher wonders whether her final exam this year is equivalent to her final
exam last year.
3. A law lecturer wonders whether the grades she assigns to her students would differ
if she had not fallen sick during the marking period.
4. A teacher discusses with other teacher the standards in marking science test prior to
marking period.
1. Define validity.
4. Define reliability.
LEARNING OUTCOMES
Introduction
When teachers develop a test, in addition to the content assessed, he or she also needs to be well
versed in the different types of item formats. A structured item, for example, would gather different
kind of response as compared to a short-answer item in relation to the length and the depths of
knowledge demonstrated, even though these items ask the same content. Each type of test item has
its own unique characteristics, uses, advantages, limitations and rules for construction.
Objective-type Items
The use of objective-type items enable teachers to accommodate objective test for their
students. Objective test is usually a paper-based test, where candidates select a response
from a range of alternatives established by the task developers. This type of testing allows
teachers to cover more range of topics (than asking essay-type questions) within a shorter
time frame.
Activity 5.1
true/false item
underline item
matching item
completion item
multiple-choice question (MCQ)
True/False Item
True/false item is easier to construct and scored, hence its widespread use in classroom tests.
True/false item is usually phared in the form of a statement and the students are required to
identify whether the statement is correct (true) or incorrect (false). Given the simplicity of its
structure, this type of item requires relatively short time frame to construct. It enables
teachers to assess a wide range of topics, however the focus is usually on factual
information. Students do not have the freedom to portray their knowledge beyond
identification of correct and incorrect statements. This item format is the least reliable as
students are presented with high probability (50%) of guessing the item correctly.
Underline Item
As the name suggests, underlining-type item only requires the candidates to underline the correct
option. An example of underlining-type item is presented below.
Matching Item
Matching-type item is appropriate in measuring relationship or connection between two aspects. It
is relatively easy to construct and scored. An example of matching-type item is on the following
page.
Tan and Ramarau, 2013
Completion Item
In language tests, completion item is designed in a manner that requires the candidates to complete
a sentence with one-word or short phrase answer. Commonly used in reading and listening tests,
this type of item is easy to construct, however may be more difficult to score as compared to other
objective-type items. This is primarily because for each item, there might be more than one
acceptable answer. Two examples of completion item are shown below.
Of all the objective-type items, true/false is the least reliable item type. Why?
Objective-type items are highly favoured by language teachers. Teachers generally teach up
to five different groups a week and each group may have more than 40 students. Testing is
time consuming and grading is often a tedious task. Comparing to subjective-type items,
objective testing eases the assessment process and the objective-item types provide
language teachers with a number of advantages.
Ease of scoring;
Cost efficient scoring procedures;
Assessing large number of candidates at one time;
Appearance of objectivity – reducing biasness (high inter-rater and intra-rater
reliabilities);
Easily established and maintained standardised test administration conditions;
Allows multiple ways to assess underpinning knowledge and understanding;
Item effectiveness can be computed and determined; and
Ease of determining validity and reliability.
Despite the advantages listed, objective-type items are not without limitations:
Subjective-type Items
Subjective-type items are also known constructed response items (CR items). Rather than
selecting a response from a list of options, the candidates are required to create their own
responses (American Educational Research Association, American Psychological Association, and
National Council on Measurement in Education, 1999).
Activity 5.3
Generally, subjective-type items consist of short-answer items, structured items and essay items.
They vary in relation to the level of objectivity and the length of responses expected.
Short-answer Item
Comparing to other types of subjective items, short-answer item has the highest level of objectivity.
This type of item is often used by language teachers to assess specific aspect of knowledge involving
retrieval of specific information. The kind of responses expected from this type of item is limited to
one or two-word response only, hence, it is often limited to measure remembering of information.
Structured Item
The length of responses expected from asking this type of items is longer than short-answer items
and shorter than asking essay-type items. Structured item provides the candidates with more
flexibility in demonstrating their language skills however the responses expected from asking such
items is limited to
Essay Item
Essay item provides students with the freedom to construct and present ideas and concepts in their
own words. One major advantage of this type of item is that it enables teachers to assess complex
skills, particularly learning outcomes that are not readily measured by objective-type items.
Teachers also have the options to increase or decrease the level of item difficulty by asking more
complex or less complex content of learning. Two types of essay item are restricted-response essay
item and extended-response essay item.
Restricted-Response Essay Item
As opposed to extended-response essay item, the restricted-response essay item has a narrower
scope of content assessed and shorter length of response expected. And these limitations are
usually expressed as part of the item. Please refer to the sample below.
Students have greater freedom to organise their ideas, analyse issues and problems, and integrate
creatively their thoughts and feelings in manner they view as appropriate. And though this type of
item presents teachers with the opportunity to measure complex language skills and high order
cognitive skills, it may be perceived as having low inter-rater and intra-rater reliability. The scoring of
such item is time consuming and requires a systematic measure as a way to improve inter and intra-
rater reliabilities.
Subjective item formats are appropriate when the construct assessed is competency in
abstractive thinking and analysis. They are also adaptive to any extended response activities
such as development of proposals, reports, applications, presentations, etc. Below are the
advantages of subjective-item formats.
Easier to construct;
Appearance of high face validity;
Appearance of high content validity in assessing writing skills;
Assessment of high order thinking skills;
Assessment of complex and abstractive learning outcomes;
Assessment is more authentic – resemblance of tasks that have real-life relevance
(eg. writing newspaper articles, reports, letters, etc.)
Despite the advantages listed, subjective-type items are not without limitations:
If the construct assessed is non-language skills, such type of items may favour candidates
with good analytical and communication skills; and writing skills may be one of the criteria
tested rather than the substance of the assessment.
Exercises
3. Why true/false has the lowest reliability as compared to other types of objective
items?
4. In your opinion, why MCQ is a much better item as compared to true/false item?
5. Why do subjective-type items appear to have higher face validity than objective-
type items?