Você está na página 1de 16

CHAPTER THREE

CHARACTERISTICS OF A GOOD
ASSESSMENT

LEARNING OUTCOMES

At the end of the chapter, the students should be able to...


1. describe validity types.
2. describe reliability types.
3. compare and contrast between validity and reliability.

Introduction
Assessment has to be meaningful and significant to you and your students. To classroom
teachers, any classroom assessments should provide accurate evidence of student learning.
The accuracy of the evidence collected influences the effectiveness of instructional decisions
made. Remember that teaching and assessment are not two separate components, in fact,
the very basis of doing formative assessments is to feed teachers with data that informs
every aspect of instructional decisions. In other words, we assess because we need
information about student learning to better target our teaching.

To students, assessments should motivate and challenge their learning. The items or tasks
designed for any assessments should be able to measure the complexity of knowledge
acquired. In language learning, assessments should be designed to accurately measure four
components of language skills: reading, writing, listening and speaking.

We know that a test should consist of items that sample student learning, but what makes a
test a good test? How do we know that the instrument we have constructed is of high
quality? How do we determine the relevance of test content? How do we ascertain that the
reading test developed does, in fact, assess reading and not unrelated skills?

In this chapter, we will learn two general principles that guide teachers in test development:
VALIDITY and RELIABILITY. The application of these principles leads to the development of
high quality measures of student learning and meaningful assessment results.
Validity of Assessment
There are various definitions available to explain the concept of validity. The oldest and the
most common definition is presented below:

Definition of Validity

A test is valid if it measures what it is supposed to measure.

If you were to examine this definition carefully, you would realize that this definition does not really
capture the true nature of validity or assessment itself. Yet, this is the most commonly used
definition to describe validity.

We have learnt that test is only a type of assessment, yet the principle of validity encompasses all
assessment types. Therefore, the definition above is too simplistic in nature and insufficient to
represent what validity truly is.

More recent publications of assessment-related literature have discussed the concept of validity as
being related to the quality of interpretation made. For example, Nitko and Brookhart (2007) define
validity as the soundness of teachers’ interpretations and uses of students’ assessment results.
Similar to this, Miller, Linn and Gronlund (2013) view validity as an evaluation of the adequacy and
appropriateness of the interpretations and uses of assessment results.

Validity: Refreshed Definition

Validity is the interpretation of student learning that the


assessment generates. It refers to the accuracy of the
interpretation.

All measures of learning are indirect. Because the constructs that we assess are latent in nature, we
have to base our assessment on tools that are observable and measurable. Evidence of learning is
derived in the form of written responses, verbal responses or through assessment tasks that are
performance-based and/or production-based. From these written and verbal responses, as well as
learning demonstrated from performance-based and object created, teachers interpret whether
learning has or has not taken place. And if assessment is highly valid, it provides teachers with
quality information with regard to student learning, hence a more accurate interpretation of learning
can be made. For example, a speaking test must include assessment task that requires candidates to
demonstrate verbally their speaking ability. If this speaking test is designed and dominated by items
or tasks that require candidates to demonstrate their proficiency in a written manner, then the test
has low validity.

Our interpretation of validity often involves the process of making inference. Inference refers to the
conclusions drawn on the synthesis and interpretatin of evidence and subsequently, requires
judgment on the basis of the evidence gathered and prior beliefs/expectations.
More on Validity

Validity of any assessment is dependent upon the purpose


of the assessment and the way in which evidence is
interpreted and used by the key stakeholders.

Activity 4.1

Can you think of examples of valid or invalid measures for test?

Validity Types
There are five types of validity: face validity, construct validity, content validity,
consequential validity and criterion validity.

Face Validity
Face validity represents the determination of an assessment’s validity at surface level. An
assessment has a high face validity if, at surface level, it translates sufficiently the content and
construct being assessed. For example, in a writing test, face validity is determined through whether
it has sufficient number of items to represent the construct assessed; is the length of assessment
appropriate with the number of items or task complexity; is the item or task appropriate
representation of knowledge or content assessed; etc. Face validity is the weakest validity type.

Construct Validity
A construct is also called an attribute, proficiency, ability, or skill being assessed. Construct validity
refers to the degree to which the inferences can legitimately be made. Its use is more significant in
researches rather than tests, for the purpose of establishing relationship between the aspect(s)
investigated in a study and the theoretical framework used. In a normal classroom test, construct
validity is important when the aspect assessed is abstractive in nature, has no obvious body of
content or when there are no existing criteria.
Content Validity
Content Validity refers to how well the content of the assessment samples the classroom situations
or subject matter about which conclusions are to be drawn. A reading test, for example, has a high
content validity if the tasks included in the test is a good representation of the domain (reading)
assessed and emphasize important aspects of reading skills.

Consequential Validity
This type of validity refers to extent to which the use of assessment results accomplishes its intended
purposes and the extent to which unintended effects of assessment is avoided. Teachers need to be
aware that the assessment implemented may have positive and/or negative effect on stakeholders.
Some of these effects include increased learning, decreased learning, increased morale and
motivation, decreased morale and motivation, narrowing of instruction, dropping out of school, etc.

Criterion Validity
Concurrent validity is represented by predictive validity and concurrent validity. Predictive validity
refers to the extent to which performance on the assessment predicts future performance.
Concurrent validity, on the other hand, is the extent to which the assessment is able to provide
estimate of current performance on some valued measures other than the test itself.

Reliability of Assessment
The word reliable is often associated with consistent, and reliability of assessment is often
described in relation to the consistency of assessment results. The definition presented
below is the most common definition used to describe reliability.

Definition of Reliability

An assessment is reliable if it is able to yield consistent


results.

The definition above implies that if a student is assessed multiple times with the same assessment
measure (e.g. given the same test), the student must be able to obtain the same scores regardless of
how many time he or she is assessed. If this student obtains different scores, then the assessment is
not reliable. There are two problems with this. First, in a normal classroom situation, teachers do
not give the same test to the same student more than once. Second, if a student sits for test A today
and sits for test A again next week, it is highly likely that the score would improve. Therefore, the
definition given is a poor description of reliability. Even though some reliability types do look at
consistency of assessment results, this consistency is the outcome of the use of multiple measures
(different instruments) that produce same set of scores; or the extent to which items in a instrument
would be able to produce similar responses. A more comprehensive definition of reliability is
provided below:

Reliability: Refreshed Definition

Reliability is the extent to which the items are free of


distractions and assessment procedures that are highly
objective.

ReliabilityTypes
This sub-chapter discusses four types of validity: inter-rater reliability, intra-rater reliability,
parallel form reliability and internal consistency reliability.

Inter-rater Reliability
Inter-rater reliability refers to the consistency of judgment across different assessors using the same
assessment task and procedures. Teachers only need to be concerned with this type of reliability
when there is more than one assessor grading the assessment. Different rater grades differently due
to the differences in worldview, beliefs and experiences. It is important that all raters approach
students’ works with the same marking standards. This can be achieved if all assesors meet to
discuss marking standards before the marking process begins.

Intra-rater Reliability
Intra-rater reliability is the consistency of assessment outcomes across time and location, and using
the same assessment task administered by the same assessor. Raters are normal human beings with
emotions and feelings, and susceptible to distractions. In ensuring consistent output (and in ensuring
high intra-rater reliability), raters have to make sure that their objectivity is high when marking.

Parallel Forms Reliability


Parallel forms reliability is used to assess the consistency of the results of two tests constructed in
the same manner addressing the same content. In parallel forms reliability, teachers have to create
two parallel forms. One way to accomplish this is to create a large set of questions that address the
same construct and then randomly divide the questions into two sets. Both instruments are then
administered to the same sample of people.
Internal Consistency Reliability
Internal consistency concerns with the extent to which the items or task act together to elicit a
consistent type of response. Internal consistency, to date, is the only reliability type that can be
computed and there is a wide variety of internal consistency measures that can be used including
Cronbach alpha, split half, average inter-item and average item-total correlations.

Validity and Reliability


The method used to collect the evidence will impact on reliability. The way in which the
assessor uses and interprets the evidence collected will impact on validity.

Classroom Discussion

For each of the questions below, identify the type of validity or reliability.

1. A physics teacher wonders whether her final exam this year is equivalent to her final
exam last year.

2. A standard-six English teacher wonders whether the grades he assigns to his


students’ essays are equivalent to the grades his colleagues would assign them.

3. A law lecturer wonders whether the grades she assigns to her students would differ
if she had not fallen sick during the marking period.

4. A teacher discusses with other teacher the standards in marking science test prior to
marking period.

5. A teacher includes attendance and students’ participations as part of the overall


assessment for her course.
Exercises

Answer all the questions.

1. Define validity.

2. List all validity types.

3. Of all types of validity, which is the weakest?

4. Define reliability.

5. List all realibility types.

6. Suggest two strategies to improve inter-rater reliability.

7. Suggest two strategies to improve intra-rater reliability.


CHAPTER FOUR
TEST FORMATS

LEARNING OUTCOMES

At the end of the chapter, the students should be able to...


1. describe different formats of objective-type items.
2. describe different formats of subjective-type items.
3. distinguish between objective and subjective item formats.

Introduction
When teachers develop a test, in addition to the content assessed, he or she also needs to be well
versed in the different types of item formats. A structured item, for example, would gather different
kind of response as compared to a short-answer item in relation to the length and the depths of
knowledge demonstrated, even though these items ask the same content. Each type of test item has
its own unique characteristics, uses, advantages, limitations and rules for construction.

Objective-type Items
The use of objective-type items enable teachers to accommodate objective test for their
students. Objective test is usually a paper-based test, where candidates select a response
from a range of alternatives established by the task developers. This type of testing allows
teachers to cover more range of topics (than asking essay-type questions) within a shorter
time frame.

Activity 5.1

Brainstorm the different types of objective-type items.


Objective-type items include:

 true/false item
 underline item
 matching item
 completion item
 multiple-choice question (MCQ)

True/False Item
True/false item is easier to construct and scored, hence its widespread use in classroom tests.
True/false item is usually phared in the form of a statement and the students are required to
identify whether the statement is correct (true) or incorrect (false). Given the simplicity of its
structure, this type of item requires relatively short time frame to construct. It enables
teachers to assess a wide range of topics, however the focus is usually on factual
information. Students do not have the freedom to portray their knowledge beyond
identification of correct and incorrect statements. This item format is the least reliable as
students are presented with high probability (50%) of guessing the item correctly.

Underline Item
As the name suggests, underlining-type item only requires the candidates to underline the correct
option. An example of underlining-type item is presented below.

Tan and Ramarau, 2013

Matching Item
Matching-type item is appropriate in measuring relationship or connection between two aspects. It
is relatively easy to construct and scored. An example of matching-type item is on the following
page.
Tan and Ramarau, 2013

Completion Item
In language tests, completion item is designed in a manner that requires the candidates to complete
a sentence with one-word or short phrase answer. Commonly used in reading and listening tests,
this type of item is easy to construct, however may be more difficult to score as compared to other
objective-type items. This is primarily because for each item, there might be more than one
acceptable answer. Two examples of completion item are shown below.

Saw and Chow, 2014


Teo and Chang, 2014

Multiple-choice Question (MCQ)


Multiple-choice question allows teachers to sample learning of a wide range of topics, and target a
variety of cognitive skills from remembering to understanding, applying, analysing, synthesising and
evaluating. However, high quality MCQ is difficult to construct, even more so for items that tap onto
higher order thinking skills such as analysis, synthesis and evaluation-type items. It is highly favoured
by teachers because of the ease in scoring and can be scored by machine. MCQ is a more reliable
item type as compared to true/false item as students only have 25% chance in guessing the item
correctly. A sample of an MCQ is presented below.

Tan and Ramarau, 2013


Activity 5.2

Of all the objective-type items, true/false is the least reliable item type. Why?

Advantages of Objective Testing

Objective-type items are highly favoured by language teachers. Teachers generally teach up
to five different groups a week and each group may have more than 40 students. Testing is
time consuming and grading is often a tedious task. Comparing to subjective-type items,
objective testing eases the assessment process and the objective-item types provide
language teachers with a number of advantages.

 Ease of scoring;
 Cost efficient scoring procedures;
 Assessing large number of candidates at one time;
 Appearance of objectivity – reducing biasness (high inter-rater and intra-rater
reliabilities);
 Easily established and maintained standardised test administration conditions;
 Allows multiple ways to assess underpinning knowledge and understanding;
 Item effectiveness can be computed and determined; and
 Ease of determining validity and reliability.

Disadvantages of Objective Testing

Despite the advantages listed, objective-type items are not without limitations:

 Limited to assessing knowledge and understanding;


 Lack face validity;
 Require high level skills in item writing, test construction and data analysis.

Subjective-type Items
Subjective-type items are also known constructed response items (CR items). Rather than
selecting a response from a list of options, the candidates are required to create their own
responses (American Educational Research Association, American Psychological Association, and
National Council on Measurement in Education, 1999).
Activity 5.3

List the different types of subjective-type items.

Generally, subjective-type items consist of short-answer items, structured items and essay items.
They vary in relation to the level of objectivity and the length of responses expected.

Short-answer Item
Comparing to other types of subjective items, short-answer item has the highest level of objectivity.
This type of item is often used by language teachers to assess specific aspect of knowledge involving
retrieval of specific information. The kind of responses expected from this type of item is limited to
one or two-word response only, hence, it is often limited to measure remembering of information.

Tan and Ramarau, 2013

Structured Item
The length of responses expected from asking this type of items is longer than short-answer items
and shorter than asking essay-type items. Structured item provides the candidates with more
flexibility in demonstrating their language skills however the responses expected from asking such
items is limited to

Essay Item
Essay item provides students with the freedom to construct and present ideas and concepts in their
own words. One major advantage of this type of item is that it enables teachers to assess complex
skills, particularly learning outcomes that are not readily measured by objective-type items.
Teachers also have the options to increase or decrease the level of item difficulty by asking more
complex or less complex content of learning. Two types of essay item are restricted-response essay
item and extended-response essay item.
Restricted-Response Essay Item

As opposed to extended-response essay item, the restricted-response essay item has a narrower
scope of content assessed and shorter length of response expected. And these limitations are
usually expressed as part of the item. Please refer to the sample below.

Saw and Chow, 2014

Extended-Response Essay Item

Students have greater freedom to organise their ideas, analyse issues and problems, and integrate
creatively their thoughts and feelings in manner they view as appropriate. And though this type of
item presents teachers with the opportunity to measure complex language skills and high order
cognitive skills, it may be perceived as having low inter-rater and intra-rater reliability. The scoring of
such item is time consuming and requires a systematic measure as a way to improve inter and intra-
rater reliabilities.

Advantages of Subjective Testing

Subjective item formats are appropriate when the construct assessed is competency in
abstractive thinking and analysis. They are also adaptive to any extended response activities
such as development of proposals, reports, applications, presentations, etc. Below are the
advantages of subjective-item formats.

 Easier to construct;
 Appearance of high face validity;
 Appearance of high content validity in assessing writing skills;
 Assessment of high order thinking skills;
 Assessment of complex and abstractive learning outcomes;
 Assessment is more authentic – resemblance of tasks that have real-life relevance
(eg. writing newspaper articles, reports, letters, etc.)

Disadvantages of Subjective Testing

Despite the advantages listed, subjective-type items are not without limitations:

 Time consuming to mark;


 Appearance of low inter-rater and intra-rater reliabilities;
 Assessment criteria can be vague or unspecified leading to subjective judgment.

If the construct assessed is non-language skills, such type of items may favour candidates
with good analytical and communication skills; and writing skills may be one of the criteria
tested rather than the substance of the assessment.
Exercises

Answer all the questions.

1. List all objective-item formats.

2. List all subjective-item formats.

3. Why true/false has the lowest reliability as compared to other types of objective
items?

4. In your opinion, why MCQ is a much better item as compared to true/false item?

5. Why do subjective-type items appear to have higher face validity than objective-
type items?

Você também pode gostar