Evaluation by The Utamed

Evaluation in
Language Teaching
By The Untamed
The Untamed
Table of contents
Introduction
to the topic
Concepts
Types
of testing
Techniques
Things to consider when designing a test
What makes a good test?
Table of contents(cont...)
Marking
tests
Conclusion
Bibliography
Introduction
Education is a weapon against many evils that could destroy the most
powerful beings on planet earth (humankind), such as poverty, illnesses
and ignorance and many others and evaluation is a vital element of all
systems of education. It is what makes possible to educators, teachers,
policy makers and the community in general have an idea of what is
missing and what has been accomplished, and that is not different to
Language teachers thats why this is an assignment about evaluation in
language teaching.
Concepts of evaluation
Evaluation is the systematic process of collecting

and analyzing data in order to determine whether,
and to what degree objectives have or are being
achieved. (Osman,)
For Tyler (1949) evaluation is the process of
determining to what extent the educational
objectives are being realized.
Concepts of evaluation(cont...)
Cameron
(1945) summed all these definition

saying that: Evaluation is the process of
making overall judgment about ones work or
a whole schools work
Testing
Testing
is a technique of obtaining
information for evaluation purpose.
Types of testing
Placement
Tests
Diagnostic Tests
Achievement Tests
Proficiency Tests
Placement tests
Placing
new students in the right class in a

school is facilitated by the use of placement
tests. Usually based on syllabuses and
materials the student will follow and use once
their level has been decided on, these test
grammar and vocabulary knowledge and
asses students productive and receptive
skills
Diagnostic tests
This
type of test can be use to expose

learners difficulties, gapes in their knowledge
and skill deficiencies during a course. Thus,
when we know what the problems are we do
can do something.
Progressive or
achievement test
These
tests are designed to measure learns

language and skill progress in relation to the
syllabus (plan showing) they have been
following.
Progress tests are often written by teachers
and given to students every few weeks to
see how well they are doing..
Achievement tests(cont..)
Achievement
tests only work if they contain

item types which the students are familiar
with. At the end of a term the achievement
test should reflect progress not failure. They
should reinforce the learning that has taken
place, not go out of the way to expose
weakness.
Proficiency tests
Give
a general picture of a students

knowledge and ability, they are frequently
used as stages people have to reach if they
want to be admitted to foreign University, get
a job or obtain some kind of certificate.
Common test techniques

For
Hughes, test techniques are means of

eliciting behavior from candidates that will tell
us about their language abilities. Useful
techniques should:
Elicit behavior which is reliable and valid
indicator of the ability in which teachers are
interested;
Elicit behavior which can be reliably scored;
Be as economical of time and effort as
possible.
1. Multiple choice items

Multiple choice items can come in many forms,
but they all have a stem and a number of
options- where only one is correct, and the
others are distractors.
Ms. Larson has been teaching __________ a
month.
A. during
B. for
C. while
D. since
Advantages
Scoring
can be perfectly reliable

Scoring should also be rapid and economical
Its possible to include more items than itd
otherwise be possible in a given period
It allows the testing of receptive skills
Disadvantages
The
technique tests only recognition

knowledge. It might give an inaccurate
picture of the candidates ability if theres a
gap between their productive and receptive
skills. For instance, the person who can
identify the correct answer in the item above
may not be able to produce the correct form
when speaking or writing.
Guessing may have a considerable but

unknowable effect on test scores. The chance
of guessing the correct answer in a three-option
multiple choice item is one in three. On average
someone can be expected to score 33 on a 100item test purely by guessing. The trouble, then,
is that we can never know what part of any
particular individuals score has come about
through guessing. So if multiple choice is to be
used, there should be at least be four options.
Cheating
may be facilitated. The fact that

the responses on a multiple choice test (a, b,
c , d) are so simple makes it easy to
communicate them to other candidates nonverbally.
Therefore, this can be prevented by by
having at least two versions of the test, the
only difference being between them being
the order in which the options are presented.
2. YES/NO and TRUE/FALSE items

When
test takes have to choose between

YES and NO, or between TRUE and FALSE
we are facing a case of multiple choice items
with only two options, which means that he
has a 50% chance of guessing the right
answer. True/False items are sometimes
extended by requiring test takers to give a
reason for their choice. However, this is a
potentially difficult writing task when writing is
not meant to be tested
3. Short-answer items
Items in which the test taker has to provide a
short answer are common, particularly in
listening and reading tests. Examples:
What does them in the last sentence refer
two?
At what time is the plane scheduled to leave
to London?
Advantages:
There
will be less guessing

The technique is not restricted by the need
for distracters
Cheating is likely to be more difficult
Items should be easier to design
Disadvantages:
Responses
may take longer and so reduce

the possible number of items;
The test taker has to produce language in
order to respond;
Scoring may be invalid or unreliable, if
judgment is required;
Scoring may take longer.
4. Gap filling items

This technique can be used for both reading
and listening,
E.g. Hannibal particularly liked to eat brains
because of their _______and their _______
As
well as for vocabulary and grammar,

E.g. He asked me for money, __________
though he knows I earn a lot less than him.
E.g. Dad has lost his job. We have to tighten
our _________ from now on.
Advantages:
Makes
use of short answers

Does not call for significant productive skills
It doesnt take long when words can be found
in the text or are straightforward;
Things to consider when

designing a test
Assess
the test situation

Decide what to test
Balance the elements
Weight the scores
Make the test work
Assess the test situation

before
we start to write the test we need to

be aware of the context in which the test
takes place. We have to decide how much
time should be given to the test-taking, when
and where it will take place, and how much
time there is for marking
Decide what to test

we
have to list what we want to include in our

test having in mind the purpose of the test
and the skills to be tested. We have to know
what programme items can be included and
what kinds of topics and situations are
appropriate for our students.
Balance the elements

balancing
elements involves estimating how

long we want each section of the test to take
and then writing test items within those time
constraints. The amount of space and time
we give to the various elements should also
reflect their importance in our teaching.
Weight the Scores
even after having balanced the elements in our

test we still dont have perception of our
students success or failure, such perception
depends upon how many marks are given to
each section of the test. If we have 3 sections
test, with 5 questions each, if we give two
marks for each of the question in section one
but only one to the questions on the remaining
sections, it means that is more important for
students to do well in the former than in the
latter.
Make the test Work

when
we write test items, the first thing to do

is to get fellow teachers to try them out,
because they may spot problems which we
are not aware of and come up with possible
answers and alternatives that we had not
anticipated.
Later, having made changes based on our
colleagues reactions, we could try out the
test on students of course these students
wont be the ones to which we intend to test.
Make the test Work(cont...)

but
a class that is roughly similar or even a

class one level above, this will allow us to
see what items cause unnecessary
problems, how long the test takes.
Such trialling is designed to avoid disaster
and yield a whole range of possible answers
to the many test items. This means that when
other people finally mark the test, we can
give them a list of possible alternatives and
thus ensure reliable scoring
What makes a good test?

Validity
Reliability
Validity
Criterion-related
validity
Concurrent validity
Predictive validity
Face validity
Construct Validity
Formative Validity
Sampling validity
Validity in Scoring
How to make a test valid?
Criterion-related validity
The
second evidence of a tests construct

validity relates to the degree to which results
on the test agree with those provided by
some independent and highly dependable
assessment of the candidates ability.
There are essentially two kinds of criterionrelated validity which are: concurrent
validity and predictive validity
Face validity
A test
is said to be face-validity if it looks as if

it measures what it is supposed to measure.
Face validity is not a scientific notion and is
not seen as providing evidence for construct
validity, yet it can be very important. A test
which does not have face-validity may not be
accepted by candidates, teachers, education
authorities or employers
Construct validity
Is
used to ensure that the measure is

actually the measure what it is intended to
evaluate and not other variables, using a
panel of experts familiar with the construct
is a way in which this type of validity can be
assessed.
The experts can examine the items and
decided what that specific item is intended to
measure, students can be involved in this
process to obtain their feedback.
Formative validity
When
applied to outcomes assessment, it is

used to assess how well a measure is able to
provide information to help improve the
program under study.
Sampling validity
Ensures
that the measure covers the broad

range of areas within the concept under
study. Not everything can be covered, so
items need to be sampled from all of the
domains. This may need to be completed
using a reflecting what an individual
personally feels are the most important or
relevant areas.
Validity in Scoring
It is worth pointing out that if a test is to have

validity, not only the items but also the way the
responses are scored must be valid, it is no use
having excellent items if they are scored
invalidly. A reading test may call for short written
responses, if the scoring of these responses
take into account spelling and grammar, then it
is not valid assume that reading test is mad to
measure reading ability. By measuring more
than one ability, it makes the measurment of the
one ability in question less accurate
How to make a test more valid?

In
order to make it more valid, there is an

obligation to carry out a full validation
exercise before the test becomes
operational.
First, write explicit specification for the test
with take account of all that known about the
construct that are to be measured. Make
sure that you include a representative
sample of the content of these in test.

(cont...)
Second,
whenever feasible, use direct

testing. If for some reason it is decided that
indirect test is necessary, reference should
be made to the reaserch literature to confirm
that measurement of the relevant underlying
construct has been demonstrated using the
testing techniques that are to be employed.
Third, make sure that the scoring of
responses relates directly to what is being
tested.

(cont...)
Finally,
do everything possible to make the

test reliable. If a test is not reliable, it cannot
be valid
Reliability
Test-retest
reliability
Parallel forms reliability
Inter-rater reliability
Inter-consistency reliability
Reliability coefficient
How to make a test more reliable?
Test-retest reliability
is
a measure of reliability obtained by

administering the same test twice over a
period of time to group of individuals, the
score from time one and time two can then
be correlated in order to evaluate the test for
stability over time.
Parallel forms reliability

is
measure of reliability obtained by

administering different versions of an
assessment tool. Both version must contain
items that explore the same construct, skill,
Knowledge base to the same group of
individuals
The scores from the two then be correlated
in order to evaluate the consistency of results
across alternate versions
Inter-rater reliability
this
used to assess the degree to which

different judges or raters agree in their
assessment decisions. This measure is
useful because human observers will not
necessarily interpret answers the same way,
they may disagree as to how well certain
responses or materials demonstrate
knowledge of the construct or skill being
assessed
Inter-consistency reliability
in
this measure we find: average inter-item

correlation and split-half reliability. These
measure are used to evaluate the degree to
which different test items explore the same
produce similar results
Reliability coefficient
is
a way of confirming how accurate a test or

measure is, by giving it to the same subject
more than once and determining if there is a
correlation which is the strength of the
relationship and similarity between the two
score.

Take
enough samples of behaviour

Exclude items which do not discriminate well
between weaker and stronger students
Do not allow candidates to much freedom
Write unambiguous items
Provide clear and explicit instructions
Ensure that are well laid out and perfectly
legible

(cont...)
Make
candidates familiar with format and

testing techniques
Provide uniform and non-distracting condition
of administration
Use items that permit scoring which is as
objective as possible
Make comparisons between candidates as
direct as possible
Provide a detailed scoring key

(cont...)
Train
scores
Agree acceptable responses and appropriate
scores at outset of scoring
Indentify candidates by number, not name
Employ multiple, independent scoring
Marking Tests
Training
More
than one Scorer

Global assessment
Analytic profiles
Scoring and interacting during oral tests
Training
if
scorers have examples of scripts at various

different levels and discuss what marks they
should be given, then their marking is likely
to be less erratic than if they come to the
task fresh. If scorers are allowed to watch
and discuss videoed oral tests, they can be
trained to rate the samples of spoken English
accurately and constantly in terms of the
predefined description of performance
Having More than one

Scorer
reliability
can be greatly enhanced by having

more than one scorer. The more people at a
script, the greater the chance that its true
worth will be located somewhere between
the various scores it is given. Two examiners
watching an oral test are more likely to agree
on a more reliable score than one.
Use Global assessment

Global
assessment scales a way of

specifying scores that can be given to the
productive skill work is to create pre defined
description of performance. Such description
says what students need to be capable of in
order to gain the required marks
Analytic profiles
marking
gets more reliable when a students

performance is analyzed in much greater
detail instead of just a general assessment
marks are rewarded for different elements.
For oral assessment we can judge a student
speaking in a number of different ways such
as pronunciation, fluency, use of lexis and
grammar and intelligibility. A combination of
global and analytic scoring gives the best
chance of reliable mark.
Scoring and interacting

during oral tests
scorer
reliability in oral tests is helped not

only by global assessment and analytic
profiles but also by separating the role of
scorer (examiner) from the role of interlobular
(the examiners who guides and provokes
conversation ).
Scoring and interacting during oral

tests(cont...)
in many tests of speaking, students are now

put in pairs or groups for certain tasks since
it is felt that this will ensure genuine
interaction and will help to relax students in a
way that this interlocutor-candidate
interaction might not.
Conclusion
Evaluation
is one of the methods used to

test, to see how well the knowledge is, skills
being delivered and received.
While there is a lot to be said about this
topic, the principles are the one aspect that
called our attention the most. We look at
evaluation as a system of testing the learners
and the principles as the rules that govern
that system, without which evaluation would
not be effective.
Bibliography
Cronbach,
L. J. 1971. Test Validation. In R.L

Thondike (Ed.) Educational
Harmer,J. 2007. The Practice of English
Language Teaching, England:Pearson
Education Ltd
Hedge, T. 2003. Teaching and Learning in
the Language Classroom, UK: OUP
Hughes, A. 2003. Testing for Language
Teaching, UK: CUP
Thank You

Evaluation by The Utamed

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Evaluation by The Utamed

Enviado por

Direitos autorais:

Formatos disponíveis

Evaluation in

Evaluation is the systematic process of collecting

(1945) summed all these definition

new students in the right class in a

type of test can be use to expose

tests are designed to measure learns

tests only work if they contain

a general picture of a students

Common test techniques

Hughes, test techniques are means of

1. Multiple choice items

can be perfectly reliable

technique tests only recognition

Guessing may have a considerable but

may be facilitated. The fact that

2. YES/NO and TRUE/FALSE items

test takes have to choose between

will be less guessing

may take longer and so reduce

4. Gap filling items

well as for vocabulary and grammar,

use of short answers

Things to consider when

the test situation

Assess the test situation

we start to write the test we need to

Decide what to test

have to list what we want to include in our

Balance the elements

elements involves estimating how

Weight the Scores

even after having balanced the elements in our

Make the test Work

we write test items, the first thing to do

Make the test Work(cont...)

a class that is roughly similar or even a

What makes a good test?

second evidence of a tests construct

is said to be face-validity if it looks as if

used to ensure that the measure is

applied to outcomes assessment, it is

that the measure covers the broad

It is worth pointing out that if a test is to have

How to make a test more valid?

order to make it more valid, there is an

How to make a test more valid?

whenever feasible, use direct

How to make a test more valid?

do everything possible to make the

a measure of reliability obtained by

Parallel forms reliability

measure of reliability obtained by

used to assess the degree to which

this measure we find: average inter-item

a way of confirming how accurate a test or

How to make a test more reliable?

enough samples of behaviour

How to make a test more reliable?

candidates familiar with format and

How to make a test more reliable?

than one Scorer

scorers have examples of scripts at various

Having More than one