Measurement and Evaluation

Assessing and Evaluating
Learning
EVALUATION

Learning
Outline:
Introduction
Definition of assessment and evaluation
Aim of student evaluation
Steps in student evaluation
The basic principles of assessment/ evaluation
Regulation of learning by the teacher
Types of evaluation
Qualities of a test
Characteristics of measurement instrument
Advantages and disadvantages of different types of tests
Introduction
Assessment and evaluation are essential
components of teaching and learning.

Without an effective evaluation program it is
impossible to know whether students have

learned, whether teaching has been
effective, or how best to address student
learning needs.

Learning
Definition of assessment:
Assessment is the process of gathering

information on student learning.
Definition of evaluation:
Evaluation is the process of analyzing,

reflecting upon, and summarizing assessment
information, and making judgments and/or
decisions based on the information collected.
Aim of student evaluation

Incentive to learn
Feedback to student
Modification of learning activities
Selection of students
Success or failure
Feedback to teacher
Protection of society
Types of evaluation
1- Formative evaluations:
It is an ongoing classroom
process that keeps students and
educators informed of students
progress toward program learning
objectives.
The main purpose of formative
evaluation is to improve instruction
and student learning.
2- Summative evaluations
It occurs most often at the end of a unit.
The teacher uses summative evaluation to
determine what has been learned over a

period of time, to summarize student
progress, and to report to students, parents
and educators on progress relative to
curriculum objectives.
3- Diagnostic evaluation
It usually occurs at the beginning of the school
year or before a new unit.

It identifies students who lack prerequisite
knowledge, understanding or skills.
Diagnostic testing also identifies student
interests.
Diagnostic evaluation provides information
essential to teachers in designing appropriate
programs for all students.
Steps in student evaluation

The criteria of the educational objectives
Development and use of measuring instruments
Interpretation of measurement data
Formulating of judgment and taking of
appropriate action
Principles of Evaluation
Evaluation should be
1. Based on clearly stated objectives
2. Comprehensive
3. Cooperative
4. Used Judiciously
5. Continuous and integral part of the teaching
learning process
Qualities of a Good Measuring

Instrument
Validity: the extent to which the instrument
measures what it is intended to measure.
Reliability: the consistency with which an
instrument measures a given variable.
Objectivity: the extent to which independent and
competent examiners agree on what constitute a
good answer for each of the elements of a
measuring instruments
Practicability: the overall simplicity of the use of a
test both for test constructor and for students.
Qualities of a test
Directly related to educational objectives
Realistic& practical
Concerned with important & useful matters
Comprehensive but brief
Precise& clear
QUALITIES OF A GOOD
MEASURING INSTRUMENT
Validity means the degree to which a test or measuring
instrument measures what it intends to measure. The
validity of a measuring instrument has to do with its
soundness, what the test measures its effectiveness,
and how well it could be applied.
For instance, to judge the validity of a performance test, it
is necessary to consider what kind of performance test
is supposed to measure and how well it manifests itself.
VALIDITY
Denotes the extent to which
an instrument is measuring
what it is supposed to
measure.
Criterion-Related Validity
A method for assessing the
validity of an instrument by
comparing its scores with another
criterion known already to be a
measure of the same trait or skill.
Criterion-related validity is usually

expressed as a correlation between
the test in question and the
criterion measure. The correlation
coefficient is referred to as a
validity coefficient
Types of Validity
Content Validity. Content validity means the extent to
which the content or topic of the test is truly representative

of the course. It involves, essentially, the systematic
examination of the test content to determine if it covers a
representative sample of the behaviour domain to be
measured. It is very important the behaviour domain to be
tested must be systematically analysed to make certain
that all major aspects are covered by the test items and in
correct proportions. The domain under consideration
should be fully described in advance rather than defined
after the test has been prepared.
CONTENT VALIDITY
Content validity is described by the relevance of a
test to different types of criteria, such as thorough

judgment and systematic examination of relevant
course syllabi and textbooks, pooled judgment of
subject matter expert, statement of behavioural
objectives, analysis of teacher-made test questions,
and among others. Thus content validity depends
on the relevance of the individuals responses to the
behavior are under consideration rather on the
apparent relevance of item content.
Content validity
Content validity is commonly used in evaluating
achievement test. A well-constructed achievement test

should cover the objective of instruction, not just its
subject matter. The Taxonomy of educational
Objectives by Bloom would be of great help in listing
the objectives to be covered in an achievement test.
Content validity is particularly appropriate for the
criterion referenced measure. It also applicable to
certain occupational test designed to select and classify
employees. But content validity is inappropriate for
aptitute and personality tests.
CONTENT VALIDITY
Whether the individual

items of a test
represent what you
actually want to
assess
ILLUSTRATION
For instance, a teacher wishes to validate a
test in Mathematics, He request experts in

Mathematics to judge if the test items or
questions measure the knowledge, skills, and
values supposed to be measured. Another
way of testing validity is for the teacher to
check if the test items or questions represent
the knowledge, skills and values suggested
in the Mathematics course content.
Good and Scates(1972) suggested the evidence of

validity
are as follows.
test
1. Is or
the questionnaire
question on the subject?
Yeswhich
___ No___
2. Is the questions perfectly clear

and unambiguous?
Yes ___ No___
3. Does the question get at something
stable which is typical of the
individual or of the situation?
Yes ___ No___
4. Does the question pull?
Yes ___ No___
5. Do the responses show a reasonable
range of variation?
Yes ___ No___
6. Is the information obtained consistent?
Yes ___ No___
7. Is the item sufficiently inclusive? Yes ___ No___
8. Is there a possibility of using an
external criterion to evaluate the test/
questionnaire? Yes ___ No___
CONCURRENT VALIDITY
Concurrent validity is the degree to which
the test agrees or correlates with a criterion

set up as an acceptable measure. The
criterion is always available at the time of
testing. It is applicable to test employed for
the diagnosis of existing status rather than
for the prediction of further outcome.
CONCURRENT VALIDITY
The extent to which a

procedure correlates
with the current
behavior of subjects
ILLUSTRATION
For example, a teacher wishes to validate a
Science achievement test he has

constructed. He administers the test to
group of Science students. The results of
the test is correlated with an acceptable
Science test which has been previously
proven as valid. If the correlation is high
the Science test he has constructed is valid.
PREDICTIVE VALIDITY
Predictive Validity is determined by showing
how well predictions made from the test are
confirmed by evidence gathered at some
subsequent time. The criterion measure
against this type of validity is important
because the outcome of the subject is
predicted.
PREDICTIVE VALIDITY
The extent to which a

procedure allows
accurate predictions
about a subjects
future behavior
ILLUSTRATION
For instance, the teacher wants to estimate
how well a student may be able to do in

graduate school courses on the bases of
how well he has done on test he took in the
undergraduate courses. The criterion
measure against which the test scores are
validated and obtained are available after a
long period of interval.
CONSTRUCT VALIDITY
Construct validity of the test is the extent to
which the measures a theoretical trait. This

involves such test as those of
understanding, appreciation and
interpretation of data. Examples are
intelligence and mechanical aptitude tests.
CONSTRUCT VALIDITY
The extent to which a test measures a
theoretical construct or attribute.
CONSTRUCT
Abstract concepts such as
intelligence, self-concept,
motivation, aggression and
creativity that can be observed by
some type of instrument.
ILLUSTRATION
For example, a teacher wishes to establish the
validity of an IQ using the Culture fair Intelligence

Test. He hypothesizes that students with high IQ
also have high achievement and those with low
IQ, low achievement. He therefore administers
both Culture Fair Intelligence Test and
achievement test to groups of students with high
IQ have high scores in the achievement test and
those with low IQ have low scores in
achievement test, the test is valid.
A tests construct
validity is often
assessed by its
convergent and
discriminant validity.
1.
2.
3.
4.
FACTORS AFFECTING
VALIDITY
Test-related factors
The criterion to which you
compare your instrument may
not be well enough established
Intervening events
Reliability
RELIABILITY
Reliability means the extent to which a test
is dependable, self-consistent and stable.

In other words, the test agrees with itself. It
is concern with the consistency of
responses from moment to moment. Even
if a person takes the same test twice, the
test yields the same results. How ever a
reliable test may not always be valid.
RELIABILITY
The consistency of measurements
A RELIABLE
Produces similar
scores across
TEST
various conditions and
situations, including different
evaluators and testing
environments.
How do we account for an individual

who does not get exactly the same
test score every time he or she takes
the test?
1.
2.
3.
4.
Test-takers temporary psychological or

physical state
Environmental factors
Test form
Multiple raters
RELIABILITY COEFFICIENTS
The statistic for expressing
reliability.
Expresses the degree of
consistency in the measurement
of test scores.
Donoted by the letter r with two
identical subscripts (rxx)
RELIABILITY
For instance, Student C took Chemistry test
twice. His anwser in item 5 What is the neutral

ph? is 6.0. In the second administration of the
same test and question, his answer is still 6.0,
thus, his response is reliable but not valid. His
answer is reliable due to consistency of
responses, 6.0, but not valid due to no veracity
of his answer. The correct answer is pH 7.0.
Hence, a reliable tst may not always be valid.
METHODS IN TESTING THE RELIABILITY OF

GOOD MEASURING INSTRUMENT
TEST RETEST METHOD. The same measuring

instrument is administered twice to the same group of
students and the correlation coefficient is determined. The
limitations of this method are (1) when the time interval is
short, the respondents may recall their previous responses
and this tends to make the correlation coefficient high, (2)
when the time interval is long, such factors as unlearning,
forgetting, among others may occur and may result in low
correlation of the measuring instrument, and (3) regardless
of the time interval separating the two administrations, other
varying environmental conditions such as noise,
temperature, lighting, and other factors may affect the
correlation coefficient of the measuring instrument.
TEST-RETEST RELIABILITY
Suggests that subjects

tend to obtain the same
score when tested at
different times.
A Spearman Rank Correlation Coefficient

Spearman rho is a statistical tool used to
measure the relationship between paired

ranks assigned to individual scores on two
variables, X and Y. Thus, this is used to
correlate the scores in a test-retest method.
Spearman rho formula:

rs = 1 - 6D2
N3 N
Rs = Spearman rho
D2 = Sum of the squared difference between ranks
N total number of cases
Ex. Spearman rho Computation of the First and Second

Administration of Achievement Test in English(Artificial
Students X
Y
Rx
Ry
D
D2
data)
1
90
40
70
30
2.0
7.5
5.5
30.25
43
43
31
31
13.0
12.5
0.5
0.25
84
48
79
31
6.5
3.0
3.5
12.25
86
55
70
43
4.5
7.5
-3.0
9.00
55
75
43
43
11.0
10.5
0.5
0.25
77
77
70
70
8.5
7.5
1.0
1.0
84
77
75
70
6.5
4.5
2.0
4.00
91
84
88
70
1.0
1.0
0.0
0.00
40
84
31
70
14.0
12.5
1.5
2.25
10
75
86
70
75
10.0
7.5
2.5
6.25
11
86
86
80
75
4.5
2.0
2.5
6.25
12
89
89
75
79
3.0
4.5
-1.5
2.25
13
48
90
30
80
12.0
14.0
2.0
4.0
14
77
91
43
88
8.5
10.5
-2.5
4.0
TOTAL
D2 = 82.00
SPEARMAN RHO VALUE

Rs = 1 - 6D2
N3 N
= 1 - 6 (82)
(14)3 14
= 1 492
2744 14
= 1 - 492/2730
= 1- 0.18021978
= 0.82(high relationship)
PARALLEL-FORMS METHOD.
Parrallel-forms method. Parallel or equivalent
forms of a test may be administered to the group

of students, and the paired observations
correlated. In estimating reliability by the
administration of parallel or equivalent forms of a
test, criteria of parallelism is required (Ferguson
and Takane, 1989). The two forms of the test
must be constructed so that the content, type of
item, difficulty, instructions for administration, and
may others, are similr but not identical>
ALTERNATE FORMS
RELIABILITY
Also known as equivalent forms reliability
or parallel forms reliability

Obtained by administering two equivalent
tests to the same group of examinees
Items are matched for difficulty on each
test
It is necessary that the time frame between
giving the two forms be as short as
possible
SPLIT-HALF METHOD
The test in this method maybe administered once, but
the test items are divided into two halves. The common
procedure is to divide a test into odd and even items.
The two halves of the test must be similar but not
identical in content, number of items, difficulty, means
and standard deviations. Each student obtain two
scores, one on the odd and the other on the even items
in the same test. The scores obtained in the two halves
are correlated. The result is a reliability coefficient for a
half test. Since the reliability holds only for a half test,
the reliability coefficient for a whole test may be
estimated by using the Spearman Brown formula.
Split-Half Reliability
Sometimes referred to as
internal consistency
Indicates that subjects scores
on some trials consistently
match their scores on other
trials
Formula
rwt
= 2 ( rht)
1 + rht
Where rwt is the reliability of the whole test;
and rht is the reliability of the half test.
Scores
Students
X odd
Ranks
Y Even
Difference
Rx
Ry
D2
23
30
7.5
1.5
2.25
25
24
7.5
9.5
-2.0
4.00
27
30
7.5
-1.5
2.25
25
35
7.5
6.0
1.5
2.25
50
51
4.0
-2.0
4.00
38
60
3.0
9.00
10
55
55
1.0
2.5
-1.5
2.25
For instance,
a test
is administered
to the students
as0.0pilot 0.00
4
35
40
5
5
sample to5 test the
reliability
coefficient
of the
odd and0.5
even0.25
48
55
3
2.5
items. 6
21
24
10
9.5
0.5
0.25
Total
26.50
rht = .84
rwt = .91 (very high reliability)
Internal Consistency method

This method is used with psychological test which
consist of dichotomously score items. The

examinee either passes or fails in an item. A
rating of 1 is assigned for a pass and 0(zero) for
failure. The Method of obtaining this method is
determined by Kuder Richardson Formula 20.
The formula is a measure of internal consistency
of homogeneity of measuring instrument. The
Formula is
r N SD 2 pi qi; SD = (X X)2
X = X/N
xx =
N1
SD 2
N-1
COMPUTATION OF KUDER RICHARDON

FORMULA 20
I
10
11
12
13
14
pi qi
piqi
12
.86
.14
.1204
12
.86
.14
.1204
11
.79
.12
.1659
10
.71
.29
.1059
10
.71
.29
.2059
10
.7
.29
.2059
.64
.36
.2304
.57
.43
.2451
.57
.43
..2451
1
0
.29
.71
.2059
10
10
10
1.9509
Rxx = .79 High relationship
INTERRATER RELIABILITY
Involves having two raters independently
observe and record specified behaviors,
such as hitting, crying, yelling, and getting
out of the seat, during the same time period
TARGET
BEHAVIOR
A specific
behavior the observer is
looking to record
Interpretation of Correlation of Coefficient Values
An r from + 0.00 to + 0.20 denotes negligible correlation

An r from + 0.21 to + 0.40 denotes low correlation
An r from + 0.41 to + 0.70 denotes marked or moderate correlation
An r from + 0.71 to + 0.90 denotes high correlation
An r from + 0.91 to + 0.99 denotes very high correlation
An r from + 1.00 denotes perfect correlation
OBTAINED SCORE
The score you get when you administer a
test
Consists of two parts: the true score and
the error score
STANDARD ERROR of
MEASUREMENT (SEM)
Gives the margin or error that you should

expect in an individual test score because of
imperfect reliability of the test
Evaluating the Reliability Coefficients

The test manual should indicate why a
certain type of reliability coefficient was

reported.
The manual should indicate the conditions
under which the data were obtained
The manual should indicate the important
characteristics of the group used in
gathering reliability information
FACTORS AFFECTING
RELIABILITY
1.
2.
3.
4.
5.
Test length
Test-retest interval
Variability of scores
Guessing
Variation within the test situation
Test reliability can be improved by the following factors
INCREASED NUMBER OF TEST ITEMS.

HETEROGENEITY OF THE LEARNER
GROUP.
MODERATE ITEM DIFFICULTY.
OBJECTIVE SCORING.
LIMITED TIME
USABILITY
Usability means the degree to which the
measuring instrument can be satisfactorily

used by teachers, researchers, supervisors
and school managers without undue
expenditure of time, money, and effort. In
other words, usability means practicality
Factors that determine usability

Ease of administration
Ease of scoring
Construction of the test in objective type
Answer keys are adequately prepared
Scring directions are fully understood
Ease of interpretation and application

Low cost
Proper mechanical makeup
Advantages and disadvantages of different

types of tests
1- Oral examinations:
Advantages
1. Provide direct personal contact with candidates.
2. Provide opportunity to take mitigating circumstances into

account.
3. Provide flexibility in moving from candidate's strong points to
weak areas.
4. Require the candidate to formulate his own replies without cues.
5. Provide opportunity to question the candidate about how he
arrived at an answer.
6. Provide opportunity for simultaneous assessment by two
examiners.
1- Oral examinations
Disadvantages
1. Lack standardization.
2. Lack objectivity and reproducibility of results.
3. Permit favoritism and possible abuse of the
personal contact.
4. Suffer from undue influence of irrelevant factors.
5. Suffer from shortage of trained examiners to
administer the examination.
6. Are excessively costly in terms of professional time
in relation to the limited value of the information it
yields.
2- Practical examinations
Advantages
1. Provide opportunity to test in realistic setting skills involving
all the senses while the examiner observes and checks
performance.
2. Provide opportunity to confront the candidate with
problems he has not met before both in the laboratory and
at the bedside, to test his investigative ability as opposed
to his ability to apply ready-made "recipes".
3. Provide opportunity to observe and test attitudes and
responsiveness to a complex situation (videotape
recording).
4. Provide opportunity to test the ability to communicate
under Pressure, to discriminate between important and
trivial issues, to arrange the data in a final form.
2- Practical examinations
Disadvantages
1. Lack standardized conditions in laboratory
experiments using animals, in surveys in the
community or in bedside examinations with patients of
varying degrees of cooperativeness.
2. Lack objectivity and suffer from intrusion or
irrelevant factors.
3. Are of limited feasibility for large groups.
4. Entail difficulties in arranging for examiners to
observe candidates demonstrating the skills to be
tested.
3- Essay examinations
Advantages
1. Provide candidate with opportunity to demonstrate
his knowledge and his ability to organize ideas and
express them effectively
Disadvantages
1. Limit severely the area of the student's total work
that can be sampled.
2. Lack objectivity.
3. Provide little useful feedback.
4. Take a long time to score
4- Multiple-choice questions
Advantages
1. Ensure objectivity, reliability and validity; preparation of
questions with colleagues provides constructive criticism.
2. Increase significantly the range and variety of facts that
can be sampled in a given time.
3. Provide precise and unambiguous measurement of the
higher intellectual processes.
4. Provide detailed feedback for both student and teachers.
5. Are easy and rapid to score.
4- Multiple-choice questions
Disadvantages
1. Take a long time to construct in order to avoid
arbitrary and ambiguous questions.
2. Also require careful preparation to avoid
preponderance of questions testing only recall.
3. Provide cues that do not exist in practice.
4. Are "costly" where number of students is
small.

Measurement and Evaluation

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Measurement and Evaluation

Enviado por

Direitos autorais:

Formatos disponíveis

Assessing and Evaluating

Assessing and Evaluating

components of teaching and learning.

impossible to know whether students have

Assessing and Evaluating

Assessment is the process of gathering

Evaluation is the process of analyzing,

Aim of student evaluation

determine what has been learned over a

year or before a new unit.

Steps in student evaluation

Qualities of a Good Measuring

Criterion-related validity is usually

which the content or topic of the test is truly representative

test to different types of criteria, such as thorough

achievement test. A well-constructed achievement test

Whether the individual

test in Mathematics, He request experts in

Good and Scates(1972) suggested the evidence of

2. Is the questions perfectly clear

the test agrees or correlates with a criterion

The extent to which a

Science achievement test he has

The extent to which a

how well a student may be able to do in

which the measures a theoretical trait. This

validity of an IQ using the Culture fair Intelligence

is dependable, self-consistent and stable.

How do we account for an individual

Test-takers temporary psychological or

twice. His anwser in item 5 What is the neutral

METHODS IN TESTING THE RELIABILITY OF

TEST RETEST METHOD. The same measuring

Suggests that subjects

A Spearman Rank Correlation Coefficient

measure the relationship between paired

Spearman rho formula:

Ex. Spearman rho Computation of the First and Second

SPEARMAN RHO VALUE

forms of a test may be administered to the group

or parallel forms reliability

The test in this method maybe administered once, but

Where rwt is the reliability of the whole test;

and rht is the reliability of the half test.

Internal Consistency method

consist of dichotomously score items. The

COMPUTATION OF KUDER RICHARDON

Rxx = .79 High relationship

Interpretation of Correlation of Coefficient Values

An r from + 0.00 to + 0.20 denotes negligible correlation

Gives the margin or error that you should

Evaluating the Reliability Coefficients

certain type of reliability coefficient was

Test reliability can be improved by the following factors

INCREASED NUMBER OF TEST ITEMS.

measuring instrument can be satisfactorily

Factors that determine usability

Ease of interpretation and application

Advantages and disadvantages of different

2. Provide opportunity to take mitigating circumstances into

Você também pode gostar