ASSESSMENT

DEVELOPMENT OF
VARIED TOOLS:
KNOWLEDGE AND
REASONING
Types of Objective Tests:
a. true-false items
b. multiple-choice type items
c. matching items
d. enumeration and filling of blanks; and
e. essays
Planning Test and Construction of Table of
Specification (TOS)
The important steps in planning for a test are:
Identifying Test Objectives. An objective
test, if it is to be comprehensive, must cover
the various levels of Bloom’s taxonomy. Each
objective consists of a statement of what is to
be achieved and preferably, by how many
percent of the students.
EXAMPLE: We want to construct a test in the topic:
“Subject-Verb Agreement in English” for a Grade V
class. The following are typical objectives:
KNOWLEDGE. The students must be able to identify
the subject and the verb in a given sentence.
COMPREHENSION. The students must be able to
determine the appropriate form of a verb to be used
given the subject of a sentence.
APPLICATION. The students must be able to write
sentences observing rules on subject-verb agreement.
ANALYSIS. The students must be able to break down
a given sentence into its subject and predicate.
SYNTHESIS/EVALUATION. The students must be
able to formulate rules to be followed regarding
subject-verb agreement.
Deciding on the type of objective test. The test
objectives guide the kind of objective tests that will
be designed and constructed by the teacher.
Preparing a table of specifications. A test map
that guides the teacher in constructing a test. The
simplest TOS consists of 4 columns: (a) level of
objective to be tested, (b) statement of objective,
(c) item numbers where such an objective is being
tested, and (d) number of items and percentage out
of the total for that particular objective.
TABLE OF SPECIFICATIONS PROTOTYPE
LEVEL OBJECTIVE ITEM NO. %
NUMBERS
1. Knowledge Identify 1, 3, 5, 7, 9 5 16.67 %
subject-verb
2. Forming 2, 4, 6, 8, 10 5 16.67 %
Comprehensio appropriate
n verb forms
3. Application Determining 11, 13, 15, 17, 5 16.67 %
subject and 19
predicate
4. Analysis Formulating 12, 14, 16, 18, 5 16.67 %
rules on 20
agreement
5. Synthesis/ Writing of Part II 10 pts 33.32 %
evaluation sentences
observing
rules on
subject-verb
agreement
Constructing the test items. The actual construction
of the test items follows the TOS. As a general rule, it
is advised that the actual number of items to be
constructed in the draft should be double the desired
number of items. For instance, of there are five (5)
knowledge level items to be included in the final test
form, then at least ten (10) knowledge level items
should be included in the draft.
Item analysis and try-out. The test draft is tried out
to a group of pupils of students. The purpose of this
try out is to determine the: (a) item characteristics
through item analysis, and (b) characteristics of the
test itself-validity, reliability, abd practicality.
CONSTRUCTING A TRUE-FALSE TEST
Here are some rules of thumb in constructing true-
false items:
RULE 1. Do not give a hint (inadvertently) in the
body of the question.
Example. The Philippines gained its independence
in 1898 and therefore celebrated its centennial year
in 2000.
RULE 2. Avoid using the words “always”, “never”,
“often” and other adverbs that tend to be either
always true or always false.
EXAMPLE. Christmas always fall on a Sunday
because it is a Sabbath day.
RULE 3. Avoid long sentences as these tend to be
“true”. Keep sentences short.
EXAMPLE. Tests need to be valid, reliable, and
useful, although, it would require a great amount of
time and effort to ensure that tests possess these
test characteristics.
RULE 4. Avoid trick statements with some minor
misleading word or spelling anomaly, misplaced
phrases, etc. A wise student who does not know the
subject matter may detect this strategy and thus
get the answer correctly.
EXAMPLE. True or False. The Principle of our
school is Mr. Albert P. Panadero.
RULE 5. Avoid quoting verbatim from reference
materials or textbooks. This practice sends the wrong
signal to the student that it is necessary to memorize
the textbook word for word and thus, acquisition of
higher level thinking skills is not given due
importance.
RULE 6. Avoid specific determiners or give-away
qualifiers. Students quickly learn that strongly
worded statements are more likely to be false than
true, for example, statements with “never”, “no”,
“all”, “always”. Moderately worded statements are
more likely to be true than false. Statements with
“many”, “often”, “sometimes”, “generally”,
“frequently”, or “some” should be avoided.
RULE 7. With true or false questions, avoid a
grossly disproportionate number of either
true or false statements or even patterns in
the occurrence of true and false statements.
MULTIPLE CHOICE TESTS
A generalization of the true-false test, the multiple choice
type of t4est offers the student with more than (2) options
per item to choose from. Each item in a multiple choice test
consists of two parts: (a) the stem, and (b) the options.
Guidelines in Constructing Multiple Choice Items

1. Do not use unfamiliar words, terms and phrases.
EXAMPLE. What would be system reliability of a computer
system whose slave and peripherals are connected in
parallel circuits and each one has a known time to failure
probability of 0.05?
2. Do not use modifiers that are vague and
whose meanings can differ from one person
to the next such as: much, often, usually, etc.
EXAMPLE. Much of the process of
photosynthesis takes place in the:
a. bark
b. leaf
c. stem
3. Avoid complex or awkward word arrangements.
Also, avoid use of negatives in the stem as this
may add unnecessary comprehension difficulties.
EXAMPLE.
(Poor) As President of the Republic of the
Philippines, Corazon Cojuangco Aquino would
stand next to which President of the Philippine
republic subsequent to the 1986 EDSA
Revolution?
(Better) Who was the President of the
Philippines after Corazon C. Aquino?
4. Do not use negatives or double negatives as such
statements tend to be confusing. It is best to use
simpler sentences rather than sentences that would
require expertise in grammatical construction.
EXAMPLE.
(Poor) Which of the following will not cause
inflation in the Philippine economy?
(Better) Which of the following will cause inflation
in the Philippine economy.
5. Each item stem should be as short as possible;
otherwise you risk testing more for reading and
comprehension skils.
6. Distracters should be equally plausible and
attractive.
EXAMPLE. The short story: May Day’s Eve, was
written by which Filipino author?
a. Jose Garcia Villa
b. Nick Joaquin
c. Genoveva Edrosa Matute
d. Robert Frost
e. Edgar Allan Poe
7. All multiple choice options should be
grammatically consistent with the stem
8. The length, explicitness, or degree of
technicality of alternatives should not be the
determinants of the correctness of the answer. The
following is an example of this rule:
EXAMPLE: If the three angles of two triangles are
congruent, then the triangles are:
a. congruent whenever one of the sides of
triangles are congruent
b. similar
c. equiangular and therefore, must also be
congruent
d. equilateral if they are equiangular
9. Avoid stems that reveal the answer to another
item
10. Avoid alternatives that are synonymous with
others or those that, include or overlap others
EXAMPLE. What causes ice to transform from solid
state to liquid state?
a. change in temperature
b. changes in pressure
c. change in the chemical composition
d. change in heat levels
11. Avoid presenting sequenced items in the same
order as in the text.
12. Avoid use of assumed qualifiers that
many examinees may not be aware of.
13. Avoid use of unnecessary words or
phrases, which are not relevant to the
problem at hand (unless such discriminating
ability is the primary intent of the
evaluation). The item’s value is particularly
damaged if the unnecessary material is
designed to distract or mislead. Such items
test the student’s reading comprehension
rather than knowledge of the subject matter.
EXAMPLE. The side opposite the thirty
degree angle in a right triangle is equal to
half the length of the hypotenuse. If the sine
of a 30-degree is 0.5 and its hypotenuse is 5,
what is the length if the side opposite the 30-
degree angle?
a. 2.5
b. 3.5
c. 5.5
d. 1.5
14. Avoid use of non-relevant sources of difficulty such as
requiring a complex calculation when only knowledge of
a principle is being tested
15. Avoid extreme specificity requirements in responses.
16. Include as much of the item as possible in the stem.
This allows less repetition and shorter choice options.
17. Use the “None of the above” option only when the
keyed answer is totally correct. When choice of the
“best” response is intended, “none of the above” is not
appropriate, since the implication has already been made
that the correct response may be partially inaccurate.
18. Note that the use of “all of the above” may allow
credit for partial knowledge. In a multiple option
item, (allowing only one option choice) if a student
only knew that two (2) options were correct, he
could then deduce the correctness of “all of the
above”. This assumes you are allowed only one
correct choice.
19. Having compound response choices may
purposefully increase difficulty of an item.
20. The difficulty of a multiple choice item may be
controlled by varying the homogeneity or degree of
similarity responses. The more homogeneous, the
more difficult the item.
EXAMPLE.
(Less Homogeneous)
Thailand is located in:
a. Southeast Asia
b. Eastern Europe
c. South America
d. East Africa
e. Central America
(More Homogeneous)
Thailand is located next to:
a. Laos & Kampuchea
b. India & China
c. China & Malaya
d. Laos & China
e. India & Malaya
MATCHING TYPE AND SUPPLY TYPE ITEMS
The matching type items may be considered as modified multiple
choice type items where the choices progressively reduce as one
successfully matches the items on the left with the items on the
right.
EXAMPLE: Match the items in column A with the items in column
B.
COLUMN A COLUMN B
__ 1. Magellan a. First President of the Republic
__ 2. Mabini b. National Hero
__ 3. Rizal c. Discovered the Philippines
__ 4. Lapu-Lapud. Brain of the Katipunan
__ 5. Aguinaldo e. The great painter
f. Defended Limasawa island
Another useful device for testing lower order
thinking skills is the supply type of tests. Like the
multiple choice test, the items in this kind of test
consist of a stem and a blank where the students
would write the correct answer.
EXAMPLE. The study of life and living

organisms is called _________.
Supply type tests depend heavily on the way that
the stems are constructed. These tests allow for one
and only one answer and, hence, often test only the
students’ knowledge. It is, however, possible to
construct supple type of tests that will test higher
order thinking as the following example shows:
EXAMPLE. Write an appropriate synonym for each

of the following. Each blank corresponds to a letter:
Metamorphose: _ _ _ _ _ _
Flourish: _ _ _ _
ESSAYS
Essays, classified as non-objective tests,
allow for the assessment of higher order
thinking skills. Such tests require students to
organize their thoughts on a subject matter in
coherent sentences in order to inform an
audience. In essay, students are required to
write one or more paragraphs on a specific
topic,
Essay questions can be used to measure attainment
of a variety of objectives. Stecklein (1995) has listed
14 types abilities that can be measured by essay
items:
1. comparisons between two or more things
2. the development and defense of an opinion
3. questions of cause and effect
4. explanations of meanings
5. summarizing of information in a designated area
6. analysis
7. knowledge of relationships
8. illustrations of rules, principles, procedures,
and applications
9. applications of rules, laws, and principles to
new situations
10. criticisms of the adequacy, relevance, or
correctness of a concept, idea, or information
11. formulation of new questions and problems
12. reorganization of facts
13. discrimination between objects, concepts,
or events
14. inferential thinking
The following are rules of thumb which
facilitate the scoring of essays:
RULE 1. Phrase the direction in such a way
that students are guided on the key concepts
to be included.
EXAMPLE. Write an essay on the topic:
“Plant Photosynthesis” using the following
keywords and phrases: chlorophyll, sunlight,
water, carbon dioxide, oxygen, by-product,
stomata.
RULE 2. Inform the students on the criteria to be
used for grading their essays. This rule allows
the students to focus on relevant and
substantive materials rather than on peripheral
and unnecessary facts and bits of information.
EXAMPLE. Write an essay on the topic: “Plant
Photosynthesis” using the keywords indicated.
You will be graded according to the following
criteria: (a) coherence, (b) accuracy of
statements, (c) use of keywords, (d) clarity, and
(e) extra points for innovative presentation of
ideas.
RULE 3. Put a time limit on the essay test.
RULE 4. Decide on your essay grading system
prior to getting the essays of your students.
RULE 5. Evaluate all of the students’ answer
to one question before proceeding to the next
question.
RULE 6. Evaluate answers to essay questions
without knowing the identity of the writer.
RULE 7. Whenever possible, have two or
more persons grade each answer.
ITEM ANALYSIS
AND
VALIDATION
ITEM ANALYSIS
There are two important characteristics of an
item that will be of interest to the teacher.
These are: (a) item difficulty, and (b)
discrimination index.
The difficulty of an item or item difficulty is
defined as the number of students who are
able to answer the item correctly divided by
the total number of students.
ITEM DIFFICULTY = number of students with
correct answer/ total number of students
The following arbitrary rule is often used in
the literature:
RANGE OF INTERPRETAT ACTION

DIFFICULTY ION
INDEX
0 – 0.25 difficulty revise or
discard
0.26 – 0.75 right difficulty retain
0.76 - above easy revise or
discard
Index of discrimination = DU – DL
EXAMPLE. Obtain the index of
discrimination of an item if the upper 25% of
the class had a difficulty index of 0.60 (i.e.
60% of the upper 25% got the correct
answer) while the lower 25% of the class had
a difficulty index of 0.20.
Here, DU = 0.60 while DL = 0.20, thus index
of discrimination = .60 - .20 = .40
Following rule of thumb:
INDEX INTERPRETA ACTION
RANGE TION
-1.0 – -.50 can discard
discriminate
but item is
questionable
-.55 – 0.45 non- revise
discriminating
0.46 – 1.0 discriminating include
item
MORE SOPHISTICATED DISCRIMINATION INDEX
Item discrimination refers to the ability of an item
to differentiate among students on the basis of
how well they know the material being tested.
Various hand calculation procedures have
traditionally been used to compare item responses
to total test scores using high and low scoring
groups of students. Computerized analyses provide
more accurate assessment of the discrimination
power of items because they take into account
responses of all students rather than just high and
low scoring groups.
The item discrimination index provided by
ScorePak® is a Pearson Product Moment
correlation between student responses to a
particular item and total scores on all other
items on the test. This index is the equivalent of
a point-biserial coefficient in this application. It
provides an estimate of the degree to which an
individual item is measuring the same thing as
the rest of the items.
ScorePak® classifies item discrimination as
“good” if the index is above .30; “fair” if it is
between .10 and .30; and “poor” if it is below .
10.
At the end of the Item Analysis report, test
items are listed according to their degrees of
difficulty (easy, medium, hard) and
discrimination (good, fair, poor). These
distributions provide a quick overview of the
test, and can be used to identify items which
are not performing well and which can
perhaps be improved or discarded.
SUMMARY
The Item-Analysis Procedure for norm-

Provides the following information:
1. the difficulty of the item
2. the discriminating power of the item
3. the effectiveness of each alternative
Benefits derived from Item Analysis
1. it provides useful information for class
discussion of the test
2. it provides data which helps students
improve their learning
3. it provides insights and skills that lead to
the preparation of better tests in the future
INDEX OF DIFFICULTY
■
P x 100
Where:
Ru – the number in the upper group who
answered the item correctly
RL – the number in the lower group who
answered the item correctly
T – the total number who tried the item
■
INDEX OF ITEM DISCRIMINATING POWER
D
Where:
P – percentage who answered the item
correctly (index of difficulty)
R – number who answered the item correctly
T – total number who tried the item
P x 100 = 40%
The smaller the percentage figure the more difficult
the item.
■
Estimate the item discriminating power using the
formula below:
D
The discriminating power of an item is reported as
a decimal fraction; maximum discriminating power
is indicated by an index of 1.00.
Maximum discrimination is usually found at the 50
percent level of difficulty
0.00 – 0.20 = very difficult
0.21 – 0.80 = moderately difficult
0.81 – 1.00 = very easy
VALIDATION
After performing the item analysis and
revising the items which need revision, the
next step is to validate the instrument. The
purpose of validation is to determine the
characteristics of the whole test itself,
namely, the validity and reliability of the test.
Validation is the process of collecting and
analyzing evidence to support the
meaningfulness and usefulness of the test.
VALIDITY
Validity is the extent to which a test
measures what is purports to measure or as
referring to the appropriateness, correctness,
meaningfulness and usefulness of the specific
decisions a teacher makes based in the test
results.
Criterion-related evidence of validity refers to
the relationship between scores obtained
using the instrument and scores obtained
using one or more other tests (often called
criterion).
Construct-related evidence of validity refers

to the nature of the psychological construct
or characteristic being measured by the test.
In order to obtain evidence of criterion-related validity,
the teacher usually compares scores on the test in
question with the scores on some other independent
criterion test which presumably has already high
validity.
Example. If a test is designed to measure mathematics
ability of students and it correlates highly with a
standardized mathematics achievement test (external
criterion), then we say we have high criterion-related
evidence of validity.
In particular, this type of criterion-related validity is
called its concurrent validity. Another type of criterion-
related validity is called predictive validity wherein the
test scores in the instrument are correlated
with scores on a later performance (criterion
measure) of the students.
Example. The mathematics ability test
constructed by the teacher may be correlated
with their later performance in a Division
wide mathematics achievement test.
Apart from the use of correlation coefficient in
measuring criterion-related validity, Gronlund
suggested using the so-called expectancy table.
This table is easy to construct and consists of the
test (predictor) categories listed on the left hand
side and the criterion categories listed horizontally
along the top of the chart.
Example. Suppose that a mathematics
achievement test is constructed and the scores are
categorized as high, average, and low. The criterion
measure used is the final average grades of the
students in high school: very good, good, and
needs improvement.
The two way table lists down the number of
students falling under each of the possible
pairs of (test, grade) as shown below:
GRADE POINT AVERAGE
Test Very Good Needs Improvement

Score Good
High 20 10 5
Average 10 25 5
Low 1 10 14
RELIABILITY
Reliability refers to the consistency of the
scores obtained – how consistent they are for
each individual from one administration of an
instrument to another and from one set of
items to another.
Reliability and validity are related concepts, if
an instrument is unreliable, it cannot yet valid
outcomes. As reliability improves, validity
may improve (or it may not).
The following table is a standard followed
almost universally in educational tests and
measurement.
RELIABILIT INTERPRETATION
Y
90 and Excellent reliability; at the level of
above the best standardized tests
80 – 90 Very good for a classroom test
.70 – 80 Good for a classroom test; in the
range of most. There are probably a
few items which could be improved.
RELIABILIT INTERPRETATION
Y
.60 – 70 Somewhat low. This test needs to be
supplemented by other measures
(e.g., more tests) to determine grades.
There are probably some items which
could be improved.
.50 – 60 Suggests need for revision of test,
unless it is quite short (ten or fewer
items). The test definitely need to be
.50 or Questionable reliability. This test
supplemented by other measures
below should not contribute heavily to the
(e.g., more tests) for grading.
course grade, and it needs revision

ASSESSMENT

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ASSESSMENT

Enviado por

Direitos autorais:

Formatos disponíveis

DEVELOPMENT OF

Guidelines in Constructing Multiple Choice Items

EXAMPLE. The study of life and living

EXAMPLE. Write an appropriate synonym for each

RANGE OF INTERPRETAT ACTION

The Item-Analysis Procedure for norm-

Construct-related evidence of validity refers

Test Very Good Needs Improvement

Você também pode gostar