Part 4 4 Reflection On An Assessment Project Final

Running head: EF ASSESSMENT PROJECT PAPER
AL6730 Assessment
Moe Kitamura
Introduction
This project paper will both present and analyze the assessment created by our group for
a beginner level class at Education First (EF) in Waikiki, Hawaii. In making this assessment, our
group hopes to accurately assess students understanding and use of gerunds and the present
progressive form, the students ability to identify job titles as well as the students reading
abilities in regards to comprehension. We also hope to accurately assess the students writing
abilities using questions on concepts or ideas presented in short texts. This paper will first
describe the project including the background information on the host class, the host institution
and the assessment groups language instruction. This paper will continue to describe the type of
assessment instruments used with the addition of the objectives and specifications for the created
assessment. Results will then be presented though the calculations of the Item Facility (I.F.) and
the Item Discrimination (I.D.) as well as the interpretation of these results. Finally, concerns and
future inquiries about the created assessment will be presented, as will personal reflections of the
assessment from each group member.
Project Description
Background Information
Host class
Name: EF International Language School, Honolulu
Class schedule: M,W,F: 8:30-11:20 T, TH: 2:30-5:20
Students proficiency level: Basic/ Elementary
Students needs: Students must be able to read and understand the story, put the story in order,
explain the story to their classmates, understand important vocabulary, and match words to their
meaning (nouns, verbs, adjective, etc.) (e.g. epic amazing, impressively great)
EF ASSESSMENT PROJECT PAPER
Class objectives: Students will improve their language skills through Listening, Speaking,
Reading, Writing and Grammar activities.
Students will also complete a language project and read a novel based on the theme: music.
Overall description of teaching approach by host teacher:
Teacher explains each task carefully to make sure students understand them.
Overall description of assessment approach by host teacher:
Teacher assesses students based on their attitude and results from the exam.
Teacher scores test using her own method.
Example.: 1 (incorrect/ Ss already learned in the class/ minus point)
1+ (correct/ but they do not learn in the class so there is no point)
1 (incorrect/ but they do not learn in the class so there is no point)
Your overall observations of the class:
Small, very interactive class. They are comfortable with one another.
Cory is fun and uses many different modes (ipad, whiteboard, book etc.) to instruct and motivate
students.
Host Institution
Name: EF International Language School Honolulu
Location: 2255 Kuhio Ave Suite 1100 (Waikiki)
Overall goals:
Can understand sentences and frequently used expressions related to areas of most immediate
relevance (e.g. very basic personal and family information, shopping, local geography,
employment). Can communicate in simple and routine tasks requiring a simple and direct
exchange of information on familiar and routine matters. Can describe in simple terms aspects on
his/her background, immediate environment and matters in areas of immediate need.
Additional observations that you have about the institution:
The classes are small, have international students from all over the world and at a very modern
and professional location.
Group members
Chelsea-Taught basic level French for 2 years of the college level.
Youngjee-Taught English to adults in Korea for 17 years at a private language school.

Moe-Taught English at public elementary school in Japan for 2 years.
Ramon-Taught English at a Conversation School part-time in Japan for six years.
Language Assessment Instrument
This assessment was given to the host class on Wednesday, March 6th, 2013 at 10:00am
11:00am. This test was assessed both qualitatively and quantitatively with both objective and
subjective scoring (refer to Appendix A and Appendix B).
This assessment is a mid-term test, meaning that it is an achievement test that aims to
assess individual students and/or the class as a whole in meeting the class objectives. This
assessment has three main sections, the first being separated into two. The first part of the first
section is multiple-choice on gerunds and the second part of the first section uses visual
depictions to have students identify job titles. We designed the test to reinforce this theme. The
first section, both parts, uses indirect testing as they are multiple choice and is not
communicative. In the second section, two short texts were created as well as short-answer
questions. These items are semi-direct as they test both reading and writing abilities in regards to
comprehension but also assess vocabulary and grammar. The third section comprises of a model
of an email sent to the students with specific questions asking the students to reply. This section
is direct as it assesses the students ability to read and comprehend the model email and respond
the questions accurately, the email format is already provided to the students (refer to Appendix
C). This section is communicative. All the sections are criterion-referenced as the scoring is
based on the course objectives.
Objectives
Students will match words and phrases in the form of gerunds to the most appropriate context.
Students will recognize job tittles and match them with the appropriate picture.
Students will read short passages and answer comprehension questions using either the past tense
or the present progressive tense depending on the context.
Students will construct 6 sentences in an email form, giving suggestions to a future EF student
using the students personal experiences in Hawaii.
Specification
1. Specifications of content:
a. Operations: Ability to read slowly and carefully to comprehend text and to
express the meaning of that in writing.
b. Types of text: Authentic, short text.
c. Length of text: 103 words (first text) and 152 words (second text).
252 words total.
d. Addressees of text: Academic readers at the Basic level.
e. Topics: Duke the Hawaiian Surfer for the first text, phone conversation between
friends for the second text.
f. Readability (Flesh-Kincaid or grade level): Elementary
g. Structural range: Simple past tense, present continuous, present perfect, future
tense (will, be going to) modals (should, have to)
h. Vocabulary range: As found in the textbook such as verbs of action, nouns,
adjectives and gerund. Or daily activities, jobs, narration, phone conversation,
email production.
i. Dialect and style: Standard American formal.
2. Structure, timing, medium, and techniques:
a. Test structure: 3 sections; all reading and writing.
b. Number of items: 15 multiple-choice, 6 short answer, 1 short composition; 22
reading and writing.
c. Number of passages: 2
d. Medium: Paper and pencil (no electronic devices allowed).
e. Testing techniques: Multiple-choice, short answer, composition writing.
3. Criterial level of performance:
To demonstrate mastery 80 per cent accuracy must happen in each of the three sections.
The number of students reaching this level will be the number that will have succeeded in
the terms of the objectives of the course.
4. Scoring procedure:
Subjective items such as the short-answer and short composition will be negotiated and
decided by two testers. Each correct answer is graded as 1 point and partial errors are 0.5
points; a total score over 27 (the number of items) will give the final percentage.
5. Sampling:
Texts will be created by the assessment production group using specific vocabulary and
grammar from the unit being tested upon in courses textbook.
Student Results based on 1st grading attempt
Our group assessment designed for EF students consist of multiple-choice (with a word
bank), recognizing job titles using pictures, comprehension check and sentence structure for
reading passages, and composition writing (in email form). Overall, there were 27 items which
we will analyze using I.F. and I.D. In the Comprehension Check and Composition Writing
sections, where items are required for subjective scoring, partial points have been applied in the
scoring. Note that partial grading is not counted as a correct answer in the calculation of I.F. and
I.D. First, let us look at the I.F. chart below to find the difficulty of each item for the students:
Item Facility (n = 15)
Item
Students who
answered item
correctly
1
14
2
13
3
15
4
12
5
14
6
15
7
14
8
15
9
0
10
15
11
15
12
15
13
13
14
12
15
15
16
11
I.F.
0.93
0.86
1.0
0.8
0.93
1.0
0.93
1.0
0
1.0
1.0
1.0
0.86
0.8
1.0
0.73
17
18
19
20
21
22
23
24
25
26
27
9
6
2
3
2
14
13
12
8
10
8
0.6
0.4
0.13
0.2
0.13
0.93
0.86
0.8
0.53
0.66
0.53
Looking at the chart above, we can find several items with the I.F. of 1.0, which means
that all of the students answered the questions correctly, which is the ideal situation for this
criterion-reference achievement test. Since only no. 3, 6, 8, 10, 11, 12 and 15 had I.F. of 1.0,
perhaps the test was not too easy. If many items had I.F. 1.0, perhaps some modification would
be needed. On the contrary, the items no. 18, 19, 20, and 21 ranging from 0.13-0.2 were clearly
difficult for the students. These numbers show that the students have not yet fully understood the
usage of the present progressive form in the Comprehension Check of the reading section
(specifically Section 2B). According to these findings it is suggested that the teacher reinstruct or
review the concerned target grammar to the students.
Items no. 25 and 27 can be perceived as marginally difficult because their I.F. results are
in the middle range of difficulty. In criterion-reference tests it is best to review items that receive
an 0.5 I.F. Items no.17 and 26 can be at an acceptable range and the rest of the items (no. 1, 2, 4,
5, 7, 13, 14, 16, 22, 23, and 24) that receive an I.F. rating of over 0.73-0.93 are considered high
so there is a need for modification.
Finally, the most problematic item that can be found in the I.F. results is item 9, which no
student answered correctly. Looking back at Section 1A, all of the students have chosen the
option studying in place of doing homework. We can conclude that this is a case where there
are two potential keys for the item. Studying is an additional possibility within the context of
the items text.
Using Flanagans method, the students I.D. chart is illustrated below. Note that since we
are dealing with a small number of students, we used the top 25% and the bottom 25% of the
students to identify the high and low scorers. The top 4 high scorers and 4 low scorers were used
in this formula.
Item Discrimination (n = 15)

Item
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
High scorers (top 4) with

correct answers
4
4
4
4
4
4
4
4
0
4
4
4
3
3
4
4
3
3
2
2
2
3
3
4
4
4
4
Low Scorers (bottom 4)

with correct answers
3
3
4
2
4
4
3
4
0
4
4
4
3
2
4
0
3
1
0
1
0
4
3
3
0
1
0
I.D.
0.26
0.26
0
0.53
0
0
0.26
0
0
0
0
0
0
0.26
0
1.0
0
0.53
0.53
0.26
0.53
-0.26
0
0.26
1.06
0.8
1.06
In relation to the result shown in the I.F. analysis, there is no discrimination for the items
no. 3, 6, 8, 10, 11, 12, and 15 because all of the students have answered correctly. For item No. 9,
no student answered correctly and therefore there is no variance between high and low scorers,
which means that there is no discrimination. However, for the items no. 5, 13, 17, and 23 no
discrimination is shown because both the high scorers and the low scorers performed at the same
level.
Items no. 16, 25 and 27, with the I.D of 1.06, have perfect item discrimination between
the high and low scorers. Item no. 26, with the I.D. of 0.8, also has good item discrimination. On
the other hand, items no. 1, 2, 7 14, 20, and 24, with the I.D. of 0.26, show low discrimination
among the high and low scorers and is considered to be outside of the acceptable range (Oller,
1979). Items no. 4, 18, 19 and 21, with the I.D. of 0.53, show high/acceptable item
discrimination.
Looking at the item no. 22, with the I.D. of -0.26, however, we can find that this item
shows interesting discrimination because the low score students scored better than the high score
students. This item can be found in Section 3 (Composition Writing) of the test where one of the
high score students received partial points in subjective scoring. Overall, the IF reveals that
majority of the items in the test are rather easy and that the students performed well. However
there is a clear evidence that the items for Comprehension Check in the Section 2A (focus on
past tense) and especially 2B (focus on the present progressive form) had been difficult for the
students. Through this item analysis it is recommended for the teacher to review further these
particular target grammar points. As for the ID, it is shown in the analysis that many items turned
out to have zero discrimination because its either the students all got the correct answer for the
particular items, or both high and low scorers performed at the same level. One item showed a
wrong discrimination as a result of partial grading.
Student Results based on revised grading schema
After making our final calculations in both I.F. and I.D. we received feedback from our
professor showing us how our scoring within the grades makes the test grading unreliable. The
scoring on the test was therefore reweighed making the second section worth a total of 12 points
instead of 6 and section 3 worth 18 points instead of 6. A new grading rubric was made (refer to
Appendix D) for all subjective scoring sections and our new scores were used to recalculate the
I.F. and the I.D. making it more reliable. In the first section no grading was changed but we did
realize that there was an inconsistency between the actual answers and our original grading key.
This made our objective scoring incorrect in one item. The entire test was therefore graded again
by one grader and revised by a second and third grader. Based on Ollers (1979) criteria we will
analyze the I.F. table illustrated below:
Item Facility (n=15)
Item
# of Correct
I.F.
14
0.93
15
15
12
0.8
14
0.93
15
15
15
15
10
15
11
15
12
15
13
13
0.86
14
12
0.8
15
15
16
12
0.8
17
0.26
18
0.53
19
0.2
20
0.33
10
21
0.4
22
15
23
13
0.86
24
13
0.86
25
0.6
26
10
0.66
27
0.53
Looking at the table, we can find several items with the I.F. of 1.0, which means that all
the students answered the questions correctly. The items no. 2, 3, 6, 7, 8, 9, 10, 11, 12, 15 and 22
turned out to be too easy for the students, so it is highly suggested for modification to take place.
On the contrary, items for no. 17 and 19, which range from 0.2-0.26 were clearly difficult for the
students. In reference to the item 19, it is noticeable that the items 20 and 21 (ranging from 0.30.4) are also in the lower difficulty range. These three items (19, 20 and 21) represent the Section
2B of our test material, and the I.F. indicates that overall the students had the most trouble with
the Comprehension Checking portion that tests the present progressive form. This means that the
present progressive should be reviewed in class.
Items 18 and 27 can be perceived as in the marginal line between the difficulties of the
I.F. and so it is considered to be in the middle range of difficulty. In this criterion-referenced test
it is best to review items that receive 0.5 in I.F. and higher Items no. 25 and 26 can be at an
acceptable range whereas the rest of the items (1, 4, 5, 13, 14, 16, 23, and 24) received an I.F.
rating of over 0.8-9.3 and are considered high. Those items might need modification of these
items because those are too easy for students or when we think about the different ways, most of
the students understood the questions and achieved the objectives. For these items, we need more
time to prepare other tests and figure out whether students really achieved the class objectives or
not.
Using Flanagans method, the students I.D. chart is illustrated below. Note that since we
are assessing a small group of students, we used 25% of higher scorers in the class and 25% of
low scorers. Within this class there were therefore 4 high scorers and 4 low scorers used in the
11
formula. It is also important to note that for the items in Section 2 (specifically items no. 16-21)
and Section 3, partial grading is applied where the students who received a score of 1.5-2 in
Section 2 and 2.5-3 in Section 3 are regarded as correct scorers.
Item Discrimination (n=15)

Item
High Scorers
Low Scorers
I.D.
0.25
0.5
0.25
10
11
12
13
14
0.25
15
16
0.5
17
0.5
18
0.5
12
19
0.5
20
0.25
21
0.25
22
23
24
0.25
25
0.75
26
0.75
27
In relation to the results shown in the I.F. analysis, there is no discrimination for the items
no. 2, 3, 6, 7, 8, 9, 10, 11, 12, 15 and 22 because all of the students have answered correctly.
However, for items 5, 13 and 23, no discrimination is shown because both high scorers and low
scorers performed at the same level.
Item no. 27 has perfect item discrimination between high and low students. The items 4, 16, 17,
18, 19, 25 and 26 show a good discrimination between high and low scorers. However, the items
no. 1, 7, 14, 20, 21 and 24 with the I.D. of 0.25 are considered to be the lowest acceptable value
according to Oller (1979).
Compared to the first grading, revised results could show discrimination clearly between
the students who got high scores and the students who got low scores because we put more
scores on the writing items.
Reflection and Discussion
Reliability of the test-Reflection and Discussion (EF school)
Before I took this assessment class, I had some experience to made tests for elementary
students. I did not think not so much how I made test (reliability, authenticity, validity etc.) and I
should assess students. This time, our group made a test for adult who come from different
countries so I was really nervous and could not imagine how to make a good test for them. I cooperated with my group members and made test and we let students took our test finally. I
13
thought our test was quite well but when we tried to assess students, it was some problems. In
this paper, Ill especially focus on reliability of our test.
Reliability
Bachman and Palmer (1996) referred that six elements such as reliability, construct
validity, authenticity, interactiveness, impact, practicality can be shown the quality of the test.
They said reliability and validity are especially essential measurement elements. McNamara
(2000) explained that reliability is a consistency of measurement of individuals by a test. In other
words, if the test has a high reliability, students will be able to take a same score whenever they
take a same test. From these scientists idea, the reliability of our test is depends on how we
made and gave tests to students.
According to McNamara (2000), there are some points to check the reliability. Many
teachers are probably familiar with the terms test/retest reliability, equivalent-forms
reliability, split half reliability and rational equivalence reliability. Each of these terms
refers to statistical methods that are used to establish consistency of student performances within
a given test or across more than one test. However, we could not let students take our test again
in this our projects so it was not possible to check the reliability from the score. We should have
tried to use the split-half methods for our test but we had only 27 items, not enough to calculate
with that methods so we would like to calculate using this method when we have chance to
evaluate students in the future.
Scorer reliability
Moskal (2000) mentioned that the two forms of reliability that typically are considered in
classroom assessment and in rubric development involve rater (or scorer) reliability. Rater
reliability generally refers to the consistency of scores that are assigned by two independent
raters and that are assigned by the same rater at different points in time. The former is referred to
as inter-rater reliability while the latter is referred to as intra-rater reliability. In our case, two
scorer assessed students score together and after that another scorer assessed and compared the
score. It was most difficult to negotiate the final score because every people have different skills,
techniques, and experience of teaching and methods. It seemed like our inter-rater reliability is
good through our experience and our each scores. In the case of intra-rater reliability, we might
be acceptable because we took around 3 hours to make scoring If we make scores different
time, it might have possibilities to change scores.
14
Rubric Scoring
Even though the scorer has good or acceptable reliability, it does not mean or make sense
if they do not make a good and reliable rubric scoring. In our case, it was the biggest problem
among the scorers when they needed to decide the score for the subjective scoring because they
did not decide the criteria clearly (how we should give the score). Moreover, we did not decide
how we give a partial point or not exactly even though we made a rubric for scoring we did not
make rubric scoring by following to our test objectives.
Our test also had one big problem about the score reliability. We did not show the points
of each section so students could not see which sections are more important than other section
and parts. Usually, subjective scores tends to have high score then objective scores but in our
test, we just gave same points and it does not matter subjective or objectives. For those reason,
our rubric scoring was low reliability.
Section 1
From this chapter, I will analyze reliability depends on the each section. For the section 1,
*We made this section in a multiple choice so we limit candidates freedom to choice.
*We showed an example to let them know how they supposed to answer the question so we
avoid ambiguous instructions, familiarize candidates with format and testing techniques.
*This section use items with objective scoring because we use multiple choices.
*In the case of this section, we do not decide or think the degree of acceptance for variation in
some items. For examples, for the section B, one picture looks like either a photographer or
tourist.
Overall, section 1 seems acceptable reliability.
Section 2
*There are a few items because they only have three questions for each section (but still, many
students cannot answer the question correctly.)
*This chapter can be limited the candidates freedom of choice because candidates can guess the
answer key from the question.
*I thought we used a clear explicit instruction and tried to avoid ambiguous instructions when we
gave tests to them but most of them did not understand or answer incorrectly so we might have
some problems of our explanation even though we tried to make sure our test is clearly
presented. We should have put an example sentence to let them understand the aim of both
15
sections.
*We only want them answer the short answer question so we do not give them not so much space
to write their answers and they understand our aim so we can say this section is clearly presented
partially.
*We try to familiarize candidates with format and testing techniques when we make reading
sections because section 2A is for Duke Kahanamoku who is a famous surfer in Hawaii and
candidates might read his article in the newspaper or see his history on TV. In the case of section
2B, we try to make a sentence that candidates will be experience in the future or they will be
listen those kinds of conversation.
In the case of section 2, our test reliability is low because many students confused and we
did not show the example how candidates supposed to answer the question.
Section 3
*This section has acceptable items because there are six items not so many and a few items for
the Elementary level students.
*We made a test for familiarize candidates with format and testing techniques because they will
have a situation to write a letter to somebody who does not visit Hawaii yet.
*We try to avoid ambiguous instruction and use clear, explicit instruction but we realize that our
section have many possibilities of answers and does not limit candidates freedom of choice.
Many students answer you should go even though they supposed to answer you should do
something so they confused the differences of go and do.
In the case of section three, it is acceptable reliability because some students do not make
sure how they should answer the question. It means our explanation is not clear. If I try to
remake test, I will add other wh-question to let have more reliability of the questionnaire.
In the case of classroom environment, it can be said this has a low reliability because of
several functions. The classroom is small and fifteen students have to sit really close without
enough space to take a test. Some students may cheat at a test so we sometimes need to check
and look around them to not try to let them cheat. A few of them ask questions each other so we
have to let them know to ask question to us.
Future Inquiries
In the future, Id like to research about how we can make a test for oral and listening that
has high reliability because we did not have a chance to make it in our projects. I also realized
16
how rubric scoring was important for increasing reliability.

Also Id like to know how I can make a rubric which evaluates the oral and listening
skills.
References
Alderson, J.C., Clapham, C. & Wall, D. (1995). Language test construction and evaluation.
Cambridge: Cambridge University Press.
Bachman, L.F., & Palmer, A.S. (1996). Language testing in practice: designing and developing
useful language tests. Oxford: Oxford University Press.
McNamara, T. F. (2000). Language testing. Oxford: Oxford University Press.
Moskal, B. M. (2000). Scoring rubrics: what, when and how? Practical Assessment,
Research & Evaluation, 7(3).
Oller, J. W. (1979). Language tests at school. London: Longman Group Limited.

Part 4 4 Reflection On An Assessment Project Final

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Part 4 4 Reflection On An Assessment Project Final

Enviado por

Direitos autorais:

Formatos disponíveis

Running head: EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

Youngjee-Taught English to adults in Korea for 17 years at a private language school.

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

Item Discrimination (n = 15)

High scorers (top 4) with

Low Scorers (bottom 4)

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

Item Discrimination (n=15)

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

EF ASSESSMENT PROJECT PAPER

how rubric scoring was important for increasing reliability.

Você também pode gostar