Escolar Documentos
Profissional Documentos
Cultura Documentos
In a typical operational scoring day for the TOEFL iBT Speaking Test,
raters score up to 8 hours continuously and repeatedly, listening to
Pernyataan Masalah audio responses and assigning scores (Xi and Mollaun, 2009).
Anecdotal evidence indicates that raters feel tired toward the end of
a shift, with their rating quality decreasing. This leads to questions
about whether rating performance could be affected by fatigue
related to the total scoring time of a shift (shift length) and the time
spent continuously on a task (session length). However, little
attention seems to have been paid to this in the testing literature or
language testing areas.
Tujuan/Objektif kajian The goal of this study is to examine the effects of fatigue on human
raters of audio responses by comparing rating accuracy and
consistency at various time points throughout scoring day, under shift
conditions that differ by shift length and session length
Kajian Literatur & Kajian Literatur dibuat tentang :
kerangka konseptual i. Definisi keletihan
kajian ii. Faktor-faktor yang menyebabkan keletihan dan kesannya
kepada prestasi kerja manusia
iii. Kesan keletihan kepada penilaian penilai manusia
iv. Pengaruh keletihan dalam penilaian bahasa
Tiada maklumat berkenaan kerangka konseptual disertakan dalam
jurnal
Metodologi Rekabentuk
- Termasuk Kajian ini merupakan kajian kuantitatif yang menggunakan kaedah
rekabentuk kajian, eksperimen. 72 penilai di bahagikan kepada empat kumpulan
persampelan, berbeza. Setiap kumpulan adalah kumpulan eksperimen dimana
instrumentasi dan mereka diberi corak masa menilai yang berbeza
kaedah analisis
data Persampelan
**terangkan detail Pengkaji tidak menyatakan secara khusus teknik persampelan yang
instrument/alat kajian digunakan. Seramai 72 orang penilai ujian bertutur TOEFL iBT dipilih
yang digunakan menjadi sampel kajian.
Instrumen
1) Responses from TOEFL iBT speaking tasks.
4 tasks of the TOEFL iBT speaking test were chosen as the instrument.
They were chosen to ensure the representation of a range of task
type (independent or integrated) and topics (familiar topics, campus
life topics or academic content-related topics) while keeping the
experiment logistically manageable. 13596 responses on the four
tasks were scored in the experiment including 5446 validity
responses. The validity responses were identified, reviewed and
confirmed by a panel of expert raters on the score levels prior to the
experiment. 40% of responses that were scored by each rater were
validity responses which were evenly distributed among each hour’s
scoring throughout the experiment.
2) Survey
The survey was developed to obtain raters’ perception on shift
schedule, time-related fatigue and scoring behaviours. It includes 8
items about mental and physical fatigue, 4 items about behavioural
indicators of fatigue and 2 general open-ended questions pertaining
to raters’ background and shift preference. The survey was delivered
through an online system, which automatically captured and stored
raters’ response data in a database
2. Rating Quality
Rating accuracy and consistency was measured by comparing the
ratings assigned in this study and the expert scores for the validity
responses based on three descriptive statistics
a) The proportions of exact and adjacent (one score band
difference) agreement (Pexact and Padj)
b) Cohen’s (1960) Kappa Coefficient
c) Root-Mean-Square-Deviation (RMSD) between the ratings
assigned in this study and the expert scores.
3. Rating Consistency
Rating consistency was measured by comparing rating accuracy
across the 8 time units.
4. Survey Data
The survey data were summarized using descriptive statistics and
were compared among the shift conditions.
Implikasi
1. The study suggests that if operational conditions permits, routine
evaluation of rating quality across time for all or a random group of
raters might be conducted. Such information could help scoring
leaders determine the typical time when a particular rater’s rating
quality decreases and how severe it becomes, in an attempt to avoid
or reduce times when relatively low-quality scoring might be
happening.
2. Similarities found between the rating accuracy curve for the 6-
housr with 2-hours interval and the typical work curve seen in the
human performance and engineer psychology literature confirms
that human raters, like human beings working in other areas, share a
similar pattern that is related to cumulative time working on the
same or similar tasks
Cadangan kajian lanjutan Future studies may benefit from taking into account the differences
among tasks, raters and their interactions while applying more
complex statistical models.
Replications using other tests with speaking or essay responses may
also help confirm and generalize the current finding.
Komen/Analisis ● Merupakan satu kajian eksperimen yang kompleks
● Penyataan Masalah ditulis berfokuskan kepada isu pentadbiran
ujian yang spesifik
● Kajian Literatur mencakupi topik- topik yang berkaitan dengan
kajian
● Tiada kumpulan kawalan tetapi mewujudkan 4 kumpulan
perbandingan dengan rawatan yang berbeza bagi setiap kumpulan
● Objektif Kajian tidak ditulis. Persoalan kajian yang disertakan pula
agak tidak jelas
● Teknik Persampelan tidak diterangkan. Namun, penerangan yang
jelas diberikan untuk elemen lain.
● Tidak dinyatakan samada telah dibuat kajian rintis atau tidak
instrumen soal selidik. Juga tidak terdapat sebarang maklumat
tentang nilai kesahan dan kebolehpercayaan instrumen ini.
● Kaedah analisis data yang komprehensif
● Persembahan dapatan data dibuat merujuk kepada persoalan
kajian dengan jelas
● Implikasi kajian tidak membuat sebarang cadangan jadual kerja
ideal untuk penilaian sesuatu ujian bahasa