Você está na página 1de 4

Jurnal 1

Penulis : Guangming Ling, Pamela Mollaun dan Xiaoming Xi (2014)


Maklumat Jurnal
Tajuk : A Study On The Impact Of Fatigue On Human Raters When
Scoring Speaking Responses

Jurnal : Language Testing 2014 31:479


(http://ltj.sagepub.com/content/31/4/479)

In a typical operational scoring day for the TOEFL iBT Speaking Test,
raters score up to 8 hours continuously and repeatedly, listening to
Pernyataan Masalah audio responses and assigning scores (Xi and Mollaun, 2009).
Anecdotal evidence indicates that raters feel tired toward the end of
a shift, with their rating quality decreasing. This leads to questions
about whether rating performance could be affected by fatigue
related to the total scoring time of a shift (shift length) and the time
spent continuously on a task (session length). However, little
attention seems to have been paid to this in the testing literature or
language testing areas.

Tujuan/Objektif kajian The goal of this study is to examine the effects of fatigue on human
raters of audio responses by comparing rating accuracy and
consistency at various time points throughout scoring day, under shift
conditions that differ by shift length and session length
Kajian Literatur & Kajian Literatur dibuat tentang :
kerangka konseptual i. Definisi keletihan
kajian ii. Faktor-faktor yang menyebabkan keletihan dan kesannya
kepada prestasi kerja manusia
iii. Kesan keletihan kepada penilaian penilai manusia
iv. Pengaruh keletihan dalam penilaian bahasa
Tiada maklumat berkenaan kerangka konseptual disertakan dalam
jurnal
Metodologi Rekabentuk
- Termasuk Kajian ini merupakan kajian kuantitatif yang menggunakan kaedah
rekabentuk kajian, eksperimen. 72 penilai di bahagikan kepada empat kumpulan
persampelan, berbeza. Setiap kumpulan adalah kumpulan eksperimen dimana
instrumentasi dan mereka diberi corak masa menilai yang berbeza
kaedah analisis
data Persampelan
**terangkan detail Pengkaji tidak menyatakan secara khusus teknik persampelan yang
instrument/alat kajian digunakan. Seramai 72 orang penilai ujian bertutur TOEFL iBT dipilih
yang digunakan menjadi sampel kajian.

Instrumen
1) Responses from TOEFL iBT speaking tasks.
4 tasks of the TOEFL iBT speaking test were chosen as the instrument.
They were chosen to ensure the representation of a range of task
type (independent or integrated) and topics (familiar topics, campus
life topics or academic content-related topics) while keeping the
experiment logistically manageable. 13596 responses on the four
tasks were scored in the experiment including 5446 validity
responses. The validity responses were identified, reviewed and
confirmed by a panel of expert raters on the score levels prior to the
experiment. 40% of responses that were scored by each rater were
validity responses which were evenly distributed among each hour’s
scoring throughout the experiment.

2) Survey
The survey was developed to obtain raters’ perception on shift
schedule, time-related fatigue and scoring behaviours. It includes 8
items about mental and physical fatigue, 4 items about behavioural
indicators of fatigue and 2 general open-ended questions pertaining
to raters’ background and shift preference. The survey was delivered
through an online system, which automatically captured and stored
raters’ response data in a database

Kaedah Analisis Data


1. Rating Productivity
Rating productivity within a unit time (eg: 1 hour) was quantified by
the number of ratings assigned by raters of a particular shift within
an hour which can be treated as a direct measure of rating output

2. Rating Quality
Rating accuracy and consistency was measured by comparing the
ratings assigned in this study and the expert scores for the validity
responses based on three descriptive statistics
a) The proportions of exact and adjacent (one score band
difference) agreement (Pexact and Padj)
b) Cohen’s (1960) Kappa Coefficient
c) Root-Mean-Square-Deviation (RMSD) between the ratings
assigned in this study and the expert scores.

3. Rating Consistency
Rating consistency was measured by comparing rating accuracy
across the 8 time units.

4. Survey Data
The survey data were summarized using descriptive statistics and
were compared among the shift conditions.

The general linear model (GLM), including analysis of variance


(ANOVA) and uni-variate GLM models, was applied to evaluate
whether the factors manipulated in this study, including shift length,
session length, time unit and their interactions would have any effect
on scoring productivity or scoring accuracy as measured by the above
statistics.
Keputusan/ Dapatan Overall Productivity
kajian The average number of ratings per rater per hour was the highest for
Shift GH (6 hours with 3 hour interval)

Overall Rating Accuracy


72.1% validity responses were in exact agreement with the expert
score

Accuracy Of Ratings Within Each Time Unit


Shift EF (6 hours with 2 hour interval) had a high level of rating
accuracy in each of the 6 hours

End Of Shift Survey Results


a) More than half raters recognized that they felt a relatively
lower level of scoring confidence in the last task session of
the day
b) Fewer raters reported feeling tired during the 6-hour shifts
than in the 8-hour shifts
c) More than half raters reported that they felt more tired in
the afternoon than morning regardless of the shift or session
length
d) 76% of all raters recognized that they struggled most to
concentrate on the scoring during the last task session of the
day
e) Re-listening to a response is an indicator of a low level of
concentration during the scoring and possibly threatens the
accuracy of scoring
f) 45% of raters believed they would be more focussed on
scoring if given more or longer breaks
g) The 8-hour shifts had more raters who reported tiredness
than the 6-hour shifts
h) More raters working in a shift with longer sessions reported
tiredness toward the end of a shift than those in a shift with
2 hour sessions.
Perbincangan/Implikasi Perbincangan
kajian 1. The 6-hour shifts had higher rating accuracy, greater hourly
productivity and greater rating consistency across time than the 8-
hour shifts
2. The shifts with 2-hour interval had greater hourly productivity,
higher rating accuracy and greater rating consistency across time
units than the shifts with 4 hour sessions.
3. The shift with two 3-hour interval had the highest hourly
productivity with its rating accuracy and consistency comparable to
those of four 2-hour intervals.
4. More than half of all raters perceived more fatigue in the
afternoon than in the morning. More raters reported fatigue during
scoring in the 8-hour shift than in the 6-hour shift.

Implikasi
1. The study suggests that if operational conditions permits, routine
evaluation of rating quality across time for all or a random group of
raters might be conducted. Such information could help scoring
leaders determine the typical time when a particular rater’s rating
quality decreases and how severe it becomes, in an attempt to avoid
or reduce times when relatively low-quality scoring might be
happening.
2. Similarities found between the rating accuracy curve for the 6-
housr with 2-hours interval and the typical work curve seen in the
human performance and engineer psychology literature confirms
that human raters, like human beings working in other areas, share a
similar pattern that is related to cumulative time working on the
same or similar tasks

Cadangan kajian lanjutan Future studies may benefit from taking into account the differences
among tasks, raters and their interactions while applying more
complex statistical models.
Replications using other tests with speaking or essay responses may
also help confirm and generalize the current finding.
Komen/Analisis ● Merupakan satu kajian eksperimen yang kompleks
● Penyataan Masalah ditulis berfokuskan kepada isu pentadbiran
ujian yang spesifik
● Kajian Literatur mencakupi topik- topik yang berkaitan dengan
kajian
● Tiada kumpulan kawalan tetapi mewujudkan 4 kumpulan
perbandingan dengan rawatan yang berbeza bagi setiap kumpulan
● Objektif Kajian tidak ditulis. Persoalan kajian yang disertakan pula
agak tidak jelas
● Teknik Persampelan tidak diterangkan. Namun, penerangan yang
jelas diberikan untuk elemen lain.
● Tidak dinyatakan samada telah dibuat kajian rintis atau tidak
instrumen soal selidik. Juga tidak terdapat sebarang maklumat
tentang nilai kesahan dan kebolehpercayaan instrumen ini.
● Kaedah analisis data yang komprehensif
● Persembahan dapatan data dibuat merujuk kepada persoalan
kajian dengan jelas
● Implikasi kajian tidak membuat sebarang cadangan jadual kerja
ideal untuk penilaian sesuatu ujian bahasa

Você também pode gostar