PHYS THER-1999-Daley-8-23 - 2 PDF

Research Report
Reliability of Scores on the Stroke

Rehabilitation Assessment of
Movement (STREAM) Measure
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Background and Purpose. The Stroke Rehabilitation Assessment of

Movement (STREAM) is a new clinical measurement tool for evaluating the recovery of voluntary movement and basic mobility following
stroke. This article presents the results of 3 substudies examining the
reliability (interrater and intrarater) and internal consistency of
STREAM scores. Subjects and Methods. A direct-observation reliability study was conducted on 20 patients who had strokes and were in a
rehabilitation setting. Pairs of raters from a group of 6 participating
therapists provided data to judge interrater agreement. A videotaped
assessments reliability study was done to assess intrarater and interrater agreement on the scoring of videotaped performances using the
STREAM measure and involved 4 videotaped assessments that were
viewed and rated on 2 occasions by 20 physical therapists. The internal
consistency of the STREAM scores was evaluated for 26 patients who
had strokes and who demonstrated the full range of motor ability.
Results. The reliability of the STREAM scores was demonstrated by
generalizability correlation coefficients of .99 for total scores and of .96
to .99 for subscale scores. The internal consistency of the STREAM
scores was demonstrated by Cronbach alphas of greater than .98 on the
subscales and overall. Conclusion and Discussion. These high levels of
reliability support the use of the STREAM instrument for the measurement of motor recovery following stroke. Further work on the validity
and responsiveness of the STREAM measure is in progress. @Daley K,
Mayo N, Wood-Dauphinee S. Reliability of scores on the Stroke
Rehabilitation Assessment of Movement (STREAM) measure. Phys
Ther. 1998;78:8 23.#
Kathy Daley
Nancy Mayo
Sharon Wood-Dauphinee
8
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Key Words: Cerebrovascular accident, Motor recovery, Outcome measure, Reliability.
Physical Therapy . Volume 79 . Number 1 . January 1999
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
lthough a number of instruments are available to measure the recovery of movement

following stroke,110 we believe that they have
not been widely used in clinical practice. A
Canadian survey*2 revealed that, although several of
these outcome measures were being used for research
purposes,1116 they were being used routinely in less than
5% of physical therapy departments. Lengthy administration time, complexity of scoring, and dependence on
equipment were cited as barriers to routine use in
clinical settings.
Background of the STREAM Measure
Researchers and clinicians at the Jewish Rehabilitation
Hospital ( JRH), Laval, Quebec, Canada, collaborated to
*
Working Group on Outcome Measures in Physiotherapy. Health and Welfare
Canada Contract, 1992. Address requests for information to: Dr Nancy Mayo,
The STREAM Research Group, School of Physical and Occupational Therapy,
McGill University, Davis House, 3654 Drummond St, Montreal, Quebec, Canada
H3G 1Y5.
produce the original Stroke Rehabilitation Assessment

of Movement (STREAM) measure in 1986. The instrument was designed for use in physical therapy departments to provide a comprehensive, objective, and quantitative evaluation of the motor functioning of
individuals with stroke. The instrument was designed to
be quick and simple to administer, and it fit very well
into the routine clinical assessment scheme. In the first
phase of the instruments development, the items and
scoring of the original STREAM measure were subjected
to review by 2 panels of experts consisting of a total of 20
physical therapists. Based on the panels recommendations, the original instrument was refined. This intermediate test version of the STREAM measure then underwent preliminary evaluations for the reliability and
internal consistency of scores. Item reduction was then
carried out.
K Daley, MSc(Rehabilitation Science), BSc(PT), was a graduate student in rehabilitation science, McGill University, Montreal, Quebec, Canada,
when this study was completed.
N Mayo, PhD(Epidemiology), MSc(Applied), BSc(PT), is Research Scientist, Royal Victoria Hospital, Montreal, Quebec, Canada, and Assistant
Professor, Faculty of Medicine, Department of Epidemiology, and School of Physical and Occupational Therapy, McGill University. Address all
correspondence to Dr Mayo at The STREAM Research Group, School of Physical and Occupational Therapy, McGill University, Davis House, 3654
Drummond St, Montreal, Quebec, Canada H3G 1Y5.
S Wood-Dauphinee, PhD(Epidemiology), MSc(Applied), BSc(PT), is Professor and Director, School of Physical and Occupational Therapy, and
Professor, Department of Medicine and Department of Epidemiology and Biostatistics, McGill University.
Ethical approval for this study was obtained from the ethics committees of the Jewish Rehabilitation Hospital and the School of Physical and
Occupational Therapy, McGill University.
This research was presented, in part, in poster format at the Joint Congress of the Canadian Physiotherapy Association and the American Physical
Therapy Association, June 4 8, 1994, Toronto, Ontario, Canada.
This project was funded by the Reseau de Recherche en Readaptation de Montreal et de louest du Quebec (RRRMOQ). The first author was
supported, in part, by a Royal Canadian Legion/Physiotherapy Foundation of Canada fellowship in gerontology.
This article was submitted September 12, 1997, and was accepted May 22, 1998.
Daley et al . 9
The final version of the STREAM measure consists of 30

items or test movements that are equally distributed
among 3 subscales: upper-limb movements, lower-limb
movements, and basic mobility items. The STREAM
scoring form, including the criteria for scoring the
items, is presented in the Appendix. Limb movements
are scored on a 3-point scale. Mobility items are scored
on a 4-point scale similar to that used for scoring limb
movements except that a category has been added to
allow for independence with the help of a mobility aid.
Thus, the maximum raw total STREAM score is 70, with
each of the limb subscales scored out of 20 points and
the mobility subscale scored out of 30 points. To allow
for the possibility that occasionally an item cannot be
scored (eg, because of restricted range of motion or
pain), the subscale and total scores may be transformed
to scores out of 100. The procedure for transforming the
scores is provided in the STREAM test manual.1 Further
details of the development and validation of the content
of the STREAM measure are presented elsewhere.17 The
present article presents the results of reliability studies
carried out on the STREAM measure.
The objectives of this study were (1) to determine the
extent to which pairs of therapists concur on scores for
the items of the STREAM measure, on subscale scores,
and on the total score (interrater reliability); (2) to
assess the consistency of STREAM scores based on the
observation of videotaped performances across occasions and across raters (intrarater and interrater reliability); and (3) to examine the extent to which the items of
the STREAM measure relate to each other, to their
respective subscales, and to the group of items as a whole
(internal consistency).
Method
Overview of Study Design

Three separate substudies were conducted. The first
objective was achieved through the direct-observation
reliability study, involving pairs of raters who directly
observed and assessed patients. To realize the second
objective, the videotaped assessments reliability study
was done with raters viewing and rating videotaped
assessments on 2 occasions. To accomplish the third
objective, internal consistency was examined using data
accrued in the direct-observation reliability study, as well
as additional data collected on patients scoring in the
lower end of the range.
The Direct-Observation Reliability Study
The study was carried out at JRH, a 120-bed rehabilitation hospital with a 40-bed stroke unit that admits
approximately 200 stroke patients per year. Written
informed consent was obtained prior to a patients

participation. A convenience sample of 20 cooperative
persons with stroke was selected. These individuals were
selected to represent a wide range of motor dysfunction,
age, and time since stroke. We excluded patients with
any major comorbid conditions that interfered with
motor function or its assessment, such as a neurological
condition in addition to the stroke, a severe comprehension disorder, marked bilateral motor or sensory impairment, amputation of a limb, or severe rheumatoid
arthritis.
Sixteen subjects with ischemic stroke and 4 subjects with
hemorrhagic stroke participated. Fourteen subjects had
left-hemisphere cerebrovascular accidents (CVAs), probably reflecting a greater propensity to admit patients
needing a multitude of rehabilitation services. The subjects (11 men, 9 women) ranged in age from 47 to 86
years (X566.7, SD510.7). They were evaluated on the
STREAM measure from 47 to 238 days (X5104.5,
SD542.7) following the occurrence of their CVAs. The
characteristics of these subjects are presented in Table 1.
Six physical therapists provided the scores for the 20
subjects in the sample of convenience. Prior to the
reliability study, each therapist independently reviewed
the STREAM test manual. The 6 therapists then participated in a 2-hour training session (led by the primary
author @KD#) in which the STREAM test manual was
discussed, a videotaped STREAM assessment was scored,
and the scores were discussed. Currently, in clinical
settings, therapists are orienting themselves to the
STREAM measure using only the test manual or sometimes doing an in-service as a group. The therapists then
independently practiced using the STREAM measure to
score 2 patients in the clinical milieu. The evaluations
for the reliability study were carried out simultaneously
and independently by pairs of raters, with one therapist
performing the assessment and the other therapist
observing close at hand.
The 6 raters were 4 therapists working at JRH, 1 therapist
from an acute care setting, and the primary author. They
had 1 to 9 years of experience as physical therapists
(X55, SD52.1) and had worked with patients with
stroke for 1.5 to 3.5 years (X52.5, SD50.6). Details of
the raters clinical backgrounds are given in Table 2.
The Videotaped Assessments Reliability Study

Four patients with a range of motor deficits secondary to
stroke who had participated in the direct-observation
reliability study and who agreed to be videotaped participated in this phase of the study. These patients were
reassessed with the STREAM measure, and the assessments were videotaped.
Available on request from Dr Mayo.
10 . Daley et al
IIIIIIIIIIIIIIIIIIIIIIIII
Table 1.
Characteristics of Subjects
a
b
Type of CVAa
Side of CVA
Sex
Age (y)
Days Post-CVA
Comments/Comorbid Conditions
Direct-observation
reliability study
Ischemicb
Ischemic
Ischemic
Ischemic
Ischemic
Ischemicb
Ischemic
Ischemic
Ischemic
Ischemic
Ischemic
Ischemic
Ischemic
Ischemic
Ischemic
Ischemic
Hemorrhagicb
Hemorrhagic
Hemorrhagic
Hemorrhagicb
R
R
R
R
R
L
L
L
L
L
L
L
L
L
L
L
R
L
L
L
M
M
M
M
F
M
M
M
M
F
F
F
F
F
F
F
F
M
M
M
60
67
70
86
73
61
67
70
75
47
48
63
69
71
79
80
50
58
59
80
77
141
47
130
119
110
66
70
105
.1,460
113
92
94
122
64
52
155
89
101
238
Shoulder pain
1 previous CVA (same side)
3 previous CVAs (same side)
Language barrier
Mild perceptual/memory deficits
Aphasic/perceptual problems
Expressive aphasia
Aphasic
Aphasic/language barrier
2 previous CVAs; lupus
Mild aphasia
Mild ataxia in upper extremity
1 previous CVA (full recovery)
Shoulder pain
Hip fracture (pinned, weight bearing as tolerated)
Ataxia and vertigo; osteoarthritic knees
Aneurysm clipped; lobectomy
Expressive aphasia
No complications
Perception/memory impaired
Additional subjects with

low-level motor
function for internal
consistency analysis
Ischemic
Ischemic
Ischemic
Hemorrhagic
Hemorrhagic
Hemorrhagic
R
L
L
R
L
L
F
M
F
F
F
M
81
63
67
34
57
70
44
15
21
16
18
41
Perception/cognition impaired
Aphasic/perceptual problems
Mild aphasia
Migraines
Perception/cognition impaired
Perception/cognition impaired; aphasia
CVA5cerebrovascular accident.
Subject also participated in videotaped assessments reliability study.
The 4 subjects (3 men, 1 woman), identified in Table 1,

ranged in age from 50 to 80 years (X563.0, SD510.8).
Two of the subjects had left-sided CVAs, and the other 2
subjects had right-sided CVAs. One subject had aphasia,
2 subjects had perceptual and memory impairments, and
1 subject had mild shoulder pain. They were evaluated
an average of 145 days (SD560.4) after having had a
stroke.
Twenty raters were recruited from Montreal-area health
care facilities by sending notices to hospital physical
therapy departments explaining the study. The raters
were selected to cover a wide range of clinical backgrounds, with a minimum of 6 months of experience
working with patients with stroke.
The 20 physical therapists who participated in rating the
videotapes had diverse clinical backgrounds, with 6
months to 11 years of experience working with patients
with stroke (X54.5, SD53.2) and 1 to 33 years of
experience working as physical therapists (X59.0,
SD57.9). Eight of the therapists worked in acute care
settings, 9 therapists worked in inpatient and outpatient

rehabilitation settings, and 3 therapists worked in longterm care (LTC) settings. Table 2 provides an overview
of the raters clinical backgrounds.
Prior to the first viewing session, the raters independently reviewed the STREAM test manual and practiced
administering the STREAM measure on at least 2
patients. At the beginning of the first viewing session, the
therapists practiced rating a sample videotaped performance (not the one used in the actual reliability study)
and the scoring of the STREAM measure was briefly
discussed as a group (led by the primary author). The 20
raters, divided into 2 groups of 10 raters for convenience
of viewing, simultaneously and independently evaluated
the 4 videotaped assessments. During the rating sessions,
no discussion of the scoring of any of the items was
permitted. Items were replayed up to 3 times upon
request, as some of the smaller movements were more
difficult to see on videotape and because, when the
testing is done in the clinical setting, the patient is
permitted up to 3 attempts to perform a test item. The
Daley et al . 11
Table 2.
Characteristics of Raters
Service Area
Caseload
No. of Patients With Stroke

Assessed in Past Year
Years of Experience in
Physical Therapya
Direct-observation reliability study

Acute care
Inpatient rehabilitation
Neurological
Stroke
Mixed
Neurological
Brain injury
Mixed
1030
.30
.30
.30
,10
1030
4.5 (2.5)
4.5 (2.5)
2 (2.5)
4.5 (1.5)
9 (3)
3.5 (3.5)
Videotaped assessments reliability study

Acute care
Acute care
Acute care
Acute care
Acute care
Inpatient rehabilitation/acute care
Inpatient rehabilitation/LTCb
Outpatient rehabilitation
Outpatient rehabilitation
Outpatient orthopedic
LTC
MScc student
MSc student
Neurological
Mixed
Neurological
Neurological
Neurological
Neurological
Mixed
Mixed
Geriatrics
Mixed
Stroke
Stroke
Neurological
Mixed
Neurological
Neurological
Orthopedic
Mixed
NAd
NA
1030
1030
1030
1030
1030
.30
1030
1030
1030
.30
1030
1030
.30
1030
.30
.30
,10
1030
,10
,10
1 (0.5)
1 (0.5)
10 (5)
3 (1)
11.5 (2)
4 (4)
5.5 (4)
2 (1)
5 (2)
20 (10)
9 (4)
8 (3)
7 (6.5)
9.5 (3.5)
24 (11)
33 (11)
6.5 (5)
10 (7.5)
4 (2.5)
8 (4)
Years of experience working with patients with stroke shown in parentheses.

LTC5long-term care.
c
MSc5Master of Science degree.
d
NA5not applicable.
b
raters viewed each of the 4 videotaped performances on

one occasion, and on a second occasion approximately 1
month later. The videotapes were presented in a random
order at each session.
Internal Consistency
For evaluating the internal consistency of scores
obtained with the STREAM measure, we used the scores
for the 20 patients in the direct-observation reliability
study. None of these patients had a score below 30 (out
of 100) on the STREAM measure. The STREAM measure, however, is intended for use in assessing the full
range of motor function possible following stroke.
Therefore, scores were collected for an additional 6
patients with low levels of motor function who met our
inclusion criteria. Four additional physical therapists,
who had participated in the videotaped assessments
reliability study and who were familiar with administering the STREAM measure, provided us with the
STREAM scores for these 6 patients.
The characteristics of the 20 subjects in the sample of
convenience were presented previously. The 6 addi12 . Daley et al
tional subjects with low levels of motor function (mean

STREAM score519/100, SD59.7) were from 4 different
facilities (ie, 1 LTC facility and 3 acute care hospitals).
These subjects (2 men, 4 women) had an average age of
62 years (SD514.5) and were evaluated an average of 26
days (SD512) following stroke. Three of the subjects
had ischemic strokes, and the remaining 3 subjects had
hemorrhagic strokes. Two subjects had right-sided CVAs,
and 4 subjects had left-sided CVAs. Three subjects were
aphasic, and 4 subjects had prominent perceptual and
cognitive impairments. The characteristics of these subjects are presented in Table 1.
Data Analysis
Scores for the individual items were summed to produce
subscale and total scores on the STREAM measure for
each subject. The subscale and total scores from the
direct-observation reliability study were transformed to
scores out of 100 (over the 20 subjects, a total of 15 items
could not be scored because of restricted range of
motion or pain). As there were no missing data in the
videotaped assessments reliability study, however, raw
scores were used (ie, a maximum total STREAM score of
70). SAS statistical software18, was used to compute

Pearson correlations, cell frequencies, kappa statistics,
signed-rank statistics, and Cronbach alphas. The
GENOVA version 2.1 program19 was used to obtain
generalizability correlation coefficients (GCCs) for subscale and total scores. The analyses of rater agreement
were parallel for the 2 reliability studies, except that
only interrater agreement was evaluated in the directobservation reliability study, whereas estimates of
intrarater and interrater agreement were derived from
the videotaped assessments reliability study.
Interrater and intrarater reliability. The agreement

between raters for scoring items of the STREAM measure in the direct-observation reliability study was
described using the index of crude agreement (the total
percentage of subjects in which paired scores agree
precisely), expected agreement, and Cohens kappa
statistic.20 Quadratically weighted kappa statistics20 22
were used; this statistic produces values equivalent to
intraclass correlation coefficients (ICCs) for the same
data23,24 and reflects chance-corrected agreement where
the disagreements between categories are viewed as
varying exponentially or as being compounded by the
distance between one another. Kappa values range from
zero to unity; the closer to 1 kappa is, the better the
agreement between scores. As the kappa statistic is
prevalence dependent, it is influenced by the distribution of scores, the variability among subjects, and the
number of rating categories; meaningful values of kappa
can only be derived when there is sufficient variability in
scores.25,26
Kappa statistics could not be derived for the individual
item scores of the videotaped assessments reliability
study due to insufficient variability in scores with only 4
subjects. The distributions of agreement on scoring,
however, were tabulated.
For describing the agreement on subscale and composite scores, GCCs were calculated.27 For the directobservation reliability study, these statistics reflected
agreement between raters. For the videotaped assessments reliability study, GCCs described agreement both
within and between raters. These statistics, based on the
generalizability theory, are analogous to traditional reliability coefficients, except that GCCs reflect not only the
magnitude of the error, as would a traditional reliability
coefficient, but also attribute the error to a specific
source.19,24,28 Generalizability correlation coefficients,
like ICCs, range from 0 to 1, and are based on an
analysis-of-variance model. The closer the GCC is to
unity, the greater the generalizability or reliability. Reliability coefficients of .95 or better are recommended as
the minimal requirement for a clinical outcome measure

used in making judgments about individuals.29 31 That
is, only when at least 95% of the total variance is due to
true variance is the risk of falsely classifying an individual
acceptably small. When not affecting an individuals care
directly, we can accept a somewhat lower degree of
certainty, and coefficients of greater than .80 are generally considered to be acceptable when tests are used to
make decisions about a group or for research
purposes.29 31
In both the direct-observation reliability study and the
videotaped assessments reliability study, subjects and
raters contributed to the variability of the error terms. In
the videotaped assessments reliability study, an additional source of variance was the timing of the viewing
(first or second). These sources of variance can be
considered as (1) fixed, where the resultant GCC
indicates the extent to which a person can generalize
across the particular subjects, raters, or occasions
involved in the study, or (2) random, where the related
GCC indicates the extent to which a person can generalize results to any rater, subject, or occasion. When
using videotapes to assess intrarater agreement, however, the variability associated with changes in a patients
performance on separate occasions is eliminated. Thus,
estimates of reliability obtained using videotapes may
differ from those that would be obtained in clinical
practice.
In addition to GCCs, Wilcoxon matched-pairs signedrank statistics,32 the nonparametric equivalent of the
paired-sample t test, were computed to identify trends in
the scoring. This statistic examines the differences
between pairs of scores, which under the null hypothesis
are assumed to have a median difference of zero.33 If the
signed-rank statistic is significant, this finding indicates
poor agreement due to a tendency to score either
consistently higher or lower on the different rating
occasions (intrarater) or by either the assessing or
observing raters (interrater).
Internal consistency. Three statistics were calculated:

(1) Pearson correlation coefficients for each possible
pair of items included, (2) the correlations between the
scores for individual items and the subscale and total
scores, calculated by omitting that item, and (3) Cronbach alphas34 for each subscale and for the STREAM
measure as a whole. Because Cronbach alphas are influenced by the total number of items included in an
instrument and increase in value if related items are
added, alpha values must be interpreted accordingly.
SAS Institute Inc, PO Box 8000, Cary, NC 27511.
Daley et al . 13
mine whether the difference between the scores given by

the 2 raters was significant, signed ranks were computed
for subscale and total scores. These test statistics, along
with their related probabilities, are shown in Table 3.
None of the signed ranks were significant (at P,.05).
The signed-rank statistic, however, was marginal for the
total scores.
The Videotaped Assessments Reliability Study
Figure 1.
Stem-and-leaf plot of kappa statistics showing interrater agreement on
scoring the 30 items of the Stroke Rehabilitation Assessment of Movement (STREAM) measure in the direct-observation reliability study.
Numbers in box (stem) represent first decimal place of kappa; numbers
to right of box (leaves) represent second decimal place of kappa.
Kappas are quadratically weighted.
Results
The Direct-Observation Reliability Study

Interrater agreement for individual items of the STREAM
measure. Over all items, there were a total of 585
paired ratings (ie, 30 items 3 20 subjects, with 15 items
scored as X). Perfect agreement occurred for 89.4% of
these ratings, disagreement by 1 category occurred for
9.6% of the ratings, and disagreement by 2 categories
occurred for only 1.0% of the ratings.
Figure 1 is a stem-and-leaf plot summarizing the distribution of kappa statistics for the interrater agreement on
scores for each of the 30 items. The kappa values
clustered around .8 and .9, indicating that there was
excellent agreement on scoring of all of the items, with
the exception of the item rising from sitting to standing, which demonstrated a kappa value of .65.
Interrater agreement for STREAM subscale and total scores.

The STREAM subscale and total scores given by the
2 raters for each of the 20 subjects are graphed in
Figure 2. The close proximity of the 2 lines indicates
excellent interrater agreement that was consistent across
the entire range of scores. The GCCs for interrater
agreement on the subscale and total scores for the 20
subjects are presented in Table 3 and Figure 2. Of the 3
subscales, the scores on the upper-extremity subscale
were the most reliable, followed by the scores of the
lower-extremity and basic mobility subscales.
There was a tendency for the rater who was observing to
score slightly higher than the rater who was doing the
hands-on assessment. The observers and the assessors
gave the same total STREAM scores for 7 of the 20
subjects, but the observers gave higher total STREAM
scores than did the assessors for 10 subjects. To deter14 . Daley et al
Intrarater agreement for individual items of the STREAM

measure. Over the 30 items on the 4 videotapes, there
were a total of 2,400 paired ratings (20 raters 3 4
videotapes 3 30 items). Perfect agreement occurred in
85.7% of the ratings, disagreement by 1 category
occurred for 12.1% of the ratings, and for only 2.2% of
the ratings were there disagreements of 2 categories.
There was generally slightly less perfect agreement on
scoring of the videotapes of the subjects with left-sided
CVAs and perceptual problems (videotapes C and D).
Intrarater agreement for STREAM subscale and total
scores. Figure 3 shows the pattern of agreement on the
total STREAM scores given by the 20 raters for each of
the 4 videotapes on the 2 occasions. The close proximity
of the 2 lines suggests excellent intrarater agreement.
The GCCs for intrarater agreement on the subscale and
total scores over the 4 videotapes are shown in Table 3.
The GCCs were virtually identical for the models with
raters fixed and random. Unlike the results of the
direct-observation reliability study, where the upperextremity subscale was the most reliably scored, the
GCCs in this study were slightly higher for the lowerextremity and mobility subscales, followed by the upperextremity subscale. All raters demonstrated excellent
intrarater agreement on scoring the 4 videotapes, with
GCCs for individual raters ranging from .982 to .999
(model with subjects and occasions fixed).
Although the paired ratings were generally within a few
points of each other, the STREAM subscale and total
scores given were consistently higher (an average of 2
points higher for total scores over the 4 videotapes) for
the second rating session. The signed ranks, presented
in Table 3, were significant (P ,.05) for all subscales on
videotapes C and D, indicating that the trend to score
higher on the second viewing session was significant for
these videotapes.
Interrater agreement for STREAM scores on the 2

occasions. The relative flatness (slope near 0) of the
lines that show the scores on the 2 occasions in Figure 3
indicates that the agreement between raters on the
STREAM total scores was excellent on each occasion. On
both occasions, the GCCs reflecting the agreement
among the 20 raters on STREAM total scores given on
Figure 2.
Direct-observation reliability study: interrater agreement. GCC5generalizability correlation coefficient.
the 4 videotaped assessments were .999 (model with

subjects fixed and raters random).
The individual item-to-total correlations for the
STREAM measure were calculated for our sample of 26
subjects and ranged from .579 to .926. Item-to-subscale
score correlations ranged from .585 to .967. The alpha
coefficients reflecting the effect that omitting a particular item would have on the overall alpha ranged
from .982 to .984. Alpha coefficients were .965 for the
mobility subscale and .979 for each of the limb subscales.
The overall alpha coefficient for the STREAM measure
was .984.
Discussion
The extent to which the results of any reliability study
can be generalized to clinical practice depends on how
closely the conditions in the study approximate those of
assessment in the clinical setting, that is, on how similar
are the subjects, the raters, and the setting and manner
in which the evaluations are done.
The subjects participating in the STREAM reliability

studies reflected the distribution and range of comorbid
medical problems typically encountered in the stroke
rehabilitation setting, and they demonstrated a wide
range of motor ability. The results of these studies,
therefore, should be generalizable to similar inpatient
rehabilitation populations.
The clinical backgrounds of the raters involved were
diverse. The raters had graduated from 6 universities,
were working in 12 different facilities, and had a wide
range of general and stroke-specific clinical experience.
In addition, despite very little training in the use of the
STREAM measure, excellent reliability was achieved.
The results of our studies, therefore, should be generalizable across the spectrum of rater training and experience. That is, similar results should be achievable among
physical therapists who have carefully read the STREAM
test manual and have used the STREAM measure a few
times in the clinical setting to familiarize themselves with
the scoring form. It must be acknowledged, however,
that the brief training that the raters received could have
Daley et al . 15
Table 3.
Results of the Stroke Rehabilitation Assessment of Movement (STREAM) Measure Reliability Studies
Direct Observation
(Interrater Agreement)
Videotaped Assessments
(Intrarater Agreement)
GCCa
Signed Rankb
GCCc
Signed Rankb,d
Upper-extremity subscale
.994
211 (.311)
.963
A: 26.5
B: 29.5
C: 235
D: 253
(.801)
(.365)
(.049)
(.009)
.979
Lower-extremity subscale
.993
28 (.281)
.999
A: 27 (.188)
B: 26.5 (.659)
C: 239 (.012)
D: 226 (.045)
.979
Basic mobility subscale
.982
216 (.172)
.999
A: 212.5 (.273)
B: 213.5 (.427)
C: 283 (.0003)
D: 233 (.0001)
.965
Total scores on STREAM
.995
226 (.072)
.999
A: 225 (.246)
B: 217.5 (.076)
C: 274 (.0001)
D: 299 (.0001)
.984
(Chronbach Alpha)
Model: raters and subjects random. GCC5generalizability correlation coefficient.

Probabilities shown in parentheses.
c
Model: subjects and occasions fixed, raters random.
d
Signed ranks given for agreement on each of the 4 videotaped assessments (A-D).
b
positively influenced the reliability estimates in this study

and thus potentially limit the generalizability of the study
results.
The estimates of interrater and intrarater reliability
obtained for the STREAM measure under the conditions
imposed in this study exceed the .95 level that some
authors believe is required for clinical decision making.29 31 However, the testing procedures used in the
direct-observation reliability study, where both ratings
are made in a quiet environment and in the same testing
session so that variability in patient performance is
eliminated, represent ideal conditions for achieving
reliable scores. Similarly, videotaped assessments may be
performed in a more standardized fashion and under
more controlled conditions than would be found in a
busy clinical setting. Thus, reliability estimates obtained
under the somewhat artificial conditions of the study
may represent optimal reproducibility. The generalizability of the results of this study to the realities of the
clinical setting, where therapists assess patients who may
be less than cooperative, on separate occasions, and with
frequent interruptions, still needs to be determined
through further study.
Only one of the items retained on the completed
STREAM measure demonstrated less than excellent reliability. Kappa was only .65 for the sit to stand item.
This item was retained, however, because the consensus
panels had deemed it to be crucial as an important
16 . Daley et al
milestone of motor functioning and because it performed well in terms of internal consistency, correlating
at .83 with the subscale score and .77 with the total score.
The disagreements between raters scores on this item
(in both reliability studies) were due to difficulty differentiating between normal movement patterns (score 2
or 3) and abnormal movement patterns (score 1c). In an
attempt to improve the reliability of scores for this item,
a note has been added to the scoring form (ie, Note:
pushing up with hand@s# to stand5aid @score 2#; asymmetry such as trunk lean, Trendelenburg position, hip
retraction, or excessive flexion or extension of the
affected knee5marked deviation @score 1a or 1c#).
As shown by the plots of paired ratings (Figs. 2 and 3),
the reliability of scores obtained for the STREAM measure was excellent across the full range of scores. This is
an important quality for instruments intended to be
used for evaluating patients with a wide range of motor
capabilities. It is not clear from the literature whether
the other available stroke motor assessments demonstrated similar reliability across the range of scores.
For the direct-observation reliability study, the finding
of slightly greater reliability for scores on the upperextremity subscale versus scores on the lower-extremity
and mobility subscales may be due to a greater heterogeneity in patients upper-extremity scores, as upperextremity recovery tends to be slower and less complete
than that for the lower extremity. Another possible
Figure 3.
Videotaped assessments reliability study: intrarater agreement. GCC5generalizability correlation coefficient.
contributor to this slightly higher reliability may be that

several of the patients had flaccid upper extremities and
clearly were not able to perform the test movement
(score of 0), thereby reducing the possibility of rater
disagreement on scoring.
In contrast, in the videotaped assessments reliability
study, the upper-extremity subscale was slightly less
reliably scored. Because it is more difficult to observe
small movements on videotape, such as movements of
the hand, this finding is not surprising. Also not unexpected were the slightly lower levels of overall agreement
we observed for videotapes C and D, in which our 2
subjects were moderately influenced by increased reflex
activity and had perceptual problems. This finding suggests that heightened reflex activity and perceptual
problems may make the scoring of movement slightly
more difficult.
There was a tendency for raters to score the videotaped
subjects higher during the second rating occasion than
during the first rating occasion. It is possible that this
trend was an artifact of the therapists learning curve.
The slightly higher level of intrarater agreement for the

videotaped assessments reliability study compared with
the level of interrater agreement found in the directobservation reliability study is as would be expected, as
typically agreement within raters is better than that
between raters. The slightly higher overall interrater
agreement achieved in the videotaped assessments reliability study than in the direct-observation reliability
study is, in light of the more controlled testing conditions, also as we would expect.
Several factors probably contributed to the very high
estimates of reliability obtained for STREAM scores. The
potential effects of the somewhat controlled conditions
have already been elucidated. In our opinion, however,
several characteristics of the STREAM measure, most
notably the simple scoring scheme and standardized
testing instructions, are likely to enhance the reliability
of the scores, regardless of the testing conditions. That
amplitude, gross quality, and independence in mobility
(ie, quantity, quality, and independence of movement)
have been incorporated into the scoring of the STREAM
measure (where no other stroke motor assessment
Daley et al . 17
includes all of these dimensions of interest to therapists), while still maintaining simplicity and objectivity, is
an additional plus.
The high degree of internal consistency found for the
STREAM measure indicates that the items are measuring
one concept, presumably the recovery of motor function. The alpha coefficients for the STREAM measure
and its subscales surpass the recommended .9035 for an
instrument to be clinically useful for measuring a specific concept. The subjects included in the internal
consistency analysis came from an inpatient rehabilitation setting as well as from acute care and LTC settings,
and their scores on the STREAM measure were distributed across the entire range of possible scores. The
results of this analysis, therefore, should be representative of the performance of the STREAM measure when it
is used with the population for which it is intended.
Although we have presented internal consistency alongside rater agreement, and internal consistency is frequently considered as a form of reliability in the literature, we contend that it may be more appropriate to
consider internal consistency as being related to validity.
For example, in the process of item reduction and
justification for the content of the STREAM measure, we
used the correlations between an individual items scores
and (1) each of the other items scores (inter-item
correlations), (2) subscale scores (item-to-subscale correlations), and (3) total scores (item-to-total correlations).17 The full details of the internal consistency
analysis are presented in this article for completeness,
but the reader should recognize that internal consistency is an issue separate from rater agreement.
Only one other stroke motor assessment9 has been
evaluated in terms of internal consistency, although this
psychometric property can conveniently be evaluated
using the same data that are used to obtain estimates of
rater agreement, provided that the subjects scores span
the range of the instrument. In addition, because the
information obtained through internal consistency analysis can be used to support the appropriateness of the
content, it would seem all the more important to evaluate internal consistency.
In all of the characteristics that are important for an
outcome measure for the recovery of movement following stroke, the STREAM measure rivals other related
measures. It provides a considerable amount of internally consistent and reliable information on the movement of individuals following stroke. Its greatest advantage over other measures may be its excellent clinical
utility. Although the STREAM measure consists of 30
items, the simple scoring process and the way in which
the instrument is organized (ie, ordinal scaling with
18 . Daley et al
consistent descriptions applied across all items, flow of

items from supine to standing and from low to high level
in terms of motor ability, standardized verbal instructions, and the fact that no special equipment is required)
combine to facilitate rapid assessment. These qualities of
the STREAM measure will certainly be much appreciated by clinicians.
Implications for Future Research on the STREAM Measure

The involvement of patients with stroke with varied
clinical profiles in the reliability studies has indicated
that the reliability of scores obtained with the instrument
across a relatively diverse population is excellent. Further tests of reliability, however, should be carried out in
different clinical facilities and settings (eg, acute care,
LTC) and under less controlled conditions than was the
case in this study in order to show the generalizability of
the STREAM scores reliability across institutions,
patient populations, and testing conditions. The results
of other measurement studies evaluating criterion validity and construct validity and of longitudinal studies
assessing the responsiveness of the STREAM measure
are pending. A training videotape is planned, and a test
manual is available to help standardize the testing procedures. Ultimately, the STREAM measure will need to
be used in clinical trials and evaluated in terms of its
efficiency relative to other instruments for discerning
the effect of treatments on the recovery of movement.
Conclusions
The reliability of scores obtained with the STREAM
measure as determined under the conditions of this
study is excellent, both within and between raters, with
GCCs of .99 for total scores and from .96 to .99 for
subscale scores. The internal consistency of the STREAM
scores is also excellent, with Cronbach alphas of greater
than .98 on the subscales. It is anticipated that the
simplicity and overall clinical utility of the STREAM
measure will facilitate the incorporation of this instrument into the clinical setting for the routine objective
measurement of motor function. Further testing of the
measurement properties of the STREAM measure,
namely validity and responsiveness, are under way.
References
1 Ashburn A. A physical assessment for stroke patients. Physiotherapy.
1982;68:109 113.
2 Bobath B. Adult Hemiplegia: Evaluation and Treatment. London,
England: William Heinemann Medical Books Ltd; 1978.
3 Brunnstrom S. Movement Therapy in Hemiplegia. New York, NY: Harper
& Row; 1970.
4 Carr JH, Shepherd RB, Nordholm L, Lynne D. Investigation of a new
motor assessment scale for stroke patients. Phys Ther. 1985;65:175180.
5 Fugl-Meyer A, Jaasko L, Leyman I, et al. The post-stroke hemiplegic
patient, I: a method for evaluation of physical performance. Scand J
Rehabil Med. 1975;7:1331.
6 Gowland C, Torresin W, Stratford PW. Chedoke-McMaster Stroke
Assessment: a comprehensive clinical and research measure. In: Proceedings of the 11th International Congress of the World Confederation for
Physical Therapy, Barbican Centre, London, England, 1991. London,
England: Chartered Society of Physiotherapy; 1991;2:851 853.
7 Guarna F, Corriveau H, Chamberland J, Arsenault B. An evaluation
of the hemiplegic subject based on the Bobath approach. Scand J
Rehabil Med. 1988;20:116.
8 Lindmark B, Hamrin E. Evaluation of functional capacity after stroke
as a basis for active intervention. Scand J Rehabil Med. 1988;20:103115.
9 LaVigne JM. Hemiplegia sensorimotor assessment form. Phys Ther.
1974;54:128 134.
10 Lincoln A, Leadbitter I. Assessment of motor function in stroke
patients. Physiotherapy. 1979;65:48 51.
11 Badke MB, Duncan PW. Patterns of rapid motor responses during
postural adjustments when standing in healthy subjects and hemiplegic
patients. Phys Ther. 1983;63:1320.
12 Bernspang B, Asplund K, Eriksson S, Fugl-Meyer A. Motor and
perceptual impairments in acute stroke patients: effects on self-care
activity. Stroke. 1987;18:10811086.
13 Dettmann M, Linder MT, Sepic SB. Relationships among walking
performance, postural stability, and functional assessments of the
hemiplegic patient. Am J Phys Med. 1987;66:7790.
14 Gowland C. Recovery of motor function following stroke: profile
and predictors. Physiotherapy Canada. 1982;34:77 84.
15 Henley S, Pettit S, Todd-Pokropek A, Tupper A. Who goes home?
Predictive factors in stroke recovery. J Neurol Neurosurg Psychiatry.
1985;48:1 6.
20 Cohen J. A coefficient of agreement for nominal scales. Educational

and Psychological Measurement. 1960;20:37 46.
21 Cohen J. Weighted kappa: nominal scale agreement with provision
for scale disagreement or partial credit. Psychol Bull. 1968;70:213220.
22 Kramer M, Feinstein A. Clinical biostatistics, LIV: the biostatistics of
concordance. Clin Pharmacol Ther. 1981;29:111123.
23 Fleiss J, Cohen J. The equivalence of weighted kappa and the
intraclass correlation coefficient as a measure of reliability. Educational
and Psychological Measurement. 1973;33:613 619.
24 Streiner D, Norman G. Health Measurement Scales: A Practical Guide to
Their Development and Use. Oxford, England: Oxford University Press;
1991.
25 Feinstein A, Cicchetti D. High agreement but low kappa, I: the
problems of two paradoxes. J Clin Epidemiol. 1990;43:543549.
26 Soeken K, Prescott P. Issues in the use of kappa to estimate
reliability. Med Care. 1986;24:733741.
27 Cronbach L, Glester G, Nanda H. The Dependability of Behavioral
Measurements: Theory of Generalizability for Scores and Profiles. New York,
NY: John Wiley & Sons Inc; 1972.
28 DeVellis R. Scale Development: Theory and Applications. London,
England: Sage Publications Ltd; 1991.
29 Helmstadter G. Principles of Psychological Measurement. New York, NY:
Appleton-Century-Crofts; 1964.
30 Nunnally J. Psychometric Theory. New York, NY: McGraw-Hill Inc;
1978.
31 Weiner E, Stewart B. Assessing Individuals. Boston, Mass: Little,
Brown & Co Inc; 1984.
16 Loewen S, Anderson B. Predictors of stroke outcome using objective measurement scales. Stroke. 1990;21:78 81.
32 Wilcoxon F. Individual comparisons by ranking methods. Biometrics

Bulletin. 1945;1:80 89.
17 Daley K, Mayo N, Danys I, et al. The Stroke Rehabilitation Assessment of Movement (STREAM): refining and validating the content.
Physiotherapy Canada. 1997;49:269 278.
33 Colton T. Statistics in Medicine. Boston, Mass: Little, Brown & Co Inc;

1974.
18 SAS Procedures Guide for Personal Computers, Version 6 Edition. Cary,

NC: SAS Institute Inc; 1985.
19 Crick T, Brennan R. GENOVA: A Generalized Analysis of Variance
System, Version 2. Iowa City, Iowa: American College Testing Program
Inc; 1983.
34 Cronbach L. Coefficient alpha and the internal structure of tests.

Psychometrika. 1951;16:297334.
35 Feinstein A. Clinametrics. New Haven, Conn: Yale University Press;
1987.
Daley et al . 19
20 . Daley et al
Daley et al . 21
22 . Daley et al
Daley et al . 23

PHYS THER-1999-Daley-8-23 - 2 PDF

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

PHYS THER-1999-Daley-8-23 - 2 PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Research Report

Reliability of Scores on the Stroke

Background and Purpose. The Stroke Rehabilitation Assessment of

Key Words: Cerebrovascular accident, Motor recovery, Outcome measure, Reliability.

Physical Therapy . Volume 79 . Number 1 . January 1999

lthough a number of instruments are available to measure the recovery of movement

produce the original Stroke Rehabilitation Assessment

Physical Therapy . Volume 79 . Number 1 . January 1999

The final version of the STREAM measure consists of 30

Overview of Study Design

informed consent was obtained prior to a patients

The Videotaped Assessments Reliability Study

Available on request from Dr Mayo.

Physical Therapy . Volume 79 . Number 1 . January 1999

Additional subjects with

The 4 subjects (3 men, 1 woman), identified in Table 1,

settings, 9 therapists worked in inpatient and outpatient

No. of Patients With Stroke

Direct-observation reliability study

Videotaped assessments reliability study

Years of experience working with patients with stroke shown in parentheses.

raters viewed each of the 4 videotaped performances on

tional subjects with low levels of motor function (mean

70). SAS statistical software18, was used to compute

Interrater and intrarater reliability. The agreement

the minimal requirement for a clinical outcome measure

Internal consistency. Three statistics were calculated:

SAS Institute Inc, PO Box 8000, Cary, NC 27511.

Physical Therapy . Volume 79 . Number 1 . January 1999

mine whether the difference between the scores given by

The Videotaped Assessments Reliability Study

The Direct-Observation Reliability Study

Interrater agreement for STREAM subscale and total scores.

Intrarater agreement for individual items of the STREAM

Interrater agreement for STREAM scores on the 2

Physical Therapy . Volume 79 . Number 1 . January 1999

the 4 videotaped assessments were .999 (model with

The subjects participating in the STREAM reliability

Basic mobility subscale

Total scores on STREAM

Model: raters and subjects random. GCC5generalizability correlation coefficient.

positively influenced the reliability estimates in this study

contributor to this slightly higher reliability may be that

The slightly higher level of intrarater agreement for the

consistent descriptions applied across all items, flow of

Implications for Future Research on the STREAM Measure

Physical Therapy . Volume 79 . Number 1 . January 1999

20 Cohen J. A coefficient of agreement for nominal scales. Educational

32 Wilcoxon F. Individual comparisons by ranking methods. Biometrics

33 Colton T. Statistics in Medicine. Boston, Mass: Little, Brown & Co Inc;

18 SAS Procedures Guide for Personal Computers, Version 6 Edition. Cary,

Physical Therapy . Volume 79 . Number 1 . January 1999

34 Cronbach L. Coefficient alpha and the internal structure of tests.

Physical Therapy . Volume 79 . Number 1 . January 1999

Physical Therapy . Volume 79 . Number 1 . January 1999

Você também pode gostar