Você está na página 1de 7

See

discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/257591589

To adjust or not adjust: Nonparametric effect


sizes, confidence intervals, and real-world
meaning
ARTICLE in PSYCHOLOGY OF SPORT AND EXERCISE JANUARY 2013
Impact Factor: 1.77 DOI: 10.1016/j.psychsport.2012.07.007

CITATIONS

DOWNLOADS

VIEWS

385

218

4 AUTHORS, INCLUDING:
Andreas Ivarsson

Magnus Lindwall

Halmstad University

University of Gothenburg

20 PUBLICATIONS 64 CITATIONS

84 PUBLICATIONS 428 CITATIONS

SEE PROFILE

SEE PROFILE

Available from: Magnus Lindwall


Retrieved on: 14 July 2015

Psychology of Sport and Exercise 14 (2013) 97e102

Contents lists available at SciVerse ScienceDirect

Psychology of Sport and Exercise


journal homepage: www.elsevier.com/locate/psychsport

Review

To adjust or not adjust: Nonparametric effect sizes, condence intervals,


and real-world meaning
Andreas Ivarsson a, *, Mark B. Andersen b, Urban Johnson a, Magnus Lindwall c, d
a

Center of Research on Welfare Health and Sport, Halmstad University, Sweden


School of Sport and Exercise Science and the Institute of Sport, Exercise and Active Living, Victoria University, Melbourne, Australia
c
Department of Food and Nutrition, and Sport(s) Science, University of Gothenburg, Sweden
d
Department of Psychology, University of Gothenburg, Sweden
b

a r t i c l e i n f o

a b s t r a c t

Article history:
Received 26 April 2012
Received in revised form
23 July 2012
Accepted 24 July 2012
Available online 1 September 2012

Objectives: The main objectives of this article are to: (a) investigate if there are any meaningful differences between adjusted and unadjusted effect sizes (b) compare the outcomes from parametric and nonparametric effect sizes to determine if the potential differences might inuence the interpretation of
results, (c) discuss the importance of reporting condence intervals in research, and discuss how to
interpret effect sizes in terms of practical real-world meaning.
Design: Review.
Method: A review of how to estimate and interpret various effect sizes was conducted. Hypothetical
examples were then used to exemplify the issues stated in the objectives.
Results: The results from the hypothetical research designs showed that: (a) there is a substantial
difference between adjusted and non-adjusted effect sizes especially in studies with small sample sizes,
and (b) there are differences in outcomes between the parametric and non-parametric effect size
formulas that may affect interpretations of results.
Conclusions: The different hypothetical examples in this article clearly demonstrate the importance of
treating data in ways that minimize potential biases and the central issues of how to discuss the
meaningfulness of effect sizes in research.
2012 Elsevier Ltd. All rights reserved.

Keywords:
Adjusted effect size
Practical signicance
Statistical interpretation

The ubiquity of p < .05 in null hypothesis signicance testing


(NHST) is a convention that has been rmly established in research
for many years (Thompson, 2002a). Even if NHST is one of the most
common methods used in sport and exercise psychology to evaluate the impact of, for example, an intervention, several
researchers (Andersen, McCullagh, & Wilson, 2007; Andersen &
Stoov, 1998) have discussed the problems of using only NHST,
because p levels may have little, if anything, to do with real-world
meaning and practical value.
Nakagawa and Cuthill (2007) suggested two shortcomings of
using NHST: (a) this approach does not provide an estimate of the
magnitude of an effect and (b), there is no indication of the precision of this estimate. Nakagawa and Cuthills suggestion means that
a statistically signicant result, by itself, has little to say about
clinical or practical signicance of the effect (Frhlich, Emrich,

* Corresponding author. Sektionen fr Hlsa och Samhlle, Halmstad Hgskolan,


P. O. Box 823, 30118 Halmstad, Sverige, Sweden. Tel.: 46 (0) 35 16 74 48; fax: 46
(0) 35 16 72 64.
E-mail address: Andreas.Ivarsson@hh.se (A. Ivarsson).
1469-0292/$ e see front matter 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.psychsport.2012.07.007

Pieter, & Stark, 2009; Jacobson & Truax, 1991). Another criticism
of NHST is that, for some studies, results may be a reection of the
power of the research design, which can be easily manipulated by
changing sample sizes (Kirk, 1996). Henson (2006) presented an
example where the p value in a randomized intervention study was
.051. By adding just one person (under the conditions that the M
and SD stayed the same) into each group, the p value would
decrease to .049. In this case, if one judged the intervention effect
based on just the p value, then the intervention is effective in the
larger study (N 18) and not effective in the smaller study (N 16).
This issue of sample-size inuences on p values in experimental
and correlational designs, and how there may be potential biases
when discussing the value of research ndings, have also been
explored in sport and exercise science (e.g., Andersen & Stoov,
1998; Edvardsson, Ivarsson, & Johnson, 2012). That p values are
sensitive to sample size also inuences review and meta-analytic
studies that are based on NHST results. For example, Hedges and
Olkin (1980) highlighted vote-counting methods as one problematic approach because they often only count the number of studies
that report statistically signicant results when comparing means
for experimental and control groups.

98

A. Ivarsson et al. / Psychology of Sport and Exercise 14 (2013) 97e102

To take a step toward using numbers that may tell us something


about the clinical, meaningful, or practical signicance of results,
many researchers (e.g., Fritz, Morris, & Richler, 2012; Thompson,
2002a) have suggested the need for reporting and interpreting
effect sizes. Also, the Publication Manual of the American Psychological Association (6th ed., APA, 2010) mandates reporting effect
sizes in quantitative research articles to enhance the interpretability of results.
In response to these recommendations, an increased number of
journals are requiring that effect sizes be reported in quantitative
studies. In sport and exercise psychology journals, Andersen et al.
(2007) found that the reporting of effect sizes had increased in
recent years. Specically, of 54 experimental and correlational
studies examined, 44 included effect sizes in the results sections.
Even though the reporting of effect sizes has increased, Andersen
et al. also found that few of the studies with reported effect sizes
(only 7 out of 44; 16%) had interpretations of what those effects
might suggest in terms of real-world meaning.
There are a number of different effect-size indicators (Cohen,
1992) that are based on, for example, correlations, shared variance, standardized differences between means, or degree of overlaps of distributions (Grissom & Kim, 2012). In general, effect sizes
could be classied into two large categories (Rosenthal, 1991). One
of the categories, which contains r and R2 effect sizes, is based on
correlation coefcients (e.g., correlations, regression, structural
equation modeling, multi-level modeling), whereas the other
category, containing Cohens d, Glasss d, and Hedgess g effect sizes,
is based on the standardized mean differences between two groups
(e.g., t tests ANOVAs; Keef & Roberts, 2004). From the correlation
coefcient category, the bivariate r effect size (Pearsons r) is
probably the most widely used. The r effect size indicates the
magnitude of the correlation between two variables (Ferguson,
2009). The formula for calculating the bivariate r effect size is
(Tabachnick & Fidell, 2007):

P
P
P
N XY  X Y
r r
 P
 P

P
P
N X 2  X2 N Y 2  Y2
When discussing the bivariate r effect size, it could also be useful to
highlight two other effect sizes that could be used in correlational
research. These two effect sizes are: point-biserial correlation (rpb)
that should be used if one of the variables in the correlation is
dichotomous, and the phi coefcient (4 or r4) that is used in
correlations with two dichotomous variables (Bonett, 2007). For
calculation formulas for these other correlations see Fritz et al.
(2012).
Probably the most common effect size, when comparing the
standardized mean differences between two groups, is Cohens
d for independent means (Thomas, Nelson, & Silverman, 2005),
which is the difference between the means of two groups
divided by the pooled standard deviation of the two groups
(Cohen, 1988; Nakagawa & Cuthill, 2007). This effect-size indicator is used when the aim is to compare the magnitude of
difference between two conditions. One formula for calculating
Cohens d, when the distributions of both groups meet the
criteria for using parametric tests and group ns are equal
(Rosenthal & Rubin, 2003), is:

M1  M2
d r
SD12 SD22
2
In the formula M1 and M2 are the group means and SD1 and SD2
are the groups standard deviations. Bivariate r and Cohens d can be

converted into each other by using the following formulas


(Ferguson, 2009):



1=2
d2 = d2 4

1=2

d 2r= 1  r 2
An additional group of effect sizes that have also been used is based
on odds ratios (OR; odds ratio, relative risk, risk difference). The OR
group of effect sizes are used to compare relative likelihood or risk
for a specic outcome between two or more groups (Ferguson,
2009). The odds ratio is calculated by the formula: AD/BC. The
letters A/B/C/D represent observed cell frequencies when a study
has two groups and two possible outcomes (A group 1/outcome
1; B group 1/outcome 2; C group 2/outcome 1; D group
2/outcome 2; Nakagawa & Cuthill, 2007). The OR effect size could
be converted into r by using Pearsons (1900) formula:



r cos p=1 OR1=2
Our intention with this article is to highlight and discuss a few
practical issues that might occur when using and interpreting effect
sizes. The main purposes of this article are to: (a) investigate if there
are any meaningful differences between adjusted and unadjusted
effect sizes, (b) compare outcomes from parametric and nonparametric effect sizes that may affect interpretations of results,
(c) discuss the importance of reporting condence intervals for
effect sizes in research, and explore how to interpret effect sizes in
terms of practical real-world meaning.
Adjusted vs. unadjusted effect sizes
One issue that is rarely discussed in the sport and exercise
psychology literature is that any effect size can be in one of two
forms, adjusted and unadjusted. The difference between these two
conditions is that in an adjusted effect size the magnitude of the
effect is "corrected" to allow for generalization to the population.
The unadjusted effect size is sample specic and tends to be an
overestimation of the population effect size (Thompson, 2006).
Thompson (2002a) listed three different design issues that will
affect the potential sampling variance: (a) sample size, (b) number
of variables measured, and (c) population effect size. In order to
adjust for the potential sampling variance, several formulas have
been suggested. All formulas have in common that it is the R2 value
that will be adjusted. Probably the most well-known formula for
adjusted R2 was developed by Ezekiel (1930).1 Wang and Thompson
(2007) found that Ezekiels formula, in comparison to other suggested formulas (e.g., Claudys, 1978), provide a better and more
reliable result (in most cases). Ezekiels formula states that an
adjusted standardized difference (d*) could be calculated by using
the unadjusted standardized difference (d). The steps for
this calculation
are: (a)
p
converting the d into an r (using the formula r d= d2 4, (b) squaring the r, (c) using the Ezekiels
formula to calculate the adjusted effect r2* the formula
is r 2* r 2  1  r 2 v=N  v  1 where v is the number of
predictor variables), (d) taking the square root of the r2* (i.e., r*) and
then using this value to calculate d* in the formula
d* 2r*1  r 2* 1=2 (Thompson, 2002a). In addition, Ezekiels
formula could be used to correct bivariate r effect sizes (Wang &
Thompson, 2007).

Ezekiels (1930) correction formula is used in SPSS to calculate the adjusted R2.

A. Ivarsson et al. / Psychology of Sport and Exercise 14 (2013) 97e102


Table 1
Unadjusted vs. adjusted effect sizes (d) for hypothesized conditions.
Sample size
N
N
N
N
N
N
N
N

20
40
80
160
20
40
80
160

Unadj. d

Adj. d*

Difference (%)

.50
.50
.50
.50
.80
.80
.80
.80

.16
.38
.44
.47
.65
.71
.76
.77

68%
24%
12%
6%
18.75%
11.25%
5%
3.75%

Researchers have, for many years, argued about the importance


of using adjusted effect sizes instead of unadjusted ones. Ezekiel
(1930), Thompson (2002a), and Wang and Thompson (2007)
highlighted that it is important to use the adjusted ones, especially in studies with small sample sizes. On the other hand, Roberts
and Henson (2002) stated that the differences between adjusted
and unadjusted effect sizes are so close to zero that they have no
practical value. With small samples and small effects, Ezekiels
equations fall apart and sometimes produce negative r2s, which are
not real numbers and are impossible to use to calculate d*. As an
example, let us say that we have a study with a sample of 16
participants. The Cohens d effect size was .20 and the r2 was
therefore .0099. Using Ezekiels correction formula, to adjust R2, the
calculation will be .0099  ((1  .0099)(1/(16  1  1)) .06. This
correction formula is not really any correction at all. A negative R2 is
a fundamental problem because it means that the predictor variable could explain less than 0% of the variance, which is nonsensical. The problem with negative R2 values has been discussed by
Vacha-Haase and Thompson (2004), and Leach and Henson (2007)
suggested replacing the negative Ezekiel values with zeros in the
calculation formula (equal to Cohens d 0), but this suggestion is
not particularly helpful. To illustrate if there are any meaningful
differences between adjusted and unadjusted ds, we have, in
Table 1, used medium (.50) and large (.80) unadjusted ds to illustrate the patterns of changes in adjusted d*s as one increases
sample size with Ns of 20, 40, 80, and 160.
In Table 1, the results from the ctional cases show that conditions with few participants have more biased effect sizes than
studies with more participants. As an example, we can imagine
a study aimed at investigating a cognitive behavioral therapy (CBT)based interventions effects on self-condence in ice hockey. For
this study, we have 20 players, divided into two groups: experimental and control. Before the intervention started the players
were asked to complete a self-condence questionnaire. After 10
weeks of CBT interventions, the players were asked to complete the
same questionnaire as they did before the intervention started.
The calculated unadjusted Cohens d effect size for the postintervention scores is .50, indicating that the intervention positively inuenced the players self-condence. But if we use Ezekiels
formula to correct for sampling variance, the adjusted Cohens d is
.16, a difference of 68%. The adjusted Cohens d for the intervention
effect size is small and probably would not be considered as having
any practical signicance. If we had done the same study, with the
same Cohens d, but with 400 participants in each group (experimental and control) the unadjusted and adjusted ds would be .50
and .49, respectively (2% difference). This example clearly shows
the importance of adjusting for sampling biases when the sample
size is small (as it often is in sport psychology studies).
Effect sizes for parametric vs. nonparametric tests
Another "adjustment" to effect size reporting would be to
determine when to use parametric versus nonparametric formulas.

99

Effect sizes are used to answer the question, "How big is it?" (i.e.,
what is the magnitude of the effect?; Nakagawa & Cuthill, 2007).
As stated previously in this article, there are several different types
of effect sizes. Most effect size estimates have the assumption that
the data are reasonably normally distributed. For differences
between two independent groups when nonparametric tests have
been performed, the value of the z distribution could be used to
calculate the effect size (Fritz et al., 2012). To calculate the effect
rpb) for some
size (in this case the point-biserial correlation
p
nonparametric tests, the formula rpb z= N could be used. In the
formula, z is the z value that would be obtained from performing
a ManneWhitney or Wilcoxon test, or it could be calculated by
hand, and N is the sample size in the study. To use the effect size
estimate rpb to calculate the Cohens d value, the formula
q
2 is used (Fritz et al., 2012). In looking through
d 2r= 1  rpb
the literature, only a few studies have used this formula for
nonparametric tests. Considering the low numbers of articles using
the formula for nonparametric tests, there might be signicant
underestimations of effect sizes and the subsequent interpretations of their practical signicance.
To illustrate with an example, we present a ctional study with
the aim to test a preventive intervention for lowering sport injury
occurrence. In the study design, two groups with equal ns exists,
one intervention and one control, and the outcome variable is
number of injuries per person (e.g., 0, 1, 2). The data are substantially skewed with more than half the participants in the intervention group receiving scores of "0" injuries. The mean and SD for
the intervention group are M .40 and SD .737, and for the
control group M .93 and SD .799. Using these values results in
a Cohens d of .69.

0:93  0:40
d r
:7992 :7372
2
If we use the z value from a ManneWhitneys U test (with the
same data) instead to calculate the Cohens d with the formula
Fritz et al. p(2012)
suggested, the effect
size will
be .79
p

rpb 2:01= 30; Cohens d 2:368= 1  :3682 :79.


This gives a difference of .1 in Cohens d effect size between the two
formulas. That difference may or may not be meaningful, but in the
realm of injuries, and the personal, nancial, and performance costs,
it may have practical signicance. In this case, using the parametric
instead of the nonparametric effect size will result in an underestimation of the intervention effect (and possibly an underestimation
of the real-world value of the intervention). The same potential bias
that is exemplied above also could be present in correlational
studies (Ferguson, 2009). Therefore, it is of equal importance to use
the appropriate correlational analysis (Pearsons r or Spearmans rho
or rs) in order to not violate the assumptions parametric and
nonparametric effect sizes. Spearmans rs coefcient is the ordinallevel-of-measurement (ranks) equivalent of Pearsons r (Rupinski
& Dunlap, 1996), but the coefcient is, in general when performed
on the same data, smaller than the r coefcient counterpart. Bishara
and Hittner (2012) emphasized the importance of using rs for studies
with small sample sizes and/or with data that are substantially not
normally distributed. As an illustrative example, we suggest a study
aimed to investigate the relationship between physical selfperception and physical tness. The sample is 109 moderately
active adults between 25 and 60 years of age. The calculated mean
for self-perception is13.89 (SD 4.31, skew .42, kurtosis .91)
and for physical tness the mean is 8.94 (SD 4.24, skew 7.69,
kurtosis 3.21). The tness data are not normally distributed.
Correlational analyses, using both Pearsons r and, rs resulted in an

100

A. Ivarsson et al. / Psychology of Sport and Exercise 14 (2013) 97e102

r .21 and an rs .18. To transform an rs coefcient into a Pearsons r,


the following formula could be used (Rupinski & Dunlap, 1996):

 p
r 2 sin* rs
6

The result from this calculation, using an rs of .18, is that the estimated Pearsons r is approximately .188. The result shows a difference between the two Pearsons r coefcients, where the coefcient
calculated from rs is smaller than the effect size that was directly
calculated from the Pearsons r formula. In this case, however, the
difference is not that large (.21e.188 .021).
Interpretations of effect sizes: what is meaningful?
Even though the reporting of effect sizes has increased in the
sport and exercise psychology literature, Andersen et al. (2007)
have suggested that real-world interpretation of what effect sizes
mean is still not common practice. To supply researchers with
conventions for how to interpret effect sizes for differences
between groups, Cohen (1988) suggested three categories: small
(d .20 r .10 OR 1.50), medium (d .50 r
.24 OR 3.50), and large (d .80 r .37 OR 5.10).
But Cohens conventions are just that, only conventions and not
hard rules of thumb. Kraemer et al. (2003) recommended that
researchers should not only use these suggested categories when
discussing the practical value of the study but also consider what
might be a clinically or meaningfully signicant effect in real-world
terms. A small effect, by Cohens conventions, might translate to
outcomes that may have large effects in terms of costs and benets
for the population in question. In discussing the practical value of
an effect size, Vacha-Haase and Thompson (2004) recommended
considering what variables are being measured as well as the
context of the study.
To help researchers to interpret whether a result is meaningful (e.g., clinical signicance, see Thompson, 2002a), several
different statistics have been developed (Fritz et al., 2012;
Kraemer et al., 2003). Three examples are: condence intervals
for effect sizes (CI; Thompson, 2002b), probability of superiority
(PS; Fritz et al., 2012; Grissom, 1994) combined with common
language effect size (Dunlap, 1994; McGraw & Wong, 1992), and
number needed to treat (NNT; Nuovo, Melinkow, & Chang, 2002).
CI describes the interval where most (90% or 95%) of the
participants in a study are located for a specic variable
(Thompson, 2002b) and could be used to interpret the results
from one study with results from other studies (Thompson,
2002b; Wilkinson & The Task Force on Statistical Inference,
1999). CIs for effect sizes, such as Cohens d, at the 95% level,
can be calculated with the formula: 95% CI ES -1.96se to
ES 1.96se (Nakagawa & Cuthill, 2007) where se is the asymptotic standard error of the effect size (to calculate the 90% level
change 1.96se into 1.645). The se value, based on Cohens d effect
size, can be calculated with the formula:

 
se d

s


 


n1 n2  1
4
d2
1
8
n1 n2  3
n1 n2

The formula for calculating the se value, using bivariate r is:

 
1
se r p
n3
If the 95% or the 90% CI for the ES does not include .0, or a negative
number, then one can be fairly condent that some effect has taken
place. This interpretation reects NHST in that the effect is significant at the p < .05 (for the 95% condence interval) or p < .10 (for

the 90% condence interval). Specically, a CI for an ES states that


there is a 95% or a 90% chance that the true population effect is
between the lower and upper scores in the CI (Finch & Cumming,
2009). In discussing CIs, it is important to state that CIs are sensitive to violations of normality (Thompson, 2002b) as well as sample
sizes and standard deviations (Finch & Cumming, 2009). This
sensitiveness might lead to inaccurate interpretations of results. It
is important not to use too small or too heterogeneous samples
when calculating the CI.
The second statistic we have chosen to present is the probability
of superiority index (PS; Fritz et al., 2012). PS was developed to give
a percentage of occasions where a randomly chosen participant
from the group with a higher mean will have higher score than
a randomly chosen participant from the other group (if there is
a two-group design; Fritz et al., 2012; Grissom, 1994). The PS could,
when raw data are not available, be calculated from sample means
and variances using the estimator that McGraw and Wong called
the "common language effect size indicator" (CL) (Grissom & Kim,
2012): The formula for calculating the CL, based on a z score is
(McGraw & Wong, 1992):



X 1  X 2
ZCL q
S21 S22
The proportion of the area under the normal curve that is below ZCL
is the CL statistic that one could use to estimate the PS (Grissom,
1994; Grissom & Kim, 2012). As an example, one could have
a study from which the value of the calculated ZCL is 1.0. For
the 1.0 value, the proportion of the area under the normal curve is
estimated to be .84. If the researcher has the data available, the
calculation formula for PS is: PS U/(mn), where U is the Manne
Whitney statistic and m and n are sample sizes for the groups.
As one example how to use the PS, let us say that we have
conducted a study with the aim to increase the participants
general subjective well-being by using a 4-week stress management intervention. The results of the study show an increase in
well-being for the intervention group compared to no change for
the control group with a Cohens d of .75, which is equal to a PS
score of approximately 70 (to obtain the formula for the PS
calculation for different estimates of effect size see Grissom, 1994;
Ruscio, 2008). A PS score of 70 states that if participants were
sampled randomly, one from each of the groups, the one from the
condition with the higher mean (in this example the intervention
group) should be bigger than the one from the experimental group
for 70% of the pairs. Given that subjective well-being is an
important variable, and that 70% of the experimental group reported higher levels of well-being than the control at the end of
the study, these results point to the potential practical value of the
intervention.
A third example of an indicator developed to clarify the practical value of effect sizes is number needed to treat (NNT). The NNT
score is the number of participants who must be treated to give
one more success/one less failure as one outcome of an intervention. The NNT effect size indicator is primarily used in research
with one dichotomous outcome variable (Ferguson, 2009). To
calculate the NNT indicator, the percent of failure cases, in decimal
form, in the experimental group should be subtracted from the
percent of failure cases in the control group. The score from this
calculation is discussed as risk difference (RD). One formula for
calculating the NNT, using the RD score, is 1/RD. In this formula, the
result of 1 is the best NNT score indicating that the treatment is
perfect (i.e., all participants in the experimental group have
improved whereas no participants in the control group have;
Kraemer et al., 2003). In order to illustrate the use of NNT, we will

A. Ivarsson et al. / Psychology of Sport and Exercise 14 (2013) 97e102

examine a ctional study with interventions aimed at preventing


sport injuries. The study design involves one experimental group
and one control group and the outcome variable for the study is
injury or no-injury during the competitive season. The calculated
effect size (Cohens d) for this ctional study is .38 (smallish). The
percentage of injured athletes in the experimental group was 46%,
whereas 80% of the participants in the control group experienced
at least one injury during the competitive season. The calculated
RD is .80  .46 .34, and the NNT is 1/.34 2.94. The calculated
NNT score indicates that approximately one person out of three
had a benecial outcome due to the intervention. So, is the result
from this example of practical value, even if it showed only
a smallish effect size? To decide we have to consider that injuries
are a common problem in sports and that there are substantial
health, nancial, performances, and happiness costs associated
with them. If we are able to help one out of three athletes who take
part in the intervention, we would argue that it has meaningful
practical value even if the study showed a smallish effect size. This
example also shows the importance of taking the context into
consideration when discussing the meaningfulness of effect sizes
and not simply using Cohens conventions of small, medium, and
large effects.

Summary
The overall aim of this article was to highlight and discuss some
important issues around reporting effect sizes in sport and exercise
psychology research. The different ctional research examples in
this article clearly demonstrate the importance of treating data in
proper ways to minimize potential biases but also how to discuss
the practical value of effect sizes in research. The ctional examples also illustrate three major points. First, our examples suggest
that it is important to use adjusted effect sizes, especially for
studies with small samples and large effect sizes, to avoid overestimations. Second, using parametric effect size formulas for
nonparametric data will often result in possibly misleading effect
sizes with either underestimations of the effects (e.g., for data with
one categorical variable) or overestimations of the effects (e.g., for
correlational data). Given that our hypothetical examples showed
differences between both parametric and non-parametric and
adjusted and non-adjusted effect sizes, choosing the proper
formula is important for interpreting results (e.g., results from
a meta-analysis). For example, meta-analyses are performed to
determine the mean effect size across studies (Iaffaldano &
Muchinsky, 1985), and the results could be biased if the studies,
integrated in the analysis, had positively (or negatively) biased
effect sizes. For researchers conducting meta-analyses, if there is
enough information to adjust unadjusted effect sizes or to use nonparametric calculations to transform what may be biased effect
sizes, then maybe before effect sizes are entered into metaanalyses, the researchers could do their own adjusting of results
rst. Third, the article highlights three indicators (i.e., CI, PS, NNT)
that have been developed to assess effect sizes, and researchers
may want consider the advantages of reporting these effects as
complements in their discussions about how to interpret research
ndings.

References
American Psychological Association. (2010). Publication manual of the American
Psychological Association (6th ed.). Washington, DC: Author.
Andersen, M. B., McCullagh, P., & Wilson, G. (2007). But what do the numbers really
tell us? Arbitrary metrics and effect size reporting in sport psychology research.
Journal of Sport & Exercise Psychology, 29, 664e672.

101

Andersen, M. B., & Stoov, M. A. (1998). The sanctity of p < .05 obfuscates good
stuff: a comment on Kerr and Goss. Journal of Applied Sport Psychology, 10, 168e
173, org/10.1080/10413209808406384.
Bishara, A. J., & Hittner, J. B. (2012). Testing the signicance of a correlation with
nonnormal data: comparison of Pearson, Spearman, transformation, and
resampling approaches. Psychological Methods. http://dx.doi.org/10.1037/
a0028087, Advance online publication.
Bonett, D. G. (2007). Transforming odds ratios into correlations for meta-analytic
research. American Psychologist, 62, 254e255.
Claudy, J. G. (1978). Multiple regression and validity estimation in one sample.
Applied Psychological Measurement, 2, 595e607. http://dx.doi.org/10.1177/
014662167800200414.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, NJ: Erlbaum.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155e159. http://
dx.doi.org/10.1037/0033-2909.112.1.155.
Dunlap, W. P. (1994). Generalizing the common language effect size indicator to
bivariate normal correlations. Psychological Bulletin, 116, 509e511. http://
dx.doi.org/10.1037/0033-2909.116.3.509.
Edvardsson, A., Ivarsson, A., & Johnson, U. (2012). Is a cognitive-behavioural
biofeedback intervention useful to reduce injury risk in junior football
players? Journal of Sport Science and Medicine, 11, 331e338.
Ezekiel, M. (1930). The sampling variability of linear and curvilinear regressions:
a rst approximation to the reliability of the results secured by the graphic
"successive approximation" method. The Annals of Mathematical Statistics, 1.
http://dx.doi.org/10.1214/aoms/1177733062, 275e315, 317e333.
Ferguson, C. J. (2009). An effect size primer: a guide for clinicians and researchers.
Professional Psychology: Research and Practice, 40, 532e538. http://dx.doi.org/
10.1037/a0015808.
Finch, S., & Cumming, G. (2009). Putting research in context: understanding
condence intervals from one or more studies. Journal of Pediatric Psychology,
34, 903e916. http://dx.doi.org/10.1093/jpepsy/jsn118.
Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: current use,
calculations, and interpretation. Journal of Experimental Psychology, 141, 2e18.
http://dx.doi.org/10.1037/a0024338.
Frhlich, M., Emrich, E., Pieter, A., & Stark, R. (2009). Outcome effects and effect
sizes in sport sciences. International Journal of Sports Science and Engineering, 3,
175e179.
Grissom, J. R. (1994). Probability of superior outcome of one treatment over another.
Journal of Applied Psychology, 79, 314e316. http://dx.doi.org/10.1037/00219010.79.2.314.
Grissom, R. J., & Kim, J. J. (2012). Effect sizes for research: Univariate and multivariate
applications. New York, NY: Taylor & Francis.
Hedges, L. V., & Olkin, I. (1980). Vote-counting methods in research synthesis. Psychological Bulletin, 88, 359e369. http://dx.doi.org/10.1037/0033-2909.88.2.359.
Henson, R. K. (2006). Effect-size measures and meta-analytic thinking in counseling
psychology research. The Counseling Psychologist, 34, 601e629. http://
dx.doi.org/10.1177/0011000005283558.
Iaffaldano, M., & Muchinsky, P. M. (1985). Job satisfaction and job performance:
a meta analysis. Psychological Bulletin, 97, 251e273.
Jacobson, N. S., & Truax, P. (1991). Clinical signicance: a statistical approach to dening
meaningful change in psychotherapy research. Journal of Consulting and Clinical
Psychology, 59, 12e19. http://dx.doi.org/10.1177/00343552060500010501.
Keef, S. P., & Roberts, L. A. (2004). The meta-analysis of partial effect sizes. British
Journal of Mathematical and Statistical Psychology, 57, 97e129. http://dx.doi.org/
10.1348/000711004849303.
Kirk, R. (1996). Practical signicance: a concept whose time has come. Educational
and Psychological Measurement, 56, 746e759. http://dx.doi.org/10.1177/
0013164496056005002.
Kraemer, H. C., Morgan, G. A., Leech, N. L., Gliner, J. A., Vaske, J. J., & Harmon, R. J.
(2003). Measures of clinical signicance. Journal of the American Academy of
Child and Adolescent Psychiatry, 42, 1524e1529. http://dx.doi.org/10.1097/
01.chi.0000091507.46853.d1.
Leach, L. F., & Henson, R. K. (2007). The use and impact of adjusted R2 effects in
published regression research. Multiple Linear Regression Viewpoints, 33, 1e11.
McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic.
Psychological
Bulletin,
111,
361e365.
http://dx.doi.org/10.1037/00332909.111.2.361.
Nakagawa, S., & Cuthill, I. C. (2007). Effect size, condence interval and statistical
signicance: a practical guide for biologists. Biological Reviews, 82, 591e605.
http://dx.doi.org/10.1111/j.1469-185X.2007.00027.x.
Nuovo, J., Melnikow, J., & Chang, D. (2002). Reporting number needed to treat and
absolute risk reduction in randomized controlled trails. Journal of the American
Medical Association, 287, 2813e2814. http://dx.doi.org/10.1001/jama.287.21.2813.
Pearson, K. (1900). Mathematical contributions to the theory of evolution. VII. On
the correlation of characters not quantitatively measurable. Philosophical
Transactions of the Royal Society of London. Series A, Containing Papers of Mathematical or Physical Character, 195, 1e47.
Roberts, K. J., & Henson, R. K. (2002). Correction for bias in estimating effect sizes.
Educational and Psychological Measurement, 62, 241e252. http://dx.doi.org/
10.1177/0013164402062062002003.
Rosenthal, R. (1991). Meta-analytic procedures for social research. Newbury Park, CA:
Sage.
Rosenthal, R., & Rubin, D. B. (2003). Requvalent: a simple effect size indicator. Psychological
Methods, 8, 492e496. http://dx.doi.org/10.1037/1082-989X.8.4.492.

102

A. Ivarsson et al. / Psychology of Sport and Exercise 14 (2013) 97e102

Rupinski, M. T., & Dunlap, W. P. (1996). Approximating Pearson product-moment


correlations from Kendalls tau and Spearmans rho. Educational and
Psychological
Measurement,
46,
419e429.
http://dx.doi.org/10.1177/
0013164496056003004.
Ruscio, J. (2008). A probability-based measure of effect size: robustness to base
rates and other factors. Psychological Methods, 13, 19e30. http://dx.doi.org/
10.1037/1082-989X.13.1.19.
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston:
Pearson Education, Inc.
Thomas, J. R., Nelson, J. K., & Silverman, S. J. (2005). Research methods in physical
activity. Champaign, IL: Human Kinetics.
Thompson, B. (2002a). "Statistical", "practical", and "clinical": how many kinds of
signicance do counselors need to consider? Journal of Counseling and Development, 80, 64e71. http://dx.doi.org/10.1002/j.1556-6678.2002.tb00167.x.

Thompson, B. (2002b). What future quantitative social science research could look
like: condence intervals for effect sizes. Educational Researcher, 31, 25e32.
http://dx.doi.org/10.3102/0013189X031003025.
Thompson, B. (2006). Foundations of behavioral statistics: An insight-based approach.
New York, NY: Guilford Press.
Vacha-Haase, T., & Thompson, B. (2004). How to estimate and interpret various
effect sizes. Journal of Counseling Psychology, 51, 473e481. http://dx.doi.org/
10.1037/0022-0167.51.4.473.
Wang, Z., & Thompson, B. (2007). Is the Pearson r2 biased, and if so, what is the best
correction formula? Journal of Experimental Education, 75, 109e125. http://
dx.doi.org/10.3200/JEXE.75.2.109-125.
Wilkinson, L., & , The Task Force on Statistical Inference. (1999). Statistical methods
in psychology journals: guidelines and explanations. American Psychologist, 54,
594e604. http://dx.doi.org/10.1037/0003-066X.54.8.594.