Você está na página 1de 19

Counseling Outcome Research and Evaluation

http://cor.sagepub.com/ Statistical Power, Sampling, and Effect Sizes : Three Keys to Research Relevancy
Christopher A. Sink and Nyaradzo H. Mvududu Counseling Outcome Research and Evaluation 2010 1: 1 originally published online 17 September 2010 DOI: 10.1177/2150137810373613 The online version of this article can be found at: http://cor.sagepub.com/content/1/2/1

Published by:
http://www.sagepublications.com

On behalf of:

Association for Assessment in Counseling and Education

Additional services and information for Counseling Outcome Research and Evaluation can be found at: Email Alerts: http://cor.sagepub.com/cgi/alerts Subscriptions: http://cor.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://cor.sagepub.com/content/1/2/1.refs.html

>> Version of Record - Oct 29, 2010 OnlineFirst Version of Record - Sep 17, 2010 What is This?
Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Outcome Research Design


Counseling Outcome Research and Evaluation 1(2) 1-18 The Author(s) 2010 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/2150137810373613 http://core.sagepub.com

Statistical Power, Sampling, and Effect Sizes: Three Keys to Research Relevancy
Christopher A. Sink1 and Nyaradzo H. Mvududu2

Abstract This article discusses the interrelated issues of statistical power, sampling, and effect sizes when conducting rigorous quantitative research. Technical and practical connections are made between these concepts and various inferential tests. To increase power and generate effect sizes that merit practical or clinical notice, not only must the research aims and associated design be well devised, reflecting best scientific practice, state of the art sampling procedures should be applied with sufficiently large and representative number of participants. Applications to research conducted in the counseling profession are included. Keywords statistical power, effect sizes, sampling, research method, counseling
Received 23 November 2009. Revised 9 February 2010. Accepted 22 April 2010.

As most experienced researchers understand, a publishable study with generalizable and meaningful findings necessitates a well-conceived and executed research design. In contrast, quantitative studies lacking the rigor and quality essential for scholarly journals may be criticized for a variety of research-related flaws, such as inadequate controls, unrepresentative sampling, loose data collection procedures, as well as instrumentation deficiencies, outcomes with trivial professional significance, and other confounding factors affecting internal and external validity (Babbie, 2010; Moss et al., 2009; Rowland & Thornton, 2002). One of the major reasons for poor research performance by academics is the lack of an effective research skill set, including method skills (Wheeler, Seagren, Becker, Kinley, & Mlinek, 2008). Even though these research concerns are largely avoidable, investigator errors continue

with some regularity, often with serious unintended consequences. For example, these errors may reduce statistical power and the magnitude of effect sizes (ESs). In certain situations, the researcher might end up rejecting the wrong hypothesis and advancing erroneous conclusions (Huck, 2009). If the ultimate goal of most counseling-related research is to positively affect the profession and the work of

Department of Counselor Education, Seattle Pacific University, Seattle, WA, USA 2 Department of Curriculum and Instruction, Seattle Pacific University, Seattle, WA, USA Corresponding Author: Christopher A. Sink, Department of Counselor Education, Seattle Pacific University, School of Education, 3307 Third Avenue West, Seattle, WA 98155, USA Email: csink@spu.edu

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Counseling Outcome Research and Evaluation 1(2)

practitioners and their clients, knowing the basics and the nuances of quality research is indispensable. Succinctly put, counseling research must reflect best practices (Osborne, 2008). In an attempt to inform nascent and perhaps more knowledgeable counseling researchers, the central intent of this article is to discuss the primary interconnections between sampling, statistical power, and ESs. To illustrate how these notions may influence the results and conclusions of a counseling-related investigation, a hypothetical study is initially provided.

Research Scenario
A researcher investigated the efficacy of narrative therapy (NT) on client sense of selfefficacy using two convenient (nonprobability) samples of caregiver-referred adolescents (ages 14 to 18). All participants received a minimum of 10 counseling sessions from licensed mental health practitioners at several urban mental health clinics. Prior to the quasi-experimental1 studys onset, one group of participants, the experimental (e) group, reported feeling mildto-moderate depression and anxiety for at least 3 months. Participants comprising the comparison (c) group exhibited for at least 3 months moderate school behavior problems (SBP; e.g., acting out in class, fighting), as reported by the school counselor. For reasons made clear later, to improve statistical power, the researcher was able to obtain 50 participants for each group. Each participant was individually pretested on a selfefficacy measure by the investigator. Following a 10-intervention session interval, the groups were readministered the inventory (posttest), and 5 weeks later, again asked to complete the instrument (follow-up test). The independent variable of primary interest was group with two levels (experimental and comparison) and dependent variable was client perceptions of self-efficacy (total score) measured at three intervals (pre, post, and follow-up). Based on the research literature, the researcher hypothesized (i.e., alternative hypothesis) that the experimental sample after NT would report significantly higher selfefficacy scores than the comparison group at the time of post- and follow-up testing. In other

words, the researcher anticipated, based on theoretical assumptions and previous research, that NT is more efficacious for clinical samples showing anxiety and depression (internalizing behavioral symptoms) than school-based samples with SBP (externalizing behavioral symptoms). Using this scenario, we pose two related questions: (a) What is the best way to ensure the research study has sufficient statistical power to find statistically significant group differences on the self-efficacy measure as well as to generate consequential outcomes (sizable ESs) for clinicians to take note? (b) What role do sampling and sample size exert on statistical power and ESs? In response, we first tackle the challenging notion of statistical power, moving then to sampling issues, and finally, to practical significance.

Statistical Power
The goal of inferential statistics is testing particular hypotheses about potential group differences or correlations between variables. Statistical power, a concept directly linked with inferential testing, concerns the ability to detect group differences or nonzero correlations. Competent researchers estimate power early on as they design their studies (Huck, 2009). Statistical procedures that are powerful have a greater likelihood of finding any true effect that may exist. Metaphorically, the concept of statistical power can be likened to the process of magnification (Meyers, Gamst, & Guarino, 2006). A more powerful magnifying glass has the ability to show greater detail. Similarly, a more powerful statistical test used to examine data can better reveal a significant result. In practitioner language, power is the odds that a researcher will observe a treatment/intervention effect when it occurs (Trochim, 2006) or as Cohen (1988) suggests, statistical power is the probability that the researcher will come to the conclusion that the phenomenon under investigation actually exists. In our research scenario above, the investigators research goal is to detect any differential group effect following the NT intervention period (pre- to posttesting). In other words,

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Sink and Mvududu

our experimenter wants to correctly reject the null hypothesis (mexperimental mcomparison) and instead affirm the alternative hypothesis (mexperimental > mcomparison).2 Retaining the alternative hypothesis when it was true would indicate that at the time of posttesting adolescents in the experimental group (i.e., participants exhibiting anxiety and depression) reported significantly higher self-efficacy scores than those adolescents in the comparison group (i.e., participants exhibiting SBP). To accomplish this end, the researcher will need to maximize statistical power. Power is often expressed as 1 b, where b represents the likelihood of committing a Type II error (i.e., the probability of incorrectly retaining the null hypothesis). Betas can range from .00 to 1.00. When the beta is very small (close to .00), the statistical test has the most power. For example, if the beta equals .05, then statistical power is .95. Multiplying statistical power by 100 yields a power estimate as a percentage. Thus, 95% power (1 b .95 100%) suggests that there is a 95% probability of correctly finding a significant result if an effect exists. Typically, a power index greater than .80 (or b .20) is considered statistically powerful (Park, 2003). In summary, rigorous quantitative studies need to be well designed to generate adequate power to detect any statistically significant group effects or correlations (Huck, 2009). A priori power analysis helps researcher determine (a) how large the sample sizes need to be to generate sufficient power, or (b) whether the power for fixed sample sizes is large enough to justify moving ahead with the study. Several statistical power calculators exist online, which allow researchers to estimate how large the samples must be to generate sufficient power. In some variants, the online calculators will yield sample error computations as well. An extensive website developed by StatPages.org provides a wide array of statistical options to choose from including power, sample size, and experimental design calculations (see http://statpages.org/index.html# Power). Another useful option is available through Lenth (20062009). This resource

provides Java applets (software) for power and sample size, allowing the researcher to more effectively plan statistical studies.

Factors Affecting Power


All statistical procedures and their power estimates are influenced by various researchrelated factors. For instance, use of a one- versus two-tailed test will increase power. Relatively, less evidence is required to find a significant effect with a one-tailed test than with a two-tailed test. The use of parametric versus nonparametric tests also affects power. Parametric tests are generally more powerful than nonparametric ones. Other factors include the alpha level (a), ES, sample size (n), and distribution variance (Park, 2003). These are discussed in more detail below. Alpha Level. Most readers with a basic knowledge of statistics are familiar with the concept of the alpha (a) level or the p value. This numerical index reflects in part the level of risk researchers are willing to tolerate when they reject chance as a plausible explanation for the significant results derived from an inferential analysis (e.g., t test). By convention, this level is set at .05, signifying that the researcher is willing to accept 5% chance of rejecting the null hypothesis erroneously. To illustrate, if our fictional researcher wants to virtually rule out chance as a potential reason for finding significant mean differences between experimental and comparison groups, when conducting inferential tests the researcher needs to set the alpha level very low (e.g., a .01). In other words, to reduce the likelihood of committing a Type I error (i.e., one incorrectly rejects the null hypothesis or saying there is a true difference between groups when there is not), the researcher selects a priori a conservative alpha level of .01. As Kline (2004) elucidated, when an investigator designs a study to purposely reduce the chances of Type I error, there are perhaps unintentional consequences. For example, setting the alpha level to a very stringent value increases the potential for committing a Type II error (b). A more liberal alpha level (e.g.,

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Counseling Outcome Research and Evaluation 1(2)

a .10) translates into a lower b, which in turn leads to higher value for statistical power (1 b). When the practical consequences of Type I error are not serious, the investigator may choose a higher alpha level to increase the likelihood of finding a relatively minor difference between group means. For instance, when considering the study of the NT effect on selfefficacy, the counseling method may be benign and present no drain on resources. In this case, selecting a conservative alpha level should have little, if any, practical consequences. However, given that NT is a relatively new technique, our researcher may want to detect even the smallest group difference on the selfefficacy measure. To do so, the investigator may increase the alpha level to .15 to enhance statistical power. The appropriate balance of alpha level and power is informed by the desired ES. It should be noted here that for purposes of journal publication most reviewers consider an alpha level of .05 the upper limit. To recap, investigators test null hypotheses presuming them to be false. It is important, therefore, to consider the probability of committing a Type I error (a) and Type II error (b), and the role statistical power (1 b) exerts in the process (Davey & Savla, 2010). Decisions about how to maximize statistical power, while at the same time balancing Type I and Type II errors, require a relatively sophisticated knowledge of research design and relevant statistical properties underlying inferential analyses. ES. Since this topic is detailed later in the article, we only mention here how this concept influences statistical power. As accomplished researchers understand, when a study involves large sample sizes (e.g., over 500), the likelihood of finding even a minimal effect (e.g., main or group effect) is near 100%. However, in most cases where the group mean differences are negligible but statistically significant, the ES will be trivial, allowing the researcher to conclude that the statistically significant finding has little, if any, application to the real world. There are multiple types of ESs that quantify the extent to which groups differ or the strength of two or more correlated variables (Thompson,

2008). In the circumstance of estimating the statistical power of t or F tests, researchers are generally interested in determining a standardized ES (e.g., Cohens d). As the ES value increases, the power to detect actual group differences increases (i.e., there is positive relationship between 1 b and ES). For example, if our fictitious researchers study was designed in such a way to maximize statistical power (e.g., increase sample size for each group to 500), it is more likely that even a small ES for group differences on the dependent variable (self-efficacy scores) would be detected. If the inferential procedure deployed has low power, it is less likely to distinguish group differences even when they exist and the associated ES may be small. One of the best ways to enhance statistical power is to design a study with a more than adequate sample size. Sample Size. Sample size is an important element not only affecting statistical significance but also statistical power (Maxwell & Delaney, 2004). Because larger samples are associated with more stable sample statistics, reduced sampling error (i.e., a lower standard error of the mean) and narrower confidence intervals, an increase in sample size is generally commensurate with a boost in statistical power. By maximizing power and minimizing sampling error, statistically significant effects with trivial ESs are more readily detected. That is, mean differences that may have little importance are detected. Researchers are cautioned, therefore, to keep in mind that a large sample can result in a statistically significant effect even when the ES is minimal. In brief, power analysis allows researchers to determine the size of the sample required to obtain a statistically significant result (Mertens, 2010). Variance. Data variability is influenced by the reliability and validity of the measures used, the design of the study, and the extent to which extraneous variables are controlled. When there is less variability (e.g., sum of squares) in the population, the estimated standard error of the mean tends to be smaller (Maxwell & Delaney, 2004). Under these circumstances, the sample

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Sink and Mvududu

statistics provide relatively good estimates of the population parameters. Because lower variance also tends to be associated with higher levels of statistical power, researchers must seriously consider this factor. For situations where two groups are compared, such as the one presented in our research scenario, the researcher assumes that the groups have equivalent variance in self-efficacy scores. For instances where the assumption of equality of variance is violated, the probability of committing a Type II error becomes inflated, and this in turn, lowers the level of statistical power. In an effort to restrict sample variability, researchers, for example, can be very specific in defining the sample to moderate school behavior problems or mild-to-moderate depression and anxiety.

the researchers confidence in the studys internal and external validity.

Application of Power Analysis to Research Designs


This section considers briefly how power analysis is relevant to various research designs often utilized in counseling research. Single Participant Design. In single participant designs, the researchers major concern is effectively increasing internal and external validity of the findings. Using a multiple baseline design is a good way to do so (Ferron & Sentovich, 2002). How participants are randomly assigned to the various baselines affects the level of statistical power. Power is largest in cases where there is an alternating treatment design (Ferron & Sentovich). That is, the participant receives one of two treatments at each experimental session. For instance, in our NT study, the researcher would design the investigation so each participant randomly receives NT in one session and perhaps another intervention in the next. However, to be more technically accurate, it is more common to observe the dependent variable across a number of baselines and then NT sessions; subsequently, in a reversal design (e.g., ABA or ABAB) the treatment is withdraw for several sessions to observe whether improvement measured by the dependent variable decreases, thus increasing

Group Experimental Design. Increasing sample size, as previously discussed, is often a useful means to enhance statistical power. This can be costly in terms of money and effort, necessitating perhaps other more realistic ways of increasing power. For example, adding a pretest as a covariate while retaining ones original sample size may lead to a lowering of the standard deviation of the error term for the same sample size. As a result, the ES, precision of statistical estimates, and statistical power increase. Although impractical in most settings, Venter, Maxwell, and Bolig (2002) recommended at least five repeated measures for the posttest to achieve significant gains in power. Any fewer than five repeated measures (testings) is likely no more beneficial than simply using a pretest-with-one-posttestonly design. When groups are being compared on dependent variables, they need be relatively equivalent, for example, in their demographic characteristics. Researchers attempt to equalize the samples by randomly selecting participants for the study and then randomly assigning participants to groups. Obviously, when the study is unable to use randomization procedures, the findings from inferential statistical analyses are problematic (Jo, 2002). The lack of randomization confounds the treatment effect and tends to reduce the statistical power of the study. To apply this information to our research scenario, the lack of randomization lowers statistical power. If the experimental group has a large proportion of treatment no shows, the power to detect differences in self-efficacy is greatly compromised. Other than encouraging participation, there is little else a researcher can do about noncompliance. However, an investigator can perhaps further adjust the studys method (e.g., use more psychometrically sound instrumentation and improve sampling procedures) to minimize error and gain adequate power.

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Counseling Outcome Research and Evaluation 1(2)

Meta-Analysis. Contemporary researchers are using meta-analysis as a quantitative method of research synthesis (Cooper, 2010; Konstantopoulos, 2008) and, at times, a statistical approach to increase statistical power in a fixed-effect model3 where samples are from the same population (Cohn & Becker, 2003). As more studies are included in the meta-analysis, not only does power improve but the precision of descriptive statistics (estimates) to represent population values increases as well. There are several criticisms leveled against meta-analytic studies. For instance, publication bias (i.e., journals largely publish articles with significant findings) and the lack of attention to statistical power are often cited as serious weaknesses (Muncer, Craigie, & Holmes, 2003). In cases where the null hypothesis is rejected, the statistical power issue is less of a concern. When the null hypothesis is not rejected, however, there are two possible explanations that need to be considered: the null hypothesis is true or the study included in the meta-analysis was underpowered. As we have mentioned previously, research conducted with low power requires a large ES to produce significant findings. In this pursuit, researchers may artificially inflate ES estimates and thus increase the heterogeneity of ESs in a metaanalysis. This practice tends to mask both the extent and the direction of the true ES. There are various ways to address this challenge (e.g., weighting the mean), but power analysis is the most effective method. The bottom line for researchers is power level should be one of the inclusion criteria in selecting studies for meta-analyses; failing this, at the very least, the statistical power of included studies should be discussed.

Closing Remarks on Statistical Power


Because hypothesis testing and statistical power are so interconnected, these topics are critical for researchers to address in their studies (Shieh, 2003; Thompson, 2008). Before funding a research study, granting institutions such National Institute of Mental Health want evidence that the study under consideration has

sufficient statistical power. Power analysis may be an ethical issue as well (Miles, 2003). Consider, for example, counseling-related research. Clients may consent to participate in a relevant study with the hope of aiding future recipients of mental health services. It behooves researchers to do all they can to generate reliable and valid results with meaningful implications for clinical practice. As noted previously, an underpowered study will reduce the probability of finding a significant and practical effect. Up to this point, we have discussed power analyses conducted before the results are fully analyzed. Although discouraged as an unsound research practice (e.g., Huck, 2009; Millis, 2003; Nakagawa & Foster, 2004), na ve investigators may conduct a retrospective or post hoc power analysis. This after-the-fact analysis determines the level of statistical power subsequent to a nonsignificant effect. Researchers deploying this method are tempted to conclude, on one hand, when a nonsignificant finding was attained with high power, that the finding reflects reality; on the other hand, a nonsignificant effect with low power is deemed inconclusive. Figure 1 provides a graphical representation of a retrospective power analysis (n 40, p < .05). The statistical power estimate (y-axis) is a function of the detected ES (x-axis). Critics of the retrospective approach point out that the technique disregards the impact of sample size and ES (Aguinis, Beaty, Boik & Pierce, 2005; Huck, 2009) by essentially detecting the power necessary to determine the ES already found. Depending on whether the investigator uses an a priori or retrospective approach, the same sample size and alpha level can lead to the equivalent level of statistical power regardless of ES and p values. The preferred method is to conduct a prospective power analysis (a priori analysis) before the study is fully instituted. Millis (2003) underscored this point, suggesting that failure to conduct such a prospective analysis a tantamount to committing a research deadly sin. The procedure requires the researcher to determine the needed sample size to attain the desired power to detect a predetermined ES. Figure 2 shows a

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Sink and Mvududu

Figure 1. Graph representing a retrospective power analysis (n = 40, p < .05)

Figure 2. Graph representing statistical power as a function of sample size

graph of statistical power as a function of sample size. For 80% power (usually considered to be sufficient power), the group size required is about

50. In the case of a two group comparison as in the above research scenario, 100 participants total would be then adequate.

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Counseling Outcome Research and Evaluation 1(2)

Although there a multiple commercially available products to choose from (e.g., nQuery 7.0, PASS: Power Analysis and Sample Size 2008), a no-cost online tool for conducting various statistical power analyses, including the a priori method, is called G*Power III4 (available at: http://www.psycho.uni-duesseldorf.de/abteil ungen/aap/gpower3/). Ultimately, the goal is to prevent investigators from conducting timeconsuming and costly studies with little chance of detecting significant treatment/intervention effects. In the event that the power analysis produces undesirable outcomes, the researcher can reconceptualize the study and improve its design and execution.

Sampling
As alluded to previously, sample size is a strategic component to enhancing statistical power and potentially ESs. If researchers want to capitalize on the power gained from collecting data from large samples, effective representative sampling procedures must be used. Because this is a complex issue and there are numerous texts to consult (e.g., Cochran, 2007; Creswell, 2009; Houser, 2008), we only address the major considerations affecting statistical power and ESs. Obviously, the studys context and goals direct what sampling method is deployed. In most situations, the most advantageous sampling approach is one that is realistic, efficient, and, perhaps most importantly, minimizes potential error variance from sampling bias (i.e., systematic error associated with nonrandom sample from a population) and produces representative samples. Researchers want to reduce bias in the data, which can often lead to severe non-normal distributions, including those that show substantial kurtosis and/or skewness. One of the best ways to maximize power and to increase the likelihood of obtaining a normal distribution of error variance is through true probability sampling involving various randomization techniques (e.g., simple, stratified, multi-stage, and systematic). Ideally, counseling researchers should attempt to include 30 or more participants per group (Judd, McClelland, & Ryan, 2009). Given the

nature of research ethics, availability of clients, and financial constraints, for most counseling studies using random sampling is not appropriate. When this is the case, the next option researchers turn to is nonprobability5 sampling such as selective (e.g., purposive, expert, and snowball) and convenience methods (e.g., use participants on hand). Typically then, counseling studies resort to soliciting volunteers from an opportune location(s) (e.g., a local mental health clinic or school), where researchers use naturally formed groups (e.g., clients in a counseling center, families in faith community) in that setting as study participants (Creswell, 2009). Regrettably, with all approaches to nonprobability sampling, investigators cannot be assured of representative samples and that the data they collect will be appropriate for parametric inferential analyses. We recommend that researchers limited by funding, potential participant groups, and ethical considerations, use a modified form of randomization with nonprobability sampling. Using fairly large sample sizes (n ! 30 per group) drawn from intact groups, researchers can randomly assign therapists to treatments and participants to groups (e.g., experimental vs comparison). If random assignment is not feasible, matching participants to groups is the next best alternative. If possible, match participants in the experimental and comparison groups on key demographic variables (e.g., gender, ethnicity, age, socioeconomic status [SES]). Prior to computing inferential statistics, ensure the similarity of groups and the appropriateness of the data for parametric analyses through data screening techniques (e.g., scatterplots, box-and-whisker plots) and reviewing relevant descriptive statistics (e.g., Ms, SDs, kurtosis, skewness)6. In brief, rigorous sampling reduces error variance in the groups, which in turn improves statistical power and the chances for sizable and consequential effects.

ESs
There is a wealth of literature detailing the statistical properties, the need for, and value of ESs in quantitative psychology-related research (see, e.g., Cortina & Nouri, 2000; Grissom &

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Sink and Mvududu

Kim, 2005; Huberty, 2002; Kline, 2004; Sink & Stroh, 2006; Thompson, 2006a, 2006b, 2007, 2008; Trusty, Thompson, & Petrocelli, 2004; Vacha-Haase & Thompson, 2004, for a detailed discussions). These numerical indices essentially provide an estimate of the magnitude of the effect, which in turn represent the practical value or clinical utility of the statistical finding. In this section, we briefly underscore the rationale for including ES estimates in studies, review the types of ESs and their interpretation, as well as clarify the relationship between ES and sample size, which, as explained above, is also an important element of calculating statistical power (Cohen, 1988, 1992; Huck, 2009). Applying the research scenario summarized above, let us assume, for whatever reason, that the counseling investigator was mainly concerned with possible group mean differences at the follow-up phase of the study and ignored possible mean differences at the pre- and posttest phases. Furthermore, the experimenter anticipated finding a statistically significant result as well as a modest ES (e.g., Cohens d .40), suggesting that NT is more clinically useful with adolescent clients with internalizing disorders than with youth exhibiting school behavior issues in improving self-efficacy. After consulting a basic statistics text, the researcher opted for a simple independent samples t test, computing it using a typical statistical software package like SPSS (2009). A statistically significant result for group differences (t 10.95, two-tailed, p < .001; Me 70.5, SD 7.5; Mc 65.5, SD 9.5) was revealed. The investigator, pleased with the statistically significant t value, next asked a follow-up question: Is the mean difference between samples, favoring the experimental group (clients with depression/anxiety), practically or clinically significant? In response, depending on what statistical procedure (e.g., t test or F test using the general linear model7 [GLM]) one uses, SPSS (2009) generates a relevant ES index, indicating the size of the significant effect (see Trusty et al., 2004, for discussion of SPSS and ESs in counseling research). Given the mean of experimental group was significantly larger than the

comparison groups average score and the samples have comparable standard deviations, the researcher anticipates at least a moderate ES. However, experimenters must be careful about making this type of prediction, because practical significance as measured by an ES can be near zero even though statistical significance was found (Huck, 2009; Kline, 2004).

Rationale
As implied above, there are good reasons for including ES estimates when reporting on the findings of a quantitative study. Most notably, doing so makes statistical sense and reflects best practices (Fidler & Cumming, 2008; Kline, 2004; Thompson, 2008). The American Psychological Associations (APA, 2010) Publication Manual underscores this point, reiterating that the traditional manner of presenting quantitative findingsthe null hypothesis statistical testing method8is insufficient. By only reporting the inferential statistics (df, value of a t, F, or w2 test, derived p value) and whether the null hypothesis was rejected or not, readers have little perspective on the wider meaning of the results (see Kline, 2004, for technical discussion). Similarly, in an attempt to bolster the rigor of educational research, the Coalition for Evidence-based Policy (2003) argued that studies must report the size of the interventions effects, so that readers can judge their educational import (p. 9). The Results section of a manuscript, therefore, should include ES indices, and, if relevant, associated confidence intervals. In summary, without a measure of practical significance, the real-world importance of the statistical findings may be lost.

Types of ESs and Interpretation


Thus far, we have attempted to correct a common research misconception: statistical significance does imply practical significance (Fidler & Cumming, 2008; Huck, 2009). Statistical significance involves null and alternative (experimental) hypotheses, a priori significance level (commonly called an alpha level or a), and the derived p value. Practical significance

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

10

Counseling Outcome Research and Evaluation 1(2)

is determined using the statistical data generated after the inferential statistical procedure is computed. As summarized below, two major classes of ESs (group difference and relationship- or variance-explained indices) are generally discussed in related technical publications (e.g., Grissom & Kim, 2005; Huberty, 2002; Kline, 2004; Rosnow & Rosenthal, 1996; Thompson, 2006a, 2006b, 2008; Vacha-Haase & Thompson, 2004). Complicating matters, however, numerous ES indicators are available to researchers and ESs exist for both parametric and nonparametric tests (see Sheskin, 2007; Sink & Stroh, 2006; Thompson, 2006a, 2006b, 2008, for application-oriented summaries). Because a synopsis of all these alternatives is beyond the scope of this article, we consider here only those indices most often reported in quantitative studies using parametric statistical tests. Group difference indices. These values9, often categorized as part of the d family, should be included when conducting group comparison studies. For instance, when experimental and control (or comparison) groups are statistically compared after an intervention, various d indices (e.g., Cohens d) are frequently reported. In meta-analytic studies, Glasss delta (D) is widely reported as the d family ES. Essentially, a group difference index involves calculating the size of the mean difference between two samples (e.g., experimental and comparison), taking into account group size (n) and standard deviation (SD) (Kline, 2004). More precisely, the d estimatea standardized ESis calculated as the difference in the mean outcome between the intervention/treatment and comparison/control groups, divided by the pooled standard deviation (Thalheimer & Cook, 2002). Standardized ESs are understood in a similar way as a z score, where a d is calculated on a common scale, allowing the researcher to evaluate the success of different intervention programs using the same metric. Our fictional investigator computed an independent samples t test revealing a statistically

significant effect for Group (t 10.95, two-tailed, p < .01). As anticipated, the participants in the experimental sample on average outperformed the adolescents in the comparison group. To determine the magnitude of the independent variables impact on the dependent variable, descriptive statistics (ns, Ms, and SDs) from the case study are used to calculate Cohens d, a well-recognized group difference index. This indicator is easily computed from one of several online ES calculators (see Table 1 for sample websites). After inputting the relevant data, the d .58. To adequately grasp the nuances of Cohens d, the relevant statistical formulae are overviewed next. The generic formula10 for Cohens d is: d Me Mc ; Spooled

where Me mean (average score of experimental group), Mc mean (average score of comparison/control group), and spooled pooled within-group SD (or common sample SD across both groups); more precisely, statisticians use the square root of the pooled11 within-groups variance (sp2). The formula for SDpooled is as follows: s 2 ne 1s2 e nc 1 s c ; spooled ne nc where s2 variance; e subscript experimental group and e subscript comparison/control group; ne sample size for experimental group and nc sample size for comparison/control group (Kline, 2004; Thalheimer & Cook, 2002). The d index can also be calculated when the researcher does not possess the standard deviation for each s    ne nc ne nc d ne nc ne nc 2 group using this formula, where t t statistic, n number of participants, and e subscript refers to the experimental group and c subscript relates to the comparison group (or control group).

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Sink and Mvududu

11

Table 1. Website Name

Online Resources for Effect Size (ES) Calculations and Interpretation Description Website Address http://www.uccs.edu/*faculty/ lbecker/

Effect Size This easy to use ES calculator will generate (1) Calculator Cohens d and r using Ms and SDs, and (2) d and r using t values and df for independent groups-t test. Calculates d using SDpooled. Not to be used for paired t test (within-subjects) value Calculator is built into a Microsoft Excel spreadsheet. The user only needs the Ms, ns, and SDs for the two groups being compared. ES and the difference with CI are generated This user-friendly calculator produces Cohens d for between- and within-subjects studies. Only Ms and SDs are needed. ES for dependent t test also requires r between means being compared. A d statistic can be produced using t scores and group ns. Note. d is calculated using the average of each means individual SD, rather than using SDpooled

http://www.cemcentre.org/ renderpage.asp?linkID30325017

http://www.cognitiveflexibility.org/ effectsize/

To extend the previous discussion, suppose another researcher computed a univariate or one-way analysis of variance (ANOVA) comparing posttest mean differences from three groups of participants after each received 10 weeks of intensive NT. In this situation, a d index is computed from the ANOVAs F test statistics (Cortina & Nouri, 2000). Here, the experimenter could generate a d for the mean comparison between Groups 1 and 2 (M1 with M2). Subsequently, ds can be calculated for other group comparisons (e.g., M1 with M3) as well. The required formula is as follows: M1 M2 d r  ; n 2 2 MSE n1n 1 n2 where M1 mean of one group, M2 mean of another group, n sample size, and

MSE mean square error term12 from the F test (Thalheimer & Cook, 2002). Another standardized mean difference ES most often reported in meta-analyses is Glasss delta (D) (Sink & Mvududu, 2009). The formula looks similar to Cohens d: D Me Mc ; scontrol

where Me mean (average score of experimental group), Mc mean (average score of comparison/control group), and sc is the standard deviation of the control (comparison) group (Kline, 2004). Unlike the d, the standardizer of D is generated only from the control (comparison) groups SD; thus, when using D, the homogeneity of variance is not assumed (Kline, 2004). After the calculation of Cohens d or Glasss D, the researcher still needs to interpret the

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

12

Counseling Outcome Research and Evaluation 1(2)

derived estimate in practical terms, that is, to evaluate whether an intervention makes a meaningful difference in an applied setting (Kline, 2004, p. 135). In our fictional study, the researcher computed the d, yielding a value of .58. So what does this number mean? Before we answer this question, some further information is still required. Theoretically, ds and Ds can range between 0.00 and +4.00 (Kline), but rarely in research with human respondents do they even reach the +1.50 threshold. Generally, they range between 0.0 and 1.0. If the experimental groups mean is significantly higher than the comparison groups mean (i.e., the alternative hypothesis is retained and null hypothesis is rejected), the d or D will be a positive number. In the odd situation when the comparison group outperforms the experimental group, the ES will be in the negative range. According to Cohens (1988, 1992) rules of thumb, an ES of .20 is considered relatively small; .50 is the benchmark to be considered medium-sized; a d of .80-plus is a large ES. Actual ESs from counseling, social science, and educational research are often less than optimal, suggesting that Cohens estimates may be somewhat inaccurate for meta-analytic research in the helping professions (Lipsey & Wilson, 2001). Table 2 provides a quick reference guide to interpreting d-family ESs. Group difference indices provide researchers with estimates of how much the experimental group surpassed the control (comparison) group following an intervention using a standardized quantitative index. For example, when a d or D equals .08, the ES can be understood in several related ways. First, participants in the experimental and control (comparison) groups produced relatively similar means (i.e., the intervention produced only a trivial or near-zero mean difference between groups, Me % Mc). Second, the mean of the experimental group is around the 50th percentile (actual 53rd percentile) of the comparison group. Third, the distribution of scores for the experimental group overlaps almost completely with the distribution of scores for the comparison group (i.e., the percentage of score distribution overlap is near 100%; actual 94% overlap).

Finally, a near-zero d is a very small ES. Similarly, to interpret a d of .5, a moderate ES, Table 2 shows that the mean of the experimental group is at the 69th percentile of the comparison group. In this case, the experimental group outperformed the control (comparison) group by .5 of a SD, with the percentage of distribution nonoverlap for both samples being approximately 33%. We now have enough background information to answer the researchers primary question regarding practical significance. A d of .58 is considered a moderate ES, indicating that the experimental groups mean was around the 73rd percentile of the distribution of the comparison sample. The researcher is able to conclude that clients in the experimental sample when tested at the studys follow-up phase had higher self-efficacy scores (* .6 SD) following NT than did participants in the comparison group. The researcher must also report the caveats (e.g., issues related to internal and external validity) to making this assertion. Relationship or Variance Explained. The second category of ESs, commonly referred to as the r family indices, reflect the degree of association or covariation (i.e., shared variance) between the independent and dependent variables (Kline, 2004). Readers are perhaps familiar with the Pearson r, which provides an estimate of the observed strength of the relationship between two continuous variables (e.g., variables using interval data). Bivariate correlations range from 0.00 to +1.00. As rs approach 1.00, the stronger the relationship between variables. If r is squared (r2) then this value represents the proportion of explained (shared) variance between two variables (i.e., coefficient of determination), ranging from .00 to 1.00. If the r2 is multiplied by 100, one obtains the percentage of explained or shared variance, ranging from 0% to 100%. In quantitative research, the aim is produce a large r, and thus, a strong r2, indicating a large percentage of score overlap between two variables. With this information in mind, we return to our invented research example. An investigator

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Sink and Mvududu

13

Table 2.

Guide to Interpreting Group Difference (d-family) Indices

Unit of Comparison Between Groups (M1 to M2) Benchmark to Determine ES Strength D or d 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 .9 .8 .7 .6 .5 .4 .3 .2 .1 .0 SD 2 * Percentile 98 97 96 96 95 93 92 90 88 86 84 82 79 76 73 69 66 62 58 54 50 * % of Distribution Nonoverlap 81 79 77 75 73 71 68 65 62 59 55 52 47 43 38 33 27 21 15 8 0

1.5

1.0

Large

Medium

.5

Small

Note. Adapted from Becker (2000) and Cohen (1988); d or D can exceed 2.0 but this situation is extremely rare; see Kline (2004) for discussion of distribution overlap; benchmarks are tentative.

used an ANOVA to compare posttest selfefficacy mean differences from three groups of participants after each received 10 weeks of intensive NT. In this scenario, we indicated that a d could be calculated as a useful ES index. However, in comparative studies with two or more independent variables (e.g., participant group, counseling treatment, clinic location), researchers often analyze the effect of these variables on the dependent variable(s) (e.g., participant scores on self-efficacy, mood, or personality inventories) using GLM factorial procedures (e.g., factorial ANOVA, analysis of covariance [ANCOVA], or multivariate analysis of variance [MANOVA]). Keeping the technical details of the GLM procedure to a minimum and somewhat oversimplified, the typical statistical software package like SPSS (2009) will generate a multiple correlation ratio (R2) for the overall ANOVA model13 (see, e.g., Field, 2009; Green & Salkind, 2008, for SPSS procedures and

statistical output) and an eta square for each ANOVA effect. Specifically, the R2 is the ES for the overall ANOVA model (all variance components are included; e.g., total sum of squares, error sum of squares, main effect[s] and interaction effect[s] sum of squares). This multiple correlation ratio represents the amount of explained variance in the dependent variable by the combined main and interaction effects, excluding mean square error (MSE) term. Moreover, SPSS conveniently outputs an additional ES for each main (independent variable [IV]) and interaction (e.g., IVa by IVb effect). Although interpreted in the same way as a R2, depending on the type of analysis of variance requested, SPSS produces an eta squared (Z2) or a partial eta squared (Zp2). The former ES represents the ratio of the sum of squares for a particular effect of interest (e.g., a main effect/independent variable) to the ANOVAs total sum of squares (i.e., total sums of squares for all main and interaction effects plus

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

14

Counseling Outcome Research and Evaluation 1(2)

error). Therefore, Z2 SSeffect/SStotal. When Z2 .40, for example, this means that 40% of the variability in the dependent variable can be explained or accounted for by the independent variable. However, a more precise ES to report for each main and interaction effect in a GLM factorial design is the Zp2. Technically, this ES is defined as the ratio of sum of squares for a particular effect (main or interaction) to an adjusted total sum of squares, which is composed of two variance components: (a) the sum of squares for the effect under consideration and (b) the sum of squares for the error term associated with that effect (Huck, 2009). Thus, the partial eta square is symbolically represented as: Zp2 SSeffect/ (SSeffect SSerror). The Zp2 gives researchers a more precise ES estimate for a specific effect (e.g., independent variable or main effect A, independent variable or main effect B, or the A B interaction effect). In other words, the Zp2 is understood similarly to a partial correlation squared, in that, this ES is the proportion of variance explained by a particular effect (main or interaction) after removing any variance associated with the other effects. Practically speaking then, when a counseling researcher computes a factorial ANOVA and SPSS (2009) outputs a partial Z2 of .15 for the main effect (independent variable) of Group, one can report that the effect for group accounted for 15% of the group-differences plus related error variance. Researchers should also note that when computing a one-way ANOVA, the Z2 as the ES for the particular independent variable (the main effect) is reported. However, this is not the case in studies where there are two or more independent variables (factorial designs). If, for instance, the researcher wants to investigate the effects of Group (2 levels: experimental and comparison) and SES (3 levels: low, medium, and high) on the dependent variable (e.g., selfefficacy total score), a two-way ANOVA could be computed. Here, we have two main effects (Group and SES) and the interaction effect (Group by SES). For each of these three effects, the researcher should report a partial Z2. For the overall two-way ANOVA model, the Z2 can be reported in the findings as well. Table 3

provides user-friendly guidelines for determining what constitutes a small, medium, or large variance-explained index. Finally, because counseling researchers often use ANOVA or an associated procedure as the inferential approach of choice for group comparison studies, consulting Howell (2008) for a practitionerfriendly discussion of this statistical method is recommended.

Additional Comments and Caveats on ES Interpretation


Given the numerous ESs available to researchers, readers may wonder which ones are best to report in quantitative studies. Like most issues in research and statistics, there is not a universal rule of thumb to apply to all studies. Because d family and r family ESs can be computed for most inferential statistics in group comparison research designs, both are easily reported. However, following Thompsons (e.g., 2007, 2008) lead, APA (2010) and Coalition for Evidence-based Policy (2003) recommended that variance-accounted-for ESs might be better understood by practitioners if they were converted to unsquared indices associated with the d family of ESs. Coalition for Evidence-based Policy (2003) authors provided sample statements from published research that assist readers to understand the practical importance of a studys findings. For example, an investigator might report the following: Based on the ESs derived from the significant findings, students in the experimental group in contrast to the control group improved their vocabulary skills by two grade levels, as well as showed a 20% reduction in weekly use of illicit drugs, and a 20% increase in high school attendance. When reviewing previous research, authors might want to follow this example from Wasik and Slavin (as cited in Coalition for Evidencebased Policy, 2003):
Evidence from randomized controlled trials, discussed in the following journal articles, suggests that one-on-one tutoring of at-risk readers by a well-trained tutor yields an ES of about 0.7. This means that the average

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Sink and Mvududu

15

Table 3.

Guide to Interpreting Variance Explained Indices

Tentative Benchmark to Determine ES Strength Strong .50 .25 .14 .14 Medium .30 .09 .06 .06 Small .10 .01 .01 .01 Symbol r R2 Z2 Zp2 Variance-Explained Indices Pearson productmoment correlation Multiple correlation squared Eta squared Partial eta squareda

Note. Each squared ES ranges from 0 to 1; multiplying each ES value by 100 yields a percentage of explained variance. a Threshold values are usually smaller than those from an eta squared; hence, small, medium, and large benchmarks for Zp2 are probably too large, so they must be interpreted with caution (Green & Salkind, 2008; Sink & Stroh, 2006).

tutored student reads more proficiently than approximately 75% of the untutored students in the control group. (p. 9)

Finally, the decision as to which ES to include in a research manuscript depends on the studys aims, samples, context, and target audience (e.g., Thompson, 2006a, 2006b, 2007). Moreover, when interpreting ESs researchers must consider individual differences among client groups and the ES findings from previous research with comparable interventions and target groups (Hill, Bloom, Black, & Lipsey, 2008). Correspondingly, investigators and consumers of research are cautioned not to apply Cohens (1988, 1992) guidelines too rigorously when characterizing what ESs are small, medium, or large. Again, ES interpretation must be situated to the particular research application and with any d-related ES, researchers should, if appropriate, include confidence intervals (Cohen & Lea, 2004; Thompson, 2007).

proposed sampling method must be carefully scrutinized, ensuring that the samples are relatively similar in composition and representative of the population. It is essential that counseling investigators move beyond the null hypothesis significance testing approach when computing and reporting on inferential statistics. All significant findings should be accompanied by their relevant ESs and associated confidence intervals. By adopting these best practices, counseling studies will reflect the highest standards set for quantitative research. Notes
1. A quasi-experimental study, among other characteristics, includes nonrandom (nonprobability) sampling (see Stuart & Rubin, 2008, for details). 2. In technical language, accepting the alternative hypothesis means a researcher is suggesting that the population mean (m) of the experimental group is significantly greater than the population mean (m) of the control group, assuming, for example, an a priori alpha (a) level of .05. 3. See the useful but technical article of Guo and Luh (2009), which overviews fixed-effect factorial ANOVAs, statistical designs often used in quasi-experimental counseling research. 4. For the a priori technique, this tool computes a sample size as function of statistical power level (1 b), significance level (a), and the to-bedetected population ES. 5. More precisely, nonprobability sampling as referred to here is a procedure where the selection of participants is based on factors other than random chance.

Conclusion
In this article, we reviewed and practically discussed the interdependency of statistical power, sampling, and ESs within the context of quantitative counseling research. Reflecting in part Osbornes (2008) recommendations, it is our hope that future studies will use research designs that are well conceptualized, methodologically sound, and executed in a manner that minimizes error and maximizes statistical power and ESs. If possible, a priori power analyses should be conducted. Furthermore, the

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

16

Counseling Outcome Research and Evaluation 1(2)

6. A user-friendly online resource for these concepts, see Trochim (2006). 7. GLM is a statistical procedure that uses linear regression modeling to analyze complicated factorial designs (see Judd et al. 2009, for a detailed explanation). 8. For critiques of the alternative to null hypothesis significance tests and their relationship to effect size estimation, see Fidler and Cumming (2008) and Killeen (2005). 9. These d-family metrics are also called nonsquared ESs. 10. Technically, for reasons fully explained in authoritative texts (e.g., Cohen, 1988; Kline, 2004), the d is similar to Hedges and Olkins g index and computed if, for instance, the groups being compared have unequal ns or the ns are relatively small (see also Becker, 2000). Thalhiemer and Cook (2002) also point out that these d equations are to be used with independent samples, where the relative homogeneity of variances (or SDs) is assumed. These ES formulae do not apply to research situations where participants are measured more than once (e.g., pre- and posttest) on the same outcome variable. These situations are called repeated-measures designs (e.g., paired or dependent t test) are discussed in several sources (e.g., Becker 2000; Kline, 2004). 11. The notion of pooled variance can be understood as a way of estimating variance of different participant groups across different research situations. 12. MSE is also referred to MSWithin (MSW) when computing an ANOVA using SPSSs GLM procedure; MSE is equivalent to MSResidual (MSR) in multiple regression analysis; MS is a measure of average variance for different components (e.g., main effect or MSBetween, interaction effect, error/residual, total) of an ANOVA, where the sum of squares is divided by the df. Thus, MSE SSerror/dferror. 13. Determining whether overall ANOVA is significant or not is based on the omnibus F and the derived p value.

Funding The authors received no financial support for the research and/or authorship of this article. References
Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect size and power in assessing moderating effects of categorical variables using multiple regression: A 30 year review. Journal of Applied Psychology, 90, 94-107. American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author. Babbie, E. (2010). The practice of social research (12th ed.). Belmont, CA: Wadsworth. Becker, L. (2000). Effect size (ES). Retrieved from http://www.uccs.edu/*faculty/lbecker/es.htm Coalition for Evidence-based Policy. (2003). Identifying and implementing educational practices supported by rigorous evidence: A user friendly guide. Washington, DC: U.S. Department of Education Institute of Education Sciences National Center for Education Evaluation and Regional Assistance. Cochran, W. G. (2007). Sampling techniques (3rd ed.). New York, NY: Wiley. Cohen, B. H., & Lea, B. R. (2004). Essentials of statistics for the social and behavioral sciences. New York, NY: John Wiley. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. Cohn, L. D., & Becker, B. J. (2003). How metaanalysis increases statistical power. Psychological Methods, 8, 243-253. Cooper, H. (2010). Research synthesis and metaanalysis: A step-by-step approach. Thousand Oaks, CA: Sage. Cortina, J. M., & Nouri, H. (2000). Effect size for ANOVA designs. Thousand Oaks, CA: Sage. Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods approaches (3rd ed.). Thousand Oaks, CA: Sage. Davey, A., & Savla, J. (2010). Statistical power analysis with missing data: A structural equation modeling approach. New York, NY: Routledge.

Declaration of Conflicting Interests The authors declared no potential conflicts of interests with respect to the authorship and/or publication of this article.

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Sink and Mvududu

17

Ferron, J., & Sentovich, C. (2002). Statistical power of randomization tests used with multiplebaseline designs. The Journal of Experimental Education, 70, 165-178. Fidler, F., & Cumming, G. (2008). The new stats: Attitudes for the 21st century. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 1-12). Thousand Oaks, CA: Sage. Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Thousand Oaks, CA: Sage. Green, S. B., & Salkind, N. J. (2008). Using SPSS for Windows and Macintosh: Analyzing and understanding data (5th ed.). Upper Saddle River, NJ: Prentice-Hall. Grissom, R. J., & Kim, J. J. (2005). Effect sizes for research: A broad practical approach. Mahwah, NJ: Erlbaum. Guo, J-H., & Luh, W-M. (2009). On sample size calculation for 22 fixed-effect ANOVA when variances are unknown and possibly unequal. British Journal of Mathematical and Statistical Psychology, 62, 417-425. Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. (2008). Empirical benchmarks for interpreting effect sizes in research. Child Development Perspectives, 2, 172-177. Houser, R. (2008). Counseling and educational research: Evaluation and application (2nd ed.). Thousand Oaks, CA: Sage. Howell, D. (2008). Best practices in the analysis of variance. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 341-357). Thousand Oaks, CA: Sage. Huberty, C. J. (2002). A history of effect size indices. Retrieved from http://epm.sagepub.com/cgi/ content/abstract/62/2/227 Huck, S. W. (2009). Statistical misconceptions. New York, NY: Routledge. Jo, B. (2002). Statistical power in randomized interventions studies with noncompliance. Psychological Methods, 7, 178-193. Judd, C. M., McClelland, G. H., & Ryan, C. S. (2009). Data analysis: A model comparison approach (2nd ed.). New York, NY: Routledge. Killeen, P. R. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16, 345-353. Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral

research. Washington, DC: American Psychological Association. Konstantopoulos, S. (2008). An introduction to meta-analysis. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 177-194). Thousand Oaks, CA: Sage. Lenth, R. V. (2006-2009). Java applets for power and sample size [Computer software]. Retrieved from http://www.stat.uiowa.edu/*rlenth/Power Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage. Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing data: A model comparison perspective (2nd ed.). Mahwah, NJ: Erlbaum. Mertens, D. M. (2010). Research and evaluation in education and psychology (3rd ed.). Los Angeles, CA: Sage. Meyers, L. S., Gamst, G., & Guarino, A. J. (2006). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage. Miles, J. (2003). A framework for power analysis using structural equation modeling procedure. BMC Medical Research Methodology, 3, 1-11. Millis, S. R. (2003). Statistical practices: The seven deadly sins. Child Neuropsychology, 9, 221-233. Moss, P. A., Phillips, D. C., Erickson, F. D., Floden, R. E., Lather, P. A., & Schneider, B. L. (2009). Learning from our differences: A dialogue across perspectives on quality in education research. Educational Researcher, 38, 501-517. doi: 10.3102/0013189X09348351. Muncer, S. J., Craigie, M., & Holmes, J. (2003). Meta analysis and power: Some suggestions for the use of power in research synthesis. Understanding Statistics, 2, 1-12. Nakagawa, S., & Foster, T. M. (2004). The case against retrospective statistical power analyses with an introduction to power analysis. Acta Ethologica, 7, 103-108. Osborne, J. W. (Ed.). (2008). Best practices in quantitative methods. Thousand Oaks, CA: Sage. Park, H. M. (2003). Understanding the statistical power of a test. Retrieved from http://www. indiana.edu/*statmath/stat/all/power/power. html#One-way Rosnow, R. L., & Rosenthal, R. (1996). Computing contrasts, effect sizes, and counternulls on other peoples published data: General procedures for

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

18

Counseling Outcome Research and Evaluation 1(2)

research consumers. Psychological Methods, 1, 331-340. Rowland, D., & Thorton, J. (2002). Transforming ideas into research. In J. Niebauer (Ed.), The clinical research survival guide (pp. 43-78). London, UK: REMEDICA Publishing. Sheskin, D. (2007). Handbook of parametric and nonparametric statistical procedures (4th ed). New York, NY: CRC Press. Shieh, G. (2003). A comparative study of power and sample size calculations for multivariate general linear models. Multivariate Behavioral Research, 38, 285-307. Sink, C. A., & Mvududu, N. H. (2009). Meta-analysis. In B. T. Erford (Ed.), Encyclopedia of counseling (pp. 334-337). Alexandria, VA: American Counseling Association. Sink, C. A., & Stroh, H. R. (2006). Practical significance: The use of effect sizes in school counseling research. Professional School Counseling, 9, 401-411. SPSS, Inc. (2009). Statistics family. Retrieved from http://www.spss.com/software/statistics/ Stuart, E. A., & Rubin, D. B. (2008). Best practices in quasi-experimental designs: Matching methods for causal inference. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 155-176). Thousand Oaks, CA: Sage. Thalheimer, W., & Cook, S. (2002, August). How to calculate effect sizes from published research articles: A simplified methodology. Retrieved from http://work-learning.com/effect_sizes.htm Thompson, B. (2006a). Research synthesis: Effect sizes. In J. Green , G. Camilli & P. B. Elmore (Eds.), Handbook of complementary methods in education research (pp. 583-603). Washington, DC: American Educational Research Association. Thompson, B. (2006b). Role of effect sizes in contemporary research in counseling. Counseling and Values, 50, 176-186. Thompson, B. (2007). Effect sizes, confidence intervals, and confidence intervals for effect sizes. Psychology in the Schools, 44, 423-432. Thompson, B. (2008). Computing and interpreting effect sizes, confidence intervals, and confidence intervals for effect sizes. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 246-262). Thousand Oaks, CA: Sage.

Trochim, W. M. K. (2006). Statistical power. Retrieved from http://www.socialresearchmethods.net/kb/ power.php Trusty, J., Thompson, B., & Petrocelli, J. V. (2004). Practical guide for reporting effect size in quantitative research. Journal of Counseling & Development, 82, 107-110. Vacha-Haase, T., & Thompson, B. (2004). How to estimate and interpret various effect sizes. Journal of Counseling Psychology, 51, 473-481. Venter, A., Maxwell, S. E., & Bolig, E. (2002). Power is randomized group comparisons: The value of adding a single intermediate time point to a traditional pretestposttest design. Psychological Methods, 7, 194-209. Wheeler, D. W., Seagram, A. T., Becker, L.W., Kinley, E. R., & Mlinek, D. D. (2008). The academic chairs handbook (2nd ed.). San Francisco, CA: Jossey-Bass.

Bios
Christopher A. Sink, PhD, NCC, LMHC, professor of counselor education at Seattle Pacific University (16 years), has been actively involved with the school counseling profession for nearly 30 years. Prior to serving as a counselor educator, he worked as a secondary and post-secondary counselor. He has many years of editorial experience in counselingrelated journals and has published extensively in the areas of school counseling and educational psychology. Chris is an advocate for systemic and strengthsbased school-based counseling. Currently, his research agenda includes program evaluation, research methods in school counseling, and spirituality as an important feature of adolescent resiliency. Sinks (2011) latest book published by Brooks/Cole is called Mental Health Interventions for School Counselors. Nyaradzo H. Mvududu, EdD, is an associate professor of curriculum and instruction at Seattle Pacific University (7 years). She teaches statistics and research courses in the School of Education as well as the School of Health Science. Nyaradzos research interests are in statistics education. Her current research agenda focuses on investigating factors that impact the teaching and learning of statistics. She has published a number of journal articles in this area and made conference presentations.

Downloaded from cor.sagepub.com at PURDUE UNIV LIBRARY TSS on March 19, 2013

Você também pode gostar