Class Notes Sessions5-6

One Way ANOVA
ANOVA stands for Analysis of Variance
ANOVA allows us to compare the means from more than two

sets of scores.
A significant ANOVA indicates that changes in the independent

variable affect the dependent variable.
ANOVA does not indicate which pairs of conditions are

significantly different.
Use planned contrasts or unplanned (“post hoc”) contrasts to

assess whether pairs of conditions are significantly different.
ANOVA Assumptions
1. Normally distributed populations

2. Equal population variances
3. Random sampling used
4. Dependent variable uses an interval or ratio scale
Digression on Scales: Levels of Measurement
Interval: 0 doesn’t mean “none (e.g., IQ score)

- distances between points on scale are equal but ratios
aren’t meaningful (e.g., temperature)
Ratio: Same as interval scale, but 0 means “none” and

ratios are meaningful (e.g., weight or age: a person
who is 50 is twice as old as one who is 25).
Nominal: Numbers are just labels for attributes (e.g., color)
Ordinal: categories have a logical order (e.g., ranks)
Digression on Scales: Types of Data
Continuous: Numerical data that can be fractional (e.g.,

temperature)
Discrete: Numerical data that cannot be fractional (e.g.,

number of World Cup trophies)
2
One-way ANOVA
National Airlines
National Airlines recently introduce a daily early-morning nonstop flight

between Houston and Chicago. The vice president of marketing for
National Airlines decided to perform a statistical test to see whether
National’s average passenger load on this new flight is different from
that of each of its two major competitors (which we will call competitor
1 and competitor 2). Ten early-morning flights were selected at random
from each of the three airlines and the percentage of unfilled seats on
each flight was recorded. These data are stored in an EXCEL file on the
website at “National Airlines (Excel).”
Is there evidence that National’s average passenger load on the new

flight is different from that of its two competitors? Report a p value and
interpret the results of the statistical test.
3
Raw Data (in Excel)
4
Raw Data (in SPSS)
5
Transform Data into Analysis-Ready Form
6
Analyze  Compare Means  One-Way ANOVA
7
Post Hoc Contrasts
8
Results
Descriptives
Unfilled
95% Confidence Interval for
Mean
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
National 10 9.80 2.044 .646 8.34 11.26 7 13
Competitor 1 10 11.30 2.003 .633 9.87 12.73 7 13
Competitor 2 10 12.60 2.011 .636 11.16 14.04 9 15
Total 30 11.23 2.269 .414 10.39 12.08 7 15
ANOVA
Unfilled
Sum of
Squares df Mean Square F Sig.
Between Groups 39.267 2 19.633 4.815 .016
Within Groups 110.100 27 4.078
Total 149.367 29
Post Hoc Tests
Multiple Comparisons
Dependent Variable: Unfilled

Bonferroni
Mean
Difference 95% Confidence Interval
(I) Airline (J) Airline (I-J) Std. Error Sig. Lower Bound Upper Bound
National Competitor 1 -1.500 .903 .325 -3.81 .81
Competitor 2 -2.800* .903 .013 -5.11 -.49
Competitor 1 National 1.500 .903 .325 -.81 3.81
Competitor 2 -1.300 .903 .484 -3.61 1.01
Competitor 2 National 2.800* .903 .013 .49 5.11
Competitor 1 1.300 .903 .484 -1.01 3.61
*. The mean difference is significant at the .05 level.
9
The χ Test
2
For data in which each outcome is assigned to a single (and only

one) mutually exclusive and exhaustive category
Derived from square of Z statistic
Assumes: Independence of observations (can't take several

observations from 1 person and analyze with χ ) 2
Compares the observed and expected values
Needs at least 5 expected observations per cell
Values range from 0 on up (no negative values)
10
One-way χ 2
Are There Differences Among the Levels of 1 Variable?

1
H0: P1 = P2 = P3 = P4 = 4
(where P1 + P2 + P3 + P4 = 1)
1
Ha: At least one Population Proportion ≠
4
(O - E )2
χ2 = ∑
E
(2 O1 - E1 )2 ( Om - Em )2
χ = + . +. .
E1 Em
Where Oi and Ei are the observed and expected # of
occurrences for m (exhaustive & mutually exclusive)
outcomes
Squared deviations in which large disparities “count” for

more than small disparities
df = (# levels in the independent variable – 1)
11
100 Analysts Rate an IPO
Strong Buy Buy Hold Sell Strong Sell
24 33 22 16 5
H0 : PSB = PB = PH = PS = PSS
Ha : PSB ≠ PB ≠ PH ≠ PS ≠ PSS
Actual & Expected ( )

Strong Buy Buy Hold Sell Strong Sell
24 33 22 16 5
(20) (20) (20) (20) (20)
Critical Value (df = 4, α = .05) : χ 2( 4) = 9.49
(O − E ) 2 4 2 13 2 2 2 4 2 15 2
χ2 = ∑
E
= 20
+
20
+ + +
20 20 20
= 21 .50
Reject H0 because the test statistic (21.50) is greater

than the critical value (9.49).
12
100 Analysts Rate an IPO:
Testing Unequal Categories
Test the null hypothesis that twice as many will
offer some form of buy recommendation (either
Strong Buy or Buy) than will offer either a hold or
some form of sell recommendation (Sell or Strong
Sell).
Collapse analysts’ recommendations into 3

categories:
Buy Hold Sell

57 22 21
H0 : PB = 2PH = 2PS
Ha : PB ≠ 2PH ≠ 2Ps
Critical Value (α = .05 ) : χ 2 ( 2) = 5.99
Actual & Expected ( )

Buy Hold Sell
57 22 21
(50) (25) (25)
49 9 16
(O − E ) 2 χ2 = + + = 1.98
χ =∑
2 50 25 25
E
Do not reject H0 because the test statistic (1.98) is

less than the critical value (5.99).
13
100 Analysts Rate an IPO:
Testing Unequal Categories in SPSS
Analyze  Nonparametric  Chi-Square
14
Set up the expected values for each category
Results
Analysts' Recommendations
Observed N Expected N Residual

Buy 57 50.0 7.0
Hold 22 25.0 -3.0
Sell 21 25.0 -4.0
Total 100
Test Statistics
Analysts' Recommendations
Chi-Squarea 1.980
df 2
Asymp. Sig. .372
a. 0 cells (.0%) have expected frequencies less than
5. The minimum expected cell frequency is 25.0.
15
Two-way χ 2
Are There Differences Between 2 Variables?
H0: Variables A and B are independent

Ha: Variables A and B are dependent
df = (# rows - 1)(# columns - 1)
Tests nondirectional hypotheses only, using a single tail
16
SPSS: 2-Way χ2 Tests
Vioxx data file:
Industry [Industry ties? 1=no, 2=yes]

Vioxx [Bring Vioxx back? 1=no, 2=yes]
17
Analyze  Descriptives  Crosstabs
Click on “Statistics”; select “chi-square”
18
Click OK. This is the χ2 output:
Crosstabs
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
Industry Ties? *
32 100.0% 0 .0% 32 100.0%
Bring Vioxx Back?
Industry Ties? * Bring Vioxx Back? Crosstabulation
Count
Bring Vioxx Back?
no yes Total
Industry no 14 8 22
Ties? yes 1 9 10
Total 15 17 32
19
Notice that the χ test below is significant (p = .005), but not
2
entirely reliable because the expected cell count in one of the

cells is less than 5. (You should be able to verify that the lower
left cell in the Crosstabulation above is the one with the
undercount.)
Chi-Square Tests
Asymp. Sig. Exact Sig. Exact Sig.

Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 7.942b 1 .005
a
Continuity Correction 5.935 1 .015
Likelihood Ratio 8.893 1 .003
Fisher's Exact Test .007 .006
Linear-by-Linear
7.694 1 .006
Association
N of Valid Cases 32
a. Computed only for a 2x2 table
b. 1 cells (25.0%) have expected count less than 5. The minimum expected count is 4.
69.
When one or more of your cells has an expected count less than
5, report Fisher's Exact Test (in the SPSS output). Fisher’s
Exact Test has no test statistic, no critical value, and no
confidence interval. Report it as follows: “p = .007, Fisher’s
Exact Test, 2-tailed.”
20
Correlation
How do the scores on one variable change with the scores on
another variable?
Correlations are concerned with measuring the direction and

magnitude of a linear relationship between two variables.
The stronger the correlation, the more accurately we can predict

Y from knowing X.
Scatterplot: A graph containing clusters of dots that represent all

X-Y pairs of observations.
Involves an examination of pairs of X-Y scores (one-sample

procedure).
21
Correlation Coefficients
Measures extent to which individual Xi-Yi scores that make up a
pair occupy the same or opposite positions within their
distributions.
- Pos relation: Pairs tend to occupy similar relative
positions in their distributions
- Neg relation: Pairs tend to occupy opposite relative

positions in their distributions
Two types (there are others as well):

- Pearson r (continuous data): rxy
- Phi Coefficient (binary variables: 2 X 2 Tables): φ
Range from -1 to 1
1 = perfect pos relation
-1 = perfect neg relation
0 = No relation
Failure to find strong r may mean:

(a) chance
(b) variables are unrelated, or
(c) the variables are related nonlinearly.
22
R Computation (by hand)
1. Transform each Y score into a Z score (Zy)
2. Transform each X score into a Z score (Zx)
3. Determine correspondence between each of the paired Zs
- r indicates the average correspondence between the paired
Zs.
r = Mean of the crossproduct of Z scores.
Population Sample
∑ Z xZ y ∑
r= r = Z xZ y
N N −1
(note: Zs will differ for population & samples because the

denominator for computing population Zs is σ and the
denominator for computing sample Zs is s.)
When large pos correspondence: Z crossproduct is pos. & large
When small neg correspondence: Z crossproduct is neg & small

- (lots of + and - canceling each other out)
Strength of Relationship
r2 = Proportion of variability of Y accounted for by X
23
Strong Correlation
(population computation)
Student # High School # College Zx Zy
A’s (X) A’s (Y)
Alejandro 13 14 1.50
0.50
Bernardo 9 18 0.50 1.50
Carlos 7 12 0.00 0.00
Dominique 5 10 -0.50 -0.50
Enrique 1 6 -1.50 -1.50
∑ Z x Z y (1.5)(. 5) + (. 5)(1.5) + (0)( 0) + ( −.5)( −.5) + ( −1.5)( −1.5)

r= = = 0.80
N 5
24
Strong Correlation: Using SPSS
Analyze  Correlate  Bivariate
Correlation Output
Correlations
High School College

High School Pearson Correlation 1 .800
Sig. (2-tailed) . .104
N 5 5
College Pearson Correlation .800 1
Sig. (2-tailed) .104 .
N 5 5
25
Two Points of Caution with Correlations
1. Restriction of range (i.e., truncated range) problem
When the relevant range of X or Y scores is a truncated part of

whole, then the truncated X-Y correlation will be smaller than
the whole X-Y correlation.
2. Correlation does not mean causation

- may be a correlated 3rd variable
- Even if no 3rd variable is involved, it’s not always clear

which variable is the cause and which is the effect.
26
Phi Coefficient Φ
Correlation for Categorical Data (2 X 2 Tables):
a b
c d
ad - bc
φ=
(a + b)(c + d)(a + c)(b + d)
Yes No
50 20
10 4
φ= 0
27
Phi Coefficient Φ (using SPSS)
10 5
5 8
Analyze  Descriptive Statistics  Crosstabs  Statistics
28
Click “Statistics” and check “Phi and Cramer’s V”
Symmetric Measures
Value Approx. Sig.

Nominal by Phi .282 .136
Nominal Cramer's V .282 .136
N of Valid Cases 28
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null
hypothesis.
(ignore “Cramer’s V”)
29
Regression
Regression: The primary purpose of regression is prediction
Predictions about the linear relationship between independent and

dependent variables.
Independent = predictor = explanatory

Dependent = response = criterion
Types of Regression
1. Linear (least squares regression line)

Simple regression: one predictor variable
Multiple regression: multiple predictor variables
2. Nonlinear (can linearize many of these via transformation)

- Positive curvilinear (e.g., diminishing marginal utility)
- Polynomial (quadratic – parabola-shaped; cubic)
- Exponential or negative curvilinear (L-shaped)
3. Logistic (when dependent variable is categorical)

- Example: graduate or not; sales are weak/moderate/strong
30
Lines: Y = bo + b1X
bo and b1 are regression coefficients

- can be positive or negative
- b1 is more important than b0
bo = Y intercept (value of Y when X=0)

b1 = Slope (how much Y changes when X changes by 1 unit)
Example #1: Suppose Aeromexico wants to examine the relation

between number of flight delays and number of passenger complaints.
X = Number of flight delays

Y = Number of passenger complaints.
Suppose that the data are as follows (X, Y):

(0, 1), (1, 3), (2, 5)
31
Scatterplot: Delays vs. Complaints
5.00
4.00
complaints
3.00
2.00
1.00
0.00 0.50 1.00 1.50 2.00
delays
The line that fits these data perfectly is: Y = 1 + 2X

# Complaints = 1 + 2 (# Flight Delays)
X Y
0 1 + 2(0) = 1 (0,1)
1 1 + 2(1) = 3 (1,3)
2 1 + 2(2) = 5 (2,5)
32
But what if the scatterplot looked like this?
20.00
15.00
complaints
10.00
5.00
0.00
0.00 2.00 4.00 6.00 8.00 10.00
delays
We’ll need to estimate a line of best fit using linear regression
33
The regression line (also called “Least squares regression line”)
minimizes the squared difference between the observed and predicted
values of the response variable (as give by the regression line).
- The difference between the actual and predicted values is called

the “residual.”
- Minimizing these squared residuals gives slope that is as close as

possible to true slope.
20.00
15.00
complaints
10.00
5.00
R Sq Linear = 0.475
0.00
0.00 2.00 4.00 6.00 8.00 10.00
delays
(We’ll talk about what “R Sq Linear” means later on…)
34
Example #2: Suppose UT wants to examine relation between alumni
donations to the school and number of football victories.
X = Number of football victories
Y = Amount of alumni donations the following year
Alumni Donations = $10,000,000 + $200,000 (# Football Victories)
Y = 10,000,000 + 200,000X
Caution #1: X-Y relation may not be causal
Caution #2: Regression line estimates most trustworthy near bulk of

data (usually center).
35
Linear Regression Assumptions
1. Linearity
- Linear relationship between X and Y
- Same expected change in Y moving from X1 to X2 vs.

moving from X2 to X3
Test: X-Y Scatterplot (look for nonlinearities)
Correction: Insert curvilinear term (Usually quadratic: X2)

Y = bo + b1X + b2X2
: Log transformation of X variable (if data are positive)

- brings large values down, pushes small values further
apart
2. Independence of Observations
- Residuals across Xs are not correlated
Test: Durbin-Watson (0-4) (tests residual correlation among Xs)

0.0 - 1.5 = pos correlation
1.5 - 2.5 = no correlation
2.5 - 4.0 = negative correlation
Correction: Transformation of Y variable (percentages or logs)
36
3. Normality
- The distrib. at each Xi is normal
- The errors have normal distribution
Test #1: Plot histogram of residuals (should be normal)
Test #2: Normal Prob. Plot

- Plot of cumulative probabilities
- Should follow diagonal (if residuals follow normal distrib)
Correction: Log transformation of Y
4. Constant Variance (homoskedasticity)

- Each Yi distrib. has same variance
- Means that effects of other factors does not depend on level of X.
- Common problem: Variance up as X increases (funnel shape)
Test: Scatterplot of X vs. Residuals

- Should not show funnel shape pattern
Correction: Log transformation of Y
37
Example: Simple Linear Regression
Houston Astros Payroll
Identify a regression equation that predicts the median salary for a
Houston Astros baseball player based on knowledge of the total team
payroll
Independent variable: Total Payroll

Dependent variable: Median Salary
Here are your data (figures are in thousands)
You can access this data file on the website as well (“Houston Astros
salary data”)
38
1. Create X–Y Scatterplot
Graphs  Scatter  Simple  Define  OK
39
Median Salary – Total Payroll Scatterplot
1500.00
1200.00
900.00
600.00
S
n
lryM
d
ia
e
300.00
0.00
10000.00 20000.00 30000.00 40000.00 50000.00 60000.00 70000.00 80000.00

Total Payroll
This scatterplot shows that the linearity assumption is OK

- we’ll check the other 3 assumptions shortly
40
2. Visual check for outliers (remove if necessary)
3. Add regression line:

Double click on graph
Single click on a data point (it will enlarge and change color)
Elements  Fit Line At Total
41
Fit Line at Total  Linear
42
4. Conduct Regression Analysis
43
Put Independent and Dependent variables in the right boxes
Click Statistics
44
Click Plots
45
Click Save
By checking these boxes, you will create extra columns on your data
file. You will get a Predicted Values (“PRE_1”) column and a Residual
Values (“RES_1”) column.
46
5. Examine Regression Output
M o d e l S u m mba ry
C h a n g e S ta tis tics
A d ju s te d S td . E rro r o f R S q u a re D u rb in -
M odel R R S q u a re R S q u a re th e E s tim a te C h a n g e F C h a n g e d f1 d f2 S ig . F C h a n g e W a ts o n
1 .7 5 4a .5 6 9 .5 4 0 2 2 0 .5 3 9 7 8 .5 6 9 1 9 .7 9 0 1 15 .0 0 0 2 .3 4 6
a . P re d ic to rs : (C o n s ta n t), T o ta l P a y ro ll
b . D e p e n d e n t V a ria b le : M e d ia n S a la ry
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 962530.2 1 962530.159 19.790 .000a
Residual 729566.9 15 48637.793
Total 1692097 16
a. Predictors: (Constant), Total Payroll
b. Dependent Variable: Median Salary
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 110.736 111.951 .989 .338
Total Payroll .012 .003 .754 4.449 .000
a. Dependent Variable: Median Salary
47
6. Is model statistically significant?
Yes, because F = 19.79, p = .000 (i.e., p < .001).
7. Identify equation for the simple linear model (i.e., the regression line)
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 110.736 111.951 .989 .338
Total Payroll .012 .003 .754 4.449 .000
a. Dependent Variable: Median Salary
Y = Y Intercept + Beta * (X)
Median Salary = 110.736 + .012 (Total Payroll)
Or, in actual dollars:

Median Salary = $110,736 + .012 (Total Payroll)
48
8. Check the other 3 linear regression assumptions
8a. Independence
: D-W = 2.346 (OK, because it’s between 1.5 and 2.5)
8b. Normality
: Histogram of Residual (is it normal?)
: Normal Prob. Plot (are points near the diagonal?)
Histogram
Dependent Variable: Median Salary
2
n
u
q
y
cF
re
Mean = -6.94E-17
Std. Dev. = 0.968
0 N = 17
-2 -1 0 1 2 3
Regression Standardized Residual
OK, because residuals have a roughly normal shape
49
Normal P-P Plot of Regression Standardized Residual
Dependent Variable: Median Salary

1.0
0.8
0.6
0.4
bm
C
P
ro
uE
td
p
c
e
x
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Observed Cum Prob
OK, because points are near the diagonal
50
8c. Constant variance?
: Is there an absence of a funnel shape in scatterplot of X vs.
Residuals?
Go to your Modified Data File:
51
Here’s a look at your data file ordered from lowest to highest payroll,
where some of the columns are rearranged to make it more readable:
52
Test the Constant Variance assumption by looking at the
X vs. Residuals scatterplot. Check for funnel pattern.
600.00000
400.00000
200.00000
0.00000
R
u d
rize
l U
n
ta
s
-200.00000
-400.00000
10000.00 20000.00 30000.00 40000.00 50000.00 60000.00 70000.00 80000.00

Total Payroll
There’s a hint of a bit of a funnel pattern here.

(Consider a log transformation of Y variable – Median)
53
9. Search output for “Case Diagnostics” that describe outliers
None were found here, so nothing shows up in the SPSS output
But if you changed the Casewise Diagnostics (in “Statistics) to

show outliers beyond 1 sd …
Here’s what you’d get:

Casewise Diagnostics(a)
Predicted
Case Number Std. Residual Median Salary Value Residual
3 1.060 500.00 266.1707 233.82928
8 -1.320 185.00 476.0631 -291.06310
14 2.229 1300.00 808.3513 491.64868
15 -1.558 500.00 843.7011 -343.70113
16 1.218 1200.00 931.4056 268.59437
17 -1.051 750.00 981.7387 -231.73868
a Dependent Variable: Median Salary
54
Don’t Trust Your Model TOO Much…
Question:
The Houston Astros payroll in 2005 = $76,779.000. What does the
regression line predict the median salary will be?
Answer:
Predicted Median Salary =
$110,736 + (.012)(76,779,000) = $1,032,084
Actual: $500,000
Question:
Why was the model so far off?
55
1988 Houston Astros (Total payroll = $13,455,000; Median = $500,000)
T
56
2005 Houston Astros (Total payroll = $76,779,000; Median = $500,000)
57

Class Notes Sessions5-6

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Class Notes Sessions5-6

Enviado por

Direitos autorais:

Formatos disponíveis

One Way ANOVA

ANOVA stands for Analysis of Variance

ANOVA allows us to compare the means from more than two

A significant ANOVA indicates that changes in the independent

ANOVA does not indicate which pairs of conditions are

Use planned contrasts or unplanned (“post hoc”) contrasts to

1. Normally distributed populations

Interval: 0 doesn’t mean “none (e.g., IQ score)

Ratio: Same as interval scale, but 0 means “none” and

Nominal: Numbers are just labels for attributes (e.g., color)

Ordinal: categories have a logical order (e.g., ranks)

Digression on Scales: Types of Data

Continuous: Numerical data that can be fractional (e.g.,

Discrete: Numerical data that cannot be fractional (e.g.,

National Airlines recently introduce a daily early-morning nonstop flight

Is there evidence that National’s average passenger load on the new

Post Hoc Tests

Dependent Variable: Unfilled

For data in which each outcome is assigned to a single (and only

Derived from square of Z statistic

Assumes: Independence of observations (can't take several

Compares the observed and expected values

Needs at least 5 expected observations per cell

Values range from 0 on up (no negative values)

Are There Differences Among the Levels of 1 Variable?

Squared deviations in which large disparities “count” for

df = (# levels in the independent variable – 1)

Actual & Expected ( )

Reject H0 because the test statistic (21.50) is greater

Collapse analysts’ recommendations into 3

Buy Hold Sell

Critical Value (α = .05 ) : χ 2 ( 2) = 5.99

Actual & Expected ( )

Do not reject H0 because the test statistic (1.98) is

Observed N Expected N Residual

Are There Differences Between 2 Variables?

H0: Variables A and B are independent

df = (# rows - 1)(# columns - 1)

Tests nondirectional hypotheses only, using a single tail

Vioxx data file:

Industry [Industry ties? 1=no, 2=yes]

Click on “Statistics”; select “chi-square”

Industry Ties? * Bring Vioxx Back? Crosstabulation

entirely reliable because the expected cell count in one of the

Asymp. Sig. Exact Sig. Exact Sig.

Correlations are concerned with measuring the direction and

The stronger the correlation, the more accurately we can predict

Scatterplot: A graph containing clusters of dots that represent all

Involves an examination of pairs of X-Y scores (one-sample

- Neg relation: Pairs tend to occupy opposite relative

Two types (there are others as well):

- Phi Coefficient (binary variables: 2 X 2 Tables): φ

Failure to find strong r may mean:

r = Mean of the crossproduct of Z scores.

(note: Zs will differ for population & samples because the

When large pos correspondence: Z crossproduct is pos. & large

When small neg correspondence: Z crossproduct is neg & small

r2 = Proportion of variability of Y accounted for by X

∑ Z x Z y (1.5)(. 5) + (. 5)(1.5) + (0)( 0) + ( −.5)( −.5) + ( −1.5)( −1.5)

High School College

When the relevant range of X or Y scores is a truncated part of

2. Correlation does not mean causation

- Even if no 3rd variable is involved, it’s not always clear