Você está na página 1de 20

TOPIC Sub topic Link

Inference and Statistical Significance (1) Inferences about the population mean Population mean

Inference and Statistical Significance (2) Inferences about the difference Difference between two population means
between two population means

Inference and Statistical Significance (3) Inferences about the population The significance of linear regression Coefficients
regression coefficients

Inference and Statistical Significance (4) Inferences about the association Chi squared statistic
between categorical variables
The Chi Squared Statistic

Inference and Statistical Significance (5) Inferences about the population Regression
regression coefficients
CONTENTS Population mean

Inferences about the population mean


CHILDREN POP MEAN You may have seen or heard of the TV programme called 'Two Point Four Children'.
0 2.4 The premise of this programme was the Census information that for the UK population there
1 2.4 were on average 2.4 children per family unit.
4 2.4 Now look at the data in the A5:A20 range. Here we have the number of children in each of a
1 2.4 random sample of 16 family units. The average number of children for this sample was calculated
2 2.4 in A23 and is seen to be 1.5.
1 2.4 This illustrates the fundamental concept of Inference - namely what can we infer from this sample
0 2.4 value of 1.5 about the average number of children in the population as a whole? - which we believe to be 2.4.
4 2.4 There are two possibilities
3 2.4 a) things in the population have changed since the census and the population average (mean) number
2 2.4 of children is no longer 2.4.
3 2.4 b) the population average is in fact still 2.4 but the sample is not representative and by chance we
1 2.4 have obtained an inordinately low value for the sample average.
0 2.4 Now since we can never be certain about either of these possibilities we will have to use a
1 2.4 "balance of probabilities" approach. In other words we can calculate the probability of obtaining
0 2.4 a sample average of 1.5 from a population in which the actual average was in fact 2.4.
1 2.4 This is the essence of inference, since if we can calculate this probability it will allow us to
choose between the two possibilities in a logical manner. Put simply, if the probability of obtaining
AVERAGE a sample average of 1.5 from a population whose average was in fact 2.4 is 1% then we are
1.5 more likely to infer that the supposition is untrue than if this probability is 99%.
In the former case we would reject the presumed value of 2.4, but in the latter case we would
accept the presumed value.
To calculate this probability we use a data analysis routine called
t-Test: Paired Two Sample for Means
But before we can use this we must replicate the presumed population mean alongside
the sample data. This has been done in the B5:B20 range. Now select the option above from Data
Analysis. The variable 1 range is A4:A20, the variable 2 range is B4:B20, labels was checked and the
output range is selected to be A40. Keep the default alpha value of 0.05.
The results should be the same as those in the following link. Study the comments carefully
for an explanation of the meaning of the calculated statistics. Inference 1
PROCEED TO SHEET 2 (Difference between two population means)
CONTENTS Difference between two population means

Inferences about the difference between the two populations.


A CHILDREN B CHILDREN The previous sheet used data from a single sample to make inferences about
0 5 a presumed population value.
1 3 Sometimes however we will want to take samples from two populations and use these to make
4 4 inferences about the difference between the two populations.
1 2 For example we might calculate the difference between the two sample means and use this to make
2 4 an inference about the difference between the means of the two populations from which they were
1 3 drawn. Usually the presumption will be that the difference between the population means is zero,
0 2 and this will be tested against any of three alternatives:
4 4 The difference is not equal to zero
3 3 The difference is greater than zero
2 2 The difference is less than zero
3 3 To see how this can be done, look at the data in the yellow area.
1 2 Here we have taken a random sample of size 16 from each of two regions (A and B) and recorded the
0 3 number of children in each family unit.
1 2 The question is - can we say that there is a significant difference in the mean number of children
0 4 between family units in the two regions?
1 1 We do this once again with the Data Analysis routine's option:
t-Test: Paired Two Sample for Means
The variable 1 range is A4:A20, the variable 2 range is B4:B20, labels was checked and the
output range is selected to be A32. Change the default alpha value from 0.05 to 0.01
The results should be the same as those in the following link. Study the comments carefully
for an explanation of the meaning of the calculated statistics. Inference 2
The calculated tStat of - 3.71 is so large that we have no difficulty in accepting either of the possible
alternative presumptions - viz - mean difference not equal to zero, mean difference less than zero.
That is we reject the presumption that the difference in the mean number of children is zero.
And accept (with 1-0.001 = 0.999 = 99.9%) confidence that the mean difference is less than zero.
Alternatively we can accept (with 1 - 0.002 =0.998 = 99.8%) confidence that the mean difference is
not equal to zero.
PROCEED TO SHEET 3 (The significance of linear regression Coefficients)
CONTENTS The significance of linear regression Coefficients

Testing the significance of regression parameters


Look at the Regression Output below. It is the one that we obtained for the regression of sales on price and advertising expenditure in the Excel_4 file
The calculated coefficients in B31, B32 and B33 are Sample estimates of the unknown true Population
coefficients. This being the case we should make some inferences about their reliability as estimates.
To do this we presume that the true values of each of the coefficients are zero and we test each one
against the alternative presumption that they are not equal to zero.
Excel does this in the yellow area below and the meaning of each statistic is explained in the comments.

We should also note that in F26 under the heading Significance F, Excel has calculated the value to be
0.00000035307. This very small probability measures the chances that the true population value of
R squared is in fact zero. We conclude that it is very unlikely and so can be fairly certain that the
calculated R squared value of 0.93856 is significantly different from zero.
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.97
R Square 0.94
Adjusted R Square 0.92
Standard Error 1030.04
Observations 12

ANOVA
df SS MS F Significance F
Regression 2 145881218.66 ### 68.75 0
Residual 9 9548781.34 1060975.7
Total 11 155430000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 19668.96 2728.32 7.21 0 13497.07 25840.86
ADVEX 0.53 0.05 10.18 0 0.41 0.65
PRICE -6.41 0.78 -8.17 0 -8.18 -4.63
PROCEED TO SHEET 4 (Chi squared statistic)
expenditure in the Excel_4 file
Using the CHI SQUARED Statistic to test the degree of association of a cross tabulation
Look at the cross tabulation reproduced below. It is the one we obtained in sheets 1&2 of the Excel_3 file
The question raised is whether on the basis of these observed frequencies we can make any inferences
about the degree of association between gender and verdict.
To do this requires that we have some idea of the frequencies that would be expected on a purely
random basis and then compare the actually observed frequencies with the expected ones.
To calculate the expected frequencies we argue as follows. 29/50 are Female, 24/50 fail so 29/50*24/50 =0.2784 of
the total of 50 = 13.92 would be expected to be Female and fail. Once we cancel the grand total (50) from top and
bottom we get: expected frequency = column total*row total/grand total =29*24/50 = 13.92.
To calculate all four expected frequencies in Excel we proceed as follows. First note we have named
the D24 cell as GT (grand total). Now in B26 enter:
=B$24*$D22/GT
and copy this cell down into B27 and then along into C22:C23. The expected frequencies will be calculated.
With the actual frequencies located in the B22:C23 range and the expected in the B26:C27 range we can now
use an Excel function known as =CHITEST to calculate the Chi Squared Statistic for these data.
The syntax for this function is =CHITEST(range of actual frequancies,range of expected frequencies)
So use B29 to contain: =CHITEST(B22:C23,B26:C27)
Count of VERDICT GENDER A figure of 0.963404 will be returned, but what does this tell us?
VERDICT F M Grand Total Well, it can be shown that if every observed frequency were equal to
FAIL 14 10 24 the expected frequency then chi squared would be calculated as zero.
PASS 15 11 26 In this case there is no association between the two variables - the actual
Grand Total 29 21 50 frequencies are just what would be expected by chance. However, as the
actual and the expected frequencies start to diverge then the calculated
13.92 10.08 value of chi squared increases and as it does so provides evidence of a degree of
15.08 10.92 CORRECT association between the variables. But how high does it have to become
before it becomes significant? The answer is to compute the significance
CHI SQUARED 0.96 CORRECT of the calculated chi squared statistic from Excel's =CHIDIST function.
The syntax is: =CHIDIST(Calculated Chi Squared Value,degrees of freedom)
SIGNIFICANCE 0.33 CORRECT Now note that in an R by C contingency (Pivot) table the Degrees of Freedom
are given by (R-1)*(C-1). So with 2 rows and two columns our table has
(2-1)*(2-1) = 1 degree of freedom. Now use B31 to contain:
=CHIDIST(B29,1)
A result of 0.326331 will be obtained and as with previous significance
tests we require a value of 0.05 or less to provide an acceptable level
CONTENTS Regression

Using regression to investigate consumption relations between different products


Consider the data shown in the yellow highlighted area. Here, over 30 time periods, we have collected data
on the following variables:
QDS = the quantity of strawberries demanded PS = the price of strawberries
PR = the price of raspberries PP = the price of peaches
PIC = the price of ice cream PC = the price of cream
Use multiple regression to fit an equation of the form:
QDS = a + b*PS + c*PR + d*PP + e*PIC + f*PC
TIME QDS PS PR PP PIC PC
Period 1 5827.4 £1.40 £2.24 £2.12 £1.62 £1.56 Which of the calculated regression coefficients
Period 2 5727.4 £1.36 £1.56 £1.54 £2.03 £2.20 are significantly different from zero at the
Period 3 5842.1 £1.48 £2.29 £2.20 £1.58 £1.80 1% level of significance?
Period 4 5767.8 £1.50 £2.10 £1.70 £2.18 £1.92
Period 5 5765 £1.30 £2.04 £2.00 £2.10 £2.10 By inspecting the signs and magnitudes of the regression
Period 6 5793.2 £1.32 £1.96 £1.64 £1.49 £1.66 coefficients, which products can be said to be good
Period 7 5721.1 £1.38 £2.23 £1.46 £2.34 £1.86 substitutes for strawberries and which can be said to
Period 8 5719.5 £1.45 £1.43 £1.78 £2.12 £1.95 be good complements?
Period 9 5702.3 £1.45 £1.79 £1.62 £2.04 £1.66
Period 10 5794.9 £1.40 £1.77 £1.29 £1.49 £1.63 A suggested solution is contained in the following link
Period 11 5732.1 £1.52 £1.93 £1.87 £1.62 £1.88 Inference 3
Period 12 5774.3 £1.50 £2.37 £1.38 £1.94 £1.79
Period 13 5774.5 £1.48 £1.41 £1.93 £1.42 £1.95
Period 14 5805 £1.38 £2.40 £1.82 £1.88 £1.70
Period 15 5748.4 £1.38 £1.98 £1.28 £2.30 £1.48
Period 16 5763.3 £1.40 £2.39 £1.37 £1.95 £1.69
Period 17 5836.8 £1.49 £1.90 £1.88 £1.53 £1.36
Period 18 5726.3 £1.50 £1.85 £2.08 £1.82 £1.68
Period 19 5788.4 £1.53 £2.36 £1.59 £1.46 £1.79
Period 20 5733.3 £1.54 £1.41 £1.84 £1.86 £1.28
Period 21 5705 £1.49 £1.98 £1.63 £2.31 £1.82
Period 22 5800.8 £1.47 £1.96 £1.57 £1.92 £1.45
Period 23 5710.7 £1.40 £1.79 £1.92 £2.28 £1.87
Period 24 5825.5 £1.40 £2.09 £1.54 £1.50 £1.37
Period 25 5750.2 £1.35 £1.42 £2.18 £1.60 £1.41
Period 26 5753.1 £1.35 £1.87 £1.63 £1.91 £1.63
Period 27 5728.4 £1.40 £1.46 £1.96 £2.36 £1.50
Period 28 5795.3 £1.37 £2.17 £1.24 £1.69 £1.77
Period 29 5801.9 £1.32 £2.05 £1.52 £1.56 £1.98
Period 30 5755.4 £1.40 £2.26 £1.30 £2.32 £2.20
CONTENTS

CHILDREN POP MEAN


0 2.4 t-Test: Paired Two Sample for Means
1 2.4
4 2.4 CHILDREN POP MEAN
1 2.4 Mean 1.5 2.4
2 2.4 Variance 1.87 0
1 2.4 Observations 16 16
0 2.4 Pearson Correlation 0
4 2.4 Hypothesized Mean Difference 0
3 2.4 df 15
2 2.4 t Stat -2.63
3 2.4 P(T<=t) one-tail 0.01
1 2.4 t Critical one-tail 1.75
0 2.4 P(T<=t) two-tail 0.02
1 2.4 t Critical two-tail 2.13
0 2.4
1 2.4 Sample Standard Deviation 1.37
Sample Standard Error 0.34
AVERAGE Mean difference -0.9
2.4 t Stat -2.63

Return to Population mean


CONTENTS

A CHILDREN B CHILDREN t-Test: Paired Two Sample for Means


0 5
1 3 A CHILDREN B CHILDREN
4 4 Mean 1.5 2.94
1 2 Variance 1.87 1.13
2 4 Observations 16 16
1 3 Pearson Correlation 0.21
0 2 Hypothesized Mean Difference 0
4 4 df 15
3 3 t Stat -3.71
2 2 P(T<=t) one-tail 0
3 3 t Critical one-tail 2.6
1 2 P(T<=t) two-tail 0
0 3 t Critical two-tail 2.95
1 2
0 4
1 1

Return to Difference between two population means


CONTENTS For a coefficient to be significant at the 1% level the figure in the P-value column
of the regression output must be less than or equal to 0.01. Those that meet
this requirement have been highlighted in green in the output below.

The positive coefficients for PR and PP indicate that they are substitutes for
strawberries - if PR or PP increases (decreases) the demand for strawberries
SUMMARY OUTPUT will increase (decrease). Since the coefficient of PR exceeds that for PP then
raspberries are a closer substitute than peaches.
Regression Statistics
Multiple R 0.84 The negative coefficients for PIC and PC indicate that they are complements for
R Square 0.7 strawberries - if PIC or PC increases (decreases) the demand for strawberries
Adjusted R Square 0.63 will decrease (increase). Since the coefficient of PIC exceeds that for PC then
Standard Error 24.55 ice cream is more closely tied to strawberry sales than cream.
Observations 30
Overall the regression is far from
ANOVA robust. Although the R squared is quite
df SS MS F Significance F high (69.7%) and significantly different
Regression 5 33309.52 6661.9 11.06 0 from zero, the fact that only two out of the
Residual 24 14461.14 602.55 five variable coefficients are significant
Total 29 47770.65 means that the model is unlikely to perform
well for prediction purposes.
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% The highly significant intercept term
Intercept 5958.66 117.67 50.64 0 5715.81 6201.51 confirms this since it does not refer to
PS -95.95 68.72 -1.4 0.18 -237.79 45.88 an explanatory variable.
PR 63.89 15.69 4.07 0 31.52 96.27 Finally the fact that the coefficient
PP 12.37 17.87 0.69 0.5 -24.5 49.25 for the price of strawberries themselves
PIC -79.39 15.52 -5.12 0 -111.43 -47.36 is not significant should give us cause
PC -30.68 20.18 -1.52 0.14 -72.32 10.96 for concern.

Return to Regression