Escolar Documentos
Profissional Documentos
Cultura Documentos
Many elementary concepts have been skipped. At this stage, it is assumed that you should know them well. In particular, you MUST know how to do HATPC for each of the 8 hypothesis tests. Only important things, or those that inter-connect several topics together, are elaborated here. You have ABSOLUTELY NO hope of passing STAT170 if you do not know the 8 HATPCs. This PP file will NOT push you from F to P. The contents of this file will only help the P or above students, given the presumed basic knowledge.
1
Review of: 5 types of graphics 5 types of research questions 8 statistical tests 8 or MORE types of reports
categorical
clustered bar chart comparative box plots bar chart or pie chart
One categorical (Lecture 2, 11) One numerical (Lecture 2, 7) Two categorical (Lecture 2, 11, 12) Two numerical (Lectures 2, 9 & 10)
3
numerical
5 types of graphics
STAT170 is restricted to only 5 types of combinations of variables, 5 different types of graphics, and 5 possible research questions. The most important step is correctly identifying the types of variables: NUMERICAL vs CATERGORICAL. Surprisingly, many students have difficulty in this very first step. The correct/wrong identification of variables would lead you to the correct/wrong: Type of graphic Research question, and Statistical test.
Comment depends on whether variable is ordinal or nominal Ordinal: comment similar to histogram Nominal: comment on which categories have the highest count and lowest frequencies 400
350 300 250 200 150 100 50 0 meat vegetarian diet vegan
Comment on shape (skewed left/right, normal) Range from xxxx to xxxx Majority (high frequencies) of data about xxxx Comment outliers (if present) Comment on any unusual features (if present)
Assessment
Example: U-shaped, high frequencies near both ends, lowest frequencies near the centre U-shaped, but slightly skewed left Range from 0 to 12
Freq.
100 80 60 40 20
7
Individual Days
0
0 3 6 9 12
day
evening
15
20
25
30
35
Age
Birth Rate
50 45 40 35 30 25 20 15 10 5 0 10 15 20 25 30 35 40 45
Median Age
10
11
Never compare the actual frequencies (sizes). Only compare % (or proportions) (shapes). Since proportions are almost the same, ie about 1/3 and 2/3 for smokers and non-smokers, smoking status is independent of Activity Level (no association)
12
Never compare the actual frequencies (sizes). Only compare % (or proportions) (shapes). Since percentages of smokers and non-smokers are obviously different for males and females, there is an association between smoking status and gender.
13
14
numerical comparative boxplots 2-sample t test scatter plot T-test of Histogram 1-sample Z or t test
bar chart z-test of proportion or chi sq test of proportions
categorical
numerical
A mistake will cost you at least 6 marks in HATPC, plus other marks in subsequent parts of the questions. The key is look at the definition, not the meaning we use in daily language. Read the question! The results are unchanged if we use the names ABC or XYZ instead of 17 AGE.
Think of the survey. How many questions? 3 or 1? How many columns do you need to store the data? 3 or 1? You are doomed if you choose 3 variables. In fact there is no test in STAT170 that involves 3 variables.
20
19
Another example: How many variables are there? 1, 2 or 4? You are doomed if you choose 4 variables.
z-test of proportion (Lect 7) 2 categories only 2 test of proportions (GOF ) (Lect 11) -- 2 or more categories z and t-tests of mean (Lect 7) Chi sq test of association (Lect 11, 12) or Odds ratio Regression analysis: Test of slope (Lect 9,10) 2-sample t-test (Lect 8)
Hist, stem- Is the mean equal to ? leaf, boxplot Clustered barchart Scatter plot Is there an association between and ? Is there a relation between and ?
Comparative Is there a diff in heights One categ (binary) & boxplots between males and one numeric females?
Note: 1. There is the paired t-test which doesnt fit in any of 5 cases above, perhaps it fits best in the 2nd case (one sample t-test). 2. 7 tests above + paired t-test = 8 hypothesis tests in STAT170
23
24
z=
p 0 0 (1 0) n
(O j E j ) Ej
2
Ho Ho:= 0
95% C.I.
Conclusion Conclusion (NOT reject Ho) (reject Ho) Proportion could be equal to 0 The proportions 1=, 2=, 3= COULD be correct. X and Y COULD be independent (not associated) Proportion is higher/lower than 0. The proportions 1=, 2=, 3= are NOT correct. X and Y are dependent (associated)
.........
Ei=n*i 5
2 =
p 1.96
p(1 p) n
df=c-1
Chi sq test of Association, X and Y are Ei = row tol col tot grand total independence independent, independent 5 (no association) proportions
2 =
.........
df=(r-1) (c-1)
......... -----------
27
28
Opposite of Ho + is higher/lower
Test statistic
z= y 0 / n
Conclusion (Reject Ho) Ave xxx is higher/lower than 0 The difference is higher/lower than 0 on ave Ave xxx is higher/lower than ave xxx There is a positive/negative relation.
30
1-sample z-test of mean Mean, average 1-sample t-test of mean Paired t-test difference
......... .........
y 0 t= df=n-1 s / n
yd d sd / n df=n-1 t=
t= sp y1 y 2
1
y 1.96
y tn 1
Ho:d=0
Difference from normal popn, or n 25 (CLT) Both groups from normal popn, same SD
.........
yd t n1
s n sd
Ho:1=2
n1
n2
t s p
df=n1+n2-2
Test of linear relation between 2 variables
There COULD be no difference 1 1 between ave xxx + n1 n2 and ave xxx There COULD be no relation between X & Y
Relation, predict
Ho: =0
t=b/SEb df=n-2
29
31
32
Key points to write in the Simple report (Check list) 1-hypothesis-test only
Introduction *What this study is about, and why this study if known *Research question any wording is OK *Target population Method *How the sample was collected (why random and representative) *Define variables *Statistical method used *Null hypothesis *Justify assumptions [put under Method or Result, depending on the type of test]
33
Results (NO HATPC; NO calculations) *Test statistic *P-val, decision (reject/not reject null) Conclusion *Decision in words: There is evidence/no evidence [Check that the research question is answered.] *Your conclusion should be almost the same (several sentences) as the conclusion you have in the proper hypothesis test (HATPC), e.g. 95% CI if appropriate. Note: It is most important that you identify the correct statistical method used (how???). For example, if it is a chi-sq test and you mention t-test, then the rest does not make sense, and youll lose most of the marks 34 and your time!
Y and X4
37 38
39
What is the best X and how to choose it? In EACH set, choose the variable with the smallest p-val (ie the one that strongly rejects Ho) EXCEPT regression. For regression, choose the largest r2, not smallest p-val
42
Needed for choosing the BEST An example on regression to illustrate 1st general rule:variable(s) Significant variable? (Reject Ho?)
Variable Assumptions P-val satisfied? X1 X2 X3 X4 X5 No Yes Yes No Yes ----0.006 0.000 ----0.07
Result -----
Y and X3
Hence only X2 and X3 are significant (important) variables 43 affecting Y. And X3 is the best predictor for Y.
Y vs Wt
Y vs WIN
Y vs Starts Compare: Y vs WT : p-val = 0.00055 Y vs STARTS: p-val = 0.0012 Both p-val< 0.05 => both Wt and Starts affect Y, but Wt has a stronger effect (because of smaller p-val).
45
Y vs Payout
Compare: Y vs WIN: p-val=0.5641 Y vs PAYOUT: p-val=0.0000 Hence WIN has no effect on Y. Payout has an effect.
46
Results (NO HATPC; NO calculations) *Discard poor ones (assumptions violated, or p-val>0.05) (AVOID lengthy repetitive checking p-val one by one.) *IF required by the question, pick the best one within each group. Conclusion Answer the research question! --------------------------------------------------------------BTW, what is the research question like? Two possibilities: Which of the variables X1, x2, . affect variable Y? Which of the variables X1, x2, . BEST affect variable Y?
48
49
50
vs.
z=
y n
in probability calculations:
Look for the keyword mean or average => y-bar. Note that there are NO such formulas:
z= y
vs. z =
y n
53
54
55
Baby ID 21 22 23 24 25 26 27 28 29 30
Mothers age 28 34 24 34 32 24 30 29 37 41
Fathers age 33 40 26 45 35 27 39 27 34 46 57
The SD from a data set (sample) MUST be s, never => t-test * Do watch out if both and s are given. Once we have , s is useless => use z-test.
58
Tips and Hints: Which condition? n5 and n(1-)5, or np5 and n(1-p)5 ?
8. Lect 5 (prob calculation on p) or Lect 7 (z-test on ) p z= Check n5 and n(1-)5
(1 )
n
p (1 p ) Lect 6: CI for p 1.96 Check np5 and n(1-p)5 np n Rule: p goes with p, goes with , p NEVER goes with together. Note that although the above 2 formulas are in the formula sheet, the 2 corresponding conditions are not. You have to know which one is the correct condition for checking.
60 60
pth percentile
(b) Given population (of infinite size) of known (given) and : = 100 = 15 (i) Given normal: 100 Find z from the given area p (1-tailed) Then find y = +*z Eg: It is known that IQ is normally distributed with
(a) Given ANY sample of size n, use the formula: n*p/100 (Lect 2) Then check result is integer or non-integer etc.
Eg AGE: 12, 17, 28, 32, 33, 40, 40, 67 (MUST be sorted first!)
mean 100 and SD 15. What is the 10th percentile? What is the LQ?
(ii) non-normal (or unknown distribution) CANNOT do it!
61
62
10. No association/association between males and females. No association/association between smokers and nonsmokers (In fact, males, females, smokers and non-smokers are NOT variables.) It should be: Could be no association/There is association between Sex and Smoking Status. 63
Eg 2-sample t: Ho: There is no difference in exam marks on average for boys and girls. Eg chi sq test of association: Ho: There is NO association between X and Y Eg regression: Ho: =0 (No relation between X and Y) P-val<0.05 =>Reject Ho Negate (make opposite) Ho Be certain, use the verb is Also give further info: is greater/less than, is longer/shorter (eg onesample or 2-sample t) except chi sq P-val>0.05 =>Do not reject Ho Copy Ho Change the verb is to could be.
66
67
68
Which is more difficult? Surprisingly, students find the symbols in STAT170 more difficult than the 26 English letters in Primary 1. If you have problems in the left column, you will be in big trouble. You will NOT lose just a a few marks, but many!70
and
z=
y n
72
Ask yourself
How many hours did I spend on STAT170 each week, on average? Macquarie University recommends (minimum): 3 credit points * 3 hours = 9 hours = 4 hours in class + 5 hours on your own at home Every WEEK.
73
How many ticks do you have in the above list ? ____ Unfortunately, even just ONE tick, eg Can do just one hypothesis test, can (and will) make a failure! Note: # = fatal
75