Escolar Documentos
Profissional Documentos
Cultura Documentos
BY
MAIMOONA KANWAL
ROLL # 17111714-010
Submitted To
Department of Zoology
UNIVERSITY OF GUJRAT
1: Univariate analysis:
In which we use only one variable and all statistical parameters are applied on this data.
Qualitative data
Which describes the quality of something like color. Qualitative data is a categorical
measurement expressed not in terms of numbers, but rather by means of a natural language
description.
In univariate analysis, a single nominal variable is selected. In which I selected the various
departments in different hospitals and checked that which department is visited more than others
by patients in a month.
Departments
Frequency Percent Valid Percent Cumulative
Percent
Cardiotherapy 4 13.3 13.3 13.3
Gynaecology 14 46.7 46.7 60.0
Neurology 3 10.0 10.0 70.0
Valid
Oncology 1 3.3 3.3 73.3
Urology 8 26.7 26.7 100.0
Total 30 100.0 100.0
Interpretation:
The Frequency column describes that how much observations are present in each category. It
indicates that sample consist of 30 observations & gynecology section is visited maximally by
patients as compared to others while oncology department shows least value. The Percent
column indicates the percentage of observations in that category out of all observations which
describes that gynaecology department has highest percentage. The Valid Percent column
displays the percentage of observations in that category out of the total number of nonmissing
responses. As our data has no missing value so it is as similar to percent value. You can verify
the proportions for each group by dividing its count in the "frequency" column by the value of
"Total" that appears after the last valid category.
Pie chart
Interpretation:
Large proportion of pie graph is covered by gynaecology department so in one week highest
number of patients visited this section. It is followed by urology department while least portion
is covered by oncology section.
Ordinal data:
To study ordinal data, patients visit response are selected which is an ordinal variable. Bar
graph is used to study qualitative data.
Bar Graph
Bar graph revealed that patients are highly satisfied from hospitals.
Scale data
In this data, income got by medicines sold in last week is a scale variable. No frequency table
is done for scale data so it is shown by histogram which is used for quantitative data. While
studying quantitative data, mean, median, variance, standard deviation, skewness is also
studied.
Statistics
Preference for operation in last month
Valid 30
N
Missing 0
Mean 4.90
Median 5.00
Std. Deviation 1.242
Variance 1.541
Skewness -.030
Std. Error of Skewness .427
Kurtosis -.879
Std. Error of Kurtosis .833
Minimum 3
Maximum 7
25 4.00
Percentiles 50 5.00
75 6.00
Interpretation:
The statistics table tells that there are 30 valid values.The center of the distribution can be
approximated by the median (or second quartile) 5 which means that half of the values are above
5 and half values are below five. Maximum value is 7 while the minimum value is 3. The mean
is similar to the median, suggesting that the distribution is symmetric which is confirmed by
small negative value. Kurtosis explains that data is normally distributed.
Histogram interpreted that our data is normally distributed that resulted in bell shaped curve.
2: Bivariate analysis:
This type of analysis consist of two variables.
Crosstab:
Crosstab is also called as contingency tables. It provides correlation between different variables.
There are total 48 observations which are equally divided into male and female. Gender is
taken as row variable because it is independent in nature while smoking is taken as
dependent variable. Data describes that smoking ratio is higher in male respondants.
Among 24, 19 are smokers while only 9 females are smokers among 24.
Chi square test
It is used to check the association between categorical data which is taken randomly.
Hypothesis:
Ho: Variable A and Variable B are independent.
Ha: Variable A and Variable B are not independent.
Chi-Square Tests
Value df Asymp. Sig. (2- Exact Sig. (2- Exact Sig. (1-
sided) sided) sided)
Pearson Chi-Square 8.571a 1 .003
Continuity Correctionb 6.943 1 .008
Likelihood Ratio 8.884 1 .003
Fisher's Exact Test .008 .004
Linear-by-Linear Association 8.393 1 .004
N of Valid Cases 48
a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 10.00.
b. Computed only for a 2x2 table
Interpretation:
The value of the test statistic is 8.571. The footnote for this statistic pertains to the expected cell
count assumption (i.e., expected cell counts are all greater than 5): no cells had an expected
count less than 5, so this assumption was met because the test statistic is based on a 2x2
crosstabulation table, the degrees of freedom (df) for the test statistic is 1. The corresponding
p-value of the test statistic is p = 0.003. Since the p-value is smaller than our chosen significance
level (α = 0.05), we do reject the null hypothesis. Rather, we conclude that there is enough
evidence to suggest an association between gender and smoking. Based on the results, we can
state the following: Association was found between gender and smoking behavior so we reject
null hypothesis.
Relative risk
Risk ratio (RR) or relative risk is the ratio of the probability of an outcome in an exposed
group to the probability of an outcome in an unexposed group.
Interpretation:
As risk ratio is 1 (or close to 1), it suggests no difference or little difference in risk (incidence in each
Risk Estimate
Value 95% Confidence Interval
Lower Upper
Odds Ratio for Hormone use (yes / no) 1.000 .020 50.397
For cohort Status = uninfected 1.000 .141 7.099
For cohort Status = light 1.000 .141 7.099
N of Valid Cases 4
Interpretation:
OR=1 Exposure does not affect odds of outcome
3: Correlation
Assumptions:
Independent of case: Cases should be independent to each other
Linear relationship: Two variables should be linearly related to each other. This can be
assessed with a scatterplot: plot the value of variables on a scatter diagram, and check if the plot
yields a relatively straight line.
Homoscedasticity: the residuals scatterplot should be roughly rectangular-shaped.
Hypothesis:
Ho: Variable A and Variable B are interdependent.
Ha: Variable A and Variable B are not interdependent.
Correlations
Height deadspace
Pearson Correlation 1 .846**
height Sig. (2-tailed) .000
N 15 15
Pearson Correlation .846** 1
deadspace Sig. (2-tailed) .000
N 15 15
**. Correlation is significant at the 0.01 level (2-tailed).
Interpretation:
Table interpreted that the value of pearson correlation is 0.846 which means that there is a
strong relationship between your two variables. This means that changes in one variable are
strongly correlated with changes in the second variable. In our example, Pearson’s r is 0.846.
This number is very close to 1. For this reason, we can conclude that there is a strong
relationship between our height and deadspace variables. However, we cannot make any other
conclusions about this relationship, based on this number.
Scatter plot indicates that there is positive correlation in our data. As one variable increases
other also increases.
4: Linear regression
Simple linear regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables:
Descriptive Statistics
Mean Std. N
Deviation
systolic 106.43 3.7589 16
bp 8
Age 13.500 2.3664 16
Model Summaryb
M R R Adjust Std. Error Change Statistics
o Sq ed R of the R Square F Change df1 df2 Sig. F
d uar Squar Estimate Change Change
e e e
l
. .48 .444 2.8040 .481 12.955 1 14 .003
6 1
1 9
3
a
ANOVAa
Model Sum of df Mean F Sig.
Squares Square
Regressio 101.860 1 101.860 12.955 .003b
n
1
Residual 110.077 14 7.863
Total 211.938 15
a. Dependent Variable: systolic bp
b. Predictors: (Constant), Age
Interpretation:
R-squared is a statistical measure of how close the data are to the fitted regression line. Value
of R square 0.481 indicated that nearly half of the variations in the model are explained by this
model.
Anova table provides the value 0.003 which is less than 0.005 so we can reject null hypothesis.
Histogram is done to check the normality of data. Its bell shape explains that our data has
normal distribution.
Multiple linear regression:
some variable, denoted x, is regarded as the predictor, explanatory, or independent
variables
The other variable, denoted y, is regarded as the response, outcome, or dependent variable
Model Summaryb
Mo R R Adjusted R Std. Error of Durbin-
del Square Square the Estimate Watson
1 .975a .950 .932 .9821 1.473
Predictors: (Constant), Age, Weight, Height, DBP
b. Dependent Variable: SBP
R square value is 0.950 which explains that our model near explains all the variations in the
data.
ANOVAa
Model Sum of df Mean F Sig.
Squares Square
Regressio 201.329 4 50.332 52.189 .000b
1
n
Residual 10.609 11 .964
Total 211.938 15
a. Dependent Variable: SBP
Predictors: (Constant), Age, Weight, Height, DBP
Anova table provided thhe value which is 0.000 that is less thhan 0.005 so
we reject our null hypothesis and accept alternate one.
Normality test:
5: Comparison of variables:
For comparison of mean of two variables we applied t test. In our analysis, we have applied
independent sample t test.
Formulation of hypothesis:
Following are the null and alternative hypothesis of our test.
HO = state that there is no mean difference between the blood pressure and hypertension.
H1 = states that there is a mean difference between the blood pressure and Hypertension
Acceptance and rejection of these hypothesis depends upon the p value which is if less than
0.05 than null hypothesis is rejected and vice versa.
Assumptions:
Normal distribution
Independence
No outliers
H0: µ1 = µ2 (“the two population means are equal”)
H1: µ1 ≠ µ2 (“the two population means are not equal”)
Interpretation:
As p value is less than 0.05 that is 0.095 hence the HO is accepted i.e. the heart diseases has
significance effect on the heart attack.
Statistics Project After mid
Introduction to some terms
Experimental Unit:
It is the smallest part in which an experiment is divided to provide equal treatments for all units.
For example, if we are going to check effect of growth hormone on animals, animals will be the
experimental unit.
Factor is the variable in the experiment which is explanatory in nature and used by researcher. It
is further divided into two or levels and combination of factor levels are termed as treatment. For
example, Urea is a fertilizer, its various concentrations are termed as level and if we combine it
with the concentrations of some other fertilizers, it is termed as treatment,
Experimental error:
It is actually the difference between the measured value and the true value that is due to random
factors. It is measured by its accuracy and precision which are itself measured by closeness of
measured value to the true value. Precision is the indication of closeness of two randomly taken
values.
Definition:
A design having all similar experimental units without any grouping. It includes the random
allocation of treatments to all experimental units that are actually homogenous in nature.
Problem Statement
In this experiment, different types of growth hormones are used to check the growth
rate of fruit plants. It is compared with control group to choose the most effective
growth hormone.
HYPOTHESIS
Null hypothesis will be accepted or rejected it depends on P-value i.e. if the p-value
is less than 0.05 then null hypothesis is rejected and if p-value is greater than 0.05
then null hypothesis is accepted.
EXPERIMENTAL MATERIAL
LOCAL CONTROL
Local control means the control of all factors except the ones about which we
are investigating.
Here as we are checking the effect of growth hormones on plant growth rate, so
all other factors that can affect this are kept controlled like temperature.
ANALYSIS
Stat – ANOVA- One way ANOVA- Response - Factor - Confidence level- Graphs
(check box plots and 4 in one)- Click ok
Among all four in one plots, Normal probability plot and histogram showed that our
data is normally distribute. When we use normal probability plot, the residues must
follow a straight line which is given by our data which indicates that data is normally
distribute while in case of versus fit model. Same number of residues present on the
both side of 0 that is shown by our data so it is ok to say that our residues have
constant variance. In case of last one graph, it is interpreted that our data is
without any auto correlation.
Results Interpretation:
Factor Levels Values
Growth hormones 4 Auxins, Cytokines, Ethylene, Gibberellins
P value is 0.000 which is below than our significance value so Null hypothesis is rejected and we
accept the Alternate hypothesis which states that all treatment means are not equal.
Model Summary
S R-sq R-sq(adj) R-sq(pred)
4.98011 21.37% 20.16% 18.12%
R Square:
R square values lie between the o and 100. It explains that how good our model is to explains all the
variations. Its value indicates that our model is able to explain 21.37% variations of our model.
Pre. R square
It explains that how good a model can make predictions, in our case, it may predict 18.12% predictions.
Means
Growth hormones N Mean StDev 95% CI
Auxins 50 18.420 5.091 (17.031, 19.809)
Cytokines 50 25.160 4.287 (23.771, 26.549)
Ethylene 50 19.540 5.023 (18.151, 20.929)
Gibberellins 50 20.380 5.447 (18.991, 21.769)
95% confidence interval means if we repeat our experiment 100 times, it will be exact for 95% times. If we
repeat the experiment for auxins for 100 times it value will be between the (17.031, 19.80). If we repeat the
experiment for Cytokines for 100 times it value will be between the (23.771, 26.549). If we repeat the
experiment for Ethylene for 100 times it value will be between the (18.151, 20.929). If we repeat the
experiment for Gibberellins for 100 times it value will be between the (18.991, 21.769).
Total 16 observations, eight from each block (Low level of N and high level of N)
HYPOTHESIS
1. Null hypothesis: All treatment means are equal.
1. Alternative hypothesis: All treatment means are not equal.
2. Null hypothesis: All block means are equal.
2. Alternative hypothesis: All block means are not equal.
3. Null hypothesis: Treatments and blocks are independent.
3. Alternative hypothesis: Treatments and blocks are dependent.
Null hypothesis will be accepted or rejected it depends on P-value i.e. if the p-value is less than 0.05 then
null hypothesis is rejected and if p-value is greater than 0.05 then null hypothesis is accepted.
EXPERIMENTAL MATERIAL
1 A
2 B
3 C
4 D
1 High N level
2 Low N level
Local Control
Here as we checked the effect of different fertilizers on the plant height. Only type of fertilizers is
changed but all other factors are kept constant such as pH, temperature and oxygen concentration.
Analysis
Stat – ANOVA- Generalized linear model- Fit generalized linear model- Response (Production of
Zooplankton)- Factor (Lake water, Supplements)- Model (Select Response and Factor & click
ADD)- Options & Stepwise (by default)- Graphs (check 4 in one)- Storage (check fit and
residuals)- Click OK
General Linear Model: height versus block, fertilizer
Method
Factor coding (-1, 0, +1)
Factor Information
TYPE LEVEL
block Fixed 2 high N, Low N
ferti Fixed 4 0, 1, 2, 3
There is one block factor that has low and high level while fertilizer factor has four level.
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
block 1 11.560 11.5600 11.56 0.009
ferti 3 158.942 52.9808 52.98 0.000
block*ferti 3 0.235 0.0783 0.08 0.970
Error 8 8.000 1.0000
Total 15 178.737
As P value is smaller than 0.05 for block factor so null hypothesis is rejected in case that all block means
are not equal. In the case of fertilizer, p value is 0.000 so it also rejects null hypothesis and accepts
alternate so there is significant difference in all means.
Model Summary
S R-sq R-sq(adj) R-sq(pred)
1 95.52% 91.61% 82.10%
R square value is nearly 100 so it means our model nearly explains all variations in our data.
Coefficients
Designs of experiment
2k factorial
These designs are created to study a large number of factors, with each factor having the minimal
number of levels, just two. The levels are termed as as high and low, +1 and -1, to explain each
factor. They may be qualitative and quantitative in nature.
Description
In this example, I choose two factors each having two level
Blood pressure: Low High (75; 130)
Glucose Uptake: Low High (5, 10)
Response:
Diabetes value as affected by these two factors
Hypothesis:
Null hypothesis:
There is in significant difference on diabetic value based on Blood pressure and Glucose uptake.
Alternate hypothesis:
There is significant difference on diabetic value based on Blood pressure and Glucose uptake.
Pathway:
STAT-----DOE---Factorial-------Create Factorial-----Create design by selecting 2K----Add response
values--------Stat--------DOE------Factorial-----Analyze factorial--------select all graphs----click OK
Among all four in one plots, Normal probability plot and histogram showed that our
data is normally distribute. When we use normal probability plot, the residues must
follow a straight line which is given by our data which indicates that data is normally
distribute while in case of versus fit model. Same number of residues present on the
both side of 0 that is shown by our data so it is ok to say that our residues have
constant variance. In case of last one graph, it is interpreted that our data is
without any auto correlation.
Factorial Regression: diabetes versus blood pressure, glucose uptake
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Model 3 6370.38 2123.46 112.50 0.000
Linear 2 6160.25 3080.13 163.19 0.000
blood pressure 1 5565.13 5565.13 294.84 0.000
glucose uptake 1 595.13 595.13 31.53 0.005
2-Way Interactions 1 210.13 210.13 11.13 0.029
blood pressure*glucose uptake 1 210.13 210.13 11.13 0.029
Error 4 75.50 18.88
Total 7 6445.88
P value is 0.000 which is below than o.o5 so we accept alternate hypothesis by rejecting null hypothesis.
Model Summary
S R-sq R-sq(adj) R-sq(pred)
4.34454 98.83% 97.95% 95.31%
There are seven factors with each having two levels. While the response factor is their effect on yield of
plants. So I decided to apply 1/16 fractional factorial design.
A Temperature
B pressure
C Humidity
D Soil fertility
E water level
F wind level
G fertilizer concentration
Hypothesis:
Null hypothesis:
There is in significant difference on yield of plant value based on all seven factors.
Alternate hypothesis:
There is significant difference on yield of plant value based on all seven factors.
Pathway
DOE---Factorial--- Create Factorial design---- Select fractional factorial 1/16----click ok
Add yield as response and again repeat the steps by choosing analyze factorial design.
Results:
Alias Structure
I + ABD + ACE + AFG + BCF + BEG + CDG + DEF + ABCG + ABEF + ACDF + ADEG + BCDE + BDFG
+ CEFG
+ ABCDEFG
A + BD + CE + FG + BCG + BEF + CDF + DEG + ABCF + ABEG + ACDG + ADEF + ABCDE + ABDFG +
ACEFG
+ BCDEFG
B + AD + CF + EG + ACG + AEF + CDE + DFG + ABCE + ABFG + BCDG + BDEF + ABCDF + ABDEG +
BCEFG
+ ACDEFG
C + AE + BF + DG + ABG + ADF + BDE + EFG + ABCD + ACFG + BCEG + CDEF + ABCEF + ACDEG +
BCDFG
+ ABDEFG
D + AB + CG + EF + ACF + AEG + BCE + BFG + ACDE + ADFG + BCDF + BDEG + ABCDG + ABDEF +
CDEFG
+ ABCEFG
E + AC + BG + DF + ABF + ADG + BCD + CFG + ABDE + AEFG + BCEF + CDEG + ABCEG + ACDEF +
BDEFG
+ ABCDFG
F + AG + BC + DE + ABE + ACD + BDG + CEG + ABDF + ACEF + BEFG + CDFG + ABCFG + ADEFG +
BCDEF
+ ABCDEG
G + AF + BE + CD + ABC + ADE + BDF + CEF + ABDG + ACEG + BCFG + DEFG + ABEFG + ACDFG +
BCDEG
+ ABCDEF
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Model 7 2419.00 345.57 2.00 0.175
Linear 7 2419.00 345.57 2.00 0.175
A 1 9.00 9.00 0.05 0.825
B 1 64.00 64.00 0.37 0.559
C 1 1521.00 1521.00 8.82 0.018
D 1 196.00 196.00 1.14 0.318
E 1 36.00 36.00 0.21 0.660
F 1 529.00 529.00 3.07 0.118
G 1 64.00 64.00 0.37 0.559
Error 8 1380.00 172.50
Total 15 3799.00
P value is 0.175 which is more than o.o5 so we reject alternate hypothesis by accepting null hypothesis.
Only significant results are provided by C whose value is 0.018 that is less than 0.05.
Model Summary
S R-sq R-sq(adj) R-sq(pred)
13.1339 63.67% 31.89% 56.00%
Coded Coefficients
Term Effect Coef SE Coef T-Value P-Value VIF
Constant 64.75 3.28 19.72 0.000
A 1.50 0.75 3.28 0.23 0.825 1.00
B 4.00 2.00 3.28 0.61 0.559 1.00
C -19.50 -9.75 3.28 -2.97 0.018 1.00
D -7.00 -3.50 3.28 -1.07 0.318 1.00
E 3.00 1.50 3.28 0.46 0.660 1.00
F 11.50 5.75 3.28 1.75 0.118 1.00
G -4.00 -2.00 3.28 -0.61 0.559 1.00
Hard-to-change factors: A
Analysis of Variance
Model Summary
S R-sq(SP) S(WP) R-sq(WP)
5.47655 36.86% 2.37135 1.22%
Model is explaining 36% variations of the data as predicted by R square.
Coded Coefficients
Term Effect Coef SE Coef T-Value P-Value VIF
Constant 61.59 1.53 40.24 0.001
A[HTC] 0.48 0.24 1.53 0.16 0.005 *
B -0.781 -0.391 0.968 -0.40 0.018 1.00
C -2.106 -1.053 0.968 -1.09 0.007 1.00
D -1.744 -0.872 0.968 -0.90 0.003 1.00
A[HTC]*B -2.206 -1.103 0.968 -1.14 0.274 1.00
A[HTC]*C 2.294 1.147 0.968 1.18 0.256 1.00
A[HTC]*D 0.531 0.266 0.968 0.27 0.788 1.00
B*C -0.469 -0.234 0.968 -0.24 0.812 1.00
B*D 2.169 1.084 0.968 1.12 0.282 1.00
C*D -0.356 -0.178 0.968 -0.18 0.857 1.00
A[HTC]*B*C 0.456 0.228 0.968 0.24 0.817 1.00
A[HTC]*B*D 2.219 1.109 0.968 1.15 0.271 1.00
A[HTC]*C*D 1.119 0.559 0.968 0.58 0.573 1.00
B*C*D -0.294 -0.147 0.968 -0.15 0.882 1.00
A[HTC]*B*C*D -0.794 -0.397 0.968 -0.41 0.688 1.00
Nested ANOVA
Description
Content of Drug Samples Manufactured at Two Sites (3 randomly chosen batches at each site, 5
randomly chosen pills from each of the 6 batches)
Null hypothesis:
All the group means are same
Alternate hypothesis:
All the group means are not same
Pathway
P value indicates that all group means are same because p value exceeds than our significance value
which 0.05.
S = 0.109962 R-Sq = 61.94% R-Sq(adj) = 54.01%
Variance Components
% of
Source Var Comp. Total StDev
Site -0.006* 0.00 0.000
Batch 0.020 62.65 0.142
Error 0.012 37.35 0.110
Total 0.032 0.180
Interpretation
The ANOVA table shows no significant Site effect. However, there is a very highly significant Batch effect,
and some investigation as to how to produce more uniform batches may be in order. Notice that Site is
"tested against" Batch and that Batch is tested against Error.If a variance component estimate is less than
zero, Minitab displays what the estimate is, but sets the estimate to zero in calculating the percent of total
variability.
Cube points: 4
Center points in cube: 5
Axial points: 4
Center points in axial: 0
α: 1.41421
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Model 5 5038.0 1007.6 0.37 0.006
Linear 2 1963.5 981.8 0.36 0.001
A 1 135.0 135.0 0.05 0.031
B 1 1828.6 1828.6 0.67 0.041
Square 2 1668.2 834.1 0.30 0.127
A*A 1 296.2 296.2 0.11 0.342
B*B 1 1518.6 1518.6 0.55 0.281
2-Way Interaction 1 1406.3 1406.3 0.51 0.497
A*B 1 1406.3 1406.3 0.51 0.297
Error 7 19176.8 2739.5
Lack-of-Fit 3 8766.0 2922.0 1.12 0.239
Pure Error 4 10410.8 2602.7
Total 12 24214.8
Model Summary
S R-sq R-sq(adj) R-sq(pred)
52.3406 69.81% 60.00% 67.00%