Applied Environmental Measurement Techniques: Statistics Exploratory Data Analysis

Sharon Kühlmann-Berenzon
Department of Mathematical Statistics

Chalmers - Göteborgs Universitet
HT2001
sharon@math.chalmers.se
Applied Environmental Measurement Techniques

Statistics
Exploratory Data Analysis
or Quantitative Detective Work
1 Introduction
So you have collected your data. And now what? The fun is not over yet. The search for
the evidence starts now. That evidence will lead to the answers to the questions posted in the
experimental planning stage. Tukey (1977), a famous Statistician who laid the foundations of
exploratory data analysis (EDA), called this very first step of the analysis "quantitative detective
work". What you will do now is find clues and gather evidence. If EDA is the detective part,
then the inference analysis stage following EDA is the trial, where you present your evidence,
and based on it, give a ruling concerning the hypotheses and generalize conclusions.
Why not go directly to the inference? Because, if during the EDA you do not find anything
interesting, then there is little to be done in the inference stage.
For now, however, we will concentrate on tools for the detective work. Think of it as your
opportunity to get to know your data, to get a feeling for what you have measured, to see how
it looks, and to listen to what it is trying to tell you. Examples of things you can learn during
the EDA:
find errors, typos, and mistakes in the data;
find possible trends;
find possible relationships;
find gaps in the study.
1
2 Notation

Symbol Description
summation

observation
number of observations
mean
Md median
Mo mode
R range
IQR inter-quartile range
variance
standard deviation
correlation coefficient
3 Example
To illustrate the methods used in EDA, we will use a dataset on mercury in fish, collected in
Florida in 1993. The data is available through StatLib at:
http://lib.stat.cmu.edu/DASL/Datafiles/MercuryinBass.html.
Other sites with information on mercury in fish are:
http://www.cfsan.fda.gov/~dms/admehg.html
http://www.cfsan.fda.gov/~frf/sea-mehg.html
The following information is reproduced as it appears on the webpage:
Datafile Name : Mercury in Bass

Datafile Subjects : Health , Nature
Story Names : Mercury Contamination in Bass
Reference : Lange, Royals & Connor. (1993).
Transactions of the American Fisheries Society.
Authorization : contact authors
Description:
Largemouth bass were studied in 53 different Florida lakes to examine the factors that influ-
ence the level of mercury contamination. Water samples were collected from the surface of the
middle of each lake in August 1990 and then again in March 1991. The pH level, the amount of
chlorophyll, calcium, and alkalinity were measured in each sample. The average of the August
and March values were used in the analysis. Next, a sample of fish was taken from each lake
with sample sizes ranging from 4 to 44 fish. The age of each fish and mercury concentration
in the muscle tissue was measured. (Note: Since fish absorb mercury over time, older fish will
tend to have higher concentrations). Thus, to make a fair comparison of the fish in different
lakes, the investigators used a regression estimate of the expected mercury concentration in a
three year old fish as the standardized value for each lake. Finally, in 10 of the 53 lakes, the age
2
of the individual fish could not be determined and the average mercury concentration of the
sampled fish was used instead of the standardized value.
Number of cases: 53
Table 1: Mercury in Bass: Variable description
Variable Name : Description (measurement units)

1. ID : ID number
2. Lake : Name of the lake
3. Alkalinity : Alkalinity (mg/L as Calcium Carbonate)
4. pH : pH
5. Calcium : Calcium (mg/l)
6. Chlorophyll : Chlorophyll (mg/l)
7. Avg.Mercury : Average mercury concentration (parts per million)
in the muscle tissue of the fish sampled from that lake
8. No.samples : How many fish were sampled from the lake
9. min : Minimum mercury concentration amongst the sampled fish
10. max : Maximum mercury concentration amongst the sampled fish
11. 3.yr.Standard.mercury : Regression estimate of the mercury concentration in a 3 year
old fish from the lake (or = Avg Mercury when
age data was not available)
12. age.data : Indicator of the availability of age data on fish sampled
4 What type of variables do you have?
Before starting the actual EDA, you must identify clearly the type of measurements you have.
In Statistics, every property that has been measured is called a variable. The reason for the
name is that we assume that the measurements are random and can vary; in other words, the
property or variable can have different values. A simple example: height of a student; not all
students have the same height.
There are three types of variables: categorical, discrete/counts, and continuous. The main
difference among them is the scale used for the measurements. The importance of being able to
differentiate among them is that not all methods and tools can be used for all types. Knowing
what you can and cannot do is a crucial first step.
4.1 Categorical variables
These are variables that are measured according to determined categories. The factors or treat-
ments chosen in the experimental planning stage are often variables of this type. A special case
of categorical variables are those with only two categories, usually indicating with and without
a certain characteristic; these type of variables are known as binary variables. Also the ordered
categories can be seen as a special case, where the categories have a natural order.
Examples:
3
Area Variable Categories
Air Pollen in the air no risk; some risk; middle risk; high risk. (ordered)
Water Water quality at bathing places
(Miljö Göteborg) unsuitable; suitable with restrictions; suitable.
Water Algae bloom at bathing places yes; no. (binary)
Water Fish species salmon, cod, tuna, swordfish, other.
4.2 Discrete and counts
These are measurements that can only take certain values. One typical example of discrete data
are counts, i.e. where the data tells us "how many" and can only take whole numbers (e.g. 0, 1,
2, . . . ).
Examples:
Area Variable
Air Number of cars passing through Korsvägen between 7am and 9am.
Water Number of fish in the lake.
In the Mercury study, the number of samples "No.samples" is a count variable.
4.3 Continuous
These are variables that can be measured, in theory, with infinite precision.
Examples:
Area Variable
Air Ozone level
Water Temperature of water
Water Weight of a fish
Most of the variables measured in the Mercury study are of this type, e.g. alkalinity, chloro-
phyll, and pH.
5 Population and Sample
The data collected for a variable will most probably have been measured on a sample. A sample
is defined as a subset of the population. The population is the complete set of elements, i.e. all
the fish in the lake, every squared centimeter of the surface of the lake, every minute of the day,
etc. Since not every "element" can be measured, then we work with a subgroup, the sample,
and measure our variables on the elements of the sample.
Since we are interested in conclusions for the population, the hypotheses in the planning
stage are formulated for the population. The sample data, however, is used for testing those
hypotheses.
4
6 Distribution of a Random Variable
A typical set of data from a variable will have several replicates. The first step is to obtain an
idea of how the distribution of the variable looks like. This can be done with a frequency dis-
tribution, which shows how often a particular value was obtained. Hopefully the sample will
be large enough, so that the sample distribution will approximate the population distribution
well enough.
A frequency distribution can be seen with a table or using a graph. Table 2 shows the ob-
served and relative frequencies for the number of samples per lake. The observed frequencies
show that in one lake, 4 fish were caught, in another 5 fish, in four lakes, 6 fish, etc. In most
of the lakes, 34 of them, between 10 and 12 fish were sampled. The relative frequencies are the
same values but as percentages: in 20% of the lakes, 10 fish were captured.
Table 2: Mercury in Bass: Observed and relative frequencies for the number of sampled fish
per lake.
No. Observed Relative
samples frequency frequency
1 0 0
2 0 0
3 0 0
4 1 1.89
5 1 1.89
6 4 7.55
7 2 3.77
8 1 1.89
9 0 0
10 11 20.75
11 3 5.66
12 20 37.74
13 3 5.66
14 2 3.77
15+ 5 9.43
The same information can be viewed graphically using a histogram. A histogram has bars
that represent the frequency. In the case of categorical and discrete data, we can draw one bar
for each value. For continuous data, it is necessary to make intervals, as was done for Calcium
in Fig 1(a). If the data is categorical, the histogram must be carried out using horizontal bar,
and ordered from largest to smallest, if there is no implied order in the categories. The reason
for this is that a horizontal axis always gives the illusion of order, which does not necessarily
apply to categorical data. For discrete and continuous variables, vertical bars should be used.
The form of the distribution for Calcium is skewed, i.e. it is not symmetrical. In this case
it is negatively skewed, since most of the observations are concentrated in the low values of
Calcium. Most distributions are either skewed or symmetrical. The distribution of the pH, for
example, is symmetrical (Fig. 1(b)).
For categorical data, pie charts may also be a useful tool for studying their distribution.
5
30
15
25
20
10
Frequency
Frequency
15
10
5
5
0
0
0 20 40 60 80 100 3 4 5 6 7 8 9 10
Calcium (mg/l) pH
(a) Calcium (mg/l) (b) pH (log scale)
Figure 1: Mercury in Bass. Data source: StatLib
7 Exploring One Variable
Location and dispersion are two of the most important characteristics of the distribution of
a variable. A lot of information can be obtained by describing the location and dispersion.
Location refers to the position of the distribution; dispersion represents the variability in the
distribution. It is not always obvious how to summarize these characteristics. The type of data
(categorical, discrete, or continuous) and the form of the distribution (skewed or symmetrical)
define in part how to represent the location and dispersion.
8 Statistics for Location
8.1 Mode
This is a statistic that can always be used, although it might not give very much information. It
represents the value of the variable that happens most often. For the number of samples of fish
per lake, the mode is 12, and in pH, between 6 and 7. It may be used for categorical, discrete,
and continuous variables, and in skewed and symmetrical distributions. In some distributions,
there might be two or more modes.
8.2 Median
The median represents the value of the variable that accumulates 50% of the observations.
First we order all observations from smallest to largest; the median is the value in the center.
If the center falls between two observations, the median is equal to the average of those two
observations. The median is interpreted as the value for which 50% of the observations are
smaller, and 50% larger. The median is 20 for the number of samples of fish per lake, for
calcium, 12.60, and for the maximum mercury concentration, 0.84.
It may be used for discrete and continuous variables, and for categorical if there is some
order in the categories.
6
8.3 Mean
The mean is also known as the average. It is calculated as the sum of all values divided by the
number of observations:

(1)
The average number of fish per lake is 13.06, the average calcium level is 22.2, and mean
maximum mercury concentration is 0.8745.
The average can only be used in discrete and continuous variables, never on categorical data.
The mean is sensitive to extreme values, i.e. to those values that are at the tails or ends of the
distribution. Even one single, very small or very large observation can change the mean con-
siderably. For this reason, one must be careful to report the mean in such skewed distributions.
8.4 Proportion
Proportions are the typical statistics used for categorical variables, in particular for binary ones.
The interest is on the proportion of elements that possess a certain characteristic.
The proportion of lakes that are acidic (i.e. pH < 6.8) in the mercury in bass example is 0.35.
9 Statistics for Dispersion
9.1 Range
This is a very simple statistic, which can be used for discrete and continuous variables. It is
computed as the difference between the maximum and the minimum values:
"!# $ (2)
The larger the range, the more variability can be seen in the distribution. This statistic
does not give much information, but it may be used to compare similar variables, or the same
variable analyzed for two or more groups.
&% ('*)+' -,/.10 324657!2 4 82469 5 &%;: (' : =< 0 ?
>A@ CB !
2 2" D56> 9
The range of alkalinity is , and for
, meaning that the alkalinity values spread over a larger scale.
9.2 Percentiles
The percentiles are similar to the median, but instead of dividing the observations according to
the 50%, it does it for some other percentage. The most popular ones are the 25th percentile and
the 75th percentile: the first gives the value where 25% of the observations are below and 75%
above, and the second represents the value where 75% of the observations are below and 25%
below. These two percentiles can be used to calculate the inter-quartile range (IQR), which is
another measure of dispersion. It is the difference between the both percentiles (or quartiles
since they are located at 1/4 and 3/4 of the range):
7
E+FG H$IKJL!MH
J (3)
The median, 5th, 25th, 75th, and 95th percentiles are used to draw boxplots. This is another
graphical tool used to describe the distribution and represent some of the location and disper-
sion statistics at the same time. It is very useful for comparing distributions and for gaining a
good picture in a quick and easy way (see Fig 2). The edges of the box are defined by the 25th
and 75th percentiles, and the line inside the box represents the median. The two lines spreading
from the box end at the 5th and 95th percentile (sometimes it is the 10th and 90th, depending
on the software).
Figure 2 shows the boxplot of chlorophyll. The distribution seems skewed towards the left
or lower values (negative skewness), and 6 observations above the 95th percentile.
150
100
Chlorophyll (mg/l)
50
0
Figure 2: Mercury in Bass: Chlorophyll (mg/l). Data source: StatLib
9.3 Variance and Standard deviation
A commonly used statistic for dispersion is the variance. It may only be used for discrete and
continuous data. The variance sums the squared deviations from the mean and is calculated
as:

% !N! 2 0
(4)

! % !N2 0 P
O (5)
Since the variance is in squared units, which may be difficult to interpret, the standard
deviation is more commonly used. The standard deviation is the squared root of the variance,
" RQ
TS (6)
and is, therefore, in the same units as the observations.
8
9.4 Coefficient of variation
Sometimes it is of interest to compare the variability of two variables that are measured in
different units, or two variable with very different means. One way of going about it is by
calculating the coefficient of variation (CV). It is the ratio of the standard deviation and the
mean and can be expressed as a percentage:
UWV X 2Y@6@[Z (7)
The CV of pH is 19.55, and for alkalinity, 101.79. This means that alkalinity is much more
variable than pH.
The CV for very small samples is biased, meaning that it does not approximate correctly the
CV of the population. It can be corrected using
\^]`_ ba2^c d 2 fe ^\ ] S (8)
where CV is the uncorrected coefficient of variance and

\g] _ the corrected one.
10 Exploring Two Variables
10.1 Contingency table: categorical vs categorical
If you want to explore any possible relationship between two categorical variables, then you
first need to construct a contingency table, also known as a crosstabulation. This is a table
with the categories of variable 1 displayed as columns, and the categories of variable 2 as rows.
Then the cells of the table contain number of observations that fall in each combination of both
variables.
The U.S.Food and Drug Administration (1994) considers as safe levels of mercury in fish,
those under 1 ppm. If we are interested in exploring the relationship between 3-year standard
mercury levels in fish and acidity in the lake, we might create a contingency table such as
Table 3. It is not clear that there is a relationship among these two variables, but we won’t
make any conclusions at this stage.
Table 3: Mercury in Bass: Average Mercury level above or below 1 ppm and acidity of the lake.
Average mercury level
i 9 5
pH level < 1 ppm 1 ppm h Total
j 9 5 25
22
1
5
26
27
Total 47 6 53
Data source: StatLib
9
10.2 Boxplots: categorical vs discrete/continuous
We saw before that boxplots can be very useful tools for studying distribution, location, dis-
persion. They are also helpful for relating a categorical variable with a continuous or discrete
variable. We may draw a boxplot for every category and see if there are any obvious trends or
differences in the distributions of the continuous variable among the categories.
Figure 3 compares the calcium measurements according to acidity levels of the lakes. It is
clear that not only the location of the distributions is different, but also the dispersion of the
observations.
80
60
Calcium (mg/l)
40
20
0
almost neutral mildly acid moderately acid acid very acid
Figure 3: Mercury in Bass: Calcium (mg/l) in lakes according to acidity level defined by the
Swedish EPA. Data source: StatLib
10.3 Scatterplots: discrete/continuous vs discrete/continuous
Two discrete or continuous variables can be explored in a scatterplot. Each variable represents
a coordinate in a x-y system.
If we are interested in finding if there is any relationship between calcium in the lake and
the average mercury in the fish, we may draw a plot such as Fig. 4. We can observe some type
of exponential trend in the data, but also some outlying observations that might be important
to check up.
1.5
Mercury 3−year std. (ppm)
1.0
0.5
0.0
0 20 40 60 80
Calcium (mg/l)
Figure 4: Mercury in Bass: Calcium (mg/l) and Mercury for a 3-year standard (ppm). Data
source: StatLib
10
10.4 Bar charts
Categorical and discrete/continuous variables may be seen together in bar charts as well. It
may also let you "see" three variables at the same time.
One example of a bar chart is given in Fig. 5. In this case the bar were drawn vertically since
the acidity levels have a natural order.
70
60
50
Calcium (mg/l)
40
30
20
10
0
almost neutral mildly acid moderately acid acid very acid
Acidity level
Figure 5: Mercury in Bass: Mean (bar) and Standard deviation (line) of Calcium (mg/l) accord-
ing to acidity level of the lake (Swedish EPA levels). Data source: StatLib
10.5 Time series
If you wish to explore a variable as it evolves in time, then the most convenient way to do it is
by plotting a scatterplot and connecting the points. The points must be spaced according to the
time intervals, i.e. the space between two point in the graph must be proportional to the time
elapsed between the two observations.
Imagine that 25 measurements (1 per week) were conducted at one same lake (not from
database). In Fig. 6 we can see how the alkalinity varied along that period.
120
100
Alkalinity (mg/l as Calcium Carbonate)
80
60
40
20
0
5 10 15 20 25
Time
Figure 6: Mercury in Bass: Alkalinity (mg/l as Calcium Carbonate) between 6 March 2000 and
28 August 2000.
11
10.6 Correlation
The relationship between two continuous variables can be summarized in the correlation co-
efficient. The correlation, often denoted by , is a number between
that reflects the k !W2 S cl2nm

strength of linear relationship. If two variables increase or decrease simultaneously in a linear
!`2
way, then will be close to 1; this is a positive correlation, as in Fig 7(a). If one increases as the

other decreases, then will be close to and it is called a negative correlation (see Fig 7(b)).
Values of close to 0 represent a lack of linear relationship; see Figures 8(a) and 8(b).
2
2
1
1
0
Var2
Var2
−1
0
−2
−3
−1
−4
−2
−2 −1 0 1 2 −3 −2 −1 0 1 2
Var1 Var1
(a) High positive correlation; r=0.98 (b) High negative correlation
Figure 7: Scatterplots of highly correlated variables

1.0
3
0.8
2
0.6
1
Var2
Var2
0
0.4
−1
0.2
−2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 0 1 2
Var1 Var1
(a) No correlation; r=0.01 (b) Low correlation; r=0.36
Figure 8: Scatterplots of low correlated variables
For continuous variables, the correlation can be calculated with the formula for the Pearson
correlation:

%po rS q 0 s % % ! !
0 % . !% .1 !0
. #! 0 . . .10
(9)

% !N2 t S
0 (10)
where and . are the i-th observations of the respective variables; and . are the means of
12

variable and variable ; .
and Pt
are the standard deviations of variable .
and variable ;
and is the number of observations.
If one of the two variables is an ordered categorical variables (e.g. acidity level), the cor-
relation must use the Spearman formula, also known as the rank correlation. This formula
does not use the values of the observations, but rather the ranks. The rank is the number an
observation receives when the set of data is ordered.
Say we have ten observations on acidity level. Table 4 has the 10 observations in the first
column. In the second column they have been ordered from least acid to most acid; the third
column gives numbers according to the order. It is clear that some of them are tied, i.e. they
have the same level, and therefore should have the same rank. The new rank is the average of
the original order number for the tied observations.
For example: the first four observations are the same and have ranks 1, 2, 3, and 4. The
average of those numbers is 2.5, so they all received a rank of 2.5. The last three should are also
tied and have ordering 8, 9, and 10, so their rank is the average 9.
The Spearman correlation is calculated as the Pearson correlation (Eq. 10), but using the
ranks instead of the observations of both variables.
Table 4: Example of ranking

Unordered Ordered Order Rank
observation observation
acid almost neutral 1 2.5
very acid almost neutral 2 2.5
almost neutral almost neutral 3 2.5
almost neutral almost neutral 4 2.5
very acid moderately acid 5 5
almost neutral acid 6 6.5
very acid acid 7 6.5
almost neutral very acid 8 9
acid very acid 9 9
moderately acid very acid 10 9
In the Mercury in Bass dataset, we obtained a correlation of -0.46 for Calcium and the Mer-
cury 3-year standard (see Fig 4). There seems to be a relationship between those two vari-
ables, but it is not linear, so the Pearson correlation is not very high. We can also compare the
u^ @ wv 9
acidity level with the Mercury 3-year standard using the Spearman correlation, and that gives
.
Caution with correlation:
Correlation should not be confused with causality, i.e. concluding that one variable is the
cause of the other. Two variables may be correlated through a third, unobserved variable. Using
small samples makes it easier to detect correlations that later become diffuse if a large sample is
studied. Sometimes correlations are also hidden when the sample includes different subgroups.
In any case, correlations should be used with caution, especially when making conclusions.
In the environmental sciences it is often very difficult to obtain high correlations, since there
are many other factors interfering.
13
11 Skewed Distributions
If a distribution is symmetrical, the mode, the median, and the mean will coincide. The more
skewed the distribution is, the further away from each they will be. For pH, the median is
6.8, and the mean is 6.6, but for calcium they are 12.6 and 22.2, respectively. The rather large
difference between the median and mean of calcium reflect the skewness of its distribution.
For this reason, if the observations are skewed, the researcher often reports both statistics, since
only one of them does not tell the whole story. In similar way, the standard deviation may not
be the most appropriate way of describing variability in a skewed distribution, so the the IQR
or the difference of the 5th and 95th percentile may be a better statistic.
Negatively skewed distribution appear often in environmental studies, since often the val-
ues are concentrated close to zero and only a few observations are very large. This is commonly
seen in air and water quality measurements.
12 Logarithmic Scale
Some variables, e.g. pH, are used in the logarithmic scale, i.e. their original values are trans-
formed to their logarithm:
';xy % _ 0 S (11)
where _ are the original observations.

_z 2Y@({}|n~ | , 2Y@({ J ~
, 2Y@({}n~ , 2Y@+{ I ~ ,
2Y@ {}n~
In the case of pH, the original observations could look like
(v 4 9 > B @ > 5 , . . . (in the case of pH, the
transformation is !6 % _ 0
, . . . , but the actual data used is
).
, , , ,
There are several reasons for using logarithms. In the case of pH, the raw values differ
2Y@6
2Y@6@ 2Y@ 2@6@6@@6@6@@6@6@
greatly among them. Using a linear scale would mean that the observations would range be-
tween and , instead of from to . This means that the inter- 4 >
pretation must be done carefully, since a difference of 2 in the logarithmic scale, represents a
difference of 100 times in the linear scale.
In a graph, the scale must also be logarithmic. That means that each tick on the axis rep-
resents 10 times more than the one before. Some software is capable of doing this, and special
logarithmic paper is also available for this purpose. Figure 9 compares pH and mercury levels
on a 3-year standard fish, where pH is graphed on a logarithmic scale. The relationship be-
tween both variables seems linear. The interpretation is that an increment of 10 times in acidity
is related to an decrease in mercury (how much increment of mercury cannot be determined
yet).
A skewed distribution can sometimes be made more symmetric if the logarithm of the ob-
servations are used. This might be useful in some analysis that only work for variables that
have a normal distribution. It may also help stabilize the variance, when the variance tends to
increase with the mean; e.g. measurements of height of trees are more variable in taller than
shorter trees. In general logarithms will find relations that are based on multiple increments.
14
1.5
Mercury 3−year std (ppm)
1.0
0.5
0.0
4 5 6 7 8 9
pH
Figure 9: Mercury in Bass: pH and Mercury for a 3-year standard (ppm). Data source: StatLib
13 Checking the Data
Errors in the measurements occur constantly; never assume it will not happen to you.
Common errors are:
contaminated sample;
mistake when reading the instrument;
mistake when transcribing the figure onto the notebook or computer (typo);
error in calculating a variable (e.g. indices, times, etc.);
skipping an observation;
repeating an observation;
inverted numbers;
mistake in coding;
wrong placement of decimal point;
wrong measurement units.
Transforming the data into codes that the computer can interpret is called coding. For ex-
ample, using a 1 for almost neutral, 2 for moderately acid, 3 for acid, and 4 for very acid. Often
a special number is used when an observations is missing (not available, lost, not measured,
etc.). In many cases the missing code is 99, 9999 or some other value that does not appear
among the observations and is easily recognized.
The dataset does not only include the measurements, but also data on the factors/treatments,
blocks, and other relevant information. You must also check that the observations are correctly
linked to their respective information.
For categorical data, a table of frequencies will often tell you if you have the right categories
(codes) and in the expected proportions. For discrete and continuous data, calculating statistics
for location and dispersion is very useful. Especially helpful are: the maximum and the mini-
mum values, the number of observations, and the boxplot. The maximum and the minimum
might tell you if any extremes are present; these extremes are called outliers. Outliers might be
15
part of the observations, but sometimes they might be due to an error and are also easy to spot
in a boxplot. The number of observations will indicate if you have all the observations in the
database, and that none have been carelessly missed or repeated.
14 Everything in Perspective
You now have a set of tools to start applying on your data. Where do you start? The first thing
is to give your data a good double-check. Mistakes not caught at the beginning will make you
work double, and checking the data is always easier and less time consuming than doing all
the analyses.
Use the other tools according to the type of variables you have and according to the question
you have. You do not need to draw contingency tables and scatterplots for all variables against
all variables. Use your time to investigate the most pressing issues: remember, you are looking
for clues that will help you solve your hypotheses.
The tables and graphs at this point need not be "pretty", they are "for your eyes only". Be
sure, however, that you have a title on each one to avoid confusion.
15 What and How to Report
It is not necessary to put in the report all of analyses. Choose those that you consider the most
relevant, and those that will support your discussion and conclusions.
Those tables and graphs that go into the report should follow certain conventions.
The figure should be self contained, i.e. all relevant information should be included.
The title should express clearly what variables are being used, how are they being ex-
pressed (e.g. frequency, proportion, mean, etc), their units of measurements, and when
and where the data was take.
If the observations are not yours, then include the source.
If a figure is included in the report, then there must be a comment or discussion on it.
Point out to the reader those features in the figure that are relevant for you.
More on "pretty" tables (see examples throughout this document):
Round numbers to two digits according to the precision of the measurements.
Use enough space to make it easy to follow the trends and to compare.
Do not use vertical lines; they break the flow of the information.
Use only horizontal lines.
Lines should be used only for: top and bottom of table; to separate heading of the table
from the body of the table; to separate a line of totals or summary statistics.
16
More on "pretty" graphs (see examples throughout this document):
The axes should be properly labelled and preferably include the unit of measurement.
If the axis does not start at zero, then it is recommended that the axis be "broken" at the
beginning.
Choose axes that do not distort or suppress information.
Include a legend box where the symbols or colors are described.
If the categories are not ordered, then use horizontal bars.
In a pie chart, order the sections from the largest to the smallest.
A good exercise is to check figures in scientific papers and in newspapers. Analyze if they
are correctly done. If not, then explain what is wrong with them; if they are particularly clear
and well-done, what makes them stand out.
References
Altman, D. G. (1991), Practical Statistics for Medical Research, Chapman & Hall.
Chatfield, C. (1988), Problem solving. A statistician’s guide, Chapman & Hall.
Sokal, R. R. and Rohlf, J. F. (1995), Biometry: the priciples and practice of statistics in biological
research, W. H. Freeman, 3rd ed.
Tukey, J. W. (1977), Exploratory Data Analysis, Addison–Wesley.
U.S.Food and Drug Administration (1994), “Mercury in Fish: Cause for Concern? (revised
1995),” Http://www.cfsan.fda.gov/ dms/mercury.html.
Utts, J. M. (1996), Seeing through statistics, Duxbury Press.
Yandell, B. S. (1997), Practical Data Analysis for Designed Experiments, Chapman & Hall.
17

Applied Environmental Measurement Techniques: Statistics Exploratory Data Analysis

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Applied Environmental Measurement Techniques: Statistics Exploratory Data Analysis

Enviado por

Direitos autorais:

Formatos disponíveis

Sharon Kühlmann-Berenzon

Department of Mathematical Statistics

Applied Environmental Measurement Techniques

find errors, typos, and mistakes in the data;

find possible trends;

find possible relationships;

find gaps in the study.

IQR inter-quartile range

Other sites with information on mercury in fish are:

The following information is reproduced as it appears on the webpage:

Datafile Name : Mercury in Bass

Table 1: Mercury in Bass: Variable description

Variable Name : Description (measurement units)

4 What type of variables do you have?

4.1 Categorical variables

4.2 Discrete and counts

In the Mercury study, the number of samples "No.samples" is a count variable.

5 Population and Sample

(a) Calcium (mg/l) (b) pH (log scale)

Figure 1: Mercury in Bass. Data source: StatLib

7 Exploring One Variable

8 Statistics for Location

9 Statistics for Dispersion

  "!#  $    (2)

Figure 2: Mercury in Bass: Chlorophyll (mg/l). Data source: StatLib

9.3 Variance and Standard deviation

and is, therefore, in the same units as the observations.

UWV   X 2Y@6@[Z  (7)

\^]`_ ba2^c d 2 fe ^\ ] S (8)

where CV is the uncorrected coefficient of variance and

10 Exploring Two Variables

10.1 Contingency table: categorical vs categorical

almost neutral mildly acid moderately acid acid very acid

10.3 Scatterplots: discrete/continuous vs discrete/continuous

almost neutral mildly acid moderately acid acid very acid

10.5 Time series

(a) High positive correlation; r=0.98 (b) High negative correlation

Figure 7: Scatterplots of highly correlated variables

0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 0 1 2

(a) No correlation; r=0.01 (b) Low correlation; r=0.36

Figure 8: Scatterplots of low correlated variables

Table 4: Example of ranking

Caution with correlation:

where _ are the original observations.

13 Checking the Data

Common errors are:

mistake when reading the instrument;

error in calculating a variable (e.g. indices, times, etc.);

wrong placement of decimal point;

wrong measurement units.

15 What and How to Report

If the observations are not yours, then include the source.

More on "pretty" tables (see examples throughout this document):

Round numbers to two digits according to the precision of the measurements.

Use only horizontal lines.

Choose axes that do not distort or suppress information.

Include a legend box where the symbols or colors are described.

If the categories are not ordered, then use horizontal bars.

Chatfield, C. (1988), Problem solving. A statistician’s guide, Chapman & Hall.

Tukey, J. W. (1977), Exploratory Data Analysis, Addison–Wesley.

Utts, J. M. (1996), Seeing through statistics, Duxbury Press.

Você também pode gostar

"!# $ (2)

UWV X 2Y@6@[Z (7)

\^]`_ ba2^c d 2 fe ^\ ] S (8)

where _ are the original observations.