Rda

Multivariate Statistics in Ecology and
Quantitative Genetics
Redundancy analysis
Dirk Metzler & Martin Hutzenthaler

http://evol.bio.lmu.de/_statgen
Summer semester 2011

1 Redundancy analysis
Setting
Example: Artificial fish data
Triplots
Example: Height weight data
Example: Species richness on sandy beaches (RIKZ data)
The order of importance
Redundancy analysis Setting
Contents
Setting
Triplots
Given: Data frames/matrices Y and X

The variables in X are called explanatory variables
The variables in Y are called response variables

Goal: Find those components of Y which are linear
combinations of X and (among those) represent as
much variance of Y as possible.

Assumption: There is a linear dependence of the response
variables in Y on the explanatory variables in X .

The idea behind redundancy analysis is to apply linear
regression in order to represent Y as linear function of X and
then to use PCA in order to visualize the result.

The idea behind redundancy analysis is to apply linear
regression in order to represent Y as linear function of X and
then to use PCA in order to visualize the result.
Among those components of Y which can be linearly explained
with X (multivariate linear regression) take those components
which represent most of the variance.
Before applying RDA:

Is Y increasing with increasing values of X ?
If the variables in X are twice as high, are the variables in Y
also approximately twice as high?
These questions are to check the assumption of linear
dependence.
Redundancy analysis Example: Artificial fish data
Contents
Setting
Triplots
To illustrate the output of redundancy analysis (RDA)

we consider an artificial example from p. 590 of
P. Legendre and L. Legendre.
Numerical Ecology
(We will not go into the maths of RDA)

The artificial data set represents fish abundances at 10 sites

along a tropical reef transect. The first three sites are on sand
and the others alternate between coral and other substrate.
The water depth is given in meter.
The artificial data set represents fish abundances at 10 sites

along a tropical reef transect. The first three sites are on sand
and the others alternate between coral and other substrate.
The water depth is given in meter.
> fishes <- read.table("artificialFishes.txt",h=T); fishes

Site Sp1 Sp2 Sp3 Sp4 Sp5 Sp6 Depth Coral Sand Other
1 1 1 0 0 0 0 0 1 0 1 0
2 2 0 0 0 0 0 0 2 0 1 0
3 3 0 1 0 0 0 0 3 0 1 0
4 4 11 4 0 0 8 1 4 0 0 1
5 5 11 5 17 7 0 0 5 1 0 0
6 6 9 6 0 0 6 2 6 0 0 1
7 7 9 7 13 10 0 0 7 1 0 0
8 8 7 8 0 0 4 3 8 0 0 1
9 9 7 9 10 13 0 0 9 1 0 0
10 10 5 10 0 0 2 4 10 0 0 1
The abundancies of the six species are the response variables.

Depth, Coral, Sand and Other are explanatory variables.
We do not need to scale=TRUE as abundancies are on
comparable scales.

comparable scales.
As Coral, Sand and Other are linearly dependent, the
covariance matrix is singular. So we can only use two out of the
three variables.

comparable scales.
three variables.
We choose Depth, Sand and Coral as explanatory variables.

comparable scales.
three variables.
We choose Depth, Sand and Coral as explanatory variables.
library(vegan) # rda() is in this library

Resp <- fishes[,c("Sp1","Sp2","Sp3","Sp4","Sp5","Sp6")]
Expl <- fishes[,c("Depth","Coral","Sand","Other")]
myrda <- rda(Resp,Expl)
plot(myrda,scaling=2)
Correlation triplot
row2
1
row3
Sand
row1
2
1
row9
Sp4 Sp3
Coral
row7
row5
RDA2
0
Sp6
1
Sp2
row10 Sp5
Depth
Sp1
row8
2
row6
1
row4
2 1 0 1 2 3
Distance triplot
1
1
row2
row3
Sand
row1 Sp4 Sp3
row9
Coral row7 row5
0
0
row10
row8
Depth
1
row6
RDA2
1
row4 Sp6
2
Sp2
Sp5
3
Sp1
1 0 1 2 3 4
If we gather the three variables Coral, Sand and Other into

one factor variable Substrate, then R eliminates automatically
the last variable.
Substrate <- c(rep("Sand",3),

rep(c("Other","Coral"),3),"Other")
myrda <- rda(Resp~Depth+Substrate,data=Expl)
Correlation triplot:
row2
1
row3
row1
SubstrateSand
2
1
row9
Sp4 Sp3
row7
row5
RDA2
0
Sp6
1
Sp2
row10 Sp5
Depth
Sp1
row8
2
row6
1
SubstrateOther
row4
2 1 0 1 2 3
Redundancy analysis Triplots
Contents
Setting
Triplots
The graphical output of RDA consists of two biplots on top of

each other and is called triplot.
You produce a triplot with plot(rda.object) (which itself calls
plot.cca()).

plot.cca()).
There are three components in a triplot:

Continuous explanatory variables (numeric values) are
represented by lines. Nominal explanatory variables (factor
object) (coded 0 1) by squares (or triangles) (one for each
level). The square is plotted at the centroid of the
observations that have the value 1.

plot.cca()).

The response variables by labels or lines.

plot.cca()).

The response variables by labels or lines.
The observations by points or labels.
Distance triplot (scaling=1)

Distances between points (observations), between squares
or between points and squares approximate distances of
the observations (or the centroid of the nominal explanatory
variable).
Angles between lines of response variables and lines of
explanatory variables represent a two-dimensional
approximation of correlations.
Other angles between lines are meaningless.
Distance triplot (scaling=1)

Distances between points (observations), between squares
or between points and squares approximate distances of
the observations (or the centroid of the nominal explanatory
variable).
Angles between lines of response variables and lines of
explanatory variables represent a two-dimensional
approximation of correlations.
Other angles between lines are meaningless.
The projection of a point onto the line of a response variable
at right angle approximates the position of the
corresponding object along the corresponding variable.
Squares/triangles cannot be compared with lines of
qualitatively explanatory variables.
Correlation triplot (scaling=2)

The cosine of the angle between lines (of response variable
or of explanatory variable) is approximately equal to the
correlation between the corresponding variables.

Distances are meaningless.

The projection of a point onto a line (response variable or
explanatory variable) at right angle approximates the value
of the corresponding variable of this observation.

The projection of a point onto a line (response variable or
explanatory variable) at right angle approximates the value
of the corresponding variable of this observation.
The length of lines are not important.
Redundancy analysis Example: Height weight data
Contents
Setting
Triplots
Recall hsw and fm.col from the slides on PCA.
Expl <- hsw[,c(1,2,3)]

Resp <- hsw[,c(1,2,3)]
myrda <- rda(Resp~shoe,scale=TRUE,data=Expl)
# Distance triplot
# The following command does not plot (type=None)
plot(myrda,scaling=1,type="n",main="Distance triplot")
segments(x0=0,y0=0,
x1=scores(myrda, display="species", scaling=1)[,1],
y1=scores(myrda, display="species", scaling=1)[,2])
text(myrda, display="sp", scaling=1, col=2)
text(myrda, display="bp", scaling=1,
row.names(scores(myrda, display="bp")), col=3)
points(myrda,display=c("sites"),scaling=1,pch=1,col=fm.col
Distance triplot, scaling=1

shoe
shoe

0

1
PC1
height
3
1
4
weight
4 3 2 1 0 1
RDA1
# Correlation triplot
plot(myrda,scaling=2,type="n",main="Correlation triplot")
segments(x0=0,y0=0,
Correlation triplot, scaling=2
1.0

0.5

0.0

shoe shoe
0

PC1

0.5

1.0
height

1.5
weight
2.0
3 2 1 0
RDA1
Now without shoe as response variable:
Expl <- hsw[,c(1,2,3)]

Resp <- hsw[,c(1,3)]
myrda <- rda(Resp~shoe,scale=TRUE,data=Expl)
# Distance triplot
# The following command does not plot (type=None)
plot(myrda,scaling=1,type="n",main="Distance triplot")
segments(x0=0,y0=0,
Distance triplot, scaling=1
4
weight
1
height
2
PC1

shoe
0
3 2 1 0 1
RDA1
# Correlation triplot
plot(myrda,scaling=2,type="n",main="Correlation triplot")
segments(x0=0,y0=0,
Correlation triplot, scaling=2
1
2.0
weight
1.5

height
1.0

PC1

0.5

0.0

shoe
0

0.5

1.0
2.5 2.0 1.5 1.0 0.5 0.0 0.5
RDA1
Redundancy analysis Example: Species richness on sandy beaches (RIKZ data)
Contents
Setting
Triplots
Which factors influence the species richness on sandy

beaches?
Data from the dutch National Institute for Coastal and
Marine Management (RIKZ: Rijksinstituut voor Kust en Zee)
see also
Zuur, Ieno, Smith (2007) Analysing Ecological Data.
Springer
richness angle2 NAP grainsize humus week

1 11 96 0.045 222.5 0.05 1
2 10 96 -1.036 200.0 0.30 1
3 13 96 -1.336 194.5 0.10 1
4 11 96 0.616 221.0 0.15 1
. . . . . . .
. . . . . . .
21 3 21 1.117 251.5 0.00 4
22 22 21 -0.503 265.0 0.00 4
23 6 21 0.729 275.5 0.10 4
. . . . . . .
. . . . . . .
43 3 96 -0.002 223.0 0.00 3
44 0 96 2.255 186.0 0.05 3
45 2 96 0.865 189.5 0.00 3
Meaning of the Variables
index i index of sampling station

richness Number of species that were found in a plot.
angle1 angle of the station
angle2 slope of the beach a the plot
exposure index composed of wave action etc.
NAP altitude of the plot compared to the mean sea level.
grainsize average diameter of sand grains
humus fraction of organic material
week in which of 4 weeks was this plot probed.
(many more variables in original data set)
library(vegan)
RIKZ <- read.table("RIKZGroups.txt", header = TRUE)
Species <- RIKZ[,2:5]
#Data were square root transformed
Species.sq <- sqrt(Species)
I1 <- rowSums(Species) #Could be used to drop sites with

#of 0.
ExplVar <- RIKZ[, c("angle1","exposure","salinity",
"temperature","NAP","penetrability",
"grainsize","humus","chalk",
"sorting1")]
RIKZ_RDA<-rda(Species.sq, ExplVar, scale=T)
plot(RIKZ_RDA,scaling=2)
Correlation biplot
sit6
4
3 sit28
sit17
2
sit29
RDA2
1
1
sit4
Insecta
sorting1
chalksit19
sit18
sit35
Polychaeta
sit1
NAP
sit21
grainsize
salinity
sit9
temperature sit34 sit41
angle1
0
0
Mollusca sit39
humus
Crustacea
sit20 sit32 exposure
sit25
sit2
sit22sit37 sit27 sit26
sit10
sit3 sit23
sit42 sit11 sit31
sit43sit15
sit30
sit5
sit14
sit38 sit40
sit24
sit44
sit7 sit8 sit36sit33
sit16
sit12
penetrability
sit13 sit45
3 2 1 0 1 2
RDA1
A different triplot
# Correlation biplot, sclaing=2

plot(RIKZ_RDA, scaling=2,main="Correlation",type="n")
segments(x0=0,y0=0,
x1=scores(RIKZ_RDA, display="species", scaling=2)
y1=scores(RIKZ_RDA, display="species", scaling=2)
text(RIKZ_RDA, display="sp", scaling=2, col=2)
text(RIKZ_RDA, display="bp", scaling=2,
row.names(scores(RIKZ_RDA, display="bp")), col=3)
text(RIKZ_RDA, display=c("sites"), scaling=2,labels=rownam
cor(Species.sq,ExplVar)
Correlation
4
3 28
17
2
29
RDA2
1
1
4 Insecta sorting1
chalk19
35 18
Polychaeta 21NAP
grainsize
salinity 1
9 34 41
temperature
angle1
0
0
Mollusca
humus 39 exposure
Crustacea 20 32
25
2237 32 27 26
10
23
1143 31
5421438 1530
40
1244
24
7 8 13 453633
16
penetrability
3 2 1 0 1 2
RDA1
Redundancy analysis The order of importance
Contents
Setting
Triplots
Anova on RDA objects
Which of the explanatory variables is the most important?

Which are the least important or even irrelevant?
As RDA is based on linear regression, the same methods apply.
Due to time constraint, this is not part of the lecture.
Anova on RDA objects
Which of the explanatory variables is the most important?

Which are the least important or even irrelevant?
As RDA is based on linear regression, the same methods apply.
Due to time constraint, this is not part of the lecture.
Have a try for yourself:
anova(RIKZ_RDA)
step(RIKZ_RDA)
dropterm(RIKZ_RDA)

Rda

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Rda

Enviado por

Direitos autorais:

Formatos disponíveis

Multivariate Statistics in Ecology and

Dirk Metzler & Martin Hutzenthaler

Summer semester 2011

Given: Data frames/matrices Y and X

Given: Data frames/matrices Y and X

Given: Data frames/matrices Y and X

Given: Data frames/matrices Y and X

Given: Data frames/matrices Y and X

Before applying RDA:

To illustrate the output of redundancy analysis (RDA)

(We will not go into the maths of RDA)

The artificial data set represents fish abundances at 10 sites

The artificial data set represents fish abundances at 10 sites

> fishes <- read.table("artificialFishes.txt",h=T); fishes

The abundancies of the six species are the response variables.

The abundancies of the six species are the response variables.

The abundancies of the six species are the response variables.

The abundancies of the six species are the response variables.

library(vegan) # rda() is in this library

If we gather the three variables Coral, Sand and Other into

Substrate <- c(rep("Sand",3),

The graphical output of RDA consists of two biplots on top of

The graphical output of RDA consists of two biplots on top of

There are three components in a triplot:

The graphical output of RDA consists of two biplots on top of

There are three components in a triplot:

The graphical output of RDA consists of two biplots on top of

There are three components in a triplot:

Distance triplot (scaling=1)

Distance triplot (scaling=1)

Correlation triplot (scaling=2)

Correlation triplot (scaling=2)

Correlation triplot (scaling=2)

Correlation triplot (scaling=2)

Recall hsw and fm.col from the slides on PCA.

Expl <- hsw[,c(1,2,3)]

Distance triplot, scaling=1

Correlation triplot, scaling=2

Now without shoe as response variable:

Expl <- hsw[,c(1,2,3)]

Distance triplot, scaling=1

Correlation triplot, scaling=2

2.5 2.0 1.5 1.0 0.5 0.0 0.5

Which factors influence the species richness on sandy

richness angle2 NAP grainsize humus week

Meaning of the Variables

index i index of sampling station

I1 <- rowSums(Species) #Could be used to drop sites with

# Correlation biplot, sclaing=2

Anova on RDA objects

Which of the explanatory variables is the most important?

Anova on RDA objects

Which of the explanatory variables is the most important?

Você também pode gostar