V8 Protein Phylogeny

Phylogenetic Prediction (of single genes)
Material of this lecture taken from

- chapter 6, DW Mount „Bioinformatics“
- A. Okas et al., Nature 425, 798 (2003)
Genome-scale approaches to resolving incongruence in molecular phylogenies.
A phylogenetic analysis of a family of related nucleic acid or protein

sequences is a determination of how the family might have been derived
during evolution.
Placing the sequences as outer branches on a tree, the evolutionary
relationships among the sequences are depicted.
8. Lecture WS 2003/04 Bioinformatics III 1

3 main approaches in single-gene phylogeny
- maximum parsimony
- distance
- maximum likelihood
Popular programs:
PHYLIP (phylogenetic inference package – J Felsenstein)
PAUP (phylogenetic analysis using parsimony – Sinauer Assoc

Concept of evolutionary trees
An evolutionary tree is a 2-dimensional graph showing evolutionary relationships
among organisms, or in the case of sequences, in certain genes from separate
organisms.
sequence A length of branches
nodes reflects number of
sequence changes.
rooted tree sequence B
sequence C Often: assume uniform
rate of mutations
(molecular clock hypothesis).
branches sequence D
sequence A sequence C
unrooted tree
sequence B sequence D

Concept of evolutionary trees
Number of substitutions in each branch is generally assumed to vary
according to the Poisson distribution that gives the probability Pn around an
average number x :
e− x xn
Pn =
n!
The number of possible trees increases very rapidly

with the number of sequences:
A B C D
#sequences #rooted trees #unrooted trees
3 3 1
4 15 3
5 105 15
-
7 10395 954

Methods for Single-Gene Phylogeny
Choose set of Obtain multiple Is there Yes Maximum

related sequences sequence strong parsimony
alignment sequence methods
similarity?
No
Is there clearly recogniza- Yes

Distance
ble sequence similarity? methods
No
Maximum likelihood Analyze how well

methods data support
prediction

Maximum Parsimony Method
Method predicts the evolutionary tree that minimizes the number of steps
required to generate the observed variation in the sequences.
Step 0 Input: multiple sequence alignment

Step 1 For each aligned position, identify phylogenetic trees that require the
smallest number of evolutionary changes to produce the observed
sequence changes.
Step 1.5 Continue analysis for every position in the sequence alignment.
Step 2 Sequence variations at each site in the alignment are placed at the tips
of the trees. Identify the tree (trees) that produce the smallest number
of changes overall for all sequence positions.
Because all possible trees are examined, method is best suited for sequences
that are quite similar + for small number of sequences.
It is guaranteed to find the best tree.

Sequence# Sequence position Example
1 2 3 4 5 6 7 8 9
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
These are 4 sequences giving 3 possible unrooted trees. E.g. trees for position 5:
Seq1 Seq3 Seq1 Seq2 Seq1 Seq2
G A G G G G
G A A A A A
G A A A A A
Seq2 Seq4 Seq3 Seq4 Seq4 Seq3
Informative sites: (1) must favor one tree over another (site 5 is informative, but
sites 1, 6, 8 are not).
(2) To be informative, a site must also have the same sequence character in at
least two genomes (only sites 5, 7, and 9 are informative according to this rule).
Combining sites 5, 7, and 9, the left tree is the best tree for these 4 sequences.
Where maximum parsimony fails
Parsimony can give misleading information when rates of sequence change vary
in the different branches of a tree that are represented by the sequence data.
Seq1 Seq4
G G Seq1 Seq2
G A
G A
Seq4 Seq3
A A In parsimony analysis rates of change

Seq2 Seq3 along all branches of the tree are
assumed equal.
Real tree: 2 long branches in Therefore the tree predicted from
which G has turned to A parsimony will not be correct.
independently, possibly with
some intermediate steps.

Distance methods
The distance method employs the number of changes between each pair in a
group of sequences to produce a phylogenetic tree of the group.
The sequence pairs that have the smallest number of sequence changes
between them are termed „neighbors“. On a tree, these sequences share a
node or common ancestor position and are each joined to that node by branch.
Goal of distance methods: identify tree that correctly positions neighbors and that
also has branch lengths that reproduce the original data as closely as possible.
→ neighbor-joining algorithm, Fitch-Margoliash algorithm
Finding the closest neighbors among a group of sequences by the distance

method is often the first step in producing a multiple sequence alignment.
E.g. ClustalW uses the neighbor-joining distance method.

Example
sequence A ACGCGTTGGGCGATGGCAAC
sequence B ACGCGTTGGGCGACGGTAAT
sequence C ACGCATTGA ATGATGATA AT
sequence D ACACATTGA GTGATAATA AT
distances beween sequences distance table

nAB 3
A B C D
nAC 7
A - 3 7 8
nAD 8
B - - 6 7
nBC 6
C - - - 3
nBD 7
D - - - -
nCD 3
A C
2 1
4
1 2
B D
Maximum likelihood approach
Method uses probability calculations to find a tree that best accounts for the
variation in a set of sequences.
Similar to maximum parsimony method in that analysis is performed on each
column of a multiple sequence alignment. All trees are considered.
Because the rate of appearance of new mutations is very small, the more
mutations are needed to fit a tree to the data, the less likely that tree.
Start with an evolutionary model of sequence change that provides estimates of

rates of substitution of one base for another (transitions and transversions).
Base A C G T
A -u(aπC+bπG+cπT) uaπC ubπG ucπT
C ugπA -u(gπA+dπG+eπT) udπG ueπT
G uhπA ujπG -u(hπA+jπG+fπT) ufπT
T uiπA ukπG ulπT -u(iπA+kπG+lπT)

Maximum likelihood approach
Step1 Align set of sequences
Step2 Examine substitutions in each column for their fit to a set of trees that
describe possible phylogenetic relationships among the sequences.
Each tree has a certain likelihood based on the series of mutations that are
required to give the sequence data.
The probability of each tree is the product of the mutation rates in each branch of
the tree, which itself is the product of the rate of substitution in each branch times
the branch length.
branchn (i )
Ptreei = ∏( ) mutation rate
branch1 i
branchn (i )
= ∏( ) rate of substitution in branch(i )× length of branch(i)
branch1 i
Advantage of maximum likelihood approach:

allows to evaluate trees with variations in mutation rates in different lineages.
Can be used for more diverse sequences.
Disadvantage: computationally intense.

Resolve Incongruences in Phylogeny
Many possible reasons that may make decisions on how to handle conflicts in
larger sets of molecular data difficult.
E.g. two genes with different evolutionary history (e.g. owing to hybridization or
horizontal transfer) will necessarily give incongruent pictures while still depicting
true histories.
Here: compare genome sequence data for 7 Saccharomyces yeast species:

S. cerevisae
S. paradoxus
S. mikatae
S. kudriavzevii
S. bayanus
S. castelli
S. kluyveri
plus one outgroup fungus Candida albicans.
Rokas et al. Nature 425, 798 (2003)

Identify orthologous genes to serve as phylogenetic markers:
106 genes which are distributed throughout the S. cerevisae genome on all 16
chromosomes and comprise a total length of 127026 nt = 42342 amino acids
corresponding to roughly 1% of the genomic sequence and 2% of the predicted
genes.
Criteria to select genes spaced ca. every 40 kb:

(1) genes have homologous sequence in each of the 8 species
(2) genes have at least two homologous flanking syntenic genes
(3) genes can be aligned over most of the protein.
3 types of analysis:
- maximum likelihood (ML) analysis of nucleotide data
- maximum parsimony (MP) analysis of nucleotide data
- MP of the amino acid data

Align individual genes with ClustalW. Edit manually to exclude indels and areas of
uncertain alignment → left with 76% of the sequence of each gene on average.
Tree construction with PAUP by branch-and-bound algorithm which guarantees to

find the optimal tree. Estimate tree reliability using non-parametric bootstrap re-
sampling.
Analysis of the 106 genes gave more than 20 alternative ML or MP trees.

Generate 50% majority-rule consensus trees by bootstrapping.
Next slide shows several strongly supported trees.

Bootstrap analysis.
A method for testing how well a particular data set fits a model.
E.g. the validity of the branch arrangement in a predicted phylogenetic tree can
be tested by resampling columns in a multiple sequence alignment to create
many new alignments.
The appearance of a particular branch in trees generated from these resampled

sequences can then be measured.
Alternatively, a sequence may be left out of an analysis to determine how

much the sequence influences the results of an analysis.
Here: swap individual nucleotide sites or positions of genes (bootstrap replicas).

Alternative Tree topologies
Single-gene data sets generate multiple, robustly supported alternative topologies.

Representative alternative trees recovered from analyses of nucleotide data of 106
selected single genes and six commonly used genes are shown. The trees are the
50% majority-rule consensus trees from the genes YBL091C (a), YDL031W (b),
YER005W (c), YGL001C (d), YNL155W (e) and YOL097C (f).
These 6 genes were selected without consideration of their function. Maybe
commonly used, well known genes of important functions provide a better resolution?

Alternative Tree topologies
Results from the commonly used genes actin (g), hsp70 (h), β-tubulin (i), RNA
polymerase II (j) elongation factor 1-α (k) and 18S rDNA (l). Numbers above
branches indicate bootstrap values (ML on nucleotides/MP on nucleotides).
→ Same problem of alternative topologies as before.

Explanations?
The alternative phylogenies could have resulted from a number of different
scenarios:
(1) most genes could have weakly supported most phylogenies and strongly
supported only a few alternative trees,
(2) most genes could have strongly supported one phylogeny and a few genes
strongly supported only a small number of alternatives,
(3) there could have been some combinations of these scenarios so that each
branch among alternative phylogenies had either weak or strong support
depending on the gene.
To distinguish between these possibilities, identify all branches recovered during

single-gene analyses, record each bootstrap value with respect to the gene and
method of analysis.
→ 8 branches were shared by all three analyses with multiple instances of
bootstrap values > 50%.

Common Branches
The distribution of bootstrap values for the eight prevalent branches recovered
from 106 single-gene analyses highlights the pervasive conflict among single-
gene analyses. a, Majority-rule consensus tree of the 106 ML trees derived from
single-gene analyses. Across all analyses, there were eight commonly observed
branches; the five branches in the consensus tree (numbers 1–5; a) and the three
branches (numbers 6–8) shown in b.

Bootstrap Values of Common Branches
Only branches 1 and 4
are supported by a
majority of genes.
c, For each of the eight branches, the ranked distribution of per cent bootstrap values recovered from
the three analyses of 106 genes is shown. Results from ML (blue) and MP (red) analyses of
nucleotide data sets, and MP analyses of amino acid data sets (black), are shown. For each branch,
the mean bootstrap value and 95% confidence intervals from the ML analyses and the percentage of
ML trees supporting this branch (in parentheses) are indicated below each graph. Although the
ranked distributions of bootstrap values from the three analyses are remarkably similar for most
branches, on a gene-by-gene basis there is no tight correspondence between bootstrap values from
ML and MP analyses Rokas et al. Nature 425, 798 (2003)
How different are the trees?
The degree of conflict among the trees could be relatively minor.
Determine how many taxa (genes) would need to be removed to make two
trees congruent (deckungsgleich).

Reversal distance problem
Extensive incongruence between trees derived from
the 106 individual-gene data sets. Pairwise
comparisons between 50% majority-rule consensus
trees from 106 single-gene ML analyses of
nucleotide data (black bars), MP analyses of
nucleotide data (white bars), and MP analyses of
amino acid data (grey bars) were categorized on the
basis of the minimum number of taxa that need to be
removed for two trees to reach congruence (x axis).
For each of the analyses, the majority of

pairwise comparisons require the
removal of two or more taxa before
congruence is attained.

What leads to incongruence?
Many factors were checked that could lead to incongruence between single-gene
phylogenies:
- outgroup choice
repeat all analyses without C. albicans
}
- number of variable sites significantly correlated with
- number of parsimony-informative sites bootstrap values for some
- gene size branches
- rate of evolution
- nucleotide composition
- base compositional bias
- genome location
- gene ontology
no parameters can systematically account for or predict the performance of single

genes!

Can incongruence be overcome?
Although we do not know the cause(s) of incongruence between single-gene
phylogenies, the critical question is how this incongruence between single trees
might be overcome to arrive at the actual species tree.
Can single gene trees be concatenated into one large data set?

Concatenation of single genes gives a single tree!
Phylogenetic analyses of the
concatenated data set composed
of 106 genes yield maximum
support for a single tree,
irrespective of method and type of
character evaluated. Numbers
above branches indicate bootstrap
values (ML on nucleotides/MP on
nucleotides/MP on amino acids).
All alternative topologies were rejected.

This level of support for a single tree with 5 internal branches is unprecedented.
This tree can now be referred to as species tree.

How much data is required?
The concatanated data recovered a tree with maximum support on all branches,
despite divergent levels of support for each branch among single-gene analyses.
→ At what size did the data set arrive at the species tree?

Convergence on single tree
branch 3
branch 5
A minimum of 20 genes is required to recover >95% bootstrap values for each

branch of the species tree. a, b, The bootstrap values for branches 3 (a) and 5 (b)
were constructed from the concatenation of randomly re-sampled orthologous
nucleotides (left) or random subsets of genes (right).
The species tree is recovered with robust support (>95% bootstrap values in all
branches at 95% confidence interval) by analyses of a minimum of 20
concatenated genes. All analyses were performed using MP.
Independent evolution?
It has been suggested that nucleotides within a given gene do not evolve
independently.
Re-sample subset of orthologous nucleotides from the total data set.

Only 3000 randomly chosen nucleotide positions (corresponding to less than three
concatenated genes) are sufficient to generate single tree with > 95% confidence.
This indicates that nucleotides in genes have not evolved independently (because
when using complete genes more than 20 genes are necessary to generate single
tree).

Implications for resolution of phylogenies
Unreliability of single-gene data sets stems from the fact that each gene is shaped
by a unique set of functional constraints through evolution.
Phylogenetic algorithms are sensitive to such constraints.
Such problems can be avoided with genome-wide sampling of independently

evolving genes.
In other cases the amount of sequence information needed to resolve specific

relationships will be dependent on the particular phylogenetic history under
examination.
Branches depicting speciation events separated by long time intervals may be

resolved with a smaller amount of data, and those depicting speciation events
separated by shorter invtervals may be much harder to resolve.

Summary
Robust strategies exist for phylogenies built on single-gene comparisons
(maximum parsimony, distance, maximum likelihood).
Problem of incongruence of phylogenies derived from individual genes.
Can be resolved by integrative analysis of multiple (here > 20) genes.
It is desirable to combine results from phylogenies constructed from local

sequence information with trees constructed from genome rearrangement.
The power of genome rearrangement studies is the construction of ancestral
genomes. Then one can derive the speed of evolution at different times, disect
mutation biases at different times from the influence of genomic context ...
and possibly derive the driving forces of biological evolution.
This lecture rounds up the first block of the Bioinformatics III course on
genome structure, rearrangements etc.
Next block until Christmas: gene finding, SNPs, functional genomics

V8 Protein Phylogeny

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

V8 Protein Phylogeny

Enviado por

Direitos autorais:

Formatos disponíveis

Phylogenetic Prediction (of single genes)

Material of this lecture taken from

A phylogenetic analysis of a family of related nucleic acid or protein

8. Lecture WS 2003/04 Bioinformatics III 1

8. Lecture WS 2003/04 Bioinformatics III 2

8. Lecture WS 2003/04 Bioinformatics III 3

The number of possible trees increases very rapidly

8. Lecture WS 2003/04 Bioinformatics III 4

Choose set of Obtain multiple Is there Yes Maximum

Is there clearly recogniza- Yes

Maximum likelihood Analyze how well

8. Lecture WS 2003/04 Bioinformatics III 5

Step 0 Input: multiple sequence alignment

8. Lecture WS 2003/04 Bioinformatics III 6

A A In parsimony analysis rates of change

8. Lecture WS 2003/04 Bioinformatics III 8

→ neighbor-joining algorithm, Fitch-Margoliash algorithm

Finding the closest neighbors among a group of sequences by the distance

E.g. ClustalW uses the neighbor-joining distance method.

8. Lecture WS 2003/04 Bioinformatics III 9

distances beween sequences distance table

Start with an evolutionary model of sequence change that provides estimates of

8. Lecture WS 2003/04 Bioinformatics III 11

Advantage of maximum likelihood approach:

8. Lecture WS 2003/04 Bioinformatics III 12

Here: compare genome sequence data for 7 Saccharomyces yeast species:

8. Lecture WS 2003/04 Bioinformatics III 13

Criteria to select genes spaced ca. every 40 kb:

8. Lecture WS 2003/04 Bioinformatics III 14

Tree construction with PAUP by branch-and-bound algorithm which guarantees to

Analysis of the 106 genes gave more than 20 alternative ML or MP trees.

Next slide shows several strongly supported trees.

Rokas et al. Nature 425, 798 (2003)

8. Lecture WS 2003/04 Bioinformatics III 15

The appearance of a particular branch in trees generated from these resampled

Alternatively, a sequence may be left out of an analysis to determine how

Here: swap individual nucleotide sites or positions of genes (bootstrap replicas).

8. Lecture WS 2003/04 Bioinformatics III 16

Rokas et al. Nature 425, 798 (2003)

Single-gene data sets generate multiple, robustly supported alternative topologies.

8. Lecture WS 2003/04 Bioinformatics III 17

Rokas et al. Nature 425, 798 (2003)

8. Lecture WS 2003/04 Bioinformatics III 18

To distinguish between these possibilities, identify all branches recovered during

8. Lecture WS 2003/04 Bioinformatics III 19

Rokas et al. Nature 425, 798 (2003)

8. Lecture WS 2003/04 Bioinformatics III 20

Rokas et al. Nature 425, 798 (2003)

8. Lecture WS 2003/04 Bioinformatics III 22

For each of the analyses, the majority of

Rokas et al. Nature 425, 798 (2003)

8. Lecture WS 2003/04 Bioinformatics III 23

no parameters can systematically account for or predict the performance of single

8. Lecture WS 2003/04 Bioinformatics III 24

Rokas et al. Nature 425, 798 (2003)

8. Lecture WS 2003/04 Bioinformatics III 25

All alternative topologies were rejected.

Rokas et al. Nature 425, 798 (2003)

8. Lecture WS 2003/04 Bioinformatics III 26

Rokas et al. Nature 425, 798 (2003)

8. Lecture WS 2003/04 Bioinformatics III 27

A minimum of 20 genes is required to recover >95% bootstrap values for each