ASBP Training - Alignment and Phylogeny

III.
Evolutionary Change in DNA Sequences
Kinds of questions
Indentification of INDIVIDUALS does the fish in the freezer match the carcass on the field
Detecting RELATEDNESS can kin selection (i.e. high level of relatedness) explain cooperative courtship behavior?
Assigning INDIVIDUALS to POPULATIONS do fish populations across the Bohol Sea show sufficient differentiation to allow us to identify unknown samples to a source population with a high level of confidence? Defining structure of POPULATIONS what forces could explain the genetic differentiation among populations of rabbit fish in western Philippines. Identifying SPECIES boundaries are these two forms of rock fish a single species or tow distinct speices PHYLOGENETIC TREES where do whales (Cetaceans) fit in a phlogenetic treee of mammalian groups. What is the grand arrangement of the tree of life in terms of kingdoms and phyla?
WHY USE MOLECULAR MARKERS?

Only genetically transmitted traits are informative to phylogeny estimation
Molecular markers open the whole biological world to genetic scrutiny

Genetic markers access an almost unlimited pool of genetic variability Molecular data distinguishes Homology (common ancestry) from Analogy (convergence from different ancestors Provides a common yardstick for measuring divergence Facilitate Mechanistic appraisals of evolution
PHYLOGENY and SYSTEMATICS
How are taxa arranged in the tree of life?

MORPHOLOGY MOLECULAR APPROACHES
Gibbon Orangutan Human Chimpanzee Gorilla

0.092 0.060 0.019 0.0075
Terms
Nodes (terminal observed taxa; internal hypothetical ancestors) Dichotomous or polytonous (uncertainty of relationships or multiple simultaneous branching) Rooted vs unrooted trees Clades and ingroups monophyly vs paraphyletic Ingroup and outgroup
Nucleotide Difference Between Sequences

A simple measure of the extent of sequence divergence is p proportion of nucleotide sites at which the two sequences are different. This is estimated by:
^ = n /n p d And is called the p distance. Although the overall nucleotide difference.
Different types of nucleotide pairs between X and Y

Class Identical nucleotides frequency AA O1 P11 Transversion-type pair frequency AT TT O2 GA P12 TA Nucleotide Pair CC O3 TC P21 AC GG O4 CT P22 CA P Total O
Transition-type pair frequency AG
Q11
TG Q31
Q12
GT Q32
Q21
CG Q41
Q22
GC Q42
Transition/ Transversion ratio

^= P/Q ^ ^ R
R is usually 0.5 2.0 in many nuclear genes. In mtDNA it can be as high as 15. R is subject to a large sampling error when the number of nucleotides examined (n) is small.
V(R) = R2 (1/nP + 1/nQ)

Assumption
P11= P12 ; P21= P22; Q11= Q12; Q 21= Q22; Q31= Q32 ;Q41= Q42
Estimation of the number of substitutions

When p is large, it gives an underestimateWhy? It does not consider backward and parallel substitutions A number of mathematical models have been developed to address this. We will discuss:
Jukes and Cantors Method Kimuras Two-Parameter Method Tajima and Neis Method Tamuras Method Tamura and Neis Method
Jukes-Cantor model
Assumes that nucleotide substitution occurs at any nucleotide site with equal frequency Each site and nucleotide changes to one of the remaining nucleotides with a probability of per year
Probability of change in nucleotide= rate of substitution

r = 3
Jukes-Cantor model A A T C G
T
C
Consider X and Y
Let qt = proportion of identical nucleotides at time t Let pt = 1-qt = proportion of different nucleotides Probability that site with similar nucleotides in X and Y at t will be remain similar by t+1: (1-r)2 or approximately 1-2r Probability that site with different nucleotides in X and Y will be similar by t+1: (1-r) * 2 = 2r (1-r)/3 or approximately 2r/3
Deriving a value for d

qt+1 = (1-2r) qt + 2r/3 (1-qt) qt+1 qt = 2r/3 8r/3 qt Using a continuous time model using dq/dt to represent qt+1 qt
dq/dt = 2r/3 8r/3 q

The solution of this equation with initial conditions q=1 at t=0 q= 1-3/4 (1-e-8rt/3)
Under our present model, the expected number of nucleotide usbstituions per site (d) for the two sequences is 2rt. Therefore, d is given by: d = -(3/4) ln [1-(4/3 p)] where; p= 1-q is the proportion of different nucelotides between X and Y. An estimate d can be obtained by ^ using ^ The large-sample variance of d is: p. V(d) = 9p(1-p) (3-4p)2 n
Kimura Two Parameter model

Considers the higher rate of transitional vs trasversional nucleotide substitution and 2
Total substitution rate per year r = + 2
Kimura Two Parameter model A A T C G T C G
Deriving d
P = (1-2 e-4(+)t +e -8t) Q = (1-e-8 t)
Where t is the time for transitional substitution: d = 2rt = 2t + 4rt = - ln (1-2P-Q)- ln (1-2Q)
Variance of d (Kimuras model)

Variance of d is: V(d) = 1/n [c12P + c32Q (c1P + c3Q)2]
Where;
c1 = 1 , 1-2P-Q c2 = 1 , and c3 = (c1 +c2)/2 1-2Q
Notes:
In both the Kimura and Jukes Cantor models, the expected frequencies of A,C,T and G will eventually become equal to 0.25.
Both models make no assumption about the initial frequencies. This property makes the two models applicable to a wider condition than may other models. There is no need to assume the stationarity of nucleotide frequencies for estimating d.
Tajima-Nei (Equal-input) model A A T C T C G
gA gA
gT gT
gC gC -
gG gG gG
gA
gT
gC
Tajima-Nei (Equal-input) model

Similar model was proposed independently by Felsenstein (1981) and Tajima and Nei (1982) It is necessary to assume stationarity of nucleotide frequencies for estimating the number of nucleotide substitutions: d = -b ln (1-p/b) where, b = [ 1- gi2 +p2/c]
i=1 4
And c is given by: c=

3
i=1 j=i+1
xij2 2gigj
Where xij (i<j) is the relative frequency of nucleotide pair i and j when two DNA sequences are compared. The nucleotide frequencies gis are estimated from two sequences compared.
The variance of d is: V(d) = b2p (1-p) (b-p)2 n
Tamura model
In Kimuras model, the four nucleotides eventually become 0.25. In real data, however, nucleotide frequencies are rarely equal and the GC content is often quite different from 0.5. (Drosophila for example = 0.1) Tamuras (1992) model was developed as an extension of Kimuras modelto the case of low or high GC content. d = -h ln (1-P/ h-Q) () (1-h) ln (1-2Q) Where h = 2 (1-), and is the GC content
Tamura model A A T C G
T
C G
2
2 2
2 2
1
1
1
1 -
1 = gG + gC 2 = gA + gT
Tamura-Nei model
Hasegawa et al (1985) maximum likelihood method. This is a hybrid of Kimuras model, equal input model and considers both the transition/ transversion and GC content biases mentioned earlier. The formula for d is quite complicated but similar to Tamura and Neis model of which it is a special case. d = - 2gAgG ln [ 1- gR P1 1 Q] gR 2gAgG 2gR
d = - 2gAgG ln [ 1- gR P1 1 Q] gR 2gAgG 2gR - 2gTgC ln [ 1- gY P2 1 Q] gY 2gTgC 2gY - 2 [gRgY gAgGgR gTgCgR] ln [ 1 1 Q] gR gY 2gRgY
Tamura-Nei model A A T C T C G
gA gA
gT 2gT
gC 2gC -
1gG gG gG
1gA
gT
gC
Gamma Distances
For our list of distances, the rate of nucleotide substitution is assumed to be the same for all nucleotide sites. In reality, this assumption rarely holds, and the rate varies from site to site. Statistical analyses of rate substion at different nucleotide sties suggested that the rate variation approximately follows a gamma distribution
Comparison of Different Distance Measures

2
Estimated number of substitutions per site
1.5 1 0.5 0
Tamura-Nei Tamura Kimura Jukes-Cantor p
0. 3 0. 75
1. 2
1. 5
Expected num ber of substitutions per site
1. 8
Alignment of Nucleotide Sequences

ATGCGTCGTT ATCCGCGAT ATGCGTCGTT ATCCG_CGAT ATGC_GTCGTT AT_CCG_CGAT
Methods
Similarity index - Needleman and Wunsch (1970) Alignment distance Sellers (1974) E = Min w1 +w2 w1 and w2 are penalties for a mismatch and a gap (e.g 1 and 4). The gap penalty is a function of the gap length. Similarly, mismatches are can be divided into transitional and transversional mismatches and different penalities are given to them.
Alignment of Multiple Sequences

Customary to use progressive alignment algorithm pairs of sequences with small distances are first aligned and the alignment of more distantly related sequences is done progressively for larger and larger groups. Pairs of sequences are aligned using the progressive alignment algorithm. Groups of sequences are aligned with each other using a profile alignment algorithm
Handling sequence gaps in estimation of evolutionary distances

Complete deletion delete all sites with gaps from the data analysis. Generally desirable because different regions of DNA sequences oftern evolve differently. Pairwise-deletion if the number of nucloties invovled in the gap is small and gaps are distributed more or less at random, distances may be computed from pairs of sequences ignoring only those gaps that in the two sequences compared
Example
A-AC-GGAT-AGGA-ATAAA AT-CC?GATAA?GAAAC-A ATTCC-GA/TACGATA-AGA

Differences/Comparison
Option
Sequence
(1,2)
(1,3)
(2,3)
Complete- deletion 1 A C GA A GA A A A 1/10 0/10 1/10
2
3
A C GA A GA A C A
A C GA A GA A A A
Pairwise-deletion 1 2 3 A-AC-GGAT-AGGA-ATAAA AT-CC?GATAA?GAAAAC-A ATTCC-GA?TACGATA-AGA 2/12 3/12 3/14
Assignment:
Reading assignments ClustalXhttp://inn-prot.weizmann.ac.i./software/ClustalX.html http://www.biozentrum.unibas.ch/`biphit/slustal/ClustalX_help.html
Mega 2 http://www.megasoftware.net/

ASBP Training - Alignment and Phylogeny

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ASBP Training - Alignment and Phylogeny

Enviado por

Direitos autorais:

Formatos disponíveis

III.

Evolutionary Change in DNA Sequences

WHY USE MOLECULAR MARKERS?

Molecular markers open the whole biological world to genetic scrutiny

PHYLOGENY and SYSTEMATICS

How are taxa arranged in the tree of life?

Gibbon Orangutan Human Chimpanzee Gorilla

Nucleotide Difference Between Sequences

^ = n /n p d And is called the p distance. Although the overall nucleotide difference.

Different types of nucleotide pairs between X and Y

Transition-type pair frequency AG

Transition/ Transversion ratio

V(R) = R2 (1/nP + 1/nQ)

Estimation of the number of substitutions

Probability of change in nucleotide= rate of substitution

Deriving a value for d

dq/dt = 2r/3 8r/3 q

Kimura Two Parameter model

Total substitution rate per year r = + 2

Kimura Two Parameter model A A T C G T C G

Variance of d (Kimuras model)

Tajima-Nei (Equal-input) model A A T C T C G

Tajima-Nei (Equal-input) model

And c is given by: c=

Comparison of Different Distance Measures

Estimated number of substitutions per site

Expected num ber of substitutions per site

Alignment of Nucleotide Sequences

Alignment of Multiple Sequences

Handling sequence gaps in estimation of evolutionary distances

A-AC-GGAT-AGGA-ATAAA AT-CC?GATAA?GAAAC-A ATTCC-GA/TACGATA-AGA

Complete- deletion 1 A C GA A GA A A A 1/10 0/10 1/10

Pairwise-deletion 1 2 3 A-AC-GGAT-AGGA-ATAAA AT-CC?GATAA?GAAAAC-A ATTCC-GA?TACGATA-AGA 2/12 3/12 3/14

Você também pode gostar