Genome Bioinformatics: Richard Durbin Sanger Institute Rd@sanger - Ac.uk

Genome bioinformatics
Richard Durbin
Sanger Institute
rd@sanger.ac.uk
Interpreting the Human Genome Sequence
• How do we get at information in genome sequences?
• Brief look at human genome paper
• Gene-finding
– ‘ab initio’ gene finding; Hidden Markov Model methods
– Adding extra information
– Comparative approaches using homologous sequence
• Computational analysis of the proteome
– Similarity based analyses
– ‘Novel’ approaches: Rosetta, phylogenetic footprint etc.
• RNA genes
• Managing and accessing data: NCBI, Ensembl, DAS
The era of sequencing genomes
Size Genes
Completion
(Mb) date
H. influenzae 2 1,700 1/1kb Bacterium 1995
Yeast 13 6,000 1/2kb Eukaryotic cell 1996
Nematode 100 18,000 1/6kb Metazoon 1998
Human 3000 ?40,000 1/75kb Mammal 2000/3
Mouse, fish (3), rat, another worm, fly … shotgun drafts in 2001/2
Value of Genomic Sequence
• Complete information – we can get it right once and for all
– The basis for a complete catalogue of genes
– An index to draw things together
– Archival reference for the future
• New entry points
– Sequence matching to cross species and access new family members
– Gene disruption and expression studies in model organisms
– Comparative studies – sequence conserved between organisms has been selected
• Genome structure and archaeology
– Long range structure – chromosome organisation and function
– Evolutionary studies – “fossil genes”
• Materials
– The basis for designing experiments, all known in advance
– A substrate for computational knowledge extraction
Human Genome Paper (Nature, 2001)
(see also Science, 2001)
• Overview, history etc.
• How we got the sequence
– Assembly and integration with the map is challenging
• Repeat content
• Identifying genes
– RNA genes
– Protein coding genes
• Analysis of proteome
• Access, medical impact, conclusion
Repeat analysis
• The human genome is approximately 45% dispersed repeat
• 20% LINEs, preferring AT rich segments
• 13% is SINES (11% Alu), strongly preferring GC rich segments
– but ALUs use LINE insertion machinery
• 8% LTR (retrovirus like) and 2% DNA transposons
• Another 3% is tandem simple sequence repeats (e.g. triplet)
• And another 3-5% is segmentally duplicated at high similarity (over
1kb over 90% id)
• Identifying and screening these out is essential to avoid confusion

(e.g. fake matches or predictions)
• Repeats also have interesting biology, and can help probe evolution
Initial human gene set IGI/IPI.1
• 31,778 putative gene transcripts

– 14,882 known from Refseq, Swissprot, Trembl
– 16,896 predictions from Ensembl and Genie
• Combine HMMs with protein, cDNA confirmation
• Calibration indicates 24,500 genes represented
– 20% overprediction, 1.4x fragmentation in 16,896
• Undercount calibration suggests 31,000 estimate for total
• Current (Ensembl): ~29,500 found, 35-40k possible

Basic characteristics of human genes .
Median Mean Sample (size)
Confirmed by mRNA or EST flanked by
Internal exons 122 bp 145 bp
confirmed introns (43317)
Exon number 7 8.8 Refseq alignments to finished sequence (3501)
Introns 1023 bp 3365 bp Refseq alignments to finished sequence (27238)
3’UTR 400 bp 770 bp Confirmed by mRNA or EST on Chr 22 (689)
5’UTR 240 bp 300 bp Confirmed by mRNA or EST on Chr 22 (463)
Coding sequence 1100 bp 1340 bp Selected Refseq entries (1804)
(CDS) 367 aa 447 aa
Genomic extent 14 kb 27 kb Selected Refseq entries (1804)
7 25
20
5 Intro ns
Exo ns
Pe rc e ntag e of Exons
Pe rc e ntage of introns
15
4 hum a n
Hum a n
Worm worm
Dros o dros o
3
10
5
1
0
0
0 100 200 300 400 500 600 700 800 900 1000
0 100 200 300 400 500
Exon Le ngth (bp)
Intron Le ngth (bp)
Gene density as a function of
GC content
GC content correlates with repeats
Methods for gene prediction
• Ab initio
– use general knowledge of gene structure: rules and statistics
– current best methods are all based on hidden Markov models, which
use Dynamic Programming
• Genscan (Burge)
• FGENES (Solovyev)
• HMMGene (Krogh)
• Similarity based
– comparison to known proteins, cDNAs, ESTs
– better, but only available if you have similar data to compare to
– best if the sequence comes from this gene (verification, not prediction)
Searching for genes:
Bacteria easy – Human hard
Promoter 5’utr <--------- coding region------> 3’utr
Bacterial gene: continuous coding region, known

signals
?? 5’utr <--- coding region -----------> 3’utr polyAAA site
Human gene: fragmented coding region,

unknown signals, contained in much more DNA
Hidden Markov Models
• Imagine a sequence of hidden states, shadowing the

sequence (labelling it “intergenic”, “exon”, “intron” etc.)
• Given this, assign the sequence a probability that is the
product of:
– the probability of the set of states (the structure)
– the probability of the sequence given the structure
• Find the structure that maximises this probability for the
sequence(Viterbi algorithm)
Hidden Markov Models (2)
• Fully probabilistic, so can do proper statistics
– Can estimate the parameters from labelled data
– Can give confidence values
• Semi- or Generalised HMMs
– A state explains a subsequence (e.g. a whole exon), rather than a
single base
– transistion between states at features detected by other methods
(e.g. splice site consensus)
Segments
Features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Selected
Features 1 6 9 13 15 17
Predicted
Gene
Score = loc(1) + tran(1,6) + loc(6) + ……+ loc(15) + tran(15, 17) + loc(17)

tran(x,y) = segment_score(x,y) - length_pen(x,y)
How Dynamic Programming works
• Sort features and proceed left to right
• Keep best score so far per feature
• From end, trace back to find features that make best

score
Genscan (Burge and Karlin, 1998)
• Dramatic improvement over

previous methods
• Generalised HMM
• Different parameter sets for
different GC content regions
(intron length distribution and
exon stats)
Current performance of ab initio methods
• We can confirm gene structures experimentally by
sequencing cDNA
• Current methods are not really good enough
– 75% correct per exon, worse with initial and final exons
– 20% correct per gene
– easier for simpler organisms, e.g. nematode C. elegans
• Options are to improve methods, or get extra information
– an attractive source of new information is whole genome sequence
from a related species, e.g. mouse for man
-20bp
pkinase.hmm 1 YELGEKLGEGA GKVYKAKHK---TGKIVAVKILKKESLSLL REIQI
++ LG + G+ Y+A + ++I+ + +K + + + E+ +
INIKNLLGGDT GCLYMAPKVQATKQQIYKLCFIKIKTFVLQ TELNL
HSU71B4 -27753 aaaaactgggaGTGTGAGTA Intron 1 CAGTgtttagcagcgaaccatatttaaaaatgccAGGTCACTA Intron 2 CAGGagcac
tataattggac <2-----[27718:22469]-2> ggtatccataccaaataatgttatacttta <2-----[22375:21185]-2> catat
atcatggtata acatgaaaaaaaaaattagcctaaattgta tacct
+3bp - 6bp
pkinase.hmm 45 LKRLN-HPNIVRLLGVFED-----SKDHLY LVLEYMEGGDLFDYLRRKG--PLSEKEAKKIALQILR
L++++ H+NIV ++G+F L+ +V+E++ G+ D++R+ L E+++ +I ++IL+
LRKYSFHKNIVSFYGAFFKLSPPGQRHQLW MVMELCAAGSVTDVVRMTSNQSLKEDWIAYICREILQ
HSU71B4 -21168 caatttcaaagtttggttacaccgccccctGTATGTT Intron 3 CAGagagttgggtgagggaaaaacataggtagtatcgacc
tgaactaaattctagcttatgccgagaatg<0-----[21078:15667]-0>tttatgccgctcattgtcgaagtaaagtcatggatta
gggctccactgcctaatcggtcttggcatg ggggataatgcttagagcttgtaaatgtttccaactg
- 66bp - 8bp - 1bp

pkinase.hmm 104 GLEYLHSNGIVHRDLKPENILLDENGTVKI DFGLAKLLK-SGEKLTTFV
GL++LH ++++HRD+K +N+LL++N VK+ DFG++++++ ++++++F+
GLAHLHAHRVIHRDIKGQNVLLTHNAEVKL DFGVSAQVSRTNGRRNSFI
HSU71B4 -15555 GTGAGTC Intron 4 CAGgtgcccgccgaccgaagcagccacagggacGGTAAGTT Intron 5 CAGTTgtggagcgaaaagaaaata
<0-----[15555:14066]-0>gtcatacagttagatagaatttcaacatat <1-----[13974:10915]-1> atgtgcatggcagggagtt
catctcacaatcgccatgtgggttttaaag ttagtcggcattaagttct
0bp - 3bp -1 bp +2bp

pkinase.hmm 153 GTPWYMMAPEVILKG-----RGYSTK VDVWSLGVILYELLTGKL FPG-D
GTP++M APEV + R Y+ + +DVWS+G++ +E++ G + +
GTPYWM-APEV-IDCDEDPRRSYDYR SDVWSVGITAIEMAEGAP LCNLQ
HSU71B4 -10855 gactta gcgg agtgggcacttgtaGTGAGTG Intron 6 CAGaggttggaagagaggggcCGTGAGTA Intron 7 CAGCTctacc
gccagt ccat tagaaacggcaaag<0-----[10783: 8881]-0>gatgctgtcctatcagcc <1-----[8825 : 4234]-1> tgata
gaacgg atgg tcttgcaaccttca ttggtgattctagtaact gtcta
+1bp +1bp
pkinase.hmm 196 PLEELFRIKKRLRLPLPPNC SEELKDLLKKCLNKDPSKRPTAKELLEHPW
PLE+LF I+++ ++ + ++ S+ + +++KC K+ RPT +L+HP+
PLEALFVILRESAPTVKSSG SRKFHNFMEKCTIKNFLFRPTSANMLQHPF
HSU71B4 -4214 ctggctgatcgtgcagatagTGGTAAAGA Intron 8 TAGGtcatcatagataaaatctccatgaacccct
ctactttttgacccctacgg <2-----[4154 : 3085]-2> cgataattaagctaatttgccccattaact
cgatccttggattcacacca ctgcctcgagtgaatcgtttttacgtacat
GeneWise (Ewan Birney)
• GeneWise aligns a protein sequence (or HMM) to genomic
DNA taking into account splicing information
Other comparative approaches
• Procrustes (Gelfand, Mironov and Pevzner, 1996)

– Find possible exons, align and piece together homologue
– Similar sensitivity, lower specificity to GeneWise
• GenomeScan (Yeh, Lim and Burge, 2001)

– Extension of GenScan to use protein matches where available
to add to the GenScan score for an exon
– Higher sensitivity, especially when match is weak (it always
predicts something) lower specificity
9,471,780 alignments of mouse shotgun
reads to the human genome in Ensembl
How conservation can help us
AYTGTHISSQKLIISCLPNOTKSIAIHIDDENAWYA
AYTGTHISSQKLIISCLPNOTKSIAIHIDDENAWYA
DEFYTHISPSQALISCAMPLETELYIHIDDENYWAE
But we can’t just use conserved regions
Test on Chromosome 22: 13472 mouse hits 4978 exons
• Specificity • Sensitivity
• Coding : • Coding :
– 79% correct – 1266 out of 2991 exons
– 21% wrong found (42%) Ugh!
• Non coding :
– 85% correct
– 15% wrong
Comparative Genefinders
• ROSETTA (Batzoglou et al, 2000)
– align first, then account for both sequences with emission model
– requires full alignment, which is likely to be incorrect where weak
• SLAM (Pachter) and Doublescan (Meyer)

– jointly align while finding predicting genes
– computationally expensive (extra search dimension) and hard to
parameterise
– still effectively requires a full alignment
Twinscan (Korf et al., 2001)
Fit a “conservation sequence” alongside the target sequence

Alternative splicing
• Alternative splicing is very prevalent in human (vertebrates) =-

historically underestimated.
• For Human Genome paper reconstruct full (coding) length
transcripts from cDNAs and ESTs on two chromosomes
• Chr 22: 642 transcripts map to 245 genes
– Average 2.6 transcripts per gene
– Two or more transcripts for 145 (90%) genes
• Chr 19: 1859 transcripts map to 544 genes
– Average 3.2 transcripts per gene
• 70% alternatives affect coding sequence
• Compare C. elegans data
– 22% genes have multiple transcripts, average 1.34 transcripts per gene
Automated protein analysis
Properties of the human protein set
• 74% of IPI proteins had matches to other known proteins
(others known or have EST evidence)
• These hit 8100 fly proteins (61%), 7850 worm proteins
(43%), 2800 yeast (46%)
• 1308 orthologous groups,
– 564 are 1-1-1-1mostly anabolic enzymes
– Without yeast, 1195 1-1-1 include signalling proteins e.g.
receptor and src-like tyrosine kinases
• About 40% had matches to domains or motifs in Interpro
(PFAM+Prints+Prosite)
Distribution of homologue matches
Numbers of paralogues
Paralogues H:x F:1 W:1
1000
900
800
700
Frequency
600
500 Frequency
400
300
200
100
0
1
e
or
M
No. human paralogues
Domain association
80
70
60
50
40 Weed
30 Yeast
20 Worm
10
Fly
0 Fly Human
Pkinase
SH3
Weed
PDZ
RasGAP
EGF
PHD
Domain architectures
Domain architecture numbers
2500
2000
1500
Number of
architectures Extracellular
1000
Intracellular
500
Human
W eed
Yeast
W orm
Fly
Summary comparison to
worms and flies
• Vertebrates have around twice as many genes
• Genes are much more spread out on the genome,
potentially allowing for more complex transcriptional
regulation
• There is at least twice as much alternative splicing
• Typical protein lengths are comparable, but more protein
domains are used, in more architectures
Phylogenetic profiling
Pellegrini et al. PNAS 1999
Gene Fusion and other correlation methods
Functional relationship Combining different

from gene fusion patterns data sources
Marcotte et al. Science 1999

Enright and Ouzounis, Nature 1999 Marcotte. (2000)
Resources and technology
UCSC genome browser
http://genome.cse.ucsc.edu/
Ensembl: http://www.ensembl.org/
GeneView gives properties and links
NCBI
http://www.ncbi.nlm.nih.gov/genome/guide/human/
Distributed Annotation Server (Stein et al.)
External Contributors Database providers
Server Coordinate Server
Sequence
Synchronisation
Programs
Annotation
Server html
xml
Viewer Users
xml
References for Genome Informatics

***International Human Genome Sequencing Consortium. (2001)
"Initial sequencing and analysis of the human genome”. Nature, 409(6822), 860-921.
Mammoth paper, but thorough over what can be done in many areas of genome informatics. Section on dispersed repeats is
particularly strong.

**Venter, C et al. (2001) “The Sequence of the Human Genome” Science 291 1304-1351. For comparison. Protein family section is
strong.

***Burge, C. & Karlin, S. (1997) “Prediction of complete gene structures in human genomic DNA”. J. Mol. Biol. 268, 78-94.
Very influential paper introducing GENSCAN. Worth reading even if you skip the technical bits.

**Burge C. B. & Karlin S. (1998) “Finding the genes in genomic data”. Current Opinion in Structural Biology, 8, 346-354.
Intelligent review of the problem and available approaches at the time.

*Rogic S., Mackworth, A.K. & Ouellette B.F. (2001). “Evaluation of gene-finding programs on mammalian sequences”. Genome
Research, 11, 817-832. Most recent comparative evaluation of mammalian genefinding. Could be better.

**Korf, I., Flicek P, Duan, D. & Brent M.R. (2001). “Integrating genomic homology into gene structure prediction”.
Bioinformatics, 17, 140S-148S. Nice paper on what is currently the most promising approach to comparative genefinding.

**Marcotte E., (2000) “Computational genetics: finding protein function by nonhomology methods” Current Opinion Struct. Biol. 10
359-365. Reviews and discusses the new set of methods to try to identify protein interactions and function using whole genome
protein sets.

**Rivas E., Klein, R.J., Jones, T.A. & Eddy, S.R. (2001). “Computational identification of noncoding RNAs in E. coli by
comparative genomics”. Current Biology, 11, 1369-1373. First paper with results from potentially powerful approach to finding
novel RNA genes by comparative methods.

Genome Bioinformatics: Richard Durbin Sanger Institute Rd@sanger - Ac.uk

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Genome Bioinformatics: Richard Durbin Sanger Institute Rd@sanger - Ac.uk

Enviado por

Direitos autorais:

Formatos disponíveis

Genome bioinformatics

H. influenzae 2 1,700 1/1kb Bacterium 1995

Yeast 13 6,000 1/2kb Eukaryotic cell 1996

Nematode 100 18,000 1/6kb Metazoon 1998

Human 3000 ?40,000 1/75kb Mammal 2000/3

• Identifying and screening these out is essential to avoid confusion

• 31,778 putative gene transcripts

• Current (Ensembl): ~29,500 found, 35-40k possible

Promoter 5’utr <--------- coding region------> 3’utr

Bacterial gene: continuous coding region, known

?? 5’utr <--- coding region -----------> 3’utr polyAAA site

Human gene: fragmented coding region,

• Imagine a sequence of hidden states, shadowing the

Score = loc(1) + tran(1,6) + loc(6) + ……+ loc(15) + tran(15, 17) + loc(17)

• From end, trace back to find features that make best

• Dramatic improvement over

- 66bp - 8bp - 1bp

0bp - 3bp -1 bp +2bp

• Procrustes (Gelfand, Mironov and Pevzner, 1996)

• GenomeScan (Yeh, Lim and Burge, 2001)

• SLAM (Pachter) and Doublescan (Meyer)

Fit a “conservation sequence” alongside the target sequence

• Alternative splicing is very prevalent in human (vertebrates) =-

Functional relationship Combining different

Marcotte et al. Science 1999

Você também pode gostar