Escolar Documentos
Profissional Documentos
Cultura Documentos
Richard Durbin
Sanger Institute
rd@sanger.ac.uk
Interpreting the Human Genome Sequence
• How do we get at information in genome sequences?
• Brief look at human genome paper
• Gene-finding
– ‘ab initio’ gene finding; Hidden Markov Model methods
– Adding extra information
– Comparative approaches using homologous sequence
• Computational analysis of the proteome
– Similarity based analyses
– ‘Novel’ approaches: Rosetta, phylogenetic footprint etc.
• RNA genes
• Managing and accessing data: NCBI, Ensembl, DAS
The era of sequencing genomes
Size Genes
Completion
(Mb) date
Mouse, fish (3), rat, another worm, fly … shotgun drafts in 2001/2
Value of Genomic Sequence
• Complete information – we can get it right once and for all
– The basis for a complete catalogue of genes
– An index to draw things together
– Archival reference for the future
• New entry points
– Sequence matching to cross species and access new family members
– Gene disruption and expression studies in model organisms
– Comparative studies – sequence conserved between organisms has been selected
• Genome structure and archaeology
– Long range structure – chromosome organisation and function
– Evolutionary studies – “fossil genes”
• Materials
– The basis for designing experiments, all known in advance
– A substrate for computational knowledge extraction
Human Genome Paper (Nature, 2001)
(see also Science, 2001)
• Overview, history etc.
• How we got the sequence
– Assembly and integration with the map is challenging
• Repeat content
• Identifying genes
– RNA genes
– Protein coding genes
• Analysis of proteome
• Access, medical impact, conclusion
Repeat analysis
• The human genome is approximately 45% dispersed repeat
• 20% LINEs, preferring AT rich segments
• 13% is SINES (11% Alu), strongly preferring GC rich segments
– but ALUs use LINE insertion machinery
• 8% LTR (retrovirus like) and 2% DNA transposons
• Another 3% is tandem simple sequence repeats (e.g. triplet)
• And another 3-5% is segmentally duplicated at high similarity (over
1kb over 90% id)
7 25
20
5 Intro ns
Exo ns
Pe rc e ntag e of Exons
Pe rc e ntage of introns
15
4 hum a n
Hum a n
Worm worm
Dros o dros o
3
10
5
1
0
0
0 100 200 300 400 500 600 700 800 900 1000
0 100 200 300 400 500
Exon Le ngth (bp)
Intron Le ngth (bp)
Gene density as a function of
GC content
GC content correlates with repeats
Methods for gene prediction
• Ab initio
– use general knowledge of gene structure: rules and statistics
– current best methods are all based on hidden Markov models, which
use Dynamic Programming
• Genscan (Burge)
• FGENES (Solovyev)
• HMMGene (Krogh)
• Similarity based
– comparison to known proteins, cDNAs, ESTs
– better, but only available if you have similar data to compare to
– best if the sequence comes from this gene (verification, not prediction)
Searching for genes:
Bacteria easy – Human hard
Features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Selected
Features 1 6 9 13 15 17
Predicted
Gene
+3bp - 6bp
pkinase.hmm 45 LKRLN-HPNIVRLLGVFED-----SKDHLY LVLEYMEGGDLFDYLRRKG--PLSEKEAKKIALQILR
L++++ H+NIV ++G+F L+ +V+E++ G+ D++R+ L E+++ +I ++IL+
LRKYSFHKNIVSFYGAFFKLSPPGQRHQLW MVMELCAAGSVTDVVRMTSNQSLKEDWIAYICREILQ
HSU71B4 -21168 caatttcaaagtttggttacaccgccccctGTATGTT Intron 3 CAGagagttgggtgagggaaaaacataggtagtatcgacc
tgaactaaattctagcttatgccgagaatg<0-----[21078:15667]-0>tttatgccgctcattgtcgaagtaaagtcatggatta
gggctccactgcctaatcggtcttggcatg ggggataatgcttagagcttgtaaatgtttccaactg
+1bp +1bp
pkinase.hmm 196 PLEELFRIKKRLRLPLPPNC SEELKDLLKKCLNKDPSKRPTAKELLEHPW
PLE+LF I+++ ++ + ++ S+ + +++KC K+ RPT +L+HP+
PLEALFVILRESAPTVKSSG SRKFHNFMEKCTIKNFLFRPTSANMLQHPF
HSU71B4 -4214 ctggctgatcgtgcagatagTGGTAAAGA Intron 8 TAGGtcatcatagataaaatctccatgaacccct
ctactttttgacccctacgg <2-----[4154 : 3085]-2> cgataattaagctaatttgccccattaact
cgatccttggattcacacca ctgcctcgagtgaatcgtttttacgtacat
GeneWise (Ewan Birney)
• GeneWise aligns a protein sequence (or HMM) to genomic
DNA taking into account splicing information
Other comparative approaches
AYTGTHISSQKLIISCLPNOTKSIAIHIDDENAWYA
AYTGTHISSQKLIISCLPNOTKSIAIHIDDENAWYA
DEFYTHISPSQALISCAMPLETELYIHIDDENYWAE
But we can’t just use conserved regions
Test on Chromosome 22: 13472 mouse hits 4978 exons
• Specificity • Sensitivity
• Coding : • Coding :
– 79% correct – 1266 out of 2991 exons
– 21% wrong found (42%) Ugh!
• Non coding :
– 85% correct
– 15% wrong
Comparative Genefinders
• ROSETTA (Batzoglou et al, 2000)
– align first, then account for both sequences with emission model
– requires full alignment, which is likely to be incorrect where weak
1000
900
800
700
Frequency
600
500 Frequency
400
300
200
100
0
1
e
or
M
No. human paralogues
Domain association
80
70
60
50
40 Weed
30 Yeast
20 Worm
10
Fly
0 Fly Human
Pkinase
SH3
Weed
PDZ
RasGAP
EGF
PHD
Domain architectures
Domain architecture numbers
2500
2000
1500
Number of
architectures Extracellular
1000
Intracellular
500
Human
W eed
Yeast
W orm
Fly
Summary comparison to
worms and flies
• Vertebrates have around twice as many genes
• Genes are much more spread out on the genome,
potentially allowing for more complex transcriptional
regulation
• There is at least twice as much alternative splicing
• Typical protein lengths are comparable, but more protein
domains are used, in more architectures
Phylogenetic profiling
Pellegrini et al. PNAS 1999
Gene Fusion and other correlation methods
Server html
xml
Viewer Users
xml
References for Genome Informatics
***International Human Genome Sequencing Consortium. (2001)
"Initial sequencing and analysis of the human genome”. Nature, 409(6822), 860-921.
Mammoth paper, but thorough over what can be done in many areas of genome informatics. Section on dispersed repeats is
particularly strong.
**Venter, C et al. (2001) “The Sequence of the Human Genome” Science 291 1304-1351. For comparison. Protein family section is
strong.
***Burge, C. & Karlin, S. (1997) “Prediction of complete gene structures in human genomic DNA”. J. Mol. Biol. 268, 78-94.
Very influential paper introducing GENSCAN. Worth reading even if you skip the technical bits.
**Burge C. B. & Karlin S. (1998) “Finding the genes in genomic data”. Current Opinion in Structural Biology, 8, 346-354.
Intelligent review of the problem and available approaches at the time.
*Rogic S., Mackworth, A.K. & Ouellette B.F. (2001). “Evaluation of gene-finding programs on mammalian sequences”. Genome
Research, 11, 817-832. Most recent comparative evaluation of mammalian genefinding. Could be better.
**Korf, I., Flicek P, Duan, D. & Brent M.R. (2001). “Integrating genomic homology into gene structure prediction”.
Bioinformatics, 17, 140S-148S. Nice paper on what is currently the most promising approach to comparative genefinding.
**Marcotte E., (2000) “Computational genetics: finding protein function by nonhomology methods” Current Opinion Struct. Biol. 10
359-365. Reviews and discusses the new set of methods to try to identify protein interactions and function using whole genome
protein sets.
**Rivas E., Klein, R.J., Jones, T.A. & Eddy, S.R. (2001). “Computational identification of noncoding RNAs in E. coli by
comparative genomics”. Current Biology, 11, 1369-1373. First paper with results from potentially powerful approach to finding
novel RNA genes by comparative methods.