Genome Analysis

20-04-10
Genomic
s
Study of Genome (complete set of genetic information of an organism)
Structural genomics Functional genomics

3D structure of all proteins Assess the gene function
encoded bio-sequence
gene prediction
comparison
prediction of prediction of
orthologous
promoters Computational genes
prediction of
genomics prediction of
transcription factor genome
binding motifs rearrangement
prediction of
prediction of
SNPs and
operons haplotype
prediction of analysis
…….
simple and complex
repeats
To understand the genomes, following distinct type of
analyses are required:
1. Acquisition of sequence data
Acquired in the form of many individual sequences of 500-
800 bp, which must be assembled into the contiguous
genome sequence
2. Analyses of the genome sequence to locate

the genes, control sequences, promoters
3. Computational analyses
Which aids in genomics and post genomic
research
A Microbial Genome Sequencing
Project
Random sequencing Genome Assembly Annotation Data Release
Library construction TIGR Assembler Gene finding Publication

Genome scaffold www.tigr.org
Colony picking Homology searches

Combinatorial PCR
POMP
Template preparation Initial role assignments
Ordered contig set

Sequencing reactions Metabolic pathways
Gene families
Gap closure
Base calling sequence editing
Comparative genomics
Sequence files Re-assembly

Transcriptional/
translational
ONE ASSEMBLY! regularory elements
Sample tracking
Repetitive sequences
What is the Procedure for Gene Prediction ?
1. Translate in all six

reading frames
and compare to
protein sequence Use gene
Obtain new Analyze regulatory
Database prediction
Genomic DNA Sequences in
sequence programme to The gene
2. Perform database locate genes
similarity search
of expressed sequence
tag (EST) database
of same organism or
cDNA sequences
if available
The term DNA sequencing refers to methods for determining
the order of the nucleotide bases, adenine, guanine, cytosine, and
thymine, in a molecule of DNA.
Sequencing can be done as the per the following methods:
• Chain termination/ Sanger Coulson method (1977)

- using nucleotide analogs
- attaching fluorescent dyes to nucleotides(1986)
(Hood and co-workers-1986)
• Automated sequencing-(1987, Applied Biosystems)
Shot-gun or Clone-by-Clone strategy may be applied for this

purpose.
Two Competing Strategies for Human
Genome
• (Hierarchical shotgun) [Public human

genome project]
• Whole-genome Shotgun [Celera

project]
Two main shotgun-sequencing
strategies
Main steps in
clone-by-clone
shotgun
sequencing
Phred
• The phred software reads DNA sequencing trace files, calls bases,
and assigns a quality value to each called base.
• The quality value is a log-transformed error probability, specifically
the phred quality values have been thoroughly tested for both
accuracy and power to discriminate between correct and incorrect
base-calls.
• Phred can use the quality values to perform sequence trimming.
• Phred works well with trace files from the following manufacturers'
sequencing machines: Amersham Biosciences, Applied Biosystems,
Beckman Instruments, and LI-COR Life Sciences. See the phred
documentation for specific compatibility information.
• Phred runs on most computers and operating systems including
Apple Mac OS X, *BSD, Hewlett-Packard HP-UX, HP-Compaq Tru64,
IBM AIX, Linux, Microsoft Windows, Silicon Graphics IRIX, and SUN
Solaris.
Phrap
• Phrap is a program for assembling shotgun DNA sequence data.
• It allows use of the entire read and not just the trimmed high
quality part.
• It uses a combination of user-supplied and internally computed

data quality information to improve assembly accuracy in the
presence of repeats.
• It constructs the contig sequence as a mosaic of the highest

quality read segments rather than a consensus
• It provides extensive assembly information to assist in trouble-

shooting assembly problems, and it handles large datasets.
Shotgun-sequence assembly.
Long-range
sequence
assembly in
whole-
genome
shotgun
sequencing.
STS: Sequence Tagged Site
Mapping
 Most powerful technique for physical mapping.
 A STS is a short DNA sequence (100-500bp)
 Occur only once in the genome being studied.
 To map STS markers, a collection of overlapping DNA fragments
from genome is needed. This collection is called mapping reagents
and is usually done as a clone library.
 Sources may be:
-Expressed Sequence tags (ESTs) obtained by analyzing
cDNA, and therefore represent expressed genes. It can be used
as an STS if it comes from unique gene and not from a
family with related/similar genes.
- Random genomic sequences obtained by sequencing
random cloned genomic DNA
There are now a large number of gene and protein databases
It is possible to find homologous sequences for generic gene or
protein coding sequences, from the databases.
Software programs such as BLAST and FASTA provide the means to
align sequences and as such homology searching provide clues to
the potential structure and function of a given DNA sequence.
General DNA Sequence databases
EMBL – http://www.ebi.ac.uk
at the European Bioinformatics Institute (EBI) at cambridge, UK
GenBank – http://ncbi.nlm.nih.gov
at the National Institute of Health (NIH) at USA
The DNA databases of Japan (DDBJ) at Mishima in Japan. – http://
ddbj.nig.ac.jp
Human Mapping Databases, John Hopkins, USA – http://
gdbwww.gdb.org
Gene Prediction…
• Computational gene finding is a process of:
– Identifying common phenomena in known genes
– Building a computational framework/model that can accurately

describe the common phenomena
– Using the model to scan uncharacterized sequence to identify

regions that match the model, which become putative genes
– Test and validate the predictions

Different Types of Gene Finding
• RNA genes
– tRNA, rRNA, snRNA, microRNA
• Protein coding genes

– Prokaryotic
• No introns, simpler regulatory features
– Eukaryotic
• Exon-intron structure
• Complex regulatory features
Approaches to Gene Finding
Direct
– Exact or near-exact matches of EST, cDNA, or Proteins
from the same, or closely related organism
– Algorithm based searches investigate the nucleotide
composition and other intrinsic features of genomic DNA
Indirect
1. Look for something that looks like one gene (homology)
Rely on previously identified genes in other organisms
2. Look for something that looks like all genes (ab initio)
3. Hybrid, combining homology and ab initio (and perhaps

even direct) methods
What is the Procedure for Gene Prediction ?
1. Translate in all six

reading frames
and compare to
protein sequence Use gene
Obtain new Analyze regulatory
Database prediction
Genomic DNA Sequences in
sequence programme to The gene
2. Perform database locate genes
similarity search
of expressed sequence
tag (EST) database
of same organism or
cDNA sequences
if available
Open Reading Frames: 6 possibilities
TCG TAC GTA GCT AGC TAG CTA

AGC ATG CAT CGA TCG ATC GAT
identical sequence
T CGT ACG TAG CTA GCT AGC TA

A GCA TGC ATC GAT CGA TCG AT
TC GTA CGT AGC TAG CTA GCT A

AG CAT GCA TCG ATC GAT CGA T
cDNA is made from mRNA
AAAAAAA
Start Stop Mature
TTTTTTT mRNA
Add polyT primer, nucleotides,
and Reverse Transcriptase
AAAAAAA
DNA/RNA
TTTTTTT
RNA removed (by NaOH) and
second strand synthesized
TTTTTTT
Complementary
DNA cDNA
A full length cDNA is hard to find
Start Stop
AAAAAAA mRNA is
Open Reading Frame (ORF) degraded
AAAAAAA from 5’ end
AAAAAAA
AAAAAAA
AAAAAAA
Most cDNAs are not

full length (flcDNA)
and the ORF is
incomplete (partial)
cDNA (EST) libraries have few flcDNAs
Open Reading Frame (ORF)
cDNA libraries are made

and individual clones
sequenced at random
- A sequenced cDNA is called an Expressed Sequence Tag (EST)

- Millions of ESTs from different tissues of different organisms are
stored in GenBank
– but only a small few are full length cDNAs!
-how to find the longest ones? Where ?
EST Division: Expressed Sequence Tags
dbEST http://www.ncbi.nlm.nih.gov/dbEST/
ESTs
sequence1
nucleus TAGTCA
80-100,000
clone xyz
genes
CGTACT sequence2
80-100,000 RNA
gene products - isolate unique clones
- sequence once from each end
make cDNA 80-100,000 unique

library cDNA clones in library
Prokaryotic and Eukaryotic Biology
Affect ORF-Based Gene Prediction
GLIMMER: A Microbial Gene
Finder
 GLIMMER 2.0: released late 1999
 > 200 site licenses worldwide
 Works on bacteria, archaea, viruses too
 Malaria (eukaryotic) version: GLIMMERM
 Refs: Salzberg et al., NAR, 1998,
Genomics 1999; Delcher et al., NAR,
1999
 Web site and code:
http://www.tigr.org/
GENSCAN: 1997 by Chris Burge & Samuel Karlin
- Combines coding region statistics, splice signal predictions,
motifs near the gene into a single HMM framework.
- Eg. A splice signal prediction is more believable if signs of
coding region appear on one side but not the other.
-Accuracy of prediction decreases with short exons or unusual
codon usage.
-Limited to predict genes in human/vertebrate DNA.
Other gene prediction algorithms include:
GeneSplicer – predicts splice sites in eukaryotic genomic DNA
GrailEXP - predicts genes,exons,promotors,CpG islands,EST, repetitive
elements etc
Glimmer – Gene locator (uses HMM) to find genes in microbial DNA
MZEF
TigrScan- Gene finder based on Hidden Markov Model Architecture
FGeneH, Hexon
Genie
Promoter prediction tools
-PromoterFinder
-PromoterInspector
-Dragon
-ModelInspector
-Eponine
Repeats:
-FORRepeats (http://al.jalix.org/FORRepeats)
- REPuter ( http://www.genomes.de)
- GCG/ EMBOSS (UNIX/ LINUX platform; needs prior
information)
- Spectral Repeat Finder
Due to variation in intron size & number from one organism to another,
gene prediction in eukaryotes depends on recognizing known exons.
This can be done by either comparing the unknown DNA sequence in all
reading frames with known protein coding gene sequences/EST tags in
databases, by homology based alignment search.
Multiple sequence alignments

• Main goal is generally to align regions of similar structural or
functional importance among sequences
• Sequence similarity is an important indicator of related function
•Most popular clustal w
BLAST Search (Basic Local Alignment Search Tool)
Pairwise, ungapped alignment database
query sequence
• FastA
– Better for nucleotides than for proteins
– Global alignments
• BLAST - Basic Local Alignment Search Tool
– Better for proteins than for nucleotides
– Local alignments
• Smith-Waterman
– More sensitive than FastA or BLAST
– Local alignments
• Genomes, 3rd edition, T. A. Brown
• Molecular Biology, 4th edition, Robert F. Weaver
• Bioinformatics, David W. Mount
• Principles of Gene Manipulation and Genomics, 7th edition,
Primrose and Twyman
• Molecular Cloning-A Laboratory Manual, 3rd edition, Sambrook
and Russell
• An introduction to Bioinformatics Algorithms, Neil C. Jones
• http://www.ncbi.nlm.nih.gov/

Genome Analysis

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Genome Analysis

Enviado por

Direitos autorais:

Formatos disponíveis

20-04-10

Structural genomics Functional genomics

2. Analyses of the genome sequence to locate

Library construction TIGR Assembler Gene finding Publication

Colony picking Homology searches

Ordered contig set

Sequence files Re-assembly

1. Translate in all six

Sequencing can be done as the per the following methods:

• Chain termination/ Sanger Coulson method (1977)

• Automated sequencing-(1987, Applied Biosystems)

Shot-gun or Clone-by-Clone strategy may be applied for this

• (Hierarchical shotgun) [Public human

• Whole-genome Shotgun [Celera

• It uses a combination of user-supplied and internally computed

• It constructs the contig sequence as a mosaic of the highest

• It provides extensive assembly information to assist in trouble-

– Identifying common phenomena in known genes

– Building a computational framework/model that can accurately

– Using the model to scan uncharacterized sequence to identify

– Test and validate the predictions

• Protein coding genes

3. Hybrid, combining homology and ab initio (and perhaps

1. Translate in all six

TCG TAC GTA GCT AGC TAG CTA

T CGT ACG TAG CTA GCT AGC TA

TC GTA CGT AGC TAG CTA GCT A

Most cDNAs are not

Open Reading Frame (ORF)

cDNA libraries are made

- A sequenced cDNA is called an Expressed Sequence Tag (EST)

make cDNA 80-100,000 unique

Multiple sequence alignments

Você também pode gostar