Você está na página 1de 38

20-04-10

Genomic
s
Study of Genome (complete set of genetic information of an organism)

Structural genomics Functional genomics


3D structure of all proteins Assess the gene function
encoded bio-sequence
gene prediction
comparison
prediction of prediction of
orthologous
promoters Computational genes
prediction of
genomics prediction of
transcription factor genome
binding motifs rearrangement
prediction of
prediction of
SNPs and
operons haplotype
prediction of analysis
…….
simple and complex
repeats
To understand the genomes, following distinct type of
analyses are required:
1. Acquisition of sequence data
Acquired in the form of many individual sequences of 500-
800 bp, which must be assembled into the contiguous
genome sequence

2. Analyses of the genome sequence to locate


the genes, control sequences, promoters

3. Computational analyses
Which aids in genomics and post genomic
research
A Microbial Genome Sequencing
Project
Random sequencing Genome Assembly Annotation Data Release

Library construction TIGR Assembler Gene finding Publication


Genome scaffold www.tigr.org

Colony picking Homology searches


Combinatorial PCR
POMP
Template preparation Initial role assignments

Ordered contig set


Sequencing reactions Metabolic pathways
Gene families
Gap closure
Base calling sequence editing
Comparative genomics

Sequence files Re-assembly


Transcriptional/
translational
ONE ASSEMBLY! regularory elements
Sample tracking
Repetitive sequences
What is the Procedure for Gene Prediction ?

1. Translate in all six


reading frames
and compare to
protein sequence Use gene
Obtain new Analyze regulatory
Database prediction
Genomic DNA Sequences in
sequence programme to The gene
2. Perform database locate genes
similarity search
of expressed sequence
tag (EST) database
of same organism or
cDNA sequences
if available
The term DNA sequencing refers to methods for determining
the order of the nucleotide bases, adenine, guanine, cytosine, and
thymine, in a molecule of DNA.

Sequencing can be done as the per the following methods:

• Chain termination/ Sanger Coulson method (1977)


- using nucleotide analogs
- attaching fluorescent dyes to nucleotides(1986)
(Hood and co-workers-1986)

• Automated sequencing-(1987, Applied Biosystems)

Shot-gun or Clone-by-Clone strategy may be applied for this


purpose.
Two Competing Strategies for Human
Genome

• (Hierarchical shotgun) [Public human


genome project]

• Whole-genome Shotgun [Celera


project]
Two main shotgun-sequencing
strategies
Main steps in
clone-by-clone
shotgun
sequencing
Phred
• The phred software reads DNA sequencing trace files, calls bases,
and assigns a quality value to each called base.
• The quality value is a log-transformed error probability, specifically
the phred quality values have been thoroughly tested for both
accuracy and power to discriminate between correct and incorrect
base-calls.
• Phred can use the quality values to perform sequence trimming.
• Phred works well with trace files from the following manufacturers'
sequencing machines: Amersham Biosciences, Applied Biosystems,
Beckman Instruments, and LI-COR Life Sciences. See the phred
documentation for specific compatibility information.
• Phred runs on most computers and operating systems including
Apple Mac OS X, *BSD, Hewlett-Packard HP-UX, HP-Compaq Tru64,
IBM AIX, Linux, Microsoft Windows, Silicon Graphics IRIX, and SUN
Solaris.
Phrap
• Phrap is a program for assembling shotgun DNA sequence data.

• It allows use of the entire read and not just the trimmed high
quality part.

• It uses a combination of user-supplied and internally computed


data quality information to improve assembly accuracy in the
presence of repeats.

• It constructs the contig sequence as a mosaic of the highest


quality read segments rather than a consensus

• It provides extensive assembly information to assist in trouble-


shooting assembly problems, and it handles large datasets.
Shotgun-sequence assembly.
Long-range
sequence
assembly in
whole-
genome
shotgun
sequencing.
STS: Sequence Tagged Site
Mapping
 Most powerful technique for physical mapping.
 A STS is a short DNA sequence (100-500bp)
 Occur only once in the genome being studied.
 To map STS markers, a collection of overlapping DNA fragments
from genome is needed. This collection is called mapping reagents
and is usually done as a clone library.
 Sources may be:
-Expressed Sequence tags (ESTs) obtained by analyzing
cDNA, and therefore represent expressed genes. It can be used
as an STS if it comes from unique gene and not from a
family with related/similar genes.
- Random genomic sequences obtained by sequencing
random cloned genomic DNA
There are now a large number of gene and protein databases
It is possible to find homologous sequences for generic gene or
protein coding sequences, from the databases.
Software programs such as BLAST and FASTA provide the means to
align sequences and as such homology searching provide clues to
the potential structure and function of a given DNA sequence.
General DNA Sequence databases
EMBL – http://www.ebi.ac.uk
at the European Bioinformatics Institute (EBI) at cambridge, UK
GenBank – http://ncbi.nlm.nih.gov
at the National Institute of Health (NIH) at USA
The DNA databases of Japan (DDBJ) at Mishima in Japan. – http://
ddbj.nig.ac.jp
Human Mapping Databases, John Hopkins, USA – http://
gdbwww.gdb.org
Gene Prediction…
• Computational gene finding is a process of:

– Identifying common phenomena in known genes

– Building a computational framework/model that can accurately


describe the common phenomena

– Using the model to scan uncharacterized sequence to identify


regions that match the model, which become putative genes

– Test and validate the predictions


Different Types of Gene Finding

• RNA genes
– tRNA, rRNA, snRNA, microRNA

• Protein coding genes


– Prokaryotic
• No introns, simpler regulatory features
– Eukaryotic
• Exon-intron structure
• Complex regulatory features
Approaches to Gene Finding

Direct
– Exact or near-exact matches of EST, cDNA, or Proteins
from the same, or closely related organism
– Algorithm based searches investigate the nucleotide
composition and other intrinsic features of genomic DNA

Indirect
1. Look for something that looks like one gene (homology)
Rely on previously identified genes in other organisms

2. Look for something that looks like all genes (ab initio)

3. Hybrid, combining homology and ab initio (and perhaps


even direct) methods
What is the Procedure for Gene Prediction ?

1. Translate in all six


reading frames
and compare to
protein sequence Use gene
Obtain new Analyze regulatory
Database prediction
Genomic DNA Sequences in
sequence programme to The gene
2. Perform database locate genes
similarity search
of expressed sequence
tag (EST) database
of same organism or
cDNA sequences
if available
Open Reading Frames: 6 possibilities

TCG TAC GTA GCT AGC TAG CTA


AGC ATG CAT CGA TCG ATC GAT
identical sequence

T CGT ACG TAG CTA GCT AGC TA


A GCA TGC ATC GAT CGA TCG AT

TC GTA CGT AGC TAG CTA GCT A


AG CAT GCA TCG ATC GAT CGA T
cDNA is made from mRNA
AAAAAAA
Start Stop Mature
TTTTTTT mRNA
Add polyT primer, nucleotides,
and Reverse Transcriptase

AAAAAAA
DNA/RNA
TTTTTTT
RNA removed (by NaOH) and
second strand synthesized
TTTTTTT

Complementary
DNA cDNA
A full length cDNA is hard to find
Start Stop
AAAAAAA mRNA is
Open Reading Frame (ORF) degraded
AAAAAAA from 5’ end
AAAAAAA
AAAAAAA
AAAAAAA

Most cDNAs are not


full length (flcDNA)
and the ORF is
incomplete (partial)
cDNA (EST) libraries have few flcDNAs

Open Reading Frame (ORF)

cDNA libraries are made


and individual clones
sequenced at random

- A sequenced cDNA is called an Expressed Sequence Tag (EST)


- Millions of ESTs from different tissues of different organisms are
stored in GenBank
– but only a small few are full length cDNAs!
-how to find the longest ones? Where ?
EST Division: Expressed Sequence Tags
dbEST http://www.ncbi.nlm.nih.gov/dbEST/
ESTs
sequence1

nucleus TAGTCA
80-100,000
clone xyz
genes
CGTACT sequence2
80-100,000 RNA
gene products - isolate unique clones
- sequence once from each end

make cDNA 80-100,000 unique


library cDNA clones in library
Prokaryotic and Eukaryotic Biology
Affect ORF-Based Gene Prediction
GLIMMER: A Microbial Gene
Finder
 GLIMMER 2.0: released late 1999
 > 200 site licenses worldwide
 Works on bacteria, archaea, viruses too
 Malaria (eukaryotic) version: GLIMMERM
 Refs: Salzberg et al., NAR, 1998,
Genomics 1999; Delcher et al., NAR,
1999
 Web site and code:
http://www.tigr.org/
GENSCAN: 1997 by Chris Burge & Samuel Karlin
- Combines coding region statistics, splice signal predictions,
motifs near the gene into a single HMM framework.
- Eg. A splice signal prediction is more believable if signs of
coding region appear on one side but not the other.
-Accuracy of prediction decreases with short exons or unusual
codon usage.
-Limited to predict genes in human/vertebrate DNA.
Other gene prediction algorithms include:
GeneSplicer – predicts splice sites in eukaryotic genomic DNA
GrailEXP - predicts genes,exons,promotors,CpG islands,EST, repetitive
elements etc
Glimmer – Gene locator (uses HMM) to find genes in microbial DNA
MZEF
TigrScan- Gene finder based on Hidden Markov Model Architecture
FGeneH, Hexon
Genie
Promoter prediction tools

-PromoterFinder

-PromoterInspector

-Dragon

-ModelInspector

-Eponine

Repeats:

-FORRepeats (http://al.jalix.org/FORRepeats)
- REPuter ( http://www.genomes.de)
- GCG/ EMBOSS (UNIX/ LINUX platform; needs prior
information)
- Spectral Repeat Finder
Due to variation in intron size & number from one organism to another,
gene prediction in eukaryotes depends on recognizing known exons.
This can be done by either comparing the unknown DNA sequence in all
reading frames with known protein coding gene sequences/EST tags in
databases, by homology based alignment search.

Multiple sequence alignments


• Main goal is generally to align regions of similar structural or
functional importance among sequences
• Sequence similarity is an important indicator of related function
•Most popular clustal w
BLAST Search (Basic Local Alignment Search Tool)
Pairwise, ungapped alignment database

query sequence
• FastA
– Better for nucleotides than for proteins
– Global alignments
• BLAST - Basic Local Alignment Search Tool
– Better for proteins than for nucleotides
– Local alignments
• Smith-Waterman
– More sensitive than FastA or BLAST
– Local alignments
• Genomes, 3rd edition, T. A. Brown
• Molecular Biology, 4th edition, Robert F. Weaver
• Bioinformatics, David W. Mount
• Principles of Gene Manipulation and Genomics, 7th edition,
Primrose and Twyman
• Molecular Cloning-A Laboratory Manual, 3rd edition, Sambrook
and Russell
• An introduction to Bioinformatics Algorithms, Neil C. Jones
• http://www.ncbi.nlm.nih.gov/

Você também pode gostar