Escolar Documentos
Profissional Documentos
Cultura Documentos
Presented by:
Soneela Anjum(459-FBAS/BSBI/F15)
Maheera Amjad(478-FBAS/BSBI/F15)
Gene:
Gene:
A unit of heredity material that is transferred
from a parent to offspring and is held to determine some
characteristic of the offspring.
Or
A sequence of nucleotides coding for protein.
The gene of the coding region encodes instructions that
allow a cell to produce a specific protein or enzyme. There
are nearly 50,000 and 100,000 genes with each being
made up of hundreds of thousands of chemical bases.
Gene Prediction:
Gene prediction or gene finding refers to the process
of identifying the regions of genomic DNA that
encode genes.
Its all about detecting coding regions and infer gene
structure.
The process includes detection of the location of open
reading frames (ORFs) and delineation of the
structures of introns as well as exons if the genes of
interest are of eukaryotic origin.
Contd.
In the field of bioinformatics, gene
identification from large DNA sequence is
known to be a significant setback.
The ultimate goal is to describe all the genes
computationally with near 100% accuracy.
Gene finding is one of the first and most
important steps in understanding the genome
of a species once it has been sequenced.
Background
In 1960s it was discovered that the sequence of
codons in a gene determines the sequence of
amino acids in a protein.
We all are familiar with the central dogma of life.
Contd.
The central dogma of molecular biology deals with the
detailed residue by-residue transfer of sequential
information. It states that such information cannot be
transferred from protein to either protein of nucleic acid.
Francis Crick.
Originally stated in 1958, but questioned in the 1960s due
to evidence of viral RNA to DNA transfer.
These events and experimental work led to the gene
prediction.
Computational Gene Prediction:
We provide a DNA sequence as input;
S =(s1,s2,...sn) * ,
in which ={'A','C','G','T '}.
Accurate labeling of each element in S as belongs to
a coding region (exon), intergenic region or non-
coding region (intron).
Gene prediction is difficult
DNA sequence signals have low information
content (degenerated and highly
unspecic).
It is difficult to discriminate real signals.
Sequencing errors.
Very short exons (3bp), especially initial.
Many very long introns.
Gene Prediction Limits
Existing predictors are for protein coding regions
Non-coding areas are not detected (5 and 3
UTR)
Non-coding RNA genes are missed
Predictions are for typical genes
Partial genes are often missed
Training sets may be biased
Atypical genes use other grammars
Prokaryotic Gene Prediction
High gene density and simple gene
structure.
Short genes have little information.
Overlapping genes.
No introns.
GENE PREDICTION IN PROKARYOTES
Prokaryotes, which include bacteria and Archaea have
small genomes with sizes ranging from 0.5to10Mbp.
Promoters are DNA segments upstream of transcripts
that initiate transcription
1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca ttt gag gac gat gta taa
M P K L N S V E G F S S F E D D V *
2 tgc cca agc tga ata gcg tag agg ggt ttt cat cat ttg agg acg atg tat
C P S * I A * R G F H H L R T M Y
3 gcc caa gct gaa tag cgt aga ggg gtt ttc atc att tga gga cga tgt ata
A Q A E * R R G V F I I * G R C I
Contd.
The three reading frames in the forward direction are
shown with the translated amino acids below
each DNA sequence.
Frame 1 starts with the "a", Frame 2 with the "t" and
Frame 3 with the "g".
Stop codons are indicated by an "*" in the protein
sequence.
The longest ORF is in Frame 1.
Gene prediction using MM and HMM:
MM describes the possibility of distribution of
nucleotides in gene.
k is the order of a Markov Model.
A zero order MM assumes that each base exist
independently.
A first order MM assumes that occurrence of any
base depends on the base preceding it.
Contd.
A second order MM assumes previous two bases.
And so on..
MM assumes that oligonucleotide distributions in the
coding regions are different from those for the non
coding regions.
Effective Markov Models are built in set of three
nucleotides.
Regulatory Sequences:
DNA elements that are present at the start sites of
gene.
Serve as binding sites for the gene transcription
machinery.
Should not be confused with the translation start
sites(start codons-Methionine).
Directly regulate gene expression.
Each gene has a unique set of regulatory sequence.
Contd.
CAAT box
Operator(biology)
Pribnow box
TATA box
A-box
Z-box
C-box
E-box
G-box
Promoter regions in Prokaryotes:
In prokaryotes promoter and
regulatory segments are
present at 35 and 10 base pairs
upstream from transcription
site.
It is referred as -35 and -10.
-35 box has sequences of
TTGACA.
-10 box has sequences of
TATAAT(pribnow box).
Promotor regions in Eukaryotes:
The core of majority eukaryotic promoter is TATA
box located 30bps upstream.
Many genes have an initiator sequence(Inr).
TATA box:
Sequence of TATA(A/T)A(A/T).
Inr(Initiator sequence):
Pyrimidine rich sequence(C/T)(C/T)CA(C/T)(C/T).
Contd.
CAAT box:
Also referred as CAT box.
GGCCAATCT sequence.
Upstream by 60-100 bases to the transcription site.
GC box:
GGGCGG Sequence.
Found at upstream of TATA box.
~110 bases upstream from transcription initiation site.
GenScan:
GenScan:
A BioInformatics tool that is used to identify the
complete gene structure.
Can be used to predict location of genes.
Their exon-intron boundaries in genomic
sequences
The GENSCAN Web server can be found at MIT.
GENEID
Geneid
Geneid is one of the oldest gene structure prediction
programs, recently updated to a new and faster
version.
Uses a hierarchical search structure (signal exon
gene).
1st: nds signals (splice sites, start and stop codons).
2nd: from the found signals start to score regions for
exon-dening signals and protein coding potential.
3rd: a dynamic programming algorithm is used to
search the space of predicted exons to assemble the
gene structure.
Contd.
Very fast and scale linearly with the length of the
sequence (both in time and memory) adapted to
analyze large sequences.
Trained with Drosophila and Human.
Available at http://www1.imim.es/geneid.html
Species:
Homo sapiens
Command:
geneid -P human.param -G -o
geneid4840.fasta
Running time:
0.13 secs