Você está na página 1de 15

Ensembl gene annotation project (e!62 and e!

63) Homo sapiens (human, GRCh37 assembly)


Raw computes stage: Searching for sequence patterns, aligning proteins and cDNAs to the genome.
ppro!imate time" 3 #ee$ The annotation process of the high-coverage human assembly began with the raw compute stage [Figure 1] whereby the genomic sequence was screened for sequence patterns including repeats using RepeatMas er [1!] "version #!$!%& with parameters '-nolow *+!+,- of the species genome!

-species

homo sapiens

-s(& )ust [$!] and TRF [#!]! RepeatMas er and )ust combined mas ed

Raw computes

46.6( repeat ma )ed 4"1720 Uniprot a&i%ned

10420" / 138000 human prot. 122!0" / 12!400 human cDNA 10301/ 128724 PE1,2 mamm/*ert Uniprot prot.

Proteins/cDNAs aligned -i&ter

cDNAs and ESTs aligned -i&ter

274!84 / 276"10 human cDNA 72!7"21 / 81743!3 human E+,

37007 / 41383 human prot. 370 / 2332 PE1,2 Uniprot prot. 67232 / 131424 annotated human cDNA

Add UTR to coding models

131381 / 221864 human cDNA u ed $or U,.

3!"06 human prot. 2132 Uniprot prot. 66434 human cDNA #$or codin% mode& '

Ensembl gene set

1263"4 human cDNA u ed $or U,.

Human Vega gene set

/er%e
3!070 human prot. 17"0 Uniprot prot. 64!3" human cDNA #$or codin% mode& ' 11462! human cDNA #$or U,. '

Final gene set

%igure 1" &ummary o' human gene annotation project( Transcription start sites were predicted using .ponine/scan [*!] and First.F [%!]! 0p1 islands and tR23s [+!] were also predicted! 1enscan [4!]

was run across RepeatMas ed sequence and the results were used as input for 5ni6rot [7!]& 5ni1ene [8!] and 9ertebrate R23 [1,!] alignments by :5;<3=T [11!]! 6assing only 1enscan results to ;<3=T is an effective way of reducing the search space and therefore the computational resources required! This resulted in *%14$, 5ni6rot& #%$1#8 5ni1ene and #*$$7$ 9ertebrate R23 sequences aligning to the genome!

Targetted stage: e!idence

enerating coding models from human

ppro!imate time" 7 #ee$s 2e>t& human protein and c)23 sequences were downloaded from public databases "5ni6rot =wiss6rot?Tr.M;< [7!] and Ref=eq [8!] for proteins& .23?1enban ?));@ and Ref=eq [8!] for c)23s( and filtered to remove sequences based on predictions! The human protein sequences were first mapped to rough locations in the genome using 6match to reduce the search space for the subsequent 1enewise step& as indicated in [Figure $]! Models of the coding sequence "0)=( were produced from the proteins using 1enewise [1#!]& which was run with four different sets of parameters to accommodate for cases where some coding models contain non-canonical "non 1T?31( splice sites! An parallel to the 1enewise step& human c)23s with nown 0)= start?end coordinates were aligned to the genome using e>onerate "cdna2genome model( [1$!] to generate coding models [Figure $]! 3dditionally& pre-aligned annotated c)23s were re-aligned to unmas ed genomic regions! This approach helped in discovering small e>ons which may have been ignored by e>onerate because of their siBe [Figure $]! ;ecause all c)23s used in this step had nown pairing with proteins "e!g! Ref=eq c)23s with accession prefi> C2MDE matching Ref=eq proteins with C26DE prefi>(& it allowed the comparison of coding models generated by e>onerate for a given c)23 to those generated by 1enewise using its counterpart protein! The 3pollo software [1%!] was used to visualise the results of filtering! :here one protein sequence had generated more than one candidate coding model at a locus& the ;estTargetted module was used to select the coding model that most closely matched the source protein to ta e through to the 2

ne>t stage of the gene annotation process!

The generation of transcript

models using species-specific "in this case& human( data is referred to as the CTargetted stageE! This stage resulted in 1$,+%7 coding models built from *1#7# human proteins and 672 2 cDNAs which were ta en through to the 5TR addition stage!

%igure 2" )argetted stage using human protein and c*+ se,uences(

Similarit" stage:

enerating additional coding models using

proteins from related species


ppro!imate time" 2 #ee$s Following the human Targetted alignments& additional coding models were generated as follows! The 5ni6rot alignments from the Raw 0omputes step were filtered to retain only those sequences belonging to 5ni6rotFs CMammaliaE and C9ertebrataE ta>onomical classes as well as 5niprotFs 6rotein .>istence "6.( classification level 1 and $! An genomic regions which were not covered by any coding models from Targetted alignments& :5-;<3=T

was rerun for the 5niprot protein sequences and the results were passed to 1enewise [1#!] to build coding models! An most cases& multiple coding models built from different 5niprot proteins were generated in a single locus& each model with a slightly different e>on-intron structure! To filter for the best supported structures& the Transcript0onsensus module was used to compare each 1enewise model against human c)23 and .=T alignments in the region "see ne>t section on how these alignments were generated(& where e>ons in the 1enewise model were scored for overlapping with e>ons of c)23?.=T alignments& and model"s( with the highest combined score in a region were ept! The generation of transcript models using data from related species is referred to as the C=imilarity stageE [Figure #]! This stage resulted in **%$ and 188% coding models supported by mammalian 5niprot proteins and non-mammalian vertebrate 5niprot proteins respectively!

%igure 3" lignment and 'iltering o' mammalian and -ertebrate proteins(

cDNA and #ST alignments


ppro!imate time" 2.3 #ee$s Guman c)23 and .=T sequences were previously downloaded from .23?1enban ?));@ and Ref=eq [8!]& clipped to remove poly3 tails& and aligned to the genome using .>onerate "est2genome model( [Figure *]!

%igure /" lignment o' human c*+ s and E&)s to the human genome $$17+* "of $4+%1,( human c)23s aligned and 4$84%$1 "of 714*#8#( human .=Ts had aligned to the genome! The coverage cut-offs and percentage identity for c)23 alignments were set at 87-& which were higher than those for .=Ts "8,- coverage& 84- percentage identity( because c)23s are generally less fragmented than .=Ts! .=T alignments were used to generate .=T-based gene models similar to those for mouse [1*!] and these are displayed on the website in a separate trac from the .nsembl gene set!

$iltering coding models


ppro!imate time" 2 #ee$s The set of coding models was finalised after another stage of filtering& which involved manual removal of some more Targetted models supported by dubious human protein?c)23 evidence on a case-by-case basis& and removal of H+,- of =imilarity alignments which contained non-canonical "non 1T?31( splice sites using a 6erl script! visualise the results of filtering! The 3pollo software [1%!] was used to

Addition of %TR to coding models


ppro!imate time" 2 #ee$s 3fter finalising the set of coding models& those generated by 1enewise alignments were e>tended into the untranslated regions "5TRs( using human c)23s! 0oding models generated by e>onerateFs cdna2genome & this includes the e>onerate$genesDregion approach where pre-aligned c)23 sequences are aligned to unmas ed genomic regions& already contained 5TR annotations and hence did not go through this 5TR addition step! :here available& human )iTag alignments were used to guide the positioning of 5TRs and add additional weight to some 5TR structures& while Ref=eq C2ME c)23 vs C26E protein pairing information was used to ensure the correct matching of c)23s to coding models supported by Ref=eq proteins! This resulted in *1,14 "of *7$$#( coding models from #4,,4 human proteins with 5TR& and *,% "of $88*( coding models from #4, 5niprot proteins with 5TR!

enerating multi&transcript #nsem'l genes


ppro!imate time" /.0 #ee$s The above steps generated a large set of potential transcript models& with or without 5TR& many of which overlapped one another! Redundant transcript models were collapsed and the remaining unique set of transcript models were clustered into multi-transcript genes where each transcript in a gene has at least one coding e>on that overlaps a coding e>on from another transcript within the same gene! The resulting .nsembl gene set contained $#,7+ genes& of which $$#+8 contained transcripts supported by human

c)23s?proteins only "from the CTargettedE stage of the build(& and 414 contained transcripts supported by 5niprot proteins only from the C=imilarityE stage of the build [Figure %]! The .nsembl genes were associated with a total of %$%%8 .nsembl transcripts& of which %17#% were supported by human c)23s?proteins& and 4$* had support from 5niprot proteins [Figure +]!

Evidence for Ensembl genes

human cDNAs/proteins Uniprot mamm./vert. proteins

%igure 0" &upporting e-idence 'or human Ensembl gene set(

Evidence for Ensembl transcripts

human cDNAs/proteins Uniprot mamm./vert. proteins

%igure 6" &upporting e-idence 'or human Ensembl transcript set( 7

(seudogenes, immunoglo'ulin genes, mitochondrial genes


ppro!imate time" 3 #ee$s The .nsembl gene set was screened for pseudogenes and retrotransposed genes! 2e>t& human immunoglobulin "Ag( genes were annotated using the .nsembl CAg genebuildE pipeline [1+!]! ;riefly& human proteins and c)23s for Ag genes were downloaded from AM1T [14!] and aligned to the human genome using .>onerate! The .>onerate alignments were processed to Ioin the 9?)?@?0 segments together into Ag gene models& which were then compared to the Ag genes already present in the .nsembl gene set "generated at the Targetted stage(! Af the models generated by the CAg genebuildE pipeline overlapped with e>isting .nsembl genes at the e>on level& the e>isting .nsembl genes will be replaced by the new Ag gene models& for the latter are usually more accurate representations of Ag genes! 3lso imported into the .nsembl gene set were annotation of mitochondrial genes in A2)=0 [17!] and short non-coding R23s "e!g! miR23s& snoR23s( generated by the ncR23 pipeline [18!]!

)erging #nsem'l and *ega gene sets, annotating long intergenic non&coding RNA genes and generating the final gene set.
ppro!imate time" 12 #ee$s Following the completion of the .nsembl gene set& .nsembl annotations and manual annotations "primarily generated by the G39323 team at the :ellcome Trust =anger Anstitute( from the 9ega database [$,!& $1!] were merged at the transcript level to create the final gene set! The 9ega database "as of 1$ =eptember $,1,( contained #8%+% genes and 1##*%1 transcripts! An the merge process& .nsembl and 9ega t ranscripts were merged if they had identical exon-intron structures. If transcripts from the two annotation sources matched at all internal exon-intron boundaries, i.e. had identical splicing pattern, but one of them had longer terminal exons, usually the UTRs, they were merged too, but the resulting merged transcript would adopt the exonintron structure of the Vega transcript as we prioritised Vega annotation over Ensembl. Transcripts which had not been merged& either because of differences in internal e>on-intron boundaries or presence of transcripts in 8

only one annotation source& were transferred from the source to the final gene set intact! The .nsembl-9ega merge code also too into account the biotype and

supporting evidence associated with the transcripts from both annotation sources! For a pair of transcripts to be merged& if there was a mismatch in biotype& e!g! the .nsembl transcript is protein-coding but the 9ega counterpart is non-coding& the 9ega biotype would have precedence over the .nsembl model and the .nsembl transcript would undergo a biotype change to match its 9ega counterpart! The translation for the .nsembl transcript would then be removed if the transcript has lost its protein-coding biotype! ;iotype conflicts between .nsembl and 9ega were always reported to the G39323 team for investigation& and when resolved& could improve the merged gene set in the future! 3s for supporting evidence& the merge of .nsembl and 9ega transcripts also involved merging of protein?c)23 supporting evidence associated with the transcripts to ensure the basis on which the annotations were made would not be lost! Following the merge& long intergenic non-coding R23 genes "lincR23s( were annotated by the .nsembl lincR23 pipeline [18!] and incorporated in the final gene set! 3n important feature of the merged gene set is the presence of all 9ega source transcripts! This has been made possible by allowing 9ega annotation to ta e precedence over .nsemblFs when merging transcripts which do not match at their terminal e>ons or have different biotypes! Jf all 9ega transcripts& 17!#- of them were merged with .nsembl transcripts! The vast maIority of merged transcripts "78!+-( are of protein-coding biotype! 9ega transcripts which were not merged "7$!4- of 9ega source transcripts( were mostly alternative splice variants& pseudogenes or non-coding! transcripts were fully transferred into the final gene set! These The final

.nsembl-9ega set consisted of **#1* genes and 1+,,,$ transcripts! Jf the 1+,,,$ transcripts& 1%!#- "$**8$( were the result of merging .nsembl and 9ega annotations& 1+!1- "$%417( originated from .nsembl& +7!%- "1,8+%,( 9

originated from 9ega& and the remaining H,!*- were incorporated from other sources "e!g! immunoglobulin gene segments?transcripts imported from AM1T data(! 3s a quality-control measure& .nsembl translations of protein-coding transcripts in the final merged gene set were aligned against the 20;A Ref=eq and 5niprot?=wiss6rot sets of public curated protein sequences "which were used in the CTargettedE stage of the gene build( to calculate the proportion of curated sequences covered by the merged gene set! Jver 88of Ref=eq and =wiss6rot proteins were represented in the merged gene set& and in the maIority of cases& there was a 1,,- match between the curated protein and .nsembl translation! =ince .nsembl release %+ "=eptember $,,8(& the .nsembl-9ega gene set has e>actly corresponded to a 1.20J). release [$#!]! The gene set in release +$& which this document describes& corresponds to 1.20J). release 4! .ach 1.20J). release also contains the full annotation of the consensus coding sequence "00)=( transcript models [$*!]! models are included in each release of the human gene set! 3ll 00)=

(rotein annotation, cross&referencing, sta'le +dentifiers


ppro!imate time" / #ee$s ;efore public release the transcripts and translations were given e>ternal references "cross-references to e>ternal databases(& while translations were searched for domains?signatures of interest and labelled where appropriate! =table identifiers were assigned to each gene& transcript& e>on and translation! :hen annotating a species for the first time& these identifiers are auto-generated! An all subsequent annotations for a species& the stable identifiers are propagated based on comparison of the new gene set to the previous gene set!

10

dditional annotation and post genebuild 'iltering in Ensembl release 63


Addition of annotation on haplot"pe regions
ppro!imate time" 1.2 #ee$s The annotation of the haplotype regions on chromosomes +& 1* and 14 were added after the main reference genome had been annotated! Figure 4 shows the annotation pipeline which closely follows the procedure described earlier! The annotation resulted in a final gene set of $7#1 genes of which $*, were pseudogenes or retrotransposed gene!

%igure 7" 3or$'lo# 'or the annotation o' haplotype regions in chromosomes 6, 1/ and 17(

(ost gene'uild filtering


ppro!imate time" 3./ #ee$s To eliminate and filter out poorly supported models that may have erroneously been included in the full annotation& the human gene set undergoes an additional filtering process after each annotation! This is to ta e advantage of the comparative genomics information that becomes available only after the first annotation has been released!

11

3ll models annotated by .nsembl were filtered systematically by a series of 6erl scripts to remove models with erroneous structures! .>amples of such scenarios would be where a model differed considerably in its internal structure compared to other models in the same locus& or if e>ons were missing or had non-consistent splice sites! An addition& models supported by c)23 fragments with wrongly annotated short open-reading frames were removed manually on a case-by-case basis! Further filtering of the models was done using the following criteria at gene levelK <ac of homologues =ingle transcript <ac of overlapping protein and c)23 alignments Frameshifts

The filtering resulted in removal of %*% transcripts and %+, genes! =ubsequently the .nsembl annotation was combined with the 9ega annotation to produce the 1.20J). gene set "release 7(!

$urther information on the #nsem'l gene set


The main focus of the .nsembl automatic gene annotation pipeline is to generate a conservative set of protein-coding gene models& although some non-coding genes and pseudogenes may also annotated! The 9ega proIect [$,!& $1!]& on the other hand& focuses on manually annotating alternative splice variants for all genes and annotating a much wider range of gene?transcript types& including non-coding genes "e!g! processed transcripts& nonsense-mediated decay transcripts& polymorphic pseudogenes( [$$!] Therefore& the .nsembl and 9ega annotation approaches complement each other and by merging the .nsembl and 9ega annotations& we aim to provide a more comprehensive final gene set for human! .very gene model produced by the .nsembl gene annotation pipeline is supported by biological sequence evidence "see the C=upporting evidenceE lin on the left-hand menu of a 1ene page or Transcript page(L ab initio models are not included in our gene set! Ab initio predictions and the full set of c)23 and .=T alignments to the genome are available on our website!

12

The quality of a gene set is dependent on the quality of the genome assembly! 1enome assembly can be assessed in a number of ways& includingK 1! 0overage estimate o 3 higher coverage usually indicates a more complete assembly! o 5sing =anger sequencing only& a coverage of at least $> is preferred! $! 2%, of contigs and scaffolds o 3 longer 2%, usually indicates a more complete genome assembly! o ;earing in mind that an average human gene may be 1,-1% b in length& contigs shorter than this length will be unli ely to hold full-length gene models! #! 2umber of contigs and scaffolds o 3 lower number toplevel sequences usually indicates a more complete genome assembly! *! 3lignment of c)23s and .=Ts to the genome o 3 higher number of alignments& using stringent thresholds& usually indicates a more complete genome assembly! More information on the .nsembl automatic gene annotation process can be found atK 0urwen 9& .yras .& 3ndrews T)& 0lar e <& Mongin .& =earle =M& 0lamp M! )he Ensembl automatic gene annotation system( Genome Res. $,,*& 1/(0)"8*$-%,! [6MA)K 1%1$#%8,] 6otter =0& 0lar e <& 0urwen 9& Meenan =& Mongin .& =earle =M& =tabenau 3& =torey R& 0lamp M! )he Ensembl analysis pipeline( Genome Res. $,,*& 1/(0)"8#*-*1! [6MA)K 1%1$#%78] httpK??www!ensembl!org?info?docs?genebuild?genomeDannotation!html httpK??cvs!sanger!ac!u ?cgi-bin?viewvc!cgi?ensembldoc?pipelineDdocs?theDgenebuildDprocess!t>tNrootOensemblPviewOco

13

References
1. =mit& 3F3& Gubley& R P 1reen& 6K Repeat4as$er 5pen.3(2( 188+-$,1,!
www!repeatmas er!org

2. MuBio @& Tatusov R& and <ipman )@K *ust( 5npublished but briefly described inK
Morgulis 3& 1ertB .M& =chQffer 33& 3garwala R! 3 Fast and =ymmetric )5=T Amplementation to Mas <ow-0omple>ity )23 =equences! Journal of Computational Biology $,,+& 13(0)"1,$7-1,*,!

3. ;enson 1! )andem repeats 'inder" a program to analy6e *+ se,uences(


Nucleic Acids Res. 1888& 27(2)"%4#-%7,! [6MA)K 87+$87$]! httpK??tandem!bu!edu?trf?trf!html

4. )own T3& Gubbard T@K Computational detection and location o' transcription
start sites in mammalian genomic *+ ( Genome Res. $,,$ 12(3)"*%7-*+1! httpK??www!sanger!ac!u ?resources?software?eponine? [6MA)K 1174%,#*]

5. )avuluri R9& 1rosse A& Rhang MSK Computational identi'ication o' promoters and
'irst e!ons in the human genome( Nat Genet. $,,1& 27(/)"*1$-*14! [6MA)K 114$+8$7]

6. <owe TM& .ddy =RK tR+ scan.&E" a program 'or impro-ed detection o' trans'er
R+ genes in genomic se,uence( Nucleic Acids Res. 1884& 20(0)"8%%-+*! [6MA)K 8,$#1,*]

7. ;urge 0& Marlin =K 8rediction o' complete gene structures in human genomic
*+ ( J Mol Biol. 1884& 269(1)"47-8*! [6MA)K 81*81*#]

8. 1ouIon M& Mc:illiam G& <i :& 9alentin F& =quiBBato =& 6aern @& <opeB RK ne#
bioin'ormatics analysis tools 'rame#or$ at E4:;.E:<( +ucleic cids Res( $,1,& 39 &uppl":+8%-+88! httpK??www!uniprot!org?downloads [6MA)K $,*#8#1*]

9. =ayers .:& ;arrett T& ;enson )3& ;olton .& ;ryant =G& 0anese M& 0hetvernin 9&
0hurch )M& )icuccio M& Federhen =& Feolo M& 1eer <T& Gelmberg :& Mapustin T& <andsman )& <ipman )@& <u R& Madden T<& MadeI T& Maglott )R& Marchler-;auer 3& Miller 9& MiBrachi A& Jstell @& 6anchen o 3& 6ruitt M)& =chuler 1)& =equeira .& =herry =T& =humway M& =irot in M& =lotta )& =ouvorov 3& =tarchen o 1& Tatusova T3& :agner <& :ang T& @ohn :ilbur :& Taschen o .& Te @K *atabase resources o' the +ational Center 'or :iotechnology <n'ormation( Nucleic Acids Res. $,1,& 39(*atabase issue)"*0.16( [6MA)K 1881,#+*] 1,! httpK??www!ebi!ac!u ?ena?

11. 3ltschul =F& 1ish :& Miller :& Myers .:& <ipman )@K :asic local alignment search
tool( J Mol Biol. 188,& 210(3)"*,#-*1,! [6MA)K $$#141$!]

12. =later 1=& ;irney .K utomated generation o' heuristics 'or biological se,uence
comparison( BMC Bioinformatics $,,%& 6"#1! [6MA)K 1%41#$##]

14

13. ;irney .& 0lamp M& )urbin RK Gene3ise and Genome#ise( Genome Res. $,,*&
1/(0)"877-88%! [6MA)K 1%1$#%8+]

14. .yras .& 0accamo M& 0urwen 9& 0lamp M! E&)Genes" alternati-e splicing 'rom
E&)s in Ensembl( Genome Res. $,,* 1/(0)"84+-874! [6MA)K 1%1$#%8%]

15. <ewis =.& =earle =M& Garris 2& 1ibson M& <yer 9& Richter @& :iel 0& ;ayra taroglir <&
;irney .& 0rosby M3& Mamin er @=& Matthews ;;& 6rochni =.& =mithy 0)& Tupy @<& Rubin 1M& Misra =& Mungall 0@& 0lamp M.K pollo" a se,uence annotation editor( Genome Biol. $,,$& 3(12)"R.=.3R0G,,7$! [6MA)K 1$%#4%41]

16. httpK??www!ensembl!org?info?docs?genebuild?igDtcr!html 17. ftpK??ftp!cines!fr?AM1T?AM1T!Bip 18. httpK??www!ncbi!nlm!nih!gov?nuccore?20D,1$8$, 19. httpK??www!ensembl!org?info?docs?genebuild?ncrna!html 20. httpK??vega!sanger!ac!u ?GomoDsapiens?Anfo?Ande> 21. <! 1! :ilming& @! 1! R! 1ilbert& M! Gowe& =! Trevanion&T! Gubbard and @! <! GarrowK
)he -ertebrate genome annotation (=ega) database! Nucleic Acid Res. $,,7 @anL 3dvance 3ccess published on 2ovember 1*& $,,4L doiK1,!1,8#?nar?g m874

22. httpK??vega!sanger!ac!u ?info?about?geneDandDtranscriptDtypes!html


$#! Garrow&@!& )enoeud&F!& Fran ish&3!& Reymond&3!& 0hen&0!M!& 0hrast&@!& <agarde&@!& 1ilbert&@!1!& =torey&R!& =warbrec &)! et al! GE+C5*E" producing a re'erence annotation 'or E+C5*E( Genome Biol!& $,,+ 7"=uppl! 1(& =*!1/=*!8! $*! 6ruitt&M!)!& Garrow&@!& Garte&R!3!& :allin&0!& )ie hans&M!& Maglott&)!R!& =earle&=!& Farrell&0!M!& <oveland&@!.!& Ruef&;!@! et al! )he consensus coding se,uence (CC*&) project" identi'ying a common protein.coding gene set 'or the human and mouse genomes( Genome Res. $,,8& 17& 1#1+/1#$#!

15

Você também pode gostar