Escolar Documentos
Profissional Documentos
Cultura Documentos
-species
homo sapiens
-s(& )ust [$!] and TRF [#!]! RepeatMas er and )ust combined mas ed
Raw computes
10420" / 138000 human prot. 122!0" / 12!400 human cDNA 10301/ 128724 PE1,2 mamm/*ert Uniprot prot.
37007 / 41383 human prot. 370 / 2332 PE1,2 Uniprot prot. 67232 / 131424 annotated human cDNA
3!"06 human prot. 2132 Uniprot prot. 66434 human cDNA #$or codin% mode& '
/er%e
3!070 human prot. 17"0 Uniprot prot. 64!3" human cDNA #$or codin% mode& ' 11462! human cDNA #$or U,. '
%igure 1" &ummary o' human gene annotation project( Transcription start sites were predicted using .ponine/scan [*!] and First.F [%!]! 0p1 islands and tR23s [+!] were also predicted! 1enscan [4!]
was run across RepeatMas ed sequence and the results were used as input for 5ni6rot [7!]& 5ni1ene [8!] and 9ertebrate R23 [1,!] alignments by :5;<3=T [11!]! 6assing only 1enscan results to ;<3=T is an effective way of reducing the search space and therefore the computational resources required! This resulted in *%14$, 5ni6rot& #%$1#8 5ni1ene and #*$$7$ 9ertebrate R23 sequences aligning to the genome!
ppro!imate time" 7 #ee$s 2e>t& human protein and c)23 sequences were downloaded from public databases "5ni6rot =wiss6rot?Tr.M;< [7!] and Ref=eq [8!] for proteins& .23?1enban ?));@ and Ref=eq [8!] for c)23s( and filtered to remove sequences based on predictions! The human protein sequences were first mapped to rough locations in the genome using 6match to reduce the search space for the subsequent 1enewise step& as indicated in [Figure $]! Models of the coding sequence "0)=( were produced from the proteins using 1enewise [1#!]& which was run with four different sets of parameters to accommodate for cases where some coding models contain non-canonical "non 1T?31( splice sites! An parallel to the 1enewise step& human c)23s with nown 0)= start?end coordinates were aligned to the genome using e>onerate "cdna2genome model( [1$!] to generate coding models [Figure $]! 3dditionally& pre-aligned annotated c)23s were re-aligned to unmas ed genomic regions! This approach helped in discovering small e>ons which may have been ignored by e>onerate because of their siBe [Figure $]! ;ecause all c)23s used in this step had nown pairing with proteins "e!g! Ref=eq c)23s with accession prefi> C2MDE matching Ref=eq proteins with C26DE prefi>(& it allowed the comparison of coding models generated by e>onerate for a given c)23 to those generated by 1enewise using its counterpart protein! The 3pollo software [1%!] was used to visualise the results of filtering! :here one protein sequence had generated more than one candidate coding model at a locus& the ;estTargetted module was used to select the coding model that most closely matched the source protein to ta e through to the 2
models using species-specific "in this case& human( data is referred to as the CTargetted stageE! This stage resulted in 1$,+%7 coding models built from *1#7# human proteins and 672 2 cDNAs which were ta en through to the 5TR addition stage!
%igure 2" )argetted stage using human protein and c*+ se,uences(
Similarit" stage:
was rerun for the 5niprot protein sequences and the results were passed to 1enewise [1#!] to build coding models! An most cases& multiple coding models built from different 5niprot proteins were generated in a single locus& each model with a slightly different e>on-intron structure! To filter for the best supported structures& the Transcript0onsensus module was used to compare each 1enewise model against human c)23 and .=T alignments in the region "see ne>t section on how these alignments were generated(& where e>ons in the 1enewise model were scored for overlapping with e>ons of c)23?.=T alignments& and model"s( with the highest combined score in a region were ept! The generation of transcript models using data from related species is referred to as the C=imilarity stageE [Figure #]! This stage resulted in **%$ and 188% coding models supported by mammalian 5niprot proteins and non-mammalian vertebrate 5niprot proteins respectively!
%igure 3" lignment and 'iltering o' mammalian and -ertebrate proteins(
%igure /" lignment o' human c*+ s and E&)s to the human genome $$17+* "of $4+%1,( human c)23s aligned and 4$84%$1 "of 714*#8#( human .=Ts had aligned to the genome! The coverage cut-offs and percentage identity for c)23 alignments were set at 87-& which were higher than those for .=Ts "8,- coverage& 84- percentage identity( because c)23s are generally less fragmented than .=Ts! .=T alignments were used to generate .=T-based gene models similar to those for mouse [1*!] and these are displayed on the website in a separate trac from the .nsembl gene set!
c)23s?proteins only "from the CTargettedE stage of the build(& and 414 contained transcripts supported by 5niprot proteins only from the C=imilarityE stage of the build [Figure %]! The .nsembl genes were associated with a total of %$%%8 .nsembl transcripts& of which %17#% were supported by human c)23s?proteins& and 4$* had support from 5niprot proteins [Figure +]!
)erging #nsem'l and *ega gene sets, annotating long intergenic non&coding RNA genes and generating the final gene set.
ppro!imate time" 12 #ee$s Following the completion of the .nsembl gene set& .nsembl annotations and manual annotations "primarily generated by the G39323 team at the :ellcome Trust =anger Anstitute( from the 9ega database [$,!& $1!] were merged at the transcript level to create the final gene set! The 9ega database "as of 1$ =eptember $,1,( contained #8%+% genes and 1##*%1 transcripts! An the merge process& .nsembl and 9ega t ranscripts were merged if they had identical exon-intron structures. If transcripts from the two annotation sources matched at all internal exon-intron boundaries, i.e. had identical splicing pattern, but one of them had longer terminal exons, usually the UTRs, they were merged too, but the resulting merged transcript would adopt the exonintron structure of the Vega transcript as we prioritised Vega annotation over Ensembl. Transcripts which had not been merged& either because of differences in internal e>on-intron boundaries or presence of transcripts in 8
only one annotation source& were transferred from the source to the final gene set intact! The .nsembl-9ega merge code also too into account the biotype and
supporting evidence associated with the transcripts from both annotation sources! For a pair of transcripts to be merged& if there was a mismatch in biotype& e!g! the .nsembl transcript is protein-coding but the 9ega counterpart is non-coding& the 9ega biotype would have precedence over the .nsembl model and the .nsembl transcript would undergo a biotype change to match its 9ega counterpart! The translation for the .nsembl transcript would then be removed if the transcript has lost its protein-coding biotype! ;iotype conflicts between .nsembl and 9ega were always reported to the G39323 team for investigation& and when resolved& could improve the merged gene set in the future! 3s for supporting evidence& the merge of .nsembl and 9ega transcripts also involved merging of protein?c)23 supporting evidence associated with the transcripts to ensure the basis on which the annotations were made would not be lost! Following the merge& long intergenic non-coding R23 genes "lincR23s( were annotated by the .nsembl lincR23 pipeline [18!] and incorporated in the final gene set! 3n important feature of the merged gene set is the presence of all 9ega source transcripts! This has been made possible by allowing 9ega annotation to ta e precedence over .nsemblFs when merging transcripts which do not match at their terminal e>ons or have different biotypes! Jf all 9ega transcripts& 17!#- of them were merged with .nsembl transcripts! The vast maIority of merged transcripts "78!+-( are of protein-coding biotype! 9ega transcripts which were not merged "7$!4- of 9ega source transcripts( were mostly alternative splice variants& pseudogenes or non-coding! transcripts were fully transferred into the final gene set! These The final
.nsembl-9ega set consisted of **#1* genes and 1+,,,$ transcripts! Jf the 1+,,,$ transcripts& 1%!#- "$**8$( were the result of merging .nsembl and 9ega annotations& 1+!1- "$%417( originated from .nsembl& +7!%- "1,8+%,( 9
originated from 9ega& and the remaining H,!*- were incorporated from other sources "e!g! immunoglobulin gene segments?transcripts imported from AM1T data(! 3s a quality-control measure& .nsembl translations of protein-coding transcripts in the final merged gene set were aligned against the 20;A Ref=eq and 5niprot?=wiss6rot sets of public curated protein sequences "which were used in the CTargettedE stage of the gene build( to calculate the proportion of curated sequences covered by the merged gene set! Jver 88of Ref=eq and =wiss6rot proteins were represented in the merged gene set& and in the maIority of cases& there was a 1,,- match between the curated protein and .nsembl translation! =ince .nsembl release %+ "=eptember $,,8(& the .nsembl-9ega gene set has e>actly corresponded to a 1.20J). release [$#!]! The gene set in release +$& which this document describes& corresponds to 1.20J). release 4! .ach 1.20J). release also contains the full annotation of the consensus coding sequence "00)=( transcript models [$*!]! models are included in each release of the human gene set! 3ll 00)=
10
%igure 7" 3or$'lo# 'or the annotation o' haplotype regions in chromosomes 6, 1/ and 17(
11
3ll models annotated by .nsembl were filtered systematically by a series of 6erl scripts to remove models with erroneous structures! .>amples of such scenarios would be where a model differed considerably in its internal structure compared to other models in the same locus& or if e>ons were missing or had non-consistent splice sites! An addition& models supported by c)23 fragments with wrongly annotated short open-reading frames were removed manually on a case-by-case basis! Further filtering of the models was done using the following criteria at gene levelK <ac of homologues =ingle transcript <ac of overlapping protein and c)23 alignments Frameshifts
The filtering resulted in removal of %*% transcripts and %+, genes! =ubsequently the .nsembl annotation was combined with the 9ega annotation to produce the 1.20J). gene set "release 7(!
12
The quality of a gene set is dependent on the quality of the genome assembly! 1enome assembly can be assessed in a number of ways& includingK 1! 0overage estimate o 3 higher coverage usually indicates a more complete assembly! o 5sing =anger sequencing only& a coverage of at least $> is preferred! $! 2%, of contigs and scaffolds o 3 longer 2%, usually indicates a more complete genome assembly! o ;earing in mind that an average human gene may be 1,-1% b in length& contigs shorter than this length will be unli ely to hold full-length gene models! #! 2umber of contigs and scaffolds o 3 lower number toplevel sequences usually indicates a more complete genome assembly! *! 3lignment of c)23s and .=Ts to the genome o 3 higher number of alignments& using stringent thresholds& usually indicates a more complete genome assembly! More information on the .nsembl automatic gene annotation process can be found atK 0urwen 9& .yras .& 3ndrews T)& 0lar e <& Mongin .& =earle =M& 0lamp M! )he Ensembl automatic gene annotation system( Genome Res. $,,*& 1/(0)"8*$-%,! [6MA)K 1%1$#%8,] 6otter =0& 0lar e <& 0urwen 9& Meenan =& Mongin .& =earle =M& =tabenau 3& =torey R& 0lamp M! )he Ensembl analysis pipeline( Genome Res. $,,*& 1/(0)"8#*-*1! [6MA)K 1%1$#%78] httpK??www!ensembl!org?info?docs?genebuild?genomeDannotation!html httpK??cvs!sanger!ac!u ?cgi-bin?viewvc!cgi?ensembldoc?pipelineDdocs?theDgenebuildDprocess!t>tNrootOensemblPviewOco
13
References
1. =mit& 3F3& Gubley& R P 1reen& 6K Repeat4as$er 5pen.3(2( 188+-$,1,!
www!repeatmas er!org
2. MuBio @& Tatusov R& and <ipman )@K *ust( 5npublished but briefly described inK
Morgulis 3& 1ertB .M& =chQffer 33& 3garwala R! 3 Fast and =ymmetric )5=T Amplementation to Mas <ow-0omple>ity )23 =equences! Journal of Computational Biology $,,+& 13(0)"1,$7-1,*,!
4. )own T3& Gubbard T@K Computational detection and location o' transcription
start sites in mammalian genomic *+ ( Genome Res. $,,$ 12(3)"*%7-*+1! httpK??www!sanger!ac!u ?resources?software?eponine? [6MA)K 1174%,#*]
5. )avuluri R9& 1rosse A& Rhang MSK Computational identi'ication o' promoters and
'irst e!ons in the human genome( Nat Genet. $,,1& 27(/)"*1$-*14! [6MA)K 114$+8$7]
6. <owe TM& .ddy =RK tR+ scan.&E" a program 'or impro-ed detection o' trans'er
R+ genes in genomic se,uence( Nucleic Acids Res. 1884& 20(0)"8%%-+*! [6MA)K 8,$#1,*]
7. ;urge 0& Marlin =K 8rediction o' complete gene structures in human genomic
*+ ( J Mol Biol. 1884& 269(1)"47-8*! [6MA)K 81*81*#]
8. 1ouIon M& Mc:illiam G& <i :& 9alentin F& =quiBBato =& 6aern @& <opeB RK ne#
bioin'ormatics analysis tools 'rame#or$ at E4:;.E:<( +ucleic cids Res( $,1,& 39 &uppl":+8%-+88! httpK??www!uniprot!org?downloads [6MA)K $,*#8#1*]
9. =ayers .:& ;arrett T& ;enson )3& ;olton .& ;ryant =G& 0anese M& 0hetvernin 9&
0hurch )M& )icuccio M& Federhen =& Feolo M& 1eer <T& Gelmberg :& Mapustin T& <andsman )& <ipman )@& <u R& Madden T<& MadeI T& Maglott )R& Marchler-;auer 3& Miller 9& MiBrachi A& Jstell @& 6anchen o 3& 6ruitt M)& =chuler 1)& =equeira .& =herry =T& =humway M& =irot in M& =lotta )& =ouvorov 3& =tarchen o 1& Tatusova T3& :agner <& :ang T& @ohn :ilbur :& Taschen o .& Te @K *atabase resources o' the +ational Center 'or :iotechnology <n'ormation( Nucleic Acids Res. $,1,& 39(*atabase issue)"*0.16( [6MA)K 1881,#+*] 1,! httpK??www!ebi!ac!u ?ena?
11. 3ltschul =F& 1ish :& Miller :& Myers .:& <ipman )@K :asic local alignment search
tool( J Mol Biol. 188,& 210(3)"*,#-*1,! [6MA)K $$#141$!]
12. =later 1=& ;irney .K utomated generation o' heuristics 'or biological se,uence
comparison( BMC Bioinformatics $,,%& 6"#1! [6MA)K 1%41#$##]
14
13. ;irney .& 0lamp M& )urbin RK Gene3ise and Genome#ise( Genome Res. $,,*&
1/(0)"877-88%! [6MA)K 1%1$#%8+]
14. .yras .& 0accamo M& 0urwen 9& 0lamp M! E&)Genes" alternati-e splicing 'rom
E&)s in Ensembl( Genome Res. $,,* 1/(0)"84+-874! [6MA)K 1%1$#%8%]
15. <ewis =.& =earle =M& Garris 2& 1ibson M& <yer 9& Richter @& :iel 0& ;ayra taroglir <&
;irney .& 0rosby M3& Mamin er @=& Matthews ;;& 6rochni =.& =mithy 0)& Tupy @<& Rubin 1M& Misra =& Mungall 0@& 0lamp M.K pollo" a se,uence annotation editor( Genome Biol. $,,$& 3(12)"R.=.3R0G,,7$! [6MA)K 1$%#4%41]
16. httpK??www!ensembl!org?info?docs?genebuild?igDtcr!html 17. ftpK??ftp!cines!fr?AM1T?AM1T!Bip 18. httpK??www!ncbi!nlm!nih!gov?nuccore?20D,1$8$, 19. httpK??www!ensembl!org?info?docs?genebuild?ncrna!html 20. httpK??vega!sanger!ac!u ?GomoDsapiens?Anfo?Ande> 21. <! 1! :ilming& @! 1! R! 1ilbert& M! Gowe& =! Trevanion&T! Gubbard and @! <! GarrowK
)he -ertebrate genome annotation (=ega) database! Nucleic Acid Res. $,,7 @anL 3dvance 3ccess published on 2ovember 1*& $,,4L doiK1,!1,8#?nar?g m874
15