Você está na página 1de 119

I WORKSHOP BRASILEIRO DE BIOINFORMTICA

18 de Outubro de 2002 Gramado Rio Grande do Sul, BRASIL

ANAIS

Promoo
SBC Sociedade Brasileira de Computao

Edio
Ana L. C. Bazzan UFRGS Andr Carlos Ponce de Leon F. de Carvalho USP-So Carlos

Organizao
UFRGS Universidade Federal do Rio Grande do Sul USP-So Carlos Universidade de So Paulo

Realizao
Instituto de Informtica UFRGS

CIP CATALOGAO NA PUBLICAO

Workshop Brasileiro de Bioinformtica (1. : 2002 out. 18 : Gramado). Anais / Edio Ana L. C. Bazzan, Andr Carlos Ponce de Leon F. de Carvalho Porto Alegre : Instituto de Informtica da UFRGS, 2002. 107 p. : il. ISBN 85-88442-35-3 Conhecido tambm como WOB 2002. 1. Bioinformtica. I. Bazzan, Ana L. C.. II. Carvalho, Andr Carlos Ponce de Leon F. de. III. WOB (1. : 2002 : Gramado).

Esta obra foi impressa a partir de originais entregues, j compostos pelos autores.

Capa: Roberta Krahe Edelweiss Editorao: Luciana Schroeder e Fbio Zschornack Impresso: Editora Evangraf Ltda.

1st BRAZILIAN WORKSHOP ON BIOINFORMATICS


October 18, 2002 Gramado Rio Grande do Sul, BRAZIL

PROCEEDINGS

Promotion
SBC Brazilian Computing Society

Editors
Ana L. C. Bazzan UFRGS Andr Carlos Ponce de Leon F. de Carvalho USP-So Carlos

Organization
UFRGS Universidade Federal do Rio Grande do Sul USP-So Carlos Universidade de So Paulo

Realization
Instituto de Informtica UFRGS

Bioinformatics - or computational biology - is a new area of scientific knowledge that can be seen as a marriage between biology and computer science. Thus it employs tools - be it mathematical, computational, statistical, and other related areas - in the solution of problems in biology in a broad sense. Nowadays, the main focus is put mainly on problems related to molecular biology and genetics. Despite its novelty as a research area, it was born with great potential. According to an investment banking firm, the market for bioinformatics has been estimated in two billion US dollars between 2000 and 2005 (Howard, 2000). Such companies work mainly with data collecting and storing, query tools for databases, and data analysis. Some of them charge fees to grant access to these tools and databases but there has been an increasing pressure from the scientific community to keep information and tools free of charges. The trend is towards keeping raw information free, while the

PREFACE
It is a pleasure to address the fascinating area of Bioinformatics in this volume, and especially to see the results of months of hard work of many people who have collaborate to organize the First Brazilian Workshop on Bioinformatics. We would like to thank all researchers who have helped organize this event, those who have submitted articles, the reviewers for carefully reading them, and the students who have worked to bring the proceedings to their final form. We start by trying to give a brief introduction to the area of Bioinformatics to newcomers. This task is not trivial since nobody really agrees on what Bioinformatics is. Of course each has her/his own definition and a proper idea of what a hot topic of research might be. Therefore, it is not our aim here to make an extensive review of the area. Such guidance can be found in several textbooks as well as in the dozens of journals, conferences, and workshops now devoted to this theme. After, we focus on the Brazilian panorama, which is full of examples of successful projects in the area. This can be attributed in part to the change in paradigm as to what regards the way science is made: geographically distributed scientists, from different fields, working together in a network. Finally, we introduce the workshop itself and give an outlook. Bioinformatics - or computational biology - is a new area of scientific knowledge that can be seen as a marriage between biology and computer science. Thus it employs tools - be it mathematical, computational, statistical, and other related areas - in the solution of problems in biology in a broad sense. Nowadays, the main focus is put mainly on problems related to molecular biology and genetics. Despite its novelty as a research area, it was born with great potential. According to an investment banking firm, the market for bioinformatics has been estimated in two billion US dollars between 2000 and 2005 (Howard, 2000). Such companies work mainly with data collecting and storing, query tools for databases, and data analysis. Some of them charge fees to grant access to these tools and databases but there has been an increasing pressure from the scientific community to keep information and tools free of charges. The trend is towards keeping raw information free, while the more aggregated one will be charged, since the main target is big pharmaceutical industries that can afford paying for that information. For this branch of industry, the advantage of using bioinformatics techniques is obvious: the perspective of finding a drug for a specific disease is easier and quicker. Therefore, we can say that bioinformatics can change biotechnology and medicine. It is certainly not the case for closing all the so-called wet labs and substitute them for bioinformatics laboratories! However, some aspects of scientific discovery in medicine and biotechnology will migrate for what is known today as in silico biology. It is already a reality that genome projects cannot survive without bioinformatics, especially given the time constraints. We can say that the area was born in the early 1980s with the creation of the database called GenBank by the U.S. Department of Energy. This database was designed to

store sequences of DNA that researchers were starting to identify. Genbank went through several phases: in the beginning, people received the sequences and typed them using especially designed keyboards (with only four keys: A, C, G, and T). After the file transfer protocols, researchers themselves could submit their data to the database. Thus, the number of submissions and queries to the bank has been increasing dramatically and reached 7 billion units of DNA pairs of bases (Howard, 2000). Finally, after the appearance of the world wide web, it is possible for anyone to access the database free of charge. One genome project in particular was fundamental to the creation of other databases: the human genome project (HGP). According to Howard (2000), the Incyte Genomics is capable of generating 20 million of pairs of bases in a single day! The Celera Genomics, claims to have a database of 50 terabytes! These figures only regard the data on DNA sequences. It can skyrocket if we talk about data on expressed genes, proteins, scientific publications, and so on. Currently, researchers are producing, as never before, huge databases that contains all these data as well as more aggregate data such as how proteins interact and which role the interaction has on certain diseases. And these databases will grow exponentially. However, it is widely recognized that the raw information is of little help if we cannot fully understand it. We still get to know the relevance of the data generated! At this point, it is important to remember the words by Ridley (1999): Genome must not be ultimately related to diseases. That means we should not focus only on finding genes that cause diseases (which could conceal a profit-making bias). Understanding life should be the ultimate goal of basic research. Also, and here we apologize for our computer science bias, researchers, at least from the computer science scene, should look at biology as an inspiration for new computing paradigms as it was the case in the not so distant regarding Neural Network, Genetic Algorithms, and others. For instance, DNA computing for instance can potentially solve open problems in computer science, which are currently stuck due the problem of complexity of algorithms. Now, turning to the Brazilian scenario, bioinformatics is becoming an increasingly hotter topic. Projects like the sequencing of the bacteria Xyllela fastidiosa supported by FAPESP have shown the growing interest of researchers as well as financing agencies in the field. The success of this project had a threefold effect: has introduced a new paradigm on how to do research (i.e. the collaborative way, forming network of labs, institutions, and people); has given visibility to the Brazilian science abroad; and has stimulated similar experiences at a national level. Currently, a dozen of projects are being conducted: from other bacteria like C. violaceum to sugar cane and eucalyptus, and to cancer. In 2001, the MCT, the CNPq, and the FINEP, launched a call for project proposals in the area of bioinformatics in the PNBRG (Programa Nacional de Biotecnologia e Recursos Genticos). It is also important to highlight the effort that has been put on the formation of human resources in the area. Apart from the recent initiative of CAPES to support PhD programmes in bioinformatics, the specialization course of LNCC provides young people - most graduate and undergraduate students - with an opportunity to learn the

essential techniques and the basic theory, enabling them to share the acquired knowledge upon returning to their laboratories and research groups. Given all this, the idea of organizing this workshop seemed ripe to us. Actually we thought that a good idea would be to join a well-established event in the Brazilian computer science scene, the SBBD/SBES (Symposium on Database/Symposium on Software Engineering) and organize it as a satellite event, given that it would be the first version of the workshop. The main goal of this workshop is to bring together researchers working on all aspects of computational biology and bioinformatics in order to exchange recent research results in their areas. It is also intended to be a forum for students to exchange their research projects and to attract people to this new field, given the growing demands posed by the genome projects. The idea proved successful: we received 21 submissions, most from computer science related people. This number can be considered adequate given that the area is very young in Brazil and the results are just starting to appear. Given the time constraint, we could accept only about 50% of those papers submitted for oral presentation. However, we have invited some other authors to present a poster and publish an extended abstract of their work in this proceedings. The topics addressed are: Algorithms and applications in Bioinformatics, Biological Databases, Biological Data Integration, Biological Data Mining, Biological Data Visualization, Biological Information Extraction and Retrieval, Biological Knowledge Bases, Biological Knowledge Representation and Inference, Bioontologies, DNA-computing, Computational Drug Discovery, Distributed Knowledge Discovery and Learning, Functional Genomics, Gene Expression Analysis, Gene Identification, Gene Regulation, Grammatical Analysis of Molecular Sequences, Information Theoretic Analysis of Molecular Sequences, Models of biomolecular computing, Molecular Dynamics and Simulation, Molecular Evolution, Molecular Sequence Alignment, Molecular Sequence Assembly, Molecular Sequence Classification, Molecular Sequence Databases, Molecular Sequence Pattern and Motifs, Molecular Structural Motifs, Phylogeny Construction, Proteomics, Protein Folding, Protein Structure Prediction, Regulatory Genetic Networks, Simulation tools, Software Environments for Bioinformatics, Software Tools for Computational Biology, and Statistical Analysis of Molecular Sequences. Finally, as for the challenges for the near future, the development of high throughput data acquisition technologies in biological sciences, together with advances in digital storage, computing, and communications technologies will continue to transform biology in general, and molecular biology in particular. The exponential growth of genome data, which is available to researchers via databases and the Internet will not stop soon. This poses a challenge for algorithms and tool developers in terms of software engineering and design, complexity of algorithms, and formal methods, since we could be experiencing now the frontier between the generation of BLASTlike tools as well as Genbank-like databases with new algorithms and tools. Although the former are very suitable for human made queries, they clearly cannot cope with the needs of high throughput sequencing projects. Here, better structured databases are necessary. Also, sequence and structural information are stored in databases along with various tools which are far from being integrated and user friendly.

Therefore, the major goal of researchers is to rethink issues such as: how is information encoded, stored, decoded, and used in biological systems? What sequence regularities (if any) are predictive of protein function? How can we precisely characterize the syntax (grammar) and semantics (meaning) of macromolecular sequences? How do hundreds of genes interact over time to orchestrate specific biological processes of interest? It should be clear that research in bioinformatics requires the development of sophisticated algorithms, software tools for data storage and retrieval, data integration, information extraction, exploratory data analysis and discovery (through data mining and data visualization), using heterogeneous biological data sources, databases, knowledge bases, and ontologies. Design and development of such tools is a major goal of bioinformatics or computational biology. Porto Alegre, August 2002 Ana L. C. Bazzan Instituto de Informtica Univ. Fed. do Rio Grande do Sul bazzan@inf.ufrgs.br Andr C. Ponce de Leon F. de Carvalho Depto. de Cincia da Computao USP-SC andre@icmc.sc.usp.br

References
Howard, K. (2000). The Bioinformatics Gold Rush. Scientific American. July 2000. Ridley, M. (1999). Genome. New York: Harper Collins.

I Workshop Brasileiro de Bioinformtica


Comit de Organizao / Organizing Committee
Ana L. C. Bazzan (UFRGS), co-chair Andr Carlos Ponce de Leon F. de Carvalho (USP-So Carlos), co-chair

Comit de Programa / Program Committee


Aldo von Wangenheim (UFSC) Ana L. C. Bazzan (UFRGS) Ana Tereza Ribeiro de Vasconcelos (LNCC) Antnio Basilio de Miranda (FIOCRUZ) Antnio Braga (UFMG) Andr C. Ponce de Leon F. de Carvalho (USP-So Carlos) Jos Carlos M. Mombach (UNISINOS) Leila Ribeiro (UFRGS) Marcilio Souto (UFPE) Osmar Norberto de Souza (PUCRS) Paulo Martins Engel (UFRGS) Sandro J. de Souza (Ludwig Inst. for Cancer Research) Srgio Ceroni da Silva (UFRGS) Srgio Lifschitz (PUC-Rio) Teresa Ludermir (UFPE) Wellington Santos Martins (UCB) Wilson Araujo da Silva Jr. (Fundao Hemocentro)

Avaliadores / Reviewers
Aldo von Wangenheim Alexandre Delbem Ana L. C. Bazzan Ana Tereza Ribeiro de Vasconcelos Andr C. P. de Leon F. de Carvalho Antnio Basilio de Miranda Antonio Braga Arnaldo Moura Aurora Pozo Cludio R. Geyer Fernando Von Zuben Joo Batista Oliveira Jos Carlos M. Mombach Leila Ribeiro Marcilio Souto Maria do Carmo Nicoletti Marinho de Andrade Mauro Biajiz Osmar Norberto de Souza Paulo Martins Engel Sandro J. de Souza Srgio Ceroni da Silva Srgio Lifschitz Solange Rezende Teresa Ludermir Wellington Santos Martins Wilson Araujo da Silva Jr.

Sociedade Brasileira de Computao

Diretoria
Presidente: Flvio Rech Wagner (UFRGS) Vice-Presidente: Luiz Fernando Gomes Soares (PUC-Rio) Administrao e Finanas: Taisy Silva Weber (UFRGS) Eventos e Comisses Especiais: Ana Teresa de Castro Martins (UFC) Educao: Marcos Jos Santana (USP So Carlos) Publicaes: Claudia Maria Bauzer Medeiros (UNICAMP) Secretarias Regionais: Aleardo Manacero Jr. (UNESP So Jos do Rio Preto) Divulgao e Marketing: Srgio Vanderlei Cavalcante (UFPE) Planejamento e Programas Especiais: Robert Carlisle Burnett (PUC-PR) Regulamentao da Profisso: Roberto da Silva Bigonha (UFMG) Eventos Especiais: Ricardo de Oliveira Anido (UNICAMP)

Conselho Membros Titulares


Mandato 2001-2005
Ana Carolina Salgado (UFPE) Paulo Csar Masiero (USP-So Carlos) Rosa Maria Viccari (UFRGS) Sergio de Mello Schneider (UFU) Tomasz Kowaltowski (UNICAMP)

Mandato 1999-2003
Andr C. P. Carvalho (USP-So Carlos) Daltro Jos Nunes (UFRGS) Jos Carlos Maldonado (USP-So Carlos) Silvio Romero de Lemos Meira (UFPE) Therezinha Souza Costa (PUC-Rio)

Membros Suplentes
Itana Maria de Souza Gimenes (UEM) Jaime Simo Sichman (USP) Raul Sidnei Wazlawick (UFSC) Miguel Jonathan (UFRJ)

Sumrio / Contents

FULL PAPERS
A CGM/BSP Parallel Similarity Algorithm _________________ C. E. R. Alves, E. N. Cceres, F. Dehne, S. W. Song Phylogeny from Whole Genome Comparison _______________ Graziela S. Arajo, Nalvo F. Almeida Jr. 1 9

DNA-Based Modelling of Parallel Algorithms _______________ 16 Leonardo Vieira Cervo, Leila Ribeiro BLAST Implementation Issues on Workstation Clusters ______ 24 Rogrio Lus de Carvalho Costa, Srgio Lifschitz Splice Junction Recognition using Machine Learning Techniques ____________________________________________ 32 Ana C. Lorena, Gustavo E. A. P. A. Batista, Andr C. P. L. F. de Carvalho, Maria C. Monard AnaGel: A Distributed System for Storage and Analysis of Electrophoretical Records ________________________________ 40 Edr Quinto Moreira, Osvaldo Carvalho On the Pursuit of Optimal Sequence Trimming Parameters for EST Projects __________________________________________ Fabiano Cruz Peixoto, Jos Miguel Ortega 48

Evolving Phylogenetic Trees: an Alternative to Black-Box Approaches ____________________________________________ 56 Oclair Prado, Fernando J. Von Zuben, Srgio F. dos Reis Analysis of Functional Interactions of Enzymes in Mycoplasma pneumoniae ___________________________________________ Adriana N. dos Reis, Cludia K. Barcellos, Fabiana Herdia, Jean J. Schmith, Jos Carlos M. Mombach, Ney Lemke, Rejane A. Ferreira 64

A Semi-Automatic Methodology for Localization of Short Mitochondrial Genes in Long Sequences ____________________ 72 Rafael Santos, Jos Humberto Machado Tambor, Luciana Campos Paulino, Ana L. C. Bazzan A Comparison Between Symbolic and Non-Symbolic Machine Learning Techniques in Automated Annotation of the Keywords Field of SWISS-PROT ___________________________________ 80 Luciana F. Schroeder, Ana L. C. Bazzan, Joo Valiati, Paulo M. Engel, Srgio Ceroni

POSTERS
Stability Evaluation of Clustering Algorithms for Time Series Gene Expression Data ___________________________________ 88 Ivan G. Costa, Francisco de A. T. de Carvalho, Marclio C. P. de Souto Ordering Gene Expression Data Using One-Dimensional SelfOrganizing Maps _______________________________________ 91 Lalinka de C. T. Gomes, Fernando J. Von Zuben, Pablo Moscato Reverse Engineering of Genetic Networks Using Variable Length Genetic Algorithms with Boolean Networks __________________ 94 Ricardo Linden, Amit Bhaya Um Estudo Emprico de Alinhamento de Seqncias Utilizando Computao Sistlica e MPI ______________________________ 97 Deive Ciro de Oliveira Creation of a Hidden Markov Model for Preliminary Identification and Characterization of Subfamily Signature in Serpin Proteins Superfamily ______________________________ 99 Cristina Russo, Hermes de Amorim, Ana Bazzan, Jorge Guimares Transductive Support Vector Machines for Cancer Diagnosis and Classification of Microarray Gene Expression Data ___________ 102 Robinson Semolini, Fernando J. Von Zuben Aprendizado de Mquina Aplicado ao Estudo de Marcadores Moleculares para Produo de Carne Bovina ________________ 105 Silvia H. M. G. da Silva, Ana C. Lorena, Andr C. P. L. F. de Carvalho, Danielle D. Tambasco, Luciana C. A. Regitano

  "!#$%'&(%'")%103245"6879)%"0A@B&DC E#FHGIFQPFHRISUTWVYXa`cbHGIFQd9F3EfWe g V hiV'XipcbQqrFHstV'uQvQVYwWb f vHxyFQFyvQW )edgfih jckalnmohqpr pikIsQr't uv'wpr mr3r'pikawHxisHrt uyrYwz{uWxQsWy|e}Alor'~ahzx crcln uYmn mnu Y'pcu 's wzx Y'r' cu9rlorYfWpikYxis9e}Alor ~1h{zx drfh jckalnmohqpr pikIikapkalorYzQcpik nrYo'u9 Wr' cfi h jckilnmohqcxignr1rWxrYfr'prx siWucuzQuAuYwokalgsW1Yhqk1WfWYa"kYxY lozqkico"uYfdr r'"t u YWyYrYcwi zquWxQsHrtcu Y yrYwzquxQsWyA}Alor ~1h{zx drfh jckalnmohqpr pikpiksH Y cii"c kkaiplnhkamnh{kaffc r  ka '} sWuI ymrYolozqYh{fu''lohqm uYp9 l)# auYx h w h{ f#r' f rYz{hqrYfWfp 9 fcag 11 'Ym fI a k Y u e l o o l { h i k a k  f Y r W f u#WwWk) rYfWzq8 Yu'loh{h ikY lnf kaWr wihqpilnkahqmmo qQwW ap uY tk1 wifiu'hql1 r"ihqur'f lnzuYzqk1wizfapum#rY fWwW po "YuY 3 z q uc ln1ucr'a zA a { h 3 x u o l { h i o a k o l ' r i k t l }A H r h k'rYzqmnfuI ikilnkam1mn k1fAh{zqk1kafcr"hquf lnkamowz kaomnmmnuYu'lnmAr'k1hr'fWka puYh f kau1 wiz{Q ik1 h{fu'kl ucpi Q g WAUviV hiV'X'QRtSWhiiuQX f vHx f QQSU g'f UWvXgUvBrvh f g X iQ#t3Qo yV'iQV'v g V g W f hiXvX f Qvx f V vi f SQhiWHSV'UvEgWQ f UWv f SrUWSUWWu f 3V hiXv8hV HSV HhQSV'XgYcbX u XuQVXV h uWHXUS hiiVYX3V"orV V'v f Q Q iXVYf iQV v g V'X8YQg b"Wb"YcFqQHhiuQV hiWhiVWbXV'ig QV'fv g V g Wf f g hXWv g f vBf VHXV'xBUvBuHV XSivWXV TV h f SiuQV hHhQSV'X QbiQb"F #vQV f i xV'vi XUS f hiUV'XV ogV'V vXV'iQV v g V'XX f SUWv uQV'biu uQVvXV'hi vX V'XgvuQV#ogXVYiQV v V'X'bUv X g u f u f uHVtor8XV'iQV'v g VYX V g f VuQVIX f fWVg X 'VWFc VV"V g u f g uHV f SUWvH V'viWHorXVYHV v V'Xeu hiV XUS f h#SUSeXuQcuQV8 f hiXtuQV hiVuQV' f g ub f vHxBxQ V hiV vi# f hiXguQV'hV8Xf ffg VYX hiV9UvHXV'hiV'xF3 V hV8viV'hVYXoiV'xBUviuQVVYXo SUUWvQV viV ogV'V vBorXhivHX'b vHx f uQVIX g hVrX g u f f v f SUWvQV'vTV'X f V f XQf hiVgW3uHc 9 g uiuQVIXoihUvQiX f hV f Sf WVF uQVX VUxV g f vV f QQSUVYxuQV vrV f vi HvxuQVxUX f v g V3V"orV V'v ogXVYHV f v g V'X'FY f Vr vA HvHxuQVUvQ8QvQ8V'hAWHUvHXV'hivHX'bYxV'SV UWvHX vHx XQXoiiUWvHX#vQV'V'xVYxf h f vHXWhi vQV9XV'iQV'v g V8viuQViuQV hYF nvWuQV'htrWhaf xQX'b gV f viiV'x8vQVWriuQVXhiUvQX f vHx f VV'i f S)BiuQV WuQV'h'FeV f XXUWv XiX#iV'SV'V'vi hiV'xW3V h ivHX vHx|XV V uQVSUV'XiXV"V'vHXTV g iXUWv g uQVYXVIV'h f UWvHX'Ff uHX'biuQV#xfX f v g VIfX f V f XHhVWuQc 8 g uuHV# XhivQiX)x 3V'h'F yWr' lh{rY zmowW i uYlrY fp u r y AtsW yr'lniuckalnkam c gyx9 " g e yy r)k1 ifWgk8 c s k yslnuc Y) uYui $ y i u W  x y rglorYyf rs 'lorY"f g xi ry# 'lorY fo m W Y Y" 'x WrY fr'pr xrYfW pxsHr' f p sWufWc # g Wye xrYefWsp
      ! #" $ &% (')  0 1 0 32547676 44 8' 9 @2A171 0 B2A4717C76 D4  9E  &" B' 11 F4  0 G 174 H4  I6 GP00 C 1 Q2 256 G 4 44 G RI6B2A471 0 4B2 D4  7C 44 GP0 11 G

 v x   3V8ogXoihUvQiXtcTV h#XWV f SUQu f V V" p F Rtv"!$#&%('0)2143` )65I ) V ogV'V fv7 f vHx7 ` pX f f g uHvQWtuHVX93WSX8@9A f vHx B97Uv X g u f iu f 8IrVxh f SUvHV'X3V"orV V'vuQV f g uQVYxX8SUX'buHV'XV SUvQV'X vQvH g hiXiXrV fWg uiuQV hYF uQV f SUvQV viXuQcIXgiuQVXUS f hiUV'XrV ogV'V vuHV ogXog'ihf UvQiX D F CUTWV v v SUvQV viV ogV'V vogXoihUvQiX brV v XXUWv EGIHQP 3#i  f XtWSUSUcIX FG fg u fg SHf v)iuQV f SUvQV vihV g V UTWVYX fg V'hg'ff Uvf T f SUQVxfF V 3V vxvH WviX g WviV'viiX f vHxuQV W f SrX g WhiVWhiuQV f SUWvHV'viUXuQVXQ iuQVT f SUQV'X XiXUWvQVYxiX g SHvX F fg SQv u f XtorxV vii g'f S g u f h fWg V'hiXSRTVUbUSS hf V g V UTWV f T f SUQX V W`YaRQbcUedgf h Y f 14!Q5 GIi d1Fs# V hiV vi g u f h fWg V hap X R7r q USUSgTVuHV v|T f SQs V W`YaRQbcUedtu Y f 1v % E 14!Q5 Gci d1FqAUv f SUSb f X fWg V8Uv f g SHv hiV g V'TV'X f gT WSUSQ V wSybuQV hiT V y9F VS WhuHVT f SHVWgiuQV3V'X f SUvQV viT Y Hc 5%14!Q# f Hx !Q#&%('0)2143)65d#uQ g u WUTWV'X#iuQV f 8Q X g WhiVWF uQX f 8Q X g WhiVUX g f SSUV'x|uHV UuYXbIdWhrXoihUvQi X f vHx E %1%#! P %5I3V"orV V'vuHVIor9XoihUvQiX8VxV vQWVYxiT F uQV hiV f VhVu f vWvQV f SUWvQV'vu f U9QX g hVYcF R XVYHV vi S SWhiiuQ i g WQQVuHVXUS f hioV ogV'V v orXhiUvQXHXV'X V g uQvQHV g f f SSUV'f@ x $Q)!$1v% GPH ' P !$1v1%)'WF uHV g QSV o uHUX f SUWhuQ UX f8Yedf xd1F uQV g WvXoih g UWvWuQVWi f S f SUWvQV'v g f v VxWvHV Uv XV'iQV vii f S U V 8YahgidjdF EgWvHXUxQV h k k0l vH x k 4k0mdFi V vW UviuQVXWSUiv WQQUvQ SS uQV XUUS f hiiVYXV ogV'Vf v f hiQh f hiBQhiV Qg f V'Xrf uQVorXhivQiXXo f hg UvQBufuHV XuQhiV hHhV QV'X vHxHXVQhiV TUWHXSU g WQiV'xhiV'XQSaXiXWSUTWVuHV QhiWQSUV h S f hV hIQhiV HVYX F f uQV hiV f h V g iXXUQSV8QhiV QV'XIn vHB x dog QhiV QVYXIp F v Yeg q dsrf Yadog$e d f ihx tuQV hiV uiX bgV g'f v f hh f vHWVWQh g f S g HS f UWvHXtv f i V fWg x u tsYaRQbcUedrhV'QhiV'XV viiXguHVXUS f hio V ogV'V B v 9 RY f vu x  U F #HXV'hTVu f gV g f v g HVguQVrT f SUQV'XAW tsYeRQbIUedHXvHIuHV iuQhV'VgHhV'TiUWX V SUV V viis X tsYaRvw w bcUed1x b tsYaRvw w bcUnwq d f vHg x tyYeRQbcUpwq d"bV g f XVtuQV'hV f hiVtWvHSuHhV'V f X g WQivQ f v f SUUWvQV viI3V"orV V'z v 9 RY f vHz x  U"FH V g'f v f SUUWv 9 { RY) u  Usw" f vx f g u f X fg Vi| u  U bh f SUW v 9 RXw "iu  Uw " f vHx f g T u 9 RYo u } U"biWh f SUWT v 8 Rnw "3io u  { U { f vHx f g u ix u 9 R'F f X fuQ g V VXUUS f hiouQV f SUvQV vi#3V"orV V'vXhivHs X f vH x  g f v|V g QVYx f XrWSUSUcIX ~ t RQbcUyw 2 w|y  bIUsw " gzW`YaRQbcUed tsYeRQbIUedn f t Rw 0 t Rw 0 bIU 2 w|y RIg v r8 ' %aXSv7YqAUWQhiV9e dX xhiV VYx g g S g h f QuuQXVITV h g V'X f hiV uQg V ` p ` 3WUvp aXP W f v ` r p hxbf uVYxWg VYXhfW hx vio YaIbduQVhx viiXYeIbsged1bDYeDg$0bd f vxYafg$0bg$qd1F V XiX V  v YaAg e drFYadg q dWhixx f 8 uuQVXUS f hoQhiWQSUV Uv uQVrv f iQf h f Sig ff "if uQp V Ye|g e dYedng q deTV h g V'X6 f hiV)v8WvQV i$ WvQV WhihiV'XvHxV v V iu|iuQx V Yeg q dYeduge d#V vihiVYX)iuQT V t f hib f vHx|uQV g XWg f vV'xV9hiWg TWV hV F YebcdTWV'hiV"F YacbdXV'i f S3T y` m f vHo x jiXw Wh mDw f vHx ji6 f vxiW`YeIbdgmjw f vHxBDhXw F rXV XXV V#u uQVXUS hoQhQSUV g f v 3VTiUV rV'x f X g QUvQuHV UvQU9Qf XQh g V nXv f f uUv f hf xsIy R CF
 B F 

i rlohqp 
RE '

#vQV# V HShV#uQVHXVA h SUSV'S WQQUvQUXriuQhQWu uQVXV SUHX V haXWrWf h Xo f UWvX8hq f Xo$CU f Hf f GiuQVg hivQV g WvHvQV g V'xtvQ f XV'x rV cg HS fg uQUvQV'X'b u s! P !Q##3#`% P 5 2!$#"!F! Gci %)3$#%&'!Wh(!F3 EIE !'3(y! EIE %)'&))25 3 P10 ! G 32# !34)SUQh f hiUV'X'F uQVIS f V'v g UvX g u g SXoiV haXWhrV'cQS fWg uHvQVYXeC5cXX g Qh hV'viSSVYXXriu vY&6jU vxHhWh vQHXvHuQVYXVhiV'XWQh g VYXrUXix f f f7 h hiV vHxUv f h f f SUSV'S f vHxxfUXhiUQVYx f g WQQUvQHF G 8 UV vi h SSUV 4 % S 9gPIA R @ YBy! !$##3#%CS!$) 1 3 E !F! GIi %)3d f SUWhuQXh uQVxv g f g Qf hiWf Wh f UvQQhQSUP V 5u f TWV3VH V v WGIG f EcUvQ V'x C SUS vHD x 9 h b F 9gPIA R @ f SWhiiuQXhuHV XhivH V'xUvQ QhQSUV u f TWV|V'f V v$f QhiW3Xf V'x E F RI3XWSU BV" SF cFAR WhiVWV'vQV h f S)XHxQWr f h f SSUV S f SUWWhiuQXWh8xv f g QhWh f gvH gf f v|3VXV V'v|Uv H GYI F 9PI( R @ f SWhiiuQX b3uQcrV TV hYbxvQ f WV8vii W Q i v W 8 Q Q v i v v x i X X Q V uHVvQ8V'h f T f US f QSUV Qhi g V'XiXWhaX 3V g X V8 g hixQV h)g if uQVQhifWQSUV f X B fWuQg V V Q Y  P )3' !Q)" #! %5d1FuHV vX uuHV WhiV"i g'f SUS V 8 g UV f vi f SWhiiuQX f hiVBQSUV V viiV'x WvhiVP f SV"P XoivQ fg uQUvQV'X'g buHV|X3V VYx  QHX W f UvQV'x f hiVV'vBxUX f HviUvQF SUvihx g V'x f XQSUV GIH ! PIE 3' P !$)U #! P %58xQV Sb g'f SUSVYW x V # X`YQ)U# R f SU f vT )  s  ! Q ! # { #  3 a # ! 0 3 F # 2 # a V b Y r  F W U W T Y V  X i h V X W v Q U S V Q i h ' V x i H v 9 X v uHV 3V hWh H E P Gci f Pv H g V iuQV f SUWhuQH XtQSUV V viiV'xWvV f XUvQf Hb f UvQSUg xQUXhiHVYx|V hb f h f SUSV'S fg uQUvQV'X'FR b y 9 SWhiiuQ vHXUXiX XVYHV v V8WrXQV'hiXV'HXXV h f V'x E Q) GIiPcH )2%dcq!$5% H )DeI!f PP %a3 PIE F nv g f XQ3V haXoiV f bQV fWg uQhig g V'XiXhV"V g VYX f XVf #UvHxV'V'vHxV'vV'h f UWvHXHXUvQS g f Sx f f f T f S f QSUVv V fg u Qhi g VYXXWh f uHV Xo htiuQV8XQ3V haXoiV b XIrV SUS X W9HvQ iv g WvXXoivHXV vHx f vHxhiV g V'TV gf VYXX f V'X'FRtv i # P 3#!Q5f % H ) Uv f f Xg QV'hiXV' g'gf WhihVYX3 WvxQXIiBXV'vHxUvQWhhV g V'TUvQ I X Y V X X ' V r X U v V u Q i h ' V i X X ' h F f R XUi US hf xQV SUXriuQfWVg g f ! g 3g !Q%)3Ih!i #&5% 1  53 #pfvF!BbQQhiW3XV'x f stV'uQvQ V 35!$#r cF nv uQXxQH V SPI b E W QhiP g V'XiXWhaX f hiV GcH g WvQvQV g PIiEV'xiuQhiWQu f vviiV h vQV UWv|vHV"orWh 3F uQV8V'h GIH ! PIE 3' P !$)U #! P %5 g WVYXthiW iuQV8 fWg iu f uHV g WhvH Q QSUV g X V v V fWg uHh g VYXXW h drqIW X WvXxV h HSS hV hiu v uQVviH93V h Qh g V'XiXhiX'FiR v E C2@ f SUWhuH g WvXXoaXg f XV'iQf V'v g Vf hiWQvHxHfX b f SV'hv f UvQgV'SS xV HvHV'xSU f S g WQQUvQ f vxWSUW f S g 8QvQ g f UWvFWdtWhi f SUSbxQhiUvQ fg Q vQhiWHvHx g r VHXVuQV3V'XtXVYiQV vi f S f SUWWhiuQ huQVHh g VYXXUvQ8Wx f ff T f US Q U S V S U S S F E C2@ f SWhiiuQX f X3V g f S g f XVW f s v y 9 f SUWhuHuQV hiV f SUSuHV f W8Qg'vQf iR v W 3 V h i H v e X H Q v r V X Q 3 V a h o X i V i h g V x Q v V U 9 v V t hV'S f UWvF uQVIv E C2@ g SUWWhiuQg XAf QSUV V'vf iV'xWv QhhiV viiS T US Qf SUVg8QSUQh iuQV'u i X X i h A X Q i h ' V X V i v ) X 3 V Y V xQHX f g f f f g
A #  # A

XUS hiuQVX3V VYxQHXgQhiV'x iV'xUviuQV h SWF uQVEvC2@ SUWhuH xVYXUWv S UXif UvQU 'VuQVvQ8V'hg XQ3V haXV X f vHxuHV f WQvif AS g'f S g Q f UWvf F V hVUviV'hVYXoiV'xv|W UvQvH h SUSV'S SUWhuHXriu vB3VUQSV'V viVYx Wv f T f Sf f QSUV f h f SUSV'Sg fg uQvHf V'X f vHx f Wf f Uv f g W f iHSVV"fV g g f iv UV'X f X8HhV x VYxvuQVEvC2@ xV'SbiUvHxV 3V vxV vigWuQV# h HS hoVIWviiV h g WvHvQV g UWv vQV"g orWh HXV'xF WQh ivHcSVYxWViuQV hiV f hiVBvQ f UgQSUV fV viVYxhiV'XQSaX HXUvQuHV b y 9FWv E C2@xV'Sh#XV'iQV v g V9XUS f ho HhQSV'F V9u f TVQSUV V'viV'xB#v f rV'cQSAiu E SvQxV'XIiuTV hiQhiWUXUvQhVYXQSiX'F  ttA t# "!$#%g"!$#))Ho'&( 0q) 21 vHB x  0q)   1 VorXhiUvQXvXWV f SUQu f 3V" FQ V V"y SUSAxQV'Xv ` f f h f SUSUV S f f SWhiiuQ `i g v  QVuQVXUS f hio3V"orV V vz f vHxz v v f 4 ! 3 b a V b Y  u o W Q i h ' V i X X i h X H v x SU SeV WhiUvV fWg uBQhi g V'XiXWhYU F tXvH f uQXhiV'XQSYbgV g'f v HvHx f g vWQU f f S f SUUW5 vQV g vif 3V"orV V B v f vB x F uQVXV'iQV vii S f SUWhuQX3iu f XSTVAV 8 g UV viSUtiuQXQhiWQSUV HXVuHVV g uQvQHV Qxv f g QhiWWh f f UvQHb"XWSUTvQ vvHX v ViuQVgQhQSUV t UvQ xT vi WV f SUhV f x g WQiV'xXSQUWvHXgWhIf X f SUSV'f hg vXo f v g V'XAuHVX f Vf QhiWHSfV'5f f F E C2@Wcs y 9xV SbgVxUTix4 V  vii XSTVuQVXUS f hoHhQSV' WvBiuQVv W|QUV g V'X'bWX V 5 b f vHxV fg uQhi g V'XiXWh687bA@@ 9 A9 WbHhiV g V UTWVYXuHV8XhivQu f vHx uQVuQV g VW  YaCB 7ED `$FHGIQPe` R7 GI d"F G g uQhi g VYXXWhS6 7 g W QQV'XuHVtV SUV V'v aX t 7 YaRQbcUedWuQVXQH f hig t 7 buQV hiV T9 fWRU 9 f vH" x Yaswq d 5 g V9 U"9 5 HXUvQ 7 XoiuQhiV VQhiV TUWHXV SUV V viiX t 7 YaRnw 0 bIUqd" b t 7 YeR w 0 bcUwe d vHT x t 7 YaRQbcUw q d"bW3V g'f HXVuHV hiV f hiVIvQS8uHhV'Vt f X) QQUvQ f v f SUvQV viV f ogV'V | v 9 RY f vH| x  U F V g f v f SUWu v 9 { RY gW ig u  Uw" f vHx f g u f X fWg Vg u  U bWh f SUW4 v 8 Rw"3ig u  Uw fgu u 9 RYiu u } U bWh f SUUW v 8 RXw$ i u  { U" f vHx| f g u f X fWg V fvHix u 8 RYF B t 7 b)V fWg u Qh g V'XiXh46 7 HXV'XuQVB3V'XXV'iQV vii f S g QViuQV|XQQ f ih U S W W i h Q u U S U S S F X V X iu f Qhi g V'XiXWhW6 7 b |f b g'f vWvQSU$Xo f h f WQQUvQ iuQg'V f V SUV V viip f 7 YeRQbIUq d XV'iV| X t V # h f uQVQhi g V'XiXWh6 7XD ` u f X g HVYx f h g uQVXQQ f ihu tY7ED ` YeRQbIUed1F stV'vQiVi`b7a b)W 9 cbIyc9 Wb f SUSiuQVV SUV V viaXguQVhiUWuiQvHx f hh YhUWui iXo g  SQ6 v d uH V yu f h| i Q u V X Q Q i h A Y t 7 F p B @ W i h V Q i h V U X V U S W f g bd`e7a )etY7YeRQb 5 dbYyw ed 5 g$b9 R@9ly  5 1 F V yu f h uQV xV iuQV 7 f SWhiiuQ UX9uQV7 WSUSUcvQeRIV h g WQ7 ivQ|uQx ruQV XQQ f f hi t buQVHh g VYXXWhe6 XV vHxHXBQhi g VYXXWhe6 Pe` uQV V SUV V viiX `b7a F #XUvQ`e7a biQhi g V'XiXWh6 7 Pe` g'f v g WQiVtiuQ V yiu f hriuQV#XQQ f ihg t 7 Pe` F RV'n h WTw hiWQvxQX bQuQVHh g VYXXWhf6 5 hiV g V UTWVYXf` 5` D ` f vHx g WQiV'XiuQV H5 haXoI f h iuQVXQQ f ihut 5 F nv uQV Wuw hiWQvHxbuQVQhi g VYXXWhb6 5 hV g V UTWVYX(` 5 D ` f vHx g WQQV'XriuQs V Wxu f hWeiuQVXHQ f hiu t vHx vQUXuQVYXuQV WQ ivF tXvQiuQX#X g uQV'xQQST V YqAQhV0 d1bgV g f vB5XV fViu f #vBiuQV HhaXg thQvHxf bHWvQSUHh0 Vg 'XiXWh6 ` rWh X F nvuHV8XV g vHxhiWQvHxbQhi g VYXXWhaXf6 ` f vxg6 p gh F #XtV f X i XV'V#iu f IUvhQvHB x yb f SSQhi g VYXXWhaX6 7 rWh biuHV hiVe9 Sl 9 yF
A $  A   A

 

   





H rf8qQAauY#wfihq1r"h{uYflnuwfpimmnkipwiz{h{f
'

hVYX3V g UTWV SUWF W S 1 QTS 1XU tsYeRQbI Uqd  f )et RQbc Uyw 6wybctI Rp w 0bcUSw "gFW YeRQbcUedbct Rpw wbcU 6wiy 1 b uQV hi4 V Yejw qd Y 5 g$e9 R9 Y 5 f vHxuYw ed 5 ge9lUe9| 5 F Yq da` %(' bl 9 yu 9W YWFUq d ) ` 1325b I  YWFUWFUq dT` %c' Yyvw qd 5 ge9R 9ly  5 C Ied d9lUe9 5 tyYeRQbcUed YWFUWFq0 dwv tVxDfVyagi hc`epr a qDsu6 t 7 PA 7 ` R YWFq0d ) `s q 1325b I YWFqFUq d 'brVbr)E&b Y `ea b 687ED ` d  YWFqFq0 dT` %c' Yyvw q d 7ED 5 ` ge 9 Rl 9 y 5 C Ied dl 9 Ue9 5 gihcprqDsut tyYeRQbcUed YWFqF w d ) `yfVi q W 1i2eb I v txuya `e7a 687 Pe` ) G H v  x  I R U S W h H u Wvw GIH 1v1& )2% G !Q5% H ) PH  ) E %5 i br%('b46( 3 G !Q) EH #Q3y5 i 3 PH e#{31  E %)'# 2ev 8Y 5  d E 3G 23)65%a!$# GIH 1  5%)'g5%143%)|3I! GIiPHG 3 EIEH$P  0P uQ V|Qhi VYXXWh 6 XV vHxHX `ea Hh g VYXXWhW6 p f V'h g WQivQ uQ| V yiu QSP Hqg H  5 SUvQVYgXWgiuQV ` v5  XQH f ` hi t ` FARiV y h WBw g W9HvQ g'f iv hiWHvHxQX'b Qh g V'XiXhb6 ` HvHUXuQV'XaXgh FyUS f hiSUWbQhi g VYXXWhe6 p HvQXuHV'XiX9gh f V hSW 8QvQ iv9hiWQvxQX F uQV'vb V h Wvwf g4 W8QvQ g f iv9hiWQvxQX bQhi g VYXXWh 6 7 g WvQ H XuQVYXig Xf rWh FWyv g VrVu f TWV f WQhi g V'XiXWhaX bg f iV hg Wyw g W9HvQ g'f ivhiWHvHxQX'b U S S Q u V s B W Q i h Y V X X W a h X u W T V H Q v X Q u Y V  x Q u ' V I h r W h F f G uQhi g V'XiXWhHXV'f X XV'iQV'vi S SWhiiuQi HViuQVtXUS ho8XQQ g iuQX SUWhf uQ V'Xyf 8f Y v d WQQg UvQUVF f f hiut fW7 g F uHXr f f g 5
A A

"!$#&%('0)$132547698@)A4B)$!ACD')E1GF HPIRQTS 1VU Yqd uQV|viH93V hXW$WQh g V'XiXhiXsY0d uQVvQ93V ho9WiuQVQhi g VYXXW hYb uQV hiVW9 ( 9lW` f vHx Y wd uQVXhivQB f vHx|uHVXQHXhiUvQB 7 WgX V8 f vx 5 b

f SUSBUvV fWg uQh g V'XiXh f vx|XV vHxUvQuQV3WQvx f hiVYX#uQVhiuQhi g V'XiXh'buHV WhihiV ivQV'XiXW3iuQV SWhiiuQ g WV'Xv f iQh f SUS9h iuQV g WhihV g vHV'XiXuHV#XV'iQV vx g f S f g SUWhuHF uQV'f vb f iV hIWv w W9HvQ g'f iv hiWQvHxHX bxtsYabdjd)USSXhV#uHV XUS f hio V ogV'V vuQVXhivHs X f gvu x F  d! % %g'AQo  W W i ' W ''   " WW Y " "' WW 'Y WW Y   '' WW WW '' ' WW '  '' ' WY "Y Y W i a W YY   W W "Y W WW '' ' " Y Y Y WW   i W Y '
A 2 6 C 2 G 6 I82 6 RI82 6 2A4P6 4 47I7IFC 4 4 4 G7070 4 0 C 0 4 4 4 4 4 4 4P664 4PI P1 47C1 4 4 4 4B2 G C 4 4 4 4 4 G 4 4 4 4 4 PI1 4 4 4 4 4 PC G 4PI 0 IB256 4 22 G I 4 4 4 66 47I (2 4 0 47I P1 4PI6 0 4PIF4P6 47176 25476 25476 2A4P6 4 32 I 4 4 4 P61 4 P682 4 0G 4 R6F4 7C R6F4 7C 7471 7471 PI7I G G I7I 7 PIFC7C G 4 G 4 I7I 64 G 6 4 2564C 4 4 4 4 32A4 P4 4 325I76 4 4 4 64 PC 64 PC 14PI 0 1 IF1 P44PI 6 G 7I G 2 4 3272561 4 32 4 4 G7G 4 3254 74 2 2 0 1(2 I76FCPI 6I47C IF4PIC 4 32 6 0 6 4 4 4 6 G C 0 P41 P41 G 0 I GG I 74 6 2 C 0 1 I 71 CB25176 CB25176 G C 0 I 32 6 0 6 I 17C 7C 2 P4 G CC 0 I 0 PI1 I41(2 G7G 256 P6 G70 4(2 0 C C(2A1P6 C 71P6 G 6F4PI 2 2 2 4741 C(25176 2 G 47C4 G G I7I4 C(2 G 1 2

2 br%('b4 585 i 33) H 0 # ' HQP %5 i 1 tyYeub djd %## E 5 HQP 3u5 i 3 EGIHQP 3 H 0 5 i 3 e % 1 E %#! P %5gec35 3c3)z5 i 3 E 5 P %)' E !$)  uQV hV'(QhicTWVYX8u iV hWFw W9HvQ f iv hiWHvHxQX'b)Qhi g V'XiXh 6 5 PHqHH 0PvQ XuQVYXiXrWh Fyv g Vrf V f fhiVVYXXV vii f SSU g g QUvQg' iuQVXUS f hioXV'iQV vx

'@ 5 AX YVW '@ 9 TR SU '@ ) '@A '

 7

4 7 4 %% 4   %

 !2'(3) %4&!!'('(!#"))$ 00"1 "$3'3'33)6)655 7 0'3)(53"$ 4%  7 47%

2'

a`H wljckamuY3ku'imnkaljkaph{kam
6

47% 0' BDCFE6GI8(H2' PFPQE(CFP )6'

3'

90'

47%

9 4 7 % AX YVW ) TR SU % 4  %   ' 2' 4

5 %

 ' 9 %4 )(53'0' )(5393""" )65$ 7 5 " 5$!90 8(53)


7 %

3 wljckamuY3ku'imnkaljkaph{kam

3'

 B C E 8(G ' H!P P E0C P )6'

4%

0'

90'

47%

Vu TVUQSV'V viVYx iuQVF8Y&WdhQvHxQXXUS f ho f SUWhuQ v f rV cHS iu E SvHf xQV'X'FG fg uBvQxVu f X# E @PtRA@ V h vHxhV8 @h X f F uHVvHxQV'X f hV g vQvQV g VYx iuQhiWQu f ' @|$UviV h g f vQV g UWv vHV"orE Wh 3F uQVW UvQV'x9UV'X)XuQciu f )u E 18!$##XVYHV v g V'X'b'iuQV g 9QvQ g f UWv9UV UXXvQ g f vif uQV v g f hiV'xi uQV Q UWv UVuWhiVtiu v f vHx E Qh g V'XiXhiX'b"hiV'XV g UTWV'S8YQY rIQY f vg x9 rf ' Sd"F1uQV vrV f QQSUuQf V f SUW huQ XV'iQV'v g VYXIWhiV f V'hu f v|QYbXUvQvQVWhIogHh g VYXXWhaX'buQV8 f vBV'h UXgvQgV vQQWui9XWSUTWViuQVtQhiWHSV'F uQVtQUS f UWvWX f TV'XHXgV f vQvHWSUV'XiX hVYXQSUvQtUVYX F uHUXegQSxvQWA g g QhAQiuQVvQxV'Xeu f TVWhiV f Uv9V'hF uHX gVu f TVXQQHhVYXXV'xuQVYXViV'X'F nvWV'vQV h SbQiuQV8UQSUV V vi UWvWuQVv E C2@ b y 9 f SUWhuHXuQcIXiu f uHV uQV'WhiV" g f Shf VYXHSaX f hiV g Wv Hhif V'xviuQVQSUV V'v f iv F gtY!(o Vu f TWVQhiV'XV viiV'x f v E C2@ s y 9 h f SUSUV S f SWhiiuQ h u 8Y(Wd 8QvQ UWv hQvHxQX g WQQV#uQVX g WhiVtiuQVXf U S f hi o3 V"orV V v og8XoihUvQgX'F nvuHUXgg f f 3V h gVIu f TWVIrWh WVYx8u f QV'xQS g X V# 5 r 5 FiV f hiV g HhhiV viSU9rWh vHiu v !w0! 5!$5% Q3 GIixH % G 3 H 0 5 i 3 Hc 5%14!Q#e# HG X E %dcq3iQhuQV'hxQV g hiV f XVuQVhQvHvQvH f UVWe iuQV f SUW huHF u 8Y&Wd W9x uQV f SUWvQV'vV ogV'V vBiuQVorXhiUvQX g f v|VW f UvQV'xi vQ g'f ivhQvHxQX fWg ih fWg UvQh uHV SUcgV'hhUWui g hvHV hWriuQV hx Wg h f QuUv 8Ye gmdjdiV FAqHWhiuQX b tsYeRQbIUedWh f SUSg3WUviiX8iuQV hxWh f Qu8HX83V XoiWhiV'xxQhivQ8uHV g WQQ f iu v YhVYiQhiUvQ8 8YaodjdIX fg e V d1FQR XSUUWuiSU x 3V'hV'vi f S
A 

WWhiuQ uQ g uBXVYXIWvHSx8YUv )ub d 1 dX fWg VX f SUX3V UvQUHSV'V'viV'xWvuHV E C2@ cys9 xV SF v tXvQIuQVF@BWvQV)QhiW3V hUV'Xg ciWiiuQVhxsIRyCbcRISUTWVYXp35j!Q#  u f TV)QhiW3XV'x v 4 8 YSUWj Wd 8QvQ UWvhQvHxQXv E C2@ s y 9xQiv HhWh f vQ f SUWhuQ f WhXSTUvQig uQVXoihUvQ g fVYxUvQHhQSV' V ogV'V v f XfhivQgu f vHx f SSXQHXhivHX# o X i h U Q v   A F V i h V r W h i U Q v u i Q u V U Q S ' V V i v U W v SXvuQXQhiWHSV'F fqQQhuQV'hhVb"gVf UviV'vHxV HShViuQV 3cTWVxV Xf#XSTxVeV i uQf V) 8QSiHSV f SUWvHV'vi f f QhQSUV F T % %Q%t%  YI'rlo rY) t hfWr z jckauYmal x zqj#h{f3 r'ak8 kalnkimomaloxh{f r kah if k' xrYifWlnu'p sH uYf sWur fWaiY y r'lomnrY3z{zqk1zAf pW firYh{ lnu' n m Y u a k p { h f q z 1 k   ! #"$"&% ')(10325476981"A@CB !7DE$F G E )&IHPHRQQS5QT"$Q U)V698 E WHXY"3P8`6a56bG`VoxQr Yk r a akagWokau'p3 u u d y Ac3sr sQlox uYlnk v r'zz{r'HxerYfWp sQr'aUr'pph{fe W t m o { z q h a x c ce f1hqk1fr'lor'z{z{kaz r'zqYuYx loh r m  Y u l m o o l { h f  a k i p q h { h W f  Y r f 8 p n l 1 k { z r o a k p n l Y u i { z a k 1 m !`g "$Bihqpr@ EsF Gq6ap{x W Y5t Y vu W #Hr ilohxrYfp # rYwir'iz{hfH#sW1r'zr zqkir'lorYz{zqk1zYkauYkiolohqrYzq'uYloh im g 1 k i W f Y k x  uYl)Va6buY r'lnximnkr'Y'lokar'm h{fkap twiz x hqauY wokalnm1 fwxPvpC"A@CBy698"37G7HRQd@ E$F G`6H6bXHRQdI50 vu E D i g k1ci5fWt k#u p3 iW3x Aur'lnmnkrYlorYh{fWkap8ir'lor'zzqk1zHrYzq'uYloh im1 !RF 5HRQ g GAPx"$Q UV9698 E H'x " i WW " rY'z{h{xzeAr'ufz{ p wiIh{r yr drl f Qyr lorYz{zqk1zepWifrYhqlnuY'lorYh{fWi3kiifihq1rYz )kauYltdrs h jckalnmoh AuiwWokalsW1hqk1fWak rkaW"{x ' W187 ' r { z  h c z ' r f Aayr lQ gfrYhqilnu'YlorYh{f)hqauYfjck5h cxaufW1rajih rYfWm3mnir'lnmoh p W ilnu'YlorYh{fix )V6bHRQr@ EsF G`6aV ! 5Vxir'Ykimg vu Wx r 1 k f Y q z a k 1 l r f  h f o n l c u i p w i q h u 9 f o  u ' r o l ' r { z { z a k 3 z p i f ' r q h 56bGqAW6aV9 @ EsF G`6b ! 5VXVx 7t u iWx ' y # t grYuz{z e rYcfWx p I ru1z{hfWi gilnu h{r ok9moloh{fr oih{f @ EsF G`6ap ! G`Vd D ox Y 5 IvW vrwif trYfWpt "sW~ix r'f mc rfr'z{'uYloh i uYlphfekalnk1fh{rYzUgizqkeauYr lohqmnufh@ EiE p "A@CBx 5 vu ImosQh{ }rh{ z{r'glokah kahqpikam) zqk1h{ rYf8 rYfWr'p r ut r' 1 hqpwimnfWkamnWiwHka fakYuYk1QfWkalou rYzHi kiok1Wh{uc p r iz{hq1r zqkou#Wkmnk1r'lnuYl f k { h f n l u f 1 m ehqpXBjRQSpkxpxi 5t ' u x Yy #'sk1z{z{kilnm1kekau'lr'fp#auiwWr"h{uYfukijcuz{wh{uYfir l#pihqmrYfika m tyr"nokalof#lnkaiuY' f h { h Y u f xhqph"$Q U698 E ox 15t vu x WIyvWsWs9 kye w r'zr r'ifh{p fvA uY k1hqpr'rYffchqx m1 g  6b)G56bbl6m@ EsF Gq6H)6bHQIBnRQS5GqQoHwkIbRQo D W w { z q h o m sWh #r'fpt sHt u r"okalo r'f Ypi k1fhfgi1r"hqufuWau#uftuzqka1wz{r'lmowmnkaWwk1fWakam1 hqp#BnQSpTkIpx i 5 x i c pr'z{hr'f" ilohqpiYh{f# uc pik1z)u'lr lorYz{zqk1ziuiwWr hquYfq @ EiE Gq7H6brP1698 "A@CBx Y5t u 'x W WIsQ' x wrYWf p9dtcrYfckal1Hr mAo5k emnk1r'lnih{f#rYz{zqu1h{ftkalnlnuYlnm1I @ EiE pX"A@CBx Y5 t u
A 2 " '   E  (' #" 644P6 6 ' ' 9' 251 I 1 C 1C7C 251174 G E ' ' "  )61C G 4 0 925171 G 8E 6 G 32 0G 2 0 &251171 I &" 4 P4 H1B2 E 2511(2 7E 71 0 2A17176 0 @' 2547I 9E C 0 272 2A171 C ' " ' 256 G CB2 P4P6 9251C74 1 F' 6F4 G IF4 G I G &251 00 E 2A4 B ' PC G I G 2A1 0 4 22 2 G IF1 GP0G 9251C74 256 2A171 0 2 G 2 P 2 0 32517I 251 0 &2517CB2 ' GG C 2A4 G 2722 &251174 25I G I C G 1B2 2511P6

10

11

12

13

14

15

DNA-Based Modelling of Parallel Algorithms


Leonardo Vieira Cervo1, Leila Ribeiro1
1 Insituto

de Informtica, UFRGS, Porto Alegre, Brazil {leo, leila}@inf.ufrgs.br

Abstract. The area of computational biology, is living a fast growth, fed with a revolution in DNA sequencing and technology of mapping. With these new technologies for manipulation of sequences, the relevance of finding efficient solution to the so-called computer intractable problems has also grown, because many problems involved in analyzing DNA belong to this class of problems. One approach to find such solutions is to use DNA itself to perform computations, taking advantage of the massive parallelism involved in operations that manipulate DNA sequences. This is what is studied in the area of DNA computing. This work proposes a formal model to represent the DNA structure and the operations performed in laboratory with it. This model can be used to analyze DNA-based algorithms, as well as for simulating such algorithms in a computer. We use graph grammars, a formal specification language, to model the DNA sequences and operations.

1 Introduction
In computer science, there are many problems that are called intractable due to the fact that there is no known algorithm that can solve this problem efficiently, that is, in a reasonable amount of time. Among such problems, there is a large class of problems that could only be solved efficiently if we consider a machine to solve them that would be able to perform a huge amount of parallel steps (in theoretical computer science, a typical model for this is the non-deterministic Turing machine). This class is called NP-Hard [1], and a lot of problems of practical relevance belong to this class. For example, many problems in computational biology are included in this class. Recently biology concepts are contributing with great potential for the computation area to help finding another way to compute the results of problems that were considered intractable according to the classical approach. One of the most promising ideas is to use DNA sequences and their manipulation to solve computational problems [2]. Due to the amount of chemical reactions that can occur in parallel in the manipulation of sequences, problems that were considered intractable by the conventional model of computation are now having a new interpretation [3]. The first ideas on DNA computation were presented by Adleman in 1994 [2], and since then many efforts have been made towards improving the results presented at that time, by presenting more efficient DNA algorithms (needing

16

less DNA amounts) and minimizing the error rates that may occur during DNA manipulation, leading to more reliable results. To understand better the possibilities of application of these new concepts and limits of the use of DNA and its manipulation techniques, as well as describing a model that can be used as a basis for analysis of DNA-based algorithms, it is necessary to construct a mathematical model that is suitable for this approach. This model shall be an abstraction of the reality that keeps the basic characteristics of the studied systems. In particular, parallel models must be considered, because parallelism is a main feature of DNAbased algorithms. However, few investigations have been made in order to find formal models for DNA computing. Some of these works include the use of new kind of string grammars to describe DNA algorithms [4]. Here we propose a model based on graphs instead of strings as basis for the description of DNA computing. Graphs are a natural way to explain complex situations in an intuitive form [5]. Rules between graphs can be used to capture the dynamic aspects of the systems. Graph grammars [6][7] combine the advantages and the potential of graphs and rules in a computational model, taking advantage also of the graphical representation. Moreover, parallelism appears naturally in this model because many rules may be applied in different parts of the same graph at the same time. The theory of concurrency and parallelism has been extensively investigated for graph grammars [8]. In this paper we propose a mathematical model for DNA-based algorithms using graph grammars. We model the DNA structure as a graph and DNA manipulations as graph grammar rules. One of the advantages of our approach is to reduce the costs/errors of DNA-based algorithms. Typically DNA computing is costly due to the use of expensive enzymes and laboratory procedures, and, as in usual computing, many errors can be made during the construction of the algorithm. Therefore a way to verify or test DNA-based algorithms before implementing them is worth investigating. This model can be used as a basis for analysis methods for DNA algorithms, and also for simulating such algorithms in a computer. This paper is structured as follows: Sect. 2 brings an overview of how to use DNA sequences to solve computational problems; Sect. 3 introduces graph grammars and how to use them to model DNA sequences and operations; Sect. 4 concludes our work. To illustrate our approach, we will show as an example the graph grammar corresponding to the DNA-based solution of the well known hamiltonian path problem. We assume that the reader is familiar with basic notions of DNA structure and its manipulation.

2 Using DNA to Solve Computational Problems


In 1994, different approaches that used the biological properties of the DNA sequences to store and to manipulate information have appeared. The idea was to use a large number of sequences as "processors" which compute in parallel. This type of computation through DNA sequences is called Computation with DNA or DNA Computing [9]. Adleman [2] used biological experiments with DNA sequences to

17

solve some instances of the problem of the Hamiltonian path problem, which is considered to be intractable because of its NP-completeness [1]. Lipton [3] showed how to extend this idea to solve any NP problem and discussed the practical relevance of this approach. He defined a model of biological computation that has the ability to manipulate great collections of DNA sequences (in test tubes). The execution of the operations in a test tube means some simple manipulation of each sequence in the tube. In that way, each sequence corresponds to a piece of information, and all these pieces can be modified in parallel. In the following, we will present the Hamiltonian Path Problem and give an idea of how this problem can be solved using DNA sequences, like in [2].

2.1 Hamiltonian Path Problem Consider a graph where the vertices are cities and the arcs are flights among the cities, where one of the vertices is called the initial city and another is the final city. The Hamiltonian Path Problem consists in finding an answer to the following question: "Is it possible to find a route from the initial to the final city passing through all other cities in the graph and not visiting each city more than once?". An instance of this problem can be seen in Figure 1, where we have 4 cities with Atlanta being the initial one and Detroit being the final one. To model this problem using DNA, we first describe each city as a DNA sequence (Table 1), where the first half of the nucleotides will be called first name of the city, and the second half last name. Then the existing flights will be coded by concatenating the complement of the last name of the origin city with the complement of the first name of the destination city (Table 2).

Figure 1 Hamiltonian Path Problem Schema

Table 1 Cities coded with DNA

Table 2 Flights coded with DNA

City Atlanta Boston Chicago Detroit

DNA Name GCTA AGCT GATC TGAC

Flight Atlanta Boston Atlanta Detroit Boston Chicago Boston Detroit Boston Atlanta Chicago Detroit

DNA flight number ATTC ATAC GACT GAAC GACG AGAC

18

The complements of the cities names are the Watson-Crick complements. For the instance of the problem shown in Figure 1, there is only one hamiltonian path and passes by Atlanta, Boston, Chicago and Detroit in this order. In the DNA computation, this path is represented by a sequence with size 16.The algorithm used for resolution of the problem consists of 5 steps: 1. Generate the combinations of possible paths randomly; 2. Remove all the paths that do not start in the initial city (Atlanta); 3. Remove all the paths that do not finish in the final city (Detroit); 4. Remove all the paths that do not have the size of 16 nucleotides; 5. For each existing city, remove the paths that do not include it, thus we exclude those that do not pass by all the cities. If there is some remaining paths after the execution of these steps, these are the solutions of the problem (the answer to the question is yes)

3 DNA with Graph Grammars


The main idea of the graph grammar model is to describe the states of a computation as graphs and the operations on these states as rules having a graphs at the left and right-hand sides. Here we will not present the formal definitions of graph grammars, and will rather concentrate on intuitive explanations of their syntax and semantics.

3.1 DNA Sequences as Graphs In our approach, DNA sequences will be modeled as graphs. The vertices will correspond to nucleotides and directed arcs will describe the links among them. To distinguish different types of nucleotides we will use labels on the vertices. Moreover, we will have some additional types of vertices and arcs to describe side conditions (like temperature, presence of some specific enzyme) and nucleotide characteristics (like orientation, begin/end nucleotide of a sequence), respectively. In Figure 2 we see a graph model, called DNA-GG, of a sequence of DNA with size 4. The labels, A T, C and G identify the different nucleotides in the molecule. Labels 53 and 35 distinguish the orientation from each nucleotide (these arcs are actually loops on the corresponding vertices, but will be drawn as arcs from the label to the vertex to ease the understanding). Labels I and F mark the begin and end of each sequence. The simple arrows represent the linking between consecutive nucleotides and the double arrows the linking between complementary nucleotides.

Figure 2 Representation of the DNA structure using Graphs

19

3.2 DNA Manipulations as Graph Rules Rules consist of a left-hand side, a right-hand side and a mapping between them. The left-hand side models the situation that must occur for that rule to be applied. The right-hand side models the effect of the application of the rule, and the mapping shows which items are preserved or change (are deleted or created) curing the application of the rule. In the graphical representation, the mapping will be given implicitly: items that are in the left- and right-hand sides are preserved, items that are only in the left-hand side are deleted and items that are only in the right-hand side are created. The hybridization rule (Figure 3 - left) represents the operation of hybridization. In the left-hand side of the rule we have a vertex modeling low temperature of the solution and two disconnected complementary nucleotides with opposite orientation (the fact these nucleotides belong to simple sequences is modeled by the looping arcs on each nucleotide vertex), and on the right-hand side we see that the temperature did not change and the two nucleotides with opposite orientation are now connected to each other (the looping arcs we removed because now we have a double strand). The denaturation rule (Figure 3 - right) represents the operation of denaturation and is modeled analogously to the hybridization operation, with the difference that is occurs when the temperature is high, and transform double into single strands. Similar rules exist for all other possible complementary nucleotides. Note that these rules can be applied simultaneously to all nucleotides in a sequence.

Figure 3 - Rule 1 and 2: Hybridization and Denaturation

The ligation rule (Figure 4) represents the operation of ligation. The left-hand side shows that, for this operation to occur, the presence of the enzyme ligase in the solution is needed, as well as nucleotides in the configuration shown in the graph. The effect of the application of this rule will be that the ligase enzyme is still present, and the nucleotides that were marked as begin and end of a sequence are now connected to each other.

Figure 4 Rule 3: Ligation

20

3.3 DNA-based Graph Grammar for the Hamiltonian Path Problem In this section we will give some ideas on how to solve the hamiltonian path problem using DNA, following the approach presented in Sect. 3.1, using graph grammars. Due to space limitations, this solution will only be sketched. The modeling of the sequences used in the algorithm as graphs is straightforward. These graphs (corresponding to cities and flights) are shown in Figure 5.

Figure 5 Cities and Flights Model

Figure 6 Example of (part of) States

21

Figure 6 illustrates examples of graphs representing (part of) the states obtained by the execution of the rules described above to the subgraphs corresponding to the hamiltonian path example described in Sect. 3.1. Graph G0 represents part of the initial state: we have the subgraphs corresponding to the cities, to the flights, and a condition (low temperature). Here rule Hybridization may occur, leading to the graph G1. If we introduce ligase in this graph, we can apply rule ligation and obtain graph G2. This example shows that is possible to execute the first step of the algorithm to solve the hamiltonian path problem, starting from a large amount of graphs for each city and each flight. Next the rules in Figures 7, 8 and 9 could be used to find the answer to the hamiltonian path problem starting from the graph G2. The first rule in Figure 8 marks with F1s the sequences which have GCTA (Atlanta) as starting subsequence. Then, rule Final City marks with F2s each sequence that has TGAC (Detroit) are end subsequence. Rule Detect Size marks with F3 all sequences that have size of 16 nucleotides. And finally rules Other Cities mark with F4 and F5 the sequences that have the cities Chicago and Boston, respectively. Here we omitted the rules that propagate the Fs in a sequence (for example, there is a rule that, when seeing two consecutive nucleotides labeled with F0 and F1, turns the F0 label into an F1 label). After performing all these rules, the resulting sequences marked with F5 are the solutions of the hamiltonian path problem.

Figure 7 Rules Initial City and Final City

Figure 8 Rule Detect the Size

Figura 9 Rule Other Cities

22

4 Conclusion
In this paper we have shown how to model DNA sequences and DNA manipulation operations using graph grammars. So, we obtained a formal description for DNAbased algorithms, and this description can be analyzed to verify properties and correctness using standard verification techniques. Moreover, the formal model can be simulated to get an idea of the results of the computations before implementing the algorithm in the laboratory. As the main interest is to use this approach to solve problems that are intractable with a computer, simulation is of restricted use (because the simulation in the computer will be necessarily inefficient). However, some bugs in the algorithms could be found in this way, without spending money and time in laboratory experiments. For the model presented in this paper, we have constructed a computer algorithm using L-systems [10]. This algorithm provides a nice simulation of the execution of all steps discussed in Sect. 3. We expect to use this model as a basis to develop reliable and efficient algorithms to solve NP-Hard problems, specially involving DNA analysis (in this case, some efforts spent in the coding of the problem into DNA sequences is not necessary because we start directly with the sequence to be analyzed).

References
1. Garey, M. R. and Johnson, D. S. Computers and Intractability: a guide to the theory of NPcompleteness. W. H. Freeman, San Francisco, 1979. 2. Adleman, L. M., Molecular computation of solutions to combinatorial problems, Science 266, pp. 1021 - 1024, 1994. 3. Lipton, R. J., Speeding up computations via molecular biology, Technical report. Princeton University, 1994. 4. P un, G.; Rozenberg G.; Salomaa A.; DNA Computing: new computing paradigms; Springer-Verlag, New York, 1998. 5. Ribeiro, Leila. Mtodos Formais de Especificao : gramticas de grafos. Em: Escola de Informtica da Sbc-Sul (8. : 2000 maio 15-19 : Iju, RS; Foz do Iguau, PR; Tubaro, SC). Livro Texto. Santa Maria : UFSM, 2000. 6. Rozenberg, G. HandBook of Graph Grammars and Computing by Graph Transformation. Vol. 1: Fundations. World Scientific, Singapore, 1999. 7. Ehrig, H., Engels, G. Kreowski, H-J. and Rozenberg, G. HandBook of Graph Grammars and Computing by Graph Transformation. Vol. 2: Applications, Languages and Tools. World Scientific, Singapore, 1999. 8. Ehrig, H., Kreowski, H-J., Montanari, U. and Rozenberg, G. HandBook of Graph Grammars and Computing by Graph Transformation. Vol. 3: Concurrency, parallelism and distribution. World Scientific, Singapore, 1999. 9. Roo, Diana; Wagner, Klaus W.; On the Power of DNA-Computing; Information and Computation 131, pp. 95-109, 1996. 10. Lindenmayer, Aristid, Mathematical model for cellular interaction in development, Parts I and II. Journal of Theoretical Biology, 18:280-315, 1968.

23

 "!$#%!'&()021$1'341 ()65 " ( 798@1A!$#%!'&B()DCEF3G1A!$47H1 I PRQ'TVS UXW P%Y$`$acS bedfThgpiqUXrqiqsut P g P bwvBixiRyd TVS U Q W P%Y WbXtWuvB 4 Fz dFeqflm |uocfh gqe hieq}H jkflouf dn Fmeoc~qpBF eyBouf"kq$tz2}Houfu qrfleRojsm pocptHeuouvwBxocgyl}HgRou{  p e q g % f R g lxlouV| yFR RFuFuqf4gRFe Bl~quFFeyuBBouqf{ BR'2$kl VocxVxyRx2h 'u'ocxh FR B" yVflxBeRuxxx  gqe pXF|u ou|uxVg gqxocRxq2pXxxouocflpyen fle oudh u|ujkfud xA eqfle %ouyflr e h c j l f p w f e F u o w f  c | F u | l  ' X p euocoupe Foufl eg d% 9VV@ p q fypVvFxFxVocxe VvF%FBou|cpXR fl dh pX|uyld4 yg@ FexFdh |u|ufu|gR |uocxp 'fwy) )x FyxuBe |u"fwjHf@F|u|2 puq Fd ouflVe x fpf yT P $vXtTe P bwv P ) P y)iRydWu) P UXvBiRyv P ATVUBiRvW P ybWuy P ) ` vBiRvW P yiqsfW P s PRQu Wcb bTF ` TV yFT% P )iRUWcb ` P yt P P P s PRQy bXTFiqUtP iqyPRd iqsuW Q yfTVyvFT"FiRy P )iqUTiUBTV FTVyvs iR fAQ TFdbXTF TVyFTv ihdfiRvBiR`iybT ` vt ` bP iRydfb P AbXTF ` TVyFTFb ` b ` iRsus iRWu)Wuy Q iqvpWcdf Wuy bwWu)WuscY' iRPUWuvXWcTFb bTF TVyFT yVvPW P yb P P TVY'y vXW iRsuy Q dG u i w b c W F R i s W f y v'` TFiRUBXt" P srWcb P yT vXtTQ P bv$ P P ` sciqU2iqP @ P 'iRs QP UXWuvtfb$v P d P vXtfWcb{fW P TVby Wus ) F T TFb$ @` iqUXWcb y"iqyd) TVyv 2TVUiqvXW y2 Y' fU PqQ UBiRbxiRUBTGeWcdTVs iqyd P TVvByTVFy P UBiRiRsuW v P yfU B T F T  v F T d u W y c s R i c W F T b UBTG Wuyf dWTVUBvTVtyiRv v)iRY' r P PR suruQ TFdGeWuvXt Q TVy P T{bTF ` TVyY'V Wuy Q iqydGP iRyiRs bwWcbpQ tTVP UTiRUBT{ R i y Y' p P U f`U UBiRbhiFrqiRWusciqfscTqsuWuyT P p U"PqbX` TFiqUtfWuy yfU P vTVY'Wuy % b q i y d y V` scTVWciyVWcdfbV$T) iruT yP" bwWcP dfTVUBTFdWuy Y' UTBATVUP WuTVyvb pH2 P TVryTVUw2vtT t UBTFb suvBbtTVUT4 F T y i b u W s B T  v V T y ) d v ruP TVUb` W yb Q P PRQ P scTfU P FTFbb'Wcb ` b ` iRsuvs tTVruU TVU vWu ptTee t Te yb @Wuy TefU UBiRrqiRP UWciRvW P yb TBP WcbvTq Q uuR4vXtiRviyQXtfWcTVruT P P UBT"TV)VWcTVy Q ` vxiRUBT)TVWu vXtTVUhtiqUdpiqUT UWcTVyvBTFd U%suWu@WuvBTFdePtTVydfP TFiqsuWuy PP eWuvQ t P )fscTBiRP ydsciqU T@bFiRscP T Y' fU P P FTFbbwWuy Q y iq vTV UyiRvWuryT U)vXt ` bXTGs Wuy ` UXiRUBP dv P 2TVvXvBTVU"ATVUX E UXQiRPyP FTVb%P Wcb)v iqP UiqsuscTVsuWcFT suY' R i y d B T f F T B v % T u W h v u W y G i u s X v u W f U F F T X b b e U y i t u W y q T d bvXATVUX UX yVQ T P vW P yiRsuUBTFiyd fUTFbTVyvrWuyiqy UBTFbTFiRUBXt4sciR P Uiqv P UXWcTFbfWcbAv P iyd P f P vXiR $ v i b t q i  U F T d y f t Q T$iybbwWuy Q AiqUtfWuvTFVv ` UTAsuWuuTeie P UbviqvXW P y%Vs ` bvTVU P UwwbWu)s yiVs ` bvWuTVy U PP UA F T b X b i v U ` y P Y' U P FP TFbbXTFb9WuyGiRUBiRsuscTVs9 P q pTVruTVUwWuvWcbey vixvXUWurWciRs2viubw)v PR` iRUBiRsuscTVsuWcFT ` Y' Qu i)Vs ` bvBTVUeP iRydiRuT ` iqWuy ` P sus ` bT P hvXtTG F i q r R i u W c s q i f c s T q i  U q i u s c s V T e s  U F T b B U F F T V b  G T w b v  U q i yvBTFTi UXUBTFVv"iRyd @fscQyTVvBT%TBTF ` vXW P y29iqUXvWc ` sciRP Us Q P yTx ` bP vxXt PP P bTxFiRUBP TB ` sus vtTiqUiqP suscTVsuWcFiRvP W P y bvXUBiRvBT RvXtT4diqviqiybTiRUvXWuvW yfWuy iRyd)iRsus FiqvXW yiqfU iuXtTFbHiqyd@iqscb vtT @ P ` yfWcFiRvW P P y P` dfTVsrptWcb4 P UX P ` b P yWcbb ` TFb P U%i Y' Wu)fscTVTVyvBiRvW P y P y UbwvBiRvW yVs bwvBTVUw

24

tWcP b)iRAQ TVU)Wcb Q P U Q iRyfWcFTFP diyb@ P sus P bWuy p vXtTGyTBv TFVvW P y2HTfUBTFbXTVPyvb P T P Q P iRUvWuvXW ` yyfWuy P UBi P TVyviqvXW y@iqyd@UBTVsuWcFiqvXW yP bwP vUBiRvBT P WcTFP buptTVy2 WuynTFVvW yvtT ` @P yfWcFiRvW y" dTVscbATVvp` TFTVybwWuvBTFbeiRUBP T{fU bTFdx` sus P pTFdG vtT4iRUBiRsuscTVsATBfTV P vW y iRsuvBTVUXyiRvWuruTFbVyThdWcbX bbWuyTFVvW yxvXtTeUTFb suvb fviqWuyTFdWuyUBiyVvWcFTiRyd yVs ` dfTeWuynTFVvXW P y G'fif{{4fH'A yvtWcbbTFVvXW P ypT%eWusus'fUXWcTV fUBTFbXTVyvev P TBfTF ` vW P ybwvUiqvT Q WcTFb P U Y' vtiRv iqfs v P btiRUBTFdPiy P vtf Wuy Q iRUBiRsuscTVspTVyrWuU P yfTVyvBb pP vXtiRffU P iytTFbiRUBTiubXTFd P y dfiRP vBiRiubXTQ iRUP vXWuvW yfWuy Q P y P y ThtiqydvXtT%dfiRvBiRiubXTWcb{TVyvXWuUBTVs UBTVfsuWcFiRvBTFdiRv4TFiuXt fU Q FTFbXbWuy y dTiRyd yxvXQ tT P vXtTVP UAtiqydwvtTdfiRvBiRP iubXT$Wcb'iqUXvWuvP W P yTFd{Wuyv PP ` P suvWufscT iUi TVyvBbxiRydvtTV` yP iybbwW yTFdv vtTiFrqiRWusciqscTy dfTFbVHTn ybwWcdfTVU@ U@ vti iybvP TVUbsciFryTFbTBTF vXW yGbXtTVTq ` P UHvtTpUTVfsuWcFiqvTFd%dfiqviqiubXTbwWuv iRvW y2RvtTiubwvBTVU'y P dfT9WuyWuvWciRsusP UBTFFTVWuryTFb9b P T UB TF ` TFbwvBb9iWuP y Q TVyTVUBiRsFi{bTVvPY' ` suvXWufscPqTQ bXTF ` TVyFTFbr v P P AT P @ UTFdv vP tTdfiRvBiRY' iybTl ` yAfiqp v F T u i X G t y f d T v t V T  U e T c W b i f U B U R i E B U F T y i d v 2 { T U t V T  U B T B U q T i P ATVUBiRvW P yFiqy2TxTBTF ` vTFdi Q iqWuybwvTFiyts P Fiqs'dfiRvBiRiybT{WuydfTVATVydfTVyvs iU P vt T P vXtTVUBbV ptT9iubwP vBTVUry P dfTHWcb$UBTFP bw P ybP WuscT PRP ` U'dWcbwvUWu P ` vWuy Q vXt` TbTF ` TVy FTFbAiR P y Q iqP susfiFrRP iqWusu iqscT bwsciFruTP y dfTFbRpP tfWcbHiqsus FiRvW P yxbt scdxATpd yTWuy)b Xt)i{i vtiRv$vtT UXP s iyd Q iubXbW yTFdx v TFiyXt)y dfTpWcbpiRffP U WuiRvBTVs vtTbiRTqptTVUTeTBWcbvBbpi4TV` iqP fU iuXtTFb vt iRvhi AT"iRffsuWcTFd2T@ ybWcdfTVUhtTVUBT i viubwbdWcbwvUXWu vW ybwvUiqvX T Qu q viubw P UUTFb P ydb9v P i" P )iRUWcb P y P i Q WuryTVyWuyf ` vbXTF ` TVyFTeWuvXtGvtT dfiRvBiRiubXTq ` P tTVyniRsusbX`TF TVyFTFbiRUBT"iy bbwW Q yTFQdGv PP ibwsciFruT y P dfTq` ri Y' Q WcbTBTF ` vTFd P U P TFiuXt vtT yFTFb$iqv'i{ U FTFbXbWuy y dfTq iqy UBTFb suvBb$iRUB` T TVyTVUBP iRvB` TFdhiqyd%vXtTFbT P bvXTFt T%TV P f iqUTxbTVyv4 v y i b  v V T e U y f d tT iybvBTVUeUTFFTVWuruTFb` vtT%UTFb su` vbfU d FTFdGiqvhTFiuXt Q scTGUBTFTq ` rsup P U@ bsciFryTy P dfTGiRydQyi b u W y b v vtTWuyfWuvXWciqsUTF TFbv"Wcb@ WusuvFpptfWcbUBTVfsuWcFiRvBTFd dfiRvBiRiubXTbvXUBiRvBT ` WcbpY'UB TVfUTFbTVyvBTFdWuy W Qu` UBT"ui Q tTVypT"U y P Wuy iRUBiRsuscTVs'` eWuvXt i)iUi TV` yvBTFdGdf iRvBiRiybT%bvX` UBiRvBT Qy P vtT P Q ` iybvTVUP Wcb@bvXWususUBTFbw yP bWuscT@ U)P dWcb vUXWu vXWuy P vXtPTWuyf v)bXTFP TVyFTFb%UBTF TFbwvxv vtT bsciFryT"` y dTFbrPetfWcXty tiFruT yfQs iiUBiyQ VvW y vtTxet scT@dfiqviqiubXTqrptTxiRWuy dP Wu) suv WcP bv dfP TVvTVU@WuyTevtThUBi TVyv TVyTVUiqvXW P y P suWc vtiRveWcbeATVvvBTVUeiydfiRfvBTFd v vX tT4etPR` scTfU FTFP bXb v@bt P scdATy vTFd vtiRv" vtTvTB" v !f scTFb ` bTFP diyb"vtTdiqviqiybT"P P U)i Y' R P Q ` fU UBiR yviqWuybHiRy bTF TVyFTFbWuy UiRv{ #RTeWususA ybWcdfTVUtTVUBTei iUP i Q TVyv Q P TVyTVUiqvXW P yTVvt P P dvtiqvAWcb$bWu@WusciqU$v P t P UWc P yvBiR sRiUi Q TVyviqPvXW P y2 P ) P y v i%UTVP sciqvXW y` iRsdfiRvBiRiybTe yvBTBv qvtTVUTeP eWususA2TiRy !fscTFbTFiuXt ` etfWcXt eWuvXt P i b V T v { b F T V T y F F T 4 b e u W X v t v t V T u W % U c W f d V T y v W $ ! F q i X v W y V b A e f t c W X t R i B U T d c W b X v u W y V x v b X b V T B v b vtT P UXW Q WuyiqsAdfiRvBiRiubX% T !fscTqptfWcbbwWuv ` iRvW P y)WcbeWusus ` bvUiqvTFd)Wuy W Qu` UTyi yT Pq` scd%vtfWuyf%iR PR` v'&qUBTFiRWuy Q &iRiRUv$TFiyt"bXTF ` TVyFT'WuyxvXtTpdfiqviqiubXT P UvtT bTViRUBiRvW P y Q P pWcdfTVyvX$W ! FiqvXW P ysuWuyTFb4iU P vXtTVWuUbXTF ` TVyFTFbetWctWcb%iqyiRs PRQPR` bev P i ruTVUXvWcFiRsiUi TVyviqvXW P y%WuyGUBTVsciRvW P yiRsdfiRvBiRiubXTFbR P TVryTVUwuT{tiFryTy P v P ybwWcdfTVUBTFd

25

Input Query File


Sequence 1 Sequence 2

Input Query File

..
Sequence

Master Node

Sequence 1 Sequence 2

..
Sequence

Master Node

Sequence 1 Sequence 2

Sequence i+1 Sequence i+2

Sequence l+1 Sequence l+2

Sequence 1 Sequence 2

Sequence 1 Sequence 2

Sequence 1 Sequence 2

..
Sequence i

..
Sequence j

..
Sequence k

..
Sequence k

..
Sequence k

..
Sequence k

Node 1
Replica 1

Node 2

Node N

Node 1
Fragment 1

Node 2

Node N

...
Replica 2 Replica N Fragment 2

...
Fragment N

BLAST Results

BLAST Results

Output File

Master Node
Output Assembly

Output File

Master Node
Output Assembly

(a) Replicated Database Schema

(0)21r43}HBy|uocpFghFeg"5y 6ldhBeRg%pF lxwF6woux

(b) Fragmented Database Schema

vtf WcdfTFi)tTVUBT{2TFFP i ` bT{pT% PR` scdtiruT%v P tiqy Q T{vtT Y' P dfT{v P@Qy` iqUiqyvTFT P WcUXb UBTFVvXyTFbbHP iqyd )Q fscTVvTVyTFbbV tTVy% @P iqUXWuy vtTUBP TVfsuWcFiRvBTFddiqviqiyb T2eWuvthvtT$iUBi Q TVyvBTFddiqviqiybT$bwvUiqvX QT u u vtTVU` TpWcbiRy vtTVUHWu) P UXvBiRyv9` dWATV` UTVyFTq yvtTpsciR` vvTVUwyv ` PP fviqWuP y@vtT%!yiqs P P @ fscTVPvBTUTFb suP vFRvtTp iubwvBTVU$y dfTp bv9b f)WuvHTFiuXt"Wuyf vbTF TVyFTv iRsusbwsciFruTy dfTFb Wuy UdfTVU%v TBY'TF ` vT P ryTVUx` vtP T@et P scT P UW Q WuyiRsdf` iRvBiRiyP bTqtTVyinbwsciFP ruTy P dfTvBTVUX )WuyiRP vBTFbpWuvb P Q TBQ TF vXW yAf` WuvhbTVPydbeWuvBbUBTFb suvBbev P vXtThiybvBTVUeyQ dfTqetfWct Wcb UBTFbw ybWuscT UTVU Wuy P vtT4UTF` b suvb vBiRWuyTFdiRv4iR sus'P y dfTFbP WuynibWuy ` sc7 T !fyiRsAUBP TFb ` suv vtiRvpeWusus'AT4fUP TFbTVyvBP TFd@v vtT bTVUFptfWcbiybbTVxs HQ vtT{s FiRs2UTFb suvbpWcbiRy vXtP TVU bvTVvXtiRv Wcb UT )fscTBY' etTVyvXtT%dfiqviqiubXTWcbUBi TVyQvBTFdptTVUBT{eWusus'ATqf U TFiuXtWuyf ` vbXTF ` TVyFTq9 8 UTFb ` suvb9 P @ U 8dfiRvBiRiubXTiUi TVyvb A 9 B7 CD B$2F E{' C TeUBTFbTVyvHtTVUBP Te P Q) ` yWcFQ iqvXW P yxiRyd"TBTF ` vW P y" P dfTVscb$PRv` tiqvFiqy@P 2TpWu@fscTVTVyvBTFd P U" vtT)iRUvXWuvW yfWuy bwvUiqvT WcPRTFQ b%dfTFbXVUWuATFP dWuyvtT@fUBTVrW b)TFVvXW yA 'ptTiRPUBiRsuscTVs TBTF ` vW P y P {i P Y' P p U Uiq Wcb"d yT"WuydTV2TVydTVyvXs WuyiRy bsciFryT)y dfTFb TFiuXtneWuvXtP iGs FiqsQ nP vtT@dfiqviqiubXTqP vtiRv%WcbTVWuP vXtTVUxiGUBTVP fsuWcFi QP vtQ T P UXW Q WuyiRP s dfiRvBiRiuP bXT UxiiUi P P TVyvFrptTxiybvTVU4y @ dT%Wcb%UBTFb ybWufscT U%iybbW yfWuy viubwbhv TFiuXt y f d 4 T R i y G d R i c s b p U v t T G !fyiqsAiubXbTVxfs P UTFb ` suvb Q TVyTVUiqvTFd"iqvTFiyXty P dTq H iytfU PRQ UBiRTBTF ` vXW P yeWusus ` bThI i !fscT{vtiRve P yviqWuybpi"bwWuy Q scT4Wuyf ` vbTF ` TVP yFT iqydiRy P vtTVP U !fscP TWcb ` bP TFdn P U"vXtT PR` vX` ` v%UBTFb ` suvBbV` 9ptTVUBTB P UBTqWhvX` tT)iubwvBTVU% y dfT UBTFFTVWuryTFb9i7!fscTev ATpfU FTFbbTFdheWuvXt suvXP WufscTebTF P TVyFTFbyTFiyt)bXP TF TVyFTpP eWusus Q Q TVyP TVUX iqvT4ixdWcbwvWuyVQ v !fscTqptTFbXR T !fscTFbpeWususAAT4iRsus FiqvTFd"v vXtT4bwsciFruThy dTFbiyF UdWuy v i

26

bAP TFVW$! P suWc ptT7 !fyiqs9iR` ydG bWuy Q scTl PR` v ` vQ!fscTheWusus$2ThiFrqiRWusciqscTiRvevtTiubwvBTVU P P y dfTeetTVyvXtTet scThTBTF vW yvTVU@WuyiqvTFb TtiruT4TBP ATVUWu TVyvTFd%vtfUP TFTpWuy P UiqvXW P yvUBiRybTVU9P bXtTVP TFb9iR P y Q vtTe y P dfTF` b Q ptT WcdfTFiWcbv ruTVUXW etfWct yTWcb` vtTGA TVvvTVU)iqfU iuXt P UvUiqybTVUXWuy WuyfP v bTF ` TVyFTFbhiRy d Y' p PR` v ` v% UBTFb suvBb yvtTS!fUbv%TVvXt dAvXtT"iubwvBTVU%y dfT VUBP TFiRvBTFbhiRPRy `fQ !fscTFbxvtiqv)iRUBTWuyf ` v"v P P vtT Y' Q p TBTF ` P vW P ybV9ptPTFbTS!fscTFPb)iRUBT WcTFdvXtfU tT i UWV X`Y" bacXdUefVgl ATVUBiRvWuy b bvBTV )P iqydv iqP susey dfTFb nTFP bbXi Q TFbP iRUBTx X b V T y v{rWch i  p idXfqq rtsX"u%rvqfq`9sSwxacX`y`rUX) lHv TFiuXty dfTqeWuvXt WuyP UiqvXW y P ` hvt T !scTFb" P Q WcTFdHptT Y' EfU P P FTFd ` UBTFbhiRPqUB` TGTB`TF ` vBTFdiRv)TFiuXt y P dfTiRydUTFb su v !fscTFbiqUT TVyTV UBiRvBTFdptTyiRT P vtTFbXT vX Q S v !fscTFbiRUBTbTVyv v P vtTiubwvBTVU"y P dP TeWuvXt R i y d v t T u i w b B v V T " U f c W F T " b X v t V T i iRWuyeWuvXtiUBV @iRydyptfWcb dfTVsfWcbWusus ` bvXUBiRvBTFd@Wuy W Qu` U T iB
2 c`d t o ydp u qz v{ r|w sx } ~ 
de f g hi pqrs t u vwyx  C

 $ $ e ge f h2ijktl$mdn p `
          ! " #$ %& '( ) 021435 6 78 9 @ ACBED FHG 24p E p

IQP4R S2T2U2VXW Y ` acb d e fhgpi q r4sQtvuEwEx h i j k lmn o p q r s t u v w x yz{p|} ~ 

c d efg

y pEcHE

def ghji k l mn opq@rs

(0)21 p q r st uv wx y Q  3 4 5 6 789@A B CD E F G H IQP RQS0TUV WYX` a b c dfeg h0i

44 X 22 2E cp p E H 4 4 2

   ! " #%$& '  dfehgYikj

vw x y z { | }~ k
tu v w x yz { | }~ C

  Q

lfm noYprqhsYtku
 ! " #$% & ' ( ) 0 1 2 3 45 6 78@9ACB D E F

Q Q k k2 fQ k     

G H I P Q R S T U V W X Y `ab c

(0)$1 rz$fwdd{ueyocpBoufle fugRB|cx


27

yinbTF P ydTVvt P d2TVruTVU vXtfWuy Q WcbbwWu)WusciRU"v P vtTfUTVrW PR` b@ P @ ` yfWcFiRvW P y bXtTVTTBfFTVvvXtiqvvtTVUBTiRUBTy P fP t bWcFiRs2 P f WcTFb P 2vXtTR!fP scTFbV yP dfTFTFdpTh P ybwWcdfTVU P X a 0  $X gvq acX`Y%@ Rv TVyiRfscT"b TxbsciFryT)y dTFbv iyFFTFbb !fscTFbhWuyi vP UiqybwiRUBTVyviqyyTVUwptTFbT !fP scTFbiRUBT{ft bWcFiRsus s P FiRvB`TFdiRvevtT{iyY'bv TVUy P dfPRTq` Wuy` i scdfTVU'vXtiqv$WcP bHbtiRUBTFd%eWuvXt"iqsus vXtTVU$iytWuyTFP bAWuy"vtTVs bvBTVUFRptT p P vX v !fscTFbhiqUTxiq scb eUXWuvvTVyWuyvXtTxbiRTx bwtiqUTFd scdfTVUwptTP!fscT"yiqTFbiqUTxbTVyvhv vtT iybvTVU$y P P dfTpvtU PR`fQ QyI t `  TFbbi Q TFbRptTeiubwvBTVU$y P dfTeiyFFTFbbdWuUBTFVvXs vtTFbXT0!scTFb iubpWuveWcbbt eyGWuy W UBP T i
y jy C y y C
t u vwx y z{ | } ~  g &c4 c  0 44

C C y C
     ! " # & $ % ' (0) 1 243 5 67 8 9 @ A B C D E FG HI PQ RTS4UVW

T c4c

d4egf hji k l4m n op

X`Y acbTd4e f g h i p q r s tuv w xy

 0  S

qTr0s  T
 

  

    ! " # $ % & ' ( ) 0 1 2 3 45 67 8@9 A BC@DFEHGPIPQSR T U VWX Y `a b c d e fgh ip q r s tu v w x y P@S Sd ef ghSi jkmlonpq rSsut u  @ uu uHP S

H@

 !#" $% &' (0)1

z {| }~ @  P

2 3 4 5 6 7 8 9 @ A0B CEDGF H I P QSR T UWV X Y ` a b c d e f g h i p q rs t u vw xy0 r s t u v w x y

z{ | } ~G # d 

G S de f g h i j kl mEn o p q

 m P  F S SP@S @u  vxwy

SS0  Sd

W          !#"%$'&)( 0 ( 12 43 576 8 9S  @ 2  F 2  A S@ (CB ! 43ED G HI  PA7H  2  !   !QRST 2 PA7 W 3 3 !ST   PA

28

43 U"%$'& H 3 P56 2RST 2  3  A2 A2 D QF 2 0  ( 8 H 3   !ST V5Q('WYX a`b6c14 A2QdS9 2 4  2 H D  3  @@ eH  3 f"%$'& H 3  2   D   @  2 Td 2 5g6 2 9S 3 ! H 3 ) @@ 2 (    !S9 F Rg 2 S!B d@@ 3  S@ h5i  0  #p PqA PH @ sr 39 Ut4uwvA 2UD   R(q@ hA  ! 2 @ @ 3 H 3 x5

iy 8ST 4   % d 2 f 8) yFC 2S9 B 0    S@B4! h 14 P56 8 Pc2 pFC9 2UF d A9d7 !!)( YG( 3T  @ 4@B29H'S  2 a 3  B7 ( D  Bg  F !W'X a`b6 CBg 3  H h56 8 U  !STA S 43 "%$'& !)( '!   (  PAF9 2   9 29@H  aB7 @  @ 2 !   8 2  @a tq5

 2   P H APF@H fr 3T uvV ! 3  B2hd 3 P 2G14 F E @P  4@B29H5T6 2 Td 2  S!B @C!HI   !7dS9 2  S 43 "%$'& HH 3 x56 2dS9 2   !S9 2H H 3 a HS9 2    D P5d6 2 A 2pdS9 2 edGpH ( H d 3 2 g@ 3 R4d f  9dge( 54& E HeW'X )`h6i 3 @ G14 @ e( G 3 !ST  D  2a  P  ( 8 B29H5
2 ) j h 4xA2RST 2  2   D  (2 d  0 W ' X )`b6 3 3  e    2 Td 8 5 rb 2 A(  H8 @2 @ hA9 PF d VAwg9P 3 ! Hd 3 !STa 43 5h6     !7!HI   2xSTaBg  HH V5 2 ' 6 2 Td    8   !Fb!STA !S 3 H 3 F kW'X a`b6$s1 40   '  2   D 56 'FCx9A  b   F 3 2 H 3    S   @B!  V5

6 8G14 fp P4( 2(0 39   B29Hd 3 STS ! 2 P4 @ B29H  54l FCPS9A2@G F   7 YF Rg 8 9S@ 2 4m Y!S 2nu0ov'!  QI   SB7eH !  2 pu ovF 2 8GW'X )`b6$   !S9a(  2dS9 2 xA2G  @    S !! 4 !!7dS9 2 m      !ST V5

i 2 !  H D C( e 3 !ST I  GaS!B 2 Td 2 A 2P@  B   fF 2jS 3    7( 2'WYX a`b6$%14    S 4 h56 2   )( PS9P 4 3  2 3T   8 b  3  8  !h   D  2   'F !!hB7 8H e9Y14 3 2  3 H W'X )`h6$ ST  2 3  80 (0@ 39  @ 4@B29H58&dd qB7 2  2   d7 D Q( 2W'X )`b6r 3  S9d B7 3f  hA q TAPirs  gW'X a`b6% 3  2 H  3 43C  0 WYX a`b6$#!twtq5 r 2  @x5 r e14  A 8  H   C( 2GFE      @ S S  '  !  V56 2eF   j( 8Q0 ( @ 39  @ B2TS    3 S  F %r 3T @GuuBhv@5

29

wYx4y4zxb{g|'}{7~s{gzdezx}Y

6 214 p P !Bg 2GPS G`b  FC@   @ d tF g4d   P S@A d 3)  43T 3 F aX e" 0 "%$'&hb5 BaST@S  wqA8irs 0 W'X )`b6t5 bA 7 e  @B8TauQ q9 A @@C 97 a ve %f(PF d @ ! 3    D PA !Bg @! q5 i 2@  @ B@   S!B 4 F f   ( 2 2 @2    8Y8@!PbW'X )`b6y7( w5 r 2 P   B29Hd    3 A2 B7 (  HI  7! @  dS9 2  A 2   3 2 2aS  ! 4 3 @eB7FC 2  eI   H 2 4 @B29H  P!E 3 2 S    C(q @x54i Rg F g EB2 V A  (   ( 8 9dgb d  !B    P 3  S Bg gT C 5 6 2 B  UTd  !!       s  @ 0  0 r  2  pB 2T  H @ @  @x5d  2e g a ( R  2  Aj! 2 2 3b  B @  # u P 43 14  %d7 hv( 8 @S  @eF 2   B p  f e( e   @  pF 8   4Y    !Bg  0 twq5  ( 2e0 ( @ 39  @ B29H P8FCF   U !gT r4 2     F  @ I    S P  g n u  ov 2e0 ( @      u ov 3 3 39 3 e q 2Q2    xm   3   I 5 iy P 8 2)0 ( @ 39  3       P    @ edbd g  d( ( 5h&  h A (  b HeF 3   PpS  3  B2 0 3G  2'F 2 !PPS 'h  h5i# 8ST   p2    4dF  %B2 3 H 576 2 (  3T  e       %FC 2 2 @ g B  ' ( H    I  x m 2    @Y e 3 (0@ 39  gTP  9 P H9 Hd  !BA79GFP!Y9 2 @ B7  (I   p  (  3T  5d6 e  Pd @f 3 2 B2TUHI  B 2   3 2d pw0 (   2e e9Hd  3T  8GdxST 8   0 B (9S   5 6 2G  2  e (0 ( @ 39   3     d    3   Bg(  q5 6 8W'X a`b6i 3 @g9  2 B2TGd  )F 8 2 0 ( e  3T   4 D  % W   F 2  2  2   3 !ST HI    d     e!tq59i 8 FC  d    e0 (  39   B29HA 2 W'X )`b6  3    edxST 8  @Y(     B2TF pS   2 @ h   2 2 3 gd  56 PA  hW'X )`h6c14  d b gf( HI   2  7  @   xAb !  2 2  G( 8 0    B29H5 HI  H!Ad  U 3T   8 fF    2 fB7  d     @    U9e S  !B @  @ e2 !PPS  F 2G0 ( @ 39   B2T5 6 8  (  A # 4  ( !   2  3  ( 82      @TRgE 2  @  2 4 A 2k2     (ics 0 W'X )`b6rFT H h546 F  S  !B 9S @B29HQS   B7  d     B 2)W'X a`b6c  3 @ !ttwq54i# 2xST D 1 2 4 TS  39  2#8 @    8ES k( 2 3  B29HES  PST (

bT  P

8 w@8 Hq !f !  #H b

30

W'X )`b6F !e(0 39  P5iyF  p3T @  8 2 3 dG   '   S e B 2adxST 2 dFC@) 2 B     ( 2G1h7  D  @  @x5

pYQgQ~b}Ye~

i# 8STH @ 8 !  Hd q     H Q( 2 W'X )`by 6 2!P  P d S   56 2 !B  P  u ov 8 0    f ( e  4PY 2 xFB2gbxAVu0 ov 2Q1 0 U P F h !S9P!   8 #u ove q S9HSPF( 3      @  @ !S9x5 2 6 81b7     @  @ B   h A  @ ! 4!B7 tqA8STe d F 2 2   apBgd  eg  d(  @ EF 2 2 B29H ( 3T   pF 2 2   2   Y( G P   B29H5hl  A2  8   0 FPST       S aB7  d    kBg( @ !ST  d    H   h 5 &  hA 4 @ 3 3 S ! B ! % U(     Bgd   B29H  4 @ 3 14  5 r B 7Ak (  3T  @ B29H    1 3   2 ( P    2 b5 iyG   !F H g 3f 2 S  !B 2  f (  '2 @!PVW'X )`b6   S  3 F %  g 8 F  (  p H U2 !P7WYX a`b6 b xA4 %9e6 0 B W'X )`b6!tTt p6   0 WYX a`b6!tq5 {7{7y{Qh{7~

jh SRq g Qbd x @h 9hb2bT 9h @ e h !T 8 9 # #! # @  TW # # SR 9 ww T9 WC d # Sdwh  w 9eww x T T g Y9 # w a G #! ewY W 8C9 # 8@u @ W 0wq  E Cw9 # S dfwwT T g Y # 'h' 0 d ww #p #  Gwf '2  #@S T8 T ' # ! U Wh #HVdqbqR xT d b h T @4hH W S   wg P wa # e ! T WdTjYgR Pw 9Tf x T8 p e7 W wa 9h wPdQ aoa q T'h   )' W49 q9x ) wa j ww9T 7s T  T wa w Pw @ w Tf7 7g # 9  # S 9T # 9e0 R  @ w Cdw wuwwT @S # 7 #w Pw h TC' # h 9 w4H7 ! 7 # ! #e #@ 97 @ q Pw # Ha0 eY7h w cw #7 hT  'h G W w) w T' 98  @ q8CCT ! x w 0 T PdQ ! W w  T 9 2 9 Ww W h w 9 w Ww WH w w d H P # #q b T'dq8C S R @7@  @H%' Q ! w 9q w2 ! T T8 Tw 'c P W 9 9 j wa b ww  wwTTww h

31

Splice Junction Recognition using Machine Learning Techniques


Ana C. Lorena1 , Gustavo E. A. P. A. Batista1 , Andr e C. P. L. F. de Carvalho1 , 1 and Maria C. Monard
1

Universidade de S ao Paulo, Instituto de Ci encias Matem aticas e de Computa c ao, Av. Trabalhador S ao-Carlense, 400 - Centro - Cx. Postal 668, S ao Carlos - S ao Paulo - Brasil {aclorena, gbatista, andre, mcmonard}@icmc.sc.usp.br

Abstract. Since the start of the Human Genome Project, a large amount of sequence data has been generated. These data need to be analyzed. One of the main analysis to be carried out is the identication of regions of these sequences that correspond to genes. For such, one can search for particular signals associated with gene expression. Among the searched signals are the splice junctions. This recognition problem can be eciently accomplished with the use of computational intelligent techniques. Many of the genetic databases, however, are characterized by the presence of high levels of noise, which can deteriorate the learning techniques performance. This paper investigates the inuence of noisy data in the performance of two dierent learning techniques (Decision Trees and Support Vector Machines), in the splice junction recognition problem. Results indicate that the elimination of noisy patterns from the datasets employed can improve Decision Trees comprehensiveness and Support Vector Machines performance.

Introduction

The Human Genome Project, whose main goal is the sequencing of all human genetic information (and also the genomes of other selected species), is generating a large amount of sequence data. One of the current issues in Bioinformatics is the recognition of patterns in these data. This work investigates the identication of genes in DNA sequences. This task can be solved by two dierent approaches: search by signal and search by content [6]. The rst approach searches for signals associated with the gene expression process, like promoters and splice junction regions. The second approach looks for general patterns in the sequences that indicate the presence of a coding region. This work investigates the use of two dierent Machine Learning (ML) techniques, Decision Trees [12] and Support Vector Machines [13], in the splice junction recognition problem. Since many of the genetic databases are characterized by the presence of high levels of noise [7], the inuence of noisy patterns in the learning process is also evaluated. The less reliable patterns (possible noise) were eliminated using the Tomek links heuristic [3, 18].
32

This paper is organized as follows: Section 2 discusses the splice junction recognition problem. Section 3 describes the pre-processing technique used for noise elimination. Section 4 presents the learning techniques considered and related works. Sections 5 and 6 presents the experiments conducted and the results obtained, respectively. Section 7 concludes this paper.

Splice junction recognition

The main process that occurs in all organisms cells is the production of proteins. Proteins are essential components of all living beings, having structural and regulatory functions [10]. The protein coding process from the genetic sequence information is named gene Expression. The gene expression is composed of two stages: Transcription and Translation. In the Transcription phase, a mRNA (messenger Ribonucleic Acid ) is synthesized from a DNA (Deoxyribonucleic Acid ). The protein coding is performed in the Translation stage, using the mRNA sequence as model. There are dierences in the processes described among organisms named eukaryotes and procaryotes. Eukaryotes genes are composed of alternated segments of exons and introns. Exons correspond to regions that are translated into proteins, and introns to regions that do not code for proteins. Translation in eukaryote organisms has then an additional step, where introns are spliced out from the mRNA molecule. Splice junctions are the boundary points where splicing occurs. The splice junction recognition problem involves identifying if a specied sequence has a splice site or not, and its type (exon-intron or intron-exon). This paper is concerned with the identication of these regions. The nal goal of this work is the recognition of genes from DNA sequences. Section 4 presents descriptions of some works devoted to the splice recognition task.

Data pre-processing

For evaluating the eect of noisy data in the performance of the learning techniques considered in this work, a pre-processing phase for the elimination of possible noises was applied. Patterns considered less reliable were then eliminated from the dataset employed. The detection of this noisy data was performed using a heuristic called Tomek links [18]. In order to illustrate how this heuristic works, consider the dataset from Fig. 1. The patterns from this dataset can be divided in three groups [3]: Mislabeled samples: data incorrectly classied. The (-) patterns in the left region of Fig. 1a are examples of mislabeled samples. Borderlines: patterns too close to the decision border induced for data classication. These examples are unreliable, since even a small quantity of noise can move them to the wrong side of the decision border.
33

Fig. 1. Applying Tomek links to a dataset. Original data set (a), Tomek links identied (b), and Tomek links removed (c) [3].

Safe samples: remaining patterns. These samples should compose the learning dataset. The Tomek links heuristic allows the identication of mislabeled and borderline samples. Given two examples x and y from distinct classes, be d(x, y) the distance between these instances. A pair (x, y) is considered a Tomek link if there is not a case z, such that d(x, z) < d(x, y) and d(y, z) < d(y, x) Fig. 1. The computation of the distances d was performed with the Value Dierence Metric (VDM) [3, 17].

Learning techniques

There are several supervised Machine Learning (ML) algorithms able of extracting concepts from data samples. Given a set of known examples, the learning algorithm induces a classier C that should be able to predict the class of any pattern from the same domain where the learning process occurred. The class represents the item one wishes to make previsions about [2]. Among the ML works in splice junction recognition one can mention [14] and [19], in which propositional rules of the biological domain are used to initialize an Articial Neural Network (ANN [8]). ANNs have achieved good performance in this task. The Statlog Project [11] also reports the use of various ML techniques for splice site identication. Another approach based on ANNs was proposed by Rampone [16], in which Boolean formulaes inferred from data were rened by an ANN. This work investigates the use of two ML techniques following dierent approaches: Decision Trees (DTs) [12], a symbolic learning technique, and Support Vector Machines (SVMs) [13], a main representant of statistical learning. DTs organize the information in a structure composed of nodes and ramications [2]. The nodes represent tests applied to data or represent classes when the node is a leaf. The ramications are possible results of the tests. The main ad34

vantage of DTs is the comprehensiveness of the induced rules in the classication process. SVMs are learning techniques based on the Statistical Learning Theory, proposed by Vapnik and Chervonenkis [20]. They map the input data to an abstract space of high dimension, where the examples can be eciently separated by an hyperplane. The SVM incorporates this concept with the use of functions named Kernels. These functions allow the access to complex spaces in a simplied and computational ecient way. The optimal hyperplane in this space is dened as the one that maximizes the separation margin between data belonging to dierent classes. The main advantages of SVMs are their precision and their robustness with high dimensional data. However, unlike DTs, SVMs classiers are not directly interpretable.

Experiments

The splice junction dataset used in this work is composed of known primate DNA sequences, collected from Genbank 64.1 by Noordewier et al. [14], with their correspondent classes. This dataset is available in the UCI benchmark database [4]. The possible classications for the DNA sequences are: EI, when the sequence has an exon-intron border, IE, for the intron-exon border, or N, if the sequence does not have a splice region. Table 1 summarizes this dataset, showing the total number of instances ( Instances), the number of continuous and nominal features present ( Features), the approximate class distribution (Class %), majority error (ME) and if there are missing values (MV). The techniques described in the previous Section were applied in the generation of binary classiers1 , because the Tomek links heuristic is simpler to perform when dealing with two classes. The splice junction recognition problem was then divided in the following manner: a classier was induced to recognize sequences that have splice junctions (IE+EI) from the ones that do not have (N). If the presence of a splice junction is veried, a second classier distinguishes if it is of the IE or EI type. For simplicity of reference, the IE+EI vs N subproblem will be referenced as splice detection (SD), and the IE vs EI one will be named splice type identication (STI).

Table 1. Dataset summary description Instances 3190 Features (nom., cont.) Class % 60(60, 0) IE EI N ME MV

25% 50% no 25% 50%

Classiers involving only two classes.

35

The dataset was divided in 10 disjoint sets of approximately equal size, according to the 10-fold cross validation method [12]. Nine of these sets are used in the learning techniques training, and the resulting classier is tested on the remaining set. This process is repeated ten times, making ten training and testing cycles. The total error is averaged over the errors obtained in each cycle. Since the original problem was divided in two (SD and STI), this procedure was performed for the total dataset (SD case) and for a subset with only the IE and EI examples (STI subproblem). The pre-processing procedure was then applied to all training sets generated, eliminating instances considered mislabeld and borderlines. Approximately 6% of the instances were eliminated from the SD training sets, and approximately 5% were removed from the STI training sets. For simplicity, from now one we shall refer to the cleaned training datasets as pre-processed datasets, and as original datasets the ones that did not pass by this pre-processing phase. These datasets were then used in the generation of the classiers, produced by DTs and SVMs. The DT induction was performed with the use of the C4.5 algorithm [15] and the SVMs with the assistance of the SVMTorch II tool [5]. Unlike DTs, SVMs require data to be in a numerical format. So, the splice junction dataset features had to be coded in a continuous format. The following coding was used: A = (1 0 0 0), C = (0 1 0 0), G = (0 0 1 0) and T = (0 0 0 1). This coding scheme ensures equidistance between all the possible feature values. Thus, for SVMs, each sample has 240 attributes (features).

Results

Decision Trees. An important propriety of DTs is their comprehensiveness, which in general is better for smaller trees. For the original dataset of the SD subproblem, the mean size of the induced trees was 230.6 8.0 nodes. For the SD pre-processed training data, this size was of 209.0 14.3 nodes, showing approximately 10% of reduction, or some comprehensiveness improvement. For the original STI training set, the mean DTs size was of 88.2 9.6 nodes. After pre-processing, this size reduced to 81.8 8.8 nodes (approximately 10% of reduction), also denoting a comprehensiveness gain. These measures refer to the trees induced after the pruning process2 . Before pruning the mean size of the SD trees were of 690.6 26.0 and 591.4 25.0 nodes, respectively (15% of reduction with pre-processing). In the STI case, the mean sizes were of 295.4 17.8 and 197.0 16.9 nodes (33% of reduction). These facts show that the Tomek links heuristic really managed to clean the data from noisy examples. Table 2 shows the overall performance of all induced trees. This table presents, for each experiment (ST and STI), the total misclassication mean error (Error), as well as for each class (IE+EI, N, IE and EI). It can be observed that errors on both datasets (on each experiment) are similar. However, comprehensiveness always improves, although this improvement is only statistically signicant for DTs before the pruning process.
2

Pruning is a technique that minimizes the noise inuence in the classier induction.

36

Table 2. DTs performance SD experiment Dataset Error IE+EI Error N Error Error Original 4.4 0.6 1.9 0.6 Pre-processed 4.5 0.9 1.9 1.0 STI experiment IE Error EI Error 6.7 0.9 4.3 1.5 5.3 2.3 3.3 2.8 6.8 1.0 4.5 1.6 5.3 2.4 3.7 2.4

Support Vector Machines. In the SVM experiments, several types of Kernel functions were tested. The functions considered were: Polynomial of dierent degrees (1 to 5), Gaussians with varying standard deviation values (0.01, 0.1, 1, 10, 50, 100) and a Sigmoid. Training was performed until a training error rate inferior to 0.01 was reached. Besides the Kernel parameters mentioned, other parameters were set to the default values of SVMTorch II [5]. In the SD experiments, the best results for the original dataset were achieved by the Gaussian Kernel with a standard deviation of 5, and with a Polynomial Kernel of third degree for the pre-processed dataset. In the STI case, the best Kernel for both datasets was the Polynomial of fth degree. For SVMs an important measure is the nal number of support vectors (SVs) of the generated model. The SVs are the most representative training data for the SVM classicatory task. They are the patterns closest to the optimal hyperplane. The equation of this hyperplane is dened using these samples. Thus, the presence of noisy patterns can inuence the determination of this hyperplane equation. In the original SD experiment, the nal number of SVs was 1696.4 8.3, against 1529.4 16.5 for the pre-processed data (10% of reduction). In the STI experiment, the results were of 1175.3 6.1 and 1109.9 7.4 SVs (6% of reduction), respectively. It was noticed too that the training time reduced with data pre-processing (approximately 7% in the SD experiment and 15% in the STI case). Similarly to Table 2, Table 3 shows the overall performance of the SVMs classiers during test. It can be veried that the overall performance of the SVMs classiers was maintained with data pre-processing (while gains were achieved with reductions in the training time and number of SVs). For the SD experiment, in particular, the results were better with data pre-processing, at a 95% condence level.
Table 3. SVMs performance SD experiment Dataset Error IE+EI Error N Error Error Original 3.6 0.9 1.3 0.8 Pre-processed 2.9 0.8 1.0 0.8 STI experiment IE Error EI Error 2.2 0.6 1.9 1.1 1.7 1.1 2.1 2.3 2.0 0.4 2.1 1.2 1.2 1.2 0.9 0.5

37

Performance comparison. Using the t-test to paired data [9] over the misclassication error of the dierent classiers induced, a performance comparison was performed among the leaning techniques considered. Through this test, it was veried that SVMs outperforms DTs in all cases, with a 95% condence level. This is due to their ability to deal with high dimensional patterns, as those in the splice junction dataset. DTs, like other symbolic ML techniques, are not well suited for this kind of data. The UCI benchmark [4] also provides results from previous works for the same data (ex. [14, 19]). However, these works did not divide the problem in two binary subproblems, diculting the comparison with the results presented in this paper. The unique comparable result is N Error, in which [19] obtained an average error rate of 4.6 %. The SVMs classiers modelled in this work achieved better results (average error of 2.0 %). It should be pointed out, however, that [19] used only 1000 examples randomly chosen from the complete dataset, while in the present work all 3190 examples were employed.

Conclusions

This paper presented a study of the inuence of noisy data in the performance of two dierent learning techniques: Decision Trees and Support Vector Machines. The application problem was the recognition of splice junctions in DNA sequences. Many genetic databases are characterized by the presence of noise, justifying the choice of this application domain. For this evaluation, a pre-processing for the elimination of possible noisy examples was applied to the datasets employed. The application of this phase lead to improvements in the learning methods performance. In the DTs case, gains were mainly noticed on the comprehensiveness of the induced tree. For SVMs, signicant reductions were obtained on training time and on in the nal number of support vectors, patterns that determine the optimal hyperplane equation for data separation. This suggests that the optimal hyperplane is no more oriented by unreliable cases. The cost of the pre-processing phase should also be considered. For n examples having m features, this phase has O(m n2 ) complexity. Thus, it is a costly process. It should be observed, however, that this phase is applied only once, independent of the number of ML techniques employed afterwards. In spite of the elimination of noisy patterns by the Tomek links heuristic, some of the examples removed may be relevant to the classier induction process. To avoid this situation, the tuning of this algorithm must be carefully performed. As possible future work, the Tomek links heuristic will be applied to other datasets from the Molecular Biology eld. Further experiments should also be performed to adjust the ML algorithms parameters. Finally, it can be stated that the splice junction dataset used in the experiments conducted does not have a high level of noise. This may be due to previous pre-processing applied to this data, since it has been used in several works (ex. [14, 19]). Despite this fact, the results obtained are encouraging.
38

Acknoledgements
The authors would like to thank CNPq and Fapesp, Brazilian research support agencies, for the nancial support provided.

References
1. Baldi, P., Brunak, S.: Bioinformatics - The Machine Learning Approach. The MIT Press (1998) 2. Baranauskas, J. A., Monard, M. C.: Reviewing some Machine Learning Concepts and Methods. Technical Report 102, Instituto de Ci encias Matem aticas e de Computa c ao, Universidade de S ao Paulo, S ao Carlos, Brazil, ftp://ftp.icmc.sc.usp.br/pub/BIBLIOTECA/rel_tec/RT_102.ps.zip (2000) 3. Batista, G., Carvalho, A., Monard, M. C.: Applying one-sided selection to unbalanced datasets. Mexican International Conference on Articial Intelligence (MICAI) Lecture Notes in Articial Intelligence, Vol. 1793. Springer-Verlag (2000) 315325 4. Blake, C. L., Merz, C. J.: UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/ (1998) 5. Collobert, R., Bengio, S.: SVMTorch: Support vector machines for large scale regression problems. Journal of Machine Learning Research, Vol. 1 (2001) 143160 6. Cravem, M. W., Shavlik, J. W.: Machine Learning Approaches to Gene Recognition. IEEE Expert, Vol. 9, No. 2. IEEE Computer Society Press (1994) 210 7. Cristianini, N.: Support vector and kernel methods for bioinformatics. Pacic Symposium on Biocomputing Tutorial, KauaI Marriott, KauaI, http://www.supportvector.net/PSB2002.pdf (2002) 8. Haykin, S.: Neural Networks - a compreensive foundation. Prentice Hall (1999) 9. Johnson, R. A.: Miller and Freunds Probability and Statistics for engineers. Prentice Hall (2000) 10. Lewis, R.: Human Genetics - Concepts and Applications. McGraw Hill (2001) 11. Michie, D., Spiegelhalter, D. J., Taylor, C. C.: Machine Learning, Neural and Statistical Classication. Ellis Horwood (1994) 12. Mitchell, T.: Machine Learning. McGraw Hill (1997) 13. M uller, K. R., Mika, S., R atsch, G., Tsuda, K., Sch olkopf, B.: An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, Vol. 12, No. 2. IEEE Computer Society Press (2001) 181201 14. Noordewier, M. O., Towell, G. G., Shavlik, J. W.: Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences. Advances in Neural Information Processing Systems, Vol. 3, Morgan Kaufmann (1991) 530536 15. Quilan, J. R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, CA (1988) 16. Rampone, S.: Recognition of splice junctions on DNA sequences by BRAIN learning algorithm. Bioinformatics, Vol. 14, No. 8, Oxford University Press (1998) 676684 17. Stanll, C., Waltz, D.: Toward Memory-Based Reasoning. Communications of the ACM, Vol. 29, No. 12 (1986) 12131228 18. Tomek, I.: Two Modications of CNN. IEEE Transactions on Systems Man and Communications, SMC-6. IEEE Computer Society Press (1976) 769772 19. Towell, G. G.: Symbolic Knowledge and Neural Networks: Insertion, Renement, and Extraction. PhD Thesis, University of Wisconsin - Madison (1991) 20. Vapnik, V. N., Chervonenkis, A.: On the Uniform Convergence of Relative Frequencies of Events to their Probabilities. Theory of Probability and Its Applications, No. 16 (1968) 262280

39

AnaGel : A Distributed System for Storage and Analysis of Electrophoretical Records


Edr e Quint ao Moreira and Osvaldo Carvalho
Laborat orio de Computa ca o Cient ca Departamento de Ci encia da Computa ca o Universidade Federal de Minas Gerais 31270-901 Belo Horizonte - MG - Brazil {edre, osvaldo}@lcc.ufmg.br

Abstract. In this work we present AnaGel , a tool for storage and analysis of electrophoretical records useful for geneticists and biochemists. The features of AnaGel that distinguish it from similar tools are the use through a web browser, the ability to share data among researchers in dierent laboratories, and the possibility of using customized algorithms for lane comparisons. AnaGel has been developed in Java, using the Grasshopper agent system.

Keywords: electrophoresis analysis, distributed systems, Java agents Topics of interest: Biological Databases, Software Tools for Computational Biology

Introduction

Gel electrophoresis [12], [5], [6], [4], [13] is a method that separates macromolecules on the basis of size, electric charge, and other physical properties. Many important biological molecules such as amino acids, peptides, proteins, nucleotides, and nucleic acids, exist in solution as electrically charged species. A gel is a colloid in a solid form. Gel electrophoresis refers to the technique in which molecules are forced by an electrical eld to move across a span of gel. A molecules properties determine how rapidly the molecule moves through a gelatinous medium. Figure 1 shows a typical result from a gel electrophoresis. Multiple samples of substances containing macromolecules are separated by a single electrophoresis experiment; each sample gives rise to a pattern called a lane. Gel electrophoresis is one of the most used tools in molecular biology and is of critical value in many aspects of genetic manipulation and study. One use is the identication of particular DNA molecules by the band patterns they yield in gel electrophoresis after being cut with various restriction enzymes. Viral DNA, plasmid DNA, and particular segments of chromosomal DNA can all be identied in this way. Another use is the isolation and purication of individual fragments containing interesting genes, which can be recovered from the gel with full biological activity.

40

Fig. 1. Typical electrophoresis gel. The leftmost lane is generated by a sample containing macromolecules with known weights, which are used to evaluate the molecular weight of bands in the other lanes

This work describes AnaGel [11], [10], [9], a software tool for storage and analysis of electrophoretical records. Many other tools exist with similar purposes (see Table 1); AnaGel presents many advantages over these tools: AnaGel is a web application, and only requires a browser to execute it; AnaGel is a multi-user application: users of one laboratory can share data, stored on a single database; AnaGel is a distributed application: two or more databases can be visited by a mobile agent [14], [2], [7], [8] for a similarity search; the similarity between two lanes can be calculated by a method provided by the user, who only needs to program a Java class, descending from an abstract class from the AnaGel package, which is uploaded by the system. This work is organized in the following way. In section 2 we present the user interface for AnaGel . The structure of the system is described in section 3. A comparison between AnaGel and similar tools is presented in section 4, together with some conclusions.

User Interface

AnaGel is structured as a distributed client-server application. The only software that is necessary to install in the client machine is a web browser with the appropriate Java Plug-in, currently Java Plug-in 1.3.0 01, which is available in the Java web site1 . If the users browser does not support Java 2, the installation is automatically recommended and the url is redirected to the download page.
1

http://www.java.sun.com

41

The system has an easy to use and intuitive interface. A set of tabs groups semantically correlated panels, guiding the user during the gel processing and analysis phases. Figure 2 shows the main window of the system.

Fig. 2. Main window of AnaGel

The reporting module provides functionalities to create reports as gif images and send it by e-mail to the user, who can use the AnaGel reports as gures in word processors, in HTML pages, etc.. Users upload their gel images and algorithms to the system through an HTML form. These data are stored on the server, in a separated place for each user. The uploaded images can be later accessed by the AnaGel client applet to be processed. The Gel Processing pallet is designed to interact with the user and extract useful information from the image. From each lane AnaGel extracts a prole, which is a linear set of points in gray scale. Figure 3 shows one of the subpallets, responsible for prole extraction and correction of distortions that may be present in the gel. In an interactive way, the system is able to nd bands and estimate molecular weight for each one. Reports with the extracted proles and graphical information about the molecular weight estimation can be requested by the user. The Figure 4 shows the Gel Register pallet, which is used by the user to provide some textual data about the electrophoresis process for documentation purpose. It is possible to control the visibility of each gel, in a way that the owner can specify which other users can use the processed gel in their experiments. Through the Compare pallet, shown in Figure 5, the user can dene the parameters to search for lanes in his laboratory or in cooperating ones. In this pallet, the user can specify his own comparison algorithm, which have to be previously uploaded, to search for similar lanes. The AnaGel system implements several tools for lanes comparison, allowing comparison between two records, sorting of a set of records by proximity to a master one, and comparison of several lanes between them, generating a similarity matrix.

42

Fig. 3. Processing of a single lane

Fig. 4. Information about the electrophoresis process

Fig. 5. Bands sharing of two electrophoretical records

43

The cooperating laboratories to be used are dened in the pallet Labs , as shown in Figure 6, which lists all laboratories that constitute the AnaGel virtual repository.

Fig. 6. List of all laboratories of the AnaGel net

Implementation

AnaGel was designed to provide an environment for testing and validating of users conjectures. To provide the functionalities of the system, it is built on top of Java and mobile agent technologies. The system is implemented in a three-tier client/server architecture. The client, implemented as a Java applet, provides means of accessing the AnaGel software through the web. It is responsible mainly for image processing and report generation. Comparison between records is made on the server side, with the use of users customized algorithms. The server interfaces with a relational database system, where the processed records are stored. It also communicates with a mobile agent system to dispatch mobile agents in order to provide collaboration between dierent repositories. Dynamic class loading is used to allow the use of algorithms uploaded by the user. The supplied algorithm must inherit from an abstract class dened in the AnaGel API. Users algorithms are uploaded through an HTML form. The submitted form is processed by a servlet [3], which recovers the supplied le, compiles it, and makes it available for future use. For example, if one user whose login is ronaldo wants to implement his own comparison algorithm, he should declare a class similar to that shown below package users.ronaldo.compare; import ufmg.dcc.anagel.compare.CompareSamples; import ufmg.dcc.anagel.lane.Lane; public class RonaldoCompare extends CompareSamples { public float compare(Lane lane1, Lane lane2) { // Method body ... } }

44

and upload it as mentioned above. Mobile agents are used to provide collaboration between dierent repositories located remotely. Some selection criteria (textual data about the run and samples) and the comparison algorithm are attached to mobile agents that are dispatched for each repository. In the destination laboratory, the agent interacts locally with the AnaGel server to pre select records and lter them using the attached algorithm. The agent then returns to the source laboratory, bringing only the records that satisfy the required parameters. In our experiment, we used the Grasshopper [1] agent system. Each AnaGel server communicates with one agent system which have two places, Dispatcher and Receiver. In the Dispatcher place resides a stationary agent that can be directly accessed by the AnaGel server to initiate and dispatch client agents through a proxy object. This place also receives the agents when they come back to the source agency and groups the partial results carried by each one to return to the AnaGel server. The Receiver place provides functionalities to receive visitor agents, allowing them to use local resources. There is a proxy agent that resides in this place that is capable of communicate with the AnaGel server. The agent is dispatched from a source agency to the destination agencies, carrying the selection criteria and the algorithm. Then it executes its task locally and moves back to the source agency. This communication protocol can be observed in the Figure 7.

Destination Agency 1
Receiver place

Destination AnaGel server 1

Sourec AnaGel server

Source Agency
Dispatcher place

Proxy agent

Dispatcher agent

Destination Agency 2
Receiver place
Proxy agent

Destination AnaGel server 2

Fig. 7. Interaction among servers using mobile agents

45

Table 1. Comparative table between softwares for gel electrophoresis analysis. AnaGel Phoretixa TotalLabb Gel-Proc GelSited GeneToolse SigmaGelf x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Characteristic / Software

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

x x x

46
x x x x x x x x

Automatic lanes nding Automatic bands nding Analysis correction for irregular gels Quantitative analysis Molecular weight estimation Rf calibration Histograms Dendrograms Isoelectric points determination Bands sharing in the same gel Interaction with work environment Image ltering Unlimited number of markers Library of markers Image annotation Customized algorithm for lane comparison Customized algorithm for lane normalization Similarity coecients computation Accessibility through web browsers Shared database of experimental data Collaboration between laboratories Lanes search by similarity Auditability of gel processing phase x

http://www.phoretix.com http://www.amershambiosciences.com http://www.mediacy.com/gppage.htm http://www.nucleicassays.com/gelsite http://www.syngene.com/genetools.asp http://www.spssscience.com/SigmaGel/index.cfm

Conclusions

In contrast with other commercial softwares for gel electrophoresis analysis and storage, AnaGel allows similarity search, providing an way to recover experimental data which match specic users criteria. The Table 1 shows a comparison between AnaGel and other commercial softwares available currently. AnaGel provides a good environment for collaboration among several laboratories located around the world. It enable the sharing of experimental data among scientists, who can collaborate without any eort. It also allows the test of new comparison methods developed by any user through its class upload and dynamic loading mechanism. This model of cooperation is easily extensible to several other applications that deal with shared experimental data.

References
1. Grasshopper An Intelligent Mobile Agent Platform. White Paper. 2. Agent technology green paper. OMG Document agent/00-09-0. http://www.objs.com/agent/agents Green Paper v100.doc, September 2000. 3. Mary Campione, Kathy Walrath, and Alison Hulm. The Java tutorial continued: the rest of the JDK. Addison-Wesley, Reading, MA, USA, 1999. 4. D. Grierson. Gel Electrophoresis of RNA. In Gel Electrophoresis of Nucleic Acids: A Practical Approach. IRL Press Limited, 1982. 5. B. D. Hames. An Introduction to Polyacrilamide Gel Electrophoresis. In Gel Electrophoresis of Proteins: A Practical Approach. IRL Press Limited, 1981. 6. B. D. Hames and D. Rickood. Gel Electrophoresis of Proteins: a practical approach. IRL press, 1981. 7. Colin G. Harrison, David M. Chees, and Aaron Kershenbaum. Mobile Agents: Are they a good idea? Technical report, IBM, T. J. Watson Research Center, Yorktown Heights, New York, March 1995. 8. Danny B. Lange and Mitsuru Oshima. Seven good reasons for mobile agents. Communications of the ACM, 42(3):8889, March 1999. 9. Edr e Quint ao Moreira. Um modelo cooperativo para aplica co es distribu das baseado na web: Aplica ca o a ` an alise e armazenamento de registros eletrofor eticos. Masters thesis, Departamento de Ci encia da Computa ca o - Universidade Federal de Minas Gerais, April 2002. 10. Maria Luiza Assun ca o Pimenta. Anagel - Armazenamento e An alise de Registros Eletrofor eticos, 1996. 11. Maria Luiza Assun ca o Pimenta. Anagel: Um sistema de an alise de registros eletrofor eticos. Masters thesis, Departamento de Ci encia da Computa ca o - Universidade Federal de Minas Gerais, 1996. 12. D. Rickood and B. D. Hames. Gel Electrophoresis of Nucleic Acids: a practical approach. IRL press, 1982. 13. P. G. Sealey and E. M. Southern. Electrophoresis of DNA. In Gel Electrophoresis of Nucleic Acids: A Practical Approach. IRL Press Limited, 1982. 14. David Wong, Noemi Paciorek, and Dana Moore. Java-based mobile agents. Communications of the ACM, 42(3):92102, February 1999.

47

On the Pursuit of Optimal Sequence Trimming Parameters for EST Projects


Fabiano Cruz Peixoto1 and Jos e Miguel Ortega2
1

Laborat orio de Computa ca o Cient ca, UFMG, Belo Horizonte, MG 31270-901, Brazil, fpeixoto@lcc.ufmg.br 2 Departamento de Bioqu imica e Imunologia, ICB, UFMG, Belo Horizonte, MG 31270-010, Brazil, miguel@icb.ufmg.br

Abstract. In EST projects it is extremely important to be able to identify the informative region of the read, by trimming the non informative ones. Several methods for sequence trimming are available. In this work we present a methodology to compare such methods, based on the sequencing of dened sequences (pUC18 cloning vector) and homology searches to nd the informative region using BLAST. This methodology has shown that Phred trim alt algorithm provides proper sequence trimming.

Keywords: Phred, trimming, EST

Introduction

Single-run, partial sequencing of cDNA, generates information known as Expressed Sequence Tag (EST) [1]. A collection of EST sequences, organized in databases such as dbEST or UniGene [2], has assisted gene discovery programs as well as gene annotation in genome projects. For a proper use of EST information, it is critical to identify and eliminate the portion of the read that stands for sequences of low quality [3, 4]. Moreover, when used as the subject of a BLAST [5] homology query, EST sequence shall contain only the portion of sequence that will lead to identication of the transcript that it represents. We noticed that ESD processed les generated by automated MegaBACE sequencer usually matched sequences in databases over the limits of trimmed sequences. This was noticed when Phred algorithms [6] (Trim or Trim alt with Trim cuto 10%) were used to lter the low quality part of the sequence. In this work, we pursuit to determine the optimal parameters for trimming. This is done by analyzing the results of the sequencing of a known molecule, a plasmidial cloning vector. For sequences basecalled by Phred, we chose, as informative, the region limited by BLAST homology query. Our work shows that, calibration of a given sequencer with under a thousand of reads supports

48

the use of trim alt parameter up to 18%, without incorporation of low quality reads at the end of the sequence. We did not intend to address calibration of the initial low quality part of the read, since proper positioning of a sequencing primer allows to clearly depicting vector/insert transition and trimmimg. Our approach assists on the choice of trim alt parameters that lead to the highest ammount of information contained in EST.

2
2.1

Methodology
Sequences

All the sequences used in this work have been provided by 3 laboratories from Universidade Federal de Minas Gerais. Briey, standard sequencing reactions were pooled and distributed on three 96 wells plates. Each plate was used to load a MegaBACE sequencer three times, yielding a total of 864 tentatives. From those 846 processed ESD les were obtained. 2.2 Processing

After sequencing, all ESD les have been collected and processed as summarized in gure 1.

Fig. 1. Processing Methodology.

Step 1 First, a BLAST database was built, using pUC18 cloning vector sequence as its only member. This database was referred to as pucdb. formatdb -t pucdb -i puc.fasta -p F

49

Step 2 All the ESD les were processed by Phred 3 . In this step, no trimming parameters were used. phred traces/trace_i.esd -st fasta -q qual/trace_i.qual -s fasta/trace_i.fasta Step 3 Each FASTA formatted le generated by Phred was submitted to a homology search against pUC18 clonning vectordb using MegaBLAST. The parameters were based upon those used in UniGene Clustering Pipeline [2], [7]. megablast -i fasta/trace_i.fasta -d pucdb -D 3 -f T -a 1 -X 30 -q -2 -W 40 -F "m" -U T For each MegaBLAST result, the longest alignment obtained was stored for further use. If no BLAST hit was obtained, we simply discard the ESD le. In our experiments, only four from 846 were discarded. Here we dene the Information Position concept: Denition 1. Information Position is represented by a tuple (start, end), where start and end are, respectively, the q.start and q.end for the longest alignment reported by MegaBLAST. Step 4 The ESD les were processed again using Phred, but, at this time, trimming parameters were used. Actually, two dierent types of executions were done: the rst one using -trim and the other using -trim alt 4 . When using trim alt, the parameter -trim cuto varied from 0.01 (1%) through 0.25 (25%). phred traces/trace_i -trim "" -st fasta -q qual/trace_i.qual -s fasta/trace_i.fasta.trim * j varying from 0.01 to 0.25 phred traces/trace_i -trim_alt "" -st fasta -q qual/trace_i.qual -s fasta/trace_i.fasta.j -trim_cutoff j Denition 2. Trimming Position is represented by a tuple (start, end), where start and end are, respectively, the start and end trimming positions deduced from the rst line of resulting FASTA formatted le. 2.3 Output
5

The following information above


3 4 5

was obtained from the processing phase described

In this work we used Phred version 0.000925.c for details about these two parameters, look at Phred documentation [6] Here we are concerned about the end part of the sequence, therefore we are using the end component of the Information Position and Trimming Position tuples

50

Information Position[i]: Information Position for each sequence i, a sequence index; Trimming Position[i][0]: Trimming Position for each sequence i with -trim option. Trimming Position[i][j]: Trimming Position for each sequence i with -trim alt option, with trim cuto j (j value varying from 1%

where i is processed processed to 25%).

Denition 3. For each sequence i, and each cuto value j, we dene Information Index[i][j] as: InformationIndex [i][j ] = InformationPosition [i][0] TrimmingPosition [i][j ] (1) Clearly, when Information Index is positive, the result sequence was trimmed more than it should, causing the information contained in it to be discarded. When Information Index is negative, the result sequence was trimmed less than it should be, resulting in the inclusion of information that is not valid. Based on the Information Index the following values can be calculated: Disc Sequences Total number of result sequences, where the trimming procedure resulted in an end position lower the end position reported by the best alignment 6 . DiscSeq [j ] = # {Sequence i | Inf ormationIndex[i][j ] < 0} (2)

Inc Sequences Total number of result sequences, where the trimming procedure resulted in an end position greater than the end position reported by the best alignment. IncSeq [j ] = # {Sequence i | Inf ormationIndex[i][j ] > 0} (3)

Discarded Bases Total number of bases that, according to MegaBLAST alignment, were correct, but have not been included in nal results because of the trimming procedure. For each cuto value j, the number of discarded bases is calculated based only on sequences where Information Index[i][j] is lower than 0. DiscBases[j ] =
i

InformationIndex [i][j ],

(4)

for all i where InformationIndex [i][j ] < 0


6

The symbol # means number of

51

Included Bases Total number of bases that, according to MegaBLAST alignment, were incorrect, but have been included in nal results because of the trimming procedure. For each cuto value j, the number of included bases is calculated based only on sequences where Information Index[i][j] is higher than 0. IncBases[j ] =
i

InformationIndex [i][j ],

(5)

for all i where InformationIndex [i][j ] > 0 Average Discarded Bases Average number of bases discarded from the result sequences where it occurred. AvgDiscBases[j ] = DiscBases[j ]/DiscSeq [j ], (6)

Average Included Bases Average number of bases included in the result sequences where it occurred. AvgIncBases[j ] = IncBases[j ]/IncSeq [j ], (7)

Results and Discussion

All the values described in Methodology/Output were calculated for a set of 846 sequences. A small group of four sequences were discarded because no BLAST hit with pUC18 clonning vector was obtained. The values have been imported to a spreadsheet and the results are displayed below. To verify which parameters are best to minimize the number of bases - included or discarded - that do not correspond to the sequenced molecule, results have been analyzed as shown in gure 2. In this graph, four sets of results have been plotted. The rst two: Included(trim) and Discarded(trim), represent, respectively, the calculated values for AvgIncBases[0] and AvgDiscBases[0]. Here is important to note that these two points have been plotted as lines in the graph just to compare then to the other sets of values. For the sequences that had bases included by the trimming process, six bases were included in each sequence on average. For the sequences that had bases discarded by the trimming process, 291 bases were discarded from each sequence. Clearly, the use of -trim parameter results in loss of information. The other two sets of results (Included(TrimAlt) and Discarded(TrimAlt)) represent, respectively, the values for AvgIncBases[i] and AvgDiscBases[i], with i varying from 1% to 25%. It can be seen that using a -trim cuto parameter up to 18%, there is no signicant increase in the average number of bases included in the sequences that had inclusions. For example, the number depicted for 19% is 19 bases included, which may not cause any harm in a homology search using BLAST.

52

Fig. 2. Average Number of Bases Included/Discarded.

More interesting results can be seen analyzing the sequences that presented discarded bases. With a -trim cuto of 1% (averaged Phred value of 20) the average number of discarded bases is 430. This procedure provides a number of discarded bases higher than using -trim. A value close to 3% gives similar results to -trim. Increasing the -trim cuto constantly, there is a decrease in the number of discarded bases. Remarkably, it is possible to decrease the number of discarded bases without increasing the number of included bases as can be observed in the graph. For example, using 10% (averaged Phred value of 10), the average number of discarded bases decreased to 161, while the average number of included bases are 0. A good compromise is achieved with 18% (averaged Phred value of 7). In this case, the average number of discarded bases decreased to 45, while the average number of included bases is 8. A similar analysis has been performed considering the total number of included and discarded bases (gure 3). In this case we also calculated an overall number of bases included/discarded represented by Total(trim alt) set. It is interesting to notice that the line representing this set intercepts the x-axis between trim cuto values of 19% and 20%. Therefore, even with a minimum overall number of included/discarded bases, the choice between 19% and 20% may cause the inclusion of fortuitous bases (16 and 30 respectively). We next analyzed the number of evens that presented included or discarded bases related to the -trim cuto parameter (Figure 4). It can be seen that from 15% up to 25%, the number of sequences that presents included bases raises from

53

Fig. 3. Total Number of Bases Included/Discarded.

Fig. 4. Total Number of Sequences with Bases Included/Discarded.

14 to 677. Similarly, as expected, the number of sequences with discarded bases decreases from 828 to 165. From a conservative point of view, it can be seen in

54

the graphic that, for a trim cuto below 16%, there is no or only a small number of sequences, that switch from having discarded bases to having included bases.

Conclusions and Future Works

In this paper we demonstrated that the Phred -trim alt algorithm can be used with the -trim cuto parameter up to 18%, without including miscalled bases. The -trim alt algorithm with the proper parameters is capable of recovering more information than the -trim algorithm. Our results suggest that other trimming algorithms, such as window-based trimming algorithms should also be analyzed. We are presently extending this analysis to compare and calibrate dierent sequencing equipments and sequencing procedures.

Acknowledgment

We thank Laborat orio de Gen etica e Bioqu imica, Laborat orio de Imunologia de Doen cas Infecciosas and Laborat orio de Biodiversidade e Evolu ca o Molecular from Rede Genoma de Minas Gerais (Minas Gerais Genome Network) - especially Marina M. Mour ao, Lucila Grossi Gon calves Pac ico and Renata A. Ribeiro for the sequences. We also thank CENAPAD-MG/CO for the machines used in this work. Finally we thank Alessandra Campos and Fabr icio R. Santos for revising this manuscript.

References
[1] Adams, M. D., Soares, M. B., Kerlavage, A. R., Fields, C., Venter, J. C.: Rapid cDna sequencing (expressed sequence tags) from a directionally clonedhuman infant brain cDNA library, Nat Genet 4 (1993) 373-80. [2] http://www.ncbi.nlm.nih.gov/UniGene/build.html. [3] Ewing, B., Hillier, L., Wendl, M. C., Green, P.: Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment. Genome Research 8 (1998) 175185. [4] Ewing, B., Green, P.: Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Research 8 (1998) 186194. [5] Altschul, S. F., Madden, T. L., Sch aer, A. A., Zhang, J., Zhang, Z., Miller, W., Lipman, D. J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25 (1997) 3389-3402. [6] http://www.phrap.org/phrap.docs/phred.html. [7] Lukas Wagner private communication.

55

Evolving Phylogenetic Trees: An Alternative to BlackBox Approaches


Oclair Prado1, Fernando J. Von Zuben1, Srgio F. dos Reis
DCA/FEEC/Unicamp - Brazil {oclair, vonzuben}@dca.fee.unicamp.br 2 IB/Unicamp - Brazil sfreis@unicamp.br Abstract. This paper presents the main steps to reconstruct phylogenetic trees using an evolutionary algorithm with a wide range of alternative computational procedures for fitness evaluation and for tuning the search engine. Due to the fine equilibrium between exploration and exploitation of the search space, highquality reconstructions are obtained among an explosion of candidates and with the additional requirement of adjusting the length of all branches in a given tree. A toolbox is available, and the matrix codification adopted, the associated genetic operators, and the optimization procedures involved are clearly described to guarantee reproducibility.
1

1. Introduction
A molecular phylogenetic tree is a tree-structured graph that represents the evolutionary process of genes, and is constructed from sequential attributes obtained from several organisms, that will compose the leaves of the tree. Reconstruction of phylogenetic trees is one of the most important problems in evolutionary theory [15]. The basic principle of tree reconstruction is to infer the evolutionary process of taxa (biological entities such as genes, proteins, individuals, populations, species, or higher taxonomic units) from their molecular data sequence [13]. A difficulty here is the lack of information, because we do not have data from the common ancestors, and they must be inferred from the analysis of the current organisms [9]. Finding the optimal tree is an NP-Complete problem [2], and the number of candidate trees may be calculated using the following expression:
( 2n 3)! 2 n 2 ( n 2)!

(1 )

The ability of evolutionary algorithms to find near-optimal solutions in search spaces with a huge number of candidates will be explored to propose high-quality solutions to the problem of phylogenetic inference, especially when many taxa are included, or demanding evolutionary models are applied to define the fitness of a given tree [5]. This paper shows the details of the Maximum Likelihood Method and presents three genetic operators specially developed to create new tree topologies based on old

56

ones. Using these genetic operators, the evolutionary approach is potentially able to generate any topology during the search. Phylogenetic Tree Project (PTP) is the resulting software package specially developed to provide user-friendly interface that is suitable even for researchers with incipient knowledge about computer systems. Although other works have already been presented in the literature using evolutionary computation for phylogenetic reconstruction, such as Matsuda [7] and Matsuda et al. [8], this paper differs from the others on explicitly presenting the details of gene codification and mutation operators used in PTP, what helps other researchers to reproduce our results and even improve them. To the best of our knowledge, no other paper in this field has done this before. Another important aspect related to PTP is the availability of a wide range of alternative computational procedures for fitness evaluation and for tuning the search engine, seldom available together in one software package. An extensive list of other available packages may be found at:
http://evolution.genetics.washington.edu/phylip/software.html.

2. Phylogeny
A number of methods have been proposed for reconstructing phylogenetic trees. These methods can be divided into two groups [14]: (a) model-based methods, in which there must be a criteria to analyze and evaluate the candidate trees until the best or a good one is found. This analysis is generally performed using a probabilistic model. (b) non-model-based methods, in which there must be a sequence of steps to achieve the best tree. One important thing about non-model-based methods is that they generally find one unique answer for the problem under analysis, while the model-based methods may find many good trees as the final result. One exception is the parsimony ratchet method [10], which is one of the non-model-based methods that is able to propose multiple trees. 2.1. Model Based Method Maximum likelihood method: attempts to avoid the limitations of other methods by trying to make explicit and efficient use of all character-states based on stochastic models of those data (e.g. DNA base substitution models, amino acid substitution models, etc.) [15]. The use of the maximum likelihood (ML) method for phylogenetic inference was first presented by [1] for gene frequency data, but they faced a number of problems in implementing this method. After that, Felsenstein [3] developed an improved algorithm for constructing a phylogenetic tree using the ML method. As they are presently implemented, ML methods for tree reconstruction starts from a given tree topology, and then search for the branch lengths that maximize the probability of the data to be explained, given the tree. These probabilities are then compared over different trees (with different topologies) and the tree with the greatest probability is taken as the best estimate [15].

57

For DNA sequence data, ML methods work with aligned nucleotides with no insertions/deletions [9]. The model on which ML is based specifies the probability of one sequence changing to another sequence in a given instant of time with a specific base-substitution model.
Fig. 1. Nucleotide substitution rates: transition () and transversion ()

Figure 1 shows the base-substitution model proposed by Swofford et al. [14]. Under the one-parameter mutation model, each of the four bases is expected to become equally frequent, suggesting that the probability of any one mutating to another one is the same, i.e. 0.25. An alternative would be to use the average frequencies of each base, found in the set of sequences from which a tree is being constructed [15]. At any single site, the ML model works with probabilities Pij (T ) that base i will have changed to base j after a time T. Subscriptions i and j take the values 1, 2, 3 and 4 for bases A, G, C and T, respectively. According to Felsenstein [3], these probabilities can be written as: Pii (T ) = (1 p ) + pi , i = j P ij (T ) = p j , j i (2 ) (3 ) (4 )

p = 1 ev

where i is the relative frequency of the i-th nucleotide and vs are the branch lengths estimated by maximizing the likelihood function for a given set of observed nucleotides. Figure 2 shows one of the 3 possible trees when 3 sequences are considered. For nucleotide position j in the sequences under analysis, the observed bases are C, T, C and the unobserved bases are set to k and l. Adding the terms with all possible values of k and l, the likelihood L( j ) is given by: L( j ) = k Pkl (v 4 ) PkC (v3 ) PlC ( v1 ) PlT ( v 2 )
k =1 l =1 4 4

(5 )

Although neighboring nucleotides in a DNA sequence are not independent, the models do assume independence of evolution at different sites, so that the probability of a set of sequences for some tree is the product of the probabilities for each of the sites in the sequences. Calculating for all the m sites, the likelihood L is given by:

58

L=

L( j )
j =1

(6 )

One method for maximizing the likelihood of a tree, adjusting the lengths vs, is presented in Weir [15].

3. PTP (Phylogenetic Tree Project)


PTP (Phylogenetic Tree Project) implements Distance Matrix and Maximum Likelihood modules for fitness evaluation, given a tree, besides considering the generic base-substitution model in Figure 1. We intend to add more modules in the future, and the software was developed to accept extensions.
Fig. 2. One of the rooted trees for three sequences

3.1. Maximum Likelihood Method This is one of the modules that may be considered by the EA module to calculate the fitness of the tree. It is based on the Maximum Likelihood concept, as showed in section 2.1. In the current version of the PTP, branches maximization is implemented using the gradient optimization method. The necessary details to implement the optimization algorithm is described in Prado [11] The other modules are also described in Prado [11]. Codification of the tree In PTP software, a tree format structure was used rather than a list of attributes, as illustrated in Figure 4.

59

Fig. 3. One possible graph representation for a tree.

Trees, like the one represented in Figure 4, may be converted into their matrix of adjacencies [6]. Figure 5 shows the matrix of adjacencies used inside PTP for the tree of Figure 4, as a particular case. Genetic Operators PTP has three operators for mutation: Leaf-leaf mutation: when you exchange columns of the matrix of adjacencies at the left side of the root column, you are exchanging leaves.
Fig. 4. : The matrix of adjacencies for the tree of Figure 4.

Leaf-branch mutation: this operator exchanges one column at the left side of the root with other at the right side. Branch-branch mutation: when you exchange columns of the matrix of adjacencies at the right side of the root column, you are in fact exchanging leaves. PTP is able to find any candidate tree using these genetic operators during the search process. Along our simulations, these operators have provided enough flexibility for the search engine, and their parameters may be adjusted by the user with one of the PTP configuration screens. The high number of PTP parameters should not be a reason of worry to users, because all of them have default values, with good performance in a wide range of simulations. The use of these operators, allows PTP to find a good tree in a reasonable time. Although the matrix of adjacencies is a sparse matrix, a parsimonious structure is adopted here to simplify the computation. Since a leaf is a terminal node and cannot be father of any other node, we do not work with classical matrix of adjacencies, as stated in Manber [6]. We can use just the lines sufficient for the fathers. Using this

60

simplification, our matrix takes nearly half memory space than using the conventional one.

Simulation Results
One interesting experiment that we have done with sequences from the literature [15], shown in Table 1, produced the results reported in Figure 6. Five sequences fragments were analyzed with three available toolboxes: Phylip (DNAML module) [4], PTP [12] and PAML [16]. PTP and Phylip were used to find the topology of the maximum likelihood tree, as long as PAML was used only to calculate the maximum likelihood, because the tree topology must be provided as an input. Another interesting point here is that PTP works with Evolutionary Algorithms and uses rooted trees, Phylip uses unrooted trees and a distinct method to find the tree topology.
Table 1. DNA mitochondrial sequencies used to compare results (obtained from Weir [15])

Name Human Chimpanzee Gorilla Orangutan Gibbon

Fragment
GTAAATATAGTTTAACCAAAACATCAGATTGTGAATCTGACAACAGAGGCTTACGACCC CTTATTTACC GTAAATATAGTTTAACCAAAACATCAGATTGTGAATCTGACAACAGAGGCTCACGACCC CTTATTTACC GTAAATATAGTTTAACCAAAACATCAGATTGTGAATCTGATAACAGAGGCTCACAACCC CTTATTTACC GTAAATATAGTTTAACCAAAACATTAGATTGTGAATCTAATAATAGGGCCCCACAACCCC TTATTTACC GTAAACATAGTTTAATCAAAACATTAGATTGTGAATCTAACAATAGAGGCTCGAAACCT CTTGCTTACC

Fig. 5. Generated trees from sequences of Table 1 using maximum likelihood method in all cases: Tree in part (a) was obtained with the module DNAML from Phylip [4], the tree in (b) was obtained with PTP [12], and the tree in (c) was obtained with PAML [16], but the tree topology was given by PTP.

61

In order to look for a good tree topology, PTP has default parameters: total and intermediate populations with 100 elements each, execution time limit set to 10 hours, generation limit set to 100, mutation percentage set to 100%, selection strategy used in each generation was taking 40% of its total population from the best elements among all of them (intermediate and total population from the past generation) and 10% of the worst elements of all populations. This selection strategy seems to preserve the evolving capacity by increasing the diversity of the population. In PTP, the fitness of each element is its maximum likelihood. There are several substitution models proposed to estimate the likelihood of a given tree, depending on the values of and in Figure 1. PTP default substitution model is the Felsenstein model [3]. Figure 6 shows the results obtained with these three toolboxes and Table 2 is used to show the pairwise distances between the elements.
Table 2. Pairwise distances

Distances Gorilla Gibbon Gorilla Orangutan Gorilla Human Gorilla Chimpanzee Gibbon Orangutan Gibbon Human Gibbon Chimpanzee Human Chimpanzee

(a) 0,183 0,096 0,045 0,030 0,189 0,228 0,213 0,015

(b) 0,182 0,107 0,048 0,024 0,189 0,230 0,206 0,024

(c) 0,177 0,091 0,043 0,029 0,178 0,220 0,206 0,014

Although the tree in Figure 6(a) seems to be different from the other two at first sight, in fact all they are similar under a phylogenetic reconstruction point of view. Taking the distances between any two elements by summing the branch lengths from one to another, one can find values very similar in the three cases. This is one more example of the strength of the maximum likelihood method, which was capable of adjusting the branch lengths in a coherent way, even though one of them is an unrooted tree. So, no matter the simplicity of the data set considered here, given the availability of previously proposed software packages with similar performance, what are the motivations for the investment in another package, like PTP? The first one is that PTP is not a black-box implementation, like other alternative evolutionary approaches for phylogeny inference, because a step-by-step description of the modules is provided. Secondly, PTP was designed to incorporate a wide range of alternative procedures for fitness evaluation and modules for human-machine interaction. Therefore, the results in Table 2 should be interpreted as a validation of the PTP package, and not as a failure to overcome the results of already available packages.

Conclusions
In this paper, we presented the main steps involved in the reconstruction of phylogenetic trees using evolutionary computation. A toolbox is available [12], and the user has the possibility of defining the modules that will evaluate the fitness of a given tree. The most refined options guide to the calculation of the likelihood of a given

62

tree, using an iterative optimization algorithm to obtain the length of the branches. nd Next versions of PTP will be implemented using a 2 order method for this optimization, and some other phylogenetic tree reconstruction approaches, such as maximum parsimony and neighbor-joining. The architecture proposed here will facilitate the process of appending these new modules, and our three mutation operators will still work properly, because they operate directly on the genetic code, associated with a matrix of adjacencies. Other interesting PTP features are its user-friendly graphic user interface, that makes easier the process of acquiring required parameters, and the ability to stop at any time and save the partial results if it is necessary to resume later. The text file that provides the inputs is also used to record the intermediate information, such as the tree topologies under evaluation and their branch lengths. Although the search engine based on an evolutionary algorithm presents a fine runtime scalability with the size of the problem, when many taxa are being manipulated, the search may require a long time. So, the ability to stop the process, save the intermediate data, and resume later, may be useful.

Acknowledgements
CNPq has sponsored this research via grants 521100/01-1 and 300910/96-7.

References
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Cavalli-Sforza, L.L. & Edwards, A.W.F. Phylogenetic analysis: Models and estimation procedures, Am. J. Hum. Genet, 19:233-257, 1967. Day, W.H.E, Computational complexity of inferring phylogenies from dissimilarity matrices, Bull. Math. Biol, 49:461-467, 1987. Felsenstein, J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17:368-376, 1981. Felsenstein, J. Phylip toolbox and other software available at http://evolution.genetics.washington.edu/phylip/software.html. Lewis, P.O. A Genetic Algorithm for Maximum-Likelihood Phylogeny Inference Using Nucleotide Sequence Data Mol Biol Evol 15, pp. 277-283, March 1998. Manber, U. Introduction to Algorithms. A Creative Approach. Addison-Wesley Publishing Company, 1989. Matsuda, H. Protein phylogenetic inference using maximum likelihood with a genetic algorithm, Pacific Symposium on Biocomputing. World Scientific, London, pp. 512-523, 1996. Matsuda, H., Yoshikawa, T., Tabe, T., Kishinami R. & Hashimoto, A. On the Implementation of a Phylogenetic Tree Database, IEEE, 42-45, 1999. Nei, M. & Kumar, S. Moledular Evolution and Phylogenetics, Oxford University Press, 2000. Nixon, K. The parsimony ratchet, a new method for rapid parsimony analysis, Cladistics 15:407-414, 1999. Prado, O. Computao Evolutiva Empregada na Reconstruo de rvores Filogenticas, Dissertao de Mestrado, Faculdade de Engenharia Eltrica e de Computao, Unicamp, 2001. Prado, O. & Von Zuben, F.J. The Phylogenetic Tree Project (PTP). Toolbox available at ftp://ftp.dca.fee.unicamp.br/pub/docs/vonzuben/oclair/, 2001. Swofford, D.L. & Olsen, G.J. Phylogeny Reconstruction, Molecular Systematics, ed. D.M. Hillis and C. Moritz, 411-501, 1990. Swofford, D.L., Olsen, G.J., Wadell, P.J. & Hillis, D.M. Phylogeny Inference. Molecular Systematics, Sinauer, Sunderland, MA, pp. 407-514, 1996. Weir, B.S., Genetic Data Analysis II, Sinauer, Sunderland, MA, 1996. Yang, Z. Phylogenetic analysis by maximum likelihood (PAML), version 3.0. University College London, London, England, 2000.

63

Analysis of Functional Interactions of Enzymes in Mycoplasma pneumoniae


Adriana N. dos Reis, Cludia K. Barcellos, Fabiana Herdia, Jean J. Schmith, Jos Carlos M. Mombach, Ney Lemke, Rejane A. Ferreira
Laboratory of Bioinformatics and Computational Biology Centro de Cincias Exatas e Tecnolgicas, Universidade do Vale do Rio dos Sinos Av. Unisinos, 950 93022-000 So Leopoldo RS, Brazil {mombach, lemke}@exatas.unisinos.br

Abstract. In an organism proteins perform many biological functions. In the metabolism they act as catalysts in a network of chemical reactions. Many biological functions are integrated through a network of interactions and their characterization is important to the understanding of the orchestrated mechanisms that underlie the machinery of life. In this work we propose a new description of the network of functional interactions among enzymes. We present preliminary results of its application to the metabolism of Mycoplasma pneumoniae. We investigate the importance and conservation of enzyme connectivity in the metabolism and find that some enzymes are highly connected. Additional applications of the network include the easier visualization of enzyme actions and interactions and the identification of incorrect functional annotations.

Metabolic Networks, Proteomics, Drug targets

1 Introduction
The challenge of postgenomic biology is to understand how genetic information results in the orchestrated action of gene products in time and space to generate function. In medicine, this is perhaps best reflected in the numerous disorders based on polygenic traits and the notion that the number of human diseases exceeds the number of genes in the genome (Gavin et al., 2002). Interfering on the function of a DNA or protein sequence by analogy to the functions of other similar sequences has had a profound impact on our ability to identify the functions of sequenced genes. Similar reasoning can be applied to biological pathways to identify the presence of known metabolic pathways in the annotated genome of an organism: it is possible to predict the function of an unknown

64

sequence by searching a reference database of sequences for those that are similar to, and it is possible to predict pathways from a sequenced genome by analogy to a reference database of pathways (Karp et al., 1999). In our studies we have used the KEGG database (Kyoto Encyclopedia of Genes and Genomes, Kanehisa & Goto, 2000) and ERGO (WIT) database (Overbeek et al., 2000). The KEGG approach to pathway prediction relies on a generic, multi-species conceptualization of pathways. Thus, that generic pathway may not occur in its entirety in any one organism. When performing a pathway prediction for a new organism, the KEGG group highlights on the generic pathway diagram those enzymes present in the genome of that organism. Genomics based drug discovery relies strongly on accurate functional annotation of a genome. Although the frequency of incorrect functional annotations in the sequence databases has not been firmly established, a recent study has estimated the error rate to be 8% in full microbial genomes (Brenner, 1999). Incorrect annotations can cost a pharmaceutical company time in pursuing incorrect targets. Pathways analysis of a genome can help identify both false-positive (incorrect) functional annotations and false-negative annotations (unidentified genes) through an examination of the pathway distribution of gene annotations. The pathway analysis identify pathway holes - missing steps within a pathway that is largely known. Pathway holes may be due to enzymes that have not yet been identified within the genome and are hidden among the uncharacterized ORFs. Pathway analysis also identifies singleton steps, which are single steps in a pathway where the majority of steps in the pathway are predicted to be absent. Singleton steps may be due to incorrect functional annotations and so their predicted functions should be carefully verified. One type of drug target probably to be avoided is a metabolic enzyme for which the organism has one or more isoenzymes. By contrast, metabolic enzymes that are used in multiple pathways, i.e., highly connected, are attractive targets, because knocking out a single protein could disrupt multiple pathways. An enzyme could be used in multiple pathways because it is multifunctional and/or because the reaction that it catalyses is used in multiple pathways. Microbial enzymes that do not have human homologs may also be attractive targets, although, if the human and microbial enzymes have diverged sufficiently, they may have different drug responses (Karp, 1999). The sequencing of the entire genomes of Mycoplasma genitalium (Fraser et al., 1995) and M. pneumoniae (Himmelreich et al., 1997) has attracted considerable attention among life scientists to the molecular biology of mycoplasmas, the smallest self-replicating organisms. Mycoplasma pneumoniae is the closest known relative of M. genitalium, with a genome size of 816 kb, 236 kb larger than that of M. genitalium. Comparison of the two genomes indicates that M. pneumoniae includes orthologs of virtually every one of the 480 M. genitalium protein-coding genes, plus an additional 197 genes (Himmelreich et al., 1997). M. pneumoniae is of special interest because it is the pathogen of pneumonia in humans (Razin et al., 1998), and it has 706 genes with 816 394 bp and 38 metabolic pathways (Karp, 1999). In a cell or microorganism, the metabolic processes are integrated through a complex network of cellular constituents and reactions (Hartwell et al., 1999).

65

Recently, empirical studies on the structure of metabolic networks have reported serious deviations from the random structure, showing that these systems are described by scale-free networks (Barabsi & Albert, 1999), for which P(k) follows a power-law, P(k)=k - , where is an exponent close to 2. Their results have shown that the large-scale structural organization of a metabolic network is indeed very similar to that of robust and error-tolerant networks (Albert et al., 2000). The uniform network topology observed in all 43 studied organisms indicates that, irrespective of their individual building blocks or species-specific reaction pathways, the large-scale structure of metabolic networks may be identical in all living organisms, in which the same highly connected substrates may provide the connections between modules responsible for distinct metabolic functions (Hartwell et al., 1999). In proteome networks, highly connected proteins with a central role in the network's architecture are three times more likely to be essential than proteins with only a small number of links to other proteins (Jeong et al., 2001). Barabasi's group has proposed that metabolic networks can be described by a network where the nodes are cellular compounds and the links are reactions that produce them. The analysis of the network suggests that is possible to determine which compounds are the most important for an organism. Despite the relevance of the work, it does not have much useful biological information since the compounds can not be used as, for example, drug targets. Since our goal is to determine the network of interaction of proteins we decided to investigate the network of enzymes. This choice is justified by the fact that enzymes are coded in the DNA and have a well-established function. Furthermore, we believe that a metric, based in the network of metabolic interactions, can help to determine which enzymes are more relevant for the metabolism. Here we investigate a common hypothesis that highly connected enzymes are important for the metabolism and so, they are good for drug targets. We also investigate the conservation of these enzymes. In this work we propose and study a new approach to study networks of enzymes, specifically, we focus on the interactions among enzymes in the metabolic network of M. pneumoniae. We have analysed the data obtained from KEGG and ERGO databases using PERL algorithms.

2. Methodology To generate the interaction enzyme network, we have developed a set of Perl algorithms that use two types of datafiles from the KEGG database (last updated: April 2002) about Mycoplasma pneumoniae: the reaction's list with the main substrates and products and the list of enzyme functions identified within each metabolic pathway. Our software, initially, extracts Enzyme Commission (EC) number and the reactions that occur in the Mycoplasma. A manual analysis was necessary to check the subnetworks. Some inconsistences were found and these were resolved by comparison with the informations at the ERGO database (http://www.integratedgenomics.com). An additional program to substitute synonym elements or to change incorrect direction

66

of some reactions standardized the processed data. This information was the input to a program that creates the general network where the nodes are enzyme functions represented by EC numbers. In this network, two nodes are connected if a product of the reaction catalysed by the first enzyme is used as a substrate in the reaction catalyzed by the other enzyme. Bidirectional links represent reversible reactions. The network was proposed to represent the functional dependencies among all the enzymes in the metabolism since a given node may not work if a node linked to it is not present or is removed. We use The Pajek software (a tool for large network analysis) to create the visualization of the network from the pairs of interactions identified by the main program. Different colors were designed to define distinct metabolic classes according to KEGG classification in the general network. Multicolored nodes mean that they are included in more than one metabolic map. After the network is created, informations like the number of connections and searches about specific enzymes can be done automatically. The analysis of the ORFs (open reading frames) coding for the enzyme functions is done by investigation of aminoacid sequence conservation. We use ERGOs system of coding ORFs. The screening is done with PSI-Blast from National Center for Biotechnology Information from an external tool provided by ERGO. The degree of homology is described by parameter e, where the results obtained are sorted in three cathegories of conservation scores: high (e < 10-50), medium (10-50 < e < 10-20) and low (e > 10-20). This analysis was performed for Mycoplasmas, Procariotes and Eucariotes from the highest connectivity to connectivity 10. For lower connectivities we choose two random ECs to do the same analysis. The genes analysed belong to the list of dispensable genes from Hutchison III et al. (1999). We have also investigated the conservation of the enzymatic functions of M. pneumoniae in Procariotes and Eucariotes using ERGO.

3. Results The metabolic network we have constructed is shown in Figure 1. The enzymes with higher connectivity are dark circled. Connectivity, conservation and essentiality are presented in Table 1.

Table 1. Relationship between connectivity, conservation and essentiality for the enzyme functions analysed.

67

+++ = high conservation; ++ = medium conservation; + = low conservation; - = without significant homology. s.u. = enzyme subunit The dispensable ORFs are highlighted.

68

Fig. 1. Representation of general metabolic network of Mycoplasma pneumoniae. Bidirectional links represent reversible reactions. Different colors (gray scales) were used to define distinct metabolic classes. Multicolored nodes imply that they are included in more than one metabolic map. The enzymes with higher connectivity are dark circled.

4. Discussion In this work we have proposed a network to study the mutual influences among enzymes in the metabolic network of M. pneumoniae. We find that quite a few enzymes have high connectivity. From the investigation of the correlation between connectivity and essentiality presented in Table 1 we see that, at least for Mycoplasmas, the majority of the enzymes studied seem to be essential (not dispensable) and then we do not find support to the assumption that essentiality correlates with connectivity. However, this investigation will be extended to other organisms to confirm the generality of this finding since Mycoplasmas have minimal genomes and then possibly almost all genes are not dispensable. The degree of homology found for aminoacid sequences in the different classes: Mycoplasmas, Procariotes and Eucariotes, help to identify which of these highly connected functions in the Mycoplasma metabolism are best targets for drugs. According to this criterion, good targets would be those ORFs with high conservation in Mycoplasmas and low conservation in the host as, for example, ORFs 515, 93, 171 in Table 1. Another possible target, even if not highly connected, would be the

69

function 2.7.1.69 since it occurs only in procariotes. Furthermore, this function is essential for the entrance of carbohydrates in the cells. Then, a possible drug could be designed to be harmful for a conserved domain in its isoenzymes, otherwise it would be harmless due to functional redundancy. It is also interesting that some ECs are present in all organisms (Procariotes and Eucariotes) while the respective aminoacid sequence in the M. pneumoniae present very low or none homology (e.g. 2.4.2.1) in other organisms. Drug designing for this specific sequence or region in the structure of these enzymes can be harmful only for the parasite without affecting the function in the host. The degree of connectivity and conservation do not present strong correlation as expected which should be further investigated by extending the analysis for all ECs. A possible reason for that, is that we have analysed ORFs whose functions were determined by homology, so there is a high conservation for the majority of the sequences investigated. The color distribution of nodes show the connections between different metabolic classes. If the enzymatic connections are correctly connected in a metabolic map, they must be connected in the network. This is not the case for some nodes in the general network. There are two reasons to explain that: (1) the colors in the network represent different metabolic classes belonging to different metabolic maps that can not be connected directly, or (2) there are missing ECs or incorrectly determined functions in the metabolic map of the M. pneumoniae. Incorrect annotations as singletons or holes have been reported by Karp et al. (1999) as very frequent in genome analysis. Our network can help in the identification of these problems. For instance, the network suggests further analysis and a possible revalidation of the 1.1.1.2 function in the class of Biodegradation of Xenobiotics. The missing connections shown in Fig. 1 for the enzymes of the metabolism of aminoacids (see EC 6.X.X.X) is explained in the literature by the loss of the genes involved in the aminoacid biosynthesis by Mycoplasmas and the need of the supply of these elements from the host (Razin et al., 1998). In summary, the network we have proposed is a useful tool to analyse enzymes functions and interactions and can help to identify incorrect annotations.

References
1. Gavin, A-C., Bsche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J. M., Michon, A-M., Cruciat, C-M., Remor, M., Hfert, C., Schelder, M., Brajenovic, M., Ruffner, H., Merino, A., Klein, K., Hudak, M., Dickson, D., Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B., Leutwein, C., Heurtier, M-A., Copley, R. R., Edelmann, A., Querfurth, E., Rybin, V., Drewes, G., Raida, M., Bouwmeester, T., Bork, P., Seraphin, B., Kuster, B., Neubauer, G., Superti-Furga, G.: Nature v. 415 (2002) 141-147 Hutchison III, C. A., Peterson, S. N., Gill, S. R., Cline, R. T., White, O., Fraser, C. M., Smith, H. O., Venter, J. C.: Global Transposon Mutagenesis

2.

70

and a Minimal Mycoplasma Genome v. 286 (1999) 2165-2169 3. Jeong, H., Mason, S. P., Barabsi, A.-L.: Lethality and centrality in protein networks. Nature v. 411 (2001) 41-42 Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N., Barabsi, A.-L.: The largescale organization of metabolic networks. Nature v. 407 (2000) 651-654 Kanehisa, M., Goto, S.: KEGG: Kioto Encyclopedia of Genes and Genomes. Nucleic Acids Research v. 28 (2000) 27-30 Karp, P. D., Krummenacker, M., Paley, S., Wagg, J.: Integrated pathwaygenome databases and their role in drug discovery. Trend in Biotechnology v. 17 (1999) 275-281 Overbeek, R., Larsen, N., Pusch, G. D., D'Souza, M., Selkov Jr, E., Kyrpides, N., Fonstein M., Maltsev, N., Selkov, E.: WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Research v. 28 (2000) 123-125 Razin, S., Yogev, D., Naot Y.: Molecular Biology and Pathogenicity of Mycoplasmas. Microbiology and Molecular Biology Reviews v. 62 (1998) 1094-1156

4.

5.

6.

7.

8.

71

A semi-automatic methodology for localization of short mitochondrial genes in long sequences


Rafael Santos1 , Jos Humberto Machado Tambor1 , Luciana Campos Paulino1 , and Ana L. C. Bazzan2
Laboratrio de Gentica Molecular e Genomas Instituto de Pesquisa e Desenvolvimento Universidade do Vale do Paraba Av. Shishima Hifumi 2911, Urbanova S ao Jos dos Campos, S ao Paulo, Brazil 12244-000 {rafael,jtambor,lpaulino}@univap.br 2 Instituto de Informtica, UFRGS Caixa Postal 15064 91501-970 Porto Alegre, Brazil, bazzan@inf.ufrgs.br
1

Abstract. Identication of short genes in long sequences using similarity measures (evalues and scores in BLAST queries) can be difcult in mitochondrial genomes since the similarity results of some genes can be shadowed by neighbor matches with higher similarity values. The same could happen for genes with relatively low similarity but which can be considered of interest in a particular study. In order to locate and identify those genes, a manual analysis of the similarity search results must be done, which can be time-consuming and error-prone. In this report we present a methodology which aids researchers on the location of those genes by semi-automatically masking subsequences corresponding to genes that were already identicated and limiting subsequent searches to the regions that did not present any result in previous steps. A tool that implements this methodology was created and used in some database searches using a sequence obtained from a mitochondrial genome. We expected that analysis using this tool would be easier if not faster than the manual analysis. Some results of the use of the tool are presented and compared with results obtained by manual similarity searching of BLAST results. As expected, the proposed tool didnt present new results (i.e. different from the ones found in the manual analysis), since both rely on the same search mechanism, input and parameters, but the results were clearer in the sense of not being cluttered with similar results, and the shorter genes could be located more easily on the nal similarity report. Some comments on the classication of this tool as a software agent are also shown. Suggestions for improvements of the methodology and tool will also be presented.

72

Introduction

The mitochondrial genome is usually composed by adjacent genes with few base pairs separating the genes from each other. A good example is the mitochondrial genome of the yeast Saccharomyces cerevisiae [2]. Our laboratory is studying the mitochondrial genome of the fungus Paracoccidiodes brasiliensis which size was estimated on being around 70kb using different restriction enzymes. Presently were on the nal steps of sequencing by joining the contigs that were generated. When one searches for similar genes using the BLAST (Basic Local Alignment Search Tool [1]) tool against the NCBI (National Center for Biotechnology Information) database, the genes which are more conserved will have higher similarity, and consequently the higher e-value and score values. When this happens, neighbor genes with lower similarity (or genes with relatively few base-pairs) does not appear on the comparisons result (due to an user-chosen limitation on the number of hits to be displayed), or appear in positions where its relevancy cannot be adequately estimated. Figure 1 shows the scheme of the Saccharomyces cerevisiae mitochondrial genome, illustrating this problem: when the complete genome is submitting to the NCBI database for comparison, the biggest regions possibly will be highly similar to sequences on the database, appearing in the beginning of the hit list. On the other hand, the smaller regions would not appear on the hit list or appear in a position so low (since the hits are ordered by e-values or scores) as to be considered as not relevant.

Fig. 1. Scheme of Saccharomyces cerevisiae mitochondrial genome (from http://www.ncbi.nlm.nih.gov:80/cgi-bin/Entrez/framik?db=genome&gi=105) .

One obvious solution for this problem would be the manual analysis of the search result, considering as interesting hits (in the sense of possible similarity to a sequence of interest) the ones with high e-values and scores but with little overlap on already considered hits and the ones with acceptable e-values and scores in regions where nothing was previously considered as being interesting. In order for this to work, the query sent to NCBI should be formulated as to allow a large number of resulting hits and/or consider even hits with low e-value and scores, which can easily be done either via a
73

WWW form or via a commandline interface to the NCBI BLAST server. The main drawback of this approach would be the amount of work that the user would have to locate the non-overlapping hits on the search result. This report presents a methodology that may help researches that does this kind of analysis by semi-automatically selecting a subsequence that corresponds to a good hit on the database search and eliminating this subsequence from future searches, thus possibly eliminating overlapping hits and increasing the possibility of location of shorter genes that could be shadowed by large genes with higher similarity. A tool that implements this methodology is also described. This tool could be considered agentlike in a broader sense it could do a task on behalf of the user, taking some simple decisions to achieve the nal results. The next sections of this report presents some details on the methodology and tool that implements it, some examples of its utilization, conclusions and some considerations on the methodology which could lead to improvements.

A task-based approach for BLAST searches

The proposed methodology considers that in BLAST results with mitochondrial genome sequences as input gives as results several hits with relatively equal e-values and scores which are somehow related to each other and therefore redundant, while some shorter genes on the genome would appear in low positions on the hit list and therefore could possibly be ignored in spite of being important. Ideally, to avoid this problem, one could subdivide the sequences in subsequences where 1) a gene or known region could be present or not and 2) there is little or no overlap with different genes of the same genome, ensuring that each subsequence would match a sequence in the NCBIs database. Obviously this approach would require knowledge about the sequences that is not available before searching the database. In order to solve this problem, we devised a simple algorithm that, when searching NCBIs database, uses only a subsequence of the whole sequence of interest at a time, discarding regions on the whole sequence that were already matched in previous searches. Before presenting the algorithm, lets dene some terms that would be used through it. A search task is a group of input sequence (with some ancillary information), execution script (that would perform the search in NCBIs database using a BLAST client) and output results that is self-contained, i.e., it can be executed independently of other existing search tasks. A task pool is a structure that is able to hold zero or more search tasks. The task pool can be implemented as a stack or queue, although it is possible to add or extract
74

more than one search tasks from it (for example, if multiple searchs are to be executed in parallel). The algorithms steps are as follows: 1. On the rst step of the algorithm, create the (empty) task pool and the rst search task. The rst search task should use, as input, the whole sequence of interest S. The rst search task should be put on the task pool. 2. Get a search task from the task pool. 3. Obtain a BLAST search result between the subsequence s on the search task and NCBIs databases using the search tasks script 4. Store the result obtained in step 3. If no hits were found, no further processing should be done. If there were hits, parse the results in order to check which region of the subsequence s corresponds to the best hit b on the resulting hit list. Extract (or mask) from s the region corresponding to the best hit b, creating as a result a number of new subsequences that can be zero (if the length of b is equal or larger than the length of s), one (if b is aligned left or right with s leaving a remaining subsequence on the right or left side of s, respectively) or two (if b is on the middle of s with remaining subsequences on its left and right). Figure 2 shows graphically the possible subsequences that could be created from certain cases of s and b. 5. If there were one or two subsequences in the previous step, create new search tasks with those subsequences and put them on the task pool. 6. Repeat from step 2 until the task pool is empty.
s s s s s s s b b b b b b b

Fig. 2. Possible congurations for matching a sequence b with a sequence s. On the top two lines b is equal or larger than s, leaving no subsequences for further search. On the next two lines, b covers the left part of s, leaving a right subsequence for further search. On the next two lines, b covers the right part of s, leaving a left subsequence for further search. On the last line, b cover a part of s leaving subsequences on the left and right side for further searches.

The algorithm will stop naturally since there will be subsequences that will not create new search tasks, since the results wont present any similarity hits, since it is common to pass to the BLAST client software some thresholds for e-values and scores, so it wont return hits below those thresholds. Nevertheless, it is possible to further narrow the creation of new search tasks by 1) considering a minimum lenght for the subsequences s and/or 2) avoiding creation of subtasks after some recursion depth. Although the algorithm presented creates a list of search tasks (in contrast to the manual approach, that would require a single search against NCBIs database), those search tasks would require just some few hits (or even just the best hit), and the results,
75

when combined, will be easier to understand. A tool that implements the algorithm was created using the Perl language, which presented the following advantages over other considered options: Perl is tightly integrated with the Linux operating system we used, making possible the creation of the search tasks as new scripts that were put on an execution queue which is maintained by the systems tools. There are tools for running BLAST clients and parsing its results written as modules in Perl (BioPerl [3]). Perl has parsing operators for other common tasks, specially for pattern matching and string processing, that proved useful for the implementation of the algorithm. 2.1 Some results obtained with the approach

To exemplify the presented methodology, a contig with 33936 base pairs (one of the largests found in the assemblies done in our laboratory) was used as input for the algorithm, and executed with the tool that implements it. For this particular analysis, we used the BLASTX program with a expectancy value of 0.00001 and showing only values which e-value was smaller than 105 . The search was done against the NCBI database plus non-redundant proteins. A visual representation3 of the results can be seen on gure 3.
1
cytochrome oxidase II [Trichophyton rubrum] score 966 evalue 1e102 pos 2205422752 size 699

2
COI intron 9 protein Podospora anserina mitochondrion score 725 evalue 4e74 pos 1604316963 size 921 gene ND5 intron 1 protein Neurospora crassa mitochondrion score 284 evalue 1 (PARSE ERROR) pos 2441924694 size 276

3
rRNA intron protein Emericella nidulans mitochondrion score 527 evalue 3e51 pos 55186672 size 1155 hypothetical nox3 protein Emericella nidulans mitochondrion score 674 evalue 8e69 pos 1709618067 size 972

4
COI i1 protein [Agrocybe aegerita] score 517 evalue 2e50 pos 1465315495 size 843 cytochromec oxidase I (Trichophyton rubrum) mitochondrion score 672 evalue 1e68 pos 2017120647 size 477

5
cytochrome oxidase subunit I [Aspergillus tubingensis] score 307 evalue 1e27 pos 1571016042 size 333 cytochromec oxidase I Emericella nidulans mitochondrion score 501 evalue 3e49 pos 1818418516 size 333 ATPase proteolipid [Neurospora crassa] score 154 evalue 4e09 pos 2152321744 size 222

putative reverse transcriptasematurasetransposase [Pseudomonas putida] score 299 evalue 4e25 pos 1112112020 size 900

6
cytochromec oxidase I Neurospora crassa mitochondrion score 296 evalue 5e25 pos 87638984 size 222 cytochromec oxidase I Trichophyton rubrum mitochondrion score 140 evalue 3e07 pos 1437714487 size 111 Orf294 COI intron 15 protein Podospora anserina mitochondrion score 283 evalue 5e24 pos 1904019744 size 705

Fig. 3. Visual representation of the results of the methodology and tool for one contig.

Figure 3 shows six levels of execution of the search tasks. The level 1 corresponds to a single search tasks using the whole contig as input, and the search task found the best hit on the database in the region between base pairs 22054 and 22752. This region was eliminated (not considered) for the next steps, and from the search task on the level 1 two new search tasks were created (one with a subsequence from base pair 0 to base pair 22053 and other from base pair 22753 to base pair 33935). Those subsequences
3

Created with a graphics editor a visualization tool for the task pool is under construction.

76

were created as search tasks and included on the task pool, and the process repeated. Some features of the algorithm can be seen on the visual representation of its results: from level 3 on, no hits above the specied thresholds were found, and no further search tasks were created from those branches. The same happened on levels 4 and 6, with short subsequences that didnt generate any hits on the database search. One interesting event happened in level 5: two hits were found for some short subsequences where the hit was larger than the subsequence (meaning that only part of the subsequence was enought to match a hit on the database). Those hits would be ignored since the regions were already matched in previous steps. Figure 3 includes only search tasks up to the depth level six. For that particular contig the algorithm was able to create search tasks up to the level eight, but those were eliminated from the gure so it would t in one page. Since the search tasks were run over a few days period (due to server problems and Internet black-outs in our laboratory), we couldnt gather reliable information on the time required to run the whole task pool and obtain the nal results, but we consider a desirable feature of the methodology the ability to execute the search part by part over a period of time. For a simple comparison with our methodology, we run a query with BLASTX considering a maximum of 1000 hits. A partial list of the 128 top hits is shown in gure 4. Although a complete comparison of the results with our approach and the manual analysis done with the results of a single BLASTX query could take some time to be done, our methodologys results were considered satisfactory and in accordance with what was expected. One can see in gure 3 that no hits were found on the region from around 25kbp on - that region expresses tRNAs, therefore no hits were found in the databases we used for searches (using a different tool, 15 tRNAs were found between base pairs 25357 and 32034).

Conclusions

The methodology presented in this report was able to reduce the redundant information obtained with searches to the NCBIs database using the BLASTX tool, and can be used to help on the localization of short genes (or genes which similarity values may be smaller than the ones for neighbor genes) on long sequences of mitochondrial genomes. The methodology could be used in a wider, more abrangent mitochondrial genome annotation system. The software tool that implements the methodology can be considered as an agent if we use a less strict tipology (e.g. Nwanas [5]), even if it doesnt present some char77

1 sp|Q01556|COX2_TRIRU CYTOCHROME C OXIDASE POLYPEPTIDE II >gi|578... 2 pir||S26949 cytochrome-c oxidase (EC 1.9.3.1) chain II - dermato... 3 pir||S05629 cytochrome-c oxidase (EC 1.9.3.1) chain II - Emerice... 4 sp|P13588|COX2_EMENI CYTOCHROME C OXIDASE POLYPEPTIDE II 5 ref|NP_074950.1|| (NC_001329) cytochrome c oxidase subunit 2 [Po... 6 sp|P00411|COX2_NEUCR Cytochrome c oxidase polypeptide II >gi|662... 7 ref|NP_074933.1|| (NC_001329) orf313 [Podospora anserina] >gi|48... ...

376 e-102 376 e-102 360 4e-97 360 4e-97 333 7e-89 329 1e-87 283 7e-74

115 sp|P02382|RMS5_EMENI MITOCHONDRIAL RIBOSOMAL PROTEIN S5 >gi|7105... 207 6e-51 116 gb|AAF15327.1|AF181939_1 (AF181939) cytochrome oxidase subunit 2... 117 emb|CAA06892.1| (AJ006146) cytochrome oxidase subunit II [Acorus... 118 gb|AAF43636.1|AF207678_1 (AF207678) cytochrome oxidase subunit 2... 119 sp|P27168|COX2_DAUCA Cytochrome c oxidase polypeptide II >gi|662... 120 ref|NP_612824.1|| (NC_003522) cytochrome c oxidase subunit II [A... 121 ref|NP_037598.1|| (NC_002387) cytochrome c oxidase subunit 2 [Ph... 122 gb|AAF15330.1|AF181946_1 (AF181946) cytochrome oxidase subunit 2... 123 gb|AAB92366.1| (AF036383) cytochrome c oxidase subunit II [Brass... 124 emb|CAB63469.1| (AJ235922) cytochrome c oxidase subunit II [Kluy... 125 emb|CAB63476.1| (AJ235916) cytochrome c oxidase subunit II [Kluy... 126 gb|AAC72264.1| (AF010257) COI i1 protein [Agrocybe aegerita] 127 sp|Q02212|COX2_PHYME Cytochrome c oxidase polypeptide II >gi|754... 128 gb|AAF15328.1|AF181940_2 (AF181940) cytochrome oxidase subunit 2... ... 207 6e-51 206 2e-50 205 2e-50 205 2e-50 205 3e-50 205 3e-50 204 7e-50 204 7e-50 204 7e-50 203 9e-50 203 9e-50 203 9e-50 203 1e-49

Fig. 4. Partial list of top 128 hits (obtained with BLASTX)

acteristics that some authors consider important or essential for agents characterization. One of the problems that genome annotation researchers face is that the best similarity (e-values and scores) obtained with BLAST searches to NCBI and other databases is often not enough to identify a sequence. Often, instead of considering that a sequence is similar to the best hit on the database search results, one researcher will choose another hit on that list because this hit has a description that ts better what was expected (e.g. one hit may have a smaller similarity value than another, but will be considered better since it is related to an organism closer to the one being studied). Considering that problem and the fact that our approach gets always the best hit from the hit list, we consider our methodology as being semi-automatic since it does not annotate the mitochondrial genome by itself, and could, in some points, benet from user interaction. Nevertheless, in the test runs we did with the tool, one of the authors concluded that he would make the same decisions as the tool, and the ones that would be done differently would impact little in the nal result.

Directions of future work

Some improvements being considered on the methodology and tool are: The algorithm could be implemented in parallel, since more than one search could be performed at a time, and only search tasks that are ready to be executed are in the
78

task pool. We are considering the benets of allowing its parallelization: since the longest subtask on the algorithm is the NCBIs search (which is not being done in a local server), some tests would be necessary to verify whether the parallelization would result in a gain of execution time. We could solve partially the problem of selection of the best search result hit presented in section 3 by considering not only one best hit, but choosing, from each search task result, the n best hits and allowing the user to choose one of those. The main drawback of this possibility is that each search task result may spawn a number of new search tasks between zero and 2n, which could generate a further maximum number of (2n)2 search tasks, and so on. Even if the number of search tasks would eventually cease to grow, it would be potentially very large in the beginning of the analysis. We are considering the benets of using this approach, specially together with the ability of running searches in parallel. The integration of the methodology with a tool like Blast Search Update [4] could make easier to researches keep track of changes on the database searches and results. The use of other search mechanisms (such one used to locate tRNAs) could also be of value. The algorithm can have, during its execution, a pool of search tasks, and each one of those can be considered a simple agent. We could consider trying to implement the methodology using a multi-agent framework, but with agents being neither competitive nor collaborative.

Acknowledgments

Two of the authors are supported by FAPESP (Fundao de Amparo Pesquisa no Estado de So Paulo) grants. Rafael Santos is supported by grant number 00/08705-3 and Jos Humberto Machado Tambor by grant number 01/09495-5. The authors would like to thank professors Francisco Nbrega and Marina Nbrega of the Laboratrio de Gentica Molecular e Genomas of Universidade do Vale do Paraba, for advice and assistance.

References
1. S. Altschul, T. Madden, A. Schffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman. Gapped blast and psi-blast: A new generation of protein database search programs. Nucleic Acid Research, 25:33893402, 1997. 2. K. Berger and M. Yaffe. Mitochondrial dna inheritance in saccharomyces cerevisiae. Trends Microbiol, 8(11):508513, Nov 2000. 3. Bioperl home page. URL http://www.bioperl.org, last veried in April 2002. 4. M. Boone and C. Upton. Blast search updater: A notication system for new database matches. Bioinformatics, 16(11):10541055, 2000. 5. H. S. Nwana. Software agents: An overview. Knowledge Engineering Review, 11(3):205244, October/November 1996.

79

A comparison between symbolic and nonsymbolic machine learning techniques in automated annotation of the Keywords eld of SWISSPROT
Luciana F. Schroeder1 , Ana L. C. Bazzan1 , Jo ao Valiati1 , Paulo M. Engel1 , and S ergio Ceroni2
Instituto de Inform atica, UFRGS Caixa Postal 15064 91501970 Porto Alegre, Brazil, {luciana,bazzan,jvaliati,engel}@inf.ufrgs.br Centro de Biotecnologia and Fac. de Veterin aria, Univ. Fed. do Rio Grande do Sul ceroni@dna.cbiot.ufrgs.br
1

Abstract. The aim of this work is to carry out a comparison between symbolic and nonsymbolic approaches regarding the task of automated annotation of the eld called Keywords in SWISSPROT. The nonsymbolic technique employed was a feedforward articial neural network (ANN), while the symbolic ones was CN2. Using an ANN trained with the wellknown Backpropagation algorithm over previously annotated data from public databases like SWISSPROT, a classier was built up that maps attributes of a specic protein to keywords encountered in SWISSPROT and TrEMBL databases. The symbolic counterpart, CN2, builds a specic classier for each keyword. Regarding the nonsymbolic approach, the resulted classier is much more compact than the symbolic counterpart. However, the symbolic one had a slightly better performance and is also more readable to the end user. The performance of the obtained classier was evaluated using data taken out SWISSPROT (for training) and TrEMBL for validation.

1 Introduction
With the increase in submission of sequences to public databases, there is a clear need for tools to generate automatic annotation. Following previous work on automated annotation, we employ symbolic and nonsymbolic machine learning techniques as a method to generate automated annotation of the eld Keywords, an important one in the SWISSPROT database. The aim of this procedure is threefold: to complete the annotation of keywords which is far from adequate; to acquire experience to be able to propose automatic annotation on other (more complex) elds of the SWISSPROT database; and to compare symbolic and nonsymbolic techniques in this domain. To test our approach, we employ data related to the organisms of the family Mycoplasmataceae, because one of these organisms is the object of the PIGS project [10]. This organism, Mycoplasma hyopneumoniae, is a bacterium which colonizes the respiratory tract of swine and is the primary agent of enzootic pig pneumonia. It causes
80

considerable economic losses through retarded growth, poor food conversion, and increased susceptibility of pigs to infection by other organisms. The disease is one of the most relevant occuring in pigs in southern Brazil. One of the expected results coming from the PIGS project is to fully sequence and annotate the genome of that microorganism. Then, in a later phase, important proteins will be expressed aiming at developing diagnostic tests and vaccine production. This paper is organized as follows: in the next section we briey refer to related work, as well as to our previous work. Then, Section 3 describes the data employed, the symbolic and the non symbolic methods. In Section 4 we compare the results achieved.

2 Previous Work
There has been an explosion of data, information and computational tools coming out from the diverse genome projects. NCBI and EBI alone report huge databases, mostly not clearly structured. Here, technologies originally developed with other means in mind can help because the motivation behind their usage is the same: the necessary data is distributed among several sources, it is dynamic, its content is heterogeneous, and most of the work can be done in a parallel way. In [2, 4] prototypes are described aiming at facilitating the process of annotation. Both works are based on information gathering: search, ltering, integration, analysis, and presentation of the data to the user. Machine learning techniques have been widely used in bioinformatics (e.g. [6, 5]. Automatic annotation and machine learning are combined in [7]. The latter work describes a machine learning based approach to generate rules based on already annotated keywords of the SWISSPROT database. Such rules can then be applied to yet unannotated protein sequences. Since this work has actually motivated ours, we provide here just a brief introduction. Details can be found in [7]. Basically, the authors have developed a method to automate the process of annotation regarding those keywords in SWISSPROT, which is based on the algorithm called C4.5 [8]. This algorithm works on training data (in this case, previously annotated keywords regarding proteins). Such data is, in this case, mainly taxonomy entries, INTERPRO classication, and PFAM and PROSITE patterns. Given these data (called attributes), C4.5 derives a classication for a target attribute (in this case the keyword). Since dealing with the whole data in SWISSPROT at once would be prohibitive, it was divided in protein groups according to INTERPRO classication. Then each group is submitted to an implementation of C4.5 contained in the software package Weka3 . Rules are generated and a condence factor for each rule is calculated. The quality of the rules is evaluated by calculating a condence factor based on the number of false and true positives, by performing a crossvalidation, and by testing the rate of error in predicting keyword annotation over the TrEMBL database. The resulting framework (called Spearmint) can be accessed at http://golgi.ebi.ac.uk/Spearmint. The Keywords eld in the SWISSPROT database is a very important one, used mainly when a researcher wants to compare an unknown sequence s/he is working with, to the sequences already deposited in the SWISSPROT. Unfortunately, regarding the
3

http://www.cs.waikato.ac.nz/ ml/

81

family of Mycoplasmataceae, a high number of proteins in SWISSPROT are classied as hypothetical protein (around 50% of them according to data obtained in February 2002). Besides, the proteins in TrEMBL, which are also potential targets for comparison, are poorly annotated regarding the Keywords eld (in the data we collected, 378 out of 1894 had no keyword at all, while 896 had no attributes). Therefore, the good results achieved in [7] have motivated us to conduct a similar study aiming at automated rule annotation of keywords for the Mycoplasmataceae universe of proteins. This way, we can extend the annotation in both TrEMBL and SWISSPROT for internal use in the PIGS project. These results are reported in [1]. Here we give just a brief introduction for the sake of clarity. We have initiated by reproducing the approach described in [7]. Soon we realized that since we reduced our universe to the proteins related to Mycoplasmataceae we could do better by modifying their method. Indeed, this method is based on a partition of the SWISSPROT proteins by INTERPRO Accession Number (henceforth IPR Acc). Thus, rules (to recommend or not the annotation of a keyword) are generated for each IPR Acc (when applicable) and, after that, they are ranked by a condence factor (CF). This may be confusing for the user. For instance, when two or more rules have close CFs, but nonetheless recommend contrary annotation (i.e. one does recommend the annotation while another does not), how should the user decide? The approach we use in [1] is similar to that reported in [7] but we consider all applicable IPR Accs as attributes at once. Of course, taxonomy is no attribute in our case since we are dealing with a single family, namely the Mycoplasmataceae.

3 Methods
3.1 Data The data collection was done in February 2002 by means of the SRS web site (version 6 at www.srs.ebi.ac.uk). Basically, we have performed a query on the SWISSPROT database in which the Organism eld included Mycoplasmataceae but the Keyword eld does not include the word hypothetical or Complete proteome. This was done to eliminate hypothetical proteins from the training set. Also, we have created a view for the SWISSPROT database which associates this database with the IPRmatches (through a personal communication with SRS maintainers, we found out that the association with the INTERPRO was not working properly so we have used IPRmatches instead, which provides the required data as well). This view included: for SWISSPROT: AccNumber, keywords for IPRmatches: IPR AccNumber The number of proteins related to the Mycoplasmataceae family was 722 (Feb. 2002) while there were around 393 IPR Accs. Around 84 keywords appeared in the data. The next step was the retrieving process, which is very easy in SRS. This has generated a table (elds are delimited by ;) partially depicted in Figure 1.
82

SWISSPROT:FTSH_MYCGE;P47695;Cell divisionATPbinding TransmembraneHydrolaseMetalloproteaseZincComplete proteome; IPRMATCHES:P47695;IPR000642IPR003593IPR003959IPR003960; PROSITE:AAA;PS00674; SWISSPROT:FTSH_MYCPN;P75120;Cell divisionATPbinding TransmembraneHydrolaseMetalloproteaseZincComplete proteome; IPRMATCHES:P75120;IPR000642IPR003593IPR003959IPR003960; PROSITE:AAA;PS00674; SWISSPROT:AMPA_MYCGE;P47631Q49371;HydrolaseAminopeptidase ManganeseComplete proteome;IPRMATCHES:P47631;IPR000819; PROSITE:CYTOSOL_AP;PS00631; SWISSPROT:AMPA_MYCPN;P75206;HydrolaseAminopeptidaseManganese Complete proteome;IPRMATCHES:P75206;IPR000819; PROSITE:CYTOSOL_AP;PS00631; SWISSPROT:AMPA_MYCSA;P47707;HydrolaseAminopeptidaseManganese; IPRMATCHES:P47707;IPR000819;PROSITE:CYTOSOL_AP;PS00631; [...]

Fig. 1. Data Extracted from SRS

3.2 Symbolic Approach This subsection describes the symbolic approach which uses data generated as explained before as input for the CN2 algorithm[3]. CN2 is a rule inductor algorithm developed by Peter Clark. It constructs simple, comprehensible production rules in domains where noise may be present. The rules produced by CN2 assume the form if condition then class. The CN2 algorithm consists of two main procedures: a search algorithm performing a beam search for good rule and a control algorithm for repeatedly executing the search. During the search procedure, a rule is constructed by searching for a condition that covers a large number of examples of an arbitrary class C and few of other classes. Having found a good condition, the algorithm removes those examples it covers from the training set and adds the rule if condition then predict C to the end of the list. For the remaining set, a new rule is constructed, until no further complexes of sufcient quality are found. A typical le is partially depicted in Figure 2. The rsts lines indicate how the attributes are mapped for the 722 proteins (in this case we have 393 lines). The last of these is the target attribute (keyword). Finally, there come the 722 lines. Each of these is formed by the presence or absence of IPR characteristic separated by space. To save time, we have generated the CN2 rules only for the keywords which appear in valid lines of the test data set (i.e. those from TrEMBL). A valid line has to have at least a keyword and at least an attribute IPR. A rule is depicted in Figure 3. It is possible to compare the structure of these rules to the similar ones produced by the Spearmint tool at the web site given in Section 2. Once the rules are generated, we have proceeded to the evaluation of the quality of these rules. Of course we avoid performing the test on the data set which was used to generate the rules. The obvious candidate to test data set is the database TrEMBL, which has a structure similar to SWISSPROT. The main difference is that TrEMBL has a poorer annotation. However, the existing annotation of keywords is enough for evaluation purposes. The data was extracted from TrEMBL in the same way already explained regarding the extraction of data from the SWISS
83

IPR000005: yes no; IPR000023: yes no; IPR000032: yes no; IPR000037: yes no; ............ IPR004821: yes no; class: Zincfinger no; @ no no no no no no no no no ... no yes ... no; ...
Fig. 2. Input to the CN2 Algorithm Class Zincnger

PROT database (i.e. query, view and save procedures as explained). Many proteins do not have either a keyword or an attribute. Therefore, these lines were deleted.

IF IPR001241 = no AND IPR002936 = yes THEN class = Zincfinger [8 0] IF IPR000191 = yes THEN class = Zincfinger [3 0]

Fig. 3. Example of An Output from CN2 Class Zincnger

For those remaining, the following procedure of evaluation was performed: if the protein is annotated with keyword K, then the rule for K (generated by CN2) is checked. For instance, take the rule in Figure 3. It says that if the protein has the INTERPRO classication IPR002936 and does not have the IPR001241, then it should have the keyword Zincnger. This procedure was repeated for each protein from the validation set (948 proteins from TrEMBL database). Figure 4 shows the accuracy per keyword gotten by each technique - CN2 and ANN. The accuracy estimation is calculate based on TP (True Positives, which means the number of examples correctly covered by the rule) plus TN (true negative, which means the number of examples correctly discarded by the rule) divided by the total number of instances from the validation set. The CN2 algorithm correctly predicted around 99% of the given keyword which is a very good result.
84

CN2 ANN Keyword Accuracy Accuracy Aminopeptidase 100.00% 100.00% ATP synthesis 98.84% 98.89% ATP-binding 95.90% 94.46% Cell division 100.00% 99.72% CF(0) 99.68% 98.89% CF(1) 98.84% 98.89% Chaperone 99.58% 99.31% Coiled coil 100.00% 100.00% DNA recombination 100.00% 100.00% DNA repair 99.26% 98.89% DNA replication 99.16% 99.03% DNA-binding 99.26% 98.89% DNA-directed DNA polymerase 99.68% 99.86% DNA-directed RNA polymerase 99.58% 98.75% Elongation factor 99.37% 99.86% Endonuclease 99.58% 98.75% Excision nuclease 99.79% 99.58% Exonuclease 100.00% 99.72% FAD 99.68% 99.58% Fatty acid biosynthesis 99.89% 100.00% Flavoprotein 99.26% 99.31% Gluconeogenesis 99.89% 100.00% Glycerol metabolism 99.89% 100.00% Glycolysis 98.11% 97.09% Glycosidase 99.89% 100.00% Glycosyltransferase 99.58% 99.31% GTP-binding 100.00% 100.00% Heat shock 99.58% 99.31% Helicase 99.79% 100.00% Hydrogen ion transport 98.95% 97.78% Hydrolase 98.74% 99.17% Initiation factor 100.00% 100.00% Isomerase 99.37% 100.00% Kinase 98.63% 97.23% Ligase 99.16% 100.00% Lipoprotein 99.89% 99.58% Lyase 99.68% 100.00% Magnesium 98.53% 98.06% Manganese 99.47% 99.31% Membrane 98.53% 98.20% Metal-binding 99.47% 98.61%

CN2 ANN Keyword Accuracy Accuracy Metalloprotease 99.79% 99.45% Methyltransferase 99.16% 100.00% Multifunctional enzyme 99.89% 99.72% NAD 99.16% 99.31% NADP 99.68% 98.06% Nickel 100.00% 100.00% Nuclease 99.58% 98.75% Nucleotide biosynthesis 99.68% 99.03% Nucleotidyltransferase 99.37% 100.00% One-carbon metabolism 99.68% 99.45% Oxidoreductase 99.79% 99.72% Pentose shunt 99.89% 100.00% Peptide transport 99.16% 98.89% Phospholipid biosynthesis 99.89% 99.72% Phosphorylation 99.37% 99.03% Phosphotransferase system 99.58% 99.03% Primosome 100.00% 100.00% Protein biosynthesis 97.90% 96.95% Protein transport 99.58% 99.72% Pyridoxal phosphate 99.89% 99.72% Redox-active center 99.47% 99.86% Repeat 99.58% 99.31% Ribosomal protein 100.00% 100.00% RNA-binding 99.89% 99.58% rRNA-binding 99.79% 99.45% Schiff base 99.89% 99.45% Signal 99.89% 99.58% Signal recognition particle 100.00% 100.00% SOS response 99.79% 99.58% Sugar transport 99.16% 98.75% Thiamine pyrophosphate 99.68% 99.31% Topoisomerase 99.16% 98.89% Transcription 99.68% 99.03% Transcription regulation 99.89% 99.72% Transferase 98.21% 99.58% Translocation 99.58% 99.72% Transmembrane 98.53% 98.06% Transport 99.37% 99.03% tRNA processing 99.47% 99.45% Zinc 98.53% 96.95% Zinc-finger 99.79% 99.72% Accuracy over all classes 99.45% 99.23%

Fig. 4. Accuracy for Each Class and Over All Classes

85

3.3 NonSymbolic Approach The data which was used was gathered the same way as for the symbolic tools. With the data sets dened, we have proceeded the training neural networking stage. The articial neural network (ANN) model chosen was Multilayer Perceptron (MLP) with Resilient Backpropagation (Rprop) training algorithm [9]. This choice was done due to the philosophy used in the learning process and it effective and efcient training. The basic network architecture conguration was composed with 393 neurons in the input layer, 40 neurons in the hidden layer and 84 neurons in the output layer. It was stipulated that the neural networks would be considered trained if they reached either the mean square error of (1 103 ), or the maximum of 50 ephocs, or the gradient threshold of (1 106 ). The structure and the parameters of the trained neural networks had been stored to be used in the validation stage, where samples of test were propagated and the outputs evaluated. In a rst stage, the set of 722 samples to training stage was used; the neural network was considered trained when the mean square error was reached (in 29 ephocs). In the validation of the trained neural model, a set with 948 samples was used. In a second stage, we have used a training set of 948 samples; the network was considered trained when the maximum limit ephocs was reached; the mean square error stabilized in 2.6 103 . The trained ANN classier is a single blackbox that maps each protein simultaneously onto the various keywords. In our experiment, 393 IPR Accs were used as possible attributes for identifying a protein and 84 keywords were used as the domain for annotation. The results are shown in Figure 4. In fact, we have considered the standard data of a contingency table: Acceptance Precision, Rejetc Precision, and Overall Accuracy as metrics of efciency.

4 Comparison
The ANN is more compact and normally more efcient. Besides, it generates the rules after a single le containing all the data. However, as one of our goals is to integrate this methods in an environment for annotation of ORFs and proteins, it is important that the end user be able to see and analyse the rules generated. The biggest disadvantage of ANNs regarding this is the fact that the rules are not straightforward to the end user. One the other hand, the symbolic method was not able to cope with all entrances at once, that means, we had to generate rules one by one, i.e. one for each keyword. This is an important bottleneck in the process, of course. However, since the rules are supposed to be generated only once4 , this is not a signicant shortcoming.

5 Conclusion
The main objective of this work was to carry out a comparison between symbolic and nonsymbolic approaches to the task of automated annotation of the Keywords eld
4

In fact, if we consider the ever changing nature of the databases, we should speak of a periodic updating of the rule generation. However, this process can be done with low frequency.

86

of SWISSPROT. The comparison focused, on the one hand, on the accuracy of the generated model in predicting the correct keyword for previous unknown data. On the other hand, the compactness of the model was considered as an important element for comparison, once the symbolic approach requires the generation of a specic model for each keyword, while the non-symbolic approach generates just one model for fullling the task. Our results show the tradeoffs between both approaches. As the symbolic approach models each keyword separatedly, it can learn better each class. On the other hand, this leads to a considerable number of models, because in practical applications the number of keywords can reach a hundred or more. On the other hand, the neural network approach produces a very compact model, consisting in just one classier with multiple outputs. However, the neural network model must consider all data in once, what leads to a slightly worse performace than the symbolic approach.

References
1. A. Bazzan, S. Ceroni, P. Engel, A. Pitinga, L. Schroeder, and F. A. Souto. Automatic annotation of keywords for proteins related to Mycoplasmataceae using machine learning techniques. In European Conf. on Computational Biology (to appear), 2002. 2. K. Bryson, M. Luck, M. Joy, and D. Jones. Applying agents to bioinformatics in geneweaver. In Proc. of the Fourth Int. Workshop on Collaborative Information Agents, Lect. Notes in Computer Science. Springer-Verlag, 2000. 3. T. Clark, P.and Niblett. The cn2 induction algorithm. Machine Learning, 3:261283, 1989. 4. K. Decker, X. Zheng, and C. Schmidt. A multi-agent system for automated genomic annotation. In Proc. of the Int. Conf. Autonomous Agents, Montreal, 2001. ACM Press. 5. R. D. King and et al. Drug design by machine learning: The use of inductive logic programming to model the structureactivity relationships of trimethoprim analogues binding to dihydrofolate reductase. Proc. Natl. Acad. Sci, 89:1132226, 1992. 6. R. D. King and M. Sternberg. Machine learning approach for the prediction of protein secondary structure. J. Mol. Biol., 216:44157, 1990. 7. E. Kretschmann, W. Fleischmann, and R. Apweiler. Automatic rule generation for protein annotation with the c4.5 data mining algorithm applied on swissprot. Bioinformatics, 17:920926, 2001. 8. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 9. M. Riedmiller and H. Braun. Rprop description and implementation details. Technical report, 1994. 10. A. Zaha. Projeto rede sul de anlise de genomas e biologia estrutural, 2001. In Portuguese.

87

Stability Evaluation of Clustering Algorithms for Time Series Gene Expression Data
Ivan G. Costa1, Francisco de A. T. de Carvalho1, Marclio C. P. de Souto2
1 Centro

de Informtica, Universidade Federal de Pernambuco {igcf, fatc}@cin.ufpe.br 2 ICMC, Universidade de So Paulo marcilio@icmc.usp.br

Keywords: Gene Expression Analysis, Functional Genomics.

1 Introduction
Different clustering techniques such as, hierarchical clustering [1], Self-Organizing Map (SOM) [2], graph theory approaches [3], dynamical clustering [4] among others, have been used on the analysis of gene expression data. The focuses of theses studies are often on the biological results, and there is no critical analysis concerning the adequacy of the clustering methods used. In this paper, an evaluation methodology that assesses the stability of clustering methods in relation to external validation criteria is presented. More specifically, preliminary results obtained from the stability evaluation of three clustering techniques in relation to five gene expression time series data sets from Yeast cells are shown [1]. The stability is evaluated through a replication analysis that investigates the extent to which similar results are obtained by the analysis of subsets of the data [5]. Two validity indices, Rand and Hubert, are used to measure the quality of the results, by comparing the clustering results with gene annotation [6].

2 Stability
An approach to evaluate the results of the classification of clustering algorithms is through an investigation of their stability. This can be accomplished by reanalysing a modified version of the data set and noting the extent to which the new classification differs from the original. In this work, a procedure called Replication Analysis is used. This procedure is described bellow [5]:
1. The data set is randomly divided into two samples, A and B; 2. A is clustered using a given method, and the centroids of the clusters are calculated; 3. The distances from the centroids and the individuals in B are calculated, and the individuals from B are assigned to the closest centroid (Centroid Assignment Step); 4. B is clustered in the same way as A (Direct Cluster Step);

88

5. A measure of agreement between a priori classification and the partitions obtained by Step 3 and 4 are calculated, and their difference is taken; 6. Go to step 1 n times (number of resamples);

This procedure is analogous to cross-validation, with the A and B samples corresponding to the training and validation set. To measure the degree of agreement between a priori classification and the resulting partitions, two external indices, Rand and Hubert [6], are used.

3 Experiments and Results


Data from five distinct biological processes; the mitotic cell division cycle (alpha, cdc15 and elutration), sporulation and diauxic shift, are used in this study [1]. They were collected using DNA microarrays, which measured the expression profiles of up to 6400 known yeast Open Reading Frames. For each dataset, two others sets were generated containing only genes with known enzyme (EC) or functional (FC) classification. (Original data available at: http://mips.sf.de/proj/yeast/catalogues). A missing data filter was applied to the resulting data sets, excluding genes with more than 20% of missing values. Futhermore, in each data set, 25% of the time series with lowest variance between the time points were removed, in order to exclude uninformative genes. This process resulted in ten data sets, with the FC and EC data sets containing in average, respectively, 2585 and 738 genes. For each of the three algorithms utilised, the ten data sets were presented, resulting in 30 experiments. For each of these 30 experiments, 30 resamples of the data were formed, with the A and B sets containing 66% and 33% of the genes, respectively. The clustering methods were set to find six clusters in the EC data sets, and thirteen clusters in the FC data sets. Such numbers of clusters corresponds to the number of a priori classes in these gene annotations. Finally, in order to access if there was a significant difference between the results of the centroid assignment and direct cluster steps, the evaluation metrics obtained from the resamples of each experiments were analysed using a t-test. In relation to Rand index, with a 99% level of significance, differences between the results of the centroid assignment and direct cluster steps were detected in all hierarchical experiments (see Table 1). On the other hand, no significant differences were detected in the dynamical clustering and SOMs experiments. Using Hubert index, one of the hierarchical experiments, and two dynamical clustering experiments were stated unstable with a 95% level of significance. With a 99% level of significance, Hubert detected instability only at one dynamical clustering experiment.
Index (signif.) Hierarchical Dynamical SOM All Rand (95%) Alpha/FC CDC/FC, Elut./FC Hubert (95%) All Rand (99%) Elut./FC Hubert (99%) Table 1: Experiments where the t-test encountered a significant difference

89

A visual inspection of the contingency tables of the cluster results versus gene prior classification provided some insights on the results obtained. In the hierarchical clustering, few clusters concentrated most of the genes. In more detail, the clusters resulted from Step 3 (centroid assignment) of the resampling procedure usually had a higher concentration of genes than the ones resulted from Step 4 (direct cluster). On the other hand, SOM and dynamical clustering experiments presented results with a more homogeneous distribution of genes, in both direct cluster and centroid assignment. An analysis of the indices equation confirms that Rand is more prone to detect such concentrations in comparison with Hubert.

4 Conclusions
The preliminaries experiments presented in this paper showed some interesting results of the use of the proposed evaluation methodology. For example, no conclusion can be draw from the results of applying the Hubert index, since only a few experiments were stated as unstable. Considering the measures made with Rand, more interesting results were revealed, as instability was detected in all the hierarchical clustering experiments. The differences of the results obtained with Rand and Hubert indices shows that, in order to draw conclusions from the results in the proposed methodology, it is necessary to know the properties of each index. Furthermore, since each index may capture distinct features of the results, the use of multiple evaluation indices is encouraged. As further work other important clustering methods should be evaluated, and validity indices with distinct characteristics should be added to the methodology.
Acknowledgements. This paper was supported in part by grants from CNPq (Proc. 30138792-3 and 130916/2001-3) and FAPESP (Proc. 02/03049-6).

References
1. Eisen, M.B. et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci, .v 95, p. 14863-14868, 1998. 2. Tamayo, P. et al. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci, v. 16, n. 96(6), p. 2907-12. 3. Ben-Dor, A., Shamir, R., Yakhini, Z. Clustering Gene Expression Patterns. Journal of Computational Biology, v 6(3/4), p 281-297, 1999. 4. Costa, I.G., de Carvalho, F. A. T., de Souto, M. C. P. A Symbolic Approach to Gene Expression Time Series Analysis. Proc. of the VII Brazilian Symposium on Neural Networks, 2002, to appear. 5. Breckenridge, J. N. Replicating cluster analysis: method, consistency, and validity. Multivariate Behavioural Research, v. 24, p. 147-61, 1989. 6. Jain A.K., Dubes, R.C. Algorithms for clustering data. Prentice Hall, 1988.

90

Ordering Gene Expression Data Using OneDimensional Self-Organizing Maps


Lalinka de C. T. Gomes, Fernando J. Von Zuben and Pablo Moscato
Department of Computer Engineering and Industrial Automation State University of Campinas (Unicamp) CP. 6101, Campinas, SP, 13083-970, Brazil {lalinka, vonzuben, moscato}@dca.fee.unicamp.br

Abstract. The microarray technology allows researchers to simultaneously measure gene expression levels of thousands of genes. Analysis of data produced by such experiments provides knowledge about the gene function. An important step in the analysis of gene expression data is the detection of genes with similar expression patterns. Real-time computational tools for organization and visualization are crucial to understand and analyze the data. In this work, we make use of an algorithm based on self-organizing neural networks for organizing gene expression data in order to reveal trends in gene expression profiles under the biological viewpoint.

Introduction

Microarray experiments are performed to simultaneously examine gene expression levels on a genomic scale. The extraction of knowledge and the analysis of gene expression information requires dealing with large amounts of biological data, making simple visual analyses of the raw data rather impracticable. Under these circumstances, it is clear the need for computational tools capable of presenting the information in such a way that biologists can organize gene expression information in a intuitive manner. Here we present a method based on a connectionist approach that explores concepts of self-organizing maps to locate the neurons according to the distribution of the expression levels of genes in each experiment.

Gene Expression Data Organization and Visualization

Several techniques have been developed to organize gene expression data, most of them based on hierarchical clustering procedures followed by an interpretation of the obtained tree to produce the ordered sequence [1], [2]. On the other hand, the problem of organizing gene expression data may be directly described as defining the best sequential order of genes in such a way that genes with similar expression profiles will be relatively close. The visualization of gene expression data is gener91

ally made by means of a graphical representation [2]. The graphical representation consists of a matrix, where rows denote genes and columns denote experiments. The gene expression levels are measured with relation to a cDNA (complementary DNA) probe. Under-expressed genes are represented by green intensities, overexpressed genes are represented by red intensities and equally-expressed genes are displayed in black color.

A Connectionist Approach to the Gene Ordering Problem

In this work we propose a method that explores concepts of self-organizing maps (SOM) [4] to adjust the weight vectors of the neurons according to the distribution of the expression levels of genes in each experiment, based on a competitive learning algorithm. The algorithm, first implemented to solve the travelling salesman problem (TSP) and the vehicle routing problem (VRP) [3], was extended to operate in a multidimensional space and to deal with the gene sequencing problem. Figure 1(a) shows the initial configuration for 102 genes and 2 experiments plotted in the plane and the corresponding graphical representation. Each gene corresponds to a 2dimensional vector. Figure 1(b) shows both representations for the output of the algorithm.

Fig. 1. a) Initial ordering of 102 genes and 2 experiments plotted in the plane and the corresponding graphical representation; b) sequence of genes in the plane and respective graphical representation after the application of the self-organizing map algorithm.

4 Computational Results
The program was coded in C language, using the Borland C++ version 3.0 compiler and the simulations were performed on a PC microcomputer with a 500 MHz Celeron processor. In the simulations showed in Figure 2 we make use of biological public datasets (http://www.ncbi.nlm.nih.gov/entrez), and we provide a comparison between the SOM ordering algorithm and our implementation of the agglomerative complete-link clustering algorithm. We applied the algorithm to the biological public dataset of microarray analysis of diurnal and circadian-regulated genes in arabidopsis. Figure 2 shows the graphical representation of the initial ordering of genes, with total cost 2347, the graphical representation for the new sequence obtained by the SOM algorithm, with cost equal to 708, and the graphical representation for the
92

hierarchical clustering algorithm, with cost of 856. Based on the obtained results, we conclude that, although the SOM algorithm presents higher execution times, the resulting solution is significantly better, particularly when instances with large number of genes and experiments are considered.

a) O riginal D ata

b)SO M

c) H ierarchical C lustering

Fig. 2. Graphical representation of diurnal and circadian-regulated genes in arabidopsis.

5 Conclusions
Regarding the huge amounts of gene expression data available and the rapid growing of information at a genomic level, computational tools to organize and analyze these data tend to become mandatory. In this work, we applied a method based on self-organization to order gene expression data. We compared the performance of the SOM ordering algorithm with an implementation of the agglomerative completelink clustering algorithm. The better performance of the SOM algorithm may be explained based on the use of an adjustable-size neighborhood during the competitive learning phase.

References
1. Bar-Joseph, Z., Gifford, D., Jaakkola, T. (2001). Fast Optimal Leaf Ordering for Hierarchical Clustering. In Proceedings of the Ninth International Conference on Intelligent Systems for Molecular Biology. 2. Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D., (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA, Vol. 95, No. 25, pp. 14863-14868. 3. Gomes, L. C. T., Von Zuben, F. J., (2002). A Neuro-Fuzzy Approach to the Capacitated Vehicle Routing Problem. Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN2002), Vol. 2, pp. 1930-1935. 4. Kohonen, T., (1997). Self-Organizing Maps. 2nd Edition, Springer Verlag. Acknowledgements: CNPq has sponsored this research via grant 521100/01-1, and scholarships 300910/96-7 and 141882/01-8.
93

Reverse Engineering of Genetic Networks Using Variable Length Genetic Algorithms with Boolean Networks
Ricardo Linden1, Amit Bhaya2
FSMA-RJ/COPPE R. Campos Elseos S/N - Maca-RJ rlinden@pobox.com 2 COPPE - Federal University of Rio de Janeiro (UFRJ) Cidade Universitria CEP 21945-970 Rio de Janeiro, RJ Brazil amit@nacad.ufrj.br Abstract Nowadays, given the new microarray technology, a huge amount of data on gene expression is available. In order to understand the genetic expression process more completely, we need to know the control structure of the genetic expression. We present a new scheme for a genetic algorithm that shows promising results when modelling boolean networks. We can effectively and quickly obtain good approximations of the underlying control structures, given very little information on the possible trajectories.
1

1. Introduction
Nowadays, the data available on gene expression is growing exponentially, but in order to understand the underlying dynamics of the gene transcription regulation we need more than merely collecting large amounts of experimental data by gene expression assays. A framework for deriving and expressing the biochemical architecture of genetic systems, using experimental data, is required [4]. Most cells have the same DNA, but they differentiate because of many factors, including transcriptional regulation. This occurs through the combinatorial action of gene products on sequence elements close to each genes transcriptional start site, establishing the foundation for cell differentiation. Nowadays we use Boolean networks to model gene networks.A Boolean network, denoted G(V;F), consists of a set V={v1,v2,...,vn} of nodes representing genes and a list F={f1,f2,...,fn} of Boolean functions, where a Boolean function fi(vi1, vi2 , ..., vik ) with inputs from specified nodes vi1, vi2 , ..., vik is assigned to each node vi. For a subset U V, an expression pattern of U is a function from U to {0; 1}. An expression pattern of V is also called a state of a Boolean network, and represents the states of nodes (genes), where each node is assumed to take either 0 (notexpressed) or 1 (expressed) as its state value. The expression pattern at time t+1 is determined by Boolean functions F of the expression pattern at time t [1]. Even though boolean networks are extremely simple, they exhibit some behaviour analogous to the real development of organisms. Concepts such as cell types can be modelled as attractors in state space and differentiation as transition between attractors and stability of expression patterns as basins of attraction. Thus, it may be argued that using a simple model such as a boolean network, we can gain useful insight into the extremely complex issue of development. There are some features of cell regulation that point to boolean behaviour: for example, DNA expression in which inhibition occurs through the binding of an element to an operon [2], and many elements are regulated by sigmoidal processes, which can be crudely approximated by a step function, which is essentially boolean. Other cellular features are essentially binary: for instance

94

In order to model the dynamics of transcriptional regulation, we use genetic algorithms (GA). The idea behind GAs is to mimic the natural evolution of the species in order to create a new kind of search technique. A GA is fully defined by the coding scheme (how each individual will be represented in the computer), the operators (both mutation and crossover) and the evaluation or fitness function (i.e., a measure of the quality of the current solutions (individuals) to the problem at hand). The chromosome structure we used is depicted in figure 1. Inside the chromosome we have n genes, and each one of them stores the elements that regulates one node of the boolean network. The regulation consists of two parts: the nodes that affect the current node and the boolean function that their combination generates. The mutation operator can change any position of the gene (either regulation element or boolean function bit) based on a determined probability. The crossover operator is very simple. We simply exchange the regulation strategies for each node of our boolean networks according to a dice throw. This is equivalent to the uniform crossover operator, but applied to the whole regulation of each node, not to the components of this regulation. In order to evaluate the performance of the chromosome, we stored t trajectories with nt steps each. Each trajectory represents the real behaviour of the network we want to model. Therefore, every network receives the first state of each trajectory and the GA calculates the intermediate and final steps for this network. The number of bits that differ at each stage of the trajectory is added and averaged over the number of steps. Afterwards, each chromosome is penalized according to the number of non-zero bits (active relationships) it represents, so that the shorter chromosomes get extra credits. Analyzing our results, we noticed that most runs found a perfect regulation for some nodes. Therefore, we combined the best results of our GA in order to improve our results. We selected the top 8 solutions, based only on the GA evaluation function, in order not to insert any unfair bias, and combined all possible regulations.

vi v i1 v i2 v i1 v i2 ........... v ik v 1 0 1 .....0 0 1 ... ik 2 k bits

Fig. 1 : Correspondence between node v i regulation and its GA representation. The 2K bits represent the next value the node vi is going to store, given the values currently stored in nodes vi1, vi2, ..., vik.

2. Results
In order to test our GA, we created a few examples that contain interesting features. It must be stressed that our implementation is efficient, and one run of the GA takes an average of 5-10 seconds in a desktop PC (Pentium MMX 166 MHz). We presented the algorithm with some purely abstract examples that include interesting characteristics, like periodicity and the algorithm was able to identify all the correct relationships. Afterwards, we created trajectories that resemble the the lac operon in E. Coli and our algorithm discovered the correct relationships among elements, even if the boolean functions were not exact. Still, the technique worked remarkably well for such a simple model.

95

3. Comparison with previous work


Even though the matrix model described in [5] is continuous, it is computationally complex, the solutions found are limited on the number of inputs and it cannot model such ubiquitous functions such as a logical AND. Since our GA does not constrain the number of inputs, it can easily find more realistic regulations. The comparison with the work described in [1] is more straightforward. It uses exactly the same binary model we do, but limits the number of inputs to two, allowing for an exhaustive search strategy. This limitation is not realistic since in nature most genes are regulated by 4 to 8 elements [3]. This limitation obviously does not apply to our work, since the GA admits any number of regulating nodes.

4. Conclusions and further work


Our GA works really well with small networks. It seems that the algorithm proposed does not scale well when presented bigger networks. Therefore, in order to apply the technique described in this paper we need to divide the genes in overlapping clusters so that we can extract many small networks from a big network and can apply the GA. Given that cis-regulatory control strategies are modular [3], separating them in overlapping clusters should not limit our ability to model them. We need to improve the crossover operator in our GA,. The current operator is quite crude and does not perform a fine tuning of the regulation network. In addition, we need to study the effects of the evaluation function. Penalizing the most complex relationships makes the networks tend to the simplest solution, but makes it more difficult to find complex correct solutions. Notwithstanding the comments above, we obtained some very satisfying results. Our GA was able to discover many useful relationships and emulate the dynamics of the trajectories quite well. It is, of course, necessary to develop reliable statistical measures of the quality of the solution in fitting and extrapolating the data. Another future direction is towards the application of this GA technique, perhaps hybridizing it with Fuzzy sets, to a continuous network. In order to have realistic simulations, we must work in the domain of real numbers.

5. References
[1] Tatsuya Akutsu, Satoru Miyano, Satoru Kuhara : Identification Of Genetic Networks From A Small Number Of Gene Expression Patterns Under The Boolean Network Model, Pacific Symposium on Bio-Informatics (PSB), 1999 [2] Alberts, B. et al.: "Molecular Biology of the Cell", 4th Ed., Garland, USA, 2002 [3] Arnone, M.I.; Davidson, E.H. : "The hardwiring of development: organization and function of genomic regulatory systems", Development 124, 1851-1864, UK, 1997 [4] Smolen, P; Baxter, D; Byrne, J. D. Modeling Transcriptional Control in Gene NetworksMethods, Recent Results, and Future Directions, Bulletin of Mathematical Biology 62, 247292, USA, 2000 [5] van Someren, E.P. ; Wessels, L.F.A ; Reinders, M.J.T. : "Information Extraction for Modelling Gene Expressions", Delft University, The Netherlands, 2000.

96

Um Estudo Emp rico de Alinhamento de Sequ encias Utilizando Computa c ao Sist olica e MPI
Deive Ciro de Oliveira
UFLA - Universidade Federal de Lavras DCC - Departamento de Ci encia da Computa ca o Cx.Postal 37 - CEP 37.200-000 Lavras (MG) deive@comp.ufla

Resumo O presente arquivo descreve a implementa ca o de um algoritmo paralelo para c alculo de Alinhamentos Otimos entre duas seq u encias de caracteres. Estrat egias de solu ca o para este problema s ao vastamente utilizadas na Biologia Computacional, especicamente na area de verica ca o de homologias. A abordagem utilizada tem como fundamento Te orico a Computa ca o Sist olica, sendo implementada em um ambiente de Troca de Mensagens (MPI: Message Passing Interface).

Introdu c ao

Na Biologia Computacional, area que estuda a aplicabilidade de ferramental computacional ` a problemas de escopo biol ogico, o problema de tratamento de homologias entre seq u encias de caracteres e de extrema import ancia. Uma das estrat egias utilizadas para se estabelecer o qu ao duas seq u encias de caracteres se assemelham e o Alinhamento de Seq u encias. Alinhar duas seq u encias constituise em disp o-las, com a possibilidade de inser c ao nas mesmas dos chamados espa cos ou gaps. Atribuir uma valora c ao a esta disposi c ao segundo especicidade da aplica c ao e uma abordagem para tentar diferenciar alinhamentos vantajosos (maior n umero de casamentos) de alinhamentos n ao vantajosos (menor n umero de casamentos). O valor maximal dentre todos os alinhamentos poss veis entre duas seq u encias constitui-se no valor do alinhamento otimo. Uma das t ecnicas utilizadas para obten c ao do valor do alinhamento otimo e a programa c ao din amica a qual se baseia na valora c ao de uma matriz em fun c ao das duas seq u encias e respeitando algumas rela c oes de recorr encia. O custo para obten c ao do valor de alinhamento otimo entre as sequ encias e de O(nm) onde n e m s ao respectivamente o tamanho das seq u encias envolvidas. Uma estrat egia paralela e sumarizada a seguir.

Paraleliza c ao e Computa c ao Sist olica

O m etodo paralelo trabalha essenciamente com a valora c ao da matriz, de forma paralela, identicando termos que podem ser valorados simult anemante. Utilizando esta t ecnica identicam-se l grupos de termos que podem ser valorados
97

simult aneamente, onde l e a soma de n e m.Tendo um n umero de processadores, sendo este o m nimo entre n e m pode-se resolver o problema em l passos. Esta estrat egia pode ser aplicada diretamente em Hardware. A uma detas implementa c oes d a-se o nome de Array Sist olico. Este e baseado na interconex ao de c elulas processadoras as quais trabalham de forma singular objetivando valorar a matriz e obter-se o valor do alinhamento otimo.

MPI (Message Passing Interface)

O MPI e um ambiente que emula paralelismo. Seu paradigma e baseado na Troca de Mensagens. As unidades processadoras no ambiente MPI s ao os processos, unidades de c odigo em execu c ao. A implementa c ao desenvolvida e baseada no Array Sist olico descrito em [6], onde as c elulas processadoras s ao processos em execu c ao.

Conclus ao

O custo computacional de problemas ligados a busca de homologias entre seq u encia de caracteres e extremamente alto em virtude da entrada de dados de grandes dimens oes. O estudo de diferentes alternativas para obten c ao de valores de Alinhamentos Otimos e consideravelmente importante, buscando otimiza c ao de custos temporais e redu c ao de carga computacional exigida para a resolu c ao do problema. A aplica c ao da Estrat egia Sist olica pode ser uma alternativa con sider avel para obter os valores de Alinhamentos Otimos. Para avaliarmos esta aplicabilidade, estudos futuros devem considerar na avalia c ao, rela c oes de custo se a implementa c ao for desenvolvida em Hardware e a considera c ao de tempo adicional gerado pela comunica c ao de processos utilizando-se Ambiente Paralelos com paradigma de Troca de Mensagens. A ado c ao destes ambientes em implementa c oes did aticas s ao importantes para futuros desenvolvimentos em hardwares (VHDL, FPGA) prevendo situa c oes conjecturais encontradas no projeto e implementa c ao de circuitos integrados espec cos.

Refer encias
1. www.cbs.dtu.dk/databases/DOGS. 2. www.ncbi.nlm.nih.gov. 3. J. Meidanis e J. C. Set ubal. Uma Introdu c ao a Biologia Computacional. IX Escola de Computa ca o - Recife, 1994. 4. D. T. Hoang. A sistolic array for alignment problem. Brown University, 1992. 5. Ken Howard. The bioinformatics gold. American Science, 2000. 6. Udi Manber. Introduction to algorithms; A Creative Aproach. Addison Wesley, 1989. 7. Martin Tompa. Lectures notes on sequence analysis. Washington University, 2000. 8. Anshul Gupta e G. Karypis. V. Kumar, A. Grama. Introduction to Parallel Computing: Design and Analysis of Algorithms. Publ. Benjamin Cumming/Addison Wesley, 1993.

98

Creation of a Hidden Markov Model for Preliminary Identification and Characterization of Subfamily Signatures in Serpin Proteins Superfamily
Cristina Russo ; Hermes de Amorim ; Ana Bazzan ; Jorge Guimares
1,2 1,3 2 1

1, Centro de Biotecnologia, UFRGS, Porto Alegre/RS, Av. Bento Gonalves, 9500. 2, Instituto de Informtica UFRGS, Porto Alegre/RS, Av. Bento Gonalves, 9500; 3, Depto. De Qumica, ULBRA, Canoas/RS

Introduction Serpins (Serine Proteinase Inhibitors) are a superfamily of regulatory proteins (350500 amino acids in size) with a variety of biological roles, including inhibition of chymotrypsin-and trypsin-like serine proteinases. These proteins present increasing interest because of their medical importance (punctual mutations can cause a number of disease states, including blood clotting disorders, pulmonary emphysema, cirrhosis, and mental diseases) and also for their unusual mechanism of behavioral function. Such mechanism involves a conformational change, known as the stressedrelaxed (SR) transition, between conformational states of different folding topologies [1]. The conserved serpin fold includes three -sheets and at least seven -helices (most typically nine ). The reactive center loop (RCL, or proteinase recognition site, RSL) is crucial for the inhibitory function. Five conformational states appear in serpin crystal structures, differing primarily in the structure of RCL. Several hundred serpins have been identified in higher eukaryotes and viruses [2]. They represent an expanding superfamily of structurally similar but functionally diverse proteins [2],[3]. Most serpin inhibit serine proteinases of the chymotrypsin/trypsin family. However, several members with no longer function as proteinase inhibitors have been also identified. They perform other roles in different physiological systems such hormone transport, corticosteroid binding globulin and blood pressure regulation. Thus, understanding the biologic function of most serpins remains an ongoing challenge for the biomedical researchers. Since the beginning of the post-genomic era, functional and structural characterization of new proteins represents a critical task for computational biology. Most profile and motif databases tend to classify protein sequences into a broad spectrum of protein families. On the other hand, protein classification through systems capable of distinguish between subfamilies within a structure and functionally diverse from the superfamily has not yet been implemented. The most popular methods of sequence analysis, such as BLAST [4] or FASTA [5], are not sensitive enough to distinguish and capture small differences in protein sequences [6]. The goal of this work is to contribute with the elucidating efforts of structure-function relationships in proteins. Specifically, we explore a Hidden Markov Model (HMM) methodology for a preliminary identification and characterization of subfamily signatures in serpin superfamily. Four models were created that allowed us to
99

distiguish: (a) serpins; (b) general inhibitory serpins, (c) serpins that inhibit serine proteinases of blood coagulation cascate, (d) and serpins with unrelated and/or unknown inhibitory function. They were called MT, MI, MC, MN, respectively. The models were trained and their Multiple Sequence Alignments (MSAs), created; their results of the four HMMs were tested using BLASTp. HMM are stochastic methods that are also used to create statistical profiles for protein families and for DNA sequences. The method is based on the probability of an expected aminoacid or nucleotide base to be found on each position of the primary sequence of a macromolecule.

Results and Discussion


Each of the four models generated a consensus sequence that would represent an average sequence for all the subfamily. Then, a serpin could be classified to a subfamily by testing its similarity to the subfamily consensus sequence. In order to validate the models, the four consensus sequences were tested using BLASTp program. As expected, MT consensus sequence had BLASTp results comparing it to a large group of serpins, all of them having a high Score. For the second model, MI, BLASTp provided the higher scores for only the serpins with inhibitory role. For MC, all BLASTp results with higher scores were serpins related to the clotting cascade. For MN, the BLASTp results were non inhibitor serpins. Sequence Conservation Information concerned to sequence conservation was obtained from the MSAs. The first three models showed two match-sites of residue conservation higher than 75%, located on the positions 170 to 244 and 392 to 400 of the HMM. The forth model, MN, did not show these conserved regions. The conserved residues in MT, MI and MC seems to represent those mobile parts of the protein related to the conformational change SR. Actually, Irving and co-workers [2] showed that site-directed mutation designed to switch a single aminoacid residue in this region, produced profound differences on the SR conformational change. The observation that the MN serpin sequences do not show the same conserved sites in HMM it is according with experimental results. For example, the RCL region of inhibitory serpins should be occupied by residues with short side-chains. These residues are thought to allow efficient and rapid insertion of the cleaved RCL into A -sheet. Subfamily Signatures The variations found in the HMM of MI, MC and MN, related to MT and related to each other, were taken as signatures of this three subfamilies. The model MI showed an extremely conserved site for inhibitor proteins which was not present in the total model MT, and is is located on state 313 to 336. This site could be responsible for the inhibitory characteristics of these proteins and could be

100

related to residues at RCL of inhibitory serpins. These regions are important in controlling and modulation serpin conformational changes [1], [2], [3]. A conserved site is present on MC, located at state 72 to 96 and from 366 to 370. These regions would probably have biological relevance on the structure or folding of the proteins belonging to the subfamily. The model MN presented only one conserved region located at the states 205 to 239.

Conclusions
It was observed that the HMMs generated reasonable information according to the four subfamilies of serpins. Each of the models generated a consensus sequence. Then, a serpin could be classified to a subfamily by testing its similarity to the subfamily consensus sequence. The testing of the sequences with BLASTp as a validation tool showed that they represent correctly the members of the corresponding model. The models MSA showed information about conservation or variation of the proteins. All of the models that contained serpins with an inhibitory activity showed a conserved site located at positions 313 to 336. This site could be mapped into RCL and breach, shutter and gate regions of inhibitory serpins (regions related to serpin conformational changes). There were also particular conserved regions on the subfamilies models: MC presented two conserved sites, and MN, one. These sites could be responsible for functional-structural information and should be investigated. For further work, more proteins should be added to the model in order to get better results. Also, the selection of protein subfamilies and definition of its limits could be changed, since it depends on the researcher who defines it.

References
1. 2. 3. 4. 5. 6. Ye S, Goldsmith EJ; Serpins and other covalent protease inhibitors. Curr Op Struc Biol, 2001, 11:740. Irving et al. Phylogeny of the Serpin Superfamily: Implications of Patterns of Amino Acid Conservation for Structure and Function. Genome Research, 2000, 10:1845. Silverman et al. The Serpins Are an Expanding Superfamily of Structurally Similar but Functionally Diverse Proteins. J. Biol. Chem. 2001, 276(36): 33293. Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997, 25:3389. Pearson W, Lipman D; Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA, 1985, 2444. Karchin R, Karplus K, Haussler D; Classifying G-protein coupled receptors with support vector machines. Bioinformatics 2002, 18:147.

101

Transductive Support Vector Machines for Cancer Diagnosis and Classification of Microarray Gene Expression Data
Robinson Semolini and Fernando J. Von Zuben
Department of Computer Engineering and Industrial Automation State University of Campinas (Unicamp), Brazil {semolini,vonzuben}@dca.fee.unicamp.br

Abstract :
The purpose of this paper is to present a powerful methodology for classification using gene expression data. The following problems will be considered: determination of specific diagnosis categories using the small round blue cell tumors of childhood, and classification of genes in functional groups. The classification task will be performed by means of Transductive Inference with SVM, that will be implemented based on training and testing data sets. In the case of training samples, experts have already previously classified the genes or diagnostic categories in their respective classes. So, given the testing samples, the purpose is to determine the corresponding class. Thereby, we will be able to classify in only one step each sample of the testing data set as a member or not of a certain class. As the transductive inference has not been previously applied to the classification of gene expression data, considering SVM as the classification tool, it will be compared with the traditional inductive method in a series of exhaustive experiments, with promising results.

1 - Introduction
With the purpose of improving generalization and increasing the performance of the estimator, transductive inference is considered here to estimate function values in only one step. It represents an alternative approach to inductive inference, that requires two steps: the first step (inductive) consists of discovering the functional dependence among input-output variables, and the second step (deductive) uses this functional dependence to evaluate the output in the points of interest. We will apply this methodology to the following problems: (A) Gene classification of budding yeast Saccharomyces Cerevisiae in functional groups using data of gene expression from an experiment of DNA microarray hybridization [1]. The data set is composed of the ratio of the expression levels of 2467 genes of budding yeast Saccharomyces cerevisiae, measured in 79 different experiments of DNA microarray hybridization. The genes were classified into five functional classes: tricarboxylic acid cycle (TCA), respiration (Resp), cytoplasmic ribosomal genes (Ribo), proteasome (Prot). The genes belonging to each functional class will be classified as positive, and the others genes will be classified as negative. The gain function to be maximized in testing data will be: G = GN FP 2xFN, where FP and FN are the number of false positives and false negatives, and GN is the null gain [1]. We will consider the complexity index as: Complexity = 1 (G / GN ). Using this index, we can compare the classification complexity of the different functional classes. (B) Classification of small round blue cell tumors (SRBCTs) of childhood into four specific diagnosis categories based on their gene expression signatures [4]. The data set is composed of 6567 genes and 88 samples divided into: Ewing family of tumors "EWS", neuroblastoma "NB", rhabdomyosarcoma "RMS" and Burkitt lymphomas "BL". Lee, Y. & Lee, C.-K. [5] showed that only 20 genes (with the largest ratios in feature selection methods) were able to correctly classify SRBCTs. The performance to be maximized will be measured according to % of correct classification in the testing data. We will use Support Vector Machines (SVM) in this study as the classification tool, guiding to Transductive Support Vector Machines (TSVM). SVM is an algorithm of machine learning, introduced by Vapnik [6], based on the Structural Risk Minimization principle from Statistical Learning Theory, that implements a nonlinear mapping (performed by a kernel function chosen a priori) of the input vectors xi into a high-dimensional feature space, where an optimal separating hyperplane is constructed. When the training data is separable, the optimal hyperplane in feature space is the one with the maximal distance between the hyperplane and the closest image of the input vectors xi from the training data. For nonseparable training data, a generalization of this concept is used. Estimating the separating hyperplane is equivalent to solving the following optimization problem [6]:
Minimize : W ( ) =

i =1

i +

1 2

i, j =1

y y K ( x , x )
i j i j i j

subject to :

y = 0
i i i =1

n i =1 : 0 i C

(1)

where C is a parameter that controls the tradeoff between the complexity of the decision function and the number of training samples misclassified (generalization parameter).

102

2 - Transductive Support Vector Machines

Besides the input-output training data (xi , yi)1 i n belonging to two classes {1}, we will operate with the testing sample data (x*j , y*j) 1 j k supposedly obtained from the same statistical distribution. According to Vapnik [7], the objective (for the nonseparable case) is to find the classification of the testing data that minimize the following dual representation for the optimization problem:
Minimize : W ( , * , y * ) =


i i =1 n k j =1 j =1 r =1 * * jyj =0

* 1 j +

2
i,r =1

i r yi y r K ( x i , x r ) +

1 2

j ,r =1

* * * * * * j y j y r r K ( x j , x r )

(2)

+
n k

* * * j r y j y r K ( x j , xr )

subject to :

y +
i i i =1 j =1

n i =1 : 0 i C

* * k j =1 : 0 j C

where C and C* are parameters that control the tradeoff between the complexity of the decision function and the number of training samples misclassified. In order to solve the optimization problem (2), that generally is not convex and thus prone to local optima, we will apply Transductive SVMlight [2] [3] to find an approximate solution to this problem using a form of local search. The algorithm TSVMlight starts training an inductive SVM on the training data and classifying the testing data according to the user parameter T+, which specifies the ratio of testing data to be assigned to positive classes. The next step is the main loop that adjusts the two unbalanced cost parameters on * * misclassification errors, C and C + , each devoted to assigning an unlabeled sample to a negative or a positive class, respectively. The role of these two costs is to contribute to the adjust merit of the user parameter T+. The algorithm starts with small values for these two parameters for unlabeled data, and uniformly increases the influence of unlabeled data up to a pre-defined penalty level C*. Inside this loop, there is another one that switches the labels of two examples of testing data having positive , so that the objective function decreases. In each change of the costs or in each label exchange involving two points, the optimization problem (1) is solved to produce an aproximate solution to problem (2).

3 - Results
To evaluate the performance of TSVM, we presupposed that only a small part of the samples is already classified, which will be taken as the training data. The remaining samples will form the testing data. The objective of this application consists in classifying this testing samples. This approach will allow us to investigate the consequences of the progressive increase of the training sample size in the performances of SVM and TSVM. The samples, randomly selected, were divided according to the distribution in Table 1. Table 1: Data structure for the application of TSVM. The col.Complexity was obtained by Brown et al.[1]. (A) =Saccharomyces Cerevisiae Data Set; (B) =SRBCTs Data Set; + = Positive class; = Negative class

    ! ! " # $ % 345 67 8@9 Y ` a b cd e@f gsh i j kl m n  @

 @ @ & ' 9 9 gc gc o o @

 @  @ @ ( ' 9 AB gc ec o mp @ @

 @ @  @ @ ( ' ( ' B BC B DEE gc hc gc icc o po o pkk @ @

ic q

( FD



' D D@G E iiep p p@r s

)0 1 2 2 H IQP qsr @ tuwv

 @ @ ( ' R R t t x x

 @  @ @  @ @ ( ' ( ' S S T T u u u vw d y y z z

 @ @ ( ' U U u xy { {

( VW x e |}

ef y~

' XW

The 20 experiments in the data set "A" and 16 in the data set "B" were repeated 5 times, each one characterized by a different set of training data, randomly defined to measure the effectiveness of TSVM when compared with SVM. In the experiments from data set "A", when the best result for TSVM or SVM produced negative values for the gain function G, the best choice was the null hypothesis, where the value of the gain function G is equal to zero. The results for the experiments comparing TSVM and SVM are shown in Figure 1. For the class BL, data set "B", the two methodologies guided to 100% of correct classification of the testing data.

4 - Conclusions
Similar to the results obtained by Joachims [3] for the problem of text classification, our experiments associated with the problem of Functional Classification, and Cancer Diagnosis using data of gene expression, guided to the assertion that transductive inference produces better classifiers than dedutive

103

inference, both using SVM as the classifier, in the presence of a reduced number of high-dimensional samples in the training set. When the size of the training set increases, the performance of dedutive inference becomes closer to that of transductive inference. Derived from the experiments to classify Genes in Functional Classes , the method of inductive inference was not capable of classifying, with a minimum of performance, the TCA class, where 3 of the 4 outputs (Figure 1(A1)) were given by the the null hypothesis, and for the Resp class, 2 of the 4 outputs (Figure 1(A2)) were also given by the null hypothesis. These two classes were the ones that presented the higher complexity index (Table 1). So, there is a high correlation between the classification complexity and the inefficacy of the inductive method when applied to a training set of small size. In the problem of classification of Cancer Diagnosis, with low complexity, 6 samples in the training data of TSVM are enough. We also see that the number of test samples does not need to be so high to get benefits from the transductive methodology, since in this problem its size varies between 64 and 70 samples. As a consequence, the transductive approach using SVM (TSVM) may contribute a lot to the problem of classification using data of gene expression, where the set of samples already classified are usually of a small size and there are still many samples to be classified. If the classification complexity is high (problem of Gene Classification in Funcional Classes), the number of training samples does not need to be small to obtain benefits from the use of the transductive methodology, and if the classification complexity is low, the size of the testing data does not need to be high, as indicated by the results associated with classification of Cancer Diagnosis.
6 5 4 3 2 1 0 5 15 50 200 24 20

(A2) : Resp
Gain Function ( G)

40 35

10 8

(A4) : Hist

Gain Function ( G)

Gain function ( G)

16 12 8 4 0 10 30 60 200

Gain Function ( G)

(A1) : TCA

30 25 20 15 10 5 0 11 33 66 200

6 4 2 0

(A3) : Prot

-2 4 12 24 200

Training Sample Size for Negative Class


150

Traning Sample Size for Negative Class

Training Sample Size for Negative Class

Training Sample Size for Negative Class

100 100 120


YL
YL

(B3) : RMS
100
YL

95 90 85 80 75 70

% Correct Classification

% Correct Classification

Gain Function ( G)

90

% Correct Classification

96

98

60

(A5) : Ribo

(B1) : EWS
96

(B2) : NB

Legend

92

30

SVM TSVM
6 10 14 18
XB Traning Sample Size

0 5 40 100 200 6 10 14 18

88 6 12 18 24
XB Training Sample Size

Training Sample Size for Negative Class

XB Training Sample Size

Figure 1: Comparison of the performance of the two methodologies in testing data, inductive (SVM) and transductive (TSVM), for the different diagnostic categories or funcional classes. It is presented for each point the average and the standard deviation (only half) of the 5 different simulations 5 - Acknowledgments The authors thank Thorsten Joachims for the availability of the SVMlight software package at the Web site http://svmlight.joachims.org/. CNPq has sponsored this research via grants 521100/01-1 and 300910/96-7. References [1] Brown, M.P.S. , Grundy, W.N. , Lin, D. , Cristianini, N. , Sugnet, C.W. , Furey, T.S. , Ares, Jr M. and Haussler, S. (2000), Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences of the USA, 97(1):262-267. [2] Joachims, Thorsten (1999a), Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Scholkopf and C. Burges and A.Smola (ed.), MIT-Press. [3] Joachims, Thorsten (1999b), Transductive Inference for Text Classification using Support Vector Machines. International Conference on Machine Learning (ICML). [4] Khan, J. , Wei, J. , Ringner, M. , Saal, L. , Ladanyi, M. , Westermann, F. , Berthold, F. , Schwab, M. , Atonescu, C. , Peterson, C. and Meltzer, P. (2001), Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7: 673-679. [5] Lee, Y. and Lee, C.-K. (2002), Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data. UW-Madison Statistics Dept Technical Report 1051. [6] Vapnik, Vladimir N. (1995), The Nature of Statistical Learning Theory. Springer. [7] Vapnik, Vladimir N. (1998), Statistical Learning Theory. John Wiley and Sons, New York.

104

Aprendizado de M aquina Aplicado ao Estudo de Marcadores Moleculares para Produ c ao de Carne Bovina
Silvia H. M. G. da Silva1 , Ana C. Lorena1 , Andr e C. P. L. F. de Carvalho1 , 2 Danielle D. Tambasco and Luciana C. A. Regitano2
1

Universidade de S ao Paulo, Instituto de Ci encias Matem aticas e de Computa c ao, Cx. Postal 668, CEP 13560-970, S ao Carlos - S ao Paulo - Brasil {silviah, aclorena, andre}@icmc.sc.usp.br 2 Embrapa - Centro de Pesquisas de Pecu aria do Sudeste, Cx. Postal 339, CEP 13560-970, S ao Carlos - S ao Paulo - Brasil {danielle, luciana}@cppse.embrapa.br

Resumo Existem diversos fatores relacionados ` a produ c ao de carne bovina. Identic a-los contribui de maneira direta para a escolha de cruzamentos gen eticos mais promissores. Este trabalho investiga o uso de Redes Neurais Articiais e M aquinas de Vetores Suporte no estudo da inu encia de marcadores moleculares no ganho de peso di ario do animal, do nascimento ` a desmama. As taxas de erros obtidas se mostraram baixas, evidenciando que as vari aveis envolvidas est ao relacionadas com o ganho de peso nesta fase.

Introdu c ao

A gera c ao de tecnologias e conhecimento que contribuam para o aumento da disponibilidade de produtos de origem animal de melhor qualidade e menor custo e de fundamental import ancia para o progresso da pecu aria. O uso de marcadores moleculares, principalmente de DNA, permite que o potencial gen etico de um animal seja determinado com maior precis ao [4]. Essas pesquisas geram uma enorme quantidade de dados que necessitam de ferramentas computacionais que facilitem sua compreens ao. T ecnicas de Aprendizado de M aquina (AM), por serem capazes de aprender por si a partir de um conjunto de dados, representam uma alternativa atraente para lidar com este tipo de problema. Este trabalho busca investigar o uso duas t ecnicas inteligentes, Redes Neurais Articiais (RNAs) e M aquinas de Vetores Suporte (Support Vector Machines - SVMs), para interpreta c ao de dados sobre marcadores moleculares. As RNAs podem ser denidas como sistemas paralelos distribu dos compostos de unidades de processamento simples que computam determinadas fun c oes matem aticas, simulando os neur onios [2]. As SVMs utilizam fun c oes denominadas Kernels para mapear os dados para um espa co de grande dimens ao. Nesse
105

espa co, a utiliza c ao de uma fun c ao linear e suciente para aproxima c ao da distribui c ao dos dados [7]. Este artigo encontra-se organizado como segue: a Se c ao 2 apresenta os materiais e m etodos utilizados nos experimentos conduzidos. A Se c ao 3 lista e discute os resultados obtidos e naliza este artigo.

Materiais e m etodos

A base de dados utilizada neste trabalho inclui dados obtidos atrav es do projeto Marcadores moleculares aplicados ` a produ c ao de carne bovina, desenvolvido no CPPSE - Embrapa. Foram obtidos dados dos marcadores moleculares LGB (encontrado no gene -lactoglobulina) e GH (gene de horm onio de crescimento). Os dados s ao de 189 animais resultantes do cruzamento de f emeas Nelore com touros Aberdeen Angus, Canchim e Simental. Os atributos utilizados como entrada para o treinamento das t ecnicas de aprendizado foram o sexo do animal, grupo gen etico, tratamento (com ou sem ra c ao), pai, idade da m ae ao parto e combina c oes dos marcadores (GH e LGB, GH, LGB e sem marcador). A sa da, por sua vez, foi o ganho de peso m edio. O trabalho investigou a inu encia dos marcadores no ganho de peso di ario dos animais, no per odo entre o nascimento e a desmama. Os resultados foram obtidos utilizando 10-fold cross validation [6]. Foram utilizadas RNAs Perceptron multicamadas, treinadas com o algoritmo backpropagation [2]. Foram testadas redes com uma camada intermedi aria, variando o n umero de neur onios. A taxa de aprendizado e o termo momentum adotados foram ambos iguais a 0.1. Para implementa c ao das RNAs, utilizou-se a ferramenta SNNS [8]. Para as SVMs, empregou-se diversas fun c oes Kernel (Polinomiais, Gaussianas e Sigmoidal). As SVMs foram geradas com o aux lio da ferramenta SVMTorch II [3].

Resultados, Discuss oes e Conclus ao

A Tabela 1 apresenta as taxas de erro obtidos nos testes das t ecnicas de aprendizado consideradas. A RNA com 1 neur onio na camada intermedi aria foi a melhor topologia de rede em todos experimentos. As SVMs com Kernel Gaussiano e desvio padr ao de 50 apresentaram os melhores resultados nos experimentos com ambos os marcadores e somente com o marcador GH. Para os experimentos com LGB e sem marcadores, o melhor Kernel foi o Gaussiano com desvio padr ao igual a 100. Comparando-se o desempenho das t ecnicas utilizadas, pode-se observar que as SVMs foram mais precisas, embora n ao se possa armar a um n vel de conan ca de 95% [5]. Verica-se que a menor taxa de erro foi obtida no experimento realizado utilizando somente o marcador GH. Do ponto de vista siol ogico, pode-se dizer que isto se deve ao fato do marcador GH estar relacionado ao gene que codica o horm onio de crescimento.
106

Tabela 1. Erro quadr atico m edio obtido nos experimentos GH + LGB GH LGB sem marcador 11.26 4.07 10.07 3.28

RNAs 11.67 4.02 11.08 3.39 11.69 4.21 SVMs 9.95 3.16 9.83 3.13 10.15 3.13

O experimento com LGB apresentou a maior taxa de erro, evidenciando que este marcador n ao inui no ganho de peso di ario. Quando associado ao GH, essa taxa diminui. Por em, o erro obtido e maior que o da an alise isolada do GH, sugerindo um efeito epist asico da -lactoglobulina. Na aus encia de marcadores, observa-se uma taxa de erro semelhante ` as anteriores. Isto indica que existem outras vari aveis inuenciando fortemente o ganho de peso di ario, sugerindo a realiza c ao de outros testes experimentais, bem como o uso de outros marcadores moleculares. Uma outra abordagem e analisar quais grupos gen eticos est ao mais relacionados com as caracter sticas de produ c ao.

Agradecimentos
Os autores agradecem ao CNPq e ` a Fapesp pelo apoio nanceiro concedido para a realiza c ao deste projeto.

Refer encias
1. Baldi, P., Brunak, S.: Bioinformatics - The Machine Learning Approach. The MIT Press (1998) 2. Braga, A.P., Carvalho, A. C. P. L. F., Ludermir T. B.: Redes Neurais Articiais: Teoria e Aplica c oes. Livro T ecnico e Cient co, Rio de Janeiro (2000) 3. Collobert, R., Bengio, S.: SVMTorch: Support vector machines for large scale regression problems. Journal of Machine Learning Research, Vol. 1 (2001) 143160 4. Coutinho, L.L., Regitano, L.C.A.: Uso de marcadores moleculares na ind ustria animal. Biologia Molecular aplicada ` a produ c ao animal, Embrapa Informa c ao Tecnol ogica (2001) 215 p. 5. Johnson, R. A.: Miller and Freunds Probability and Statistics for engineers. Prentice Hall (2000) 6. Mitchell, T.: Machine Learning. McGraw Hill (1997) 7. Smola, A. and Sch olkopf, B.: A tutorial on support vector regression. NeuroCOLT2 Technical Report NC2-TR-1998-030 (1998) 8. Zell, A., Mamier, G., Vogt, M., Mache, N., H ubner, R., D oring, S., Herrmann, K., Soyez, T., Schmalzl, M., Sommer, T., Hatzigeorgiou, A., Posselt, D., Schreiner, T., Kett, B., Clemente, G., Wieland, J.: SNNS - Stuttgart Neural Network Simulator - User Manual, Version 4.1. Technical Report 6/95, Institute for Parallel and Distributed High Performance Systems (IPVR), University of Stuttgart (1995)

107

Você também pode gostar