P. 1
Bio in for Ma Tics

Bio in for Ma Tics

|Views: 4|Likes:

More info:

Published by: Samudrala Vijaykumar on Oct 07, 2011
Direitos Autorais:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as DOC, PDF, TXT or read online from Scribd
See more
See less

12/26/2013

pdf

text

original

Bioinformatics

1

Introduction

Biology is in the middle of a major paradigm shift driven by computing technology. Although it is already an informational science in many respects, the field has been rapidly becoming much computational and analytical. Rapid progress in genetics and biochemistry research combined with the tools provided by modern biotechnology has generated massive volumes of genetic and protein sequence data. Bioinformatics has been defined as a means for analysing, comparing, graphically displaying, modeling, storing, systemising, searching, and ultimately distributing biological discipline that information, which includes sequences ,structures, function, and phylogeny. Thus bioinformatics may be defined as a generates computational tools, databases, and methods to support genomic and post genomic research. It comprises the study of DNA structure and function, gene and protein expression, protein production, structure and function, genetic regulatory systems, and clinical applications. Bioinformatics needs the expertise from Computer Science, Mathematics, Statistics, Medicine, and Biology. Bioinformatics is the application of computer technology to the management of biological information. Computers are used to together store, analyse and integrate biological and genetic information which can,then, be applied to the gene-based drug discovery and development. more

1

Bioinformatics

2

Definition of Bioinformatics
Bioinformatics is the analysis of biological information using computers and statistical techniques. The science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research. Bioinformatics is more of a tool than a discipline, the tools for analysis of biological Data.

The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as:
"Bioinformatics

is the field of science in which biology, computer science,

and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets. The analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures and the development and implementation of tools that enable efficient access and management of different types of information."

From Webopedia:
The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research. Bioinformatics is being used largely in the field of human genome research by the Human genome Project that has been determining the sequence of the entire human genome (about 3 billion base pairs) and is essential in using genomic information to understand diseases. It is also used largely for the identification of new molecular targets for drug discovery.

2

Bioinformatics

3

The three terms bioinformatics, computational biology and bioinformation infrastructure are often times used interchangeably. These three may be defined as follows: 1. bioinformatics refers to database-like activities, involving persistent sets of data that are maintained in a consistent state over essentially indefinite periods of time;

2. computational biology encompasses the use of algorithmic tools to facilitate biological analyses; while 3. bioinformation infrastructure comprises the entire collective of information management systems, analysis tools and communication networks supporting biology. Thus, the latter may be viewed as a computational scaffold of the former two.

3

Bioinformatics 4 4 .

Mendel in 1865. Gregor Mendel. Here is the data of historical events for both biology and computer sciences. B. He carefully recorded the data and analyzed the data. History of Bioinformatics The Modern bioinformatics is can be classified into two broad categories. Mendel illustrated that the 5 .C. Biological Science and computational Science. The history of biology in general. He did experiment on the crossfertilization of different colors of the same species.Bioinformatics 5 . and before the discovery of genetic inheritance by G. is extremely sketch and inaccurate. is known as the "Father of Genetics". This was the start of Bioinformatics history.

"The moving-boundary method of studying the electrophoresis of proteins" (published in Nova Acta Regiae Societatis Scientiarum Upsaliensis. Gregory Mendel (1823-1884). comparing their microscopic structure to that of the living organisms they resembled. The advancement of computing in 1960-70s resulted in the basic methodology of bioinformatics. 7. electrophoresis. it is the 1990s when the INTERNET arrived when the full fledged bioinformatics field was born. as shown by Lederberg and Tatum. Austria.H. two important things happened in the field of genomics. In 1972. described the cellular structure of cork. "bioinformatics". John Ray's in his book "Historia Plantarum" catalogued and described 18. Antoni van Leeuwenhoek discovered bacteria. IV. is introduced by Tiselius for separating proteins in solution. 1665 1683 1686 1843 1864 1865 1902 1962 1905 1913 1930 1946 6 . established the theory of genetic inheritance. Morgan). Uppsala University.Bioinformatics inheritance of traits could be more easily explained if it was controlled by factors passed down from generation to generation. He also described microscopic examinations of fossilized plants and animals. Ernst Haeckel (Häckel) outlined the essential elements of modern zoological classification. Tiselius. A new technique. Here are some of the major events in bioinformatics over the last several decades. In 1973. First ever linkage map created by Columbia undergraduate Alfred Sturtevant (working with T. Richard Owen elaborated the distinction of homology andanalogy. The understanding of genetics has advanced remarkably in the last thirty years. The events listed in the list occurred long before the term. However. Stanley Cohen. Vol. 4) Genetic material can be transferred laterally between bacterial cells. and suggested a plausible mechanism for their formation. working independently. Pauling's theory of molecular evolution The word "genetics" is coined by William Bateson. He argued for an organic origin of fossils. Sweden.600 kinds of plants. Ser. John Ray. No. was coined. Annie Chang and Herbert Boyer produced the first recombinant DNA organism. His book gave the first definition of species based upon common descent. The chromosome theory of heredity is proposed by Sutton and Boveri. Paul berg made the first recombinant DNA 6 molecule using ligase. BioInformatics Events Robert Hooke published Micrographia. In that same year.

Bioinformatics 1952 1961 1965 1970 1977 7 Alfred Day Hershey and Martha Chase proved that the DNA alone carries genetic information. Matthew Meselson. The human genome (3 Giga base pairs) is published. Albert R. The A. 7 . Hinxton. This was proved on the basis of their bacteriophage research. Oltvai ZN. Barabasi AL. UK First bacterial genomes completely sequenced Yeast genome completely sequenced PSI-BLAST Worm (multicellular) genome completely sequenced Fly genome completely sequenced Jeong H. Nature 2000 Oct 5. thaliana genome (100 Mb) is secquenced. identify messenger RNA.3 Mbp) is published. PubMed The genome for Pseudomonas aeruginosa (6. The large-scale organization of metabolic networks. Tombor B.407(6804):651-4. UK EMBL European Bioinformatics Institute. Sidney Brenner. Hinxton. François Jacob. Margaret Dayhoff's Atlas of Protein Sequences Needleman-Wunsch algorithm DNA sequencing and software to analyze it (Staden) 1981 1981 1982 1982 1983 1985 1988 1988 1990 1991 1993 1994 1995 1996 1997 1998 1999 2000 2000 2000 2001 Smith-Waterman algorithm developed The concept of a sequence motif (Doolittle) GenBank Release 3 made public Phage lambda genome sequenced Sequence database searching algorithm (Wilbur-Lipman) FASTP/FASTN: fast sequence similarity searching National Center for Biotechnology Information (NCBI) created at NIH/NLM EMBnet network for database distribution BLAST: fast sequence similarity searching EST: expressed sequence tag sequencing Sanger Centre.

behavioral. Chemical informatics: Computer assisted storage. 8 . development. mathematical modelling and computational simulation techniques to the study of biological. Pharmacogenomics: It is the application of genomic approaches and technologies to the identification of drug targets. medical.In short. analyze.of course possible to compare genomics by comparing more or less representative subsets of genes within genomes. archive. Proteomics: It is the study of proteins-their location. retrieval and analysis of chemical information from data to chemical knowledge. or visualize such data.Bioinformatics 8 DEFINITIONS RELATED TO BIOINFORMATICS : Bioinformatics: Research. Genomics: Genomics is any attempt to analyze or compare the entire genetic complement of a species . organize. Computational Biology: The development and application of data analytical and theoretical methods.( or) application of computational tools and approaches for expanding the use of biological. Pharmacogenetics: It is the study of how the actions of and reactions to drugs vary with the patients genes. and social systems. including those to acquire store. pharmacogenomics is using genetic information to predict whether a drug will help make a patient well or sick.behavioral or health data.It is .structure and function.

For example. This needs more than just a simple text-based search and programs such as FASTA and PSIBLAST must consider what comprises a biologically significant match. Development of such resources dictates expertise in computational theory as well as a thorough under-standing of biology. 9 . the information stored in these databases is essentially use-less until analysed. Pharmacoinformatics: It concentrates on the aspects of bioinformatics dealing with drug discovery. e. AIMS OF BIOINFORMATICS : The aims of bioinformatics include:  First.coli. having sequenced a particular protein.g. the Protein Data Bank for 3D macromolecular structures. While data-curation is an essential task.  The second aim is to develop tools and resources that aid i n the analysis of data. it is of interest to compare it with previously characterised sequences. Thus the purpose of bioinformatics extends much further. at its simplest bioinformatics organises data in a way that allows researchers to access existing information and to submit new entries as they are produced. Biophysics: An interdisciplinary field which applies techniques from the physical sciences to understanding biological structure and function. the fruit fly and the bacterium E.Bioinformatics 9 Comparative genomics: The study of human genetics by comparisons with model organisms such as mice.

genetics and biochemistry. National centre for Biotechnology Information (NCBI) In November 1988. we can now conduct global analyses of all the available data with the aim of un-covering common principles that apply across many systems and highlight novel features.  To facilitate usage of the database and analytical software available to the scientific community. Traditionally.  To coordinate worldwide efforts to gather biological data 10 . biological studies examined individual systems in detail and frequently compared them with a few that are related. The National centre for biotechnology Information(NCBI) in the United states and the European Bioinformatics institute(EBI) in England are two main life science servers responsible for dealing with this data.Bioinformatics 10  The third aim is to use these tools to analyse the data and interpret the results in a biologically meaningful manner. the US senate recognised the need for computerused data processing in the biomedical and biochemical field and passed legislation that helped to establish NCBI at the National library of medicine(NLM) NCBI’s four main tasks are:  To crate automated machines that can analyse and store data pertaining to molecular biology. INFORMATION NETWORKS: Computational tools and databases are essential to the management and identification of suitable patterns found by using exponentially growing volume of biological data. In bioinformatics.

1. Databases at EBI:      EMBL database SWISS. England.PROT database dbEST&dbSTS PDB NDB 11 .g:PIR’s complete database which consists of PIR!+PIR2+PIR3.  Nucleotide sequence: These are DNA and RNA sequences derived from less automated sequencing projects I. Redundant nucleotide sequence databases-Eg. 11 Databases supported by NCBI:  Protein sequence: These are experimentally sequenced proteins and translated nucleotide sequences from nucleotide libraries.Bioinformatics  To conduct research in computerised analysis of structure function relationships for key biological molecules. redundant protein sequence databases –E. It tasks and goals are similar to those of NCBI and include :     Bioinformatics tracking technology Research and development of bioinformatics software Training and supporting its subscribers Relevant bioinformatics services. dbEST European Bioinformatics institute(EBI): EBI is an outstation of the European molecular biology laboratory(EMBL) located at Hinxton.

Bioinformatics  IMGT database 12 Skills Required to become successful Bioinformatician: As mentioned earlier Bioinformatics profession requires wide range and it is not possible to learn all of them. Perl or Python. Experience with one or more of Molecular Biology software packages. Java and HTML should be known by Bioinformatician. Some of the molecular biology packages are GCG. Central Dogma of molecular biology 3. BIOLOGICAL DATABASES : 12 . FASTA etc. BLAST. Here is the important topics very essential to enter in this profession. Learn Unix or Linux Since these days Unix or Linux (Free open source) is extensively used in biotechnology for is robustness and available tools & software for 5. 4. 1. Molecular Biology 2. Computer Programming Language like C/C++. This platform. its very important to learn these operating system. 7. Learn to use sequence analysis and molecular modeling software. 6. Database Management Systems Learn Oracle and MySQL (Free Database Server) which is extensively used for store gigabytes of biotech data for further analysis.

organized body of persistent data usually associated with computerized software designed to update. and 13 . literature citations associated with the sequence. two additional requirements must be met: 1. often. each of which includes the same set of information. and retrieve components of the data stored within the system. For researchers to benefit from the data stored in a database. A simple database might be a single file containing many records.Easy access to the information. the input sequence with description of the type of molecule. and. a record associated with a nucleotide sequence database typically contains information such as contact name. the scientific name of the source organism from which it waisolated.Bioinformatics 13 A biological database is alarge . query. For example.

A method for extracting only that information needed to answer a specific biological question.Bioinformatics 2. Making such databases accessible via open standards like the Web is very important since consumers of bioinformatics data use a range of computer platforms: from the more powerful and forbidding UNIX boxes favoured by the developers and curators to the far friendlier Macs often found populating the labs of computerwary biologists. RNA and DNA are the proteins that store the 14 . a lot of bioinformatics work is concerned with thetechnology of databases. 14 Currently. These data bases include both "public" repositories of gene data like Gene Bank or the Protein Data Bank (the PDB) and private databases like those used by research groups involved in gene mapping projects or those held by biotech companies.

Structural Classification of Proteins 3. Prosite 2. Protein Structure Database: 1. Protein Family (PFAM) Nucleotide Sequence Database 1. Architecture. Topology and Homologous super family Protein Sequence Database • Primary Database 1.Bioinformatics 15 hereditary information about an organism. Receptor-Ligand Database (ReliBase) 2. These macromolecules have a fixed structure. Protein Information Resource (PIR) Secondary Database 1. Protein Data Bank (PDB) 2. Literature Database – PubMed 15 . Restriction Enzyme Database (REBASE) 3. Sequence Retrieval System (SRS) 3. GenBank 2. National Center for Biotechnology Information (NCBI) 2. which can be analyzed by biologists with the help of bioinformatics tools and databases. DNA Database of Japan (DDBJ) 3. G-Protein Coupled Receptor Database (GPCRDB) 4. Catalogue of Databases (DBCAT) Other Databases 1. Nuclear Receptor Database (NucleaRDB) 5. European Molecular Biology Laboratory (EMBL) Composite Databases 1. Swiss-Prot 2. Class.

Pointers to the SWISS-PROT entree(s) that correspond to the 16 .400. Catalytic activity.The database currently doubles in size every 18 months and currently (June 1994) contains nearly 2 million bases from 182.000 sequences as of June 1994. It has a flat file structure that is an ASCII text file.000 bases and 183. EC-ENZYME: The 'ENZYME' data bank contains the following data for each type of characterized enzyme for which an EC number has been provided: EC number. SwissProt: This is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy (means less identical sequences are present in the database).There areapproximately 191. In addition to sequence data. PROSITE: The PROSITE dictionary of sites and patterns in proteins prepared by Amos Bairoch atthe University of Geneva. phylogeneticclassification and references to published literature. Alternative names. Cofactors.Bioinformatics 16 Genbank: GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences.615 sequence entries. readableby both humans and computers. GenBank filescontain information like accession numbers and gene names. EMBL: The EMBL Nucleotide Sequence Database is a comprehensive database of DNA andRNA sequences collected from the scientific literature and patent applications anddirectly submitted from researchers and sequencing groups. Data collection isdone in collaboration with GenBank USA) and the DNA Database of Japan (DDBJ). recommended name.

17 RCSB-PDB : The RCSB PDB contains 3-D biological macromolecular structure data from X-ray crystallography. Antonarakis at John Hopkins University. It is operated by Rutgers. The PIR serves the scientific community through on-line access. and performing off-line sequence identification services for researchers. The State University of New Jersey and the San Diego Super computer Center at the University of California. Pointers to disease(s) associated with a deficiency of the enzyme. and bibliographic information. and professional and scientific education by providing for the storage and dissemination of data about genes and other DNA markers. PIR-PSD: PIR (Protein Information Resource) produces and distributes the PIRInternational Protein Sequence Database (PSD). map location. genetic disease and locus information. and Cryo-EM.00:March 31. GDB: The GDB Human Genome Data Base supports biomedical research. clinical medicine. Francomano and Stylianos E. OMIM: The Mendelian Inheritance in Man data bank (MIM) is prepared by Victor Mc Kusickwiththe assistance of Claire A.423 entries 17 . 1994 67. Release 40. NMR.Bioinformatics enzyme. San Diego. distributing magnetic tapes. It is the most comprehensive and expertly annotated protein sequence database.

Homologies (1300 mouse loci. PCR primers (currently 500primer pairs).000 current and withdrawn symbols). can be obtained via ftp.000 references). Secondary databases (like Prosite) contain the information derived from protein sequences. Protein sequence databases are classified as primary. the C. 18 . to create a genetic map to aid in the study of hereditary diseases. Primary databases are combined and filtered to form non-redundant composite database Genethon Genome Databases: PHYSICAL MAP: computation of the human genetic map using DNA fragments in the form of YAC contigs. Bibliography (over 18. HumanChromosome 21. This initial release contains the following kinds of information: Loci (over15.000). GENETIC MAP: production of micro-satellite probes and the localization of chromosomes.Bioinformatics 18 19. ACeDB (A Caenorhabditis elegans Database): Containing data from the Caenorhabditis Genetics Center (funded by the NIH National Center for Research Resources). GENEXPRESS (cDNA): catalogue the transcripts required for protein synthesis obtained from specific tissues. secondary and composite depending upon the content stored in them. ACeDB is also the name of the generic genome database software in use by an increasing number of genome projects. Drosophila melanogaster.747. 3500 loci from 40 mammalian species). Human Chromosome X. Probes and Clones (about 10. as well as the C. Experimental data (from 2400 published articles). MGD: The Mouse Genome Databases: MGD is a comprehensive database of genetic information on the laboratory mouse. elegans. elegans data. PIR and Swiss Prot are primary databases that contain protein sequences as 'raw' data. for example neuromuscular tissues. mycobacteria.297 residues. ACeDB databases are available for the following species: C. and the worm community. The software. elegans genome project (funded by theMRC and NIH).

Theoretical scientists have derived new and 19 . soybeans. nursing. using NLM's controlled vocabulary. all necessary sequence information of interest to a given project from published journal articles quickly became a formidable task. and corresponds in part to the International Nursing Index and the Index to Dental Literature.Solanaceae. forest trees. dentistry. Journal articles are indexed for MEDLINE. the organization and analysis of this data still remained. Neurosporacrassa.Bioinformatics 19 Arabidopsis. Aspergillus nidulans. MeSH (Medical Subject Headings). Saccharomyces cerevisiae. Citations include the English abstract when published with the article (approximately 70% of the current file). For researchers to benefit from all this information. and their citations are searchable. grains. It could take weeks to months for a researcher to search sequences by hand in order to find related genes or proteins. veterinary medicine. Schizosaccharomyces pombe. Not only can computers be used to store and organize sequence information into databases. however. so far. Computer technology has provided the obvious solutionto this problem. MEDLINE: MEDLINE is NLM's premier bibliographic database covering the fields of medicine. andSorghum bicolor. by hand. Bos taurus. MEDLINE contains all citations published in Index Medicus. been able to outpace the increase in sequence information being created. and the preclinical sciences. rice. maize. two additional things were required: 1) Ready access to the collected pool of sequence information and 2) A way to extract from this pool only those sequences of interest to a given researcher Simply collecting. The evolution of computing power and storage capacity has. After collection. but they can also be used to analyze sequence data rapidly. Gossypium hirsutum.

Bioinformatics 20 sophisticated algorithms which allow sequences to be readily compared using probability theories. a sequence alignment is a way of arranging the primary sequences of DNA. RNA. The property of sharing a common ancestor. has evolved and expanded so that nearly everyone has access to this information and the tools necessary to analyze it. These comparisons become the basis for determining gene function. structural. or evolutionary relationships between the sequences. or protein to identify regions of similarity that may be a con sequence of functional. developing phylogenetic relationships and simulating protein models. can be a very powerful indicator in bioinformatics. homology. Databases of existing sequencing data can be used to identify homologues of new molecules that have been amplifie and sequenced in the lab. The physical linking of a vast array of computers in the 1970's provided a few biologists with ready access to the expanding pool of sequence information. This web of connections. Gaps are inserted between the residues so that residues with identical or similar characters are aligned in successive columns. 20 . Sequence Analysis: In bioinformatics. now known as the Internet. Aligned sequences of nucleotide oramino acid residues are typically represented as rows within a matrix.

suggest that this region has structural or functional importance.Bioinformatics 21 A sequence alignment. The absence of substitutions. Although DNA and RNA nucleotide bases are more similar to each other than to amino acids. the conservation of base pairing can dictate a similar functional or structural role. In protein sequence alignment. insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. 21 . the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence. or the presence of only very conservative substitutions (that is. the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. Sequence alignment canbe used for non-biological sequences. produced by ClustalW between two human zinc finger proteins identified by GenBank accession number. such as those present in natural language or in financial data. If two sequences in an alignment share a common an cestor. mismatches can be interpreted as point mutations and gaps as indels (that is.

Most web-based tools allow a number of input and output formats. local alignments. A variety of computational algorithms have been applied to the sequence alignment problem. Illustration of global and local alignments demonstrating the 'gappy' quality of global alignments that can occur if sequences are insufficiently similar Global alignments. Local alignments are often preferable. which attempt to align every residue in every sequence. Sequence alignments can be stored in a wide variety of text-based file formats.Bioinformatics 22 Computational approaches to sequence alignment generally fall into two categories: 1. but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. A general conversion program is available at DNA Baser or Readseq (for Readseq you must upload your files on a foreign server and provide your email address). such as FASTA format and Gen Bank format. many of which were originally developed in conjunction with a specific alignment program or implementation.) A general global alignment technique is called the 22 . global alignments 2. the use of specific tools authored by individual research laboratories can be complicated by limited file format compatibility. local alignments identify regions of similarity within long sequences that are often widely divergent overall. are most useful when the sequences in the query set are similar and of roughly equal size. including slow but formally optimizing methods like dynamic programming and efficient heuristic or probabilistic methods designed for large-scale database search. however. (This does not mean global alignments cannot end in gaps. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast.

The Smith-Waterman algorithm is a general local alignment method also based on dynamic programming. but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high homology to a query). all three pair wise methods have difficulty with highly repetitive sequences of low information content .especially where the number of repetitions differ in the two sequences to be aligned. One way of quantifying the utility of a given pairwise alignment is the 'maximum unique match'. dynamic programming. Although each method has its individual strengths and weaknesses. multiple sequence alignment techniques can also align pairs of sequences. Longer MUM sequences typically reflect closer relatedness. Pairwise alignment : Pair wise sequence alignment methods are used to find the bestmatching piecewise(local) or global alignments of two querysequences.Bioinformatics Needleman-Wun sch algorithm and is based on dynamic programming. With sufficiently similar sequences. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs 23 within their larger sequence context. The three primary methods of producing pair wise alignments are dot-matrix methods. there is no Difference between local and global alignments. Multiple sequence alignment: 23 . however. or the longest subsequence that occurs in both query sequence. and word methods . Pair wise alignments can only be used between two sequences at a time.

Factors that must be taken into consideration when designing these tools are: • The end user (the biologist) may not be a frequent user of computer technology and thus it should be very user friendly. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences. Such conserved sequence motifs can be used in conjunction with 24 structural and mechanistic information to locate the catalytic active sites of enzymes. Nevertheless. Multiple alignment methods try to align all of the sequences in a given query set. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to NP-complete combinatorial optimization problems. BioInformatics Tools The Bioinformatics tools are the software programs for the saving. • The Bioinformatics Tools may be categorized into following categories: • • • Homology and Similarity Tools Protein Function Analysis Structural Analysis 24 . retrieving and analysis of Biological data and extracting the information from them. These software tools must be made available over the internet given the global distribution of the scientific research community.Bioinformatics Multiple sequence alignment is an extension of pair wise alignment to incorporate more than two sequences at a time.

Bioinformatics • 25 Sequence Analysis Homology and Similarity Tools The term homology implies a common evolutionary relationship between two traits -whether they are DNA sequences or bristle patterns on a fly's nose. more detailed analysis on your query sequence including evolutionary analysis. The function of a protein is more directly a consequence of its structure rather than its sequence with structural homologs tending to share functions. Structural Analysis This set of tools allow you to compare structures with the known structure databases. Homologous sequences are sequences that are related by divergence from a common ancestor. This set of tools can be used to identify similarities between novel query sequences of unknown structure and function and database sequences whose structure and function have been elucidated. identification of mutations. Sequence Analysis This set of tools allows you to carry out further. Thus the degree of similarity between two sequences can be measured while their homology is a case of being either true of false. Highly significant hits against these different pattern databases allow you to approximate the biochemical function of your query protein. This group of programs allow you to compare your protein sequence to the secondary (or derived) protein databases that contain information on motifs. The identification of these and other biological properties are all clues that aid the search to elucidate the specific function of your sequence. Protein Function Analysis Function Analysis is Identification and mapping of all functional elements (both coding and non-coding) in a genome. The determination of a protein's 2D/3D structure is crucial in the study of its function. CpG islands and compositional biases. hydropathy regions. signatures and protein domains. 25 .

similarities and differences can be seen. and lines them up so that the identities. immunoglobulins. PHI-BLAST. the scores of segments in which there are multiple word hits are calculated ("init1"). protein motif identification and domain analysis. microbial. The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word".Bioinformatics Bioinformatics Tools 26 BLAST: The Basic Local Alignment Search Tool (BLAST) for comparing gene and protein sequences against others in public databases. Clustalw ClustalW is a general purpose multiple sequence alignment program for DNA or proteins. EMBOSS EMBOSS (The European Molecular Biology Open Software Suite) is a new. It produces biologically meaningful multiple sequence alignments of divergent sequences. Within EMBOSS you will find around 100 programs (applications) for sequence alignment. database searching with sequence patterns. Specialized BLASTs are also available for human. An optimized alignment that includes gaps is shown in the output as "opt". nucleotide sequence pattern analysis. and tentative human consensus sequences. free open source software analysis package specially developed for the needs of the molecular biology user community. Initially. and other genomes. and much more. FASTA A database search tool used to compare a nucleotide or peptide sequence to a sequence database. malaria. It was the first widely used algorithm for database similarity searching. codon usage analysis for small genomes. The program is based on the rapid sequence algorithm described by Lipman and Pearson. now comes in several types including PSI-BLAST. Later the scores of several segments may be summed to generate an "initn" score. and BLAST 2 sequences. as well as for vector contamination. calculates the best match for the selected sequences. 26 . The program looks for optimal local alignments by scanning the sequence for small matches called "words".

laws of physics & chemistry. proteins. mathematics. and of course sound knowledge of IT to analyze biotech data. it is emerging as a key player in bioinformatics. data warehousing and analyzing the DNA sequences. computer science. is an easier to use program.Bioinformatics RasMol It is a powerful research tool to display the structure of DNA. One example of perl project is BioPerl project. Application Programs 27 JAVA in Bioinformatics: Due to Platform independence nature of Java. APPLICATIONS OF BIOINFORMATICS Bioinformatics is the use of IT in biotechnology for the data storage. Bioinformatics is not limited to 27 . Perl in Bioinformatics: Perl is also being used in the processing of biological data. Protein Explorer. Physiome Sciences' computer-based biological simulation technologies and Bioinformatics Solutions' PatternHunter are two examples of the growing adoption of Java in bioinformatics. and smaller molecules. In Bioinfomatics knowledge of many branches are required like biology. a derivative of RasMol.

computer algorithms) to the understanding of living systems.. and a core set of problem-solving methods (e.g.Bioinformatics 28 the computing data. These include the fallowing:  Molecular medicine  More drug targets  Personalised medicine  Preventive medicine  Gene therapy  Microbial genome applications  Waste cleanup  Climate change  Alternative energy sources  Biotechnology  Antibiotic resistance  Forensic analysis of microbes  The reality of bioweapon creation  Evolutuionary studies  Agriculture  Crops  Insect resistance  Improve nutritional quality  Grow crops in proper soils and that are drought resistant  Animals  Comparative studies 28 . science (e..g. probability and statistics). It is the comprehensive application of mathematics (e..g. The science of bioinformatics has many beneficial uses in the modern day world. biochemistry). but in reality it can be used to solve many biological problems and find out how living things works.

how many kinases in Yeast 29 . Structure Modeling • Major Application II: Finding Homologs Major Application I|I: Overall Genome Characterization  Overall Occurrence of a Certain Feature in the Genome e.Bioinformatics • 29 Major Application : Designing Drugs  Understanding How Structures Bind Other Molecules (Function)  Designing Inhibitors  Docking.g.

 Prediction of functional gene products.  Handling of vast biological data which otherwise is not possible.  Designing of drugs for medical treatment.  Development of models for the functioning various cells.  Identification of nucleotide sequences of functional genes.D structure of proteins.  Finding of sites that can be cut by restriction enzymes.  For the prediction of 3. RNA. tissues and organs.  To trace the evolutionary tree of genes. Proteins).Bioinformatics 30  Compare Organisms and Tissues Expression levels in Cancerous vs Normal Tissues  Sequence mapping of bimolecules (DNA. 30 .  Molecular modeling of bimolecules.

You're Reading a Free Preview

Descarregar
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->