Você está na página 1de 28

Introduction to Bioinformatics

What is Bioinformatics?

Bioinformatics is a relatively new interdisciplinary field that integrates computer science, mathematics, biology, and information technology to manage, analyze, and understand biological, biochemical and biophysical information. Bioinformatics is a computational science and the subset of larger field of Computational Biology.

What is Bioinformatics?

Bioinformatics is the use of computers to study biology Bioinformatics is the science of using information to understand biology Bioinformatics is integration of information technology (IT) and biology Bioinformatics is the development of computational methods for studying structure, function and evolution of genes, proteins and whole genomes

Some Terminology
Cell is a primary unit of life Cell consists of molecules, chemical reactions and a copy of the genome for that organism All life on this planet depends on three types of molecules: DNA, RNA and proteins

Some Terminology

DNA Holds information on how cell works RNA Acts to transfer short pieces of information to different parts of cell Provide templates to synthesize into protein Proteins Form enzymes that send signals to other cells and regulate gene activity Form bodys major components (e.g. hair, skin, etc.)

DNA - Deoxyribonucleic Acid


Genetic material Consists of two long strands Each strand is made of: Phosphates Sugar Nucleotides A (adenine) G (guanine) C ( cytosine) T (thymine)

DNA Double Helix Structure

The Central Dogma of Molecular Biology


transcription DNA RNA translation Protein

Information has been transferred from DNA (information storage molecule) to RNA (information transfer molecule) to a specific protein (a functional, non-coding product)

More Terminology

Transcription of DNA

DNA transcribed into RNA RNA exits as a single-strand unit and as a double-helix as well RNA consist of A, C, G and U (uracil)
Messenger RNA mRNA Transfer RNA tRNA Ribosomal RNA rRNA

Types of RNA

More Terminology

Translation of Messenger RNA (mRNA):

mRNA is translated into protein linear polymers built from amino acids

Proteins:

The transfer of information from DNA to specific protein via RNA takes place according to the genetic code.

The RNA sequence is divided into blocks of three letters This block is called CODON Each codon corresponds to the specific amino acid

More Terminology

Four different nucleotides are used to build DNA and RNA molecules A, G, C, T and A, G, C, U 20 different amino acids are used in protein synthesis Four nucleotides can be arranged in 64 different combinations of three. There are 64 = 4*4*4 different codons Some codons are redundant and some have special function to terminate the translation process

Why is bioinformatics important?

Traditionally, research was carried out entirely at the experimental laboratory but the huge increase in the data in the genomic era has seen a need to incorporate computers into this research process There are three central biological processes around which bioinformatics tools must be developed: DNA sequence determines protein sequence Protein sequence determines protein structure Protein structure determines protein function

Major research areas

Sequence analysis- A comparison of genes within


a species or between different species can show similarities between protein functions, or relations between species The comparison of sequences in order to find similar and dissimilar in compared sequences (sequence alignment) Identification of gene-structures, reading frames, distributions of introns and exons and regulatory elements Revealing the evolution and genetic diversity of organisms.

Computational evolutionary biologyEvolutionary biology is the study of the origin and descent of species, as well as their change over time. Informatics has assisted evolutionary biologists in several key ways; it has enabled researchers to: trace the evolution of a large number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone, build complex computational models of populations to predict the outcome of the system over time track and share information on an increasingly large number of species and organisms

Prediction of protein structure- Protein


structure prediction is another important application of bioinformatics. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. MODELLER is one of the best software for Homology modelling. Protein Data Bank is the data base for 3D coordinates of a protein.

Drug Designing- Drug design is the approach of


finding drugs by design, based on their biological targets. Computer-assisted drug design uses computational chemistry to discover, enhance, or study drugs and related biologically active molecules

Phylogenetics- Predicting the genetic or evolutionary


relation of set of organisms. Mitochondrial SNPs and Microsatellites ( DNA repeats) are mostly used in Phylogenetics. MEGA,PAUP are PAUP* are some of the important software's. Maximum Parsimony and Maximum Likelyhood are mostly used methods.

Biological databases: why?

Need for storing and communicating large datasets has grown Make biological data available to scientists. To make biological data available in computer-readable form.

Type of data

nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways

Different classifications of databases

Primary or derived databases

Primary databases: experimental results directly into database Secondary databases: results of analysis of primary databases Aggregate of many databases

Links to other data items Combination of data Consolidation of data

Nucleotide sequence databases

EMBL, GenBank, and DDBJ are the three primary nucleotide sequence databases EMBL www.ebi.ac.uk/embl/ GenBank www.ncbi.nlm.nih.gov/Genbank/ DDBJ www.ddbj.nig.ac.jp

Genbank

An annotated collection of all publicly available nucleotide and proteins Set up in 1979 at the LANL (Los Alamos). Maintained since 1992 NCBI (Bethesda).

http://www.ncbi.nlm.nih.gov

EMBL Nucleotide Sequence DB

An annotated collection of all publicly available nucleotide and protein sequences Created in 1980 at the European Molecular Biology Laboratory in Heidelberg. Maintained since 1994 by EBICambridge. http://www.ebi.ac.uk/embl.html

DDBJDNA Data Bank of Japan

An annotated collection of all publicly available nucleotide and protein sequences Started, 1984 at the National Institute of Genetics (NIG) in Mishima. Still maintained in this institute a team lead by Takashi Gojobori. http://www.ddbj.nig.ac.jp

Other NCBI nucleic acids DBs

EST database: A collection of expressed sequence tags, or short, single-pass sequence reads from mRNA (cDNA).

GSS database: A database of genome survey sequences, or short, single-pass genomic sequences.
HomoloGene: A gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. HTG database: A collection of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences.

SNPs database: A central repository for both single-base nucleotide substitutions and short deletion and insertion polymorphisms.
RefSeq: A database of non-redundant reference sequences standards, including genomic DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within NCBI and with external groups, supports data-gathering efforts. STS database: A database of sequence tagged sites, or short sequences that are operationally unique in the genome. UniSTS: A unified, non-redundant view of sequence tagged sites (STSs). UniGene: A collection of ESTs and full-length mRNA sequences organized into clusters, each representing a unique known or putative human gene annotated with mapping and expression information and cross-references to other sources.

Bioinformatics Tools

BLAST:
The Basic Local Alignment Search Tool (BLAST) for comparing gene and protein sequences against others in public databases, now comes in several types including PSI-BLAST, PHI-BLAST, and BLAST 2 sequences. Specialized BLASTs are also available

FASTA
A database search tool used to compare a nucleotide or peptide sequence to a sequence database. It was the first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words"

Clustalw
ClustalW is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences, calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen.

RasMol
It is a powerful research tool to display the structure of DNA, proteins, and smaller molecules. Protein Explorer, a derivative of RasMol, is an easier to use program.

DeepView (also knows as Swiss-PdbViewer) For seeing and exploring macromolecular models in three dimensions, and for manual and semiautomated homology modeling

conclusion

Bioinformatics in India is at an early stage of development. But at 4 to 5 centers in the country, one sees mature understanding of the needs of this sector and world class development of tools and applications. These centers will ensure that Indias traditional strengths in IT are leveraged to place us on par with the developed countries.

Você também pode gostar