Escolar Documentos
Profissional Documentos
Cultura Documentos
BIOINFORMATICS
OBJECTIVE
At the end of this exercise, the student should be able to
1. Access and explore existing biochemical information from key electronic resources such as
the National Center for Biotechnology (NCBI), ExPASy server of the Swiss Institute of
Bioinformatics and Protein Data Bank.
2. Use the Basic Local Alignment tool (BLAST) for comparing sequences (DNA or protein) to
other sequences in various organisms.
3. Learn to utilize the molecular visualization software Swiss PDB Viewer to manipulate and
analyze the three‐dimensional structures of molecules
INTRODUCTION
A. What is Bioinformatics?
The cross‐disciplinary field of bioinformatics is defined as “conceptualizing biology in terms
of molecules (in the sense of physical chemistry) and applying informatics techniques (derived
from disciplines such as applied math, computer science and statistics) to understand and organize
the information associated with these molecules, on a large scale. In short bioinformatics is a
management information system for molecular biology“(1). Plainly, it is the convergence of
molecular biology and computational analysis to examine the information associated with
biomolecules and it is a synonym for computational molecular biology.
The goal of bioinformatics is divided into three: First, is to create and maintain databases
that store large quantities of biological information such as amino acid and nucleotide sequences
(1,2). This will allow the scientific community easy access to existing data as well as to submit new
and revised entries (1). However, bioinformatics is not just building databases, its second goal is to
develop new tools and algorithms to discover and uncover the wealth of information hidden in
those biological data (1,2). The third goal is to implement these tools in the analysis and
interpretation of the biological information to gain comprehensive insight into living systems (1,2).
Common activities in bioinformatics include analyzing DNA and protein sequences, aligning
different DNA and protein sequences to identify similarity and cluster them into families of related
sequences, and predicting and viewing 3‐D structural models of protein structures. Table 1 below
gives a summary of the biological information examined in bioinformatics.
1
Table 1. Data source analyzed in bioinformatics (2001)1
Data Source Data Size Bioinformatics topics
Raw DNA Sequence 8.2 million sequences Separating coding and non‐coding regions
(9.5 billion sequences) Identification or introns and exons
Gene product prediction
Forensic Analysis
Protein Sequence 300,000 sequences Sequence comparison algorithms
(~300 amino acid Multiple sequence alignment algorithms
each) Identification of conserved sequence motifs
Macromolecular 13,000 structures Secondary and tertiary prediction
structure (~1000 atomic 3D structural alignment algorithms
coordinates each) Protein geometry measurements
Surface and volume shape calculations
Intermolecular interactions
Genomes 40 complete genomes Characterization of repeats
(1.6 million‐3 billion Structural assignments to genes
bases each) Phylogenetic analysis
Genomic‐scale censuses (characterization of protein content, metabolic
pathways)
Linkage analysis relating specific genes to diseases
Gene expression largest: ~20 time point Correlating expression patterns
measurements for Mapping expression data to sequence, structural and biochemical data
~6000 genes
Literature 11 million citations Digital libraries for automated bibliographical searches
Knowledge databases of data form literature
Metabolic pathways Pathway simulation
Bioinformatics is a broad topic so for a newcomer, this field maybe overwhelming,
daunting and confusing at first because of the computational and mathematical content as well as
the terminology and language of bioinformatics. Furthermore, this field is rapidly evolving that it is
nearly impossible to keep up with all the progress. This exercise is designed to introduce students
to computational techniques in the area of bioinformatics for the analysis of biomolecules, with
emphasis on proteins and nucleic acids.
B. Selected Databases
Protein Sequence Database
UniProt (Universal Protein Resource) is a result of the merger of three separate protein
databases. They are the Swiss Institute of Bioinformatics’ and the European Bioinformatics
Institute’s Swiss‐Prot and TrEMBL (Translated EMBL Nucleotide Sequence Data Library) databases
and Georgetown University’s PIR‐PSD (Protein Information Resource Protein Sequence Database).
UniProtKB/Swiss‐Prot, the smaller component of UniProt, provides manually annotated protein
sequences with additional information while UniProtKB/TrEMBL provides computationally
analyzed sequence records that need to be annotated and later transferred to Swiss‐Prot (3).
Nucleotide Sequences and Genome Sequences Database
The GenBank operated by NCBI (National Center for Biotechnology Information), EMBL
(European Molecular Biology Laboratory) and the DDBJ (DNA Data Bank of Japan) contain
annotated nucleotide sequences. The Entrez Genome database by NCBI holds all complete and
partial genomes (1).
1
Luscombe, N.M.; Greenbaum, D.; Gerstein, M. What is Bioinformatics? An introduction and overview. Yearbook of Medical InformaticsI,
Schattauer, Stuttgart/New York, 2001.
2
Structural Databases
The Protein Data Bank, PDB is a repository of all experimentally determined three‐
dimensional (3D) structure of biomolecules such as proteins and DNA. Most of these structures
are obtained through X‐ray crystallography and NMR (1).
Online Mendelian Inheritance in Man (OMIM)
OMIM can be accessed through the Entrez database of NCBI. It is a database that
catalogues human genes and genetic disorders. It is linked to gene entries in GenBank and
PubMed articles and sequences.
Publisher’s MEDLINE/Public MEDLINE (PubMed)
PubMed is a bibliographic database that can serve as a search engine for abstracts and
articles published in different biomedical journals.
Closer Look: NCBI’s Entrez interface System
The Entrez facility maintained by NCBI is an interface system, which serves as a gateway to
accessed and traversed all‐component databases for protein and DNA sequences, three‐
dimensional structure information, genome mapping and bibliographic databases (5). PubMed,
OMIM and Nucleotide (GenBank) are all Entrez databases (5).
C. Tools for Analysis in Bioinformatics
Basic Local Alignment Search Tool (BLAST)
BLAST provides sequences similar to the input sequence, which may lead to identification
of previously characterized sequences, finding of evolutionary related protein (homologous) and
establishment of function based on sequence similarity (6).
Multiple Sequence Alignment, ClustalW
ClustalW is a multiple sequence alignment tool used for DNA and protein sequences. It
produces the best match over the length of selected sequences, highlighting similarities and
differences (7). This is useful for the identification of conserved patterns in protein families,
predicting three‐dimensional structures of proteins and discovering evolutionary relationships
between sequences (8).
Deep View (Swiss‐Pdb Viewer)
Deep View is a molecular graphics program useful in exploring macromolecular models in
three dimensions. It allows simultaneous loading of several molecules for analysis. Measurement
of bond angles and distances between atoms, examination of H‐bonding, mutational studies on
amino acids and side chain conformation and structural alignment are some of the main
applications of Deep View (9).
Expert Protein Analysis System (ExPaSy)
ExPaSy is a proteomics server operated by the Swiss Bioinformatics Institute which consists
of a variety of databases and tools and software for protein analyses. Databases include UniProt
while tools and software covers those for identification and characterization of proteins, protein
to DNA reverse translation, Blast similarity searches and predicting the secondary and tertiary
structure of proteins (10).
3
PROCEDURE
A. PubMed2
Go to www.pubmed.org
Do a basic search on a human biomedical topic of your own choice in pubmed, which has a
genetic component.
1. What search term(s) did you use?
2. How many hits do you get?
3. Give the reference of the first citation on the list
4. Is this reference available in full‐text?
5. How would you adjust your search to give only review papers
a. How many hits did you get?
b. Give the citation of he review paper you found
c. Is this available in full‐text
6. Are there any books in the PubMed library that discuss the topic that you chose? How
do you find them?
a. Cite the book source that gives the best discussion of your topic
b. Why did you consider this the best?
7. Use the OMIM database on PubMed to look up the genetics associated with your topic.
a. What is the entry number and title for this genetic topic?
b. Is there a location for this gene in the human genome? Which chromosome is it
located? Click the gene map locus
8. Using the key PubMed and OMIM information that you found, give a brief summary
about the topic that you chose
9. Give the key references
B. Swiss Prot3
Go to www.expasy.org, then go to database, click Swiss‐Prot and TrEMBL and search for a
protein. Choose one that you think is an interesting match. Click to see the NiceProt view
and answer the following questions:
1. Swiss‐Prot ID Name and Code
2. Organism where it comes from
3. Full name and alternate name for this protein
4. How many amino acids?
5. Based on the general annotations, describe the general function of this protein and
where it is expressed.
6. Does this protein have any non‐protein components such as metal ions, other
cofactors, prosthetic groups or sugars attached via glycosylation sites? If yes, please
describe.
7. Where is the active site or binding site of this protein, if any.
2 st
Bioinformatics Module by Nina Rosario L. Rojas PhD for Chem 349 Special Topics in Biochemistry 1 sem 2007‐2008 [Unpublished]
3
Bioinformatics Module by Nina Rosario L. Rojas PhD for Chem 349 Special Topics in Biochemistry 1 sem 2007‐2008 [Unpublished]
st
4
8. Look at the gene ontology, what functional descriptions are associated with your
protein
9. Look at the feature section and note what other features are listed, what are shown by
the graphical view?
10. Are there any variants of this protein? Note any of them.
11. Look for the secondary structures. Describe the secondary structure of your selected
protein and determine which secondary structure predominates?
12. Look at the amino acid sequence, choose a chain and determine the pI and MW. What
results do you get?
13. Click blast, what information do you get?
14. Click alignment, paste two sequences, what information do you get?
15. Is a three‐dimensional structure of this protein available? Look for a PDB link, if any. If
there are several PDB IDs, just choose one
16. Give the citation for the 3‐D structure
17. Give the molecular description
18. What can you say about the tertiary structure? Click on Jmol
19. What experimental data were used to predict the 3‐D structure?
20. Go back to the main entry page, go to features, what key domains/motifs are present
in this protein and what aa residues are in each
21. Cick ProtParam and analyze parameters of the protein. Study the results, what
information are given?
Go back to main entry page, click FASTA format link. Copy and paste into notepad the
sequence of your protein.
22. Go to tools from the main page and go to the section on DNA to protein. Go to reverse
translate and get the nucleotide sequences that code for your protein. Why is the
codon usage table an important parameter?
23. Go to the primary structure analysis and go to ProtScale. Choose alpha helix (Chou and
Fasman) what information can you get?
24. Choose any other parameter that you want in #23
25. Go to the section on secondary structure prediction. Go to JPred and paste your
protein sequence. Choose one of the hits found
26. Give a short summary of what your protein is like and what it does, based on your
ExPaSy findings.
C. NCBI BLAST4
Go to www.ncbi.nlm.nih.gov/. Select BLAST. (You can use this site http://www.digital‐
world‐biology.com/BLAST to access an on‐line tutorial for this program.)
1. Select nucleotide BLAST (blastn)
2. Input your own random nucleotide sequence (fill the first line of the Query Box: e.g.
CATATTACTATGGGTACTCTTA)
3. In “Choose databases”, select ‘other’
4
Adapted from Bioinformatics: An Interactive Introduction to NCBI by Seth Bordenstein
http://serc.carleton.edu/microbelife/k12/bioinformatics/index.html (accessed November 2010)
5
4. Select BLAST and wait for the results page to load. The time it takes to get it depends
on the type of search and the number of peers using the NCBI website at the same
time.
5. Did your sequence produce a significant hit? How many?
6. Click Search summary. How many sequences did it search in the database? How many
nucleotide letters did it search in the database?
7. Repeat analysis this time using your own random amino acid sequence (e.g.
MISSPIGGYISENIGMATIC). Select protein BLAST (blastp)
8. Did your sequence produce a significant hit? How many?
Close the results page when you are done.
9. Select nucleotide BLAST to go back to its page.
10. Copy and paste the unknown nucleotide sequence given to you by your instructor.
Make sure to copy the entire sequence including the > symbol and the name.
11. Repeat the search you perform above using this unknown sequence. Wait for the
results.
12. What is the E‐value and score for the first matching sequence? What is the significance
of the E‐value?
13. How many sequences can you find with an E‐value larger than 0.01?
14. What is the most likely identity of this sequence? What data supports this conclusion?
15. What are the 3 closest relatives based on sequence similarity? What other organisms
encode for this gene based on nucleotide sequence similarity?
Close the results page when you are done.
16. Return to the nucleotide BLAST page. Cut the sequence given to you by pasting only
the paste only the first 50 base pairs in the query box.
17. Repeat the BLAST search you performed above with the larger sequence and wait for
the results to appear.
18. How do these E‐values compare with the ones you obtained above? Is the identity of
the best hit different from when you used the complete nucleotide sequence?
19. From the two BLAST searches, what can you deduce abut how the length of a query
sequence affects your confidence in the sequence search?
20. Give the first five sequences from the list that produces significant alignments.
21. Repeat the analysis using the unknown amino acid sequence given to you by your
instructor.
22. What is the most likely identity of this sequence? What data supports this conclusion?
D. Deep View (This is a take‐home activity for the students). Refer to the tutorial provided.
1. Download Deep View from http://spdbv.vital‐it.ch/. Spdbv stands for Swiss PDB
Viewer. Make sure to choose the version that is compatible for your operating system.
2. Open the zip file and extract the files to the Desktop.
3. Click the file to start the program.
4. Upon opening of the file, there will be two windows that will pop‐up. One is the
“Swiss‐PdbViewer 4.0.1" window and the smaller one is the "About Swiss‐
PdbViewer..." window. Just click the X mark on the right side of the window to close
the "About Swiss‐PdbViewer..." window.
5. Download the DeepView manual from http://spdbv.vital‐it.ch/Swiss‐
PdbViewerManualv3.7.pdf. Be familiar with the basic movement control.
6
6. Load the pdb file to Deep View. To do this, click “File” then choose “Import”. On the
text box, type in “2DN1”, the code for human hemoglobin. Click “PDB file” under “Grab
from Server:” Another option is the “Open PDB File” from the “File” menu. If a prompt
appears, click “Ok”. Also close the inset that appears before you proceed.
7. Familiarize with the basic controls. The first four buttons represent: (1) place the
image to the center of the screen (2) translate (3) zoom in and out (4) rotate
8. Change display by clicking this series “Display”> render in 3D and Display > render in
solid 3D.
9. Select the alignment window (“Window”>Alignment)
10. Look for all the Histidine residues (H) by running the cursor over the sequence.
Observe how amino acid residues are distinguished.
11. Go to the control panel (“Window”>control panel)
12. Select the entire protein by clicking to the far left of the first column. The text for all
the residues will turn red. Select the entire protein and color the residues by their
secondary structure (Color>by Secondary structure). Which color corresponds to
which type of secondary structure?
13. Overlay the ribbon view of the protein by right‐clicking the “ribn” title on the top of the
column. How would you describe the tertiary structure of this protein?
14. Deselect the ribbon view by removing all the “v”s. To do this, right‐click again in the
“ribn” column.
15. Find the Heme on the control panel. Change its color to something more visible (box
on far right of Zn in the control panel)
16. Go to the control panel and select the entire protein and then select the
Ramachandran window (“Window”>Ramachandran).
17. Look at the residues within the outlined yellow regions. What secondary structure
does each region of the plot corresponds to?
18. Is there a deviant amino acid, one outside the blue and yellow outine? If yes,
determine its identity.
19. Choose a point within the yellow region, click on it and move its angle. What happens
to the tertiary structure? Close the Ramachandran window.
20. Center the complete model 3INC in CPK colors, and display H‐bonds.
21. Label A His 509 by clicking on the label column next to it.
22. Click the MUTATE button (second from right in button row), and then click any atom of
A SER9. Select Glycine (G) from the pop‐up menu that appears. What are the effects of
the amino acid replacement?
23. To accept changes after using the MUTATION button, simply click it again and click YES
on the dialog box that appears. Gly 50 on the Control Panel window will replace His
50.
24. Perform energy minimizations on the insulin molecule. Go to Tools Energy
Minimization then click File> Save> Current layer.
25. Load 2DN1 and the mutant file.
26. You can control the visibility of the two layers by checking or unchecking under the vis
column. Try superimposing the two layers by checking the vis for both layers.
Differentiate the two structures. Close the window after.
27. Download PDB file 2DN2 (human hemoglobin deoxygenated) and load it with 2DN1.
Differentiate the structure.
7
DISCUSSION
1. Submit a presentation about your selected protein using all the computational analyses
discussed.
REFERENCES
1. Luscombe, N.M.; Greenbaum, D.; Gerstein, M. What is Bioinformatics? An introduction and
overview. Yearbook of Medical InformaticsI, Schattauer, Stuttgart/New York, 2001.
2. Bioinformatics. http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html (accessed
November 2010)
3. GenBank, RefSeq, TPA and UniProt: What’s in a Name?
http://www.microbemagazine.org/index.php?option=com_content&view=article&id=1270
:genbank‐refseq‐tpa‐and‐uniprot‐whats‐in‐a name&catid=347:letters&Itemid=419
(accessed November 2010)
4. Online Mendelian Inheritnace in Man. (http://www.nslij‐genetics.org/search_omim.html
(accessed November 2010)
5. NCBI’s Entrez System by Alex E. Lash.
6. BLAST: Basic Local Alignment Sequence Tools by Jonathan M. Urbach
7. ClustalW. http://www.ebi.ac.uk/Tools/clustalw/ (accessed November 2010)
8. ClustalW. (http://www‐bimas.cit.nih.gov/clustalw/clustalw.html) (accessed November
2010)
9. Swiss‐Pdb Viewer DeepView v4.0 by Nicolas Guex , Alexandre Diemand , Manuel C. Peitsch , &
Torsten Schwede. (http://spdbv.vital‐it.ch/)
10. Bioinformatics:Proteomics:ExPASy Proteomics Server. http://bioinformatics‐made‐
easy.blogspot.com/2009/11/expasy‐proteomics‐server.html (accessed November 2010)
11. Bioinformatics Module by Nina Rosario L. Rojas PhD for Chem 349 Special Topics in
Biochemistry 1st sem 2007‐2008 [Unpublished]
12. Bioinformatics: An Interactive Introduction to NCBI by Seth Bordenstein
http://serc.carleton.edu/microbelife/k12/bioinformatics/index.html (accessed
November 2010)
13. Ship,N.J.; D. B. Zamble. Analyzing 3‐D Structure of Human Carbonic Anhydrase II and Its
Mutants Using Deep View and Protein Data Bank. J. Chem. Ed. [Online], 2005, 82, 1805‐
1808
14. Deep View Tutorial, project submitted by Saluria, J. V.; Vergara, C. J.; Hermosa, C.; Ventura,
I. R. Chem 146.1, 1st semester, 2010‐2011
15. Deep View Tutorial, project submitted by Estandarte, A. K.; Estevanez, P. J.; Garcia, M.C.;
Reyes, Roy. Chem 146.1, 1st semester, 2010‐2011
16. Deep View Tutorial, project submitted by Alcan.; Estevanez, P. J.; Garcia, M.C.; Reyes, Roy.
Chem 146.1, 1st semester, 2010‐2011
8
DISCUSSION
2. Submit a presentation about your selected protein using all the computational analyses
discussed.
3. Obtain your assigned nucleotide sequence and copy of the worksheet from this site:
http://www.digitalworldbiology.com/BLAST/62000sequences.html. Submit your answers
to the guide questions included in the worksheet.
9