Você está na página 1de 9

Bioinformatics Laboratory 2

Sequence Alignment

By the end of this lab, you should:

be able to find similar proteins to your protein of interest


have an idea of how to perform sequence alignments using available online tools
know where to look for further help on using these tools

We need to have some sequences to work with in this lab. In the Biological Databases lab, we
learned how to look up sequence information in one of the many databases available online.
Look up the protein sequences of the genes of your choice and download them to your home
directory as FASTA files.

Why not the nucleotide sequence? Well, many different triplet codons translate to the same
amino acid and what ultimately determines the structure and thus, function of the protein is its
amino acid sequence. So most of the time, it makes more sense to align protein sequences
rather than nucleotide sequences. Obviously, this is not a strict rule -- there are definitely times
when you want to align nucleotide sequences, e.g. for non-protein-coding sequences, regulatory
regions of DNA, etc etc.

1. Finding similar sequences using BLAST

First of all, why do you want to find similar sequences? As an example, say you stumbled upon
a new protein and somehow got a hold of its sequence. The only problem is that you have no
idea what the protein's function is! One way to start figuring out the function using bioinformatics
is to find a bunch of other sequences that are similar to your protein. If other people already
figured out the functions for that set of proteins, then you can infer that your protein probably
has a similar function, purely based on sequence similarity. Obviously, this is not a guarantee
that your protein actually has that function (the ultimate proof is through experiments), but it's a
pretty good start....

In lecture, we've been talking about aligning two sequences and assigning scores to an
alignment such that it will give us information about how similar the two sequences are. We've
also looked at the Needleman-Wunsch algorithm and started investigating the basis for the
scoring schemes. The Needleman-Wunsch algorithm is relatively easy to implement using
dynamic programming. However, it is quite time-consuming and memory-intensive to run on
sequences of any significant lengths.

Yet, all bioinformatics students learn about this algorithm, because it gives a good intuition of
how alignments can be performed and more importantly, what the scores mean. When you want
alignments of "real" sequences, you would probably turn to some of the already available
alignment tools. (But the idea behind the scores still holds - so that's why it's very important to
understand what alignment scores means!)

Most of you have probably heard of BLAST as a program for finding similar sequences,
available online from the NCBI website. BLAST is linked to many of the sequence databases, so
when you enter a query sequence, it essentially performs many local alignments for you against
the sequences in the database you choose and gives you the set of highest scoring alignments.
Obviously, it's not performing thousands and thousands of Needleman-Wunschs or you would
never get a result! In fact, BLAST uses a heuristics-based approach by first splitting up the
sequence you enter into a bunch of "words", scanning for these words in the sequence
database, and then using the search results to seed alignments extending from the found
words. You can learn a lot more about the actual workings of BLAST this chapter of the NCBI
Handbook.

Go to the BLAST webpage.

Having an available search tool doesn't necessarily mean life becomes simple! Look at all the
choices you have before you even enter a query sequence! Luckily, the people at NCBI have
spent a lot of time writing documentation on BLAST, so check out this table to see what the
differences between all the various BLAST programs are. For now, just focus on the ones for
protein queries.

Since our goal here is to look for sequences similar to the query sequence and our query
sequence is most likely longer than 15 residues, protein blast (blastp) seems like a
decent choice. So select protein blast (blastp) from the table (or from the main BLAST
page).

You should now see a page with a form containing many boxes for you to fill in. The
main text box labeled 'Enter accession number, gi, or FASTA sequence' is where you
put in your query sequence. Not sure what kind of format the sequence should be in?
Click on the link next to 'Enter accession number, gi, or FASTA sequence' for an
explanation.

How do we tell BLAST where to look for similar sequences? That's specified by the
database we use for the query, chosen by the 'Database' dropdown box. To learn what
the different databases are, click on the link next to the 'Database' dropdown. Choosing
an appropriate database is an important step, since the results you get completely
depend on which databases you choose to query. If you only cared about finding similar
sequences in human, you may want to use the 'Organism' text box to narrow down or
exclude organisms rather than searching the entire database, which would search all
sequences regardless of organism.

There are many other settings for blastp and I suggest clicking on the explanation for
each to find out what that setting is for. If you're really itching to do your first alignment,
just leave everything at its default value and click on the 'Blast' button.

Once you submit your request, you'll be directed to a page with your Request ID. Wait a
bit to let your request go through the queue. By default, BLASTP will also run a search
against the Conserved Domain Database (CDD) and display the results graphically while
it performs the Blast search. The graphic shows protein domains that may be present in
your query. Click on the graphic to get more information about the search results. Here's
a page with help on CDD.

When the server finishes processing your request, it'll show you the set of sequences it
found. Congrats! You just performed your first BLAST search! Below where the results
show your query and its length is a graphic titled "Distribution of BLAST Hits". This
graphic shows you at a glance what were the significant matches and where they match
up with your query. You should be able to tell from this whether the matches are local or
global with respect to your query. These matches are also referred to as "target"
sequences. Scroll down a bit and you'll see a big list of whole sequences or sequence
fragments that matched your query sequence.
o Associated with each sequence is a 'score', which is the pairwise alignment
score between that sequence and your query sequence.
o You'll also see something called an 'E-value', which is the 'expect value', and
essentially tells you how statistically significant your alignment is (more here).
The E-value is calculated from the length of the query sequence and the
database size. The smaller the E-value the more significant the match. A rule of
thumb is that E-values less than 1e-3 are significant matches.
o Below the line showing the Expect value are three numbers identified as
"Identities", "Positives", and "Gaps". Identities give the number of exact matches
between the query and the target sequence over the length of this alignment. A
general rule of thumb for %id is that it should be 25-30% over an alignment of at
least 80-100 amino acids to be able to assert that the sequences are
homologous, meaning that they are evolutionarily related. If sequences are
homologous, then you have a better case for asserting they share a common
function. "Positives" takes into account both exact matches and conservative
substitutions (ie, similar amino acids). "Gaps" refers to the number of gaps in the
alignment. Of course, the lower the number of gaps, the better the alignment.
o By default, BLAST will filter out low-complexity regions which have biased amino-
acid composition (eg, sequences of repeated amino acids) because they will
skew the results. In the output, BLAST will display these regions grayed out and
in lower case.

Note that the alignment scores are based on the scoring matrix that BLAST used - for protein
sequence alignments, this is a very important setting and how you choose an appropriate matrix
depends on many things, including what kind of sequence similarities you want to detect, how
long your sequences are, etc. Here's a longer explanation of substitution matrices.
1. Graphic Display

2. Hit List
3. Alignment
BLAST EXERCISE 1

The gene DCC is deleted in colorectal cancer and is located on human chromosome 18q21.3. It
encodes for a tumor suppressor protein. Expression of the gene is reduced significantly in most
colorectal carcinomas. The protein sequence of human DCC has the Refseq accession number
of NP_005206.

Locate the Genbank record for this protein. Note the length of the amino acid sequence.
Perform a BLASTP search using this protein as the query and Swiss-Prot as the target
database. Limit the search to mammalian species only and use BLOSUM62 as the
scoring matrix.
o The DCC protein from human is most closely related to the DCC protein from
what other mammal?
o What percent identity do they share?
o What is their percent similarity?
o What is the length of the alignment? Were both proteins aligned along their entire
length?
o Does the DCC protein contain any low-complexity regions that have been
masked out by BLASTP? If so, where?
Look at the results for protein with Swiss-Prot id of P97798.
o What percent identity does it share with the query?
o What is the alignment length? Is it a global or local alignment for the query and
the target?
Based on the BLASTP results, can any general observations be made regarding the
putative function or cellular role of DCC?

2. Protein Databases and Prediction Servers

But what if the sequence you're interested in just isn't similar enough to any other known
sequences, so by sequence alignments, you can't really figure out what your protein does?
There are a number of databases which contain information about protein families, domains,
motifs, etc. Some of these databases can also help you find similar proteins.

Let's first take a look at PROSITE. Either by pasting in a query sequence or an accession
number of the sequence, you can scan the database to see if your protein contains any of the
known protein domains and sites, with the idea that this will help you figure out what the function
of your protein is. Depending on what your protein is, you may/may not have detected some
binding sites, phosphorylation sites, etc etc.

Another thing you can do with your protein sequence is to try to predict its secondary structure
using one of the few prediction servers: PSI-PRED, PredictProtein, and JPRED. On most of
these sites, you submit your protein sequence along with your email address, and when the
results are ready, you get a notification email.

Finally, there are a couple of databases which contain classifications of structural elements in
proteins, such as SCOP, CATH, HOMSTRAD. These databases try to group families of proteins
together by common structural elements, so you can view all the proteins with coiled-coil
domains, for example.
BLAST EXERCISE 2

"Jurassic Park" Dino-DNA Analysis

In 1990, Michael Crichton published the book Jurassic Park about the resurrection of dinosaurs
using the blood from the stomachs of insects which had been encased in amber. At one point in
the book, Dr. Henry Wu is asked to explain some of DNA techniques used in reconstructing the
extinct dinosaur genomes. Dr. Wu describes the use of restriction enzymes and how the
fragmented pieces of dino DNA can be spliced together with these enzymes. He also alludes to
the fact that they don't have the entire genome but that they "fill in the gaps" with modern day
frog DNA. At one point during his discussion he points to a computer screen and remarks "Here
you see the actual structure of a small fragment of dinosaur DNA."

gcgttgctgg cgtttttcca taggctccgc ccccctgacg agcatcacaa aaatcgacgc


ggtggcgaaa cccgacagga ctataaagat accaggcgtt tccccctgga agctccctcg
tgttccgacc ctgccgctta ccggatacct gtccgccttt ctcccttcgg gaagcgtggc
tgctcacgct gtaggtatct cagttcggtg taggtcgttc gctccaagct gggctgtgtg
ccgttcagcc cgaccgctgc gccttatccg gtaactatcg tcttgagtcc aacccggtaa
agtaggacag gtgccggcag cgctctgggt cattttcggc gaggaccgct ttcgctggag
atcggcctgt cgcttgcggt attcggaatc ttgcacgccc tcgctcaagc cttcgtcact
ccaaacgttt cggcgagaag caggccatta tcgccggcat ggcggccgac gcgctgggct
ggcgttcgcg acgcgaggct ggatggcctt ccccattatg attcttctcg cttccggcgg
cccgcgttgc aggccatgct gtccaggcag gtagatgacg accatcaggg acagcttcaa
cggctcttac cagcctaact tcgatcactg gaccgctgat cgtcacggcg atttatgccg
caagtcagag gtggcgaaac ccgacaagga ctataaagat accaggcgtt tcccctggaa
gcgctctcct gttccgaccc tgccgcttac cggatacctg tccgcctttc tcccttcggg
ctttctcatt gctcacgctg taggtatctc agttcggtgt aggtcgttcg ctccaagctg
acgaaccccc cgttcagccc gaccgctgcg ccttatccgg taactatcgt cttgagtcca
acacgactta acgggttggc atggattgta ggcgccgccc tataccttgt ctgcctcccc
gcggtgcatg gagccgggcc acctcgacct gaatggaagc cggcggcacc tcgctaacgg
ccaagaattg gagccaatca attcttgcgg agaactgtga atgcgcaaac caacccttgg
ccatcgcgtc cgccatctcc agcagccgca cgcggcgcat ctcgggcagc gttgggtcct
gcgcatgatc gtgctagcct gtcgttgagg acccggctag gctggcgggg ttgccttact
atgaatcacc gatacgcgag cgaacgtgaa gcgactgctg ctgcaaaacg tctgcgacct
atgaatggtc ttcggtttcc gtgtttcgta aagtctggaa acgcggaagt cagcgccctg

In 1992 Dr. Mark Boguski at NCBI entered this sequence into a text editor and searched all of
the known DNA sequences at the time. Dr. Boguski wrote up his findings and submitted a
manuscript to the journal BioTechniques, as a tongue-in-cheek joke. His manuscript was
accepted and published. (Boguski, M.S. A Molecular Biologist Visits Jurassic Park. (1992)
BioTechniques 12(5):668-669). You will reproduce this experiment using BLAST.
EXERCISE 2a:
From the main BLAST page select Nucleotide-nucleotide BLAST (blastn). This brings up a web
page where you can specify your query sequence along with various parameters. Cut and paste
the above "dinosaur DNA" sequence into the window labeled Search, then click the BLAST!
button to start the search. Click the Format! button on the web page that appears and reformat
the page to the old view.

The most obvious feature of the resulting page is the graphic near the top which depicts the
"hits" or database matches for your query sequence. The number of hits depends on the degree
of similarity found between your input sequence and the sequences in the database. The
uppermost red line in the graphic represents your query sequence. The colored lines below
represent the "hits" or sequences that closely match your query sequence. Lavender lines
represent close or identical matches while green, blue and black lines are more imperfect
matches. The text immediately below the graphic describes the DNA sequences represented by
the lines in the graphic with the best matches presented first. The hyperlink at the start of each
line of text will take you to an entry in the DNA sequence database that corresponds to the gene
named in that line of text.

For each of the top three matches, click on the link to the left and report the entry that appears
under the heading SOURCE ORGANISM.

If a sequence does not correspond to a natural organism but instead represents a man-made
construct, the SOURCE ORGANISM entry will identify it as an artificial sequence. How many of
the top ten matches are artificial sequences? Identify any actual organisms in the top ten.

In practice, researchers rarely have the complete and exact DNA samples. Some mistakes will
undoubtedly occur in extracting sequences from samples, and gaps may occur as pieces of a
sample are lost or incorrectly combined. This is why BLAST reports multiple matches and
provides matching information via the colored lines and overall score. Advanced users of
BLAST can specify additional search parameters to control how similar a match must be in
order to be reported.

EXERCISE 2b:
Introduce errors into the Jurassic Park sequence by deleting the first two lines and last two lines
in the sequence, and randomly changing five bases in the remaining sequence. How, if any, do
these changes affect the search results?
BLAST EXERCISE 3

"The Lost World" Dino-DNA Analysis

Mark Boguski's published article was brought to Crichton's attention. In his second book, "The
Lost World", Mr. Crichton used Dr. Boguski as a consultant. Dr. Boguski constructed an
interesting sequence from existing species and also embedded a message in the protein
translation of the DNA sequence which he submitted for use in the book. Here is the sequence
Dr. Boguski gave Crichton for the book "The Lost World":

gaattccgga agcgagcaag agataagtcc tggcatcaga tacagttgga gataaggacg


gacgtgtggc agctcccgca gaggattcac tggaagtgca ttacctatcc catgggagcc
atggagttcg tggcgctggg ggggccggat gcgggctccc ccactccgtt ccctgatgaa
gccggagcct tcctggggct gggggggggc gagaggacgg aggcgggggg gctgctggcc
tcctaccccc cctcaggccg cgtgtccctg gtgccgtggg cagacacggg tactttgggg
accccccagt gggtgccgcc cgccacccaa atggagcccc cccactacct ggagctgctg
caaccccccc ggggcagccc cccccatccc tcctccgggc ccctactgcc actcagcagc
gggcccccac cctgcgaggc ccgtgagtgc gtcatggcca ggaagaactg cggagcgacg
gcaacgccgc tgtggcgccg ggacggcacc gggcattacc tgtgcaactg ggcctcagcc
tgcgggctct accaccgcct caacggccag aaccgcccgc tcatccgccc caaaaagcgc
ctgcgggtga gtaagcgcgc aggcacagtg tgcagccacg agcgtgaaaa ctgccagaca
tccaccacca ctctgtggcg tcgcagcccc atgggggacc ccgtctgcaa caacattcac
gcctgcggcc tctactacaa actgcaccaa gtgaaccgcc ccctcacgat gcgcaaagac
ggaatccaaa cccgaaaccg caaagtttcc tccaagggta aaaagcggcg ccccccgggg
gggggaaacc cctccgccac cgcgggaggg ggcgctccta tggggggagg gggggacccc
tctatgcccc ccccgccgcc ccccccggcc gccgcccccc ctcaaagcga cgctctgtac
gctctcggcc ccgtggtcct ttcgggccat tttctgccct ttggaaactc cggagggttt
tttggggggg gggcgggggg ttacacggcc cccccggggc tgagcccgca gatttaaata
ataactctga cgtgggcaag tgggccttgc tgagaagaca gtgtaacata ataatttgca
cctcggcaat tgcagagggt cgatctccac tttggacaca acagggctac tcggtaggac
cagataagca ctttgctccc tggactgaaa aagaaaggat ttatctgttt gcttcttgct
gacaaatccc tgtgaaaggt aaaagtcgga cacagcaatc gattatttct cgcctgtgtg
aaattactgt gaatattgta aatatatata tatatatata tatatctgta tagaacagcc
tcggaggcgg catggaccca gcgtagatca tgctggattt gtactgccgg aattc

EXERCISE 3a:
Once again, invoke Nucleotide-nucleotide BLAST (blastn) and copy and paste all or part this
new "Lost World" sequence into the Search window and submit it to BLAST. Click the link to the
left of the highest-scoring match in the list of sequence. Which organism is this DNA sequence
from?

To see more about the organism click the ORGANISM link to the left. Do the same thing for the
second-highest-scoring match and identify the organism. Are either of these organisms related
to dinosaurs?

Você também pode gostar