created: 14th January 1998, last updated: 14th January 1998, © 1998 ABRF

DATABASE SEARCHING USING SHORT PEPTIDE SEQUENCES

Bruno A. Gaëta

The University of Sydney

Introduction

Sequence database similarity searching is one of the most common computing techniques in modern biology. It allows the large repositories of DNA and protein sequence information to be queried using a sequence, with the goal of identifying database sequences homologous to this query sequence. This technique can be extremely useful in the study of gene and protein structure, function and evolution.

One particular application of this technique in protein biochemistry is the identification of a protein using the sequence of a short peptide fragment, which could be obtained for example by amino-terminal sequencing of a spot on a 2-D electrophoresis gel. However this type of search often does not produce results because typical similarity search programs are optimized for longer query sequences, and must be especially configured to handle short queries.

Most database similarity search programs are, in a nutshell, sequence alignment programs. Their fundamental principle is to find the best alignment between the query sequence and every sequence in the database. For protein sequences, this involves taking into account not only amino acids which are identical between the two sequences, but also 'related' amino acids which may indicate a common origin or function for the two sequences (for example, a Lys-Arg mismatch is much less of a mismatch than Lys-Glu). The program can then derive a 'similarity score' from each of these alignments, by assigning positive values to matches and negative or zero values to mismatches or gaps, then report these database sequences with the highest similarity score.

In practice, aligning a whole sequence database with a query sequence is unmanageable on all but specially designed massively parallel computers. Database similarity search programs must therefore use 'shortcuts' in order to keep search times down to a practical level, without too much loss in sensitivity. The two most common implementations of this approach are BLAST (1) and FastA (5). This article discusses both methods with special emphasis on their use with short peptide query sequences.

BLAST

BLAST (Basic Local Alignment Search Tool) is, in fact, a collection of five different programs which allow different combinations of nucleic acid and protein query sequences and databases to be used. The programs of most relevance to protein identification from short peptides are blastp, which searches a protein sequence database, and tblastn, which searches the 6 possible translations of the sequences in a DNA sequence database. A common way of accessing BLAST is through the World-Wide Web, for example on the public access server at the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/). The software itself is freely available by anonymous ftp (ftp://ncbi.nlm.nih.gov/blast).

BLAST speeds up its search of the database by first looking up an 'index' of every oligomer (by default, every tripeptide) in the database for oligomers showing a sufficient degree of similarity to those present in the query peptide. This allows the program to identify very quickly which database sequences show some similarity with the query. BLAST then tries to extend these initial regions of similarity into a larger ungapped alignment. This alignment without gaps is called an HSP (high-scoring segment pair).

Gapped BLAST (3) and WU-BLAST(http://blast.wustl.edu/) are new versions of BLAST able to refine these HSPs by inserting gaps in the sequences, therefore giving rise to gapped alignments which make more biological sense than ungapped HSPs. Both programs are also faster than the original BLAST. In gapped BLAST (available at NCBI, http://www.ncbi.nlm.nih.gov/), only database sequences containing at least two nearby oligomers with some similarity to the query sequence are considered for further analysis and extension. This 'two-hit' approach increases the sensitivity of the search as well as its speed, provided there is sufficient similarity between the sequences. However, for short peptide sequences which would give rise to low-scoring HSPs, the original version of BLAST appears more sensitive than gapped BLAST. WU-BLAST is another refinement of the original BLAST algorithm which provides, faster, more sensitive searches, and gapped alignments. ISREC, the Swiss Institute for Experimental Cancer Research (http://www.ch.embnet.org/) is one site providing free access to WU-BLAST on the WWW.

For each of the alignments produced by the search program, a score can be calculated. For protein sequence alignments, each amino acid pair is given a score taken from a scoring matrix (a table assigning a value to every possible amino acid pairing) and the alignment score is calculated by adding up individual amino acid pair scores and subtracting penalties for gaps present in the alignment.

A major strength of BLAST is the statistical significance evaluation it performs on the results. Given the score of an alignment, BLAST is able to evaluate its statistical expectation, which is related to the probability that this alignment is a chance occurrence. Statistical expectations are easier to deal with than alignment scores because they are independent of the scoring scheme used to calculate alignment scores. For example, knowing that an alignment has a score of 100 is not very informative in the absence of a reference scale. On the other hand, describing this alignment as having a 10% probability of being a chance occurrence gives a good idea of how significant the match is. In a typical BLAST output, this probability is given in the P(N) column, next to the name and description of the database entry (Figure 1a and 1b).

After creating the alignments, BLAST screens them according to their likelihood of being a chance occurrence. The key parameter in this process is the Expect (Statistical expectation) value, which controls the level of similarity with the query required for a database sequence to be reported as a match. The Expect value corresponds to the number of sequences expected to be found in the database by chance alone. By default, the Expect value is 10, which means that if the database contained only random sequences, 10 of these sequences would be reported as having a suitable level of similarity with the query given the size of the database. Lowering the Expect value makes the search more 'stringent' since fewer sequences with lower similarity (which are considered more likely to be random matches) are reported.

When a peptide sequence of 8 to 10 residues is used as the query for a standard BLAST search, the program often fails to report any matching sequences in the database, even though such sequences are present. The major reason for this behavior is the statistical screening process: a short peptide sequence has a high probability of matching sequences in the database by chance alone, and therefore most of the database matches are considered chance occurrences and rejected. This can be illustrated by the following crude example: if a sequence database contains 108 residues, a random decapeptide has a probability of approximately 108/2010 x 10-5 of occurring by chance alone, since there are approximately 108 10-mers in the database and 2010 possible decapeptides. The statistical expectation for a random decapeptide is therefore ~108 x 10-5 = 1000, well in excess of the Expect default value of 10 (note that the statistics used by BLAST are much more sophisticated than those used in this crude example, and do take into account the frequency of occurrence of the different amino acids and the length of the sequences compared). The smaller the query peptide and the larger the database to search, the more likely the peptide to occur by chance alone, and therefore of the 'real' matches to be rejected as chance occurrences.

The logical solution to this problem would to use a much higher Expect value. Unfortunately most BLAST implementations do not allow the Expect parameter to be set to more than 1000, which is not usually sufficient. It is therefore necessary to 'fool' the program into assuming that the database is much smaller than it actually is, so that the number of matches expected by chance alone is lowered. This is done by modifying the 'effective database length' or Z parameter, which controls the value that the program uses as the number of residues in the database for its calculations. By default, the actual number of residues in the database is used as the value of this parameter, but this can be changed in some BLAST implementations. For example, the BLAST interface available to users of the Australian National Genomic Information Service (ANGIS) allows the effective database length to be changed directly by typing in a new value in the corresponding entry box. If the WWW at NCBI is used, the Advanced BLAST search form must be used, and 'Z=' followed by the new value of Z entered in the 'other advanced options' box (e.g.., to use an effective database length of 100,000, 'Z=100000' must be entered in the 'other advanced options' box). Which value to use depends on the length of query peptide and on the size of the database to be searched. It is usually simplest to experiment with a few values ranging from 100,000 to 10,000,000 and to examine the results (especially the alignments) to find a suitable compromise between the identification of homologous sequences and the reporting of 'false positives' (similarities occurring by chance alone). The effect of reducing the effective database length when using a short peptide query sequence is demonstrated in Figure 1a and 1b.

Note that changing the effective database length invalidates the statistics, and that the results can only be interpreted by using biological knowledge when looking at the alignments and the function of the sequences identified, since the numbers (expectation and probability) are no longer valid.

Another factor which may affect the result of the search is the presence of low complexity regions in the query peptide. A low complexity region can be a simple repeat or a region of abnormal amino acid composition, which can bias the statistics when present in a query sequence. Because of this, many BLAST implementations, including the one at NCBI, filter out by default low complexity regions from the query sequence (discussed in ref. 2). This can present a problem with short queries, which can be completely 'filtered out' if their composition is biased. It is therefore a good idea to run the search with and without filtering and to compare the results to make sure that these are not affected by compositional bias.

FastA

FastA (5) is a similarity search program which can be used to search a nucleotide sequence database with a nucleotide query sequence, or a protein sequence database with a protein query sequence. Its companion program TFastA (or FastA-Trans) is used to search a 6-frames translation of a nucleotide sequence database with a protein query sequence. FastA is distributed freely and is available for a range of platforms at ftp://ftp.virginia.edu/pub/fasta (the installation requires that the sequence databases be installed on your local computer). FastA can also be accessed through a number of public WWW sites such as the Baylor College of Medicine search launcher (http://gc.bcm.tmc.edu:8088/search-launcher/launcher.html) and the European Bioinformatics Institute (http://www2.ebi.ac.uk/fasta3/).

FastA accelerates database searching by using several passes over the database and only retaining a 'best matching' subset for further analysis at each pass, therefore 'pruning down' the database progressively. The first pass is similar to BLAST's, in that short sequence 'words' are compared for rapid detection of small regions of similarity. However, FastA uses a smaller word size (called k-tuple or ktup in this case): by default, 6 for nucleic acids, and 2 for proteins (versus 11 and 3 for BLAST). Unlike BLAST, FastA requires the words to match perfectly, and may overlook at this stage some weak but significant similarity between protein sequences (for example, a Lys-Arg match will be ignored). With a protein sequence query, decreasing the ktup to 1 can increase the sensitivity of the search significantly, at the expense of speed.

The program then extends the initial short regions of similarity into alignments without gaps, the best of which are subsequently joined into a longer alignment (whose score is designated the initn score).

Finally, regions with a high initn are aligned with the query sequence using a slower, more sensitive method. The score calculated for these gapped alignments is called the opt score. By default this score is used to sort the sequences in the output.

The more recent versions of FastA perform a statistical evaluation of the results similar to that of BLAST, albeit less rigorous, since the statistical model used assumes alignments without gaps (2). First, a normalized score called the z-score is derived from one of the other scores (by default, the opt score). This score is then converted into a statistical expectation value, which approximates the probability that a given match is a chance occurrence (when under 0.05). Database sequences with an expectation value higher than the Cutoff expectation parameter are listed in the program output, together with the various scores (Figure 1c). The statistical expectation is given in the E column.

The calculation of the statistical expectation from the z-score is based on alignments between the query and a large number of sequences sampled at random from the database, which are used as a representative sample of chance alignments. Since this process is less stringent than the strict statistics used by BLAST, FastA reports many more significant matches than BLAST when a short peptide is used as the query sequence (when both programs are used in their default configuration). An example is shown in Figure 1c. Even then, it is recommended to increase the value of the Cutoff expectation parameter (for instance, to 100 instead of the default value of 10) when using FastA with a short query sequence.

(a)
                                                                    Smallest
                                                                      Sum
                                                               High Probability
Sequences producing High-scoring Segment Pairs:               Score P(N)   N
 
pir|-|E49958 AMP-activated protein kinase, 63K, catalytic...   44   0.59   1

 

(b)
                                                                    Smallest
                                                                      Sum
                                                               High Probability
Sequences producing High-scoring Segment Pairs:               Score P(N)   N
 
pir|-|E49958 AMP-activated protein kinase, 63K, cataly...      44 0.0057   1
sp|-|AAK1_PIG 5'-AMP-ACTIVATED PROTEIN KINASE, CATALYTI...     44 0.067    1
gp|-|G2077991 H.sapiens mRNA for AMP-activated protein ...     44 0.082    1
sp|-|AAK1_RAT 5'-AMP-ACTIVATED PROTEIN KINASE, CATALYTI...     44 0.090    1
gp|-|G1326255 Caenorhabditis elegans cosmid T01C8              44 0.090    1
sp|-|AAK2_HUMAN 5'-AMP-ACTIVATED PROTEIN KINASE, CATALYTI...   37 0.59     1
sp|-|AAK2_RAT 5'-AMP-ACTIVATED PROTEIN KINASE, CATALYTI...     37 0.59     1
pir|-|S51025 AMP-activated protein kinase - human              37 0.59     1
gp|-|G862473 Rattus norvegicus 5'-AMP-activated protei...      37 0.59     1
gp|-|G581218 E. coli rpmH gene for ribosomal protein L34       24 0.94     1
gp|-|G699230 Mycobacterium leprae cosmid B2266                 33 0.96     1
gp|-|G1771144 L.delbrueckii pepG and pepW genes and unk...     32 0.99     1
sp|-|YH04_YEAST HYPOTHETICAL 91.2 KD PROTEIN IN RPS7A-SCH...   32 0.99     1
pir|-|S65519 carcinoembryonic antigen-binding protein,...      26 0.996    1
gp|-|G736297 S.cerevisiae chromosome XIII cosmid 8325          31 0.996    1
pir|-|S59441 hypothetical protein YMR200w - yeast (Sac...      31 0.996    1
sp|-|YM56_YEAST HYPOTHETICAL 28.9 KD PROTEIN IN CLN1-RAD1...   31 0.997    1
sp|-|TYSY_MYCGE THYMIDYLATE SYNTHASE (EC 2.1.1.45) (TS)        31 0.997    1
sp|-|TYSY_LACCA THYMIDYLATE SYNTHASE (EC 2.1.1.45) (TS)        31 0.997    1
gp|-|G149601 L.casei thymidylate synthase (thyA) gene,...      31 0.997    1
sp|-|RUVB_SYNY3 HOLLIDAY JUNCTION DNA HELICASE RUVB            31 0.997    1
gp|-|G1743244 H.contortus mRNA for glutamate gated chlo...     31 0.998    1
pir|-|B32735 thyroglobulin - sheep (fragment)                  22 0.998    1
gp|-|G599989 S.cerevisiae chromosome IX cosmid 8277            31 0.998    1
sp|-|YIM8_YEAST HYPOTHETICAL 117.9 KD PROTEIN IN FKH1-STH...   31 0.998    1
gp|-|G1737175 Saccharomyces cerevisiae DNA repair/trans...     31 0.998    1
gp|-|G736704 Human cytoskeleton associated protein (CG...      30 0.9995   1
gp|-|G1905902 Homo sapiens DNA from chromosome 19-cosmi...     30 0.9995   1
pir|-|S42513 recombination-activating protein RAG-2 - ...      22 0.9996   1
sp|-|MEXA_PSEAE MULTIDRUG RESISTANCE PROTEIN MEXA PRECURSOR    30 0.9998   1
pir|-|S50865 avermectin-sensitive glutamate-gated chlo...      30 0.9998   1
gp|-|G1627519 A.vinelandii algG gene                           30 0.9998   1
gp|-|G1072219 Caenorhabditis elegans cosmid R03E9              30 0.9998   1
gnl|gp7|G664874 Saccharomyces cerevisiae chromosome XII c...   30 0.9998   1
gp|-|G208220 E.coli dnaA-lacZ fusion protein gene frag...      24 0.9999   1
gp|-|G554123 Mouse Ig rearranged kappa-chain mRNA V8-J...      21 0.99992  1
pir|-|JS0244 hypothetical 2.85K protein - liverwort (M...      22 0.99992  1
gp|-|G999431 Pleurochrysis carterae chloroplast rpl27 ...      22 0.99992  1
 

 

 
(c)
The best scores are:                                initn init1 opt z-sc E(256142)
E49958 pir|-:AMP-activated protein kinase, ( 18)     57 57 57 210.3 0.00017
AAK1_PIG sp|-:5'-AMP-ACTIVATED PROTEIN K ( 132)      57 57 57 197.3 0.0009
G2077991 gp|-:H.sapiens mRNA for AMP-activ ( 257)    57 57 57 192.9 0.0016
AAK1_RAT sp|-:5'-AMP-ACTIVATED PROTEIN K ( 548)      57 57 57 188.0 0.003
G1326255 gp|-:Caenorhabditis elegans cosmi ( 562)    57 57 57 187.8 0.003
AAK2_HUMAN sp|-:5'-AMP-ACTIVATED PROTEIN K ( 552)    48 48 48 157.3 0.15
AAK2_RAT sp|-:5'-AMP-ACTIVATED PROTEIN K ( 552)      48 48 48 157.3 0.15
S51025 pir|-:AMP-activated protein kinase ( 552)     48 48 48 157.3 0.15
G862473 gp|-:Rattus norvegicus 5'-AMP-acti ( 552)    48 48 48 157.3 0.15
S59441 pir|-:hypothetical protein YMR200w ( 218)     32 32 41 139.6 1.5
G736297 gnl|gpd:S.cerevisiae chromosome XI ( 218)    32 32 41 139.6 1.5
G699230 gp|-:Mycobacterium leprae cosmid B ( 384)    42 42 42 139.3 1.5
YM56_YEAST sp|-:HYPOTHETICAL 28.9 KD PROTE ( 256)    32 32 41 138.5 1.7
TYSY_MYCGE sp|-:THYMIDYLATE SYNTHASE (EC 2 ( 287)    37 37 40 134.4 2.9
TYSY_LACCA sp|-:THYMIDYLATE SYNTHASE (EC 2 ( 316)    37 37 40 133.8 3.1
G149601 gp|-:L.casei thymidylate synthase ( 329)     37 37 40 133.5 3.2
G1743244 gp|-:H.contortus mRNA for glutama ( 432)    25 25 40 131.7 4
G396866 gp|-:A. thaliana transcribed seque ( 55)     36 36 36 131.6 4.1
PSBO_SYNY3 sp|-:PHOTOSYSTEM II MANGANESE-S ( 274)    36 36 39 131.3 4.3 

 

Figure 1. Database similarity search using a short protein sequence. A non-redundant protein database combining the SWISS-PROT, PIR and GenPept databases (total length 78,254,572 residues) was searched for sequences similar to the octapeptide MSLQLYQVD (from the N-terminus of the pig AMP-activated protein kinase sequence). The matching database sequences reported by the following programs are listed: (a) BLASTP with default settings, (b) BLASTP with the effective database length (Z) set to 500000, (c) FastA with default settings.

Conclusion

BLAST and FastA are the two most popular methods for similarity searching, and it is good practice to try both when searching a database for related sequences. A comparison of various database similarity search methods has demonstrated that FastA used with a ktup of 1 is more sensitive than BLAST for detecting distant protein homologies, mainly because it produces more meaningful gapped alignments (6). This, together with its more suitable statistical handling of short query sequences, makes FastA a better algorithm for searching sequence databases for entries similar to a short peptide sequence. However, FastA is much slower than BLAST, especially with the ktup set to 1, and there is a scarcity of public WWW servers offering a FastA search option for some of the larger, most useful databases. In particular, FastA is not available on a public server to search the non-redundant protein database maintained at NCBI, which supports only BLAST. FastA searches of non-redundant, up to date databases may therefore be available only to some investigators, through in-house or subscription-based services (which may also provide BLAST searches of the same databases, therefore improving the integration and management of the results).

In practice, the quality of the database (how up to date, non-redundant and complete it is) is a key factor determining the sensitivity and the usefulness of a database similarity search (7). The advanced BLAST search of the non-redundant protein database at NCBI is therefore the best option for researchers who have access only to public database search servers, especially since the introduction of the gapped BLAST service. However, when searching the database using short peptide query sequences, it may be necessary to reduce the effective database length (therefore invalidating the statistical analysis) in order to obtain any results.

References

1. Altschul, S. F., Gish, W., Miller, W., Myers, E. and Lipman (1990) Basic Local Alignment Search Tool. J. Mol. Biol. 215:403.

2. Altschul, S. F., Boguski, M. S., Gish, W and Wooton, J. C. (1994) Issues in Searching Molecular Sequence Databases. Nature Gen. 6:119.

3. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman D. J. (1997), Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389.

5. Pearson, W. R. and Lipman, D. J. (1988) Improved Tools for Biological Sequence Comparison. Proc. Natl Acad. Sci. USA 85:2444.

6. Pearson, W. R. (1995) Comparison of methods for searching protein sequence databases. Protein Science 4:1145.

7. Pearson, W. R. (1996) Effective protein sequence comparison. Meth. Enzymol. 266:227.


Return to the ABRF Home Page