Created: 3rd January 1999, last updated: 4th January 1999, © 1999 ABRF
David D. Stranz (1) and LeRoy B. Martin III (2)
Contact Details:
(1) Sierra Analytics, LLC, 3453 Dragoo Park Drive, Modesto, CA 95350 USA
david_stranz@MassSpec.com
(2) Micromass, Inc., 100 Cummings Center, Suite 407N, Beverly, MA 01915
USA
Keywords: primary sequence, mass spectrometry, genetic algorithm, computer technique
The genetic algorithm (GA) is a recent approach to solving optimization problems in rugged solution spaces. Traditional algorithms often become trapped in local optima or require excessive computational time to find the global optimum. Because the genetic algorithm explores multiple regions of the solution space simultaneously, it can avoid the local optimum problem and can often find a solution in less time.
Algorithms for derivation of peptide sequence from MS and MS/MS data typically involve stepwise extension of partial candidate sequences until a best fit is achieved. Low data quality or ambiguous or missing fragment ions often cause this process to fail when the sequence cannot be further extended.
Unlike these algorithms, the genetic algorithm proposes complete candidate sequences. The GA selection operator chooses the best-fitting sequences for carry-over into successive refinement steps, and the crossover and mutation operators alter the selected sequences to obtain a better fit with the experimental data.
In this paper, initial results of the application of the GA to peptide sequencing from MS and MS/MS data are presented. The representation of the candidate sequences, fitness functions, and operators for selection, crossover, and mutation are described using example cases.
The genetic algorithm bears some resemblance to the natural process of evolution [1-3]. The algorithm operates on a set (population) of possible solutions to the problem (the individuals or genomes). Each individual is encoded with the parameters, coefficients, or other information used to determine its goodness of fit as a potential solution. Individuals with higher fitness represent better solutions to the problem. Populations are evolved through successive generations, each containing progressively higher fitness individuals.
The GA uses three types of operators to drive evolution. The most important of these is selection. The selection operator chooses those individuals from each generation with the highest fitness (by evaluating their objective or fitness functions) by weighting their probability of selection according to their fitness. Individuals with high fitness are more likely to survive into the next generation, while those that are unfit will die off. Thus, parameter values (genes) which represent higher fitness are propagated, and the next generation will contain higher fitness individuals.
While selection improves the overall fitness of the population, the selected individuals are unchanged. In order to improve individual fitness, two more operators are employed. Of these, crossover is more important. All variants of the crossover operator mix the genetic information from two selected individuals. So, using the selection operator, two individuals are chosen to be parents. Using the crossover operator, certain parameter values (chosen at random) are exchanged between the two parents to yield two children. The parents are discarded, and the children are placed into the next generation. In practice, the probability that the two parents will be crossed is usually less than unity, so often the parents are simply carried over intact into the next generation (becoming their own children, in effect). This helps to ensure that good parameter combinations are not always bred away.
However, selection and crossover operations can work only with the genes already present in the population. To introduce new genes into the population and to retain genetic diversity, the mutation operator is used. After selection and crossover, the mutation operator is applied to each child before it is placed into the next generation. The mutation operator alters the values of one or more parameters (subtly or drastically). Typically, mutation is given a low probability of occurrence (0.01 or less). Mutation is a key feature for avoiding local optima traps. If selection and crossover have caused a set of parameter values to become localized to a small range, mutation can knock an individual out of the set and into a regime of higher fitness. As in nature, though, more often than not mutation is fatal but still serves to keep diversity alive.
Probability and statistics govern all of the genetic algorithm functions. Because of this, no two "runs" of the GA on the same problem will produce the same results. This is a drawback to the GA, since there can be no guarantee that if an optimum result exists it will be found during any specific run. This is particularly true if the solution space is rugged, i.e. has many local optima or the global optimum is not greatly distinguished.
The statistical nature of the GA also leads to the problem of determining when to stop. Most GA implementations define the termination criterion to be typically one of:
It is not intuitively obvious why an algorithm based on random chance should be any better at finding good solutions than simply generating possibilities at random. The reason why the GA works is that, while specific decisions are made according to chance (selection, crossover, and mutation), the forces driving the algorithm are anything but random.
At each generation, only the individuals representing the best solutions make it into the next. Properly-designed crossover and mutation operators and properly chosen probabilities of application of those operations ensure that good parameter values become better on an individual basis, and that population diversity is maintained from generation to generation.
All computer programs were written in the C++ language using the Microsoft(R) Visual C++(R) Developer Studio, Version 4.2. Computational trials were run on Intel(R) Pentium(tm) or Pentium Pro(tm) personal computers running the Microsoft(R) Windows NT(R) Version 4.0 operating system.
Samples of renin substrate (DRVYI HPFHL LVYS) and glu-fibrinopeptide (EGVND NEEGF FSAR) were purchased from Sigma Chemical Company, and were prepared as solutions at 1 pmol/microlitre concentration in 50% acetonitrile/water with 0.2% formic acid added.
Collision-induced dissociation mass spectra were acquired using a Micromass, Ltd. QTOF(tm) orthogonal extraction quadrupole-time of flight mass spectrometer in MS/MS mode. Samples were introduced using electrospray ionization at a flow rate of 5 microlitres/min, needle voltage at 3 kV, and counter-electrode voltage at 500 V. The quadrupole (MS1) was set to pass the doubly-charged parent ion [M + 2H]2+ with open resolution of 4 Da. Collision voltage was 30 V. Resolution of the TOF (MS2) was 6500.
Following acquisition, spectra were combined, smoothed using a Savitzky-Golay algorithm, and centroided. Because of the open resolution on MS1, the resulting spectra contained features due to multiple isotopomers. To remove these, the spectra were further refined using a reduction step to remove less chemically significant ions within a sliding mass window, followed by a proprietary deisotoping algorithm that merges isotopic clusters into a single monoisotopic ion. For the renin substrate spectrum, this post-processing resulted in a refined spectrum containing only 182 peaks of the original 375, but which retained 92% of the original intensity.
A GA-based approach was developed for interpreting peptide MS and MS/MS spectra to derive sequence. Peptide sequencing represents a good area for GA application, since the solution space (number of peptides that could match a given spectrum) is large and the traditional approaches [4, 5] are computationally intensive: For example, each step in a stepwise extension algorithm must consider 20 possibilities. If the peptide has a length of N residues, the stepwise solution thus scales as 20^N.
For this work, peptides containing only unmodified amino acid residues were considered. Computationally, the peptides were represented as a linked list where each node in the list contained the single letter abbreviation for the residue. A lookup table was used to translate between residue abbreviation and molecular weight.
This representation permits peptides of arbitrary length, yielding a population of various-length peptides. Selection, crossover, and mutation force the population to a distribution of lengths about a problem-determined mean by eliminating peptides that are too short or too long.
Each peptide in the initial population was constructed by adding randomly selected residues to the end of the list until the peptide mass exceeded the measured MW. The population size was typically chosen as 250 peptides to obtain adequate residue sampling statistics.
Two peptide properties are measured in mass spectrometry: the overall molecular weight of the peptide and the ion series characteristic of fragmentation along the peptide backbone. The objective function, F(pep), used in this work was designed to reflect these properties.
A partial score, f1(pep), that measures goodness of molecular weight fit was computed as follows:
| MW Match Score: | When: |
| 1.0 | mw(meas) - tol <= mw(pep) <= mw(meas) + tol |
| [mw(meas) - mw(pep)] / mw(meas) | mw(pep) < mw(meas) - tol |
| [2 * mw(meas) - mw(pep)] / mw(meas) | mw(meas) + tol < mw(pep) <= 2 * mw(meas) |
| .0001 | mw(pep) > 2 * mw(meas) |
where mw(meas) is the experimentally-determined molecular weight of the unknown peptide, mw(pep) is the calculated molecular weight for the proposed peptide, and tol is the mass measurement tolerance.
A second partial score, f2(pep), based on a standard intensity-weighted spectral comparison function used in GC/MS library searching, was used to compute the similarity of the MS/MS spectrum predicted for the proposed peptide to the experimentally-determined spectrum. The comparison rewards features that match between the two spectra with an increased score, and discounts the score for unmatched features in either spectrum. The metric yields a value between 0.0 (no match) and 1.0 (perfect match). For this work, the predicted spectrum included the singly- and doubly- charged ions for the b and y'' series and internal acyl fragments.
The overall fitness of a proposed peptide was computed as the average of these two partial scores. Prior to selection, the raw population fitness scores were normalized to the 0.0 to 1.0 range.
Because the two parent peptides selected for crossover could be of different lengths, the crossover operator must be insensitive to length as well as be capable of generating children of lengths different than either parent.
The simplest list-based crossover operator is single-point crossover. A number is chosen between 1 and the length of the shorter of the two parent peptides. The list contents following this point are exchanged to yield two children (each of which has a length equal to one of the parents). (See Figure 1a).
This scheme can be extended to uniform multi-point crossover. At each point along the shorter parent peptide a coin is tossed; if it lands heads up, the residues at that location are exchanged, otherwise they are left in place. When the end of the shorter chain is reached, the coin is flipped again, and if heads up, the remaining residues from the longer parent are added to end of the shorter one to yield the two children. (See Figure 1b)
Neither of these schemes can generate children of lengths different from their parents. A variant of the single-point crossover chooses a point within each parent peptide. The two ends are exchanged, yielding two children of possibly different lengths than either parent. (See Figure 1c)

Figure 1: Peptide crossover operators
Once child peptides were created through selection and crossover, one of several mutation operators was employed, each with small probability of occurrence (See Figure 2a-f)
All of these operators have the ability to introduce residue positional diversity, but only the random replacement mutator has the ability to introduce compositional diversity into the population. This ensures that specific amino acids are not forever lost.
For example, if no peptide in a given generation has an arginine at the N terminus, none of the crossover operators could generate a child with an N-terminal arginine. Only through mutation could an N-terminal R be introduced into the population.
If no peptide in a given generation contains arginine in any sequence position, then only the random replacement mutator can re-introduce arginine into the population.

Figure 2: Peptide Mutation Operators
Numerous trials of the GA were run on the processed spectra from each peptide to determine the effects of various parameters on the performance of the algorithm. Because of the stochastic nature of the algorithm, trials were not strictly reproducible, so several runs were made at each parameter setting to obtain an average.
Parameters included the number of peptides in the population, number of iterations (generations), and crossover and mutation probabilities. In addition, performance of two different GA implementations was examined: one that operated with complete replacement of the population in each generation (with and without elitism), and a steady-state algorithm that replaced only a fraction of the peptides in each cycle. The percentage replacement was a variable parameter.
Two typical trials for renin substrate are shown in Figure 3. This graph shows the improvement in average score over the population and best individual score within the population at each generation, using both GA (simple and steady-state) implementations. As can be seen, the steady-state algorithm gives much better performance, despite the fact that it only replaces the worst 5% of the population at every cycle (and therefore only considers 5% as many total peptides over the course of the trial). This is in accord with observations in other GA applications.

Performance of the steady-state GA was characterized by a rapid convergence to a single best peptide for both molecular weight and spectral match. This peptide remained best for a long period, and then was abruptly replaced by a better one. This corresponds to the jumps in score in Figure 3. Prior to one of these jumps, the numerical range of the partial scores became narrower as the minimum approached that of the best score. Following the jump, the range widened. At each successive jump, however, the range became less wide and the population diversity gradually decreased.
The simple GA was characterized by a similarly rapid convergence to a best peptide for the spectral match only; the best matching peptide for the molecular weight score was often the same as this, but showed considerable fluctuation. The range of each partial score was much wider than that of the steady-state GA, and showed no particular pattern. The combined score fluctuated about its initial value of about 0.50 with no improvement.
Neither GA was very successful at elucidating the correct sequences, despite generating high-scoring peptides. In general, the best peptides that were generated were exact matches to the molecular weight (within experimental mass measurement tolerance), but were only moderate to good spectral matches. The major features in the spectrum were matched, but an examination of the assignments showed mostly fortuitous matches to intense peaks by individual series or acyl ions. There were generally no extended series matches (eg. b2+, b3+, b4+, ...) or many matches to minor ions.
The overall best-matching peptides typically had short, correct, internal sub-sequences that matched acyl ions, or short pieces of the N- or C-terminal sequence matching series ions. None of the mutators appeared to be capable of rearranging these best-matching peptides into better ones, despite very high (by usual GA standards) mutation probabilities (0.01 - 0.10).
Several trials were performed using spectral matching as the sole score (molecular weight was ignored). In general, this produced peptides which were good spectral matches but poor molecular weight matches. The peptides were mostly somewhat too long, generating additional predicted ions that fortuitously matched spectral features. The scoring penalty for extra unmatched predicted ions kept these peptides from growing without limit (since the longer the peptide, the more possibilities for a match to an acyl ion).
To reduce the reward given to fortuitous matches of predicted acyl ions, two approaches were tried. In one, no acyl ions were predicted for the candidate peptides, only y and b series ions. In the other, match weighting was introduced. The weighting gave full score to y and b series ion matches, but only 50% score to acyl ions. Neither approach gave a high degree of improvement. The first failed because the spectra do contain some acyl ion features, and failure to match those reduces the score. The second gave somewhat better results so it was retained.
The results obtained in this preliminary study suggest that a genetic algorithm can ultimately be designed which will successfully derive peptide sequence from MS/MS data. The algorithms implemented here easily and quickly find peptides that match a given molecular weight and that contain short pieces of the correct sequence. The steady-state algorithm is much faster and provides better overall performance than the simple GA, with or without elitism. Unlike the exponential scaling of step-wise extension algorithms for peptide sequencing, the computational time for the genetic algorithm scales linearly with peptide length, providing much faster execution for longer peptides.
It is apparent that the strictly numerical spectral match algorithm employed here is not discriminating enough to force convergence to peptides with highly fit spectral matches. Work is under way to develop a better scoring scheme, one that gives significant reward to correct series ion matches and strongly discriminates against fortuitous matches.
It is also apparent that the genetic algorithm will be applicable to sequencing any type of biopolymer (peptide, modified peptide, oligonucleotide, etc.) whose spectra are regular and predictable from sequence. The algorithms here were deliberately developed to be independent of the type of biopolymer; the only dependence is obviously in the computation of molecular weights and predicted spectra but these are completely replaceable components. To sequence oligonucleotides, for example, one needs only to substitute a table containing the nucleotides and their masses, and an oligonucleotide-specific spectral prediction function.
Extension of the algorithm to sequence modified peptides is also feasible. The simplest method is to augment the pool of residues with a set of modified residues. The algorithm would construct peptides containing both modified and unmodified residues, and the replacement mutator would select from this pool to produce modified peptides. A more sophisticated approach would use a mutator that only applied a modification to a residue if the surrounding consensus sequence was appropriate, and an initializer that did the same on construction. This would eliminate infeasible peptides.
1. Holland JH. 1975. Adaptation in natural and artificial systems. Ann Arbor: The University of Michigan Press.
2. Goldberg DE. 1989. Genetic algorithms in search, optimization, and machine learning. Reading, MA: Addison-Wesley Publishing Company, Inc.
3. Davis L, ed. 1991. Handbook of genetic algorithms. New York: Van Nostrand Reinhold.
4. Siegel MM, Bauman N. 1988. An efficient algorithm for sequencing peptides using fast atom bombardment mass spectral data. Biomed. Environ. Mass Spectrom. 15:333-343.
5. Hines WM, Falick AM, Burlingame AL, Gibson BW. 1991. Pattern-based algorithm for peptide sequencing from tandem high energy collision-induced dissociation mass spectra. J. Am. Soc. Mass Spectrom. 3:326-336.