MSMSFit: A method for comparing MSMS data to amino acid sequences

Fitting MSMS data to a proposed acid sequence is a bit messy. One has to account for potential ion types as well as data errors. The image above depicts sample MSMS data (top half) along side the amino acid from which it is derived (bottom half, with multiple lines for different ions).
However, when one accounts for the possible ions and accepts that some MSMS weights are red-herrings then point matches should be very close (within 0.02 Daltons). This method (for now dubbed MSMSFit) takes advantage of the preciseness of a correct match.
The process essentially walks from one acid to the next and, given the possible ions at that point, finds the closest MSMS peak. If the acid falls in a gap in MSMS data (such as in the left side of the above figure), then that acid is skipped. Error (the difference between the best fit ion and the MSMS peak) is squared and added to a tally. MSMSFit is then computed by this formula:
hits / ((1.0 + errorTally) * usedAcids);
- “hits” – the number of ions that had less than 0.02 Dalton difference from a peak.
- “errorTally” – the sum of the squared errors between best fit ions and peaks.
- “usedAcids” – the number of acids compared (error is less than 100 daltons).
The errorTally factor is not crucial and can possibly be removed. It is included to punish for loose fit points.
For a set of acid sequences, if we give each a HMM_score and a MSMSFit score and plot the results where HMM_score is the x-axis, a scatter chart like this is typical:

The correct sequence is the major outlier in the upper right corner. The correct sequence performs signifgantly better for both HMM_score as well as for MSMSFit.
Performance and computational complexity: In addition to being accurate, the process is very fast. MSMSFit runs in O(N + M) time where N and M are the cardinalities of the acid sequence and the peaks. To match 5 spectra in E. coli took less than a second.
Here is the code for MSMSFit:
-(float) getMSMSFit:(NSString *)acidSequence forPeaks:(NSArray *)peaks
{
int i, j, hits = 0, usedAcids = 0, peakIndex = 0, ionCount, seqLen = [acidSequence length];
float nextYion, currentFragMass, nextFragMass, currentPeakMass, minError, nextYError, errorTally = 0.0;
float errors[7];
float ionDiff[7];
int Type = MONOISOTOPIC;
double hydrogenMass = (Type == MONOISOTOPIC) ?HYDROGEN_MONO : HYDROGEN_AVE,
oxygenMass = (Type == MONOISOTOPIC) ? OXYGEN_MONO : OXYGEN_AVE,
carbonMass = (Type == MONOISOTOPIC) ? CARBON_MONO : CARBON_AVE,
nitrogenMass = (Type == MONOISOTOPIC) ? NITROGEN_MONO : NITROGEN_AVE;
float peptideMass = [self calculatePeptideMass:acidSequence];
ionCount = 3;
//yIon
ionDiff[0] = peptideMass + (2 * hydrogenMass);
//y18Ion
ionDiff[1] = peptideMass - oxygenMass;
//y17Ion
ionDiff[2] = peptideMass - nitrogenMass - hydrogenMass;
currentFragMass = AAMasses[(int)[acidSequence characterAtIndex:0]] +HYDROGEN_MONO; //adding a hydrogen to get b-ion
peakIndex = [peaks count] - 1;
currentPeakMass = [[peaks objectAtIndex:peakIndex] mass];
for (i = 0; i < seqLen - 1; i++) {
minError = currentFragMass; // a suffeciently large value to start
nextFragMass = currentFragMass + AAMasses[(int)[acidSequence characterAtIndex:i + 1]];
nextYion = ionDiff[0] - nextFragMass;
if (currentPeakMass > nextYion) {
while (true) {
if (peakIndex == 0) break;
currentPeakMass = [[peaks objectAtIndex:peakIndex] mass] +HYDROGEN_MONO;
for (j = 0; j < ionCount; j++) {errors[j] = fabs((ionDiff[j] - currentFragMass) - currentPeakMass);}
nextYError = fabs(nextYion - currentPeakMass);
if (nextYError < errors[1]) break;
BOOL gettingWorse = true;
for (j = 0; j < ionCount; j++) {gettingWorse = gettingWorse && (errors[j] > minError);}
if (gettingWorse) break;
for (j = 0; j < ionCount; j++) {if(errors[j] < minError) minError = errors[j];}
peakIndex--;
}
if ((minError < 0.02) && (minError < nextYError)){
errorTally += minError * minError;
hits++;
}
if (minError < 100.0)
usedAcids++;
}
currentFragMass = nextFragMass;
}
if (hits == 0) {return 0.0f;}
return hits / ((1.0 + errorTally) * usedAcids);
}
One thing to notice is that, at this point, MSMSFit accounts only for y-ions as those produce stronger results. This makes for somewhat difficult to follow as though we work forwards through the acid sequence we are subtracting from the overall sequence mass and working backwards through the peak array. Adding b-ions could add signifiant computation time (as the program would either have to merge and sort the two ion lists or keep track of which errors were best fit for a given acid cleavage point). Initial tests of MSMSFit have that only accounting for y-ions produces reliable results.
Also, for reference, the output errors for the MSMS spectrum which corresponded to ALNSVEASQPHQDQMEK. Keep in mind that errors of greater than 100 Daltons are not tallied. (I think the order is reversed):
72.044937
11.986694
0.005615
1.992188
0.007324
0.010986
0.005249
0.002441
0.001099
26.928284
0.001160
1262.613037
1377.640015
1505.698608
1636.739136
1765.781738