Report on lattice based MSMS hashing

by Brian | 2nd December 2009

To judge the possible effectiveness of comparing spectra based on a hash of their peaks I wrote a program to take a list of spectra and convert them to hashes.

From that set of spectra there were two pairs of sibling spectra that should be close matches as I am confident that they are derived from the same peptide. I took the distance between the siblings and set that as the distance threshold. I then found the distance between one of the siblings and the rest of the entire body of spectra.

Ideally we would find that not many spectra would fall below the determined threshold (i.e. false positives). Unfortunately this was not the case.

For Sibling group one:

  • comparing to sibling one 59 fell below the threshold.
  • comparing to sibling two 111 fell below the threshold

For Sibling group two:

  • comparing to sibling one 105 fell below the threshold.
  • comparing to sibling two 64 fell below the threshold

In the worst case (111 less than threshold), which we would have to assume to avoid discarding good matches, we could only eliminate 1 spectra (because one of those was the comparison between siblings which defined and therefore equals the threshold value).

The lattice hash was created as a 4 by 4 matrix where row corresponded to intensity and col corresponded to mass. To evenly divide up amongst intensities, a spectrum was sorted by intensity and divided into four equal groups. The mass bin was determined by a peaks mass by the precursor mass and multiplying by 4.

Distance was computed by summing the absolute value of the difference between corresponding cells of two hash matrices.

Of course both distance formula and lattice distribution could be changed, however these initial results are not encouraging. Also the matrix dimensions could be increased but this will also increase comparison times with O(n^2) proportions.

Leave a Reply

Name (Required)

Email (Required - will not be published)

Website

Message (Required)