Tandem Mass data hash

by Brian | 9th October 2009

“Tandem mass data hash,” try to say that five times fast. To get a better feel for how MS/MS data will be represented as a hash I wrote a quick visualizer.

How the MSMS data is hashed: an NxM bin matrix is computed where N is the number of bins to store mass and M is the number for intensity. Here I’ve used 4 for both. I first went through all the MSMS data and found the maximum mass and maximum intensity to find the thresholds for what mass/intensity values go in what bins. I took the natural log of intensity so that lower intensities are not zeroed out or dwarfed by large outlier intensities.

An example visualization of a spectrum when parceled out into a bin matrix:
msms hash
Darkness value in a bin represents the number of peaks which fall in that mass/intensity range. We can see that in this spectrum there were very few peaks with low intensity and none with high intensity.

The tandem mass data I was working with contained two amino acid sequences that were represented twice. Here are the “standard” representations of those spectra (each peak represented as a vertical line where line placement is determined by peak mass) :

HGTDDGVVWMNWK
precursor-mass: 1545.12
precursor-mass: 1545.52

KGGETSEMYLIQPDSSVKPYR
precursor-mass: 2386.29

precursor-mass: 2385.91

Here are the visualizations for how their spectra are broken down into bins:

HGTDDGVVWMNWK:

KGGETSEMYLIQPDSSVKPYR:

While the two representations of KGGETSEMYLIQPDSSVKPYR look fairly similar, there is an unfortunate amount of dissimilarity between the two HGTDDGVVWMNWK. A matching algorithm using this bin technique and this data could very well misdiagnose the HGTDDGVVWMNWK match.

Here is a visualization for if the data were parsed into 8×8 bins:

HGTDDGVVWMNWK:

KGGETSEMYLIQPDSSVKPYR:

We can see that there is still a good match for KGGETSEMYLIQPDSSVKPYR but that there would continue to be difficulty in matching the two HGTDDGVVWMNWK.

For the curious, here are many spectra represented as 4×4 and many as 8×8.

4 Responses to “Tandem Mass data hash”

  1. Oct 9th, 2009 :

    Hi Brian,
    Good work so far. I think the problem you are having is that your mass bins and intensity bins are fixed, not relative.

    How about this: make your bins relative to the highest intensity peak in the spectrum (or, maybe better, the average peak intensity), and the mass of the precursor. It might work better that way.

    BTW – I actually think it worked much better for HGTDDGVVWMNWK than you thought it did. If there were a distance metric, it would say these two were quite close. But I think doing the calculations as relative values will be even better.

  2. jainab

    Oct 9th, 2009 :

    Thanks, Brian for doing this. I have only one small suggestion. Could you examine one time about how it looks if you divide the mass range in large number of bins, suppose 15 or 20, but you can keep your intensity bin count 4. Just a check, so we can be sure that this won’t work.

  3. Oct 14th, 2009 :

    Jainab, that is a good suggestion. But saying “so that we can be sure that this won’t work” is a bit of a misnomer. I don’t think these experiments proved anything of the sort that the method won’t work. In fact, contrary to what Brian wrote, I find these first experiments quite encouraging, given that they are exactly that: _first_ experiments.

  4. Brian

    Oct 14th, 2009 :

    Of course: I should keep the bins relative to the max values for a specific spectrum. This makes perfect sense as we would only be comparing spectra with similar precursor masses. Also, it is obvious the method I have chosen for parceling along the y-axis is not the best.

Leave a Reply

Name (Required)

Email (Required - will not be published)

Website

Message (Required)