Tandem Mass data hash
“Tandem mass data hash,” try to say that five times fast. To get a better feel for how MS/MS data will be represented as a hash I wrote a quick visualizer.
How the MSMS data is hashed: an NxM bin matrix is computed where N is the number of bins to store mass and M is the number for intensity. Here I’ve used 4 for both. I first went through all the MSMS data and found the maximum mass and maximum intensity to find the thresholds for what mass/intensity values go in what bins. I took the natural log of intensity so that lower intensities are not zeroed out or dwarfed by large outlier intensities.
An example visualization of a spectrum when parceled out into a bin matrix:

Darkness value in a bin represents the number of peaks which fall in that mass/intensity range. We can see that in this spectrum there were very few peaks with low intensity and none with high intensity.
The tandem mass data I was working with contained two amino acid sequences that were represented twice. Here are the “standard” representations of those spectra (each peak represented as a vertical line where line placement is determined by peak mass) :
HGTDDGVVWMNWK
precursor-mass: 1545.12
precursor-mass: 1545.52
KGGETSEMYLIQPDSSVKPYR
precursor-mass: 2386.29
precursor-mass: 2385.91
Here are the visualizations for how their spectra are broken down into bins:
HGTDDGVVWMNWK:

KGGETSEMYLIQPDSSVKPYR:

While the two representations of KGGETSEMYLIQPDSSVKPYR look fairly similar, there is an unfortunate amount of dissimilarity between the two HGTDDGVVWMNWK. A matching algorithm using this bin technique and this data could very well misdiagnose the HGTDDGVVWMNWK match.
Here is a visualization for if the data were parsed into 8×8 bins:
HGTDDGVVWMNWK:

KGGETSEMYLIQPDSSVKPYR:

We can see that there is still a good match for KGGETSEMYLIQPDSSVKPYR but that there would continue to be difficulty in matching the two HGTDDGVVWMNWK.
For the curious, here are many spectra represented as 4×4 and many as 8×8.