The USP data

by Brian | 28th June 2010

We were recently given by the Chen lab a set of spectra that were derived from the USP set of 50 proteins. We were told that there are probably a good amount of impurities in this set so some of the spectra will not correspond to peptides which can be found in the USP list.

I’ve created a test set of spectrum/peptide pairs which I think does a good job of filtering the impurity peptides. It was a much smaller percentage than I had originally guessed — Only about 1434 spectra out of 7036 come from peptides in the USP set.

About the creation:

The first pass at creating this body of spectrum/peptide pairs was created by taking all matches that were in the top 10 results for TandemFit that were also in the USP peptide list. I then went through and “hand verified” the results. This was done by printing out and evaluating a report of all of the spectra/peptide matches that were incorrectly assigned. This included graphic representations of the spectra with peaks hilighted for aligned theoretical ions. I went through this report and selected out spectra where the TandemFit match seemed to far outclass the match from the USP set. This amounted to removing 5 objectionable spectra. These wonky results may have been produced by spectra chimera or by odd luck that one of the peptides from the USP list had similar properties to one of the impurity peptides.

Of course, I recognize that this “hand verification” may further bias the dataset. You can look at the ones that I have removed here to see if I’m being crazy:

http://proteomics.me/files/20100628

Granted, as the test set was created using TandemFit, it will be somewhat biased towards TandemFit. However, it is very good for potentially tuning some of the parameters in the algorithm to get the matches that were in positions 2 through 10 into position 1.

As far as how many hits we have missed by going with the top 10 matches, I went through the same process of selecting and hand verifying with top 5 matches. Going from the top 5 to the top 10 means the match set went from 1340 to 1434. That’s a 7 percent gain. If I repeated the process for the top 20 matches I wouldn’t expect more than another 7 percent gain. I would bet that this test set accounts for at least 90% of the correct matches.

Results from these tests.

Again, because of the above assumptions, these numbers are approximate.

Percent correct at one percent error rate: 57.53%
e value at that rate: 0.1349
Total correct top matches: 82.5%

Leave a Reply

Name (Required)

Email (Required - will not be published)

Website

Message (Required)