<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Proteomics.ME</title>
	<atom:link href="http://proteomics.me/feed/" rel="self" type="application/rss+xml" />
	<link>http://proteomics.me</link>
	<description>Proteomics Software</description>
	<lastBuildDate>Thu, 28 Oct 2010 20:05:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>E values are less reliable with databases of spliced peptides</title>
		<link>http://proteomics.me/2010/10/28/e-values-are-less-reliable-with-databases-of-spliced-peptides/</link>
		<comments>http://proteomics.me/2010/10/28/e-values-are-less-reliable-with-databases-of-spliced-peptides/#comments</comments>
		<pubDate>Thu, 28 Oct 2010 19:59:12 +0000</pubDate>
		<dc:creator>Brian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://proteomics.me/?p=511</guid>
		<description><![CDATA[When we calculate our E values we are assuming that This is great example of the type of histogram on which the Fenyo method of calculating E values performs poorly. See that one tall histogram bar? As that falls in the small section where the least-square line is calculated what happens is a much steeper [...]]]></description>
			<content:encoded><![CDATA[<p>When we calculate our E values we are assuming that </p>
<p>This is great example of the type of histogram on which the Fenyo method of calculating E values performs poorly.</p>
<p><img src="http://proteomics.me/files/20101028/bad-e-value.jpg" alt="2.1931996962851298E-14" /></p>
<p>See that one tall histogram bar?  As that falls in the small section where the least-square line is calculated what happens is a much steeper slope (and, directly, a much better E value) comes out than really should be.</p>
<p>These kinds of histograms happen all the time in databases which contain peptides derived from multiple splicing junctions.  The reason is that there may be many peptides where the front end or tail end are correct.  For example, let&#8217;s say the correct peptide for a spectrum is &#8220;WSFFFFCGYN&#8221;, but an intron position begins after the C.  The database may contain many variations on that peptide which begin with &#8220;WSFFFFC&#8221;.  If those alternate peptides also fall within our precursor tolerance then they will all produce fairly decent scores, which will create the kind of spike such as that in the above histogram. </p>
]]></content:encoded>
			<wfw:commentRss>http://proteomics.me/2010/10/28/e-values-are-less-reliable-with-databases-of-spliced-peptides/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tandem Mass Spectrometry Peak Calibration Tool</title>
		<link>http://proteomics.me/2010/08/26/tandem-mass-spectrometry-peak-calibration-tool/</link>
		<comments>http://proteomics.me/2010/08/26/tandem-mass-spectrometry-peak-calibration-tool/#comments</comments>
		<pubDate>Thu, 26 Aug 2010 17:38:23 +0000</pubDate>
		<dc:creator>Brian</dc:creator>
				<category><![CDATA[New Methods]]></category>

		<guid isPermaLink="false">http://proteomics.me/?p=508</guid>
		<description><![CDATA[It hit me today: What if the theoretical mass I&#8217;m calculating for ions is consistently off? Even if it is just a little (talking fractions of a proton), that would negatively impact a peptide matching algorithm&#8217;s performance. How can we measure this? Relatively easy. The somewhat challenging part is getting a set of spectra from [...]]]></description>
			<content:encoded><![CDATA[<p>It hit me today:  What if the theoretical mass I&#8217;m calculating for ions is consistently off?  Even if it is just a little (talking fractions of a proton), that would negatively impact a peptide matching algorithm&#8217;s performance.</p>
<p>How can we measure this?</p>
<p>Relatively easy.  The somewhat challenging part is getting a set of spectra from your machine where you know the peptide from which each spectrum came.  These tests sets are very useful and finding consistent theoretical peak error is one of them.</p>
<p>To calculate average theoretical peak error, go through each spectrum you have.  For each spectrum, take the peptide it was derived from and find the theoretical ions of this peptide.  Then, for each of these theoretical ions, find the most intense (it&#8217;s so <em>intense</em>!) peak of the spectrum that falls within a fairly conservative mass window.  (This mass window should be large enough to account for normal error but small enough to avoid false matches.  A good starting point is 0.5Da.)  Find the mass difference between this peak and the theoretical ion.  </p>
<p>Do this for every theoretical ion of every peptide for every spectrum and get the average.</p>
<p>Voila!</p>
]]></content:encoded>
			<wfw:commentRss>http://proteomics.me/2010/08/26/tandem-mass-spectrometry-peak-calibration-tool/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>TandemFit animation</title>
		<link>http://proteomics.me/2010/08/20/tandemfit-animation/</link>
		<comments>http://proteomics.me/2010/08/20/tandemfit-animation/#comments</comments>
		<pubDate>Fri, 20 Aug 2010 14:39:40 +0000</pubDate>
		<dc:creator>Brian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://proteomics.me/?p=506</guid>
		<description><![CDATA[Here&#8217;s a kind of nifty animation of how TandemFit walks through each MS/MS peak twice (once forwards, once backwards) to find the matches to the theoretical ions of a peptide. Note that the ion being searched for is displayed. The animation!]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a kind of nifty animation of how TandemFit walks through each MS/MS peak twice (once forwards, once backwards) to find the matches to the theoretical ions of a peptide.  Note that the ion being searched for is displayed.</p>
<p><a href="http://proteomics.me/files/20100820/tandemfit.mov">The animation</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://proteomics.me/2010/08/20/tandemfit-animation/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
<enclosure url="http://proteomics.me/files/20100820/tandemfit.mov" length="1757711" type="video/quicktime" />
		</item>
		<item>
		<title>Aurum dataset oddity</title>
		<link>http://proteomics.me/2010/08/19/aurum-dataset-oddity/</link>
		<comments>http://proteomics.me/2010/08/19/aurum-dataset-oddity/#comments</comments>
		<pubDate>Thu, 19 Aug 2010 21:13:33 +0000</pubDate>
		<dc:creator>Brian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://proteomics.me/?p=501</guid>
		<description><![CDATA[It looks like spectrum: T10475_Well_A13_2025.07_16898.mgf..pkl and spectrum: T10475_Well_A13_2025.07_17096.mgf..pkl are the same. This is bad news for me as TandemFit gets both of those &#8220;wrong&#8221;. The quotes because TandemFit&#8217;s match of QAGLQLQESLEPAVRLDR has 11 fragment alignments vs 2 produced by VPAPSIEDICHVLSTVCK which is the &#8220;correct&#8221; peptide. Update: other duplicates: T10475_Well_A12_1386.68_16898.mgf..pkl T10475_Well_A12_1386.68_17096.mgf..pkl T10475_Well_A03_1551.77_16898.mgf..pkl T10475_Well_A03_1551.77_17096.mgf..pkl T10475_Well_A11_1386.69_16898.mgf..pkl T10475_Well_A11_1386.69_17096.mgf..pkl T10475_Well_A10_1188.45_17096.mgf..pkl [...]]]></description>
			<content:encoded><![CDATA[<p>It looks like spectrum:<br />
T10475_Well_A13_2025.07_16898.mgf..pkl </p>
<p>and spectrum:<br />
T10475_Well_A13_2025.07_17096.mgf..pkl </p>
<p>are the same.  This is bad news for me as TandemFit gets both of those &#8220;wrong&#8221;.  The quotes because TandemFit&#8217;s match of QAGLQLQESLEPAVRLDR has 11 fragment alignments vs 2 produced by VPAPSIEDICHVLSTVCK which is the &#8220;correct&#8221; peptide.</p>
<p><strong>Update</strong>:  other duplicates:<br />
T10475_Well_A12_1386.68_16898.mgf..pkl<br />
T10475_Well_A12_1386.68_17096.mgf..pkl </p>
<p>T10475_Well_A03_1551.77_16898.mgf..pkl<br />
T10475_Well_A03_1551.77_17096.mgf..pkl </p>
<p>T10475_Well_A11_1386.69_16898.mgf..pkl<br />
T10475_Well_A11_1386.69_17096.mgf..pkl </p>
<p>T10475_Well_A10_1188.45_17096.mgf..pkl<br />
T10475_Well_A10_1188.45_16898.mgf..pkl </p>
<p>T10475_Well_A10_2143.98_17096.mgf..pkl<br />
T10475_Well_A10_2143.98_16898.mgf..pkl </p>
<p>Hmm&#8230; it looks like something of a pattern</p>
]]></content:encoded>
			<wfw:commentRss>http://proteomics.me/2010/08/19/aurum-dataset-oddity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to Fake a Great E Value</title>
		<link>http://proteomics.me/2010/08/13/how-to-fake-a-great-e-value/</link>
		<comments>http://proteomics.me/2010/08/13/how-to-fake-a-great-e-value/#comments</comments>
		<pubDate>Fri, 13 Aug 2010 16:23:35 +0000</pubDate>
		<dc:creator>Brian</dc:creator>
				<category><![CDATA[Interesting Aside]]></category>

		<guid isPermaLink="false">http://proteomics.me/?p=483</guid>
		<description><![CDATA[To better peer into and inspect the code I have to calculate E values, I created a histogram visualizer. Here are two histograms for score distributions for two different spectra: (Note: The small green bars are were values are above zero, but wouldn&#8217;t normally be drawn as the bar would normally be less than one [...]]]></description>
			<content:encoded><![CDATA[<p>To better peer into and inspect the code I have to calculate E values, I created a histogram visualizer.  Here are two histograms for score distributions for two different spectra:  (Note: The small green bars are were values are above zero, but wouldn&#8217;t normally be drawn as the bar would normally be less than one pixel high.)</p>
<p>Histogram for Spectrum A:<br />
<div id="attachment_484" class="wp-caption aligncenter" style="width: 310px"><a href="http://proteomics.me/2010/08/13/how-to-fake-a-great-e-value/4a/" rel="attachment wp-att-484"><img src="http://proteomics.me/wp-content/uploads/2010/08/4a.jpg" alt="" title="4a" width="300" height="300" class="size-full wp-image-484" /></a><p class="wp-caption-text">E value: 1.0659386771035567E-26</p></div></p>
<p>Histogram for Spectrum B:<br />
<div id="attachment_485" class="wp-caption aligncenter" style="width: 310px"><a href="http://proteomics.me/2010/08/13/how-to-fake-a-great-e-value/365a/" rel="attachment wp-att-485"><img src="http://proteomics.me/wp-content/uploads/2010/08/365a.jpg" alt="" title="365a" width="300" height="300" class="size-full wp-image-485" /></a><p class="wp-caption-text">E value: 2.64817948989741E-19</p></div></p>
<p>These two score distributions typify what you&#8217;d expect when a correct match is present.  Most scores fall into a large mass on the low end, but then, on the right most side, far away from the other scores, you see a tiny blip where the correct score lives.</p>
<p>Now, just by looking at the two would you say that you are much more confident about the results of the first one over the second one?  Maybe so, but would you say that you are 10,000,000 times more confident?  Well, that&#8217;s what those E values imply.  Like I said before, E values are highly sensitive to the shape of the right tail of the histogram, and the slight difference in shapes between the two skewed the linear regression to make the confidence factors so starkly different.</p>
<p>This dependence on histogram shape can be used to game the E value method to make a scoring algorithm that produces results with highly overly inflated confidence scores.</p>
<p>What I&#8217;m about to do is simple and can be implemented by any scoring method.</p>
<p>IMPORTANT NOTE:  I&#8217;m not changing the E value code at all.  I&#8217;m only changing the code in the scoring method.  I&#8217;ve gone ahead and altered the code in TandemFit to square the final score.  Here&#8217;s the code:</p>
<p>score = score * score;</p>
<p>Simple enough?  Do you think this improves the quality of the scores?</p>
<p>Spoiler:  It doesn&#8217;t.  This will not change the ordering of TandemFit scores.  The best score will still be the best score and the worst score will still be the worst scores.  However, the distribution of the scores will be very different.  What were scores of 1, 2, 3, 4 and 5 will now be scores of 1, 4, 9, 16 and 25.  Because larger scores are more affected by the squaring process, this will appear to make our best scores (the true matches) seem much further removed from the rest.</p>
<p>With this one change, here are the histograms and E values for the scores for those same two spectra:</p>
<p>Histogram for Spectrum A:<br />
<div id="attachment_486" class="wp-caption aligncenter" style="width: 310px"><a href="http://proteomics.me/2010/08/13/how-to-fake-a-great-e-value/4b/" rel="attachment wp-att-486"><img src="http://proteomics.me/wp-content/uploads/2010/08/4b.jpg" alt="3.202345672236334E-91" title="4b" width="300" height="300" class="size-full wp-image-486" /></a><p class="wp-caption-text">E value: 3.202345672236334E-91</p></div></p>
<p>Histogram for Spectrum B:<br />
<div id="attachment_487" class="wp-caption aligncenter" style="width: 310px"><a href="http://proteomics.me/2010/08/13/how-to-fake-a-great-e-value/365b/" rel="attachment wp-att-487"><img src="http://proteomics.me/wp-content/uploads/2010/08/365b.jpg" alt="" title="365b" width="300" height="300" class="size-full wp-image-487" /></a><p class="wp-caption-text">E value: 1.1669794248500472E-57</p></div></p>
<p>Look at those frickin&#8217; E values!  The first one is within throwing distance of being one over a googol!  </p>
<p>Now let&#8217;s say we were comparing the two scoring methods.  For spectrum A, is our confidence really 10^65 better for the second scoring method?</p>
<p>Will you ever trust anyone&#8217;s E values again?</p>
]]></content:encoded>
			<wfw:commentRss>http://proteomics.me/2010/08/13/how-to-fake-a-great-e-value/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The battle of ORF: Open Reading Frames</title>
		<link>http://proteomics.me/2010/06/28/the-battle-of-orf-open-reading-frames/</link>
		<comments>http://proteomics.me/2010/06/28/the-battle-of-orf-open-reading-frames/#comments</comments>
		<pubDate>Mon, 28 Jun 2010 20:49:15 +0000</pubDate>
		<dc:creator>Brian</dc:creator>
				<category><![CDATA[Code Change]]></category>

		<guid isPermaLink="false">http://proteomics.me/?p=472</guid>
		<description><![CDATA[Peppy, up until now, only digested chromosomes inside of open reading frames. This is no more. There is now choice! In the properties file you can set if you want to digest the whole chromosome or only the ORFs.]]></description>
			<content:encoded><![CDATA[<p>Peppy, up until now, only digested chromosomes inside of open reading frames.  This is no more.  There is now choice!  In the properties file you can set if you want to digest the whole chromosome or only the ORFs.</p>
]]></content:encoded>
			<wfw:commentRss>http://proteomics.me/2010/06/28/the-battle-of-orf-open-reading-frames/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The USP data</title>
		<link>http://proteomics.me/2010/06/28/the-usp-data/</link>
		<comments>http://proteomics.me/2010/06/28/the-usp-data/#comments</comments>
		<pubDate>Mon, 28 Jun 2010 19:24:54 +0000</pubDate>
		<dc:creator>Brian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://proteomics.me/?p=470</guid>
		<description><![CDATA[We were recently given by the Chen lab a set of spectra that were derived from the USP set of 50 proteins. We were told that there are probably a good amount of impurities in this set so some of the spectra will not correspond to peptides which can be found in the USP list. [...]]]></description>
			<content:encoded><![CDATA[<p>We were recently given by the Chen lab a set of spectra that were derived from the USP set of 50 proteins.  We were told that there are probably a good amount of impurities in this set so some of the spectra will not correspond to peptides which can be found in the USP list.</p>
<p>I&#8217;ve created a test set of spectrum/peptide pairs which I think does a good job of filtering the impurity peptides. It was a much smaller percentage than I had originally guessed &#8212;  Only about 1434 spectra out of 7036 come from peptides in the USP set.</p>
<p><strong>About the creation:</strong></p>
<p>The first pass at creating this body of spectrum/peptide pairs was created by taking all matches that were in the top 10 results for TandemFit that were also in the USP peptide list.  I then went through and &#8220;hand verified&#8221; the results.  This was done by printing out and evaluating a report of all of the spectra/peptide matches that were incorrectly assigned.  This included graphic representations of the spectra with peaks hilighted for aligned theoretical ions.  I went through this report and selected out spectra where the TandemFit match seemed to far outclass the match from the USP set.  This amounted to removing 5 objectionable spectra.  These wonky results may have been produced by spectra chimera or by odd luck that one of the peptides from the USP list had similar properties to one of the impurity peptides.  </p>
<p>Of course, I recognize that this &#8220;hand verification&#8221; may further bias the dataset.  You can look at the ones that I have removed here to see if I&#8217;m being crazy:</p>
<p><a href="http://proteomics.me/files/20100628">http://proteomics.me/files/20100628</a></p>
<p>Granted, as the test set was created using TandemFit, it will be somewhat biased towards TandemFit.  However, it is very good for potentially tuning some of the parameters in the algorithm to get the matches that were in positions 2 through 10 into position 1.</p>
<p>As far as how many hits we have missed by going with the top 10 matches, I went through the same process of selecting and hand verifying with top 5 matches.  Going from the top 5 to the top 10 means the match set went from 1340 to 1434.  That&#8217;s a 7 percent gain.  If I repeated the process for the top 20 matches I wouldn&#8217;t expect more than another 7 percent gain.  I would bet that this test set accounts for at least 90% of the correct matches.</p>
<p><strong>Results from these tests.  </strong></p>
<p>Again, because of the above assumptions, these numbers are approximate.</p>
<p>Percent correct at one percent error rate: 57.53%<br />
e value at that rate:  0.1349<br />
Total correct top matches:  82.5%</p>
]]></content:encoded>
			<wfw:commentRss>http://proteomics.me/2010/06/28/the-usp-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MSMSFit now TandemFit</title>
		<link>http://proteomics.me/2010/06/25/msmsfit-now-tandemfit/</link>
		<comments>http://proteomics.me/2010/06/25/msmsfit-now-tandemfit/#comments</comments>
		<pubDate>Fri, 25 Jun 2010 15:10:51 +0000</pubDate>
		<dc:creator>Brian</dc:creator>
				<category><![CDATA[MSMSFit]]></category>

		<guid isPermaLink="false">http://proteomics.me/?p=468</guid>
		<description><![CDATA[I recently found that something exists called &#8220;MS-Fit&#8221;. I found this because had confused one of our lab members into thinking that it had something to do with MSMSFit. To avoid future confusion I am redubbing that scoring method &#8220;TandemFit&#8221;. The name works well in that it scores tandem mass spectrometry data, and it does [...]]]></description>
			<content:encoded><![CDATA[<p>I recently found that something exists called &#8220;MS-Fit&#8221;.  I found this because had confused one of our lab members into thinking that it had something to do with MSMSFit.  To avoid future confusion I am redubbing that scoring method &#8220;TandemFit&#8221;.</p>
<p>The name works well in that it scores tandem mass spectrometry data, and it does so by sequentially comparing spectra and theoretical spectra in tandem.</p>
]]></content:encoded>
			<wfw:commentRss>http://proteomics.me/2010/06/25/msmsfit-now-tandemfit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Spectrum intensity normalization</title>
		<link>http://proteomics.me/2010/06/24/spectrum-intensity-normalization/</link>
		<comments>http://proteomics.me/2010/06/24/spectrum-intensity-normalization/#comments</comments>
		<pubDate>Thu, 24 Jun 2010 20:16:16 +0000</pubDate>
		<dc:creator>Brian</dc:creator>
				<category><![CDATA[MSMSFit]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://proteomics.me/?p=463</guid>
		<description><![CDATA[Just added a method and modified constructor to the Spectrum object which allows for normalization of the peaks intensities. It finds the maximum intensity for a given spectrum and then walks through each peak dividing the intensity by this found maximum intensity. This can be handy for SpectrumMatch as well as for comparing scores TandemFit [...]]]></description>
			<content:encoded><![CDATA[<p>Just added a method and modified constructor to the Spectrum object which allows for normalization of the peaks intensities.  It finds the maximum intensity for a given spectrum and then walks through each peak dividing the intensity by this found maximum intensity.</p>
<p>This can be handy for SpectrumMatch as well as for comparing scores TandemFit scores (formerly MSMSFit).</p>
]]></content:encoded>
			<wfw:commentRss>http://proteomics.me/2010/06/24/spectrum-intensity-normalization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Peppy test</title>
		<link>http://proteomics.me/2010/06/21/peppy/</link>
		<comments>http://proteomics.me/2010/06/21/peppy/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 15:43:34 +0000</pubDate>
		<dc:creator>Brian</dc:creator>
				<category><![CDATA[Featured]]></category>

		<guid isPermaLink="false">http://proteomics.me/?p=435</guid>
		<description><![CDATA[Peppy, the open-source proteogenomic mapping software has been benchmarked at processing over 600,000 spectra per day on a consumer-grade desktop&#8230;]]></description>
			<content:encoded><![CDATA[<p><img src="http://proteomics.me/images/peppy-main-page.jpg" alt="Peppy - proteogenomic mapping software" /></p>
<p>Peppy, the open-source proteogenomic mapping software has been benchmarked at processing over 600,000 spectra per day on a consumer-grade desktop&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://proteomics.me/2010/06/21/peppy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

