MSMSFit idea: skip ion if fragment already accounted for

by Brian | 19th November 2009

if a certain fragment fault-line has already been accounted for with another ion, we can skip further ions for that location. Read More

Brainstorm: massive “crowdsourcing” of proteomic processing

by Brian | 3rd November 2009

Similar in principle to the various folding@home projects. It may seem that GFS may not be conducive to such segmentation as genomic data is quite large making download times prohibitive. This is not the case if a node “specializes” in a subset of the genome. A node only needs access to a spectrum file and [...] Read More

proteomic database fields

by Brian | 1st September 2009

For a basic database which holds peptides from in silico digestion of a genome, I think these fields will be helpful: peptide – the chain sequence of amino acids location – This will probably be many fields which will track location information mass – (theoretical) size – the number of amino acids cleavage – number [...] Read More

Processing a directory of PKL files

by Brian | 5th August 2009

So MSMSFit may be a good replacement for seqtagscore, but it doesn’t solve the redundant overhead processing that comes with comparing a directory of fasta files to a directory of pkl files. The way we are doing it now is like an individual job for each comparison; that means that if there are 100 pkl [...] Read More

Grouping MSMS data

by Brian | 13th July 2009

It looks as though MSMS data may contain multiple peak lists for the same polypeptide. Would it be beneficial to combine the data of such suspected duplicates? I’ve noticed that sometimes there will be large gaps in regions of of a peak list; combining lists could fill these gaps and perhaps help our algorithms. Of [...] Read More

Points to optimize

by Brian | 6th July 2009

in calcScoresForMSMS the loop which goes through the masses surrounds the loop which goes through the sequence. if we put the sequence loop on the outside this will confer some advantages: translateNucleotideSequence won’t have to be called over and over. the code which produces a probable MSMS spectrum for a given sequence could be called [...] Read More

“String Scaling” for fuzzy string comparison

by Brian | 25th June 2009

A common trick in image matching (e.g. face recognition, character recognition, etc.) is when comparing two images to first scale them to the same dimensions in pixels. My thought is that this technique could be used to estimate similarity between two sequences of amino acids. Yesterday I put up a post about a possible ordering [...] Read More

charToIntRepresentation returns char

by Brian | 24th June 2009

some fails on overall conversion to char… am trying a more scaled back approach where just charToIntRepresentation returns a char array. With 4 tandem spectra on E. coli (Escherichia_coli_K-12_MG1655.fasta) the improvement was 3.7% (1m51s to 1m47s). Not much. With 68 tandem spectra on E. Coli there was, surprisingly, negative improvement ( 19m41s (1181s) to 19m55s [...] Read More

for future: matchTagsFiltered

by Brian | 19th June 2009

matchTagsFiltered is where the unnecessary charToIntRepresentation is used. Though charToIntRepresentation may not be entirely unnecessary as tags are as an NSArray, which still need to be converted…. Need to look at this later. Too distracted by converting NSStrings to chars… Read More

Note to self – convert Strings should be char arrays

by Brian | 17th June 2009

Strings are essentially arrays of UTF characters which take up 4 bytes. As we are comparing proteome data, we don’t need that size of a data container to represent the amino acids. If rather than Srings we used arrays of characters (one byte each) then comparisons and assignment operators could be sped up as much [...] Read More

Tuesday from home

by Brian | 16th June 2009

Working from home (waiting for plumber). Had to quickly drop by lab to pick up source files and parameters list. Future Idea: need to make “getting started with GFS” intro guide. Success! Found that if we create a temp variable to store acidArrayAsInts[i] for comparisons then we save 2% to 3% of time on findLongestCommonSubstring Read More

findLongestCommonSubstring – short sequence

by Brian | 12th June 2009

It would be interesting to check results to see if longestCommonSubstring were skipped where the query sequence were significantly longer than the comparison sequence.  perhaps an if clause which has a cut-off based on the length of the existing longest string relative to the length of the sequence string. Read More

findLongestCommonSubstring

by Brian | 12th June 2009

As I believe characters are stored as a type of integer, converting to an integer array may be redundant and counter to its goal of expediting comparisons.  Test time performance when removing this step. Read More