MSMSFit idea: skip ion if fragment already accounted for
if a certain fragment fault-line has already been accounted for with another ion, we can skip further ions for that location. Read More
if a certain fragment fault-line has already been accounted for with another ion, we can skip further ions for that location. Read More
Similar in principle to the various folding@home projects. It may seem that GFS may not be conducive to such segmentation as genomic data is quite large making download times prohibitive. This is not the case if a node “specializes” in a subset of the genome. A node only needs access to a spectrum file and [...] Read More
For a basic database which holds peptides from in silico digestion of a genome, I think these fields will be helpful: peptide – the chain sequence of amino acids location – This will probably be many fields which will track location information mass – (theoretical) size – the number of amino acids cleavage – number [...] Read More
So MSMSFit may be a good replacement for seqtagscore, but it doesn’t solve the redundant overhead processing that comes with comparing a directory of fasta files to a directory of pkl files. The way we are doing it now is like an individual job for each comparison; that means that if there are 100 pkl [...] Read More
It looks as though MSMS data may contain multiple peak lists for the same polypeptide. Would it be beneficial to combine the data of such suspected duplicates? I’ve noticed that sometimes there will be large gaps in regions of of a peak list; combining lists could fill these gaps and perhaps help our algorithms. Of [...] Read More
in calcScoresForMSMS the loop which goes through the masses surrounds the loop which goes through the sequence. if we put the sequence loop on the outside this will confer some advantages: translateNucleotideSequence won’t have to be called over and over. the code which produces a probable MSMS spectrum for a given sequence could be called [...] Read More
A common trick in image matching (e.g. face recognition, character recognition, etc.) is when comparing two images to first scale them to the same dimensions in pixels. My thought is that this technique could be used to estimate similarity between two sequences of amino acids. Yesterday I put up a post about a possible ordering [...] Read More
some fails on overall conversion to char… am trying a more scaled back approach where just charToIntRepresentation returns a char array. With 4 tandem spectra on E. coli (Escherichia_coli_K-12_MG1655.fasta) the improvement was 3.7% (1m51s to 1m47s). Not much. With 68 tandem spectra on E. Coli there was, surprisingly, negative improvement ( 19m41s (1181s) to 19m55s [...] Read More
matchTagsFiltered is where the unnecessary charToIntRepresentation is used. Though charToIntRepresentation may not be entirely unnecessary as tags are as an NSArray, which still need to be converted…. Need to look at this later. Too distracted by converting NSStrings to chars… Read More
Strings are essentially arrays of UTF characters which take up 4 bytes. As we are comparing proteome data, we don’t need that size of a data container to represent the amino acids. If rather than Srings we used arrays of characters (one byte each) then comparisons and assignment operators could be sped up as much [...] Read More
Working from home (waiting for plumber). Had to quickly drop by lab to pick up source files and parameters list. Future Idea: need to make “getting started with GFS” intro guide. Success! Found that if we create a temp variable to store acidArrayAsInts[i] for comparisons then we save 2% to 3% of time on findLongestCommonSubstring Read More
It would be interesting to check results to see if longestCommonSubstring were skipped where the query sequence were significantly longer than the comparison sequence. perhaps an if clause which has a cut-off based on the length of the existing longest string relative to the length of the sequence string. Read More
As I believe characters are stored as a type of integer, converting to an integer array may be redundant and counter to its goal of expediting comparisons. Test time performance when removing this step. Read More