How GFS handles MS/MS files
I’m curious if a job on one DTA/PKL file with N spectra will run faster than a job that processes the exact same data broken into N different files.
I just wrote a quick java app to take OutputFile2.dta which has 112 spectra and split it into 112 different dta files.
At the moment I am getting the baseline: GFS running OutputFile2.dta on chr4.fasta. Done. It took 19 minutes 10 seconds.
Now running GFS on the directory containing 112 separate DTA files where each file contains one spectrum. done. It took 26 minutes 17 seconds.
It takes about 6min 20sec to digest chr4. That means that pure sequence searching time goes up 55% percent (12min 50sec to 19min 53sec; 770sec to 1193sec).
What could account for this? Multiple resorting of the fragments array in seqFragPackage?
UPDATE: I am finding that when multithreading is engaged that multiple files can actually perform better than one huge file. I’m running a quad-core machine and multiple files runs significantly better because (I’m assuming) the parceling of jobs is much more efficient.