proteomic database fields

by Brian | 1st September 2009

For a basic database which holds peptides from in silico digestion of a genome, I think these fields will be helpful:

  • peptide – the chain sequence of amino acids
  • location – This will probably be many fields which will track location information
  • mass – (theoretical)
  • size – the number of amino acids
  • cleavage – number of missed tripsin cleavage points (0 or 1)
  • total

Total is something worth explaining. If two E. coli genomes are digested and put into the database and there is a peptide which is represented in both (very likely), then the “total” value for that database will be 2. If 50 E. coli proteomes are added to the database and a certain peptide is only present in half of them, then the database entry for that peptide will be 25.

The usefulness is, of course, in the detection of anomalies such as mutations, errors, etc. GFS can search for a match to some MSMS data but if the best matched peptide is only has a “total” value of 1 when the maximum “total” value is 50 then we could look at that result with more skepticism.

UPDATE: We could have TWO tables. One that keeps only unique peptides (perhaps along with the total), but the other table will keep every peptide. This way when we find a peptide we are interested in from the smaller, unique-peptide table, we can then use the larger table to tell us exactly which genomes contain that peptide.

Any other fields that would be useful?

LINK: getting MySQL, PHP and phpMyAdmin running on OS x

MySQL C API

2 Responses to “proteomic database fields”

  1. jainab

    Sep 1st, 2009 :

    Hi Brian,
    I think that should do our jobs. The only thing I can think now is that we need to add the frame number. And somehow, we need to link the genomic sequence name in the location information and in the total. Because location is different for different sequence and when you say just total that does not make sense until we know name of those sequence.

  2. Peter

    Dec 21st, 2009 :

    I have put together a normalized schema for this, but will need help with the specifics. Currently I do not store information about number of relations to a peptide/sequence because this can be computed efficiently.

    Using a database like this may make implementation under condor or other parallel computing frameworks more efficient, since sequences may be demand loaded.

    I would post a drawing here but I do not know how this tool accommodates drawings…

Leave a Reply

Name (Required)

Email (Required - will not be published)

Website

Message (Required)