“String Scaling” for fuzzy string comparison

by Brian | 25th June 2009

A common trick in image matching (e.g. face recognition, character recognition, etc.) is when comparing two images to first scale them to the same dimensions in pixels. My thought is that this technique could be used to estimate similarity between two sequences of amino acids.

Yesterday I put up a post about a possible ordering of amino acids. Here I’ve assigned each of those amino acids a color. Now, colors are essentially points in three dimensions (red, green, blue). Assigning amino acids a color is good for visualization purposes, but it is lossy. Ideally we would be using the equivalent of a 20-dimensional color.
color-assignment

Our two demonstration sequences shall be the ones employed from this site:

HEAGAWGHEE and PAWHEAE

Using the Needleman & Wunsch algorithm we get this global alignment:
nw

Using the Smith & Waterman algorithm we get this local alignment:
sw

Both of these use O(mn) where m and n are the two string lengths. That’s a lot of computation to see that two strings have 30% similarity. If a similarity factor is all that is desired (no alignment data needed), then string scaling may be a viable option as it has a computation time of O(n + m).

A visual example. We take our two test strings and represent them as colors:

HEAGAWGHEE: heagawghee

PAWHEAE: pawheae

We then scale both images to be the same size:

HEAGAWGHEE scaled:
heagawghee-scaled

PAWHEAE scaled:
pawheae-scaled

Using a standard distance formula (such as square root of the sum of the squared differences) we can derive a factor of similarity.

Again, colors were employed for visual demonstration, but this is a lossy technique as we are essentially representing that which is 20 dimensional as 3 dimensional. Better accuracy in our similarity factor is dependent both on dimensions employed as well as the size to which the string is scaled (in this case both strings were scaled to a length of 5).

Leave a Reply

Name (Required)

Email (Required - will not be published)

Website

Message (Required)