Notes on hanging bug

by Brian | 3rd November 2009

Lately GFS has been hanging during long tasks. It is a hang where processing drops to nothing (vs. high processing with zero progress). When running with two threads it hangs after about 35 minutes (counting 5 minutes of digestion) in single thread mode it goes for 55 minutes.

When paused at the hang point and examined with the debugger I see that in thread 1 we are stuck in the callClientWithParams loop where

while (![self clientsCompleted]) {
		if (debugLevel > 1)
			fprintf(stdout, "In loop before sleep.\n");
		NSDate *interval = [[NSDate alloc] initWithTimeIntervalSinceNow:2.0];
		[NSThread sleepUntilDate:interval];
		[interval release];
		if (debugLevel > 1)
			fprintf(stdout, "In loop after sleep.\n");
	}

in thread 4, which looks more promising of a lead, we are stuck at runShotgunClient line

 [frags setMasses:masses];

“masses” is an NSArray with 0 items. Unfortunatly, I am not finding anymore clues about the masses variable (such as how we are getting no values, why we are hanging here, why this wasn’t happening before, how to avoid this, etc. etc.).

The bug disappears with a precursor cutoff of 1000Da, but appears with a 600Da cutoff. My hypothesis is that a fragment with something bad about it that weighs less than 1000Da is being grabbed.

This morning it hung after the setMasses line, but still in that general area of runShotgunClient. This line is what was hanging:

tandemMatchContainer = [frags getTandemMatchResults:dict];

There were 0 objects in “dict” (however, I think there are supposed to be 0 at that point) and again 0 objects in “masses”.

0167.1: 1009.1985 1
2669.1: 1196.2334 1

2644.3: 3252.1479 3
start time: 2009-11-04 09:15:07 -0500
end time: 2009-11-04 09:45:26 -0500

2660.2: 2285.9971 2
shotgun start: 2009-11-04 09:53:00 -0500
start time: 2009-11-04 09:54:51 -0500
end time: 2009-11-04 10:25:11 -0500

it could be that [clients count] never reaches 0 because a client isn’t deleted/deallocated?

when debug level is set to 2 we see “Done with VendServer” before the hang. It appears to be the one time the message is displayed.

looks like SERVER_TIMEOUT might be the culprit. Some questions arise: SERVER_TIMEOUT is 1200 which is 20 minutes, and this is running for at least 30 minutes and digestion only takes about 5 minutes. Also, why has this not cropped up before for jobs that have taken this amount of time? Why does it run so much longer in single thread mode?

shotgun start: 2009-11-04 14:33:43 -0500
shotgun end time: 2009-11-04 15:28:24 -0500

Leave a Reply

Name (Required)

Email (Required - will not be published)

Website

Message (Required)