Long, fruitless day trying to reproduce an error (And it's only 4-1/2 hours old)

Bryan and one of his peoples are experiencing what appears to be the "big nasty evil database bug" (formerly the "evil database bug") where the database (b)locks in the middle of a hierarchy import (which is the only database operation that's occuring).

This has a lot of secondary effects, as once the database thread is blocked, the main thread winds up building up a huge queue of things to do in the database thread. Eventually something blocks solidly and the zombie watcher notices that the process is not responding and restarts it (trying to kill the old one as it goes).

Unfortunately, this particular way of crashing does not admit of killing, so the app winds up with two running copies. The second copy runs until it blocks on the database and eventually becomes unresponsive. The third copy then runs...

With enough time in this state the time-to-failure for the process drops to fractions of a second.

That's all observed behaviour on the running system. The problem is that I have as of yet been unable to reproduce the problem's source. I've been uploading and uploading and uploading every combination of "different" hierarchy maps I have and haven't been able to produce the crash. I'm sure I'm just missing some trivial variable, I just haven't been able to figure out what it is.

Ah, figured out the variable, (finally): a set of existing records and a set of new records in the same data-set. That gets me to the point where I can reproduce the failure, but I'm still stuck with a failure that appears to be in either libpq or PostgreSQL itself.


Comments are closed.


Pingbacks are closed.