I spent a solid 7 hours this afternoon/evening working on Cinemon. Over the last few days I've left the live system running with the code that traps blocks, and it had identified the two places where unreasonably long blocks were occuring.
The second of those two is the one that's been causing problems on the live system. I was using semaphore-based parallel iteration code which was consuming generators. Each iteration of the generator was a fairly heavy operation (i.e. probably a second's worth of heavy calculation), with the idea being that only one iteration would ever occur in a single "time-slice". (There are 400+ iterations of that generator on the live system.)
Unfortunately for that approach the semaphore-based parallel iteration code was doing a very straightforward [ sem.run( callable, item, ...) for item in iterable] to populate a regular old deferredlist, which meant that every last iteration was being produced at the moment the parallel operation was initiated. Oops.
Once found it was easy to fix, just rewrote the parallel.py code to do what I thought it was doing in the first place. I also introduced staggering of the starting X elements so that there's no more than one iteration calculated in a single time-slice. I'm testing it now on the live system, should know by tomorrow if it eliminates the problem entirely or if yet more optimisations are required (I actually did quite a few of those before I found the parallelisation problem).
I'm thinking I'll rachet down the hang-tolerance level to something like .5s and see what pops out there.. I'm guessing tonight's changes will have increased our maximum scannable CMTS size by 20 or 30%, maybe more. I'm wondering if maybe I can get us up to double the size even before bringing in the spreadable servers.