Columnar Sort of N Columns in 2 Directions, a.k.a. Data-grid Sorting

There is a thread on Python-list about removing the "cmp" parameter to Python's sort. One little tidbit I had missed was the introduction of a stable sort guarantee with Python 2.2. I've got lots of older code that uses cmp, so I figured I'll just start rewriting those as I come across them... later that day, ran into the first one, the data-grids in RunSnakeRun were showing extremely slow sort times from... you guessed it, millions of calls to the various pieces of code invoked by the cmp methods.

The data-grids in RSR are "virtual" (that is, cell contents are looked up on demand for display (there really isn't any reason any more to do this, as we currently always have all records in ram anyway, but that's the way they were originally created)), but they allow the user to sort them on any column, either up or down.

So, we have N columns, each column may or may not be part of the sort schema (we originally just have a single sorting column), each column may be sorted up or down, columns can have many equal values. The code to implement that is charmingly simple, and does perform far better than the cmp-based version:

        for ascending,column in self.sortOrder[::-1]:
            self.sorted.sort( key=column.get, reverse=(not ascending))

The actual code has a note there regarding the stable sort guarantee, but that's all that's needed to provide the sorting. The sortOrder is kept in "human" order (that is, most-important-sort first), so iterate over it in reverse order. Use standard list sorting using the column's get method as a key, and either reverse or not based on whether the user wanted ascending or descending for that column. It's much shorter than the cmp() method was, and on those enormous meliae dumps the performance is far better.

So, accept the guarantee of a stable sort and move on with your key-based lives.

[Update] to be clear, you only need to do this to apply all current sorts to a new list, for the add-a-single-sort case, you can just sort by the new sort-column on the old sorted list.

Comments

  1. Eric

    Eric on 04/05/2011 2:57 p.m. #

    If you know that all of your reversed columns are numeric, then you could get away with a single sort using a tuple key:

        def keygen(row):
            return tuple((col.get(row) if asc else -col.get(row))
                for asc, col in reversed(self.sortOrder))
        self.sorted.sort(key=keygen)

    Granted, this means iterating over sortOrder for each row, and comparing tuples instead of integers, so profiling on real data might show that multiple sorts perform better. It's also useless for strings.

  2. Mike Fletcher

    Mike Fletcher on 04/05/2011 8:13 p.m. #

    Agreed, but RunSnakeRun has lots of string values on which people want to sort (module name, function name, object type), which are very hard to "invert" (not impossible, but a PITA when you allow for unicode values).

Comments are closed.

Pingbacks

Pingbacks are closed.