The Occasional Occurence

A sorting problem

April 12, 2007 at 11:11 AM | categories: Python, General

Here's the situation.

I have a list of known field names.

specific_order = ['id', 'age', 'zzyzx', 'bar', 'spam_eggs']

That list is in the order that I want the resulting data displayed in. The dataset is a simple dictionary where the items in the specific_order list are the keys.

However, in each dataset there is the possibility that derived fields/keys will be present, like 'bar_foo' or 'id_new'.

In the final dataset, I would like the fields to be ordered first by the known field names, and then by any derived field names. So, something like:

final = ['id', 'id_new', 'age', 'zzyzx', 'bar', 'bar_foo',  'spam_eggs']

So, with the specific_order list and a copy of that list sorted by descending length (spo_by_length) I have created the following comparison function (to be passed to list.sort()):

def cmp_headers(a, b):
    ca, cb = None, None
    if (a not in specific_order) or (b not in specific_order):
        for candidate in spo_by_length:
            if (ca and cb):
                break
            if not ca and a.startswith(candidate):
                ca = candidate
            if not cb and b.startswith(candidate):
                cb = candidate
    a = ca or a
    b = cb or b
    if a in specific_order and b in specific_order:
        return cmp(specific_order.index(a), specific_order.index(b))
    if a in specific_order:
        return -1
    if b in specific_order:
        return 1
    return cmp(a, b)

Here is some sample sorted output:

In [2]: keys = specific_order + ['id_old', 'id_new', 'bar_foo', 'spam_eggs_ham']
In [3]: sorted(keys, cmp_headers)
Out[3]:
['id',
 'id_old',
 'id_new',
 'age',
 'zzyzx',
 'bar',
 'bar_foo',
 'spam_eggs',
 'spam_eggs_ham']

The part that concerns me in cmp_headers is the for loop. Is there any way that I can get the desired sort order without looping over the length-sorted list of fields? Is there a better sort algorithm in general for this sort of task?

cw

PS - this is somewhat academic, as this sort operation will be performed once per dataset, so not too expensive in the long run. However, I would like to know if there is a better way.