A sorting problem

Here’s the situation.

I have a list of known field names.

specific_order = ['id', 'age', 'zzyzx', 'bar', 'spam_eggs']

That list is in the order that I want the resulting data displayed in. The dataset is a simple dictionary where the items in the specific_order list are the keys.

However, in each dataset there is the possibility that derived fields/keys will be present, like 'bar_foo' or 'id_new'.

In the final dataset, I would like the fields to be ordered first by the known field names, and then by any derived field names. So, something like:

final = ['id', 'id_new', 'age', 'zzyzx', 'bar', 'bar_foo',  'spam_eggs']

So, with the specific_order list and a copy of that list sorted by descending length (spo_by_length) I have created the following comparison function (to be passed to list.sort()):

def cmp_headers(a, b):
    ca, cb = None, None
    if (a not in specific_order) or (b not in specific_order):
        for candidate in spo_by_length:
            if (ca and cb):
                break
            if not ca and a.startswith(candidate):
                ca = candidate
            if not cb and b.startswith(candidate):
                cb = candidate
    a = ca or a
    b = cb or b
    if a in specific_order and b in specific_order:
        return cmp(specific_order.index(a), specific_order.index(b))
    if a in specific_order:
        return -1
    if b in specific_order:
        return 1
    return cmp(a, b)

Here is some sample sorted output:

In [2]: keys = specific_order + ['id_old', 'id_new', 'bar_foo', 'spam_eggs_ham']
In [3]: sorted(keys, cmp_headers)
Out[3]:
['id',
 'id_old',
 'id_new',
 'age',
 'zzyzx',
 'bar',
 'bar_foo',
 'spam_eggs',
 'spam_eggs_ham']

The part that concerns me in cmp_headers is the for loop. Is there any way that I can get the desired sort order without looping over the length-sorted list of fields? Is there a better sort algorithm in general for this sort of task?

cw

PS - this is somewhat academic, as this sort operation will be performed once per dataset, so not too expensive in the long run. However, I would like to know if there is a better way.

16 Responses to “A sorting problem”

  1. JR Rozko Says:

    The problem appears to be that you are speaking in absolute gibberish. However, I am also concerned about the for loop in the cmp_headers, this simply must not be - we must stop at nothing to alleviate this problem, no matter the cost! I am pretty sure all this can be solved by simply applying either the algorithm of E=MC squared or 2+2=4, really either should work.

  2. christian Says:

    Thanks for that JR. Do you think I should organize some sort of rally around this problem?

  3. Doug Napoleone Says:

    I have seen this problem come up a few times now. Unfortunately the only solution I have come up with myself for large data sets is a bit heavy handed. It requires creating a custom class to wrap the strings with a custom __cmp__ function and has its own initial sort index.

    The Idea is to have two lists. One with the sort order you want, another with the index into the first list, but sorted wising the string comparisons.
    That way you can use bisect (binary search) to find the closest match and decode on the insertion point. Then you need to update all the indexes in the index list above the insertion point. To do that efficiently you then need a dictionary mapping as well.

    Not sure if I am explaining myself well enough. I will try to draw up an example implementation and post it. the short answer is, yes it can be done faster, but at the cost of much more memory, and worse efficiency on small datasets.

    A more complex approach would be a tree implementation. This would work quite fast at the cost of code complexity. If you are sure you will never get any new roots, and you always have a delimiter (’_') then that is the best approach. You can run into problems if you have [ 'carto', 'car', 'cart'] where does ‘cartoon’ belong?

  4. Thomas Says:

    # my proposal:

    specific_order = ['id', 'age', 'zzyzx', 'bar', 'spam_eggs']
    fields = specific_order + ['id_old', 'id_new', 'bar_foo', 'spam_eggs_ham']
    
    idx_field_pairs = []
    
    for field in fields:
        for idx, prefix in enumerate(specific_order):
            if field.startswith(prefix):
                idx_field_pairs.append((idx, field))
                break
    
    idx_field_pairs.sort()
    print [field for (idx, field) in idx_field_pairs]
    

    EDIT: Formatted code with PRE tags. -cw

  5. Doug Napoleone Says:

    Thomas,
    The problem with that is it is On^2 which is no better than the original implementation and in some ways is worse as now both loops are in the interpreter and over more data to boot, as you concatenate the data for the fields array. If code complexity is the concern then your implementation is indeed the better option.

    -Doug

  6. christian Says:

    Thanks for the thoughts so far.

    Doug: Since you mentioned it a couple times, I think that for this particular problem I am looking for a balance between speed and code simplicity. Your tree suggestion is interesting though, as the derived names will always be joined to a parent name with an underscore.

    Thomas: As Doug mentioned, yours is definitely the easiest solution to follow.

  7. sri Says:

    OK, screwed up twice:
    here is in on my blog:
    http://defcraft.blogspot.com/2007/04/merge-problem.html

  8. Gary Bernhardt Says:

    I *think* this is O(n log n) (and only then, because of the original sorting of the keys). It builds a mapping from the parent keys (those in the original class) to the child keys (all of them). It does that by sorting both the parents and the children and stepping through them at the same time, so it’s O(n). Then it massages the child lists to get them back into their original order, and finally joins all of the child lists together based on the parents’ specific_order.

    This started out so simple, and became so disgusting. I promise, I don’t usually write code this nasty. :)

    def awesome_sort(keys):
        mapping = field_mapping(keys)
        mapping = restore_natural_order(mapping, keys)
        # Join the child fields, ordered by their specific order
        return sum([mapping[k] for k in specific_order], [])
    
    def field_mapping(keys):
        “”"
        Return a dictionary mapping each parent field to all fields that start with
        it (including itself).
        “”"
        mapping = dict((k, []) for k in specific_order)
        # Sort both the keys and the parent keys so we can step through them
        # together.
        sorted_keys, sorted_parents = sorted(keys), sorted(specific_order)
    
        # This loop is O(n) - it hits each element in sorted_keys and
        # sorted_parents at most once.
        for parent in sorted_parents:
            # Eat any keys that begin with this parent key
            while sorted_keys and sorted_keys[0].startswith(parent):
                mapping[parent].append(sorted_keys.pop(0))
    
        return mapping
    
    def restore_natural_order(mapping, keys):
        “”"
        Given a mapping from field_mapping(), reorder the child key lists to their
        original order.
        “”"
        # Invert the mapping - build a dictionary mapping each child key to its
        # parent key.
        inverse_mapping = {}
        for parent, child_keys in mapping.iteritems():
            for k in child_keys:
                inverse_mapping[k] = parent
    
        # Invert the inverted mapping (ugh) to rebuild the original mapping, but
        # with the original order.
        result = dict((k, []) for k in specific_order)
        for key in keys:
            parent = inverse_mapping[key]
            result[parent].append(key)
    
        return result
    
    assert sorted(keys, cmp_headers) == awesome_sort(keys)
    

    EDIT: added PRE tags back in - cw

  9. Gary Bernhardt Says:

    sri,

    I thought about doing it that way as well, but Christian never said you could assume underscores. Without a token to split on, you’d have to search for common prefixes between the two sets of keys. In that case, I think you’d either have to loop over both lists (n^2), or sort both of them and process them both in order (something like my solution - n log n).

  10. sri Says:

    well, he did say *derived* keys,
    so you should be able to exploit on that

  11. JR Rozko Says:

    ok, I guess they have some good ideas as well.

  12. christian Says:

    Thanks for more ideas guys!

    Gary: Very cool. Nice comments too :-) I think I might go with something like this. I’ll have to massage it into a cmp-like function to work with the API that I am using though. It requires a cmp function to sort the fields.

    sri: Also a good looking solution, and you are right, there is definite merging going on. However, as Gary mentioned, assuming the underscores probably wouldn’t work. A parent name could be foo_bar and a derived name could be foo_bar_spam_eggs. Crazy, I know :-)

    And JR, well, thanks for sticking around :-)

  13. Thomas Says:

    (my first attempt to submit this has been rejected as spam, so here is another try …)

    christian: I understand that you need a solution with a cmp function, but I want to share the result of my experiments anyway.

    The code I posted above was indeed not intended to be runtime optimal, it was just “the simplest thing that could possible work”, for me at last.

    I did some measurements and awesome_sort is indeed MUCH faster:

    naive_sort 1500 keys => 0.341000080109 sec
    naive_sort 3000 keys => 1.31200003624 sec
    naive_sort 4500 keys => 3.28399991989 sec
    naive_sort 6000 keys => 5.34800004959 sec

    awesome_sort 1500 keys => 0.00999999046326 sec
    awesome_sort 3000 keys => 0.0699999332428 sec
    awesome_sort 4500 keys => 0.130000114441 sec
    awesome_sort 6000 keys => 0.210999965668 sec

    (naive_sort is my solution from above)

    But it can be even further improved.

    The first thing I noticed is the use of sum to flatten a list.
    sum(list_of_lists, []) is analogous to

    result = []
    for lst in list_of_lists:
    result = result + lst

    Given that list1 + list2 is O(len(list1) + len(list2)) it follows that this sum idiom is O(N**2).
    In order to replace it, I finally removed restore_natural_order completely and used the inline loop shown below instead.

    def awesome_sort2(keys, specific_order):
    mapping = field_mapping2(keys, specific_order)
    # concat the fragments according to the order give by specific_orderS
    result = []
    for key in specific_order:
    result.extend(mapping[key])
    return result

    Another opportunity for improvement was the use of pop(0) in loop of field_mapping.
    pop(0) has complexity O(N), as the whole array has to be memmove’d, but removing the last list element with pop() is O(1).
    So I sorted the list keys in reversed order and used pop() instead of pop(0), leading to:

    def field_mapping2(keys, specific_order):
    mapping = dict((k, []) for k in specific_order)
    # change: reverse=True
    sorted_keys, sorted_parents = sorted(keys, reverse=True), sorted(specific_order)

    for parent in sorted_parents:
    while sorted_keys and sorted_keys[-1].startswith(parent): # change: -1
    mapping[parent].append(sorted_keys.pop()) # change: pop()

    return mapping

    All this results in a massive speedup:

    awesome_sort 7500 keys => 0.351000070572 sec
    awesome_sort 15000 keys => 1.29200005531 sec
    awesome_sort 22500 keys => 4.03499984741 sec
    awesome_sort 30000 keys => 11.1059999466 sec

    awesome_sort2 7500 keys => 0.0400002002716 sec
    awesome_sort2 15000 keys => 0.110999822617 sec
    awesome_sort2 22500 keys => 0.120000123978 sec
    awesome_sort2 30000 keys => 0.170000076294 sec

  14. Thomas Says:

    For the record, here is how I create the test data:

    def random_permutation(lst):
    “”"Permutes lst by swapping random entries.”"”
    def swap(lst, i, j):
    tmp = lst[j]
    lst[j] = lst[i]
    lst[i] = tmp
    import random
    random.seed(0)
    result = list(lst)
    n = len(result)
    for i in range(n):
    j = random.randint(0, n-1)
    swap(result, i, j)
    return result

    def generate_dataset(n, suffixes):
    “”"Generates dataset as a pair (keys, specific_order).

    >>> generate_dataset(2, ["a", "b"])
    (['1/', '1/b', '0/a', '0/b', '0/', '1/a'], ['1/', '0/'])
    “”"
    specific_order = []
    keys = []
    for i in range(n):
    prefix = str(i) + “/” # to assure that no prefix is a prefix of another prefix
    specific_order.append(prefix)
    for suffix in suffixes:
    keys.append(prefix + suffix)
    keys.extend(specific_order)
    return random_permutation(keys), random_permutation(specific_order)

  15. Gary Bernhardt Says:

    Well done, Thomas. I wasn’t thinking a whole lot about implementation - academic as this exercise is, I was thinking “given an ideal list implementation…”, etc. The sum and pop optimizations were good finds.

    However, I think you need to keep restore_natural_order. Your results don’t match the originals:

    >>> sorted(['id', 'id_old', 'id_new'], cmp_headers) # Christian’s original method
    ['id', 'id_old', 'id_new']
    >>> awesome_sort(['id', 'id_old', 'id_new']) # matches original
    ['id', 'id_old', 'id_new']
    >>> awesome_sort2(['id', 'id_old', 'id_new'], specific_order) # doesn’t match - new and old are switched
    ['id', 'id_new', 'id_old']

    That happens because the lists that come out of the mapping function will be sorted alphanumerically. Christian’s original function sorted the keys by the specific order first, and by the original order of the child keys second. That’s what restore_natural_order fixes.

  16. Thomas Says:

    Well, I’ve noticed this change in semantics, but that’s how I apparently wrongly interpreted “ordered first by the known field names, and then by any derived field names”. I had understood that the suffixes should be used as the secondary sort criteria. Of course I should have mentioned that.

    But, anyway, an awesome_sort3 with reintroduced restore_natural_order, but with the faster field_mapping2 and the sum idiom replaced by a faster alternative too, ie

    def flatten(iteratable_of_lists):
    ..result = []
    ..for lst in iteratable_of_lists:
    ….result.extend(lst)
    ..return result

    def awesome_sort3(keys, specific_order):
    ..mapping = field_mapping2(keys, specific_order)
    ..mapping = restore_natural_order(mapping, keys, specific_order)
    ..# Join the child fields, ordered by their specific order
    ..return flatten(mapping[k] for k in specific_order)

    has the correct semantics and is still much faster than the original solution:

    original:
    awesome_sort 7500 keys => 0.341000080109 sec
    awesome_sort 15000 keys => 1.20199990273 sec
    awesome_sort 22500 keys => 3.30400013924 sec
    awesome_sort 30000 keys => 8.97299981117 sec

    1st improved solution, wrong semantics:
    awesome_sort2 7500 keys => 0.0400002002716 sec
    awesome_sort2 15000 keys => 0.0899999141693 sec
    awesome_sort2 22500 keys => 0.121000051498 sec
    awesome_sort2 30000 keys => 0.169999837875 sec

    solution from above, correct semantics;
    awesome_sort3 7500 keys => 0.0500001907349 sec
    awesome_sort3 15000 keys => 0.119999885559 sec
    awesome_sort3 22500 keys => 0.22000002861 sec
    awesome_sort3 30000 keys => 0.300999879837 sec

Leave a Reply