A sorting problem
Here’s the situation.
I have a list of known field names.
specific_order = ['id', 'age', 'zzyzx', 'bar', 'spam_eggs']
That list is in the order that I want the resulting data displayed in. The dataset is a simple dictionary where the items in the specific_order list are the keys.
However, in each dataset there is the possibility that derived fields/keys will be present, like 'bar_foo' or 'id_new'.
In the final dataset, I would like the fields to be ordered first by the known field names, and then by any derived field names. So, something like:
final = ['id', 'id_new', 'age', 'zzyzx', 'bar', 'bar_foo', 'spam_eggs']
So, with the specific_order list and a copy of that list sorted by descending length (spo_by_length) I have created the following comparison function (to be passed to list.sort()):
def cmp_headers(a, b):
ca, cb = None, None
if (a not in specific_order) or (b not in specific_order):
for candidate in spo_by_length:
if (ca and cb):
break
if not ca and a.startswith(candidate):
ca = candidate
if not cb and b.startswith(candidate):
cb = candidate
a = ca or a
b = cb or b
if a in specific_order and b in specific_order:
return cmp(specific_order.index(a), specific_order.index(b))
if a in specific_order:
return -1
if b in specific_order:
return 1
return cmp(a, b)
Here is some sample sorted output:
In [2]: keys = specific_order + ['id_old', 'id_new', 'bar_foo', 'spam_eggs_ham'] In [3]: sorted(keys, cmp_headers) Out[3]: ['id', 'id_old', 'id_new', 'age', 'zzyzx', 'bar', 'bar_foo', 'spam_eggs', 'spam_eggs_ham']
The part that concerns me in cmp_headers is the for loop. Is there any way that I can get the desired sort order without looping over the length-sorted list of fields? Is there a better sort algorithm in general for this sort of task?
cw
PS - this is somewhat academic, as this sort operation will be performed once per dataset, so not too expensive in the long run. However, I would like to know if there is a better way.
April 12th, 2007 at 11:36 am
The problem appears to be that you are speaking in absolute gibberish. However, I am also concerned about the for loop in the cmp_headers, this simply must not be - we must stop at nothing to alleviate this problem, no matter the cost! I am pretty sure all this can be solved by simply applying either the algorithm of E=MC squared or 2+2=4, really either should work.
April 12th, 2007 at 12:21 pm
Thanks for that JR. Do you think I should organize some sort of rally around this problem?
April 12th, 2007 at 12:40 pm
I have seen this problem come up a few times now. Unfortunately the only solution I have come up with myself for large data sets is a bit heavy handed. It requires creating a custom class to wrap the strings with a custom __cmp__ function and has its own initial sort index.
The Idea is to have two lists. One with the sort order you want, another with the index into the first list, but sorted wising the string comparisons.
That way you can use bisect (binary search) to find the closest match and decode on the insertion point. Then you need to update all the indexes in the index list above the insertion point. To do that efficiently you then need a dictionary mapping as well.
Not sure if I am explaining myself well enough. I will try to draw up an example implementation and post it. the short answer is, yes it can be done faster, but at the cost of much more memory, and worse efficiency on small datasets.
A more complex approach would be a tree implementation. This would work quite fast at the cost of code complexity. If you are sure you will never get any new roots, and you always have a delimiter (’_') then that is the best approach. You can run into problems if you have [ 'carto', 'car', 'cart'] where does ‘cartoon’ belong?
April 12th, 2007 at 12:55 pm
# my proposal:
specific_order = ['id', 'age', 'zzyzx', 'bar', 'spam_eggs'] fields = specific_order + ['id_old', 'id_new', 'bar_foo', 'spam_eggs_ham'] idx_field_pairs = [] for field in fields: for idx, prefix in enumerate(specific_order): if field.startswith(prefix): idx_field_pairs.append((idx, field)) break idx_field_pairs.sort() print [field for (idx, field) in idx_field_pairs]EDIT: Formatted code with PRE tags. -cw
April 12th, 2007 at 1:06 pm
Thomas,
The problem with that is it is On^2 which is no better than the original implementation and in some ways is worse as now both loops are in the interpreter and over more data to boot, as you concatenate the data for the fields array. If code complexity is the concern then your implementation is indeed the better option.
-Doug
April 12th, 2007 at 1:22 pm
Thanks for the thoughts so far.
Doug: Since you mentioned it a couple times, I think that for this particular problem I am looking for a balance between speed and code simplicity. Your tree suggestion is interesting though, as the derived names will always be joined to a parent name with an underscore.
Thomas: As Doug mentioned, yours is definitely the easiest solution to follow.
April 12th, 2007 at 2:30 pm
OK, screwed up twice:
here is in on my blog:
http://defcraft.blogspot.com/2007/04/merge-problem.html
April 12th, 2007 at 3:41 pm
I *think* this is O(n log n) (and only then, because of the original sorting of the keys). It builds a mapping from the parent keys (those in the original class) to the child keys (all of them). It does that by sorting both the parents and the children and stepping through them at the same time, so it’s O(n). Then it massages the child lists to get them back into their original order, and finally joins all of the child lists together based on the parents’ specific_order.
This started out so simple, and became so disgusting. I promise, I don’t usually write code this nasty.
def awesome_sort(keys): mapping = field_mapping(keys) mapping = restore_natural_order(mapping, keys) # Join the child fields, ordered by their specific order return sum([mapping[k] for k in specific_order], []) def field_mapping(keys): “”" Return a dictionary mapping each parent field to all fields that start with it (including itself). “”" mapping = dict((k, []) for k in specific_order) # Sort both the keys and the parent keys so we can step through them # together. sorted_keys, sorted_parents = sorted(keys), sorted(specific_order) # This loop is O(n) - it hits each element in sorted_keys and # sorted_parents at most once. for parent in sorted_parents: # Eat any keys that begin with this parent key while sorted_keys and sorted_keys[0].startswith(parent): mapping[parent].append(sorted_keys.pop(0)) return mapping def restore_natural_order(mapping, keys): “”" Given a mapping from field_mapping(), reorder the child key lists to their original order. “”" # Invert the mapping - build a dictionary mapping each child key to its # parent key. inverse_mapping = {} for parent, child_keys in mapping.iteritems(): for k in child_keys: inverse_mapping[k] = parent # Invert the inverted mapping (ugh) to rebuild the original mapping, but # with the original order. result = dict((k, []) for k in specific_order) for key in keys: parent = inverse_mapping[key] result[parent].append(key) return result assert sorted(keys, cmp_headers) == awesome_sort(keys)EDIT: added PRE tags back in - cw
April 12th, 2007 at 5:18 pm
sri,
I thought about doing it that way as well, but Christian never said you could assume underscores. Without a token to split on, you’d have to search for common prefixes between the two sets of keys. In that case, I think you’d either have to loop over both lists (n^2), or sort both of them and process them both in order (something like my solution - n log n).
April 12th, 2007 at 5:43 pm
well, he did say *derived* keys,
so you should be able to exploit on that
April 12th, 2007 at 5:53 pm
ok, I guess they have some good ideas as well.
April 12th, 2007 at 7:28 pm
Thanks for more ideas guys!
Gary: Very cool. Nice comments too
I think I might go with something like this. I’ll have to massage it into a cmp-like function to work with the API that I am using though. It requires a cmp function to sort the fields.
sri: Also a good looking solution, and you are right, there is definite merging going on. However, as Gary mentioned, assuming the underscores probably wouldn’t work. A parent name could be foo_bar and a derived name could be foo_bar_spam_eggs. Crazy, I know
And JR, well, thanks for sticking around
April 13th, 2007 at 1:03 pm
(my first attempt to submit this has been rejected as spam, so here is another try …)
christian: I understand that you need a solution with a cmp function, but I want to share the result of my experiments anyway.
The code I posted above was indeed not intended to be runtime optimal, it was just “the simplest thing that could possible work”, for me at last.
I did some measurements and awesome_sort is indeed MUCH faster:
naive_sort 1500 keys => 0.341000080109 sec
naive_sort 3000 keys => 1.31200003624 sec
naive_sort 4500 keys => 3.28399991989 sec
naive_sort 6000 keys => 5.34800004959 sec
awesome_sort 1500 keys => 0.00999999046326 sec
awesome_sort 3000 keys => 0.0699999332428 sec
awesome_sort 4500 keys => 0.130000114441 sec
awesome_sort 6000 keys => 0.210999965668 sec
(naive_sort is my solution from above)
But it can be even further improved.
The first thing I noticed is the use of sum to flatten a list.
sum(list_of_lists, []) is analogous to
result = []
for lst in list_of_lists:
result = result + lst
Given that list1 + list2 is O(len(list1) + len(list2)) it follows that this sum idiom is O(N**2).
In order to replace it, I finally removed restore_natural_order completely and used the inline loop shown below instead.
def awesome_sort2(keys, specific_order):
mapping = field_mapping2(keys, specific_order)
# concat the fragments according to the order give by specific_orderS
result = []
for key in specific_order:
result.extend(mapping[key])
return result
Another opportunity for improvement was the use of pop(0) in loop of field_mapping.
pop(0) has complexity O(N), as the whole array has to be memmove’d, but removing the last list element with pop() is O(1).
So I sorted the list keys in reversed order and used pop() instead of pop(0), leading to:
def field_mapping2(keys, specific_order):
mapping = dict((k, []) for k in specific_order)
# change: reverse=True
sorted_keys, sorted_parents = sorted(keys, reverse=True), sorted(specific_order)
for parent in sorted_parents:
while sorted_keys and sorted_keys[-1].startswith(parent): # change: -1
mapping[parent].append(sorted_keys.pop()) # change: pop()
return mapping
All this results in a massive speedup:
awesome_sort 7500 keys => 0.351000070572 sec
awesome_sort 15000 keys => 1.29200005531 sec
awesome_sort 22500 keys => 4.03499984741 sec
awesome_sort 30000 keys => 11.1059999466 sec
awesome_sort2 7500 keys => 0.0400002002716 sec
awesome_sort2 15000 keys => 0.110999822617 sec
awesome_sort2 22500 keys => 0.120000123978 sec
awesome_sort2 30000 keys => 0.170000076294 sec
April 13th, 2007 at 1:04 pm
For the record, here is how I create the test data:
def random_permutation(lst):
“”"Permutes lst by swapping random entries.”"”
def swap(lst, i, j):
tmp = lst[j]
lst[j] = lst[i]
lst[i] = tmp
import random
random.seed(0)
result = list(lst)
n = len(result)
for i in range(n):
j = random.randint(0, n-1)
swap(result, i, j)
return result
def generate_dataset(n, suffixes):
“”"Generates dataset as a pair (keys, specific_order).
>>> generate_dataset(2, ["a", "b"])
(['1/', '1/b', '0/a', '0/b', '0/', '1/a'], ['1/', '0/'])
“”"
specific_order = []
keys = []
for i in range(n):
prefix = str(i) + “/” # to assure that no prefix is a prefix of another prefix
specific_order.append(prefix)
for suffix in suffixes:
keys.append(prefix + suffix)
keys.extend(specific_order)
return random_permutation(keys), random_permutation(specific_order)
April 13th, 2007 at 10:30 pm
Well done, Thomas. I wasn’t thinking a whole lot about implementation - academic as this exercise is, I was thinking “given an ideal list implementation…”, etc. The sum and pop optimizations were good finds.
However, I think you need to keep restore_natural_order. Your results don’t match the originals:
>>> sorted(['id', 'id_old', 'id_new'], cmp_headers) # Christian’s original method
['id', 'id_old', 'id_new']
>>> awesome_sort(['id', 'id_old', 'id_new']) # matches original
['id', 'id_old', 'id_new']
>>> awesome_sort2(['id', 'id_old', 'id_new'], specific_order) # doesn’t match - new and old are switched
['id', 'id_new', 'id_old']
That happens because the lists that come out of the mapping function will be sorted alphanumerically. Christian’s original function sorted the keys by the specific order first, and by the original order of the child keys second. That’s what restore_natural_order fixes.
April 14th, 2007 at 1:14 am
Well, I’ve noticed this change in semantics, but that’s how I apparently wrongly interpreted “ordered first by the known field names, and then by any derived field names”. I had understood that the suffixes should be used as the secondary sort criteria. Of course I should have mentioned that.
But, anyway, an awesome_sort3 with reintroduced restore_natural_order, but with the faster field_mapping2 and the sum idiom replaced by a faster alternative too, ie
def flatten(iteratable_of_lists):
..result = []
..for lst in iteratable_of_lists:
….result.extend(lst)
..return result
def awesome_sort3(keys, specific_order):
..mapping = field_mapping2(keys, specific_order)
..mapping = restore_natural_order(mapping, keys, specific_order)
..# Join the child fields, ordered by their specific order
..return flatten(mapping[k] for k in specific_order)
has the correct semantics and is still much faster than the original solution:
original:
awesome_sort 7500 keys => 0.341000080109 sec
awesome_sort 15000 keys => 1.20199990273 sec
awesome_sort 22500 keys => 3.30400013924 sec
awesome_sort 30000 keys => 8.97299981117 sec
1st improved solution, wrong semantics:
awesome_sort2 7500 keys => 0.0400002002716 sec
awesome_sort2 15000 keys => 0.0899999141693 sec
awesome_sort2 22500 keys => 0.121000051498 sec
awesome_sort2 30000 keys => 0.169999837875 sec
solution from above, correct semantics;
awesome_sort3 7500 keys => 0.0500001907349 sec
awesome_sort3 15000 keys => 0.119999885559 sec
awesome_sort3 22500 keys => 0.22000002861 sec
awesome_sort3 30000 keys => 0.300999879837 sec