XOR Media

Natural Sort Order with Zero Padding

Which of the following lists is sorted in the most “natural” fashion?

A:
    Elementary Season 1 Episode 1
    Elementary Season 1 Episode 10
    Elementary Season 1 Episode 11
    Elementary Season 1 Episode 12
    Elementary Season 1 Episode 13
    Elementary Season 1 Episode 2
    Elementary Season 1 Episode 3
    Elementary Season 1 Episode 4
    Elementary Season 1 Episode 5
    Elementary Season 1 Episode 6
    Elementary Season 1 Episode 7
    Elementary Season 1 Episode 8
    Elementary Season 1 Episode 9
B:
    Elementary Season 1 Episode 1
    Elementary Season 1 Episode 2
    Elementary Season 1 Episode 3
    Elementary Season 1 Episode 4
    Elementary Season 1 Episode 5
    Elementary Season 1 Episode 6
    Elementary Season 1 Episode 7
    Elementary Season 1 Episode 8
    Elementary Season 1 Episode 9
    Elementary Season 1 Episode 10
    Elementary Season 1 Episode 11
    Elementary Season 1 Episode 12
    Elementary Season 1 Episode 13

The Problem

While the former is technically sorted, it’s ordering is mathematical, only taking in to account the character values at each index in turn. The latter which we’ll refer to as human ordering where higher level logic is applied to realized that “10” comes after “9” and not “1.” Note that we’re not talking about “The” or “A,” removal which is another interesting topic, we’re just focusing on numbers in strings here.

Back on track… The simple “trick” to get a mathematical sorting algorithm to result in the natural sort order is to zero pad the numbers in the strings before handing them off to sorting.

We’ll start with an example of zero padded strings so you can see what we’re after. If you run the following strings through a sorting algorithm they’ll come out in the expected order.

Elementary Season 0001 Episode 0001
Elementary Season 0001 Episode 0010
Elementary Season 0001 Episode 0002
Elementary Season 0001 Episode 0011

The Code

I’ve looked at several ways to do this using python and after a bit of somewhat scientific benchmarking landed on the use of SRE_Pattern.sub as the best balance of performance and simplicity.

from sys import maxint
import re

# optional '-' to support negative numbers
_num_re = re.compile(r'-?\d+')
# number of chars in the largest possible int
_maxint_digits = len(str(maxint))
# format for zero padding positive integers
_zero_pad_int_fmt = '{0:0' + str(_maxint_digits) + 'd}'
# / is 0 - 1, so that negative numbers will come before positive
_zero_pad_neg_int_fmt = '/{0:0' + str(_maxint_digits) + 'd}'


def _zero_pad(match):
    n = int(match.group(0))
    # if n is negative, we'll use the negative format and flip the number using
    # maxint so that -2 comes before -1, ...
    return _zero_pad_int_fmt.format(n) \
        if n > -1 else _zero_pad_neg_int_fmt.format(n + maxint)

def zero_pad_numbers(s):
    return _num_re.sub(_zero_pad, s)

To use zero_pad_numbers to sort we just provide it as the key function to sort/sorted. This transforms each element for the purpose of ordering them, but doesn’t modify the elements returned so they’re in the preferred ordering, but remain in their original form.

unordered = ['file12.txt', 'file3.txt', 'file1.txt', ...]
ordered = sorted(unordered, key=zero_pad_numbers)
# ordered
#    ['file1.txt', 'file3.txt', 'file12.txt']

What About Databases

It’s common to need to order things in this manner when they’re retrieved from a database. In most situations it’s not reasonable to pull all of data back in to the web server for sorting, nor is it practical to try and do the zero padding in the database/query.

The best route to take here is to add an extra column for sorting, i.e. name_sorted. You can then hook in to the save method of your object and compute the sorted version of name. This way you only incur the cost of zero padding once rather than every time the data is retrieved.