Wednesday, 5 December 2012

Are you relying on hash keys being ordered?

tl;dr - if you rely on Perl 5, and plan to eventually upgrade to the upcoming 5.18, do this *now*:

  1. Install 5.17.6 (you do use perlbrew, right?);
  2. Try your modules and apps in it (you do have tests, right?);
  3. If  anything breaks, it's likely because you're relying on keys(), values() or each() being in some particular order. You really shouldn't, so go sort() your keys or something :)
  4.  If some CPAN module you depend on suddenly fails on 5.17.6, make sure to let the author know;
  5.  Spread the word!

On Hashes & Security


The Perl 5 core team has always put security as one of its top priorities. To put things under perspective, in late 2011, an algorithmic complexity remote denial of service attack (original paper, advisory, slides, video) was found on major language implementations like PHP, Ruby, Python, Java, even JavaScript. It's been fixed in Perl 5 since... 2003.

That was then. What about now?


Still thinking about security, Yves Orton pushed some important changes these past few weeks, changes that are going into perl 5.18.0. Among other things, to quote 7dc8663964, it:

  • Introduces multiple new hash functions to choose from at build time. This includes Murmur-32, SDBM, DJB2, SipHash, SuperFast, and an improved version of the original One-at-a-time.
  • Rips out the old HvREHASH mechanism and replaces it with a per-process random hash seed.

Optimizations aside, the ability to change hash functions easily is important because, if, for whatever reason, the active function is found vulnerable to an attack, you don't have to wait until the Perl Core Team (or your specific vendor/system) releases a fix: just recompile your perl setting another hash function as default.

The important bit, however, is the per-process random hash seed. Until now, perl was using a not-so-great hash seed, one that was set during compilation. All hashes would use this seed, and if a collision attack was detected it would trigger a rehash, where every item in the hash would have its hash value recalculated, with corresponding effects performance and memory. Of course, when too many collisions were found, the rehash would switch to a random seed instead.

Now, after this change, every process is guaranteed to use a random seed.

Hash randomization should make perl even more robust to complexity attacks, and with simpler code. But, as you may have predicted, there's a side effect to it: the order of hash keys changes more often than before.

Sweet! But, what does it mean to my code?


As it is stated in perlsec since version 5.8.1 (that one from 2003), Perl has never guaranteed any ordering of the hash keys, and in fact the ordering has already changed several times throughout its history. The problem, however, is that a lot of developers end up inadvertently relying on hashes being ordered, or rather in some random but constant order, simply because that particular order worked on their machine. Talk about a subtle bug!

This may not be your case, but you should check nonetheless. Andreas König, Father Chrysostomos and the rest of the P5P/CPANTesters gang have gone through the enormous effort of testing several major CPAN distributions for this and letting authors know whenever it failed a test while running on a patched version of perl, but they can only do so much, and there's *your* code to test, too.

You know, code your app runs, code that you haven't checked to CPAN.

Oddly enough, it looks like most of the found issues are on test cases themselves, tests that expect keys() to be in a particular order. Now, keys() is guaranteed only to return items in the same order as values() or each(), and even that is only true for the same process, so make sure you're not shooting yourself on the foot.

LIES! My code is perfect, you're the ones that broke Perl!


Well, not really. Like I said, it's a subtle bug, one that might be hitting your production code right now, but only on some very specific scenarios, and be very hard to reproduce and debug. If you don't trust me, there's a very simple experiment you can run on your system perl:

First, let's create a simple one liner that creates 15 key/value pairs, and print them on the screen:

   > perl -E 'local $,=q[, ]; $hash{$_} = $_ for 1..15; say keys %hash'
     6, 11, 3, 7, 9, 12, 2, 15, 14, 8, 1, 4, 13, 10, 5

You may have gotten a different order (did you?), but you'll probably get that same order no matter how many times you run it:

   > perl -E 'local $,=q[, ]; $hash{$_} = $_ for 1..15; say keys %hash'
     6, 11, 3, 7, 9, 12, 2, 15, 14, 8, 1, 4, 13, 10, 5
   > perl -E 'local $,=q[, ]; $hash{$_} = $_ for 1..15; say keys %hash'
     6, 11, 3, 7, 9, 12, 2, 15, 14, 8, 1, 4, 13, 10, 5
   > perl -E 'local $,=q[, ]; $hash{$_} = $_ for 1..15; say keys %hash'
     6, 11, 3, 7, 9, 12, 2, 15, 14, 8, 1, 4, 13, 10, 5
   > perl -E 'local $,=q[, ]; $hash{$_} = $_ for 1..15; say keys %hash'
     6, 11, 3, 7, 9, 12, 2, 15, 14, 8, 1, 4, 13, 10, 5
   > ...

What happens, however, if your code adds a 16th key and then, realizing its mistake, removes it right afterwards (highlighted code below)? There are still 15 keys, the very same 15 keys as before, so surely they'll be in the same order, right? Right? Wrong:

   > perl -E 'local $,=q[, ]; $hash{$_}=$_ for 1..15;
             $hash{16}=16; delete $hash{16}; say keys %hash'
     11, 7, 2, 1, 13, 6, 3, 9, 12, 14, 15, 8, 4, 10, 5


This can happen anywhere, like when reusing a hash variable:

    sub init { ( 1=>1, 2=>2, 3=>3, 4=>4, 5=>5 ) }

    my %hash = init();
    say "original: " . join ', ' => keys %hash;
    $hash{$_} = $_ for 6..100;

    %hash = init(); # restores original values
    say "original? " . join ', ' => keys %hash;


This is what I get on my good old 5.14.3:

    original: 4, 1, 3, 2, 5
    original? 2, 1, 3, 4, 5


As you can see, it's a real problem and it could be lurking in your code right now. What Yves' patch does is simply expose the issue more explicitly to you. This is a good thing, because, aside from the extra security protection, it will let you spot buggy code much easier. If you try that previous one-liner on 5.17.6, you'll get a different key order every time you run it:

   > perl -E 'local $,=q[, ]; $hash{$_} = $_ for 1..15; say keys %hash'
     1, 5, 15, 12, 6, 4, 10, 9, 3, 13, 7, 14, 11, 2, 8
   > perl -E 'local $,=q[, ]; $hash{$_} = $_ for 1..15; say keys %hash'
     5, 11, 7, 3, 15, 6, 12, 2, 13, 9, 8, 14, 10, 1, 4
   > perl -E 'local $,=q[, ]; $hash{$_} = $_ for 1..15; say keys %hash'
     2, 15, 14, 13, 5, 1, 9, 10, 3, 11, 6, 8, 12, 4, 7
   > perl -E 'local $,=q[, ]; $hash{$_} = $_ for 1..15; say keys %hash'
     8, 2, 14, 10, 1, 9, 4, 3, 6, 15, 5, 13, 7, 12, 11


Uh-oh... looks like my code is broken.


Not to worry, the fix is usually pretty easy! Look for the failing test and see if whatever is being tested calls keys(), values() or each() at some point. You'll likely want to sort() the results or change your code algorithm to something more deterministic.

I don't really have that many tests... What can I do?


Look for calls to keys(), values() or each() in your code, and make sure they are not relying on the order of the elements being returned. It is ok to do something like:

  my @keys   = keys %hash;
  my @values = values %hash;
  say "hash key $keys[3] is $values[3]";

because, as I said before, keys() and values() will always use the same order for the same process, whatever that order is. However, this is not ok:

  if ($keys[0] eq 'some_key') {
     ...
  }

simply because there's no way to guarantee the order of the list returned by keys(). The code above might have worked, however, if you always sorted the returned value, like so:

  my @keys = sort keys %hash;
 

Indirect usage


Sadly, your code is not safe just because you don't use those functions (or have them properly sorted). Sometimes you expect lists of values from external modules, and those lists might be affected by the change. So make sure you look for arrays that are populated by external functions and see if you rely on their order being a particular one. For example, you might have code like:

   my ($name, $age, $rate) = Some::Module->new->get_list( 'some_user' );
   
And, within Some::Module, you'll find the suspect:

  sub get_list {
    my ($self, $username) = @_;
    return values $self->{data}{$username};
  }

Make a failing test for it, push a fix, rinse and repeat :)

I hate this! Switch it back!


This is hardly going to happen. Remember: hash randomization is a good thing! Please take another look at the sections above and try to fix your code. If you need help, ask for it at the usual places, like your local mailing list or IRC - heck, even Facebook has a Perl group!

But if you really really really need the previous behavior, you can simply stick to 5.16, or try compiling perl defining PERL_HASH_FUNC_ONE_AT_A_TIME_OLD to simulate the old algorithm, but the entire rehashing mechanism is gone, so specifying your own PERL_HASH_SEED value is probably as close as you'll get :)

Many thanks to the nice folks at P5P for their continuous effort in keeping us safe!

4 comments:

  1. Nice clear post, garu.

    I'd just add that the reason so many tests break is that we need to test complex data structures whose order depends on the order of hash keys, but the order is unimportant in runtime code. Adding a sort into runtime code just so that the tests pass is wasteful.

    Fortunately, RJBS' excellent module Test::Deep allows us to test these complex structures with embedded unordered array refs using the bag() function. Just change eg:

    {foo => [1,2,3]}

    to

    {foo => bag(1,2,3)}

    and you're done


    ReplyDelete
  2. Does that imply that code like

    @merged{keys %hash} = values %hash;

    is now suspect? Because that would be *bad*...

    ReplyDelete
  3. I wouldn't thinks so; from perldoc -f values :

    The values are returned in an apparently random order. The actual random order is subject to change in future versions of Perl, but it is guaranteed to be the same order as either the "keys" or "each" function would produce on the same (unmodified) hash.

    ReplyDelete
  4. All nice, but I wouldn't agree that example function requires fix:

    sub get_list {
    my ($self, $username) = @_;
    return values $self->{data}{$username};
    }

    For me it's perfectly OK, as long as documentation states clearly that it returns elements in random order. Sorting requires some additional time, so I'd prefer have it done only when really needed.

    ReplyDelete