With the latest release of Core, I’ve had occasion to think about how our
libraries differ from INRIA’s. One difference that’s been there from the very
beginning is in the List
module: INRIA’s list functions are not
tail-recursive, and ours are. The obvious reason to prefer
tail-recursive solutions is that they are able to handle lists of arbitrary
length, but they have a downside as well. INRIA’s list implementation is not
tail recursive largely because of performance, as described by Xavier
here. All that heap allocation you get by mapping and then reversing the list really costs.
The key tradeoff that Xavier points to is the choice between running fast on small-to-medium lists, and running at all on long lists. A few different people have opined that lists don’t make sense for large datasets anyway, which argues in favor of the choice made in the standard library of using the tail-recursive version. But I’ve never bought this argument. It’s hard to predict what your code is going to be used for, and you don’t want to have landmines where your code simply blows up (rather than degrading gracefully in performance) when run on an unexpectedly large dataset. Using a non-tail-recursive list implementation leads to such brittle behavior.
So, it occurred to me, can we have the best of both worlds? As Xavier points
out, there are “magic” implementation that you can do that use the dreaded
Obj.magic
, but Obj.magic
is notoriously difficult to reason about, and is
really the same thing as hacking the compiler. Among other things, it leaves you
with no guarantees when new versions of the compiler are released.
ExtLib takes this approach, but it’s
never been something we’ve been comfortable with.
But what if we write a version that detects when we’re dealing with a big list, and dynamically switches implementations? Here’s a simple version.
open Core.Std
let rec count_map ~f l ctr =
match l with
| [] -> []
| hd :: tl -> f hd ::
(if ctr < 5000 then count_map ~f tl (ctr + 1)
else List.map ~f tl)
let map ~f l = count_map ~f l 0
This works a lot better. It’s a little bit slower than the standard List.map
for small lists, and about the same as the tail-recursive List.map
for large
lists. But we can do better still. There are two more optimizations I played
with. The first is to do a little loop unrolling on the recursion, and the
second is to to deal with the large list case going through arrays, as suggested
in a post by Christophe
Troestler. Here’s
the resulting code:
open Core.Std
let list_array_map ~f l =
Array.to_list (Array.map ~f (Array.of_list l))
let rec count_map ~f l ctr =
match l with
| [] -> []
| [x] -> [f x]
| [x;y] -> [f x; f y]
| [x;y;z] -> [f x; f y; f z]
| x :: y :: z :: w :: tl ->
f x :: f y :: f z :: f w ::
(if ctr > 500 then list_array_map ~f tl
else count_map ~f tl (ctr + 1))
let map ~f l = count_map ~f l 0
This implementation does better still. It’s actually faster than the standard
implementation on short lists, and only a little slower on long lists. Here are
some very rough benchmarks (done on an x86-64 box). Here, the mean and standard
deviations are of the ratio of the implementation versus the implementation in
the standard library. “core” is the implementation currently in Core
, and
“fast” is the above implementation.
## list length 0 ##
core: mean 1.648838, nstdev 0.043502
fast: mean 0.717259, nstdev 0.043177
## list length 1 ##
core: mean 2.113085, nstdev 0.075585
fast: mean 0.596140, nstdev 0.049489
## list length 2 ##
core: mean 1.989603, nstdev 0.044707
fast: mean 0.636450, nstdev 0.003376
## list length 5 ##
core: mean 2.003528, nstdev 0.043638
fast: mean 0.821950, nstdev 0.024802
## list length 10 ##
core: mean 1.428904, nstdev 0.016536
fast: mean 0.729491, nstdev 0.018766
## list length 20 ##
core: mean 1.443628, nstdev 0.062703
fast: mean 0.743741, nstdev 0.018308
## list length 100 ##
core: mean 1.301089, nstdev 0.019097
fast: mean 0.898968, nstdev 0.017839
## list length 1000 ##
core: mean 1.719725, nstdev 0.025758
fast: mean 0.950799, nstdev 0.018624
## list length 25000 ##
core: mean 1.721275, nstdev 0.044541
fast: mean 1.188690, nstdev 0.031437
The performance improvement seems worth the ugly code given how common map is. I suspect some hack like this will make it’s way to a future version of Core.