This article deals with some not well-known dark corners of the OCaml compiler and how to get around them to produce more efficient code. The bottom line is that you should avoid using partial applications and instead prefer eta-expanding your functions to the maximum. To understand why, let’s compare the performance of the following definitions:
let f a b c = a + b + c
let g1 a l = List.fold_left ~f:(f a) ~init:0 l
let g2 a l = List.fold_left ~f:(fun b -> f a b) ~init:0 l
let g3 a l = List.fold_left ~f:(fun b c -> f a b c) ~init:0 l
Those three versions are all semantically equivalent (as long as f
does not
perform mutations between arguments): they all return the sum of the elements in
l
plus a
for each element. Yet, they have quite different execution times (n
is the length of the list l
, please ignore the last 3 columns for the moment):
n | g1 | g2 | g3 | g4 | g5 | g6 |
---|---|---|---|---|---|---|
0 | 0.18 | 0.11 | 0.12 | 0.14 | 0.12 | 0.10 |
1 | 0.39 | 0.43 | 0.22 | 0.25 | 0.35 | 0.24 |
5 | 1.16 | 1.78 | 0.64 | 0.62 | 1.54 | 0.23 |
10 | 2.16 | 3.49 | 1.11 | 1.14 | 2.73 | 0.36 |
100 | 20.32 | 33.69 | 10.19 | 10.25 | 24.45 | 2.85 |
So, what is going on here? The answer is in the way the OCaml compiler optimizes
calls to functions with multiple arguments. The lambda-calculus interpretation
of f
is of a function that takes one argument a
, returns a closure (i.e. a
function plus an environment containing the value of a
), which in turn takes
one argument b
, etc. Applying f
to three arguments would then amount to
applying them one by one, creating intermediate closures in the process.
For efficiency reasons, most functional languages compile f
as a function of
arity 3, i.e. as a function optimized to take 3 arguments all at once. Most of
the time, functions are called with all their arguments anyway, so this
optimization is a good thing. In essence, this is similar (but not quite
equivalent) to the following non-currified version of f
:
let f_non_currified (a,b,c) = a + b + c
Obviously, the currified version of f
is strictly more expressive since it
allows partial applications and it avoids dealing with the allocation of the
tuples (a,b,c)
, using three CPU registers or the stack instead. Still, there
needs to be a mechanism to deal with applications where the numbers of arguments
do not match: inside List.fold_left
, there is a call to the argument ~f
with
two arguments (the list element and the accumulated result). If the arity of
~f
is not 2, some generic functions caml_applyN
and caml_curryN
come into
play to decompose the applications into simpler steps, potentially creating the
intermediate closures we talked about. You might have seen those functions show
up in the output of gprof
sometimes, but not many people are actually fully
aware of this internal mechanism (if someone knows of a good reference on this
subject, please report it as I could not find any). If you are curious about
their definitions, you can look at them in the (undocumented) output of
ocamlopt -dcmm
.
Let’s go back to our example and consider g3
first. One closure (the argument
to List.fold_left
) is initially allocated, but since this function is of arity
2 (b
and c
), all the calls from List.fold_left
are direct and require no
allocation.
One the other hand, g1
also allocates a closure initially, but it is of a
different nature: it is a call to caml_curry3
with f
and a
as arguments.
This function creates a closure of arity 1, simply waiting for the next
argument, so that when List.fold_left
calls it, there is an arity mismatch. We
then end up passing the first argument b
first, creating a new closure and
finally passing in the second argument c
. In the end, we allocate one extra
closure per element in the list (and execute more instructions), which is why
g1
is twice slower than g3
.
g2
is even worse: the initial allocation creates a closure of arity 1 (waiting
for the argument b
). There is a first mismatch when List.fold_left
calls it,
as with g1
, so we go into slow mode and pass in b
only. But then there is a
second mismatch inside the code of the local function when we call f
(of arity
3) with only the two arguments a
and b
. Arguments are passed in one-by-one
and we create two intermediate closures in the process. Finally, c
which was
put aside from List.fold_left
is applied in a final step. In the end, g2
allocates two closures per element in the list and is three times slower than
g3
.
Those interpretations can be checked by measuring the number of allocated words
with (Gc.stat ()).Gc.minor_words
before and after a function call (I thank
Stephen Weeks for this idea). The following results confirm how much GC pressure
is created by g1
and g2
compared to g3
(a closure typically takes 4 or 5
words in our example):
g1 | g2 | g3 | g4 | g5 | g6 |
---|---|---|---|---|---|
5 + 5n |
4 + 10n |
5 | 5 | 5 + 5n |
5 |
It is worth noting that the actual cost does not come so much from the allocating part (it is just a few extra instructions after all), but from the GC: reclaiming the memory back is quite expensive and this cost is amortized over all allocations. Reducing the total amount of allocation will automatically reduce the amount of GC pressure. There is an old page written by Pierre Weiss that basically says that a block allocation is roughly equivalent to 100 additions on average (taking into account the cost of reclaiming the memory by the GC). Those results are outdated so they probably do not apply anymore, but they give the right order of magnitude.
A different style
If we really wanted to use a partial application as in g1
, we would need to
tell the compiler that f
is a function that takes one argument a
and return
a function taking two arguments. Ideally, we would like to write one of the
following:
let f2 a = fun b c -> a + b + c
let f2 = fun a -> fun b c -> a + b + c
Sadly, the OCaml parser still interprets those definitions as a function of arity 3. We have to write the following cumbersome version to make it work:
let f2 a = (); fun b c -> a + b + c
let g4 a l = List.fold_left ~f:(f2 a) ~init:0 l
(or any other no-op operation instead of ()
). This does indeed solve the
problem: if you look at the timings above, you will see that g4
is indeed
about as fast as g3
, and for good reason since they now do essentially the
same operations and allocations.
But what if we accidentally use f2
combined with the eta-expanded style of
g3
?
let g5 a l = List.fold_left ~f:(fun b c -> f2 a b c) ~init:0 l
As you can see in the timings, things get much worse again, since for every list
element, a closure will be created from the application of f2
to a
.
Whether to go with g3
or g4
is a matter of style: if you know the arguments
are always going to be decomposed in the same way, you can use g4
. But if you
use your function sometimes partially, sometimes with all its arguments, or with
different levels of partial application, you have to resort to g3
. Since this
style is also optimal in any scenario, it is probably the right choice in most
situations.
As an aside note, there is another major optimization down the road. Instead of
calling the generic List.fold_left
(or any other such function), we could
redefine a local recursive function that calls f
directly:
let g6 a l =
let rec fold_left accu = function
| [] -> accu
| (x::r) -> fold_left (f a accu x) r
in
fold_left 0 l
As you can see from the timings, this would give you another factor of 3 in this particular case (mostly because of inlining, more efficient loops and direct branching instead of functions calls). It might be possible to perform this syntactic transformation automatically through a Camlp4 extension, however we have not tried it yet (and one should consider the impact on code size and instruction caching in the CPU).
Inlining
Another nasty effect of partial applications is that they prevent inlining. It is a common style to define small generic functions and make specialized versions through partial application. Here is a somewhat extreme example:
let mult_and_add m c x = m * x + c
let double_and_add_v1 = mult_and_add 2
let double_and_add_v2 c x = mult_and_add 2 c x
If you look at the assembly code generated for those functions (with
ocamlopt -S
), you will see that the second version gets compiled as only two
instructions, does not require any allocation and is likely to be inlined
further at calling sites. The first version on the other hand creates one
closure from the partial application at startup, another one on every call (!)
and is composed of many more instructions. I ran some benchmarks and found a
speedup factor of about 15 between the two versions.
Yet another trick
This is a somewhat related trick that applies whenever you have the following (quite common) scheme:
List.iter ~f:(fun x -> do_something ... x ...) l
(in most cases with List.iter
, List.map
, List.fold_left
, etc). Every time
this code gets executed, a new closure gets allocated for the local function
before the call to List.iter
. If the argument list l
is going to be empty a
large fraction of the time, the closure will be useless most of the time. This
can get quite expensive in the long run and can be avoided with the following
trick:
if l <> [] then List.iter ~f:(fun x -> do_something ... x ...) l
Similarly, if you know that l
is likely to contain only 0 or 1 element, you
can specialize even further:
match l with
| [] -> ()
| [x] -> do_something ... x ...
| _ -> List.iter ~f:(fun x -> do_something ... x ...) l
You will save the closure allocation in most cases and the function call in the 1-element case is a good candidate for inlining.
Note that this trick does not apply when the local function does not have any free variable (i.e. when it does not reference any value outside the scope of the local function). In this case, no closure is necessary and the function is fully allocated at compile time.
Does it really matter ?
It really depends what kind of code you are writing. If your application is very small, runs in a short period of time or does not do enough allocation to ever trigger the GC, then you probably don’t care. But for applications with high-performance requirements or for central libraries that are used everywhere, it can make a big difference. Here at Jane Street, we achieved significant speedups with those simple transformations on one of our biggest and most speed sensitive application.
I have to agree that eta-expanding all the functions in my code is intellectually less satisfying than using higher-order code and partial applications. Readability can also get impacted, although there are situations where it benefits from the change, making it more explicit what the different arguments are.