Almost every programming language uses 64-bit integers on typical modern Intel machines. OCaml uses a special 63-bit representation. How does it affect OCaml?
OCaml int memory representation
Most of OCaml’s types are in memory represented as a header followed by data. The header is a 64-bit integer containing the length of the data and a tag. Tag is a rough classification of the type. The only OCaml’s types which differ from this are ints and sometimes floats.
Floats normally have header and data, data being the value of the float itself. This representation is called “boxed”. If a record’s field is float, record’s data will actually contain the pointer to the float data. The only exceptions are records with only floats and float arrays, whose data instead of pointers contain the values of floats. This representation is called “unboxed”.
Values of type int are never stored as header and data (boxed). Int x
is
stored as (x << 1) | 1
, where <<
is left shift and |
is bitwise or, hence
its least significant bit is always set. Pointers are word aligned, so they will
never have this bit set, hence how ints and pointers are discerned. It is
assumed that much of typical data is integers, so this is done to significantly
improve performance:
- there’s no need to dereference a pointer when getting an int
- no memory allocation is needed when creating ints
- less work for the garbage collector
- less memory fragmentation
- no memory is needed for int headers
Distinguishing whether a value is int or pointer is as simple as testing x &
1
, so this feature doesn’t slow down garbage collector, polymorphic hash,
polymorphic compare and whatever else structurally inspects data. One should
note that this doesn’t apply to the types int32
and int64
, which are always
boxed.
Penalty
Having the extra bit comes with a price – arithmetic operations are more complicated. For example
x + y
is translated to CPU instructionsx + y - 1
x * y
is translated to CPU instructions(x >> 1) * (y - 1) + 1
x / y
is translated to CPU instructions(((x >> 1) / (y >> 1)) << 1) + 1
x lsl y
is translated to CPU instructions((x - 1) << (y >> 1)) + 1
Sometimes this penalty is small or nonexistent. For instance there’s no need to
fix the bit in x + y - z
. Only one bit fixing is needed for all five
additions in x + y + z + w + u + v
.
Another help is the Intel CPU instruction LEA
, which can compute the sum of
three integers with a single instruction, like x + y - 1
. Unfortunately,
LEA
became very slow in the recent generations of CPUs. Intel doesn’t suggest
this will change.
This benchmark (test.ml) tries to estimate the difference in the performance. The results from Sandy Bridge show about 2 times speed difference in arithmetic operations. Assembly can be examined by compiling using “ocamlopt -S test.ml”.
speed(ns) 63-bit 64-bit slowdown
add independent 0.327588 0.121502 2.696156
add dependent 0.328160 0.169375 1.937477
mul independent 0.614824 0.328060 1.874120
mul dependent 1.094343 0.328872 3.327565
lsl independent 0.359828 0.166088 2.166489
lsl dependent 0.762251 0.177468 4.295151
xor independent 0.249350 0.122900 2.028886
xor dependent 0.404255 0.170380 2.372668
Agner’s instruction tables show that the difference is even bigger with later
generations of CPUs. For instance, Haswell can do four integer adds per cycle
versus one LEA
.
Conclusion
The benefits of unboxed ints are amazing. On the other hand, arithmetic operations are significantly slower. How much do arithmetic operations affect an average program? Could we have a solution which would keep ints unboxed but have fast arithmetic operations?