When GADTs (Generalized Algebraic Data Types) landed in OCaml, I wasn’t particularly happy about it. I assumed that it was the kind of nonsense you get when you let compiler writers design your programming language.
Which is to say that the standard GADT examples all seem to be about the kinds of things that compiler writers do, like embed domain-specific languages or build typed abstract-syntax trees. But it didn’t seem particularly relevant for the kind of systems programming that I think about.
But it became apparent pretty quickly that I was wrong. In particular, since GADTs have landed, at Jane Street we’ve found lots of examples where GADTs are important for performance, of all things. The theme in these examples is that GADTs enable you to tweak your memory representations in ways that would otherwise be painful or impossible to do safely in OCaml.
The Problem of Polymorphism
I’d like to walk through a simple example that illustrates this aspect of GADTs,
but first, a few words about OCaml’s memory representation. OCaml’s polymorphism
is in an important way backed on that memory representation. In particular,
consider a simple polymorphic function like List.iter
, which has the following
type:
val iter: 'a list -> f:('a -> unit) -> unit
The polymorphic type tells you that List.iter
can operate on lists of any
type, and in OCaml, this is achieved with a single compiled version of the code.
This is possible because the memory representation of the elements of a list are
uniform: you can always refer to an OCaml value in a single word, either as a
pointer to a heap-allocated value, or as an immediate that fits inside that
word.
That means that some OCaml datatypes are less efficient space-wise than you might imagine. Arrays, for example, take the same amount of space per element whether those elements are bytes, 32-bit ints, or 64-bit ints. (There’s actually some special magic in the compiler for float arrays, though this is probably more trouble than it’s worth, as described by Alain Frisch here. But let’s ignore float arrays for now.)
OCaml does have a tighter representations for byte arrays, called bytes
. But
it’s a completely different type, and so building a general purpose data
structure that uses bytes
when it would make sense, and ordinary arrays
otherwise, is a little awkward.
Controlling memory representation without GADTs
Let’s see what happens if we try to design (without GADTs) an array type that
sometimes uses the general array representation and sometimes uses bytes
.
You could imagine representing such a value using an ordinary variant.
type 'a t = | Array of 'a array
| Bytes of bytes
We could then implement each of the operations we want on our new array type, implementing each operation differently depending on the particular representation. Let’s see what happens if we just take this idea and run with it, implementing all the required functions in the most straightforward way.
> module Compact_array = struct
type 'a t = | Array of 'a array
| Bytes of bytes
let of_bytes x : char t = Bytes x
let of_array x = Array x
let length = function
| Array a -> Array.length a
| Bytes s -> Bytes.length s
let get t i =
match t with
| Array a -> a.(i)
| Bytes s -> s.[i]
let set t i v =
match t with
| Array a -> a.(i) <- v
| Bytes s -> s.[i] <- v
end;;
module Compact_array :
sig
type 'a t = Array of 'a array | Bytes of bytes
val of_bytes : bytes -> char t
val of_array : 'a array -> 'a t
val length : 'a t -> int
val get : char t -> int -> char
val set : char t -> int -> char -> unit
end
This seems pretty good at first glance, but the inferred types aren’t quite what
we want. In particular, get
and set
only work with Compact_array
s
containing characters. If you think about how type inference works, it’s not
really all that surprising. If you think about the code for get
:
let get t i =
match t with
| Array a -> Array.get a i
| String s -> String.get s i
The OCaml compiler is looking for a single type to assign to the return value
for all the cases of the match. Given that String.get
always returns a char
,
then Compact_array.get
will be restricted to only returning a char
.
One way to work around this problem is to essentially implement what we want as a poor-man’s object. Here, we just write the code separately for the different cases, and stuff those functions into a record full of closures. Here’s how that looks.
> module Compact_array = struct
type 'a t = { len: unit -> int
; get: int -> 'a
; set: int -> 'a -> unit
}
let of_string s =
{ len = (fun () -> String.length s)
; get = (fun i -> String.get s i)
; set = (fun i x -> String.set s i x)
}
let of_array a =
{ len = (fun () -> Array.length a)
; get = (fun i -> Array.get a i)
; set = (fun i x -> Array.set a i x)
}
let length t = t.len ()
let get t i = t.get i
let set t i x = t.set i x
end;;
module Compact_array :
sig
type 'a t = {
len : unit -> int;
get : int -> 'a;
set : int -> 'a -> unit;
}
val of_string : bytes -> char t
val of_array : 'a array -> 'a t
val length : 'a t -> int
val get : 'a t -> int -> 'a
val set : 'a t -> int -> 'a -> unit
end
This more or less solves the problem, but it’s still not really the memory
representation we want. In particular, we have to allocate three closures for
each Compact_array.t
, and this number of closures will only go up as we add
more functions whose behavior depends on the underlying array.
GADTs to the rescue
Let’s go back to our failed variant-based implementation, but rewrite it using the GADT syntax. Note that we’re not trying to change the types at all this time, just rewriting the same type we had before in the language of GADTs.
type 'a t = | Array : 'a array -> 'a t
| Bytes : bytes -> 'a t
The syntax of this declaration suggests thinking about variant constructor like
Array
or Bytes
as functions from the constructor arguments to the type of
the resulting value, with the thing to the right of the :
roughly
corresponding to the type signature of the constructor.
Note that for the Array
constructor, the type value of 'a
depends on the
type of the argument:
> Array [|1;2;3|];;
- : int t = Array [|1; 2; 3|]
> Array [|"one";"two";"three"|];;
- : bytes t = Array [|"one"; "two"; "three"|]
But for the Bytes
constructor, the type 'a
in the type is still free.
> Bytes "foo";;
- : 'a t = Bytes "foo"
This is really the problematic case, because we’d like for Bytes "foo"
for the
parameter 'a
to by char
, since in the Bytes
case, that’s what the element
type of our array is.
Because GADTs give us the ability to specify the type on the right-hand side of the arrow, we can get that.
type 'a t = | Array : 'a array -> 'a t
| Bytes : bytes -> char t
Now, the Bytes constructor behaves as we’d like it too.
> Bytes "foo";;
- : char t = Bytes "foo"
Now let’s see what happens when we try to write the length function.
> let length t =
match t with
| Bytes b -> Bytes.length b
| Array a -> Array.length a
;;
val length : char t -> int = <fun>
Disappointingly, we’re again stuck with a function that doesn’t have the right
type. In particular, the compiler has decided that this function can only
operate on char t
, when we want it to work for arrays of any type.
But the problem now is that type inference in the presence of GADTs is difficult, and the compiler needs a little help. Roughly speaking, without some hints, OCaml’s type system will try to identify all types as having a single value within a given function. But in this case, we need a type variable which might have different values in different branches of a match statement.
We can do this by creating a locally-abstract type el
to represent the type
parameter of t
(and the element type), and annotating t
accordingly.
> let length (type el) (t:el t) =
match t with
| Bytes b -> Bytes.length b
| Array a -> Array.length a
;;
val length : 'a t -> int = <fun>
Now we see that we get the right type. We can push this approach through to get a complete implementation.
> module Compact_array = struct
type 'a t = | Array : 'a array -> 'a t
| Bytes : bytes -> char t
let of_bytes x = Bytes x
let of_array x = Array x
let length (type el) (t:el t) =
match t with
| Array a -> Array.length a
| Bytes s -> Bytes.length s
let get (type el) (t:el t) i : el =
match t with
| Array a -> Array.get a i
| Bytes s -> Bytes.get s i
let set (type el) (t:el t) i (v:el) =
match t with
| Array a -> Array.set a i v
| Bytes s -> Bytes.set s i v
end;;
module Compact_array :
sig
type 'a t = Array : 'a array -> 'a t | Bytes : bytes -> char t
val of_bytes : bytes -> char t
val of_array : 'a array -> 'a t
val length : 'a t -> int
val get : 'a t -> int -> 'a
val set : 'a t -> int -> 'a -> unit
end
As I said at the beginning, this is really just an example of the more general theme. GADTs are about more than clever typed interpreters; they’re a powerful mechanism for building easy to use abstractions that give you more precise control of your memory representation. And getting the right memory representation is often critical for building high performance applications.