We’re once again at the end of our internship season, and it’s time do our annual review of what the interns achieved while they were here.

Jane Street is a big enough place and the internship is wide-ranging enough that it’s impossible to really cover the full spread of work that interns do here. So, instead, we’ve picked a few interesting projects to focus on, just to give you a sense of the possibilities.

Here are the ones I’m going to discuss this year:

  • Arya Maheshwari wrote a first version of Camels, a Polars-like dataframe library for OCaml.

  • Arvin Ding designed a variant of our bin-prot binary-serialization protocol, which was aimed at achieving better writing speed at the expense of compactness.

  • Alex Li worked on improving a time-travel debugger for Limshare, a system we’ve built for sharing risk limits across multiple trading systems.

Now, let’s dive into the details!

Dataframes for OCaml

Tables are a pretty convenient way to organize, think about, and work with data. They show up in all sorts of places: databases, spreadsheets, and a wide variety of so-called dataframe libraries in languages like Python and R.

In the Python world, there’s the ubiquitous pandas library, as well as some more modern, higher-performance alternatives, like Polars. Polars is written in Rust, and is useful to users of both languages. Many of Polars’ performance advantages come from its parallel execution engine, which is built on top of Rust’s support for fearless concurrency, i.e. Rust’s ability to provide type-level protection against data-races.

Sometimes we want to use dataframes from OCaml, too. To that end, we’ve written OCaml bindings for Polars and used them in various applications. That’s been great, both because dataframes are a convenient programming idiom, and because it gives us an easy API for accessing parallelism safely.

But the Polars experience hasn’t been perfect. We don’t have a good incremental build story for Rust, which, in combination with Rust’s fairly slow build times, makes depending on Polars a real drag to the development process. Also, we’ve found some limitations and bugs in Polars (and our bindings) that have been harder to address than we’d like.

As it happens, a lot of the performance-oriented language features that make Rust a good choice for Polars are becoming available in OCaml, including our work on data-race free parallel programming in OCaml based on our system of modes, and flat-and-narrow data representations based on unboxed types.

So, we decided to experiment with a pure OCaml dataframe library, both because we thought it would be easier to use in existing OCaml applications, and because we thought it would be a good testbed for exercising the new language features we’re working on.

Arya Maheshwari’s task this summer was to build a first version of such a library, called Camels. The goal was to lay down the bones of the system in a way that would let us work out the basic structure and core APIs, before we dive into the performance-sensitive parts of the system.

We wanted an API that was easy to use and expressive, while still being amenable to serious optimization. Arya did this by separating out the syntax, i.e. the underlying structure of the computation, from the semantics, i.e. what the computation actually does. So, a function like this one:

let running_sumproduct ~value ~weight ~ordering =
  let open Expr in
  let product = float value *. float weight in
  let sorted_product = sort_by product ~by:(float ordering) in
  cumsum sorted_product

doesn’t actually execute the sumproduct computation. Instead, it produces an expression that describes the computation to be run. This phase separation gives you an opportunity to compile expressions down to a more efficient form before executing them, which you might do with code like this:

let execute_running_sumproduct df ~value ~weight ~ordering =
  Query.select (Query.view df) ~cols:[ running_sumproduct ~value ~weight ~ordering ]
  |> Dataframe.compile_exn
  |> Dataframe.execute

Another interesting design challenge was how to handle broadcasting. Broadcasting allows you to promote a single scalar value to a column before using it in column-level operations. Many dataframe libraries make broadcasting implicit, which is both convenient and sometimes very confusing. We made the choice of having explicit broadcasting. So, you might write something like this, which creates an expression that adds three to a column:

let add3 column =
  let open Expr in
  int column + broadcast (int' 3)

In order to catch improper broadcasts and type mismatches at compile-time, Arya added a simple type-system for expressions. The expression type tracks both the underlying row type and whether it represents a scalar or a full-length column. For example, if we forget a broadcast:

let add3 column =
  let open Expr in
  int column + int' 3

The OCaml compiler will produce the following error at the expression int' 3:

This expression has type (int, Length.one) t
       but an expression was expected of type (int, Length.input) t
       Type Length.one is not compatible with type Length.input

Camels isn’t finished yet. We’re interested in writing alternate backends that exploit both SIMD parallelism and multicore OCaml thread-level parallelism, as well as experimenting with expression fusion and query planning as part of the compilation step. But Arya’s work this summer has already gotten it to a pretty great starting place!

Faster (and fatter) binary serialization

We build a lot of latency-sensitive systems that need to respond quickly to incoming data, mostly in the form of marketdata. The core workflow is pretty simple: read in the new data, update your calculations accordingly, and maybe send out a packet in response.

That’s most of what they do, but it’s not all they do: our systems typically also need to log information on a transaction-by-transaction basis, for later analysis and debugging.

This logging needs to be as lightweight as possible, so it doesn’t slow down the rest of the program. We typically do this by writing the information we care about in a compact binary form to be picked up by another, less latency-sensitive process, which will take that data and then do some final processing and formatting to store it in our logs.

This serialization is often done in a format called called Binprot, and we use code-gen syntax extensions to make it relatively painless to generate the serializer/deserializer code. But Binprot, while quite efficient, isn’t really optimized for this use-case. In particular, Binprot gives up on some speed on the writing side in exchange for reducing the number of bytes required in the serialized output.

The goal of Arvin Ding’s intern project, then, was to create a library to serialize OCaml data types as fast as possible, sacrificing size along the way.

The central change to the design was to give up on Binprot’s variable length-encoding of integers. OCaml’s ints are 64 bits long (well…really 63 bits, but we can ignore that for now), which means that you’d normally need 8 bytes to represent them.

Binprot doesn’t do that, and instead tries to take advantage of the fact that most ints are small, and therefore can fit in fewer bytes. Here’s what the integer-encoding code looks like:

let bin_write_int buf ~pos n =
 assert_pos pos;
 if n >= 0
 then
   if n < 27(* can be stored in 7 bits *)
   then all_bin_write_small_int buf pos n
   else if n < 215(* can be stored in 15 bits *)
   then all_bin_write_int16 buf pos n
   else if arch_sixtyfour && n >= 231
   then all_bin_write_int64 buf pos (Int64.of_int n)
   else all_bin_write_int32 buf pos (Int32.of_int n)
 else 
;;

This is a great trick, and does buy us a lot of size-reduction. But it has a real cost, both because of the computation that has to be done for each int to be written, but also because it means that the serialized data representation is pretty different from the in-memory representation of the same object. That prevents you from doing bulk copying of the bytes out, using memcpy, say, which is a very efficient routine that takes full advantage of the ability of the CPU to copy lots of adjacent bytes in parallel.

If we give up on the variable-length encoding, then we can start using memcpy for some of the serialization work. In particular, adjacent, non-pointer fields in a record can be copied simply by issuing a single memcpy that covers the full range of those fields.

There were lots of technical challenges here, and it required Arvin to familiarize himself with some tricky and low-level parts of our codebase, and of OCaml itself.

Rather than modify the existing ppx_bin_prot syntax extension, Arvin decided to use typerep, a first-class representation of the type of an object, which is in turn generated by a different syntax extension. The serializer then works off of the typerep, which is a lot easier to program against than doing this at the syntactic level directly.

The project also required some diving into low-level C-bindings, to do some of the bit-twiddling that was required. It also required a fairly detailed understanding of OCaml’s underlying memory representation of types. They relied a lot on this page from Real World OCaml!

But there were further challenges. Arvin had some difficult debugging work tracking down some stray allocations. One key moment came when we realized that Obj.unsafe_ith_field allocates (🤯), but only when it is used on a record of only floats (because of floats’ special memory representation). So Arvin had to write a custom C version of unsafe_ith_field.

The end result worked out really well. Performance benchmarks showed it doing better than Binprot in every case they looked at. For small messages, it might be only 10-20% better. For large messages with a lot of non-pointer data, it could be as much as 15 times better!

And those improvements have translated to our production systems. This new protocol has been rolled out for more than a month, and we’ve seen improvements in tail latencies from 30-65% in real systems.

We’re really excited about the end result. We haven’t observed any crashes or incorrect serialization/deserialization attempts, which suggests that the many Quickcheck tests Arvin wrote were not in vain. And the library is fairly generic, so we’re excited to adapt it to more use cases throughout the firm.

A time-travel debugger

A key part Jane Street’s trading systems is our risk controls. These controls largely rest on a set of risk-checking systems that are responsible for enforcing a collection of risk rules, with each system managing a particular slice of our trading.

Each risk-checking system has to be allocated some limits, upper bounds on how much risk the trading associated with that system is allowed to take on. These limits are precious, since they bound the amount of trading we can do. And historically, we allocated risk limits statically, based on human-written configuration files.

Static allocation is relatively simple, but it doesn’t allow us to maximize the use of our limits, since it requires us to predict in advance which systems will need them, and there are limits on how well that can be done, even in principle.

That’s why we built Limshare, a system for dynamically allocating risk limits while keeping a bound on our overall risk exposure. Limshare is built on a framework called Aria, which implements a distributed state machine. In an Aria app, you have a single global log of updates which, when run one after the other, build up the application’s state. This provides a simple replication and persistence story, since any process can always reconstruct the state of the system by replaying the global log.

One nice property of this approach is that the update log is useful for debugging. When something goes wrong, it’s in theory possible to step over events in the log to reconstruct exactly how you got into a bad state.

In practice, though, we’d only ever built rudimentary tools for replaying Aria messages in Limshare. There was an interactive debugger that let you step through updates one by one, printing them out, and as well as printing out some useful bits of the state of the application.

The trouble was that it had no notion of snapshotting. It would just replay messages from the beginning of the day until a target time t, and from then on you could step forward one message at a time (or jump to another target time). But because you couldn’t go back, if you skipped past the interesting part of the stream you’d have to start the replayer over again from the beginning of the day.

Alex Li’s project was to make the debugger much faster, by leveraging Aria’s system of snapshots. A snapshot is effectively a summary of the state of the system, as of a particular point in time. Without snapshots, in order to spin up an application, you need to replay all updates from the beginning of time. With snapshots, you can just start from the latest snapshot, and replay updates going forward from there.

Alex added logic to the debugger to take snapshots of the app state at fixed intervals. To go backwards in time the debugger would find the latest snapshot before the target time, and locally replay messages to efficiently construct the state of the world N messages back. As part of this work he also added snapshotting logic to Limshare in the first place.

This debugger has made it easy to reconstruct what happened in the wake of complex production incidents. Here’s an example inspired by real events of how you could use the debugger to investigate why a given order was rejected unexpectedly.

First, we step forward until we see a reject happen. Then, we step back in time by one step, until just before the rejection decision was made.

> step-time 10m
(Decision (Resize_pool 2) REJECTED)
stop_condition: Breakpoint: Reached a rejected request decision.
current stream time: 2024-08-26 10:36:23.967645939
> back-messages 1
(Request.Resize_pool (pool_id 140737488355330) (downcrash $9_241_233)
 (upcrash $1_391))
current stream time: 2024-08-26 10:36:23.967491204

Now, we can print out the state to see what the limit usages were at the relevant scopes, as well as the allocation request that was rejected.

> print
Checked out resources and limits
┌────────────────────┬─────────────┬──────────────┬─────────────┬──────────────┐
│ node id            │ resources ↓ │      limit ↓ │ resources ↑ │      limit ↑ │
├────────────────────┼─────────────┼──────────────┼─────────────┼──────────────┤
│ kumquat            │ U$6_453_178 │ U$10_000_000 │    U$34_748 │ U$10_000_000 │
└────────────────────┴─────────────┴──────────────┴─────────────┴──────────────┘

pools
┌─────────────────┬─────────────┬───────────────────────┬──────────────────────────┐
│            pool │ risk system │        request bundle │                     size │
├─────────────────┼─────────────┼───────────────────────┼──────────────────────────┤
│ 140737488355330 │      nasdaq │ pts2, kumquat         │              ↓ $0 | ↑ $0 │
│ 140737488355329 │        nyse │ pts1, kumquat         │ ↓ $6_453_178 | ↑ $34_748 │
└─────────────────┴─────────────┴───────────────────────┴──────────────────────────┘

Undecided requests
┌───┬────────┬─────────────────┬─────────────────┬─────────────────────────┐
│ # │   Kind │            Node │            Pool │            Desired Size │
├───┼────────┼─────────────────┼─────────────────┼─────────────────────────┤
│ 1 │ Resize │ kumquat         │ 140737488355330 │ ↓ $9_241_233 | ↑ $1_391 │
└───┴────────┴─────────────────┴─────────────────┴─────────────────────────┘

The thing we can see in this case is that pts2’s request was rejected because pts1 had a large limit reservation already in place. At that point, we do more to figure out why, like jumping back in time by five minutes more to see how long the reservation was in place.

> back-time 5m
("Enforcer lease " (has_lease_until (2024-08-26 10:31:28.713728082-04:00)))
stop_condition: Time_limit
current stream time: 2024-08-26 10:31:23.713892045
> print
Checked out resources and limits
┌────────────────────┬─────────────┬──────────────┬─────────────┬──────────────┐
│ node id            │ resources ↓ │      limit ↓ │ resources ↑ │      limit ↑ │
├────────────────────┼─────────────┼──────────────┼─────────────┼──────────────┤
│ kumquat            │ U$6_453_178 │ U$10_000_000 │    U$34_748 │ U$10_000_000 │
└────────────────────┴─────────────┴──────────────┴─────────────┴──────────────┘

pools
┌─────────────────┬─────────────┬───────────────────────┬──────────────────────────┐
│            pool │ risk system │        request bundle │                     size │
├─────────────────┼─────────────┼───────────────────────┼──────────────────────────┤
│ 140737488355330 │      nasdaq │ pts2, kumquat         │              ↓ $0 | ↑ $0 │
│ 140737488355329 │        nyse │ pts1, kumquat         │ ↓ $6_453_178 | ↑ $34_748 │
└─────────────────┴─────────────┴───────────────────────┴──────────────────────────┘

It turns out, there was no good reason for a reservation of that size to be out for that long, which made it clear that that reservation was because of a bug.

This example highlights what’s great about great observability tools: they make visible what was previously hidden, and thereby dramatically simplify the process of debugging. In that way, it feels similar to some other tools we’ve built, like magic-trace and memtrace.

Another exciting thing about this debugger is that the concept is really very general. There’s nothing about it that in principle limits it to Limshare, and indeed, the Aria team has recently adopted the project, and plans to make it a broadly supported tool across the Aria ecosystem. There are already several other teams interested in adopting it for their Aria apps!

Join in on the fun!

If any of this sounds like fun, you should apply to one of our internships. One of the great things about our intern program is you get a chance to work on real and impactful projects that will really stretch your skills as an engineer, which I hope these examples do a good job of demonstrating.