Yet again, we’re at the end of our internship season, and so it’s time to summarize what the interns were up to!

This year, I was recommended a real bumper crop of exciting projects to include. It’s kind of crazy how many great intern projects are out there. To mention a few that I’m not going to have time to cover in detail:

Annie Hu spent a big chunk of her summer investigating, implementing, and optimizing different neural net sequence models, trying out a variety of compilation techniques and toolchains.
Aster Oransoy added build priorities to our build systems’ shared action execution service, so you can ensure low-priority builds don’t slow down high-priority ones.
Allen Pei wrote a quickcheck-like system for creating automated tests of trading systems by generating randomized sequences of market events, along with shrinking heuristics for creating minimal test cases.
Evan Thompson wrote an LSP for our inline CSS syntax extension which includes a CSS validator that found tons of instances of invalid CSS in our applications.
Zhibo Chen added a generic form of optional arguments to OCaml, so that it can use other types than the traditional OCaml option type (including more efficient representations) for optional values.
Conor Kennedy added predicate pushdown to our internal data warehouse system to do filtration before it gets to the full query engine, and even wrote a mini query planner for analyzing filter expressions to derive narrower key ranges.
Joe Cutler worked on using JIT-ing to make our HardCaml simulator fast enough to be competitive with Verilator, but with much better start-up times.

And those are just the ones I felt like I could explain in a handful of words each!

As usual, I picked just three projects to go into in more detail. In particular:

Leo Gagnon wrote a (sometimes dramatically) more efficient evaluator for JSQL, our internal SQL dialect that we use for lots of different user-facing tools.
Aryan Khatri built a new version of our OCaml torch bindings that leverage OxCaml’s new features for controlling memory management to build bindings that clean up tensors safely and deterministically.
Anthony Li wrote a library for managing memory across processes within our trading systems via ref-counting, making it possible to more efficiently and safely ship data across the process boundary.

Let’s dive in!

Faster (J)SQL evaluation

We use a lot of SQL at Jane Street, both in standard Postgres (or similar) databases floating around, and for accessing our own homegrown analytics-oriented data warehouse software.

Over time, we came to realize that SQL was sufficiently well-known internally that we wanted to use it beyond the context of databases, as a general language for filtering and transforming tabular data. This could be useful in all sorts of contexts: web UIs, data visualization tools, trading-systems configuration tools, etc.

The problem with this idea is… which version of SQL should you use? Every database you look at has its own charmingly unique SQL dialect that’s almost but not quite the same as all the others.

We decided to deal with this by (I know, I know) building our own dialect of SQL called JSQL. We’ve built a bunch of tools for using JSQL, including parsers, translators to other SQL dialects, web-UI components, and a collection of different in-memory evaluators for computing the results of a JSQL expression without invoking a traditional database at all.

Our evaluators started out very simple, doing little more than walking though a collection of rows and one-by-one evaluating whether they passed or failed a WHERE clause. Over time, we’ve built multiple evaluators with different performance properties, including incremental evaluators.

That said, none of our evaluators were all that sophisticated, and in particular, none of them made use of indexing. Leo Gagnon’s project was to change that!

The idea was that when presented with data that’s in an indexed container, like a Map.t or Hashtbl.t, to be able to use that indexing to more efficiently filter down to the data you need. So, if you have a SELECT statement where the WHERE clause contains:

author = "Dijkstra" AND publication_year > 1980

and the underlying data is contained in, say, a Paper.t list String.Map.t (a map from author names to lists of their papers), Leo’s evaluator would have to:

determine that we only care about things under the key "Dijkstra",
use an O(log n) Map.find to get the resulting Paper.t list,
use List.filter on the resulting much smaller list to select the papers with publication_year > 1980

Which is way more efficient than walking over the entire map.

Getting this done involved a bunch of steps!

Building a selection type that represented the possible over-approximations of the range of keys that would be needed to satisfy a given query.
Writing code to extract and optimize the selection for a given query.
Writing code to specialize the execution of the selection to the backing store for the data. For example, the selection type tracks when ranges of queries are in scope. The Map.t type supports efficient range queries, but the Hashtbl.t type doesn’t, so you need different execution strategies depending on which you use to store your data.
Supporting multi-index data-structures, like our Immutable_indexable_bag. This involved building selection heuristics that help us pick the most efficient index to use.

And, of course, benchmarking.

The results of that benchmarking were pretty promising. We ran some sample queries over 3.8 million rows of test data, comparing a linear scan over an array versus an index-optimized scan over a Map.t.

This first query shows a ~700x speedup, since it lets us zoom in on just the MSFT trades, ignoring everything else.

SELECT * WHERE und = "MSFT US" AND event_date > "2025-01-01"::date
+----------------------------------+--------------------+
| aggregator_name                  | average_time       |
+----------------------------------+--------------------+
| jsql-aggregations-eval           | 15.844514478s      |
| jsql-indexed-aggregations-eval   | 21.939788ms        |
+----------------------------------+--------------------+

This second query is more complicated, in that it requires us to do a scan over a range of values, but we still get a ~30x speedup here.

SELECT * WHERE
  (und = "MSFT US" OR (und >= "AAPL US" AND und < "AMZN US"))
  AND event_date > "2025-01-01"::date
+--------------------------------+--------------------+
| aggregator_name                | average_time       |
+--------------------------------+--------------------+
| jsql-aggregations-eval         | 37.056874003s      |
| jsql-indexed-aggregations-eval | 1.324532585s       |
+--------------------------------+--------------------+

Despite this being a pretty algorithmic and performance-oriented project, a lot of the challenges turned out to be about API design, and getting all of this work done with a codebase that was simple and readable, and presented a convenient API to users.

Better Torch bindings

We use PyTorch a lot as part of our machine learning efforts, and as you might expect, most of that work is done in Python. But sometimes, we want to drive PyTorch from OCaml, which we do using ocamltorch, originally written by Laurent Mazare some years back.

But OCaml is in some ways an awkward match for PyTorch, because OCaml manages memory using a tracing GC, in contrast to Python, which uses a refcounting GC.

A lot of ink has been spilled on the tradeoffs between refcounting and tracing, but one clear difference is around the determinism of collection. With a tracing GC, it’s hard to know when the memory you’ve allocated will be reclaimed. With refcounting, your object will be collected the moment you drop your last reference to it.

This determinism comes in handy when you’re using your collector for managing things other than main memory, like precious GPU memory. This is a plot of GPU memory usage over time doing one forward and backward pass on a batch, then some sampling, then 3 more batches, written naively with ocamltorch.

Torch memory use when running with a tracing GC

This behavior is pretty awful! We’re holding on to tensors we just don’t need anymore, which is basically intolerable.

You’d deal with this in ocamltorch by carefully calling Gc.full_major () after each batch and each token sampled, to force the GC to recognize that the memory is unused and reclaim it. That gives you the desired memory behavior:

Torch memory use when you deallocate tensors in the right place

but it’s a poor solution, since the calls to the GC are expensive, and there’s no discipline to help you make sure you put them in the right place.

Aryan’s project was to build a better API for ocamltorch that provided a safe and efficient discipline for managing tensor memory, leveraging some of the new features of OxCaml, a set of extensions to OCaml that have been developed at Jane Street.

The basic idea is to introduce a way of marking a scope of allocation for a tensor, using this with_rc_scope function, where “rc” is short for “reference count”:

val with_rc_scope : (unit -> 'a) @ local -> 'a

The idea is that the body of the closure passed to this function acts as a scope, and that any tensors allocated within it will have their refcounts decremented when the function ends.

To make this all work, we use OxCaml’s local mode to make sure that tensors can’t escape their scope. In particular, any function that allocates a tensor will allocate it as a local value:

val ( + ) : t @ local -> t @ local -> t @ local

This prevents the allocated value from being returned from the closure passed to with_rc_scope.

Here’s a worked example of how you might use this in practice.

    let vs = Var_store.create ~name:"vs" () in
    let opt = Optimizer.sgd vs ~learning_rate:1e-3 in
    let model = Model.init vs in
    for index = 1 to 100 do
      Tensor.with_rc_scope (fun () ->
        Optimizer.zero_grad opt;
        let ys_ = Model.forward model xs in
        let loss = Tensor.(mean (square (ys - ys_))) in
        Tensor.backward loss;
        Optimizer.step opt)
    done;

The full API is a bit more complicated than just that. The system has support for nested scopes, which is needed to support many of the idioms that are used in practice for both training and inference workflows on GPUs. As part of that, there is some special support for returning tensors from an inner scope to an outer scope in a controlled way that doesn’t violate the reference counting rules.

The project itself involved a lot of experimentation at the API level, to design an API that was easy to use and understand and that also captured the memory-use patterns we run into in practice. The project also had an interesting performance-engineering aspect to it: removing all of the now-unnecessary GC invocations made it easier to understand and identify further inefficiencies (like unnecessary synchronizations between the CPU and GPU) that were harder to see amongst the performance mess created by the full_major invocations.

We have more ideas about how to extend and improve these interfaces, but we already expect the new APIs to be quite useful in their current form. This is part of our open-source code, so once the new code is released, you’ll be able to find it here.

Ref-counted objects in shared memory

At Jane Street, we have lots of performance-sensitive trading systems that gather complex information over the course of their execution, and then periodically serialize pieces of that data over a shared-memory channel to another process.

This is generally a pretty good approach, but it has its limitations. Serialization always has a cost, but here it’s made worse by the fact that the data we want to send is complex nested data with shared structure between messages. As a result, serializing the data can involve serializing the same sub-structures over and over.

Anthony Li’s project was to build a library supporting a very different – and much more efficient – approach.

The idea is to get rid of the serialization and deserialization altogether, and to just pass pointers to the values in question instead. This requires that the space of objects in question is visible to both processes, so it means we need to allocate those objects within a shared memory segment.

We already have support for managing pools of objects in a shared memory segment, so this sounds easy enough at first glance. But the tricky bit is figuring out when you can recycle one of your pooled objects.

We can’t rely on OCaml’s ordinary GC for this because the data resides in a shared-memory segment between two processes, each with their own GC. And anyway, we don’t want to be churning the garbage collector in a latency-sensitive trading system.

Instead, Anthony’s project was to use a tried-and-true technique for this: reference counting.

Safer refcounting through modes

Reference counting is tricky to integrate into a language like OCaml that doesn’t have it designed in from the start. There are really three invariants you need to get right for this system to work:

There are no data-races on the refcounts or the objects themselves
Refcounts are incremented every time a new reference is created
Refcounts are decremented every time a reference is destroyed

But how do we ensure that these rules are followed when they’re not natively enforced by the runtime? This is a bit like the problem Aryan ran into with reference counting in PyTorch, and again, the solution is to leverage the system of modes to ensure the necessary invariants.

We’ll need different modes at different times, so in order to manage this, we’re going to have a special handle object o Handle.t that guards access to the underlying object (of type o). We can both use modes to protect the use of the handle itself, and the handle can release the object o with specific modal types under specific circumstances.

That’s all a bit abstract, so let’s talk about the details:

Eliminating data races

There are really two data-race questions to handle here: one is about the refcounts, and the other is about the actual objects being managed. For the refcounts, an atomic compare-and-set operation can be used to manage them in a safe way, so that’s pretty simple, and doesn’t require anything from the mode system.

The mutability of the objects is more complicated, because the rules are different at different times. The objects must be mutable on initialization, since they have to be filled in at that point. But once you have multiple readers of the object, you really need them to not change. It turns out we can leverage OxCaml’s visibility mode axis, which include immutable, read, and read_write modes.

Specifically:

During initialization, we expose the value under the read_write mode (which is the default), so the data in the object can be set. Notably, at this point, we’re guaranteed there’s only one reference to the object in question.
When reading, we expose objects under the read mode. This way, multiple readers (even across processes) can access the same object without fear of a race.

Notably, once an object’s reference count goes back to zero, it can again be the subject of an initialization, so it can again be exposed read_write.

Another interesting aspect of this is that when we release the underlying values, we do so under the local mode, to prevent the value from escaping its intended scope. As such, what we’re implementing is analogous to borrow-checking in Rust.

Managing increments and decrements

The key invariant here is that people don’t just go about duplicating handles without incrementing the associated reference count. To ensure this, each Handle.t is created under the unique mode, and all operations that use handles require that they be provided uniquely.

This guarantees that all handles that are used are held uniquely, and so if you want to refer to the handle in multiple places, an explicit copy function must be called. And, critically, that copy function increments the reference count.

There’s also a free operation that consumes a handle and decrements the reference count. And a way of sending a handle to another process, at which point the sending handle is consumed, and a receiving handle is created, without changing the reference count.

Anthony’s library is complete, and the team is now working it into our production systems. We hope that this will be a library that’s useful to multiple teams across the firm.

Join us!

If this sounds like a fun way to spend your summer, you should apply to our internship program. Jane Street interns get a chance to solve fun and challenging problems that have real impact. I hope this post gives you a sense of that!

What the interns have wrought, 2025 edition

Faster (J)SQL evaluation

Better Torch bindings

Ref-counted objects in shared memory

Safer refcounting through modes

Eliminating data races

Managing increments and decrements

Join us!

Signals & Threads Podcast

Featured Tech Talk

Featured Reads

Tags

RSS

Jane Street Open Source

Join Our Team

What the interns have wrought, 2025 edition

Faster (J)SQL evaluation

Better Torch bindings

Ref-counted objects in shared memory

Safer refcounting through modes

Eliminating data races

Managing increments and decrements

Join us!

Subscribe to Email Updates

Signals & Threads Podcast

Featured Tech Talk

Featured Reads

Tags

RSS

Jane Street Open Source

Join Our Team