What the interns have wrought: RPC_parallel and Core_profiler

We’re in the midst of intern hiring season, and so we get a lot of questions about what it’s like to be an intern at Jane Street. One of the things people most want to know is what kind of projects they might work on as an intern.

That’s of course hard to answer precisely, since we don’t yet know what next summer’s intern projects are going to be. But you can get a sense by seeing some of what interns have done in the past. To that end, I thought I’d describe a couple of intern projects that were done this past summer.

Rpc_parallel

Rpc_parallel is a new library written by Todd Lubin (who will be returning to Jane Street full-time next fall), aimed at making parallel programming in OCaml simpler and safer.

Writing programs that take advantage of multiple cores, and even of multiple computers, is of course a necessity. In OCaml, such parallelism is typically achieved by writing multiple single-threaded programs that use message passing to coordinate.

We have lots of parallel message-passing systems, but while these share a lot of infrastructure (notably, the Async_rpc library), we’ve never really settled on a lightweight way to construct complete parallel programs. Instead, each such program tends to have its own little process coordination framework.

A while back, we wrote a library called Async_parallel, (described here). Async_parallel does a lot of things right – in particular, it makes it easy to spin up and manage new processes and distribute work between them.

But Async_parallel has a problem. It is based on OCaml’s marshal facility, which allows you to serialize arbitrary data, including functions, between processes. This sounds great, but it has a dark side: marshalling arbitrary functions turns out to be error prone. In particular, in order to send a function over the wire, you also have to send a copy of any data that that function implicitly depends on.

Unfortunately, it’s hard to know what kind of data is hidden behind a function, which can cause a few different forms of trouble: it might send much more data than you expect, it might fail unexpectedly if it hits one of the forms of data that can’t be marshalled, or it might lead to crazy and hard-to-predict behavior if some of the data required by the function is meant to be mutable shared state.

Instead, we wanted a library that was more typeful and constrained in terms of what was sent over the wire. This pushed us towards a design where, at the cost of some extra verbosity, we explicitly declare the types of data that is sent. In exchange, we get a system that is easier to understand and debug.

One of the great things about Rpc_parallel is how fast it came together. Todd got it into a sufficiently good state by the middle of the summer that he was able to use it for his other projects (interns typically have at least two major projects over the course of the summer).

Rpc_parallel also benefitted from some world-hopping collaboration. Interns spend at least a week in an office other than their home office, and Todd ended up visiting Hong Kong. While there, he ended up spending a lot of time talking and working with Chengqi Song, who had a lot of experience with Async_parallel. Out of those discussions came a complete redesign and rewrite of the library, factoring out the core primitives for coordinating across multiple processes, and making the overall library simpler and more general.

By the end of the summer, a few other people picked up and starting using it for other projects, and last week, it was released open source, so you can take a look at it yourself on github.

Core_profiler

Profiling is surprisingly hard, and as such it’s perhaps unsurprising that there are lots of ways of doing it. If you want to understand the cost of an individual operation, you might want a micro-benchmarking library like Haskell’s Criterion, or our own Core_bench. If you’re trying to understand properties of a whole application, like which lines of code it’s spending its time on, or how many cache-misses are occurring, you might want to use something like Linux’s perf tools, which use CPU-level counters to efficiently gather profiling statistics from your program.

Another useful technique is for the programmer to modify the source to add explicit probes that keep track of when certain program points are reached, and write out that information to a log that can be analyzed after the fact.

Daniel Richman (who will be returning for an internship next summer) worked along with Roshan James (formerly an intern himself, now fulltime) on a library called Core_profiler, which aims to make the use of such probes easy and efficient. Efficiency matters quite a bit, because if logging a probe takes more time than the thing you’re trying to measure, you basically can’t extract reliable data. Keeping the overhead small, therefore, is a must.

Accordingly, a lot of Daniel’s time was spent thinking very carefully about how to write the probes in a way that would only minimally disrupt the execution of the program. He started a simple but less efficient library, called Stats_reporting, which took about 60ns and two words of allocation per probe, and started machining it down from there.

The first step was to avoid all allocation, which meant we could no longer use bin-prot, our standard binary serialization technique, since bin-prot requires that you allocate an OCaml object representing the data to be serialized. So they moved to using an internal library called Protogen for generating zero-allocation serialization code.

That brought us down to about 30ns, and zero allocations. We then decided to try out writing our own hand-rolled binary protocol, so we could have a yet-more-closely optimized binary layout. That brought us down to about 20-25ns. The next hack was to customize the binary format yet more by packing multiple values into a single, 63-bit OCaml integer. That, plus some cleverness to make sure that writes were word-aligned, brought the cost of a simple probe down to 12ns. In addition to the design changes, Daniel also spent a lot of time carefully examining the assembly code, to make sure that there were no surprises from the code generator.

We’re pretty happy with the end result. We think that Core_profiler probes are now a good bit cheaper than the dynamic probes you can insert using a system like DTrace, and are in any case the best tool we have for tracking the performance of even relatively quick code-paths. And, as a nice bonus to all this optimization, the offline processing got about twice as fast as a side effect of the runtime improvements.

There’s more to be said about both of these projects, and about the many other projects that were done this summer. If you’re interested in applying, you can do so here:

http://janestreet.com

You don’t need to know anything about finance or functional programming to apply. But if you come, you’ll learn a ton about both by the time you’re done.