Building reproducible Python environments with XARs

Our traders and researchers love Python for its agility and for its huge open-source ecosystem, especially when it comes to machine learning. But the heavy use of notebooks can make it difficult to support. Notebooks have a very different lifecycle than regular code, and aren’t always rigorously version controlled. And while most of our code (much of it written in OCaml) lives in a monorepo, putting all notebooks there is difficult; many notebooks end up being stored all over the place.

That leads to a serious problem: notebooks living outside of the repository depend on various libraries inside of it. We have to expose those libraries somehow and manage their release and deployment cycle. This is where Python environments come in: they are the layer between libraries and notebooks. In an ideal world, they’d allow library authors to iterate quickly while still allowing notebooks to be upgraded at their own pace.

Over the last few years we have become more convinced that declarative, reproducible, centrally built Python environments are where we want to be. Unfortunately, a lot of open-source Python environment tools were designed for smaller, mutable environments. There have been some improvements lately, notably Pipenv and Poetry for reproducibility and Docker for deployment. Unfortunately, we can’t use these tools off the shelf: our Python environments need to interact with OCaml code during the build process; and Docker requires more privileges on the hosts it runs on than we’re comfortable allowing.

So instead we’ve developed our own system for building and deploying Python environments. This tool, which we call js-python, relies on our internal build system for OCaml integration and uses a format called XAR for deployment. We think it works pretty well: it allows us to easily create and manage new environments, which in turn allows more decentralization and improves robustness. Let’s take a look at how we arrived here, and what we like about it.

How we started with Conda and discovered our problems

We deployed our first widely shared Conda environment in 2018. Conda is great. It’s open-source, has lots of packages and online documentation. Our environment was working reasonably well and people started to use it more. And they started to request more and more packages.

Pretty soon two major problems became very apparent. The first has to do with how the environment is stored on the filesystem: it is just a directory with a bunch of files. This feels very normal in much of the world, but some internal context complicates things. Because the environment was shared, it was deployed on a network filesystem. NFS is a very powerful piece of technology, but our NFS server becomes slow when tens of thousands of files are read in a tight loop. Starting Python and running the usual library imports actually triggers this situation. Starting a kernel from the local disk took less than one second, but starting the same kernel from NFS took on the order of minutes if you included all imports.

Second: the environment was not reproducible. We had a wiki page that had instructions and records about how it was built, but it wasn’t complete and didn’t account for differences in machines and conditions during the build. At some point people became cautious about installing packages because they were afraid they would break the environment in subtle ways and wouldn’t be able to go back!

Switching to XAR files for fast startup and relocatability

The first major change came when we discovered the XAR file format. At a very high level, a XAR file contains an entire filesystem packed together into one file. It can be mounted and appear to the user as the usual tree of files. XARs use efficient compression (Zstd), and are pretty easy to work with. They are built on top of squashfs, which has been used in the Linux world for a long time and has robust tooling.

XARs seemed like a perfect fit for our NFS deployments: from the NFS perspective they are just a single file, so the reads are fast; but from the Python perspective once the XAR is mounted it’s not that different from a normal Python environment.

However, the reproducibility problem remained. It was difficult to write a script that built an exact copy of our shared environment, and especially one that could run in an isolated build system the way we wanted it to. Also, Conda relies on a central configuration and a standard directory structure, which makes it somewhat tricky to build a Conda environment in one place and then move it to another. These reasons made it difficult to integrate Conda with XARs, which are designed to be relocatable.

Faced with these challenges, we decided to drop the idea of packing Conda environments into XARs. Instead we’d try to take more control over how the environment was structured and initialized. We wanted to keep the structure simple so it would be easy to maintain and integrate with other tools. In the end we set up something similar to virtualenv using our XAR entrypoint script to configure environment isolation.

It worked, and it was fast! In the spirit of naming things boring names we called the resulting environment format js-python. But it still wasn’t perfect.

Where XARs leak: dynamic components

We could build and deploy XARs pretty easily, and we were eager to move the build process into our continuous integration system so we could run tests and catch bugs early. This was not something possible with our Conda setup, but seemed doable with XARs.

In practice, though, we quickly ran into an issue: our XARs were not fully self-contained. Because environments were historically hard to deploy, our library code was developed and released separately as two major components: pure Python libraries and Python bindings for OCaml libraries. The environment itself only contained the interpreter and third-party libraries. This made testing and releasing changes that spanned more than one component—like adding a third-party library to the environment and using it in internal code—quite painful.

For the same reason, it was harder to troubleshoot production issues: looking at the deployment history of any one component to find the cause of a breakage, you could potentially miss deployments or rollbacks of the other two components. Rolling back during production issues was not fun.

So we did get some benefits of XARs, but because of the way our XARs were structured, we didn’t get all of them.

Static bundling: making XARs actually self-contained

If we zoom out, the problem we were facing is the classic one: static or dynamic linking. Jane Street libraries were essentially dynamically linked into the environment, with the pros and cons that came with it.

Outside of Python, we normally build statically linked executables, and that has a lot of advantages. Library owners can expect to be able to make and release changes without risking breaking production jobs, and often with less consideration for versioning. The monorepo encourages this as well: it’s harder to take full advantage of refactoring capabilities that a monorepo provides if you have to care about your libraries being dynamically linked into old executables.

Python did not have to follow the same model. But if it did, testing and deploying changes that span Python, OCaml, and external libraries would be much easier. This is very attractive: there are hundreds of OCaml libraries at Jane Street that could be useful in Python, and the lower the barrier for exposing them, the more power Python users would have.

So we went ahead and converted the internal library and OCaml bindings to be statically linked into the environment! This resulted in a set of new Python-specific build rules in our build system that allow you to specify, at build time, how Python libraries depend on each other and on OCaml libraries.

Versioning deployed environments: XARs need names

Now that we could build and deploy environments continuously, one of the first user concerns was: how do I make sure my notebooks don’t break unexpectedly?

Historically our environments were used through their “nightly” instance. Static linking means that when we don’t deploy “nightly”, the environment doesn’t change (which is a good property!). But we do want to deploy “nightly” continuously, to provide new features and fixes to users quickly.

To accommodate both the users who want the latest changes, and those that prefer stability, we introduced a distinction between “nightly” and “stable” tags. stable tags are created on a schedule, and once created they never change. Both nightly and stable tags are exposed as notebook kernels, so users can easily switch between them when they want to “time-travel” between environment versions.

A single-revision world

All of the above has led us to a state where a Python environment is exactly described by the revision of the code it was built from. Users can switch between revisions (usually by means of stable tags), and maintainers have full control over a deployed instance like “research/nightly”: when you roll it, you roll “everything”, and when things break, you can roll back “everything”.

We're really happy with the results. It makes environments easier to maintain without having to worry about interactions and varying roll-out schedules between third-party code and internal Python and OCaml code.

That makes it easier to mint new environments, which itself makes workflows more isolated and makes it easier to distribute work across teams. In the new world, a critical notebook in Hong Kong might use a dedicated local environment, which means it won't get broken by a roll happening out of New York. Ultimately, centralizing the build allows us to decentralize the environments more, which increases the overall resiliency of our Python code.

It comes at a cost though. Hotfixes are now much less trivial to make: you cannot edit a XAR file in vim when you find a bug. At the very least you have to build a new XAR file with the fix and deploy it. Even if you skip most of the normal pipeline, and you are familiar with the tools, it can still take 10 to 20 minutes. We’re working on bringing this time down, but some traders consider this fundamentally too slow for running critical applications. If you use Python code to provide knobs to your system, you expect to be able to modify it on the fly, not wait 20 minutes. Python is a pretty good “rich config” format, so we don’t want to take away this possibility.

The way we think about this currently is that Python code is split into “core” and “leaf”. “Core” code is part of the environment: it has clear separation between development and production phases. After it’s deployed, it’s immutable. “Leaf” code exists outside of the environment. Different teams have different setups for it, but usually it involves some lighter-weight version control than our main monorepo, and the ability to edit the code in the deployment location. Leaf code is usually notebooks or thin scripts which call into core code.

The simplicity of the single-revision model is very attractive, and we try to encourage users to put as much code as possible into the environment. But we don’t want to take away that last-mile agility that many teams rely on, and the core/leaf distinction lets us strike a good compromise.