Getting from tested to battle-tested

Testing is an essential part of building reliable software. It’s a form of documentation, a reminder of mistakes of the past, and a boost of confidence when you want to refactor. But mostly, testing is a way of showing that your code is correct and resilient. Because it’s so important, we’ve invested a lot of effort at Jane Street to develop techniques that make tests clearer, more effective, and more pleasant to write.

But testing is still hard. It takes time to write good tests, and in any non-trivial system, your tests are an approximation at best. In the real world, programs are messy. The conditions a program runs under are always changing – user behavior is unpredictable, the network blips, a hardware failure causes a host to reboot. It’s inherently chaotic. And that’s the hard thing about developing high-availability systems: for all the careful tests that you think to write, there are some things you can only learn by experiencing that chaos. That’s what it takes to go from merely being tested to being battle-tested.

We spend a considerable amount of time thinking about this problem in our development of an internal distributed system called Aria. Aria is a low-latency shared message bus with strong ordering and reliability guarantees – you might recognize it from an episode of Signals and Threads where I talked about how it acts as a platform for other teams to build their own resilient systems with strict uptime requirements.

More and more teams have been adopting Aria at Jane Street, which is great! But it also means that each week that goes by without an incident becomes less of a tiny victory and more of an obligation to keep the system running smoothly. Not to mention, the system has to continue to grow in scale and complexity to meet the needs of the teams that use it. How do we mitigate the risks that naturally come with change so that we can keep evolving the system? Testing goes a long way here, but it’s all too easy for your tests to miss the critical scenario that will expose your mistake.

Earlier this year we started using Antithesis, an end-to-end automated testing platform, to fill those gaps. We’ve become huge fans of the service (and are now leading their next funding round! More on that later), and part of the point of this post is to explain why.

But before we get to that, let’s lay some groundwork for how Aria approaches testing.

Testing everything you can think of

While none of this is exactly novel, we’ve built up a rather extensive toolbox of different testing techniques:

Unit tests of modules and data structures without side-effects, including many simple state machines.
Integration tests with a simulated networking layer which allows for testing very fine-grained interactions between services, including delaying and dropping packets and manipulating time.
Quickcheck tests that can produce random orderings of events which we can feed into a simulation.
Version skew tests to ensure that new client library changes work with existing servers and older client libraries will be compatible with newer servers.
Fuzz tests using AFL which will turn the fuzzer’s byte input stream into a sequence of state updates in an attempt to catch unsafe behavior in performance-optimized state machines.
Lab tests to check for performance regressions which run nightly in a dedicated lab environment that is set up similar to production.
Chaos testing where our staging environment runs a newer version of the code while we apply simulated production-like load and restart services randomly.

Each one of these adds real value, but the simulated networking is maybe the most important piece. The ability to write tests which don’t require excess mocking and are also fast and deterministic means that you can express more edge cases with less effort, get more introspection on the state of components, and run the entire suite in every build without worrying about flakiness. It is an invaluable tool when writing new features, as well as a great way to write reproduction tests when verifying bug fixes.

Aria’s testing story requires a lot of effort and has evolved organically over time, but it also has been quite successful. Incidents in production are few and far between, even as we deploy new changes each week.

When we do encounter a bug that slipped through, there’s always a sense of “oh, that’s a really tricky case, it’s no wonder we didn’t think to test it”. Even our quickcheck and fuzz tests are limited to the confines of the artificial environments we construct for them, and the chaos testing barely scratches the surface of what’s possible.

Testing everything you didn’t think of

Last year we had a chance to talk with the team at Antithesis and got really excited about their product. The amazing thing that Antithesis does is run your whole system in a virtual machine controlled by a completely deterministic hypervisor, and then adds a little manufactured chaos by interfering with scheduling and networking. It uses this setup to explore many different scenarios, and to discover circumstances where your system might fail.

Part of what’s great about this is that you don’t need to change your system to use Antithesis. You can run your system in a realistic environment – network, file system, shared memory, it’s all there. You get to interact with your system using real client code. And if they do manage to make a process crash or cause an assertion to fail, you can replay events to get back to that state and interact with the system as much as you want to understand what happened.

We weren’t sure how effective it was going to be, so we started with a trial period to find out. Sure enough, on our first run, Antithesis surfaced two previously unknown bugs – notably, one had just been introduced a month prior, and seemed pretty likely to eventually occur in production, and with fairly consequential effects. We’d actually thought about the possibility of this kind of failure when designing the change, but a simple bug in the code slipped through, and we just forgot to write an explicit test.

There’s something really attractive about running your system in a way that looks and feels like production. You can be a bit more confident that you’re not accidentally hiding away some race condition by rewiring everything to fit into a little box. I find the “API” of Antithesis to be quite elegant: provide some Docker images and a compose file that describes the individual parts of your system, and they will call docker compose up inside the VM. That gets the system into a running state, but you obviously need to make it do something. So, you can create a directory in a container full of executable files that each take some kind of action on your system – like actions users or admins would take in production – and Antithesis will decide how and when to run them. And by and large, that’s it.

Of course, the generality here is a double-edged sword: the space of all possible states and inputs is enormous. Even if you threw tons of hardware at the problem, you’d probably only do a bit better than our chaos testing. That’s why the second half of Antithesis – the exploration engine – is so important. One of the cool properties of determinism is not just that you can reconstruct a state at any time, you can also reconstruct a prior state too. So you can effectively rewind time and try a new approach. If the explorer is getting feedback from which branches of code it managed to hit, it can know when it got into an interesting or rare state, and it can spend more time taking different actions around that moment. Will Wilson, one of the co-founders of Antithesis, gave a talk which demonstrates some of the principles behind this search using the NES game Super Mario Bros. as a test subject – it’s such a fun talk; I highly recommend checking it out.

So let’s say Antithesis stumbles upon a bug. What does that look like, and where do you go from there?

A real bug

We kick off a test run each night with the most recent revision of code, and one morning we came in to find results that showed an unexpected container shutdown. At first glance, the logs included this.

118.738

standby.replicator.1

Info: Streaming tcp receiver connected to 10.89.5.61:42679

118.738

standby.replicator.1

Error: Unhandled exception raised:

118.738

standby.replicator.1

(monitor.ml.Error

118.738

standby.replicator.1

(message_part.ml.Malformed_message

118.738

standby.replicator.1

("00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

118.738

standby.replicator.1

"00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

118.738

standby.replicator.1

"00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

118.738

standby.replicator.1

"00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

The “replicator” service connected to a server and shortly after, raised an exception and crashed. The 118.738 is the time in seconds since the test started. The exception made it look like it was being served corrupt data, which should never happen under any circumstances. Antithesis also has a tool that can investigate a specific instance of a failure by rewinding a bit, running with different input, and seeing whether it failed again. It produces a graph like this.

This is showing that somewhere about 6 seconds before the crash, something happened that put us in a state where it was very likely to reproduce. If we go back through the logs, we can find out that Antithesis randomly killed a different service around that time.

111.861

fault_injector

{"fault":{"name":"kill","affected_nodes":["primary.tip-retransmitter.1"]}}

We can also filter the logs down to look for that specific service.

113.911

primary.tip-retransmitter.1

Info: Starting from snapshot with stream time 2025-11-28 16:59:51.362900555-05:00

113.911

primary.tip-retransmitter.1

Debug: Streaming TCP retransmitter listening on 10.89.5.61:42679

And that also lists the same host and port that the replicator connected to. But this still doesn’t say much – a server restarted, a client connected to it, and then the client got corrupt data? At this point we can jump into Antithesis’ debugger environment, which lets you write notebook-style snippets to run inside the virtual machine. By rewinding time by one second before the crash and running tcpdump, we can capture the exact traffic that was exchanged between the client and the server.

branch = moment.rewind(Time.seconds(1)).branch()
container = 'standby.replicator.1'
print(bash`tcpdump -nn -X tcp and host 10.89.5.61`.run_in_background({ branch, container }))
branch.wait(Time.seconds(5))

And with a little grit, we can extract the query that the client sent.

16:59:57.631701 IP 10.89.5.36.35922 > 10.89.5.61.42679: Flags [P.], seq 40:67, ack 40, win 32768, options [nop,nop,TS val 2689576410 ecr 3733209032], length 27 0x0000: 4500 004f c841 4000 4006 5355 0a59 0524 E..O.A@.@.SU.Y.$ 0x0010: 0a59 053d 8c52 a6b7 934f 4b7a df85 101d .Y.=.R...OKz.... 0x0020: 8018 8000 1f54 0000 0101 080a a04f adda .....T.......O.. 0x0030: de84 3fc8 1900 1500 0000 51c8 0400 0000 ..?.......Q..... 0x0040: 0000 0041 3131 3233 0000 00ff ffff ff ...A1123.......

This highlighted portion is the byte offset that was requested by the client. It’s a little-endian 64-bit integer whose value is 0x04c851, or 313425 in decimal. Okay, so what did that snapshot contain?

container = 'primary.tip-retransmitter.1'
print(bash`aria admin get-latest-snapshot -max-stream-time '2025-11-28T16:59:51.362900555-05:00' \
            | sexp get '.snapshot.metadata.core_stream_length'`.run({ branch, container }))

Here we not only get to use our own admin command to talk to a server, but we also can simply pipe the output to another tool of ours that dissects and pretty-prints the output.

((stream_time 2025-11-28T16:59:51.362900555-05:00) (byte_offset 315567))

This is telling us that the server started from byte offset 315567, which is after the offset of the request. It should have served the client an error, not bad data! At this point we have enough of a picture to read through the code and figure out what’s wrong.

The gritty details

This bug was related to a new feature extending the “tip-retransmitter” service which was mentioned in the logs above. These services provide data to clients (the “replicator” in this case) on demand from an in-memory ring buffer – only the most recent data in the stream, or the “tip”, is available. These services had been in use for a long time but recently were given the ability to serve clients in other regions in addition to local clients. Something about this new behavior was buggy.

After closer inspection, we realized that the implementation made some incorrect assumptions about the state of its ring buffer when checking if the client request was valid. However, this only manifests

after the server was restarted and loaded a snapshot,
before the ring buffer was filled up, and
if the client sends a request for data before the snapshot.

This is exactly what Antithesis managed to reproduce. Instead of an error, the server incorrectly sent back NUL bytes from an empty region in the ring buffer. At the time the original code was written, snapshots didn’t exist, so the bug couldn’t have occurred. It was only introduced later on.

But hold on a second, loading from snapshots had been around for a while, yet this only failed once we extended it to serve other regions. Had it always been broken? Well, sort of. It turns out that local clients use a different method of service discovery which means they won’t even try to talk to a server which was started from a later snapshot because they knew it didn’t have the data. The clients in another region used a different method of service discovery and simply had to optimistically try.

This had all the ingredients for a tricky bug:

It required a niche situation where a server was restarted and a client connected to it after it advertised and before it filled up its ring buffer, asking for data from before its snapshot.
It was code that had already been running in production for a long time, but the bug was being masked by the service discovery mechanism.
Because we were leveraging existing code, we didn’t think to write a new test, especially for this situation.

And the potential impact was really bad, since it involved serving corrupt data.

Happily, Antithesis was just what we needed to catch the bug before it caused real problems.

Antithesis found the bug shortly after the feature was completed and the new services added to our Antithesis config. This time delay was short enough that we knew that something about our recent change was the culprit.

It also gave us the tools to actually dig in and figure out what happened. If this happened in production, we would have gotten the exception, and we might have been able to notice the log lines, but we wouldn’t have had enough data to narrow down the situation, and we wouldn’t have had a good way to verify the fix we wrote was fixing the actual bug.

It’s not that Antithesis replaces all of our existing testing. Each different flavor of test really serves it’s own unique purpose. But the way in which Antithesis tests whole-system scenarios that we either wouldn’t have thought to test is its own kind of magic. Enough so that we’ve noticed a small cultural shift on the team where we feel like we can tackle more ambitious projects by relying on Antithesis to fill in any gaps along the way.

Where do we go from here?

Antithesis has been really useful for Aria, and we’ve started working on applying it to other applications within Jane Street. We’re starting out with some similar, high-assurance distributed systems, like a new distributed object store that’s in development.

But we think there are lots of other opportunities for applying the tool. For one thing, we’re excited about using Antithesis on systems whose testing story is less developed than Aria’s. Not every system at Jane Street has gone to the trouble of using mockable network and timing services that let you build nice, deterministic simulation tests. Sometimes, that kind of testing is simply infeasible, since some parts of the system rely on external software that we don’t fully control. But that kind of software is still easy to run in Antithesis.

We also think that Antithesis holds a lot of promise in the context of agentic coding tools. One of the key problems with coding agents is that it’s hard to build confidence that they’ve done the right thing. We think that Antithesis holds a lot of promise as a source of feedback, both for using and for training such models.

A future partnership

There’s one last part of this story to talk about: we were so impressed by the product and the team behind it that we wanted to invest, and in the end, we’re leading their next round of funding. We love these kinds of partnerships because not only is this a technology that feels unique and aligned with our technical culture ¹, but also because Antithesis has been so receptive to feedback, and is so passionate about what they’re building.

This all lines up with Jane Street’s broader approach to private investing: we like to provide long-term capital to companies where we understand the technology deeply and can see the potential; where we like and believe in the people doing the work; and where they’ve built something we’re excited to use ourselves as a customer. Antithesis hits all those marks.

On a personal note, I’m really excited about this. The team at Antithesis is an absolute pleasure to work with. I’ve never used a SaaS product where I got to talk directly to their engineers about bugs or specific behaviors, or to their designers about UX. And a countless number of my colleagues have had to hear me gush about just how cool it is. I’m always strangely excited to see what it digs up next.

After all, we already abstracted away our entire network layer to get this high-fidelity integration testing ↩