People seem to enjoy talking about programming methodologies. They give them cute names, like eXtreme programming, Agile, and Scrum; run conferences and build communities around them; write books that describe how to use them in excruciating detail; and manifestos that lay out their philosophy.

This approach always leaves me cold.

I prefer stories to manifestos. Big overarching theories of programming are hard to come by (or at least good ones are) because so much depends on the details of the technology used, the problem to be solved, and the culture of the organization in question.

Instead, I like to hear people describe the things they’ve tried and how those choices have worked out in practice. Such stories are hard to draw general conclusions from, but hearing them helps to build up your intuition about what the possibilities are.

In that spirit, I want to tell a story about how we develop software. In particular, I wanted to describe a style of development that has gained some traction with us in the last couple of years. For lack of a better name, let’s call this the Iron style of development, since it depends a lot on Iron, our code review and release management tools.

The Iron style combines the following approaches:

  • Lots of (expect) tests. Expect tests are used pervasively, serving as a way of capturing program traces that expose aspects of the behavior of the system to reviewers.

  • Small changes. Individual changes are kept small most of the time, with most changes being somewhere from ten to a few hundred lines. Large changes are often created (and reviewed) as chains of dependent changes.

  • Fast turnaround. Review is done eagerly, with review taking precedence over writing new code.

This sounds like a fairly mundane list of good practices, and it mostly is. But the details are important, and not obvious (or at least, they weren’t obvious to us). The way in which the tools affected which workflows we were willing to choose, and the interplay between the different elements of the this style, was something we were surprised by.

So, without further ado, let’s talk about the details.

Lots of expect tests

Expect tests (also known as unified tests) let you interlace your test code with the printed output of those tests. The test framework is responsible for capturing the output and integrating it into your source file. If such integration would lead to the file changing, then the test has failed.

Here’s a small example using our expect test framework to test the function List.group.

let%expect_test _ =
  let test l =
    let stringify l = List.map ~f:Int.to_string |> String.concat ~sep:"-" in
    List.group l ~break:(fun x y -> y < x)
    |> List.iter ~f:(fun sub -> print_endline (stringify sub))
  in
  test [1;2;3;2;3;3;6;1;2;36;7];
  [%expect {|
    1-2-3
    2-3-3-6
    1-2-36
    7
  |}]

We don’t actually need to fill in the expect declaration by hand. If it starts empty, then the test runner will generate a corrected file with the output shown above, which we can accept by copying it over the original source file. If the output changes again at some later point, it’s easy to look at the diff to see if the change should be accepted.

Expect tests make it easy to create simple regression tests. For example, when writing a parser for some data source, we might write a test that consumes and prints out the result of running the parser over some sample data. Similarly, with systems that contain complex state machines, we often write expect tests that sequence a set of transactions, periodically dumping out summaries of the internal state of the system.

This is useful as a way to verify the behavior of new code, but the real value is how it helps reviewers understand changes to existing code. Reading the diffs of these program traces provides a way to visualize the way in which the semantics of your program is changed by a given feature.

In this style, you may not write an expect test for every change, but you do expect nearly every semantic change to be reflected in one way or another, either via a new test, or via a diff to an old one.

Small changes

The other aspect of this approach is that features are kept small, mostly by breaking up large changes into chains of smaller features. Sometimes the initial author will do this from the get-go, and sometimes a reviewer helps to break down what starts as a monolithic feature.

The goal is to express a large change as a sequence of smaller ones that are themselves coherent enough to be read, tested, and often even released on their own. This dovetails with expect tests, since the expect tests make the effect of each feature easier to comprehend, and the fact that the semantic changes are small makes the diffs of the expect tests easier to read.

Sometimes, the program trace you want for demonstrating the effect of a change isn’t there yet, in which case you can mint a parent feature that adds the program trace, so you can read the diff in the substantive feature.

This is particularly valuable when squashing bugs. Adding an expect test in one feature that demonstrates the buggy behavior, and then fixing it in the followup feature, is a good way of convincing yourself that the bug really works the way you think it does, and demonstrating that the bugfix resolves the issue.

Fast turnaround

Another aspect of the Iron style is fast review. That’s easier said than done, of course, but keeping features small and heavily tested helps. That’s because it takes less mental effort to convince yourself that a feature is right if it’s small enough to fit in your head, and there are good tests that demonstrate the behavior.

At Jane Street, a given change may be reviewed by a decent number of people, particularly if it touches many parts of the codebase. But there is a special reviewer, called the seconder, whose job is to review the feature in its entirety, and is often a full collaborator with the author on the change in question.

Part of the seconder’s job is to encourage authors to make review easier, both by breaking features down into smaller pieces, and by adding more tests. That increases the amount of work that an author needs to do to complete a given code change; but because review needs to be done by many people, that effort is generally well spent.

Beyond that, it gives authors more autonomy to get their features done, since they can effectively cut the amount of work they need from other people. And the tests they add in the process have lasting value, helping prevent future bugs.

There’s also a nice interplay between small features and fast turnaround when it comes to merge conflicts. Small features are less likely to conflict, and when they do, the resolution of the conflict tends to be easier to understand. At the same time, small features are easier to get out quickly, which reduces the likelihood of conflicts yet more.

Why now?

You might wonder why this approach emerged now. After all, Jane Street has been around for more than 15 years, and building systems in OCaml for 12 of those years.

I think the answer has to do with tools. Tools change the constants of your development universe, warping the fabric of your day-to-day work by enough that different equilibria become possible. The Iron style depends on a collection of different improvements to our tools, and if you take just a few of them away, the style starts to fall apart.

The major systems that are relevant here are:

Let’s talk about each of these individually.

Iron

Iron is responsible for managing the different changes (called features) that are being worked on, organizing both code review and the management and merging of these features. The push towards long chains of small features puts a lot of pressure on Iron’s performance. If a change that would have been one feature becomes seven, you need to do seven times as many feature operations. As a result, the performance of those operations becomes absolutely critical to the user experience.

To that end, we’ve done a lot of work to make the system more responsive and lightweight. For example, Iron keeps a cache of pre-populated source checkouts to make creating a new feature almost instantaneous. Iron also uses the fact that it knows what review a given user has to do, and uses that information to prefetch the relevant features.

We’ve also done work to simplify working with chains of features. When you have seven features instead of one, you really want the ability to release the entire chain as a single action. We’ve built automation in Iron to do just that.

The need to make review fast also means that you need Iron to act as an effective communication mechanism. Iron comes with a dashboard that tells all the users involved what features they need to work on and what they’re expected to do next.

Inline tests

Many developers are reluctant to write tests, and I think a large part of the reason isn’t that writing the tests themselves is painful; it’s all the work around setting up the tests that people find disheartening.

For that reason, we’ve long thought it important to make adding a new test require as little bureaucracy as possible. In our codebase, just adding a let%expect_test declaration to the source ensures that the test is registered and will be included in our continuous-integration testing, meaning that no one can release a feature that breaks that test.

For expect tests in particular, we’ve built pretty good editor side support, with key bindings for rerunning a test, bringing up the colorized diff showing how a given test failed, and accepting the new version of a test.

These are small efficiencies, but they add up. The end result is that we do a lot more testing with these tools in place than we did without them.

Jenga

Jenga has contributed to the Iron style in a number of ways. Probably the single most important thing is compilation speed. Small features are a lot more palatable if switching between features is efficient, and a big part of that efficiency comes from Jenga.

Jenga is also critical to the automation of tests I mentioned above. Jenga is a key contributor to the low overhead of adding a new test. And the efficient parallelization that Jenga provides helps makes testing faster too, which makes it possible to add more tests.

Summing up

One of the things that’s become clear to me over the years is that tools are critical to keeping people efficient as an organization grows. By default, everything gets harder as you get bigger; you try to solve tougher problems; as your software grows there are more complex interactions between different parts of your infrastructure; and the organization itself becomes more complex.

One key tool for maintaining the ability of each individual developer to get things done is to invest in sharpening their tools. And if you do it right, that tool sharpening doesn’t just make the things they’re doing now easier. It can open up new and unexpected ways of working.