Centralizing distributed version control, revisited

7 years ago, I wrote a blog post about how we at Jane Street were using our distributed version control system (hg, though the story would be the same for git) in a partially centralized way. Essentially, we built a centralized repo and a continuous integration system whose job was to merge in new changesets. The key responsibility of this system was to make sure that a change was rejected unless it merged, compiled and tested cleanly.

This half-distributed, half-centralized approach let us enjoy the benefits of a DVCS, while still getting the coherence and easy of sharing that comes from having a central authoritative source.

Since then, our development tools have changed a lot, including the arrival of a new code review and release management system called Iron. In writing Iron we discovered that centralization was valuable in ways we hadn’t considered before. In particular, despite the fact that good support for merging is central to a DVCS, centralization is actually a critical ingredient to making merges work better.

To understand how centralization can help, let’s talk about one reason why merging is a fraught process to begin with.

The criss-cross merge

The basic approach to merging in a DVCS like hg or git is pretty simple. Here are the basic steps that are taken to merge two heads A and B.

Find the greatest common ancestor (GCA(A,B)) of the heads to be merged.
Compute the patch from that base point to one of the two heads, say, A.
Take the patch you just computed, and apply it to B. Conflicts appear when the patch, which was actually based on GCA(A,B), doesn’t apply cleanly to B. The result of this process is the merge.

The above discussion oversimplifies the story by assuming there’s a well defined GCA, but this just isn’t always true. To see why, consider a repository staring with a root revision R, and two revisions made independently on top of R.

  A
 /
R
 \
  B

Now, imagine that two different developers each concurrently decide to merge the heads A and B and do some further development. Note that in both of the cases shown below, the GCA for the merge between A and B is R.

Developer 1                Developer 2

  A---C--D                    A     
 /   /                       / \        
R   /                       R   \       
 \ /                         \   \  
  B                           B---E--F

Now, if we bring these two separate histories together into one repo, we have something like this.

  A---C--D
 / \ /
R   \
 \ / \
  B---E--F

Now, what happens if we want to merge D and F? In particular, what is GCA(D,F)? Both A and B are common ancestors, but neither one is greater than the other. In this case, there are in some sense two different GCAs, or, more precisely, there are multiple maximal common ancestors, or MCAs. This case is often described as a criss-cross merge, and is the source of much wailing and gnashing of teeth among developers and users of DVCSs.

git and hg have different ways of dealing with the case of multiple MCAs. By default, hg just picks one of the MCAs arbitrarily and does the merge based on that. Given that different choices of the merge base will lead to different results, making that choice arbitrarily is pretty disturbing.

git, on the other hand, has a strategy called recursive merge that repeatedly merges together the MCAs, and then uses that merged MCA as the basis for computing the diffs to A and B on which the final merge will be based. And hg has a new strategy called bid merge that is willing to make different choices as to the GCA to use on a file by file basis.

None of these approaches amount to principled solutions, and while they work better in some cases and worse in others, they all sometimes lead to bad results. It’s tempting to look for a way out of this conundrum altogether, by avoiding the possibility of criss cross merges in the first place.

Avoiding the criss-cross merge

For those who haven’t read my previous posts about how Iron approaches merges, I’ll describe it briefly here. Iron organizes its branches into a hierarchy: every repository has a root feature, and that feature can have children, and those can have children as well. Thus, our main repository, called Jane, has a root feature called jane, and one can develop changes to jane in child features, such as jane/Core.Applicative or jane/quickcheck.

Critically, the merging of features is constrained. Note that in Iron, every feature is defined by its base and tip revision, where the diff between those two revisions is effectively the contents of the feature. Here are some of the key operations allowed on features.

fe release, moves changes from a child feature into a parent. This can only be done once the child feature is fully merged with its parent, and has the effect of setting the tip of parent to be the tip of the child, and typically deleting the child.

As an example, if the jane/quickcheck feature is based at the current tip of jane (and is fully reviewed, and all its tests pass), then calling fe release jane/quickcheck will move the tip of jane forward to be equal to the tip of jane/quickcheck, and will delete jane/quickcheck.

fe rebase lets you merge a feature with its parents, effectively pulling changes from a parent feature into a child. This has the effect of changing the base of the feature to be the tip of its parent, and the tip of the feature to be the result of the merge.

So, if other features have been released into jane since the jane/Core.Applicative feature was created, then the base of jane/Core.Applicative will no longer be the tip of jane. Calling fe rebase jane/Core.Applicative will merge the tip of jane/Core.Applicative with the tip of jane, and will set the base of jane/Core.Applicative to the tip of jane.

fe rename, which in addition to allowing you to simply change the name of a feature, also lets you introduce a parent-child relationship between features that didn’t previously have one. e.g., calling fe rename jane/Core.Applicative jane/quickcheck/Core.Applicative causes the Core.Applicative feature to become a child of, and so be able to depend on the changes in, thequickcheck feature.

All of these operations are implemented against a single, centralized server which keeps track of the state of all our features. This centralization lets Iron enforce some useful invariants along the way, critically, that the GCA of a feature and its parent is well defined, and is equal to the base of the feature. This simple property turns out to outlaw criss-cross merges, which avoids all of the mess we described earlier.

The happy outcome turns out to depend critically on the fact that we built a central server that could enforce the invariants in question, or, more precisely, that we built a consistent service

discovered by chance, the existence of the central server is key to enforcing the necessary invariant. In particular, the scenario of two different users concurrently releasing into the same feature or rebasing the same feature simply isn’t possible when there’s a centralized monitor determining who goes first.

In retrospect, this shouldn’t be too surprising. The criss-cross merge is really a result of concurrency, and the idea that introducing a lock (which is what a centralized server does for you) can be used to exclude unwanted concurrent executions in a distributed systems should surprise no one.

In the end, you can trace it all back to the CAP theorem: If you want progress while partitioned, you need to give up on consistency in some way. And criss cross merges are caused by a kind of inconsistency.

Centralization obviously has downsides, but I think Iron picks a nice point along the spectrum here: writing code is totally doable while disconnected, but operations like rebase and release that affect how information is shared between features requires you to be connected. I think it’s a small price to pay to never have to deal with a criss-cross merge.