How to shuffle a big dataset

At Jane Street, we often work with data that has a very low signal-to-noise ratio, but fortunately we also have a lot of data. Where practitioners in many fields might be accustomed to having tens or hundreds of thousands of correctly labeled examples, some of our problems are more like having a billion training examples whose labels have only a slight tendency to be correct. These large datasets present a number of interesting engineering challenges. The one we address here: How do you shuffle a really large dataset? (If you’re not familiar with why one might need this, jump to the section Why shuffle below.)

For a dataset x₀ , . . . , x_{n - 1} that fits in RAM, you can shuffle using something like Fisher–Yates:

for i = 0, ..., n - 2 do
  swap x[i] and x[j], where j is a random draw from {i, ..., n - 1}

But what if your dataset doesn’t fit in RAM?

I will present the algorithm I use for shuffling large datasets. It isn’t novel, and one can find multiple instances of people reinventing it or something similar (and in essence it descends from Rao). However, I don’t know of anywhere that states the algorithm, shows why it’s correct, and gets into the particular practical issues we address below. Also, when I first encountered this problem and searched online for an answer, I didn’t find any of the good examples above, just lots of bad ideas, so hopefully this post will improve the odds for the next person.

To be clear, this is not some minor performance hack. For large datasets, it makes the difference between feasible and infeasible. (See appendix for a more quantitative comparison.)

A 2-pass shuffle algorithm

Suppose we have data x₀ , . . . , x_{n - 1}. Choose an M sufficiently large that a set of n/M points can be shuffled in RAM using something like Fisher–Yates, but small enough that you can have M open files for writing (with decent buffering). Create M “piles” p₀ , . . . , p_{M - 1} that we can write data to. The mental model of a “pile” here is that it’s a file you can append to, but in practice you might, say, have several piles exist as datasets in the same HDF5 file. The first pass of the algorithm is to split the data into these M piles, and the second pass shuffles each pile and appends it to the final result.

-- First pass
create empty piles p[0], ..., p[M - 1]
for i = 0, ..., n - 1 do
  j := uniform random draw from {0, ..., M - 1}
  append x[i] to pile p[j]

-- Second pass (perhaps done lazily)
for j = 0, ..., M - 1 do
  shuffle p[j] in RAM with Fisher-Yates or whatever is convenient
  append p[j] to output file

Example of a shuffle: We start with unshuffled data (top); the first pass leaves M=6 piles (middle); the second pass yields shuffled data (bottom).

Assuming you have enough memory to satisfy the above constraint on M and assuming that drawing a random number is O(1), this is a linear time algorithm; the constant factor is dominated by having to read and write each data point twice in external storage (but the reading/writing can be done in blocks rather than one point at a time). Since the reading and writing is stream-oriented, the algorithm still works for data with variable record length.

To see that the 2-pass shuffle yields an unbiased random permutation, consider another algorithm already known to be correct: draw U₀ , . . . , U_{n - 1} ~ Uniform(0,1), associate x_i with U_i, and sort by U_i; this yields an unbiased permutation. Our algorithm above can be seen to be equivalent to this: for M=1000, the choice of pile is like radix sorting on the first 3 digits of U_i, and then shuffling within each pile is like sorting on the remaining digits.

Dealing with oversized piles

Even if the expected pile size would be small enough to shuffle in RAM, there is some chance of getting an oversized pile that is too large to shuffle in RAM. You can make the probability of getting an oversized pile very small: if expected pile size is s, the stdev is slightly under √s, so you can just arrange for, say, s + 6√s to be a size that you can still shuffle in RAM. Even with M=1000, the chance that some pile will be larger than expected by 6 stdevs is about 10⁻⁶. (This 6√s business is just a formality. In practice, you just leave yourself what feels like a sufficient amount of headroom, and if you get an oversized pile, it’s overwhelmingly likely that you overestimated how many points you could fit in memory rather than getting unlucky, and you try again with smaller pile size.)

In the rare case that you end up with an oversized pile, you could recursively apply the algorithm to the oversized pile, but it’s also okay just to start over. Because the probability of having to restart is small, the expected runtime is only slightly increased. You might worry that starting over would introduce some bias into the shuffle, but—surprisingly, perhaps—it doesn’t, because the tuple of pile sizes that results from the first pass is independent of the permutation that is generated. (Consider the above way of thinking of the algorithm as associating each point with some U_i and then sorting; if I tell you how many of the U_i happened to fall in certain intervals, I still haven’t given you any information about the relative ordering among the U_i.)

A similar consideration applies if the way you are storing your data makes it necessary or advantageous to preallocate the storage for each pile: you preallocate s + 6√s for each pile, on average waste 6√s per pile, and very rarely have to restart if you exceed the storage you had preallocated.

Parallelizing, and other practical considerations

As a practical matter, with very large data sets, the input is often broken across several files rather than being in a single file, and it would be desirable for the result of the shuffle to be broken across several files as well. The above algorithm adapts naturally to this context.

Suppose the input is spread across files X₀ , . . . , X_{K - 1}. We do the first pass for each of these files in parallel, leaving many sets of piles p^k₀ , . . . , p^k_{M -
1} for k = 0 , . . . , K - 1.
For j = 0 , . . . , M - 1, combine p⁰_j , . . . , p^{K - 1}_j into p_j.
Proceed with second pass as above.

Commonly, the data you are trying to shuffle was the output of some preprocessing step. The first pass can be integrated into the preprocessing, so that the extra cost incurred by the first pass is near zero: during preprocessing, where you would have written preprocessed data to one file, you instead write it to many piles.

Also, in practice, it can be handy to have the resulting chunks be small enough that they can be shuffled in RAM while also training your model. Then, the second pass is done lazily: You only shuffle the piles as they are loaded for training. This is often a net win, depending on how many times you are going to consume the data without re-shuffling. (Fancier still, if the piles are small enough that you can fit 2 in memory at the same time, you can have a better input pipeline: while you are training on one pile, you start loading and shuffling the next one.)

Leaving piles unshuffled also allows for another trick pointed out by my colleague David Wu: Suppose new data is arriving at a roughly constant rate, and you want to maintain a moving window of length Y years. Think of each pile as a circular buffer, with its contents in chronological order. As new data comes in, when you write to a pile, you remove outdated data and append the new data. In this way you can incrementally maintain a shuffled copy of the last Y years of data. (Okay, it’s only a half-shuffled copy, but the remaining work is easy to do when you load each pile.)

Leaving the data in many piles, rather than combining into a single monolithic output, also allows you to get imperfect (but for many purposes good enough) reshuffles by permuting the order in which you load piles (and shuffling within each pile when you load it).

Why shuffle

When training neural nets by stochastic gradient descent (or a variant thereof), it is common practice to shuffle the data. Without getting bogged down in a detailed discussion, let’s try to get a sense for why this shuffling is useful by considering an extreme example. Suppose you are training a classifier to tell cats from dogs, and your training set is 50,000 cats followed by 50,000 dogs. If you don’t shuffle, you will get poor training performance. Strictly speaking the problem arises from having serial correlation in the noise of your gradients, combined with non-commutativity of parameter updates (if training on x and then y were equivalent to training on y and then x, then shuffling would have no effect); intuitively, your net will spend 50,000 examples learning “everything’s a cat” followed by 50,000 examples learning “no, everything’s a dog,” and most of the finer structure you might have learned along the way will get drowned out.

If you only locally shuffle (e.g., maintain a reservoir of 10,000 examples that you draw from randomly, which is replenished by streaming through your dataset) then that could be sufficient if serial correlations in your data persist for much fewer than 10,000 examples, but it would be insufficient in our 50,000 cat–50,000 dog example.

That’s not to say that shuffling is itself optimal. E.g., you might get better training performance by making sure each consecutive pair of training examples has one cat and one dog (though we’ve found there are other problems that crop up with this idea). Or, there are approaches like curriculum learning (Bengio et al.).

Appendix: Performance comparison

The 2-pass shuffle seemed so obviously better than random access into a file that I hadn’t bothered to measure how much faster it actually is. One approach works, the other doesn’t, what’s there to measure? But the post was met with a lot of skepticism about whether it is faster at all, apparently on the basis that the 2-pass algorithm has an extra read/write and SSDs are fast. So I measured the difference and found that, for my data and how it is stored, the 2-pass approach is 1000 times as fast as random access (and that’s before incorporating further improvements to the 2-pass approach that are done in practice, which are to parallelize the first pass and integrate it with the data preprocessing). If this sounds too good to be true, bear in mind that this is not a comparison to some highly-regarded practice; it is a comparison to a bad idea, like quicksort against bubblesort.

Even with uncompressed data on local SSDs, sequential traversals are 48 times as fast as random access traversals for my data.

Obviously the performance gap will depend on how large your training examples are, your storage setup, what file format you’re using, whether the data is compressed, and so on. In particular, if individual examples are very large (500kB each?) then random access could be competitive.

The dataset I tested this on is 220 million examples, 9kB each. It would be 2TB uncompressed. It is 320GB compressed (4 HDF5 files, 80GB each, using HDF5’s internal compression). If I try to traverse the data by grabbing one random example at a time, it takes 394,000μs per example (random access into compressed 80GB files is SLOW). At that rate, it would take 2.75 years to traverse the data once. (That’s not doing anything obviously wrong like reopening the file for each read—the four files are only opened once. The only obviously wrong thing it’s doing is trying to traverse the data via random access.)

By comparison, reading the data sequentially in big blocks, it takes 120μs/example, and a single traversal of the dataset takes 7.3 hours. Taking into account the fact that with the 2-pass algorithm you have to read each data point twice and do an intermediate write, it takes about a day, starting from unshuffled data, to do a random traversal. This is a 1000x speedup over random access, without incorporating anything like parallelizing the first pass, or piggybacking the first pass on top of whatever preprocessing you’re already doing. If I put some effort into optimizing the silly approach, I can get the factor to be smaller. E.g., if I go to the trouble of putting the data on local storage (a RAID array of SSDs in this case), still compressed, and only reading from one file, it’s “only” a 460x speedup. Using uncompressed data (I tested with a memory-mapped .npy file) on locally attached SSD storage yields a hefty speedup for both approaches, with random reading taking 720μs/example and sequential reading taking 15μs/example. This narrows the gap, but not enough to make random access competitive.

So, the relative speed of sequential access more than compensates for the cost of the first pass (which itself is negligible if you are going to preprocess the data anyway, as pointed out earlier). You might wonder: even in RAM, sequential access is faster than random access; does this mean that we can make in-memory shuffles faster using an algorithm like this rather than Fisher–Yates (where RAM is the new disk, and cache is the new RAM)? According to the Sanders paper mentioned in the introduction, the answer is yes, and he claims a 4x speedup on contemporary hardware. (Of course, in the context of our problem here, where the in-memory operations are cheap relative to getting stuff off the disk, that 4x speed up for the in-memory shuffle would make little difference for us.)