Can you reverse engineer our neural network?

A lot of “capture-the-flag” style ML puzzles give you a black box neural net, and your job is to figure out what it does. When we were thinking of creating our own ML puzzle early last year, we wanted to do something a little different. We thought it’d be neat to give users a complete specification of the neural net, weights and all. They would then be forced to use the tools of mechanistic interpretability to reverse engineer the network—which is a situation we sometimes find ourselves facing in our own research, when trying to interpret features of complex models.

We published the puzzle last February. At the time, we weren’t even sure it was solvable. The neural network we’d designed would output 0 for almost all inputs. A reasonable solver might assume that the goal was to furnish an input that produced 1 or some other nonzero value. But we’d engineered the network in such a way, as you’ll soon see, that you couldn’t use traditional methods to brute force your way to an answer—say, by backpropagating a nonzero output all the way back to the input layer. You had to actually think about what the net was doing.

We were amazed by the response the puzzle got. Mostly by luck, it seemed like we’d calibrated the difficulty just so: it wasn’t so hard that no one could solve it, and wasn’t so easy that we were flooded with responses. In fact if you can solve this puzzle, there’s a decent chance you’d fit in well here at Jane Street.

We’ll restate the problem below, but be warned that the rest of this post contains huge spoilers. If you want to try solving the puzzle yourself, avert your eyes. The rest of this post will walk through the process that an actual solver took, with all the twists and turns before they finally cracked it.

The problem

Today I went on a hike and found a pile of tensors hidden underneath a neolithic burial mound! I sent it over to the local neural plumber, and they managed to cobble together this.

model.pt

Anyway, I’m not sure what it does yet, but it must have been important to this past civilization. Maybe start by looking at the last two layers.

Model Input

vegetable dog

Model Output

0

If you do figure it out, please let us know.

That model.pt file is basically just a pickled PyTorch model.

A solution

Getting started

A senior at university named Alex was in his dorm room when a roommate told him about a puzzle that was making the rounds on Twitter. The roommate had tried it himself but given up after two nights. Alex, in his final winter at school, was looking for something to do and decided to have a look.

He started by downloading the model and poking around, focusing on the last layer in particular:

import torch
import plotly.express as px
model = torch.load('./model.pt')
linears = [x for x in model if isinstance(x, torch.nn.Linear)]
px.imshow(linears[-1].weight.detach())

Immediately it was plain that this was not an ordinary neural network. It clearly hadn’t been trained: all the weights had integer values. Instead, it had been designed by hand, probably to carry out some very specific computation.

The last layer was a 48x1 matrix, but apparently broken into three sections. And indeed if you looked at the activations from the previous layer, they were always three repetitions of the same thing. The second-to-last layer appeared to be three repetitions of the same weights, while its bias contained the same 16 bytes, but incremented by 1 each time, as if encoding a vector v, then v + 1, and v + 2. Here’s what the weights on that second-to-last layer looked like:

px.imshow(linears[-1].weight.detach())

and the biases:

px.imshow(linears[-2].bias.detach().unsqueeze(0))

Thinking about it some—and about the fact that the last layer emitted a single bit—Alex realized that this second-to-last ReLU layer must be computing whether two 16-byte integers were equal to one another (with one byte per neuron). The way it seemed to work is that it made three copies of the input vector v, a 16-byte number. It tried to check that against a reference number x (which was determined by the bias of the second-to-last layer). So the three copies would actually represent v - x - 1, v - x, and v - x + 1. The last layer applied weights 1, -2, and 1 to these cases respectively. We can do some casework on an individual value here: consider the value of ReLU(v-x-1) - 2ReLU(v-x) + ReLU(v-x+1). If v=x, then this is equal to 1. We won’t show the rest of the cases here, but they all result in 0. The bias on the last layer was -15, so the final neuron would only fire when v=x for all 16 bytes.

So now the question became, how do we get the activations of the second last layer to equal x?

Reverse-engineering the program at the heart of the network

Alex figured that if there’s some number that the network is checking against at the very end, then the rest of the network must be some sort of big equation. There indeed appears to be a lot of structure in the network, as you can see just from plotting the size of the 2500 linear layers (about half the full network):

px.line([l.out_features for l in linears])

So Alex began looking at various sub-networks, tracing their dependencies. This involved staring at a lot of graph structures:

But after hours of searching for legible sub-circuits, he came up short. For the moment there just seemed to be too much complexity to trace by hand. So he had a new idea: what if I treat this thing as a linear program and just solve it?

This is, of course, not possible with so many ReLU layers—ReLUs aren’t linear—but they can be modelled by adding an additional integer value, corresponding to the statement “this activation is negative.” You can thereby treat it as an integer linear program and use a constraint solver capable of integer programming. So that’s what Alex did: he dutifully wrote some code to convert the layers of the neural network into a giant linear program and let it run.

And let it run.

That seemed to be going nowhere—so Alex now attempted to reduce the number of variables in the program. Perhaps there were some reductions you could do? Alex found that if you looked at a bunch of layers, they mostly looked like identity matrices. In fact in 1500 or so layers, 80% of the nodes were just performing an identity operation.

Alex treated each neuron in the network as a node in a DAG, where each node goes into the nodes in the next layer with some weights; but if you ever have a node with in-degree 1 and whose weight is exactly 1, you can combine those two nodes. (You know this is safe to do because the network has integer values everywhere: all the inputs are integers, as are all the weights.)

There were slightly fancier reductions. For instance, if you have a node whose every incoming edge has positive weight, then the fact that you’re doing ReLU doesn’t matter, because it’s never going to hit the negative clamp—and so you can forward its in-edges to its children, directly passing them to the next layer. Also, if two neurons in a layer have exactly the same input vector, you can combine them, and redirect their descendents to the new merged neuron. And you can repeat this process many times.

Alex by now had poured hours into this analysis. He’d found circuits that appeared to be repeated across many layers. He’d print out different equivalence classes of nodes, looking at the sequence of weights that each node had as input, discovering that there were only a few kinds of nodes. For instance there was one class of nodes which effectively would forward a value from two layers back. Collapsing these, among other similar reductions, brought down the size of the linear program from something like 2 million nodes, to 75,000.

But after all that, Alex ran the solver again and again it churned without terminating.

The final reductions

A new idea: what if you propagated bounds through the network? Just by reasoning through one layer at a time, you could figure out the maximum value that any given node could achieve; you’d do this simply by looking at the bounds on its inputs. It turns out that with fairly conservative assumptions, many nodes end up with very tight bounds, e.g. from 0-1. Maybe this was enough to make the program tractable?

At this point Alex switched from a linear program to a SAT solver, since the total number of values had gotten so much smaller. In the SAT version, you had a boolean variable for each node equalling each value in its range. All told this resulted in 200,000 variables after all the reductions. After a day of running, the SAT solver reduced the program to 20,000 variables. From there it didn’t seem to reduce further.

In effect Alex had discovered that inside this neural network there was a core program, irreducibly complex, that—much to his disappointment—was still too large to brute force. So after many days, he had to take a step back, effectively having gotten nowhere.

Glancing off the solution

He thought meta: this has to be a solvable puzzle, right? How would someone build a puzzle like this that would be interesting to solve? If you generated random weights, a SAT solver would probably be able to solve it by brute force. This network was created by a human. At its core there seemed to be a function that you couldn’t just use search or optimization to recover. It was an irreversible function. What were some go-to examples of irreversible functions?

Alex asked ChatGPT for some common hash functions, and compared them against some basic plots of the layer widths, which looked periodic. In fact there were 32 periods of length 48, repeating exactly each time. Maybe the network was doing 32 blocks of the same computation? To ChatGPT again: are there any common hash functions that use 32 blocks of computation? Bingo. It turned out that roughly all of them do.

To determine which one was in play here, he explored by hand: he’d input some string into the network, compute various hash flavors with separate programs, then look at the second-to-last layer. It turned out that md5 lined up and the other common hash functions didn’t.

This was nice, because he already knew what the hash was supposed to be by looking at the second-to-last layer’s biases. So the problem reduced to finding an input string that produced that particular md5 hash. But it was not obvious how to solve that—especially since he didn’t have a real proof that this network always produces an md5 hash. Maybe the solution was to dig deeper, and hack the network to make it reversible?

A glitch in the matrix

Alex noticed something odd in the network. It seemed to have a bug: if your input was greater than length 32, it no longer produced the correct md5 hash. Perhaps somewhere in that bug was a key to reversing the hash value that was built into the network?

He spent the next two days reverse-engineering the bug. To start, he got Gemini to write an implementation of the md5 hashing function. Then he matched up every neuron in the network to the corresponding variable in the md5 algorithm. He wrote some code that would store the sequence of values for a given intermediate variable, then search each of the 32 blocks in the network for that value; this would pick out which ranges of neurons corresponded to the bits for each variable. It turned out that some ranges of bits exactly corresponded to the variables, and others were intermediate computation values.

Then, with inputs that were >32, he could painstakingly trace through the blocks to find the exact spot where the network diverged from the correct algorithm.

The crux of it was in the first 7 layers—there was a circuit that would compute the length of the input, and attempt to store it in 4 bytes, in little-endian order. But when the length was 256 bits or greater, you’d have a length variable that contained the value 256, instead of the correct encoding. That is, if the length were >384 bits, the length bytes should be 128 1 0 0, but what the network encoded instead was 384 0 0 0.

Then the question was, is it possible to exploit this bug, by crafting a message of length 256 or greater? Some more painstaking tracing revealed a few observations: First, there aren’t that many possible lengths. There were only 55 inputs, so he could do an exhaustive search to see how the network behaved with respect to these weird values. Second, the broken length value was converted to binary, and then propagated through every layer in the entire network. In binary, all of the bits would equal 1, and the rest of the number was concentrated in the lowest-order bit, so 384 would be encoded as 130,1,1,1,1,1,1,1. Third, the invalid bytes from the length of the message were only used in a few blocks of the md5 computation, which always reads bytes from the input in the same order.

Using these observations, it’s possible to write down a modified version of the md5 algorithm, which corrects itself at the necessary blocks to be in line with the neural network. Looking closely at this, however, it still seems very difficult to reverse in general.

This took about two days to figure out, but—disappointment again—didn’t lead Alex any closer to the solution. He wrote to the email address provided on the puzzle with what he’d discovered so far. What he heard back surprised him. The bug was not intentional. With that in mind, why don’t you try to solve it one last time?

The return of brute force

It turned out that once you knew the hash encoded in the bias of the second-to-last layer, you were done. Figuring that out was the meat of the puzzle. The puzzle creator had intentionally made the hash easy to brute force, leaving various small hints in the puzzle description and Python code that the solution was composed of two English words, lowercased, concatenated by a space.

Alex had actually tried to brute force the hash earlier, but had downloaded a list of the top 10,000 most popular words to do it, which turned out not to be big enough to find it. Once he had a big enough word list, he got the answer.

Another puzzle

One of the things that made this puzzle challenging was designing a network of the right complexity. Using logic gates means the network won’t be differentiable; but if you make the program encoded by those gates too complex, there’d be little hope of reverse engineering it. Md5 felt like a good compromise, though it was by no means trivial. Because md5 uses modular addition, creating the puzzle required implementing a parallel carry adder in 20ish layers of a neural network. Not easy! We were impressed that some solvers managed to figure that out—and Alex’s discovery of the >32 bug was unexpected and quite extraordinary.

The experience of creating and releasing the puzzle, and engaging with the folks who solved it, went well enough that we’ve done it again. Here you’ll find the latest. In this new puzzle, a neural network whose layers have been jumbled up needs to be put back in the right order… Can you help?

If this kind of thing is interesting to you, consider applying. You’ll join a close-knit group of brilliant, supportive colleagues, harnessing tens of thousands of GPUs, petabytes of training data, and the agility and resources to invest in the best ideas.