Using group theory to explore the space of positional encodings for attention

Attention is a computational primitive at the core of modern language models, allowing internal representations to reference and influence each other. It’s how these models handle sequential data in the first place.

Yet, naively implemented, attention doesn’t have any notion of position. In the core attention computation, you calculate the dot product between a given “query” and a “key,” which tells you how much to attend to the value at that key—and this dot product says nothing about where in the sequence the queries and keys are. Since in most settings that information matters a great deal, you generally want to somehow perturb your dot-product calculation so that it depends on the positions (usually just the relative positions) of the query and key.

The so-called “positional encoding” that you use represents an important modeling decision, because it governs how the model views the passage of time. The most popular positional encoding, called RoPE, rotates components of key and query vectors by an angle that depends on the position in the sequence, like the hands of a clock. This works well in practice, but it’s far from the only option.

On a recent Friday afternoon I wondered, well, what are all the options? I’m an ML researcher at Jane Street, and when we’re working with sequential models we’ve often debated whether we’re using the best positional encoding. This raises the question: what positional encodings are even possible? As I started investigating, I discovered to my surprise that the space is quite constrained.

The reason is that there are a few key properties that any desirable positional encoding should have, and once you formalize those, you’re left with a very particular mathematical structure. (Spoiler: a one-parameter group.) In exploring that structure, I was able to show that there are only a few families of valid positional encoding—the demonstration of that fact forms the bulk of this blog post—and actually all of the sensible ones are already being used in real systems.

It was a reassuring finding, because it means that we don’t need to rack our brains to come up with some perfect positional encoding, as we are probably already using it. Even so, in doing this analysis I did discover a strange, unlikely-to-work-out-in-practice, but technically legal class of positional encodings that seems to be totally unexplored.

Formalizing a positional encoding

Let’s suppose we have a time series of queries $q(t): \mathbb{R} \to \mathbb{R}^d$ , and a time series of keys $k(t): \mathbb{R} \to \mathbb{R}^d$ . We write $q$ and $k$ as functions of time so that we can accommodate continuous or irregularly sampled inputs, but we could just as well restrict ourselves to only integer times if we prefer. “Time” here is a stand-in for any increasing quantity that we might care to use to measure progression through a sequence; it could be a sequence index, or literal elapsed time in a time series modeling problem, or even some kind of learned notion of time, as used in the Mamba family of models. We’re only concerned with the time-dependent aspects of the positional encoding, so we’ll allow ourselves to apply certain time-independent operations, most notably changes of basis, to $q$ and $k$ without really considering the positional encoding to have changed. (When $q$ and $k$ are produced by linear layers in neural networks, the basis is arbitrary anyway.)

In the absence of positional encodings, attention would require us to compute inner products like $q(s)^\top k(t)$ . We want to modify these inner products to encode the progression of time. For the sake of computational efficiency, we won’t modify every pairwise inner product independently; instead we’ll transform the queries and keys themselves, using some explicitly time-dependent functions $f: \mathbb{R} \times \mathbb{R}^d \to \mathbb{R}^d$ and $g: \mathbb{R} \times \mathbb{R}^d \to \mathbb{R}^d$ . We’ll set $Q(s) = f(s, q(s))$ , $K(t) = g(t, k(t))$ , so that the attention score of the pair $(s, t)$ becomes $Q(s)^\top K(t)$ . We’ve assumed here that $f$ and $g$ don’t change the dimension of our queries and keys, which we may do without loss of generality. (If we wanted a different dimension, we could have applied a time-independent projection to our keys and queries prior to the positional encoding.) Now let’s add some more restrictions.

Properties of good positional encodings

Linearity: $f$ is linear in $q(s)$ , and $g$ is linear in $k(t)$ . This one isn’t rigorously justifiable, but we’re working with a vector space, so it’s only right that our encoding should be linear. We’ll leave the study of nonlinear encodings to others.

Linearity ensures that we can write $f(s, q(s)) = F(s)q(s)$ and $g(t, k(t)) = G(t)k(t)$ , with $F(s)$ and $G(t)$ being square matrices. Now our attention dot product is

$Q(s)^\top K(t) = q(s)^\top F(s)^\top G(t) k(t)$

We can see at this point that any positional encoding scheme will be fully characterized by the square matrix $F(s)^\top G(t)$ , which determines how we modify the inner product between times $s$ and $t$ .

Translation invariance: For any times $s$ and $t$ , we have $F(s)^\top G(t) = F(t - s)^\top G(0)$ . This property ensures that only relative positions are observable, which will help with generalization to longer sequences, among other things. (If the absolute index is observable in your positional encoding, what do you do when you train on sequences only up to length $n$ , and now need to deal with a longer one?)

If you consider computing $F(s)^\top G(t)$ pairwise as you range over $s$ and $t$ , you get a table like:

	$s = 0$	$s = 1$	$s = 2$
$t = 0$	$F(0)^\top G(0)$	$F(0)^\top G(1)$	$F(0)^\top G(2)$
$t = 1$	$F(1)^\top G(0)$	$F(1)^\top G(1)$	$F(1)^\top G(2)$
$t = 2$	$F(2)^\top G(0)$	$F(2)^\top G(1)$	$F(2)^\top G(2)$

Translation invariance is basically just saying that all you need to care about are the diagonals in this table, since by assumption along the diagonals the values will be equal.

We make one more assumption, namely that $F(0)^\top G(0) = I$ . Effectively this is saying that for equal times, our positional encoding modification simply drops out. (We can do this without loss of generality, because if we’d started with $F(0)^\top G(0) = M$ instead, we could have folded $M$ into the keys by redefining $k(t) \leftarrow M k(t)$ , and as usual we’re happy to ignore such time-independent transformations.) Therefore $F(t)^\top G(t) = I$ by translation invariance. And since the matrices are square, we also now know that $F(t)^\top$ and $G(t)$ are inverses, so $G(t) F(t)^\top$ is also equal to $I$ .

Let’s define a new matrix-valued function $A(t) = F(t)^\top G(0)$ . You can think of $A(t)$ as giving a single value for each of the diagonals in the table above:

	$s = 0$	$s = 1$	$s = 2$
$t = 0$	$A(0)$	$A(-1)$	$A(-2)$
$t = 1$	$A(1)$	$A(0)$	$A(-1)$
$t = 2$	$A(2)$	$A(1)$	$A(0)$

$A(t)$ says how much you should perturb the inner product when shifted by $t$ . In other words, $A(t)$ is a translation across time that brings the key $k(0)$ from the past, at time $0$ , to the present, at time $t$ , where it can be compared to $q(t)$ . This key piece of intuition simultaneously suggests that:

Shifting by $0$ should be a no-op
Shifting by $s$ and then by $t$ should be the same as shifting by $s+t$

And indeed, thanks to the assumptions above we can immediately conclude that $A(0) = F(0)^\top G(0) = I$ , and that:

$\begin{align} A(s)A(t) &= F(s)^\top G(0) F(t)^\top G(0) \\ \\ &= F(s+t)^\top G(t) F(t)^\top G(0) \\ \\ &= F(s+t)^\top G(0) \\ \\ &= A(s+t) \end{align}$

The property $A(s+t) = A(s)A(t)$ is a very powerful constraint. If you’re familiar with group theory, the family of matrices $A(t)$ form a one-parameter group as long as we require that $A(t)$ is continuous in $t$ . This fact makes our enumeration possible, because such groups are well understood. So our last assumption is:

Continuity: $A(t)$ is a continuous function of $t$ . There are discontinuous functions that satisfy our constraints, but they require the axiom of choice to describe and are far too deranged to implement on a physical computer.

Enumerating all the representations

Remember that we’re trying to use these constraints to characterize the space of possible positional encodings. The fact that any positional encoding under the above assumptions must be a one-parameter group lets us bring known results about one-parameter groups to bear against this problem.

In particular, we know that every one-parameter matrix group has the form $A(t) = \exp(tX)$ for some fixed generator matrix $X$ . So now our task comes down to determining how the choice of generator affects our encoding. The presence of a matrix exponential hints that we should try to diagonalize $X$ .

Diagonalizable generators

There’s a lot more we can say when $X$ is diagonalizable. Through a time-independent change of basis that we may fold into our keys and values, we can convert $A$ to have the form:

$A(t) = \exp(tX) = \begin{pmatrix} \exp(x_1 t) & 0 & \cdots & 0 \\ 0 & \exp(x_2 t) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \exp(x_d t) \end{pmatrix}$

Diagonalizing $A$ is convenient because it lets us chop up our positional encoding into non-interacting components. More precisely, our $d$ -dimensional vector space is decomposable into a direct sum of $d$ different 1-dimensional vector spaces, where our positional encoding acts on each 1D space separately via multiplication by $\exp(x_j t)$ .

When we do this decomposition, we encounter a problem: the new basis will in general be complex, not real, and the resulting diagonal matrix $A(t)$ will also have complex entries in general. However, if the original un-diagonalized $X$ was real, then after diagonalization, the non-real values will come in conjugate pairs. We will therefore pick a slightly coarser direct sum decomposition: real diagonal elements get cut into their own subspaces, and pairs of conjugate diagonal elements get 2D subspaces, where $A$ acts like:

$A(t) = \begin{pmatrix} \exp(t z) & 0 \\ 0 & \exp(t z^*) \end{pmatrix}$

Here $z^*$ denotes the conjugate of $z$ .

The action of $A$ on a 1D subspace corresponds to exponential decay or blowup. We can write the single diagonal value of $A(t)$ as the scalar $\exp(tu)$ , for some real $u$ . Recalling how $A$ operates on $qk$ products, we find that restricted to this subspace, $Q(t)^\top K(0) = q(t)^\top \exp(tu) k(0)$ . That allows us to contemplate a few cases:

If we have $u > 0$ , then the influence of key $0$ blows up exponentially as time proceeds. This is clearly impractical—you don’t want your attention to increase exponentially as the time gap increases—so we throw out this possibility.
When $u = 0$ , our positional encoding is $1$ everywhere, and we’ve recovered NoPE.
When $u < 0$ , $\exp(tu)$ exponentially suppresses the influence of key $0$ as time goes on, and we recover the exponential decay that so commonly appears in linear attention variants. Note that when analyzing gated models, which learn a data-dependent decay factor, from the lens of positional encoding, we should think of them as learning how far to advance time, as opposed to changing the rate of decay. (Also note that the $u < 0$ case only makes sense when we are running causal attention, where the future is masked out. Otherwise tokens in the far future have exponentially increased influence on our early queries, which presents its own problems.)

The action of $A$ on the 2D subspaces can be rewritten in polar form as:

$A(t) = \exp(tu) \begin{pmatrix} \exp(it\theta) & 0 \\ 0 & \exp(-it\theta) \end{pmatrix}$

with $u$ and $\theta$ real. The role of $u$ in this subspace is exactly the same as in the 1D subspace, so we’ll require $u \le 0$ .

To connect $\theta$ to something more familiar, we’ll first note that:

$\begin{pmatrix} \exp(it\theta) & 0 \\ 0 & \exp(-it\theta) \end{pmatrix}$

can be converted by change of basis into the 2D rotation matrix $R(t\theta)$ . This is called the real canonical form, and as long as our original matrix $X$ was real, the change of basis required to convert it to blocks of this form will also be real. We can implement our choice $A(t) = R(t\theta)$ by taking $F(t) = G(t) = R(-t\theta)$ , so $F(t)^\top = R(t\theta)$ . We can immediately check that

$\begin{align} F(t)^\top G(t) &= R(t\theta) R(-t\theta) \\ &= I \end{align}$

and

$\begin{align} F(t)^\top G(0) &= R(t\theta) R(0) \\ &= R(t\theta) \\ &= A(t) \end{align}$

Here you can see how $\exp(tz)$ and $A(t)$ behave as we vary $t$ and $z$ :

Thus $F$ and $G$ are constant-frequency rotations on our query and key subspaces. We’ve derived RoPE, with a new exponential damping factor induced by $r$ . This exponentially damped RoPE is the positional encoding used in RetNet and Mamba-3, among others.

Defective generators

Finally, let’s examine the possibility that our generator $X$ is a defective matrix, and thus not diagonalizable. Instead, when we put it in Jordan normal form, we find that Jordan blocks appear around repeated eigenvalues. For simplicity, we can assume for now that $X$ consists entirely of one Jordan block.

To get some intuition for what happens here, let’s analyze the defective matrix $X = \begin{pmatrix} 0 & 1 \\ 0 & 0 \end{pmatrix}$ . This matrix satisfies $X^2 = 0$ , and therefore $A(t) = \exp(tX) = I + tX = \begin{pmatrix} 1 & t \\ 0 & 1 \end{pmatrix}$ . If we apply $\exp(tX)$ to a 2D vector $(x, v)$ , we get the result $(x+tv, v)$ .

This is not just some mathematical pathology; this choice of $X$ appears naturally in certain dynamical systems. Consider a frictionless hockey puck launched from position $x$ with velocity $v$ at time $0$ . At time $t$ , its new position is $x+tv$ , and its velocity remains $v$ . The time evolution of this simple physical system is governed precisely by $\exp(tX)$ .

Our analysis above covered only the simplest case where the Jordan block had the shared eigenvalue $0$ , but we can visualize the more general case of a real shared eigenvalue $u$ , which adds exponential decays to the mix:

The complex eigenvalue case is messier—the simplest examples involve a $4 \times 4$ Jordan block involving a conjugate pair of doubled eigenvalues—but there’s nothing fundamentally new there, and the analysis is left as an exercise to the reader.

In general, unlike diagonalizable matrices which produce only exponential and trigonometric factors, defective matrices give rise to positional encodings involving polynomial terms. Whereas exponential factors produce time decay and rotations produce a sort of analog clock, it’s entirely unclear how to interpret these polynomials in the context of attention. I failed to find any existing literature which directly addresses the possibility of defective generators for positional encodings in deep learning, and most likely they have no practical application. Nevertheless, the curious possibility remains.

Addendum: ALiBi

@StefanGliga pointed out on X that the restriction to only linear functions of $q$ and $k$ narrows the space of positional encodings so much that it excludes ALiBi, which applies a penalty to the attention dot product. And it’s true that the assumptions made in this post don’t cover the class of positional encodings which map the dot product $q(t)^\top k(0)$ to $h(t, q(t)^\top k(0))$ for some arbitrary function $h$ . But ALiBi was a particularly interesting example to use, because that particular encoding is possible (though admittedly impractical) to implement with linear positional encodings, as long as we augment our queries and keys appropriately. Let’s see how.

ALiBi takes the dot product $q(t)^\top k(0)$ and replaces it with $q(t)^\top k(0) - mt$ , where $m$ is a constant chosen per-head. The first term is trivial to produce, so we just need to construct a dot product that evaluates to $-mt$ . This means we need a generator matrix that can grow a component linearly in time. We met exactly such a matrix in the defective generators section: $X = \begin{pmatrix} 0 & 1 \\ 0 & 0 \end{pmatrix}$ , which satisfies $\exp(tX) = I + tX = \begin{pmatrix} 1 & t \\ 0 & 1 \end{pmatrix}$ . Now we’d like to pick 2D time-independent $q$ and $k$ vectors such that $q^\top \exp(tX) k = -mt$ . We can set $q = \begin{pmatrix} 1 \\ 0 \end{pmatrix}$ and $k = \begin{pmatrix} 0 \\ -m \end{pmatrix}$ , and we find:

$q^\top \exp(tX)\, k = \begin{pmatrix} 1 & 0 \end{pmatrix} \begin{pmatrix} 1 & t \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 0 \\ -m \end{pmatrix} = -mt.$

So, intentionally or otherwise, Stefan found a useful example of a positional encoding generated by a defective matrix!

Addendum: Group Representational Positional Encoding (GRAPE)

While writing this post, I somehow missed this recent paper by Zhang et al. which attacks a closely related problem and points out the connection between defective generators and ALiBi. Roughly speaking, while this post tries to enumerate a universe of positional encodings, the GRAPE paper demonstrates the utility of certain classes of positional encodings using essentially the same framework of one-parameter groups. Thanks to Yifan Zhang for noticing the connection between this blog post and his research.

	\(s = 0\)	\(s = 1\)	\(s = 2\)
\(t = 0\)	\(F(0)^\top G(0)\)	\(F(0)^\top G(1)\)	\(F(0)^\top G(2)\)
\(t = 1\)	\(F(1)^\top G(0)\)	\(F(1)^\top G(1)\)	\(F(1)^\top G(2)\)
\(t = 2\)	\(F(2)^\top G(0)\)	\(F(2)^\top G(1)\)	\(F(2)^\top G(2)\)

	\(s = 0\)	\(s = 1\)	\(s = 2\)
\(t = 0\)	\(A(0)\)	\(A(-1)\)	\(A(-2)\)
\(t = 1\)	\(A(1)\)	\(A(0)\)	\(A(-1)\)
\(t = 2\)	\(A(2)\)	\(A(1)\)	\(A(0)\)