Rotation intuition

Why OSCAR uses \(U \cdot H_{\mathrm{Had}} \cdot P\)

OSCAR rotates each key row before INT2 quantization with \(R_K = U_Q H_{\mathrm{Had}} P_K\). The same construction is used for values with \(R_V = U_S H_{\mathrm{Had}} P_V\). These factors are not interchangeable: \(U\) chooses the attention-relevant basis, \(H_{\mathrm{Had}}\) equalizes and mixes that basis, and \(P\) places the mixed channels into groups that the INT2 kernel can quantize stably.

\[ \left\lVert QK^\top - Q\widehat{K}^{\top}\right\rVert_F^2 = \operatorname{tr}\!\left(R_K^\top C_Q R_K \cdot E_K\right) \]

Here \(C_Q = Q^\top Q\), and \(E_K\) is the residual covariance of \(KR_K - Q_2(KR_K)\). The rotation changes both terms: the attention-importance metric and the distribution seen by the per-token, per-group INT2 quantizer.

The two jobs of the rotation

A useful key rotation must do more than reduce raw \(K\) reconstruction error. It must reduce the error after multiplication by \(Q\), while also making token-wise INT2 scales less dominated by outlier channels.

A

Shape the attention metric

\(R_K^\top C_Q R_K\) tells us how much quantization noise in each rotated coordinate is amplified in the attention logits. Error in a high-query-energy direction is expensive; error in a low-query-energy direction is much less visible.

B

Shape the quantizer residual

\(E_K\) depends on the values that the INT2 quantizer sees. A single scale is fitted per token and per group, so one large channel can expand the range and make the other channels in that group use a much coarser quantization step.

What each factor contributes

1

\(U_Q\): expose attention-importance directions

Eigendecompose \(C_Q = U_Q\Lambda_Q U_Q^\top\), with eigenvalues sorted from large to small. If \(R_K = U_Q\), the rotated importance metric becomes \(\Lambda_Q\). Channel \(j\) of \(KU_Q\) is the projection onto query principal direction \(u_j\), and its noise is weighted by \(\lambda_j = \lVert Q u_j\rVert_2^2\).

This is the right first step because it makes attention importance explicit. It is not sufficient by itself: real LLM spectra are anisotropic, and after the eigenbasis rotation the large-variance coordinates can make per-group min-max quantization worse.

2

\(H_{\mathrm{Had}}\): equalize importance and suppress outliers

Once \(U_Q\) diagonalizes the importance metric, normalized Hadamard mixing gives an exact invariant: every diagonal entry of \(H_{\mathrm{Had}}^\top \Lambda_Q H_{\mathrm{Had}}\) equals \(\operatorname{tr}(C_Q)/d\). The peaky importance spectrum is replaced by a uniform diagonal.

The same Hadamard matrix also mixes each token across all channels with equal-magnitude signed weights. A single principal-direction outlier is spread across many coordinates, reducing the max-min range that determines the INT2 scale.

3

\(P_K\): balance contiguous INT2 groups

The kernel quantizes contiguous groups of \(G_K\) channels. Hadamard mixing makes channels more uniform marginally, but each group still needs a balanced sample of the importance hierarchy.

OSCAR uses permuted bit-reversal: the largest eigenvalue direction goes to position \(0\), the second to \(64\), the third to \(32\), the fourth to \(96\), and so on for \(d = 128\). For any power-of-two group size, top directions are recursively spread across different groups instead of being clustered in one quantization block.

Why \(K^\top K\) is not enough

A tempting alternative is to rotate keys by their own covariance. That optimizes the geometry of raw \(K\), but attention consumes keys through \(QK^\top\). A low raw reconstruction error can still be harmful if it lands in directions heavily used by queries.

Empirically, \(Q^\top Q\) and \(K^\top K\) do not share a useful eigenbasis. On Qwen3-8B, the top-8 eigenvector self-alignment across layers is about \(0.05\text{--}0.15\), close to random alignment at \(d = 128\). Rotating key covariance by \(U_Q\) can even reduce its diagonal concentration.

This is why OSCAR uses \(Q^\top Q\) for keys and \(V^\top S^\top S V\) for values. The targets match the quantities consumed by attention: logits for keys, and score-weighted value aggregation for values.

Q and K covariance eigenbasis alignment at layer 16

Q/K eigenbasis alignment, Qwen3-8B layer 16

A diagonal-bright alignment matrix would indicate shared eigenvectors. The near-uniform pattern means the raw key covariance basis is not a substitute for the query-importance basis.

OSCAR error propagation through attention and layers

The advantage appears after attention consumes the cache

Raw cache reconstruction does not fully explain downstream quality. The gap is clearer in attention-score KL, attention-block output error, and propagated hidden-state error.

Why downstream errors separate

OSCAR is not designed to minimize plain Euclidean MSE of \(K\) or \(V\). Its targets are chosen so that the quantization error is small after the operations that matter: \(QK^\top\) for keys and \(SV\) for values.

This explains why some baselines can look acceptable on raw tensor reconstruction but fail after attention. A rotation can reduce visible outliers while still placing residual error in high-query-energy directions. OSCAR aligns the basis with the attention metric first, then uses Hadamard and bit-reversal to make INT2 grouping viable.

Worked example: Qwen3-4B-Thinking layer 10

This example uses \(d = 128\), \(G_K = 64\), 8000 calibration tokens, and the production query covariance averaged over GQA groups. The table separates what the quantizer sees from how attention weights the resulting residual.

Rotation max |K˃| per-group max-min importance max/mean \(\operatorname{tr}(E_K)\) Interpretation
I (no rotation)36.325.743.0233large outliers and highly uneven attention importance
\(H_{\mathrm{Had}}\)11.113.31.72206outliers shrink, but the attention metric is not exactly equalized
\(U_Q\)29.422.446.91354importance is exposed, but the INT2 group range gets worse
\(U_QH_{\mathrm{Had}}\)7.4613.41.00208Hadamard equalizes the diagonal importance metric exactly
\(U_QH_{\mathrm{Had}}P_K\) (OSCAR)7.4611.61.00169equalized importance plus lower INT2 residual

No rotation. One raw token has group ranges \(44.81\) and \(7.19\); a few outliers set the INT2 scale, so most coordinates in that group are quantized too coarsely.

\(U_Q\) alone. The attention importance is now explicit, but K-side variance is pushed into the eigenbasis unevenly. In the example, the difficult group flips: the second group range jumps to \(37.85\).

\(U_QH_{\mathrm{Had}}\). Hadamard spreads the spike across all channels and forces \(\max/\mathrm{mean} = 1.00\) for the diagonal of the importance metric. The per-group ranges become nearly balanced.

\(U_QH_{\mathrm{Had}}P_K\). The permutation does not change the set of 128 values; it changes which values share an INT2 group. Averaged across tokens, the mean per-group max-min drops from \(13.40\) to \(11.58\), reducing \(\operatorname{tr}(E_K)\) from \(208\) to \(169\).

Summary

\(U\) gives the right metric

It turns the attention-aware covariance into eigenvalue weights, so the rotation knows which directions matter to attention.

\(H_{\mathrm{Had}}\) makes the metric usable

It equalizes the importance diagonal exactly after \(U\) and spreads token-level outliers before INT2 scaling.

\(P\) makes grouping usable

It balances high-importance directions across contiguous quantization groups; when \(G_K = d\), this step becomes a no-op.

The value rotation follows the same logic with \(C_S = V^\top S^\top S V\): choose the basis for the quantity attention consumes, equalize and mix it, then lay it out for the low-bit group quantizer.

Back to the project page

The main page contains the serving pipeline, accuracy table, and links to code, paper, and RotationZoo.

Return to OSCAR