OSCAR Offline Spectral Covariance-Aware Rotation
for 2-bit KV Cache Quantization

Zhongzhu Zhou*1,2, Donglin Zhuang*1,2, Jisen Li1,3, Ziyan Chen2
Shuaiwen Leon Song1, Ben Athiwaratkun1, Xiaoxia Wu†1

1Together AI    2University of Sydney    3University of Illinois Urbana-Champaign

*Equal contribution.   Corresponding author.

INT2 KV-cache serving for long-context reasoning LLMs: attention-aware fixed rotations, BF16 sink/recent protection, and SGLang-ready kernels at 2.28 effective bits per KV element. On Qwen3-4B-Thinking, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8, the mean gap to BF16 is 3.78, 1.42, −0.02, and +0.27 points respectively, while reducing KV memory by ~ and increasing large-batch throughput by up to ~.

Quantization Pipeline

Offline calibration -> online INT2 serving

Q
K
V
CK = QQ
CS = VSSV

Run a calibration pass. Dump activations and compute attention-aware covariance targets per layer/head.

UQ
×
HHad
×
Pbr
=
RK, RV

Eigendecompose covariances, compose with Hadamard mixing and bit-reversal permutation, then fit token-wise clip thresholds.

BF16 sink
INT2 history
BF16 recent

Logical cache: [1, S₀] ‖ [S₀+1, t−W] ‖ [t−W+1, t]. New tokens land in recent; oldest recent demotes to INT2.

clip → INT2 → pack

4 × 2-bit values per byte

k⁺ = Q₂⁺(clip(k·RK, τ)) — token-wise affine asymmetric INT2: each 2-bit code selects one of four reconstruction levels under the token/group scale and zero point.

BF16 kernel sink + recent
INT2 kernel unpack · dequant · accumulate
⊕ merge
Attention output

Two first-stage kernels over BF16 and INT2 segments; online softmax merge produces the same result as full-precision attention.

Calibrate
2.28 BPE
~8× KV memory
~7× throughput
128K reasoning traces

Where INT2 KV Quantization Fails

KV activations have severe channel-wise outliers. At INT2, only four quantization levels exist — outliers dominate the scale and many non-outlier entries map to the same code. Hadamard rotations reduce coordinate spikes, but they do not use the query/value statistics that determine attention error.

Outlier-dominated scale

A few extreme channels set the min-max range; most values collapse into useless INT2 bins.

Rotation target changes the error

Random or Hadamard rotations mix coordinates; they do not choose directions from QQ or score-weighted value covariance.

Attention consumes correlations

Keys affect logits through QK; values affect the weighted sum after softmax. Raw tensor MSE is therefore an indirect objective.

The OSCAR Recipe

OSCAR separates calibration from serving: fixed rotations and clip thresholds are computed offline, then the online path stores the long-history KV cache in INT2 while preserving a small BF16 window.

OSCAR pipeline overview: offline calibration and online INT2 serving
OSCAR pipeline overview. Offline: estimate attention-aware key/value covariances, build fixed rotations (R = U · HHad · Pbr), and fit clipping thresholds — Hadamard mixing flattens raw outlier peaks while OSCAR rotations align quantization with attention. Online: keep sink and recent tokens in BF16; apply rotate–clip–INT2 to history KV inside a paged SGLang cache. Right panels: main-table accuracy and serving throughput under the same 2.28-BPE cache layout.
01

Attention-aware rotations

Offline calibration estimates CK = QQ for keys and CS = VSSV for values, then builds R = U · HHad · Pbr.

02

BF16 sink & recent

Protect attention sinks and a sliding recent window in full precision while the long history lives in INT2.

03

SGLang INT2 serving

Fused Triton kernels rotate, clip, quantize, and pack K/V rows (4×2-bit values per byte); the decode path is compatible with SGLang paged attention.

Rotation Target Changes Accuracy

A rotation that minimizes raw cache reconstruction can still distort the directions used by attention. OSCAR aligns rotations to QQ (keys) and VSSV (values) rather than raw cache reconstruction.

The ablation changes the rotation target or removes one factor of R = U · HHad · Pbr; the INT2 layout and BF16 windows stay fixed.

Setup. Qwen3-8B, same INT2 KV budget (sink 64 + recent 256, calibration-derived clip). Each number is the unweighted mean pass@1 accuracy (%) over five benchmarks: GPQA, HumanEval, LiveCodeBench v6, AIME 2025, and MATH-500 (3 seeds in the paper).

Full OSCAR 70.01 Attention-aware U · Hadamard · bit-reversal
Tensor-reconstruction target 31.12 U from KK / VV instead
w/o attention-aware U 32.82 Hadamard + bit-reversal only (QuaRot-style)
No rotation 4.23 Clip + sink/recent only
Quantization error at every pipeline stage

OSCAR limits quantization error at every stage

Qwen3-4B-Thinking-2507, AIME

  • (a) Relative MSE in quantized K / V caches (keys solid, values dashed).
  • (b) KL(pFP16pq) between FP16 and quantized attention-score distributions.
  • (c) Relative MSE at the attention-block output (post-WO).
  • (d) Relative MSE in propagated hidden states across layers.

Naive INT2, Hadamard-only, and clip-only baselines diverge in (b)–(d); full OSCAR stays closest to FP16.

Rotation intuition

Why R = U · HHad · Pbr is ordered this way

OSCAR separates the three factors: U exposes attention-importance directions, Hadamard equalizes the diagonal importance metric and spreads token outliers, and bit-reversal balances high-importance directions across quantization groups.

Read intuition page

Production Serving Path

OSCAR implements an SGLang INT2 KV-cache mode for paged attention and prefix caching.

Prefill BF16 sink high precision INT2 history rotated · packed · paged BF16 recent sliding window Triton decode kernel
  • Fused rotate–clip–quantize–pack during prefill and decode demotion
  • Value rotation absorbed into projection weights for compute savings
  • FlashAttention-3 prefill + Triton INT2 decode hybrid mode
  • Compatible with SGLang paged attention and prefix cache

Accuracy & Efficiency

Evaluated on reasoning models with up to 32K-token traces across five reasoning and coding benchmarks. RULER-NIAH is evaluated up to 128K on Qwen3 models.

Qwen3-4B-Thinking

3.78

BF16 gap (pts) @ 2.28 BPE

Qwen3-8B

1.42

BF16 gap (pts) @ 2.28 BPE

Qwen3-32B & GLM-4.7

≈ BF16

−0.02 / +0.27 mean gap

vs TurboQuant

+40.1

mean score on Qwen3-4B (2.28 vs 3.25 BPE)

Main accuracy on four model configurations and five benchmarks (32K generation; μ±σ over 5 seeds unless noted). BPE = effective bits per KV element @ 128K context. Drop = gap to BF16 mean.

Model Method BPE GPQA HumanE LCB v6 AIME25 MATH500 Mean Drop
Qwen3-4B-ThinkingBF1616.0067.2794.0548.6674.6793.5575.64
Saw-INT44.2566.3789.7846.2070.0093.1973.11−2.53
TurboQuant*3.2541.4131.830.5816.6768.2031.74−43.90
QuaRot-INT22.250.340.980.000.005.671.40−74.24
OSCAR2.2864.9592.2445.3864.0092.7571.86−3.78
Qwen3-8BBF1616.0056.6785.9549.0170.0092.5970.84
Saw-INT44.2554.8586.4447.9568.0092.6369.97−0.87
TurboQuant*3.2555.0574.6321.0546.6787.0056.88−13.96
QuaRot-INT22.2514.989.800.582.2223.1310.14−60.70
OSCAR2.2855.0587.8846.3266.6792.2269.42−1.42
Qwen3-32BBF1616.0058.4991.1959.0668.6793.5574.19
TurboQuant*3.2558.6988.4155.5666.6790.6071.99−2.20
OSCAR2.2860.4090.1253.5774.0092.7574.17−0.02
GLM-4.7-FP8BF1616.0073.2391.4649.1280.0095.6677.89
TurboQuant*3.2566.6790.2458.4880.0095.3978.15+0.26
OSCAR2.2873.5791.0652.6378.8994.6678.16+0.27

Notes. For a fair comparison at a comparable bit budget, TurboQuant results use vLLM's implementation modified so that all layers are quantized (no mixed precision); the original TurboQuant keeps the first, last, and selected middle layers in full precision. We run it in its K3V3 configuration (3-bit K, 3-bit V) to land near the OSCAR bit budget. TurboQuant entries are single-run results because its vLLM path is too slow for repeated 32K-context evaluations under our compute budget.

QuaRot-INT2 is the standard 2-bit KV-quant recipe (data-free Hadamard rotation per layer). Saw-INT4 is an INT4 reference for context. Naive INT2 is per-token symmetric INT2 with no rotation.

Effect of prefix-cache hit ratio on end-to-end serving throughput

Effect of prefix-cache hit ratio on end-to-end serving throughput

Effect of prefix-cache hit ratio on end-to-end serving throughput (100k ISL, 1K OSL). Each subplot shows median per-user throughput \(U\) (x-axis, tok/s) versus mean per-GPU throughput \(G\) (y-axis, tok/s), with markers denoting batch sizes \(\mathrm{BS} \in \{1,8,16,32\}\). Rows correspond to Qwen3-4B-Thinking-2507 and GLM-4.7-FP8. Columns correspond to radix cache disabled, radix cache enabled during normal execution, and immediate replay after warmup (near-100% hit ratio).

OSCAR vs TurboQuant

TurboQuant OSCAR
Target Generic online vector quantization Attention-aware 2-bit KV serving
Rotation Data-oblivious random Covariance-aware offline
Effective BPE 3.25 2.28
Protection BF16 sink + recent
System Quantizer backend SGLang cache layout + Triton kernels

TurboQuant and OSCAR both use rotations and low-bit codes, but their measured objectives differ: TurboQuant reports a vector-quantizer backend; OSCAR reports an SGLang KV-cache layout and decode path.

Are OSCAR and TurboQuant compatible?

A hybrid is possible as a new experiment, but it is not the code path evaluated here. TurboQuant exposes an online quantizer; OSCAR fixes the KV layout, BF16 windows, and SGLang decode kernels around its rotation target.

TurboQuant

  • Online vector quantizer
  • Data-oblivious random rotation
  • Optimizes MSE / inner-product distortion
  • Rate–distortion analysis for the quantizer backend
  • ~3.25 BPE in our fair KV comparison (no mixed-precision exemptions)

OSCAR

  • Attention-aware 2-bit KV-cache serving method
  • Offline covariance-aware rotations (QQ, VSSV)
  • BF16 sink + recent protection + INT2 history layout
  • SGLang / Triton path with paged KV
  • 2.28 BPE; BF16 mean gap reported in the table above

Possible hybrid to test: One could explore TurboQuant-style codebooks after OSCAR's attention-aware rotation, or test OSCAR's rotation targets inside a TurboQuant-like backend.

Why this is not a one-line change: Swapping one method for the other would require new kernels, revised bit accounting (effective BPE), and new accuracy/throughput measurements for the resulting cache layout.

In the current implementations, they are not interchangeable modules. TurboQuant targets generic vector quantization; OSCAR fixes the rotation target, BF16 windows, INT2 cache layout, and SGLang kernels as one serving path.

Citation

Please cite the arXiv version of OSCAR using the BibTeX entry below.

@misc{zhou2026oscarofflinespectralcovarianceaware,
  title         = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
  author        = {Zhongzhu Zhou and Donglin Zhuang and Jisen Li and Ziyan Chen and Shuaiwen Leon Song and Ben Athiwaratkun and Xiaoxia Wu},
  year          = {2026},
  eprint        = {2605.17757},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2605.17757},
}
Open arXiv