KDA (Kimi Delta Attention): From Matrix Multiplication to Affine Transformation

This article assumes familiarity with linear algebra (matrix multiplication, outer product, inverse matrices) and basic sequence modeling concepts. Abstract This article derives the chunk-wise parallel algorithm for KDA (Kimi Delta Attention). Core contributions: Proving that KDA’s chunk state update can be expressed as an Affine transformation: $\mathbf{S}' = \mathbf{M}\mathbf{S} + \mathbf{B}$ Decomposing residual computation into history-independent components via WY representation to enable parallel computation Deriving the mathematical foundation for CP (Context Parallel) based on the compositional properties of Affine transformations Advantages of KDA over standard Attention: $O(N)$ complexity, constant memory state, suitable for ultra-long sequences. ...

February 17, 2026 · 23 min · 4827 words · Zhiyuan Li