Linear Attention

This article assumes familiarity with linear algebra (matrix multiplication, outer products, inverse matrices) and basic sequence modeling concepts. It is recommended to read The Mathematics of KDA first. Abstract This article derives the chunk-wise parallel algorithm for DPLR (Diagonal Plus Low Rank). DPLR is an important variant of the generalized Delta Rule, applied in architectures such as RWKV-7. The core contributions are: Establishing the explicit transition matrix form of DPLR: $\mathbf{P}_t = \text{diag}(\exp(\mathbf{g}_t)) + \mathbf{b}_t \mathbf{a}_t^T$ Deriving the WY representation for DPLR, decomposing the cumulative transition matrix into diagonal and low-rank components Proving that DPLR also satisfies the Affine transformation form, naturally supporting Context Parallelism (CP) Comparing DPLR, KDA, and IPLR, revealing the unified mathematical framework of the linear attention family Advantages of DPLR over standard Delta Rule: explicit control of diagonal decay (dim-wise forgetting) and low-rank updates, providing stronger expressiveness. However, in chunk form, it significantly introduces additional computational complexity and requires more HBM space to store intermediate variables. ...

This article assumes familiarity with linear algebra (matrix multiplication, outer product, inverse matrices) and basic sequence modeling concepts. Abstract This article derives the chunk-wise parallel algorithm for KDA (Kimi Delta Attention). Core contributions: Proving that KDA’s chunk state update can be expressed as an Affine transformation: $\mathbf{S}' = \mathbf{M}\mathbf{S} + \mathbf{B}$ Decomposing residual computation into history-independent components via WY representation to enable parallel computation Deriving the mathematical foundation for CP (Context Parallel) based on the compositional properties of Affine transformations Advantages of KDA over standard Attention: $O(N)$ complexity, constant memory state, suitable for ultra-long sequences. ...

Linear Attention

The Mathematics of DPLR (Diagonal Plus Low Rank): Parallel Computing with Explicit Transition Matrices

KDA (Kimi Delta Attention): From Matrix Multiplication to Affine Transformation