DPLR | Zhiyuan Li

This article assumes familiarity with linear algebra (matrix multiplication, outer products, inverse matrices) and basic sequence modeling concepts. It is recommended to read The Mathematics of KDA first. Abstract This article derives the chunk-wise parallel algorithm for DPLR (Diagonal Plus Low Rank). DPLR is an important variant of the generalized Delta Rule, applied in architectures such as RWKV-7. The core contributions are: Establishing the explicit transition matrix form of DPLR: $\mathbf{P}_t = \text{diag}(\exp(\mathbf{g}_t)) + \mathbf{b}_t \mathbf{a}_t^T$ Deriving the WY representation for DPLR, decomposing the cumulative transition matrix into diagonal and low-rank components Proving that DPLR also satisfies the Affine transformation form, naturally supporting Context Parallelism (CP) Comparing DPLR, KDA, and IPLR, revealing the unified mathematical framework of the linear attention family Advantages of DPLR over standard Delta Rule: explicit control of diagonal decay (dim-wise forgetting) and low-rank updates, providing stronger expressiveness. However, in chunk form, it significantly introduces additional computational complexity and requires more HBM space to store intermediate variables. ...