The Mathematics of DPLR (Diagonal Plus Low Rank): Parallel Computing with Explicit Transition Matrices

This article assumes familiarity with linear algebra (matrix multiplication, outer products, inverse matrices) and basic sequence modeling concepts. It is recommended to read The Mathematics of KDA first. Abstract This article derives the chunk-wise parallel algorithm for DPLR (Diagonal Plus Low Rank). DPLR is an important variant of the generalized Delta Rule, applied in architectures such as RWKV-7. The core contributions are: Establishing the explicit transition matrix form of DPLR: $\mathbf{P}_t = \text{diag}(\exp(\mathbf{g}_t)) + \mathbf{b}_t \mathbf{a}_t^T$ Deriving the WY representation for DPLR, decomposing the cumulative transition matrix into diagonal and low-rank components Proving that DPLR also satisfies the Affine transformation form, naturally supporting Context Parallelism (CP) Comparing DPLR, KDA, and IPLR, revealing the unified mathematical framework of the linear attention family Advantages of DPLR over standard Delta Rule: explicit control of diagonal decay (dim-wise forgetting) and low-rank updates, providing stronger expressiveness. However, in chunk form, it significantly introduces additional computational complexity and requires more HBM space to store intermediate variables. ...

February 21, 2026 · 16 min · 3207 words · Zhiyuan Li

KDA (Kimi Delta Attention): From Matrix Multiplication to Affine Transformation

This article assumes familiarity with linear algebra (matrix multiplication, outer product, inverse matrices) and basic sequence modeling concepts. Abstract This article derives the chunk-wise parallel algorithm for KDA (Kimi Delta Attention). Core contributions: Proving that KDA’s chunk state update can be expressed as an Affine transformation: $\mathbf{S}' = \mathbf{M}\mathbf{S} + \mathbf{B}$ Decomposing residual computation into history-independent components via WY representation to enable parallel computation Deriving the mathematical foundation for CP (Context Parallel) based on the compositional properties of Affine transformations Advantages of KDA over standard Attention: $O(N)$ complexity, constant memory state, suitable for ultra-long sequences. ...

February 17, 2026 · 23 min · 4827 words · Zhiyuan Li

Tech Stack

Tech Stack This site is built with the following technologies: Technology Purpose Hugo Fast static site generator written in Go PaperMod Clean and elegant Hugo theme GitHub Pages Free static site hosting GitHub Actions Automated deployment Features Blazing Fast: Hugo’s Go implementation ensures sub-second builds SEO Friendly: Built-in Open Graph, Twitter Cards, and structured data Dark/Light Mode: Automatically follows system theme Full-text Search: Site-wide search powered by Fuse.js Responsive Design: Perfectly adapted for mobile devices Multi-language: Supports both Chinese and English Deployment Workflow 1 Local Writing → Git Push → GitHub Actions → GitHub Pages → Live Site Fully automated deployment pipeline, focusing on content creation. ...

February 16, 2026 · 1 min · 111 words · Zhiyuan Li