[{"content":" This article assumes familiarity with linear algebra (matrix multiplication, outer products, inverse matrices) and basic sequence modeling concepts. It is recommended to read The Mathematics of KDA first.\nAbstract This article derives the chunk-wise parallel algorithm for DPLR (Diagonal Plus Low Rank). DPLR is an important variant of the generalized Delta Rule, applied in architectures such as RWKV-7. The core contributions are:\nEstablishing the explicit transition matrix form of DPLR: $\\mathbf{P}_t = \\text{diag}(\\exp(\\mathbf{g}_t)) + \\mathbf{b}_t \\mathbf{a}_t^T$ Deriving the WY representation for DPLR, decomposing the cumulative transition matrix into diagonal and low-rank components Proving that DPLR also satisfies the Affine transformation form, naturally supporting Context Parallelism (CP) Comparing DPLR, KDA, and IPLR, revealing the unified mathematical framework of the linear attention family Advantages of DPLR over standard Delta Rule: explicit control of diagonal decay (dim-wise forgetting) and low-rank updates, providing stronger expressiveness. However, in chunk form, it significantly introduces additional computational complexity and requires more HBM space to store intermediate variables.\nTable of Contents Introduction: From Delta Rule to DPLR Notation and Conventions Core Lemmas The Recurrent Form of DPLR WY Representation: Decomposition of Cumulative Transition Matrices Core Theorem: Chunk-wise Affine Form Algorithm Implementation: From Theory to Code DPLR vs KDA vs IPLR CP Parallelism and Multi-Level Parallelism Summary Introduction: From Delta Rule to DPLR Limitations of Delta Rule The state update of standard Delta Rule (and KDA) can be written as:\n$$\\mathbf{s}_t = \\mathbf{s}_{t-1} + \\beta_t \\cdot \\mathbf{k}_t^T (\\mathbf{v}_t - \\mathbf{k}_t \\mathbf{s}_{t-1})$$The transition matrix in this form is implicit:\nThe state update is indirectly affected through the residual $(\\mathbf{v}_t - \\mathbf{k}_t \\mathbf{s}_{t-1})$ The forgetting mechanism is implemented through the gate $\\boldsymbol{\\lambda}_t$ Mathematically, this is equivalent to:\n$$\\mathbf{s}_t = (\\mathbf{I} - \\beta_t \\mathbf{k}_t^T \\mathbf{k}_t)\\mathbf{s}_{t-1} + \\beta_t \\mathbf{k}_t^T \\mathbf{v}_t$$The transition matrix $\\mathbf{I} - \\beta_t \\mathbf{k}_t^T \\mathbf{k}_t$ is in the form of identity matrix + low-rank (rank-1), known as the IPLR (Identity Plus Low Rank) structure.\nThe Core Idea of DPLR DPLR (Diagonal Plus Low Rank) adopts an explicit transition matrix form:\n$$\\mathbf{S}_t = \\exp(\\mathbf{g}_t) \\odot \\mathbf{S}_{t-1} + \\mathbf{k}_t^T \\mathbf{v}_t + \\mathbf{b}_t (\\mathbf{a}_t^T \\mathbf{S}_{t-1})$$Or more compactly:\n$$\\mathbf{S}_t = (\\mathbf{D}_t + \\mathbf{b}_t \\mathbf{a}_t^T) \\mathbf{S}_{t-1} + \\mathbf{k}_t^T\\mathbf{v}_t$$Where:\n$\\mathbf{D}_t = \\text{diag}(\\exp(\\mathbf{g}_t)) \\in \\mathbb{R}^{K \\times K}$ is the diagonal decay matrix $\\mathbf{a}_t, \\mathbf{b}_t \\in \\mathbb{R}^{K \\times 1}$ (column vectors) are the two vectors for low-rank update The transition matrix $\\mathbf{P}_t = \\mathbf{D}_t + \\mathbf{b}_t \\mathbf{a}_t^T$ has the Diagonal Plus Low Rank (DPLR) structure Why \u0026ldquo;Diagonal Plus Low Rank\u0026rdquo;? The structure of matrix $\\mathbf{P}_t = \\mathbf{D}_t + \\mathbf{b}_t \\mathbf{a}_t^T$:\nDiagonal part $\\mathbf{D}_t$: Controls independent decay for each dimension Low-rank part $\\mathbf{b}_t \\mathbf{a}_t^T$: Rank-1 update providing cross-dimensional coupling This structure has been extensively studied in numerical linear algebra and is particularly suitable for fast matrix-vector multiplication.\nRelationship with RWKV-7 RWKV-7 adopts a Dynamic State Evolution architecture based on the DPLR concept. In our underlying parallel implementation, RWKV-7\u0026rsquo;s state update formula is essentially a specific instantiation of the DPLR framework.\nWhile traditional linear attention tries to directly match $\\{k, v\\}$ pairs, RWKV-7 simulates dynamic gradient descent to update the state $S$, guided by the L2 loss $L=\\frac{1}{2} \\left\\Vert v - S k \\right\\Vert^2$. The theoretical update formula is:\n$$S_t = S_{t-1} \\text{Diag}(d_t) - \\eta_t \\cdot S_{t-1} k_t k_t^{\\top} + \\eta_t \\cdot v_t k_t^{\\top}$$In the algorithm implementation, this gradient-based update is generalized into a more flexible DPLR form:\n$$S_t = S_{t-1} \\odot \\exp(-e^{w_t}) + (S_{t-1} a_t) b_t^T + v_t k_t^T$$The parameter mapping in our parallel system is as follows:\n$w_t$ maps to the logarithmic decay term (specifically $-\\exp(w_t)$) $a_t$ maps to the low-rank update vector $a$ (dynamic learning rate modulator / in-context learning rate) $b_t$ maps to the low-rank update vector $b$ (state update modulator) These features enable RWKV-7 to achieve:\nDynamic Decay and Learning Rate: $w_t, a_t, b_t$ are all data-dependent, allowing the model to dynamically determine the strength of forgetting and updating based on the context. Enhanced Expressiveness: By introducing explicit state evolution, RWKV-7 can recognize all regular languages. Its theoretical expressiveness surpasses TC0 (Transformers) and reaches NC1. Seamless Integration with DPLR Chunk Parallelism: Because its core is a DPLR structure, RWKV-7 can directly reuse the DPLR chunk-wise algorithm to achieve highly efficient parallel training for long sequences. Notation and Conventions Symbol Dimensions Meaning $\\mathbf{s}_t$ $\\mathbb{R}^{K \\times V}$ Token-level state matrix $\\mathbf{S}$ $\\mathbb{R}^{K \\times V}$ Chunk-level initial state $\\mathbf{S}'$ $\\mathbb{R}^{K \\times V}$ Chunk-level final state $\\mathbf{k}_t, \\mathbf{q}_t$ $\\mathbb{R}^{1 \\times K}$ (row vectors) Token-level key/query $\\mathbf{v}_t$ $\\mathbb{R}^{1 \\times V}$ (row vector) Token-level value $\\mathbf{a}_t, \\mathbf{b}_t$ $\\mathbb{R}^{K \\times 1}$ (column vectors) Two vectors for low-rank update $\\mathbf{K}, \\mathbf{V}$ $\\mathbb{R}^{C \\times K}$ / $\\mathbb{R}^{C \\times V}$ Chunk-level key/value matrices, row $i$ is $\\mathbf{k}_i$ / $\\mathbf{v}_i$ $\\mathbf{A}^{\\text{lr}} \\in \\mathbb{R}^{C \\times K}$ Row $i$ is $\\mathbf{a}_i^T$ Matrix form of low-rank vector $\\mathbf{a}$ (column vectors arranged as rows) $\\mathbf{B}^{\\text{lr}} \\in \\mathbb{R}^{C \\times K}$ Row $i$ is $\\mathbf{b}_i^T$ Matrix form of low-rank vector $\\mathbf{b}$ (column vectors arranged as rows) $\\mathbf{g}_t$ $\\mathbb{R}^{K}$ Log decay vector (before cumsum) $\\mathbf{g}_t^{\\text{cum}}$ $\\mathbb{R}^{K}$ Cumulative log decay (after cumsum) $\\mathbf{D}_t = \\text{diag}(\\exp(\\mathbf{g}_t^{\\text{cum}}))$ $\\mathbb{R}^{K \\times K}$ Diagonal decay matrix $\\boldsymbol{\\Gamma}_i^t = \\prod_{j=i}^t \\mathbf{D}_j$ $\\mathbb{R}^{K \\times K}$ Cumulative diagonal decay matrix $\\mathbf{P}_t = \\mathbf{D}_t + \\mathbf{b}_t \\mathbf{a}_t^T$ $\\mathbb{R}^{K \\times K}$ Transition matrix (low-rank outer product form) $\\mathbf{A}_{ab}, \\mathbf{A}_{ak}$ $\\mathbb{R}^{C \\times C}$ Strictly lower-triangular attention matrices $\\mathbf{W}, \\mathbf{U}$ $\\mathbb{R}^{C \\times K}$ / $\\mathbb{R}^{C \\times V}$ Weighted matrices in WY representation $\\mathbf{w}_i, \\mathbf{u}_i$ $\\mathbb{R}^{K}$ / $\\mathbb{R}^{V}$ Weighted vectors in WY representation (the $i$-th component) $\\tilde{\\mathbf{u}}_i$ $\\mathbb{R}^{V}$ Corrected vector including historical state contributions $\\mathbf{M}$ $\\mathbb{R}^{K \\times K}$ Affine transition matrix $\\mathbf{B}$ $\\mathbb{R}^{K \\times V}$ Affine bias matrix $\\odot$ - Hadamard product (element-wise multiplication) Important Conventions:\nIn the flash-linear-attention implementation, DPLR adopts the left-multiplication form: $\\mathbf{S}_t = \\mathbf{P}_t \\mathbf{S}_{t-1} + \\mathbf{k}_t^T \\mathbf{v}_t$ State matrix $\\mathbf{S} \\in \\mathbb{R}^{K \\times V}$ (key dim × value dim) Note: The native RWKV-7 formula uses the dual right-multiplication form, where the state matrix is $\\mathbf{S}_{\\text{rwkv}} \\in \\mathbb{R}^{V \\times K}$ and the update is $\\mathbf{S}_t = \\mathbf{S}_{t-1} \\mathbf{P}_t^T + \\mathbf{v}_t \\mathbf{k}_t^T$. In the FLA framework, to maintain consistency with KDA and other linear attention mechanisms, we transposed the state matrix to unify under the left-multiplication form.\nComparison with KDA:\nProperty KDA DPLR (FLA Implementation) RWKV-7 Native Multiplication Direction Left Left Right State Dimensions $\\mathbb{R}^{K \\times V}$ $\\mathbb{R}^{K \\times V}$ $\\mathbb{R}^{V \\times K}$ Affine Form $\\mathbf{S}' = \\mathbf{M}\\mathbf{S} + \\mathbf{B}$ $\\mathbf{S}' = \\mathbf{M}\\mathbf{S} + \\mathbf{B}$ $\\mathbf{S}' = \\mathbf{S}\\mathbf{M}^T + \\mathbf{B}^T$ Transition Matrix Implicit (Delta Rule) Explicit (DPLR) Core Lemmas Lemma 1: Inverse of Lower Triangular Matrices Let $\\mathbf{L} \\in \\mathbb{R}^{C \\times C}$ be a unit lower triangular matrix (diagonal entries are 1, upper triangle is 0), then $\\mathbf{L}^{-1}$ is also unit lower triangular and can be computed via forward substitution.\nIn particular, if $\\mathbf{L} = \\mathbf{I} - \\mathbf{N}$, where $\\mathbf{N}$ is strictly lower triangular (diagonal entries are 0), then:\n$$\\mathbf{L}^{-1} = \\mathbf{I} + \\mathbf{N} + \\mathbf{N}^2 + \\cdots + \\mathbf{N}^{C-1}$$Proof: Directly verify that $(\\mathbf{I} - \\mathbf{N})(\\mathbf{I} + \\mathbf{N} + \\cdots + \\mathbf{N}^{C-1}) = \\mathbf{I} - \\mathbf{N}^C = \\mathbf{I}$ (since $\\mathbf{N}^C = 0$).\nLemma 2: Product Structure of DPLR Matrices Let $\\mathbf{P}_i = \\mathbf{D}_i + \\mathbf{b}_i \\mathbf{a}_i^T$, where $\\mathbf{D}_i$ is a diagonal matrix. Then the reverse cumulative product $\\mathbf{P}_{t:1} = \\prod_{i=t}^1 \\mathbf{P}_i = \\mathbf{P}_t \\mathbf{P}_{t-1} \\cdots \\mathbf{P}_1$ can be expressed as:\n$$\\mathbf{P}_{t:1} = \\boldsymbol{\\Gamma}_1^t + \\sum_{i=1}^t (\\boldsymbol{\\Gamma}_{i+1}^t \\mathbf{b}_i) \\cdot (\\mathbf{a}_i^T \\boldsymbol{\\Gamma}_1^{i-1})$$Note on product direction: The product here accumulates from right to left ($\\mathbf{P}_t$ on the leftmost), consistent with the form obtained by expanding the state recurrence $\\mathbf{S}_t = \\mathbf{P}_t \\mathbf{S}_{t-1} + \\mathbf{k}_t^T \\mathbf{v}_t$. In the expanded summation terms, $\\boldsymbol{\\Gamma}_{i+1}^t$ is the cumulative decay to the left of $\\mathbf{b}_i$ (from $i+1$ to $t$), and $\\boldsymbol{\\Gamma}_1^{i-1}$ is the cumulative decay to the right of $\\mathbf{a}_i^T$ (from $1$ to $i-1$).\nSignificance: This lemma guarantees that the DPLR structure is closed under matrix multiplication, forming the foundation for the existence of the WY representation. The specific form shows that the cumulative product maintains a \u0026ldquo;diagonal + low-rank\u0026rdquo; structure.\nLemma 3: Decomposition of Logarithmic Decay For cumulative logarithmic decay, we have:\n$$\\exp(\\mathbf{g}_i^{\\text{cum}} - \\mathbf{g}_j^{\\text{cum}}) = \\exp(\\mathbf{g}_i^{\\text{cum}}) \\odot \\exp(-\\mathbf{g}_j^{\\text{cum}})$$This allows the decay computation to be expressed as the outer product of two gated vectors.\nThe Recurrent Form of DPLR Basic Recurrence The state update equation for DPLR is:\n$$\\mathbf{S}_t = \\exp(\\mathbf{g}_t) \\odot \\mathbf{S}_{t-1} + \\mathbf{k}_t^T \\mathbf{v}_t + \\mathbf{b}_t (\\mathbf{a}_t^T \\mathbf{S}_{t-1})$$Or in matrix form:\n$$\\mathbf{S}_t = (\\mathbf{D}_t + \\mathbf{b}_t \\mathbf{a}_t^T) \\mathbf{S}_{t-1} + \\mathbf{k}_t^T \\mathbf{v}_t$$Where:\nFirst term $\\mathbf{S}_{t-1} \\odot \\exp(\\mathbf{g}_t)$: Dimension-wise decay (Hadamard product form) Second term $\\mathbf{k}_t^T \\mathbf{v}_t$: Standard key-value outer product update Third term $\\mathbf{b}_t (\\mathbf{a}_t^T \\mathbf{S}_{t-1})$: Low-rank update, projecting state through $\\mathbf{a}_t^T$ (yielding $1 \\times V$) and expanding through $\\mathbf{b}_t$ (yielding $K \\times V$) Expanding the Recurrence To understand chunk-wise parallelism, let\u0026rsquo;s expand the first few time steps:\n$$ \\begin{aligned} \\mathbf{S}_1 \u0026= \\mathbf{P}_1 \\mathbf{S}_0 + \\mathbf{k}_1^T \\mathbf{v}_1 \\\\ \\mathbf{S}_2 \u0026= \\mathbf{P}_2 \\mathbf{S}_1 + \\mathbf{k}_2^T \\mathbf{v}_2 \\\\ \u0026= \\mathbf{P}_2 (\\mathbf{P}_1 \\mathbf{S}_0 + \\mathbf{k}_1^T \\mathbf{v}_1) + \\mathbf{k}_2^T \\mathbf{v}_2 \\\\ \u0026= \\mathbf{P}_2 \\mathbf{P}_1 \\mathbf{S}_0 + \\mathbf{P}_2 \\mathbf{k}_1^T \\mathbf{v}_1 + \\mathbf{k}_2^T \\mathbf{v}_2 \\end{aligned} $$General form: $$\\mathbf{S}_t = \\left( \\prod_{i=t}^1 \\mathbf{P}_i \\right) \\mathbf{S}_0 + \\sum_{i=1}^t \\left( \\prod_{j=t}^{i+1} \\mathbf{P}_j \\right) \\mathbf{k}_i^T \\mathbf{v}_i$$Challenge: Directly computing the cumulative transition matrix $\\mathbf{P}_{t:1} = \\prod_{i=t}^1 \\mathbf{P}_i$ requires $O(t)$ matrix multiplications. How can we achieve parallelism?\nWY Representation: Decomposition of Cumulative Transition Matrices Core Problem We need to efficiently represent the product of cumulative transition matrices (note the left-multiplication order, accumulating from right to left): $$\\mathbf{P}_{t:1} = \\prod_{i=t}^1 (\\mathbf{D}_i + \\mathbf{b}_i \\mathbf{a}_i^T)$$Key Insight: The product of diagonal-plus-low-rank matrices retains the \u0026ldquo;diagonal + low-rank\u0026rdquo; structure and can be decomposed into diagonal accumulation plus weighted sums of low-rank outer products.\nDefining Cumulative Diagonal Decay Let: $$\\boldsymbol{\\Gamma}_i^t = \\prod_{j=i}^t \\mathbf{D}_j = \\text{diag}\\left(\\exp\\left(\\sum_{j=i}^t \\mathbf{g}_j\\right)\\right)$$When $i \u003e t$, define $\\boldsymbol{\\Gamma}_i^t = \\mathbf{I}$ (identity matrix).\nTheorem (WY Representation for DPLR) The cumulative transition matrix can be decomposed as:\n$$\\mathbf{P}_{t:1} = \\boldsymbol{\\Gamma}_1^t + \\sum_{i=1}^t (\\boldsymbol{\\Gamma}_{i+1}^t \\mathbf{b}_i) \\cdot (\\mathbf{a}_i^T \\boldsymbol{\\Gamma}_1^{i-1})$$ Motivation for Definition: To make the WY representation more compact, we define the weighted vector $\\mathbf{w}_i^T$ (row vector), which accumulates the influence of all historical low-rank updates up to step $i$. This is analogous to how the classical WY representation accumulates the weights of Householder transformations.\nOr equivalently, define $\\mathbf{w}_i^T = \\mathbf{a}_i^T \\boldsymbol{\\Gamma}_1^{i-1} + \\sum_{j=1}^{i-1} (\\mathbf{a}_i^T \\boldsymbol{\\Gamma}_{j+1}^{i-1} \\mathbf{b}_j) \\cdot \\mathbf{w}_j^T$:\n$$\\mathbf{P}_{t:1} = \\boldsymbol{\\Gamma}_1^t + \\sum_{i=1}^t (\\boldsymbol{\\Gamma}_{i+1}^t \\mathbf{b}_i) \\cdot \\mathbf{w}_i^T$$Where the coefficient $(\\mathbf{a}_i^T \\boldsymbol{\\Gamma}_{j+1}^{i-1} \\mathbf{b}_j)$ is a scalar.\nConnection to Classical WY Representation: The classical WY representation decomposes the product of Householder matrices as $\\mathbf{Q} = \\mathbf{I} - \\mathbf{W}\\mathbf{Y}^T$. The DPLR WY representation is its generalization: replacing $\\mathbf{I}$ with $\\boldsymbol{\\Gamma}_1^t$ (diagonal accumulation) and replacing the standard low-rank outer product with a weighted sum.\nProof (by Induction) Base case $t=1$: $$\\mathbf{P}_1 = \\mathbf{D}_1 + \\mathbf{b}_1 \\mathbf{a}_1^T = \\boldsymbol{\\Gamma}_1^1 + (\\boldsymbol{\\Gamma}_2^1 \\mathbf{b}_1) \\cdot \\mathbf{w}_1^T$$Since $\\boldsymbol{\\Gamma}_1^1 = \\mathbf{D}_1$, $\\boldsymbol{\\Gamma}_2^1 = \\mathbf{I}$, $\\mathbf{w}_1^T = \\mathbf{a}_1^T$, the equality holds.\nInductive step: Assume the formula holds for $t$, prove for $t+1$.\n$$ \\begin{aligned} \\mathbf{P}_{t+1:1} \u0026= \\mathbf{P}_{t+1} \\cdot \\mathbf{P}_{t:1} \\\\ \u0026= (\\mathbf{D}_{t+1} + \\mathbf{b}_{t+1} \\mathbf{a}_{t+1}^T)\\left(\\boldsymbol{\\Gamma}_1^t + \\sum_{i=1}^t (\\boldsymbol{\\Gamma}_{i+1}^t \\mathbf{b}_i) \\cdot \\mathbf{w}_i^T\\right) \\\\ \u0026= \\boldsymbol{\\Gamma}_1^{t+1} + \\sum_{i=1}^t (\\boldsymbol{\\Gamma}_{i+1}^{t+1} \\mathbf{b}_i) \\cdot \\mathbf{w}_i^T \\\\ \u0026\\quad + \\mathbf{b}_{t+1} \\cdot \\underbrace{\\left(\\mathbf{a}_{t+1}^T \\boldsymbol{\\Gamma}_1^t + \\sum_{i=1}^t (\\mathbf{a}_{t+1}^T \\boldsymbol{\\Gamma}_{i+1}^t \\mathbf{b}_i) \\cdot \\mathbf{w}_i^T\\right)}_{\\eqqcolon \\mathbf{w}_{t+1}^T} \\\\ \u0026= \\boldsymbol{\\Gamma}_1^{t+1} + \\sum_{i=1}^{t+1} (\\boldsymbol{\\Gamma}_{i+1}^{t+1} \\mathbf{b}_i) \\cdot \\mathbf{w}_i^T \\end{aligned} $$Where we used $\\boldsymbol{\\Gamma}_{t+2}^{t+1} = \\mathbf{I}$. Q.E.D.\nWY Representation of State Substituting the WY representation into the state recurrence, we obtain:\n$$\\mathbf{S}_t = \\sum_{i=1}^t (\\boldsymbol{\\Gamma}_{i+1}^t \\mathbf{k}_i^T \\mathbf{v}_i + \\boldsymbol{\\Gamma}_{i+1}^t \\mathbf{b}_i \\mathbf{u}_i^T)$$Where $\\mathbf{u}_i^T$ ($1 \\times V$ row vector) satisfies:\n$$ \\mathbf{u}_i^T = \\begin{cases} \\mathbf{0}, \u0026 i=1 \\\\ \\sum_{j=1}^{i-1} (\\mathbf{a}_i^T \\boldsymbol{\\Gamma}_{j+1}^{i-1} \\mathbf{k}_j^T \\mathbf{v}_j + \\mathbf{a}_i^T \\boldsymbol{\\Gamma}_{j+1}^{i-1} \\mathbf{b}_j \\mathbf{u}_j^T), \u0026 i \\geq 2 \\end{cases} $$Matrix Form of Linear System Define matrices within a chunk (row $i$ is the corresponding vector, the following applies to the left-multiplication DPLR):\n$\\mathbf{A}_{ab} \\in \\mathbb{R}^{C \\times C}$: $[\\mathbf{A}_{ab}]_{ij} = \\mathbf{a}_i^T \\boldsymbol{\\Gamma}_{j+1}^{i-1} \\mathbf{b}_j$ for $i \u003e j$ $\\mathbf{A}_{ak} \\in \\mathbb{R}^{C \\times C}$: $[\\mathbf{A}_{ak}]_{ij} = \\mathbf{a}_i^T \\boldsymbol{\\Gamma}_{j+1}^{i-1} \\mathbf{k}_j^T$ for $i \u003e j$ Then $(\\mathbf{I} + \\mathbf{A}_{ab})$ is a unit lower triangular matrix. Let:\n$\\mathbf{A}^{\\text{gate}} = \\mathbf{A}^{\\text{lr}} \\odot \\exp(\\mathbf{G}^{\\text{cum}}) \\in \\mathbb{R}^{C \\times K}$ (gated low-rank vector matrix), where $\\mathbf{A}^{\\text{lr}} \\in \\mathbb{R}^{C \\times K}$ has row $i$ as $\\mathbf{a}_i^T$, and $\\mathbf{G}^{\\text{cum}}$ has row $i$ as $\\mathbf{g}_i^{\\text{cum}}$ The matrix form of the WY representation is:\n$$\\mathbf{W} = (\\mathbf{I} + \\mathbf{A}_{ab})^{-1} \\mathbf{A}^{\\text{gate}}$$$$\\mathbf{U} = (\\mathbf{I} + \\mathbf{A}_{ab})^{-1} \\mathbf{A}_{ak} \\mathbf{V}$$This is structurally similar to the WY representation in KDA. The difference is: in KDA $\\tilde{\\mathbf{V}} = \\mathbf{U} - \\mathbf{W}\\mathbf{S}$ (minus sign, from Delta Rule residual), while in DPLR $\\tilde{\\mathbf{U}} = \\mathbf{U} + \\mathbf{W}\\mathbf{S}$ (plus sign, from low-rank superposition). This leads to different signs in the Affine parameter $\\mathbf{M}$: KDA uses $\\text{diag}(\\cdot) - \\mathbf{K}^T \\mathbf{W}$, while DPLR uses $\\text{diag}(\\cdot) + \\mathbf{B}^T \\mathbf{W}$.\nCore Theorem: Chunk-wise Affine Form Theorem (Chunk-wise Affine Form for DPLR) Let the state at the beginning of a chunk be $\\mathbf{S} \\in \\mathbb{R}^{K \\times V}$, then the state at the end of the chunk is:\n$$\\mathbf{S}' = \\mathbf{M} \\mathbf{S} + \\mathbf{B}$$Where:\nTransition matrix $\\mathbf{M} \\in \\mathbb{R}^{K \\times K}$: $$\\mathbf{M} = \\text{diag}(\\exp(\\mathbf{g}_{\\text{last}})) + \\mathbf{B}_{\\text{decayed}}^T \\mathbf{W}$$ Bias matrix $\\mathbf{B} \\in \\mathbb{R}^{K \\times V}$: $$\\mathbf{B} = \\mathbf{K}_{\\text{decayed}}^T \\mathbf{V} + \\mathbf{B}_{\\text{decayed}}^T \\mathbf{U}$$ And the chunk output is:\n$$\\mathbf{O} = \\mathbf{Q} \\mathbf{S} + \\text{mask}(\\mathbf{A}_{qk}) \\mathbf{V} + \\text{mask}(\\mathbf{A}_{qb}) (\\mathbf{U} + \\mathbf{W} \\mathbf{S})$$Proof State Update:\n$$ \\begin{aligned} \\mathbf{S}' \u0026= \\text{diag}(\\exp(\\mathbf{g}_{\\text{last}})) \\mathbf{S} + \\sum_{i=1}^C \\exp(\\mathbf{g}_{\\text{last}} - \\mathbf{g}_i) \\odot (\\mathbf{k}_i^T \\mathbf{v}_i + \\mathbf{b}_i \\tilde{\\mathbf{u}}_i) \\\\ \u0026= \\text{diag}(\\exp(\\mathbf{g}_{\\text{last}})) \\mathbf{S} + \\mathbf{K}_{\\text{decayed}}^T \\mathbf{V} + \\mathbf{B}_{\\text{decayed}}^T \\tilde{\\mathbf{U}} \\end{aligned} $$Where $\\tilde{\\mathbf{u}}_i = \\mathbf{u}_i + \\mathbf{w}_i \\mathbf{S}$ ($1 \\times V$ row vector) is the corrected vector including historical state contributions. Here $\\mathbf{w}_i \\in \\mathbb{R}^{1 \\times K}$ (row vector), $\\mathbf{S} \\in \\mathbb{R}^{K \\times V}$, and the product $\\mathbf{w}_i \\mathbf{S} \\in \\mathbb{R}^{1 \\times V}$, with matching dimensions.\nSubstituting the WY representation\u0026rsquo;s matrix form $\\tilde{\\mathbf{U}} = \\mathbf{U} + \\mathbf{W} \\mathbf{S}$ (note the plus sign here, different from KDA where $\\tilde{\\mathbf{V}} = \\mathbf{U} - \\mathbf{W} \\mathbf{S}$ uses a minus sign. The reason is that KDA\u0026rsquo;s WY representation separates the residual $\\mathbf{v}_i - \\mathbf{k}_i \\mathbf{S}$ from the Delta Rule, where the minus comes from \u0026ldquo;subtracting historical prediction\u0026rdquo;; DPLR has no Delta Rule structure, and the low-rank part $\\mathbf{b}_i \\mathbf{a}_i^T$ is directly superimposed onto the state, so the contribution from historical states accumulates positively):\n$$ \\begin{aligned} \\mathbf{S}' \u0026= \\text{diag}(\\exp(\\mathbf{g}_{\\text{last}})) \\mathbf{S} + \\mathbf{K}_{\\text{decayed}}^T \\mathbf{V} + \\mathbf{B}_{\\text{decayed}}^T (\\mathbf{U} + \\mathbf{W} \\mathbf{S}) \\\\ \u0026= \\underbrace{(\\text{diag}(\\exp(\\mathbf{g}_{\\text{last}})) + \\mathbf{B}_{\\text{decayed}}^T \\mathbf{W})}_{\\mathbf{M}} \\mathbf{S} + \\underbrace{(\\mathbf{K}_{\\text{decayed}}^T \\mathbf{V} + \\mathbf{B}_{\\text{decayed}}^T \\mathbf{U})}_{\\mathbf{B}} \\end{aligned} $$(Note: Detailed derivation of cross terms requires considering the specific relationship between $\\mathbf{W}$ and $\\mathbf{K}_{\\text{decayed}}$; the main structure is presented here.)\nOutput computation follows similarly.\nAlgorithm Implementation: From Theory to Code Based on the above theorems, the chunk-wise algorithm for DPLR proceeds as follows:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 def chunk_dplr(K, V, A, B_lr, G, chunk_size=64): \u0026#34;\u0026#34;\u0026#34; K, V: [C, K], [C, V] - keys, values A, B_lr: [C, K] - low-rank vectors a, b G: [C, K] - cumulative log decay \u0026#34;\u0026#34;\u0026#34; # Step 1: Compute gated inputs # Note: Code uses relative decay trick for numerical stability ag = A * exp(G) # gated a (using ge, i.e., shifted cumsum) bg = B_lr * exp(-G + G[-1]) # gated b (relative decay) kg = K * exp(-G + G[-1]) # gated k (relative decay) qg = Q * exp(G) # gated q (forward gating) # Step 2: Compute lower triangular matrices A_ab and A_ak # A_ab[i,j] = dot(a_i * exp(g_i - g_j), b_j) for i \u0026gt; j A_ab = (ag @ (B_lr * exp(-G)).T).masked_fill_(triu_mask, 0) A_ak = (ag @ (K * exp(-G)).T).masked_fill_(triu_mask, 0) # Step 3: Compute (I + A_ab)^{-1} via forward substitution A_ab_inv = forward_substitution_inverse(I + A_ab) # Step 4: WY representation # w = A_ab_inv @ ag # u = A_ab_inv @ (A_ak @ v) W = A_ab_inv @ ag # [C, K] U = A_ab_inv @ (A_ak @ V) # [C, V] # Step 5: Compute Affine parameters decay_last = exp(G[-1]) # [K] K_decayed = K * exp(G[-1] - G) # [C, K] B_decayed = B_lr * exp(G[-1] - G) # [C, K] # Transition matrix M M = diag(decay_last) + B_decayed.T @ W # [K, K] # Bias matrix B (contributions from k and b) B_mat = K_decayed.T @ V + B_decayed.T @ U # [K, V] # Step 6: State update (if initial state S=0, then S_next = B_mat) S_next = M @ S + B_mat # Step 7: Compute chunk output # O = Q @ S + masked_attention # Note: qg is gated query O_local = mask(qg @ K.T) @ V + mask(qg @ B_lr.T) @ U return M, B_mat, S_next, W, U Key Implementation Details Matrix Inversion: $(\\mathbf{I} + \\mathbf{A}_{ab})^{-1}$ is the inverse of a unit lower triangular matrix, which can be computed via forward substitution in $O(C^3)$ time ($C$ is the chunk size, typically 64 or 128)\nRelative Decay Trick: The code uses $\\exp(-\\mathbf{g} + \\mathbf{g}_{\\text{last}})$ rather than directly using $\\exp(\\mathbf{g})$, for numerical stability\nIndex Absorption Convention: In the code, ag = A * exp(G) absorbs $\\exp(\\mathbf{g}_i)$ into $\\mathbf{a}_i$, so the computed $\\mathbf{A}_{ab}$ is actually $[\\mathbf{A}_{ab}]_{ij} = \\mathbf{a}_i^T \\boldsymbol{\\Gamma}_{j+1}^{i} \\mathbf{b}_j$ (including the $\\mathbf{g}_i$ factor), rather than $\\mathbf{a}_i^T \\boldsymbol{\\Gamma}_{j+1}^{i-1} \\mathbf{b}_j$ from the mathematical definition. Correspondingly, the computed $\\mathbf{W}$ also absorbs this extra factor, ensuring the final Affine parameters $\\mathbf{M}, \\mathbf{B}$ remain correct. This absorption simplifies code implementation by avoiding explicit index shifts\nBlock-wise Computation: When $K$ is large, key/value dimensions need to be blocked to fit GPU Shared Memory\nPrecision Control: Similar to KDA, intermediate computations use float32, while storage uses bf16/fp16\nDPLR vs KDA vs IPLR A Unified Perspective on Three Variants Variant Transition Matrix Multiplication Direction Core Feature IPLR $\\mathbf{I} + \\mathbf{b}\\mathbf{a}^T$ Right (historically) Identity + Low Rank, no explicit decay KDA Implicit (via Delta Rule) Left Per-dim decay + Delta Rule DPLR $\\text{diag}(\\exp(\\mathbf{g})) + \\mathbf{b}\\mathbf{a}^T$ Left Diagonal decay + Low Rank Mathematical Connections IPLR is a special case of DPLR: When $\\mathbf{g}_t = \\mathbf{0}$ (i.e., $\\mathbf{D}_t = \\mathbf{I}$), DPLR reduces to IPLR\nDuality between RWKV-7 and DPLR:\nDPLR (FLA): $\\mathbf{S}' = \\mathbf{M}\\mathbf{S} + \\mathbf{B}$ (left multiplication, column-space update) RWKV-7: $\\mathbf{S}' = \\mathbf{S}\\mathbf{M}^T + \\mathbf{B}^T$ (right multiplication, row-space update) Unified Framework: Both ultimately reduce to the Affine transformation form\nCP Parallelism and Multi-Level Parallelism Affine Chain Rule (Left-Multiplication Version) DPLR state updates also satisfy the Affine form and permit chain composition:\nLet:\n$\\mathbf{S}_1 = \\mathbf{M}_0 \\mathbf{S}_0 + \\mathbf{B}_0$ $\\mathbf{S}_2 = \\mathbf{M}_1 \\mathbf{S}_1 + \\mathbf{B}_1$ Then: $$\\mathbf{S}_2 = \\underbrace{(\\mathbf{M}_1 \\mathbf{M}_0)}_{\\mathbf{M}_{01}} \\mathbf{S}_0 + \\underbrace{(\\mathbf{M}_1 \\mathbf{B}_0 + \\mathbf{B}_1)}_{\\mathbf{B}_{01}}$$CP Parallelism Algorithm Similar to KDA:\nLocal Computation: Each rank assumes $\\mathbf{S} = \\mathbf{0}$ and computes $(\\mathbf{M}_r, \\mathbf{B}_r)$ All-Gather: Collect Affine parameters from all ranks Prefix Scan: Rank $r$ computes the true initial state $$\\mathbf{S}_r = \\sum_{j=0}^{r-1} \\left( \\prod_{k=j+1}^{r-1} \\mathbf{M}_k \\right) \\mathbf{B}_j$$ Local Recomputation: Recompute chunk outputs with correct $\\mathbf{S}_r$ SM Parallelism Also applicable. Long sequences are divided into multiple subsequences, and states are merged through two-level Affine composition.\nSummary We have established a complete mathematical theory for DPLR from the perspective of explicit transition matrices:\nCore of DPLR: Diagonal-plus-low-rank transition matrix $\\mathbf{P}_t = \\text{diag}(\\exp(\\mathbf{g}_t)) + \\mathbf{b}_t \\mathbf{a}_t^T$ WY Representation: Decomposing the cumulative transition matrix into diagonal and low-rank components $$\\mathbf{P}_{t:1} = \\boldsymbol{\\Gamma}_1^t + \\sum_{i=1}^t (\\boldsymbol{\\Gamma}_{i+1}^t \\mathbf{b}_i) \\cdot \\mathbf{w}_i^T$$ Chunk-wise Affine: $\\mathbf{S}' = \\mathbf{M}\\mathbf{S} + \\mathbf{B}$ Unified Framework: DPLR, KDA, and IPLR are all special cases of Affine transformations, supporting the same parallel paradigms The mathematical derivations in this article are based on our theoretical framework and implementations in Flash Linear Attention (FLA).\n","permalink":"https://zhiyuan1i.github.io/en/posts/dplr-mathematics/","summary":"\u003cblockquote\u003e\n\u003cp\u003eThis article assumes familiarity with linear algebra (matrix multiplication, outer products, inverse matrices) and basic sequence modeling concepts. It is recommended to read \u003ca href=\"/en/posts/kda-mathematics/\"\u003eThe Mathematics of KDA\u003c/a\u003e first.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003ch2 id=\"abstract\"\u003eAbstract\u003c/h2\u003e\n\u003cp\u003eThis article derives the \u003cstrong\u003echunk-wise parallel algorithm\u003c/strong\u003e for \u003cstrong\u003eDPLR (Diagonal Plus Low Rank)\u003c/strong\u003e. DPLR is an important variant of the generalized Delta Rule, applied in architectures such as \u003cstrong\u003eRWKV-7\u003c/strong\u003e. The core contributions are:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eEstablishing the explicit transition matrix form of DPLR: $\\mathbf{P}_t = \\text{diag}(\\exp(\\mathbf{g}_t)) + \\mathbf{b}_t \\mathbf{a}_t^T$\u003c/li\u003e\n\u003cli\u003eDeriving the \u003cstrong\u003eWY representation\u003c/strong\u003e for DPLR, decomposing the cumulative transition matrix into diagonal and low-rank components\u003c/li\u003e\n\u003cli\u003eProving that DPLR also satisfies the \u003cstrong\u003eAffine transformation\u003c/strong\u003e form, naturally supporting Context Parallelism (CP)\u003c/li\u003e\n\u003cli\u003eComparing DPLR, KDA, and IPLR, revealing the unified mathematical framework of the linear attention family\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAdvantages of DPLR over standard Delta Rule: explicit control of diagonal decay (dim-wise forgetting) and low-rank updates, providing stronger expressiveness. However, in chunk form, it significantly introduces additional computational complexity and requires more HBM space to store intermediate variables.\u003c/p\u003e","title":"The Mathematics of DPLR (Diagonal Plus Low Rank): Parallel Computing with Explicit Transition Matrices"},{"content":" This article assumes familiarity with linear algebra (matrix multiplication, outer product, inverse matrices) and basic sequence modeling concepts.\nAbstract This article derives the chunk-wise parallel algorithm for KDA (Kimi Delta Attention). Core contributions:\nProving that KDA\u0026rsquo;s chunk state update can be expressed as an Affine transformation: $\\mathbf{S}' = \\mathbf{M}\\mathbf{S} + \\mathbf{B}$ Decomposing residual computation into history-independent components via WY representation to enable parallel computation Deriving the mathematical foundation for CP (Context Parallel) based on the compositional properties of Affine transformations Advantages of KDA over standard Attention: $O(N)$ complexity, constant memory state, suitable for ultra-long sequences.\nTable of Contents Introduction: From Transformer to Linear Attention The Development of Linear Attention Notation and Conventions Background: From GDN to KDA Core Lemmas State Update Mechanism of KDA WY Representation: Separation of Dependencies Core Theorem: Chunk-wise Affine Form Algorithm Implementation: From Theory to Code CP Parallelism and SM Parallelism Summary Appendix: GDN vs KDA References Introduction: From Transformer to Linear Attention Bottleneck of Standard Attention Since its introduction in 2017, the Transformer architecture has become the mainstream method for natural language processing and sequence modeling. Its core component, the Self-Attention mechanism, captures long-range dependencies by computing attention weights between all pairs of tokens in a sequence:\n$$ \\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V$$However, this standard Softmax Attention has significant computational bottlenecks:\n$O(N^2)$ complexity: Computing the attention matrix requires $O(N^2)$ time and space complexity Memory wall problem: As sequence length $N$ increases, memory usage grows quadratically Low inference efficiency: Autoregressive generation requires caching all historical KV, resulting in huge memory overhead For long sequence tasks (e.g., document understanding, code generation, multi-turn dialogue), $N$ can reach hundreds of thousands or even millions, making standard Attention infeasible.\nMotivation for Linear Attention Linear Attention 1 removes Softmax and rewrites attention in RNN form. The complete form includes both numerator (value accumulation) and denominator (normalization accumulation):\n$$\\mathbf{o}_t = \\frac{\\phi(\\mathbf{q}_t)^T \\mathbf{S}_t}{\\phi(\\mathbf{q}_t)^T \\mathbf{Z}_t}$$where both states are updated recursively: $$ \\begin{aligned} \\mathbf{S}_t \u0026= \\mathbf{S}_{t-1} + \\phi(\\mathbf{k}_t) \\otimes \\mathbf{v}_t \\\\ \\mathbf{Z}_t \u0026= \\mathbf{Z}_{t-1} + \\phi(\\mathbf{k}_t) \\end{aligned} $$Here $\\mathbf{S}_t \\in \\mathbb{R}^{d_k \\times d_v}$ is the state matrix and $\\mathbf{Z}_t \\in \\mathbb{R}^{d_k}$ is the normalizer vector. In practice, the denominator normalization can be approximated by subsequent layers such as RMSNorm, so it is often omitted to simplify computation, yielding a cleaner form:\n$$\\mathbf{S}_t = \\mathbf{S}_{t-1} + \\phi(\\mathbf{k}_t) \\otimes \\mathbf{v}_t, \\quad \\mathbf{o}_t = \\phi(\\mathbf{q}_t)^T \\mathbf{S}_t$$This form has $O(N)$ complexity, and inference only requires maintaining a fixed-size state matrix.\nContributions of This Article This article focuses on Kimi Delta Attention (KDA) introduced in Kimi Linear 2, a new generation of Linear Attention architecture that combines:\nDelta Rule: Only updates information related to prediction errors Per-dimension Decay: Different dimensions can have independent forgetting rates Chunk-wise parallelism: Hardware-efficient parallel training through WY representation We will build the complete mathematical theory of KDA from the most basic matrix multiplication lemmas.\nThe Development of Linear Attention Linear Attention research has evolved from early attempts to mimic Softmax Attention, to gradually developing its own characteristics, and recently exploring higher-level guiding principles (such as the Delta Rule), going through several important stages.\n1. Foundational Period (2020): From Approximation to Reconstruction Katharopoulos et al. 1 published the groundbreaking work \u0026ldquo;Transformers are RNNs\u0026rdquo; at ICML 2020, first reformulating Transformers into RNN form. They proved that through feature mapping $\\phi$, linear-complexity attention mechanisms can be constructed.\nEarly Linear Attention mainly mimicked and approximated Softmax Attention:\nDirectly removing exp from softmax to obtain $O = (QK^\\top \\odot M)V$ Adding non-negative activation functions (e.g., elu+1) to Q, K for numerical stability Performer 3 used random Fourier features to approximate Softmax However, subsequent research found that normalization along the sequence dimension cannot completely avoid numerical instability; it\u0026rsquo;s better to use post-hoc normalization (e.g., RMSNorm), and activation functions for Q, K are not strictly necessary.\n2. Introduction of Forgetting Mechanisms (2021-2023) Pure Linear Attention is essentially cumsum, equally weighting all historical information, causing information from distant tokens to have minimal contribution. The introduction of forgetting mechanisms solved this problem:\nLRU (2023): Linear Recurrent Unit, introducing scalar decay factors RetNet (2023): First combining forgetting factors with Linear Attention, $S_t = \\gamma S_{t-1} + v_t k_t^\\top$, where $\\gamma \\in (0,1)$ is a constant decay RWKV-4 4 (2023): Pure RNN architecture combining constant inference memory of RNNs with parallel training advantages of Transformers, using channel-wise decay A detail of RetNet is adding RoPE to Q, K, equivalent to generalizing decay to the complex domain; from the LRU perspective, this considers complex eigenvalues.\n3. Data-Dependent Decay (2023-2024) Extending static decay to input-dependent dynamic decay led to a series of works:\nMamba 5: Introducing input-dependent gating mechanisms Mamba2 67: Proposing the SSD framework, reinterpreting from the state space model perspective GLA 8: Using outer product form of forgetting gates, enabling GPU-efficient matrix multiplication parallelism RWKV-5/6 9 (2024): Eagle and Finch architectures, introducing matrix-valued states and dynamic recurrence Works at this stage are very similar to \u0026ldquo;forgetting gates\u0026rdquo; in traditional RNNs like GRU and LSTM, except that to maintain linearity, the gating\u0026rsquo;s dependence on State is removed.\n4. RWKV: An Independent Pure RNN Architecture RWKV (Receptance Weighted Key Value) is a series of pure RNN architecture LLMs proposed by Peng Bo et al., developed in parallel with Linear Attention but adopting a different technical route—RWKV emphasizes maintaining a pure RNN form (only passing historical information through a fixed-size state), while Linear Attention focuses on using matrix multiplication to achieve chunk-wise parallel computation.\nVersion Time Core Features RWKV-4 4 2023 Basic architecture, introducing Receptance mechanism and channel-wise time decay RWKV-5 (Eagle) 9 2024 Matrix-Valued States, enhanced expressiveness RWKV-6 (Finch) 9 2024 Data-dependent token shift and dynamic recurrence RWKV-7 10 2025 Introduction of generalized Delta Rule, vector-valued gating and context learning rate The unique aspect of RWKV is its complete RNN-based form, achieving efficient sequence modeling through carefully designed state update mechanisms.\n5. The Rise of Delta Rule (2024-2025) The Delta Rule was originally a parameter update rule in neural networks (Widrow-Hoff rule), recently introduced into sequence modeling as a form of \u0026ldquo;Test Time Training\u0026rdquo;:\nTTT (2024): Treating sequence model construction as an online learning problem, building RNNs with optimizers DeltaNet 11 (NeurIPS 2024): Applying Delta Rule to Linear Attention Gated DeltaNet (GDN) 12 (2024): Introducing gating mechanisms to control information flow RWKV-7 10 (2025): Independently introducing generalized Delta Rule KDA 2 (2025): Introduced in Kimi Linear, extending scalar decay to per-dimension decay The core idea of Delta Rule is to only update the state when new information differs from historical predictions, similar to human incremental learning processes and highly aligned with TTT\u0026rsquo;s \u0026ldquo;online learning\u0026rdquo; perspective.\nComparison of Variants Method Update Rule Complexity Key Features Softmax Attention $\\text{softmax}(QK^T)V$ $O(N^2)$ Global dependencies, accurate but slow Linear Attention $\\phi(Q)^T \\sum \\phi(K)V^T$ $O(N)$ Fixed state, efficient but weak expressiveness RetNet $S_t = \\gamma S_{t-1} + v_t k_t^\\top$ $O(N)$ Constant decay + RoPE RWKV-4/5/6 Receptance + time decay $O(N)$ Pure RNN architecture, parallel training Mamba Input-dependent state transition $O(N)$ Selective, hardware-optimized GLA Gated Linear Attention $O(N)$ Outer product form, GPU-efficient DeltaNet Delta Rule $O(N)$ Content-aware incremental updates GDN Delta + scalar gating $O(N)$ Global forgetting control RWKV-7 Generalized Delta Rule $O(N)$ Vector-valued gating KDA Delta + per-dim gating $O(N)$ Dimension-selective forgetting Notation and Conventions Symbol Dimension Meaning $\\mathbf{s}_t$ $\\mathbb{R}^{K \\times V}$ token-level state matrix $\\mathbf{S}$ $\\mathbb{R}^{K \\times V}$ chunk-level initial state $\\mathbf{S}'$ $\\mathbb{R}^{K \\times V}$ chunk-level final state $\\mathbf{k}_t, \\mathbf{q}_t$ $\\mathbb{R}^{1 \\times K}$ (row vector) token-level key/query $\\mathbf{v}_t$ $\\mathbb{R}^{1 \\times V}$ (row vector) token-level value $\\mathbf{K}, \\mathbf{Q}, \\mathbf{V}$ $\\mathbb{R}^{C \\times K}$ / $\\mathbb{R}^{C \\times V}$ chunk-level matrices, row $i$ is $\\mathbf{k}_i$ $\\mathbf{g}_t^{\\text{raw}}$ $\\mathbb{R}^K$ raw log decay $\\mathbf{g}_t$ $\\mathbb{R}^K$ cumulative log decay (after cumsum) $\\boldsymbol{\\lambda}_t = \\exp(\\mathbf{g}_t^{\\text{raw}})$ $\\mathbb{R}^K$ per-dimension decay factor (raw decay) $\\beta_t$ scalar Delta Rule weight $\\mathbf{A}_{kk}$ $\\mathbb{R}^{C \\times C}$ strictly lower triangular weight matrix $\\mathbf{W}, \\mathbf{U}$ $\\mathbb{R}^{C \\times K}$ / $\\mathbb{R}^{C \\times V}$ WY representation weighted keys/values $\\mathbf{M}$ $\\mathbb{R}^{K \\times K}$ Affine transition matrix $\\mathbf{B}$ $\\mathbb{R}^{K \\times V}$ Affine bias matrix $\\otimes$ - outer product: $(\\mathbf{k}\\otimes\\mathbf{v})_{ab} = k_a \\cdot v_b$ $\\odot$ - Hadamard product (element-wise multiplication) Conventions:\nLowercase bold ($\\mathbf{s}, \\mathbf{k}, \\mathbf{v}$) denotes token-level row vectors Uppercase bold ($\\mathbf{S}, \\mathbf{K}, \\mathbf{V}$) denotes chunk-level matrices Matrix $\\mathbf{K} \\in \\mathbb{R}^{C \\times K}$ has row $i$ as $\\mathbf{k}_i \\in \\mathbb{R}^{1 \\times K}$ Matrix $\\mathbf{V} \\in \\mathbb{R}^{C \\times V}$ has row $i$ as $\\mathbf{v}_i \\in \\mathbb{R}^{1 \\times V}$ States $\\mathbf{s}_t \\in \\mathbb{R}^{K \\times V}$ and $\\mathbf{S} \\in \\mathbb{R}^{K \\times V}$ are matrices (not vectors) About Chunks Chunk refers to dividing long sequences into fixed-length continuous segments (typically $C = 64$ or $128$), each containing $C$ tokens. The choice of $C = 64$ or $128$ is related to GPU Tensor Core matrix multiplication dimensions:\nOptimal dimensions for Tensor Core matrix multiplication typically satisfy $M, N, K \\in \\{64, 128, 256\\}$ Chunk size $C$ corresponds to the $M$ or $N$ dimension in matrix multiplication Larger $C$ (e.g., 256) increases shared memory usage; smaller $C$ (e.g., 16) cannot fully utilize Tensor Core parallelism Linear Attention: A Simple Starting Point As a warm-up, let\u0026rsquo;s first look at Linear Attention, the simplest recurrent attention form.\nDefinition $$\\mathbf{s}_t = \\mathbf{s}_{t-1} + \\mathbf{k}_t \\otimes \\mathbf{v}_t, \\quad \\mathbf{o}_t = \\mathbf{q}_t^\\top \\mathbf{s}_t$$where $\\mathbf{s}_t \\in \\mathbb{R}^{K \\times V}$ is the state matrix.\nChunk-wise Form Divide the sequence into chunks of $C$ tokens each. Let $\\mathbf{S} \\in \\mathbb{R}^{K \\times V}$ be the state at the beginning of the chunk; then the state at position $i$ within the chunk is:\n$$\\mathbf{s}_i = \\mathbf{S} + \\sum_{j=1}^i \\mathbf{k}_j \\otimes \\mathbf{v}_j$$The chunk output $\\mathbf{O} \\in \\mathbb{R}^{C \\times V}$ (row $i$ is $\\mathbf{o}_i^\\top$):\n$$\\mathbf{O} = \\mathbf{Q} \\mathbf{S} + \\text{mask}(\\mathbf{Q} \\mathbf{K}^\\top) \\mathbf{V}$$where $\\text{mask}(\\cdot)$ denotes causal masking (lower triangular part). This form is entirely composed of matrix multiplications.\nReference: The foundational work on Linear Attention is Katharopoulos et al. (ICML 2020) 1, which first reformulated Transformers into RNN form. Hardware-efficient chunk-wise parallel training methods are described in Yang et al. (ICML 2024) 8.\nBackground: From GDN to KDA Gated DeltaNet (GDN) Gated DeltaNet (GDN) is a Delta Rule-based sequence modeling method using scalar decay:\n$$\\mathbf{s}_t = \\lambda_t \\cdot \\mathbf{s}_{t-1} + \\beta_t \\cdot \\mathbf{k}_t^\\top (\\mathbf{v}_t - \\mathbf{k}_t (\\lambda_t \\cdot \\mathbf{s}_{t-1}))$$where $\\lambda_t = \\exp(g_t)$ is a scalar (one value per head), with all dimensions sharing the same forgetting rate.\nKimi Delta Attention (KDA) KDA extends GDN by generalizing scalar decay to per-dimension decay:\n$$\\mathbf{s}_t = \\boldsymbol{\\lambda}_t \\odot \\mathbf{s}_{t-1} + \\beta_t \\cdot \\mathbf{k}_t^\\top (\\mathbf{v}_t - \\mathbf{k}_t (\\boldsymbol{\\lambda}_t \\odot \\mathbf{s}_{t-1}))$$where $\\boldsymbol{\\lambda}_t \\in \\mathbb{R}^K$ is a vector (one value per dimension), allowing different dimensions to have independent forgetting rates.\nObjective of This Article This article focuses on KDA as the main subject, establishing its mathematical theory for chunk-wise parallelism and CP parallelism. GDN, as a special case of KDA (scalar decay), is discussed in the appendix.\nCore Lemmas Lemma 1: Matrix Form of Outer Product Accumulation Lemma 1: Let $\\mathbf{k}_1, \\ldots, \\mathbf{k}_C \\in \\mathbb{R}^K$ and $\\mathbf{v}_1, \\ldots, \\mathbf{v}_C \\in \\mathbb{R}^V$ be two sets of vectors. Then:\n$$\\sum_{i=1}^C \\mathbf{k}_i \\otimes \\mathbf{v}_i = \\mathbf{K}^\\top \\mathbf{V}$$where:\n$\\mathbf{K} \\in \\mathbb{R}^{C \\times K}$ is the matrix formed by $C$ vectors $\\mathbf{k}_i$ $\\mathbf{V} \\in \\mathbb{R}^{C \\times V}$ is the matrix formed by $C$ vectors $\\mathbf{v}_i$ $\\otimes$ denotes outer product: $(\\mathbf{k} \\otimes \\mathbf{v})_{ab} = k_a \\cdot v_b$ Proof: Directly compute element $(a, b)$ of the right-hand side matrix:\n$$(\\mathbf{K}^\\top \\mathbf{V})_{ab} = \\sum_{i=1}^C K_{ia} V_{ib} = \\sum_{i=1}^C k_{i,a} \\cdot v_{i,b} = \\sum_{i=1}^C (\\mathbf{k}_i \\otimes \\mathbf{v}_i)_{ab}$$By Lemma 1, outer product accumulation within a chunk can be expressed as matrix multiplication (GEMM, General Matrix Multiply), providing the mathematical foundation for chunk-wise parallelism.\nLemma 2: Inverse of Lower Triangular Matrix Lemma 2: Let $\\mathbf{L} \\in \\mathbb{R}^{C \\times C}$ be a unit lower triangular matrix (diagonal is 1, upper triangle is 0), then $\\mathbf{L}^{-1}$ is also a unit lower triangular matrix, and can be computed via forward substitution.\nIn particular, if $\\mathbf{L} = \\mathbf{I} - \\mathbf{N}$, where $\\mathbf{N}$ is a strictly lower triangular matrix (diagonal is 0), then:\n$$\\mathbf{L}^{-1} = \\mathbf{I} + \\mathbf{N} + \\mathbf{N}^2 + \\cdots + \\mathbf{N}^{C-1}$$Proof: Directly verify $(\\mathbf{I} - \\mathbf{N})(\\mathbf{I} + \\mathbf{N} + \\cdots + \\mathbf{N}^{C-1}) = \\mathbf{I} - \\mathbf{N}^C = \\mathbf{I}$ (since $\\mathbf{N}^C = 0$, the $C$-th power of a strictly lower triangular matrix is zero).\nLemma 3: Linear Decomposition of Log-Decay Matrix (exp g and exp -g) Lemma 3: For given cumulative log-decay vectors $\\mathbf{g}_1, \\dots, \\mathbf{g}_C \\in \\mathbb{R}^K$ (computed via cumsum), the decay terms in the attention matrix can be decomposed as:\n$$\\exp(\\mathbf{g}_i - \\mathbf{g}_j) = \\exp(\\mathbf{g}_i) \\odot \\exp(-\\mathbf{g}_j)$$This allows logic originally requiring per-position loops to be written directly as standard matrix multiplication of two \u0026ldquo;gating matrices\u0026rdquo;:\n$$\\mathbf{A} = (\\mathbf{K} \\odot \\exp(\\mathbf{G})) \\cdot (\\mathbf{K} \\odot \\exp(-\\mathbf{G}))^\\top$$Dimension notes:\n$\\mathbf{K} \\in \\mathbb{R}^{C \\times K}$: keys matrix within chunk, row $i$ is $\\mathbf{k}_i$ $\\mathbf{G} \\in \\mathbb{R}^{C \\times K}$: cumulative log-decay matrix, row $i$ is $\\mathbf{g}_i$ $\\mathbf{A} \\in \\mathbb{R}^{C \\times C}$: intermediate attention matrix (before applying $\\beta$ and causal mask) Decomposition form:\n$\\mathbf{K}_{\\text{exp}} = \\mathbf{K} \\odot \\exp(\\mathbf{G})$: Forward decay (keys after cumulative decay) $\\mathbf{K}_{\\text{inv}} = \\mathbf{K} \\odot \\exp(-\\mathbf{G})$: Reverse decay (keys after inverse decay) $$\\mathbf{A} = \\mathbf{K}_{\\text{exp}} \\cdot \\mathbf{K}_{\\text{inv}}^\\top$$ Significance:\nEliminates loops: Transforms $O(C)$ loops and complex einsum into a single standard matrix multiplication (GEMM) Hardware acceleration: Leverages GPU Tensor Core hardware acceleration, shifting computational efficiency from memory-bound to compute-bound Memory savings: No need to store $C \\times C \\times K$ intermediate tensors, only need to store $C \\times K$ gating matrices State Update Mechanism of KDA Origin of Delta Rule Delta Rule (also known as Widrow-Hoff learning rule or LMS algorithm) was originally a parameter update rule in neural networks:\n$$\\Delta w = \\eta \\cdot (y - \\hat{y}) \\cdot x$$where $(y - \\hat{y})$ is the prediction error (delta), and $\\eta$ is the learning rate. This rule corrects weights using error signals.\nIn sequence models, Delta Rule is reinterpreted as a state update mechanism:\nHistorical state $\\mathbf{s}_{t-1}$ is viewed as a \u0026ldquo;prediction\u0026rdquo; of current input $\\mathbf{k}_t^\\top \\mathbf{s}_{t-1}$ computes the \u0026ldquo;expected value\u0026rdquo; Residual $\\mathbf{v}_t - \\mathbf{k}_t \\mathbf{s}_{t-1}$ (row vector $\\mathbb{R}^{1 \\times V}$) represents the difference between \u0026ldquo;new information\u0026rdquo; and \u0026ldquo;historical expectation\u0026rdquo;, outer product $\\mathbf{k}_t^\\top (\\cdot)$ maps the result back to state matrix $\\mathbb{R}^{K \\times V}$ Only this difference (not the full value) updates the state Recurrence Formula of KDA KDA state update mechanism (Delta Rule + per-dim gate):\n$$\\mathbf{s}_t = \\boldsymbol{\\lambda}_t \\odot \\mathbf{s}_{t-1} + \\beta_t \\cdot \\mathbf{k}_t^\\top (\\mathbf{v}_t - \\mathbf{k}_t (\\boldsymbol{\\lambda}_t \\odot \\mathbf{s}_{t-1}))$$where:\n$\\boldsymbol{\\lambda}_t = \\exp(\\mathbf{g}_t^{\\text{raw}}) \\in \\mathbb{R}^K$ is the per-dimension decay factor (vector) $\\beta_t$ is the delta rule weight In the residual term $\\mathbf{v}_t - \\mathbf{k}_t (\\boldsymbol{\\lambda}_t \\odot \\mathbf{s}_{t-1})$: $\\mathbf{k}_t (\\boldsymbol{\\lambda}_t \\odot \\mathbf{s}_{t-1}) \\in \\mathbb{R}^{1 \\times V}$ (row vector) is the expected value Comparison with $\\mathbf{v}_t$ yields the residual (row vector form) Product $\\mathbf{k}_t^\\top (\\cdot)$ maps the result to state matrix $\\mathbb{R}^{K \\times V}$ Note:\nThe expected value in the residual is computed using gated state $\\boldsymbol{\\lambda}_t \\odot \\mathbf{s}_{t-1}$ $\\boldsymbol{\\lambda}_t$ is a vector; each dimension $i$ has an independent decay rate $\\lambda_{t,i}$ When $\\boldsymbol{\\lambda}_t = \\lambda_t \\cdot \\mathbf{1}$ (all dimensions identical), KDA reduces to GDN Comparison: Linear Attention vs KDA Mechanism Update Rule Features Linear Attention $\\mathbf{s}_t = \\mathbf{s}_{t-1} + \\mathbf{k}_t \\otimes \\mathbf{v}_t$ Accumulates all historical information GDN $\\mathbf{s}_t = \\lambda_t \\mathbf{s}_{t-1} + \\beta_t \\cdot \\mathbf{k}_t^\\top (\\mathbf{v}_t - \\mathbf{k}_t (\\lambda_t \\mathbf{s}_{t-1}))$ Scalar decay, global forgetting KDA $\\mathbf{s}_t = \\boldsymbol{\\lambda}_t \\odot \\mathbf{s}_{t-1} + \\beta_t \\cdot \\mathbf{k}_t^\\top (\\mathbf{v}_t - \\mathbf{k}_t (\\boldsymbol{\\lambda}_t \\odot \\mathbf{s}_{t-1}))$ per-dimension decay, dimension-selective forgetting Problem: Residual Depends on Historical State Expanding the first two steps of recurrence (note: gated state is used in the residual):\n$$\\mathbf{s}_1 = \\boldsymbol{\\lambda}_1 \\odot \\mathbf{s}_0 + \\beta_1 \\cdot \\mathbf{k}_1^\\top (\\mathbf{v}_1 - \\mathbf{k}_1 (\\boldsymbol{\\lambda}_1 \\odot \\mathbf{s}_0))$$ $$\\mathbf{s}_2 = \\boldsymbol{\\lambda}_2 \\odot \\mathbf{s}_1 + \\beta_2 \\cdot \\mathbf{k}_2^\\top (\\mathbf{v}_2 - \\mathbf{k}_2 (\\boldsymbol{\\lambda}_2 \\odot \\mathbf{s}_1))$$Each $\\mathbf{s}_i$ complexly depends on $\\mathbf{S}$ and cannot be directly written in $\\mathbf{K}^\\top \\mathbf{V}$ form using Lemma 1.\nProblem to solve: Separate \u0026ldquo;depends on $\\mathbf{S}$\u0026rdquo; and \u0026ldquo;independent of $\\mathbf{S}$\u0026rdquo; parts.\nWY Representation: Separation of Dependencies Objective Let\u0026rsquo;s explicitly write out $\\mathbf{s}_i$\u0026rsquo;s dependence on $\\mathbf{S}$. Define the corrected value:\n$$\\tilde{\\mathbf{v}}_i = \\mathbf{v}_i - \\mathbf{k}_i (\\boldsymbol{\\lambda}_i \\odot \\mathbf{s}_{i-1}) \\in \\mathbb{R}^{1 \\times V}$$Since $\\mathbf{s}_{i-1}$ itself depends on $\\mathbf{S}$, we need to find a representation satisfying:\n$$\\tilde{\\mathbf{v}}_i = \\mathbf{u}_i - \\mathbf{w}_i \\mathbf{S}$$where $\\mathbf{u}_i, \\mathbf{w}_i$ only depend on $\\{\\mathbf{k}_j, \\mathbf{v}_j\\}$ within the chunk, independent of $\\mathbf{S}$.\nDeriving WY Representation Step 1: Write the recurrence for $\\mathbf{s}_i$\n$$\\mathbf{s}_i = \\boldsymbol{\\lambda}_i \\odot \\mathbf{s}_{i-1} + \\beta_i \\cdot \\mathbf{k}_i^\\top (\\mathbf{v}_i - \\mathbf{k}_i (\\boldsymbol{\\lambda}_i \\odot \\mathbf{s}_{i-1}))$$ Step 2: Define cumulative quantities\nLet $\\boldsymbol{\\Lambda}^{(i)} = \\prod_{j=1}^i \\text{diag}(\\boldsymbol{\\lambda}_j) \\in \\mathbb{R}^{K \\times K}$ (diagonal cumulative decay matrix), and define normalized state:\n$$\\hat{\\mathbf{s}}_i = (\\boldsymbol{\\Lambda}^{(i)})^{-1} \\mathbf{s}_i$$ Step 3: Transform to lower triangular linear system\nSubstituting normalized state $\\hat{\\mathbf{s}}_i = (\\boldsymbol{\\Lambda}^{(i)})^{-1} \\mathbf{s}_i$ into the recurrence and rearranging:\n$$\\hat{\\mathbf{s}}_i = \\hat{\\mathbf{s}}_{i-1} + \\beta_i \\cdot \\hat{\\mathbf{k}}_i^\\top (\\hat{\\mathbf{v}}_i - \\hat{\\mathbf{k}}_i \\hat{\\mathbf{s}}_{i-1})$$Define normalized key/value (note: value does not need decay relative to state): $$\\hat{\\mathbf{k}}_i = \\mathbf{k}_i \\odot \\exp(\\mathbf{g}_i), \\quad \\hat{\\mathbf{v}}_i = \\mathbf{v}_i$$Then the residual can be written as (row vector): $$\\tilde{\\mathbf{v}}_i = \\hat{\\mathbf{v}}_i - \\hat{\\mathbf{k}}_i \\hat{\\mathbf{s}}_{i-1} \\in \\mathbb{R}^{1 \\times V}$$Expanding $\\hat{\\mathbf{s}}_{i-1}$ in recursive form (with initial state $\\hat{\\mathbf{s}}_0 = \\mathbf{S}$): $$\\hat{\\mathbf{s}}_{i-1} = \\mathbf{S} + \\sum_{j=1}^{i-1} \\beta_j \\cdot \\hat{\\mathbf{k}}_j \\otimes \\tilde{\\mathbf{v}}_j$$Substituting into the residual expression: $$\\tilde{\\mathbf{v}}_i = \\hat{\\mathbf{v}}_i - \\hat{\\mathbf{k}}_i \\mathbf{S} - \\sum_{j=1}^{i-1} \\beta_j \\cdot \\hat{\\mathbf{k}}_i \\hat{\\mathbf{k}}_j^\\top \\cdot \\tilde{\\mathbf{v}}_j$$Note: Here $\\tilde{\\mathbf{v}}_j \\in \\mathbb{R}^{1 \\times V}$ is a row vector, $\\hat{\\mathbf{k}}_i \\hat{\\mathbf{k}}_j^\\top$ is a scalar ($K$-dimensional inner product).\nRearranging into matrix form. Define:\nMatrices $\\tilde{\\mathbf{V}}, \\hat{\\mathbf{V}} \\in \\mathbb{R}^{C \\times V}$ with rows $\\tilde{\\mathbf{v}}_i, \\hat{\\mathbf{v}}_i$ respectively Matrix $\\mathbf{A}_{kk} \\in \\mathbb{R}^{C \\times C}$ as strictly lower triangular, for $i \u003e j$: $A_{ij} = \\beta_j (\\mathbf{k}_i \\odot \\exp(\\mathbf{g}_i)) (\\mathbf{k}_j \\odot \\exp(-\\mathbf{g}_j))^\\top$ This yields the linear system: $$\\tilde{\\mathbf{V}} = \\hat{\\mathbf{V}} - \\mathbf{K}_{\\text{gated}} \\mathbf{S} - \\mathbf{A}_{kk} \\tilde{\\mathbf{V}}$$That is: $$(\\mathbf{I} + \\mathbf{A}_{kk}) \\tilde{\\mathbf{V}} = \\hat{\\mathbf{V}} - \\mathbf{K}_{\\text{gated}} \\mathbf{S}$$where row $i$ of $\\mathbf{K}_{\\text{gated}}$ is $\\mathbf{k}_i \\odot \\exp(\\mathbf{g}_i)$.\nStep 4: Apply Lemma 2\nBy Lemma 2, $\\mathbf{L} = \\mathbf{I} + \\mathbf{A}_{kk}$ is a unit lower triangular matrix; its inverse $\\mathbf{L}^{-1} = (\\mathbf{I} + \\mathbf{A}_{kk})^{-1}$ is also unit lower triangular. Solving the linear system:\n$$\\tilde{\\mathbf{V}} = (\\mathbf{I} + \\mathbf{A}_{kk})^{-1} \\cdot \\hat{\\mathbf{V}} - (\\mathbf{I} + \\mathbf{A}_{kk})^{-1} \\cdot \\mathbf{K} \\mathbf{S}$$ Step 5: Define WY representation\nDefine weighted matrices (corresponding to u = A @ v and w = A @ (exp(g) * k) in code): $$\\mathbf{U} = (\\mathbf{I} + \\mathbf{A}_{kk})^{-1} \\text{diag}(\\boldsymbol{\\beta}) \\mathbf{V}$$ $$\\mathbf{W} = (\\mathbf{I} + \\mathbf{A}_{kk})^{-1} \\text{diag}(\\boldsymbol{\\beta}) (\\mathbf{K} \\odot \\exp(\\mathbf{G}))$$where $\\hat{\\mathbf{V}}$ is the normalized values (including $\\beta$ and relative decay), yielding the separated form: $$\\tilde{\\mathbf{V}} = \\mathbf{U} - \\mathbf{W} \\mathbf{S}$$This is the WY representation.\nReference: WY representation was originally proposed by Bischof \u0026amp; Van Loan (1987) 13 for representing products of Householder matrices, later improved to a compact form by Schreiber \u0026amp; Van Loan (1989) 14. In sequence models, DeltaNet 11 first applied this technique to parallel computation of linear attention; Gated DeltaNet 12 further introduced gating mechanisms.\nExplanation of WY Representation $\\mathbf{W} \\in \\mathbb{R}^{C \\times K}$: weighted keys, row $i$ is $\\mathbf{w}_i \\in \\mathbb{R}^{1 \\times K}$ $\\mathbf{U} \\in \\mathbb{R}^{C \\times V}$: weighted values, row $i$ is $\\mathbf{u}_i \\in \\mathbb{R}^{1 \\times V}$ $\\tilde{\\mathbf{v}}_i = \\mathbf{u}_i - \\mathbf{w}_i \\mathbf{S}$: corrected value (row vector $\\mathbb{R}^{1 \\times V}$) From the above derivation, $\\mathbf{U}, \\mathbf{W}$ are independent of $\\mathbf{S}$ and can be precomputed before computing $\\mathbf{S}$.\nCore Theorem: Chunk-wise Affine Form Now we can state the core theorem.\nTheorem (Chunk-wise Affine Form of KDA/GDN) Let the state at chunk start be $\\mathbf{S} \\in \\mathbb{R}^{K \\times V}$; then the state at chunk end is:\n$$\\mathbf{S}' = \\mathbf{M} \\cdot \\mathbf{S} + \\mathbf{B}$$where:\nTransition matrix $\\mathbf{M} \\in \\mathbb{R}^{K \\times K}$: $$\\mathbf{M} = \\text{diag}(\\exp(\\mathbf{g}_{\\text{last}})) - \\mathbf{K}_{\\text{decayed}}^\\top \\mathbf{W}$$ Bias matrix: $\\mathbf{B} = \\mathbf{K}_{\\text{decayed}}^\\top \\mathbf{U} \\in \\mathbb{R}^{K \\times V}$ Row $i$ of $\\mathbf{K}_{\\text{decayed}}$ is $\\mathbf{k}_i \\odot \\exp(\\mathbf{g}_{\\text{last}} - \\mathbf{g}_i)$, where $\\mathbf{g}_{\\text{last}}$ denotes the cumulative log decay at the last position of the chunk And the chunk output is:\n$$\\mathbf{O} = (\\mathbf{Q} \\odot \\exp(\\mathbf{g}_q)) \\cdot \\mathbf{S} + \\text{mask}(\\mathbf{A}_{qk}) \\cdot (\\mathbf{U} - \\mathbf{W} \\mathbf{S})$$where $\\mathbf{g}_q$ is the cumulative gate for queries, and $\\odot$ denotes broadcasting multiplication.\nProof State update (taking KDA as example):\n$$\\begin{aligned} \\mathbf{S}' \u0026= \\text{diag}(\\exp(\\mathbf{g}_{\\text{last}})) \\mathbf{S} + \\sum_{i=1}^C \\exp(\\mathbf{g}_{\\text{last}} - \\mathbf{g}_i) \\odot (\\mathbf{k}_i^\\top \\tilde{\\mathbf{v}}_i) \\\\ \u0026= \\text{diag}(\\exp(\\mathbf{g}_{\\text{last}})) \\mathbf{S} + \\mathbf{K}_{\\text{decayed}}^\\top \\tilde{\\mathbf{V}} \\quad \\text{(Lemma 1: outer product accumulation = matrix multiplication)} \\\\ \u0026= \\text{diag}(\\exp(\\mathbf{g}_{\\text{last}})) \\mathbf{S} + \\mathbf{K}_{\\text{decayed}}^\\top (\\mathbf{U} - \\mathbf{W} \\mathbf{S}) \\quad \\text{(substitute WY representation } \\tilde{\\mathbf{V}} = \\mathbf{U} - \\mathbf{W} \\mathbf{S} \\text{)} \\\\ \u0026= (\\text{diag}(\\exp(\\mathbf{g}_{\\text{last}})) - \\mathbf{K}_{\\text{decayed}}^\\top \\mathbf{W}) \\mathbf{S} + \\mathbf{K}_{\\text{decayed}}^\\top \\mathbf{U} \\\\ \u0026= \\mathbf{M} \\mathbf{S} + \\mathbf{B} \\end{aligned}$$For GDN, simply replace diagonal matrix $\\text{diag}(\\boldsymbol{\\lambda}^{\\text{last}})$ with scalar $\\lambda^{\\text{last}} \\mathbf{I}$.\nOutput computation follows similarly.\nForm of Affine Transformation $$\\mathbf{S}' = \\underbrace{\\mathbf{M}}_{K \\times K} \\cdot \\underbrace{\\mathbf{S}}_{K \\times V} + \\underbrace{\\mathbf{B}}_{K \\times V}$$The above form is an Affine transformation:\nLinear part: $\\mathbf{M} \\cdot \\mathbf{S}$ represents decay and projection of historical state Translation part: $\\mathbf{B}$ represents new information introduced by the current chunk Algorithm Implementation: From Theory to Code Based on the above theorem, we can write the chunk-wise algorithm:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 def chunk_kda(K, V, Q, g, beta): \u0026#34;\u0026#34;\u0026#34; K, V, Q: [C, K] or [C, V] # keys, values, queries within chunk g: [C, K] # cumulative gate (cumsum of log decay) beta: [C] # weight for delta rule \u0026#34;\u0026#34;\u0026#34; # Step 1: Compute lower triangular matrix A (without beta) # Using Lemma 3 decomposition: A = (K * exp(g)) @ (K * exp(-g)).T K_exp = K * exp(g) K_inv = K * exp(-g) A = (K_exp @ K_inv.T).masked_fill(diagonal_mask, 0) # Step 2: Compute (I + A)^{-1} via forward substitution (Lemma 2) # Since A = K_exp @ K_inv.T, this is the typical WY representation form L = I + A * beta[:, None] # Unit lower triangular matrix including beta # Step 3: Prepare gated inputs K_gated = K * exp(g) # [C, K], gated keys V_weighted = V * beta[:, None] # [C, V], V * beta K_weighted = K_gated * beta[:, None] # [C, K], gated K * beta # Step 4: WY representation (solve L @ X = Y via forward substitution) # U = L^{-1} @ (V * beta) # W = L^{-1} @ (K * exp(g) * beta) U = forward_substitution(L, V_weighted) # [C, V] W = forward_substitution(L, K_weighted) # [C, K] # Step 5: Compute Affine parameters # Note: row i of K_decayed is k_i * exp(g_last - g_i) K_decayed = K * exp(g[-1] - g) # [C, K] decay_last = exp(g[-1]) # [K], cumulative decay at last position (per-dim) M = diag(decay_last) - K_decayed.T @ W # [K, K] B = K_decayed.T @ U # [K, V] # Step 6: Assume initial state S=0, compute local state S_next = B # If S=0 # Step 7: Compute chunk output (assuming S=0; actual output needs S contribution) Q_gated = Q * exp(g) # [C, K], gated queries O_local = mask(Q_gated @ K.T) @ U # [C, V] return M, B, O_local, S_next, W, U Notes:\nKDA uses per-dimension decay diag(decay_last); GDN uses scalar decay_last * I Both queries and keys need gating applied, for output computation and residual computation respectively g is cumulative gate with dimension [C, K], representing per-dim log decay CP Parallelism and SM Parallelism CP Parallelism: Affine Chain Rule Now that we have a consistent Affine interface, we can naturally extend to Context Parallel (CP).\nCompositional Properties of Affine Transformations Lemma 4: The composition of two Affine transformations is still an Affine transformation.\nLet:\n$\\mathbf{S}_1 = \\mathbf{M}_0 \\mathbf{S}_0 + \\mathbf{B}_0$ $\\mathbf{S}_2 = \\mathbf{M}_1 \\mathbf{S}_1 + \\mathbf{B}_1$ Then: $$\\mathbf{S}_2 = \\underbrace{(\\mathbf{M}_1 \\mathbf{M}_0)}_{\\mathbf{M}_{01}} \\mathbf{S}_0 + \\underbrace{(\\mathbf{M}_1 \\mathbf{B}_0 + \\mathbf{B}_1)}_{\\mathbf{B}_{01}}$$CP Algorithm Assume $R$ ranks, where rank $r$ holds chunk $r$.\nStep 1: Local Computation\nEach rank assumes $\\mathbf{S} = \\mathbf{0}$ and computes:\n$(\\mathbf{M}_r, \\mathbf{B}_r)$: Affine parameters $\\mathbf{B}_r$: Final state assuming zero initial state (i.e., local accumulation, corresponding to $h_{ext}$ in KCP) Step 2: All-Gather\nCollect all ranks\u0026rsquo; $\\{ (\\mathbf{M}_r, \\mathbf{B}_r) \\}_{r=0}^{R-1}$.\nStep 3: Prefix Scan (Fold)\nRank $r$ computes the true initial state:\n$$\\mathbf{S}_r = \\sum_{j=0}^{r-1} \\left( \\prod_{k=j+1}^{r-1} \\mathbf{M}_k \\right) \\mathbf{B}_j$$Step 4: Local Recomputation\nRecompute chunk output with correct $\\mathbf{S}_r$:\n$$\\mathbf{O}_r = \\mathbf{O}_r^{\\text{local}} + \\mathbf{Q}_r \\mathbf{S}_r - \\text{mask}(\\mathbf{A}_{qk}) \\mathbf{W}_r \\mathbf{S}_r$$Mathematical Foundation of CP Parallelism CP parallelism is possible due to the compositional properties of Affine transformations:\nEach chunk is an Affine transformation Continuous application of multiple chunks = product of Affine transformations Cross-rank state transfer = accumulation of Affine parameters SM Parallelism: Fine-grained Parallelism within Single Card Problem Background In single-card (Intra-Card) inference scenarios, SM underutilization occurs when sequences are very long:\nGPUs have a fixed number of SMs (Streaming Multiprocessors, e.g., A100 has 108 SMs) Number of chunks per head = $T / (H \\times C)$, where $T$ is sequence length, $H$ is number of heads, $C$ is chunk size When sequences are long but the number of heads is small, chunks per head may exceed the number of SMs, leaving some SMs idle Solution: Subsequence Splitting SM Parallelism splits long sequences into multiple subsequences such that:\n$$\\text{subseq\\_len} = \\text{target\\_chunks} \\times C \\approx \\text{num\\_sms} \\times C$$where:\n$\\text{num\\_sms}$: Number of SMs in GPU $C$: chunk size (typically 64) Each subsequence contains enough chunks to saturate all SMs Mathematical Form Let the original sequence be split into $M$ subsequences, each subsequence $m$ having initial state $\\mathbf{S}_m$.\nStep 1: Intra-subsequence CP\nEach subsequence internally executes standard CP Pre-process:\nCompute $(\\mathbf{M}_m^{\\text{local}}, \\mathbf{B}_m^{\\text{local}})$: local accumulation assuming $\\mathbf{S}_m = \\mathbf{0}$ Step 2: Inter-subsequence Merge\nStates are merged between multiple subsequences of the same original sequence: $$\\mathbf{S}_{m+1} = \\mathbf{M}_m^{\\text{local}} \\cdot \\mathbf{S}_m + \\mathbf{B}_m^{\\text{local}}$$This is still chain composition of Affine transformations.\nStep 3: Final Computation\nRecompute output for each subsequence with correct initial state.\nRelationship with CP Parallelism Parallelism Level Split Dimension Communication Applicable Scenario CP Parallelism Cross-GPU (inter-card) NCCL All-Gather Multi-card training/inference SM Parallelism Within single card (intra-card) Shared memory Single-card long sequence inference Both have the same mathematical essence: chain composition of Affine transformations, just at different granularities:\nCP Parallelism: rank level SM Parallelism: subsequence level Implementation Points Dynamic splitting: Dynamically compute subseq_len based on sequence length and number of SMs Split info management: Maintain mapping between subsequences and original sequence Two-level computation: intracard_pre_scan: Parallelly compute local $(\\mathbf{M}, \\mathbf{B})$ for all subsequences intracard_merge: Merge subsequence states of the same original sequence Implementation reference: fla/ops/common/intracard_cp.py\nSummary We have established the complete mathematical framework for KDA (and GDN as its special case) from the most basic lemmas:\nLemma 1: Outer product accumulation = matrix multiplication → motivation for chunk-wise parallelism Lemma 2: Inverse of lower triangular matrix → theoretical foundation for WY representation Lemma 3: Decomposition of log-decay → matrix multiplication form of decay computation Challenge of KDA: Residual depends on historical state WY Representation: Separate dependencies to obtain $\\tilde{\\mathbf{V}} = \\mathbf{U} - \\mathbf{W} \\mathbf{S}$ Core Theorem: Chunk-wise Affine form $\\mathbf{S}' = \\mathbf{M} \\mathbf{S} + \\mathbf{B}$ CP Parallelism: Chain composition of Affine transformations Key Insights Essence of WY Representation: Explicitly separate parts dependent on historical state $\\mathbf{S}$, making parallel computation possible Role of Affine Form: Provides a unified state update interface, naturally supporting multi-level parallelism (CP, SM) Advantage of Per-dim decay: Allows different feature dimensions to have independent forgetting rates, enhancing expressiveness Notation Conventions Lowercase $\\mathbf{s}, \\mathbf{k}, \\mathbf{v}$: token-level vectors Uppercase $\\mathbf{S}, \\mathbf{K}, \\mathbf{V}, \\mathbf{M}, \\mathbf{B}$: chunk-level matrices Distinguishing GDN (scalar decay) and KDA (per-dimension decay) only differs in the diagonal part of the transition matrix Appendix: GDN vs KDA Feature GDN KDA Decay Scalar $\\lambda$ Vector $\\boldsymbol{\\lambda} \\in \\mathbb{R}^K$ Transition $\\mathbf{M} = \\lambda \\mathbf{I} - \\mathbf{K}^\\top \\mathbf{W}$ $\\mathbf{M} = \\text{diag}(\\boldsymbol{\\lambda}) - \\mathbf{K}^\\top \\mathbf{W}$ Expressiveness Global forgetting Dimension-selective forgetting Computation Slightly faster Slightly slower Both are Affine forms; only the diagonal part of $\\mathbf{M}$ differs.\nReference: Gated DeltaNet is detailed in Yang et al. (2024) 12; Kimi Delta Attention (KDA) is its extension in the per-dimension decay direction.\nReferences The mathematical derivations and algorithm descriptions in this article are based on the Flash Linear Attention (FLA) framework implementation.\nKatharopoulos, A., et al. (2020). \u0026ldquo;Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention\u0026rdquo;. ICML 2020. https://arxiv.org/abs/2006.16236\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nKimi Team. (2025). \u0026ldquo;Kimi Linear: An Expressive, Efficient Attention Architecture\u0026rdquo;. arXiv:2510.26692. https://arxiv.org/abs/2510.26692\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nChoromanski, K., et al. (2021). \u0026ldquo;Rethinking Attention with Performers\u0026rdquo;. ICLR 2021. https://arxiv.org/abs/2009.14794\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nPeng, B., et al. (2023). \u0026ldquo;RWKV: Reinventing RNNs for the Transformer Era\u0026rdquo;. EMNLP 2023. https://arxiv.org/abs/2305.13048\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nGu, A., \u0026amp; Dao, T. (2023). \u0026ldquo;Mamba: Linear-Time Sequence Modeling with Selective State Spaces\u0026rdquo;. https://arxiv.org/abs/2312.00752\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nDao, T., \u0026amp; Gu, A. (2024). \u0026ldquo;Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality\u0026rdquo;. https://arxiv.org/abs/2405.21060\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nDao, T., \u0026amp; Gu, A. (2024). \u0026ldquo;Mamba2\u0026rdquo; (in \u0026ldquo;Transformers are SSMs\u0026rdquo;). https://arxiv.org/abs/2405.21060\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nYang, S., et al. (2024). \u0026ldquo;Gated Linear Attention Transformers with Hardware-Efficient Training\u0026rdquo;. ICML 2024. https://arxiv.org/abs/2312.06635\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nPeng, B., et al. (2024). \u0026ldquo;Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence\u0026rdquo;. arXiv:2404.05892. https://arxiv.org/abs/2404.05892\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nPeng, B., et al. (2025). \u0026ldquo;RWKV-7 \u0026lsquo;Goose\u0026rsquo; with Expressive Dynamic State Evolution\u0026rdquo;. arXiv:2503.14456. https://arxiv.org/abs/2503.14456\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nYang, S., et al. (NeurIPS 2024). \u0026ldquo;Parallelizing Linear Transformers with the Delta Rule over Sequence Length\u0026rdquo;. NeurIPS 2024. https://arxiv.org/abs/2406.06484\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nYang, S., Kautz, J., \u0026amp; Hatamizadeh, A. (2024). \u0026ldquo;Gated Delta Networks: Improving Mamba2 with Delta Rule\u0026rdquo;. arXiv:2412.06464. https://arxiv.org/abs/2412.06464\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nBischof, C., \u0026amp; Van Loan, C. (1987). \u0026ldquo;The WY Representation for Products of Householder Matrices\u0026rdquo;. SIAM Journal on Scientific and Statistical Computing, 8(1). https://epubs.siam.org/doi/abs/10.1137/0908009\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nSchreiber, R., \u0026amp; Van Loan, C. (1989). \u0026ldquo;A Storage-Efficient WY Representation for Products of Householder Transformations\u0026rdquo;. SIAM Journal on Scientific and Statistical Computing, 10(1). https://epubs.siam.org/doi/10.1137/0910005\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://zhiyuan1i.github.io/en/posts/kda-mathematics/","summary":"\u003cblockquote\u003e\n\u003cp\u003eThis article assumes familiarity with linear algebra (matrix multiplication, outer product, inverse matrices) and basic sequence modeling concepts.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003ch2 id=\"abstract\"\u003eAbstract\u003c/h2\u003e\n\u003cp\u003eThis article derives the chunk-wise parallel algorithm for KDA (Kimi Delta Attention). Core contributions:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eProving that KDA\u0026rsquo;s chunk state update can be expressed as an \u003cstrong\u003eAffine transformation\u003c/strong\u003e: $\\mathbf{S}' = \\mathbf{M}\\mathbf{S} + \\mathbf{B}$\u003c/li\u003e\n\u003cli\u003eDecomposing residual computation into history-independent components via \u003cstrong\u003eWY representation\u003c/strong\u003e to enable parallel computation\u003c/li\u003e\n\u003cli\u003eDeriving the mathematical foundation for \u003cstrong\u003eCP (Context Parallel)\u003c/strong\u003e based on the compositional properties of Affine transformations\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAdvantages of KDA over standard Attention: $O(N)$ complexity, constant memory state, suitable for ultra-long sequences.\u003c/p\u003e","title":"KDA (Kimi Delta Attention): From Matrix Multiplication to Affine Transformation"},{"content":"Zhiyuan Li AI Infra Engineer at Moonshot AI.\nFocusing on efficient implementation and optimization of Linear Attention. Honored to have contributed to the development of Kimi Linear and Kimi Delta Attention (KDA), learning a lot from the excellent colleagues on the team.\n🔬 Research Interests Linear Attention: Exploring sub-quadratic sequence modeling methods for more efficient long sequences Efficient Inference Optimization: CUDA kernel optimization, memory bandwidth optimization, Tensor Core acceleration Model Architectures: RWKV-6/7, Gated DeltaNet, and other novel attention mechanisms 🚀 Open Source Contributions Contributed to flash-linear-attention community project - Efficient implementations of state-of-the-art linear attention models 📝 Articles Learning KDA from Scratch - Part 1 - Understanding KDA parallelization from an Infra perspective (Chinese) 💬 About This Site This site documents my learning notes, technical articles, and some immature thoughts in the AI Infra field. I\u0026rsquo;m still learning, so please feel free to point out any mistakes. Looking forward to exchanging ideas with you.\nContact:\nGitHub: @zhiyuan1i Zhihu: @lizhiyuan Email: lizhiyuan@moonshot.cn ","permalink":"https://zhiyuan1i.github.io/en/about/","summary":"\u003ch2 id=\"zhiyuan-li\"\u003eZhiyuan Li\u003c/h2\u003e\n\u003cp\u003eAI Infra Engineer at \u003ca href=\"https://www.moonshot.cn/\"\u003eMoonshot AI\u003c/a\u003e.\u003c/p\u003e\n\u003cp\u003eFocusing on efficient implementation and optimization of \u003cstrong\u003eLinear Attention\u003c/strong\u003e. Honored to have contributed to the development of \u003ca href=\"https://github.com/MoonshotAI/Kimi-Linear\"\u003eKimi Linear\u003c/a\u003e and \u003cstrong\u003eKimi Delta Attention (KDA)\u003c/strong\u003e, learning a lot from the excellent colleagues on the team.\u003c/p\u003e\n\u003chr\u003e\n\u003ch3 id=\"-research-interests\"\u003e🔬 Research Interests\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eLinear Attention\u003c/strong\u003e: Exploring sub-quadratic sequence modeling methods for more efficient long sequences\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eEfficient Inference Optimization\u003c/strong\u003e: CUDA kernel optimization, memory bandwidth optimization, Tensor Core acceleration\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eModel Architectures\u003c/strong\u003e: RWKV-6/7, Gated DeltaNet, and other novel attention mechanisms\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"-open-source-contributions\"\u003e🚀 Open Source Contributions\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eContributed to \u003ca href=\"https://github.com/fla-org/flash-linear-attention\"\u003e\u003cstrong\u003eflash-linear-attention\u003c/strong\u003e\u003c/a\u003e community project - Efficient implementations of state-of-the-art linear attention models\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"-articles\"\u003e📝 Articles\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"https://zhuanlan.zhihu.com/p/1989809041849988324\"\u003eLearning KDA from Scratch - Part 1\u003c/a\u003e - Understanding KDA parallelization from an Infra perspective (Chinese)\u003c/li\u003e\n\u003c/ul\u003e\n\u003chr\u003e\n\u003ch3 id=\"-about-this-site\"\u003e💬 About This Site\u003c/h3\u003e\n\u003cp\u003eThis site documents my learning notes, technical articles, and some immature thoughts in the AI Infra field. I\u0026rsquo;m still learning, so please feel free to point out any mistakes. Looking forward to exchanging ideas with you.\u003c/p\u003e","title":"About Me"},{"content":"Tech Stack This site is built with the following technologies:\nTechnology Purpose Hugo Fast static site generator written in Go PaperMod Clean and elegant Hugo theme GitHub Pages Free static site hosting GitHub Actions Automated deployment Features Blazing Fast: Hugo\u0026rsquo;s Go implementation ensures sub-second builds SEO Friendly: Built-in Open Graph, Twitter Cards, and structured data Dark/Light Mode: Automatically follows system theme Full-text Search: Site-wide search powered by Fuse.js Responsive Design: Perfectly adapted for mobile devices Multi-language: Supports both Chinese and English Deployment Workflow 1 Local Writing → Git Push → GitHub Actions → GitHub Pages → Live Site Fully automated deployment pipeline, focusing on content creation.\nPowered by Kimi K2.5 🌙\n","permalink":"https://zhiyuan1i.github.io/en/posts/tech-stack/","summary":"\u003ch2 id=\"tech-stack\"\u003eTech Stack\u003c/h2\u003e\n\u003cp\u003eThis site is built with the following technologies:\u003c/p\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n      \u003ctr\u003e\n          \u003cth\u003eTechnology\u003c/th\u003e\n          \u003cth\u003ePurpose\u003c/th\u003e\n      \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u003ca href=\"https://gohugo.io/\"\u003eHugo\u003c/a\u003e\u003c/td\u003e\n          \u003ctd\u003eFast static site generator written in Go\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u003ca href=\"https://github.com/adityatelange/hugo-PaperMod\"\u003ePaperMod\u003c/a\u003e\u003c/td\u003e\n          \u003ctd\u003eClean and elegant Hugo theme\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u003ca href=\"https://pages.github.com/\"\u003eGitHub Pages\u003c/a\u003e\u003c/td\u003e\n          \u003ctd\u003eFree static site hosting\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u003ca href=\"https://github.com/features/actions\"\u003eGitHub Actions\u003c/a\u003e\u003c/td\u003e\n          \u003ctd\u003eAutomated deployment\u003c/td\u003e\n      \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003ch2 id=\"features\"\u003eFeatures\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eBlazing Fast\u003c/strong\u003e: Hugo\u0026rsquo;s Go implementation ensures sub-second builds\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSEO Friendly\u003c/strong\u003e: Built-in Open Graph, Twitter Cards, and structured data\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDark/Light Mode\u003c/strong\u003e: Automatically follows system theme\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eFull-text Search\u003c/strong\u003e: Site-wide search powered by Fuse.js\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eResponsive Design\u003c/strong\u003e: Perfectly adapted for mobile devices\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMulti-language\u003c/strong\u003e: Supports both Chinese and English\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"deployment-workflow\"\u003eDeployment Workflow\u003c/h2\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cdiv class=\"chroma\"\u003e\n\u003ctable class=\"lntable\"\u003e\u003ctr\u003e\u003ctd class=\"lntd\"\u003e\n\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode\u003e\u003cspan class=\"lnt\"\u003e1\n\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/td\u003e\n\u003ctd class=\"lntd\"\u003e\n\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-fallback\" data-lang=\"fallback\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eLocal Writing → Git Push → GitHub Actions → GitHub Pages → Live Site\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/table\u003e\n\u003c/div\u003e\n\u003c/div\u003e\u003cp\u003eFully automated deployment pipeline, focusing on content creation.\u003c/p\u003e","title":"Tech Stack"}]