Papers
arxiv:2512.07782

GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory

Published on Dec 8, 2025
Authors:
,
,

Abstract

GatedFWA addresses the instability issues of sliding window attention by introducing learnable gates that stabilize memory updates and improve gradient flow in autoregressive models.

Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an Associative Memory interpretation, its difference-style update renders the training objective effectively unbounded. In contrast, Softmax attention normalizes updates, leading to memory shrinkage and gradient vanishing. We propose GatedFWA: a Memory-Gated (Flash) Windowed Attention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2512.07782
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.07782 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.07782 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.07782 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.