Method

VideoNSA builds on Qwen2.5-VL-7B. Each layer splits tokens by position IDs into vision and text paths: text tokens follow grouped-query attention, while vision tokens run Native Sparse Attention with compression, selection, and sliding-window branches fused by per-head two-layer MLP gates. Vision tokens aggregate into frame blocks, pass through the three branches, and rejoin the text stream after gating.
Compression Branch
Maintains salient cues by averaging frame KV blocks and routing them through learnable gates, keeping compute budget linear in context length.
Selection Branch
Ranks candidate blocks by importance scores and retains the most informative segments, focusing attention on discriminative events.
Sliding Window Branch
Guarantees local temporal coverage with a lightweight windowed path so fine-grained motion details persist alongside global context.
We train VideoNSA end to end so the vision pathway learns data-dependent sparse connectivity inside the language model. The dataset is a filtered split of LLaVA-Video-178K sampled at 4 fps, keeping clips with 350 to 550 frames for about 216K question–answer pairs. We set pixels per frame at 50,176 and limit each training example to a 36K token context. Training uses block size 64, block count 32, and sliding window 256. The full training consumes roughly 4,600 H100 GPU hours.
Key Findings
01Do learned sparse attention weights remain beneficial in dense attention settings?
Yes, but only partially. Sparse-trained QKV weights help under dense inference. The transferred Dense-NSA often surpasses Dense-SFT, indicating an inductive bias toward better attention distributions. However, gains are limited and inconsistent. The full VideoNSA with runtime sparsity and dynamic gating remains best, showing that improvements come from learned weights plus execution-time sparsity, not weight transfer alone.
02How far can VideoNSA scale in context length?
VideoNSA scales reliably to 128K vision–text contexts, with task-dependent budgeting. Trained at a smaller budget of 36K, VideoNSA generalizes beyond its training length and continues to improve as context grows. Under a fixed budget, the optimal split is task dependent: benchmarks emphasizing spatial detail prefer more tokens per frame, those emphasizing temporal coverage prefer more frames, and mixed settings show shifting preferences as length increases. This indicates that longer contexts benefit VideoNSA while it adapts to diverse spatiotemporal demands.

03How to allocate the attention budget?
Allocate toward global attention and tune near the training allocation. Model performance is highly sensitive to how the attention budget is split between global blocks and local sliding windows. Across tasks, settings close to the training ratio generally work best, and small fine-tunes around it help more than simply enlarging the overall budget. Under the same budget, increasing global attention typically outperforms increasing local attention. VideoNSA attains leading performance with only 3.6% of the full attention budget.

04What roles do compression, selection, and sliding-window gates play in VideoNSA?
Compression Branch merges redundant tokens to preserve salient content and support long-range aggregation, Selection Branch routes attention sparsely to the most informative blocks for global context, and Sliding Window Branch enforces local temporal coverage and smooth frame-to-frame integration. Final Blend uses dynamic gates to mix these signals per head and layer.




What roles do the gates play across depth? Compression dominates across depth, selection and sliding window peak in early and middle layers before tapering, and all three intensify again at the final layer for late fusion. Compression serves as the backbone that reduces redundancy while preserving salient features. Selection and sliding window contribute more in early and middle layers, sometimes overtaking compression, then weaken as the model aggregates higher-level features. In the last layer, all three branches become strongly active again, indicating a late fusion stage.

How similar are heads within each gate? In the middle layers, selection and sliding window show high inter-head similarity while compression remains diverse across heads, and at the first and final layers similarity is low for all gates. This mid-layer alignment suggests synchronized behaviors for block selection and local temporal integration. The compression gate maintains low inter-head similarity across depths, operating largely in a head-independent manner. Early and final layers keep inter-head similarity weak across all gates to preserve representational diversity and to support feature mixing at the top.

05Where does the efficiency bottleneck come from?
The compression branch is the primary latency bottleneck as context scales to 128K.
We measure wall-clock latency for each branch from 1K to 128K tokens and observe runtime becoming increasingly dominated by compression, while selection and sliding-window paths contribute only modestly at long horizons. Ideally compression grows roughly linearly with length O(L), sliding windows behave like O(L * w), and selection incurs O(L2 / b) work for importance scoring over block size b. In practice, hardware parallelism, memory access, and kernel-launch overheads shift these curves, yet compression remains the limiting route, signalling that kernel and memory optimizations there would deliver the biggest wins.

06Do learnable sparse mechanisms induce attention sinks?
Attention sinks are tokens, often the first few in decoder-only Transformers, that attract a disproportionate share of attention regardless of content. They arise from softmax normalization and positional or initialization biases. Sinks typically show small key and value norms and large query activations, which leads to high cosine similarity and large attention weights while contributing little to the residual state because the value norms are low. This effect pulls probability mass away from informative tokens, weakens long-range information flow, and becomes more pronounced as context length increases.

Learnable sparse mechanisms can induce dynamic attention sinks, but the effect is branch specific and controlled. Under the same sparse configuration with 256 tokens per frame, the compression branch produces the most sinks, forming banded patterns along the value-norm axis because token merging amplifies some norms while suppressing others. The selection branch yields almost no sinks because its top-k block filtering smooths the value-norm distribution. The sliding window branch reveals a clear split between sink and non-sink tokens and helps regularize norms. With dynamic gating, VideoNSA offsets compression-induced sinks and keeps the overall sink ratio near 0.3%.


VideoNSA keeps sink ratios low and stable across layers, while dense attention and strict locality are more prone to sink accumulation. As depth increases, dense attention’s sink ratio climbs steadily. By branch, compression shows the highest levels with occasional spikes, selection remains near zero, and sliding window stays low but exhibits mid-to-late layer peaks, indicating that locality can reintroduce bias on long sequences. Learned sparsity and gating in VideoNSA prevent sink buildup at scale.

VideoNSA avoids both early-position bias and uniform diffusion, keeping sink positions controlled and structured. Dense attention spreads sinks broadly across the sequence. Compression concentrates sinks near the beginning with a steep decay. Selection yields very few sinks. Sliding window shows sparse peaks near periodic local boundaries. Dynamic gating smooths temporal coverage and mitigates over-reliance on early tokens.

Sparse hyperparameters strongly shape sink position and density: compression is the main source, selection is largely immune, and balanced block/window choices trade early peaks for coverage. In the compression branch, smaller blocks create sharper, higher peaks at the sequence start, while larger blocks damp the initial spike but spread low-density sinks with periodic boundary bumps. The selection branch keeps densities near zero because top-k filtering reliably suppresses sinks. The sliding-window branch concentrates sinks near the first tokens and decays with depth; larger windows reduce overall density but broaden coverage. Training with w=256 offers a balanced profile, showing sparse periodic clusters mid-to-late sequence that mark learned local boundaries.



Previous Work
- MovieChat Long Video
- MovieChat+ Long Video
- AuroraCap Detailed Caption
- AuroraLong Long Video
- Video-MMLU Lecture Video
BibTeX
@misc{song2025videonsanativesparseattention,
title={VideoNSA: Native Sparse Attention Scales Video Understanding},
author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
year={2025},
eprint={2510.02295},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.02295},
}