Ablation study on transferring sparse attention weights to dense attention across tasks.
ModelLongVideoBenchMLVUTimeScopeLongTimeScopeTomatoVSIBench
Qwen2.5-VL-7B58.751.281.040.722.629.7
Dense-SFT57.8 (-1.5%)51.2 (+0.0%)76.8 (-5.2%)40.2 (-1.2%)21.7 (-4.0%)30.6 (+2.1%)
Dense-NSA56.1 (-4.4%)51.6 (+0.8%)83.0 (+2.5%)40.9 (+0.5%)23.4 (+3.5%)33.1 (+10.7%)
VideoNSA59.4 (+1.1%)51.8 (+1.2%)82.7 (+2.1%)44.4 (+9.1%)26.2 (+15.9%)36.1 (+20.3%)